🏆 ClueWeb-Reco Leaderboard
Candidate Ranking Results
Model | Recall@10 ▼ | NDCG@10 | Recall@50 | NDCG@50 | Recall@100 | NDCG@100 |
---|---|---|---|---|---|---|
GPT-4.1-QueryGen | 0.0107 | 0.0050 | 0.0195 | 0.0068 | 0.0254 | 0.0077 |
HLLM | 0.0088 | 0.0041 | 0.0137 | 0.0052 | 0.0176 | 0.0059 |
GPT-3.5-Turbo-QueryGen | 0.0088 | 0.0027 | 0.0176 | 0.0050 | 0.0312 | 0.0072 |
GPT-4o-QueryGen | 0.0068 | 0.0042 | 0.0146 | 0.0058 | 0.0264 | 0.0077 |
Gemini-2.5-Flash-QueryGen | 0.0068 | 0.0042 | 0.0146 | 0.0058 | 0.0264 | 0.0077 |
Claude-Sonnet-4-QueryGen | 0.0068 | 0.0032 | 0.0166 | 0.0052 | 0.0215 | 0.0060 |
TASTE | 0.0020 | 0.0015 | 0.0039 | 0.0019 | 0.0039 | 0.0019 |
Prompt Construction for Query Generation
To assess the generalization power of LLM-based recommenders, ClueWeb-Reco includes a query generation task. Browsing history titles are formatted into a prompt, and LLMs are asked to infer the next likely interest without rephrasing. The generated query is then embedded and matched to the candidate pool via dense retrieval.
