🏆 ClueWeb-Reco Leaderboard
Candidate Ranking Results
| Model | Recall@10 ▼ | NDCG@10 | Recall@50 | NDCG@50 | Recall@100 | NDCG@100 |
|---|---|---|---|---|---|---|
| DeepSeek-V3-QueryGen | 0.0127 | 0.0082 | 0.0264 | 0.0111 | 0.0371 | 0.0129 |
| GPT-4.1-QueryGen | 0.0107 | 0.0050 | 0.0195 | 0.0068 | 0.0254 | 0.0077 |
| HLLM | 0.0088 | 0.0041 | 0.0137 | 0.0052 | 0.0176 | 0.0059 |
| GPT-3.5-Turbo-QueryGen | 0.0088 | 0.0027 | 0.0176 | 0.0050 | 0.0312 | 0.0072 |
| Qwen3-235B-QueryGen | 0.0088 | 0.0046 | 0.0234 | 0.0077 | 0.0303 | 0.0088 |
| GPT-4o-QueryGen | 0.0068 | 0.0042 | 0.0146 | 0.0058 | 0.0264 | 0.0077 |
| Gemini-2.5-Flash-QueryGen | 0.0068 | 0.0042 | 0.0146 | 0.0058 | 0.0264 | 0.0077 |
| Claude-Sonnet-4-QueryGen | 0.0068 | 0.0032 | 0.0166 | 0.0052 | 0.0215 | 0.0060 |
| Kimi-K2-QueryGen | 0.0039 | 0.0022 | 0.0156 | 0.0050 | 0.0234 | 0.0062 |
| TASTE | 0.0020 | 0.0015 | 0.0039 | 0.0019 | 0.0039 | 0.0019 |
Prompt Construction for Query Generation
To assess the generalization power of LLM-based recommenders, ClueWeb-Reco includes a query generation task. Browsing history titles are formatted into a prompt, and LLMs are asked to infer the next likely interest without rephrasing. The generated query is then embedded and matched to the candidate pool via dense retrieval.

Note: Prompt design plays a critical role in the performance of LLM-QueryGen models. ORBIT encourages community contributions of custom prompts to better reflect the strengths and nuances of each language model.