ORBIT

🏆 ClueWeb-Reco Leaderboard

Candidate Ranking Results

ModelRecall@10NDCG@10Recall@50NDCG@50Recall@100NDCG@100
DeepSeek-V3-QueryGen0.01270.00820.02640.01110.03710.0129
GPT-4.1-QueryGen0.01070.00500.01950.00680.02540.0077
HLLM0.00880.00410.01370.00520.01760.0059
GPT-3.5-Turbo-QueryGen0.00880.00270.01760.00500.03120.0072
Qwen3-235B-QueryGen0.00880.00460.02340.00770.03030.0088
GPT-4o-QueryGen0.00680.00420.01460.00580.02640.0077
Gemini-2.5-Flash-QueryGen0.00680.00420.01460.00580.02640.0077
Claude-Sonnet-4-QueryGen0.00680.00320.01660.00520.02150.0060
Kimi-K2-QueryGen0.00390.00220.01560.00500.02340.0062
TASTE0.00200.00150.00390.00190.00390.0019

Prompt Construction for Query Generation

To assess the generalization power of LLM-based recommenders, ClueWeb-Reco includes a query generation task. Browsing history titles are formatted into a prompt, and LLMs are asked to infer the next likely interest without rephrasing. The generated query is then embedded and matched to the candidate pool via dense retrieval.

Prompt construction visual

Note: Prompt design plays a critical role in the performance of LLM-QueryGen models. ORBIT encourages community contributions of custom prompts to better reflect the strengths and nuances of each language model.