š About ORBIT
ORBIT (Open Recommendation Benchmark for Reproducible Research with Hidden Tests) is a unified benchmarking framework for recommender systems, designed to promote reproducible, fair, and practically relevant evaluation practices. Recommender systems are core to many digital services today, yet current benchmarks often fall short of capturing real-world user behavior or ensuring consistency in evaluation settings.
ORBIT addresses these gaps by introducing two core contributions:
- A standardized evaluation suite covering 12 representative models across 5 widely-used public datasets, with consistent data splits and metrics, released through an open leaderboard.
- ClueWeb-Reco, a novel web page recommendation task featuring real U.S. user browsing sessions. These sessions are mapped to 87M+ public ClueWeb22-B EN documents through a semantic dense retrieval pipeline, enabling privacy-preserving yet realistic evaluations.
Data collection for ClueWeb-Reco was conducted with explicit user consent via vetted human research platforms. Quality control mechanisms filter out scam, inappropriate, or low-quality pages. Collected URLs are then soft-matched to ClueWeb22 pages using dense embeddings, forming fully anonymized sequences for evaluation.
On public benchmarks, ORBIT shows that content-based models outperform ID-based approaches, especially in sparse settings. For ClueWeb-Reco, traditional models struggle to generalize across the massive item space. In contrast, our proposed LLM-QueryGen baselineāframing recommendation as retrieval using LLM-generated queriesādemonstrates strong zero-shot performance.
ORBIT is publicly available and welcomes participation from the community. Submit your model predictions to our leaderboard and join us in advancing reproducible, realistic recommendation research.