ORBIT

Open Recommendation Benchmark for Integrative Tasks

Benchmark datasets, loaders, and evaluators for recommendation models

Motivation

Advancing recommender systems research requires benchmarks that are not only fair and reproducible, but also grounded in real-world user behavior. However, existing evaluations often rely on outdated or synthetic datasets and inconsistent protocols, making meaningful comparisons across models difficult.

ORBIT addresses these challenges by providing a unified benchmarking framework with standardized data splits and transparent evaluation settings. It further introduces ClueWeb-Reco, a large-scale, privacy-preserving web page recommendation task built from real user browsing histories mapped to the ClueWeb22-B EN corpus using dense retrieval. This hidden test set rigorously evaluates models' generalization capabilities in realistic, open-world scenarios.

View Dataset on Hugging Face

Our Goals

Unified & Reproducible Evaluation: We provide a standardized suite of datasets, loaders, and evaluation protocols that enable consistent and transparent model comparisons across domains.
Hidden Test Sets from Real User Behavior: We introduce privacy-safe test sets derived from real user interaction logs, offering a more accurate reflection of production performance without compromising user privacy.
Centralized Leaderboard Across Domains: ORBIT consolidates model results across datasets, domains, and metrics, helping researchers track state-of-the-art performance with clarity.