Evaluation
Looking to reproduce or benchmark your model?
View Codebase on GitHubEvaluation Protocol
ORBIT frames recommendation as a sequential prediction task, where the goal is to predict the next item a user will interact with based on their historical sequence. We use the leave-one-out method to split the data into training, validation, and test sets:
- The first
n - 2
items are used as the training input - The
(n - 1)
th item is used for validation - The
n
th item is reserved for testing
During evaluation, the model is asked to rank the correct item against the entire item pool, simulating real-world candidate sets and ensuring rigorous generalization tests.
Metrics
ORBIT evaluates models using two standard metrics for top-K ranking performance:
- Recall@K: Measures the fraction of relevant items successfully retrieved within the top-K predictions. For one-relevant-item scenarios, this is equivalent to HitRate@K.
- NDCG@K: Normalized Discounted Cumulative Gain evaluates the position of the relevant item, rewarding correct items that appear higher in the ranking.
We report these metrics at K ∈ {1, 10, 50, 100} for consistency across datasets.