ORBIT

Evaluation

Looking to reproduce or benchmark your model?

View Codebase on GitHub

Evaluation Protocol

ORBIT frames recommendation as a sequential prediction task, where the goal is to predict the next item a user will interact with based on their historical sequence. We use the leave-one-out method to split the data into training, validation, and test sets:

  • The first n - 2 items are used as the training input
  • The (n - 1)th item is used for validation
  • The nth item is reserved for testing

During evaluation, the model is asked to rank the correct item against the entire item pool, simulating real-world candidate sets and ensuring rigorous generalization tests.

Metrics

ORBIT evaluates models using two standard metrics for top-K ranking performance:

  • Recall@K: Measures the fraction of relevant items successfully retrieved within the top-K predictions. For one-relevant-item scenarios, this is equivalent to HitRate@K.
  • NDCG@K: Normalized Discounted Cumulative Gain evaluates the position of the relevant item, rewarding correct items that appear higher in the ranking.

We report these metrics at K ∈ {1, 10, 50, 100} for consistency across datasets.