MasterSet: A Benchmark for Must-Cite Citation Recommendation

Apr 20, 2026 · 2 min read

Overview

MasterSet is a large-scale benchmark designed to evaluate must-cite citation recommendation in AI and machine learning research. Given only the title and abstract of a paper, the task is to retrieve the small set of papers so central to the work—direct experimental baselines, foundational methods, core datasets—that omitting them would misrepresent the contribution’s novelty or undermine reproducibility.

A live demo is available at mustcite.com.

Motivation

The volume of AI/ML publications has grown by an order of magnitude over the past decade. Existing citation recommendation systems focus on broad topical relevance, but researchers need something more targeted: which specific papers must they cite? Missing a key baseline or foundational method is not merely an oversight—it can constitute an incomplete or misleading submission.

MasterSet is the first benchmark specifically designed to evaluate this harder, higher-stakes task.

Dataset

MasterSet is built on 153,373 papers collected from official proceedings of 15 peer-reviewed venues, including NeurIPS, ICML, ICLR, CVPR, ACL, and others spanning core ML, computer vision, NLP, and probabilistic methods. Papers are collected using Open Papers, a venue-specific scraper that retrieves directly from official proceedings websites rather than aggregator APIs, yielding exact, verified paper counts free of preprint conflation.

Annotation

Every citation instance is annotated with a three-tier labeling scheme:

  • Type I: Experimental baseline status (binary)
  • Type II: Core relevance on a 1–5 scale
  • Type III: Intra-paper mention frequency

Over 2 million citation instances are labeled using Gemini 2.5 Flash as an LLM judge, validated against human expert annotations on a stratified sample of 510 instances.

Benchmark Results

We evaluate sparse retrieval (BM25), dense scientific embeddings (SPECTER, SciNCL, SciBERT), and graph-based methods. The best baseline, SciBERT fine-tuned with contrastive loss, recovers fewer than 50% of must-cite papers in the top 100 from a 67,761-paper pool—confirming that must-cite retrieval remains a substantially open problem.