How I measured it all
Quality
We report nDCG@10, Recall@100, MRR@10, and MAP@100 over the full query set for each dataset.
Latency
Latency is end-to-end query latency at batch=1 on A100 40GB, p50 over a fixed 200-query sample after 10 warmup queries. It includes tokenization and query preprocessing, the query embedding forward pass, and the backend retrieval call for top-100. It excludes corpus indexing, corpus embedding generation, index build, network or API overhead, application-layer reranking, and disk cold-start.
We also store the encoding latency (query_encode_ms_p50) and the retrieval
latency (retrieval_ms_p50_topk100) as an additional breakdown.
Storage
We report index bytes for the serving index when a backend creates one, otherwise the representation size. Compression variants — fp16, int8, binary+rerank, MUVERA, FastPLAID — get their own row rather than asterisked footnotes.
Backends
Dense: flat (numpy BLAS matmul), HNSW (hnswlib, M=32, ef_search=128 — high recall), OPQ-IVF-PQ, RaBitQ, ScaNN, binary rerank.
Late-interaction: exact MaxSim, FastPlaid, MUVERA.
Sparse baseline: BM25 (Pyserini / Anserini).
Hybrid rows compose two or three of the above as first-stage retrievers.
For corpora up to ~171k documents, exact contiguous flat search can outperform high-recall HNSW because graph traversal overhead dominates the BLAS matmul.
Hybrid fusion
Hybrid rows fuse component run lists with Reciprocal Rank Fusion at k=60.
Each component contributes a top-1000 run list; the fused list is evaluated at top-100.
We use one representative dense model and one representative late-interaction model per
hybrid row, so the comparison reads as BM25 / dense / LI / hybrid rather than a
fusion-weight grid.
Hybrid latency
Hybrid p50 e2e is calculated based on this formula:
max(component p50) + fusion_ms. This matches how production hybrids serve
(components dispatch concurrently and then merge). We also store serial latency
(sum(component p50) + fusion_ms) and per-component latencies, both visible
in the row detail panel.
Applicability
We skip configurations that would mislead at small scale — OPQ-IVF-PQ doesn't run on corpora below ~10k documents, where the index is under-trained. Skipped rows don't appear on the leaderboard. Rows marked quality only have nDCG and recall but no latency, usually because the backend variant exists for storage tradeoffs rather than serving.
Best balance
Best balance is the knee point of the Pareto frontier. We normalise both axes to
[0, 1], draw a chord from the cheapest frontier member to the highest-quality
one, and pick the frontier system with the largest perpendicular distance above that
chord. That's where the curve bends: where adding cost stops buying proportional quality.
Result logs
Per-row JSON for every measured configuration lives in
results/.