How I measured it all

Quality

We report nDCG@10, Recall@100, MRR@10, and MAP@100 over the full query set for each dataset.

Latency

Latency is end-to-end query latency at batch=1 on A100 40GB, p50 over a fixed 200-query sample after 10 warmup queries. It includes tokenization and query preprocessing, the query embedding forward pass, and the backend retrieval call for top-100. It excludes corpus indexing, corpus embedding generation, index build, network or API overhead, application-layer reranking, and disk cold-start.

We also store the encoding latency (query_encode_ms_p50) and the retrieval latency (retrieval_ms_p50_topk100) as an additional breakdown.

Storage

We report index bytes for the serving index when a backend creates one, otherwise the representation size. Compression variants — fp16, int8, binary+rerank, MUVERA, FastPLAID — get their own row rather than asterisked footnotes.

Backends

Dense: flat (numpy BLAS matmul), HNSW (hnswlib, M=32, ef_search=128 — high recall), OPQ-IVF-PQ, RaBitQ, ScaNN, binary rerank.
Late-interaction: exact MaxSim, FastPlaid, MUVERA.
Sparse baseline: BM25 (Pyserini / Anserini).
Hybrid rows compose two or three of the above as first-stage retrievers.

For corpora up to ~171k documents, exact contiguous flat search can outperform high-recall HNSW because graph traversal overhead dominates the BLAS matmul.

Hybrid fusion

Hybrid rows fuse component run lists with Reciprocal Rank Fusion at k=60. Each component contributes a top-1000 run list; the fused list is evaluated at top-100. We use one representative dense model and one representative late-interaction model per hybrid row, so the comparison reads as BM25 / dense / LI / hybrid rather than a fusion-weight grid.

Hybrid latency

Hybrid p50 e2e is calculated based on this formula: max(component p50) + fusion_ms. This matches how production hybrids serve (components dispatch concurrently and then merge). We also store serial latency (sum(component p50) + fusion_ms) and per-component latencies, both visible in the row detail panel.

Applicability

We skip configurations that would mislead at small scale — OPQ-IVF-PQ doesn't run on corpora below ~10k documents, where the index is under-trained. Skipped rows don't appear on the leaderboard. Rows marked quality only have nDCG and recall but no latency, usually because the backend variant exists for storage tradeoffs rather than serving.

Best balance

Best balance is the knee point of the Pareto frontier. We normalise both axes to [0, 1], draw a chord from the cheapest frontier member to the highest-quality one, and pick the frontier system with the largest perpendicular distance above that chord. That's where the curve bends: where adding cost stops buying proportional quality.


Result logs

Per-row JSON for every measured configuration lives in results/.