On the Theoretical Limitations of Embedding-Based Retrieval


TL;DR

This paper connects learning theory with embedded retrieval, and both theoretically and empirically shows that single-vector embedding models, due to their dimensionality limits, cannot represent all possible combinations of relevant documents. It also constructs a simple dataset called LIMIT, revealing that even state-of-the-art models today have this fundamental limitation.

Key Definitions

The core argument of this paper is built on a mathematical formalization of the retrieval problem and vector representation capacity. The key definitions are as follows:

Current neural-network-based embedding models (i.e., dense retrieval) are being used to tackle increasingly complex retrieval tasks, such as instruction-following and multimodal retrieval. A common trend in the community is to push models toward handling “any query” and “any definition of relevance,” implicitly assuming that embedding models have unlimited representational capacity.

However, existing academic benchmarks (such as MTEB and QUEST) usually contain only a limited number of queries due to annotation costs and other reasons. These queries cover only a tiny fraction of all possible combinations of relevant documents, thereby masking the potential representational bottlenecks of embedding models. Although some studies have empirically explored the effect of dimensionality, there has been no solid theoretical foundation to explain the fundamental limitation.

This paper aims to fill this gap, specifically asking: do single-vector embedding models have a fundamental, theoretical limitation in representing all possible top-k combinations of relevant documents? How is this limitation related to the embedding dimension $d$? And can this theoretical limitation appear in practice in a simple form?

Method

Theoretical Foundation and Formalization

The paper first formalizes the retrieval problem mathematically. Given $m$ queries and $n$ documents, their relevance can be represented by a binary matrix $A \in {0, 1}^{m \times n}$ (called qrels in information retrieval). An embedding model maps query $i$ to a vector $u_i \in \mathbb{R}^d$ and document $j$ to a vector $v_j \in \mathbb{R}^d$. Retrieval scores are computed by the dot product $u_i^T v_j$. All scores form a score matrix $B = U^T V$, whose rank is at most $d$.

The paper’s main theoretical contribution is to connect the requirement of “being able to retrieve correctly” with matrix rank, and to introduce the concepts of \(rop-rank\) and \(rt-rank\) defined above. Through a series of proofs, it derives the following key relationship:

\[\operatorname{rank}_{\pm}(2A-\mathbf{1}_{m\times n})-1 \leq \operatorname{rank}_{\text{rop}}A = \operatorname{rank}_{\text{rt}}A \leq \operatorname{rank}_{\text{gt}}A \leq \operatorname{rank}_{\pm}(2A-\mathbf{1}_{m\times n})\]

This chain of inequalities tightly binds the minimum embedding dimension $d$ required by the model (i.e., \(rop-rank\)) to the sign rank of the relevance matrix $A$ (\($sign-rank\)).

Innovations

The essential innovation of this paper is that it does not propose a new model, but instead reveals the ceiling of the existing paradigm at the theoretical level. Its core ideas are:

  1. Connecting theory and practice: For the first time, the concept of “sign rank” from communication complexity theory is introduced into modern neural information retrieval, providing a solid mathematical tool for analyzing the representational capacity of embedding models.
  2. Proving a fundamental limitation: It proves that for any given embedding dimension $d$, there always exist some combinations of relevant documents (i.e., a qrel matrix $A$) that cannot be perfectly represented by a $d$-dimensional embedding model. This is because the sign rank of qrel matrices can be arbitrarily high, while the model’s representational capacity is constrained by the dimension $d$.

Theory-Based Empiricism and Dataset Construction

To validate the theory and demonstrate its real-world impact, the paper designs a two-layer empirical study.

Optimization Experiments in the Best Case

To eliminate confounding factors such as natural language modeling, the paper designs a “free embedding” optimization experiment. In this experiment, the vectors of queries and documents themselves are directly optimizable parameters, with the goal of directly fitting a given qrel matrix. This represents the upper bound of what any embedding model can achieve.

Refer to caption

The LIMIT Dataset

To reproduce the above theoretical limitation in a real natural-language setting, the paper constructs a new dataset called LIMIT (Limitations of Instructions & Meaning In Text).

Refer to caption

Experimental Conclusions

The paper evaluates LIMIT on a range of SOTA embedding models and conducts an in-depth analysis.

Key Findings

Refer to caption

Refer to caption

Further Analysis

Summary

The experimental results in this paper strongly confirm its theoretical predictions: single-vector embedding models have a fundamental representational ceiling determined by their dimensionality. As retrieval tasks—especially instruction-following tasks—become increasingly complex and require models to distinguish and combine document sets never seen before, this theoretical limit will become a serious practical bottleneck. Therefore, the community should be fully aware of this limitation when evaluating and designing next-generation retrieval systems, and should actively explore more expressive architectures such as multi-vector, sparse representations, or hybrid models.