Visual Language Hypothesis

ByteDance’s hardcore derivation: Is the essence of visual understanding a “fiber bundle”? Unveiling the topological truth behind Expand-and-Snap

In the era of large models, we have grown used to trading compute and data for the emergence of intelligence. Yet one fundamental question has remained unresolved: can a model really gain “semantic understanding” simply by reconstructing pixels or preserving local consistency?

ArXiv URL：http://arxiv.org/abs/2512.23335v1

A recent theoretical study from ByteDance (Bytedance) gives a negative answer. Rather than proposing a new SOTA model to top the charts, this paper takes a striking perspective from topology and group theory: the prerequisite for visual understanding is the existence of a “visual semantic language,” and the formation of such a language must go through a dramatic topological “collapse.”

This paper not only explains why purely generative models struggle to acquire high-level semantics, but also reveals from the mathematical foundation the true role of Attention and Softmax in the Transformer architecture — they are not merely computational mechanisms, but scalpels for performing “topological surgery.”

The “fiber bundle” structure of the visual world

The core starting point of the study is a simple but powerful hypothesis: the Visual Language Hypothesis.

The researchers argue that visual understanding is not just perception; it presupposes a “semantic language.” In this language, countless ever-changing perceptual observations (Observations) ultimately correspond to only a very small number of discrete semantic states (Semantic States).

If we combine this hypothesis with the notion of “transferability” in machine learning, a necessary geometric structure emerges: the visual observation space must be organized in a form similar to a Fiber Bundle.

Total space $X$: the raw pixel world we see, full of chaos and detail.
Fibers: spaces filled by physical laws or nuisance variations. For example, a cup’s rotation, lighting changes, and occlusion all form high-dimensional “fibers.”
Base space $X/G$: the “quotient space” after stripping away all nuisance factors, where true semantics reside.

In this model, the essence of visual understanding is to find a mapping $\pi$ that collapses all observations lying on the same “fiber” — for example, the same cup seen from different angles — onto the same point in the base space.

Why can’t “smooth deformation” produce semantics?

After understanding the fiber bundle structure, the paper derives a counterintuitive conclusion: true semantic abstraction cannot be achieved through smooth, continuous deformation.

In deep learning, the reconstruction losses we commonly use (such as Autoencoder, VAE) or some self-supervised learning methods are essentially learning a continuous function. From a topological perspective, these functions are performing a “homotopy transformation” on the data manifold. It is like having a piece of modeling clay (the data manifold): you can stretch it, bend it, or even knead it into a ball, but you cannot tear it apart, nor can you forcibly glue two separate points together.

Yet semantic abstraction precisely requires “gluing.”

To obtain the semantic quotient space $X/G$, the model must collapse all nuisance variables along an orbit (for example, all images produced by rotating an object) into a single point. This is a non-homeomorphic process.

In other words, if your training objective is merely to perfectly reconstruct pixels, or to preserve local geometric structure, then your model is only giving the data a “facelift,” not a “soul upgrade.” It preserves the original topological structure of the data, and therefore preserves all redundant information, making it impossible to form truly abstract concepts.

The necessary path to semantic understanding: Expand-and-Snap

If smooth deformation does not work, how does a model achieve semantic abstraction? The paper proposes an extremely vivid mechanism: Expand-and-Snap.

This reveals the deep logic behind deep neural network architecture design:

Expand: First, the model needs to map the data into a higher-dimensional space. This corresponds to Cover’s Theorem in classical learning theory — in high-dimensional spaces, complex entangled structures are more easily linearly separable. Transformer increases the internal dimensionality precisely to give the manifold enough “stretching room” to untangle it.
Snap: This is the most critical step. After disentanglement, the model must perform a topological “surgery,” forcibly collapsing a continuous manifold into discrete regions.

The paper points out that this “collapse” capability is usually provided by specific architectural components. For example, the Softmax operation and routing in the attention mechanism are essentially performing selective path activation and mass concentration. They are no longer geometry-preserving transformations; instead, they approximate an identification operation on a quotient space.

This is why purely geometry-preserving architectures (such as simple fully connected layers) can only reshape the appearance manifold, while architectures capable of routing, gating, or selective collapse (such as Transformer and MoE) have the ability to approximate the semantic quotient space.

Why do large models need “discriminative” objectives?

The theory also explains why recent multimodal models (such as CLIP) and large language models (LLMs) have been so successful in semantic understanding.

According to the derivation, to break the constraints of homotopy equivalence, one must introduce a non-homeomorphic, discriminative target.

Generative objectives (such as pixel reconstruction): tend to preserve all information and make it difficult to discard nuisance variables through collapse.
Discriminative objectives (such as label classification and image-text alignment): introduce external semantic structure (Label or Text). These objectives force the model to map different images — as long as they belong to the same class — to the same point.

This external strong constraint forces the model to perform “topological surgery” in its internal representations, cutting the fibers and extracting the base space. Although LLMs are generative, their core task is to predict the next Token (choosing from a finite vocabulary), which is essentially an extremely strong classification task that forces a continuous stream of thought to collapse into discrete symbols.

Summary

Through rigorous mathematical derivation, this paper provides us with a new “topological lens” for examining AI models. It tells us:

Dimensionality is a geometric problem, cardinality is a topological problem. The difficulty of semantic abstraction does not lie in handling high-dimensional data, but in how to collapse continuous infinite states into a finite set of discrete symbols.
Attention peaks are not just feature selection, but topological surgery. The sharp Attention distributions in Transformer are physical evidence that large models are trying to crush continuous manifolds into discrete semantic units.
Expand-and-Snap is a universal paradigm of intelligence. First deconstruct in high-dimensional space (Expand), then induct in semantic space (Snap) — this may be the necessary path from perception to cognition.

This theoretical framework not only resonates with Vapnik’s structural risk minimization principle, but also points the way toward designing more efficient visual representation architectures in the future: do not focus only on fitting the geometric shape of pixels; dare to “cut into” the topological structure of the manifold.