An Open-Notebook Exploration of Emergent Grounding in LMs

Disclaimer: This blog is edited with help from an LLM.

Few days ago, we released a preprint1 showing evidence that symbol grounding can emerge in language models. 1 The Mechanistic Emergence of Symbol Grounding in Language Models (Wu*, Ma* and Luo* et al., 2025) The dust of release has not yet settled, and already a friend, after reading the manuscript, messaged me (Martin):

"I wouldn't have thought to link these things. What made you do it?"

That question stayed with us, as it exposed a quiet truth about research: we too often present it as a polished surface or a linear arc from premise to conclusion, when in practice it unfolds as a cartography of detours. What looks, in retrospect, like a straight line was, in fact, a constellation of hunches, failed starts, fragile insights, and long hours spent staring at phenomena no one else thought worth staring at.

The ethos of open-notebook science has long resonated with us. It is a quiet belief that knowledge does not emerge solely from final proofs or polished figures, but also from the raw material of inquiry: half-formed ideas, imperfect setups, unpublished (even negative2) results, and the many intermediate steps that usually remain unseen. 2 Illuminating 'the ugly side of science': fresh incentives for reporting negative results (Brazil, 2024)

Interpretability research moves forward through curious observations and moments when we choose to look closer instead of turning away. This blog is not only about what we found; it is about how we came to see, shape, and articulate those findings. What follows keeps the non-linear nature of the work intact. Results that shaped the paper, results that refused to fall into place, and results that were simply too strange to overlook are all included here, with only the most obvious bugs set aside.

Observations

Over the past few years, (large) vision-language models (VLMs) have come to be seen as general-purpose problem solvers, capable of integrating text, images, and other perceptual inputs. We see clear evidence that semantic correspondences emerge inside these models: language tokens align with visual entities and structures, enabling zero-shot localization, segmentation, and cross-modal matching. 3 GROUNDHOG: Grounding Large Language Models to Holistic Segmentation (Zhang et al., 2024)

Figure 1 from Zhang et al. (2024) — Figure 1 from GroundHog-7B3. The VLM formulates grounding process as an entity segment selection problem which involves (1) proposing entity segmentation masks and (2) predicting the next token as well as the masks associated with it.

Recent studies describe this phenomenon as "grounding" emerging without explicit supervision, based on the observation that language tokens naturally map onto visual regions and structures during large-scale training. Below are a few representative examples:

4 Grounding Everything: Emerging Localization Properties in Vision-Language Transformers (Bousselham et al., 2024) 5 Emergent Visual Grounding in Large Multimodal Models Without Grounding Supervision (Cao et al., 2025) 6 It's a (Blind) Match! Towards Vision-Language Correspondence without Parallel Data (Schnaus et al., 2025)

Paper	What's Observed	Operational Definition of Grounding	Operational Definition of Emergence
Bousselham et al. (2024)4	Self-attention patterns cluster visual tokens into object-like groups aligned with textual prompts.	Text-aligned object localization and segmentation obtained directly from pretrained attention structure.	Localization emerges without fine-tuning or additional objectives, as a property of pretrained attention.
Cao et al. (2025)5	Attention maps produce pixel-level masks through an attend-and-segment procedure.	Alignment between language spans and pixel regions derived from internal attention.	Masking ability appears spontaneously during large-scale training without grounding supervision.
Schnaus et al. (2025)6	Independently trained language and vision embedding spaces align through geometric matching.	Cross-modal correspondence between linguistic and visual representations without paired data.	Alignment emerges from structural convergence in embedding geometry, not explicit supervision.

These observations are genuinely interesting: they suggest that the capacity to align language symbols with perceptual structure may arise spontaneously during large-scale training. Yet they are not examples of symbol grounding in the classical sense of Harnad (1990)7. 7 The Symbol Grounding Problem (Harnad, 1990) While these studies provide preliminary evidence of grounding-like behavior, their operational definitions of grounding and emergence are limited. Grounding is framed primarily through spatial correlations or geometric correspondences, and emergence is defined descriptively as the presence of such patterns in a trained model rather than a principled account of when, how, or why they arise during learning. Thus one can only conclude that what emerges in these settings is evidence of statistical correlations, rather than demonstrable causal anchoring of symbols to referents in the world. Any stronger claim about grounding would extend beyond what the evidence supports.

Figure 1 from Schnaus et al. (2025)6. Even when trained separately, their representations reflect similar relationships, i.e., the platonic representation hypothesis. For example, "cat" is closer to "dog" than to "airplane." A factorized Hahn-Grant solver can align and interpret these shared structures and grounds words to images without using paired data.

This gap between emergent correlations and symbol grounding is precisely where our curiosity begins. Rather than studying grounding only as a static property of a finished model, we turn our attention to the learning trajectory itself: when does grounding emerge, how does it specialize, and what mechanisms give rise to it. By tracing these dynamics with causal interventions and controlled comparisons, we aim to move from observing alignment to understanding its origin.

Minimal Testbed

The first step in our investigation is to establish clear (i) conceptual and (ii) operational definitions, together with (iii) a minimal testbed that allows us to formalize the problem in a controlled setting.

We treat emergence conceptually as a capacity or structure that arises spontaneously as a byproduct of large-scale optimization on a proxy task, rather than being explicitly imposed through supervision or inductive biases. Operationally, emergence should not be inferred solely from the final trained model. Instead, it must be probed throughout the entire training trajectory, which requires either retraining our language models from scratch or leveraging open-weight models with accessible intermediate checkpoints, such as Pythia or Olmo.
We define grounding conceptually as the process of mapping a primary data source X to an external resource Y that serves as its ground.8 8 Learning Language through Grounding (Shi et al., 2025) In the VLM setting, symbol grounding will correspond to the question of how discrete linguistic symbols (tokens) acquire meaning through their mapping to the visual perception.

To minimize confounds introduced by the visual perception system and study symbol grounding in a controlled setting, we construct a minimal testbed that satisfies two key requirements: (i) the dataset should be as simple as possible while still capturing essential aspects of realistic language model training; and (ii) the task should have a "true" grounded solution, meaning that certain tokens explicitly refer to external entities or events. This makes it possible to probe whether and how such links emerge during learning.

A natural candidate corpora for this purpose is the Child Language Data Exchange System (CHILDES9), 9 The CHILDES Project: Tools for Analyzing Talk (MacWhinney, 2000) which provides speech transcriptions paired with environmental annotations. This offers a structured yet ecologically meaningful context for examining grounding in language models. Below is an example corpus snippet:

*CHI: My favorite animal is the horse.
%act: painted a picture of a horse

The above utterance from a child (CHI) is paired with a non-verbal description of their actions (%act), which essentially means that the child was painting a picture of a horse while saying "My favorite animal is the horse."

We thus design a minimal testbed with a custom word-level tokenizer, in which every lexical item is represented in two corresponding forms. One token appears in non-verbal descriptions (for example, a horse in the scene description), and another appears in utterances (for example, horse in speech). We refer to these as environmental tokens (<ENV>) and linguistic tokens (<LAN>), respectively. We process the ENG-US and ENG-UK subsets of CHILDES to construct our training corpus. The previous example can be rewritten in this format as:

<CHI> painted_<ENV> a_<ENV> picture_<ENV> of_<ENV> a_<ENV> horse_<ENV>
<CHI> My_<LAN> favorite_<LAN> animal_<LAN> is_<LAN> the_<LAN> horse_<LAN>

It's worth noting that in this example, horse_<ENV> and horse_<LAN> are converted in two completely different indices by the word-level tokenizer, and thereby getting rid of superficial form-level connections between them.

Model training follows the standard causal language modeling objective with a whole-word tokenizer. Our primary model architectures include a Causal Transformer in the style of GPT-2, a Unidirectional LSTM, and a State-Space Model (Mamba-2). All experiments are repeated with 5 random seeds, randomizing both model initialization and corpus shuffle order.

Behavioral Evidence

The first step is to confirm this phenomenon of emergence is robust and significant. Drawing inspration from priming in psychology, we test the conditional probability of a linguistic token (<LAN>) under two situations. Take book_<LAN> as the target word:

Match: The corresponding <ENV> token is presented in the context.

book_<ENV> <CHI> I_<LAN> love_<LAN> reading_<LAN> ______,

Mismatch: The corresponding <ENV> token is not presented in the context.

toy_<ENV> <CHI> I_<LAN> love_<LAN> reading_<LAN> ______,

For fair comparison, these two conditions should be minimally different, so we construct mismatched context with replacing the corresponding <ENV> token with others. If the conditional probability of the target <LAN> token in the match situation is significantly higher than that in the mismatch case, we can claim there is reliable behavioral evidence of grounding (still not necessarily symbol grounding yet).

At this stage, two limitations of our initial formulation became clear.

The linguistic template itself introduced semantic bias: the word reading strongly implies book, making the match condition artificially easier and the mismatch (e.g., toy) linguistically implausible. To avoid such bias, we decided to replace the original template with semantically lighter constructions that do not presuppose specific referents.
We already plan to examine attention patterns, but language models tend to disproportionately focus on the first token in the sequence, creating an attention sink that can mask more subtle grounding dynamics.10 10 Why do LLMs attend to the first token? (Barbero et al., 2025) This raises an important ambiguity: we cannot immediately tell whether the <ENV> token receives more attention because models naturally favor early positions, or because it genuinely serves as the ground for the linguistic form to be predicted.

To address this, we designed more complex environmental contexts and semantically neutral linguistic templates, ensuring that observed differences reflect genuine grounding behavior rather than artifacts of template choice or attention distribution. An example of these new templates is as follows:

<CHI> asked_<ENV> for_<ENV> a_<ENV> new_<ENV> book_<ENV>
<CHI> I_<LAN> love_<LAN> this_<LAN> ______,

For clarity, we plot the surprisal, defined as the negative log probability of the target token conditioned on context.

Surprisal Results — Average surprisal of the experimental and control conditions over training steps.

We also attempted to extend our study to hybrid architectures that combine Transformer and Mamba-2 layers. Specifically, we experimented with 12-layer hybrid models in which either layers 4-7 or all even-numbered layers were replaced with Mamba-2, while the remaining layers retained the Transformer structure. Interestingly, we still observed consistent grounding behaviors under these configurations. However, we soon realized that a systematic investigation of hybrid architectures would require addressing many design considerations.11 11 A Systematic Analysis of Hybrid Linear Attention (Wang* and Zhu* et al., 2025) Given the scope of current research questions, we decided to omit these results from the main paper and include them here in this blog to motivate future work.

Hybrid 4-7 Mamba Results — A hybrid model with layers 4-7 replaced with Mamba-2.

Hybrid Even Mamba Results — A hybrid model with even-numbered layers replaced with Mamba-2.

As shown above, all models show clear separation between match and mismatch conditions. We observed two unexpected behaviors during this phase of experimentation:

Mamba-2 exhibits mild overfitting effects toward the end of training. This behavior can likely be explained by existing work12 on state-space models, 12 Revealing and Mitigating the Local Pattern Shortcuts of Mamba (You* and Tang* et al., 2025) and it does not undermine the main conclusions of our study.
Transformers and Mamba-2 have much clearer separation than LSTM. One might simply attribute this to the higher capacity of these architectures, but it also raises some concerns about what this behavioral finding truly reflects.

One simple explanation could make these results trivial: the match advantage might come from frequent co-occurrence between <LAN> and <ENV> tokens. In that case, the models may have memorized surface-level token statistics rather than learning a genuine mapping.

To test this hypothesis, we start with a data ablation experiment on a 12-layer Transformer. We define co-occurrence between corresponding <ENV> and <LAN> tokens when they appear within the same context window (512-token training chunk). In the CHILDES corpus, roughly one-third of the data contains such co-occurring pairs. We retrain models after randomly removing different proportions of these co-occurring chunks from the training data. The results are as follows:

Data Co-occurrence Ablation Results — The data-level ablation study by removing proportions of co-occurring chunks from training.

As expected, increasing the amount of co-occurring data enlarges the gap between match and mismatch conditions. However, even when all co-occurring data are removed, a noticeable gap remains. This indicates that language models can still learn to associate words with their grounds in the absence of explicit co-occurrence. In other words, the observed performance gain cannot be explained by surface-level co-occurrence statistics alone.

Fortunately, statistical toolkits offer an easy way to quantify if it's indeed the case. We can fit a linear regression between the average surprisal difference of a target <LAN> token (which, from an information theoretic perspective, could be also explained as information gain) and the cumulative co-occurrence, quantified by the number of examples where both <LAN> and <ENV> are present. The R² value of this linear regresion over time shows how well the co-occurrence predicts the information gain. If the information gain simply comes from co-occurrence, we should see a strictly increasing R² over time.

R Square Results — Grounding information gain and its correlation to the co-occurrence of linguistic and environment tokens over training steps

We show the R² values (orange) alongside the grounding information gain (blue) for different model architectures. In both the Transformer and Mamba-2, R² rises sharply during the early training steps but then declines, even as grounding information gain continues to increase. This pattern suggests that grounding in these architectures cannot be fully explained by co-occurrence statistics. Models initially rely on surface regularities but later improvements reflect more complex, learned representations. In contrast, LSTM shows steadily increasing R² but little growth in grounding information gain, indicating that it encodes co-occurrence but lacks the architectural capacity to convert it into predictive grounding.

As a bonus analysis, we flipped the template to test the reverse direction: given linguistic conditioning, can the model predict the corresponding <ENV> token? We didn't expect a satisfying performance as the model was never trained in this direction.

LAN2ENV Surprisal Results — `<LAN>`-to-`<ENV>` Surprisal Results

LAN2ENV R2 Results — `<LAN>`-to-`<ENV>` `R²` Results

We observed that surprisal for both match and mismatch conditions increases steadily as training progresses, yet the match group consistently maintains lower surprisal than the mismatch group. Furthermore, the R² trend mirrors our previous findings. This (very weakly) suggests that the model might internalizes a bidirectional mapping between linguistic and environmental tokens even without explicit training in the reverse direction, reflecting a more general underlying grounding mechanism rather than a task-specific artifact.

This is not discussed in the paper but included here to simulate possible future discussions and open questions.

Mechanistic Evidence

Taking a closer look at our task setup, we see some similarity between ours and that in Wang et al. (2023).13 In Wang et al. (2023), they study in-context classification settings, and find label words serve as "anchors" that collects information from previous tokens by looking at the information flow. 13 Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning (Wang et al., 2023) In our setup, if we view the <ENV> tokens as the anchors, we can analyze how they facilitate information flow and grounding in a similar manner. We hypothesize that, during prediction, salient information flows from the <ENV> tokens into the token immediately preceding the target <LAN> to facilitate the linguistic token prediction.

Following Wang et al. (2023), the information flow (i.e., saliency) matrix at Layer $\ell$ and Head $h$ is defined as: $I_{\ell, h} = \frac{\partial L}{\partial A_{\ell, h}} \odot A_{\ell, h}$, where $L$ is the loss function, $A_{\ell, h}$ is the attention matrix at Layer $\ell$ and Head $h$, and $\odot$ denotes element-wise multiplication. The over saliency score from token $i$ to token $j$ at Layer $\ell$ is then computed by aggregating the information flow across all heads:

CHILDES Saliency Results — Saliency of layer-wise attention from environmental to linguistic tokens across training steps.

We analyze a 12-Layer Transformer. As we see above, the saliency at middle layers (Layers 7-9) starts to increase at around 4,500 training steps, exactly where the R² (between grounding information gain and co-occurrence) starts to drop in the previous analysis! If they two are indeed connected, this is evidence suggesting that some mechanisms in the middle layers facilitate grounding beyond co-occurrence statistics.

If the attention mechanism reveals where information flows, then the hidden representations capture what information is being encoded and transformed along the layers. Our next analysis is from the tuned lens (Belrose et al., 2023),14 which aims at translating model internals to human understandable text. 14 Eliciting Latent Predictions from Transformers with the Tuned Lens (Belrose et al., 2023) A tuned lens is a trained linear projection layer that maps the hidden states at each layer to the vocabulary space. We train a tuned lens for each layer at different training steps, and analyze how the surprisal of the target <LAN> token changes over training.

Resonating with the above findings, the tuned lens analysis also highlights the middle layers (Layers 7-9) as critical for grounding. At about 10K to 20K steps, we see a significant drop in surprisal at these layers.

Tuned Lens with Saliency — Layer-wise tuned lens to predict the `<LAN>` token in match condition

Taking all these evidence together, we hypothesize that there are some internal mechanism in the middle layers after 5K steps that implement grounding beyond co-occurrence statistics.

To make our hypothesis more concrete, we examine the saliency of attention heads by plotting them directly. We visualize all attention heads across layers in a grid: each row corresponds to a layer (from top to bottom, layers 1 to 12), and each column represents a head. All values are normalized per head, so we focus first on the behavior of each head individually rather than directly comparing their absolute strengths. This allows us to identify where and when specific heads might specialize in grounding, without the confound of global scale differences across heads.

As shown above, there are a few heads that captures our attention. Some of them (mostly in layers 1-4) are almost a horizontal strip, indicating information from all previous tokens flows into one position, whereas others (mostly in layers 7-9) are a single salient dot, indicating that the head is simply transferring information from one specific token to another. This resonates with the findings of Bick et al. (2025),15 where they argue such gather (horizontal strip) and aggregate (single dot) attention patterns are crucial for in-context learning. 15 Understanding the Skill Gap in Recurrent Language Models: The Role of the Gather-and-Aggregate Mechanism (Bick et al., 2025)

In our setting, typical gather heads look like these:

And typical aggregate heads look like these:

We now specifically hypothesize that the gather-and-aggregate mechanism plays a crucial role in enabling symbol grounding, and verify this hypothesis through causal interventions. We detect gather-and-aggregate heads by searching for attention heads that exhibit the following two properties:

Gather property: The strip at certain position occupies at least 30% of the total saliency mass of the head.
Aggregate property: A single dot in the head occupies at least 30% of the total saliency mass of the head.

After identifying the set of gather and aggregate heads for each context in the final model, we conduct an over-time analysis to measure how their relative contributions evolve during training. Specifically, we calculate the proportion of saliency attributed to these heads compared to the total saliency across all heads. This allows us to track when grounding-related mechanisms become active and how their influence changes over the course of learning.

GnA Overtime Results — Saliency on gather and aggregate heads over time.

By zeroing-out these heads during inference and comparing to zeroing-out random heads that do not satisfy these properties, we can test if these heads are indeed crucial for grounding. We perform this causal intervention experiments on the identified gather and aggregate heads across training checkpoints. Avg. Count refers to the average number of heads of each type across inference runs, and Avg. Layer indicates the average layer index at which they appear. Interv. Sps. reports surprisal after zeroing out the identified heads, while Ctrl. Sps. reports surprisal after zeroing out an equal number of randomly selected heads. Original refers to baseline surprisal without any intervention. *** indicates a significant difference (p < 0.001), where the intervention surprisal is higher than in the corresponding control experiment.

Causal Intervention Results — Causal intervention results on identified gather and aggregate heads across training checkpoints.

Results above show that only aggregate heads have significant causal effect on grounding performance, partially confirming our hypothesis.

In fact, when we normalize all attention values per model rather than per head, we observe that aggregate heads are substantially more salient than gather heads. In many cases, only one or a few aggregate heads stand out prominently. An example illustrating this pattern is shown below.

CHILDES Text Normalized Per Model — Among all heads, only one (head 8 of layer 7) is identified as an aggregate head. In this case, information flows from *horse*_<ENV> to the current position, encouraging the model to predict *horse*_<LAN> as the next token.

This is not surprising. In our setup, the task structure itself makes gather heads unnecessary. Because each target <LAN> token has a single corresponding <ENV> token, the model can retrieve the needed information in a single hop directly from the referent, which is precisely what aggregate heads are designed to do. In contrast, gather heads become useful when there is an intermediate anchor token (such as label words in few-shot ICL settings) that first collects information before being read by aggregate heads. Since our format is semantically light and lacks these few-shot templates, no anchor emerges, leaving the aggregate mechanism as the most efficient solution.

Realistic Settings

Now we extend experiments to more complicated settings with Visual Dialog dataset and Stable Diffusion 2 for image inpainting, as dipicted below:

Realistic Dataset Settings — Training and test examples across datasets with the target word *book*. The training examples combine environmental tokens (`<ENV>`, shaded) with linguistic tokens (`<LAN>`). Test examples are constructed with either matched (*book*) or mismatched (*toy*) environmental contexts, paired with corresponding linguistic prompts. Note that in both child-directed speech and caption-grounded dialogue, *book*_<ENV> and *book*_<LAN> are two distinct tokens as received by the language model.

For the VLM setting, we use DINOv2 as our ViT tokenizer because it is trained purely on vision data without any auxiliary text supervision. In contrast to models like CLIP, this ensures that environmental tokens capture only visual information, providing a cleaner and more controlled grounding signal. While standard LLaVA uses a two-layer perceptron to project ViT embeddings into the language model, we bypass this projection and directly feed the DINOv2 representations into the language model. This allows us to avoid introducing additional learned components and preserves the original visual signal more faithfully.

Realistic Dataset Results — Average surprisal of the experimental and control conditions in caption- and image-grounded dialogue settings, as well as the grounding information gain and its correlation to the co-occurrence of linguistic and environment tokens over training steps. All results are from a 12-layer Transformer model on grounded dialogue data.

As shown above, the replication of behavioral findings is successful. There is also an increase-the-decrease trend on the R² metric, which encourages us to directly test if the aggregate mechanism also plays a role here.

Before we moved on, we also repeated the experiment using a randomly initialized DINOv2 (see below). An advantage of the match condition over the mismatch condition was still observed, but it required significantly more training steps to approach convergence and did not fully converge even after 200K steps. While this is an interesting direction for further investigation, it is not central to our current research question, so we exclude this setting from the main analysis.

VLM No-DINO Surprisal Results — VLM w/o DINO Surprisal Results

VLM No-DINO R2 Results — VLM w/o DINO `R²` Results

Similar to the previous text-only visualization, we plot the saliency of attention heads from each visual patch to the tested linguistic form.

Based on the above visualization, we identify attention heads as aggregate according to the following criterion (we do not define gather heads in this setting): an attention head is classified as an aggregate head if at least a certain threshold (70% or 90% in our experiments) of its total image patch-to-word saliency flows from patches inside the bounding box to the token immediately preceding the corresponding linguistic token.

VLM Results — Mechanistic analysis in the image-grounded visual dialogue setting. **Left:** Causal intervention results on identified aggregate heads across training checkpoints, showing that intervention on aggregate heads consistently yields significantly higher surprisal (p < 0.001, ***) compared to the control group. **Right:** Layer-wise attention saliency from environmental tokens (i.e., image tokens corresponding to patches within the bounding boxes of the target object) to linguistic tokens across training steps.

Consistently with what is shown in the minimum testbed setup, we find that aggregate heads are also the key to the performance in realistic vision-language models. Although early layers display little salient information flow (unlike in the minimal testbed), pronounced saliency in the middle layers remains a consistent and distinguishing feature.

We finally extend our grounding-as-aggregation hypothesis to a full-scale VLM (LLaVA-1.5-7B). Even in this heavily engineered architecture, we can easily identify attention heads that exhibit aggregation behavior consistent with our earlier findings. This reinforces our findings that symbol grounding arises from specialized heads, rather than being an artifact of statistics.

However, several challenges arise when we try to systematically identify and intervene on these heads at scale, just like what we did in prior sections.

VLMs often incorporate multiple visual encoders (for example, LLaVA-1.5-HD), increasing architectural complexity.
Many models rely on CLIP-derived embeddings that already encode language priors, making it difficult to treat the experiment as a controlled test of grounding emergence.
The register token effect: global information may be stored in redundant artifact tokens rather than object-centric regions in the image.16 16 Vision Transformers Need Registers (Darcet et al., 2024)
The large number of visual (environmental) tokens greatly increases computational cost and makes it difficult to define a universal criterion for identifying genuine aggregation heads, especially making fair comparisons across different VLMs.

We believe this is an important direction to pursue, but for now we have reached a natural stopping point. We have, so far, built a clean and formal testbed, presented reasonably realistic VLM setups, and provided case studies on full-scale models. Addressing the challenges discussed above will be essential for future work that aims to generalize our findings to more complex, real-world systems. However, we see this as a stand-alone project that goes beyond the scope of the current paper.

So we call it a day here.

Practical Implications

One might ask...does this mean we don't need grounding supervision to build mechanistically grounded VLMs? No. What our findings show is that grounding can emerge naturally from the training signal itself, but that does not make supervision unnecessary. Fine-grained mappings between text spans and semantic regions are well known to accelerate learning. While naturalistic data and minimal objectives alone can induce symbol grounding given sufficient scale and proper architecture, incorporating well-designed supervision or inductive biases can make models significantly more data-efficient, robust, and reliable, especially in low-resource settings.17,18 17 Grounded Language-Image Pre-training (Li*, Zhang*, and Zhang* et al., 2022) 18 World-to-Words: Grounded Open Vocabulary Acquisition through Fast Mapping in Vision-Language Models (Ma* and Pan* et al., 2023)

Our findings also imply a practical pathway to improving the reliability of language model outputs. By identifying aggregation heads that mediate grounding between environmental and linguistic tokens, we reveal an internal mechanism that can be monitored before generation occurs. This allows us to detect potential failure modes upstream rather than relying solely on output-level checks. 19 OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation (Huang et al., 2024)

Opera Figure 2 — Figure 2 from OPERA.19 A case of relationship between hallucinations and knowledge aggregation patterns.

Many hallucination errors can be traced to misallocated attention in intermediate layers.20 20 Devils in Middle Layers of Large Vision-Language Models: Interpreting, Detecting and Mitigating Object Hallucinations via Attention Lens (Jiang et al., 2025) Because grounding is not uniformly distributed across the model, certain attention heads act as bottlenecks where environmental signals are either integrated correctly or distorted. Monitoring these heads provides a way to detect when the model is likely to produce unreliable outputs. Such attention-level signals offer a promising foundation for practical interventions. They can inform decoding-time strategies that adjust or constrain generation based on grounding strength, allowing for targeted mitigation and, ultimately, prevention of hallucinations.

Philosophical Implications

Our findings also highlight the need to sharpen what we mean by grounding in multimodal models. Much of the existing discourse has treated grounding as little more than statistical correlation between visual and linguistic signals. While such correlations are informative, they do not address the deeper question of whether symbols are causally anchored to their referents in the environment.

On another side of this debate, Gubelmann (2024)21 21 Pragmatic Norms Are All You Need - Why The Symbol Grounding Problem Does Not Apply to LLMs (Gubelmann, 2024) argues that the symbol grounding problem does not apply to large language models:

"LLMs, in contrast, are connectionist, statistical devices that have no intrinsic symbolic structure. It is, however, this symbolic structure that is presupposed in the CTM, the Chinese Room Thought Experiment, and hence also in the symbol grounding problem."

Our results suggest otherwise. We observe emergent symbolic structure as a mechanistic property of the model: it can be traced along the training trajectory, localized to specific attention heads, and tested through causal interventions.

We echo Pavlick (2023)22 22 Symbols and grounding in large language models (Pavlick, 2023) that "LLMs lack the capacity to represent abstract symbolic structure" should not be accepted a priori. Instead, these claims must be empirically tested, with the focus placed on uncovering the models' underlying competence rather than relying solely on surface-level performance.

This perspective shifts the debate from abstract philosophical claims to testable mechanisms. Grounding can be investigated as a measurable process within models, rather than simply dismissed as irrelevant by definition. More importantly, it provides a practical diagnostic pathway for identifying when and how models begin to tie symbols to meaning, moving the discussion beyond surface correlations toward mechanistic explanations with the power of computation.

Citation

To give credit if you find this work useful, please cite:

@article{wu2025mechanistic,
  title={The Mechanistic Emergence of Symbol Grounding in Language Models},
  author={Wu, Shuyu and Ma, Ziqiao and Luo, Xiaoxi and Huang, Yidong and Torres-Fonseca, Josue and Shi, Freda and Chai, Joyce},
  journal={arXiv preprint arXiv:2510.13796},
  year={2025}
}

How We Did This Curiosity-Driven Research?