[Research] Martin Ziqiao Ma

Research Statement (Quite long, please read...?)

As a muggle (?), my ultimate goal is to enable Mechanistic Alignment & Grounding for Interactive Cognition (aka MAGIC). The three constant themes of my research are Language, Interaction, and Embodiment from a scalable and cognitive angle. I will break it down and elaborate:

Language Grounding and Alignment. We develop our language systems from natural supervision. Our language develops through sensorimotor and sociolinguistic experiences in the physical world (semantic/static grounding) and through interactions with others (communicative/dynamic grounding). We acquire lexical semantics and syntactic structures via this grounded language learning, and we apply our language pragmatically in everyday communication. To me, grounding is about mapping a language system to something external—whether it be another language, perception, or shared beliefs. I (mostly) agree with Freda's view on the definition of grounding, and to me, alignment is closely related to grounding but slightly different from it. To me, alignment is two-fold: in-vocabulary alignment (intent/value/preference/safety alignment) and out-of-vocabulary alignment (aka expanding the action space into multimodal/multilingual/code tokens). I (together with my colleagues) have a tutorial on grounding and wrote something on alignment. I will find some time to put down my thoughts on the difference between alignment and grounding (TODO list +1), but in short I think grounding is a property of our language representations but alignment includes a bit more on how we generate given representations.

Grounding and alignment: connecting language to everything non-linguistics.

Grounding language to the physical world: Understanding and generating language that is grounded to sensorimotor experiences and physical situations. There are more to look at beyond 2D grounding, e.g., video, 3D, generative world models.
Grounding language to human interactions: (Co-)situated Human-AI interaction in shared environment with disparate mental states, and collaborations towards a common ground.
Alignment in post-training and at inference time: Human-like planning and reasoning that is deliberate, (inter)active, lifelong, and steerable upon pre-trained systems.
Applications of frontier models in situated/embodied agents as well as content generation.

Mechanistic (Mis)alignment. In my view, the goal of cognitive science is to understand the underlying mechanisms that give rise to intelligence. I regard humans and machines as fundamentally distinct intelligent systems, and I believe there will come a point where human-like learning will no longer offer meaningful insights for superhuman AI models. My ultimate research question centers on what I refer to as "mechanistic (mis)alignment": investigating which factors drive shared cognitive behaviors between humans and machines, and which mechanistic differences account for their divergent cognitive behaviors. I always remind myself to be epistemologically rigid and avoid anthropomorphizing AI models as well as overclaiming. I (mostly) agree with A roadmap for reverse-engineering the infant language-learner and What Artificial Neural Networks Can Tell Us About Human Language Acquisition.

Scalable (data-driven but sufficiently efficient learning of) representations as computational abstractions of cognition.

Scaling law and developmental psychology: Exploring the developmental trajectories of data-driven models and the emergent cognitive capabilities over the course of development.

Efficient learning with minimal supervision: Learning that is data-efficient, over multiple modalities, on (semi-)structured data.

Cross-cultural and cross-lingual conventions of cognition. (Languages are dying! Under-represented languages are dear to my heart but I plan (try hard) not to do (too much) research on this topic before I finish my PhD lol)

The Connections

This is how I perceive the connections between the pieces.

Vision and Physical Embodiment

Show/Hide Work on Semantic Grounding

Learning Crossmodal Correspondence
- Entity Grounded VLMs: [OctoBERT; ACL'23] [GroundHog; CVPR'24]
- Object Hallucination: [ROPE; NeurIPS'24]
- Crossmodal Interpretability: [Coming]
Learning Underlying World Representation
- Frame of Reference: [COMFORT; ICLR'25]
- Spacetime Representations: [Coming]
Learning Visual Concept Manipulation
- Image Concepts: [CycleNet; NeurIPS'23] [InfEdit; CVPR'24]
- Video Concepts: [VEGGIE; Preprint'25]

Language
Show/Hide Work on Semantic Grounding

Show/Hide Work on Communicative Grounding

Show/Hide Applications to Embodied Dialogue Agents
Applications: Embodied Dialogue Agents
- Overview: [TMLR'24]
- Interactive Autonomous Driving: [Dorothie; EMNLP'22 Findings] [DriVLMe; IROS'24]
- Interactive Household Robots: [DANLI; EMNLP'22] [SEAGULL; Alexa Prize]

Theory of Mind (ToM)
- ToM in LLMs: [EMNLP'23 Findings]
- ToM for Planning: [IJCAI'23]
- ToM for Reasoning and Collaborations: [Coming]
Trials, Errors, Demos (TED)
- TED in Training LLMs: [NAACL'25]
- TED in LLM Agents: [Coming]
Proactivity and Steerability
- AL with Structured Data: [TMLR'23]
- Learning to Teach: [TRAVER; Preprint'25]
- Steerable alignment: [Coming]

Interaction with Humans and Other Agents

Show/Hide Work on Communicative Grounding

Acknowledgement: Thanks to Jiayuan Mao for this amazing template!

Get In Touch

You are welcome to drop me a message :)

Phone
xxx-xxx-xxxx
marstin0607
ziqiao_ma
Address
Bob and Betty Beyster Building 4909,
2260 Hayward Street,
Ann Arbor, MI 48109.

The Connections

Get In Touch

Phone

Address