Here is a virtual greeting from Ziqiao.
As many of my friends found it hard to pronounce my name (马子乔, pronounced as /ma˨˩˦ tsɨ˧˥ tɕʰiɑʊ˧˥/ in Mandarin),
it's absolutely fine to just call me Martin alternatively.
I am a language person at heart, but I believe in semantic externalism and embodied cognition,
such that the hardest questions
in computational linguistics should not and can not be answered by language itself.
The three constant themes of my research are language, interaction, and embodiment, from a scalable and cognitive angle.
As a linguist, I am interested in grounded language processing, computational psycholinguistics, language development, and multilingualism.
As a machine learning researcher, I am interested in learning with minimal and natural supervision,
e.g., self-supervised learning, learning and evaluation with no train-test boundary, learning with natural cross-modal correspondence.
Learning with natural supervision: grounding and alignment
My collaborators (alphabet order)
Andrew Yang (Multimodality, Memory, Graphs),
Dezhi Luo (Cogsci, Philosophy, Consciousness),
Ding Zhong (Vision, Multimodality, Reasoning),
Jiawei Ren (Multimodality, Agent),
Junyu Zhang (Reasoning, RL, Embodied AI),
Shuyu Wu (Multimodality, Interpretability),
Xiaokang Ye (Agents, Multimodality, Embodied AI),
Xiaoxi Luo (Historical Linguistics, Interpretability),
Xueyang Yu (Vision, Multimodality, Embodied AI),
are looking for grad school opportunities.
They are extremely talented, self-motivated, and pleasant to work with. Please consider them if you have openings!
[Pinned] I have joined ACL Year-Round Mentorship since 2025. Come to our monthly mentoring sessions and let's grow together :)
[Pinned] We are building GrowAI, an open-source community uniting researchers interested in human-like artificial general intelligence and growing AI like a child at scale.
[Aug. 2022] I will be the Graduate Student Instructor for EECS 595 (NLP) in Fall 2022 at Michigan.
[Mar. 2021] I will join the family of the Michigan AI as a Ph.D. student this fall. Go Blue!
[Dec. 2020] I will be the Instructional Aide for EECS 492 (Intro. AI) in Winter 2021 at Michigan.
Seminar and Technical Talks
[20251212] The Science of Evaluation and Benchmarking.
[20250416] Grounding Lexical Semantics in the Era of Vision-Language Models @ Theoretical and Computational Neuroscience Journal Club, JHU.
[20250326] Language Grounding to the Visual World and Human Interactions: How Far Are We from Embodied Dialogue Agents @ HAAG Seminar, Georgia Institute of Technology.
[20250206] Bridging Minds and Machines: Cognitive Insights for Developing and Evaluating AI Systems @ Foreseer Group, UMich.
NEPA (Next-Embedding Predictive Autoregression) relies solely on a next-embedding prediction loss to learn broad, generalizable models for diverse downstream vision problems.
NEPA requires no offline encoders and let autoregression operates on the embeddings from the encoder directly.
We train modern Vision Transformers with NEPA and achieve competitive performance after supervised fine-tuning.
ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation
ROVER, the first benchmark targeting reciprocal cross-modal reasoning where one modality guides, verifies, or refines outputs in another;
Cross-modal reasoning strongly correlates with visual generation performance, while current models show limited visually-augmented reasoning capabilities.
The Mechanistic Emergence of Symbol Grounding in Language Models
Grounding concentrates in middle-layer computations and is implemented through the aggregate mechanism, where attention heads aggregate the environmental ground to support the prediction of linguistic forms;
This phenomenon replicates across data (text-only and text-image) and architectures (Transformers and SSMs), but not in unidirectional LSTMs.
4D-LRM: Large Space-Time Reconstruction Model From and To Any View at Any Time
Ziqiao Ma, Xuweiyi Chen, Shoubin Yu, Sai Bi, Kai Zhang, Chen Ziwen, Sihan Xu, Jianing Yang, Zexiang Xu, Kalyan Sunkavalli, Mohit Bansal, Joyce Chai, Hao Tan
VLM-Lens is a toolkit designed to support the extraction of intermediate outputs from any layer during the forward pass of open-source VLMs.
VLM-Lens integrates various interpretability and analysis pipelines (probing, attention visualization, PCA, concept similarity, etc.) to facilitate the interpretation of VLMs.
Communication and Verification in LLM Agents towards Collaboration under Information Asymmetry
We study LLM agents in task collaboration with information asymmetry, where agents have disparities in their knowledge and skills and need to work together to complete a shared task;
Agents w/o communication can achieve high performance but lower trust from human evaluators.
AimBot: A Simple Auxiliary Visual Cue to Enhance Spatial Awareness of Visuomotor Policies
Yinpei Dai*, Jayjun Lee*, Yichi Zhang, Ziqiao Ma, Jianing Yang, Amir Zadeh, Chuan Li, Nima Fazeli, Joyce Chai.
WM-ABench is a large-scale benchmark comprising 23 fine-grained evaluation dimensions across 6 diverse simulated environments with controlled counterfactual simulations.
We introduce a two-stage framework that assesses perception (visual, spatial, temporal, quantitative, and motion) and prediction (mechanistic simulation, transitive inference, compositional inference) to provide an atomic evaluation of VLMs as WMs.
While VLMs excel in scenarios with pronounced differences, they struggle with 3D and dynamic perception, fail to differentiate subtle physical distinctions, and exhibit failures in understanding world transitions of transitive and compositional scenarios.
Training Turn-by-Turn Verifiers for Dialogue Tutoring Agents: The Curious Case of LLMs as Your Coding Tutors
We introduce a novel agent workflow called Trace-and-Verify (TRAVER), which combines knowledge tracing to estimate a student's knowledge state and turn-by-turn verification to ensure effective guidance toward task completion.
We introduce the coding tutoring task as the testbed for tutoring LLM agents.
TRAVER effectively enables inference-time scaling for tutoring agents.
Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation
Ziqiao Ma*, Jing Ding*, Xuejun Zhang, Dezhi Luo, Jiahe Ding, Sihan Xu, Yuchen Huang, Run Peng, Joyce Chai
We introduce RefOI, a new dataset of 1.5k objects, each with 3 written and 2 spoken human-produced referring expressions.
We also release RefOI-TLHF, a large dataset of token-level human feedback for 10.6k referring expressions.
We identify three key failures of pragmatic competence in VLMs:
(1) cannot uniquely refer to the referent,
(2) include excessive or irrelevant information, and
(3) misalign with human pragmatic preferences.
VEGGIE: Instructional Editing and Reasoning Video Concepts with Grounded Generation
We introduce VEGGIE, a diffusion-loss only video generative model that handles various tasks for both video concept grounding and editing from user instructions.
Pixel-level grounded training helps various video concept editing task in multi-task learning.
VEGGIE shows emergent zero-shot multimodal instructional and in-context video editing.
Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference under Ambiguities
We introduce COMFORT, a protocol to evaluate spatial reasoning in VLMs across multilingual and ambiguous frames of reference (FoR);
VLMs exhibit poor robustness and consistency, lack the flexibility to accommodate multiple FoRs, and fail to adhere to language-specific or culture-specific conventions in cross-lingual tests.
Babysit A Language Model From Scratch: Interactive Language Learning by Trials and Demonstrations
We introduce a trial-and-demonstration (TnD) learning framework that incorporates three components: student trials, teacher demonstrations, and a reward conditioned on language competence at various developmental stages;
TnD accelerates word representation learning for student models of equal and smaller numbers of parameters, and both trials and demonstrations matter.
We further show that the teacher's choices of words influence students' word-specific learning efficiency, and a practice-makes-perfect effect is evident by a strong correlation between the frequency of words in trials and their respective learning curves.
Humanity's Last Exam
Community Contribution, lead by Center for AI Safety and Scale AI.
A multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage;
State-of-the-art LLMs demonstrate low accuracy and calibration, gaps remain compared to the expert human frontier on closed-ended academic questions.
Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models
We introduce ROPE, an evaluation protocol for hallucination across multiple objects using visual referring prompts;
VLMs hallucinate more with multiple objects, are influenced by object class distribution, and exhibit behavior driven by data-specific and intrinsic factors.
DriVLMe: Enhancing LLM-based Autonomous Driving Agents with Embodied and Social Experiences
Yidong Huang, Jacob Sansom, Ziqiao Ma§, Felix Gervits, Joyce Chai
We introduce DriVLMe, a video-language-model-based agent to facilitate natural and effective communication between humans and autonomous vehicles that perceive the environment and navigate;
We develop DriVLMe from both embodied experiences in a simulated environment and social experiences from real human dialogue.
GroundHog: Grounding Large Language Models to Holistic Segmentation
We introduce GroundHog, a multimodal large language model grounded in holistic segmentation, using a masked feature extractor and unified grounding masks for fine-grained visual understanding.
Trained on the curated M3G2 dataset, GroundHog outperforms in language grounding tasks, reduces object hallucination, and offers improved diagnosis for complex visual inputs.
Inversion-Free Image Editing with Language-Guided Diffusion Models
We derive Denoising Diffusion Consistent Model (DDCM), showing that when the initial sample is known, a special variance schedule reduces the denoising step to the same form as the multi-step consistency sampling;
DDCM implies a inversion-free strategy without explicit inversion in sampling for image editing;
We further unify the attention control mechanisms in an inference time algorithm for text-guided editing, taking less than 3 seconds per edit.
CycleNet: Rethinking Cycle Consistency in Text-Guided Diffusion for Image Manipulation
We introduce CycleNet from theoretic derivations, a model that incorporates cycle consistency (and a self-supervision loss) into diffusion model to regularize image manipulation;
CycleNet is robust even with very limited training data (around 2k) and requires minimal computational resources (1 GPU) to train.
Towards A Holistic Landscape of Situated Theory of Mind in Large Language Models
We taxonomize machine ToM into 7 mental state categories and delineate existing benchmarks to identify under-explored aspects of ToM;
Pilot studies for a holistic and situated evaluation of ToM to break ToM into individual components and treat LLMs as an agent who is physically situated in environments and socially situated in interactions with humans.
Towards Collaborative Plan Acquisition through Theory of Mind Modeling in Situated Dialogue
We study collaborative plan acquisition in human-AI tasks, where agents predict missing task knowledge for themselves and their partners using perceptual and dialogue history;
We show that predicting a partner's missing knowledge, coupled with explicit modeling of dialogue moves and mental states, leads to better collaboration.
World-to-Words: Grounded Open Vocabulary Acquisition through Fast Mapping in Vision-Language Models
We introduce OctoBERT, a visually grounded language model designed to acquire grounding ability during pre-training and enable fast mapping of new words through few-shot learning without explicit grounding supervision;
Visual grounding accelerates grounded word representation learning;
Imageability aligns positively with human intuition and prediction metrics, while concreteness shows opposite correlations -> need for language learning agents to acquire word meanings through physical interactions!
NLP Reproducibility For All: Understanding Experiences of Beginners
Shane Storks, Keunwoo Peter Yu, Ziqiao Ma, Joyce Chai
We studied 93 NLP students replicating recent NLP papers;
Programming skills and paper comprehension had limited impact on effort, while accessible documentation, coding practices, and data availability were critical.
SEAGULL: An Embodied Agent for Instruction Following through Situated Dialog
We introduce GraphPart, a partition-based active learning method for GNN that selects representative nodes from graph partitions for querying;
GraphPart is motivated by classification error analysis under smoothness assumptions;
GraphPart outperforms existing active learning methods across benchmarks and budget constraints and reduces the accuracy disparity compared to random training node selection across most datasets.
If you like my figures here, I highly recommend you also visit SiX's homepage.
Misc
Fun Facts
I was given my Chinese name, 马子乔, through a visually symbolic process, breaking down the radicals of 骄子, which roughly translates to 'the gifted child' in English. This is one of the reasons why logographic languages are so beautiful.
I was born and raised up in Chengdu, the home of pandas. I am proud of Chengdu Foreign Languages School, my high school, and identify myself as a CFLSer. 成外人永远不会成外人.
I am INFJ according to the Myers-Briggs, and my friends said that I exhibit stereotypical traits of this personality type...lol
I love literature and plays. I am particularly interested in Shakespeare's plays, traditional Chinese opera, Latin American literature, and modern Asian literature.
I love movies, I am obsessed with the Czechoslovak New Wave and psychological thrillers these days.
I seriously considered a career in game design when I just started college, and although I ultimately chose a different path,
it provided excellent preparation for my work in embodied AI research, which often involves intensive programming with simulators.
a list of student-made games I enjoyed:
Mogu,
Turbo Neon.
a list of video games I enjoyed:
Sandbox (MineCraft, Terraria),
Story-based RPG & Interactive Stories (Stardew Valley, Undertale, The Stanley Parable, This War of Mine, To the Moon, Season: A Letter to the Future),
Roguelike (Soul Knight, Risk of Rain),
Puzzle & Puzzle-based Platformer (Gorogoa, Limbo, Inside, Chants of Sennaar),
Battle Royales (Naraka Bladepoint, Fall Guys).
Here are some of the projects we worked on:
Contracts
Zekai Fan, Shiyu Qu, Juan Rivera Plata, Yihao Huang, Ziqiao Martin Ma
I understand that access to research oppotunities can be hard, particularly for beginners and the underrepresented.
If there is a match in research interests, I am happy to collaborate with undergrads and masters when I have the bandwidth.
Please find more details here.
I've been fortunate to have (co-)mentered and collaborated with these amazingly talented young researchers:
Take a random virtual stroll over to one of my friends' homepage!
It's like a digital house call, minus the awkward small talk and the "sorry, my place is a mess" excuse!
When I was exhausted but couldn't take time off to travel, I'd go on virtual adventures instead:
randomly searching for remote destinations and dropping pins on Google Maps.
Here are a few spots that I swear I'll visit in person...someday, eventually!
Why I am still staying alive?
Chat?
If you would like to have a random (virtual) coffee chat with me, please visit my calendly page.
I am happy to talk if you want to share your stress or just want to chat about life in general (when I have time),
but be sure to check out the On-Campus Mental Health Resources @ Michigan.
Get In Touch
Drop me a message here :)
marstin0607 (work only)
ziqiao_ma
Office
Bob and Betty Beyster Building 4909,
2260 Hayward Street,
Ann Arbor, MI 48109.