About

I am a Member of Technical Staff at Thinking Machines Lab, and a Member of Less Technical Stuff at ACL Mentorship. I did my Ph.D. at University of Michigan, advised by Joyce Chai.

Research Interests

I am currently interested in (i) predictive evaluation of scalable learning systems, and (ii) continual multimodal learning with minimal and natural supervision, e.g., self-supervised learning, learning and inference with no turn-taking and train-test boundary, and (iii) learning from natural cross-modal correspondence.

Recent Tutorials

Recent Research

  • Interaction Models for Human-AI Collaboration

    Interaction Models are models that handle interaction natively rather than through external scaffolding. Interaction models let people collaborate with AI the way we naturally collaborate with each other: they continuously take in audio, video, and text, and think, respond, and act in real time. We train an interaction model with a multi-stream, micro-turn design.

  • Elastic Test-Time Training For Space and Time

    Large-Chunk Elastic TTT (LaCET) reframes test-time training as continual learning, introducing an additional Fisher-weighted elastic consolidation term, so fast weights can keep adapting across chunks without unconstrained drift.

    4D-LRM pretrains general space-time representations that reconstruct an object from a few views at some times to any view at any time. 4D-LRM adopts a clean and minimal Transformer design and unifies space and time by predicting 4D Gaussian primitives directly from multi-view tokens.

  • Omnimodality with Embedding Autoregression

    NEPA implements minimal latent autoregression with a next-embedding prediction loss to learn broad, generalizable models for diverse downstream vision problems. It requires no offline encoders and let autoregression operates on the embeddings from the native embeddings directly.

  • Grounded Vision Language Models

    VEGGIE is an instructional video generative model for concept grounding and editing with diffusion-loss only. VEGGIE shows emergent zero-shot multimodal instruction following and in-context video editing.

    GroundHog is a generative vision language model grounded in segmentation. It proposes segmentation masks of regions with discernible semantic content, and recognizes entities while generating language.

    OctoBERT is an object-centric encoder-based vision language model, designed to acquire grounding ability during pre-training and transfer to new words through few-shot learning without explicit grounding supervision.

  • Behavioral Evaluation and Mechanistic Interpretation

    VLM-Lens is a toolkit designed to support the extraction of intermediate outputs from any layer during the forward pass of open-source VLMs.

    Check out the line of analyses and behavioral evaluation of (V)LMs, e.g., spatial representation, objectness, pragmatic generation, world modeling, and mental state modeling.

    Check out the line of work to advance mechanistic interpretability of (V)LMs, e.g., cross-modal grounding, and conceptual metonymy.