About

I am a Member of Technical Staff at Thinking Machines Lab, and a Member of Less Technical Stuff at ACL Mentorship. I obtained my Ph.D. at University of Michigan, advised by Joyce Chai.

Research Interests

I am currently interested in (i) predictive evaluation of scalable learning systems, and (ii) continual multimodal learning with minimal and natural supervision, e.g., self-supervised learning, learning and inference with no train-test boundary, and learning from natural cross-modal correspondence.

Recent Tutorials

Selected Research

  • Scalable Elastic Test-Time Training

    Large-Chunk Elastic TTT (LaCET) reframes test-time training as continual learning, introducing an additional Fisher-weighted elastic consolidation term, so fast weights can keep adapting across chunks without unconstrained drift.

  • Next-Embedding Predictive Autoregression

    NEPA implements minimal latent autoregression with a next-embedding prediction loss to learn broad, generalizable models for diverse downstream vision problems. It requires no offline encoders and let autoregression operates on the embeddings from the native embeddings directly.

  • Large Space-Time Reconstruction Model

    4D-LRM pretrains general space-time representations that reconstruct an object from a few views at some times to any view at any time. 4D-LRM adopts a clean and minimal Transformer design and unifies space and time by predicting 4D Gaussian primitives directly from multi-view tokens.

  • Grounded Vision Language Models

    VEGGIE is an instructional video generative model for concept grounding and editing with diffusion-loss only. VEGGIE shows emergent zero-shot multimodal instruction following and in-context video editing.

    GroundHog is a generative vision language model grounded in segmentation. It proposes segmentation masks of regions with discernible semantic content, and recognizes entities while generating language.

    OctoBERT is an object-centric encoder-based vision language model, designed to acquire grounding ability during pre-training and transfer to new words through few-shot learning without explicit grounding supervision.

  • Behavioral Evaluation and Mechanistic Interpretation

    VLM-Lens is a toolkit designed to support the extraction of intermediate outputs from any layer during the forward pass of open-source VLMs.

    Check out the line of analyses and behavioral evaluation of (V)LMs, e.g., spatial representation, objectness, pragmatic generation, world modeling, and mental state modeling.

    Check out the line of work to advance mechanistic interpretability of (V)LMs, e.g., cross-modal grounding, and conceptual metonymy.