Multimodal Models

Does AI assistance always help reasoning or does it carry hidden costs?

When AI systems intervene during human reasoning tasks, do they uniformly improve performance, or does the disruption to cognitive focus create a hidden tax that could offset their benefits?

Can a single model generate all modalities without external encoders?

Most multimodal systems rely on separate encoders for each modality. This research explores whether training a unified foundation model on discrete tokens across text, image, video, and speech can enable any-to-any generation without those external components.

When and how much should AI interrupt human reasoning?

Most AI explanations focus on what to say, not when to say it or how intrusively. This explores how timing and scale of interventions shape whether support feels collaborative or disruptive.

Can generating entire videos at once beat keyframe interpolation?

Does synthesizing a video's full temporal duration in a single pass, rather than generating keyframes and filling gaps, produce more globally coherent motion? This explores whether pipeline decomposition fundamentally limits motion consistency.

Can bounding boxes replace image encoders for document understanding?

Explores whether spatial layout information alone, encoded as bounding boxes, can capture the multimodal signal needed for document understanding without expensive visual encoding. Matters because image encoders add significant computational cost to document processing systems.

Can we solve modality competition through architectural design?

Does modality competition in multimodal models stem from fundamental training conflicts, or from specific architectural choices? Understanding the root cause could reveal whether the trade-off is solvable.

Can AI systems read cognitive state from interaction patterns alone?

Explores whether behavioral telemetry—gaze, typing hesitation, interaction speed—can serve as a reliable continuous signal of user cognitive state without explicit self-report, and what design constraints this imposes.

Does multimodal zero-shot performance actually generalize or interpolate?

Explores whether multimodal models like CLIP truly generalize to unseen concepts or whether their impressive performance merely reflects memorization of frequently-seen concepts during pretraining.

Are text-only language models fundamentally limited by abstraction?

Explores whether text's compression of physics, geometry, and causality into symbols creates an irreducible ceiling for language-only AI, and whether multimodal approaches can overcome this structural constraint.

Can video language models actually understand time?

This research investigates whether video LLMs truly grasp temporal concepts like causality and event progression, or merely recognize spatial content across frames. Understanding this gap matters for video understanding tasks that depend on reasoning about time.

Why do vision and language scale so differently?

IsoFLOP analysis reveals vision and language follow distinct scaling curves—vision demands far more training data than language at equivalent compute budgets. Understanding this asymmetry matters for designing multimodal architectures that serve both modalities well.