FounderFiles · N°013Recipe Transfer · World Models · LLM Foundations
Subject · Ethan He (Yihui He)
Ethan He.
Video-model intelligence is now mostly coming from the language model, not the video distribution model itself.
He led the small team that took xAI’s Grok Imagine from “no data, no infra, no model” to a shipped v0.9 in three months. Then he made a deliberate choice: after delivering reference-to-video, video extension, and world-model work, he left to focus on what he now sees as the higher-leverage layer — LLM foundations, context management, and agent orchestration.
Finding the Minimal Sufficient Representation
Across more than a decade, Ethan He has repeatedly solved the same underlying problem: how do you preserve the intelligence that matters while dramatically reducing what the system has to carry?
It began with Channel Pruning at Megvii (ICCV 2017), where he developed methods to remove redundant channels from deep networks while reconstructing the feature maps that actually drove downstream accuracy. The same instinct reappears in his NVIDIA work on Mixture-of-Experts routing — activating only the experts a token needs — and again in the VAE temporal compression choices inside Cosmos and Grok Imagine. Most recently, it has migrated into LLM context management and agent harness design.
He has described the long-context problem in video models and the context-compaction problem in LLM agents as fundamentally the same research question viewed through different substrates.
Grok Imagine: Velocity as a Systems Problem
When He joined xAI in July 2025, the team had no data pipeline, no training infrastructure, and no model. Three months later, on October 7, 2025, Grok Imagine v0.9 shipped — five days after OpenAI released Sora 2.
He attributes the speed less to raw compute and more to a combination of extreme talent density, almost empty calendars, strong pre-existing inference and data foundations, and a transferable technical recipe from his time on NVIDIA’s Cosmos project. The same image-first bootstrapping, synthetic captioning, VAE tokenizer, and step-distillation approach that worked at NVIDIA was adapted and accelerated at xAI.
Compute mattered, but primarily as a multiplier of iterations per day rather than as the decisive variable.
“The visual intelligence is actually mostly coming from language. Every time you see some improvement on these models, I would say mostly this comes from the language model, not coming from the video distribution models themselves.”
Where the Marginal Gains Have Moved
He’s central claim is that the frontier of progress in video and world models has shifted. The diffusion or video generation component has become relatively mature; the intelligence that turns vague user intent into rich, coherent output now lives primarily in the language model layer — the prompt rewriter, planner, and orchestrator.
This is not a claim that video data is irrelevant. It is a claim about where the highest-leverage work currently sits. In his view, the next major qualitative leap will come from better agentic systems that can plan, generate, critique, edit, and iterate — treating the video model as one tool among several rather than the sole source of capability.
The Product and Research Direction
He predicts that by the end of 2026, production-grade video agents will trigger a new wave of capability and spending — particularly from enterprises willing to pay for iterative, multi-step creative and simulation workflows that single-shot generation cannot deliver.
Grok Imagine’s Agent Mode, the open canvas where the system plans and stitches together longer outputs, was an early signal of this direction. He sees generative interfaces and real-time interactive world models as the longer-term destination, where the boundary between model and application becomes increasingly fluid.
Choosing Research Freedom Over Scale
He made three deliberate moves up the institutional compute ladder: FAIR, then NVIDIA, then xAI. After delivering v0.9, reference-to-video, video extension, and world-model work at xAI, he chose to leave.
His stated reasons were direct: there was research he wanted to pursue that he could not do inside a company, and company priorities can shift quickly. This was not an impulsive exit but a calculated reversal — trading access to massive compute and engineering velocity for the autonomy to work on LLM foundations, self-managed context, and test-time model behavior.
His departure sits within a broader 2026 pattern at xAI following the SpaceX acquisition, as multiple researchers and engineers opted for smaller, more independent environments once the organization’s focus shifted.
The Current Research Agenda
He is now focused on problems that extend his long-standing interest in minimal sufficient representations into the language model domain:
- Models that can understand and actively manage their own context length
- Agent harnesses that a model can inspect and modify at test time
- Moving from heuristic context pruning to learned, continual mechanisms
The through-line across his career remains consistent: identify what actually carries the intelligence, strip away what does not, and build systems that can act intelligently on the remainder.
- 2017
Channel Pruning (ICCV)
Introduced iterative channel selection and feature reconstruction. First major expression of his core instinct: preserve what matters, remove what does not.
- 2019-2021
FAIR / Reality Labs
Epipolar Transformers and early multimodal fusion work. Began exploring cross-view and temporal structure in human-centric vision.
- 2023-2025
NVIDIA Cosmos & MoE
Co-authored the Cosmos world foundation model and led upcycling work for large Mixture-of-Experts models. Developed the full technical recipe later used at xAI.
- July 2025 - Early 2026
xAI Grok Imagine
Joined when there was no data, infra, or model. Shipped v0.9 in three months, followed by reference-to-video, video extension, and world-model work.
- 2026-
Independent Research
Left xAI to focus on LLM foundations: self-managed context, test-time model modification, and moving from heuristic to learned continual mechanisms.
Current Direction. LLM foundations with emphasis on self-managed context, test-time harness modification, and continual learning mechanisms.
Signature Strength. Transferring working technical recipes across domains while dramatically improving iteration speed and efficiency.
Notable Institutions. Megvii, CMU Robotics Institute, Facebook AI Research / Reality Labs, NVIDIA (Cosmos & MoE), xAI (Grok Imagine).
Caveat. Internal Grok Imagine architecture is not public. This file treats Cosmos as a recipe-transfer analog, not as proof of xAI implementation details.
Dash Velocity Generalist
Composes and orchestrates rather than digging; treats speed of assembly as the moat and gets to a working artifact in weeks.
- Credential Path
- Practitioner
- Abstraction
- Bottom Up
- Exit Horizon
- Velocity
- Moat Instinct
- Orchestration
- Capital Posture
- Venture
- Fast-shipping small teams
- Orchestration-over-scaling proponents
A small reasoning persona distilled from this file. Inject it into a chat or deep-research context to assess a business problem the way He would.
Reason as a velocity-first builder. Ask what a small team could ship in weeks by composing existing pieces rather than building from scratch. Look for leverage in orchestration, context management, and agent harnesses rather than raw scale. Optimize for the fastest path to a working end-to-end artifact, then iterate.
{
"$schema": "https://www.contextjamming.com/schemas/founder-context-v1.json",
"file": "N°013",
"persona": "Ethan He",
"archetype": "dash-velocity",
"shape": "—",
"one_line": "Treats orchestration and assembly speed as the moat, not raw model scale.",
"cognitive_basis": {
"credentialPath": "practitioner",
"abstractionDirection": "bottom-up",
"exitHorizon": "velocity",
"moatInstinct": "orchestration",
"capitalPosture": "venture"
},
"operating_questions": [
"What can a small team ship in weeks by composing existing pieces?",
"Is the next leap in scale, or in orchestration, context, and the agent harness?",
"What is the fastest path from zero to a working artifact?"
],
"first_principles": [
"Composition and orchestration beat raw scaling for the next increment.",
"Velocity is a moat; the team that ships v0.9 first sets the agenda
…