FounderFiles · N°013Recipe Transfer · World Models · LLM Foundations

Subject · Ethan He (Yihui He)

Ethan He.

Video-model intelligence is now mostly coming from the language model, not the video distribution model itself.

He led the small team that took xAI’s Grok Imagine from “no data, no infra, no model” to a shipped v0.9 in three months. Then he made a deliberate choice: after delivering reference-to-video, video extension, and world-model work, he left to focus on what he now sees as the higher-leverage layer — LLM foundations, context management, and agent orchestration.

TRAINED

CMU Robotics · Megvii Channel Pruning

AT

FAIR · NVIDIA Cosmos & MoE · xAI Grok Imagine

FILE

N°013

§ 01 · The Compression Instinct

Finding the Minimal Sufficient Representation

Across more than a decade, Ethan He has repeatedly solved the same underlying problem: how do you preserve the intelligence that matters while dramatically reducing what the system has to carry?

It began with Channel Pruning at Megvii (ICCV 2017), where he developed methods to remove redundant channels from deep networks while reconstructing the feature maps that actually drove downstream accuracy. The same instinct reappears in his NVIDIA work on Mixture-of-Experts routing — activating only the experts a token needs — and again in the VAE temporal compression choices inside Cosmos and Grok Imagine. Most recently, it has migrated into LLM context management and agent harness design.

He has described the long-context problem in video models and the context-compaction problem in LLM agents as fundamentally the same research question viewed through different substrates.

§ 02 · The Three-Month Build

Grok Imagine: Velocity as a Systems Problem

When He joined xAI in July 2025, the team had no data pipeline, no training infrastructure, and no model. Three months later, on October 7, 2025, Grok Imagine v0.9 shipped — five days after OpenAI released Sora 2.

He attributes the speed less to raw compute and more to a combination of extreme talent density, almost empty calendars, strong pre-existing inference and data foundations, and a transferable technical recipe from his time on NVIDIA’s Cosmos project. The same image-first bootstrapping, synthetic captioning, VAE tokenizer, and step-distillation approach that worked at NVIDIA was adapted and accelerated at xAI.

Compute mattered, but primarily as a multiplier of iterations per day rather than as the decisive variable.

“The visual intelligence is actually mostly coming from language. Every time you see some improvement on these models, I would say mostly this comes from the language model, not coming from the video distribution models themselves.”

Ethan He, Latent Space — June 1, 2026

§ 03 · The LLM Driver Thesis

Where the Marginal Gains Have Moved

He’s central claim is that the frontier of progress in video and world models has shifted. The diffusion or video generation component has become relatively mature; the intelligence that turns vague user intent into rich, coherent output now lives primarily in the language model layer — the prompt rewriter, planner, and orchestrator.

This is not a claim that video data is irrelevant. It is a claim about where the highest-leverage work currently sits. In his view, the next major qualitative leap will come from better agentic systems that can plan, generate, critique, edit, and iterate — treating the video model as one tool among several rather than the sole source of capability.

§ 04 · Video Agents Over Raw Model Scaling

The Product and Research Direction

He predicts that by the end of 2026, production-grade video agents will trigger a new wave of capability and spending — particularly from enterprises willing to pay for iterative, multi-step creative and simulation workflows that single-shot generation cannot deliver.

Grok Imagine’s Agent Mode, the open canvas where the system plans and stitches together longer outputs, was an early signal of this direction. He sees generative interfaces and real-time interactive world models as the longer-term destination, where the boundary between model and application becomes increasingly fluid.

§ 05 · The Autonomy Re-Bet

Choosing Research Freedom Over Scale

He made three deliberate moves up the institutional compute ladder: FAIR, then NVIDIA, then xAI. After delivering v0.9, reference-to-video, video extension, and world-model work at xAI, he chose to leave.

His stated reasons were direct: there was research he wanted to pursue that he could not do inside a company, and company priorities can shift quickly. This was not an impulsive exit but a calculated reversal — trading access to massive compute and engineering velocity for the autonomy to work on LLM foundations, self-managed context, and test-time model behavior.

His departure sits within a broader 2026 pattern at xAI following the SpaceX acquisition, as multiple researchers and engineers opted for smaller, more independent environments once the organization’s focus shifted.

§ 06 · Self-Managing Context and Test-Time Adaptation

The Current Research Agenda

He is now focused on problems that extend his long-standing interest in minimal sufficient representations into the language model domain:

Models that can understand and actively manage their own context length
Agent harnesses that a model can inspect and modify at test time
Moving from heuristic context pruning to learned, continual mechanisms

The through-line across his career remains consistent: identify what actually carries the intelligence, strip away what does not, and build systems that can act intelligently on the remainder.

Trajectory

2017
Channel Pruning (ICCV)
Introduced iterative channel selection and feature reconstruction. First major expression of his core instinct: preserve what matters, remove what does not.
2019-2021
FAIR / Reality Labs
Epipolar Transformers and early multimodal fusion work. Began exploring cross-view and temporal structure in human-centric vision.
2023-2025
NVIDIA Cosmos & MoE
Co-authored the Cosmos world foundation model and led upcycling work for large Mixture-of-Experts models. Developed the full technical recipe later used at xAI.
July 2025 - Early 2026
xAI Grok Imagine
Joined when there was no data, infra, or model. Shipped v0.9 in three months, followed by reference-to-video, video extension, and world-model work.
2026-
Independent Research
Left xAI to focus on LLM foundations: self-managed context, test-time model modification, and moving from heuristic to learned continual mechanisms.

The Index

3 months

Grok Imagine from zero infrastructure to v0.9

NVIDIA Cosmos

World model recipe transferred and accelerated at xAI

Lead Author

Upcycling Large Language Models into Mixture of Experts

9k+ citations

Channel Pruning, MoE, and world model contributions

Reading List

Dossier

Current Direction. LLM foundations with emphasis on self-managed context, test-time harness modification, and continual learning mechanisms.

Signature Strength. Transferring working technical recipes across domains while dramatically improving iteration speed and efficiency.

Notable Institutions. Megvii, CMU Robotics Institute, Facebook AI Research / Reality Labs, NVIDIA (Cosmos & MoE), xAI (Grok Imagine).

Caveat. Internal Grok Imagine architecture is not public. This file treats Cosmos as a recipe-transfer analog, not as proof of xAI implementation details.

Career Shape

dash-shaped — pure breadth, the inverse of I

Dash Velocity Generalist

Composes and orchestrates rather than digging; treats speed of assembly as the moat and gets to a working artifact in weeks.

Credential Path: Practitioner
Abstraction: Bottom Up
Exit Horizon: Velocity
Moat Instinct: Orchestration
Capital Posture: Venture

Role-Model Reference Class

Fast-shipping small teams
Orchestration-over-scaling proponents

Founder Context · JSON

A small reasoning persona distilled from this file. Inject it into a chat or deep-research context to assess a business problem the way He would.

Reason as a velocity-first builder. Ask what a small team could ship in weeks by composing existing pieces rather than building from scratch. Look for leverage in orchestration, context management, and agent harnesses rather than raw scale. Optimize for the fastest path to a working end-to-end artifact, then iterate.

{
  "$schema": "https://www.contextjamming.com/schemas/founder-context-v1.json",
  "file": "N°013",
  "persona": "Ethan He",
  "archetype": "dash-velocity",
  "shape": "—",
  "one_line": "Treats orchestration and assembly speed as the moat, not raw model scale.",
  "cognitive_basis": {
    "credentialPath": "practitioner",
    "abstractionDirection": "bottom-up",
    "exitHorizon": "velocity",
    "moatInstinct": "orchestration",
    "capitalPosture": "venture"
  },
  "operating_questions": [
    "What can a small team ship in weeks by composing existing pieces?",
    "Is the next leap in scale, or in orchestration, context, and the agent harness?",
    "What is the fastest path from zero to a working artifact?"
  ],
  "first_principles": [
    "Composition and orchestration beat raw scaling for the next increment.",
    "Velocity is a moat; the team that ships v0.9 first sets the agenda
  …

Share

CONTEXT JAMMING

Ethan He.

Finding the Minimal Sufficient Representation

Grok Imagine: Velocity as a Systems Problem

Where the Marginal Gains Have Moved

The Product and Research Direction

Choosing Research Freedom Over Scale

The Current Research Agenda

Channel Pruning (ICCV)

FAIR / Reality Labs

NVIDIA Cosmos & MoE

xAI Grok Imagine

Independent Research

Dash Velocity Generalist

The Ledger.

How this site is made.

Antigravity

Claude Opus 4.8

Codex