Distributed LLM Inference
Schedulers, KV-cache reuse, and load balancing across GPUs. Built Preble, the first distributed system targeting prompt sharing.
I build the systems that make large language model inference fast at scale: distributed prefix caching, multi-SLO serving, RL post-training, and fast cloning for agent workspaces.
My research on distributed prefix scheduling, intercept-aware serving, and scheduler overhead has informed the design of production LLM inference systems across major industry laboratories and open-source stacks.
I am a final-year PhD candidate at UCSD's WukLab, advised by Professor Yiying Zhang, and previously completed my B.S. and M.S. in Electrical Engineering and Computer Science at UC Berkeley, where I worked in Sky and RISELab under Professor Joseph Gonzalez. My research concerns the systems infrastructure underpinning large language model inference and reinforcement-learning training.
Current and prior projects include: Preble (ICLR 2025), which formalized prefix caching as a distributed systems problem and has informed the design of production LLM serving stacks across major industry laboratories; Nitsum, a runtime for multi-SLO scheduling with adaptive tensor parallelism; DAS (MLSys 2026), distribution-aware speculative decoding for reinforcement-learning rollouts; TClone, a kernel-level system for forkable workspaces in computer-use agents; and InferCept (ICML 2024), an inference framework for tool-augmented language models.
At Berkeley I worked on serverless execution across cloud, 5G, and edge environments, as well as on machine-learning robustness under distribution shift. My industry experience includes Amazon Web Services (AWS SAM-CLI), Amazon Alexa, New Relic Pixie Labs (eBPF observability), and an earlier role as Chief Technology Officer of an early-stage startup.

Schedulers, KV-cache reuse, and load balancing across GPUs. Built Preble, the first distributed system targeting prompt sharing.
Discovered and formalized distributed prefix-caching as a systems problem. Tradeoffs between cache reuse, fairness, and prefill/decode balance.
Rust global scheduler + C++/Python local schedulers; custom CUDA kernels to dynamically meet strict per-tenant SLOs.
DAS: length-aware speculative decoding that beats the long tail of agentic rollouts, without changing accuracy.
TClone uses kernel-level CRIU, copy-on-write memory, and filesystem versioning for forkable GUI workspaces with sub-second branching and isolated state.
Workspace primitives for CUAs: isolated process / memory / filesystem / GUI snapshots; evaluated across 600+ agent tasks.
Length-aware speculation prioritizes long trajectories, 50% lower rollout latency on agentic reasoning, no accuracy loss.
Linux kernel + CRIU + CoW memory + FS versioning. Live GUI workspaces fork with isolated process/memory/FS/GUI state.
Treats tensor parallelism as a first-class runtime control surface. Jointly optimizes TP level, prefill/decode split, and scheduling, with TP-aware weight reuse and fast KV migration to make frequent adaptation practical.
Adaptive hierarchical search (AdaSeek) autotunes workflow structure, operators, and prompts. 2.8× quality, 10× cost, 2.7× latency.
First distributed serving system targeting prompt sharing. Co-optimizes KV reuse with load balancing across a fleet.
First inference framework targeting tool-augmented LLMs. Avoids 37–40% recomputation. 1.6–2× throughput.
Full author lists for each project are on the Publications page.
I've given invited talks on this research at Meta, Together AI, and MLSys.