On the job market · Final-year PhD

Vikranth Srivatsa.

I build the systems that make large language model inference fast at scale: distributed prefix caching, multi-SLO serving, RL post-training, and fast cloning for agent workspaces.

PhD · UCSD WukLabAdvised by Yiying ZhangB.S./M.S. · UC Berkeley · Sky / RISELab

Publications →Read on WukLab ↗Google Scholar · 130+ citations ↗

Where my work shows up

vLLM

Scheduler redesign

≈ 30% perf improvement, upstreamed from my analysis

via Scheduling Overhead blog

SGLang

Prefix-aware routing

Distributed prefix-caching ideas adopted in production routing

via Preble · ICLR '25

Mooncake

KV-cache disaggregation

Distributed KV-state reuse framing adopted in Moonshot's serving stack

via Preble · ICLR '25

NVIDIA · OpenAI · ByteDance · Together AI

Inference & RL post-training

Prefix routing, multi-tier SLO scheduling, and speculative decoding for RL

via Preble · Nitsum · DAS

My research on distributed prefix scheduling, intercept-aware serving, and scheduler overhead has informed the design of production LLM inference systems across major industry laboratories and open-source stacks.

About

MLSys Researcher.

I am a final-year PhD candidate at UCSD's WukLab, advised by Professor Yiying Zhang, and previously completed my B.S. and M.S. in Electrical Engineering and Computer Science at UC Berkeley, where I worked in Sky and RISELab under Professor Joseph Gonzalez. My research concerns the systems infrastructure underpinning large language model inference and reinforcement-learning training.

Current and prior projects include: Preble (ICLR 2025), which formalized prefix caching as a distributed systems problem and has informed the design of production LLM serving stacks across major industry laboratories; Nitsum, a runtime for multi-SLO scheduling with adaptive tensor parallelism; DAS (MLSys 2026), distribution-aware speculative decoding for reinforcement-learning rollouts; TClone, a kernel-level system for forkable workspaces in computer-use agents; and InferCept (ICML 2024), an inference framework for tool-augmented language models.

At Berkeley I worked on serverless execution across cloud, 5G, and edge environments, as well as on machine-learning robustness under distribution shift. My industry experience includes Amazon Web Services (AWS SAM-CLI), Amazon Alexa, New Relic Pixie Labs (eBPF observability), and an earlier role as Chief Technology Officer of an early-stage startup.

Vikranth Srivatsa

PhD candidate · UCSD

Research

What I work on.

/01

Distributed LLM Inference

Schedulers, KV-cache reuse, and load balancing across GPUs. Built Preble, the first distributed system targeting prompt sharing.

/02

Prefix Caching at Scale

Discovered and formalized distributed prefix-caching as a systems problem. Tradeoffs between cache reuse, fairness, and prefill/decode balance.

/03

Multi-SLO Serving + Tensor Parallelism

Rust global scheduler + C++/Python local schedulers; custom CUDA kernels to dynamically meet strict per-tenant SLOs.

/04

RL Post-Training Systems

DAS: length-aware speculative decoding that beats the long tail of agentic rollouts, without changing accuracy.

/05

Fast Cloning for Serverless & RL Envs

TClone uses kernel-level CRIU, copy-on-write memory, and filesystem versioning for forkable GUI workspaces with sub-second branching and isolated state.

/06

Computer-Use Agents

Workspace primitives for CUAs: isolated process / memory / filesystem / GUI snapshots; evaluated across 600+ agent tasks.

Selected work

Papers & systems.

All publications →

2026
MLSys 2026
Beat the Long Tail: Distribution-Aware Speculative Decoding for RL Training ↗
Length-aware speculation prioritizes long trajectories, 50% lower rollout latency on agentic reasoning, no accuracy loss.
RL Post-TrainingSpeculative DecodingScheduling
50% ↓
rollout latency
2025
Pre-print
TClone: Low-Latency Forking of Live GUI Environments for Computer-Use Agents ↗
Linux kernel + CRIU + CoW memory + FS versioning. Live GUI workspaces fork with isolated process/memory/FS/GUI state.
Computer-Use AgentsKernel SystemsCRIURL Environments
1.9× ↓
task latency vs. KVM
2025
Pre-print
Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism ↗
Treats tensor parallelism as a first-class runtime control surface. Jointly optimizes TP level, prefill/decode split, and scheduling, with TP-aware weight reuse and fast KV migration to make frequent adaptation practical.
LLM ServingMulti-Tier SLOTensor ParallelismGoodput
5.3× ↑
SLO-compliant goodput
2026
KDD 2026
Cognify: Supercharging Gen-AI Workflows With Hierarchical Autotuning ↗
Adaptive hierarchical search (AdaSeek) autotunes workflow structure, operators, and prompts. 2.8× quality, 10× cost, 2.7× latency.
Gen-AI WorkflowsAutotuningRAG
10× ↓
monetary cost
2025
ICLR 2025
Preble: Efficient Distributed Prompt Scheduling for LLM Serving ↗
First distributed serving system targeting prompt sharing. Co-optimizes KV reuse with load balancing across a fleet.
LLM ServingDistributed SchedulingPrefix CachingLoad Balancing
14.5× ↓
p99 latency
2024
ICML 2024
InferCept: Efficient Intercept Support for Augmented LLM Inference ↗
First inference framework targeting tool-augmented LLMs. Avoids 37–40% recomputation. 1.6–2× throughput.
Augmented LLMsLLM InferenceMemory Management
2× ↑
throughput

Collaborated with · WukLab & Berkeley

Yiying Zhang
↗
PhD advisor · UCSD WukLab
Joseph Gonzalez
↗
M.S./B.S. advisor · UC Berkeley Sky / RISELab
Yutong Huang
↗
TClone (second author)
Zijian He
↗
Preble · InferCept · Cognify · Nitsum
Reyna Abhyankar
↗
InferCept · Cognify · Scheduling overhead
Pu Guo
↗
Nitsum
Dongming Li
↗
Preble · Nitsum · Scheduling overhead
Alan Pham
Worst-group Generalization · Cloudless
Eunice Chan
Worst-group Generalization (co-first)

External collaborators

Full author lists for each project are on the Publications page.

Talks

I've given invited talks on this research at Meta, Together AI, and MLSys.

News

May 2026
DAS accepted to MLSys 2026, distribution-aware speculative decoding for RL post-training. 50% rollout latency reduction.
Feb 2026
On the academic & industry job market for Research Scientist and Member of Technical Staff roles.
Feb 2025
Cognify accepted to KDD 2026, hierarchical autotuning for Gen-AI workflows.
Apr 2024
Preble accepted to ICLR 2025, distributed prefix caching, 14.5× lower latency over SOTA.
Jul 2024
InferCept published at ICML 2024, efficient serving for augmented LLMs.

Vikranth Srivatsa.

MLSys Researcher.

What I work on.

Distributed LLM Inference

Prefix Caching at Scale

Multi-SLO Serving + Tensor Parallelism

RL Post-Training Systems

Fast Cloning for Serverless & RL Envs

Computer-Use Agents

Papers & systems.

Beat the Long Tail: Distribution-Aware Speculative Decoding for RL Training ↗

TClone: Low-Latency Forking of Live GUI Environments for Computer-Use Agents ↗

Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism ↗

Cognify: Supercharging Gen-AI Workflows With Hierarchical Autotuning ↗

Preble: Efficient Distributed Prompt Scheduling for LLM Serving ↗

InferCept: Efficient Intercept Support for Augmented LLM Inference ↗