On the job market · Final-year PhD

Vikranth Srivatsa.

I build the systems that make large language model inference fast at scale: distributed prefix caching, multi-SLO serving, RL post-training, and fast cloning for agent workspaces.

PhD · UCSD WukLabAdvised by Yiying ZhangB.S./M.S. · UC Berkeley · Sky / RISELab
Where my work shows up
vLLM
Scheduler redesign
≈ 30% perf improvement, upstreamed from my analysis
via Scheduling Overhead blog
SGLang
Prefix-aware routing
Distributed prefix-caching ideas adopted in production routing
via Preble · ICLR '25
Mooncake
KV-cache disaggregation
Distributed KV-state reuse framing adopted in Moonshot's serving stack
via Preble · ICLR '25
NVIDIA · OpenAI · ByteDance · Together AI
Inference & RL post-training
Prefix routing, multi-tier SLO scheduling, and speculative decoding for RL
via Preble · Nitsum · DAS

My research on distributed prefix scheduling, intercept-aware serving, and scheduler overhead has informed the design of production LLM inference systems across major industry laboratories and open-source stacks.

About

MLSys Researcher.

I am a final-year PhD candidate at UCSD's WukLab, advised by Professor Yiying Zhang, and previously completed my B.S. and M.S. in Electrical Engineering and Computer Science at UC Berkeley, where I worked in Sky and RISELab under Professor Joseph Gonzalez. My research concerns the systems infrastructure underpinning large language model inference and reinforcement-learning training.

Current and prior projects include: Preble (ICLR 2025), which formalized prefix caching as a distributed systems problem and has informed the design of production LLM serving stacks across major industry laboratories; Nitsum, a runtime for multi-SLO scheduling with adaptive tensor parallelism; DAS (MLSys 2026), distribution-aware speculative decoding for reinforcement-learning rollouts; TClone, a kernel-level system for forkable workspaces in computer-use agents; and InferCept (ICML 2024), an inference framework for tool-augmented language models.

At Berkeley I worked on serverless execution across cloud, 5G, and edge environments, as well as on machine-learning robustness under distribution shift. My industry experience includes Amazon Web Services (AWS SAM-CLI), Amazon Alexa, New Relic Pixie Labs (eBPF observability), and an earlier role as Chief Technology Officer of an early-stage startup.

Vikranth Srivatsa atop the Seattle Space Needle
Vikranth Srivatsa
PhD candidate · UCSD
Research

What I work on.

/01

Distributed LLM Inference

Schedulers, KV-cache reuse, and load balancing across GPUs. Built Preble, the first distributed system targeting prompt sharing.

/02

Prefix Caching at Scale

Discovered and formalized distributed prefix-caching as a systems problem. Tradeoffs between cache reuse, fairness, and prefill/decode balance.

/03

Multi-SLO Serving + Tensor Parallelism

Rust global scheduler + C++/Python local schedulers; custom CUDA kernels to dynamically meet strict per-tenant SLOs.

/04

RL Post-Training Systems

DAS: length-aware speculative decoding that beats the long tail of agentic rollouts, without changing accuracy.

/05

Fast Cloning for Serverless & RL Envs

TClone uses kernel-level CRIU, copy-on-write memory, and filesystem versioning for forkable GUI workspaces with sub-second branching and isolated state.

/06

Computer-Use Agents

Workspace primitives for CUAs: isolated process / memory / filesystem / GUI snapshots; evaluated across 600+ agent tasks.

Selected work

Papers & systems.

  1. 2026
    MLSys 2026

    Beat the Long Tail: Distribution-Aware Speculative Decoding for RL Training ↗

    Length-aware speculation prioritizes long trajectories, 50% lower rollout latency on agentic reasoning, no accuracy loss.

    RL Post-TrainingSpeculative DecodingScheduling
  2. 2025
    Pre-print

    TClone: Low-Latency Forking of Live GUI Environments for Computer-Use Agents ↗

    Linux kernel + CRIU + CoW memory + FS versioning. Live GUI workspaces fork with isolated process/memory/FS/GUI state.

    Computer-Use AgentsKernel SystemsCRIURL Environments
  3. 2025
    Pre-print

    Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism ↗

    Treats tensor parallelism as a first-class runtime control surface. Jointly optimizes TP level, prefill/decode split, and scheduling, with TP-aware weight reuse and fast KV migration to make frequent adaptation practical.

    LLM ServingMulti-Tier SLOTensor ParallelismGoodput
  4. 2026
    KDD 2026

    Cognify: Supercharging Gen-AI Workflows With Hierarchical Autotuning ↗

    Adaptive hierarchical search (AdaSeek) autotunes workflow structure, operators, and prompts. 2.8× quality, 10× cost, 2.7× latency.

    Gen-AI WorkflowsAutotuningRAG
  5. 2025
    ICLR 2025

    Preble: Efficient Distributed Prompt Scheduling for LLM Serving ↗

    First distributed serving system targeting prompt sharing. Co-optimizes KV reuse with load balancing across a fleet.

    LLM ServingDistributed SchedulingPrefix CachingLoad Balancing
  6. 2024
    ICML 2024

    InferCept: Efficient Intercept Support for Augmented LLM Inference ↗

    First inference framework targeting tool-augmented LLMs. Avoids 37–40% recomputation. 1.6–2× throughput.

    Augmented LLMsLLM InferenceMemory Management
Collaborated with · WukLab & Berkeley
External collaborators

Full author lists for each project are on the Publications page.

Talks

I've given invited talks on this research at Meta, Together AI, and MLSys.

News
  • May 2026
    DAS accepted to MLSys 2026, distribution-aware speculative decoding for RL post-training. 50% rollout latency reduction.
  • Feb 2026
    On the academic & industry job market for Research Scientist and Member of Technical Staff roles.
  • Feb 2025
    Cognify accepted to KDD 2026, hierarchical autotuning for Gen-AI workflows.
  • Apr 2024
    Preble accepted to ICLR 2025, distributed prefix caching, 14.5× lower latency over SOTA.
  • Jul 2024
    InferCept published at ICML 2024, efficient serving for augmented LLMs.