Research·

February 7, 2026

·12 min read

How We Cut Agent Response Time from 14s to Under 2s

A deep dive into the engineering work behind making AI agents feel instant. We traced every millisecond from button press to first token, eliminated hidden bottlenecks, and turned a 14-second wait into a sub-2-second response.

Before

14.1s

After

1.7s

The Problem: 14 Seconds of Silence

When you send a message to an AI agent on Computer Agents, a lot happens behind the scenes. Your message travels from the browser to our Next.js frontend, through a prepare endpoint that loads your agent config and secrets, then to our GCE backend which provisions a container, starts a Claude Code CLI process, sends your prompt to the Anthropic API, and streams the response back.

For a simple "just say hi" — that took 14 seconds. Fourteen seconds of staring at a spinner before seeing "Hi!" appear on screen.

That's unacceptable. When you're iterating with an AI agent, latency kills flow. Every second of delay breaks your concentration and makes the tool feel sluggish. We set out to trace every millisecond of that 14-second journey and eliminate everything we could.

This post is a detailed account of what we found and how we fixed it.

Anatomy of a Request

To optimize something, you have to measure it. We instrumented every stage of the pipeline with timestamps and traced a single request through the system. Here's what the original 14-second journey looked like:

Frontend (browser → backend):

User clicks send
Next.js /prepare endpoint loads thread, agent, environment, GitHub token, secrets — all sequentially across two phases
Browser opens SSE connection directly to our GCE backend

Backend (GCE server): 4. Auth middleware validates API key 5. Budget middleware re-validates API key (redundant!) 6. Thread handler loads message count, agent config, environment config — all sequentially 7. Router decides which execution path to use 8. Container executor checks if container is running 9. Custom skills are deployed to the workspace filesystem 10. Claude Stream Manager gets or creates a CLI process 11. If the stream is warming up, wait for warmup to complete 12. Send the user message to Claude Code CLI via stdin 13. Claude Code CLI sends the prompt to Anthropic's API 14. Stream response tokens back through the pipeline

Each of these steps had hidden costs. Some were obvious (sequential DB queries). Others were surprising (a filesystem that adds 5-15 seconds of latency). Let's walk through the biggest wins.

The gcsfuse Discovery: 78 Files Over the Network

The single largest bottleneck was something we didn't even think to check: the filesystem.

Our agent containers mount their workspace from Google Cloud Storage via gcsfuse — a FUSE adapter that makes a GCS bucket look like a local filesystem. This is great for persistence (workspaces survive container restarts), but every file read is an HTTP request to GCS.

When Claude Code CLI starts up, it reads its configuration from .claude/ — specifically the skills/ directory, which contains 78 skill files across multiple subdirectories. On a cold gcsfuse cache, each file read takes 50-200ms. That's 5-15 seconds just to read config files before the CLI can even start processing.

The fix: Local caching inside the container.

Before spawning the CLI process, we copy the skills/ and projects/ directories from the gcsfuse mount to a local tmpfs path (/tmp/.claude-local) inside the container. Then we set CLAUDE_CONFIG_DIR to point at the local copy. The CLI reads from fast local disk instead of slow network storage.

terminal

This alone cut 5-15 seconds from the cold-start path. But we hit three bugs on the first deploy:

Missing settings.json — Claude Code CLI v2.1.19 tries to show an onboarding prompt if settings.json doesn't exist. Without a TTY in the container, it hangs forever. Fix: create an empty {} settings file.
Files owned by root — The docker exec that copies files runs as root, but the CLI runs as user 1002. Fix: chown -R after copying.
Copying everything — The original code copied the entire .claude/ directory, including debug logs (28KB+), telemetry, session data (84 directories), and todos (49KB+). None of these are needed for startup. Fix: only copy skills/ and projects/.

We also added a version marker (.version file) so subsequent startups skip the copy entirely if skills haven't changed.

The Pre-warm Strategy

Even after the gcsfuse fix, there's an irreducible cost: Anthropic's prompt cache creation. The first time Claude sees a particular system prompt, it takes 10-18 seconds to create the cache. Subsequent calls with the same prompt hit the cache and are much faster.

We can't eliminate this cost, but we can hide it behind user think-time with a three-tier pre-warm strategy:

Tier 1: Homescreen — When the app loads (before the user has even thought about what to type), we call POST /environments/:id/start. The backend starts the container and spawns a Claude Code CLI process.

Tier 2: App Mount — When the user opens the runner app, we send a warm-up message ("hi") through the CLI to trigger Anthropic's prompt cache creation. This runs in the background while the user is typing.

Tier 3: Agent Selection — When the user selects an agent (which determines the model), we pre-warm with the correct model so the stream is ready.

The result: If the user takes 15+ seconds to type their message (common for thoughtful prompts), the prompt cache is already created and the response comes back in under 2 seconds. If they type quickly, they wait for whatever warmup time remains — but never the full 18 seconds.

Death by a Thousand Sleeps

With the big wins done, we turned to the long tail of small delays. Each one was only 50-500ms, but they added up to nearly a full second of pure waste.

The 100ms process start sleep. After spawning the CLI process, we waited 100ms "to ensure the process actually started." But the close event fires synchronously if the process fails to start. We replaced the sleep with setImmediate() — a single event loop tick that lets any pending error events fire.

The 50ms drain delay. After sending a user message to the CLI, we waited 50ms to catch any late-arriving warmup events in the queue. Again, these events arrive via stdout data events which fire on the next tick. Replaced with setImmediate().

The 500ms abort wait. When a user sends a follow-up while a previous execution is running, we abort the old execution and waited a fixed 500ms for cleanup. Replaced with a polling loop that checks every 50ms and exits as soon as the execution is unregistered — typically 50-100ms instead of a full 500ms.

The redundant API key lookup. The budget check middleware re-fetched the API key from PostgreSQL to verify it exists — but the auth middleware had already done this. Removing the duplicate query saved 10-30ms per request.

Total savings from these micro-optimizations: ~300-500ms.

Parallelizing Everything

The next category of wins came from converting sequential operations to parallel ones.

Backend DB reads. The thread message handler made roughly 10 sequential PostgreSQL queries: message count, agent config, environment config, set thread started, store user message, store attachments. We restructured these into two parallel phases:

example.ts

Frontend prepare endpoint. The /prepare route fetched thread details in Phase 1, then fetched agent and environment config in Phase 2. But the agent fetch doesn't depend on the thread result — it only needs the agentId from the request body. We moved it to Phase 1:

example.ts

This shaved another 50-100ms off the critical path.

Fixing the Duplicate Stream Bug

During log analysis, we spotted something unexpected: two CLI processes were being spawned inside the same container.

The frontend's pre-warm calls were arriving faster than the backend could create a stream. The first /start call began creating a Claude CLI process (~750ms for config caching). Before it finished, a second /start arrived, checked activeSessions.get(containerName), found nothing (the first session wasn't registered yet), and spawned a second CLI process.

Two processes meant two warmup "hi" messages, two Anthropic API calls, and double the cost — all for the same container.

The fix: A pending-creation lock.

example.ts

After deploying this fix, the logs confirmed: "Waiting for in-flight stream creation" — the second call correctly waited for the first instead of spawning a duplicate.

Frontend Deduplication

The backend race fix solved the symptom, but the root cause was on the frontend: five /start calls within 12 seconds from three different components.

When the app loads, three React components independently fire pre-warm requests:

Homescreen — on mount, fetches the default environment and calls /start
BaseApp — on mount, calls /start for the most recent environment
TaskInput — when the agent selection renders, calls /start with the agent ID

Each component had its own inline fetch() call with no coordination. We created a shared startContainer() utility with in-flight promise deduplication and a cooldown:

example.ts

This collapsed 5 HTTP requests into 1. The key includes the agentId so model-specific pre-warming still works when the user switches agents.

Caching Custom Skills

One more 600ms bottleneck was hiding in the execution path: custom skill deployment.

User-created skills (like a custom "Howdy Skill") are written to the workspace filesystem on every message execution. Since the workspace is on gcsfuse, each writeFileSync() is a GCS upload — taking 400-600ms for even a small skill.

The fix was straightforward: hash the skill content and cache it in memory. If the hash matches, skip the write entirely.

example.ts

First execution after a server restart still writes (cold cache), but every subsequent message skips the 600ms gcsfuse write. We applied the same in-memory caching pattern to the system skills version check — avoiding gcsfuse reads on the warm path entirely.

The Results

Here's the before and after for a complete cold-start request (no container, no warm CLI, no prompt cache):

Metric	Before	After	Change
Total time (cold)	14.1s	10.2s	-28%
Total time (warm)	~3.5s	~1.7s	-51%
/start API calls	5	1	-80%
CLI streams spawned	2	1	-50%
Backend overhead	~700ms	~160ms	-77%
Custom skill deploy	602ms	0ms (cached)	-100%

The 10.2s cold-start is dominated by Anthropic's prompt cache creation (77% of the time) — something entirely outside our control. Our backend overhead is down to 160ms.

On the warm path (second message onward), the prompt cache is already created, the container is running, the CLI is warm, and skills are cached. The user message goes from browser to visible response in under 2 seconds — with most of that time being the LLM generating tokens.

Where the remaining time goes (warm path):

~200ms: Backend overhead (auth, DB reads, stream lookup)
~1.5s: Anthropic API response generation
Total: ~1.7s

That's about as fast as physically possible given the network round-trips involved.

Lessons Learned

1. Measure before you optimize. Every fix in this post came from reading server logs with timestamps. We didn't guess where the latency was — we traced it millisecond by millisecond. The gcsfuse discovery (our biggest win) would never have been found by code review alone.

2. Filesystems aren't always local. gcsfuse looks like a local filesystem but behaves like a network API. Every existsSync(), readFileSync(), and writeFileSync() is potentially a 50-200ms HTTP request. When your code reads 78 files in a loop, that's 5-15 seconds of latency hiding behind familiar Node.js APIs.

3. Hide latency behind user behavior. The pre-warm strategy works because humans take time to think. By starting expensive operations (container provisioning, CLI startup, Anthropic prompt cache creation) the moment the app loads — seconds before the user types anything — we turn a 18-second cold start into a 0-second perceived start.

4. Small delays compound. No single sleep was egregious. 50ms here, 100ms there, a redundant DB query. But combined, they added nearly a second to every request. Replacing setTimeout(100) with setImmediate() and removing one unnecessary database lookup aren't glamorous changes, but they matter at scale.

5. Deduplication at every layer. The duplicate stream bug was caused by a race between frontend components, but could have been prevented at any layer: frontend dedup, backend creation locks, or both. We added both. Defense in depth isn't just for security — it applies to performance too.

The work isn't done — it never is. But for now, sending a message to an AI agent on Computer Agents feels snappy. And that's what matters.

Ready to get started?

Try Computer Agents today and experience the future of AI-powered automation.

Get Started

Research

How We Cut Agent Response Time from 14s to Under 2s

The Problem: 14 Seconds of Silence

Anatomy of a Request

The gcsfuse Discovery: 78 Files Over the Network

The Pre-warm Strategy

Death by a Thousand Sleeps

Parallelizing Everything

Fixing the Duplicate Stream Bug

Frontend Deduplication

Caching Custom Skills

The Results

Lessons Learned

Ready to get started?

Related Posts

Persistent vs Ephemeral Agents (2026 Benchmarks): Why True Autonomy Requires Persistence