Infrastructure

Computer Agents runs on Google Cloud Platform with a production-grade infrastructure designed for reliability, performance, and security.

Global Load Balancing

All traffic enters through a global HTTPS load balancer that provides:

SSL/TLS Termination

Managed certificates - Automatic renewal via Google-managed SSL
TLS 1.3 - Latest encryption standards
HTTP/2 - Efficient multiplexed connections

Health Checking


Health Check Configuration:
├── Protocol: HTTP
├── Path: /health
├── Interval: 10 seconds
├── Timeout: 5 seconds
├── Healthy threshold: 2 consecutive successes
└── Unhealthy threshold: 3 consecutive failures

Unhealthy instances are automatically removed from the load balancer pool until they recover.

Traffic Distribution

Round-robin load balancing across healthy instances
Connection draining - Graceful handling during deployments
DDoS protection - Built-in protection at the edge

Compute Layer

Managed Instance Groups

Our compute layer uses Google Cloud Managed Instance Groups (MIG) for automatic scaling and self-healing:


Managed Instance Group
├── Min instances: 1
├── Max instances: N (scales with demand)
├── Machine type: e2-standard-4 (4 vCPU, 16 GB RAM)
├── Boot disk: 50 GB SSD
└── Auto-scaling target: 70% CPU utilization

Instance Configuration

Each instance runs:

Component	Description
API Server	Express.js handling REST requests
Docker Engine	Container execution environment
gcsfuse	Cloud Storage mount for workspaces

Auto-Healing

If an instance becomes unhealthy:

Health check fails 3 consecutive times
Instance is marked unhealthy
Traffic is redirected to healthy instances
MIG automatically recreates the instance
New instance joins the pool once healthy

Auto-healing typically completes within 2-3 minutes, ensuring minimal impact on availability.

Database Layer

Cloud SQL PostgreSQL

We use Cloud SQL PostgreSQL for reliable, scalable data storage:

Property	Value
Version	PostgreSQL 15
High Availability	Enabled
Automated Backups	Daily
Point-in-Time Recovery	7 days
Encryption	AES-256 at rest

Data Stored

The database stores:

User accounts and API keys
Thread and message history
Environment configurations
Agent definitions
Billing and usage records
Schedule definitions

Connection Pooling

Each API instance maintains a connection pool to the database:

Max connections per instance - Configured for optimal performance
Connection timeout - Prevents hung connections
Automatic reconnection - Handles transient failures

Storage Layer

Google Cloud Storage

All workspace files are stored in Cloud Storage:


Storage Structure
├── workspaces/
│   └── {environmentId}/
│       ├── src/
│       ├── package.json
│       └── ...
└── sessions/
    └── {threadId}/
        └── artifacts/

Storage Features

Feature	Benefit
Multi-region replication	High durability (11 9’s)
Versioning	Recover from accidental changes
Encryption	AES-256 at rest
Access control	Per-environment isolation

gcsfuse Integration

Storage is mounted directly to compute instances via gcsfuse:

Read-after-write consistency - Changes visible immediately
Parallel access - Multiple instances can access the same workspace
Automatic sync - No manual file transfer needed

Container Execution

Docker Runtime

Each task executes in an isolated Docker container:


Container Configuration
├── Base image: Custom with Node.js, Python, Codex CLI
├── Resource limits: CPU and memory caps
├── Network: Isolated per container
├── Volumes: Workspace mounted from GCS
└── Cleanup: Automatic after execution

Container Pool

We maintain warm containers for faster execution:

Path	Latency
Warm container	~100-500ms startup
Cold start	~3-5 seconds

Warm containers are kept alive for 15 minutes after last use, then automatically cleaned up.

Execution Flow


1. Request arrives at API server
2. Container pool checked for warm container
3. If cold: start new container from image
4. Mount workspace from Cloud Storage
5. Execute task via Codex SDK
6. Stream results back to client
7. Update container pool state

Network Architecture

External Access


Internet
    │
    ▼
Global Load Balancer (HTTPS, port 443)
    │
    ▼
Backend Service
    │
    ▼
Instance Group (HTTP, port 8080)

Internal Communication

API servers → PostgreSQL: Private network
API servers → Cloud Storage: Google internal network
Container → Internet: Optional per environment

Firewall Rules

Rule	Source	Destination	Ports
Load balancer	Google IPs	Instances	8080
Health check	Google IPs	Instances	8080
SSH (admin)	Authorized IPs	Instances	22

Monitoring & Observability

Metrics Collected

Request latency (P50, P95, P99)
Error rates by endpoint
CPU and memory utilization
Container startup times
Database query performance

Logging

All logs are shipped to Cloud Logging:

Structured JSON format
Request tracing across services
30-day retention
Query and alerting capabilities

Alerting

Automatic alerts for:

Error rate > 5% for 5 minutes
Latency P99 > 30 seconds
Instance count at max capacity
Database connection failures

Disaster Recovery

Backup Strategy

Component	Backup Frequency	Retention
Database	Daily + continuous WAL	7 days
Cloud Storage	Versioning enabled	30 days
Configuration	Infrastructure as Code	Git history

Recovery Procedures

Scenario	Recovery Time
Instance failure	~2-3 minutes (auto-healing)
Zone outage	~5 minutes (traffic rerouting)
Database failover	~60 seconds (HA automatic)
Full region recovery	~30 minutes (manual)

Infrastructure Summary

Component	Technology	Purpose
Edge	Global HTTPS LB	Traffic management
Compute	GCE MIG	API and execution
Database	Cloud SQL PostgreSQL	Persistent storage
Files	Cloud Storage	Workspace storage
Containers	Docker	Task isolation
Monitoring	Cloud Monitoring	Observability