Skip to Content
ArchitectureInfrastructure

Infrastructure

Computer Agents runs on Google Cloud Platform with a production-grade infrastructure designed for reliability, performance, and security.

Global Load Balancing

All traffic enters through a global HTTPS load balancer that provides:

SSL/TLS Termination

  • Managed certificates - Automatic renewal via Google-managed SSL
  • TLS 1.3 - Latest encryption standards
  • HTTP/2 - Efficient multiplexed connections

Health Checking

Health Check Configuration: ├── Protocol: HTTP ├── Path: /health ├── Interval: 10 seconds ├── Timeout: 5 seconds ├── Healthy threshold: 2 consecutive successes └── Unhealthy threshold: 3 consecutive failures

Unhealthy instances are automatically removed from the load balancer pool until they recover.

Traffic Distribution

  • Round-robin load balancing across healthy instances
  • Connection draining - Graceful handling during deployments
  • DDoS protection - Built-in protection at the edge

Compute Layer

Managed Instance Groups

Our compute layer uses Google Cloud Managed Instance Groups (MIG) for automatic scaling and self-healing:

Managed Instance Group ├── Min instances: 1 ├── Max instances: N (scales with demand) ├── Machine type: e2-standard-4 (4 vCPU, 16 GB RAM) ├── Boot disk: 50 GB SSD └── Auto-scaling target: 70% CPU utilization

Instance Configuration

Each instance runs:

ComponentDescription
API ServerExpress.js handling REST requests
Docker EngineContainer execution environment
gcsfuseCloud Storage mount for workspaces

Auto-Healing

If an instance becomes unhealthy:

  1. Health check fails 3 consecutive times
  2. Instance is marked unhealthy
  3. Traffic is redirected to healthy instances
  4. MIG automatically recreates the instance
  5. New instance joins the pool once healthy

Auto-healing typically completes within 2-3 minutes, ensuring minimal impact on availability.

Database Layer

Cloud SQL PostgreSQL

We use Cloud SQL PostgreSQL for reliable, scalable data storage:

PropertyValue
VersionPostgreSQL 15
High AvailabilityEnabled
Automated BackupsDaily
Point-in-Time Recovery7 days
EncryptionAES-256 at rest

Data Stored

The database stores:

  • User accounts and API keys
  • Thread and message history
  • Environment configurations
  • Agent definitions
  • Billing and usage records
  • Schedule definitions

Connection Pooling

Each API instance maintains a connection pool to the database:

  • Max connections per instance - Configured for optimal performance
  • Connection timeout - Prevents hung connections
  • Automatic reconnection - Handles transient failures

Storage Layer

Google Cloud Storage

All workspace files are stored in Cloud Storage:

Storage Structure ├── workspaces/ │ └── {environmentId}/ │ ├── src/ │ ├── package.json │ └── ... └── sessions/ └── {threadId}/ └── artifacts/

Storage Features

FeatureBenefit
Multi-region replicationHigh durability (11 9’s)
VersioningRecover from accidental changes
EncryptionAES-256 at rest
Access controlPer-environment isolation

gcsfuse Integration

Storage is mounted directly to compute instances via gcsfuse:

  • Read-after-write consistency - Changes visible immediately
  • Parallel access - Multiple instances can access the same workspace
  • Automatic sync - No manual file transfer needed

Container Execution

Docker Runtime

Each task executes in an isolated Docker container:

Container Configuration ├── Base image: Custom with Node.js, Python, Codex CLI ├── Resource limits: CPU and memory caps ├── Network: Isolated per container ├── Volumes: Workspace mounted from GCS └── Cleanup: Automatic after execution

Container Pool

We maintain warm containers for faster execution:

PathLatency
Warm container~100-500ms startup
Cold start~3-5 seconds

Warm containers are kept alive for 15 minutes after last use, then automatically cleaned up.

Execution Flow

1. Request arrives at API server 2. Container pool checked for warm container 3. If cold: start new container from image 4. Mount workspace from Cloud Storage 5. Execute task via Codex SDK 6. Stream results back to client 7. Update container pool state

Network Architecture

External Access

Internet Global Load Balancer (HTTPS, port 443) Backend Service Instance Group (HTTP, port 8080)

Internal Communication

  • API servers → PostgreSQL: Private network
  • API servers → Cloud Storage: Google internal network
  • Container → Internet: Optional per environment

Firewall Rules

RuleSourceDestinationPorts
Load balancerGoogle IPsInstances8080
Health checkGoogle IPsInstances8080
SSH (admin)Authorized IPsInstances22

Monitoring & Observability

Metrics Collected

  • Request latency (P50, P95, P99)
  • Error rates by endpoint
  • CPU and memory utilization
  • Container startup times
  • Database query performance

Logging

All logs are shipped to Cloud Logging:

  • Structured JSON format
  • Request tracing across services
  • 30-day retention
  • Query and alerting capabilities

Alerting

Automatic alerts for:

  • Error rate > 5% for 5 minutes
  • Latency P99 > 30 seconds
  • Instance count at max capacity
  • Database connection failures

Disaster Recovery

Backup Strategy

ComponentBackup FrequencyRetention
DatabaseDaily + continuous WAL7 days
Cloud StorageVersioning enabled30 days
ConfigurationInfrastructure as CodeGit history

Recovery Procedures

ScenarioRecovery Time
Instance failure~2-3 minutes (auto-healing)
Zone outage~5 minutes (traffic rerouting)
Database failover~60 seconds (HA automatic)
Full region recovery~30 minutes (manual)

Infrastructure Summary

ComponentTechnologyPurpose
EdgeGlobal HTTPS LBTraffic management
ComputeGCE MIGAPI and execution
DatabaseCloud SQL PostgreSQLPersistent storage
FilesCloud StorageWorkspace storage
ContainersDockerTask isolation
MonitoringCloud MonitoringObservability
Last updated on