Infrastructure
Computer Agents runs on Google Cloud Platform with a production-grade infrastructure designed for reliability, performance, and security.
Global Load Balancing
All traffic enters through a global HTTPS load balancer that provides:
SSL/TLS Termination
- Managed certificates - Automatic renewal via Google-managed SSL
- TLS 1.3 - Latest encryption standards
- HTTP/2 - Efficient multiplexed connections
Health Checking
Health Check Configuration:
├── Protocol: HTTP
├── Path: /health
├── Interval: 10 seconds
├── Timeout: 5 seconds
├── Healthy threshold: 2 consecutive successes
└── Unhealthy threshold: 3 consecutive failuresUnhealthy instances are automatically removed from the load balancer pool until they recover.
Traffic Distribution
- Round-robin load balancing across healthy instances
- Connection draining - Graceful handling during deployments
- DDoS protection - Built-in protection at the edge
Compute Layer
Managed Instance Groups
Our compute layer uses Google Cloud Managed Instance Groups (MIG) for automatic scaling and self-healing:
Managed Instance Group
├── Min instances: 1
├── Max instances: N (scales with demand)
├── Machine type: e2-standard-4 (4 vCPU, 16 GB RAM)
├── Boot disk: 50 GB SSD
└── Auto-scaling target: 70% CPU utilizationInstance Configuration
Each instance runs:
| Component | Description |
|---|---|
| API Server | Express.js handling REST requests |
| Docker Engine | Container execution environment |
| gcsfuse | Cloud Storage mount for workspaces |
Auto-Healing
If an instance becomes unhealthy:
- Health check fails 3 consecutive times
- Instance is marked unhealthy
- Traffic is redirected to healthy instances
- MIG automatically recreates the instance
- New instance joins the pool once healthy
Auto-healing typically completes within 2-3 minutes, ensuring minimal impact on availability.
Database Layer
Cloud SQL PostgreSQL
We use Cloud SQL PostgreSQL for reliable, scalable data storage:
| Property | Value |
|---|---|
| Version | PostgreSQL 15 |
| High Availability | Enabled |
| Automated Backups | Daily |
| Point-in-Time Recovery | 7 days |
| Encryption | AES-256 at rest |
Data Stored
The database stores:
- User accounts and API keys
- Thread and message history
- Environment configurations
- Agent definitions
- Billing and usage records
- Schedule definitions
Connection Pooling
Each API instance maintains a connection pool to the database:
- Max connections per instance - Configured for optimal performance
- Connection timeout - Prevents hung connections
- Automatic reconnection - Handles transient failures
Storage Layer
Google Cloud Storage
All workspace files are stored in Cloud Storage:
Storage Structure
├── workspaces/
│ └── {environmentId}/
│ ├── src/
│ ├── package.json
│ └── ...
└── sessions/
└── {threadId}/
└── artifacts/Storage Features
| Feature | Benefit |
|---|---|
| Multi-region replication | High durability (11 9’s) |
| Versioning | Recover from accidental changes |
| Encryption | AES-256 at rest |
| Access control | Per-environment isolation |
gcsfuse Integration
Storage is mounted directly to compute instances via gcsfuse:
- Read-after-write consistency - Changes visible immediately
- Parallel access - Multiple instances can access the same workspace
- Automatic sync - No manual file transfer needed
Container Execution
Docker Runtime
Each task executes in an isolated Docker container:
Container Configuration
├── Base image: Custom with Node.js, Python, Codex CLI
├── Resource limits: CPU and memory caps
├── Network: Isolated per container
├── Volumes: Workspace mounted from GCS
└── Cleanup: Automatic after executionContainer Pool
We maintain warm containers for faster execution:
| Path | Latency |
|---|---|
| Warm container | ~100-500ms startup |
| Cold start | ~3-5 seconds |
Warm containers are kept alive for 15 minutes after last use, then automatically cleaned up.
Execution Flow
1. Request arrives at API server
2. Container pool checked for warm container
3. If cold: start new container from image
4. Mount workspace from Cloud Storage
5. Execute task via Codex SDK
6. Stream results back to client
7. Update container pool stateNetwork Architecture
External Access
Internet
│
▼
Global Load Balancer (HTTPS, port 443)
│
▼
Backend Service
│
▼
Instance Group (HTTP, port 8080)Internal Communication
- API servers → PostgreSQL: Private network
- API servers → Cloud Storage: Google internal network
- Container → Internet: Optional per environment
Firewall Rules
| Rule | Source | Destination | Ports |
|---|---|---|---|
| Load balancer | Google IPs | Instances | 8080 |
| Health check | Google IPs | Instances | 8080 |
| SSH (admin) | Authorized IPs | Instances | 22 |
Monitoring & Observability
Metrics Collected
- Request latency (P50, P95, P99)
- Error rates by endpoint
- CPU and memory utilization
- Container startup times
- Database query performance
Logging
All logs are shipped to Cloud Logging:
- Structured JSON format
- Request tracing across services
- 30-day retention
- Query and alerting capabilities
Alerting
Automatic alerts for:
- Error rate > 5% for 5 minutes
- Latency P99 > 30 seconds
- Instance count at max capacity
- Database connection failures
Disaster Recovery
Backup Strategy
| Component | Backup Frequency | Retention |
|---|---|---|
| Database | Daily + continuous WAL | 7 days |
| Cloud Storage | Versioning enabled | 30 days |
| Configuration | Infrastructure as Code | Git history |
Recovery Procedures
| Scenario | Recovery Time |
|---|---|
| Instance failure | ~2-3 minutes (auto-healing) |
| Zone outage | ~5 minutes (traffic rerouting) |
| Database failover | ~60 seconds (HA automatic) |
| Full region recovery | ~30 minutes (manual) |
Infrastructure Summary
| Component | Technology | Purpose |
|---|---|---|
| Edge | Global HTTPS LB | Traffic management |
| Compute | GCE MIG | API and execution |
| Database | Cloud SQL PostgreSQL | Persistent storage |
| Files | Cloud Storage | Workspace storage |
| Containers | Docker | Task isolation |
| Monitoring | Cloud Monitoring | Observability |