Bedrock Access Gateway – Design Spec

Doc status: Draft v1.1
Owner: Miguel Merlin
Date: Aug 11, 2025
Reviewers: Brandon Yen

1) Overview

Provide Bedrock access to internal IAM users via a controlled gateway that works from a variety of interfaces, ranging from developer IDE extensions (Cline) to a front-facing UI. The system exposes a stable HTTP API (OpenAI‑style), enforces per‑user throttling and monthly token budgets, and includes multiple kill‑switches.

Primary goals

Simple IDE integration via HTTPS + API key
Per‑user TPS throttling and monthly cost limits
Hard and soft kill‑switches
Centralized audit, cost visibility, and guardrails

2) Requirements

2.1 Functional

Users can call the Lambda Function URL from IDE extensions or the UI.
Requests from IDE extensions authenticated via STS credentials generated from IAM credentials.
Requests from UI authenticated via Cognito JWT tokens.
Enforce rate limits (TPS/burst) per user (or per team plan) at the edge.
Enforce monthly cost limits per user; reject after budget exceeded.
Support streaming responses (SSE) for chat/completions.
Allow list of Bedrock models; requests to others are rejected.
1. Current list of models includes:
  1. Claude 3 Haiku
  2. Claude 3.5 Sonnet
2. All other Anthropic models were deemed not necessary or require access through provisioned throughput.
Provide usage/remaining quota endpoint.
Emit structured logs and metrics for cost and auditing.

2.2 Operational

Soft kill: feature flag to disable traffic quickly.
Hard kill: SCP denying Bedrock; model access toggle in Bedrock.
Per-user soft kill: activate/deactivate access keys for each user.
Config changes (caps, models, plans) without redeploy.
SLO: 99.9% monthly availability for gateway.
P90 end‑to‑end latency target: ≤ 1.5s for 2‑KB prompts, non‑streaming.

2.3 Security & Compliance

Users cannot directly call Bedrock; only the gateway’s IAM role can.
TLS 1.2+, HSTS, no PII in logs by default; opt‑in redaction allowlist.

3) High‑Level Architecture

IDE/CLI ──HTTPS──> Lambda Function URL ──> Lambda Proxy ──> Bedrock
                                                └─> DynamoDB (usage)
                             

Control plane: DynamoDB (config/limits)
Kill‑switches: Org SCP + Bedrock model access (hard)

Key Components

API Gateway (HTTP/REST): Routing, API Keys, Usage Plans, throttling.
Lambda Authorizer: Validates API key, checks feature flag, reads per‑user caps, pre‑checks remaining daily tokens (estimated).
Lambda Proxy: Validates payload, calls Bedrock, streams results, updates usage counters from actual token usage.
DynamoDB: Token metering per user/day; user profiles; config.
SSM Parameter Store: Feature flags + global toggles.
Organizations SCP: Bedrock deny policy for emergency.
WAF (optional): Rate‑based block/allow rules.

4) Data Model

4.1 DynamoDB Tables

**Transaction Table: **

PK: userId (S)
SK: datetimestamp (S, YYYY-MM-DDDDTHH:mm:ss)
Attributes: inputTokensTotalcost (N)float), outputTokensTotalmodelId (N)S), lastUpdatedusage (SoutputTokens ~~RFC3339)~~(S), versioninputTokens (N)

~~TTL (optional):~~ expireAt ~~(N, unix epoch) for 90‑day retention~~
S))

**Monthly Usage Table: **

PK: userIduserArn (S)
SK:
~~Attributes:~~ apiKeyIdmonth_year (~~S),~~S, planIdMM_YYYY ~~(S),~~ dailyOutputCap ~~(N),~~ maxTokensPerCall ~~(N),~~ allowedModels ~~(SS),~~ status ~~(S: ACTIVE|SUSPENDED)~~

~~**Table: **~~

PK: configId ~~(S)~~
)
Attributes: defaultDailyOutputCapcost (N)float), defaultMaxTokensPerCallinvocations (~~N),~~ allowedModels ~~(SS),~~ plans ~~(M: rate, burst)~~

4.2 SSM Parameters

/app/bedrock/enabled = true|false ~~(global soft kill‑switch)~~int)

5) API Design

5.1 Authentication

~~Header~~Headers:
- x-api-aws-session-token: <key> (STS session token)
- x-aws-access-key: <key> (~~API~~STS ~~Gateway~~access ~~API~~key)
- x-aws-secret-key: <key> (STS secret key)
~~Authorizer~~Inference ~~attaches~~proxy principalIdfinds =user userIdARN ~~into~~based ~~request~~on ~~context.~~credentials

5.2 Endpoints (JSON)

`POST /v1/chat(Lambda Function URL)`

Request:

{
    "model"modelId": "anthropic.claude-3-5-sonnet-20240620-haiku-20240307-v1:0",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "Explaintext": reservoir"<task>"
                sampling.},
                {
                    "text": "<system prompt>"
                },
                {
                    "text": "<environment details>"
                }
            ]
        }
    ],
    "max_tokens"system": 400,[
        {
            "text": "<system prompt>"
        }
    ],
    "inferenceConfig": {
        "maxTokens": 4096,
        "temperature": 0.2,0
    "stream": false
}

~~Response (non‑stream):~~

{
  "id": "chatcmpl_...",
  "model": "...",
  "created": 1723456789,
  "usage": {"input_tokens": 123, "output_tokens": 278},
  "choices": [
    {"index": 0, "message": {"role": "assistant", "content": "..."},
    "finish_reason"additionalModelRequestFields": "stop"{}
]
}

~~Response (stream):~~ text/event-stream ~~with~~ data: {"delta": "..."} ~~frames; final frame includes~~ usage.

`POST /v1/complete`

~~Simple prompt → completion variant; same usage object.~~

`GET /v1/modelsusage`

~~Returns~~Include ~~allowed~~the ~~model~~userArn ~~IDs.~~in the header to retrieve their monthly usage statistics and the monthly limit.

`GET /v1/usage/today`

~~Response:~~

{
  "date": "2025-08-12",
  "output_tokens_used": 9217,
  "output_tokens_cap": 50000,
  "remaining": 40783
}

Error Format

{ "error": { "code": "DAILY_CAP_EXCEEDED", "message": "Reached daily token cap." } }

HTTP Codes: 200, 400 (validation), 401/403 (auth/quota), 429 (rate limit), 5xx (upstream/internal).

6) Throttling & Quotas

6.1 Edge TPS (API Gateway Usage Plans)

Plans:

Standard: rate=2 RPS, burst=10
Power: rate=5 RPS, burst=20 Attach one or more API keys per plan. Optionally add API Gateway request quota per day as a coarse control.

6.2 Token Budgets (Authorizer + Proxy)

Pre‑check (Authorizer): estimate estOut = min(request.max_tokens, user.maxTokensPerCall); if used + estOut > dailyCap, deny.
Post‑accounting (Proxy): parse Bedrock usage and ADD to BedrockTokenUsage atomically:
- SET outputTokensTotal = if_not_exists(outputTokensTotal, :zero) + :out
- SET inputTokensTotal = if_not_exists(inputTokensTotal, :zero) + :in

6.3 Kill‑switches

Soft: /app/bedrock/enabled=false ⇒ Authorizer denies all.
Hard: Org SCP denying bedrock:*; or revoke model access in Bedrock.

7) Request Flow (Sequence)

IDE sends request with x-api-key → API Gateway.
API Gateway authenticates API key and enforces plan TPS/burst.
Lambda Authorizer executes:
- Check SSM flag enabled; check GatewayUsers.status.
- Load user profile & todays usage; pre‑check against cap.
- Return Allow with context {userId, planId} or Deny.
Lambda Proxy validates payload, ensures model is in allowlist, clamps max_tokens.
Call Bedrock Converse/Invoke; stream or buffer output back to client.
On completion, parse usage and update BedrockTokenUsage.
Emit metrics and structured logs; return final body.

8) IAM & Network

8.1 Execution Role (Proxy Lambda)

Permissions:

bedrock:InvokeModel*, bedrock:Converse* (scoped to allowed models)
dynamodb:GetItem/UpdateItem on both tables
ssm:GetParameter on /app/bedrock/enabled
CloudWatch Logs

8.2 End‑User Principals

Explicit Deny policy for bedrock:* to prevent bypass.
Allow invoke of API Gateway only via API Keys (no SigV4 from clients).

8.3 Network

Private VPC Endpoint to Bedrock (where available); add IAM condition aws:SourceVpce on the Lambda role.

9) Validation & Limits

Payload size: limit prompt/messages total ≤ 64 KB (configurable).
**Per‑call **``: clamp to user or global max (e.g., 1024).
Allowed models: configured in GatewayConfig; reject others.
Timeouts: Lambda 30s–60s; API GW 29s for non‑stream; for streaming use integration with Lambda function URLs or REST API w/ chunked responses.

10) Observability

10.1 Metrics (CloudWatch, with EMF)

Requests: count, 4xx, 5xx, latency p50/p90/p99
Bedrock call duration and errors by model
Tokens: input/output per user/day; top N users
Rejections: DAILY_CAP_EXCEEDED, MODEL_NOT_ALLOWED, DISABLED_FLAG

10.2 Logs

Correlation id (request id) end‑to‑end
Redact message content by default (toggle for debugging)

10.3 Alerts (SNS/Slack)

5xx rate > 2% over 5m
DISABLED_FLAG set to false (notify)
User at 80% and 100% of daily cap (optional DM)
Cost anomaly (tokens/day spike)

11) Security Considerations

No API key leakage in logs; rotate keys quarterly.
Optional HMAC sidecar token per request (defense in depth).
WAF geo/IP allowlist if necessary.
Content guardrails (Bedrock Guardrails) for policy compliance.

12) Deployment & IaC

CDK (TypeScript/Java) project containing:
- API Gateway (routes, models), Usage Plans, API Keys (seed via script)
- Lambdas (Authorizer + Proxy) with env vars for table names/params
- DynamoDB tables, autoscaling RCU/WCU
- SSM params with defaults
- Optional WAF association
Environments: dev, staging, prod; feature flags per env
CI/CD: GitHub Actions to synth/deploy; unit + integration tests

13) Runbooks

13.1 Emergency Shutdown

Set /app/bedrock/enabled=false (Immediate soft stop).
If not sufficient, attach Org SCP denying bedrock:* to account/OU.
Optionally disable Bedrock model access in console.

13.2 Raise/Lower User Cap

Update GatewayUsers.dailyOutputCap; change takes effect immediately.

13.3 Key Rotation

Create new API key, map to user, notify; revoke old key after grace period.

13.4 Hotspot/Abuse

Move user to stricter Usage Plan or suspend user; add WAF rule.

14) Testing Strategy

Unit tests: payload validation, authorizer cap math, DDB updates
Contract tests: /v1/chat success/429/403/400 cases
Load tests: confirm API GW TPS enforcement; backpressure behavior
Chaos: Bedrock 5xx/latency injections; ensure graceful degradation
Security: key rotation test; WAF rule efficacy

15) Cost Model (rough)

API Gateway requests (per million)
Lambda GB‑seconds (authorizer + proxy)
DynamoDB RCUs/WCUs proportional to calls (two writes per request worst‑case)
Bedrock tokens (dominant cost) – tracked via usage metrics

Optimizations: batch writes (streaming buffer), on‑demand → provisioned with autoscaling, aggregate counters with periodic compaction.