You built an MCP server. It works on your laptop. Claude connects, tools fire, results come back. Ship it, right?
Not so fast. An April 2026 scan of 2,181 remote MCP server endpoints found that 52% were completely dead. Only 9% were fully healthy. The rest? Degraded — responding slowly, returning stale data, or failing silently with 200 OK responses full of parsing errors.
The gap between “works locally” and “works in production” is where MCP servers go to die. This guide covers the architecture, tooling, and operational patterns that keep yours in that 9%.
We run two MCP servers on Apify that handle real traffic daily. Everything here comes from what we learned getting those to stay alive — plus patterns we’ve stolen from teams running much larger deployments.
The STDIO Trap
Here’s the thing most MCP tutorials won’t tell you: the default transport doesn’t work for production.
When you scaffold an MCP server with the official SDK, it uses STDIO transport. That’s fine for local development — Claude Desktop, Cursor, and other clients spawn your server as a subprocess and pipe JSON-RPC messages through stdin/stdout. Simple. Fast. Zero network config.
But STDIO means the client and server run on the same machine. You can’t deploy that to a remote server. You can’t scale it. You can’t put it behind a load balancer. You can’t even share it with a teammate without them cloning your repo and running it locally.
The fix: switch to Streamable HTTP transport before you deploy anywhere remote.
Streamable HTTP (introduced in the March 2025 MCP spec revision) replaces the older HTTP+SSE transport. It turns your MCP server into a standard HTTP API that accepts POST requests at a /mcp endpoint. Any client that can make HTTP requests can connect — no subprocess spawning, no local filesystem access required.
The actual code change is small. In Python with FastMCP:
# Local development (STDIO)
mcp.run()
# Production (Streamable HTTP)
mcp.run(transport="streamable-http", host="0.0.0.0", port=8000)
In TypeScript:
// Local development
const transport = new StdioServerTransport();
// Production
const transport = new StreamableHTTPServerTransport({ port: 8000 });
await server.connect(transport);
That’s the minimum change. But switching transport is only step one. The real work is everything that comes after.
The Production Architecture
A production MCP server isn’t just your tool code running on a VPS. Here’s the architecture that actually survives contact with real traffic:
┌─────────────────────────────────────┐
│ AI Client (Claude, etc.) │
└──────────────┬──────────────────────┘
│ HTTPS
▼
┌─────────────────────────────────────┐
│ Reverse Proxy (Nginx) │
│ TLS termination, rate limiting │
└──────────────┬──────────────────────┘
│ HTTP
▼
┌─────────────────────────────────────┐
│ MCP Server Container │
│ Streamable HTTP on port 8000 │
│ Non-root user, health checks │
└──────────────┬──────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ External APIs / Databases │
│ (The actual data sources) │
└─────────────────────────────────────┘
Three layers: proxy, server, data sources. Each with its own failure modes and its own mitigation.
If you’re running 3+ MCP servers, add a gateway layer between the proxy and the individual servers. Gateways like Docker’s MCP Gateway (GA since late 2025) aggregate multiple servers behind a single endpoint, handle centralized auth, prevent tool-name collisions, and give you unified logging. Without a gateway, you end up with separate auth flows, separate monitoring, and separate secrets management for each server — and that operational complexity grows faster than you expect.
Dockerizing Your MCP Server
Containers aren’t optional for production MCP. They give you reproducible builds, consistent environments, and the ability to deploy anywhere that runs Docker — which in 2026 is basically everywhere.
Here’s a production Dockerfile for a Python MCP server:
FROM python:3.12-slim
# Don't run as root
RUN useradd --create-home --shell /bin/bash mcpuser
WORKDIR /app
# Install dependencies first (better layer caching)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application code
COPY . .
# Switch to non-root user
USER mcpuser
EXPOSE 8000
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
CMD curl -f http://localhost:8000/health || exit 1
CMD ["python", "-m", "server", "--transport", "streamable-http"]
Key decisions here:
Use python:3.12-slim, not Alpine. Alpine uses musl instead of glibc, and that causes subtle issues with some Python packages (especially anything that touches async I/O). The slim image is 40MB larger but saves you hours of debugging.
Run as a non-root user. If your server gets compromised, the attacker shouldn’t have root access to the container. This is basic hygiene but a shocking number of MCP server Dockerfiles skip it.
HEALTHCHECK in the Dockerfile. This tells Docker (and any orchestrator above it) how to know if your server is actually working. More on health checks below — they’re more nuanced than they look.
Pin your dependencies. Your requirements.txt should have exact versions (mcp==1.7.1, not mcp>=1.7). The MCP SDK is still evolving fast, and a minor version bump can change behavior.
And the docker-compose.yml for local production testing:
version: "3.8"
services:
mcp-server:
build: .
ports:
- "8000:8000"
environment:
- API_KEY=${API_KEY}
- LOG_LEVEL=info
restart: unless-stopped
deploy:
resources:
limits:
memory: 512M
cpus: "0.5"
Never put secrets in your Dockerfile or docker-compose file. Use environment variables with ${VAR} substitution, keep the actual values in a .env file, and add .env to .gitignore. For production, use Docker secrets or your platform’s secret manager (AWS Secrets Manager, Vault, etc.).
Health Checks That Actually Work
This is where most production MCP servers fail. And I don’t mean they fail the health check — I mean they pass it while being completely broken.
The naive approach:
@app.get("/health")
async def health():
return {"status": "ok"}
This tells you the HTTP process is running. Congratulations — you’ve verified the least interesting thing about your server. If your MCP server wraps a Google Maps API and that API key expired yesterday, this health check still returns 200 OK while every tool call fails.
A real health check verifies the full dependency chain:
@app.get("/health")
async def health():
checks = {}
# Check upstream API connectivity
try:
response = await http_client.get(
"https://maps.googleapis.com/maps/api/place/details/json",
params={"place_id": "ChIJN1t_tDeuEmsRUsoyG83frY4", "key": API_KEY},
timeout=5
)
checks["google_maps_api"] = response.status_code == 200
except Exception:
checks["google_maps_api"] = False
# Check database if applicable
try:
await db.execute("SELECT 1")
checks["database"] = True
except Exception:
checks["database"] = False
all_healthy = all(checks.values())
return JSONResponse(
status_code=200 if all_healthy else 503,
content={"status": "healthy" if all_healthy else "degraded", "checks": checks}
)
This tells you something useful. When Google’s API goes down or your key gets revoked, your health check goes red before your users file a bug report.
Run this check on a 30-second interval. Set your orchestrator to restart the container after 3 consecutive failures. And log every failure — health check failures are your early warning system.
The Silent Killer: Schema Drift
Here’s a failure mode that won’t show up in any health check if you’re not specifically looking for it.
Your MCP server wraps an external API. That API changes its response format — maybe a field gets renamed, maybe a nested object becomes an array, maybe a field that was always present stops appearing for certain queries. Your server still gets 200 OK from the API. It parses what it can, returns half-broken results, and the LLM downstream tries to make sense of garbage data.
This is called schema drift, and it was called out in the April 2026 endpoint analysis as one of the hardest production bugs to catch.
The defense is response validation:
from pydantic import BaseModel, ValidationError
class PlaceDetails(BaseModel):
name: str
formatted_address: str
rating: float | None = None
user_ratings_total: int | None = None
@mcp.tool()
async def get_place_details(place_id: str) -> str:
response = await fetch_from_google(place_id)
try:
validated = PlaceDetails(**response)
except ValidationError as e:
logger.error(f"Schema drift detected: {e}")
raise McpError(
ErrorCode.InternalError,
"Upstream API response format has changed. This tool needs maintenance."
)
return validated.model_dump_json()
With Pydantic models (or Zod schemas in TypeScript), you catch format changes immediately instead of silently passing bad data through. When validation fails, you return a clear error to the LLM instead of letting it hallucinate based on malformed input.
Pin the API version in your requests too, if the upstream API supports versioning. And set up an alert when validation errors spike — that’s your signal to update the schema.
Security: Treat Each Server Like a Microservice
The MCP spec now mandates OAuth 2.1 with PKCE for authentication. If you’re building a server that faces the internet, implement it. If you’re running internal-only servers, you still need auth — the question is just how much.
Minimum security checklist:
TLS everywhere. Your reverse proxy terminates TLS, but the connection between client and proxy must be HTTPS. No exceptions. Self-signed certs are fine for internal testing, but use Let’s Encrypt (it’s free) for anything public.
Scope your API keys. If your MCP server wraps the Google Maps API, create a dedicated API key with only the Places API enabled. If it gets compromised, the attacker can look up restaurants but can’t read your Gmail.
Rate limit at the proxy. Nginx or your cloud provider’s API gateway should enforce per-client rate limits before requests hit your server. An LLM in a retry loop can generate thousands of requests per minute.
Validate all inputs. Every tool parameter gets validated against its JSON schema before you do anything with it. The MCP SDK handles basic type validation, but add your own bounds checking (e.g., max 100 results per query, coordinates within valid ranges).
Log tool calls, not tool results. Log which tools were called, by whom, with what parameters. Don’t log the full response data — that might contain PII from the upstream source. If you need response logging for debugging, make it opt-in and time-limited.
Monitoring and Observability
You need three things running from day one:
Metrics. Track request count, latency (p50, p95, p99), error rate, and upstream API latency — separately for each tool. Prometheus + Grafana is the standard stack, but any metrics system works. The key metric to watch: the ratio of successful tool calls to total tool calls, broken down by tool name. When that ratio drops, something’s wrong.
Structured logging. Every tool call should produce a structured log entry with the tool name, request ID, duration, success/failure, and error type if applicable. JSON logs that you can query with Loki, the ELK stack, or even CloudWatch Logs Insights. When something breaks at 3 AM, you need to answer “what happened” without SSH-ing into the container.
Alerting. At minimum, alert on: health check failures (3+ consecutive), error rate above 5% over a 5-minute window, p95 latency above your SLA (we use 10 seconds), and container restarts. Pipe alerts to wherever you’ll actually see them — Slack, PagerDuty, email, whatever you actually check.
Here’s a practical observability setup for a Python MCP server:
import time
import structlog
from prometheus_client import Counter, Histogram, start_http_server
logger = structlog.get_logger()
TOOL_CALLS = Counter("mcp_tool_calls_total", "Total tool calls", ["tool", "status"])
TOOL_LATENCY = Histogram("mcp_tool_duration_seconds", "Tool call duration", ["tool"])
# Start metrics server on a separate port
start_http_server(9090)
@mcp.tool()
async def get_reviews(place_id: str) -> str:
start = time.monotonic()
tool_name = "get_reviews"
try:
result = await fetch_reviews(place_id)
TOOL_CALLS.labels(tool=tool_name, status="success").inc()
logger.info("tool_call", tool=tool_name, place_id=place_id,
duration=time.monotonic() - start)
return result
except Exception as e:
TOOL_CALLS.labels(tool=tool_name, status="error").inc()
logger.error("tool_call_failed", tool=tool_name,
error=str(e), duration=time.monotonic() - start)
raise
finally:
TOOL_LATENCY.labels(tool=tool_name).observe(time.monotonic() - start)
This gives you counters and histograms per tool, plus structured logs that tell you exactly what happened on every call.
Avoiding Cold Start Hell
If you’re deploying to a serverless platform (AWS Lambda, Google Cloud Run, Fly.io), cold starts will bite you. An MCP client sends a request, your function spins up from zero, loads dependencies, establishes API connections… and the client times out after 30 seconds waiting for a response.
Three fixes:
Set a minimum instance count. Cloud Run and Lambda both support this. It costs a few dollars per month to keep one instance warm, and it eliminates the cold start for the first request after a quiet period. Worth it.
Lazy-load heavy dependencies. Don’t import your entire ML model or API client at module load time. Initialize them on first use and cache them for subsequent calls.
Reduce your container image size. A 2GB image takes 15+ seconds to cold start. A 200MB image starts in 2-3 seconds. Use slim base images, multi-stage Docker builds, and don’t ship dev dependencies to production.
If cold starts are still unacceptable, skip serverless entirely and run on a small VPS. A $5/month DigitalOcean droplet with Docker Compose handles more MCP traffic than you probably think.
The Idempotency Requirement
AI agents retry. A lot. If the network glitches mid-response, the client resends the same request. If the LLM decides it needs the same data again, it calls the same tool with the same parameters.
Your tools need to handle this gracefully. A get_reviews call is naturally idempotent — calling it twice with the same place_id returns the same data. But a create_report tool that generates a new database record on each call? That’s a problem. Two retries give you three duplicate reports.
The fix: accept a client-provided request_id parameter (or use the MCP request ID from the JSON-RPC envelope), check if you’ve already processed that ID, and return the cached result if so. This is the same pattern that payment APIs have used for decades, and it’s just as important here.
Deployment Checklist
Before you push to production, verify every item:
Transport: Switched from STDIO to Streamable HTTP. Test with a remote client, not just localhost.
Container: Dockerfile uses slim base image, non-root user, HEALTHCHECK directive. Image builds successfully with docker build --no-cache.
Health check: /health endpoint verifies all upstream dependencies. Returns 503 when any dependency is down.
Secrets: No credentials in code, Dockerfile, or docker-compose.yml. All secrets injected via environment variables or a secret manager.
Auth: OAuth 2.1 implemented for public-facing servers. API keys scoped to minimum required permissions.
TLS: HTTPS enforced. HTTP requests redirect to HTTPS.
Rate limiting: Configured at the reverse proxy or gateway level.
Monitoring: Metrics exported (Prometheus or equivalent). Structured logging enabled. Alerts configured for errors, latency, and health check failures.
Schema validation: All upstream API responses validated against Pydantic/Zod models. Validation failures logged and alerted.
Idempotency: Write operations handle duplicate requests gracefully.
Resource limits: Memory and CPU limits set in Docker/orchestrator config. You don’t want one runaway request taking down the host.
Documentation: Tool descriptions are clear enough that an LLM can use them without additional context. Bad tool descriptions cause more production issues than bad code.
When to Use a Managed Platform
Not everyone needs to manage their own infrastructure. Platforms like Apify handle the deployment, scaling, monitoring, and billing for you. You write the tool logic, they handle everything else.
This makes sense when you’re a solo developer or small team and don’t want to manage Docker, Nginx, TLS certs, and monitoring infrastructure. It also makes sense when you want usage-based pricing instead of fixed server costs, or when your traffic is spiky and you’d rather pay per request than keep servers warm.
See It in Action
Godberry Studios runs two production MCP servers on the Apify Store — Content-to-Social and Google Reviews Scraper. Both handle real customer traffic with zero infrastructure management on our end.
The trade-off: you give up control over the infrastructure and pay a platform fee. For servers generating revenue, the math usually works out. For internal tools or hobby projects, self-hosting on a $5 VPS is more cost-effective.
What We Learned Shipping Two MCP Servers
Running production MCP servers since April 2026 has taught us a few things that don’t show up in documentation:
Tool descriptions matter more than tool code. An LLM decides which tool to call based on the description you provide. If your description is ambiguous, the LLM calls the wrong tool — or calls the right tool with wrong parameters. We’ve spent more time refining tool descriptions than refining tool logic.
Upstream APIs are your biggest risk. Your code doesn’t change, but the APIs you depend on do. Google changes rate limits, adds new required parameters, or deprecates endpoints. A weekly manual smoke test of each tool catches problems before your users do.
Start with one server that does one thing well. The teams reporting the most production pain are the ones with 10+ servers and no centralized management. One focused server with good monitoring beats five servers with none.
FAQ
Q: Do I need Kubernetes for MCP servers? For most use cases, no. Docker Compose on a single VPS handles surprising amounts of traffic. Kubernetes adds value when you’re running 5+ servers, need auto-scaling, or have strict uptime requirements. Don’t add orchestration complexity before you need it.
Q: How much traffic can one MCP server handle? That depends entirely on what your tools do. A tool that returns cached data can handle thousands of requests per second. A tool that calls a slow external API might max out at 10 concurrent requests. Profile your specific tools under load before sizing your infrastructure.
Q: Should I use SSE or Streamable HTTP? Streamable HTTP. The older HTTP+SSE transport still works, but Streamable HTTP is the current spec recommendation and has better client support. It’s simpler to deploy (single endpoint vs. two) and works better behind load balancers.
Q: What about authentication for internal servers? At minimum, use an API key passed in a header and validated by your reverse proxy. For anything touching sensitive data, implement OAuth 2.1. The MCP spec provides a full auth framework — use it rather than inventing your own.
Q: How do I test MCP servers before deploying?
The MCP Inspector (npx @modelcontextprotocol/inspector) lets you call tools interactively and inspect responses. Use it for manual testing, and write automated integration tests that call each tool with known inputs and verify the output structure. Test with the Streamable HTTP transport, not STDIO — transport bugs only surface in the transport you actually use.
Where to Go From Here
If you’re building your first production MCP server, here’s the sequence that works:
- Get your tool logic working locally with STDIO transport.
- Switch to Streamable HTTP and verify it works with the MCP Inspector.
- Dockerize with the patterns from this guide.
- Set up health checks that test real upstream connectivity.
- Deploy behind a reverse proxy with TLS and rate limiting.
- Add Prometheus metrics and structured logging.
- Set up alerts for the three critical signals: error rate, latency, and health check failures.
That’s it. No Kubernetes, no service mesh, no over-engineering. Get the basics right and your server will be in the 9% that actually stays healthy.
If you’re new to web scraping and data extraction — the domain where many MCP servers live — our beginner’s guide to web scraping covers the fundamentals. And if you want to see how we’ve built MCP servers that handle social media content transformation, that guide walks through the architecture.
The MCP SDK crossed 164 million monthly downloads in April 2026. The ecosystem is growing fast, but production quality isn’t keeping pace. The developers who figure out deployment and reliability now will own this space for the next few years.
Build something that stays alive.