Do I need Kubernetes for MCP servers?

For most use cases, no. Docker Compose on a single VPS handles surprising amounts of traffic. Kubernetes adds value when you're running 5+ servers, need auto-scaling, or have strict uptime requirements. Don't add orchestration complexity before you need it.

Should I use SSE or Streamable HTTP?

Streamable HTTP. The older HTTP+SSE transport still works, but Streamable HTTP is the current spec recommendation and has better client support. Simpler to deploy and works better behind load balancers.

How do I test MCP servers before deploying?

Use the MCP Inspector (npx @modelcontextprotocol/inspector) to call tools interactively and inspect responses. Write automated integration tests that call each tool with known inputs and verify output structure. Test with the Streamable HTTP transport, not STDIO — transport bugs only surface in the transport you actually use.

Building Production-Ready MCP Servers: The 2026 Deployment Playbook

Your MCP server works on your laptop. Claude connects, tools fire, results come back. Ship it, right?

An early-2026 audit of 1,847 MCP servers found 52% abandoned and only 17% meeting a production bar — the rest lightly maintained, stale, or failing silently with 200 OK responses full of parsing errors. The gap between “works locally” and “works in production” is where MCP servers go to die. The patterns below keep yours in the 17% — drawn from shipping Content-to-Social MCP to the Apify Store on 2026-04-12 and running it on real customer traffic ever since.

The STDIO trap

The default transport doesn’t work for production. Scaffold with the official SDK and you get STDIO — fine for Claude Desktop or Cursor spawning your server as a subprocess, useless for anything remote. No load balancer, no sharing without a repo clone.

Switch to Streamable HTTP before you deploy. Added in the spec’s 2025-03-26 revision, it deprecated the older HTTP+SSE transport (still supported for backward compatibility) and turns your server into a standard HTTP API accepting POST at a /mcp endpoint. Single endpoint, works with load balancers, works with any HTTP client. The code change is small — in Python with FastMCP:

# Local development (STDIO)
mcp.run()

# Production (Streamable HTTP)
mcp.run(transport="streamable-http", host="0.0.0.0", port=8000)

In TypeScript the transport itself takes no port — it’s an HTTP handler you mount on a route, and the web framework owns the port. Wire it into Express:

import express, { Request, Response } from "express";
import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
import { StreamableHTTPServerTransport } from "@modelcontextprotocol/sdk/server/streamableHttp.js";
import { randomUUID } from "node:crypto";

const app = express();
app.use(express.json());

app.post("/mcp", async (req: Request, res: Response) => {
  const server = getServer(); // your configured McpServer instance
  const transport = new StreamableHTTPServerTransport({
    sessionIdGenerator: () => randomUUID(),
  });
  await server.connect(transport);
  await transport.handleRequest(req, res, req.body);
  res.on("close", () => { transport.close(); server.close(); });
});

app.listen(8000);

Set sessionIdGenerator: undefined for stateless mode (simpler, no resumability). That’s step one. The real work is everything that comes after.

Production architecture

The shape that survives real traffic:

┌─────────────────────────────────────┐
│          AI Client (Claude, etc.)    │
└──────────────┬──────────────────────┘
               │ HTTPS
               ▼
┌─────────────────────────────────────┐
│         Reverse Proxy (Nginx)       │
│    TLS termination, rate limiting    │
└──────────────┬──────────────────────┘
               │ HTTP
               ▼
┌─────────────────────────────────────┐
│        MCP Server Container         │
│   Streamable HTTP on port 8000      │
│   Non-root user, health checks      │
└──────────────┬──────────────────────┘
               │
               ▼
┌─────────────────────────────────────┐
│       External APIs / Databases     │
│   (The actual data sources)         │
└─────────────────────────────────────┘

Three layers, each with its own failure modes. Running 3+ MCP servers? Add a gateway layer between the proxy and the individual servers. Gateways like Docker’s MCP Gateway (open-sourced by Docker in mid-2025) aggregate multiple servers behind a single endpoint, centralize auth, prevent tool-name collisions, and give you unified logging. Without one you end up with separate auth flows, monitoring, and secrets management per server — and operational complexity grows faster than you expect.

Dockerizing your MCP server

Containers aren’t optional. A production Dockerfile for a Python MCP server:

FROM python:3.12-slim

# Don't run as root
RUN useradd --create-home --shell /bin/bash mcpuser

WORKDIR /app

# Install dependencies first (better layer caching)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY . .

# Switch to non-root user
USER mcpuser

EXPOSE 8000

HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
  CMD curl -f http://localhost:8000/health || exit 1

CMD ["python", "-m", "server", "--transport", "streamable-http"]

Decisions: slim, not Alpine — musl breaks Python packages touching async I/O; slim is 40MB larger and saves hours of debugging. Non-root — compromise shouldn’t equal root, and a shocking number of public MCP Dockerfiles skip it. HEALTHCHECK in the Dockerfile so Docker and any orchestrator above it know whether the server is actually working. Pin dependencies to an exact current version (mcp==1.27.1, not mcp>=1.27) — the SDK is still evolving fast, so pin whatever is current the day you build.

docker-compose.yml for local production testing:

services:
  mcp-server:
    build: .
    ports:
      - "8000:8000"
    environment:
      - API_KEY=${API_KEY}
      - LOG_LEVEL=info
    restart: unless-stopped
    deploy:
      resources:
        limits:
          memory: 512M
          cpus: "0.5"

Never bake secrets into the image. ${VAR} substitution, values in .env, .env in .gitignore; in production, Docker secrets or your platform’s secret manager (AWS Secrets Manager, Vault, Apify’s env-var store).

The Apify-flavored version — same transport + billing wiring I run in production — is open-sourced at mcp-server-apify-starter (MIT). If Apify is your target, start there and skip the Dockerfile entirely.

Health checks that actually work

This is where most production MCP servers fail — and not by failing the check. They pass it while completely broken. The naive approach:

@app.get("/health")
async def health():
    return {"status": "ok"}

This tells you the HTTP process is running — the least interesting thing about your server. If your MCP server wraps a Google Maps API and that key expired yesterday, this still returns 200 OK while every tool call fails. A real health check verifies the full dependency chain:

@app.get("/health")
async def health():
    checks = {}
    
    # Check upstream API connectivity
    try:
        response = await http_client.get(
            "https://maps.googleapis.com/maps/api/place/details/json",
            params={"place_id": "ChIJN1t_tDeuEmsRUsoyG83frY4", "key": API_KEY},
            timeout=5
        )
        checks["google_maps_api"] = response.status_code == 200
    except Exception:
        checks["google_maps_api"] = False
    
    # Check database if applicable
    try:
        await db.execute("SELECT 1")
        checks["database"] = True
    except Exception:
        checks["database"] = False
    
    all_healthy = all(checks.values())
    return JSONResponse(
        status_code=200 if all_healthy else 503,
        content={"status": "healthy" if all_healthy else "degraded", "checks": checks}
    )

Run this on a 30-second interval, restart the container after 3 consecutive failures, log every failure. Health check failures are your early warning system.

The silent killer: schema drift

A failure mode that won’t show up in any health check. The upstream API changes its response format — a field renamed, a nested object becomes an array, a field that was always present stops appearing for certain queries. Your server still gets 200 OK, parses what it can, returns half-broken results, and the LLM downstream tries to make sense of garbage. The April 2026 endpoint analysis flagged schema drift as one of the hardest production bugs to catch. The defense is response validation:

from pydantic import BaseModel, ValidationError

class PlaceDetails(BaseModel):
    name: str
    formatted_address: str
    rating: float | None = None
    user_ratings_total: int | None = None

@mcp.tool()
async def get_place_details(place_id: str) -> str:
    response = await fetch_from_google(place_id)
    
    try:
        validated = PlaceDetails(**response)
    except ValidationError as e:
        logger.error(f"Schema drift detected: {e}")
        raise McpError(
            ErrorCode.InternalError,
            "Upstream API response format has changed. This tool needs maintenance."
        )
    
    return validated.model_dump_json()

Pydantic (or Zod in TypeScript) catches format changes immediately instead of silently passing bad data through. Pin the upstream API version when supported, and alert on validation-error spikes.

Not theoretical — Google migrated reviewer anchors from <a href> to <button data-href> and silently emptied reviewerUrl for every customer of my scraper until I caught the validation gap. The v0.3.1 remake added a npm run smoke script (run before deploys, and on a daily schedule) that hits three known-good targets and asserts field-population thresholds. That single smoke test is the highest-leverage check running on the deployment.

Security: treat each server like a microservice

The MCP authorization layer requires OAuth 2.1 (with PKCE) for HTTP-facing servers — STDIO servers are exempt, since they inherit the host process’s environment. If your server faces the internet, implement it; internal-only servers still need auth, just less. Minimum checklist:

TLS everywhere. Proxy terminates TLS, but client-to-proxy must be HTTPS. Let’s Encrypt is free for anything public.
Scope your API keys. Dedicated key with only the permissions you need — compromise means the attacker looks up restaurants, not reads your Gmail.
Rate limit at the proxy. Nginx or your cloud’s API gateway enforces per-client limits. An LLM in a retry loop generates thousands of requests per minute.
Validate all inputs. Every parameter against its JSON schema, plus your own bounds (max results, valid coordinate ranges). The SDK handles types; you add semantics.
Log tool calls, not tool results. Tool name, caller, parameters — never full response data, which may contain upstream PII. Response logging is opt-in and time-limited.

For the full threat surface — tool poisoning, prompt injection through tool descriptions, the patterns that get servers compromised — see the MCP security playbook.

Monitoring and alerts

Three things from day one. Metrics — request count, latency (p50/p95/p99), error rate, upstream API latency per tool; Prometheus + Grafana is standard but anything works. The metric to watch is the ratio of successful to total tool calls per tool — when it drops, something’s wrong. Structured logging — every tool call as a JSON entry with tool name, request ID, duration, success/failure, error type; queryable via Loki, ELK, or CloudWatch Logs Insights. Alerting — minimum is health check failures (3+ consecutive), error rate >5% over 5 minutes, p95 above your SLA (I use 10s), and container restarts. Pipe to Slack, PagerDuty, or email — wherever you actually look.

A practical setup:

import time, structlog
from prometheus_client import Counter, Histogram, start_http_server

logger = structlog.get_logger()
TOOL_CALLS = Counter("mcp_tool_calls_total", "Total tool calls", ["tool", "status"])
TOOL_LATENCY = Histogram("mcp_tool_duration_seconds", "Tool call duration", ["tool"])
start_http_server(9090)

@mcp.tool()
async def get_reviews(place_id: str) -> str:
    start = time.monotonic()
    try:
        result = await fetch_reviews(place_id)
        TOOL_CALLS.labels(tool="get_reviews", status="success").inc()
        logger.info("tool_call", tool="get_reviews", place_id=place_id,
                    duration=time.monotonic() - start)
        return result
    except Exception as e:
        TOOL_CALLS.labels(tool="get_reviews", status="error").inc()
        logger.error("tool_call_failed", tool="get_reviews", error=str(e))
        raise
    finally:
        TOOL_LATENCY.labels(tool="get_reviews").observe(time.monotonic() - start)

Counters, histograms, and structured logs per tool — exactly what happened on every call.

Operational realities

Cold starts. Serverless (Lambda, Cloud Run, Fly.io) bites. Three fixes: minimum instance count (a few dollars a month keeps one warm); lazy-load heavy dependencies; shrink the image (2GB → 15s+ cold start, 200MB → 2–3s). If cold starts stay unacceptable, skip serverless — a $5/month DigitalOcean droplet running Docker Compose handles more MCP traffic than you’d think.

Idempotency. Agents retry, a lot. get_reviews is naturally idempotent. create_report that writes a new record per call? Two retries → three duplicate reports. Accept a client-provided request_id (or use the MCP request ID from the JSON-RPC envelope), check whether you’ve processed it, return the cached result if so. Same pattern payment APIs have used for decades.

Charge on success, not on attempt. If you’re metering — Apify PPE, Stripe, x402 — fire the charge only after the JSON-RPC response actually ships, and make it throw on failure in production. The default I shipped on Google Reviews v0.3.1 is if (Actor.isAtHome) throw around Actor.charge; anything else and you eventually deliver data uncharged because a network blip dropped the billing event. With un-bounded retries, naive metering bills three events for one 5xx failure.

Free-plan gates are the platform reality nobody mentions — they cost me 5,407 results’ worth of compute before I caught them. On Apify (and any platform where developer revenue depends on the user’s subscription tier, not just the event firing), Actor.charge succeeds and emits charge events but pays the developer nothing when the caller is on the free plan — the platform keeps it. Build the gate into the actor on day one, like the pattern shipped on Google Reviews 0.2.5:

const userInfo = await Actor.client.user('me').get();
const isFreePlan = userInfo?.plan === 'FREE';

const reviewLimit = isFreePlan ? 10 : input.maxReviews;
const placeLimit = isFreePlan ? 1 : input.maxPlaces;

// On any error reading plan info, DO NOT gate —
// never throttle a paying customer over a plan-API blip.

Two rules: fail open (lookup throws → treat as paid; throttling a paying customer is the worst failure mode) and surface an upgrade prompt in the run log when the gate clamps. Generalizes off Apify — any platform that meters on event-emit but pays out on subscriber tier needs a tier check before the expensive work starts.

When to use a managed platform

Not everyone needs their own infrastructure. Platforms like Apify handle deployment, scaling, monitoring, and billing — you write tool logic. Fits solo developers, usage-based pricing, and spiky traffic.

See It in Action

I run a production MCP server (Content-to-Social) plus scrapers on the Apify Store, all handling real customer traffic with zero infrastructure management on my end.

Learn More →

Trade-off: less infrastructure control, a platform fee. For revenue-generating servers the math usually works out; for internal tools, a $5 VPS is cheaper. The MCP server monetization playbook breaks down Apify vs MCPize vs self-hosted economics, and the Apify pay-per-event migration playbook covers the billing-model migration step-by-step.

The deployment checklist

Transport on Streamable HTTP, tested with a remote client.
Container uses slim base, non-root user, HEALTHCHECK directive.
Health check verifies all upstream dependencies; returns 503 when any is down.
Secrets via env vars or a secret manager — never baked into the image.
Auth is OAuth 2.1 for public servers; API keys scoped to minimum permissions.
TLS enforced, HTTP redirects to HTTPS.
Rate limiting at the proxy or gateway.
Monitoring: metrics, structured logs, alerts on errors / latency / health failures.
Schema validation on upstream responses; spikes alert.
Idempotency on every write.
Resource limits in Docker/orchestrator config.
Billing (if charging): fires only on successful response, throws on failure in production, tier/plan gate in front of compute-heavy work.
Tool descriptions sharp enough that an LLM picks the right tool with the right parameters — bad descriptions cause more production issues than bad code.

End-to-end sequence: tool logic working locally on STDIO → Streamable HTTP verified with the MCP Inspector → Dockerize (or skip to mcp-server-apify-starter if Apify is your target) → real health checks → reverse proxy with TLS + rate limiting → metrics, logs, three alerts. No Kubernetes, no service mesh, no over-engineering.

The MCP SDKs crossed ~97 million monthly downloads as of March 2026 — the ecosystem is growing faster than production quality is. The developers who get deployment and reliability right now will own this space. Build something that stays alive, and bill the people calling it.

Building Production-Ready MCP Servers: The 2026 Deployment Playbook

The STDIO trap

Production architecture

Dockerizing your MCP server

Health checks that actually work

The silent killer: schema drift

Security: treat each server like a microservice

Monitoring and alerts

Operational realities

When to use a managed platform

See It in Action

The deployment checklist

Frequently asked questions

My tools on the Apify Store

Stay in the loop

The STDIO trap

Production architecture

Dockerizing your MCP server

Health checks that actually work

The silent killer: schema drift

Security: treat each server like a microservice

Monitoring and alerts

Operational realities

When to use a managed platform

See It in Action

The deployment checklist

Frequently asked questions

My tools on the Apify Store

Stay in the loop

You might also like