API Gateway Patterns: Authentication, Routing, and Transformation at the Edge

Q: When should I use a gateway versus a service mesh?

They solve overlapping but different problems. A gateway manages north-south traffic (requests entering from external clients). A service mesh (Istio, Linkerd) manages east-west traffic (requests between internal services). Many architectures use both: gateway at the edge, mesh internally for mTLS, service discovery, and internal load balancing.

Q: Should the gateway handle response aggregation?

It can, and for simple cases (a mobile BFF stitching together user profile and recent orders) it works well. But aggregation adds latency and failure modes. If the logic involves business rules or conditional branching, it belongs in a dedicated aggregation service behind the gateway, not in the gateway itself.

Q: How does the Kubernetes Gateway API change things?

The Kubernetes Gateway API replaces the older Ingress resource with a more expressive, role-oriented model. It separates concerns between infrastructure providers (who define GatewayClasses), cluster operators (who deploy Gateways), and application developers (who configure HTTPRoutes). For teams on Kubernetes, it means more portable routing configuration that works across different gateway implementations. It doesn't replace the gateway. It standardizes how you talk to it.

Q: Do I need a separate AI gateway?

If your LLM traffic is minimal (one model, low volume), your existing gateway can handle basic proxying and auth. But once you need token-based rate limiting, semantic caching, model failover, prompt logging for compliance, or cost attribution across teams, a purpose-built AI gateway (or AI gateway plugin for platforms like Kong) becomes necessary. Traditional gateways count requests and bytes. AI gateways count tokens and meaning. The difference is not academic once your monthly inference bill has a comma in it.

Q: How do I prevent the gateway from becoming a single point of failure?

Deploy multiple instances behind a load balancer, spread across availability zones or regions. Use health checks so the balancer removes unhealthy gateway instances automatically. Keep the gateway stateless (no session data stored locally) so any instance handles any request. Most managed services handle this transparently. For self-managed gateways (Kong, Envoy), horizontal scaling and automated failover are day-one requirements.

Edge Logic

Why the Edge Is the Right Place

Placing authentication, routing, and transformation at the network boundary rather than scattering them across services is not a convenience. It is an architectural decision with compounding consequences.

Without a gateway, every service implements its own authentication. Every service understands how clients address it. Every service handles format mismatches and versioning quirks. You end up with dozens of independent security policies, each maintained by a different team, each with its own bugs. One service forgets to check credentials. Another accepts the wrong format. The inconsistency becomes the vulnerability.

A centralized gateway consolidates these concerns into a single, auditable layer. TLS termination (decryption of encrypted traffic so internal services don't each manage certificates) happens once. Token validation happens once. Rate limiting is enforced in the same place. Services behind the gateway never see raw traffic. They receive pre-screened requests and focus on business logic.

Modern edge deployments push gateway logic to points of presence around the world. A request from Tokyo hits a gateway node in Tokyo, gets authenticated and transformed there, and only then travels to backend services in Virginia. The user feels the distance to the nearest edge node, not to the backend.

Identity Layer

Authentication at the Gate

Authentication answers one question: is this caller who they claim to be? Authorization (whether the caller is allowed) is a different question. The gateway validates identity (authentication) and can enforce coarse-grained access rules, but fine-grained business authorization ("can this user edit this specific record?") belongs in the service that owns the domain.

API keys are the simplest credential: a unique string included in every request. They work for server-to-server communication where both parties are trusted. They fail when keys leak, because an API key is a bearer credential — possession alone grants access. For internal tools and low-sensitivity endpoints, they suffice. For anything customer-facing or handling sensitive data, use OAuth 2.0 flows with short-lived tokens instead.

JWTs carry claims (structured statements about identity, roles, and permissions) signed by a trusted authority. The gateway validates a JWT without calling a central auth server on every request, because the signature proves the token has not been tampered with.

OAuth 2.0 and OIDC formalize obtaining and validating tokens through a trusted third party. The gateway validates the access token and extracts claims to pass downstream as headers, sparing each service the burden of token parsing.

Mutual TLS requires both client and server to present certificates, creating a two-way cryptographic handshake. Common in service-to-service communication within zero-trust architectures (security models that assume no implicit trust, even for traffic inside the network perimeter).

The pattern is offloading. The gateway validates the credential, then forwards identity context as HTTP headers (X-User-ID, X-Tenant-ID). Services trust the gateway and never validate a token themselves.

Never use wildcard CORS in any environment where credentials are sent. Specify exact domains. A permissive Access-Control-Allow-Origin: * policy at the gateway is a security hole that most teams do not notice until it is exploited.

API keys

Useful for trusted server-to-server traffic, but weak once leaked because possession alone grants access.

JWTs

Signed claims the gateway can validate statelessly and translate into downstream identity context.

OAuth 2.0 and OIDC

Standardized delegated access flows that spare every service from owning token parsing and validation.

Mutual TLS

Strong two-way identity for zero-trust and internal service-to-service communication.

Traffic Control

Routing

Once the caller is identified, the gateway decides where to send them.

In a monolithic application, routing is trivial: everything goes to the same place. In a microservices architecture, routing is the connective tissue holding the system together. A single public endpoint like /api/orders/12345 might need to reach an order service, while /api/inventory/check goes to completely different infrastructure.

Path-based routing is the most common: /api/users/* goes to the user service, /api/payments/* goes to the payment service. Simple, readable, easy to reason about.

Header-based routing directs traffic based on request metadata. A X-Version: beta header might route to a canary deployment (a new version running alongside the stable one, receiving a small fraction of traffic to test before full rollout). A Content-Type: application/grpc header might route to a different upstream entirely.

Weighted routing distributes traffic across service versions by percentage. 95% to stable, 5% to new. This implements canary or blue-green deployment patterns entirely at the gateway without the services knowing they are being tested.

Service discovery integration connects the gateway to a registry of healthy instances. Rather than hard-coding addresses, the gateway asks: "Who is alive and serving orders right now?" This can be client-side discovery (gateway queries the registry directly) or server-side (a load balancer sits between gateway and registry). Either way, the visitor doesn't need to know the shop moved. The directory stays current.

Rate limiting also lives in the routing layer. The gateway enforces global limits (no more than 10,000 requests per minute across all clients), per-client limits (this API key gets 500 per minute), and burst allowances (a client can briefly exceed their limit for spikes, then must slow down). Rate limiting is a routing decision: the request either goes to the service, to a queue, or back to the caller with a 429. It is the gatehouse controlling foot traffic so the narrow streets inside do not become impassable.

Path-based routing

Readable public paths map requests to the right upstream services with minimal client awareness of internal topology.

Header and weighted routing

Metadata and traffic percentages let the gateway support protocol shifts, beta traffic, and rollout strategies.

Discovery and rate control

The gateway tracks healthy instances and decides which traffic proceeds, slows, queues, or gets rejected.

Contract Shaping

Transformation

Transformation reshapes requests and responses so client expectations and service requirements don't have to match. This is where the gateway removes coupling that clients never see.

Protocol translation: A client sends REST over HTTP; the backend speaks gRPC. The gateway translates in both directions. The client never knows gRPC exists. The service never knows REST was involved.

Header manipulation: The gateway strips sensitive headers, adds tracing IDs (X-Request-ID for distributed tracing), injects tenant context, and rewrites Host headers for services behind a reverse proxy.

Path rewriting: The public API exposes /customers/1234; the backend expects ?customerID=1234. The gateway rewrites one into the other, allowing internal services to evolve without breaking external contracts.

Body transformation: Incoming XML from a legacy partner converts to JSON before reaching a modern service. Verbose internal responses get trimmed to include only fields the public contract specifies, reducing payload size and preventing accidental exposure of internal data.

API versioning: When you deprecate an endpoint, the gateway accepts old-format requests, transforms them to the new spec, and forwards them. Backward compatibility without maintaining two code paths.

Transformation is an underrated decoupling strategy. We had a billing team that needed to migrate from REST to gRPC internally. Without gateway transformation, that would have meant coordinating with every external partner to update their integrations at the same time. Instead, the gateway translated the public REST contract on the fly. Partners never knew the migration happened. The team shipped in two weeks instead of an estimated eight weeks. Every conversion the gateway handles is a dependency that does not exist between client and service.

LLM Traffic

AI Gateways

AI-driven API traffic from LLM applications, agents, and RAG pipelines behaves differently from human-driven traffic. Agents decide call frequency on their own, generate bursty load, and send token-heavy requests. Gartner projects that by 2026, over 30 percent of API demand growth will come from AI and LLM-driven tools.

This has produced the AI gateway, a specialized evolution that counts tokens instead of requests, uses semantic caching (storing responses based on prompt meaning rather than URL, so semantically similar questions can return cached answers without calling the model again), and enforces token budgets per agent, model, or department.

AI gateways also introduce model-level routing: directing requests to different LLM providers based on cost, latency, or capability. If the primary model is overloaded, the gateway fails over to a secondary provider. Simple requests route to a smaller, cheaper model; high-accuracy requests route to a larger one.

The choice between a dedicated AI gateway and extending the existing gateway with token-counting middleware depends on scale. For teams running fewer than a dozen agents, the dedicated gateway feels like over-engineering. Once you manage multiple models, department-level budgets, and agents that retry on their own, dedicated gateway tooling starts to make sense. The AI gateway market reflects the urgency: $3.9 billion in 2024, projected to reach $9.8 billion by 2031.

Token-aware control

AI gateways measure prompts, completions, and budgets rather than just counting HTTP requests and bytes.

Semantic caching

Caching shifts from exact URL matching to similarity-aware reuse of responses for prompt-driven systems.

Model-level routing

The gateway can choose providers and model sizes based on cost, latency, capacity, or task complexity.

Topology Choices

Architecture Patterns

Centralized edge gateway: All traffic through one gateway. Simplest to secure and audit. The right starting point for most teams new to microservices. It can become a bottleneck as services grow, and configuration accumulates complexity. The classic weakness is single-point-of-failure risk, mitigated by running multiple instances behind a load balancer.

Backend-for-Frontend (BFF): A dedicated gateway per client type. Mobile gets smaller responses, fewer round trips, and aggressive caching. Web gets richer data structures and WebSocket support. Partners get strict contract enforcement and detailed audit logging. This prevents the one-size-fits-all problem where mobile clients receive bloated desktop payloads and partner integrations inherit consumer-facing rate limits. In the health-tech case below, the mobile BFF cut average payload size by 40% and reduced round trips from three to one.

Dual-layer: External edge gateway for public APIs, separate internal gateway for service-to-service communication. The external enforces strict security, transformation, and throttling; the internal focuses on service discovery and load balancing with lighter overhead (though in zero-trust architectures, the internal gateway still validates mTLS). This is a system with an outer wall and an inner wall.

Microgateway: Lightweight gateways alongside individual services or small clusters. Each service has its own tiny gateway handling its specific concerns. Reduces single points of failure and enables independent scaling, but increases operational complexity: more gates to monitor, more configurations to keep consistent, more surfaces to secure.

Most mature architectures blend these patterns: a centralized edge gateway for external traffic, a BFF layer for client customization, and service meshes (infrastructure layers managing service-to-service communication through sidecar proxies) for internal routing. Match the pattern to the layer, and resist solving every problem at the same gate.

Pattern	Best use	Tradeoff
Centralized edge gateway	Simplest secure starting point for many external APIs.	Can become a scaling and configuration bottleneck if it grows without structure.
Backend-for-Frontend	Optimizes payloads and behavior for specific client types such as mobile, web, or partner integrations.	Adds more gateway surfaces and ownership boundaries to manage.
Dual-layer	Separates strict public-edge controls from internal service-to-service traffic management.	Requires clarity about which concerns belong on which layer.
Microgateway	Useful when services need highly specific edge behavior and independent scaling.	Increases consistency, security, and operational complexity burden.

Failure Handling

Resilience and Observability

A gateway that falls over under load is worse than no gateway. When the gateway collapses, the entire system goes dark.

Circuit breakers stop sending traffic to a failing downstream service and return fallback responses (cached results, graceful errors, default values).
Retry policies handle transient failures with exponential backoff, though aggressive retries amplify the overload they are trying to survive.
Timeouts prevent the gateway from waiting forever. A payment call might get 10 seconds. A health check gets 500 milliseconds. Without timeouts, slow services consume gateway connections like a traffic jam that never clears.
Connection pooling reuses connections to backend services, reducing latency and resource consumption.

Health checks complete the picture. The gateway should actively probe downstream services and remove unhealthy ones from the routing table before they cause visible failures.

For preventing the gateway itself from becoming a single point of failure: deploy multiple instances behind a load balancer, spread across availability zones. Keep the gateway stateless so any instance can handle any request. Most managed gateway services (AWS API Gateway, Azure API Management) handle this transparently.

The gateway is the first and last point a request touches. If you instrument only one layer, instrument this one.

Myths

Myths That Waste Engineering Time

"The gateway is just a reverse proxy." A reverse proxy forwards requests. A gateway authenticates, transforms, rate-limits, circuit-breaks, translates protocols, and enforces contracts. The difference is significant.

"Put all the logic in the gateway." The moment your gateway knows about order discount rules or notification preferences, you have coupled infrastructure to domain logic, and deployments grind to a halt. The gateway handles cross-cutting infrastructure concerns: identity, traffic management, protocol translation. Not business logic.

"One gateway is always enough." For a small system, yes. For a system with mobile, web, partner, and AI consumers each with different payloads, latency tolerances, and security models, a single gateway becomes an overloaded chokepoint.

"API keys are sufficient for production auth." API keys identify a client. They do not expire gracefully, carry no claims, and offer no defense if leaked beyond revocation.

"Transformation at the gateway is a performance bottleneck." For trivial transformations (header rewriting, path mapping, claim extraction) the overhead is negligible. For heavy transformations (XML-to-JSON on large payloads) it adds latency, but that cost is almost always less than the organizational cost of coupling clients directly to service internals.

Gateways are more than proxies

They combine identity, traffic policy, transformation, and resilience, not just forwarding.

Infrastructure logic only

Cross-cutting concerns belong at the gateway. Business rules do not.

Performance tradeoffs are contextual

Small transformations cost little, and even larger ones often cost less than tight client-service coupling.

Case Study

How It Played Out

A mid-size health-tech company with around 140 engineers had grown to 60 microservices. Each managed its own auth. Fourteen services parsed JWTs independently using three different libraries. Two had quietly fallen behind on updates and accepted expired tokens.

Phase 1 was supposed to take two weeks. It took five. We assumed deploying the gateway in front of the existing services would be a clean cutover, but three services were doing non-standard things with their JWT validation that we didn't discover until traffic shifted. One service was extracting custom claims from the token body and using them for tenant routing. Another had implemented its own token refresh logic that conflicted with the gateway's behavior. We had to add claim-forwarding headers we hadn't planned for and write a compatibility shim for the refresh flow. Once those were resolved, every inbound request hit the gateway, which validated the JWT and forwarded identity headers downstream. The two services with stale libraries were fixed the moment the gateway went live.

Phase 2 added routing. The mobile team got a BFF gateway that aggregated three service calls into one, cutting payload size by roughly 40% and reducing mobile round trips from three to one.

Phase 3 introduced transformation for the billing team's REST-to-gRPC migration. The billing team had estimated eight weeks for a coordinated client migration. Instead, the gateway translated the public REST contract to gRPC on the fly. External partners never knew the migration happened. The team shipped in two weeks.

Six months in: authentication incidents dropped to zero, down from about 2.3 per quarter. Mean time to onboard a new partner integration fell from 14 days to 3 days, because the gateway handled format translation and the partner never needed to learn internal conventions. Gateway-level metrics caught a failing pharmacy-lookup service 90 seconds before it would have been visible to patients, because the circuit breaker tripped and returned cached results while the service restarted.

The CTO's summary at the retrospective: "We spent three years arguing about whose job it was to check credentials and translate formats. The gateway made it nobody's job and everybody's problem went away."

Phase 1: auth centralization

Edge validation replaced fourteen independent JWT implementations, including stale ones that had been accepting expired tokens.

Phase 2: routing and BFF

A mobile-specific BFF reduced payload size and client round trips by aggregating calls at the edge.

Phase 3: transformation

REST-to-gRPC translation at the gateway removed the need for synchronized partner rewrites.

Operational result

Authentication incidents disappeared, integration onboarding sped up, and gateway metrics exposed failing services before patients noticed.

Takeaways

Authentication, rate limiting, and protocol translation belong at the gateway because centralizing them eliminates inconsistency and reduces security surface area.
Authentication and authorization are different jobs. The gateway validates identity; fine-grained business authorization belongs in the domain service.
Transformation is a decoupling strategy. Every conversion the gateway handles is a dependency that does not exist between client and service.
AI traffic demands token counting, semantic caching, and model-level routing, and these needs are growing faster than traditional API traffic.
One gateway pattern does not fit all. Match the pattern to the trust boundary and consumer type.
Resilience at the gateway is non-negotiable. Circuit breakers, timeouts, retries, and health checks prevent it from becoming the single point of failure it was designed to eliminate.
Keep business logic out of the gateway. Infrastructure concerns in, domain logic out.
Observability starts at the gate. The gateway sees every request, so it's where to instrument first.

Keep Reading

FAQ

When should I use a gateway versus a service mesh?

Should the gateway handle response aggregation?

How does the Kubernetes Gateway API change things?

Do I need a separate AI gateway?

How do I prevent the gateway from becoming a single point of failure?