AmtocSoft Tech Insights: GraphQL in Production 2026: Schema Design, DataLoader, Persisted Queries, and Federation

Friday, April 17, 2026

GraphQL in Production 2026: Schema Design, DataLoader, Persisted Queries, and Federation

Introduction

GraphQL turned ten in 2025, and the ecosystem has finally caught up to its ambitions. What was once an API curiosity driven by Facebook's mobile needs is now the default choice for any system where the client's data requirements are complex, varied, or rapidly evolving. In 2026, the conversation has shifted from "should we use GraphQL?" to "how do we run it properly at scale?"

The pitch is familiar: one endpoint, clients ask for exactly what they need, no over-fetching, no under-fetching. Compared to REST, GraphQL eliminates the proliferation of specialized endpoints — /users/:id/posts/recent-with-authors and friends — and puts the query structure in the client's hands. That matters most when you have mobile clients on slow networks, multiple frontends (web, iOS, Android, internal tools) with different data shapes, or a team structure where frontend and backend move independently.

Where GraphQL still loses to REST: simple CRUD APIs with predictable data shapes, systems where HTTP caching is non-negotiable, teams without the tooling investment to manage schema evolution, or anywhere the operational overhead of a schema registry and query planner is not justified by the complexity saved. REST with OpenAPI and a good client generator solves most of what REST developers reach for GraphQL to fix. Choose your weapons deliberately.

But for complex, multi-client, multi-team systems, GraphQL wins on ergonomics — and that is increasingly where production systems live. The patterns in this post reflect what actually works at load: schema design choices that age well, DataLoader as the mandatory antidote to the N+1 problem, persisted queries as the production security boundary, and federation as the path to scaling schema ownership across teams.

The N+1 problem is the fulcrum. If you deploy GraphQL without DataLoader and your schema has any relationship fields at all, you will hit it immediately in production. A list of 100 posts, each with an author resolved by a separate DB query, produces 101 database round-trips instead of 2. At scale that is the difference between a 40ms response and a 4-second one. Every other optimization in this post builds on getting that right first.

1. Schema Design for Production

A GraphQL schema is a long-lived contract. Unlike a REST endpoint you can quietly change, a schema is introspectable — clients query it to understand what is available. Decisions made on day one compound over years. These are the ones that matter.

Nullability Strategy

The GraphQL spec defaults fields to nullable. The community has divided itself into two camps: nullable-by-default (the spec's intent) versus non-null-by-default (the pragmatic camp).

The nullable-by-default argument: partial results are a first-class GraphQL feature. If one resolver fails, the query can still return the rest. Making fields non-null means one resolver error propagates up to the nearest nullable parent, potentially nulling out entire subtrees.

The non-null-by-default argument: nullable types in generated TypeScript clients produce T | null | undefined everywhere, and clients have to defensively null-check fields that will never actually be null. This erodes code quality fast.

The production answer: be deliberate, not dogmatic. Mark fields non-null when you can contractually guarantee they will always have a value. Mark nullable fields — especially relationship fields and computed fields — nullable so partial failure is handled gracefully. Never mark a field non-null if the resolver can legitimately return null due to data state or access control.

# Good: id is always present, name may be missing on legacy records
type User {
  id: ID!           # Non-null: always exists
  name: String      # Nullable: may be empty on legacy accounts
  email: String!    # Non-null: required at registration
  posts: [Post!]    # Nullable list: null means "failed to load", [] means "no posts"
}

The distinction between [Post!] (non-null items, nullable list), [Post]! (null items allowed, list itself non-null), and [Post!]! (nothing nullable) matters. Pick the one that reflects the actual contract.

Input Types vs Inline Arguments

For mutations with more than two or three arguments, always use input types:

# Bad: inline args don't compose, don't reuse, break on addition
mutation CreatePost(
  $title: String!
  $body: String!
  $authorId: ID!
  $publishAt: DateTime
  $tags: [String!]
) { ... }

# Good: input type is reusable, versionable, and documented
input CreatePostInput {
  title: String!
  body: String!
  authorId: ID!
  publishAt: DateTime
  tags: [String!]
}

mutation CreatePost($input: CreatePostInput!) {
  createPost(input: $input) {
    post { id title }
    errors { field message }
  }
}

The mutation result pattern — returning both the created object and a structured errors array — is critical. It lets clients handle validation errors without catching GraphQL errors, which are a separate concern.

Connection Pattern for Pagination

Never return raw arrays for paginated collections. The Relay Connection spec is the production standard:

type PostConnection {
  edges: [PostEdge!]!
  pageInfo: PageInfo!
  totalCount: Int!
}

type PostEdge {
  node: Post!
  cursor: String!
}

type PageInfo {
  hasNextPage: Boolean!
  hasPreviousPage: Boolean!
  startCursor: String
  endCursor: String
}

type Query {
  posts(first: Int, after: String, last: Int, before: String): PostConnection!
}

Cursor-based pagination is O(1) regardless of page depth. Offset-based (page: 3, limit: 20) breaks at page 500 on large tables and is inconsistent when records are inserted mid-browse. Cursors avoid both problems. The verbosity of the connection pattern pays off in client predictability.

Union Types and Interfaces

Use interfaces when types share fields and behavior. Use unions when types are fundamentally different but appear in the same position:

interface Node {
  id: ID!
}

interface Auditable {
  createdAt: DateTime!
  updatedAt: DateTime!
}

type User implements Node & Auditable {
  id: ID!
  createdAt: DateTime!
  updatedAt: DateTime!
  email: String!
}

# Union for a search result that can be multiple disjoint types
union SearchResult = User | Post | Comment | Tag

Schema Versioning with @deprecated

Never remove a field without a deprecation window. The @deprecated directive is your migration tool:

type User {
  id: ID!
  username: String! @deprecated(reason: "Use `handle` instead. Will be removed 2026-12-01.")
  handle: String!
  fullName: String @deprecated(reason: "Split into `firstName` and `lastName`.")
  firstName: String
  lastName: String
}

Introspection surfaces these deprecations. Client generators (GraphQL Codegen, Relay) can be configured to warn on deprecated field usage at build time, giving you a concrete migration signal without breaking existing clients.

Full Schema Example

type Query {
  user(id: ID!): User
  post(id: ID!): Post
  posts(first: Int, after: String): PostConnection!
  search(query: String!): [SearchResult!]!
}

type Mutation {
  createPost(input: CreatePostInput!): CreatePostPayload!
  updatePost(id: ID!, input: UpdatePostInput!): UpdatePostPayload!
  deletePost(id: ID!): DeletePostPayload!
  addComment(input: AddCommentInput!): AddCommentPayload!
}

type User implements Node & Auditable {
  id: ID!
  handle: String!
  email: String!
  firstName: String
  lastName: String
  posts(first: Int, after: String): PostConnection!
  createdAt: DateTime!
  updatedAt: DateTime!
}

type Post implements Node & Auditable {
  id: ID!
  title: String!
  body: String!
  author: User!
  comments(first: Int, after: String): CommentConnection!
  tags: [String!]!
  publishedAt: DateTime
  createdAt: DateTime!
  updatedAt: DateTime!
}

type Comment implements Node & Auditable {
  id: ID!
  body: String!
  author: User!
  post: Post!
  createdAt: DateTime!
  updatedAt: DateTime!
}

union SearchResult = User | Post | Comment

2. The N+1 Problem and DataLoader

The N+1 problem is not a GraphQL-specific bug — it exists in any ORM with lazy loading. But GraphQL makes it worse because the resolver tree hides it. Each resolver is a small, isolated function that fetches data for one node. Composing them naively means each field on a list of N items fires its own query.

What N+1 Looks Like

// This looks innocent
const resolvers = {
  Query: {
    posts: () => db.query('SELECT * FROM posts LIMIT 100'),
  },
  Post: {
    // Called once per post — 100 posts = 100 separate author queries
    author: (post) => db.query('SELECT * FROM users WHERE id = $1', [post.authorId]),
  },
};

A request for 100 posts with their authors fires:
- 1 query: SELECT * FROM posts LIMIT 100
- 100 queries: SELECT * FROM users WHERE id = ? — once per post

Total: 101 queries. With DataLoader: 2 queries. At 100 posts, that is a 50x reduction in database round-trips. At 1,000 posts, it is 500x.

DataLoader Batching Mechanism

DataLoader works by deferring individual load calls until the end of the current event loop tick, collecting all requested IDs, then firing a single batch function. The per-request cache prevents duplicate fetches within the same request lifecycle.

import DataLoader from 'dataloader';
import { Pool } from 'pg';

// Batch function: receives array of IDs, returns array of results in same order
async function batchUsers(
  db: Pool,
  userIds: readonly string[]
): Promise<(User | Error)[]> {
  const { rows } = await db.query<User>(
    'SELECT * FROM users WHERE id = ANY($1::uuid[])',
    [userIds]
  );

  // DataLoader requires results in the SAME ORDER as input keys
  const userMap = new Map(rows.map(u => [u.id, u]));
  return userIds.map(id => userMap.get(id) ?? new Error(`User ${id} not found`));
}

// Factory: create a new DataLoader per request (never singleton)
export function createLoaders(db: Pool) {
  return {
    userById: new DataLoader<string, User>(
      (ids) => batchUsers(db, ids),
      {
        // Cache is scoped to this DataLoader instance (per-request)
        cache: true,
        // Maximum batch size — tune based on DB max_query_params
        maxBatchSize: 1000,
      }
    ),
    commentsByPostId: new DataLoader<string, Comment[]>(
      async (postIds) => {
        const { rows } = await db.query<Comment>(
          'SELECT * FROM comments WHERE post_id = ANY($1::uuid[])',
          [postIds]
        );
        // Group by post_id, return in input order
        const grouped = new Map<string, Comment[]>();
        for (const comment of rows) {
          const list = grouped.get(comment.postId) ?? [];
          list.push(comment);
          grouped.set(comment.postId, list);
        }
        return postIds.map(id => grouped.get(id) ?? []);
      }
    ),
  };
}

export type Loaders = ReturnType<typeof createLoaders>;

Per-Request Instantiation

This is the most common DataLoader mistake in production: creating DataLoader as a singleton. A singleton's cache persists across requests, which means:

User A requests post 42. DataLoader caches it.
User B requests post 42. Gets User A's cached result — even if permissions differ.
Post 42 is updated. Cache returns the stale version indefinitely.

Always instantiate DataLoader inside request context:

// Apollo Server context function — runs once per request
const server = new ApolloServer({
  typeDefs,
  resolvers,
  context: ({ req }): AppContext => ({
    db,
    user: extractUser(req),
    loaders: createLoaders(db), // Fresh instance per request
  }),
});

Using DataLoader in Resolvers

const resolvers: Resolvers<AppContext> = {
  Query: {
    posts: async (_parent, { first = 20, after }, { db }) => {
      const { rows } = await db.query<Post>(
        `SELECT * FROM posts
         WHERE ($1::uuid IS NULL OR id < $1::uuid)
         ORDER BY id DESC
         LIMIT $2`,
        [decodeCursor(after), first + 1]
      );
      return buildConnection(rows, first);
    },
  },

  Post: {
    // No N+1: DataLoader batches all author loads from this request tick
    author: async (post, _args, { loaders }) => {
      return loaders.userById.load(post.authorId);
    },

    comments: async (post, { first = 10, after }, { loaders }) => {
      const comments = await loaders.commentsByPostId.load(post.id);
      return buildConnection(paginateComments(comments, after, first), first);
    },
  },

  Comment: {
    // Also batched — DataLoader catches this nested resolver too
    author: async (comment, _args, { loaders }) => {
      return loaders.userById.load(comment.authorId);
    },
  },
};

The key insight: loaders.userById.load() does not fire a query immediately. It schedules the load. After all synchronous resolver code for this tick completes, DataLoader calls the batch function with all accumulated IDs. This works across nested resolvers — the Post author loads and Comment author loads are batched together if they occur in the same event loop tick.

flowchart LR subgraph WITHOUT["Without DataLoader (N+1)"] direction TB Q1["Query: 100 posts"] --> P1["Post 1 → author query"] Q1 --> P2["Post 2 → author query"] Q1 --> P3["Post 3 → author query"] Q1 --> PN["... 97 more author queries"] style Q1 fill:#c1121f,color:#fff style P1 fill:#c1121f,color:#fff style P2 fill:#c1121f,color:#fff style P3 fill:#c1121f,color:#fff style PN fill:#c1121f,color:#fff end subgraph WITH["With DataLoader (Batched)"] direction TB Q2["Query: 100 posts"] --> DL["DataLoader\ncollects 100 IDs"] DL --> B1["1 batched query:\nSELECT WHERE id = ANY(...)"] B1 --> R["100 authors returned"] style Q2 fill:#2d6a4f,color:#fff style DL fill:#2d6a4f,color:#fff style B1 fill:#2d6a4f,color:#fff style R fill:#2d6a4f,color:#fff end WITHOUT -.->|"101 DB round-trips\n~4000ms"| COST1[" "] WITH -.->|"2 DB round-trips\n~40ms"| COST2[" "]

3. Persisted Queries and Security

A public GraphQL endpoint accepting arbitrary queries is an invitation for abuse. An attacker can send deeply nested queries, field explosion attacks, or resource-exhausting introspection queries. Persisted queries are the production answer.

The Arbitrary Query Problem

The developer experience of GraphQL — write any query, get exactly that data — is also the attack surface. Consider:

# Deeply nested query — exponential resolver tree
{
  user(id: "1") {
    friends {
      friends {
        friends {
          friends {
            posts { comments { author { posts { comments { author { id } } } } } }
          }
        }
      }
    }
  }
}

This resolves to a tree with thousands of nodes. Without protection, a single request like this can saturate your server.

Automatic Persisted Queries (APQ)

APQ (Apollo's protocol, supported by most clients) works in two phases:

Client sends a hash of the query (SHA-256) without the query itself
Server looks up the hash in its registry; if found, executes. If not, responds with PERSISTED_QUERY_NOT_FOUND
Client re-sends with the full query + hash; server stores the hash and executes

After the first round-trip, subsequent requests send only the hash — smaller payloads, faster network round-trips, and critically: in production you can disable new query registration and only accept known hashes.

import { createServer } from '@graphql-yoga/node';
import { usePersistedOperations } from '@graphql-yoga/plugin-persisted-operations';

// In production: load from a static file generated at build time
const persistedQueries = new Map<string, string>(
  Object.entries(require('./persisted-queries.json'))
);

const server = createServer({
  schema,
  plugins: [
    usePersistedOperations({
      getPersistedOperation(sha256Hash: string) {
        return persistedQueries.get(sha256Hash) ?? null;
      },
      // In production: reject unknown queries entirely
      allowArbitraryOperations: process.env.NODE_ENV !== 'production',
    }),
  ],
});

Generate the persisted queries map at build time with GraphQL Codegen or Relay compiler, then deploy it alongside your server. New queries require a deploy — which is the right constraint. It means your production server only executes queries your own clients wrote.

Query Depth and Complexity Limiting

Even with APQ, defense in depth matters. For development environments and internal APIs that accept arbitrary queries:

import { createComplexityRule, fieldExtensionsEstimator, simpleEstimator } from 'graphql-query-complexity';
import depthLimit from 'graphql-depth-limit';

const server = createServer({
  schema,
  validationRules: [
    // Reject queries nested deeper than 7 levels
    depthLimit(7),

    // Reject queries scoring above 1000 complexity points
    createComplexityRule({
      maximumComplexity: 1000,
      estimators: [
        // List fields cost 10x their children per item
        fieldExtensionsEstimator(),
        // Default: 1 point per field
        simpleEstimator({ defaultComplexity: 1 }),
      ],
      onComplete(complexity) {
        console.log(`Query complexity: ${complexity}`);
      },
    }),
  ],
});

Mark expensive fields in the schema extensions:

const PostType = new GraphQLObjectType({
  name: 'Post',
  fields: {
    comments: {
      type: CommentConnectionType,
      extensions: {
        complexity: ({ childComplexity }) => childComplexity * 10,
      },
    },
  },
});

Disabling Introspection in Production

Introspection reveals your entire schema to anyone who can reach the endpoint. Disable it in production after your client tooling has generated its types:

import { NoSchemaIntrospectionCustomRule } from 'graphql';

const server = createServer({
  schema,
  validationRules: process.env.NODE_ENV === 'production'
    ? [NoSchemaIntrospectionCustomRule]
    : [],
});

Field-level authorization belongs in resolvers or middleware, not schema definitions. Use a pattern like:

const resolvers = {
  User: {
    email: (user, _args, { currentUser }) => {
      // Only the user themselves or admins can see email
      if (currentUser.id !== user.id && currentUser.role !== 'ADMIN') {
        return null; // Return null for nullable, throw for non-null
      }
      return user.email;
    },
  },
};

4. Federation and the Supergraph

When your company has multiple teams each owning a service, a monolithic GraphQL schema becomes a coordination problem. Federation solves this by composing independently deployed subgraphs into a single supergraph at the router layer — clients see one API, teams own their domains.

Subgraph Architecture

Each team owns a subgraph: a complete, independently deployable GraphQL service that handles one domain. The router (Apollo Router or GraphQL Hive Gateway) fetches from each subgraph and stitches results together:

Client → Router (supergraph) → Users Subgraph
                             → Products Subgraph
                             → Orders Subgraph

Each subgraph can reference entities from other subgraphs using the @key directive without importing the full schema.

The @key Directive and Entity References

# users-subgraph: owns the User type
type User @key(fields: "id") {
  id: ID!
  handle: String!
  email: String!
}

# orders-subgraph: references User without owning it
extend type User @key(fields: "id") {
  id: ID! @external
  orders(first: Int): OrderConnection!
}

type Order @key(fields: "id") {
  id: ID!
  userId: ID!
  user: User!
  totalAmount: Float!
  status: OrderStatus!
  createdAt: DateTime!
}

The orders subgraph declares User as an external entity it can extend. When a client queries order.user.handle, the router fetches Order from the orders subgraph, extracts the userId, then fetches User from the users subgraph — transparently to the client.

Reference Resolvers

Each subgraph that defines a @key type must implement a __resolveReference resolver:

// users-subgraph resolvers
const resolvers = {
  User: {
    // Called by the router when another subgraph references a User by id
    __resolveReference: async (reference: { id: string }, { loaders }: AppContext) => {
      return loaders.userById.load(reference.id);
    },

    // Normal field resolvers
    posts: async (user, { first = 20 }, { loaders }) => {
      return loaders.postsByUserId.load(user.id);
    },
  },
};

// orders-subgraph resolvers
const orderResolvers = {
  Order: {
    __resolveReference: async (ref: { id: string }, { db }) => {
      const { rows } = await db.query('SELECT * FROM orders WHERE id = $1', [ref.id]);
      return rows[0];
    },
    user: (order: Order) => ({ __typename: 'User', id: order.userId }),
  },

  User: {
    // Extends User with order data — runs in orders subgraph context
    orders: async (user: { id: string }, { first = 20 }, { loaders }) => {
      return loaders.ordersByUserId.load(user.id);
    },
  },
};

@external, @requires, @provides

These directives handle cases where a resolver in one subgraph needs a field owned by another:

# shipping-subgraph needs the user's address to calculate shipping
extend type User @key(fields: "id") {
  id: ID! @external
  address: String @external          # Owned by users-subgraph
  shippingEstimate: Float @requires(fields: "address")  # Needs address at resolve time
}

The @requires directive tells the router: before calling the shippingEstimate resolver on this subgraph, fetch address from the users subgraph and include it in the reference object.

@provides is the inverse — a subgraph can declare that it can provide certain fields from another entity, avoiding a round-trip to the owning subgraph when those fields are already available in the response.

When Federation Is Worth It

Federation adds real operational complexity: a router process, a schema registry, composition validation, and distributed tracing across subgraphs. It pays off when:

You have 3+ teams that need to evolve their schemas independently
You are experiencing merge conflicts and coordination overhead on a shared schema repo
Different subgraphs have meaningfully different scaling requirements

It does not pay off for a small team (under 5 engineers) or a single service. For single-service architectures, schema stitching with module separation (Pothos or NestJS GraphQL modules) gives you the organizational benefits without the operational overhead.

flowchart TD Client["Client"] -->|Supergraph query| Router["Apollo Router\n(Supergraph)"] Router -->|user fields| US["Users Subgraph\n:4001"] Router -->|product fields| PS["Products Subgraph\n:4002"] Router -->|order fields| OS["Orders Subgraph\n:4003"] US --> UDB[("Users DB\nPostgreSQL")] PS --> PDB[("Products DB\nPostgreSQL")] OS --> ODB[("Orders DB\nPostgreSQL")] Router -->|Schema composition\n& validation| Registry["Schema Registry\n(Apollo Studio / Hive)"] style Router fill:#1d3557,color:#fff style Registry fill:#457b9d,color:#fff style US fill:#2d6a4f,color:#fff style PS fill:#2d6a4f,color:#fff style OS fill:#2d6a4f,color:#fff

5. Subscriptions and Real-Time

GraphQL subscriptions give clients a way to receive pushed updates using the same query language as regular operations. The two transport options differ significantly in production operational profile.

WebSocket-Based Subscriptions

The graphql-ws protocol (successor to the deprecated subscriptions-transport-ws) is the standard WebSocket implementation:

import { createServer } from '@graphql-yoga/node';
import { useServer } from 'graphql-ws/lib/use/ws';
import { WebSocketServer } from 'ws';

const yoga = createServer({ schema });
const httpServer = createHttpServer(yoga);

const wsServer = new WebSocketServer({
  server: httpServer,
  path: '/graphql',
});

useServer({ schema }, wsServer);

WebSockets are stateful connections — every open subscription holds a connection. At 10,000 concurrent subscribers, you are holding 10,000 TCP connections. This is manageable, but it means your GraphQL server cannot be stateless; load balancers must use sticky sessions or connection-aware routing.

Server-Sent Events (SSE) — Lighter Weight

SSE uses standard HTTP — unidirectional push from server to client over a long-lived HTTP response. It works through HTTP/2 multiplexing, does not require WebSocket upgrades, and is simpler to scale behind standard load balancers:

// GraphQL Yoga supports SSE subscriptions out of the box
// Client uses EventSource or fetch with stream reading
const yoga = createServer({
  schema,
  // Yoga defaults to SSE for subscriptions when client requests it
});

For most subscription use cases (notifications, feed updates, status changes), SSE is simpler to operate than WebSockets. Use WebSockets when you need bidirectional communication beyond what GraphQL subscriptions provide.

Subscription Resolver with Async Iterator

import { PubSub } from 'graphql-subscriptions';
import { withFilter } from 'graphql-subscriptions';

const pubsub = new PubSub();

const resolvers = {
  Subscription: {
    commentAdded: {
      // Filter: only send to subscribers watching this specific post
      subscribe: withFilter(
        () => pubsub.asyncIterator(['COMMENT_ADDED']),
        (payload: { commentAdded: Comment }, variables: { postId: string }) => {
          return payload.commentAdded.postId === variables.postId;
        }
      ),
      resolve: (payload: { commentAdded: Comment }) => payload.commentAdded,
    },
  },

  Mutation: {
    addComment: async (_parent, { input }, { db, loaders }) => {
      const { rows } = await db.query(
        'INSERT INTO comments (body, author_id, post_id) VALUES ($1, $2, $3) RETURNING *',
        [input.body, input.authorId, input.postId]
      );
      const comment = rows[0];

      // Publish to all subscribers
      pubsub.publish('COMMENT_ADDED', { commentAdded: comment });

      return { comment };
    },
  },
};

Scaling with Redis Pub/Sub

The in-memory PubSub above only works for single-instance deployments. With multiple server instances, a comment added via instance A never reaches subscribers connected to instance B. Redis pub/sub is the standard broadcast layer:

import { RedisPubSub } from 'graphql-redis-subscriptions';
import Redis from 'ioredis';

const pubsub = new RedisPubSub({
  publisher: new Redis({ host: process.env.REDIS_HOST }),
  subscriber: new Redis({ host: process.env.REDIS_HOST }),
});

// Replace the in-memory PubSub with RedisPubSub — same API
// Now publishes fan out to all server instances via Redis

Redis pub/sub is eventually consistent and at-most-once delivery. For strong guarantees (at-least-once, ordering), use Kafka or a message queue as the event backbone, with pub/sub only for the final WebSocket fan-out hop.

When Subscriptions Beat Polling

Polling at one-second intervals for 1,000 clients means 1,000 requests/second to your GraphQL server — 86.4 million requests/day — most of which return empty results. Subscriptions invert this: events flow only when data changes. For applications with change rates below 1 event per second per subscriber, subscriptions dramatically reduce server load. For high-frequency data (>10 updates/second per subscriber), consider whether WebSocket raw streaming or SSE with delta encoding is more appropriate than GraphQL subscriptions.

6. Production Considerations

Tracing with OpenTelemetry

Resolver-level tracing tells you exactly which field is slow — not just which request:

import { useOpenTelemetry } from '@envelop/opentelemetry';
import { NodeTracerProvider } from '@opentelemetry/node';

const provider = new NodeTracerProvider();
provider.register();

const server = createServer({
  schema,
  plugins: [
    useOpenTelemetry({
      resolvers: true,          // Span per resolver call
      variables: true,          // Include query variables in spans
      document: true,           // Include query document in spans
      result: false,            // Don't include result data (PII risk)
    }),
  ],
});

With resolver-level spans, your trace shows: query.posts (12ms) → Post.author [DataLoader] (2ms batched) → db.query (18ms). You can see at a glance whether slowness is in the resolver logic, the DataLoader batch, or the database query.

Caching Strategy

GraphQL's single-endpoint pattern breaks standard HTTP caching. The fix is multi-layered:

Persisted queries + GET requests: APQ queries sent via HTTP GET can be cached by CDN. This only works for queries (not mutations), but it covers the majority of traffic.
DataLoader: Per-request in-memory cache. Not cross-request, but eliminates duplicate fetches within a single response.
Response cache plugin: Cache entire query results keyed by query + variables + user role. Use with care — cache invalidation is hard, and cached responses can leak data across users if the cache key does not account for authorization context.

import { useResponseCache } from '@graphql-yoga/plugin-response-cache';

useResponseCache({
  session: (request) => {
    // Cache key includes user role — never mix user-specific data
    const user = extractUser(request);
    return user?.role ?? 'anonymous';
  },
  ttl: 10_000, // 10 seconds default
  ttlPerSchemaCoordinate: {
    'Query.posts': 30_000,  // Posts list: 30s
    'Query.user': 5_000,    // User data: 5s
  },
});

Error Handling and Partial Results

GraphQL's error model is one of its most underused features. Unlike REST where a single failure means 500, GraphQL returns partial results:

{
  "data": {
    "posts": [
      { "id": "1", "title": "First Post", "author": { "id": "u1", "handle": "alice" } },
      { "id": "2", "title": "Second Post", "author": null }
    ]
  },
  "errors": [
    {
      "message": "User not found",
      "path": ["posts", 1, "author"],
      "extensions": { "code": "NOT_FOUND" }
    }
  ]
}

Post 2's author failed to resolve, but the rest of the response is valid. Clients should handle data and errors independently. Returning an error in errors while still returning data in data is correct GraphQL behavior — do not throw errors from resolvers when you can return null + an error entry.

Rate Limiting by Complexity

Traditional rate limiting counts requests. GraphQL requests are not equivalent — a simple { user(id:"1") { id } } and a deeply nested post/comments/authors traversal are wildly different in cost. Rate limit by query complexity:

// Track complexity per user, rate limit on complexity budget
const complexityBudget = new Map<string, number>();

createComplexityRule({
  maximumComplexity: 1000,
  onComplete(complexity) {
    const userId = context.user?.id ?? 'anonymous';
    const current = complexityBudget.get(userId) ?? 0;

    if (current + complexity > 10_000) {
      throw new GraphQLError('Rate limit exceeded', {
        extensions: { code: 'RATE_LIMITED', retryAfter: 60 },
      });
    }

    complexityBudget.set(userId, current + complexity);
    // Reset budget on a sliding window timer
  },
});

Monitoring Key Metrics

Fields to alert on:
- Resolver error rate per field: a spike in Post.author errors signals a data integrity issue
- Slow resolver p99: DataLoader batch queries should be under 20ms; anything over 100ms needs investigation
- Persisted query miss rate: rising misses indicate a client version deploying new queries not yet registered
- Subscription connection count: watch for connection leaks — clients that subscribe but never unsubscribe

Conclusion

GraphQL earns its place in production when your system has genuine complexity: multiple clients with different data needs, multi-team schema ownership, or intricate relationship graphs that would produce REST endpoint sprawl. When those conditions hold, the patterns in this post are what separate a GraphQL deployment that performs well at scale from one that collapses under its own weight.

The non-negotiables: DataLoader on every relationship field, persisted queries before you open traffic to the internet, and a clear nullability policy communicated to client developers. Federation is the right answer for multi-team schemas — but only after you have outgrown a single schema's organizational limits. Start with a modular monolith-style schema using Pothos or NestJS GraphQL, and migrate to federation when coordination pain becomes real rather than anticipated.

Where REST still wins: simple CRUD with predictable shapes, systems that depend heavily on HTTP caching semantics, and teams that do not yet have the tooling investment to manage schema evolution safely. GraphQL's power is proportional to the complexity it is solving — applied to simple problems, it adds overhead without benefit.

The production maturity of the GraphQL ecosystem in 2026 — stable federation spec, battle-tested DataLoader, OpenTelemetry resolver tracing, APQ support across all major clients — means the operational risk of adopting it is lower than ever. The patterns exist. The tooling exists. The question is whether your problem is complex enough to justify them.

Sources

About the Author

Toc Am

Founder of AmtocSoft. Writing practical deep-dives on AI engineering, cloud architecture, and developer tooling. Previously built backend systems at scale. Reviews every post published under this byline.

LinkedIn X / Twitter

Published: 2026-06-09 · Written with AI assistance, reviewed by Toc Am.

Get These In Your Inbox

Weekly deep-dives on AI engineering, no fluff. Join the newsletter →

Subscribe (free)

Or grab the book ($39, ~100 pages) · Buy me a coffee

☕ Buy Me a Coffee · 🔔 YouTube · 💼 LinkedIn · 🐦 X/Twitter