The N+1 Problem Is Bigger Than Queries

 



How “Calls per Request” helps you catch hidden performance and cost bugs early

Most “slow apps” are not slow because the database is bad.

They are slow because a single request quietly triggers far more work than anyone expects. This pattern is commonly known as N+1 queries, but in real systems it goes beyond databases.

It becomes N+1 work.

And once it shows up, it hurts performance, reliability, and cloud cost at the same time.

This blog explains what N+1 work looks like, why it is dangerous, and how to prevent it using one simple measurement: calls per request.


1) What N+1 work actually is

N+1 happens when something that should happen once happens N times, usually because it sits inside a loop.

A classic example:

  • You fetch 100 orders.

  • For each order, you fetch customer details.

  • That becomes 1 query for orders + 100 queries for customers.

But the same pattern appears outside the database too:

  • One API request triggers 1 service call, then 30 more in a loop

  • One page load triggers 50 small network calls

  • One background job performs hundreds of tiny writes

  • One feature causes repeated cache misses

The system looks fine at small scale, because N is small.
At real usage, N grows and the architecture starts to feel “slow” even though nothing obvious changed.


2) Why N+1 work becomes a serious problem

N+1 work is dangerous because it multiplies three things at once.

A) Latency multiplies

Even if each downstream call is “fast,” stacking many calls increases response time.

A request that should take 150 ms becomes 1 to 3 seconds simply because it performs too many small operations.

B) Reliability drops

More calls means more points of failure.

Even a tiny failure rate becomes visible when multiplied.
For example, if a dependency has a 1% failure rate and you call it 50 times, you will see failures often.

C) Cost rises silently

N+1 work increases:

  • compute usage

  • DB load

  • third-party API spend

  • queue usage

  • logging volume

  • network egress

This is why it is painful. The feature “works,” but the bill and the latency keep climbing.


3) The metric most teams miss: Calls per Request

Most teams monitor:

  • CPU

  • memory

  • response time

  • error rate

Those are important, but they often tell you the symptoms, not the cause.

A better early-warning signal is:

Calls per request
How many downstream operations does one inbound request trigger?

This includes:

  • database queries

  • service-to-service calls

  • cache reads

  • message queue operations

  • third-party API calls

When calls per request increases, it usually means:

  • hidden loops

  • missing batching

  • poorly shaped endpoints

  • repeated lookups

  • accidental fan-out

If you track this metric, you will catch N+1 work before it becomes a production incident.


4) How to measure Calls per Request in a practical way

You do not need a perfect observability setup to start.

Here are three realistic options.

Option 1: Distributed tracing (best)

If you use OpenTelemetry, Datadog, New Relic, or similar tools, count:

  • spans per trace

  • outbound calls per endpoint

  • DB queries per request

Endpoints with unusually high span counts are the first ones to investigate.

Option 2: Simple counters inside your code

Add lightweight counters per request:

  • query_count

  • outbound_http_count

  • cache_get_count

  • queue_ops_count

Log them only at debug level or sample 1% of traffic to avoid noise.

Even one line like this is powerful:
/orders summary: db=12 http=18 cache=22 total=52

Option 3: Gateway or service mesh metrics

If you use a gateway or service mesh, track outbound requests per inbound request.
This is often enough to detect fan-out patterns.


5) Where N+1 work usually comes from

In practice, these are the most common sources.

A) Looping calls inside request handlers

A single endpoint loops over items and calls another dependency per item.

B) UI-driven fragmentation

The UI makes many small calls to build one screen instead of requesting an aggregated payload.

C) Background jobs performing tiny writes

Jobs that process items one by one instead of batching writes.

D) Cache-as-a-database misuse

Code repeatedly performs cache lookups inside loops rather than fetching keys in bulk.


6) Fixes that actually work

Here are the solutions that consistently remove N+1 work.

1) Batch the work

Replace per-item calls with batched operations.

Examples:

  • DB: WHERE id IN (...)

  • API: GET /customers?ids=...

  • Cache: multi-get operations

Batching is the simplest and most reliable fix.

2) Cache the expensive part

If the same data is fetched repeatedly, cache it:

  • per-request memoization

  • short TTL caching

  • “last known good” caching

The goal is not caching everything.
The goal is stopping repeated work.

3) Limit fan-out

Put guardrails on how much work one request can trigger:

  • pagination

  • maximum items per request

  • explicit caps for exports and bulk endpoints

  • degrade gracefully for large payloads

Without limits, one “large” request becomes an accidental load test.

4) Change the shape of the API

Sometimes the design causes the problem.

If a screen needs five endpoints to render, you will eventually see N+1 style behavior at the page level.

Fix this by creating aggregated endpoints:

  • GET /dashboard-summary

  • GET /order-summary

  • GET /profile-with-dependencies

This reduces round trips and prevents UI-driven fragmentation.


7) A simple pre-release checklist

Before shipping a feature, ask:

  • How many downstream calls can one request trigger in the worst case?

  • Are we batching, or looping?

  • Do we have a safe cap or pagination?

  • Can we cache the repeated lookups?

  • What is our baseline calls per request for this endpoint?

  • What alert would fire if calls per request suddenly doubled?

If you do this, you prevent most surprises.


Final thought

N+1 is rarely a “performance bug.”
It is usually a design bug.

The system is doing more work than it needs to do.

If you track calls per request, you stop guessing and start seeing the real shape of your system.


Question for you

What is the worst N+1 pattern you have seen in production?
Was it queries, service calls, UI calls, or background jobs?

_____________________________________________________________________________________

Written by Sharath Chandra Odepalli

LinkedIn 

Comments

Popular posts from this blog

Stop Building: Why Less Is More in Software Development

Clarity: The Most Underrated Skill in Technology

The Transparency Era: What California's New AI Law Means for Tech Builders