
SQL query performance issues rarely announce themselves early. They sit quietly while your dataset is small, your traffic is predictable, and your indexes happen to align with your queries. Then one day, something shifts (data size, access patterns, concurrency) and the same query that felt harmless starts dragging everything else with it.
This is how most real systems hit their first serious performance wall. One query. No code change. Suddenly it’s responsible for half your database load.
It Usually Starts With a Plan That Looked Fine Before
In one system I worked on, we had a query joining a users table (3 million rows) with an activity table (80 million rows). The query filtered by a recent timestamp and a user status flag. On paper, it was selective enough to use an index. For months, it did exactly that.
Then performance dropped. Not gradually, noticeably. Latency jumped from 120ms to over 3 seconds during peak hours. The query hadn’t changed. The plan had.
The optimizer started choosing a sequential scan on the activity table. Its estimate suggested a large portion of rows would match the filter, so scanning became “cheaper” than using the index. The estimate was off by a wide margin.
Instead of scanning a few hundred thousand rows, it read tens of millions. You could see it clearly once we ran EXPLAIN ANALYZE actual row counts were nowhere near the planner’s expectations.
The root issue wasn’t the query. It was stale statistics combined with a change in how recent activity was distributed across users.
Once we updated stats and adjusted the index to better reflect the filter pattern, the plan flipped back. Latency dropped to 150ms. Same SQL. Completely different cost.
Indexes Don’t Help If They Don’t Match Reality
Another case was more subtle. A payments service had an index on (status, created_at). Most queries filtered by status first, then time. That worked well for a long time.
Then reporting queries started filtering heavily on created_at ranges, with status as a secondary condition. The index was still there. It just wasn’t useful anymore.
Instead of efficient index scans, the database was scanning large portions of the table, then filtering. On a table with 120 million rows, that became expensive fast.
This is one of those problems that looks obvious in hindsight. Index order matters. The database can’t rearrange it for you. MySQL’s documentation on how indexes are structured makes that clear, but it’s easy to overlook when queries evolve over time.
We introduced a new index aligned with the actual query pattern. Not more indexes, just the right one.
Query time dropped from 2.4 seconds to under 200ms. No schema redesign. No caching layer. Just alignment.
Load Changes How Queries Behave
Some problems don’t show up until concurrency increases.
A reporting query in one system ran every few seconds, scanning a moderately large dataset and aggregating results. On its own, it was fine, around 400ms. Under load, it became a problem.
It held locks long enough to block updates on the same table. Not for long, but long enough that with enough concurrent requests, those delays stacked up.
Write latency increased. Then retries increased. Then overall load increased.
That’s how a “read-only” query starts affecting writes.
The behavior lines up with how PostgreSQL describes locking and blocking, but seeing it in production feels different. You don’t just get slower queries—you get cascading effects.
We ended up isolating the workload and adjusting how often the query ran. Not a complicated fix, but not something you’d catch by looking at the query in isolation.
Growth Breaks Assumptions, Not Just Performance
There’s a point where data stops fitting comfortably in memory. You don’t always notice when you cross it, but the database does.
One analytics table grew from ~15GB to ~90GB over a few months. Queries that relied on scanning recent partitions started pulling in more data than expected because “recent” had expanded.
At 15GB, most of the working set stayed in memory. At 90GB, it didn’t.
Disk reads increased. Latency followed. Not dramatically at first, but enough to show up in p95 and p99 metrics.
Joins became heavier too. Intermediate results grew larger before filters were applied, which increased memory usage and occasionally forced temporary data to spill to disk.
That’s the kind of shift you don’t see in code review. It shows up in metrics.
What Keeps Things Stable Over Time
The systems that hold up under growth tend to do a few things consistently.
They look at execution plans early, not just when something breaks. They keep statistics up to date so the optimizer has a realistic picture of the data. They revisit indexes as query patterns evolve, instead of treating them as permanent.
They also watch what actually runs in production. Tools like pg_stat_statements help surface queries that don’t look expensive individually but become costly through repetition. And they accept that performance is not static.
A query that works today can become a liability later without any change in code. The only reliable way to stay ahead of that is to keep checking how the database is executing the work, not just what the SQL says.
