The strange case of the underestimated Merge Join node

Publié le 12/01/2026 par Frédéric Yhuel

Brest, France, 12 January 2026

We recently encountered a strange optimizer behaviour, reported by one of our customers:

Customer: “Hi Dalibo, we have a query that is very slow on the first execution after a batch process, and then very fast. We initially suspected a caching effect, but then we noticed that the execution plan was different.”

Dalibo: “That’s a common issue. Autoanalyze didn’t have the opportunity to process the table after the batch job had finished, and before the first execution of the query. You should run the VACUUM ANALYZE command (or at least ANALYZE) immediately after your batch job.”

Customer: “Yes, it actually solves the problem, but… your hypothesis is wrong. We looked at pg_stat_user_tables, and are certain that the tables were not vacuumed or analyzed between the slow and fast executions. We don’t have a production problem, but we would like to understand.”

Dalibo: “That’s very surprising! we would also like to understand…”

So let’s dive in!

Execution plans

The query is quite basic (table and column names have been anonymized):

SELECT    *
FROM      bar
LEFT JOIN foo ON (bar.a = foo.a)
WHERE     id = 10744501
ORDER BY  bar.x DESC, foo.x DESC;

Here’s the plan of the first execution of the query after the batch job:

                                                                 QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
 Sort (cost=17039.22..17042.11 rows=1156 width=786) (actual time=89056.476..89056.480 rows=6 loops=1)
   Sort Key: bar.x DESC, foo.x DESC
   Sort Method: quicksort Memory: 25kB
   Buffers: shared hit=2255368 read=717581 dirtied=71206 written=11997
   -> Merge Right Join (cost=2385.37..16980.41 rows=1156 width=786) (actual time=89056.428..89056.432 rows=6 loops=1)
         Inner Unique: true
         Merge Cond: (foo.a = bar.a)
         Buffers: shared hit=2255365 read=717581 dirtied=71206 written=11997
         -> Index Scan using foo_fk1 on foo (cost=0.57..145690068.16 rows=80462556 width=734) (actual time=89050.555..89050.557 rows=1 loops=1)
               Buffers: shared hit=2255360 read=717574 dirtied=71206 written=11997
         -> Sort (cost=2384.81..2387.70 rows=1156 width=52) (actual time=5.853..5.854 rows=6 loops=1)
               Sort Key: bar.a
               Sort Method: quicksort Memory: 25kB
               Buffers: shared hit=5 read=7
               -> Index Scan using bar_fk1 on bar (cost=0.58..2326.00 rows=1156 width=52) (actual time=1.514..5.808 rows=6 loops=1)
                     Index Cond: (bar.id = 10744501)
                     Buffers: shared hit=5 read=7
 Settings: effective_cache_size = '20GB', random_page_cost = '2', work_mem = '100MB'
 Planning:
   Buffers: shared hit=11073 read=10738 dirtied=2
 Planning Time: 209.686 ms
 Execution Time: 89056.610 ms

And here’s the plan of the 2nd execution of the query:

                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
 Sort (cost=544154.99..544157.54 rows=1020 width=789) (actual time=0.222..0.224 rows=6 loops=1)
   Sort Key: bar.x DESC, foo.x DESC
   Sort Method: quicksort Memory: 25kB
   Buffers: shared hit=39
   -> Nested Loop Left Join (cost=1.15..544104.02 rows=1020 width=789) (actual time=0.087..0.164 rows=6 loops=1)
         Buffers: shared hit=36
         -> Index Scan using bar_fk1 on bar (cost=0.58..2052.42 rows=1020 width=52) (actual time=0.073..0.127 rows=6 loops=1)
               Index Cond: (bar.id = 10744501)
               Buffers: shared hit=12
         -> Index Scan using foo_fk1 on foo (cost=0.57..528.77 rows=265 width=737) (actual time=0.004..0.004 rows=0 loops=6)
               Index Cond: (foo.a = bar.a)
               Buffers: shared hit=24
 Settings: effective_cache_size = '20GB', random_page_cost = '2', work_mem = '100MB'
 Planning:
   Buffers: shared hit=329
 Planning Time: 10.390 ms
 Execution Time: 0.373 ms

Searching for the culprit

The main difference between the two plans is the join strategy. It’s a Merge Join in the first one, and a Nested Loop Join in the second one.

Since the two tables haven’t been analyzed or vacuumed, we know the statistics¹ didn’t change between the two executions. We observe that the first execution modifies a lot of buffers (dirtied=71206), and that’s our first clue.

I was surprised by the high total cost of the outer index scan in the Merge Join node (145 millions), especially since the cost of the Merge Join itself is only 16,980. In fact, the planner knows that only a small fraction of one of the two Merge Join child nodes will be executed. This occurs, for example, when the histograms of the join clause columns (foo.a and bar.a, in our case) have only a small overlap, or no overlap (I would like to thank Robert Haas for teaching me this when I asked about it on the PostgreSQL Hacking Discord.

In our case, the histograms are as follows:

histogram of column foo.a: {1532523860,1532923673, <97 more values>, 1573407772,1573803559}
histogram of column bar.a: {16877,15720140, <97 more values>, 1485178901,1499389426}

So they actually don’t overlap at all.

When I saw this, I suddenly remembered a case from a few years ago when we had to deal with a query whose planning time was absurdly high on the first execution. There were many index tuples pointing to dead heap tuples.

When the optimizer computes the selectivity of one histogram, it may probe the index to determine the actual extreme values. (The function is get_actual_variable_endpoint() ², called by get_actual_variable_range() when the first or last histogram entry is accessed by the function ineq_histogram_selectivity()).

At that time, this function could end up reading many heap pages, and also writing index pages in order to set the dead flag for the index tuples pointing to dead heap tuples. This explained the very high planning time.

However, in November 2022, a patch from Simon Riggs limited the number of heap pages get_actual_variable_endpoint() could fetch. After reading 100 heap pages, it gives up, preventing the planning time explosion.

As a consequence, we have this strange case where the first execution of a query is planned differently from the second one. The first time, get_actual_variable_endpoint() gives up, so the planner uses the extreme value recorded in the histogram. The second time, if enough of the index tuples pointing to dead heap tuples were killed, get_actual_variable_endpoint() successfully returns the actual extreme value. Thus, the planner works with more accurate information.

Verifying the hypothesis

1st execution

Assuming our hypothesis is correct, on the first execution, get_actual_variable_endpoint() gives up. Since our histograms don’t overlap at all, the planner could estimate a null total cost for the Merge Join node. However, the planner is wary of extremely low or large selectivity estimates³, so it estimates (by simplifying) that it will run at least a small fraction (0.01 %, or one-hundredth of the histogram resolution) of the outer child node, and, at worst, almost all (99.99 %) of the inner child node to find the first matching tuple, and all of it to find them all. So the startup cost of the Merge Join node is close to the total cost of the inner child node. Its run cost (total cost minus startup cost) should be close to 0.0001 of the total cost of the outer child node.

Let’s do the math. In the above plan, the Merge Join startup cost is 2385, which is slightly higher than the total cost of the inner index scan multiplied by 0.9999 (2326). Therefore, it’s consistent. The run cost equals 16,980 - 2,385 = 14,595, which is slightly higher than the total cost of the outer index scan multiplied by 0.0001 (145,690,068 × 0.0001 = 14,569). Once again, this is consistent with our hypothesis.

2nd execution

Assuming our hypothesis is correct, on the second execution, get_actual_variable_endpoint() successfully returns the actual extreme values, thanks to the already killed tuples in the index. And we expect the total cost of the Merge Join to be higher than the total cost of the Nested Loop Join node in the above “fast” plan, which is 544,104.

We can’t say anything very accurate here, because we didn’t ask our customer for the real minimum value foo.a and the maximum value bar.a. Besides, it would be very tedious to unfold the computations performed by the planner. Suffice it to say that if PostgreSQL estimates that it will have to run at least 0.5 % of the outer index scan, then the total cost of the Merge Join node exceeds 0.005 × 145,690,068 = 728,450, which is greater than the total cost of the Nested Loop node. So the planner chooses the Nested Loop.

A script to reproduce the case

The following script reproduces the case. First, we create two tables and disable autovacuum for them.

We insert one million rows into foo, with values for column a between 200,000 and 299,999.

We insert one million rows into bar, with values for column a between 100,000 and 199,999 (without any overlap with the foo.a).

We run VACUUM ANALYZE on both tables, and then insert 100,000 rows in foo, with values for column a between 185,000 and 195,000 (inclusive), so there is now some overlap. The autovacuum is inhibited here, and we can verify that the histograms don’t overlap in the statistics view. We create the indexes.

Then we delete all the rows that were inserted by the last INSERT, except the last value (a = 195,000). So the minimum value for foo.a is now 195,000, but the beginning of the index still contains many entries pointing to dead heap tuples, and get_actual_variable_endpoint() will abort and return false on the first execution.

When you run the script, the last command should output a plan with an underestimated Merge Join node. You’ll see Heap Fetches, hinting that dead tuples entries are cleaned in the index.

Running this EXPLAIN command a second time will result in a Nested Loop Join. With SET enable_nestloop TO off, you can verify that the cost of the Merge Joinis much higher.

DROP TABLE IF EXISTS foo, bar;

CREATE UNLOGGED TABLE foo(id int, a int);
CREATE UNLOGGED TABLE bar(id int, a int);

ALTER TABLE foo SET (autovacuum_enabled = off);
ALTER table bar SET (autovacuum_enabled = off);

INSERT INTO foo SELECT i,  200000 + random() * 100000 FROM generate_series(1, 999999) i;
INSERT INTO bar SELECT i%100000, 100000 + i/10 FROM generate_series(1, 999999) i;

VACUUM ANALYZE foo, bar;

INSERT INTO foo SELECT 1000000 + i, 185000 + i/10 FROM generate_series(1, 100000) i;

SELECT * FROM pg_stats WHERE tablename IN ('foo', 'bar') AND ATTNAME = 'a';

CREATE INDEX ON foo(a);
CREATE INDEX ON bar(id, a);

DELETE FROM foo WHERE a >= 185000 and a < 195000;

EXPLAIN (ANALYZE, BUFFERS, SETTINGS)
SELECT foo.a FROM foo JOIN bar USING (a) WHERE bar.id = 2047 ORDER BY a;

Conclusion

We found a case in which the query plan can change between two executions, while the data and statistics remain exactly the same. As far as I know, there are no other known cases⁴, but if you know of one, or have any questions or feedback about this article, please contact Frédéric Yhuel at frederic.yhuel@dalibo.com.

One more thing: typically, estimation errors in the planner result in an output row count that differs greatly from the actual count, for a given node. This kind of discrepancy is easy to spot, and graphical tools like explain.dalibo.com and explain.depesz.com highlight it. A common example is when the planner selects a Nested Loop Join instead of a Hash Join because the estimated number of output rows of the outer child node is significantly underestimated. The reverse also happens, albeit less frequently. However, we have encountered another type of estimation error that is much more difficult to spot, because it comes from the direct use of indexes by the planner.

Not only those contained in pg_statistics, but also those in pg_class.

The latter can also be updated by an index creation or a reindexation, but this was not the case here. ↩
See selfuncs.c ↩
See selfuncs.c ↩
I can think of one other case, involving GEQO and a different GEQO seed, but it’s not very interesting to me. ↩

DALIBO

DALIBO est le spécialiste français de PostgreSQL^®. Nous proposons du support, de la formation et du conseil depuis 2005.

Le blog Dalibo

Emploi

Événements

Formation

Open Source

Releases

SCOP

The strange case of the underestimated Merge Join node

Execution plans

Searching for the culprit

Verifying the hypothesis

1st execution

2nd execution

A script to reproduce the case

Conclusion

DALIBO

Emploi

Événements

Formation

Open Source

Releases

SCOP

The strange case of the underestimated Merge Join node

Execution plans

Searching for the culprit

Verifying the hypothesis

1st execution

2nd execution

A script to reproduce the case

Conclusion

Partagez cet article !

DALIBO