Overview
We’re going to use the demodb database available at https://postgrespro.com/community/demodb. We at Metis provide a docker container with this database: https://github.com/metis-data/postgresql-demo-database
Specifically, we’ll just use tables ticket_flights with ~8 million rows and flights with ~200 thousand rows. We can find the schema at https://postgrespro.com/docs/postgrespro/10/apjs02.html:
Table flights has a primary key configured on the flight_id field. This means that the table is stored as a B-tree. Similarly, ticket_flights has a primary key configured on the tuple (ticket_no, flight_id).
Before moving on, let’s also set parallel scans to 0 with the following query:
Parallel scans don’t change the algorithms, so we can ignore them in the scope of this article.
Nested Loop Join
First and the simplest join strategy is called Nested Loop Join. It can be depicted with the following pseudocode:
We iterate over both tables with two loops, and join them naively. This has quadratic time complexity O(size(table1) * size(table2)). The memory complexity is O(1).
Let’s now see that in action. Take this query:
And here is the plan we obtained:
Nested Loop (cost=0.42..16532324955.82 rows=601044021228 width=95)
-> Seq Scan on ticket_flights tf (cost=0.00..153851.52 rows=8391852 width=32)
-> Index Scan using flights_pkey on flights f (cost=0.42..1253.81 rows=71622 width=63)
Index Cond: (flight_id < tf.flight_id)
We can see that the engine decided to use an index to scan the flights table. No index was used to scan the ticket_flights table which is bad - scanning the whole table requires reading all of the rows which amounts to plenty of data. We generally always want to avoid scanning the whole table when filtering, but we want to read as little data as possible. Next, both of these scans are joined with the Nested Loop algorithm.
Let’s see if changing the order of joins matters. Take this query:
And here is the plan we get:
Nested Loop (cost=0.42..16532324955.82 rows=601044021228 width=95)
-> Seq Scan on ticket_flights tf (cost=0.00..153851.52 rows=8391852 width=32)
-> Index Scan using flights_pkey on flights f (cost=0.42..1253.81 rows=71622 width=63)
Index Cond: (flight_id < tf.flight_id)
We can see that the results are the same.
However, we can also see that changing the output aggregation can change the way we scan tables, but doesn’t change the algorithm. Let’s take this query
We get the following plan:
Aggregate (cost=18034924512.89..18034924512.90 rows=1 width=8)
-> Nested Loop (cost=0.42..16532314459.82 rows=601044021228 width=0)
-> Seq Scan on ticket_flights tf (cost=0.00..153851.52 rows=8391852 width=4)
-> Index Only Scan using flights_pkey on flights f (cost=0.42..1253.81 rows=71622 width=4)
Index Cond: (flight_id > tf.flight_id)
We can see that now we scan the flights table with Index Only Scan. This operation doesn’t even need to read the rows, it can get everything from the index which makes this operation even faster than the Index Scan. Next, scans are once again joined with the Nested Loop operation, and finally the Aggregate operation is executed to select the count of the rows.
Hash Join
Next strategy is called Hash Join. The Hash Join algorithm consists of two phases. In the first phase we build a hashtable from one of the tables that we want to join. In the second phase we iterate over the rows of the latter table, and then find the match in the hashtable. The algorithm looks like this:
The complexity is O(size(table1) + size(table2)) if we assume that the hashing algorithm is good and we have O(1) lookup time. The memory complexity is O(size(table1)) so the order matters. The engine generally prefers to hash the smaller table.
Let’s see that in action:
This is the plan:
Hash Join (cost=9767.51..302691.07 rows=8391852 width=95)
Hash Cond: (tf.flight_id = f.flight_id)
-> Seq Scan on ticket_flights tf (cost=0.00..153851.52 rows=8391852 width=32)
-> Hash (cost=4772.67..4772.67 rows=214867 width=63)
-> Seq Scan on flights f (cost=0.00..4772.67 rows=214867 width=63)
We can see that the engine decided to scan the flights table and then build a hash table out of it. Next, it iterates over the ticket_flights table and matches the rows based on the condition.
If we swap the join order in the SQL query like this:
then we get exactly the same plan:
Hash Join (cost=9767.51..302691.07 rows=8391852 width=95)
Hash Cond: (tf.flight_id = f.flight_id)
-> Seq Scan on ticket_flights tf (cost=0.00..153851.52 rows=8391852 width=32)
-> Hash (cost=4772.67..4772.67 rows=214867 width=63)
-> Seq Scan on flights f (cost=0.00..4772.67 rows=214867 width=63)
The engine is allowed to do so. SQL queries are declarative, so they define what the result is, but they don’t dictate how the result is calculated.
However, if we add the aggregation
Then we get this plan:
Aggregate (cost=271560.70..271560.71 rows=1 width=8)
-> Hash Join (cost=8298.51..250581.07 rows=8391852 width=0)
Hash Cond: (tf.flight_id = f.flight_id)
-> Seq Scan on ticket_flights tf (cost=0.00..153851.52 rows=8391852 width=4)
-> Hash (cost=4772.67..4772.67 rows=214867 width=4)
-> Seq Scan on flights f (cost=0.00..4772.67 rows=214867 width=4)
You can see that the flights table is still scanned as before. There is no index scan this time.
Merge Join
Merge join algorithm is used when we can iterate the rows in order. It works like this:
The time complexity is O(size(table1)*log(size(table1)) + size(table2)*log(size(table2)) + size(table1) + size(table2)). The memory complexity is O(size(table1) + size(table2)).
However, if the data is already ordered, then we get O(size(table1) + size(table2)) for time complexity and O(1) for memory complexity.
Let’s see that in action. First, disable the hash join strategy:
And then run this query:
We get the following plan:
Merge Join (cost=1520511.52..1676140.76 rows=8391852 width=95)
Merge Cond: (f.flight_id = tf.flight_id)
-> Index Scan using flights_pkey on flights f (cost=0.42..8245.57 rows=214867 width=63)
-> Materialize (cost=1520506.91..1562466.17 rows=8391852 width=32)
-> Sort (cost=1520506.91..1541486.54 rows=8391852 width=32)
Sort Key: tf.flight_id
-> Seq Scan on ticket_flights tf (cost=0.00..153851.52 rows=8391852 width=32)
We can see that the engine had to sort the ticket_flights table after scanning it. However, the flights table was already sorted because it has the B-tree already built for the primary key.
The reason ticket_flights table needs to be sorted is because the primary key consists of the ticket number and the flight id. However, the order of fields matters, so the flight id may not be stored in order.
Summary
The engine can choose how to calculate the join of two tables. Various algorithms have different time and memory complexities, so it’s useful to understand how we can speed things up. We can do that by adding indexes or making sure that we use conditions that allow us to use the more efficient join operations.