We usually make small iterative improvements in performance. However, on rare occasions, an opportunity arises to make an improvement of five orders of magnitude by changing a single line!
After learning about a slow query that was hitting our statement timeout from one of our Site Reliability Engineers, I reformatted it for ease of readability and ran it through EXPLAIN ANALYZE. The query was about 80 lines long and took me a while to look through. In the middle of the query, there was a curious join:
... INNER JOIN table2 ON table1.table2_id = 123456 ...
I thought to myself: I bet that doesn't do what they think it does.
Notice that this doesn't specify a column on the right side to join with. I thought this would produce a partial Cartesian product, but I always like to verify such things.
So, I decided to take a closer look. The query on the left joins in the same way the original query does. The query on the right specifies a normal join condition plus an additional WHERE clause. Please, note the result set: ~66 million on the left vs. 48 on the right.
|Original JOIN||JOIN w/ Right Column Specified|
SELECT table1.id, table2.id FROM table1 INNER JOIN table2 ON table1.table2_id = 123456;
SELECT table1.id, table2.id FROM table1 INNER JOIN table2 ON table1.table2_id = table2.id WHERE table2.id = 123456;
|Rows returned: 66,408,624||Rows returned: 48|
Without the right-side column specified, we end up with the Cartesian product of all rows in table1 where table2_id is 123456, and every row in table2. By specifying the right-side column and moving the condition to the WHERE clause, we get the expected result set.
So, if we bring that back to our original query, we find quite a decent improvement!
|Original Query Run Time||New Query Run Time|
5588997.823 ms (01:33:08.988)
You see, the mistake is pretty simple. By thinking, “Hey, if I know the id, I can just slap that into the join,” you just end up with that Cartesian product and some sad Database Administrators without a right-side column to join against.
As exciting as this was to work on, let’s not allow a problem of this size to slip through code reviews in the future. We created a danger rule (https://danger.systems/ruby/) alert as guardrails to enable teams to spot similar issues.