May 14, 2008

Faking independent subqueries in MySQL

So one of my associates laid this little gem on me yesterday that kinda floored me.

Say you have a query that counts the number of members in a group that looks like this:

select count(*) as members from users where id in (select user_id from memberships where group_id = 1);

You may know that this query can take a long time as the number of rows in the users table grows. Why? Take a look at the explanation:

+----+--------------------+-------------+-------+---------------+---------+---------+------+------+--------------------------+
| id | select_type        | table       | type  | possible_keys | key     | key_len | ref  | rows | Extra                    |
+----+--------------------+-------------+-------+---------------+---------+---------+------+------+--------------------------+
|  1 | PRIMARY            | users       | index | NULL          | PRIMARY | 4       | NULL |    1 | Using where; Using index | 
|  2 | DEPENDENT SUBQUERY | memberships | ALL   | NULL          | NULL    | NULL    | NULL |    1 | Using where              | 
+----+--------------------+-------------+-------+---------------+---------+---------+------+------+--------------------------+

That inner query is dependent, which means it’ll get run for each user in the table, even though it really doesn’t have to be. It would be nice to store this in a temporary table, but that’s a pain right? Or is it? Check out this “optimization”:

select count(*) as members from users where id in (select user_id from (select user_id from memberships where group_id = 1) as x);

That’s an extra query right? Why would that help? Turns out that wrapping the dependent subquery like this creates a derived table. MySQL just creates that temporary table for you for free. Here’s the explanation.

+----+--------------------+-------------+--------+---------------+---------+---------+------+------+--------------------------+
| id | select_type        | table       | type   | possible_keys | key     | key_len | ref  | rows | Extra                    |
+----+--------------------+-------------+--------+---------------+---------+---------+------+------+--------------------------+
|  1 | PRIMARY            | users       | index  | NULL          | PRIMARY | 4       | NULL |    1 | Using where; Using index | 
|  2 | DEPENDENT SUBQUERY | <derived3>  | system | NULL          | NULL    | NULL    | NULL |    0 | const row not found      | 
|  3 | DERIVED            | memberships | ALL    | NULL          | NULL    | NULL    | NULL |    1 | Using where              | 
+----+--------------------+-------------+--------+---------------+---------+---------+------+------+--------------------------+

So now instead of running that query against the memberships table it runs it against the derived table. This derived table only contains one column, resides in memory and never changes allowing it to run much much faster without changing the query structure too heavily.

Update: This may not be the best example by itself. Let me clarify my additional requirement of counting all the users that aren’t in the given group as well. The above query handles that easily via a “not”:

select count(*) as members from users where id NOT in (select user_id from (select user_id from memberships where group_id = 1) as x);
May 6, 2008
&gt; git gui How the eff did I miss this!?
> git gui
How the eff did I miss this!?
March 25, 2008

March 17, 2008
crabs casino!
crabs casino!
March 15, 2008
Boarding time!
Boarding time!
Just like my first trip to japan!
Just like my first trip to japan!
The short layover race begins!
The short layover race begins!
My flight to detroit. Now boarding.
My flight to detroit. Now boarding.
March 12, 2008
March 9, 2008