Friday, January 1, 2010

Get data without reading it - the power of covering indexes in MySQL

It's no real breakthrough, but it can have a very significant performance meaning to use covering indexes. Now you are sure to ask what a covering index really is? The answer is quite simple. If a index can provide all the data a query needs it becomes a covering index for it thus covering all the data the query needs. Still sounds too complicated? Let's have a look at a simple example. Let us suppose a pretty common example. A table with users, ids, birth dates, join dates. For some marketing purposes we need to determine the average birth year of users that joined us from the beginning of 2006:

1. Let's create a simple table:


CREATE TABLE IF NOT EXISTS `users` (
  `id` int(11) NOT NULL auto_increment,
  `birth_date` date NOT NULL,
  `joined_date` timestamp NOT NULL default CURRENT_TIMESTAMP,
  PRIMARY KEY  (`id`)
) ENGINE=MyISAM  DEFAULT CHARSET=utf8 COLLATE=utf8_general_ci AUTO_INCREMENT=1 ;



2. We need to insert some random data. To make it a more real-life case I inserted 1 000 000 rows, hope my notebook (which has the test mysql running) won't explode ;-)

INSERT INTO `example_table` (`name`, `birth_date`, `joined_date`) VALUES
[a few pretty much random records]


3. What we need to acomplish is the result of a simple sql query:

SELECT AVG(YEAR(`birth_date`)) FROM `users` 
WHERE joined_date >= '2006-01-01 00:00:00'


First I'll try to calculate the average using the AVG() function without any indexes except the PK index. The explain plan looks something like this:

select type: simple,
table: users,
type: all,
possible keys: null,
key_len: null,
ref: null,
rows: 1000000,
extra: Using where

The "null" after the possible_keys indicates that there are no indexes to be used with this select. There is pretty much no difference if the query cache is primed or not if you have to fetch and search for the data making disk I/Os. Let us add two indexes: one for the `birth_year` field and one for the `joined_date` column and see how it looks on the explain plan:

ALTER TABLE  `test`.`users` ADD INDEX  `i_birth` (  `birth_date` );
ALTER TABLE  `test`.`users` ADD INDEX  `i_joined` (  `joined_date` );

Explain plan:

select type: simple,
table: users,
type: all,
possible keys: i_joined,
key_len: null,
ref: null,
rows: 1000000,
extra: Using where

You can see that MySQL tried to find a fitting index to execute the query but still it took forever with an empty query cache, an eternity. That's clearly the wrong way to go.

Now we will create a covering index for the said query.

ALTER TABLE  `test`.`users` ADD INDEX  `i_birth_joined` (  `birth_date`, `joined_date` );

The index has all the fields the query needs to be executed so the explain plan :


select type: simple,
table: users,
type: index,
possible keys: i_birth_joined,
key_len: 7,
ref: null,
rows: 1000000,
extra: Using where; Using index 

The changes are quite clear. The covering index we created caused MySQL to use it not only in the search operation but also as a data provider. That speeds up thing a whole lot but it also slows down every INSERT/UPDATE operation with rebuilding a large index.

Summarizing:

While creating a covering indexes is a good practice to speed up select queries, you have to take into account the downside of building/rebuilding large indexes that cover a lot of data, the obvious extra space the indexes take up in memory and disk space, or finally the temptation to make too much indexes, instead of taking some slower queries to the background via e.g. a cron process that fills a cache.




3 comments:

adelon said...
This comment has been removed by a blog administrator.
pejot said...

just plain english please ;-)

Martin Farach-Colton said...

Great post. You point out the insertion and disk space load induced by covering indexes. You might take a look at TokuDB, a storage engine for MySQL that supports very fast insertions into indexes, as well as compression. This combination makes covering indexes especially useful, because you can keep a bunch of them. Click here for a discussion of covering indexes, and some performance numbers for TokuDB. Furthermore, TokuDB supports a special kind of covering index, the clustering index, which is a clustering index on a different sort order than the primary key, and thus a covering index for any query that benefits from the chosen sort order.