Saturday, December 25, 2010

My favorite Christmas songs


So this is Christmas...

the time of the year when we basically forget about our computers and focus on the important things in life... eating yourself sick with your whole family.

No Christmas would be complete without a good selection of seasonal music that accompanies us every year, here is my favourites on the subject:

1. John Lennon - so this is Christmas

This song brings back old memories of one of my first Christmases I can remember (surely because the song has about 30 years).

2. Wham! - Last Christmas

A nemesis of polish radio for the past 10 years. Just no holidays without it.

3. Bobby Caldwell & Vanessa Williams - Baby It's Cold Outside

Song written in 1944 by Frank Loesser though the years had a lot of great interpretations but my favorite one is from the '96 Vanessa Williams album - Star Bright. BTW she is a very beautiful woman ;)

4. Skaldowie - z kopyta kulig rwie

This is a polish song from 1968 so you probably wouldn't understand much, but the main theme here is a sleigh ride through the forest (you can find some places that still do it by horse).

5. Chris Rea - Driving Home for Christmas

This list isn't in any particular order because my private Christmas "experience" starts with listening to this song while driving in a snowy, icy road through the forest. The best song ever!

6. Mariah Carey - All I want for Christmas is You

Very warm song performed by the pop diva brings up very warm feelings for sure. This will worm us up in those long and lustrous winters to come.

7. Kenny Rogers & Dolly Parton - Winter Wonderland

Some might say it is a bit corny, in my opinion its a beautiful song that has a lot of positive energy, plus the young Dolly Parton was surely something warming up to look at ;)

Well folks, this concludes my list, feel free to add suggestions to the list,
Have a quiet holiday with your families.

Saturday, December 18, 2010

Control authentication with KeePass and Dropbox


A few weeks ago it's been brought to my attention that may people have problem with dealing with password sharing and access control of may resources at the same time.
Imagine you have a team consisting of:

- system admins
- programmers
- marketers

Every position requires a different set of passwords, usually there is a lot of passwords (20+) that for security reasons should not be the same text.

I've actually came across a nice, free and fast resolution for the problem.
KeePass (and for Linux KeePassX) is a free password database that can help put in situations like that.
Your team just has to remember the password to the KeePass database. Every access type, or even person can have it's own database, but there emerges a problem of password versioning. Every change of every password would trigger and endless wave of emails to your employees about the changes. Actually there is a neat resolution for the problem as well.

DropBox is a great network service which i love to use. It makes sharing files between my computers at work and home really easy. No more sending yourself emails with files, just drop something to your dropbox folder on one end, and it automatically updates everywhere.

The idea for sharing with dropbox is that you need to create a "company dropbox" account, that everyone is connected to. Every time you or anybody else updates a file everyone that needs to know about the changes uses another password.

Lets summarize what are the benefits here:

- passwords that don't need to be remembered can be very long and complicated (KeePass has a pass generator too)

- an employee has to remember:
1) Dropbox account and password, that's two
2) Keepass db password, that's one
So only three passwords to remember, plus one per every database the employee has access to.

- passwords are automatically updated across our business

- passwords are pretty safe, keepass encrypts with a 256-bit AES or a 256-bit Twofish algorithm that are considered to be very secure by the cryptographic community

- every time someone quite their job, you need to change all their passwords + the drobbox password. Changing all passwords actually isn't that much of a problem, since dropbox + keepass resolves it.

- there is limited possibility to create passwords per employee

After this short summary IMHO this is an easy and secure way to share your passwords (or just keep your passwords at bay) without any serious knowledge about security. You may want to check out the keepass plug-in page for additional functionality

Thanks, and see you next time,

Friday, December 17, 2010

Multiple repositories in one with svn:externals

Hi there!

Recently I had a problem with my services:
- I had 2 separate websites that use a kind of token authorization system
- I had a lot of code that was used in both web services

Of course the only smart thing to do when you have a sender and receiver on both ends is that its performed by the exactly same code as follows:

Sender service:
$token = new Token();
echo $token->generateToken();

Receiver service:
$token = new $token($_GET['token']);
echo $token->isValid();

So what we need to achieve here is pretty arbitrary to the point I'm making. Just a simple sending and receiving of some token that has to be validated on the other side.

The problem is that those two sites are entirely different projects that share some common logic. The mistake people make is to copy the common code between the services in the hope that no one will change it, or when changed, someone will change it on both ends. There is a lot of room for mistakes here.

Luckily we have svn:externals to help us with the whole process. What we want to achieve can be summarized in a few simple steps:

1) Isolate the common code in another repository to protect it from unwanted changes. Let's say we will name the repository CORE.

2) Remove the obsolete code from all code base in which i was present.

3) Add a external checkout to our library folder:
>svn propget svn:externals library

What this will do is when someone updates the library folder, svn will know to get the external items from the common code repository.

In this example you will have to commit your code to the external repository, but an easier way is that you create a directory for the code in one repository and add a directory externally to another repository that requires the code.

Hope you will take something from this simple example,
see you next time ;)

Wednesday, December 8, 2010

Starting your own IT business? A few lessons to be learned.


The thing is I decided to start My own business ( but it isn't much to look at, still working on it), producing software and websites of course. After some three weeks of muddling through the polish law system I finally managed to start it officially :)

After a few months I have some things I want to talk about that may help you in your endeavor.
Aside from all problems, the first assignment (, a small co repetition system was quite an interesting task. Not only learning (the hard way) that absolutely everything must be perfect up to 1 pixel, but also learned how to employ people. You always have to be in control, if you outsource some project, every few days look at the results, the time/hour costs.

Lesson one:
Always have a written contract with the functionality annexed to it.

Even if someone won't pay you, you still have a chance of getting back the costs.
It's an industry standard so you won't have any problem finding templates on the web.

Lesson two:
Include the first changes (except obvious bugs) in the price, for next changes bill your client separately.

As a young business you lack credibility. The clients are unsure if you won't run and leave them with the problems on the website. If you decide to include project maintenance over an year or so in the project you might get bigger projects.

Lesson three:
If you work with someone, keep track of the work being done.

If you leave everything to your employee/subcontractor there is a big possibility that the client won't get what he/she ordered or anything at all. Every few days have your employees (programmers) show you whet is accomplished and what is still on the to do list. Tell them what has the highest priority for the client. I personally use Redmine to track progress, keep in touch with clients (although there are better tools for that out there), bug tracking and calculate bills (from spent time).

If you're looking for something more sophisticated, actually JIRA has some event where they give you a starter package for 10 users for $10 so its really cheap (for some charity I think).

Lesson three:
Don't ask people to work for you, let them ask, just be in the right place.

Over time I've worked with many of my colleagues that were working with me for some time. Really bib disappointment, pretty much nothing was done right. I've learned to let people look form me, not the other way. Some of the best employees were hired from some small cities university IT forum. The main thought here is - look for people in niches not on or something like that. You wont get someone that is very much skilled but they sure are motivated and in my opinion that's the key to success.

I only hope that the lessons I had to learn the hard way will be of some use to you, my Dear Readers :)

See you soon.

Thursday, November 25, 2010

New exciting work to be done

Hi ppl,

long time no see. Fortunately my new business (creating www software) is growing fast, so I didn't have much time to write but a lot of stuff happened.
I got a new job, a new car and and moved to a quieter city :)

The car:

(not the actual photo, but damn close)
Black Jeep ZJ '96 - sounds pretty old - yep but the advantage is that its a V8 engine with over 200HP, makes a lot of fun:)

The city:

Sopot - it's like the Polish Beverly Hills - "the place to see end be seen" in the summer, now in the early winter its just a nice place to live :)

The job: - a great new adventure. A very innovative project - recording everything that happens on a flash website -> store it -> analyze it -> draw conclusions. The basic idea here is to know what you need to change on a flash website to raise user experience. There are some really interesting technical difficulties that will eventually need solving:

1) The project has potential - fast growth isn't a problem if you have the infrastructure or use a cloud (and your application has basic scalability possibilities)

2) A lot of simple traffic coming in - recording user traffic is a lot of inbound traffic. The architecture has to take into account that traffic can vary with pretty big amplitudes.

3) Big OLAP queries to execute - trends over time, user behavior changes. Analytical queries have a very much different profile that the usual website OLTP stuff, some things just take long, some can be made faster (I hope ;) ).

4) What happens if you get a client that wants to record a lot more? Is the system scalable enough to handle bigger players or is the query execution time multiplier equal to the size of the database?

Some interesting month are ahead, we are yet to an official product start (it's currently in semi open BETA). I hope I will have the pleasure to write about interesting problems, and their smart resolutions.

I didn't write anything about my business, but that's another topic all together :)

See you soon,

Sunday, April 4, 2010

Some feedback on Google Closure Templates

We've been using GCT at (My Virtual Museum). After the introduction of the new technology, we had some problems that needed solving:

1) The GCT compiler is a pretty unhandy tool, after some php modifications, we managed to get results faster.

2) No predefined directory/file handling structure was a big problem. I personally hate the directory maze of some developers, we had to keep it simple yet practical. A simple structure of front-end ( guest users ) and back-end ( logged in users ), and an additional categorization per structure, made every decision about where everything should be really straight forward.

3) Loading of the templates should be automatically. We'll soon migrate to a multilingual system, that has to work automatically from the js template side, so auto loading was an issue, but no more.

Google Closure Tools is a great tool, that, if used with a little brain, can make Your js look and feel a lot better. It's something worth recommending.

Saturday, January 23, 2010

Organize your spaghetti javascript/java with Google Closure Templates

If you ever developer something in javascript that has to have some html, its a real chore, for me anyway. Plain dom manipulation is pretty easy, but what do you do when you need to insert some dom nodes that weren't there before? I think it's pretty common to write something like this:

//jQuery javascript:
 var someHTML = 'A new paragraph!';
 jQuery('#paragraphContainerDiv').html( someHTML ); 

The code, if you have jQuery of course, should insert the paragraph as a new node in the selected container. When you're a more advanced js developer, you'll probably try to generate the nodes with some js functions and then append them to the containers and so on, which by the way make things a lot more complicated to the person that maintains the code later on. If you are smart you'll probably think about templating. The normal approach would be to write some functions in which you input some template data:

 var Template = {
 helloThere : function (name){
  return 'Hello there '+name+' !';

When you paste the code into you firebug, you probably should see a "Hello there Piter !
" text. What we have achieved is to unbind the html from the javascript, which is the right way to go, but still when you look at the Template.helloThere() function, its a pretty messy piece of code. Let us add some internationalization features like so:

var Template = {
 helloThere : function (hello, name){
  return ''+hello+' '+name+' !';
 document.write(Template.helloThere('Hi there','Piter'));

It's not hard to imagine how this template would look like within a real exapmle that has 20-50 html tags, full I18n support and additional data variables. Even if you would pass the variables as arrays the main problem stands: spaghetti html and js.
Fortunately there is a way to rapidly develop templates and manage the very easily, especially in a team when the js programmer and web developer are two different people. The answer is Google Closure Templates. The pretty recently featured Template engine is used among the most commonly used G applications like Gmail, co as Google says it's well battle-tested :-), and well documented. It not only speeds up development of Javascript/Java but also makes the code a lot easier to maintain and localize. Let's Have a look on what the code above would look like in GCT, but first you need to follow the instructions on Google code Closure site, in a shortcut (for windows/javascript):

  1. You need to install Java Runtime Environment,
  2. Download the closure template compiler (for js)
  3. Unpack the library and put it in some locally hosted folder (you need Apache or some other web server)
  4. Create a file and start editing it

//soy example
{namespace Template}

* Simple soy file example
* Here you declare variables via the documentation comment
* @param hello - the hello message
* @param name - name of the greated person
{template .helloThere}
  {$hello} {$name} !

What you want do do is to compile the
java -jar SoyToSrcCompiler.jar --outputPathFormat simple.js

If you get some exceptions... deal with them :) They are usually pretty strait forward and easy to understand. If there are no errors, you'll get a simple.js that is the javascript compiled version of your template file. After linking soyutils.js and simple.js to a html file you can post the code in a script section or just into firebug:

document.write(Template.helloThere({hello:'Hi there', name: 'Piter'})); 
 document.write(Template.helloThere({hello:'Witaj', name: 'Piter'})); 
 document.write(Template.helloThere({hello:'Bonjour', name: 'Piter'})); 

The code looks very similar to the simple javascript function but it has a real advantage, you can request a ready template from your web developer and just fill the gaps with variables, what makes developing really fast and easy for everyone. You even can put your translations in another file so your translators can work at the same time too.

Summarizing Closure Templates have every advantage a normal e.g. php based template engine has, like variables, globals, loops, conditional functions and more. Everything is documented at so as my Linux addicted friends would say RTFM and start using Google Closure Templates. Also there is no catch, Google serves it free under the Apache License 2.0 and encourages strongly to use them.

BTW Still hate Bloggers text editor... that is why every post looks different :)

Thursday, January 14, 2010

Get a load of your database - paginated caching

Your site is getting awfully slow? There's just to much reads to your database and you have already tweaked the performance of every query? In most cases data caching is the solution to your problem!

The idea is to cache all processed data you heave retrieved from the database. Let us look on a example. It uses a mockup class that basically can handle any caching system like memcached or xcache:

//just a mockup
abstract class cache{
    static public function save($pool, $data, $ttl){
    //some cache processing
    static public function load($pool){

Now what we want to do is to save time by implementing caching on heavy load environments. The way to save some execution time and/or decrease database load is to search the cache first, before even querying the db:

//first define what we will save/load adn the time
//the pool name is mainly for verification (if you even write/read the right thing) and is optional

$pool_name = 'somePoolName_';
$data_name = 'someDataName-userId';
$ttl = 3600;

//load data from cache
$data = cache::load($pool_name.$data_name);

//if theres no data in the cache, we will get a false/null value
if(!$data || $data == 'thereIsNoData'){
    $data = DB::exec($someHeavySql);
    //here's a sensitive point where many people make a mistake
    //if you would just save the data and go on with your life, there is a big probability that
    //you would save a NULL/false value. We need to avoid that

        cache::save($pool_name.$data_name, 'thereIsNoData',$ttl);
        cache::save($pool_name.$data_name, $data, $ttl);

Every time someone generates a page view, the data is either cached, or retrieved and the cache is field. Either way we avoid to execute "$someHeavySql" for $ttl seconds. That's just the easy part. What we need to accomplish here is cache almost everything including paginated results.
It's not hard to imagine the need to paginate query results. Let's just think about a site with products. There are about 500 products in the sites database, and there is no way of showing everything on one page (maybe in 1px font size ;-) ) Because the products page is so popular that our database is barely handling all requests, we decided that we will use a caching layer to help the database a little. The problem is the whole pagination issue. When we cached every site like above, adding the page, we encountered a problem.

//pool name, data name, page
$pool_name = 'products_';
$data_name = 'mainProducts-'.$page;

Every time we change the price of a product, we need to delete the cached data. The problem is that we never know on which page is the product we changed, therefore which one to clear. Obviously we need to delete all the cached pages, the product could change, it could be deleted or there could be a new one inserted. Either way the whole cache becomes obsolete. We could iterate and delete all the pages, but that would be very time costly unnecessary. What we want to achieve is to give the cache name an additional parameter.

$cache_version = cache::load('someCacheWithTheVersion');
$pool_name = 'products_';
$data_name = 'mainProducts-'.$cache_version.'-'.$page;

Now when we want do delete the cache, we just increment the version via cache. All the older cached pages become unused and eventually get deleted by the caches garbage collector (if present). Unfortunately we need to make and additional cache request for every page, but it still saves us a lot of resources.
Another problem is how the development cycle of a site. When for instance you have an array with 10 keys that you always cache, and you've made some changes to the structure of said array. What will happen if you sync the new code do your site? You can imagine the things that could happen from simple errors to site wide errors, data inconsistency and so on. You can flush the whole cache but then your databases will get overwhelmed by all the requests, that of course get cached eventually but firs will produce a very nasty vain on your DBA's
 forehead ;-). The easiest way to ensure the usage of new data is a additional version attribute for each cache pool:

//name-version array
$caches = array(
$pool_version = $caches[ $cache_name ][0];
$pool_name = // the poll name and version number, then data, pages and so on

You don't even need to increase the version numbers, just make sure they change every time, and include them in the pool names within your caching layer class.

Hope this helps someone to get started with caching data :-)

BTW: I'm really starting to hate blogger's WYSIWYG... it is really annoying...

Tuesday, January 12, 2010

Sphinx and big integers

I encountered a problem that can give headaches even to the toughest programmers. I think that the solution will be quite a relief to some people.
The problem is when indexing in sphinx, the indexer sometimes prints out a warning/error message:

WARNING: DOCID_MAX document_id, skipping

which by the way is very annoying because it stops your indexing :)

I've searched the web for some solution, but I didn't find anything useful. When I experimented a little with the query I have discovered that the base id's are simply to big. The sphinx was built without the --enable-id64 option, what introduced only 32 integers and not micro timestamp id integers that have about 15 digits. 

Hope that this solution  helps some some of you to get to like sphinx ;-)

Wednesday, January 6, 2010

What motivates IT professionals?

In my practice as a programmer I've come across some interesting people. A lot of my friends and co-workers of course are programmers, designers, network administrators. I is quite fascinating to look from a objective position on everyone and watch what motivates them the most. I've seen a few separate types of things that motivate people around. Here are a few person types I could distinguish:

  1. The Banker - Money is the most obvious reason to go to work every day, but not the only, and for many people not the main motivator. Bankers tend to work better given small raises/bonuses now and then, and they are absolutely not to argue with if they want a get-or-leave-raise. The words "either I get a raise in pay or I'll leave" are dead serious with this kind of people, they most probably already found a new employer and just want to give you a last chance to hold them in your company.
  2. The Creator - this part of the motivation pie is a pretty big piece in most of my programmer friends. The Creator creates stuff just to see it work ( to see it being used is IMHO something else ). People who are motivated by creating can write something that was written a lot of times and try to make it better every time. This is pretty much the "easiest" employee to keep happy, a creator type just needs a computer and most often someone to participate and share ideas with.
  3. The Innovator- now innovation is seen pretty often among IT workers. The Innovator absolutely hates having people to him/her to make another message board or log in form. He/She like a cat wanting to go his/hers own ways. They thrive on new and interesting endeavours and ideas. Now that's a great thing, but people with this quality tend to produce unstable, poorly maintainable solutions, so they can work on another interesting thing. When you have an enough innovative field in IT, you can easily hold the innovator employee, given that he has a choice to go somewhere with a little better pay but a boring business profile.
  4. The Egoist - over the time I've seen many people thrilled by the idea that something they create will be used by a large number of people and/or his/hers friends. I think that an Egoist is a Creator that does it just for a different reason.The Egoist is very hard to keep happy. Statistically I can just say that most graphics designers, artists basically, are the egoist type. People with that quality tend to work much harder and better given the right amount of "ego steroids" like ensuring they are the best people for the job or that their work will be most certainly recognised by many people.
Here is the place where the "worker-key-system" comes in. Try to imagine a IT professional as a kind of a small piano keyboard and his/hers employer as the player like in the picture (the employer being the rat ;-) ) 

[image source:]

Now what you want to do as the boss, is to play a good and robust song, or to make a good and robust program with the least amount of spending. This keyboard is a very funny instrument. It already knows some songs, but it won't make absolutely no sound when stricken in the wrong key, and will eventually break if you won't strike the right key with the right force, and produce the same song if you strike the right keys with more force that they need.
Leaving all metaphors, only a good project manager can ensure the right amount of money, creation, innovation and ego by hiring the right people and balancing out the proportions to make a great product with the least amount of money. If your project is another CMS like millions out there, don't hire innovators, the surely aren't the right people for the job, because they'll get bored and leave for a more interesting position. If you are creating a start-up, don't hire bankers, they are very good at what they do, but they will leave your project when given the first opportunity to get more money. There surely are many more aspects to take in account but in my opinion these are the main things that drive IT specialists to work better, harder and enjoy their work more, thus generating many good ideas and products for your company.

Sunday, January 3, 2010

How to build a fast, good scaling user notification system pt. 2

This post is the second part of my earlier post on How to build a fast, good scaling user notification system where we discussed the problem areas of the said system. This post will be mostly about a strategy on the infrastructure to store all those notifications and retrieve them as fast as possible.
The most common approach to store data like user notifications would be to create a table containign a PK id, timestamp when the notification was added and of course some additional meta data, for instance:

CREATE TABLE IF NOT EXISTS `notifications` (
  `id` int(10) unsigned NOT NULL auto_increment,
  `date_added` timestamp NOT NULL default '0000-00-00 00:00:00',
  `user_id` int(10) unsigned NOT NULL,
  `content` char(255) NOT NULL,
  PRIMARY KEY  (`id`)

Every user entering his/hers notification page (which btw is the most frequently visited page) will most probably generate an sql like this:

SELECT id, date_added, content FROM `notifications` WHERE user_id = 123 ORDER BY date_added DESC limit 0,10

You will need the date_added to show to the user when the event occurred, the PK id to take some additional actions on the event and finally the content of the specified notification, we also have to apply a limit to the query to paginate the results. To make queries faster let us create a covering index for the query e.g. modifying the primary index or adding a separate index:

ALTER TABLE  `test`.`notifications` DROP PRIMARY KEY ,
ADD PRIMARY KEY (  `id` ,  `user_id` ,  `date_added` )

Don't get me wrong, that above is a perfectly good system to store user notifications, it is simple, pretty fast up to a certain point, but we can do it better.

First let us think about the holding of chronological data. When you insert data in a cronological order, the PK id is always in some correlation with the timestamp of when the event occurred. Both fields are always growing but we don't really need the id, only the timestamp is of some value. What we can do is create an ID before inserting the data:

$micro = explode(' ',microtime());
define('MICRO_ID', $micro[1].substr($micro[0],2,5));

That piece of code defines a constant consisting of 10 digits representing the current timestamp and 5 digits representing the micro timestamp. So we have a unique time-related ID that we insert instead of the auto-incremented id and the date_added, and we get a free! index on the timestamp value. To prepare the new structure first we need to change the `notifications` table:

ALTER TABLE  `notifications` DROP  `date_added`;
ALTER TABLE  `notifications` CHANGE  `id`  `id` BIGINT( 15 ) NOT NULL;
ADD PRIMARY KEY ( `id`, `user_id` );

Let me show you an example of an insert in a ORM like Doctrine:

$nt = new Notifications();
$nt->id = MICRO_ID;
$nt->content = 'Some event description';
$nt->user_id = 1;

Surely ;-) you wonder "How about an multi-server environment?". Well, actually I've seen it work in a a big (100+ servers) environment with just 3 digits of micro time and it worked just great. The key here is to make a combined primary index. The only issue we need to face is when we insert two notifications to the same user in the lifetime of the same program. To be extra careful we can use a ON DUPLICATE KEY clause or simply increment the MICRO_ID constant by 1 via a new variable of course. To construct a proper index, we need to look at the new query retrieving the data:

SELECT id, user_id, content FROM  `notifications` WHERE user_id = 123 ORDER BY id DESC LIMIT 0,10

Because of the user specific nature of the query we have to rebuild the primary index. The trick is to make a combined index on the user_id first and the id, which is the pk, second. When you'll try (try it) to make it the other way around, MySQL will have to make a extra pass to sort the values by id, therefore making it a lot slower (I had a performance drop of over 10 times !).

ALTER TABLE  `test`.`notifications` DROP PRIMARY KEY ,
ADD PRIMARY KEY (  `user_id` ,  `id` );

The explain plan for the SELECT query above:

select type: simple;
type: ref;
possible keys: PRIMARY;
key len: 4;
ref: const;
Extra: Using Where

That looks pretty nice. Our query uses the index we have created earlier and it is very fast, in addition we save a whole lot of space that the timestamp field would take up.

The normalization of the table was left out for last on purpose. We need to divide the `content` field into some smaller ones. The real advantage of this kind of notification system is that the whole text is in a template which is used by a object in the program itself (example hierarchy generated with You can of course use some complicated design patterns to enhance the flexibility of the system but the baseline is pretty simple:

In this example \tThe render() method generates the html for each notification row via the template set seperatly for each subclass and a prepareData() method to make additional processing of raw data. We obviosly need another field in the notification table containing the type id:

ALTER TABLE `notifications` DROP `content`;
ALTER TABLE `notifications` ADD `type_id` TINYINT NOT NULL ,
ADD `info` CHAR( 100 ) NOT NULL;

The info field stores all the ids, of short text information you need e.g. two ids and some text '123||456||some text' that you can split later on in the mentioned prepareData() method before using the render() method. I'm not going to carry on about the implementation specifics, we just need to know the baseline of the system. We have successfully created a table structure and a class hierarchy to select and insert user notifications. The next post will handle about inserting multiple notifications (notify multiple users about an event), and the structures to provide such functionality.

by the way... WYSIWYG in blogger is really lame... thinking about going to wordpress...

Friday, January 1, 2010

Get data without reading it - the power of covering indexes in MySQL

It's no real breakthrough, but it can have a very significant performance meaning to use covering indexes. Now you are sure to ask what a covering index really is? The answer is quite simple. If a index can provide all the data a query needs it becomes a covering index for it thus covering all the data the query needs. Still sounds too complicated? Let's have a look at a simple example. Let us suppose a pretty common example. A table with users, ids, birth dates, join dates. For some marketing purposes we need to determine the average birth year of users that joined us from the beginning of 2006:

1. Let's create a simple table:

  `id` int(11) NOT NULL auto_increment,
  `birth_date` date NOT NULL,
  `joined_date` timestamp NOT NULL default CURRENT_TIMESTAMP,
  PRIMARY KEY  (`id`)

2. We need to insert some random data. To make it a more real-life case I inserted 1 000 000 rows, hope my notebook (which has the test mysql running) won't explode ;-)

INSERT INTO `example_table` (`name`, `birth_date`, `joined_date`) VALUES
[a few pretty much random records]

3. What we need to acomplish is the result of a simple sql query:

SELECT AVG(YEAR(`birth_date`)) FROM `users` 
WHERE joined_date >= '2006-01-01 00:00:00'

First I'll try to calculate the average using the AVG() function without any indexes except the PK index. The explain plan looks something like this:

select type: simple,
table: users,
type: all,
possible keys: null,
key_len: null,
ref: null,
rows: 1000000,
extra: Using where

The "null" after the possible_keys indicates that there are no indexes to be used with this select. There is pretty much no difference if the query cache is primed or not if you have to fetch and search for the data making disk I/Os. Let us add two indexes: one for the `birth_year` field and one for the `joined_date` column and see how it looks on the explain plan:

ALTER TABLE  `test`.`users` ADD INDEX  `i_birth` (  `birth_date` );
ALTER TABLE  `test`.`users` ADD INDEX  `i_joined` (  `joined_date` );

Explain plan:

select type: simple,
table: users,
type: all,
possible keys: i_joined,
key_len: null,
ref: null,
rows: 1000000,
extra: Using where

You can see that MySQL tried to find a fitting index to execute the query but still it took forever with an empty query cache, an eternity. That's clearly the wrong way to go.

Now we will create a covering index for the said query.

ALTER TABLE  `test`.`users` ADD INDEX  `i_birth_joined` (  `birth_date`, `joined_date` );

The index has all the fields the query needs to be executed so the explain plan :

select type: simple,
table: users,
type: index,
possible keys: i_birth_joined,
key_len: 7,
ref: null,
rows: 1000000,
extra: Using where; Using index 

The changes are quite clear. The covering index we created caused MySQL to use it not only in the search operation but also as a data provider. That speeds up thing a whole lot but it also slows down every INSERT/UPDATE operation with rebuilding a large index.


While creating a covering indexes is a good practice to speed up select queries, you have to take into account the downside of building/rebuilding large indexes that cover a lot of data, the obvious extra space the indexes take up in memory and disk space, or finally the temptation to make too much indexes, instead of taking some slower queries to the background via e.g. a cron process that fills a cache.