Sunday, January 3, 2010

How to build a fast, good scaling user notification system pt. 2

This post is the second part of my earlier post on How to build a fast, good scaling user notification system where we discussed the problem areas of the said system. This post will be mostly about a strategy on the infrastructure to store all those notifications and retrieve them as fast as possible.
The most common approach to store data like user notifications would be to create a table containign a PK id, timestamp when the notification was added and of course some additional meta data, for instance:

CREATE TABLE IF NOT EXISTS `notifications` (
  `id` int(10) unsigned NOT NULL auto_increment,
  `date_added` timestamp NOT NULL default '0000-00-00 00:00:00',
  `user_id` int(10) unsigned NOT NULL,
  `content` char(255) NOT NULL,
  PRIMARY KEY  (`id`)

Every user entering his/hers notification page (which btw is the most frequently visited page) will most probably generate an sql like this:

SELECT id, date_added, content FROM `notifications` WHERE user_id = 123 ORDER BY date_added DESC limit 0,10

You will need the date_added to show to the user when the event occurred, the PK id to take some additional actions on the event and finally the content of the specified notification, we also have to apply a limit to the query to paginate the results. To make queries faster let us create a covering index for the query e.g. modifying the primary index or adding a separate index:

ALTER TABLE  `test`.`notifications` DROP PRIMARY KEY ,
ADD PRIMARY KEY (  `id` ,  `user_id` ,  `date_added` )

Don't get me wrong, that above is a perfectly good system to store user notifications, it is simple, pretty fast up to a certain point, but we can do it better.

First let us think about the holding of chronological data. When you insert data in a cronological order, the PK id is always in some correlation with the timestamp of when the event occurred. Both fields are always growing but we don't really need the id, only the timestamp is of some value. What we can do is create an ID before inserting the data:

$micro = explode(' ',microtime());
define('MICRO_ID', $micro[1].substr($micro[0],2,5));

That piece of code defines a constant consisting of 10 digits representing the current timestamp and 5 digits representing the micro timestamp. So we have a unique time-related ID that we insert instead of the auto-incremented id and the date_added, and we get a free! index on the timestamp value. To prepare the new structure first we need to change the `notifications` table:

ALTER TABLE  `notifications` DROP  `date_added`;
ALTER TABLE  `notifications` CHANGE  `id`  `id` BIGINT( 15 ) NOT NULL;
ADD PRIMARY KEY ( `id`, `user_id` );

Let me show you an example of an insert in a ORM like Doctrine:

$nt = new Notifications();
$nt->id = MICRO_ID;
$nt->content = 'Some event description';
$nt->user_id = 1;

Surely ;-) you wonder "How about an multi-server environment?". Well, actually I've seen it work in a a big (100+ servers) environment with just 3 digits of micro time and it worked just great. The key here is to make a combined primary index. The only issue we need to face is when we insert two notifications to the same user in the lifetime of the same program. To be extra careful we can use a ON DUPLICATE KEY clause or simply increment the MICRO_ID constant by 1 via a new variable of course. To construct a proper index, we need to look at the new query retrieving the data:

SELECT id, user_id, content FROM  `notifications` WHERE user_id = 123 ORDER BY id DESC LIMIT 0,10

Because of the user specific nature of the query we have to rebuild the primary index. The trick is to make a combined index on the user_id first and the id, which is the pk, second. When you'll try (try it) to make it the other way around, MySQL will have to make a extra pass to sort the values by id, therefore making it a lot slower (I had a performance drop of over 10 times !).

ALTER TABLE  `test`.`notifications` DROP PRIMARY KEY ,
ADD PRIMARY KEY (  `user_id` ,  `id` );

The explain plan for the SELECT query above:

select type: simple;
type: ref;
possible keys: PRIMARY;
key len: 4;
ref: const;
Extra: Using Where

That looks pretty nice. Our query uses the index we have created earlier and it is very fast, in addition we save a whole lot of space that the timestamp field would take up.

The normalization of the table was left out for last on purpose. We need to divide the `content` field into some smaller ones. The real advantage of this kind of notification system is that the whole text is in a template which is used by a object in the program itself (example hierarchy generated with You can of course use some complicated design patterns to enhance the flexibility of the system but the baseline is pretty simple:

In this example \tThe render() method generates the html for each notification row via the template set seperatly for each subclass and a prepareData() method to make additional processing of raw data. We obviosly need another field in the notification table containing the type id:

ALTER TABLE `notifications` DROP `content`;
ALTER TABLE `notifications` ADD `type_id` TINYINT NOT NULL ,
ADD `info` CHAR( 100 ) NOT NULL;

The info field stores all the ids, of short text information you need e.g. two ids and some text '123||456||some text' that you can split later on in the mentioned prepareData() method before using the render() method. I'm not going to carry on about the implementation specifics, we just need to know the baseline of the system. We have successfully created a table structure and a class hierarchy to select and insert user notifications. The next post will handle about inserting multiple notifications (notify multiple users about an event), and the structures to provide such functionality.

by the way... WYSIWYG in blogger is really lame... thinking about going to wordpress...

No comments: