someGreatTechName: January 2010

Saturday, January 23, 2010

Organize your spaghetti javascript/java with Google Closure Templates

If you ever developer something in javascript that has to have some html, its a real chore, for me anyway. Plain dom manipulation is pretty easy, but what do you do when you need to insert some dom nodes that weren't there before? I think it's pretty common to write something like this:

//jQuery javascript:
 javascript: 
 var someHTML = 'A new paragraph!';
 jQuery('#paragraphContainerDiv').html( someHTML );

The code, if you have jQuery of course, should insert the paragraph as a new node in the selected container. When you're a more advanced js developer, you'll probably try to generate the nodes with some js functions and then append them to the containers and so on, which by the way make things a lot more complicated to the person that maintains the code later on. If you are smart you'll probably think about templating. The normal approach would be to write some functions in which you input some template data:

//javascript: 
 var Template = {
 helloThere : function (name){
  return 'Hello there '+name+' !';
 }
 } 
 document.write(Template.helloThere('Piter'));

When you paste the code into you firebug, you probably should see a "Hello there Piter !
" text. What we have achieved is to unbind the html from the javascript, which is the right way to go, but still when you look at the Template.helloThere() function, its a pretty messy piece of code. Let us add some internationalization features like so:

//javascript: 
var Template = {
 helloThere : function (hello, name){
  return ''+hello+' '+name+' !';
 }
 }
 
 //english
 document.write(Template.helloThere('Hi there','Piter'));
 //polish
 document.write(Template.helloThere('Witaj','Piter'));
 //french
 document.write(Template.helloThere('Bonjour','Piter'));
 /japanese
 document.write(Template.helloThere('Konnichiwa','Piter'));

It's not hard to imagine how this template would look like within a real exapmle that has 20-50 html tags, full I18n support and additional data variables. Even if you would pass the variables as arrays the main problem stands: spaghetti html and js.
Fortunately there is a way to rapidly develop templates and manage the very easily, especially in a team when the js programmer and web developer are two different people. The answer is Google Closure Templates. The pretty recently featured Template engine is used among the most commonly used G applications like Gmail, co as Google says it's well battle-tested :-), and well documented. It not only speeds up development of Javascript/Java but also makes the code a lot easier to maintain and localize. Let's Have a look on what the code above would look like in GCT, but first you need to follow the instructions on Google code Closure site, in a shortcut (for windows/javascript):

You need to install Java Runtime Environment,
Download the closure template compiler (for js)
Unpack the library and put it in some locally hosted folder (you need Apache or some other web server)
Create a simple.soy file and start editing it

//soy example
{namespace Template}

/**
* Simple soy file example
* Here you declare variables via the documentation comment
* @param hello - the hello message
* @param name - name of the greated person
*/
{template .helloThere}
 
  {$hello} {$name} !
 
{/template}

What you want do do is to compile the simple.soy:

java -jar SoyToSrcCompiler.jar --outputPathFormat simple.js simple.soy

If you get some exceptions... deal with them :) They are usually pretty strait forward and easy to understand. If there are no errors, you'll get a simple.js that is the javascript compiled version of your template file. After linking soyutils.js and simple.js to a html file you can post the code in a script section or just into firebug:

document.write(Template.helloThere({hello:'Hi there', name: 'Piter'})); 
 document.write(Template.helloThere({hello:'Witaj', name: 'Piter'})); 
 document.write(Template.helloThere({hello:'Bonjour', name: 'Piter'})); 
 document.write(Template.helloThere({hello:'Konnichiwa',name:'Piter'}));

The code looks very similar to the simple javascript function but it has a real advantage, you can request a ready template from your web developer and just fill the gaps with variables, what makes developing really fast and easy for everyone. You even can put your translations in another file so your translators can work at the same time too.

Summarizing Closure Templates have every advantage a normal e.g. php based template engine has, like variables, globals, loops, conditional functions and more. Everything is documented at http://code.google.com/intl/pl-PL/closure/templates/docs/overview.html so as my Linux addicted friends would say RTFM and start using Google Closure Templates. Also there is no catch, Google serves it free under the Apache License 2.0 and encourages strongly to use them.

BTW Still hate Bloggers text editor... that is why every post looks different :)

Thursday, January 14, 2010

Get a load of your database - paginated caching

Your site is getting awfully slow? There's just to much reads to your database and you have already tweaked the performance of every query? In most cases data caching is the solution to your problem!

The idea is to cache all processed data you heave retrieved from the database. Let us look on a example. It uses a mockup class that basically can handle any caching system like memcached or xcache:

php:
//just a mockup
abstract class cache{
   static public function save($pool, $data, $ttl){
   //some cache processing
   }
   static public function load($pool){
   }
}

Now what we want to do is to save time by implementing caching on heavy load environments. The way to save some execution time and/or decrease database load is to search the cache first, before even querying the db:

//first define what we will save/load adn the time
//the pool name is mainly for verification (if you even write/read the right thing) and is optional

php:
$pool_name = 'somePoolName_';
$data_name = 'someDataName-userId';
$ttl = 3600;

//load data from cache
$data = cache::load($pool_name.$data_name);

//if theres no data in the cache, we will get a false/null value
if(!$data || $data == 'thereIsNoData'){
   $data = DB::exec($someHeavySql);
   //here's a sensitive point where many people make a mistake
   //if you would just save the data and go on with your life, there is a big probability that
   //you would save a NULL/false value. We need to avoid that

   if(!$data){
   cache::save($pool_name.$data_name, 'thereIsNoData',$ttl);
   }else{
   cache::save($pool_name.$data_name, $data, $ttl);
   }
}

Every time someone generates a page view, the data is either cached, or retrieved and the cache is field. Either way we avoid to execute "$someHeavySql" for $ttl seconds. That's just the easy part. What we need to accomplish here is cache almost everything including paginated results.
It's not hard to imagine the need to paginate query results. Let's just think about a site with products. There are about 500 products in the sites database, and there is no way of showing everything on one page (maybe in 1px font size ;-) ) Because the products page is so popular that our database is barely handling all requests, we decided that we will use a caching layer to help the database a little. The problem is the whole pagination issue. When we cached every site like above, adding the page, we encountered a problem.

php:
//pool name, data name, page

$pool_name = 'products_';

$data_name = 'mainProducts-'.$page;

Every time we change the price of a product, we need to delete the cached data. The problem is that we never know on which page is the product we changed, therefore which one to clear. Obviously we need to delete all the cached pages, the product could change, it could be deleted or there could be a new one inserted. Either way the whole cache becomes obsolete. We could iterate and delete all the pages, but that would be very time costly unnecessary. What we want to achieve is to give the cache name an additional parameter.

php:

$cache_version = cache::load('someCacheWithTheVersion');

$pool_name = 'products_';

$data_name = 'mainProducts-'.$cache_version.'-'.$page;

Now when we want do delete the cache, we just increment the version via cache. All the older cached pages become unused and eventually get deleted by the caches garbage collector (if present). Unfortunately we need to make and additional cache request for every page, but it still saves us a lot of resources.
Another problem is how the development cycle of a site. When for instance you have an array with 10 keys that you always cache, and you've made some changes to the structure of said array. What will happen if you sync the new code do your site? You can imagine the things that could happen from simple errors to site wide errors, data inconsistency and so on. You can flush the whole cache but then your databases will get overwhelmed by all the requests, that of course get cached eventually but firs will produce a very nasty vain on your DBA's forehead ;-). The easiest way to ensure the usage of new data is a additional version attribute for each cache pool:

php:

//name-version array

$caches = array(

'namesCache'=>1,

'someOtherCache'=>3,

};

$pool_version = $caches[ $cache_name ][0];

$pool_name = // the poll name and version number, then data, pages and so on

You don't even need to increase the version numbers, just make sure they change every time, and include them in the pool names within your caching layer class.

Hope this helps someone to get started with caching data :-)

BTW: I'm really starting to hate blogger's WYSIWYG... it is really annoying...

Tuesday, January 12, 2010

Sphinx and big integers

I encountered a problem that can give headaches even to the toughest programmers. I think that the solution will be quite a relief to some people.
The problem is when indexing in sphinx, the indexer sometimes prints out a warning/error message:

WARNING: DOCID_MAX document_id, skipping

which by the way is very annoying because it stops your indexing :)

I've searched the web for some solution, but I didn't find anything useful. When I experimented a little with the query I have discovered that the base id's are simply to big. The sphinx was built without the --enable-id64 option, what introduced only 32 integers and not micro timestamp id integers that have about 15 digits.

Hope that this solution helps some some of you to get to like sphinx ;-)

Wednesday, January 6, 2010

What motivates IT professionals?

In my practice as a programmer I've come across some interesting people. A lot of my friends and co-workers of course are programmers, designers, network administrators. I is quite fascinating to look from a objective position on everyone and watch what motivates them the most. I've seen a few separate types of things that motivate people around. Here are a few person types I could distinguish:

The Banker - Money is the most obvious reason to go to work every day, but not the only, and for many people not the main motivator. Bankers tend to work better given small raises/bonuses now and then, and they are absolutely not to argue with if they want a get-or-leave-raise. The words "either I get a raise in pay or I'll leave" are dead serious with this kind of people, they most probably already found a new employer and just want to give you a last chance to hold them in your company.
The Creator - this part of the motivation pie is a pretty big piece in most of my programmer friends. The Creator creates stuff just to see it work ( to see it being used is IMHO something else ). People who are motivated by creating can write something that was written a lot of times and try to make it better every time. This is pretty much the "easiest" employee to keep happy, a creator type just needs a computer and most often someone to participate and share ideas with.
The Innovator- now innovation is seen pretty often among IT workers. The Innovator absolutely hates having people to him/her to make another message board or log in form. He/She like a cat wanting to go his/hers own ways. They thrive on new and interesting endeavours and ideas. Now that's a great thing, but people with this quality tend to produce unstable, poorly maintainable solutions, so they can work on another interesting thing. When you have an enough innovative field in IT, you can easily hold the innovator employee, given that he has a choice to go somewhere with a little better pay but a boring business profile.
The Egoist - over the time I've seen many people thrilled by the idea that something they create will be used by a large number of people and/or his/hers friends. I think that an Egoist is a Creator that does it just for a different reason.The Egoist is very hard to keep happy. Statistically I can just say that most graphics designers, artists basically, are the egoist type. People with that quality tend to work much harder and better given the right amount of "ego steroids" like ensuring they are the best people for the job or that their work will be most certainly recognised by many people.

Here is the place where the "worker-key-system" comes in. Try to imagine a IT professional as a kind of a small piano keyboard and his/hers employer as the player like in the picture (the employer being the rat ;-) )

[image source: www.pixdaus.com]

Now what you want to do as the boss, is to play a good and robust song, or to make a good and robust program with the least amount of spending. This keyboard is a very funny instrument. It already knows some songs, but it won't make absolutely no sound when stricken in the wrong key, and will eventually break if you won't strike the right key with the right force, and produce the same song if you strike the right keys with more force that they need.

Leaving all metaphors, only a good project manager can ensure the right amount of money, creation, innovation and ego by hiring the right people and balancing out the proportions to make a great product with the least amount of money. If your project is another CMS like millions out there, don't hire innovators, the surely aren't the right people for the job, because they'll get bored and leave for a more interesting position. If you are creating a start-up, don't hire bankers, they are very good at what they do, but they will leave your project when given the first opportunity to get more money. There surely are many more aspects to take in account but in my opinion these are the main things that drive IT specialists to work better, harder and enjoy their work more, thus generating many good ideas and products for your company.

Sunday, January 3, 2010

How to build a fast, good scaling user notification system pt. 2

This post is the second part of my earlier post on How to build a fast, good scaling user notification system where we discussed the problem areas of the said system. This post will be mostly about a strategy on the infrastructure to store all those notifications and retrieve them as fast as possible.
The most common approach to store data like user notifications would be to create a table containign a PK id, timestamp when the notification was added and of course some additional meta data, for instance:

CREATE TABLE IF NOT EXISTS `notifications` (
  `id` int(10) unsigned NOT NULL auto_increment,
  `date_added` timestamp NOT NULL default '0000-00-00 00:00:00',
  `user_id` int(10) unsigned NOT NULL,
  `content` char(255) NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;

Every user entering his/hers notification page (which btw is the most frequently visited page) will most probably generate an sql like this:

SELECT id, date_added, content FROM `notifications` WHERE user_id = 123 ORDER BY date_added DESC limit 0,10

You will need the date_added to show to the user when the event occurred, the PK id to take some additional actions on the event and finally the content of the specified notification, we also have to apply a limit to the query to paginate the results. To make queries faster let us create a covering index for the query e.g. modifying the primary index or adding a separate index:

ALTER TABLE `test`.`notifications` DROP PRIMARY KEY ,
ADD PRIMARY KEY ( `id` , `user_id` , `date_added` )

Don't get me wrong, that above is a perfectly good system to store user notifications, it is simple, pretty fast up to a certain point, but we can do it better.

First let us think about the holding of chronological data. When you insert data in a cronological order, the PK id is always in some correlation with the timestamp of when the event occurred. Both fields are always growing but we don't really need the id, only the timestamp is of some value. What we can do is create an ID before inserting the data:

php:
$micro = explode(' ',microtime());
define('MICRO_ID', $micro[1].substr($micro[0],2,5));

That piece of code defines a constant consisting of 10 digits representing the current timestamp and 5 digits representing the micro timestamp. So we have a unique time-related ID that we insert instead of the auto-incremented id and the date_added, and we get a free! index on the timestamp value. To prepare the new structure first we need to change the `notifications` table:

ALTER TABLE `notifications` DROP `date_added`;
ALTER TABLE `notifications` CHANGE `id` `id` BIGINT( 15 ) NOT NULL;
ALTER TABLE `notifications` DROP PRIMARY KEY ,
ADD PRIMARY KEY ( `id`, `user_id` );

Let me show you an example of an insert in a ORM like Doctrine:

php:
$nt = new Notifications();
$nt->id = MICRO_ID;
$nt->content = 'Some event description';
$nt->user_id = 1;
$nt->save();

Surely ;-) you wonder "How about an multi-server environment?". Well, actually I've seen it work in a a big (100+ servers) environment with just 3 digits of micro time and it worked just great. The key here is to make a combined primary index. The only issue we need to face is when we insert two notifications to the same user in the lifetime of the same program. To be extra careful we can use a ON DUPLICATE KEY clause or simply increment the MICRO_ID constant by 1 via a new variable of course. To construct a proper index, we need to look at the new query retrieving the data:

SELECT id, user_id, content FROM `notifications` WHERE user_id = 123 ORDER BY id DESC LIMIT 0,10

Because of the user specific nature of the query we have to rebuild the primary index. The trick is to make a combined index on the user_id first and the id, which is the pk, second. When you'll try (try it) to make it the other way around, MySQL will have to make a extra pass to sort the values by id, therefore making it a lot slower (I had a performance drop of over 10 times !).

ALTER TABLE `test`.`notifications` DROP PRIMARY KEY ,
ADD PRIMARY KEY ( `user_id` , `id` );

The explain plan for the SELECT query above:

select type: simple;
type: ref;
possible keys: PRIMARY;
key: PRIMARY;
key len: 4;
ref: const;
Extra: Using Where

That looks pretty nice. Our query uses the index we have created earlier and it is very fast, in addition we save a whole lot of space that the timestamp field would take up.

The normalization of the table was left out for last on purpose. We need to divide the `content` field into some smaller ones. The real advantage of this kind of notification system is that the whole text is in a template which is used by a object in the program itself (example hierarchy generated with yuml.me). You can of course use some complicated design patterns to enhance the flexibility of the system but the baseline is pretty simple:

In this example \tThe render() method generates the html for each notification row via the template set seperatly for each subclass and a prepareData() method to make additional processing of raw data. We obviosly need another field in the notification table containing the type id:

ALTER TABLE `notifications` DROP `content`;
ALTER TABLE `notifications` ADD `type_id` TINYINT NOT NULL ,
ADD `info` CHAR( 100 ) NOT NULL;

The info field stores all the ids, of short text information you need e.g. two ids and some text '123||456||some text' that you can split later on in the mentioned prepareData() method before using the render() method. I'm not going to carry on about the implementation specifics, we just need to know the baseline of the system. We have successfully created a table structure and a class hierarchy to select and insert user notifications. The next post will handle about inserting multiple notifications (notify multiple users about an event), and the structures to provide such functionality.

by the way... WYSIWYG in blogger is really lame... thinking about going to wordpress...

Friday, January 1, 2010

Get data without reading it - the power of covering indexes in MySQL

It's no real breakthrough, but it can have a very significant performance meaning to use covering indexes. Now you are sure to ask what a covering index really is? The answer is quite simple. If a index can provide all the data a query needs it becomes a covering index for it thus covering all the data the query needs. Still sounds too complicated? Let's have a look at a simple example. Let us suppose a pretty common example. A table with users, ids, birth dates, join dates. For some marketing purposes we need to determine the average birth year of users that joined us from the beginning of 2006:

1. Let's create a simple table:

CREATE TABLE IF NOT EXISTS `users` (
  `id` int(11) NOT NULL auto_increment,
  `birth_date` date NOT NULL,
  `joined_date` timestamp NOT NULL default CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_general_ci AUTO_INCREMENT=1 ;

2. We need to insert some random data. To make it a more real-life case I inserted 1 000 000 rows, hope my notebook (which has the test mysql running) won't explode ;-)

INSERT INTO `example_table` (`name`, `birth_date`, `joined_date`) VALUES
[a few pretty much random records]

3. What we need to acomplish is the result of a simple sql query:

SELECT AVG(YEAR(`birth_date`)) FROM `users`
WHERE joined_date >= '2006-01-01 00:00:00'

First I'll try to calculate the average using the AVG() function without any indexes except the PK index. The explain plan looks something like this:

select type: simple,
table: users,
type: all,
possible keys: null,
key_len: null,
ref: null,
rows: 1000000,
extra: Using where

The "null" after the possible_keys indicates that there are no indexes to be used with this select. There is pretty much no difference if the query cache is primed or not if you have to fetch and search for the data making disk I/Os. Let us add two indexes: one for the `birth_year` field and one for the `joined_date` column and see how it looks on the explain plan:

ALTER TABLE `test`.`users` ADD INDEX `i_birth` ( `birth_date` );
ALTER TABLE `test`.`users` ADD INDEX `i_joined` ( `joined_date` );

Explain plan:

select type: simple,

table: users,

type: all,

possible keys: i_joined,

key_len: null,

ref: null,

rows: 1000000,

extra: Using where

You can see that MySQL tried to find a fitting index to execute the query but still it took forever with an empty query cache, an eternity. That's clearly the wrong way to go.

Now we will create a covering index for the said query.

ALTER TABLE `test`.`users` ADD INDEX `i_birth_joined` ( `birth_date`, `joined_date` );

The index has all the fields the query needs to be executed so the explain plan :

select type: simple,

table: users,

type: index,

possible keys: i_birth_joined,

key_len: 7,

ref: null,

rows: 1000000,

extra: Using where; Using index

The changes are quite clear. The covering index we created caused MySQL to use it not only in the search operation but also as a data provider. That speeds up thing a whole lot but it also slows down every INSERT/UPDATE operation with rebuilding a large index.

Summarizing:

While creating a covering indexes is a good practice to speed up select queries, you have to take into account the downside of building/rebuilding large indexes that cover a lot of data, the obvious extra space the indexes take up in memory and disk space, or finally the temptation to make too much indexes, instead of taking some slower queries to the background via e.g. a cron process that fills a cache.

someGreatTechName