Thursday, January 14, 2010

Get a load of your database - paginated caching

Your site is getting awfully slow? There's just to much reads to your database and you have already tweaked the performance of every query? In most cases data caching is the solution to your problem!


The idea is to cache all processed data you heave retrieved from the database. Let us look on a example. It uses a mockup class that basically can handle any caching system like memcached or xcache:


php:
//just a mockup
abstract class cache{
    static public function save($pool, $data, $ttl){
    //some cache processing
    }
    static public function load($pool){
    }
}



Now what we want to do is to save time by implementing caching on heavy load environments. The way to save some execution time and/or decrease database load is to search the cache first, before even querying the db:




//first define what we will save/load adn the time
//the pool name is mainly for verification (if you even write/read the right thing) and is optional


php:
$pool_name = 'somePoolName_';
$data_name = 'someDataName-userId';
$ttl = 3600;

//load data from cache
$data = cache::load($pool_name.$data_name);


//if theres no data in the cache, we will get a false/null value
if(!$data || $data == 'thereIsNoData'){
    $data = DB::exec($someHeavySql);
    //here's a sensitive point where many people make a mistake
    //if you would just save the data and go on with your life, there is a big probability that
    //you would save a NULL/false value. We need to avoid that


    if(!$data){
        cache::save($pool_name.$data_name, 'thereIsNoData',$ttl);
    }else{
        cache::save($pool_name.$data_name, $data, $ttl);
    }
}



Every time someone generates a page view, the data is either cached, or retrieved and the cache is field. Either way we avoid to execute "$someHeavySql" for $ttl seconds. That's just the easy part. What we need to accomplish here is cache almost everything including paginated results.
It's not hard to imagine the need to paginate query results. Let's just think about a site with products. There are about 500 products in the sites database, and there is no way of showing everything on one page (maybe in 1px font size ;-) ) Because the products page is so popular that our database is barely handling all requests, we decided that we will use a caching layer to help the database a little. The problem is the whole pagination issue. When we cached every site like above, adding the page, we encountered a problem.


php:
//pool name, data name, page
$pool_name = 'products_';
$data_name = 'mainProducts-'.$page;


Every time we change the price of a product, we need to delete the cached data. The problem is that we never know on which page is the product we changed, therefore which one to clear. Obviously we need to delete all the cached pages, the product could change, it could be deleted or there could be a new one inserted. Either way the whole cache becomes obsolete. We could iterate and delete all the pages, but that would be very time costly unnecessary. What we want to achieve is to give the cache name an additional parameter.


php:
$cache_version = cache::load('someCacheWithTheVersion');
$pool_name = 'products_';
$data_name = 'mainProducts-'.$cache_version.'-'.$page;




Now when we want do delete the cache, we just increment the version via cache. All the older cached pages become unused and eventually get deleted by the caches garbage collector (if present). Unfortunately we need to make and additional cache request for every page, but it still saves us a lot of resources.
Another problem is how the development cycle of a site. When for instance you have an array with 10 keys that you always cache, and you've made some changes to the structure of said array. What will happen if you sync the new code do your site? You can imagine the things that could happen from simple errors to site wide errors, data inconsistency and so on. You can flush the whole cache but then your databases will get overwhelmed by all the requests, that of course get cached eventually but firs will produce a very nasty vain on your DBA's
 forehead ;-). The easiest way to ensure the usage of new data is a additional version attribute for each cache pool:



php:
//name-version array
$caches = array(
'namesCache'=>1,
'someOtherCache'=>3,
};
$pool_version = $caches[ $cache_name ][0];
$pool_name = // the poll name and version number, then data, pages and so on


You don't even need to increase the version numbers, just make sure they change every time, and include them in the pool names within your caching layer class.


Hope this helps someone to get started with caching data :-)


BTW: I'm really starting to hate blogger's WYSIWYG... it is really annoying...







No comments: