Benchmarking (in general)

I just wanted to follow up on Davey's post about extending PDO, because I think he isn't being very clear about what he's doing (no offence, Davey!)

Davey's Cortex framework allows you to pass in either a DSN or an already-instantiated-PDO object to its data object class, and Davey's post claims, quite rightly, that it is faster to not create brand new connections each time you create an instance of his framework objects.

Let's see if we can come up with a slightly more scientific and perhaps more fair set of tests.

When benchmarking it's important to get some decent numbers. If your test is over in less than 1 second, your readings are probably wildly inaccurate because your system may not have had a chance to properly utilize hardware or OS level caches or otherwise adjust in the same way that it would be doing under a consistent level of load.

If you're running a quick test, try to make it last more than 10 seconds. If you want to get a better numbers, run the test for longer; 5 or 10 minutes should give pretty decent results for a given code fragment, and an hour is probably as good as you could ever hope for.

Here's a simple test harness that runs for approximately 5 minutes:

<?php
    $start = time();
    $deadline = $start + (5 * 60); // run for 5 minutes
    $iters = 0;
    do {    
      something();
      ++$iters;
    } while (time() <= $deadline);
    $end = time();
    $diff = $end - $start;
    printf("Ran %.2f iterations per minute (%d/%d)\\n",
        (60.0 * $iters) / $diff, $iters, $diff);
    ?>

This harness simply repeats a task until the time limit is more or less up, and then summarizes how many times it managed to run within the specified time, normalizing it to iterations per minute.

Notice that I'm not particular bothered about sub-second time intervals here, because they don't really have much impact when compared to a 5 minute time duration--5 minutes plus or minus half a second is still near as damn it 5 minutes.

Our first test creates some kind of object that does some kind of work on a database connection. We'll make this one re-connect to the database each time; this is equivalent to Davey's extending PDO case:

<?php
   // represents some object in your framework
   class TestObject {
      var $db;
      function __construct($db) {
         $this->db = $db;
      }
      function doWork() {
         # Limited to 100 rows, because the connection cost
         # will be lost in the noise of the fetch otherwise
         array_reverse($this->db->query("select * from words LIMIT 100")->fetchAll());
      }
   }
   function something() {
       $db = new PDO($dsn, $user, $pass);
       $obj = new TestObject($db);
       $obj->doWork();
   }
   ?>

The next test uses the same test object class, but caches the PDO instance. This is equivalent to Davey's call proxying case:

<?php
   function something() {
       static $db = null;
       if ($db === null) $db = new PDO($dsn, $user, $pass);
       $obj = new TestObject($db);
       $obj->doWork();
   }
   ?>

The third test uses persistent connections; this is equivalent to Davey's extending PDO case, but "smarter"; even though $db falls out of scope and is destroyed at the end of each call to the something() function, the underlying connection is cached so that subsequent calls don't need to re-connect. This is transparent to the calling script, except for the extra parameter to the constructor, and is generally a very good thing to do with database connections:

<?php
   function something() {
       $db = new PDO($dsn, $user, $pass, array(PDO::ATTR_PERSISTENT => true));
       $obj = new TestObject($db);
       $obj->doWork();
   }
   ?>

Here are the results I got; since I'm lazy I'm only running mine for about 30 seconds each. I used a sqlite database with the contents of /usr/share/dict/words inserted into it (234937 words).

   one:   Ran 46734.19 iterations per minute (24146/31)
   two:   Ran 68504.52 iterations per minute (35394/31)
   three: Ran 64689.68 iterations per minute (33423/31)

The results speak for themselves; if you're initiating connections every time to you want to do some work, it's the slowest. If you cache the connection in a PHP variable it's faster than making persistent connections to PDO, because it doesn't need to create a new object each time. Persistent connections are "almost" as fast as caching in PHP variables; they need to create a new object but still reference the same connection internally.

It's work mentioning that benchmarks are tricky things. For instance, if you take out the "LIMIT 100" clause from the SELECT statement, the connection overhead time becomes so small in comparison to the time it takes to fetch the data, that all the tests wind up the same (about 18 iterations per minute with my test data). Similarly, if you limit the fetch to 1 row, you'll see a more exaggerated difference in the numbers, because the benchmark script is exercising your system differently.

If you're running against mysql, the differences between test one and test two will be greater because there is more overhead in establishing a connection over a socket than there is for sqlite to open up a file or two. You'll see a bigger difference again when connection to Oracle, because it does a pretty hefty amount of work at connection time.

The main lesson to be learned here is that benchmarking an artifical code fragment will give you artificial results; they can help you guage how fast something will run in general, but the answers you get back depend on the questions you're asking. If you don't ask appropriate or even enough questions (eg: Davey's quick tests didn't include persistent connections), you're not going to get all the information you need to tune your application effectively.

PS: There's a fourth test case that I didn't cover; it's important and probably yields the best results out of all the cases presented here. Anyone care to suggest what that case might be?