Optimizing database queries in CakePHP

We’ve been using CakePHP for a while now. I’ve found it to be very helpful for quick prototyping and development of core site functionality. This is nice because it allows more time for tweaking a site’s user interface, functionality, and performance. That last one, performance, can occasionally be a challenge. One of the means of optimizing a site’s performance is through the crafting of database queries that take into account indexes and conditions. But because CakePHP is building the SQL we have to keep in mind how CakePHP does this as we construct our query parameters.

Let’s review one of those situations. Say we have the following (simplified) models:

class Item extends AppModel {
  var $hasMany = array('Answer');
  var $belongsTo = array('Packet');
}

class Student extends AppModel {
  var $hasMany = array('Answer');
} 

class Answer extends AppModel {
  var $belongsTo = array('Student','Item','Packet');
}

class Packet extends AppModel {
  var $hasMany = array('Answer','Student');
  var $hasAndBelongsToMany = array('Item');
}

Let me explain the model relationships. We have test items that are assigned to packets. Packets are sent out for testing to a group of student. So each answer is unique based on the packet, item, and student. The (somewhat circular) relationships between these models provides for easy access to specific data. For example, let’s say we want to know all the answers from a particular state for a couple of our items. Here’s an example query that present some optimization issues:

$queryOpts = array(
  'fields' => array('Item.*');
  'conditions' => array('Item.id' => array(1995, 1726, 1971, 1707, 1972));
  'contain' => array(
    'Answer' => array(
      'fields' => array('Answer.*');
      'conditions' => array(
        'Answer.packet_id IN (SELECT id FROM packets WHERE state = "CA")'
      )
    )
  );
);
$items = $this->Item->find('all',$queryOpts);

You’ll note that I’m not using CakePHP syntax for the Answer model conditional. We had been using this conditional in our web application prior to moving to CakePHP. Since there is not, so far as I can tell, an easy way to work with SQL subqueries we decided to start by forcing a query similar to the original.

Running the above creates the following SQL statement when querying the answers table:

SELECT `Answer`.* FROM `answers` AS `Answer` WHERE `Answer`.`packet_id` IN (SELECT id FROM packets WHERE packet_type = "F") AND `Answer`.`item_id` IN (1995, 1726, 1971, 1707, 1972)

We have a fairly large data set so the above query takes ~32 seconds to run on my system. But that query is not optimized; reordering the conditions in the WHERE clause produces the following query:

SELECT `Answer`.* FROM `answers` AS `Answer` WHERE `Answer`.`item_id` IN (1995, 1726, 1971, 1707, 1972) AND `Answer`.`packet_id` IN (SELECT id FROM packets WHERE packet_type = "F")

The optimized query performs significantly better, <1 second.

Since for this example we know which items we want we can update the query to manually force the desired ordering:

$queryOpts = array(
  'fields' => array('Item.*');
  'conditions' => array('Item.id' => array(1995, 1726, 1971, 1707, 1972));
  'contain' => array(
    'Answer' => array(
      'fields' => array('Answer.*');
      'conditions' => array(
        'Item.id' => array(1995, 1726, 1971, 1707, 1972)
        'Answer.packet_id IN (SELECT id FROM packets WHERE state = "CA")'
      )
    )
  );
);
$items = $this->Item->find('all',$queryOpts);

But what if we don’t know what items we want. Say we’re pulling items (and answers) as part of another containable query. We have no way to specify the items in the Answer model conditionals. But we can improve the query so that we’re not using a subquery each time (i.e. we’re doing it the “CakePHP” way):

$fields = array('Packet.id');
$conditions = array('Packet.packet_type' => 'F');
$packet_list = $this->Item->Packet->find('list',array('fields'=>$fields,'conditions'=>$conditions));

$queryOpts = array(
  'fields' => array('Item.*');
  'conditions' => array('Item.id' => array(1995, 1726, 1971, 1707, 1972));
  'contain' => array(
    'Answer' => array(
      'fields' => array('Answer.*');
      'conditions' => array(
        'Answer.packet_id' => $packet_list
      )
    )
  );
);
$items = $this->Item->find('all',$queryOpts);

Which produces the following two queries:

SELECT `Packet`.`id` FROM `packets` AS `Packet` WHERE `Packet`.`packet_type` = 'F';
SELECT `Answer`.*, `Answer`.`item_id` FROM `answers` AS `Answer` WHERE `Answer`.`packet_id` IN (1278, 1277, 1276, 1274,...,1726, 1971, 1972, 1995);

Again we’re seeing the kind of performance we want, <1 second for both queries. As good as when we knew all our conditions in advance. The main drawback to doing it this way is the size of the query string sent to your database. If your first query is pulling a large number of results then your second query will be quite large. And most databases have limits on how large a query can be.

This isn’t the only way to get to the end point either. As with any programming task there are a variety of ways to arrive at the desired result. For example, you could break up the query into multiple queries that build off of the previous query. Sticking with our items/answers query you could first get the items in one query. Then foreach through the results and run a query to get the answers, integrating the results manually. It’s more work, and likely would decrease performance since you’re increasing the number of database queries, but if you can’t seem to optimize your all-in-one queries it’s worthwhile to try other options to see if you can get better performance.

Aside: performance problems due to poorly optimized queries can be somewhat mitigated by query caching. Even so, you’re better of with an optimized query to begin with since it saves you from a performance hit every time the relevant cached result is flushed.