Tag Archives: CakePHP 1.2.x

Optimizing database queries in CakePHP

We’ve been using CakePHP for a while now. I’ve found it to be very helpful for quick prototyping and development of core site functionality. This is nice because it allows more time for tweaking a site’s user interface, functionality, and performance. That last one, performance, can occasionally be a challenge. One of the means of optimizing a site’s performance is through the crafting of database queries that take into account indexes and conditions. But because CakePHP is building the SQL we have to keep in mind how CakePHP does this as we construct our query parameters.

Let’s review one of those situations. Say we have the following (simplified) models:

class Item extends AppModel {
  var $hasMany = array('Answer');
  var $belongsTo = array('Packet');
}

class Student extends AppModel {
  var $hasMany = array('Answer');
} 

class Answer extends AppModel {
  var $belongsTo = array('Student','Item','Packet');
}

class Packet extends AppModel {
  var $hasMany = array('Answer','Student');
  var $hasAndBelongsToMany = array('Item');
}

Let me explain the model relationships. We have test items that are assigned to packets. Packets are sent out for testing to a group of student. So each answer is unique based on the packet, item, and student. The (somewhat circular) relationships between these models provides for easy access to specific data. For example, let’s say we want to know all the answers from a particular state for a couple of our items. Here’s an example query that present some optimization issues:

$queryOpts = array(
  'fields' => array('Item.*');
  'conditions' => array('Item.id' => array(1995, 1726, 1971, 1707, 1972));
  'contain' => array(
    'Answer' => array(
      'fields' => array('Answer.*');
      'conditions' => array(
        'Answer.packet_id IN (SELECT id FROM packets WHERE state = "CA")'
      )
    )
  );
);
$items = $this->Item->find('all',$queryOpts);

You’ll note that I’m not using CakePHP syntax for the Answer model conditional. We had been using this conditional in our web application prior to moving to CakePHP. Since there is not, so far as I can tell, an easy way to work with SQL subqueries we decided to start by forcing a query similar to the original.

Running the above creates the following SQL statement when querying the answers table:

SELECT `Answer`.* FROM `answers` AS `Answer` WHERE `Answer`.`packet_id` IN (SELECT id FROM packets WHERE packet_type = "F") AND `Answer`.`item_id` IN (1995, 1726, 1971, 1707, 1972)

We have a fairly large data set so the above query takes ~32 seconds to run on my system. But that query is not optimized; reordering the conditions in the WHERE clause produces the following query:

SELECT `Answer`.* FROM `answers` AS `Answer` WHERE `Answer`.`item_id` IN (1995, 1726, 1971, 1707, 1972) AND `Answer`.`packet_id` IN (SELECT id FROM packets WHERE packet_type = "F")

The optimized query performs significantly better, <1 second.

Since for this example we know which items we want we can update the query to manually force the desired ordering:

$queryOpts = array(
  'fields' => array('Item.*');
  'conditions' => array('Item.id' => array(1995, 1726, 1971, 1707, 1972));
  'contain' => array(
    'Answer' => array(
      'fields' => array('Answer.*');
      'conditions' => array(
        'Item.id' => array(1995, 1726, 1971, 1707, 1972)
        'Answer.packet_id IN (SELECT id FROM packets WHERE state = "CA")'
      )
    )
  );
);
$items = $this->Item->find('all',$queryOpts);

But what if we don’t know what items we want. Say we’re pulling items (and answers) as part of another containable query. We have no way to specify the items in the Answer model conditionals. But we can improve the query so that we’re not using a subquery each time (i.e. we’re doing it the “CakePHP” way):

$fields = array('Packet.id');
$conditions = array('Packet.packet_type' => 'F');
$packet_list = $this->Item->Packet->find('list',array('fields'=>$fields,'conditions'=>$conditions));

$queryOpts = array(
  'fields' => array('Item.*');
  'conditions' => array('Item.id' => array(1995, 1726, 1971, 1707, 1972));
  'contain' => array(
    'Answer' => array(
      'fields' => array('Answer.*');
      'conditions' => array(
        'Answer.packet_id' => $packet_list
      )
    )
  );
);
$items = $this->Item->find('all',$queryOpts);

Which produces the following two queries:

SELECT `Packet`.`id` FROM `packets` AS `Packet` WHERE `Packet`.`packet_type` = 'F';
SELECT `Answer`.*, `Answer`.`item_id` FROM `answers` AS `Answer` WHERE `Answer`.`packet_id` IN (1278, 1277, 1276, 1274,...,1726, 1971, 1972, 1995);

Again we’re seeing the kind of performance we want, <1 second for both queries. As good as when we knew all our conditions in advance. The main drawback to doing it this way is the size of the query string sent to your database. If your first query is pulling a large number of results then your second query will be quite large. And most databases have limits on how large a query can be.

This isn’t the only way to get to the end point either. As with any programming task there are a variety of ways to arrive at the desired result. For example, you could break up the query into multiple queries that build off of the previous query. Sticking with our items/answers query you could first get the items in one query. Then foreach through the results and run a query to get the answers, integrating the results manually. It’s more work, and likely would decrease performance since you’re increasing the number of database queries, but if you can’t seem to optimize your all-in-one queries it’s worthwhile to try other options to see if you can get better performance.

Aside: performance problems due to poorly optimized queries can be somewhat mitigated by query caching. Even so, you’re better of with an optimized query to begin with since it saves you from a performance hit every time the relevant cached result is flushed.

Integrating calculated fields and model data in CakePHP

(This is mostly a summary of Dealing with calculated fields in CakePHP’s find().)

One of the great things about CakePHP is that if it doesn’t have some core functionality you want/need there are easy ways to add it. More and more I’m taking advantage of this ability. This all comes about because I wanted to have calculated fields available inline with the model data. By calculated fields I mean results that are not data columns (e.g. SELECT *, CURDATE() AS current_date FROM users … yes, that’s a fairly contrived example).

By default Cake places results from calculated fields outside the model data, like this:

Array
(
    [0] => Array
        (
            [User] => Array
                (
                    [id] => 1
                    [username] => aaas
                )

            [0] => Array
                (
                    [current_date] => '2011-05-13'
                )

        )
)

What we want is to place the “current_date” calculated field inside the User model, so it’s more naturally accessed with $users[0]['User']['current_date'] instead of $users[0]['User'][0]['current_date']. Easy enough to do through a model’s afterFind() callback method (to make it widely available place the function in app_model.php).

function afterFind($results, $primary=false) {
	if($primary == true) {
		if(Set::check($results, '0.0') && Set::check($results, "0." . $this->alias)) {
			$fields = array_keys( $results[0][0] );
			foreach($results as $key=>$value) {
				foreach( $fields as $fieldName ) {
					$results[$key][$this->alias][$fieldName] = $value[0][$fieldName];
				}
				unset($results[$key][0]);
			}
		}
	}
	return $results;
}

And yet, even as I add such enhanced functionality to my web apps I’m finding limits. The above logic is complicated a little because you don’t know what type of results you’re getting. The results you get from calling, for example, find('all') versus find('count') are not the same. But Cake doesn’t give any hints to the afterFind() callback, and as a result additional logic is included to try and guess at our data structure. The above adds a quick but incomplete hack by a) checking that the results are for the primary model queried (e.g. not from contained model), b) checking for the presence of nested numerical keys, and c) checking that there is model data to integrate with.

The take away is that while the code produces the desired data structure, in its current form it does so only for specific results.