Category Archives: PHP

Optimizing database queries in CakePHP

We’ve been using CakePHP for a while now. I’ve found it to be very helpful for quick prototyping and development of core site functionality. This is nice because it allows more time for tweaking a site’s user interface, functionality, and performance. That last one, performance, can occasionally be a challenge. One of the means of optimizing a site’s performance is through the crafting of database queries that take into account indexes and conditions. But because CakePHP is building the SQL we have to keep in mind how CakePHP does this as we construct our query parameters.

Let’s review one of those situations. Say we have the following (simplified) models:

class Item extends AppModel {
  var $hasMany = array('Answer');
  var $belongsTo = array('Packet');
}

class Student extends AppModel {
  var $hasMany = array('Answer');
} 

class Answer extends AppModel {
  var $belongsTo = array('Student','Item','Packet');
}

class Packet extends AppModel {
  var $hasMany = array('Answer','Student');
  var $hasAndBelongsToMany = array('Item');
}

Let me explain the model relationships. We have test items that are assigned to packets. Packets are sent out for testing to a group of student. So each answer is unique based on the packet, item, and student. The (somewhat circular) relationships between these models provides for easy access to specific data. For example, let’s say we want to know all the answers from a particular state for a couple of our items. Here’s an example query that present some optimization issues:

$queryOpts = array(
  'fields' => array('Item.*');
  'conditions' => array('Item.id' => array(1995, 1726, 1971, 1707, 1972));
  'contain' => array(
    'Answer' => array(
      'fields' => array('Answer.*');
      'conditions' => array(
        'Answer.packet_id IN (SELECT id FROM packets WHERE state = "CA")'
      )
    )
  );
);
$items = $this->Item->find('all',$queryOpts);

You’ll note that I’m not using CakePHP syntax for the Answer model conditional. We had been using this conditional in our web application prior to moving to CakePHP. Since there is not, so far as I can tell, an easy way to work with SQL subqueries we decided to start by forcing a query similar to the original.

Running the above creates the following SQL statement when querying the answers table:

SELECT `Answer`.* FROM `answers` AS `Answer` WHERE `Answer`.`packet_id` IN (SELECT id FROM packets WHERE packet_type = "F") AND `Answer`.`item_id` IN (1995, 1726, 1971, 1707, 1972)

We have a fairly large data set so the above query takes ~32 seconds to run on my system. But that query is not optimized; reordering the conditions in the WHERE clause produces the following query:

SELECT `Answer`.* FROM `answers` AS `Answer` WHERE `Answer`.`item_id` IN (1995, 1726, 1971, 1707, 1972) AND `Answer`.`packet_id` IN (SELECT id FROM packets WHERE packet_type = "F")

The optimized query performs significantly better, <1 second.

Since for this example we know which items we want we can update the query to manually force the desired ordering:

$queryOpts = array(
  'fields' => array('Item.*');
  'conditions' => array('Item.id' => array(1995, 1726, 1971, 1707, 1972));
  'contain' => array(
    'Answer' => array(
      'fields' => array('Answer.*');
      'conditions' => array(
        'Item.id' => array(1995, 1726, 1971, 1707, 1972)
        'Answer.packet_id IN (SELECT id FROM packets WHERE state = "CA")'
      )
    )
  );
);
$items = $this->Item->find('all',$queryOpts);

But what if we don’t know what items we want. Say we’re pulling items (and answers) as part of another containable query. We have no way to specify the items in the Answer model conditionals. But we can improve the query so that we’re not using a subquery each time (i.e. we’re doing it the “CakePHP” way):

$fields = array('Packet.id');
$conditions = array('Packet.packet_type' => 'F');
$packet_list = $this->Item->Packet->find('list',array('fields'=>$fields,'conditions'=>$conditions));

$queryOpts = array(
  'fields' => array('Item.*');
  'conditions' => array('Item.id' => array(1995, 1726, 1971, 1707, 1972));
  'contain' => array(
    'Answer' => array(
      'fields' => array('Answer.*');
      'conditions' => array(
        'Answer.packet_id' => $packet_list
      )
    )
  );
);
$items = $this->Item->find('all',$queryOpts);

Which produces the following two queries:

SELECT `Packet`.`id` FROM `packets` AS `Packet` WHERE `Packet`.`packet_type` = 'F';
SELECT `Answer`.*, `Answer`.`item_id` FROM `answers` AS `Answer` WHERE `Answer`.`packet_id` IN (1278, 1277, 1276, 1274,...,1726, 1971, 1972, 1995);

Again we’re seeing the kind of performance we want, <1 second for both queries. As good as when we knew all our conditions in advance. The main drawback to doing it this way is the size of the query string sent to your database. If your first query is pulling a large number of results then your second query will be quite large. And most databases have limits on how large a query can be.

This isn’t the only way to get to the end point either. As with any programming task there are a variety of ways to arrive at the desired result. For example, you could break up the query into multiple queries that build off of the previous query. Sticking with our items/answers query you could first get the items in one query. Then foreach through the results and run a query to get the answers, integrating the results manually. It’s more work, and likely would decrease performance since you’re increasing the number of database queries, but if you can’t seem to optimize your all-in-one queries it’s worthwhile to try other options to see if you can get better performance.

Aside: performance problems due to poorly optimized queries can be somewhat mitigated by query caching. Even so, you’re better of with an optimized query to begin with since it saves you from a performance hit every time the relevant cached result is flushed.

Integrating calculated fields and model data in CakePHP

(This is mostly a summary of Dealing with calculated fields in CakePHP’s find().)

One of the great things about CakePHP is that if it doesn’t have some core functionality you want/need there are easy ways to add it. More and more I’m taking advantage of this ability. This all comes about because I wanted to have calculated fields available inline with the model data. By calculated fields I mean results that are not data columns (e.g. SELECT *, CURDATE() AS current_date FROM users … yes, that’s a fairly contrived example).

By default Cake places results from calculated fields outside the model data, like this:

Array
(
    [0] => Array
        (
            [User] => Array
                (
                    [id] => 1
                    [username] => aaas
                )

            [0] => Array
                (
                    [current_date] => '2011-05-13'
                )

        )
)

What we want is to place the “current_date” calculated field inside the User model, so it’s more naturally accessed with $users[0]['User']['current_date'] instead of $users[0]['User'][0]['current_date']. Easy enough to do through a model’s afterFind() callback method (to make it widely available place the function in app_model.php).

function afterFind($results, $primary=false) {
	if($primary == true) {
		if(Set::check($results, '0.0') && Set::check($results, "0." . $this->alias)) {
			$fields = array_keys( $results[0][0] );
			foreach($results as $key=>$value) {
				foreach( $fields as $fieldName ) {
					$results[$key][$this->alias][$fieldName] = $value[0][$fieldName];
				}
				unset($results[$key][0]);
			}
		}
	}
	return $results;
}

And yet, even as I add such enhanced functionality to my web apps I’m finding limits. The above logic is complicated a little because you don’t know what type of results you’re getting. The results you get from calling, for example, find('all') versus find('count') are not the same. But Cake doesn’t give any hints to the afterFind() callback, and as a result additional logic is included to try and guess at our data structure. The above adds a quick but incomplete hack by a) checking that the results are for the primary model queried (e.g. not from contained model), b) checking for the presence of nested numerical keys, and c) checking that there is model data to integrate with.

The take away is that while the code produces the desired data structure, in its current form it does so only for specific results.

suhosin to [internal web app]: you talk too much

Following up on my earlier post, I’ve had to make some further configuration adjustments to avoid suhosin-related restrictions in one of our custom web applications. This particular application has a function that generates a summary of data from student assessments. The summaries can be generated based on groupings of packets and items. Depending on the filtering parameters selected there can be a fairly large number of packets and items. Not all of the packets necessarily contain the items of interest, but it’s always easier to select all if you want an overall summary of item performance.

I recently noticed the following alert in the system log:

ALERT – configured POST variable limit exceeded – dropped variable ‘included_packet_ids[]’ (attacker USER_IP_ADDRESS, file REPORT_FILTERING_PAGE)

One of the reasons I use POST variables on this page is to avoid the relatively small data size limit of GET. Suhosin adds additional limits, including in the number of times you can reference an individual variable.  Our limit was set at 1000, meaning there were over 1000 packets selected. This points to a need to adjust how the filter “selects all” … but for now I’ve adjusted the suhosin limit upward by modifying the suhosin.post.max_vars setting.

References:

suhosin to WordPress: go on a diet

We were seeing a lot of suhosin alerts in the system messages log of the type:

ALERT – script tried to increase memory_limit to 268435456 bytes which is above the allowed value (attacker SERVER_IP_ADDRESS, file WP_MAIN_ADMIN_PAGE, line 96)

The source of the issue is WordPress. The application is trying to raise the memory limit and suhosin won’t let it. Apparently WordPress will try to set a 256MB memory limit before executing certain functions. The necessity of adjusting this setting seems questionable to me, but I also understand that it’s often better to play it safe when developing software for public consumption.

I don’t particularly like applications attempting to specify their own resource usage in a web environment. In my mind applications should specify a required/recommended memory limit in the system requirements and stay away from adjusting this setting behind-the-scenes. Tell me during setup if the current setting may result in non-optimal performance or even a halt in script execution. That’s not how it’s done here, but really no harm is done in the long run beyond the annoyance of suhosin throwing errors at the system logs.

There are two easy fixes to the problem:

  1. Set the PHP memory limit to 256MB
  2. Modify the suhosin.memory_limit parameter to 256MB

In our particular situation it’s just as easy to set the PHP memory limit. There’s always a risk of overloading the physical resources, but this site receives little enough traffic that I’m not concerned about the right confluence of request occurring to cause a crash.

References:

ALERT – script tried to increase memory_limit to 268435456 bytes which is above the allowed value

CakePHP and CURRENT_TIMESTAMP

As of cakePHP 1.2.1.8004 you can’t use CURRENT_TIMESTAMP as the default for a column. With MySQL when the default is set to CURRENT_TIMESTAMP the current date+time is inserted for a new row that doesn’t specifically define the value of the column.

From what I can tell, when CakePHP specifies the values for all columns when it creates a new record. A column with a defined default that is not specifically set by the user is manually set by CakePHP (rather than let MySQL handle defaults upon record insertion). But CakePHP doesn’t understand the CURRENT_TIMESTAMP keyword and so treats it as a string and wraps it in quotes. This breaks the resulting INSERT statement and you receive an error:

Incorrect datetime value: ‘CURRENT_TIMESTAMP’

Interestingly, columns that are named “created” and “modified” receive special handling by CakePHP. These columns are treated like auto-update columns by CakePHP and it sets them as expected. With the special handling of these columns in mind it is possible to get around the CURRENT_TIMESTAMP bug by following the recommended settings for created/modified columns, i.e. specifying the field as DATETIME with a default of NULL. CakePHP will automatically update the columns when inserting/updating records.

References:

CakePHP, Apache, and Ampersands

I’m still learning CakePHP. My most recent experimentation is with forms and URL redirecting. In the process I noticed that if the URL includes an ampersand, even in URL-encoded form, CakePHP will read in the URL incorrectly.

In my case, I was taking a form submission and redirecting it to a new page, appending the submitted data as URL parameters (e.g. of the form /controller/action/name:value). CakePHP handles ampersand in POST-submitted forms just fine, but during the redirect I noticed that the value of the parameter was cut off at the ampersand. For example, I’m creating a search form for a list of items (located at /items/index). When you submit the form with a value of “ball & chain” the controller sees it just fine. The form is submitted to the search action (/items/search) which takes the values, constructs a new URL, and redirects back to the item list (/items/index/Search.keywords:ball+%26+chain). However, the index action only sees “ball ” … the rest of the value is assumed to be additional key/value pairs in the querystring.

This problem seems like a pretty major issue to me. A bug report has already been filed, but I don’t believe it will be addressed anytime soon. And for good reason, the bug is actually in Apache’s mod_rewrite module rather than in CakePHP. When mod_rewrite processes a URL it unescapes any escaped characters (newer releases unescape to two levels). What this means for CakePHP is that any escaped characters in the URL become normal characters in the querystring (%26 becomes &).

The best overview of the problem has a good work-around, but it requires modification of the core CakePHP scripts. I’ll leave that to the experts. On our install I’ve found that a triple-escaped value will only be unescaped twice, resulting in the proper escaping after being parsed by mod_rewrite.

References

Playing nice with CPU usage

I needed to do a mass resampling of around 280,000 images. There are a number of ways to do this, but I settled on doing it via PHP because the images are stored on our web server, the total size of the images is large (~10GB), and I didn’t want to kill my machine trying to get it done.

PHP is ideal for a task such as this: parsing directories and subdirectories for images is easy; resampling using the built-in library (GD) is a breeze; specifying the destination as a subdirectory is simple. The one minus was processor usage. Performing image manipulation eats up the CPU in a big way.

Luckily linux systems have a built-in utility for addressing a situation like this: nice. nice will “run a program with modified scheduling priority.” I’m running the image manipulation script using the following command:

nice --adjustment=19 php script.php

If nothing else is going on the script will use whatever resources are available. When anything with a higher priority executes, that program will take precedence over the script with regard to system resources. The script should thus not affect the responsiveness of the web server. This is the reason I was searching for this kind of functionality.

Creating Graphs in PHP

I was tasked with creating a demonstration of a dynamically generated graph for one of the grant proposals being worked on at present. This isn’t something we’ve really done before in PHP, but I had a feeling it would not be a unique requirement. A quick search revealed that a PEAR package exists for just this purpose: Image_Graph.

Installation was not too difficult, though I did have to ensure that a few dependencies were installed before PEAR would load the package. Documentation for the module is not that great, but there are a number of samples and a fairly active discussion forum. (There used to be a decent community, but the site appears to have gone under.)

I did not find the graphing object very intuitive. It seems to me that a lot of the properties and functions are abstracted from the objects they affect. I suspect this was a design decision to allow for greater flexibility, but it may also be due to an older coding style. The current branch appears to have been developed in earnest starting in 2005, and it’s been a few years since a new release. Even so, my initial exposure leads me to believe the package may be stable enough for even a production-level project so long as sufficient QA has been performed.

It took me only a couple of days to hack together the demonstration. Though the requirements of the script were few, I decided to push the limits of my object-oriented PHP and create the graphing component as a class. Overall it was a good first-try at producing something of this nature. You can find the demonstrator at http://flora.p2061.org/weather_data/.

Items Utility: Piloting Data Entry Updates

Piloting packets for this year are starting to come in and the data from these packets needs to be entered into the items utility. Because the questions in the packet have changed from previous years, and due to the rigid structure of the piloting data entry form and data tables, I decided to rewrite the data entry form. The new format allows more flexible on the front-end without requiring modification of the back-end. In addition I modified the way PHP interacts with the packet data from MySQL by utilizing a structured array to hold the data.

Continue reading Items Utility: Piloting Data Entry Updates