Efficient Data Crunching in PHP


When it comes to data crunching for reporting purposes in PHP, I have a predefined set of rules that I follow sacredly and that will ensure that my report writing experience will be nice and problem-free.

  1. Do not use ORMs
  2. Use non-buffered TCP connections
  3. Use PHP generators with abstractions
  4. Prefer easy-to-stream content types

There you have! My promise is that if you follow these simple four rules, your report writing experience will be much more pleasant and your application will work on them more efficiently. Let’s review these rules one by one:

Do NOT use ORMs

ORMs, as indicated in the meaning of their acronym, have a very specific purpose. That purpose is to map relational database tables to the objects that contain our business logic. This is so we can abstract away persistence details and focus on expressing our domain in code. But a report is neither a domain action or a business rule. It’s just a report: someone or something just wants to see data some data in a certain format.

Anything we use the ORM for that is not related to the purpose stated above constitutes a wrong use of the ORM. ORMs were designed for nothing else than domain objects mapping.

If I load the ORM with ten thousand rows worth of database records, I’m not going to even reach a thousand before my script runs out of memory. This is because most ORMs have a in-memory cache of some sort, and also because objects consume more memory than more simpler data types, like hash maps or arrays. Some terrible things I’ve seen developers do to solve this is to increase the default memory limit of the application — worst idea ever, because it’s not solving the problem, just pushing it away — or to clear the memory cache of the Entity Manager every N iterations — not that bad, but still that’s telling you you probably don’t need an ORM for this.

Even when you can get around the memory issue, you still are putting your system under a lot of stress because of all the mapping logic. Just stop there and ask yourself the question “Do I really need hydrated objects for this?” You’ll find that the answer is, in 99.9% of cases, no.

This is really all you need.

<?php
interface UserReporter
{
public function activeUsers(): iterable;
public function inactiveUsers(): iterable;
}

You don’t need anything more complex than this. Implement it in PDO or any other database library, perform the necessary translations/normalizations, and return an iterable of some sort. You’ll be amazed of how easy and efficient this approach is.

Use Non-Buffered TCP Connections

TCP connections are slow — this compared to writing in the host’s filesystem, for example. So almost every database client that I know uses something called buffering. In PDO, for example, this works by accumulating the resultset of a query in the database memory, and once it is done, flushes it all to the client requesting the data over the TCP socket. This means that in PHP, every time you are querying something, you are loading the whole result in memory.

Buffering is a technique used to improve socket performance. The idea is that if you send many small packets, the overhead of sending each packet is greater than if you send one big packet. So, by buffering the data, you are sending less packets, and thus improving performance. The tradeoff though is that the resultset gets allocated in the client memory.

Now, depending on the size of your table, this may also kill your application. If your table is 5 GB and you are dumping all its records into a csv file, that is going to allocate 5 Gb in your app’s memory. I remember the first time I encountered this problem working for a client. They had a database with the amount in MW of the electricity generated by every single electrical generation plant in Chile, every hour, over the span of 5 years. In Chile there are around 150 electrical generation plants and there are 8760 hours in a year, so that table had close to 6.127.000 records, each record roughly weighing 1 kb. That sole table was around 6 GB. Of course that when I tried to allocate that much memory into my system, it was going to crash.

Reports like this can easily kill your app if you are not cautious. In cases like this, is always good to stream the results as they happen over the TCP connection. In PDO, you can do this by disabling buffering in queries:

<?php
$pdo->setAttribute(PDO::MYSQL_ATTR_USE_BUFFERED_QUERY, false);

Then you iterate over your resultset, and you will get every record as it is found, keeping your memory usage very low at the cost of lower speed — because of TCP overhead — and a bit more CPU.

You can read more about non-buffered queries in PDO here.

Use PHP Generators with Abstractions

Is of little use if you disable buffering your query to put every result into an array and continue with it. What you want here is the power of generators.

I’m not going to explain in detail what they are. For that you can read the amazing blog post that Nikita Popov wrote about them. I’m going to say that a generator is a special kind of iterator that yields the items of an iteration without necessarily knowing the whole iterable structure in advance.

They fit very well our case for the non-buffered queries in PDO, because we do not know the whole resultset in advance, so we can yield values as they reach our server.

<?php
class UserReporter {
private PDO $pdo;
public function __construct(PDO $pdo) {
$this->pdo = $pdo;
}
public function activeUsers(): Iterator {
$this->pdo->setAttribute(PDO::MYSQL_ATTR_USE_BUFFERED_QUERY, false);
$stmt = $this->pdo->prepare("SELECT * FROM `users` WHERE `status` = 'active'");
if (!$stmt->execute()) {
throw new RuntimeException('Query error');
}
while ($row = $stmt->fetch(PDO::FETCH_ASSOC)) {
yield $row;
}
}
}

When a function has the yield keyword inside it, their return value will be an Iterator (specifically, a Generator instance), and every iteration will return the next yielded value. So, when Iterator::next() is called, code will be executed until a yield is found, and that will be the return value. The code yielding will be paused until the Iterator::next() is called, and it will continue where it was the last time, until there are no more yields.

This is a powerful feature, useful to implement streams, data transformation pipelines, and even coroutines (as Nikita’s blog post shows). You can compose iterables over iterables to create cool pipelines and process your report data separating concerns effectively.

<?php
class UppercaseNames implements IteratorAggregate
{
private iterable $iterable;
public function __construct(iterable $iterable)
{
$this->iterable = $iterable;
}
public function getIterator(): Iterator
{
foreach ($this->iterable as $item) {
// Transform the data and then yield
yield $transformedItem;
}
}
}
class NormalizeEmails implements IteratorAggregate
{
private $iterable;
public function __construct(iterable $iterable)
{
$this->iterable = $iterable;
}
public function getIterator(): Iterator
{
foreach ($this->iterable as $item) {
// Transform the data and then yield
yield $transformedItem;
}
}
}
class TranslateFields implements IteratorAggregate
{
private $iterable;
public function __construct(iterable $iterable)
{
$this->iterable = $iterable;
}
public function getIterator(): Iterator
{
foreach ($this->iterable as $item) {
// Transform the data and then yield
yield $transformedItem;
}
}
}
$pipeline = new UppercaseNames(new NormalizeEmails(new TranslateFields($userReporter->activeUsers())));
foreach ($pipeline as $item) {
// Write in a csv or something
}

Who needs ETL processing libraries when you have the power of Generators at your disposal?

Prefer Easy-to-Stream Content Types

One of the first things I ask when someone asks me to dump a report from a set of SQL tables is “Can the report be a CSV?”, most of the time the answer is an annoying no, because it is too complex for the business guy to import a csv file into Excel (really?). But I try to make my point anyway.

Every content type structured as a tree is hard to stream, especially in PHP. I’m mainly talking about xml and json here. As you might know, an excel file is just compressed xml. Even perfect superb packages like box/spout have to do some in-memory cache or temp file gymnastics to stream .xlsx files. Json files can be streamed as well, but not using the native json extension functions. For streaming JSON in PHP you should use violet-php/streaming-json-encoder. But the most simple of all formats is the beautiful csv. Just line by line streaming: forget about parsing and reading byte per byte.

The reason to prefer streams is because, potentially, you will be sending this somewhere else: be an object storage or a client over HTTP. Trust me, you want to stream that.

I’ve seen applications that when I click “Download Report” it waits for like three whole minutes until the browser downloading tray icon pops and then takes a second to download. This is because most reports build the file in-memory or disk and then send it to the client when it is done. This is a poor UX. Let the user know at least that you are sending data to them by streaming the content and not making them wait for three minutes.

Conclusion

Writing reports is one of the most boring tasks of a developer, but happens to be one of the most common ones. Do it well, and you’ll save yourself a lot of time that you can use in doing other, more interesting things. Besides, you’ll learn a thing or two about streaming, generators, and TCP connections.