mnavarro.dev

Efficient Reports in PHP

10 December 2019 | 8 minutes read | 1512 words

When it comes to reporting, I have a predefined set of rules that I follow sacredly and that will ensure that my report writing experience will be nice and problem-free.

  1. Do not use ORMs
  2. Use non-buffered TCP connections
  3. Use PHP generators with abstractions
  4. Prefer easy-to-stream content types

There you have! My promise is that if you follow these simple four rules, your report writing experience will be much more pleasant and your application will work on them more efficiently. Let’s review these rules one by one:

1. Do not use ORMs

ORMs, as indicated in the meaning of their acronym, have a very specific purpose. That purpose is to map relational database tables to the objects that contain our business logic. This is so we can abstract away persistence details and focus in expressing our domain in code. But a report is neither a domain action or a business rule. It’s just a report: some business guy wants to see data on an spreadsheet.

Anything we use the ORM for that is not related to the purpose stated above constitutes a wrong use of the ORM. ORMs were designed for nothing else than domain objects mapping.

If I load the ORM with ten thousand rows worth of database records, I’m not going to even reach a thousand before my script runs out of memory. This is because most ORMs have a in-memory cache of some sort, and also because objects consume more memory than more simpler data types, like hash maps or arrays. Some terrible things I’ve seen developers do to solve this is to increase the default memory limit of the application (worst idea ever) or to clear the memory cache of the Entity Manager every X iterations (not that bad, but still not the right thing to do).

Even when you can get around the memory issue, you still are putting your system under a lot of stress because of all the mapping logic. Just stop there and ask yourself the question “Do I really need hydrated objects for this?” You’ll find that the answer is, in 99% of cases, no.

This is really all you need.

<?php

interface UserReporter
{
    public function activeUsers(): iterable;

    public function inactiveUsers(): iterable;
}

You don’t need anything more complex than this. Implement it in PDO or any other database library, perform the necessary translations/normalizations, and return an iterable of some sort. You’ll be amazed of how easy and efficient this approach is.

2. Use non-buffered TCP connections

TCP connections are slow (this compared to writing in the host’s filesystem, for example). So almost every database client that I know uses something that is called buffering. In PDO, for example, this works by accumulating the resultset of a query in the database memory, and once is done, flush it all to the client requesting the data over the TCP socket. This means that in PHP, every time you are querying something, your are loading the whole result in memory.

Now, depending of the size of your table, this may also kill your application. If your table is 5 GB and you are dumping all its records into a csv file, that is going to allocate 5 Gb in your app’s memory. I remember the first time I encountered this problem working for a client. They had a database with the amount in MW of the electricity generated by every single electrical generation plant in Chile, every hour, over the span of 5 years. In Chile there are around 150 electrical generation plants and there are 8760 hours in a year, so that table had close to 6.127.000 records, each record roughly weighing 1 kb. That sole table was around 6 GB. Of course that wen I tried to allocate that much memory into my system it was going to crash.

Reports like this can easily kill your app if you are not cautious. In cases like this, is always good to buffer the results as they happen over the TCP connection. In PDO, you can do this by disabling buffering in queries:

<?php

$pdo->setAttribute(PDO::MYSQL_ATTR_USE_BUFFERED_QUERY, false);

Then you iterate over your resultset and you will get every record as it is found, keeping your memory usage very low at the cost of less speed and a bit more CPU.

You can read more about non-buffered queries in PDO here

3. Use PHP generators with abstractions

Is of little use if you disable buffering your query to put every result into an array and continue with it. What you want here is the power of generators.

I’m not going to explain in detail what they are. For that you can read the amazing blog post that Nikita Popov wrote about them. I’m going to say that a generator is a special kind of iterator that yields the items of an iteration without necessarily know the whole iterable structure in advance.

They fit very well our case for the non-buffered queries in PDO, because we do not know the whole resultset in advance, so we can yield values as they reach our server.

<?php

class UserReporter {

    private PDO $pdo;

    public function __construct(PDO $pdo) {
        $this->pdo = $pdo;
    }

    public function activeUsers(): Iterator {
        $this->pdo->setAttribute(PDO::MYSQL_ATTR_USE_BUFFERED_QUERY, false);
        $stmt = $this->pdo->prepare("SELECT * FROM `users` WHERE `status` = 'active'");
        if (!$stmt->execute()) {
            throw new RuntimeException('Query error');
        }
        while ($row = $stmt->fetch(PDO::FETCH_ASSOC)) {
            yield $row; // This is the keyword!!
        }
    }
}

When a function has the yield keyword inside it, their return value will be an Iterator (specifically, a Generator instance), and every iteration will return the next yielded value. So, when Iterator::next() is called, code will be executed until a yield is found, and that will be the return value. The code yielding will be paused until the Iterator::next() is called, and it will continue where it was the last time, until there are no more yields.

This is a powerful feature, useful to implement streams, data transformation pipelines and even coroutines (as Nikita’s blog post shows). You can compose iterables over iterables to create cool pipelines and process your report data separating concerns effectively.

<?php

class UppercaseNames implements IteratorAggregate
{
    private iterable $iterable;

    public function __construct(iterable $iterable)
    {
        $this->iterable = $iterable;
    }

    public function getIterator(): Iterator
    {
        foreach ($this->iterable as $item) {
            // Transform the data and then yield
            yield $transformedItem;
        }
    }
}

class NormalizeEmails implements IteratorAggregate
{
    private $iterable;

    public function __construct(iterable $iterable)
    {
        $this->iterable = $iterable;
    }

    public function getIterator(): Iterator
    {
        foreach ($this->iterable as $item) {
            // Transform the data and then yield
            yield $transformedItem;
        }
    }
}

class TranslateFields implements IteratorAggregate
{
     private $iterable;

    public function __construct(iterable $iterable)
    {
        $this->iterable = $iterable;
    }

    public function getIterator(): Iterator
    {
        foreach ($this->iterable as $item) {
            // Transform the data and then yield
            yield $transformedItem;
        }
    }
}

$pipeline = new UppercaseNames(new NormalizeEmails(new TranslateFields($userReporter->activeUsers())));

foreach ($pipeline as $item) {
    // Write in a csv or something
}

Who needs ETL processing libraries when you have the power of Generators at your disposal? 🤓️

https://media.giphy.com/media/11ahZZugJHrdLO/giphy-downsized.gif

OMG generators are soo cool!

4. Prefer easy-to-stream content types

One of the first things I ask when someone asks me to dump a report from a set of SQL tables is “Can the report be a CSV?", most of the time the answer is an annoying no, because is too complex for the business guy to import a csv file into excel (really?). But I try to make my point anyways.

Every content type structured as a tree is hard to stream, specially in PHP. I’m mainly talking about xml and json here. As you might know, an excel file is just compressed xml. Even really good libraries like box/spout have to do some in-memory cache or temp file gymnastics to stream .xlsx files. Json files can be streamed as well, but not using the native json extension functions. For streaming json in PHP you should use violet-php/streaming-json-encoder. But the most simple of all formats is the beautiful csv. Just line by line streaming: forget about parsing and reading byte per byte.

The reason to prefer streams is because, potentially, you will be sending this somewhere else: be an object storage or a client over HTTP. Trust me, you want to stream that.

I’ve seen applications that when I click “Download Report” it waits for like three whole minutes until the browser downloading tray icon pops, and then takes a second to download. This is because most reports build the file in-memory or disk, and then send it to the client when is done. This is poor UX. Let the user know at least that you are sending data to him/her by streaming the content, and not making him/her wait for three minutes.

Conclusion

Writing reports is one of the most boring tasks of a developer, but happens to be one of the most common ones. Do it well, and you’ll save yourself a lot of time that you can use in doing other, more interesting things. Besides, you’ll learn a thing or two about streaming, generators and TCP connections. 🙂️

Taggged php, reporting, generators, mysql, pdo,

Entendiendo Versionamiento Semántico
Unit-testing the filesystem in PHP
comments powered by Disqus