Handling large CSVs with Laravel

In my job at Resolute Property Tax and in one of my side projects, I've spent way (way) too much time worrying about CSVs. At Resolute, we will regularly get CSVs that are millions of rows long and dozens of columns wide – the files can sometimes hit two or three gigabytes. When you're dealing with files that size, handling them efficiently and effectively becomes a lot more important than when you're dealing with a couple hundred rows.

I'll share some of the strategies I've developed over the past couple of years that I think might be helpful if you're handling large CSVs.

Please Use a CSV Library

My first and highest recommendation may seem like a no-brainer, but we've all been tempted to think: "It's just a CSV, how hard could it be to parse on my own?"

Answer: very hard.

In the off chance you don't believe me, take a look at Falsehoods Programmers Believe About CSVs and ask yourself if you had planned to handle all (any?) of those. I didn't even know most of them!

For the sake of your sanity, I suggest you pull in a proper CSV parser. I highly, highly recommend the one supported by The PHP League. It's fantastic, easy to use, tested and maintained.

Never Trust Your Data Provider

The large CSVs that we load into our app are primarily exports from other systems, either Government systems or from our data providers. It may sound harsh to say "never trust your data provider", but they will unknowingly change things all the time. It's better to be prepared for changes than hope they never come.

With no notice to you, they might:

change the order of the columns
change the name of a column
add or remove columns
have an escaping issue on their end

We load all of our data as a part of scheduled commands that run without intervention, in off hours. The system fetches the CSV, opens, and imports it. That means that we need to make sure the file is in the correct shape before we start loading millions of rows.

Ensure Header Consistency

As long as you're using the League/CSV package, it doesn't really matter what order the source data has the columns in, because you'll get one associative array per row. What does matter, however, is when a column name changes or a column is dropped. That can lead to pernicious, subtle bugs that manifest themselves later when data is not populated.

You really, really want to make those issues explicit instead of silently letting in empty data.

The way we do this is we set up an "expectation" of what the headers should be for a file, and the actual headers don't match, then we throw an exception and everything grinds to a halt.

We usually get them directly from the file using head:

protected function validateHeaders($path, $expectedHeaders)
{
    $headersFromFile = trim(shell_exec("head -1 $path"));
 
    if ($expectedHeaders !== $headersFromFile) {
        Log::error('Expected: ' . $expectedHeaders);
        Log::error('Actual: ' . $headersFromFile);
 
        throw new Exception("Headers did not match. Check the logs.");
    }
}

Code highlighting powered by torchlight.dev (A service I created!)

You could also check the first record given back to you by the Reader object.

protected function validateHeaders($reader, $expectedHeaders)
{
    $headersFromFile = array_keys($reader->fetchOne());
 
    if($expectedHeaders !== $headersFromFile) {
        Log::error('Expected: ' . $this->originalHeaders);
        Log::error('Actual: ' . $headersFromFile);
 
        throw new Exception("Headers did not match. Check the logs.");
    }
}

Regardless of which way you do it, I would recommend that you check your headers to make sure you're working with the data that you expect.

No Headers?

It's hard to check headers for changes if there aren't any headers in the first place!

That totally happens. In that case we treat the first row of data as the "header" and monitor that specific row for changes just like we would monitor the header row. Most of the data we get is relatively stable, so while we do get a few false exceptions, we don't have to update the "fixture" data very often.

Ensure (Mostly) Correct Escaping

The League's CSV package will handle most of this for you, but there are still going to be times where your provider has encoded some data in... interesting ways. It's always good to sanity check the data before you blindly load it into your system.

For us, we ran into an issue where the provider was incorrectly escaping a character that caused two records to be merged into one. Of course that's not what the error said, because how could the library know? The library was just following the CSV "spec", as it were. We only discovered this when we tried to insert a field into the database and got a "value too long" error from MySQL. I noticed that an entire row's worth of data was in a single field, which tipped me off that something wasn't parsing correctly.

The way we catch this now is a bit complicated, but I'll briefly describe how we do it.

The first thing we do when we download a new CSV is check to see how many records are in the file itself. We used to do this by counting the number lines in the file, but some fields can have line breaks so we had to come up with a different way. Every row in this particular file contains an easily identifiable pattern e.g. | 043,2020,, so if we count the number of times that pattern appears we know how many records there should be.

public function linesInFile($path)
{
    $pattern = " \| 043,2020,";
 
    // The `E` turns on regex, and the `a` forces it to treat the
    // file as text (which it is). `o` tells it to only show the
    // match, not the whole line.
    return (int)trim(exec("grep -oaE '$pattern' $path | wc -l"));
}

Knowing how many records there should be, we compare with how many records we actually got:

public function ensureCsvParsed($path)
{
    $reader = $this->getReader($path);
    $lines = $this->linesInFile($path);
 
    if ($lines == $reader->count()) {
        // The CSV document has as many records as there are
        // lines in the file.
        return;
    }
 
    // Figure out which line is bad...
 
    throw new Exception('Line [X] is bad.');
}

If the two don't match, then we know that there was an error somewhere.

We like to also figure out which line is bad by bisecting the file over and over until we hit a mismatch. This tells us which line needs to be properly escaped.

I won't put all the bisecting code here, but I will explain generally how we do it. We start at the very end of the file and get one line:

$lineFromFile = exec("sed -n '{$line},{$line}p' $external_url");

We then get the line that is in the reader at $line - 2. (Subtract one to compensate for headers, and one to compensate for 0-based indexing.)

$lineFromReader = $reader->fetchOne($line - 2);

Code highlighting powered by torchlight.dev (A service I created!)

We compare those two lines and see if they contain the same values. If they do, the error is before that line, so we jump to the middle of the file and try again. This time if they do match, we know we've gone too far back, so we set a variable telling us where the last good line is, and then jump forward a bunch.

Repeat this jumping around until we've narrowed down to the bad line, which is usually full of a random number of backslashes and quotes. Congrats, you found it!

Stream the File to Save Memory

If your files are multiple hundreds of megabytes, or even really tens of megabytes, you're going to want to make sure that you're streaming the file through your reader and not loading the entire thing into memory all at once.

Streaming the file means that it will only load a small portion into memory at any given time. This is very easy to accomplish and will greatly improve your throughput.

// Loads the entire thing into memory (Do NOT do this!)
$reader = Reader::createFromString(file_get_contents('/path/to/my/file.csv'));
 
// Opens as a stream (Do this!)
$reader = Reader::createFromPath('/path/to/my/file.csv', 'r');

Fortunately the PHP League's package handles all of the hard parts about this for us. You just have to make sure you call the right method.

If your files aren't local, it's likely that you can still get a hold of a stream for them.

If they're on S3 you're especially in luck. AWS provides a stream wrapper for just this purpose. If you're using Laravel, here's how you'd get a stream for a file on S3:

public function getStream($disk, $path)
{
    // Get the underlying S3 adapter.
    $adapter = $disk->getDriver()->getAdapter();
 
    // Register the s3:// stream wrapper.
    $adapter->getClient()->registerStreamWrapper();
 
    // Now we can use the s3:// protocol.
    $path = 's3://' . $adapter->getBucket() . '/' . $path;
 
    // Return the stream.
    return fopen($path, 'r', false, stream_context_create([
        's3' => ['seekable' => true]
    ]));
}
 
//...
 
$stream = $this->getStream($disk, $external_url);
$reader = Reader::createFromStream($stream);

You're now streaming files from S3 chunk by chunk instead of having to download it all in one go!

Note that many of the other methods in this guide require that the file be local, so I like to download the whole file from S3 and then stream it from the filesystem.

Go Page by Page

In the case where you have a couple of million rows that you're trying to churn through in a reasonable amount of time, you'll find yourself limited by trying to read the whole file a row at a time on one thread.

Instead, take a huge page of 50,000 or 100,000 records and spawn a new process to handle that page alone. Then you can spawn multiple processes to handle different pages and they can all work independently in parallel instead of a single thread working sequentially.

The way we do this is to have a single "parent" command that is in charge of spawning "children" commands to import individual pages:

public function importFile($path)
{
    // Totally up to you and your requirements.
    $maxProcesses = 10;
    $perPage = 50000;
 
    $pages = ceil($this->linesInFile($path) / $perPage);
    $page = 1;
 
    $processes = [];
 
    while ($page <= $pages || count($processes)) {
        $processes = $this->getRunningProcesses($processes);
 
        // If there is room to start another process and we need
        // to import another page, then start a new process
        if ($page <= $pages && count($processes) < $maxProcesses) {
            $command = "php artisan import:page $page $path";
 
            // Spawn the command in the background.
            exec($command . ' > /dev/null 2>&1 & echo $!', $processes);
 
            $page++;
        }
 
        sleep(10);
    }
}

And then in the child, you can take advantage of statements to only get the records for that particular page.

public function records($reader)
{
    return (new Statement)
        ->offset($this->page * $this->perPage)
        ->limit($this->perPage)
        ->process($reader);
}

Once you break it up into a page-by-page processing pattern, you can more easily control speed and throughput, weighed against system resources.

Processing in Jobs vs. Commands vs. Daemons

Where your process your CSVs is a bit of an open question, but I have some suggestions.

Jobs

The first option would be to do your processing inside of a queued job. This is probably the easiest to set and forget and is perfect for small imports stemming from a discrete action, like a user upload.

There are a few drawbacks though. The first drawback is the job timeout. If your imports are of any size at all, you're quickly going to blow past the 90 second default timeout for queued jobs. Of course you can set up a long or import queue with a longer timeout, but eventually you're going to blow past that timeout too. Some of our imports run for 3-4 hours.

If you were to break up those long running jobs into multiple smaller jobs you might be able to get away with it, but then you have to worry about coordination between independent jobs. How do you know when the import is done? Laravel 8's new job batching may help there, but I can't speak directly to that.

Commands

If your data is available programmatically, either by API, public data, or loaded onto an FTP share (really) for you, then you can schedule a command to go and fetch it and start working. This is nice because you don't have to worry about timeouts with scheduled commands. Do make sure you call ->runInBackground()->withoutOverlapping() on these commands if they are going to take hours.

This is the method we use to import our largest files overnight. We load the files onto S3 at some point during the work day, and then at night the command wakes up to see if anything is there.

It works wonderfully because we don't have to worry about timeouts and we can one parent command spawning multiple child processes so we get parallel importing. (See above.)

Daemons

You may already know that a daemon is simply a process that runs in the background instead of one that's under the direct control of a user. The horizon:work command in Laravel is typically set up as a daemon.

Using a daemon for imports is nice for a couple of reasons. The first is that, like commands, they won't timeout like jobs might. The second benefit is something that a scheduled command doesn't have going for it: a daemon can start working immediately.

If your imports are primarily user-provided imports, you probably won't want to wait around until midnight to import them. In fact, checking every 5 minutes or even every minute may not be fast enough. Users will probably expect to see some sort of progress pretty quickly.

For user imports, we keep a daemon running at all times just waiting to see a new unprocessed import in the database. Once it sees that, it picks it up and starts processing it.

If you implement this strategy, you'll need to make sure you have a lock when selecting the next import, otherwise two daemons could get the same import, leading to duplicated data:

public function handleNextImport()
{
    // Make sure that we are the only process looking for a new import.
    $handler = Cache::lock('selectNextImport')->get(function () {
        $import = Import::processable()->first();
 
        if (!$import) {
            return;
        }
 
        $import->markAsProcessing();
 
        return $import->makeHandler();
    });
 
    optional($handler)->handle();
}

Truth be told we don't run a proper daemon, because setting them up is kind of a pain. We run a pseudo-daemon, which I wrote all about here.

Clean Your Data

All of your data is going to come in as a string, because that's how CSVs work. You'll likely not want to keep all of that data as strings but rather cast them to their correct primitive types. Obviously all of your data will be specific to your application, but I'll cover some of the normal issues I've run into.

Removing Extra Spaces

The first thing I always do is utf8_encode the value and get rid of any extra spaces. UTF-8 encoding gets rid of some wonky values that can sneak in, such as \x00.

At this point I also convert any empty strings to proper nulls as that is what they actually are.

public function clean($record)
{
    foreach ($record as $key => $value) {
        $value = utf8_encode($value);
        $value = trim($value);
        $value = $value === '' ? null : $value;
 
        $record[$key] = $value;
    }
 
    return $record;
}

Now that your data is more generally accurate, you can go about casting it to the correct primitives.

Boolean

If your field is a true/false, you'll need to look out for

yes / y / Y / YES
no / n / NO / N
true / TRUE
false / FALSE
1
0
(blank)

Numbers

Depending on if your number is supposed to be an integer or a float depends on what you do here. Regardless, you'll need to make sure to remove commas (,) and any currency indicators ($).

If you only want integers, I like to explode off any decimals (because it sounds so cool) and then run digits_only over it.

function digits_only($val, $replacement = '')
{
    return preg_replace('/\D/', $replacement, $val);
}
 
$value = explode('.', $value);
$value = head($value);
$value = digits_only($value);

If you're looking to handle both floats and integers, you can remove the commas and currency indicators and then take advantage of PHPs goofy typecasting:

$value = 0 + $value;

If $value is a float-like string (123.45) you'll get a float back. If it's an int-like string (123) you'll get an int back.

Dates

Hopefully the format of dates is consistent throughout your data file. If it is, then you should take advantage of Carbon's createFromFormat method:

public function date($record, $key, $format)
{
    if (Arr::has($record, $key)) {
        $record[$key] = Carbon::createFromFormat($format, $record[$key]);
    }
 
    return $record;
}

If Carbon cannot create a date from the string given it will throw an exception, which is what you want! Better an explicit error than a silent one.

EOF

CSV is a fantastic format for data-portability. Almost every system can export to CSV and reading them can be very simple, as long as you keep a few things in mind!

I'll keep adding to this guide over time as I come across more useful tips and tricks.

Have you handled a lot of CSVs? What did I miss? I'd love to hear, hit me up on twitter.