Storing tables efficiently

2019-04-03
Python Performance Benchmark



When working with pandas people usually need to store one or more tables. There are a lot of different formats to do that. In this post I am going to compare the performance between them.

How test will be done

To do the tests I downloaded 3 datasets of different sizes:

I downloaded them from kaggle, one of the best places to find datasets.

The formats that I will test are:

Test 1: Simple test

First of all I will check how they perform without changing any parameters. I will do 100 iterations with the small dataset and 10 with the medium one.

Size small and iterations 100 with all extensions

First of all the average reading/writing times for each format.

It seems that xlsx is a slow option

Size medium and iterations 10 with all extensions

With the medium dataframe the results are very similar.

xlsx is a slow solution. I would only recommend it for small dataframes.

Regarding the file size both xlsx and parquet outperform the rest. If you look the pandas documentation you will see that all extensions except xlsx and feather allow different types of compression. Also parquet uses a more agresive compression by default. We need to test the formats and the different compressions.

Test 2: Include compressions

This time each extensions will be tested using all posible compressions. The results are:

format compression read [s] write [s] size [MB]
csv None 3.05 5.90 236.10
bz2 10.52 24.44 47.58
gzip 4.26 49.02 65.80
infer 3.09 6.00 236.10
xz 9.92 288.62 43.99
zip 4.05 21.26 66.44
feather feather 0.61 1.00 213.38
msgpack None 0.49 1.32 210.09
blosc 0.48 1.37 202.88
zlib 0.51 2.05 202.90
parquet None 0.77 1.41 209.58
brotli 0.99 27.04 50.82
gzip 1.00 18.28 56.75
snappy 0.81 1.87 96.37
pickle None 0.47 1.34 207.54
bz2 7.78 18.51 40.93
gzip 1.49 40.68 54.85
infer 0.48 1.37 207.54
xz 6.26 254.32 38.05
zip 2.07 17.86 55.43

By plotting both reading and writing time is easier to se that writing times for xz compression are really slow.

It is not clear which formats and compressions work best. Let's plot reading/writing time vs file size.

csv underperforms compared to the other formats

Test 3: Using the big table

For this test I will only use the best formats from the previous test. Those are:

format compression read [min] write [min] size [GB]
feather feather - - 4.24
msgpack None 5.86 6.32 7.21
blosc 5.45 5.01 6.50
zlib 5.49 5.55 6.50
parquet None 1.15 2.69 5.96
brotli 0.92 11.22 1.56
gzip 0.97 8.98 2.00
snappy 0.89 2.78 3.30
pickle None 1.38 7.47 6.25
bz2 5.15 17.47 1.24
gzip 1.88 20.76 1.81
infer 1.46 5.80 6.25
xz 3.46 97.62 0.97

feather did not work since it has a restriction of 2 GB per column and it was exceeded.

With a file of this size it is clear that parquet is the best option. For compression it depends if your priority is file size or Input/Output time. For fast writing/reading use parquet without compression, for minimum file size blosc and zlib is a solution that is between the other two.

parquet is the best for big tables.

Conlusions

  1. Excel is only useful for small files when you are planning to open with an Excel program.
  2. csv performs poorly
  3. pickle, msgpack and parquet are good options
  4. feather is another option when the table is not big
  5. parquet is a good option in general and the best one for big files.
  6. parquet without compression is the fastest solution for big files and parquet with blosc the best at compressing big files.