When working with pandas people usually need to store one or more tables. There are a lot of different formats to do that. In this post I am going to compare the performance between them.
To do the tests I downloaded 3 datasets of different sizes:
I downloaded them from kaggle, one of the best places to find datasets.
The formats that I will test are:
First of all I will check how they perform without changing any parameters. I will do 100 iterations with the small dataset and 10 with the medium one.
First of all the average reading/writing times for each format.
It seems that
xlsxis a slow option
With the medium dataframe the results are very similar.
xlsxis a slow solution. I would only recommend it for small dataframes.
Regarding the file size both
parquet outperform the rest. If you look the pandas documentation you will see that all extensions except
feather allow different types of compression. Also
parquet uses a more agresive compression by default. We need to test the formats and the different compressions.
This time each extensions will be tested using all posible compressions. The results are:
|format||compression||read [s]||write [s]||size [MB]|
By plotting both reading and writing time is easier to se that writing times for
xz compression are really slow.
It is not clear which formats and compressions work best. Let's plot reading/writing time vs file size.
csv underperforms compared to the other formats
For this test I will only use the best formats from the previous test. Those are:
|format||compression||read [min]||write [min]||size [GB]|
feather did not work since it has a restriction of 2 GB per column and it was exceeded.
With a file of this size it is clear that parquet is the best option. For compression it depends if your priority is file size or Input/Output time. For fast writing/reading use parquet without compression, for minimum file size
zlib is a solution that is between the other two.
parquetis the best for big tables.
Excelis only useful for small files when you are planning to open with an Excel program.
parquetare good options
featheris another option when the table is not big
parquetis a good option in general and the best one for big files.
parquet without compressionis the fastest solution for big files and
parquet with bloscthe best at compressing big files.