Storing tables efficiently with Pandas

0. Intro

When working with pandas people usually need to store one or more tables. There are a lot of different formats to do that. In this post I am going to compare the performance between them.

1. How test will be done

To do the tests I downloaded 3 datasets of different sizes:

Small (56 KB)Medium (233 MB)Large (6.62 GB)

I downloaded them from kaggle, one of the best places to find datasets.

The formats that I will test are:

csv
feather
msgpack
parquet
pickle
xlsx

2. Test 1: Simple test

First of all I will check how they perform without changing any parameters. I will do 100 iterations with the small dataset and 10 with the medium one.

2.1. Size small and iterations 100 with all extensions

First of all the average reading/writing times for each format.

It seems that xlsx is a slow option

2.2. Size medium and iterations 10 with all extensions

With the medium dataframe the results are very similar.

xlsx is a slow solution. I would only recommend it for small dataframes.

Regarding the file size both xlsx and parquet outperform the rest. If you look the pandas documentation you will see that all extensions except xlsx and feather allow different types of compression. Also parquet uses a more aggressive compression by default. We need to test the formats and the different compressions.

3. Test 2: Include compressions

This time each extension will be tested using all possible compressions. The results are:

format	compression	read [s]	write [s]	size [MB]
csv	None	3.05	5.90	236.10
	bz2	10.52	24.44	47.58
	gzip	4.26	49.02	65.80
	infer	3.09	6.00	236.10
	xz	9.92	288.62	43.99
	zip	4.05	21.26	66.44
feather	feather	0.61	1.00	213.38
msgpack	None	0.49	1.32	210.09
	blosc	0.48	1.37	202.88
	zlib	0.51	2.05	202.90
parquet	None	0.77	1.41	209.58
	brotli	0.99	27.04	50.82
	gzip	1.00	18.28	56.75
	snappy	0.81	1.87	96.37
pickle	None	0.47	1.34	207.54
	bz2	7.78	18.51	40.93
	gzip	1.49	40.68	54.85
	infer	0.48	1.37	207.54
	xz	6.26	254.32	38.05
	zip	2.07	17.86	55.43

By plotting both reading and writing time it is easier to see that writing times for xz compression are really slow.

It is not clear which formats and compressions work best. Let’s plot reading/writing time vs file size.

csv underperforms compared to the other formats

4. Test 3: Using the big table

For this test I will only use the best formats from the previous test. Those are:

feather
msgpack
parquet
pickle

format	compression	read [min]	write [min]	size [GB]
feather	feather	-	-	4.24
msgpack	None	5.86	6.32	7.21
	blosc	5.45	5.01	6.50
	zlib	5.49	5.55	6.50
parquet	None	1.15	2.69	5.96
	brotli	0.92	11.22	1.56
	gzip	0.97	8.98	2.00
	snappy	0.89	2.78	3.30
pickle	None	1.38	7.47	6.25
	bz2	5.15	17.47	1.24
	gzip	1.88	20.76	1.81
	infer	1.46	5.80	6.25
	xz	3.46	97.62	0.97

feather did not work since it has a restriction of 2 GB per column and it was exceeded.

With a file of this size it is clear that parquet is the best option. For compression it depends if your priority is file size or Input/Output time. For fast writing/reading use parquet without compression, for minimum file size blosc and zlib is a solution that is between the other two.

parquet is the best for big tables.

5. Conlusions

Excel is only useful for small files when you are planning to open with an Excel program.
csv performs poorly
pickle, msgpack and parquet are good options
feather is another option when the table is not big
parquet is a good option in general and the best one for big files.
parquet without compression is the fastest solution for big files and parquet with blosc the best at compressing big files.

Storing tables efficiently with Pandas

0. Intro

1. How test will be done

2. Test 1: Simple test

2.1. Size small and iterations 100 with all extensions

2.2. Size medium and iterations 10 with all extensions

3. Test 2: Include compressions

4. Test 3: Using the big table

5. Conlusions

Share :

Related Posts

Google Analytics ETL with python

SQL Alchemy with python

Java UDF with pyspark