Improving python performance with Numba
0. Intro
This is the first post about python performance. This post will be of how to make python code go faster.
1. Exploring python compilers
The first way to improve the python performance is by using different compilers. The most famous ones are:
Numba and cython are similar in terms of speed and pypy is a little bit slower. You can read more at How to use Cython, Numba and Pypy.
1.1. Cython overview
Cython is an optimising static compiler for both the Python programming language and the extended Cython programming language (based on Pyrex).
When using cython you will need to specify variables classes so the code will look slightly different. For example:
cdef int a = 0
for i in range(10):
a += i
print(a)
You can also cythonize python code. So for example you can create a file to compute Fibonacci series:
fib.pyx
def fib(n):
"""Print the Fibonacci series up to n."""
a, b = 0, 1
while b < n:
print(b, end=' ')
a, b = b, a + b
print()
And then transform it to cython:
from distutils.core import setup
from Cython.Build import cythonize
setup(
ext_modules=cythonize("fib.pyx"),
)
Even though cython is fast I don’t like having to change the code to adapt it to cython.
1.2. Pypy overview
PyPy is a fast, compliant alternative implementation of the Python language (2.7.13 and 3.5.3, 3.6). It has several advantages and distinct features:
- Speed: thanks to its Just-in-Time compiler, Python programs often run faster on PyPy. (What is a JIT compiler?)
- Memory usage: memory-hungry Python programs (several hundreds of MBs or more) might end up taking less space than they do in CPython.
- Compatibility: PyPy is highly compatible with existing python code. It supports cffi and can run popular python libraries like twisted and django.
- Stackless: PyPy comes by default with support for stackless mode, providing micro-threads for massive concurrency.
When using pypy you can write regular python code. The main disadvantage of pypy is that you can’t use other libraries out of the box. So if for example you wan to use pandas you will need a pypy implementation of it. This makes using pypy along common python packages unconvinient.
1.3. Numba overview
Numba translates Python functions to optimized machine code at runtime using the industry-standard LLVM compiler library. Numba-compiled numerical algorithms in Python can approach the speeds of C or FORTRAN.
You don’t need to replace the Python interpreter, run a separate compilation step, or even have a C/C++ compiler installed. Just apply one of the Numba decorators to your Python function, and Numba does the rest.
So you can use python code without modifications and you won’t have compatibility problems with other packages when using numba. This is the reason I preffer numba over cython and pypy and it is also one of the faster of the three.
2. Using numba
Numba provides some decorators that will transform functions to C
. This way the execution times will be faster.
By adding the jit
decorator to a function numba will transform it to C
. For example:
from numba import jit, njit
@jit
def m_sum(data):
out = 0
for x in data:
out += x
return out
m_sum([1, 2, 3])
> 6
2.1. jit
vs njit
decorators
The default numba decorator is jit
. It is possible to pass use nopython mode with @jit(nopython=True)
or with the decorator njit
which is equivalent.
njit
decorator is faster than jit
but supports less features than jit.
For example you can’t cast to string (with for example str(10)
) inside a njit
decorated function.
So you should try to use njit
decorator if possible specialy for numerical calculations.
2.2. vectorize
decorator
Using the vectorize
decorator, you write your function as operating over input scalars, rather than arrays. Numba will generate the surrounding loop (or kernel) allowing efficient iteration over the actual inputs.
So for example given a np.array
we can calculat the square with the follwing function.
from numba import vectorize
mlist = np.array(range(10))
@vectorize
def square(x):
return x**2
square(mlist)
> array([ 0, 1, 4, 9, 16, 25, 36, 49, 64, 81], dtype=int64)
To increase the speed you can specify the inputs and output dtypes using int8
since in this example is enough for storing 81
which is the max value.
You can check the Numpy documentation to see the different dtypes
.
from numba import vectorize, int8
mlist = np.array(range(10))
@vectorize([int8(int8)])
def square(x):
return x**2
square(mlist)
> array([ 0, 1, 4, 9, 16, 25, 36, 49, 64, 81], dtype=int8)
The format for declaring the dtypes
is [output(input1, input2, ...)]
.
3. Test numba performance
3.1. Test sum
function
First we will time the execution of different sum
functions. The first option will be a for loop in python:
iter_and_sum
def iter_and_sum(data):
""" Sums each element in an iterable """
out = 0
for x in data:
out += x
return out
And we will test both jit
and njit
numba decorators. As an example this is how the jit
version would be defined:
jit
@jit
def jit_iter_and_sum(data):
""" Sums each element in an iterable """
out = 0
for x in data:
out += x
return out
# Or equivalent:
jit(iter_and_sum)
The other options is to use the sum()
python function and the numpy.sum()
.
For the tests we will try different array lenghts.
numpy size | iter_and_sum | jit | njit | np.sum | sum |
---|---|---|---|---|---|
10^4 | 0.000926 | 0.000001 | 0.000001 | 0.000008 | 0.000750 |
10^5 | 0.009162 | 0.000009 | 0.000010 | 0.000042 | 0.007408 |
10^6 | 0.092395 | 0.000106 | 0.000120 | 0.000408 | 0.075894 |
10^7 | 0.936200 | 0.002313 | 0.002345 | 0.004393 | 0.766890 |
10^8 | 9.335412 | 0.021232 | 0.020506 | 0.042701 | 7.549756 |
10^9 | 96.134125 | 0.238718 | 0.225308 | 0.433185 | 76.537118 |
First time result has been excluded to avoid computing the time of compilation before calculating the mean.
It is important to mention that the results would be different using python lists instead of numpy
arrays.
For small arrays numpy.sum
is the best option but for large arrays both jit
and njit
perform really well.
In general is a really good idea to use numpy functions since they use cython
under the hood.
In all cases using the raw loop iter_and_sum
or the default python sum
gives poor results.
numba can be 400x better than a python loop
3.2. Test fix time
At one of the projects I was working at my company I found an intersting problem.
I had a huge dataframe (650 milion rows) and among other columns there was one for the date
and one for time
.
The date
column followed the format YYYY-MM-DD
like 2019-05-15. The problem was with the time
column.
The original format was hhmmsscc
(c
for centiseconds) and instead of being a string it was presented as an integer. So for example you could find the following values:
value | represents |
---|---|
23411210 | 23:41:12.10 |
9051137 | 09:05:11.37 |
712 | 00:00:07.12 |
3.2.1. The functions
So to fix that I transformed the values to string, then added 0 until I had 8 chars and finally split and add :
and .
so that then it could be transformed.
Let’s use this problem for testing numba. First let’s see all different options I came up.
3.2.1.1. zfill
This was my first approach which was good enough. The idea is to transform the time
column to string and then apply zfill(8)
. After that I created a string with the appropiate format and transform the whole datetime
with pd.to_datetime
.
def zfill(df):
"""
1. Transform time to str
2. zfill
3. split time as string lists
4. pd.to_datetime
"""
aux = df["time"].apply(str).apply(lambda x: x.zfill(8)).str
return pd.to_datetime(df["date"] + " " + aux[:2] + ":" + aux[2:4] + ":" + aux[4:6] + "." + aux[6:])
3.2.1.2. fix_time_individual
The idea in this option is to create a function decorated with jit
that transform one element of the time
column and apply the function to the column.
Insted of using pd.to_datetime
I use .astype(np.datetime64)
since is faster.
def fix_time_individual(df):
"""
1. pandas.apply a jit function to add 0 to time
2. concat date + time
3. change to np.datetime64
"""
@jit
def _fix_time(x):
aux = "0"*(8 - len(str(x))) + str(x)
return aux[:2] + ":" + aux[2:4] + ":" + aux[4:6] + "." + aux[6:]
return (df["date"] + " " + df["time"].apply(_fix_time)).astype(np.datetime64)
3.2.1.3. fix_time_np_string
With this solution I created and empty numpy
array and filled with a loop inside the jit
decorated function.
def fix_time_np_string(df):
"""
1. Use a jit function to add 0 to each time
2. concat date + time
3. change to np.datetime64
"""
@jit
def _fix_time(mlist):
out = np.empty(mlist.shape, dtype=np.object)
for i in range(len(mlist)):
elem = str(mlist[i])
aux = "0"*(8 - len(elem)) + elem
out[i] = aux[:2] + ":" + aux[2:4] + ":" + aux[4:6] + "." + aux[6:]
return out
return (df["date"].values + " " + _fix_time(df["time"].values)).astype(np.datetime64)
3.2.1.4. fix_time_np_datetime
In this case the I also create an empty numpy
array but with datetime64[s]
dtype. This way I can iterate over both time
and date
at the same time.
def fix_time_np_datetime(df):
"""
1. Iterate time and date with jit function
2. Transform each element to string and add 0s
3. Split the string
4. Cast each element to np.datetime64
"""
@jit
def _fix_date(mdate, mtime):
out = np.empty(mtime.shape, dtype="datetime64[s]")
for i in range(len(mtime)):
elem = str(mtime[i])
aux = "0"*(8 - len(elem)) + elem
aux = mdate[i] + " " + aux[:2] + ":" + aux[2:4] + ":" + aux[4:6] + "." + aux[6:]
out[i] = np.datetime64(aux)
return out
return _fix_date(df["date"].values, df["time"].values)
3.2.1.5. np_divmod_jit
In this solution I process the time
column as number and with the np.divmod
function I create a value that represents a timedelta. After transforming the time
column I change the dtype to timedelta64[ms]
and sum it to the date
column as a datetime64
.
def np_divmod_jit(df):
"""
1. Iterate time and date with jit function
2. Use np.divmod to transfom HHMMSSCC to miliseconds integer
3. Cast date as np.datetime and time to timedelta
4. Sum date and time
"""
@jit
def _fix_date(mdate, mtime):
time_out = np.empty(mtime.shape[0], dtype=np.int32)
for i in range(mtime.shape[0]):
aux, cent = np.divmod(mtime[i], 100)
aux, seconds = np.divmod(aux, 100)
hours, minutes = np.divmod(aux, 100)
time_out[i] = 10*(cent + 100*(seconds + 60*(minutes + 60*hours)))
return mdate.astype(np.datetime64) + time_out.astype("timedelta64[ms]")
return _fix_date(df["date"].values, df["time"].values)
3.2.1.6. divmod_njit
It is the same as the previous example but after changing np.divmod
to the python divmod
function I can use the njit
decorator for the first time.
def divmod_njit(df):
"""
1. Iterate time with njit function
2. Use divmod to transfom HHMMSSCC to miliseconds integer
3. Outside the njit function cast date as np.datetime and time to timedelta
4. Sum date and time
"""
@njit
def _fix_time(mtime):
time_out = np.empty(mtime.shape)
for i in range(mtime.shape[0]):
aux, cent = divmod(mtime[i], 100)
aux, seconds = divmod(aux, 100)
hours, minutes = divmod(aux, 100)
time_out[i] = 10 * (cent + 100 * (seconds + 60 * (minutes + 60 * hours)))
return time_out
return df["date"].values.astype(np.datetime64) + _fix_time(
df["time"].values.astype(np.int32)
).astype("timedelta64[ms]")
3.2.1.7. divmod_vectorize
This case is exactly the same as the previous one but instead of doing the for
loop I use the numba vectorize
decorator.
def divmod_vectorize(df):
"""
1. Use divmod to transfom HHMMSSCC to miliseconds integer with vectorize
2. Outside the njit function cast date as np.datetime and time to timedelta
3. Sum date and time
"""
@vectorize([int32(int32)])
def _fix_time(mtime):
aux, cent = divmod(mtime, 100)
aux, seconds = divmod(aux, 100)
hours, minutes = divmod(aux, 100)
return 10 * (cent + 100 * (seconds + 60 * (minutes + 60 * hours)))
return df["date"].values.astype(np.datetime64) + _fix_time(
df["time"].values.astype(np.int32)
).astype("timedelta64[ms]")
3.2.2. The results
size | zfill | fix_time_individual | fix_time_np_string | fix_time_np_datetime | np_divmod_jit | divmod_njit | divmod_vectorize |
---|---|---|---|---|---|---|---|
10^2 | 0.0044 | 0.2134 | 0.4346 | 0.5354 | 0.4284 | 0.1323 | 0.0886 |
10^3 | 0.0072 | 0.2238 | 0.4435 | 0.5184 | 0.4371 | 0.1311 | 0.0900 |
10^4 | 0.0304 | 0.2655 | 0.4835 | 0.5588 | 0.5370 | 0.1429 | 0.1013 |
10^5 | 0.2955 | 0.7117 | 0.7359 | 0.8380 | 1.4360 | 0.1683 | 0.1274 |
10^6 | 3.4249 | 5.1130 | 3.4255 | 3.5275 | 10.1573 | 0.4198 | 0.3844 |
10^7 | 37.4349 | 50.3042 | 30.2810 | 32.3781 | 97.4159 | 2.9162 | 2.8040 |
10^8 | 721.8816 | 508.8800 | 310.1489 | 326.4423 | 973.3178 | 31.4919 | 31.6025 |
Both divmod_njit
and divmod_vectorize
performs really well compared to the other options. Is intersting that my first approach (zfill
) is the best for small dataframes but it starts to underperform at 10^5.
Working with numbers in numba is really fast.
3.2.3. Speed with different machines
I did this test with up to 10^7
elements with my computer. I was not able to increase the number of elements due to an out of memory error (not enough RAM).
Then I repeated everything with an M64
azure machine.
The specs of each machine are:
processor | cores | RAM | |
---|---|---|---|
my computer | Intel Core i5-6500 3.2Ghz | 4 | 16 GB |
azure M64 | Intel Xeon E7-8890 v3 2.5GHz (Haswell) | 64 | 1 TB |
Let’s compare the results of two functions.
size | zfill (i5) | zfill (azure) | divmod_vectorize (i5) | divmod_vectorize (azure) |
---|---|---|---|---|
10^2 | 0.0040 | 0.0044 | 0.0902 | 0.0886 |
10^3 | 0.0054 | 0.0072 | 0.0917 | 0.0900 |
10^4 | 0.0314 | 0.0304 | 0.1001 | 0.1013 |
10^5 | 0.2917 | 0.2955 | 0.1910 | 0.1274 |
10^6 | 3.1439 | 3.4249 | 1.0513 | 0.3844 |
10^7 | 32.7804 | 37.4349 | 9.7113 | 2.8040 |
If we take a look at the results we can see that they both perform similar.
It is important to remember that pandas only work with one core so I am not using the full potential of the machines. Having more RAM allows to work with more data but it does not increase the speed. With the numba vectorize the azure machine performs better as the size increases.
4. More info
You can read the Numba documentation. Some examples of numba form their web page at Numba examples.
I also suggest you read this post from Jake VanderPlas about Code optimization with numba.