This is the first post about python performance. This post will be of how to make python code go faster.
The first way to improve the python performance is by using different compilers. The most famous ones are:
Numba and cython are similar in terms of speed and pypy is a little bit slower. You can read more at this quora question.
Cython is an optimising static compiler for both the Python programming language and the extended Cython programming language (based on Pyrex).
When using cython you will need to specify variables classes so the code will look slightly different. For example:
cdef int a = 0 for i in range(10): a += i print(a)
You can also cythonize python code. So for example you can create a file to compute Fibonacci series:
def fib(n): """Print the Fibonacci series up to n.""" a, b = 0, 1 while b < n: print(b, end=' ') a, b = b, a + b print()
And then transform it to cython:
from distutils.core import setup from Cython.Build import cythonize setup( ext_modules=cythonize("fib.pyx"), )
Even though cython is fast I don't like having to change the code to adapt it to cython.
PyPy is a fast, compliant alternative implementation of the Python language (2.7.13 and 3.5.3, 3.6). It has several advantages and distinct features:
When using pypy you can write regular python code. The main disadvantage of pypy is that you can't use other libraries out of the box. So if for example you wan to use pandas you will need a pypy implementation of it. This makes using pypy along common python packages unconvinient.
Numba translates Python functions to optimized machine code at runtime using the industry-standard LLVM compiler library. Numba-compiled numerical algorithms in Python can approach the speeds of C or FORTRAN.
You don't need to replace the Python interpreter, run a separate compilation step, or even have a C/C++ compiler installed. Just apply one of the Numba decorators to your Python function, and Numba does the rest.
So you can use python code without modifications and you won't have compatibility problems with other packages when using numba. This is the reason I preffer numba over cython and pypy and it is also one of the faster of the three.
Numba provides some decorators that will transform functions to
C. This way the execution times will be faster.
By adding the
jit decorator to a function numba will transform it to
C. For example:
from numba import jit, njit @jit def m_sum(data): out = 0 for x in data: out += x return out m_sum([1, 2, 3])
The default numba decorator is
jit. It is possible to pass use nopython mode with
@jit(nopython=True) or with the decorator
njit which is equivalent.
njit decorator is faster than
jit but supports less features than jit.
For example you can't cast to string (with for example
str(10)) inside a
njit decorated function.
So you should try to use
njit decorator if possible specialy for numerical calculations.
vectorize decorator, you write your function as operating over input scalars, rather than arrays. Numba will generate the surrounding loop (or kernel) allowing efficient iteration over the actual inputs.
So for example given a
np.array we can calculat the square with the follwing function.
from numba import vectorize mlist = np.array(range(10)) @vectorize def square(x): return x**2 square(mlist)
To increase the speed you can specify the inputs and output dtypes using
int8 since in this example is enough for storing
81 which is the max value.
You can check the numpy documentation to see the different
from numba import vectorize, int8 mlist = np.array(range(10)) @vectorize([int8(int8)]) def square(x): return x**2 square(mlist)
The format for declaring the
[output(input1, input2, ...)].
First we will time the execution of different
sum functions. The first option will be a for loop in python:
def iter_and_sum(data): """ Sums each element in an iterable """ out = 0 for x in data: out += x return out
And we will test both
njit numba decorators. As an example this is how the
jit version would be defined:
@jit def jit_iter_and_sum(data): """ Sums each element in an iterable """ out = 0 for x in data: out += x return out # Or equivalent: jit(iter_and_sum)
The other options is to use the
sum() python function and the
For the tests we will try different array lenghts.
First time result has been excluded to avoid computing the time of compilation before calculating the mean.
It is important to mention that the results would be different using python lists instead of
For small arrays
numpy.sum is the best option but for large arrays both
njit perform really well.
In general is a really good idea to use numpy functions since they use
cython under the hood.
In all cases using the raw loop
iter_and_sum or the default python
sum gives poor results.
numba can be 400x better than a python loop
At one of the projects I was working at my company I found an intersting problem.
I had a huge dataframe (650 milion rows) and among other columns there was one for the
date and one for
date column followed the format
YYYY-MM-DD like 2019-05-15. The problem was with the
The original format was
c for centiseconds) and instead of being a string it was presented as an integer. So for example you could find the following values:
So to fix that I transformed the values to string, then added 0 until I had 8 chars and finally split and add
. so that then it could be transformed.
Let's use this problem for testing numba. First let's see all different options I came up.
This was my first approach which was good enough. The idea is to transform the
time column to string and then apply
zfill(8). After that I created a string with the appropiate format and transform the whole
def zfill(df): """ 1. Transform time to str 2. zfill 3. split time as string lists 4. pd.to_datetime """ aux = df["time"].apply(str).apply(lambda x: x.zfill(8)).str return pd.to_datetime(df["date"] + " " + aux[:2] + ":" + aux[2:4] + ":" + aux[4:6] + "." + aux[6:])
The idea in this option is to create a function decorated with
jit that transform one element of the
time column and apply the function to the column.
Insted of using
pd.to_datetime I use
.astype(np.datetime64) since is faster.
def fix_time_individual(df): """ 1. pandas.apply a jit function to add 0 to time 2. concat date + time 3. change to np.datetime64 """ @jit def _fix_time(x): aux = "0"*(8 - len(str(x))) + str(x) return aux[:2] + ":" + aux[2:4] + ":" + aux[4:6] + "." + aux[6:] return (df["date"] + " " + df["time"].apply(_fix_time)).astype(np.datetime64)
With this solution I created and empty
numpy array and filled with a loop inside the
jit decorated function.
def fix_time_np_string(df): """ 1. Use a jit function to add 0 to each time 2. concat date + time 3. change to np.datetime64 """ @jit def _fix_time(mlist): out = np.empty(mlist.shape, dtype=np.object) for i in range(len(mlist)): elem = str(mlist[i]) aux = "0"*(8 - len(elem)) + elem out[i] = aux[:2] + ":" + aux[2:4] + ":" + aux[4:6] + "." + aux[6:] return out return (df["date"].values + " " + _fix_time(df["time"].values)).astype(np.datetime64)
In this case the I also create an empty
numpy array but with
datetime64[s] dtype. This way I can iterate over both
date at the same time.
def fix_time_np_datetime(df): """ 1. Iterate time and date with jit function 2. Transform each element to string and add 0s 3. Split the string 4. Cast each element to np.datetime64 """ @jit def _fix_date(mdate, mtime): out = np.empty(mtime.shape, dtype="datetime64[s]") for i in range(len(mtime)): elem = str(mtime[i]) aux = "0"*(8 - len(elem)) + elem aux = mdate[i] + " " + aux[:2] + ":" + aux[2:4] + ":" + aux[4:6] + "." + aux[6:] out[i] = np.datetime64(aux) return out return _fix_date(df["date"].values, df["time"].values)
In this solution I process the
time column as number and with the
np.divmod function I create a value that represents a timedelta. After transforming the
time column I change the dtype to
timedelta64[ms] and sum it to the
date column as a
def np_divmod_jit(df): """ 1. Iterate time and date with jit function 2. Use np.divmod to transfom HHMMSSCC to miliseconds integer 3. Cast date as np.datetime and time to timedelta 4. Sum date and time """ @jit def _fix_date(mdate, mtime): time_out = np.empty(mtime.shape, dtype=np.int32) for i in range(mtime.shape): aux, cent = np.divmod(mtime[i], 100) aux, seconds = np.divmod(aux, 100) hours, minutes = np.divmod(aux, 100) time_out[i] = 10*(cent + 100*(seconds + 60*(minutes + 60*hours))) return mdate.astype(np.datetime64) + time_out.astype("timedelta64[ms]") return _fix_date(df["date"].values, df["time"].values)
It is the same as the previous example but after changing
np.divmod to the python
divmod function I can use the
njit decorator for the first time.
def divmod_njit(df): """ 1. Iterate time with njit function 2. Use divmod to transfom HHMMSSCC to miliseconds integer 3. Outside the njit function cast date as np.datetime and time to timedelta 4. Sum date and time """ @njit def _fix_time(mtime): time_out = np.empty(mtime.shape) for i in range(mtime.shape): aux, cent = divmod(mtime[i], 100) aux, seconds = divmod(aux, 100) hours, minutes = divmod(aux, 100) time_out[i] = 10 * (cent + 100 * (seconds + 60 * (minutes + 60 * hours))) return time_out return df["date"].values.astype(np.datetime64) + _fix_time( df["time"].values.astype(np.int32) ).astype("timedelta64[ms]")
This case is exactly the same as the previous one but instead of doing the
for loop I use the numba
def divmod_vectorize(df): """ 1. Use divmod to transfom HHMMSSCC to miliseconds integer with vectorize 2. Outside the njit function cast date as np.datetime and time to timedelta 3. Sum date and time """ @vectorize([int32(int32)]) def _fix_time(mtime): aux, cent = divmod(mtime, 100) aux, seconds = divmod(aux, 100) hours, minutes = divmod(aux, 100) return 10 * (cent + 100 * (seconds + 60 * (minutes + 60 * hours))) return df["date"].values.astype(np.datetime64) + _fix_time( df["time"].values.astype(np.int32) ).astype("timedelta64[ms]")
divmod_vectorize performs really well compared to the other options. Is intersting that my first approach (
zfill) is the best for small dataframes but it starts to underperform at 10^5.
Working with numbers in numba is really fast.
I did this test with up to
10^7 elements with my computer. I was not able to increase the number of elements due to an out of memory error (not enough RAM).
Then I repeated everything with an
M64 azure machine.
The specs of each machine are:
|feature||my computer||azure M64|
|processor||Intel Core i5-6500 3.2Ghz||Intel Xeon E7-8890 v3 2.5GHz (Haswell)
|RAM||16 GB||1 TB|
Let's compare the results of two functions.
If we take a look at the results we can see that they both perform similar.
It is important to remember that pandas only work with one core so I am not using the full potential of the machines. Having more RAM allows to work with more data but it does not increase the speed. With the numba vectorize the azure machine performs better as the size increases.
I also suggest you read this post from Jake VanderPlas about code optimization with numba.