Improving python performance with Numba

Intro

This is the first post about python performance. This post will be of how to make python code go faster.

Exploring python compilers

The first way to improve the python performance is by using different compilers. The most famous ones are:

Numba and cython are similar in terms of speed and pypy is a little bit slower. You can read more at How to use Cython, Numba and Pypy.

Cython overview

Cython is an optimising static compiler for both the Python programming language and the extended Cython programming language (based on Pyrex).

When using cython you will need to specify variables classes so the code will look slightly different. For example:

cdef int a = 0
for i in range(10):
    a += i
print(a)

You can also cythonize python code. So for example you can create a file to compute Fibonacci series:

fib.pyx

def fib(n):
    """Print the Fibonacci series up to n."""
    a, b = 0, 1
    while b < n:
        print(b, end=' ')
        a, b = b, a + b

    print()

And then transform it to cython:

from distutils.core import setup
from Cython.Build import cythonize

setup(
    ext_modules=cythonize("fib.pyx"),
)

Even though cython is fast I don’t like having to change the code to adapt it to cython.

Pypy overview

PyPy is a fast, compliant alternative implementation of the Python language (2.7.13 and 3.5.3, 3.6). It has several advantages and distinct features:

Speed: thanks to its Just-in-Time compiler, Python programs often run faster on PyPy. (What is a JIT compiler?)
Memory usage: memory-hungry Python programs (several hundreds of MBs or more) might end up taking less space than they do in CPython.
Compatibility: PyPy is highly compatible with existing python code. It supports cffi and can run popular python libraries like twisted and django.
Stackless: PyPy comes by default with support for stackless mode, providing micro-threads for massive concurrency.

When using pypy you can write regular python code. The main disadvantage of pypy is that you can’t use other libraries out of the box. So if for example you wan to use pandas you will need a pypy implementation of it. This makes using pypy along common python packages unconvinient.

Numba overview

Numba translates Python functions to optimized machine code at runtime using the industry-standard LLVM compiler library. Numba-compiled numerical algorithms in Python can approach the speeds of C or FORTRAN.

You don’t need to replace the Python interpreter, run a separate compilation step, or even have a C/C++ compiler installed. Just apply one of the Numba decorators to your Python function, and Numba does the rest.

So you can use python code without modifications and you won’t have compatibility problems with other packages when using numba. This is the reason I prefer numba over cython and pypy and it is also one of the faster of the three.

Using numba

Numba provides some decorators that will transform functions to C. This way the execution times will be faster.

By adding the jit decorator to a function numba will transform it to C. For example:

from numba import jit, njit

@jit
def m_sum(data):
    out = 0
    for x in data:
        out += x

    return out

m_sum([1, 2, 3])

> 6

`jit` vs `njit` decorators

The default numba decorator is jit. It is possible to pass use nopython mode with @jit(nopython=True) or with the decorator njit which is equivalent. njit decorator is faster than jit but supports less features than jit.

For example you can’t cast to string (with for example str(10)) inside a njit decorated function. So you should try to use njit decorator if possible specialy for numerical calculations.

`vectorize` decorator

Using the vectorize decorator, you write your function as operating over input scalars, rather than arrays. Numba will generate the surrounding loop (or kernel) allowing efficient iteration over the actual inputs.

So for example given a np.array we can calculate the square with the following function.

from numba import vectorize

mlist = np.array(range(10))

@vectorize
def square(x):
    return x**2

square(mlist)

> array([ 0, 1, 4, 9, 16, 25, 36, 49, 64, 81], dtype=int64)

To increase the speed you can specify the inputs and output dtypes using int8 since in this example is enough for storing 81 which is the max value. You can check the Numpy documentation to see the different dtypes.

from numba import vectorize, int8

mlist = np.array(range(10))

@vectorize([int8(int8)])
def square(x):
    return x**2

square(mlist)

> array([ 0, 1, 4, 9, 16, 25, 36, 49, 64, 81], dtype=int8)

The format for declaring the dtypes is [output(input1, input2, ...)].

Test numba performance

Test `sum` function

First we will time the execution of different sum functions. The first option will be a for loop in python:

iter_and_sum

def iter_and_sum(data):
    """ Sums each element in an iterable """
    out = 0
    for x in data:
        out += x

    return out

And we will test both jit and njit numba decorators. As an example this is how the jit version would be defined:

jit

@jit
def jit_iter_and_sum(data):
    """ Sums each element in an iterable """
    out = 0
    for x in data:
        out += x

    return out

# Or equivalent:
jit(iter_and_sum)

The other options is to use the sum() python function and the numpy.sum().

For the tests we will try different array lengths.

numpy size	iter_and_sum	jit	njit	np.sum	sum
10^4	0.000926	0.000001	0.000001	0.000008	0.000750
10^5	0.009162	0.000009	0.000010	0.000042	0.007408
10^6	0.092395	0.000106	0.000120	0.000408	0.075894
10^7	0.936200	0.002313	0.002345	0.004393	0.766890
10^8	9.335412	0.021232	0.020506	0.042701	7.549756
10^9	96.134125	0.238718	0.225308	0.433185	76.537118

First time result has been excluded to avoid computing the time of compilation before calculating the mean.

It is important to mention that the results would be different using python lists instead of numpy arrays.

For small arrays numpy.sum is the best option but for large arrays both jit and njit perform really well.

In general is a really good idea to use numpy functions since they use cython under the hood. In all cases using the raw loop iter_and_sum or the default python sum gives poor results.

numba can be 400x better than a python loop

Test fix time

At one of the projects I was working at my company I found an interesting problem. I had a huge dataframe (650 milion rows) and among other columns there was one for the date and one for time.

The date column followed the format YYYY-MM-DD like 2019-05-15. The problem was with the time column. The original format was hhmmsscc (c for centiseconds) and instead of being a string it was presented as an integer. So for example you could find the following values:

value	represents
23411210	23:41:12.10
9051137	09:05:11.37
712	00:00:07.12

The functions

So to fix that I transformed the values to string, then added 0 until I had 8 chars and finally split and add : and . so that then it could be transformed.

Let’s use this problem for testing numba. First let’s see all different options I came up.

zfill

This was my first approach which was good enough. The idea is to transform the time column to string and then apply zfill(8). After that I created a string with the appropiate format and transform the whole datetime with pd.to_datetime.

def zfill(df):
    """
        1. Transform time to str
        2. zfill
        3. split time as string lists
        4. pd.to_datetime
    """

    aux = df["time"].apply(str).apply(lambda x: x.zfill(8)).str
    return pd.to_datetime(df["date"] + " " + aux[:2] + ":" + aux[2:4] + ":" + aux[4:6] + "." + aux[6:])

fix_time_individual

The idea in this option is to create a function decorated with jit that transform one element of the time column and apply the function to the column.

Insted of using pd.to_datetime I use .astype(np.datetime64) since is faster.

def fix_time_individual(df):
    """
        1. pandas.apply a jit function to add 0 to time
        2. concat date + time
        3. change to np.datetime64
    """
    
    @jit
    def _fix_time(x):
        aux = "0"*(8 - len(str(x))) + str(x)
        return aux[:2] + ":" + aux[2:4] + ":" + aux[4:6] + "." + aux[6:]
    
    return (df["date"] + " " + df["time"].apply(_fix_time)).astype(np.datetime64)

fix_time_np_string

With this solution I created and empty numpy array and filled with a loop inside the jit decorated function.

def fix_time_np_string(df):
    """
        1. Use a jit function to add 0 to each time
        2. concat date + time
        3. change to np.datetime64
    """

    @jit
    def _fix_time(mlist):
    
        out = np.empty(mlist.shape, dtype=np.object)

        for i in range(len(mlist)):

            elem = str(mlist[i])
            aux = "0"*(8 - len(elem)) + elem

            out[i] = aux[:2] + ":" + aux[2:4] + ":" + aux[4:6] + "." + aux[6:]

        return out
    
    return (df["date"].values + " " + _fix_time(df["time"].values)).astype(np.datetime64)

fix_time_np_datetime

In this case the I also create an empty numpy array but with datetime64[s] dtype. This way I can iterate over both time and date at the same time.

def fix_time_np_datetime(df):
    """
        1. Iterate time and date with jit function
        2. Transform each element to string and add 0s
        3. Split the string
        4. Cast each element to np.datetime64
    """
    
    @jit
    def _fix_date(mdate, mtime):

        out = np.empty(mtime.shape, dtype="datetime64[s]")

        for i in range(len(mtime)):

            elem = str(mtime[i])
            aux = "0"*(8 - len(elem)) + elem

            aux = mdate[i] + " " + aux[:2] + ":" + aux[2:4] + ":" + aux[4:6] + "." + aux[6:]

            out[i] = np.datetime64(aux)

        return out
    
    return _fix_date(df["date"].values, df["time"].values)

np_divmod_jit

In this solution I process the time column as number and with the np.divmod function I create a value that represents a timedelta. After transforming the time column I change the dtype to timedelta64[ms] and sum it to the date column as a datetime64.

def np_divmod_jit(df):
    """
        1. Iterate time and date with jit function
        2. Use np.divmod to transfom HHMMSSCC to miliseconds integer
        3. Cast date as np.datetime and time to timedelta
        4. Sum date and time
    """
        
    @jit
    def _fix_date(mdate, mtime):

        time_out = np.empty(mtime.shape[0], dtype=np.int32)

        for i in range(mtime.shape[0]):
            aux, cent = np.divmod(mtime[i], 100)
            aux, seconds = np.divmod(aux, 100)
            hours, minutes = np.divmod(aux, 100)

            time_out[i] = 10*(cent + 100*(seconds + 60*(minutes + 60*hours)))

        return mdate.astype(np.datetime64) + time_out.astype("timedelta64[ms]")
    
    return _fix_date(df["date"].values, df["time"].values)

divmod_njit

It is the same as the previous example but after changing np.divmod to the python divmod function I can use the njit decorator for the first time.

def divmod_njit(df):
    """
        1. Iterate time with njit function
        2. Use divmod to transfom HHMMSSCC to miliseconds integer
        3. Outside the njit function cast date as np.datetime and time to timedelta
        4. Sum date and time
    """

    @njit
    def _fix_time(mtime):

        time_out = np.empty(mtime.shape)

        for i in range(mtime.shape[0]):
            aux, cent = divmod(mtime[i], 100)
            aux, seconds = divmod(aux, 100)
            hours, minutes = divmod(aux, 100)

            time_out[i] = 10 * (cent + 100 * (seconds + 60 * (minutes + 60 * hours)))

        return time_out

    return df["date"].values.astype(np.datetime64) + _fix_time(
        df["time"].values.astype(np.int32)
    ).astype("timedelta64[ms]")

divmod_vectorize

This case is exactly the same as the previous one but instead of doing the for loop I use the numba vectorize decorator.

def divmod_vectorize(df):
    """
        1. Use divmod to transfom HHMMSSCC to miliseconds integer with vectorize
        2. Outside the njit function cast date as np.datetime and time to timedelta
        3. Sum date and time
    """

    @vectorize([int32(int32)])
    def _fix_time(mtime):

        aux, cent = divmod(mtime, 100)
        aux, seconds = divmod(aux, 100)
        hours, minutes = divmod(aux, 100)

        return 10 * (cent + 100 * (seconds + 60 * (minutes + 60 * hours)))

    return df["date"].values.astype(np.datetime64) + _fix_time(
        df["time"].values.astype(np.int32)
    ).astype("timedelta64[ms]")

The results

size	zfill	fix_time_individual	fix_time_np_string	fix_time_np_datetime	np_divmod_jit	divmod_njit	divmod_vectorize
10^2	0.0044	0.2134	0.4346	0.5354	0.4284	0.1323	0.0886
10^3	0.0072	0.2238	0.4435	0.5184	0.4371	0.1311	0.0900
10^4	0.0304	0.2655	0.4835	0.5588	0.5370	0.1429	0.1013
10^5	0.2955	0.7117	0.7359	0.8380	1.4360	0.1683	0.1274
10^6	3.4249	5.1130	3.4255	3.5275	10.1573	0.4198	0.3844
10^7	37.4349	50.3042	30.2810	32.3781	97.4159	2.9162	2.8040
10^8	721.8816	508.8800	310.1489	326.4423	973.3178	31.4919	31.6025

Both divmod_njit and divmod_vectorize performs really well compared to the other options. It is interesting that my first approach (zfill) is the best for small dataframes but it starts to underperform at 10^5.

Working with numbers in numba is really fast.

Speed with different machines

I did this test with up to 10^7 elements with my computer. I was not able to increase the number of elements due to an out of memory error (not enough RAM).

Then I repeated everything with an M64 azure machine.

The specs of each machine are:

	processor	cores	RAM
my computer	Intel Core i5-6500 3.2Ghz	4	16 GB
azure M64	Intel Xeon E7-8890 v3 2.5GHz (Haswell)	64	1 TB

Let’s compare the results of two functions.

size	zfill (i5)	zfill (azure)	divmod_vectorize (i5)	divmod_vectorize (azure)
10^2	0.0040	0.0044	0.0902	0.0886
10^3	0.0054	0.0072	0.0917	0.0900
10^4	0.0314	0.0304	0.1001	0.1013
10^5	0.2917	0.2955	0.1910	0.1274
10^6	3.1439	3.4249	1.0513	0.3844
10^7	32.7804	37.4349	9.7113	2.8040

If we take a look at the results we can see that they both perform similar.

It is important to remember that pandas only work with one core so I am not using the full potential of the machines. Having more RAM allows to work with more data but it does not increase the speed. With the numba vectorize the azure machine performs better as the size increases.

More info

You can read the Numba documentation. Some examples of numba form their web page at Numba examples.

I also suggest you read this post from Jake VanderPlas about Code optimization with numba.

Improving python performance with Numba

Intro

Exploring python compilers

Cython overview

Pypy overview

Numba overview

Using numba

`jit` vs `njit` decorators

`vectorize` decorator

Test numba performance

Test `sum` function

Test fix time

The functions

zfill

fix_time_individual

fix_time_np_string

fix_time_np_datetime

np_divmod_jit

divmod_njit

divmod_vectorize

The results

Speed with different machines

More info

Related posts

Fast Python Package Management and Linting with uv and ruff

Creating Python packages

Black for code formatting

Improving python performance with Numba

Intro

Exploring python compilers

Cython overview

Pypy overview

Numba overview

Using numba

jit vs njit decorators

vectorize decorator

Test numba performance

Test sum function

Test fix time

The functions

zfill

fix_time_individual

fix_time_np_string

fix_time_np_datetime

np_divmod_jit

divmod_njit

divmod_vectorize

The results

Speed with different machines

More info

Related posts

Fast Python Package Management and Linting with uv and ruff

Creating Python packages

Black for code formatting

`jit` vs `njit` decorators

`vectorize` decorator

Test `sum` function