Easy Proof Omitted

A sink for random thoughts & experiments

0%

C through Python

Python has been always the ultimate glue language in various fields due to it's intuitive syntax and a huge community that support different high-level tools and libraries. But it's no secret that python is extremely slow compared to compiled languages like C.

Luckily, Python can easily interface with low-level languages to gain the best of both worlds. This post will show how to find and elevate some of the performance bottlenecks in your code.

In mathematics, the Fibonacci numbers, commonly denoted \(F_n\), form a sequence, called the Fibonacci sequence, such that each number is the sum of the two preceding ones, starting from 0 and 1. That is,

\[\begin{align*} &\quad F_0=0, F_1= 1,\\ &F_n=F_{n-1} + F_{n-2}, &&\textbf{n} > 1 \end{align*}\]

The beginning of the sequence is thus: \(0,\;1,\;1,\;2,\;3,\;5,\;8,\;13,\;21,\;34,\;55,\;89,\;144,\; \ldots\)

Fibonacci numberWikipedia

For the demonstration purposes , A grossly inefficient implementation of Fibonacci sequence will be used as a toy example:

fib.py
1
2
3
4
def f(x):
return f(x - 1) + f(x - 2) if x > 2 else 1

print(f(40))

Identify bottlenecks

Python has a couple of built-in profilers that can be used to benchmark the performance of your code: profile and cProfile. profile is written in pure python which in return adds a significant overhead and wastes more time than cProfile, which is a C extension with minimal overhead.

To run cProfile on your script, simply type:

python -m cProfile fib.py

102334155
204668313 function calls (5 primitive calls) in 76.602 seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.011 0.011 76.602 76.602 fib.py:6(<module>)
204668309/1 76.591 0.000 76.591 76.591 fib.py:6(f)
1 0.000 0.000 76.602 76.602 {built-in method builtins.exec}
1 0.000 0.000 0.000 0.000 {built-in method builtins.print}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}

To calculate the 40th fibonacci number in the sequence, the script spent 76.602 seconds in f(x) strongly hinting that this causes a serious bottleneck. 1

What can we do about it?

Cython

Cython is an optimising static compiler for both the Python programming language and the extended Cython programming language (based on Pyrex). It makes writing C extensions for Python as easy as Python itself.

In a nutshell, Cython will transpile our python code to C then compile it into a shared object.

First, Write the function in a new file fib.pyx providing as much type hints as possible

1
2
def f(int x): # type hint
return f(x - 1) + f(x - 2) if x > 2 else 1

Create a new file setup.py with the following:

1
2
3
4
5
6
from setuptools import setup
from Cython.Build import cythonize

setup(
ext_modules=cythonize("fib.pyx")
)

Finally run setup.py with the following arguments:

python setup.py build_ext --inplace

The previous step generated the required shared object and the necessary interfaces that integrates seamlessly with python using a regular import statement.

1
2
import fib
print(fib.f(40))

re-run cProfile

102334155
97 function calls in 4.156 seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:103(release)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:143(__init__)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:147(__enter__)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:151(__exit__)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:157(_get_module_lock)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:176(cb)
2 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:211(_call_with_frames_removed)
4 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:222(_verbose_message)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:342(__init__)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:376(cached)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:389(parent)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:397(has_location)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:477(_init_module_attrs)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:549(module_from_spec)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:58(__init__)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:650(_load_unlocked)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:725(find_spec)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:78(acquire)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:800(find_spec)
3 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:863(__enter__)
3 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:867(__exit__)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:890(_find_spec)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:956(_find_and_load_unlocked)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:986(_find_and_load)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:1088(__init__)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:1099(create_module)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:1107(exec_module)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:1265(_path_importer_cache)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:1302(_get_spec)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:1334(find_spec)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:1426(_get_spec)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:1431(find_spec)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:40(_relax_case)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:424(_get_cached)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:62(_path_join)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:629(spec_from_file_location)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:64(<listcomp>)
2 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:80(_path_stat)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:90(_path_is_mode_type)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:99(_path_isfile)
1 0.000 0.000 4.156 4.156 fib.py:10(<module>)
5 0.000 0.000 0.000 0.000 {built-in method _imp.acquire_lock}
1 0.000 0.000 0.000 0.000 {built-in method _imp.create_dynamic}
1 0.000 0.000 0.000 0.000 {built-in method _imp.exec_dynamic}
1 0.000 0.000 0.000 0.000 {built-in method _imp.is_builtin}
1 0.000 0.000 0.000 0.000 {built-in method _imp.is_frozen}
5 0.000 0.000 0.000 0.000 {built-in method _imp.release_lock}
2 0.000 0.000 0.000 0.000 {built-in method _thread.allocate_lock}
2 0.000 0.000 0.000 0.000 {built-in method _thread.get_ident}
1 0.000 0.000 4.156 4.156 {built-in method builtins.exec}
6 0.000 0.000 0.000 0.000 {built-in method builtins.getattr}
3 0.000 0.000 0.000 0.000 {built-in method builtins.hasattr}
1 0.000 0.000 0.000 0.000 {built-in method builtins.isinstance}
1 0.000 0.000 0.000 0.000 {built-in method builtins.print}
1 0.000 0.000 0.000 0.000 {built-in method posix.fspath}
1 0.000 0.000 0.000 0.000 {built-in method posix.getcwd}
2 0.000 0.000 0.000 0.000 {built-in method posix.stat}
1 4.155 4.155 4.155 4.155 {fib.f}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
2 0.000 0.000 0.000 0.000 {method 'endswith' of 'str' objects}
2 0.000 0.000 0.000 0.000 {method 'get' of 'dict' objects}
1 0.000 0.000 0.000 0.000 {method 'join' of 'str' objects}
1 0.000 0.000 0.000 0.000 {method 'pop' of 'dict' objects}
3 0.000 0.000 0.000 0.000 {method 'rpartition' of 'str' objects}
2 0.000 0.000 0.000 0.000 {method 'rstrip' of 'str' objects}

Went down from 76.602 seconds to 4.156 almost 18x speed-up.

Although this approach is much easier, but the generated C code is part of large generic boilerplate and results in a noticeable overhead. Therefore, it's highly recommended to provide as much type hints as possible before compilation process.

ctypes

ctypes is a foreign function library for Python. It provides C compatible data types, and allows calling functions in DLLs or shared libraries. It can be used to wrap these libraries in pure Python.

First, Rewrite the performance critical functions in C or C++.

1
2
3
extern "C" int f(int x) {
return x > 2 ? f(x - 1) + f(x - 2) : 1;
}

extern "C" tells the compiler to avoid mangling the function name. which will work with simple function prototypes, but one-to-one mapping between C++ classes and python classes is not possible without an interface library like pybind11.

Then compile it into a shared object:

clang++ -shared -o fib.so fib.cpp

Lastly we can call any function from the shared object inside python:

1
2
3
4
import ctypes
fib = ctypes.CDLL('./fib.so')

print(fib.f(40))

Running cProfile again yields a few extra system calls that loads the shared object:

102334155
1265 function calls (1242 primitive calls) in 0.617 seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:1017(_handle_fromlist)
6 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:103(release)
5 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:143(__init__)
5 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:147(__enter__)
5 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:151(__exit__)
6 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:157(_get_module_lock)
5 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:176(cb)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:194(_lock_unlock_module)
7/1 0.000 0.000 0.002 0.002 <frozen importlib._bootstrap>:211(_call_with_frames_removed)
75 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:222(_verbose_message)
5 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:342(__init__)
3 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:35(_new_module)
8 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:376(cached)
5 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:389(parent)
5 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:397(has_location)
5 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:477(_init_module_attrs)
5 0.000 0.000 0.001 0.000 <frozen importlib._bootstrap>:549(module_from_spec)
5 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:58(__init__)
5/1 0.000 0.000 0.002 0.002 <frozen importlib._bootstrap>:650(_load_unlocked)
5 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:725(find_spec)
6 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:78(acquire)
5 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:800(find_spec)
15 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:863(__enter__)
15 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:867(__exit__)
5 0.000 0.000 0.001 0.000 <frozen importlib._bootstrap>:890(_find_spec)
5/1 0.000 0.000 0.003 0.003 <frozen importlib._bootstrap>:956(_find_and_load_unlocked)
5/1 0.000 0.000 0.003 0.003 <frozen importlib._bootstrap>:986(_find_and_load)
3 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:1010(path_stats)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:104(_path_isdir)
2 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:1088(__init__)
2 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:1099(create_module)
2 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:1107(exec_module)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:1252(_path_hooks)
19 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:1265(_path_importer_cache)
5 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:1302(_get_spec)
5 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:1334(find_spec)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:1394(__init__)
8 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:1400(<genexpr>)
5 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:1426(_get_spec)
15 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:1431(find_spec)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:1479(_fill_cache)
1 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:1520(path_hook_for_FileFinder)
6 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:294(cache_from_source)
15 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:40(_relax_case)
5 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:424(_get_cached)
3 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:456(_check_name_wrapper)
3 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:493(_classify_pyc)
3 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:51(_unpack_uint32)
3 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:554(_validate_hash_pyc)
3 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:578(_compile_bytecode)
71 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:62(_path_join)
5 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:629(spec_from_file_location)
71 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:64(<listcomp>)
6 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:68(_path_split)
3 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:774(create_module)
3/1 0.000 0.000 0.002 0.002 <frozen importlib._bootstrap_external>:777(exec_module)
28 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:80(_path_stat)
3 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:849(get_code)
9 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:90(_path_is_mode_type)
3 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:939(__init__)
3 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:964(get_filename)
6 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:969(get_data)
8 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:99(_path_isfile)
1 0.000 0.000 0.000 0.000 <frozen zipimport>:63(__init__)
1 0.000 0.000 0.002 0.002 __init__.py:1(<module>)
2 0.000 0.000 0.000 0.000 __init__.py:101(CFunctionType)
14 0.000 0.000 0.000 0.000 __init__.py:141(_check_size)
1 0.000 0.000 0.000 0.000 __init__.py:153(py_object)
1 0.000 0.000 0.000 0.000 __init__.py:162(c_short)
1 0.000 0.000 0.000 0.000 __init__.py:166(c_ushort)
1 0.000 0.000 0.000 0.000 __init__.py:170(c_long)
1 0.000 0.000 0.000 0.000 __init__.py:174(c_ulong)
1 0.000 0.000 0.000 0.000 __init__.py:183(c_int)
1 0.000 0.000 0.000 0.000 __init__.py:187(c_uint)
1 0.000 0.000 0.000 0.000 __init__.py:191(c_float)
1 0.000 0.000 0.000 0.000 __init__.py:195(c_double)
1 0.000 0.000 0.000 0.000 __init__.py:199(c_longdouble)
1 0.000 0.000 0.000 0.000 __init__.py:220(c_ubyte)
1 0.000 0.000 0.000 0.000 __init__.py:227(c_byte)
1 0.000 0.000 0.000 0.000 __init__.py:232(c_char)
1 0.000 0.000 0.000 0.000 __init__.py:237(c_char_p)
1 0.000 0.000 0.000 0.000 __init__.py:243(c_void_p)
1 0.000 0.000 0.000 0.000 __init__.py:248(c_bool)
1 0.000 0.000 0.000 0.000 __init__.py:253(c_wchar_p)
1 0.000 0.000 0.000 0.000 __init__.py:258(c_wchar)
1 0.000 0.000 0.000 0.000 __init__.py:261(_reset_cache)
1 0.000 0.000 0.000 0.000 __init__.py:318(CDLL)
2 0.000 0.000 0.001 0.001 __init__.py:339(__init__)
2 0.000 0.000 0.000 0.000 __init__.py:367(_FuncPtr)
1 0.000 0.000 0.000 0.000 __init__.py:383(__getattr__)
1 0.000 0.000 0.000 0.000 __init__.py:390(__getitem__)
1 0.000 0.000 0.000 0.000 __init__.py:396(PyDLL)
1 0.000 0.000 0.000 0.000 __init__.py:436(LibraryLoader)
2 0.000 0.000 0.000 0.000 __init__.py:437(__init__)
3 0.000 0.000 0.000 0.000 __init__.py:498(PYFUNCTYPE)
3 0.000 0.000 0.000 0.000 __init__.py:499(CFunctionType)
2 0.000 0.000 0.000 0.000 __init__.py:75(CFUNCTYPE)
1 0.000 0.000 0.000 0.000 _endian.py:1(<module>)
1 0.000 0.000 0.000 0.000 _endian.py:23(_swapped_meta)
1 0.000 0.000 0.000 0.000 _endian.py:46(BigEndianStructure)
1 0.613 0.613 0.617 0.617 fib.py:1(<module>)
1 0.000 0.000 0.000 0.000 struct.py:3(<module>)
2 0.000 0.000 0.000 0.000 {built-in method _ctypes.POINTER}
2 0.001 0.001 0.001 0.001 {built-in method _ctypes.dlopen}
38 0.000 0.000 0.000 0.000 {built-in method _ctypes.sizeof}
3 0.000 0.000 0.000 0.000 {built-in method _imp._fix_co_filename}
26 0.000 0.000 0.000 0.000 {built-in method _imp.acquire_lock}
2 0.000 0.000 0.000 0.000 {built-in method _imp.create_dynamic}
2 0.000 0.000 0.000 0.000 {built-in method _imp.exec_dynamic}
4 0.000 0.000 0.000 0.000 {built-in method _imp.is_builtin}
5 0.000 0.000 0.000 0.000 {built-in method _imp.is_frozen}
26 0.000 0.000 0.000 0.000 {built-in method _imp.release_lock}
3 0.000 0.000 0.000 0.000 {built-in method _imp.source_hash}
18 0.000 0.000 0.000 0.000 {built-in method _struct.calcsize}
10 0.000 0.000 0.000 0.000 {built-in method _thread.allocate_lock}
12 0.000 0.000 0.000 0.000 {built-in method _thread.get_ident}
30 0.001 0.000 0.001 0.000 {built-in method builtins.__build_class__}
4/1 0.000 0.000 0.617 0.617 {built-in method builtins.exec}
30 0.000 0.000 0.000 0.000 {built-in method builtins.getattr}
26 0.000 0.000 0.000 0.000 {built-in method builtins.hasattr}
31 0.000 0.000 0.000 0.000 {built-in method builtins.isinstance}
12 0.000 0.000 0.000 0.000 {built-in method builtins.len}
1 0.000 0.000 0.000 0.000 {built-in method builtins.print}
2 0.000 0.000 0.000 0.000 {built-in method builtins.setattr}
3 0.000 0.000 0.000 0.000 {built-in method from_bytes}
6 0.000 0.000 0.000 0.000 {built-in method io.open_code}
3 0.000 0.000 0.000 0.000 {built-in method marshal.loads}
11 0.000 0.000 0.000 0.000 {built-in method posix.fspath}
4 0.000 0.000 0.000 0.000 {built-in method posix.getcwd}
1 0.000 0.000 0.000 0.000 {built-in method posix.listdir}
28 0.000 0.000 0.000 0.000 {built-in method posix.stat}
2 0.000 0.000 0.000 0.000 {method 'clear' of 'dict' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
7 0.000 0.000 0.000 0.000 {method 'endswith' of 'str' objects}
3 0.000 0.000 0.000 0.000 {method 'extend' of 'list' objects}
10 0.000 0.000 0.000 0.000 {method 'get' of 'dict' objects}
77 0.000 0.000 0.000 0.000 {method 'join' of 'str' objects}
9 0.000 0.000 0.000 0.000 {method 'pop' of 'dict' objects}
6 0.000 0.000 0.000 0.000 {method 'read' of '_io.BufferedReader' objects}
37 0.000 0.000 0.000 0.000 {method 'rpartition' of 'str' objects}
148 0.000 0.000 0.000 0.000 {method 'rstrip' of 'str' objects}
5 0.000 0.000 0.000 0.000 {method 'startswith' of 'str' objects}

We went from 76.602 seconds to 0.612 almost 125x speed-up!


  1. obviously since it's the only function.↩︎