Cython Performance Case Study

I’ve been learning different programming languages since the early 2000s. It can be tempting to clutch onto a language and use it no matter what. There are advantages to mastering a single language on a personal level and company level. However, the disadvantages include:

(1) not using new languages and their inherent strengths,

(2) reliant on language support for new tools,

(3) your build and testing infrastructure are stuck in a language (I really like Makefiles because they work in all languages!),

(4) using multiple languages may guide your implementation to have clearly defined modules, and

(5) not learning new languages can limit us as programmers (see Nathan Marz’s blog that suggests learning new languages makes us better programmers).

I started programming with C and C++, but higher-level languages like Python are attractive in terms of code-density and readability, which in my experience greatly help maintainability. Many Java and C++ software engineers dismiss Python as a production-grade language because it is much slower and it is not strongly typed (Often times they dismiss each other as well :-D, but that’s another topic.).

I wanted to do a short case study on sorting unique numbers in linear time using bit vectors in each of these languages. I wanted to see how much slower Python would be compared to the strongly typed languages for bit manipulation, and as the title eludes, compare all these to Cython.

The precise question, sort a file of unique numbers and write out the numbers to a file in sorted order. I plan to use a bit vector of integer type, with the idea of setting a bit at the memory address corresponding to the number’s value. Once I have all the numbers represented with one bit, I can loop through each bit of the bit vector and print out the numbers in order.

For me it was natural to think about implementing this algorithm in C. Here is a link to the full 32-bit based implementation. The important snippets below:

// bit vector uint32_t *bvp = (uint32_t*)malloc(bv_size*sizeof(uint32_t)); // clear bit vector for (idx = 0; idx < bv_size; idx++) { bvp[idx] = 0; } ... // ---- reads file and sets bits in bit vector ---- while (fscanf(ifp, "%d", ph_num) != EOF) { idx = *ph_num/32; // calc which int in bv to set bit = *ph_num % 32; // calc which bit in int to set comp_int = 1; comp_int<<=bit; *(bvp + idx) = (*(bvp + idx) | comp_int); // set bit } // ---- outputs sorted numbers in file ---- for (idx = 0; idx < bv_size; idx++) { comp_int = 1; if (*(bvp + idx) > 0) { // small optimization for (bit = 0; bit < 32; bit++) { if (comp_int & *(bvp + idx)) fprintf(ofp,"%d\n",idx*32+bit); comp_int<<=1; } } }

Then, I wrote this in Python. I had never cared deeply about an integer’s bit length before in Python, so my first attempt was to use NumPy’s number types. Here is a link to the full Python implementation. The important snippets below:

bit_vector = numpy.zeros(bv_size, dtype=numpy.uint32) for idx in xrange(bv_size): bit_vector[idx] = 0 # reading from file with open('unsorted_phonebook.txt','r') as ifh: for line in ifh: ph_num = int(line) bit = ph_num % int_size idx = ph_num / int_size comp_int = numpy.uint32(1) comp_int <<= bit bit_vector[idx] |= comp_int with open('sorted_phonebook_python.txt','w') as ofh: for idx in xrange(bv_size): comp_int = numpy.uint32(1) if bit_vector[idx] > 0: # small optimization for bit in xrange(int_size): if comp_int & bit_vector[idx]: ofh.write(str(idx*32 + bit) + '\n') comp_int <<= 1

The error handling of opening the file is included here showing the nice code density of Python but it’s slow of course (how much slower we’ll see below). Here is a link to the full Cython implementation:

def sort_phonebook(): DEF bv_size = 321500 # 10M / 32 supports 0 - 9999999 (or ph:999-9999) DEF int_size = 32 cdef int bit cdef int idx cdef int ph_num cdef unsigned int comp_int cdef unsigned int bit_vector_i cdef unsigned int bit_vector[bv_size] for idx in xrange(bv_size): bit_vector[idx] = 0 # reading from file with open('unsorted_phonebook.txt','r') as ifh: for line in ifh: ph_num = int(line) bit = ph_num % int_size idx = ph_num / int_size comp_int = 1 comp_int <<= bit bit_vector[idx] |= comp_int with open('sorted_phonebook_cython.txt','w') as ofh: for idx in xrange(bv_size): comp_int = 1 if bit_vector[idx] > 0: # small optimization for bit in xrange(int_size): if comp_int & bit_vector[idx]: ofh.write(str(idx*int_size + bit) + '\n') comp_int <<= 1

Notice how the Cython code is not very different than the Python code! Cython will need to compile this pyx file for Python to call this function. The setup file is here:

from distutils.core import setup from Cython.Build import cythonize setup( ext_modules = cythonize("sort_phonebook_cython.pyx") )

And then execute the setup with the following command:

python setup_cython_phonebook_sort.py build_ext --inplace

Now we can call this function in another Python program with:

import sort_phonebook_cython sort_phonebook_cython.sort_phonebook()

For more comprehensive benchmarking, I wrote this algorithm in Java. For brevity, see the java implementation here .

On to the performance results. I sorted 6- and 7-digit numbers with these programs.
The results for 6-digits:

$ make
python create_phonebook_file.py
gcc sort_phonebook.c -o sort_phonebook_in_c
javac SortPhonebook.java
python setup_cython_phonebook_sort.py build_ext --inplace
running build_ext
python run_all_tests.py
0.608454942703 seconds to sort in c
1.06622695923 seconds to sort in cython
12.8049960136 seconds to sort in python
1.31410098076 seconds to sort in java
diff sorted_phonebook_c.txt sorted_phonebook_cython.txt
diff sorted_phonebook_c.txt sorted_phonebook_python.txt
diff sorted_phonebook_c.txt sorted_phonebook_java.txt

The results for 7-digits (I excluded the creation of the file since this is really slow in Python with my current implementation):

$ make run_tests
python run_all_tests.py
4.33113193512 seconds to sort in c
10.2070810795 seconds to sort in cython
130.906479836 seconds to sort in python
10.8574941158 seconds to sort in java
diff sorted_phonebook_c.txt sorted_phonebook_cython.txt
diff sorted_phonebook_c.txt sorted_phonebook_python.txt
diff sorted_phonebook_c.txt sorted_phonebook_java.txt

A couple of points:

(1) Confirmed (anecdotally) that the computational complexity growth of these function is mostly linear (10x) going from 6-digit to 7-digit numbers (a 10x growth in number of numbers, 1M to 10M)

(2) The Python implementation was 20x slower than C for sorting 1M numbers and 30x slower than C for sorting 10M numbers

(3) Cython edged Java for sorting 1M numbers, but both implementation were 2–2.5x slower than C.

(4) There are 25 source line of code (SLOC) Python, 35 SLCO for Cython, 51 SLOC Java, 56 SLOC C. There could be argument for performance in terms of this count in terms of developer time.

There you have it. Try Cython if you have a computationally complex function in Python (or high big-O constants as in this evaluation) that you’d like to avoid re-writing in C or Java. But the ultimate performance king is still C.

Originally published at http://ryaneirwin.com on July 19, 2014.