Using PyCUDA in C++ Python Bindings

Some of the work I’ve been doing lately involves writing CUDA code. In particular, I’ve been working on a CUDA and OpenMP enabled simulated pulsar generator, psr-gen. Depending on the how things shake out, running this code on the GPU can be anywhere from 5 to 30 times faster than the equivalent multi core CPU bound version, including transfer times to and from the GPU. Given that this project is written in a combination of C++ and CUDA, things can get incredibly verbose very quickly. In order to facilitate developers using the psr-gen API, I’ve put together some SWIG-based Python bindings for the project. In doing so, I’ve found a reasonably slick way to access data that have been allocated on the GPU with PyCUDA from C++ land. This means that we can operate on PyCUDA GPUArrays using the psr-gen Python API. Instead of walking through the full psr-gen codebase, I’m going to show a minimal example that demonstrates the approach. pycuda-swig-example is located here and implements a warp reduction sum in CUDA.

Why not just use `pycuda.compiler.SourceModule`?

PyCUDA has a really slick means of calling custom CUDA kernels directly from a Python environment. We could use this functionality to call the kernels used in pycuda-swig-example without using any C++. This approach is fine when working with individual CUDA kernels, but becomes problematic when trying to implement an algorithm that is not simply a series of calls to CUDA kernels. When trying to call an algorithm from Python that was written in a mix of C/C++ and CUDA, it becomes necessary to either translate the logic of the original algorithm, or to write some Python bindings.

C++ interface

Before jumping into creating Python bindings for pycuda-swig-example, I want to show a simple example of the C++/CUDA interface.

// example.cpp
#include <iostream>
#include <vector>

#include <cuda_runtime.h>

#include "example/sum.hpp"

int main () {
  unsigned size = 53735;

  std::vector<double> in(size);
  double* in_d;

  for (unsigned idx=0; idx<size; idx++) {
    in[idx] = (double) idx;
  }

  cudaMalloc(&in_d, size*sizeof(double));
  cudaMemcpy(in_d, in.data(), size*sizeof(double), cudaMemcpyHostToDevice);

  double result = example::sum<double>(in_d, size);
  std::cerr << result << std::endl;
  cudaFree(in_d);

  return 0;
}

In the pycuda-swig-example directory, we can compile this as follows:

nvcc src/sum.cu -c -o sum.o -I include -l cudart
g++ -o example example.cpp sum.o -I include -l cudart

(For those unfamiliar, we have to treat CUDA source files like we might Fortran source files – we compile separately and then link when creating our final library/executable. This introduces significant constrainsts on how we design CUDA libraries.)

We could use thrust to get a more idiomatic C++ interface, but then we’d have to either 1) write a wrapper for example::sum so we can compile with g++ or 2) Use thrust in our final executable, but compile with nvcc. nvcc is not really a drop in replacement for g++: See here and here.

Compare this to the syntax we can acheive with the Python bindings:

import numpy as np
import pycuda.autoinit
import pycuda.gpuarray as gpuarray

import example

def main():
    size = 53735
    arr = gpuarray.arange(size, dtype=np.float64)
    result = example.sum(d_arr)
    print(result)

if __name__ == "__main__":
    main()

With the Python interface, we don’t have to worry about managing GPU memory, and given Python’s dynamic typing, we don’t have to think about templates. Perhaps most important is the fact that we don’t have to worry about compiling and linking any code!

SWIG and PyCUDA

Before writing a SWIG interface file, we need to figure out what API we’d like to expose to Python users. PyCUDA GPUArrays carry with them information about their dimensionality, so we don’t need to send along that information as a separate argument. Moreover, our Python interface should take advantage of the fact that GPUArrays know what type of data they contain. Using type annotations, the function we’d like to expose should be something like this (ignoring complex numbers for the moment):

def sum(arr: pycuda.gpuarray.GPUArray) -> typing.Union[float, int]:
  ...

With this in mind, let start our SWIG interface file.

%module example

%{

#include "example/sum.hpp"

%}

%include "example/sum.hpp"


%pythoncode %{

import numpy as np

def sum(in_array):
  if in_array.dtype == np.float32:
    # do float sum
  elif in_array.dtype == np.float64:
    # do double sum
  else:
    raise TypeError("Sum not implemented for types other than np.float32 and np.float64")

%}

Here things start to get a little awkward. Given that Python is dynamically typed, we can’t easily use the nice template-based C++ API that we’ve exposed, as we need to know about data types at compile time. Swig does have a useful utility for just this situation:

%include "example/sum.hpp"

%inline %{
  namespace example {
    template<typename Type>
    PyObject* sum (PyObject* in) {

    }
  }
%}

%template(sum_float) example::sum<float>;
%template(sum_double) example::sum<double>;

%pythoncode %{

import numpy as np

def sum(in_array):
  if in_array.dtype == np.float32:
    return sum_float(in_array)
  elif in_array.dtype == np.float64:
    return sum_double(in_array)
  else:
    raise TypeError("Sum not implemented for types other than np.float32 and np.float64")

%}

Now that we’ve added the C++ example::sum override that takes and returns a pointer to a PyObject, we need to figure out how to get the data to which the GPUArray refers. Lucky for us, the GPUArray has a ptr method that returns a long long corresponding to the memory address of the allocated array. We can use reinterpret_cast to cast this as the appropriate datatype:

%module example

%{

#include "example/sum.hpp"

long pycuda_get_ptr (PyObject* obj) {
  return PyLong_AsLongLong(
    PyObject_GetAttrString(obj, "ptr")
  );
}

long long pycuda_get_size (PyObject* obj) {
  return (unsigned) PyLong_AsLong(
      PyTuple_GetItem(
        PyObject_GetAttrString(obj, "shape"), 0));
}

%}

%include "example/sum.hpp"

%inline %{
  namespace example {
    template<typename Type>
    PyObject* sum (PyObject* in) {
      long long in_ptr = pycuda_get_ptr(in);
      long long in_size = pycuda_get_size(in);
      Type result = example::sum<Type>(
        reinterpret_cast<Type*>(in_ptr),
        reinterpret_cast<unsigned>(in_size)
      );
      return PyFloat_FromDouble((double) result);
    }
  }
%}

%template(sum_float) example::sum<float>;
%template(sum_double) example::sum<double>;


%pythoncode %{

import numpy as np

def sum(in_array):
  if in_array.dtype == np.float32:
    return sum_float(in_array)
  elif in_array.dtype == np.float64:
    return sum_double(in_array)
  else:
    raise TypeError("Sum not implemented for types other than np.float32 and np.float64")

%}

This is pretty cool. PyCUDA no doubt has a C++ API that we could use to access these attributes more efficiently, but it is far easier to leverage the Python C API to get these attributes. I’m not entirely sure why PyCUDA exposes the memory address of the cudaMalloc‘d arrays, but it sure is useful here!

Let’s compile the interface file, located in the python subdirectory.

swig -python -outdir ./ -c++ -Iinclude -I/python/include/path/here -o example_wrap.cxx python/example.i

This creates two files: example_wrap.cxx and example.py. example.py is expecting the presence of a shared library called _example.so. We can create this as follows:

g++ -fPIC -shared -o example_wrap.o -c -Iinclude -I/python/include/path/here example_wrap.cxx
nvcc --compiler-options -fPIC -shared -o sum.o -c -Iinclude src/sum.cu
g++ -fPIC -shared -o _example.so example_wrap.o sum.o /path/to/python/shared/lib/here -l cudart

All this compilation is a little complicated, which is why pycuda-swig-example uses CMake to help configure and build everything:

mkdir build && cd build
cmake ..
make
python ./../python/test_example.py

I’ve also put together a simple Makefile that illustrates in a more obvious way the series of steps needed to build the Python bindings.

python/example.py runs a small suite of unittests to ensure that the sum function runs.

Why not just use pycuda.compiler.SourceModule?

C++ interface

SWIG and PyCUDA

Why not just use `pycuda.compiler.SourceModule`?