Thursday, 15 December 2011

Conversion of numpy arrays from C++ classes without additional memory allocation

This post discusses the methods that can be used to integrate the array data structures implemented in C++ into Python code. At the end we want to be able to initialize C++ array objects from Python number sequences as well as to convert these objects into ndarray -- the array type of NumPy library.

The code bellow shows a simple example of an array data structure in C++.

Header file:

/*
* @file: MyArray.hpp
*/

#ifndef _MY_ARRAY_
#define _MY_ARRAY_

#include 

class MyArray 
{
public:
/*
* Constructor allocated memory of given size
*/
MyArray(int size);

/*
* Constructor creates MyArray object using given data
*/
MyArray(double* input, int size);

//getters
double* getData();
int getSize();
double getItem(int idx);

private:
double* data; //pointer to the memory segment
int size; //array size
};

#endif

And the source file:

/*
* @file: MyArray.cpp
*/

#include "MyArray.hpp"

MyArray::MyArray(int size){
  this->data = new double[size];
  this->size = size;
}

MyArray::MyArray(double* input, int size){
  this->size = size;
  // create new vector
  this->data = new double[size];
  // copy data
  memcpy(this->data, input, size * sizeof(double));
}

double* MyArray::getData(){
  return this->data;
}

int MyArray::getSize(){
  return this->size;
}

double MyArray::getItem(int idx){
  return this->data[idx];
}

Basically, we want to achieve the following in Python:

from myarray import MyArray
x = MyArray([1,2,3]) # we use Python array to initialize a C++ class
y = x.array() # create new view on the same data segment
type(y)
>> <type 'numpy.ndarray'>
del x # the data still exist since y is pointing on it
y[0]
>> 1.0
del y # memory is deallocated

SWIG is a useful framework for wrapping C/C++ code into libraries, which can be used in script languages like Python or Ruby. The general steps for compilation are described in a number of tutorials. If you are completely new to SWIG you probably should read these tutorials first.

In order to use SWIG with NumPy, we need the file numpy.i that contains the description of NumPy interfaces, i.e. numerical arrays, necessary for the work with Python. If you are using Ubuntu, you can get the file from the package python-instant in Ubuntu 11.10 or from python-numpy in some older versions. Alternatively, you can download the interface file from the NumPy github repository and put it into the directory with other interface files.

Now let us consider two parts of the problem separately.

Part 1. Initialization of a C++ class using a Python number sequence

The solution I have found in a stackoverflow post. We create an interface file MyArray.i for wrapping the C++ class.

/*
* @file MyArray.i
*/ 

%module myarray

// typemap allowing to pass sequence of numbers to constructor
%typemap(in) (double* data, int size)
{
  if (!PySequence_Check($input)) {
    PyErr_SetString(PyExc_ValueError, "Expected a sequence");
    return NULL;
  }
  $2 = PySequence_Size($input);
  $1 = (double *) malloc(sizeof(double)*$2);
  for (int i = 0; i < $2; i++) {
    PyObject *o = PySequence_GetItem($input,i);
    if (PyNumber_Check(o)) {
      $1[i] = (double) PyFloat_AsDouble(o);
    } else {
      PyErr_SetString(PyExc_ValueError,"Sequence elements must be numbers");      
      return NULL;
    }
  }
}
%typemap(freearg) (double *input, int size)
{
  if ($1) free($1);
}
%typecheck(SWIG_TYPECHECK_FLOAT) (double* data, int size)
{
$1 = PySequence_Check($input) ? 1 : 0;
}

In the code above all work is done manually. We check if the input parameters have a proper type, initialize an array of doubles with size elements, and then one by one copy the elements from Python sequence to our new array. Although I will show a simpler way below, this solution is useful if we need some additional magic during the conversion.

Alternatively, a cleaner solution uses input arrays available in SWIG. In this case MyArray.i file will contain following lines:

/*
* @file MyArray.i
*/

%module(directors="1") myarray

%include "typemaps.i"

%{
#define SWIG_FILE_WITH_INIT
#include "MyArray.hpp"
%}
%include "numpy.i"
%init %{
import_array();
%}


%apply (double* IN_ARRAY1, int DIM1) {(double* input, int size)}

//...

We need to include typemaps.i and numpy.i and to call function the function import_array that handles all necessary initialization routinges. Finally, with the %apply statement with tell the preprocessor the required type mapping.

Important: due to the operation of %apply the signatures and variable names should exactly match those used in the function signatures! Save yourself hours of bug-fixing and just copy/paste it.

Part 2. Create a persistent ndarray view on data in C++

Now we want to create a representation of our C++ class as an ndarray. Moreover, as in scientific computing or machine learning we usually deal with very large data, we would like to avoid allocation of additional memory and data copying. An obvious solution would be to use argoutview arrays from numpy.i. In this case, we would need to extend the MyArray.i file as follows:

//...

%apply ( DATA_TYPE** ARGOUTVIEW_ARRAY1, DIM_TYPE* DIM1 ) {(double** vec, int n)};

class MyArray {
public:
/*
* Constructor allocated memory of given size
*/
MyArray(int size);

/*
* Constructor creates MyArray object using given data
*/
MyArray(double* data, int size);

%extend {
    // Create a ndarray view from the MyArray data
    // using ARGOUTVIEW
    
    void __array(double** vec, int n){
      //Get the data and number of entries
      vec = &$self->getData();
      int n = $self->getSize();   
    }
     %pythoncode
     {
        def array(self):   
          return self.__array(self)
     }
  }
}

This is a simple and clean solution. Unfortunately, it has a big drawback, since without explicit memory management, we cannot guarantee the data integrity for variables containing view after the variable originally pointing to MyArray object was deleted or reassigned. The following example illustrates the problem:

import myarray
x = myarray.MyArray([1,2,3])
y = x.array()
y[0]
>> 1.0
x = 'foobar'
y[0]
>> [1.93759274e-10, 2,3]

Hence, the solution can be used only in cases, when the original MyArray variable (x in our example) lives as long as all other variables pointing to the underlying (y in our example).

To overcome the memory management issue, we need to create a NumPy array manually and use the Python pointer counting mechanism to guarantee that the memory will not be freed before the last variable pointing to the array was deleted or reassigned. The solution was inspired by this blog post.

//...

%inline %{
/*
* Memory managment function used in sg::base::DataVector::__array()
* it simply decrements the number of references to the PyObject datavector 
* every time a referencing ndarray is deleted.
* After reference counter reaches 0, the memory of DataVector will be deallocated 
* automatically. 
*/
void free_array(void* ptr, void* arr){
   double* vec = (double *) ptr;
   PyObject* array = (PyObject*)arr;
   Py_DECREF(array);
  }
%}

class MyArray {
public:
/*
* Constructor allocated memory of given size
*/
MyArray(int size);

/*
* Constructor creates MyArray object using given data
*/
MyArray(double* data, int size);

%extend {
    // Create a ndarray view from the MyArray data
    // an alternative approach using ARGOUTVIEW will fail since it does not allow to do a proper memory management
    PyObject* __array(PyObject* myarray){
        //Get the data and number of entries
      double *vec = $self->getPointer();
      int n = $self->getSize();

      npy_intp dims[1] = {n};
      
      // Create a ndarray with data from vec
      PyObject* arr = PyArray_SimpleNewFromData(1,dims, NPY_DOUBLE, vec);
      
      // Let the array base point on the original data, free_array is a additional destructor for our ndarray, 
      // since we need to DECREF MyArray object
      PyObject* base = C((void*)vec, (void*)myarray, free_array);
      PyArray_BASE(arr) = base;
      
      // Increase the number of references to PyObject MyArray, after the object the variable is reinitialized or deleted the object
      // will still be on the heap, if the reference counter is positive.
      Py_INCREF(myarray);
      
      return arr;
    }
     %pythoncode
     {
        def array(self):   
          return self.__array(self)
     }
  }
}

The function PyArray_SimpleNewFromData in the line 41 creates a 1D Python array of size dims from the memory segment referenced by vec. In the line 46 we set the base object of the MyArray wrapper class as the base object of the ndarray object arr. Now, if the ndarray object will be deleted or reassigned, Python calls free_array method on its base object. In the function free_array we then would decrement the pointer counter and if no pointer remains, the memory will be deallocated automatically.

Finally, we need to increase the number of pointer to the memory in the line 50 (in fact, we increase the number of pointer to the MyArray object and not to the double* data pointer).

Step 3. Compilation and Installation

To make the compilation simple, I am using distutils and setup.py file adapted from this tutorial.

#! /usr/bin/env python

# System imports
from distutils.core import *
from distutils      import sysconfig

# Third-party modules - we depend on numpy for everything
import numpy

# Obtain the numpy include directory.  This logic works across numpy versions.
try:
    numpy_include = numpy.get_include()
except AttributeError:
    numpy_include = numpy.get_numpy_include()

# inplace extension module
_myarray = Extension("_myarray",
                   ["MyArray.i","MyArray.cpp"],
                   include_dirs = [numpy_include],
      swig_opts=['-c++']
                   )

# NumyTypemapTests setup
setup(  name        = "C++/Numpy Conversion",
        description = "Integration of array data structures implemented in C++ into Python code",

        author      = "Valeriy Khakhutskyy",
        version     = "1.0",
        ext_modules = [_myarray]
        )

For detailed explanation of the file, please refer to the tutorial.

To build the code use

python setup.py build

Now copy the library that was compiled into build/lib.XXX directory into the folder with myarray.py file. If everything went right, running example code will give you correct output.

You can find the example files described in this post in github.

1 comment:

  1. Maybe add a final checklist, including where we should save the files.

    ReplyDelete