Thursday, 15 December 2011

Conversion of numpy arrays from C++ classes without additional memory allocation

This post discusses the methods that can be used to integrate the array data structures implemented in C++ into Python code. At the end we want to be able to initialize C++ array objects from Python number sequences as well as to convert these objects into ndarray -- the array type of NumPy library.

The code bellow shows a simple example of an array data structure in C++.

Header file:

/*
* @file: MyArray.hpp
*/

#ifndef _MY_ARRAY_
#define _MY_ARRAY_

#include 

class MyArray 
{
public:
/*
* Constructor allocated memory of given size
*/
MyArray(int size);

/*
* Constructor creates MyArray object using given data
*/
MyArray(double* input, int size);

//getters
double* getData();
int getSize();
double getItem(int idx);

private:
double* data; //pointer to the memory segment
int size; //array size
};

#endif

And the source file:

/*
* @file: MyArray.cpp
*/

#include "MyArray.hpp"

MyArray::MyArray(int size){
  this->data = new double[size];
  this->size = size;
}

MyArray::MyArray(double* input, int size){
  this->size = size;
  // create new vector
  this->data = new double[size];
  // copy data
  memcpy(this->data, input, size * sizeof(double));
}

double* MyArray::getData(){
  return this->data;
}

int MyArray::getSize(){
  return this->size;
}

double MyArray::getItem(int idx){
  return this->data[idx];
}

Basically, we want to achieve the following in Python:

from myarray import MyArray
x = MyArray([1,2,3]) # we use Python array to initialize a C++ class
y = x.array() # create new view on the same data segment
type(y)
>> <type 'numpy.ndarray'>
del x # the data still exist since y is pointing on it
y[0]
>> 1.0
del y # memory is deallocated

SWIG is a useful framework for wrapping C/C++ code into libraries, which can be used in script languages like Python or Ruby. The general steps for compilation are described in a number of tutorials. If you are completely new to SWIG you probably should read these tutorials first.

In order to use SWIG with NumPy, we need the file numpy.i that contains the description of NumPy interfaces, i.e. numerical arrays, necessary for the work with Python. If you are using Ubuntu, you can get the file from the package python-instant in Ubuntu 11.10 or from python-numpy in some older versions. Alternatively, you can download the interface file from the NumPy github repository and put it into the directory with other interface files.

Now let us consider two parts of the problem separately.

Part 1. Initialization of a C++ class using a Python number sequence

The solution I have found in a stackoverflow post. We create an interface file MyArray.i for wrapping the C++ class.

/*
* @file MyArray.i
*/ 

%module myarray

// typemap allowing to pass sequence of numbers to constructor
%typemap(in) (double* data, int size)
{
  if (!PySequence_Check($input)) {
    PyErr_SetString(PyExc_ValueError, "Expected a sequence");
    return NULL;
  }
  $2 = PySequence_Size($input);
  $1 = (double *) malloc(sizeof(double)*$2);
  for (int i = 0; i < $2; i++) {
    PyObject *o = PySequence_GetItem($input,i);
    if (PyNumber_Check(o)) {
      $1[i] = (double) PyFloat_AsDouble(o);
    } else {
      PyErr_SetString(PyExc_ValueError,"Sequence elements must be numbers");      
      return NULL;
    }
  }
}
%typemap(freearg) (double *input, int size)
{
  if ($1) free($1);
}
%typecheck(SWIG_TYPECHECK_FLOAT) (double* data, int size)
{
$1 = PySequence_Check($input) ? 1 : 0;
}

In the code above all work is done manually. We check if the input parameters have a proper type, initialize an array of doubles with size elements, and then one by one copy the elements from Python sequence to our new array. Although I will show a simpler way below, this solution is useful if we need some additional magic during the conversion.

Alternatively, a cleaner solution uses input arrays available in SWIG. In this case MyArray.i file will contain following lines:

/*
* @file MyArray.i
*/

%module(directors="1") myarray

%include "typemaps.i"

%{
#define SWIG_FILE_WITH_INIT
#include "MyArray.hpp"
%}
%include "numpy.i"
%init %{
import_array();
%}


%apply (double* IN_ARRAY1, int DIM1) {(double* input, int size)}

//...

We need to include typemaps.i and numpy.i and to call function the function import_array that handles all necessary initialization routinges. Finally, with the %apply statement with tell the preprocessor the required type mapping.

Important: due to the operation of %apply the signatures and variable names should exactly match those used in the function signatures! Save yourself hours of bug-fixing and just copy/paste it.

Part 2. Create a persistent `ndarray` view on data in C++

Now we want to create a representation of our C++ class as an ndarray. Moreover, as in scientific computing or machine learning we usually deal with very large data, we would like to avoid allocation of additional memory and data copying. An obvious solution would be to use argoutview arrays from numpy.i. In this case, we would need to extend the MyArray.i file as follows:

//...

%apply ( DATA_TYPE** ARGOUTVIEW_ARRAY1, DIM_TYPE* DIM1 ) {(double** vec, int n)};

class MyArray {
public:
/*
* Constructor allocated memory of given size
*/
MyArray(int size);

/*
* Constructor creates MyArray object using given data
*/
MyArray(double* data, int size);

%extend {
    // Create a ndarray view from the MyArray data
    // using ARGOUTVIEW
    
    void __array(double** vec, int n){
      //Get the data and number of entries
      vec = &$self->getData();
      int n = $self->getSize();   
    }
     %pythoncode
     {
        def array(self):   
          return self.__array(self)
     }
  }
}

This is a simple and clean solution. Unfortunately, it has a big drawback, since without explicit memory management, we cannot guarantee the data integrity for variables containing view after the variable originally pointing to MyArray object was deleted or reassigned. The following example illustrates the problem:

import myarray
x = myarray.MyArray([1,2,3])
y = x.array()
y[0]
>> 1.0
x = 'foobar'
y[0]
>> [1.93759274e-10, 2,3]

Hence, the solution can be used only in cases, when the original MyArray variable (x in our example) lives as long as all other variables pointing to the underlying (y in our example).

To overcome the memory management issue, we need to create a NumPy array manually and use the Python pointer counting mechanism to guarantee that the memory will not be freed before the last variable pointing to the array was deleted or reassigned. The solution was inspired by this blog post.

//...

%inline %{
/*
* Memory managment function used in sg::base::DataVector::__array()
* it simply decrements the number of references to the PyObject datavector 
* every time a referencing ndarray is deleted.
* After reference counter reaches 0, the memory of DataVector will be deallocated 
* automatically. 
*/
void free_array(void* ptr, void* arr){
   double* vec = (double *) ptr;
   PyObject* array = (PyObject*)arr;
   Py_DECREF(array);
  }
%}

class MyArray {
public:
/*
* Constructor allocated memory of given size
*/
MyArray(int size);

/*
* Constructor creates MyArray object using given data
*/
MyArray(double* data, int size);

%extend {
    // Create a ndarray view from the MyArray data
    // an alternative approach using ARGOUTVIEW will fail since it does not allow to do a proper memory management
    PyObject* __array(PyObject* myarray){
        //Get the data and number of entries
      double *vec = $self->getPointer();
      int n = $self->getSize();

      npy_intp dims[1] = {n};
      
      // Create a ndarray with data from vec
      PyObject* arr = PyArray_SimpleNewFromData(1,dims, NPY_DOUBLE, vec);
      
      // Let the array base point on the original data, free_array is a additional destructor for our ndarray, 
      // since we need to DECREF MyArray object
      PyObject* base = C((void*)vec, (void*)myarray, free_array);
      PyArray_BASE(arr) = base;
      
      // Increase the number of references to PyObject MyArray, after the object the variable is reinitialized or deleted the object
      // will still be on the heap, if the reference counter is positive.
      Py_INCREF(myarray);
      
      return arr;
    }
     %pythoncode
     {
        def array(self):   
          return self.__array(self)
     }
  }
}

The function PyArray_SimpleNewFromData in the line 41 creates a 1D Python array of size dims from the memory segment referenced by vec. In the line 46 we set the base object of the MyArray wrapper class as the base object of the ndarray object arr. Now, if the ndarray object will be deleted or reassigned, Python calls free_array method on its base object. In the function free_array we then would decrement the pointer counter and if no pointer remains, the memory will be deallocated automatically.

Finally, we need to increase the number of pointer to the memory in the line 50 (in fact, we increase the number of pointer to the MyArray object and not to the double* data pointer).

Step 3. Compilation and Installation

To make the compilation simple, I am using distutils and setup.py file adapted from this tutorial.

#! /usr/bin/env python

# System imports
from distutils.core import *
from distutils      import sysconfig

# Third-party modules - we depend on numpy for everything
import numpy

# Obtain the numpy include directory.  This logic works across numpy versions.
try:
    numpy_include = numpy.get_include()
except AttributeError:
    numpy_include = numpy.get_numpy_include()

# inplace extension module
_myarray = Extension("_myarray",
                   ["MyArray.i","MyArray.cpp"],
                   include_dirs = [numpy_include],
      swig_opts=['-c++']
                   )

# NumyTypemapTests setup
setup(  name        = "C++/Numpy Conversion",
        description = "Integration of array data structures implemented in C++ into Python code",

        author      = "Valeriy Khakhutskyy",
        version     = "1.0",
        ext_modules = [_myarray]
        )

For detailed explanation of the file, please refer to the tutorial.

To build the code use

python setup.py build

Now copy the library that was compiled into build/lib.XXX directory into the folder with myarray.py file. If everything went right, running example code will give you correct output.

You can find the example files described in this post in github.

Sunday, 7 August 2011

Lists of lambda functions in Python

Lambda functions in Python offer a very convenient way to integrate functional programming in your imperative or object-oriented code. To keep it flexible one may need to generate the code of the lambda function at runtime. To make it work a simple trick is needed.

For the sake of illustration let us take a very simple example. Supposedly, we want to evaluate the sum of every Nth element in the list. We can try to solve the problem by creating a list of lambda functions and then applying it to the list:

functions_list = []
N = 3
for i in xrange(N):
    functions_list.append(lambda x: sum([my_list[j*N + i] 
                       for j in xrange(len(my_list)/N)]))

If we test our functions now, we will get a wrong results:



>>> my_list = range(12)

>>> [f(my_list) for f in functions_list]

[26, 26, 26]

The problem is that interpreter will look for the value of i in the scope, in our case it will find equal to 2 and use for every function in the list.

Using eval() function allows to "embed" the value of i in the functions body at the moment of its creation:

functions_list = []
for i in xrange(N):
    f_str = 'lambda x: sum([my_list[j*%i + %i] ' + \
    'for j in xrange(len(my_list)/%i)])' %( N, i, N)
    functions_list.append(eval(f_str))

And this will lead to the correct solution.



>>> [f(my_list) for f in functions_list]

[18, 22, 26]

Thursday, 16 June 2011

Shared repository using git

Revision control systems, or source code repositories, are vital for managing software development projects. But a new research project, or writing a paper, you can benefit greatly from using a repository as well. Besides the ability to revert changes or lookup the old version of a text, revision control systems allow simple sharing and simultaneous work on same files.

In this post I describe how to initialize a git repository on Windows systems using a shared drive. In my next post I will show how to share your repository using Dropbox. If you want to create a shared repository on a server with ssh access, I would refer you to this post.

Getting started

STEP 1. Download and install git client for Windows from the official website. Please check the integration of "Git Bash Here" and "Git Gui Here" into Windows Explorer, since it will make some things a lot more comfortable.

STEP 2. Connect your network drive. Open Explorer and then choose Tools->Map Network Drive...

Type in the path to the shared drive and check the box "Reconnect at logon"

STEP 3. On the network drive create a new folder for your repository. Now, click on the folder with the right mouse button and choose "Git Bash Here" from the context menu.

This will open a command line window where you have to run the command

git init --bare --shared=group

to initialize an empty repository.

STEP 4. Create a local folder in Explorer for your working copy (in this instruction I will use the folder "myrepo") and again open the command line window by clicking on the folder with the right mouse button and choosing “Git Bash here” from the context menu. In the command line window type



git clone //my/shared/drive/my_shared_repo myrepo

and hit "Enter" to clone the shared repository to your local working repository. Make sure to enter the proper paths instead of //my/shared/drive/my_shared_repo and myrepo

STEP 5. At last you need to configure git to use proper identification information:



$ git config --global user.name "Firstname Lastname"

$ git config --global user.email "your_email@youremail.com"

That's it!

General Workflow

Now you have two repositories: your local working repository with information stored in .git directory and the shared repository in the network folder. To share your repository, repeat the steps 2, 4, and 5 on other computers as well.

The usual workflow includes the following three steps:

Pull the recent changes from the shared repository to get your working copy up-to-date,
Make changes and commits them locally. If you are ready,
Push all your commits to the shared repository to make the changes available for other users as well.

STEP 1. To pull the recent global commit and get your repository up-to-date call in your working directory:



git pull

STEP 2. To check the status of your working copy:



git status

This will show you which files were changed/added after the last commit

STEP 3. To add new files to repository:



git add filename

STEP 4. To commit changes to your local version control repository:



git commit -a -m “your message here”

Think about putting some meaningful messages to recognize your commits later.

STEP 5. To send your commits to the shared repository:



git push

Thursday, 26 May 2011

Database copy with minimal access rights

Currently, I faced a very specific problem. I needed to duplicate a remote database locally. The problem was that I had only rights to execute SELECT statements on the remote computer. Maybe this post will help others to save time looking for the solution.

The reference manual of MySQL describes how to copy a database from one computer to another. Using the instructions in the manual, I ended up using mysqldump and mysql programs on my local computer. This utility programs are parts of the Community Server package and not of Workbench Tools as some people would assume ;-)

Once the software is installed. Run



mysqldump -h <remote_db_hostname> -u <username> -p --compress --single-transaction db_name | gzip > db_name.gz

to save the tables of the database locally. And then



gunzip < db_name.gz | mysql db_name

to create the tables in the local database.

Without the parameter --single-transaction for mysqldump I received the error message

mysqldump: Got error: 1044: Access denied for user 'username'@'%' to database 'Databasename' when using LOCK TABLES

P.S. If you are using SQuirreL SQL client, there is a plugin DBCopy available for these purposes.

Wednesday, 30 March 2011

Serialisation of trees in Matlab

I have written a small but handy Matlab function to serialise tree structures from an adjacency matrix. I use the string then to process the trees in Python, but one can come up with some other reasons to do that.

The function has some constraints, i.e. since there is no separators between node indices it will only work for trees with small number of nodes, but it was sufficient for me and it is easy to extend.

function str = adjmatrix2string( adjmatrix, node, parent )
%ADJMATRIX2STRING converts an adjacency matrix of a tree to 
string
%   Converts an adjacency matrix of a tree to its 
%   string representation of 
%   the form 1(2(34)5(67))
%   adjmatrix - adjacency matrix
%   node - index of the current node
%   parent - index of parrent node, 0 for root
%   call example: adjmatrix2string(adjmatrix, 1, 0)

str = int2str(node);
childnodes = find(adjmatrix(node, :));
% only one node in a tree
if isempty(childnodes)
str = strcat(str, node);
elseif length(childnodes) == 1
% node is a root
if parent == 0
str = strcat(str, '(', adjmatrix2string(adjmatrix, 
             childnodes(1), node), ')');
end
else
str = strcat(str, '(');
for childnode = childnodes
if childnode ~= parent
str = strcat(str, adjmatrix2string(adjmatrix, 
      childnode, node));
end
end
str = strcat(str, ')');
end
end

It will convert an adjacency matrix
$\begin{pmatrix}0&1&1&0&1&0&0\\1&0&1&1&0&0&0\\0&1&0&0&0&0&0\\0&1&0&0&0&0&0\\1&0&0&0&0&1&1\\0&0&0&0&1&0&0\\0&0&0&0&1&0&0\\\end{pmatrix}$
to the string '1(2(34)5(67))'.