During the last two years, most of the times, I have been playing around the concept of Self Organizing Map (SOM), which aligns very well with our theoretical lines of argumentation in our research group. Although the algorithm itself, is easy to implement, so far I was either using the prominent SOM toolbox for Matlab, written by researchers at Helsinki University of Technology or the open source Java package, DataBionics ESOM.
However, these days you hear everywhere about Python as a high level programming language for scientific computing, while you can speed up your code via different methods such as using Cython, parallel computing in cloud, multi-core processing, and lots of interesting matrix calculation methods by numpy.
At the same time, unfortunately, when you go above a certain size of data sets, those above mentioned packages are not fast enough and therefore, in most of machine learning computing communities you might hear that SOM is conceptually interesting, but! for example SVM is faster, though living in a linear world.
Although, I believe the majority of applications of SOM are limited to simple visualizations, SOM has lots of other capabilities for clustering, classification, prediction, function approximation beyond the concept of least square method and polynomial functions to computation based on symbolic features ( i.e. a nonlinear, contextual feature, which has no closed-form functional relations to its constructing lower level features) and even to multi-modal data analysis . Any way, I don’t want to discuss this issue here, but later I will write about it. (update: I wrote a paper about it. You can find it here)
Therefore, one good reason for me to learn Python was to write a new SOM package in Python, which is fast and scalable. I checked several of existing SOM modules in Python community, but all of them are using the classic concept of sequential learning algorithm, which is not easy to speed up or parallelize. The reason is that the loop on the training data set (dlen) should be done in a serial loop since the weight vectors of SOM need to be updated based on their impressions from each individual data record. Therefore, considering the number of times, (training length: normally more than 10 ) you need to perform a simple loop for a large data set, SOM can be considered as a slow algorithm.
However, there is another learning algorithm, in which the whole process can be done in one step. This algorithm is called batch training. The basic idea behind my library that I call it SOMPY is to speed up this algorithm and make it suitable for larger data size.
I should mention that it is a real joy to be engaged in Python community and beside its simple framework as an object oriented programming language, you can find what ever you want in the scipy community. Therefore, it took me only around 1 months to learn the basics of programming in Python from scratch and further go to more advanced methods and libraries in matrix calculations such as numba, numpy.einsum, memap and pytables for memory management. I tested several ideas to for speeding up and memory management. I didn’t get good results from memmap and pytables regarding the memory limit of my computer, but I am skeptical to my implementation.
As of current version of the code, it is 3 times faster (!?) than som-toolbox of Matlab that I found it the fastest among existing packages, but this limit is mainly due to the limit of my CPU and RAM, but the Matlab toolbox doesn’t reach to the limit of CPU and Memory.
Currently, SOMPY works only on a single machine, but later on I am going to adapt it for parallel computing on a cluster. Ipython parallel computing framework looks very straight forward and simple for this goal.
You can see a quick demo of SOMPY here.
You can get the library files from its Github page, here.