Saturday, May 7, 2011

K-means clustering using R

I'm trying to learn R, and I'm a firm believer that there's not better way to learn than by getting your hands dirty. After reading an excellent post on Intelligent Trading blog, it got me thinking how you would do a clustering analysis with R, using K-means.

In the rest of this post I will try to detail the different steps that I followed, in hope that it can be useful to others. In this post I will be using a couple of R packages, namely quantmod, fpc and a few others. The most crucial is quantmod. You can install it with:

They are used for:

  • quantmod: this is your bread and butter to ease time series analysis
  • graphics, scatterplot3d, gplots RColorBrewer are used for plotting
  • fpc: this packages is package dedicated to clustering 

First this article supposes that you already have your data handy in a xts object used by quantmot. If it's not the case have a look at this article.

In the details below, "x" is the name of the object that contains my timeserie. Now let dig into it.

The first thing you need to do is create matrix with the different criteria you want to use for the clustering. In my case I'm going to use three normalized ratios, Close / Open, High / Open,  Low / Open , and then stuff them into a matrix. Then you want to process the actual cluster and display it:

The result looks like:
This works pretty well, except you have to specify a number of cluster you're looking for. Another may to do this is through the pamk function from the fpc package, for for which you don't specify the actual number of cluster, it will be calculated (you provide a range of value though):

Note the difference with kmeans method, the cluster information is packed into a pam object. That why you access the cluster details through zpamk$pamobject.