Wednesday, November 30, 2011

Quantifying dependency between attributes.

Given two sets of attributes, X = (x1, x2, ..., xp) and Y = (y1, y2, ..., yq), how do we quantify the dependency between X and Y?

If  both X and Y are one dimensional numerical attributes, we can use linear correlation coefficient to quantify the dependency between X and Y. For multidimensional X and Y (that is our data are numerical matrices of dimension n×p and n×q respectively, assuming that we have sampled n data points)  we can use canonical correlation analysis (CCA) to capture the dependency between X and Y. Linear methods are in general effective and simple to use. VisuMap offers support for CCA and many other linear analysis services through a plugin module MultivariateAnalysis.  However, linear methods work only for numerical attributes and can only capture linear relationships.

In this note I'll describe a more general method to quantify dependency between attributes. The method can be applied to non-numerical data and can capture non-linear relationships. The method is based on the concept conditional entropy from information theory. Assuming that X and Y are two random variables, the information gain ratio from X to Y is defined as following:

where H(Y) is the entropy of Y, and H(Y|X) is the conditional entropy of Y under the conditional that X is known. The numerator H(Y) - H(Y|X) is also called the mutual information between X and Y, and it is normally denoted as I(X; Y). So, the above formula can also be defined as R(X;Y):=I(X;Y)/H(Y). The following diagram shows the relationship between various types of entropy relating to two random variables:

In general, R(X; Y) tells us how much information we gain about Y when the value of X becomes known. R(X; Y) is a number between 0 and 1.0. If R(X;Y)=0, then X and Y are independent from each other, knowing X does not provide any information about Y. If R(X;Y)=1.0, then Y is completely dependent on X, i.e. Y is completely determined by the value of X.

Now, let's go back to our original problem. We assume that X and Y are now two tables of dimension n×p and n×q respectively, and X and Y may contain numerical and non-numerical values. In order to calculate the information gain ratio between X and Y we first convert them to random variables. To do so, we pick two clustering algorithms C and D that can be applied on X and Y. The clustering algorithm will assign each of the n data point to a label of label sets X and Y respectively:

                  C: N → X

                  D: N → Y

where N:= {1,2,3, ..., n} is the set of data sample indexes. We then define two random variables X' and Y' over X and Y as following:

With above definitions we can then apply the formula for R(X;Y) to the random variables X' and Y'. In particular, we have the following detailed formula to calculate the generalized information gain ratio (GIGR) of X for Y:


We notice that GIGR in above formula is indexed by the two clustering algorithms C and D. This means that the calculation is dependent on what clustering algorithms we choose for X and Y. In other words, the method described here is actually a meta-algorithm that requires, apart from two input data tables, also two clustering algorithms as input.

The clustering algorithms C and D plays an essential role in GIGR. They determine the granularity of the entropy calculation. More particularly, RC,D will be maximum if C and D are the finest clustering algorithms by creating unique label for each distinguished data point. On the other side, RC,D will be zero, if C and D, as the coarsest clustering algorithm, simply assign all data points to single label.

We can also consider the clustering algorithm in GIGR as a quantization method that converts data from multidimensional discrete or continues domain to one-dimensional discrete domain, so that we can effectively apply the information theoretical calculation on them.

In practise, C and D can be different clustering algorithms. The main requirement on C and D is that they can be applied on data tables X and Y separately and can assign a label to each data row. In fact, the data X and Y don't even have to be tables. They can be, for instance, a tree structures or DNA sequences. To apply our method, we just need to pick appropriate clustering algorithms for the data types.

It should be pointed out here that GIGR described here does not bias towards any type of clustering algorithms. In fact, I believe that any clustering algorithm would be appropriate for certain type of problems. The attached script code implemented 6 different groups of clustering algorithms to calculate GIGR with VisuMap.

Since the two clustering algorithms in GIGR are kind of input, we can use GIGR to investigate similarity between different clustering algorithms. For instance, if RC,D(X; X) is close 1.0 for a data set X, then the two clustering algorithms C and D produce similar results with respect to our data. Thus, in the practise, we might use the algorithm that is easier to implement, if the GIGR indicates that two algorithm produce similar results.

Also along the same line, GIGR can be used as an indicator for the stability of clustering algorithms. Many clustering algorithms, like the k-mean algorithm, are non-deterministic in the sense that repeating the algorithm on the same data normally yields different result. We would normally favor an algorithm that maintain certain level of stability so that the clustering results remain more or less the same. When we calculate RC,C(X; X) by repeating algorithm C twice on X, we would normally get a GIGR value that is smaller than 1.0 for non-deterministic clustering algorithms. In these cases, there closer RC,C(X; X) is to the value 1.0, the more stable is C with respect to the data set X.

The following JavaScript code helps users to calculate the GIGR with VisuMap software application:

Tuesday, November 29, 2011

K-Mean clustering under no-optimal conditions

K-Mean algorithm (KMA) is a widely used clustering algorithm for multidimensional data. In this note, I'll show some intuitive behaviors of KMA with a very simple quantization task: Grouping a series of numeric values into multiple groups. An intuitive requirement for this task is: if some values are concentrated at a location, those values should be grouped together.

I have generated 5 datasets, each with 5000 values between -5.0 and +20.0. Data points in the 5 datasets are successively concentrated at 1 to 5 locations according to Gaussian distribution. I used KMA to cluster these data into 3 clusters. I ran KMA on each dataset twice; the result is shown in the following picture:

In above picture, each spectrum shows the result of one run of KMA on a dataset. Each bar in the spectrum indicates a data point in a dataset. The curve on the spectrum shows the density of the data points. The bar colors indicate the clusters created by the KMA. We notice that all bars are colored by 3 different colors as the number of clusters, K, used KMA is 3.

Let N be the natural clusters present in the dataset (i.e. N=1,2,3,4,5 for these 5 datasets,) it would be helpful to distinguish the behaviors of KMA in to the following three cases:

(1) N < K : In this case the KMA has to group fewer natural clusters into larger number of clusters. As be shown in spectrum 1.a, 1.b, 2.a and 2.b, KMA assigned more than one clusters (colors) to a natural cluster, but didn't assign one to multiple natural clusters. For instance, in spectrum 2.a, the first natural cluster has been split evenly into two clusters.

(2) N = K : In this case the KMA has to group exact N natural clusters in to N clusters. We see in spectrum 3.a and 3.b that KMA did a nice job in this case. KMA acutally found the natural clusters.

(3) N > K : KMA has to group more natural clusters into fewer clusters. As be shown in spectrum 4.a, 4.b, 5.a and 5.b, KMA, in this case, grouped multiple natural clusters together. KMA has not, at least not significantly, split natural clusters. How KMA groups the natural clusters can change from run to run, depending on the initialization of the algorithm. For instance, spectrum 4.a and 4.b show two different grouping created by two different KMA runs.

In above test, our datasets all have clearly separated natural clusters. I would say that KMA did a pretty good job: its results reflect pretty much our intuitive expectation. However, if the natural cluster are not so clearly separated, the result of KMA would not be such nice. The following picture shows how KMA groups less separated natural clusters:

In above picture, the 5 datasets are created similarly as before except that these natural clusters overlap each other somewhat. We see that, only for the case K=N, KMA found the exact natural clusters. For other causes, KMA does not respect the natural clusters. Actually, we can visually do a better job based on the density curves.

Friday, May 13, 2011

VisuMap v3.5 Relased.

Today we have released VisuMap version 3.5.864. This version is a major upgrade from its previous version 3.2; it is marked by the following major changes:
  • Multi-core optimization: we have implemented extensive optimization for most algorithms to exploit multi-core computer system. By default, VisuMap 3.5 will automatically make use of all available CPU cores with multi-threading technology. The user can also explicitly limit the number of processing unit through the application configuration property WorkThreads. On our 6-core Windows 7 test machine we have often achieved 3-5 times performance speedup.

  • Updated DirectX Library: VisuMap used DirectX+MDX (managed directX) to implement hardware accelerated visualization service. Since MDX has been discontinued as a component in new generations of .NET platform, we have to switch to another product. In VisuMap v3.5 we have replaced MDX by SlimDX that is an open source library with similar goal as MDX. With this change, we not only significantly improved the rendering performance, but also reduced the dependency on old obsoleted software components. VisuMap can now be installed on most Windows 7 machines without pre-installing any other software. Some older systems may need to upgrade to the latest DirectX runtime library through the normal DirectX web installer.
  • Many bug fixes and enhancements: Based on customer feedback VisuMap has undergone many small changes. Most of them have already made available through previous minor releases. Worth mentioning is here the extension of the XY plot that made this simple service a generic tool for large datasets; as well as addition of new glyphsets and utility scripts. We have recently also released some VisuMap plugins which extend VisuMap for special user groups. For instance, the plugin for Multivariate Analysis and Wave Transformations. Also, the Customer Importer plugin has been extended to import FCS (Flow Cytometry Standard) files.

VisuMap v3.5 is backwards compatible with its previous versions. All existing datasets will silently upgraded to the new version when loaded into VisuMap. Existing users can upgrade to the new version, as long as the license type allows it, by simply clicking on the menu Help>Check for Updates in VisuMap's main window.

Wednesday, April 6, 2011

Self-similarity of high dimensional random walk process

A high dimensional random walk process (RWP) is the trajectory of a vector variable whose components change independently and randomly step by step, in discrete time space, for a small constant percentage. High dimensional RWP can be used as the starting point to investigate changing complex systems. For instance, as discussed in this blog, the stocks price history of 500 stocks can be considered as a 500 dimensional RWP.

Self-similarity means that an object is, from certain point of view, similar to parts of itself. A high dimensional RWP is self-similar in the sense that each sub-section of it follow the same statistical constraints. Thus, the randomness is a property shared by RWP and its sub-sections. We can generate series of 1000 data points of a 500-dimensional RWP with the following VisuMap's JavaScript code:

var n = 1000;
var dim = 500;
var rwp = New.NumberTable(n, dim);
var m = rwp.Matrix;
for( var col=0; col < dim; col++) {
m[0][col] = 1.0;
for(var t=1; t < n; t++) {
m[t][col] = m[t-1][col] * (1 + 0.01*(Math.random() - 0.5) );

How can we visualize the self-similar randomness in high dimensional RWP? A simple way is to use a dimensionality reduction method to map the high dimensional trajectory to low dimensional space, then plot it out on paper or screen. The following picture for instance shows two CCA (curvilinear component analysis) maps of the 1000 data points mapped to the 3 dimensional space:

We can see the randomness of the trajectory in above picture, but the self-similarity is hardly apparent. This is because, as intrinsically to the human perception, we normally only good at recognizing similarity between geometric patterns, but not similarity between random patterns.

Fortunately, when we use principal component analysis (PCA) to project the trajectory to low dimensional space, we get much easier recognizable patterns. For instance, the following picture shows the projections of our sample RWP to some major principal components (ie. eigenvectors):

The PCA projection to the major principal components apparently filtered out the randomness of the data, like a low pass filter suppresses high frequency noises. Now, we can select three principal components, say the second, third and forth components as our projection axes; then plot the PCA projection of the data (or sub-sections of it) to these three axes. As be illustrated in the following picture, we can see that they are all geometrically similar to each other albeit with different densities.

Notice that in above picture, the projection axes are the second, third and forth principal components of the selected data, not those of the complete dataset. The following video clip shows what we have discussed above in a more intuitively way:

One practical use of studying RWP is to find non-randomness ( that is information ) in seemly random data. Using the PCA technique we can show random data as easy recognizable patterns which enable us to detect deviations from those patterns, so that we can quickly find potential information in apparently random data.

Friday, April 1, 2011

Pirated software web sites

I have notice that more than a dozens so-called freeware web sites now provide pirated version of VisuMap software. All those sites seem to be originated from a single source, as they are all organized in a similar way and all pirated the version 3.2.854 of VisuMap. I have spent 2 or 3 hours to investigate those sites, the followings are what I have found:

  • These sites have done a pretty good SEO (search engine optimization) job, they pretty much over flooded search results when I searched for "VisuMap download" on Google. Google is clearly loosing the battle against those mis-information.

  • Most of these sites are not really free. They usually re-direct users through a chain of redirections with dubious ads (they hold the user with popups for popups, you need to shutdown the browser to get away from them), then at the end they usually require users to sign up for paid service for "fast" download.

  • All product information about VisuMap on those sites are copy and pasted from VisuMap's web site. Some sites used translation engine to translate English to other languages with horrible results.

  • All information about downloading sites, speed, etc. on those site are faked and likely automatically generated to fool users and search engines.

  • I have tested the pirated software on an isolated machine. I scanned the software with anti-virus software, no virus or spyware has been detected.

  • The pirated software seemed to run smoothly on the test machine, although it felt a little sluggish. Since it has no proper license, it won't be able to get online update from VisuMap's web site.

  • The pirated software has an obsoleted version, so that it won't be able to use some new plugin extensions for VisuMap.

With so many limitations and apparent dubious practice, I don't think those pirate sites will have serious impact on the legitimated software vendors. But, they clearly post a challenge for Internet searching engines.