Visualizing High Dimensional Data

Friday, August 31, 2007

The null-dataset

When we analyze a high dimensional dataset the first question is often:

Does the dataset contain useful information at all?

A simple way to answer this questions is as follows: We first construct a so called null-dataset that contains 100 data points; each data point is a 100 dimensional row vector of the following table:

1, 0, 0, ..., 0

0, 1, 0, ..., 0
0, 0, 1,...., 0

...

0, 0, 0,..., 1

Since the Euclidean distance between any two points from the null-dataset is the constant sqrt(2), the distance matrix contains no information about the data points in the sense that the distance matrix does not single out any group of points with special meaning.

Now for a new high dimensional dataset, we can simply apply a collection of MDS mapping methods to the dataset and to the null-dataset with the same configuration settings. If the MDS maps of the new dataset resemble those of the null-dataset, we can expect that the new dataset contains no meaningful information.

The following pictures show the 2D MDS maps of the null-dataset with 100 data points created with 4 MDS algorithms (i.e. RPM, Sammon, CCA and SMACOF) available in VisuMap:

We notice that three of these MDS algorithms resulted in similar disk alike maps, whereas the RPM algorithm produced a homogeneously distributed map (the first map).

In order to understand the behavior of MDS algorithms it is helpful to see how they react to injection of information by means of broken symmetry between the data points. For this purpose I have altered the null-dataset slightly be multiplying the vector of the first data point by the factor 1.05. This change singles out the first data point as it has slightly larger distance to other points. The following pictures are the corresponding MDS maps of dataset:

The first data point is shown as a red spot in above pictures. We can see that 3 of our MDS algorithms singled out the first point as an out-lier, whereas the RPM method ignored such alteration in the dataset. Similarly, we can alter the null-dataset by multiplying a factor 0.95 to the first data point. The following pictures show the corresponding MDS maps:

We can see that all 4 MDS mapping algorithms singled the first data point by moving it to the center of these maps.

In order to test our method for a real dataset I have created a dataset with ca. 500 data points. Each data point contains 100 daily price changes in percentage. The following pictures show its MDS maps created with the same set of configuration settings used for the null-dataset:

We can notice, more or less, some resemblance between above pictures and those of the null-dataset. This means, that the distance matrix probably doesn't carry much information about the stocks; and we probably won't be able to extract much information from the dataset using methods based on Euclidean distance. For example, it won't make sense to apply k-mean clustering or neural network algorithms directly on this dataset.

It should be pointed out here that above statments only apply to the Euclidean distance. If we use other distance metric to measure to distance between the data points (for instance use the correlation as a kind of distance) we might be able to extract useful information.

In general, the null-dataset method described above only applies to Euclidean distance. If we explore the dataset as a non-euclidean space, we need to adapt the null-dataset to that non-euclidean space.

Thursday, August 30, 2007

VisFinance 1.2 Released

We have just released VisFinance 1.2.
VisFinance is a plugin module for VisuMap to provide
specialized services for finance data analysis.

Free trial version of VisFinance 1.2 can be downloaded at here.

Wednesday, August 22, 2007

VisuMap 2.5 Released.

Today we have released VisuMap version 2.5. This version has undergone major extensions and changes in scripting & pluging interfaces in order to accommodate more advanced plugin applications (e. g. the upcoming VisFinance plugin for finance service).

In this release we have not changed the formats of data files nor the configuration file, but plugins and scripts developed for version 2.4 or older might need to be upgraded to the new interface.

This release also includes many bug fixes, enhancements in user interfaces and performance. For instance, double-mouse-clicking has now been consistently linked to the task to switch the mode of a window. In the main window, double-clicking the main map now switches the main map between editing and shifting mode, instead of starting the RPM mapping algorithm as in the previous version.

In general, data drilling features (i.e. PCA, Data Details and Diagram view) now work in a more consistent and integrated way. For instance, you can now create a new dataset for selected data in almost all views with few mouse clicks.

Also, we have a new price list for our products and services.

Tuesday, July 24, 2007

The symmetry between repulsive and attractive force

The relational perspective map (RPM) algorithm in its core simulates a multi-particle system on a torus surface in which the particles exerts repulsive forces to each other. Some people has asked the question why just repulsive force? Why not use attractive forces like other force directed mapping methods?

A simple answer to this question is that we prefer to use a simpler dynamic system if it can solve the problem. Just like physicists who try to reduce the number of fundamental interactions, I would prefer to avoid using attractive force if it is not absolutely needed.

Classical multidimensional scaling (MDS) methods have to use both types of forces (as can be derived from their stress function) because their base information space is the infinite open Euclidean space: without attractive force their configuration will quickly degrade to infinite size; and without repulsive forces their configuration will shrink to a single point. With RPM method, the closed manifold (i.e. the torus surface) confines the configuration into a limited size.

Another not-so-obvious answer to above question is that on a closed manifold the repulsive and attractive forces are the manifestation of the same thing, from certain point of view at least. To see this, let use consider the simplest case of 1-dimensional curled space (i.e. the circle). As be shown in the following picture, imaging that we have two ants living on the circle and there are two positively charged particles on the circle which can move freely on the circle but are confined on the circle.

From the point of view of the ant on the left side, the two charges exert repulsive force to each other according to coulomb's law.

However, from the point of view of the ant on the right side, the two charges attract each other. This is a little contra-intuitive for us as observers from the 3-dimensional space. But, we need to remind us that these ants are living in the 1-dimensional space. The ant on the right side has no way to see the two charges moving apart from each other. This ant can only move on the circle, particularly the right arc of the circle, to measure the distance between the two charges. What this ant would find out is that the same force as indicated in the picture will seemly press the two charge closer to each other; and larger the charge, the stronger the force. Thus, the ant on the right side would claim that two charges attract each other according a law similar to the coulomb's law (the anti-coulomb's law?).