Visualizing High Dimensional Data: August 2007

When we analyze a high dimensional dataset the first question is often:

Does the dataset contain useful information at all?

A simple way to answer this questions is as follows: We first construct a so called null-dataset that contains 100 data points; each data point is a 100 dimensional row vector of the following table:

1, 0, 0, ..., 0

0, 1, 0, ..., 0
0, 0, 1,...., 0

...

0, 0, 0,..., 1

Since the Euclidean distance between any two points from the null-dataset is the constant sqrt(2), the distance matrix contains no information about the data points in the sense that the distance matrix does not single out any group of points with special meaning.

Now for a new high dimensional dataset, we can simply apply a collection of MDS mapping methods to the dataset and to the null-dataset with the same configuration settings. If the MDS maps of the new dataset resemble those of the null-dataset, we can expect that the new dataset contains no meaningful information.

The following pictures show the 2D MDS maps of the null-dataset with 100 data points created with 4 MDS algorithms (i.e. RPM, Sammon, CCA and SMACOF) available in VisuMap:

We notice that three of these MDS algorithms resulted in similar disk alike maps, whereas the RPM algorithm produced a homogeneously distributed map (the first map).

In order to understand the behavior of MDS algorithms it is helpful to see how they react to injection of information by means of broken symmetry between the data points. For this purpose I have altered the null-dataset slightly be multiplying the vector of the first data point by the factor 1.05. This change singles out the first data point as it has slightly larger distance to other points. The following pictures are the corresponding MDS maps of dataset:

The first data point is shown as a red spot in above pictures. We can see that 3 of our MDS algorithms singled out the first point as an out-lier, whereas the RPM method ignored such alteration in the dataset. Similarly, we can alter the null-dataset by multiplying a factor 0.95 to the first data point. The following pictures show the corresponding MDS maps:

We can see that all 4 MDS mapping algorithms singled the first data point by moving it to the center of these maps.

In order to test our method for a real dataset I have created a dataset with ca. 500 data points. Each data point contains 100 daily price changes in percentage. The following pictures show its MDS maps created with the same set of configuration settings used for the null-dataset:

We can notice, more or less, some resemblance between above pictures and those of the null-dataset. This means, that the distance matrix probably doesn't carry much information about the stocks; and we probably won't be able to extract much information from the dataset using methods based on Euclidean distance. For example, it won't make sense to apply k-mean clustering or neural network algorithms directly on this dataset.

It should be pointed out here that above statments only apply to the Euclidean distance. If we use other distance metric to measure to distance between the data points (for instance use the correlation as a kind of distance) we might be able to extract useful information.

In general, the null-dataset method described above only applies to Euclidean distance. If we explore the dataset as a non-euclidean space, we need to adapt the null-dataset to that non-euclidean space.

Today we have released VisuMap version 2.5. This version has undergone major extensions and changes in scripting & pluging interfaces in order to accommodate more advanced plugin applications (e. g. the upcoming VisFinance plugin for finance service).

In this release we have not changed the formats of data files nor the configuration file, but plugins and scripts developed for version 2.4 or older might need to be upgraded to the new interface.

This release also includes many bug fixes, enhancements in user interfaces and performance. For instance, double-mouse-clicking has now been consistently linked to the task to switch the mode of a window. In the main window, double-clicking the main map now switches the main map between editing and shifting mode, instead of starting the RPM mapping algorithm as in the previous version.

In general, data drilling features (i.e. PCA, Data Details and Diagram view) now work in a more consistent and integrated way. For instance, you can now create a new dataset for selected data in almost all views with few mouse clicks.

Also, we have a new price list for our products and services.

Visualizing High Dimensional Data

Friday, August 31, 2007

The null-dataset

Thursday, August 30, 2007

VisFinance 1.2 Released

Wednesday, August 22, 2007

VisuMap 2.5 Released.

About Me

Blog Archive

Tweet