Friday, August 31, 2007

The null-dataset

When we analyze a high dimensional dataset the first question is often:

Does the dataset contain useful information at all?

A simple way to answer this questions is as follows: We first construct a so called null-dataset that contains 100 data points; each data point is a 100 dimensional row vector of the following table:

1, 0, 0, ..., 0
0, 1, 0, ..., 0
0, 0, 1,...., 0
...
0, 0, 0,..., 1

Since the Euclidean distance between any two points from the null-dataset is the constant sqrt(2), the distance matrix contains no information about the data points in the sense that the distance matrix does not single out any group of points with special meaning.

Now for a new high dimensional dataset, we can simply apply a collection of MDS mapping methods to the dataset and to the null-dataset with the same configuration settings. If the MDS maps of the new dataset resemble those of the null-dataset, we can expect that the new dataset contains no meaningful information.

The following pictures show the 2D MDS maps of the null-dataset with 100 data points created with 4 MDS algorithms (i.e. RPM, Sammon, CCA and SMACOF) available in VisuMap:





We notice that three of these MDS algorithms resulted in similar disk alike maps, whereas the RPM algorithm produced a homogeneously distributed map (the first map).

In order to understand the behavior of MDS algorithms it is helpful to see how they react to injection of information by means of broken symmetry between the data points. For this purpose I have altered the null-dataset slightly be multiplying the vector of the first data point by the factor 1.05. This change singles out the first data point as it has slightly larger distance to other points. The following pictures are the corresponding MDS maps of dataset:




The first data point is shown as a red spot in above pictures. We can see that 3 of our MDS algorithms singled out the first point as an out-lier, whereas the RPM method ignored such alteration in the dataset. Similarly, we can alter the null-dataset by multiplying a factor 0.95 to the first data point. The following pictures show the corresponding MDS maps:



We can see that all 4 MDS mapping algorithms singled the first data point by moving it to the center of these maps.

In order to test our method for a real dataset I have created a dataset with ca. 500 data points. Each data point contains 100 daily price changes in percentage. The following pictures show its MDS maps created with the same set of configuration settings used for the null-dataset:




We can notice, more or less, some resemblance between above pictures and those of the null-dataset. This means, that the distance matrix probably doesn't carry much information about the stocks; and we probably won't be able to extract much information from the dataset using methods based on Euclidean distance. For example, it won't make sense to apply k-mean clustering or neural network algorithms directly on this dataset.

It should be pointed out here that above statments only apply to the Euclidean distance. If we use other distance metric to measure to distance between the data points (for instance use the correlation as a kind of distance) we might be able to extract useful information.

In general, the null-dataset method described above only applies to Euclidean distance. If we explore the dataset as a non-euclidean space, we need to adapt the null-dataset to that non-euclidean space.






No comments: