Visualizing High Dimensional Data

Saturday, March 16, 2013

On the weighted Hamming distance

Hamming distance is a simple distance function to impose a metrical structure over discrete non-numerical data. Hamming distance has been widely used to cluster or visualize discrete data. In VisuMap Hamming distance has been so fare the only distance function for discrete date. In this note I'll introduce the weighted Hamming distance (WHD) as an extension to the Hamming distance. After a brief review of the original Hamming distance, I'll describe WHD and demonstrate its strength with two sample datasets.

Let's first consider the one-dimensional case. Let Ω be a set of data points potentially with duplicates. Ω may represent, for instance, the hair color or blood type of a group of patients. The Hamming distance between two data points x and y is simply defined as:

The weighted Hamming distance (WHD) is defined as:

where p(x) is the frequency of the value x in Ω. |Ω| is the number of data points in Ω. w(x,y) is an extension of d(x,y) in the sense that d(x, y) < d(x', y') always leads to w(x, y) < w(x',y'). On the other side, w(,) is actually not a distance function in strict terms of mathematics, since w(,) does not fulfill the requirement w(x,x) ≡ 0.

We notice that d(,) takes the value 0 and 1, whereas w(,) can take any value from 1 to |Ω|. d(,) only captures the information about equality, whereas w(,) captures, in addition to the equality, also the frequency of each different discrete value. In particular w(x,y) yields smaller distance if the value x is infrequent in Ω. In terms of our example with patients as data points, w(,) will ascribe larger closeness (i.e. smaller distance) to a pair of patients if they share a rare property values, say a rare blood type. This property of WHD make it to focus more on rare property values.

In multidimensional case, our data points will be assumed to be from a multidimensional space x = (x₁, x₂...x_n)∈Ω₁×Ω₂×...×Ω_n; Let y = (y₁, y₂...y_n) be a second the data point; and let w_kbe the weighted hamming distance between x_k and y_k; then we define the multidimensional WHD between x and y as:

We notice that the multidimensional WHD is, up to a constant factor, the harmonic mean of weighted hamming distance of each individual dimension. The harmonic mean, like the normal arithmetical mean, yields a number between the minimal and maximal value in {w₁, w₂, ..., w₃}, except that the harmonic mean is normally shifted towards the minimal value. The use of harmonic mean instead of the squared sum (as used by Euclidean distance) makes w(,) bias towards smaller values in the set {w₁, w₂, ..., w_n}. This means, in terms of our example with patient data, we ascribe more importance to those properties which show larger proximity between data points. For instance, if the hair color property are more or less randomly distributed among the patients, but some specially blood types occur only rarely, then the blood type property will be more importance with respect to the harmonic mean.

To see the difference between WHD and normal hamming distance we first consider the sample dataset Adult Database in the UCI Machine-Learning repository that collects information about students. For the sake of simplicity I just use two properties, education and race, to generate MDS maps. The following are the MDS maps generated with CCA mapping algorithm with WHD and normal hamming distance, on the left and right side respectively:

In above picture we can see clearly in the left map that three large clusters occupy the whole data space, whereas the left map only show a cloud with discrete data points without visual information about the cluster sizes.

As second example we consider a dataset in Haplotype Analysis. This dataset contains about 2000 data points each comprises the haplotypes of about 160 SNPs. The following picture shows the MDS maps generated with the tSNE algorithm with WHD and normal Hamming distance, on the left and right side respectively:

We notice that map on the left side reveals about 8 clusters with different sizes, whereas the map on the right side (generated with Hamming distance) shows four, more or less, symmetrical clusters. The map generated with WHD shows more structure and complexity, but is up to the scientist to verify the relevance of those extra structural information.

Weighted hamming distance has been implemented in VisuMap 4.0 as a new metric for enumerated data. The above sample maps are generated with VisuMap version 4.0.891.

Friday, March 8, 2013

VisuMap 4.0 Released

We have just released VisuMap 4.0 with new price list for our software products and services. This release is a major upgrade from its previous release v3.5 (released about two years ago). Main changes in this release include:

Upgraded the .Net Runtime library from version 2.0 to 4.0. Some new features in the .Net 4.0 significantly simplified internal implementation and most likely also improved the performance and stability. For example, the support for generic covariant list greatly speeded up a lot data processing.
Implemented new services to analyze tables with non-numerical data.. We have for instance introduced the weighted-hamming distance to geometrize multidimensional enumerate data. These features manifest our R&D in the past year that moved gradually towards analysis of non-numerical data. We'll talk about this simple and interesting metric in latter blogs.
We have increased the price for our products and services for 10% to 15%. This is the first price increase in last two years.

For more information about VisuMap 4.0 please see the "What's new" section in the product's on-line help and our web site.

Wednesday, December 19, 2012

New VisuMap release with support for k-NN.

We have just releaseed VisuMap version 3.5.888. Apart from the new special release number, this version offers the k-NN (k nearest neighbors) classification service. k-NN is kind of a modelless supervised machine learning algorithm, it uses the training data directly as the "model", whereas other classification methods, like SOM (self-organizing net), uses a network trained with the training data as model.

The following short video shows a simple scenario where k-NN is used to match the sphere surface with that of torus:

Saturday, October 6, 2012

Generalized Relational Perspective Map

This post describes a variation of the algorithm Relational Perspective Map (RPM)[1]. The new algorithm, called generalized RPM (gRPM), is more sound from the view point of simulated dynamical system; and produces in general more consistent maps. gRPM has been implemented in VisuMap and has been available since version 2.4. In the following I'll first review the RPM algorithm; then describe gRPM for a special case that employs flat torus topology; I'll then discuss possible variation and extension of the algorithm.

As a MDS (multidimensional scaling) algorithm, RPM seeks to map a set of data points from high dimensional space to the surface of a torus, so that the following "Energy" is minimal:

where δ_ijis the distance matrix between the data points in the high dimensional space; and d_ij is the distance matrix for the data points on the torus. RPM uses the gradient-descent algorithm to minimize the energy E, so that RPM in effect simulates a multi-particle system to find a minimal energy configuration on the torus directed by the following "force":

The negative sign in above formula indicates that the "particles" exert repulsive force on each other; and the closer two "particles" are on the torus, the larger is the repulsive force. The following picture illustrates the RPM algorithm:

Figure 1: RPM as a mapping algorithm.

In the initial RMP paper, the distance d_ij is defined as the minimal geodesic distance on the torus surface. This definition leads to a problem that, when two particles are located close to the opposite sites of the torus, the repulsive force between them becomes unstable with alternating directions. The following diagram illustrates this case (where I made the thickness of the torus infinitely small so that the torus becomes in effect a ring):

Figure 2: RPM dynamics anomaly on a ring: the force from A to B₊ and B_ have different direction even B₊ and B_ can be arbitrarily close to B.

In above diagram, the repulsive force between A and B can quickly change the direction when the particle B moves from position B_ to B₊, because the repulsive force follows the minimal path that changed from upper half arc to lower half arc. As will be shown latter, this kind of discontinuity can lead to unwanted artifacts in the resulting RPM maps.

In order to fix the discontinuity problem described above, the gRPM algorithm extends the interaction between two particles to multiple paths between them. In the case of ring topology, gRPM defines the interaction between two particles as superposition of two forces: one along the upper arc and one along the lower arc in above picture. By this definition, when two particles are at exact opposite positions, A and B, on the ring, the two paths between them (i.e. the two arcs) will have equal length, so that the two forces will have equal absolute value but opposite direction. Thus, the two forces will cancel out each other; and the two particles can stay stably at the position A and B.

Formally, let's represent the ring by the numbers in the interval [0, w], where w is the length of the ring; let i and j be two data points mapped to the position x and y in [0, w]. With this representation the two ending points at 0 and w, as shown in the following picture, should be considered as stuck together:

Figure 3: Interval representation of a ring topology.

Then the force between i and j according RPM will be:

The energy and force between i and j according gRPM will be:

We can easily verify that F_ij, as a function of x and y, is not continues when |x-y|=w/2. But, F^*_ijis continues and has the value 0 when |x-y|=w/2.

For two dimensional torus, there will be 4 different paths connecting any two different points on the torus surface. The four paths are illustrated in the following picture:

Figure 4: Four paths between two points on a flat torus. Opposite edge of the rectangle should be consided stuck together.

Let (x₀, x₁) and (y₀, y₁) be the coordinate of two data points i and j, d₀:=|x₀-y₀|, d₁:=|x₁-y₁|; then the energy and force between the two points are:

The consts w and h in above equation are the width and length of the flat torus respectively.

To compare gRPM with RPM we have applied both algorithms to the sphere data set. The following pictures show the result:

Figure 5: Mapping 1000 data points sampled from a sphere surface to the flat torus: a) The original dataset displayed as scatter plot in the 3D sphace. b) The map generated by the RPM algorithm. c) The map generated by the gRPM algorithm.

We notice that there is kind of framing effect in the map generated by RPM algorithm: high concentration of data points along the boundary of the two square fragments. The map generated by gRPM does not suffer from such framing effect.

The key technique to move RPM to gRPM is to find a set of "conjugated" paths, so the forces induced by them will cancel each out at those "discontinuous" configurations of RPM. With this technique we have worked out gRPM algorithm for other relatively simple spaces (e.g. 2D smooth manifolds). In VisuMap we have implemented gRPM for flat klein-bottle, flat sphere, flat real projective plane (all flat fundamental polygons), 3D sphere and the projective plane in half-sphere model. The set of conjugated paths for these manifolds comprise 2 to 8 paths.

Discussions:

The initial RPM algorithm defines the energy as 1/d^p_ij, where p is any positive number. Just for the sake of simplicity, we have here just considered the case p=0. The handling of other values for p should be analogous. In fact, for any smooth monotonously increasing function h(x), we can use the function 1/h( d_ij) as energy function for gRPM.

Notice that, for the ring topology, the two "conjugated" paths form a complete winding loop (see Figure 1.) As an extension, we could require that the path pair forms two winding loops. As shown in the following diagram, two particles will have minimal energy when they are at the same spot (that is the same phase but different loop). Thus, the repulsive force will be effectively turned into attractive force.

In general, we can easily verify that, for the ring topology, the interaction is repulsive when the path pair forms odd number of loops; and is attractive when the path pair forms even number of loops. This kind of winding number resemble the spin numbers of bosons and fermions in particle physics. It might be interesting to investigate gRPM for these more general cases.

Visualization of high-dimensional data with relational perspective map, James X. Li, Information Visualization 2004; 3, 49-59.