Monday, March 10, 2008

The "Big Three" Data Categories

Data analysis is an essential part of science and engineering. As a guide to deal with various data analysis tasks, I find it helpful to group data into three basic categories:
  • Multidimensional tabular data (e.g . tables of numbers and text).
  • Time series data (e.g. image, audio/video data).
  • Graphics oriented data (e.g. tree and network structured data).
All data analysis methods can be grouped in to three categories depending on which data category they target. So, for instance, wavelet transformation is a method for time series; whereas MDS are methods for multidimensional data. Context free grammar and text parsers are mainly tools for graphics oriented data.

Furthermore, all software applications can also be grouped in to three corresponding categories. A software package that supports multiple data categories (like java system, Matlab) is either a generic tool that requires further software for end-users; or it is hopelessly difficult to design, implement, understand and to use.

When I encounter a new data analysis problem I first identify its category, then I start to search for methods and software packages targeting that category. I would not try, for instance, to use MDS or SQL to analyze images or video/audio data.