by Puripant Ruchikachorn
“Big Data” is often written with capital letters or as a quote. It conveys not only the large amount of data through ever more convenient acquisition but also slight disbelief in the relevance of data science. Big data is real, but its definition of a large collection of samples may not seem applicable to most people – except if you’re an employee of a growing internet company, or of the National Security Agency (NSA). There is another aspect of big data; we not only have more samples but also many more attributes per sample to analyze. The advance in computing speed and novel techniques allow us to “see” a multidimensional dataset and extract insights from it.
These sample attributes, typically displayed as columns of a data table (think, for example, of an Excel spreadsheet), are often called dimensions. This may confuse or even scare a lot of people; our mortal world is three-dimensional! That is true, but only for physical positions. There are also properties associated with time, color, transparency, smell, and many other properties that are quite subjective. Basically, dimensionality in data science refers to the number of selected attributes of a sample. For example, a car easily has at least four attributes of, say, year, acceleration, weight, and country of origin.
As you can imagine, a four-dimensional (4D) dataset can be embedded in a simple two-dimensional (2D) scatterplot; the first attribute assigned to horizontal position, the second to vertical position, the third to circular radius, and the fourth to color hue. The chart above uses this “encoding”, and shows a real dataset of 392 car models during the period of 1970–1983. (Note that acceleration here is measured in seconds, so the lower the number shows, the faster the car is.) Pretty trivial. We can also use other visual variables such as color saturations, shape, or texture for, say, miles per gallon (MPG), the number of cylinders, and horsepower, respectively. What if we have a hundred dimensions? Actually, that’s not uncommon!
There are tools for an arbitrary number of dimensions such as a scatterplot matrix and a parallel coordinates plot. The former basically extends a 2D chart to multidimensional space by organizing the charts of all dimension pairs in a particular arrangement. The latter requires more radical thinking. Instead of displaying dimensions as orthogonal properties, parallel coordinates put them in succession—as the name suggests—in parallel, then each data point becomes a polyline, i.e., a connected series of line segments between adjacent axes. In theory, this display can support unlimited number of dimensions, so it provides a good graphical overview of multidimensional data. Below is a plot of the same dataset of 4D car properties.
A visualization is often used to show data insights – insights which seem to automatically or even magically come to the minds of analytical geniuses. Even such analysts do use tools to understand messy arrays of numbers but the visualization outputs are only the last step or the end product of their analytical process.
Statistics is an important tool. It aggregates numbers, confirms or rejects assumptions, and even suggests causality. However, similar to extending a square to a cube, a high dimensional space occupies more volume than its low dimensional counterpart and data points become much more sparse with the number of dimensions. This so-called curse of dimensionality makes statistical significance harder to achieve in high dimensional space
Not only can a visualization present data to show specific messages, it can also be used to analyze data. After we train ourselves, we can “see” things through a multidimensional visualization that an analysis of subsets of lower dimensions cannot provide. Humans can spot trends, clusters, and anomalies which can be a part of a scientific process.
As shown above, the change in colors of car production year from 1970 in blue to 1983 in red suggests the trend of lighter but not necessarily faster cars. The topmost cluster of lines on the country-of-origin axis represents Japan that manufactured only lightweight cars unlike the bottom group of American models. Counter-intuitively, weight and acceleration time seem inversely proportional. They are really negatively correlated because of another feature that is not present here: the number of cylinders, which increases weight but also horsepower, tending to decrease acceleration time. This clearly shows the fact that correlation does not imply causation and also the limitation of analysis in lower dimensional space.
A subfield of visualizations, aptly named visual analytics, uses visual displays and interactions (and often real-time visual feedback) to augment analytical thinking. Different views of the same dataset can further reveal important features. For example, swapping the axes of year and weight would bring year and country of origin close together and reveal the relationship between the two dimensions and Japan’s emergence as a major automobile manufacturer.
Still there is more work to do to understand high dimensional world that is already here and we are playing catch-up with. Right now, we still need humans in the analytical loop, especially those with domain-specific expertise in fields such as biology, and thus a broader understanding of multidimensional visualization. Many techniques are already here, such as a parallel coordinates plot, which has become a standard in the visualization community, but they often get a blank look from people outside of the field. The field of visualization has many tools of power but no popularity. As a researcher in visualization, I’m keen to ignite public interest in visual literacy and ultimately an increased adoption of powerful visualizations as analytical tools.
Puripant Ruchikachorn is a 2010 fellow of the Fulbright Science & Technology Award, from Thailand, and a PhD Candidate in Computer Science at Stony Brook University.