Exploring the Structure of High-Dimensional Data with HyperTools
The datasets we encounter as scientists, analysts, and data nerds are increasingly complex. Much of machine learning is focused on extracting meaning from complex data. However, there is still a place for us lowly humans: the human visual system is phenomenal at detecting complex structure and discovering subtle patterns hidden in massive amounts of data. Every second that our eyes are open, countless data points (in the form of light patterns hitting our retinas) are pouring into visual areas of our brain. And yet, remarkably, we have no problem at all recognizing a neat looking shell on a beach, or our friend’s face in a large crowd. Our brains are “unsupervised pattern discovery aficionados.”
On the other hand, there is at least one major drawback to relying on our visual systems to extract meaning from the world around us: we are essentially capped at perceiving just 3 dimensions at a time, and many datasets we encounter today are higher dimensional.
So, the question of the hour is: how can we harness the incredible pattern-recognition superpowers of our brains to visualize complex and high-dimensional datasets?
In comes dimensionality reduction, stage right. Dimensionality reduction is just what it sounds like– transforming a high-dimensional dataset into a lower-dimensional dataset. For example, take this UCI ML dataset on Kaggle comprising observations about mushrooms, organized as a big matrix. Each row is comprised of a bunch of features of the mushroom, like cap size, cap shape, cap color, odor etc. The simplest way to do dimensionality reduction might be to simply ignore some of the features (e.g. pick your favorite three—say size, shape, and color—and ignore everything else). However, this is problematic if the features you drop contained valuable diagnostic information (e.g. whether the mushrooms are poisonous).
A more sophisticated approach is to reduce the dimensionality of the dataset by only considering its principal components, or the combinations of features that explains the most variance in the dataset. Using a technique called principal components analysis (or PCA), we can reduced the dimensionality of a dataset, while preserving as much of its precious variance as possible. The key intuition is that we can create a new set of (a smaller number of) features, where each of the new features is some combination of the old features. For example, one of these new features might reflect a mix of shape and color, and another might reflect a mix of size and poisonousness. In general, each new feature will be constructed from a weighted mix of the original features.
Below is a figure to help with the intuition. Imagine that you had a 3 dimensional dataset (left), and you wanted to reduce it to a 2 dimensional dataset (right). PCA finds the principal axes in the original 3D space where the variance between points is the highest. Once we identify the two axes that explain the most variance (the black lines in the left panel), we can re-plot the data along just those axes, as shown on the right. Our 3D dataset is now 2D. Here we have chosen a low-dimensional example so we could visualize what is happening. However, this technique can be applied in the same way to higher-dimensional datasets.
We created the
HyperTools package to facilitate these sorts of dimensionality reduction-based visual explorations of high-dimensional data. The basic pipeline is to feed in a high-dimensional dataset (or a series of high-dimensional datasets) and, in a single function call, reduce the dimensionality of the dataset(s) and create a plot. The package is built atop many familiar friends, including
HyperTools is designed with ease of use as a primary objective. We highlight two example use cases below.
First, let’s explore the mushrooms dataset we referenced above. We start by importing the relevant libraries:
import pandas as pd import hypertools as hyp
and then we read in our data into a
data = pd.read_csv('../input/mushrooms.csv') data.head()
Each row of the DataFrame corresponds to a mushroom observation, and each column reflects a descriptive feature of the mushroom (only some of the rows and columns are shown above). Now let’s plot the high-dimensional data in a low dimensional space by passing it to
HyperTools. To handle text columns,
HyperTools will first convert each text column into a series of binary ‘dummy’ variables before performing the dimensionality reduction. For example, if the ‘cap size’ column contained ‘big’ and ‘small’ labels, this single column would be turned into two binary columns: one for ‘big’ and one for ‘small’, where 1s represents the presence of that feature and 0s represents the absence (for more on this, see the documentation for the
get_dummies function in
In plotting the DataFrame, we are effectively creating a three-dimensional “mushroom space,” where mushrooms that exhibit similar features appear as nearby dots, and mushrooms that exhibit different features appear as more distant dots. By visualizing the DataFrame in this way, it becomes immediately clear that there are multiple clusters in the data. In other words, all combinations of mushroom features are not equally likely, but rather certain combinations of features tend to go together. To better understand this space, we can color each point according to some feature in the data that we are interested in knowing more about. For example, let’s color the points according to whether the mushrooms are (p)oisonous or (e)dible (the
hyp.plot(data,'o', group=class_labels, legend=list(set(class_labels)))
Visualizing the data in this way highlights that mushrooms’ poisonousness appears stable within each cluster (e.g. mushrooms that have similar features), but varies across clusters. In addition, it looks like there are a number of distinct clusters that are poisonous/edible. We can explore this further by using the ‘cluster’ feature of
HyperTools, which colors the observations using k-means clustering. In the description of the dataset, it was noted that there were 23 different types of mushrooms represented in this dataset, so we’ll set the
n_clusters parameter to 23:
hyp.plot(data, 'o', n_clusters=23)
To gain access to the cluster labels, the clustering tool may be called directly using
hyp.tools.cluster, and the resulting labels may then be passed to
cluster_labels = hyp.tools.cluster(data, n_clusters=23) hyp.plot(data, group=cluster_labels)
HyperTools uses PCA to do dimensionality reduction, but with a few additional lines of code we can use other dimensionality reduction methods by directly calling the relevant functions from
sklearn. For example, we can use t-SNE to reduce the dimensionality of the data using:
from sklearn.manifold import TSNE TSNE_model = TSNE(n_components=3) reduced_data_TSNE = TSNE_model.fit_transform(hyp.tools.df2mat(data)) hyp.plot(reduced_data_TSNE,'o', group=class_labels, legend=list(set(class_labels)))
Different dimensionality reduction methods highlight or preserve different aspects of the data. A repository containing additional examples (including different dimensionality reduction methods) may be found here.
The data expedition above provides one example of how the geometric structure of data may be revealed through dimensionality reduction and visualization. The observations in the mushrooms dataset formed distinct clusters, which we identified using
HyperTools. Explorations and visualizations like this could help guide analysis decisions (e.g. whether to use a particular type of classifier to discriminate poisonous vs. edible mushrooms). If you’d like to play around with HyperTools and the mushrooms dataset, check out and fork this Kaggle Kernel!
Whereas the mushrooms dataset comprises static observations, here we will take a look at some global temperature data, which will showcase how
HyperTools may be used to visualize timeseries data using dynamic trajectories.
This next dataset is made up of monthly temperature recordings from a sample of 20 global cities over the 138 year interval ranging from 1875–2013. To prepare this dataset for analysis with
HyperTools, we created a time by cities matrix, where each row is a temperature recording for subsequent months, and each column is the temperature value for a different city. You can replicate this demo by using the Berkeley Earth Climate Change dataset on Kaggle or by cloning this GitHub repo. To visualize temperature changes over time, we will use
HyperTools to reduce the dimensionality of the data, and then plot the temperature changes over time as a line:
Well that just looks like a hot mess, now doesn’t it? However, we promise there is structure in there– so let’s find it! Because each city is in a different location, the mean and variance of its temperature timeseries may be higher or lower than the other cities. This will in turn affect how much that city is weighted when dimensionality reduction is performed. To normalize the contribution of each city to the plot, we can set the
normalize flag (default value:
normalize='across' HyperTools incorporates a number of useful normalization options, which you can read more about here.
Now we’re getting somewhere! Rotating the plot with the mouse reveals an interesting shape to this dataset. To help highlight the structure and understand how it changes over time, we can color the lines by year, where more red lines indicates early and more blue lines indicate later timepoints:
hyp.plot(temps, normalize='across', group=years.flatten(), palette='RdBu_r')
Coloring the lines has now revealed two key structural aspects of the data. First, there is a systematic shift from blue to red, indicating a systematic change in the pattern of global temperatures over the years reflected in the dataset. Second, within each year (color), there is a cyclical pattern, reflecting seasonal changes in the temperature patterns. We can also visualize these two phenomena using a two dimensional plot:
hyp.plot(temps, normalize='across', group=years.flatten(), palette='RdBu_r', ndims=2)
Now, for the grand finale. In addition to creating static plots,
HyperTools can also create animated plots, which can sometimes reveal additional patterns in the data. To create an animated plot, simply pass
hyp.plot when visualizing timeseries data. If you also pass
chemtrails=True, a low-opacity trace of the data will remain in the plot:
hyp.plot(temps, normalize='across', animate=True, chemtrails=True)
That pleasant feeling you get from looking at the animation is called “global warming.”
This concludes our exploration of climate and mushroom data with
HyperTools. For more, please visit the project’s GitHub repository, readthedocs site, a paper we wrote, or our demo notebooks.
Andrew is a Cognitive Neuroscientist in the Contextual Dynamics Laboratory. His postdoctoral work integrates ideas from basic learning and memory research with computational techniques used in data science to optimize learning in natural educational settings, like the classroom or online. Additionally, he develops open-source software for data visualization, research and education.
The Contextual Dynamics Lab at Dartmouth College uses computational models and brain recordings to understand how we extract information from the world around us.
InnoValeur | Data Science | Smart Data | Machine Learning | AI