A Challenge to Analyze the World’s Most Interesting Data: The Department of Commerce Publishes its Datasets on Kaggle

Since opening up our public datasets platform in August, we’ve been amazed by the depth and breadth of projects our community has created, the thoughtful analyses shared, and the words of wisdom exchanged. This is why, when the Department of Commerce – “America’s Data Agency” – issued a call to the private sector to democratize data and promote data equality in September 2016, we responded. Since then we have been working with the DOC to bring what we see as some of the world’s most interesting data to you, our talented community.

In this blog post, we introduce you to some of the latest datasets made available to Kaggle through our work with the data scientists at the Department of Commerce and together we challenge you to explore innovation, creativity, and technological progress in the United States and dig deeply into the stories of how Americans live and work. Thanks to the repository of code available on Kernels, you can quickly move from accessible data to reproducible insights.

We would love to see what you create, so share with us and the world. Authors of top kernels on Department of Commerce datasets will receive our newest Kaggle swag. If you download the data, let us know how you use it!


The United States Census Bureau is responsible for producing data about the American people and economy. Working with data scientists at the DOC, we have made two exciting Census datasets available: the 2014 American Community Survey and the Current Population Survey. You can learn more about the US population through these datasets than anywhere else.

The 2014 American Community Survey

The American Community Survey (ACS) is an ongoing survey that provides vital information on a yearly basis about our nation and its people. Information from the survey generates data that help determine how more than $400 billion in federal and state funds are distributed each year.

The best place to get started with the 2014 ACS is with the 2013 ACS, also published on Kaggle. Here you’ll find an incredibly rich repository of code and discussion. We encourage you to replicate and extend some of our favorite kernels created using this granular dataset about fascinating facets of Americans’ lives.

The relationship between work arrival times and income is explored in this kernel.

The relationship between work arrival times and income is explored in this kernel.

Some of the great analyses by Kagglers include:

Here are a few additional resources for working with the 2014 American Community Survey on Kaggle:

The Current Population Survey

The Current Population Survey (CPS) is one of the oldest, largest, and most well-recognized surveys in the United States. It is immensely important, providing information on many of the things that define us as individuals and as a society – our work, our earnings, and our education.

In this dataset, you can delve into a detailed snapshot of Americans’ lives including:

  • how many people were working and how many were laid off from their jobs;
  • household characteristics;
  • and details about government assistance programs.

This dataset, which was converted from fixed-width format to a much more accessible CSV format, includes a detailed data dictionary. You can also get started with geographical analyses as survey responses are recorded at the FIPS county level.

Challenge conventional wisdom about the American people

The mission of the National Oceanic and Atmospheric Administration (NOAA) is to understand and predict changes in climate, weather, oceans, and coasts, to share that knowledge and information with others, and to conserve and manage coastal and marine ecosystems and resources. With global warming becoming one of our most pressing concerns as a species, analyzing our planet’s climate and weather data is of enormous value.

Global Historical Climatology Network

How has the climate of our planet changed over the past 100+ years? This dataset, compiled through the aggregation and analysis of many thousands of weather station records, permits the quantification of changes in the mean monthly temperature and precipitation for the earth’s surface. Gridded data for every month from the year 1880 to 2016 is available.

As an example of what you can do with this dataset, check out this kernel by Ed King showing the coldest and hottest months on record. We’ll leave it up to you to fill in everything in-between.

The hottest and coldest months since 1880 based on data from NOAA published on Kaggle.

The hottest and coldest months since 1880 based on data from NOAA published on Kaggle.

Severe Weather Data Inventory

The Severe Weather Data Inventory is an integrated database of severe weather records for the United States. The records in SWDI come from a variety of sources in the NCDC archive and cover a number of weather phenomena. This extract from 2015 covers hail detections including the probability of a weather event as well as the size and severity of hail – all of which help understand potential damage to property and injury to people.

There are nearly 11 million hailstorm events on record in this expansive dataset which you can use to understand:

  • how often damaging storms occur;
  • where these events happen geographically;
  • and what statistical and geospatial techniques can be used to understand patterns in the storms?

Study over 100 years of global weather data

The United States Patent and Trademark Office (USPTO) is the federal agency for granting United States patents and registering trademarks. The vitality of the US economy depends directly on effective mechanisms that protect new ideas and investments in innovation and creativity and the datasets presented on Kaggle offer an opportunity for data scientists to analyze themes underlying technological progress.

Patent Grant Full Text

Every Tuesday, the USPTO issues approximately 6,000 patent grants and posts the full text of the patents online. These patent grant documents contain much of the supporting details for a given patent. From this dataset published on Kaggle, you can track and compare trends in innovation across industries.

Analyze this library of inventions and innovations. Get started by forking this kernel.

Analyze this library of inventions and innovations. Get started by forking this Python kernel.

While details behind many advances are often closely guarded by their authors, the full text of the patent grants made available in this dataset present a unique opportunity to learn more about the research and techniques that have gone into improving our daily lives.

Patent Assignment Daily

Each day, the US Patent and Trademark Office (USPTO) records patent assignments (changes in ownership). These assignments can be used to track chain-of-ownership for patents and patent applications.

In this dataset, you can, for example:

  • examine where hotbeds of innovation are located geographically in the United States;
  • use information from patent titles to pick a good name for your next invention;
  • find out how many patents Google has been awarded in this timeframe.

InnoValeur | Data Science | Smart Data | Machine Learning | AI

Publier un commentaire

Ce site utilise Akismet pour réduire les indésirables. En savoir plus sur comment les données de vos commentaires sont utilisées.