Welcome! This class is an introduction to data cleaning, analysis and visualization. We will walk you through as we analyze real world datasets. Each day, we will spend the first 30 minutes introducing the day's concepts, and spend the rest of the class doing the lab exercises. We have written a daily walkthrough that you will read and program through in class, and we will be available to help.
This is our first time teaching this course, and we'll be learning as much as you. Don't hesitate to ask us to change something or improve on something. We'll be grateful.
Prerequisites
We assume you have a working knowledge of Python (perhaps from 6.01) and are willing to write code. Most of the code you interact with will come with an example that you can modify. Hopefully you won't need to write too much custom code—unless you're inspired to write more!
We expect you to install several development-related Python modules, and download the datasets we will be using in the class. Instructions can be found under Day 0, in the Lectures and Labs section.
What We Will Teach
We will teach the basics of data analysis through concrete examples. All of your programming will be written in Python. The schedule is as follows:
- Day 0 (today): setup
- Day 1: An end-to-end example getting you from a dataset found online to several plots of campaign contributions.
- Day 2: Lots of visualization examples, and practice going from data to chart.
- Day 3: Statistics basics, including t-tests, linear regression, and statistical significance. We'll use campaign finance and per-county health rankings.
- Day 4: Text processing on a large text corpus (the Enron email dataset) using tf-idf and cosine similarity.
- Day 5: Scaling up to process large datasets using Hadoop/MapReduce on a larger copy of the Enron dataset.
- Day 6: You tell us! Get into groups or work on your own to analyze a dataset of your choosing, and tell us a story!
What We Will Not Teach
- R. R is a wonderful data analysis, statistics, and plotting framework. We will not be using it because we can achieve all of our objectives in Python, and more MIT undergraduates know Python.
- Visualization using browser technology (canvas, svg, d3, etc) or in non-Python languages (Processing). These tools are very interesting, and lots of visualizations on the web use these tools (such as the New York Times visualizations), but they are out of the scope of this class. We'll teach you how to visualize data in static charts. If this is an area of interest for you, the next step will be to build interactive visualizations that the world can explore, and we can point you in the right direction with these.
GitHub
The course materials can also be found in this GitHub repository, which may be updated more frequently than the OCW site.