I’m leading a pair of introductory data journalism workshops on Tuesday, September 4 and Monday, September 10 here in New York, along with Sha Hwang from Trulia, who will lead similar workshops simultaneously in San Francisco.

All of these workshops use as examples data made available through the White House Office of Science and Technology Policy‘s Safety Data Initiative. Below are some helpful links that I’ll mention during the sessions.

  • Start by downloading and installing Google Refine.
  • My first example dataset is this list of mine accidents. The data series page, which is referenced in the metadata, includes the all-important definition file as well as the master list of all mines. I used the joining method outlined by Tony Hirst on his blog.
  • My second example dataset is this KML file of auto accidents.
  • Some useful articles on Google Refine techniques from Paul Bradshaw. H/T to Jonathan Hayter for pointing these out.
  • Some relevant O’Reilly books (obviously, I work for O’Reilly, but I learned to program with the famous “animal books”. These are great introductions and you can download them as DRM-free e-books):
  • Introducing Regular Expressions—Regex, which lets you extract data from formatted text, is one of the most powerful tools in data mining. Google Refine offers plenty of ways to use regular expressions, and it’s also a key feature of scripting languages like Python.
  • Learning Python—Python is my favorite scripting language: once you’re familiar with it, you’ll find it easy to build small bits of software that can, say, scrape thousands of web pages in a few minutes or turn a spreadsheet into thousands of graphs. And it scales up from there: Python is one of the languages that Google uses to build some huge systems.
  • R in a Nutshell—R is the most widely-used open-source statistical package, with free add-on libraries for every discipline. I got my start with Stata, and found R difficult to learn, but once you do you’ll find it to be a very powerful tool that can stand in for Python and full-scale database applications for many data-journalism projects.
  • Advanced resources: at some point you’ll want to use a database to store and analyze data. MySQL is the reigning king of open-source database software, and you can learn it with the aptly-named Learning MySQL. For certain applications, the freer forms of Mongo are more useful, and you might want to take a look at MongoDB and Python, which outlines ways to use Python to fill and analyze a Mongo database.

Jon Bruner

AI, data, bots, and hardware person at O'Reilly Media