Real World Data

Dealing with it

A talk by Travis Swicegood / @tswicegood / #rwdata

House Keeping

Where are we going?

  • Understanding
  • Strategies
  • Tools
  • Case Studies

Questions: Ask Them

Twitter!

@tswicegood | #rwdata

Slides are Online

Link at the end, so hold your horses

Texas Tribune

Story Time

What Is It?

Not Big Data

Normally…

Dirty

Unpredictable

Limited

Never Exactly
What You Need

Unpleasant

Constrained

Quantifiable

  • 150 Representatives
  • 31 Senators
  • 254 Counties

Really Interesting

Strategies

Script Everything

Seriously, Everything

Keep Copies

Know Your Source

Know What
Humans Touched

Know Everything

Document Everything

At least everything you use and learn

Embrace Change

Tools

Git

CSVKit

Tabula

Unix Philosophy

Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.

Python

NumPy

pandas

R

Chrome Web Tools

jQuery -> pyQuery

HTTP Proxies

OpenRefine

Overview

Data Stores

  • CouchDB
  • MongoDB
  • Postgres/PostGIS

Use Cases

Public Schools Explorer

8 Different Sources

Over 70 millions rows

1 data source has over 700 columns

12 Steps to Import

State Employee
Salaries

141 Sources

Mostly Provided Separately

Still Automating

State Prisoners

1 Source

On auto-pilot

Reservoir Map

JSON Data!

Built Relationship

  • Early Access
  • Provided Feedback

Talk to Your Sources

Questions?

@tswicegood | #rwdata

tswicegood.github.io/real-world-data/