How to Wrangle Huge Amounts of Data

I used MongoDB, Ruby and Google Fusion tables to make a map of long, early-morning taxi trips.

It shows the start points of 5533 NYC taxi trips that were at least 10 miles long and began between 4 a.m. and 6 a.m. one week in March 2009. Data from the NYC Taxi & Limousine Commission.

I imported the TLC's data into MongoDB using MongoImport and then got the Mongo Ruby Gem. Using these three pages, I ran a bunch of "finds" on the data to see what I could get, just using "puts" to print them to the screen.

Even figured out some regular expressions to pull the only records where the hour was 04 or 05, and also to get rid of the commas and extra spaces in the address fields.

Here's the code for where I ended up.

Instead of writing to a file, I just copy-pasted the output from my terminal as a .csv file and uploaded it to Google Fusion Tables.

And here's a bigger version of the map.

Things I’ve Learned

Discoveries on the journohacking trail.

How to Wrangle Huge Amounts of Data