Ben Wellington didn't mean to start a data revolution. He just wanted to teach his students. As the visiting assistant professor in Pratt Institute's City & Regional Planning program, he teaches a statistics course based on NYC Open Data, a repository of information provided by the city. After he started playing around with the data sets, he decided to launch a blog. I Quant NY features his discoveries—from the farthest Manhattan apartment from the subway to the fire hydrant that brought in the most ticket revenue—and has become something of a media sensation, even affecting policy. Wellington spoke with Pacific Standard about asking the right questions, the simplicity and power of summing, and why posts don't need to be difficult.
Did you anticipate the blog getting attention beyond your students?
"For the most part, all my work is just counts and means. The fanciest thing is a correlation."
I was definitely surprised by the interest. I think there are a lot of compelling stories to be told within the different data sets. Often, people do really fancy visualizations and beautiful things. I don't have the skills to make beautiful things. I'm taking the opposite approach. I take small slices of things, thing that people can actually internalize. On one hand, that's probably more relatable. Watching every CitiBike fly around the city is cool, but exploring where the most females and males are is simpler to internalize and easier to draw conclusions from.
It can be really hard to have an idea and express it in a simple way with stats and data. You seem to have mastered that.
The key is figuring out the right question to ask. If you look through Freakanomics, they aren't doing the fanciest analysis. They are looking at one variable, but they are looking at it from a new angle. For the most part, all my work is just counts and means. The fanciest thing is a correlation. After I did the post about the fire hydrant that was generating the most ticket revenue I got a call from the New York Post asking about my methodology. I told them that I summed. I added. It was an awkward conversation.
There's a fear of numbers. I work at Pratt to teach city planners that you don't need to be a statistician or computer scientist to do this type of work. If you can learn a little bit about Excel, you can go on Open Data and download a .csv file and do the same CitiBike study I did with a few clicks.
How do you think of the questions to ask?
It's probably three ways. I keep an ear open to interesting current trends that are happening. If a hearing about Vision Zero is coming up, I might have an extra look at their data. I live in the city, and I'm always asking myself, "Why, why, why?" My experience as a New Yorker really leads me to a fascination with a certain type of data. I've gotten my share of parking tickets in my life. The third way is to look at a data set and start rattling off questions. I'll ask myself a question, look at the answer and if it seems cool, write about it.
What percentage of the time is there an interesting conclusion?
More than half, honestly. There's no data set that I couldn't find something. Some are more "pop culture" exciting. The recent work I did about finding the apartment farthest from the subway somehow got picked up by a bunch of media outlets. I find that less interesting, but it's more relatable to people. There's a balance between interest and useful, where maybe I can help with policy decisions, and exciting to the Internet. That's a conflicting battle I have.
I first found the blog after the MTA post about the perfect value to add to your Metrocard. That's similar in the not-all-that-useful-but Internet-fascinating way.
Yeah, that's another example. I hope they fix it, but it's not that compelling analysis. It's more "this is quirky, let's fix it." Those aren't the things I'm most excited about, but they are the most relatable, I think. People flock to those types of things.
The apartment one was fun but also sort of silly. It's just an apartment on the water. That's not all that surprising.
I thought it was funny to find the apartment, see it listed, and see the price. [It's on the market for $18.9 million.] But to do the same thing in Brooklyn would be more compelling because you could take about public transportation and places where you have a real distance. You could do a lot, though, like count the number of apartments that will have closer access to the subway when the Second Avenue line starts. What percentage of Manhattan was affected by that?
"I've had a lot of people call it 'data journalism,' and I've never thought about it that way. It never occurred to me at all to call it that. It's just analysis."
You've done a lot of work with CitiBike.
One of the cool things I saw in the CitiBike data was the median age of riders per station. The Lower East Side has the oldest riders. The median age is 41 or 42. That was a data point that stuck out to me and got me thinking. It could mean one of two things: Either they need more bikes and that's a good thing or they feel their only option is to bike to get to work and maybe that's a sign that we need better public transportation. You'd have to go figure out what is going on, but it's certainly an outlier.
That was cool, too, because the youngest area is the East Village. I colored it by age and you have these opposing things going on.
Are you surprised that public agencies have responded?
I put the data out there. I don't consider myself a journalist. Usually what happens is a journalist finds the data, reaches out to the agencies, and the agencies respond to them. I will read that response. This is all a brand new thing for government.
Open data plays two roles. You're leveraging the power of people who are passionate to find things. The fact that you can help find issues with streets and have them fixed is maybe useful to the Department of Transportation. If the Fire Department were to release information about the times of fires, I bet people would start modeling the likelihood of fires. Maybe the Fire Department could do that, but there are a bunch of scientists out there who would have a great time doing it, and the Fire Department could leverage that work for free. On the other hand, it's also a bit of a watchdog with transparency and accountability.
There are two sides to the coin. Anytime you point out something, it could go either way. If you tell the Department of Health that there's something wrong with the rating system, they could either say, "Wow, let's look into that" or they could play defensive. Generally, agencies are defensive, but there's also not a good mechanism for them to take in information like this. They get caught off guard. I hope in the coming years they build in ways to reach out like this. If there were a liaison I could reach out to, maybe I would go that route. But right now, the only way to get attention is through the media. Unfortunately, that can create an adversarial relationship, which I think is the wrong way to look at open data. I really believe that if you empower people, you'll get much more out than you'll get criticism.
Is this happening in other cities?
It's definitely growing. A lot of cities have open data portals. I think New York is a leader in its size and scope, but the federal government has open data and the state has a large one as well. Los Angeles does as well. There are dozens and dozens of cities that have jumped on.
Have you had interest from other people in other cities?
A lot of people have reached out and asked me to do something I've done for New York for their city. If I do something for New York, they ask about Philly. Or if I do something for Manhattan, someone will ask if I can do it for Brooklyn. There is a demand for this kind of thing.
People study data sets all over the country, but most studies are these long in-depth analysis. You do real reporting, show up somewhere, look around, ask questions, and then write a report that makes recommendations. I've taken the opposite approach, a breadth over depth approach. I want to dig up as many interesting tidbits as possible and put them out there. I let the experts dig in. I don't know much about traffic safety, but I can analyze the most dangerous places, give that to people who actually know about it, and let them go with it. I'm not going to sit there and make infrastructural suggestions.
Do you think we'll see more of what you're doing? There's definitely a trend in this type of data journalism.
I've had a lot of people call it “data journalism,” and I've never thought about it that way. It never occurred to me at all to call it that. It's just analysis.
One of my favorite things was a tongue-in-cheek 4th of July post I did about the best place to see illegal fireworks. I looked at the number of complaints about illegal fireworks and made a map. It turns out that Inwood was the best place. An hour or so after I posted it, the Times had an article about Inwood fireworks. It had nothing to do with my work. They actually had done a real story. They went there, had the pictures, and had done the reporting. The fact that we had zeroed in on the same place—me from my couch in a few hours; them in a much more compelling, interesting, and deep way—was an affirmation that there was value. It was cool to see.