Author: Anders S

Beyond Data “Hot Pockets”: Creating a Continuum of Tools

How do you make Data Science more accessible? A lot of folks say the answer is to create easy to use drag-and-drop tools that hide all the complex, icky stuff. At Data Chefs, we totally get why smart, dedicated people think Data Science should head in this direction. But we don’t think it’s sustainable.

If you know exactly, precisely the kind of data analysis people are going to want to do, then easy to use drag-and-drop tools are great. But in our experience, it often doesn’t play out that way.

Say you’re part of a coalition that’s tried to reduce kids’ asthma that’s caused by polluted air. Someone finds this great little tool that lets you easily import in data and map it to your heart’s content. Problem solved!

And then someone points out that for two thirds of the data visualizations you want to do, the maps are perfect, but for the other third you need a different kind of map – and the easy-to-use tool doesn’t make those kind of maps.

Even more likely, you run into a snag with the data. The maps would be a lot more useful if you could merge in some census data, only the tool can’t do that. Or it can only merge in really small data sets, or census data that’s in another format, or…

Okay, you say, we can come up with a workaround. Maybe there’s another tool you can use to merge or reformat data before you import the data into your great little mapping tool. The catch: this new tool isn’t as easy to use. Someone needs to learn how to run it from the command line, decipher the obscure documentation and make sense out of the cryptic, passive-aggressive error messages the tool spits out whenever you make a tiny mistake. So maybe instead, if you run some of your data through Excel, and then use this other little utility, and then clean some of the data by hand, and some other ugly little duct tape hack.…

In short, using the “easy” tool has morphed into complicated, painful process that nobody will remember how to do three months from now when you need to create another set of maps.

Or worse yet, there is no obvious workaround – now you need a programmer.

We don’t think it makes sense to keep building a data science landscape where there are easy-to-use tools but when you need to go a little beyond what they can do you fall off a cliff. Instead, we think it’s time to take a page from the world of cooking.

The world of cooking isn’t divided into people who can only microwave a hot pocket and Master Chefs. Instead there’s a continuum of experience.

A lot of people start off their cooking journey by just being able to microwave a TV dinner or add milk to cereal. Then maybe they learn how to make mac & cheese from a box, scramble eggs in a pan, or throw together a simple salad. Then maybe they learn how to grill burgers or bake chocolate chip cookies from the recipe on the chocolate chip package. Then they build up a repertoire of a handful of their go-to recipes. Or maybe early on one of their family members or relatives teaches them how to make some simple versions of the food that are part of their heritage. Or maybe they decide they’re going to take a class.

There are lots of paths you can take to learn to cook. And not everyone gets to or needs to get to the same level of skill; plenty of people can take care of their culinary needs without becoming an expert cook.

But what most of these paths have in common is that there’s a continuum. If you want to do more than microwave hot pockets, you don’t have to enroll in the Culinary Institute of America. You can take baby steps towards getting more skill based on what kind of food interests you the most and how much time & effort you want to put into it.

We think the world of data science would be a lot more accessible – and a lot more diverse – if was built around a continuum of tools that, like cooking, let folks take baby steps towards getting more skill as they had the need and time for it.

What exactly would this continuum look like? We’re not sure. That’s what Data Chefs aims to find out.

Reshaping the Tools to Fit Our Communities

When I first started learning pandas, I spent way too much time feeling like an idiot.

Some parts of pandas were a snap to pick up, and they let you easily slice and dice data with a few simple commands. But other parts of pandas could be maddeningly difficult or just plain bizarre. And my frustration was compounded by the fact that like a lot of people who need a tool like pandas, I only had an hour here and there in the middle of my daily grind to learn it.

The documentation had similar problems.  Some of it was really well written.  But sometimes figuring out the basics I needed to do my work felt like a game of Marco Polo.

After about six months of playing with pandas, I figured out what was going on: pandas was great, but it wasn’t designed with folks like me in mind.

Take how pandas handles time. pandas has elegant, powerful commands for handling what’s called “time series” data – data like stock market data where you get share prices at regular intervals of time. But when it came to the dates I usually worked with — sales dates, membership dates — pandas was like the DMV on acid. Want to get the year of the sales purchase? Here’s what you need to do:

sales.year = sales.purchase_date.apply(lambda x: x.year)

Ick!

(A quick nerdy aside. Complaint about this to a pandanista and odds are they’ll say, “if you don’t like this approach, just convert your sales data into timeseries data by making sales.purchase_date into sales’ index.” To which I say: I rest my case.)

Was this some bizarre form of coding masocism — call it Fifty Shades Of Pandas? No. The pandas community was incredibly nice and tried to be as helpful as possible to folks who were flailing. The problem is that pandas and the tools it was built on – e.g. numpy — were designed around the needs of the people who were building it, who were mostly financial quants and scientists. Quants and scientists mostly use time series. Folks like me who work for nonprofits? Not so much.

And this brings me to one of the main points of Data Chefs. A lot of really good people are trying to figure out how to make it easier for folks in the community to learn how to use the tools we have. We think that’s a good thing. But we also think it’s time to work in the other direction. We need to start making the tools easier for more folks to use — not by dumbing them down but by redesigning them with a different audience in mind.

In the long run, what Data Chefs wants to accomplish is crazy ambitious. I used to work at SEIU, where I worked with union locals on data issues, and eventually I want janitors who are active in their local to be able to use tools like pandas to understand and act on the world around them. That’s not going to happen overnight.

But in the meantime, we’ve got plenty of low hanging fruit to pick. There’s no reason why working with dates in pandas can’t be easy for folks in the nonprofit community, and fixing it isn’t rocket science. But that won’t happen unless the people who aren’t currently part of the communities that are building tools like pandas start making our voices heard.