Can I Cook?: Impostor Syndrome and “Data Science”

Lately, I’ve been thinking a lot about the difference between amateurs and experts. In particular, I’ve been wrestling with our use of the term “data science” on this site.  To me, the term denotes a level of expertise that I don’t feel comfortable claiming yet. Through a series of conversations with my Data Chefs colleagues, who have experienced similar discomfort at times, I’ve learned a great deal.

My colleagues and I have discussed the need to actively combat impostor syndrome, an accomplished person’s fear that s/he is a fraud who will eventually be exposed.  Left unchecked, impostor syndrome can stifle creativity and momentum, especially among women and people of color. Still, I’ve learned that I feel much better if I don’t seek to claim an identity as a “data scientist,” but instead think of myself as someone who’s “doing data science” (albeit at a beginner’s pace).

As mentioned in the last post, most of us know some incredible home cooks who didn’t go to fancy culinary schools or study under distinguished Michelin-rated chefs.  These amateur chefs have made dishes hundreds of times, perfecting them, first through trial and error, and eventually through skill and intuition. For that reason, I’d put their knowledge up against that of most formally recognized experts.

If you’re like me, an amateur struggling to take stock of your data science abilities and accomplishments, the relevant question to ask yourself isn’t the equivalent of “Do I consider myself a chef?”  It’s “Am I learning how to cook?” If the answer to that is “Yes,” then, eventually, you’ll be able to ask yourself the only question that matters: “Can I cook?”

You Don’t Have To Be a Data Chemist to Bake Data Cookies

One of the reactions I’ve gotten to the argument behind my last post is that it’s unrealistic to think we can smooth data science’s learning curve. When you get beyond very simple point and click, you’ve got to immerse yourself in the dirty details of how statistics, machine learning, etc. work. In other words, we can’t really make data science accessible because the body of knowledge you need to go beyond baby steps is just too large.

When I first ran into this argument, I would reply with stories about the skilled practitioners in the field I’ve worked with who’ve forgotten a lot of what they learned in, say, intro stats – couldn’t perform a chi-square test by hand if their life depended on it – but still produce very powerful, highly influential work. These days my answer is a lot simpler.

Let’s have a show of hands of everyone who has relatives or friends who are amazing cooks. Now keep your hand up if most of those amazing cooks know the chemistry and physics behind what they do. Not a whole lot of hands left up.

It’s not that these amazing cooks don’t have any of the knowledge that’s embodied in chemistry and physics. They know a lot about how to work with boiling water, how you know when something they’ve been frying is done, etc. But the model they have in their head – or “in their fingers” – isn’t the one you get in chemistry class.

I think Data Chefs is going to end up demonstrating that’s also true for data science: you don’t need to be a Data Chemist to bake great data cookies. I don’t have any concrete empirical data to back me up. But neither do the people who are saying it can’t be done. All we know for sure is that that’s not how it’s been taught in the past. And if the data-driven revolution has taught us anything, you wouldn’t want to build the foundation of data science training on “but that’s the way it’s always been done.”

Beyond Data “Hot Pockets”: Creating a Continuum of Tools

How do you make Data Science more accessible? A lot of folks say the answer is to create easy to use drag-and-drop tools that hide all the complex, icky stuff. At Data Chefs, we totally get why smart, dedicated people think Data Science should head in this direction. But we don’t think it’s sustainable.

If you know exactly, precisely the kind of data analysis people are going to want to do, then easy to use drag-and-drop tools are great. But in our experience, it often doesn’t play out that way.

Say you’re part of a coalition that’s tried to reduce kids’ asthma that’s caused by polluted air. Someone finds this great little tool that lets you easily import in data and map it to your heart’s content. Problem solved!

And then someone points out that for two thirds of the data visualizations you want to do, the maps are perfect, but for the other third you need a different kind of map – and the easy-to-use tool doesn’t make those kind of maps.

Even more likely, you run into a snag with the data. The maps would be a lot more useful if you could merge in some census data, only the tool can’t do that. Or it can only merge in really small data sets, or census data that’s in another format, or…

Okay, you say, we can come up with a workaround. Maybe there’s another tool you can use to merge or reformat data before you import the data into your great little mapping tool. The catch: this new tool isn’t as easy to use. Someone needs to learn how to run it from the command line, decipher the obscure documentation and make sense out of the cryptic, passive-aggressive error messages the tool spits out whenever you make a tiny mistake. So maybe instead, if you run some of your data through Excel, and then use this other little utility, and then clean some of the data by hand, and some other ugly little duct tape hack.…

In short, using the “easy” tool has morphed into complicated, painful process that nobody will remember how to do three months from now when you need to create another set of maps.

Or worse yet, there is no obvious workaround – now you need a programmer.

We don’t think it makes sense to keep building a data science landscape where there are easy-to-use tools but when you need to go a little beyond what they can do you fall off a cliff. Instead, we think it’s time to take a page from the world of cooking.

The world of cooking isn’t divided into people who can only microwave a hot pocket and Master Chefs. Instead there’s a continuum of experience.

A lot of people start off their cooking journey by just being able to microwave a TV dinner or add milk to cereal. Then maybe they learn how to make mac & cheese from a box, scramble eggs in a pan, or throw together a simple salad. Then maybe they learn how to grill burgers or bake chocolate chip cookies from the recipe on the chocolate chip package. Then they build up a repertoire of a handful of their go-to recipes. Or maybe early on one of their family members or relatives teaches them how to make some simple versions of the food that are part of their heritage. Or maybe they decide they’re going to take a class.

There are lots of paths you can take to learn to cook. And not everyone gets to or needs to get to the same level of skill; plenty of people can take care of their culinary needs without becoming an expert cook.

But what most of these paths have in common is that there’s a continuum. If you want to do more than microwave hot pockets, you don’t have to enroll in the Culinary Institute of America. You can take baby steps towards getting more skill based on what kind of food interests you the most and how much time & effort you want to put into it.

We think the world of data science would be a lot more accessible – and a lot more diverse – if was built around a continuum of tools that, like cooking, let folks take baby steps towards getting more skill as they had the need and time for it.

What exactly would this continuum look like? We’re not sure. That’s what Data Chefs aims to find out.

Overcoming Obstacles to Sharing Science Data

Over at The Atlantic, Maggie Puniewska recently wrote a nice summary of several studies exploring the various factors that prevent scientists from sharing their data: “Scientists Have a Sharing Problem.”

Puniewska’s piece is a great read, as are all of the  studies  she  references.   A few quotes are worth highlighting. Among them,

      These findings show that data withholding isn’t always motivated by vengeance or the desire to get ahead; in some cases, the lack of resources makes it difficult to share it.


Also, regarding efforts to correct this lack of data sharing, Puniewska writes,


     Scientists would need a centralized place to store their data, meaning more digital repositories would need to be created… The scientific community would also need to establish protocols on how data should be stored, so that it becomes less time-consuming for other researchers to locate and interpret results.


The studies Puniewska examines are about primarily about scientists sharing their datasets. At Data Chefs, we believe that making data accessible is more than just posting raw datasets for public consumption; we believe that there must also be tools and concerted efforts to make the shared data easier to analyze and understand. We aim to be a community that provides the resources, repositories, and protocols to make this a reality.

Reshaping the Tools to Fit Our Communities

When I first started learning pandas, I spent way too much time feeling like an idiot.

Some parts of pandas were a snap to pick up, and they let you easily slice and dice data with a few simple commands. But other parts of pandas could be maddeningly difficult or just plain bizarre. And my frustration was compounded by the fact that like a lot of people who need a tool like pandas, I only had an hour here and there in the middle of my daily grind to learn it.

The documentation had similar problems.  Some of it was really well written.  But sometimes figuring out the basics I needed to do my work felt like a game of Marco Polo.

After about six months of playing with pandas, I figured out what was going on: pandas was great, but it wasn’t designed with folks like me in mind.

Take how pandas handles time. pandas has elegant, powerful commands for handling what’s called “time series” data – data like stock market data where you get share prices at regular intervals of time. But when it came to the dates I usually worked with — sales dates, membership dates — pandas was like the DMV on acid. Want to get the year of the sales purchase? Here’s what you need to do:

sales.year = sales.purchase_date.apply(lambda x: x.year)


(A quick nerdy aside. Complaint about this to a pandanista and odds are they’ll say, “if you don’t like this approach, just convert your sales data into timeseries data by making sales.purchase_date into sales’ index.” To which I say: I rest my case.)

Was this some bizarre form of coding masocism — call it Fifty Shades Of Pandas? No. The pandas community was incredibly nice and tried to be as helpful as possible to folks who were flailing. The problem is that pandas and the tools it was built on – e.g. numpy — were designed around the needs of the people who were building it, who were mostly financial quants and scientists. Quants and scientists mostly use time series. Folks like me who work for nonprofits? Not so much.

And this brings me to one of the main points of Data Chefs. A lot of really good people are trying to figure out how to make it easier for folks in the community to learn how to use the tools we have. We think that’s a good thing. But we also think it’s time to work in the other direction. We need to start making the tools easier for more folks to use — not by dumbing them down but by redesigning them with a different audience in mind.

In the long run, what Data Chefs wants to accomplish is crazy ambitious. I used to work at SEIU, where I worked with union locals on data issues, and eventually I want janitors who are active in their local to be able to use tools like pandas to understand and act on the world around them. That’s not going to happen overnight.

But in the meantime, we’ve got plenty of low hanging fruit to pick. There’s no reason why working with dates in pandas can’t be easy for folks in the nonprofit community, and fixing it isn’t rocket science. But that won’t happen unless the people who aren’t currently part of the communities that are building tools like pandas start making our voices heard.