Uncategorized

Beyond Boot Camps

Boot camps have become increasingly popular way for folks in the community to get started in Data Science. It’s understandable why. Data Science can be pretty overwhelming at first, so getting a concentrated dose with lots of support can be invaluable.

I have a tremendous amount of respect for the people who make data science boot camps happen, and they have made a huge difference in the lives of some of the people who have gone through them. But I think we are at the point where we are hitting the limits of boot camps.

First, most boot camps cost more money than many people can afford. A number of programs aimed at increasing the diversity of Data Science offer scholarships for some or all of their participants. But given how much boot camps cost to run, they can only reach a limited number of people. As a model, it just doesn’t scale – and given how many data science jobs there are out there, that’s a serious problem.

Similarly, most boot camps take far more time than many people can afford. Again, boot camps that try to increase the diversity of data science work very hard to help folks overcome this barrier. But for single working parents and many other people, a model built on one very concentrated dose of learning over several months just isn’t going to work.

Finally, most boot camps simply can’t afford to provide real support once the boot camp is over. This is an issue for a lot of folks who go through boot camps. Because no matter how dedicated the instructors are, many folks can only absorb so much info at one time. That’s a problem even if you’re just learning one programming language or skill. But to retain even a basic mastery of the array of skills many data science jobs require, boot camps don’t offer a good answer.

So in addition to boot camps, we need another approach that can scale up. That’s why Data Chefs argues for creating a continuum of tools and smooth the the learning curve among these tools . Part of the reason we need boot camps is that learning these tools is way too hard. Many of these tools are open source, and of the tools that aren’t they are very interested in growing their markets. There is no reason why a movement couldn’t change the trajectory of these tools to make it far easier to get started and far easier to make progress.

Similarly, there’s no reason we couldn’t create a more robust, community-centered ecosystem around learning and using these tools so a much wider range of folks could get exposed to them, get their feet wet, and begin to make progress at a pace that their lives could handle.

But won’t this take a lot of work? Yes, it will. But so do boot camps.

Boot camps require a staggering amount of time and energy – one of the many reasons I have so much respect for the people who make them happen. For all the time and energy that go into boot camps, they can only reach a limited number of people. And for the most part, each boot camp – or school of boot camps – is an island unto itself. As a result, they never get the payoff of having many people across many communities working together towards a common goal.

So maybe it’s time to think about taking some of the considerable energy going into boot camps right now and use it to build a solution that can reach a lot more people.

Our Data, Ourselves

One of Data Chefs’ core assumptions is that there’s no reason data science can’t be accessible to a much wider audience. Some people think that’s crazy. Slicing and dicing data, making sense of data – it’s just too complicated for anyone other than an expert.

Back in the early 60s, that’s exactly how most folks thought about medicine. Nancy Miriam Hawley recalls recalls an encounter she had with her OB/GYN:

Imagine me as a 23 year old professional young woman asking a question after the doctor (he) recommended that I use a new –to- market pill for birth control.  What’s in this pill? I ask.  His response: condescending pat on my head and literally said “don’t worry your pretty little head!”

Minus the head pat, that was pretty much the standard answer doctors were expected to give. They had years and years of intensive training. How could anyone — let alone a woman — be expected to have any real say in their treatment given that they couldn’t possibly understand medicine?

In 1969, Hawley and several other women who had met at a women’s conference decided it was time for a change.

We had all experienced similar feelings of frustration and anger toward specific doctors and the medical maze in general, and initially we wanted to do something about those doctors who were condescending, paternalistic, judgmental and noninformative. As we talked and shared our experiences with one another, we realized just how much we had to learn about our bodies. So we decided on a summer project: to research those topics which we felt were particularly pertinent to learning about our bodies, to discuss in the group what we had learned, then to write papers individually or in groups of two or three, and finally to present the results in the fall as a course for women on women and their bodies.

As we developed the course we realized more and more that we really were capable of collecting, understanding, and evaluating medical information. Together we evaluated our reading of books and journals, our talks with doctors and friends who were medical students. We found we could discuss, question, and argue with each other in a new spirit of cooperation rather than competition. We were equally struck by how important it was for us to be able to open up with one another and share our feelings about our bodies. The process of talking was as crucial as the facts themselves. Over time the facts and feelings melted together in ways that touched us very deeply, and that is reflected in the changing titles of the course and then the book, from “Women and Their Bodies” to “Women and Our Bodies” to, finally, “Our Bodies, Ourselves.”

Today, the idea that we couldn’t understand enough about medicine to have an informed opinion seems about as antiquated as using leeches. In fact, these days you can even get a degree in the art and science of making medical information accessible to the public.

And as complex as data science is, it’s not in the same league as medicine. To understand the human body, you need to understand biology, physics, chemistry, psychology, statistics, etc. In fact, medicine is so complex that even someone with years and years of training in one medical specialty isn’t qualified to have an expert opinion about another specialty.

So the next time someone talking about data science does the equivalent of patting you on the head, remember that the only reason that they can get away with that crap is that we are just at the beginning of a movement that’s committed to do in data science what those women did “about those doctors who were condescending, paternalistic, judgmental and noninformative.”

Why We Still Need to Worry About Diversity in Tech: the Sexist Idiot Edition

Sarah Drasner is an expert in the arcane, super geeky world of Scalable Vector Graphics (SVG) animation — basically one of the main ways to do  really cool interactive work, like data viz, on the web.  Parts of SVG animation can be mind-numbingly painful enough that it can make daytime drinking under your desk seem like a very reasonable response.  Drasner’s book, SVG Animation, which was published by O’Reilly, is hands down the best book on this subject.  And yet in 2017, she still has to put up with crap like this:

After my talk:

Guy: so who coded your demos?

Me: I did

G: so you used a GUI?

M: no I coded it

G: you code?

M: yes

G: no, like actual code

And as she tweets, this wasn’t a one off:

It’s like, every day now. Just cut it out.

please stop, this shit is exhausting

In case anyone is having trouble wrapping their head around why diversity in tech matters, this is why:  so there are enough women in tech that no guy would dare do this.

Data Viz Revision: O’Reilly 2016 Data Science Salary Survey (Part 2)

This post is part of a series based on the data displayed in O’Reilly’s 2016 Data Science Salary Survey. Using the Data Chefs Revision Organizer as a guide, we will rethink and revise some of the visualizations featured in the report.

In this post, I want to focus on the visualization for the share of survey respondents by self-reported age category:

oreilly-age

Again, the authors used the arcing blue circle theme to depict the breakdown by age category.  On the plus side, the data labels are consistently placed, all falling along the bottom-right of each value circle (or the inside of the arc), and the order is intuive: youngest to oldest. Also, the circles appear to be sized properly by area (as opposed to diameter).

Using circles is not necessarily a bad way to depict category data, but doing so has some limitations. The main drawback is that by using distinct circles, you lose the relation of each part to the whole.

 

data-chefs-viz-revision-organizer-oreilly-age

For this data, I propose using a form of visualization in which the part/whole relationship is central: pie chart, donut chart, waffle chart, or stacked 100% bar chart, shown below:

oreilly age revisions.png

The biggest downside to using these part/whole visualizations is that there isn’t a lot of room to label smaller values.  For that reason, I created a legend for all the values in each graph.

And, although this isn’t a problem with the visualization itself,  if you pay attention to the values in the original, you’ll see that they add up to greater than 100%: 101% to be exact. What probably happened is that more than one value was rounded up, giving the total an extra full percent.  In my revisions, I changedthe value for the 41-50 category, from 16% to 15% so that the values would sum to 100%. This was a compltely arbitrary choice because I had no access to the raw data to know exactly how they were rounded.

I think any one of these would work in place of the original.  Thoughts?

 

 

 

Where We Are Headed: Data Visualization, Data Manipulation

When we first started Data Chefs, we thought we were going to focus on making it a lot easier to clean up and slice & dice data. Many folks in “data science” will tell you that they spend the vast majority of their time prepping their data before they can analyze it, so if we could make this easier for folks in the community, it would be a real win.

After banging our heads against various doors, we realized we were trapped the chicken or the egg problem: without concrete to show folks, the idea that they could have any say on how easy/hard the tools for wrangling data were seemed overwhelming, and without community folks to figure out what “easier” looked like, we wouldn’t be able to convince data geeks to build easier-to-use tools.

So over the past six months, we’ve gradually been shifting our focus. You’re going to start seeing a lot more about data visualization on this blog. That’s because even if people are scared of or overwhelmed by the idea that they could shape the tools they use, everybody likes shiny objects. So going forward, we’re going to explore what it means to make an organization data visualization-literate, and we’re going to do some D3 experiments to see if we can smooth its learning curve.

That said, we aren’t entirely giving up on data manipulation. Although we weren’t able to put together a community of folks, we did learn a lot about the problems that a Data Chefs approach would have to solve; I’ll blog about it in the next few months.

“Fluff” Vs Context: In the Right Environment, Girls Out-Program Boys

At the beginning of January, Motherboard’s Michael Byrne argued that if you had a New Year’s resolution to start learning how to code, you should learn to do it ” the hard way.”

I learned to program in C at a community college and I wouldn’t have done it any other way….. I was hooked on problem solving and algorithms. This is a thing unique to programming: the immediate, nearly continuous rush of problem solving. To me, programming feels like solving puzzles (or, rather, it is solving puzzles), or a great boss fight in a video game….

The point is to learn programming as it is nakedly, minus as much gunk and fluff that can possibly be removed from the experience.

Let’s put aside for a moment the “hard way” vs “gunk and fluff” pseudo-macho, my-head-was-shoved-in-a-locker-in-high-school-and-I-haven’t-got-over-it framing used by guys like Byrne. There are an awful lot of people who’d agree this is the best way to teach anyone who wants to learn. But that’s because it appeals to them, not because empirical evidence backs them up.

Take a recent study by Kate Howland and Judith Good.

Researchers in the University’s Informatics department asked pupils at a secondary school to design and program their own computer game using a new visual programming language that shows pupils the computer programs they have written in plain English.

Dr Kate Howland and Dr Judith Good found that the girls in the classroom wrote more complex programs in their games than the boys and also learnt more about coding compared to the boys.

Why did girls do so much better? Here’s what Good thinks is happening:

Given that girls’ attainment in literacy is higher than boys across all stages of the primary and secondary school curriculum, it may be that explicitly tying programming to an activity that they tend to do well in leads to a commensurate gain in their programming skills.

In other words, if girls’ stories are typically more complex and well developed, then when creating stories in games, their stories will also require more sophisticated programs in order for their games to work.

And it isn’t just that these girls are more skilled at telling complex stories; they also enjoy doing it.

It’s an important lesson, not only for teaching children but also adults. If you want to make Data Science more accessible, the first thing we need to ask is where are the audiences we are trying to reach coming from? If we understand what get someone fired up and what skills they bring to the table, it can go a long way in unlocking their ability — and just as importantly, their desire — to excel in this new field.


UPDATE: I’d also like to point out that there are a decent number of guys like me who, unlike Byrne, think solving abstract puzzles is boring as hell. In my three decades of coding and managing complex software projects, this lack of enthusiasm for abstract puzzles hasn’t been a problem so far.

Can I Cook?: Impostor Syndrome and “Data Science”

Lately, I’ve been thinking a lot about the difference between amateurs and experts. In particular, I’ve been wrestling with our use of the term “data science” on this site.  To me, the term denotes a level of expertise that I don’t feel comfortable claiming yet. Through a series of conversations with my Data Chefs colleagues, who have experienced similar discomfort at times, I’ve learned a great deal.

My colleagues and I have discussed the need to actively combat impostor syndrome, an accomplished person’s fear that s/he is a fraud who will eventually be exposed.  Left unchecked, impostor syndrome can stifle creativity and momentum, especially among women and people of color. Still, I’ve learned that I feel much better if I don’t seek to claim an identity as a “data scientist,” but instead think of myself as someone who’s “doing data science” (albeit at a beginner’s pace).

As mentioned in the last post, most of us know some incredible home cooks who didn’t go to fancy culinary schools or study under distinguished Michelin-rated chefs.  These amateur chefs have made dishes hundreds of times, perfecting them, first through trial and error, and eventually through skill and intuition. For that reason, I’d put their knowledge up against that of most formally recognized experts.

If you’re like me, an amateur struggling to take stock of your data science abilities and accomplishments, the relevant question to ask yourself isn’t the equivalent of “Do I consider myself a chef?”  It’s “Am I learning how to cook?” If the answer to that is “Yes,” then, eventually, you’ll be able to ask yourself the only question that matters: “Can I cook?”

You Don’t Have To Be a Data Chemist to Bake Data Cookies

One of the reactions I’ve gotten to the argument behind my last post is that it’s unrealistic to think we can smooth data science’s learning curve. When you get beyond very simple point and click, you’ve got to immerse yourself in the dirty details of how statistics, machine learning, etc. work. In other words, we can’t really make data science accessible because the body of knowledge you need to go beyond baby steps is just too large.

When I first ran into this argument, I would reply with stories about the skilled practitioners in the field I’ve worked with who’ve forgotten a lot of what they learned in, say, intro stats – couldn’t perform a chi-square test by hand if their life depended on it – but still produce very powerful, highly influential work. These days my answer is a lot simpler.

Let’s have a show of hands of everyone who has relatives or friends who are amazing cooks. Now keep your hand up if most of those amazing cooks know the chemistry and physics behind what they do. Not a whole lot of hands left up.

It’s not that these amazing cooks don’t have any of the knowledge that’s embodied in chemistry and physics. They know a lot about how to work with boiling water, how you know when something they’ve been frying is done, etc. But the model they have in their head – or “in their fingers” – isn’t the one you get in chemistry class.

I think Data Chefs is going to end up demonstrating that’s also true for data science: you don’t need to be a Data Chemist to bake great data cookies. I don’t have any concrete empirical data to back me up. But neither do the people who are saying it can’t be done. All we know for sure is that that’s not how it’s been taught in the past. And if the data-driven revolution has taught us anything, you wouldn’t want to build the foundation of data science training on “but that’s the way it’s always been done.”

Beyond Data “Hot Pockets”: Creating a Continuum of Tools

How do you make Data Science more accessible? A lot of folks say the answer is to create easy to use drag-and-drop tools that hide all the complex, icky stuff. At Data Chefs, we totally get why smart, dedicated people think Data Science should head in this direction. But we don’t think it’s sustainable.

If you know exactly, precisely the kind of data analysis people are going to want to do, then easy to use drag-and-drop tools are great. But in our experience, it often doesn’t play out that way.

Say you’re part of a coalition that’s tried to reduce kids’ asthma that’s caused by polluted air. Someone finds this great little tool that lets you easily import in data and map it to your heart’s content. Problem solved!

And then someone points out that for two thirds of the data visualizations you want to do, the maps are perfect, but for the other third you need a different kind of map – and the easy-to-use tool doesn’t make those kind of maps.

Even more likely, you run into a snag with the data. The maps would be a lot more useful if you could merge in some census data, only the tool can’t do that. Or it can only merge in really small data sets, or census data that’s in another format, or…

Okay, you say, we can come up with a workaround. Maybe there’s another tool you can use to merge or reformat data before you import the data into your great little mapping tool. The catch: this new tool isn’t as easy to use. Someone needs to learn how to run it from the command line, decipher the obscure documentation and make sense out of the cryptic, passive-aggressive error messages the tool spits out whenever you make a tiny mistake. So maybe instead, if you run some of your data through Excel, and then use this other little utility, and then clean some of the data by hand, and some other ugly little duct tape hack.…

In short, using the “easy” tool has morphed into complicated, painful process that nobody will remember how to do three months from now when you need to create another set of maps.

Or worse yet, there is no obvious workaround – now you need a programmer.

We don’t think it makes sense to keep building a data science landscape where there are easy-to-use tools but when you need to go a little beyond what they can do you fall off a cliff. Instead, we think it’s time to take a page from the world of cooking.

The world of cooking isn’t divided into people who can only microwave a hot pocket and Master Chefs. Instead there’s a continuum of experience.

A lot of people start off their cooking journey by just being able to microwave a TV dinner or add milk to cereal. Then maybe they learn how to make mac & cheese from a box, scramble eggs in a pan, or throw together a simple salad. Then maybe they learn how to grill burgers or bake chocolate chip cookies from the recipe on the chocolate chip package. Then they build up a repertoire of a handful of their go-to recipes. Or maybe early on one of their family members or relatives teaches them how to make some simple versions of the food that are part of their heritage. Or maybe they decide they’re going to take a class.

There are lots of paths you can take to learn to cook. And not everyone gets to or needs to get to the same level of skill; plenty of people can take care of their culinary needs without becoming an expert cook.

But what most of these paths have in common is that there’s a continuum. If you want to do more than microwave hot pockets, you don’t have to enroll in the Culinary Institute of America. You can take baby steps towards getting more skill based on what kind of food interests you the most and how much time & effort you want to put into it.

We think the world of data science would be a lot more accessible – and a lot more diverse – if was built around a continuum of tools that, like cooking, let folks take baby steps towards getting more skill as they had the need and time for it.

What exactly would this continuum look like? We’re not sure. That’s what Data Chefs aims to find out.