data visualization

Data Viz Revision: Maeve’s Westworld Attribute Matrix

This post contains spoilers about “Westworld

 

In episode 6 of HBO’s wildly popular drama “Westworld,” viewers got a brief look at the “Attribute Matrix” of Maeve, one of the host androids featured in the show (h/t reddit):

westworld-maeve-attribute-chart

The attribute matrix is a graph of the values assigned to each trait (on a scale from 0 (1?) to 20). The visualization itself is just a radar chart. I’ve reproduced a rough version below for better visibility:

ww_maeve_radar

Since I first got a glimpse of this back in episode 6, I’ve been thinking about better ways to visualize the Westworld hosts’ attributes. The biggest problem with using a radar chart is that there doesn’t seem to be any meaningful order or organization of the host attributes; the polygon carved out by the radar chart values is an arbitrary shape that could change drastically with a different attribute order.

Radar charts are sometimes used when comparing multiple attributes among different series of values.  In this example, the values of six different attributes are compared across several countries and the resulting polygons are laid on top of one another:

kap-lab-radar

Kap Lab (via Scott Logic)

In this next example, the concept of small multiples is used to compare the 12 NBA players who made the 2013 All Star Game for the Eastern Conference based on how they rank in 11 statistical categories:

moghadam-nba-all-star-east-2013

Rami Moghadam

In these 2 examples, the polygon shapes formed by connecting each series value make sense to compare in the context of the visualizations. They compare multiple bundles of values on a common scale. In the two examples above, those bundles are countries and NBA players, respectively.

But in the image from Westworld, only one host’s values–Maeve’s–are shown.  This removes the main advantage a radar chart has, namely, comparing multiple values across many series.

Given that, I decided on 4 potential revisions.

 

Option 1: Bar Chart

ww_maeve_bar

I decided to do a pretty standard bar chart for the first revision.  Gone is the unwieldy polygon of the radar chart; in its place is a series of bars, ranked from highest to lowest.  This allows the audience to more easily grasp the relationship among the attribute values.

I decided against the random order of the original attribute matrix or alphabetical ordering because they don’t really help when looking at a single host’s data.

 

Option 2: Bullet Chart

ww_maeve_bullet

This is like the previous bar chart, only with an added series showing the maximum value of 20.  The benefit of this one is that for each attribute, you can see how far the value is from the maximum, so it gives the effect of a bar filling up.  I like this one.

 

Option 3: Lollipop Chart

ww_maeve_lollipop

This one is similar to the bar chart, only with thinner bars and a filled circle at the end.  The lollipop looks a bit cleaner to me, probably because the bars take up less space.

(h/t Stephanie Evergreen for the Excel tutorial).

 

Option 4: Table with Conditional Formatting

ww_maeve_conditional-formatting

This final revision is just a table with the values ranked from highest value to lowest. I added shading created by conditional formatting based on the same ranking.  For that reason, the shading is redundant, but I like the look.

 

westworld-maeve-organizer

I think any one of these is preferable to the original radar chart.  Which one would you choose?  Is there another, more effective visualization that I’ve overlooked?

Data Viz Revision: O’Reilly 2016 Data Science Salary Survey (Part 3)

This post is part of a series based on the data displayed in O’Reilly’s 2016 Data Science Salary Survey. Using the Data Chefs Revision Organizer as a guide, we will rethink and revise some of the visualizations featured in the report.

In this visualization, the authors are trying to show the proportion of  survey respondents based on their location in specific regions of the world:

oreilly-world-region-original

The blue circles do not depict the underlying data in this map, as they did in the visualizations from the first two posts in this series.  Instead, the blue bubbles here are merely a stylistic choice: they serve as pixels representing the world’s land mass. The numeric values are then laid on top of their corresponding regions.

It’s important to note that while all the categories are regional, the units vary. Sometimes they refer to countries (e.g., the United States, Canada), sometimes to entire continents (e.g., Africa, Asia), and sometimes to vague regional groupings (e.g. Latin America). Given the inconsistency in the data categories, it’s no surprising that the visualization is a little unclear too.

One of the problems with this visualization is that the values are represented as numbers, so the reader does not immediately notice the difference between the size of the values.  If you move back a little bit or squint your eyes until you can’t quite read the exact values, there’s nothing that immediately distinguishes the highest value (United States) and the lowest (Africa). Both appear to be white text that takes up roughly the same amount of space on a blue grid.

As I considered how to revise this map, my first thought was to try to salvage the blue bubble theme by using blue bubbles sized based on the values and placed over a geographic map.  Here’s a mockup I did using carto:

oreilly-world-region-excel-map

And here’s one I did using PowerBI:

oreilly-world-region-powerbi-map

While you can immediately see the size difference in values on these revisions, this type of map still has the same issue as the original, namely, confusion caussed by inconsistent geographic categories.  What countries constitute “Latin America,” for instance? If we assume that a number of the Caribbean island nations are part of Latin America, then it seems a little weird that the value is placed in the middle of South America.  Using another example, respondents from Iceland probably fall under Europe/non-UK, but there’s a disconnect (literally), because the  value bubble is all the way in mainland Europe.

There’s also a secondary problem that arise from the limitations of the tools I used: PowerBI and carto. If you look in my examples, the bubbles are not sized consistently.  In both tools, it’s difficult to make bubble maps in which the size of the circles accurately reflect area, not diameter.  For these reasons, I ruled out the bubble map.

Next, I considered a part/whole visualization, like the ones in part 2, but the fact that there are eight distinct categories, and some of the values are relatively small, I knew that there would be issues seeing the smaller values and their labels.

So, ultimately, I settled on this revision:

oreilly-world-region

 

 

data-chefs-viz-revision-organizer-oreilly-world-region

It’s just a simple bar chart, with values ranked from highest to lowest.  The benefit of using this simple graph, rather than the map, is that it elimiates the confusion caused by the inconsistent units of the regional categories. Now, because we don’t see every country on this chart, we don’t worry about it.

This may not be as visually appealing as the original, but, sometimes, the simplest solution is the best solution.

Data Viz Revision: O’Reilly 2016 Data Science Salary Survey (Part 2)

This post is part of a series based on the data displayed in O’Reilly’s 2016 Data Science Salary Survey. Using the Data Chefs Revision Organizer as a guide, we will rethink and revise some of the visualizations featured in the report.

In this post, I want to focus on the visualization for the share of survey respondents by self-reported age category:

oreilly-age

Again, the authors used the arcing blue circle theme to depict the breakdown by age category.  On the plus side, the data labels are consistently placed, all falling along the bottom-right of each value circle (or the inside of the arc), and the order is intuive: youngest to oldest. Also, the circles appear to be sized properly by area (as opposed to diameter).

Using circles is not necessarily a bad way to depict category data, but doing so has some limitations. The main drawback is that by using distinct circles, you lose the relation of each part to the whole.

 

data-chefs-viz-revision-organizer-oreilly-age

For this data, I propose using a form of visualization in which the part/whole relationship is central: pie chart, donut chart, waffle chart, or stacked 100% bar chart, shown below:

oreilly age revisions.png

The biggest downside to using these part/whole visualizations is that there isn’t a lot of room to label smaller values.  For that reason, I created a legend for all the values in each graph.

And, although this isn’t a problem with the visualization itself,  if you pay attention to the values in the original, you’ll see that they add up to greater than 100%: 101% to be exact. What probably happened is that more than one value was rounded up, giving the total an extra full percent.  In my revisions, I changedthe value for the 41-50 category, from 16% to 15% so that the values would sum to 100%. This was a compltely arbitrary choice because I had no access to the raw data to know exactly how they were rounded.

I think any one of these would work in place of the original.  Thoughts?

 

 

 

Data Viz Revision: O’Reilly 2016 Data Science Salary Survey (Part 1)

We will be posting a series based on the data displayed in O’Reilly’s 2016 Data Science Salary Survey. Using the Data Chefs Revision Organizer as a guide, we will rethink and revise some of the visualizations featured in the report.

 

I recently read  O’Reilly’s 2016 Data Science Salary Survey (by John King & Roger Magoulas). People who worked in the field of Data Science answered questions about their job titles, age, salaries, tools, tasks, etc., and this report summarized the results.  I thought the report offered a pretty fascinating overview of the data science industry, and is definitely worth the read.

However, I was a little thrown off by the choices the authors made in visualizing the data.  Here is a selection of representative pages:

oreilly-circle-theme

As you can see, King & Magoulas opted to use a series of blue circles to represent the data throughout the report.  While the circles provide a common visual theme, I don’t think they best represent this particular data.

One example is the visualization for tasks: work activities in which the data science survey respondents reported major engagement:

oreilly-tasks

The values are displayed as circle areas, sorted from highest to lowest, starting from bottom-left and curving clockwise around to the bottom-middle.  The relative sizes of the circle areas seem to be accurate., but notice the positioning of the labels on the circles.  From 69% down through 36%, the data and category labels are consistently positioned to the right of each circle.  From 32% on down, the data label placement starts to get inconsistent: left sometimes, right other times, based on space constraints.

This space constraint also forces the authors to alter the positioning of the value circles.  In order to fit the long text of the categories, the bottom right side of the arc had to be squashed. This gives the visualization an odd, bean-like shape.

 

data-viz-revision-organizer-oreilly-tasks

The revision I’ve proposed, a horizontal bar chart, is a lot cleaner. The data labels are consistent: categories to the left of the bars, values to the right.  Also, the relative sizes of the bars are pretty clear. That’s not really the case with the circle values.

oreilly-tasks-revision

This bar chart may lack the novelty or the visual pop of the original, but I think it’s more appropriate for the data, and far easier to understand.

What do you think?

New Data Chefs Data Viz Revision Organizer

This post is part of an ongoing series about the Data Chefs Viz Organizer, a planning document designed to help people create visualizations from conception to end product.

 

Because the 1st Cohort of Data Viz Ambassadors (DVAs) had some success using the original Visualization Organizer, I was a little surprised when I realized that the Organizer wasn’t as helpful to the 2nd Cohort.

After some investigation, I figured out the issue: while the 1st Cohort mostly wanted to create visualizations from scratch; those in the 2nd Cohort were more interested in revising existing visualizations that weren’t up to par.  For them, the Question/Problem section of the Visualization Organizer just wasn’t necessary.

Compare the kinds of problems and questions from the 1st Cohort…

question3

…to the common problem of those in Cohort 2 who revised visualizations:

revision question.png

 

This is not the kind of in-depth, substantive problem/question that typically drives data inquiry.  But it made me think about the important functional task of revising data visualizations and sugested the need to tweak the template for those who are just revising them, not designing them from scratch.

I came up with this (click on the image for a PDF version):

Data Chefs Viz Revision Organizer blank

Compare the original from scratch Organizer (left) to this Revision Organizer (right):

data chefs org side by side

Again, because the Problem/Question section was not necessary for revising, I scrapped it altogether, along with the Assumptions section. I also beefed up the Chart/Visualization section, adding space to describe the old visualization and its drawbacks as well as the proposed revision and the rationale for the change.

Our hope is that people use this organizer to document their revision process.

Using one of the revisions we did here a while back, we’ve completed a Revision Organizer below (click image for PDF version):

Data Chefs Viz Revision Organizer NIMSP

What do you think? We’d love to hear your thoughts.

On Using “Racial” Color Categories in Data Visualization

ethnicity americas.png

 

Today I came across this map depicting the ethnic composition in the Americas (h/t Randy Olson).

It rubbed me the wrong way immediately, and I’m not talking about its use of multiple pie charts. It brought to mind these examples from Stephanie Evergreen (via Vidhya Shanker).

The first issue is that the categories themselves seem arbitrarily defined, and there is a conflation of ethnicity and race. For instance, “Mestizo,” “Mulatto,” “Garifuna,” and “Zambo” are all multiracial or multiethnic groups, yet, there is also an “Other, Multiracial, Mixed” category.

ethnic categories.png

Complicating matters further, as a result of legal classifications regulating those of African descent, in some places (e.g. the United States), a number of those categorized as “Black” have been of multiracial descent.  Moreover, the title refers specifically to “Ethnic Composition,” but “Black” and “White” are technically not ethnicities.

This gets me to the terms themselves. While the Spanish term “mulato” may still be acceptable, the English “mulatto” is definitely no longer considered appropriate (and is often considered a racial slur), yet there it is representing people in The U.S. and Canada.

Then, there is the most glaring problem, in my view: the use of symbolic racial category colors to represent the different groups.  I’m sure the thought behind it was to use immediately recognizable colors to limit confusion, but in data visualization, a good practice is asking if the benefits of familiarity outweigh the costs.

What are some of those costs? Specifically, using one stylized, but supposedly realistic “racial” color to represent each group brings us back to the earlier point about conflating race and ethnicity.  It also takes groups that contain people with a wide range of skin tones and represents each one using only one shade. This is not a problem when the colors are abstract (see this racial dot map featured below–no African Americans actually have green skin, for instance). But when it’s supposed to represent real people, it feels both reductive and exclusive.

racial dot map.png

And what are we supposed to make of the fact that most of the racial colors are “realistic,” except for the bright Red of “Native American” and the yellow of “East Asian, East Indian, Javanese” category?

There has also been some pushback against this kind of “familiar” color categorization with respect to sex and gender.

Bottom line: data visualization is all about deliberate choices and tradeoffs. When confronted with “sensitive” data, it’s a good idea to ask yourself, “Could the  choices I’ve made offend people?”

Let’s say you have an aversion to this kind of framing: to “offense” as a legitimate constraint.  That’s fine. In that case, I’d suggest you modify the question to “Could the presentation & classification choices I’ve made distract from the content?”

Thoughts?

Data Chefs Data Viz Organizer: Complete Example

This post is part of an ongoing series about the Data Chefs Viz Organizer, a planning document designed to help people create visualizations from conception to end product.

 

In an earlier post, I explored the problem and question section of the Data Chefs Viz Organizer and offered some examples.  In this post, I’d like to provide an example of a fully completed organizer template. I think this will demonstrate how the organizer can help guide the work would-be visualizers.

Click on the image below for the PDF version with working links.

 

committefromscratch

As I detailed in prior posts, the 3 major sections of the organizer-the ones shaded purple-are based on the the Junk Charts Trifecta Checkup (JCTC).  I don’t think it makes sense to limit “Assumptions” to any one of the 3 sections (Question, Data, and Chart), so I placed it on its own (number IV.).

Below are the final versions of both hexmap graphs.

This one for the raw number of Committe Members in each state:

 

…and this one for the difference between Actual and Expected Committee Members in each state:

 

Thoughts about the organizer and the example?  How could it be improved? Could you use this organizer help guide and document your process for creating a visualization from scratch?

Data Viz Revision: NIMSP’s Campaign Contributions Disclosure Scorecard

 

Earlier this year, the National Institute on Money in State Politics (NIMSP) created a brilliant campaign contributions disclosure scorecard depicting each US state’s grade.

They used Tableau to visualize the data as a geographic state choropleth and labeled each state with its scorecard grade (A through F):

Sunlight Scorecard

 

There are a few problems with visualizing this data in this way.  First, there’s no way to see Washington, DC on this map. Also, on this map, the grade labels are redundant because the legend already shows what color represents each grade. But the biggest problem is that using a geographic map creates scale issues: you have to scroll and/or zoom to see the small states in the Northeast (e.g., Delaware, Rhode Island), not to mention Alaska and Hawaii.

The scorecard map is showing which states got which overall grade, so it doesn’t matter how big each state is.  For the purposes of this data, the states have equal weight. To try to solve this scale issue, I decided to revise the map as an abstract state map, with every state depicted using a figure of the same size and shape (I was inspired by this post by Danny DeBelius from NPR’s Visuals Team).

I decided to make a square tile map in Excel, applying conditional formatting on the state squares to get the different colors.  I downloaded the data from NIMSP’s Tableau file, then used this awesome tutorial by Caitlyn Dempsey at GIS Lounge. Here’s what my Excel square tile map looks like:

NIMSP Excel Scorecard

 

I think the legend is a bit too big (It’s not that easy to make the cells smaller. You’d probably have to make all of the cells tiny, and merge them to get the bigger state squares).

Overall, the Excel tile map is fine, but I think it lacks a bit of visual pop.

Next, I decided to try make the same tile map in Tableau, only with hexagons instead of squares.   I followed Matt Chambers’ user-friendly tutorial at Sir-Viz-a-Lot and made the following map:

NIMSP Scorecard Hex Tableau.jpg

Just for kicks, I switched the tiles back to squares instead of hexagons, and got what looks like a keyboard.  With this tile configuration, I think the hexagons look much better.

NIMSP Scorecard Square Tableau.jpg

Finally, I incorporated a little from Keir Clark of Maps Mania, Andrew W Hill from the CartoDB Team, and the Danny DeBelius NPR post I mentioned earlier.  Here’s the result:

NIMSP Scorecard Hex CartoDB2.png

So, what do you think?  Which revision do you think is most effective?

 

Data Chefs Data Viz Organizer: Problems & Questions

 

This post is part of an ongoing series about the Data Chefs Viz Organizer, a planning document designed to help people create visualizations from conception to end product.

 

As I said in the last post, I created the Data Chefs Visualization Organizer to provide a little more guidance than the Junk Charts Trifecta Checkup (JCTC), especially for crafting the question driving the data visualization.  This organizer proved to be really useful for the first cohort, who built their visualizations from scratch.

Here are a couple (modified) examples of the question section of the organizer when completed:

question3

Again, the benefit of expanding on the JCTC is that students now have some context they can use when creating their questions; bracketing each question with a formulated problem and possible decisions to be made as a result of answering the question made for much more effective questions.

 

Data Chefs Data Viz Organizer: Building on the Junk Charts Trifecta

This post is the first in an ongoing series about the Data Chefs Viz Organizer, a planning document designed to help people create visualizations from conception to end product.

 

I work at a nonprofit, and I’ve recently started working to help the organization get better at data viz. One thing I’m doing is training small workgroups of Data Viz Ambassadors (DVAs—pronounced “divas”), to help spread best data viz practices from the ground up. In addition to learning about best practices, DVAs must produce a practical visualization using their own data.

My syllabus for the workgroup is strongly influenced by Kaiser Fung’s Junk Charts Trifecta Checkup (JCTC):

junkcharts trifecta

I’ve had the workgroup use the JCTC as an organizing tool because it puts the question—which is very important, but often overlooked—on equal footing with the data and the visualization.  The JCTC is ideal for analyzing visualizations, but most of the DVAs have struggled to figure out how to use the JCTC when producing their own visualizations.

When I realized that this was a common problem, I created the data viz template below to help with the process of planning and documentation (note how the Data Chefs organizer parallels the JCTC):

DataChefsVizOrganizer

The goal is to keep the question front and center, like the JCTC does, while providing a bit more structure to the process of creating and documenting data visualizations.  Hopefully, someone else in the organization can look at a completed organizer and understand (and even replicate) the process and the final data visualization.

I will be returning to this organizer in later posts, going into more detail on each component, and soliciting feedback on how to improve the document.

Initial thoughts on the organizer as a planning guide?