On Salutary Obfuscation

Last week, a map which I made about swearing on Twitter gained its fifteen minutes of Internet fame. I heard a lot of comments on the design, and one of the things that many of the more negative commenters (on sites other than mine) were displeased by was the color scheme. It was, as they said, very hard to distinguish between the fifteen different shades of red used to indicate the profanity rate. This complaint was probably a good thing, because I did not particularly want readers to tell the shades of red apart and trace them back to a specific number.

In designing the map, I took a couple of steps which made it more difficult for people to get specific data off of it. Before I can explain why I would want to do this, first you need a quick, general background on how the map was made.

This is a map based on a very limited sample of tweets. Twitter will give you a live feed of tweets people are making, but they will only give you about 10% of them, chosen randomly. On top of that, I could only use tweets that are geocoded, which means the user had to have a smart phone that could add a GPS reading to the tweet. A third limitation was that I could only use tweets which were flagged as being in English, being as I don’t know any curse words in other languages besides Latin. Finally, there were occasional technical glitches in collecting the data, which caused the program my colleagues and I were using to stop listening to the live feed from time to time. If you add those four limitations up, it means that I made use of somewhere between about 0.5% and 1% of all tweets going on in the US during the time period analyzed. Possibly not a strongly representative sample, but still a large one at 1.5 million data points.

In that limited sample, I searched for profanities. This is based on my subjective assessment of what may be a profanity (as many readers sought to remind me), and the simple word searches I did may have missed more creative uses of language. Once I had the number and location of profanities, I could start to do some spatial analysis. I didn’t want to make a map of simply the number of profanities, because that just shows where most people live, not how likely they are to be swearing. So, I set up some calculations in my software so that each isoline gives the number of profanities in the nearest 500 tweets, giving a rate of profanity instead of a raw total. Unfortunately, for places that are really sparsely populated, like Utah, the algorithm had to search pretty far, sometimes 100 miles, to get 500 tweets, meaning the lines you see there are based partially on swearing people did (or, didn’t) in places far away. If I hadn’t done this, then there would be too few data points in Utah and similar places to get a good, robust rate (counting the # of profanities in 10 tweets is probably not the most representative sample, we need something much bigger to be stable). Maybe I should have just put “no data” in those low areas, but that’s another debate.

So, the map is based on a limited sample of tweets, and the analysis requires some subjective judgments of what’s a swear word, and then some heavy smoothing and borrowing of data from areas nearby in order to get a good number. What all that means is: you shouldn’t take this as a really precise assessment of the swearing rate in your city. If I had chosen to look for different words, or if the Twitter feed had sent a different random 10% of tweets, or if I had chosen to search profanities in the nearest 300, rather than 500 tweets, then the map would end up looking different. Peaks would drift around some and change shape. But my feeling is that the big picture would not change significantly. Under different conditions, you’d still see a general trend of more profanity in the southeast, a low area around Utah, etc. The finer details of the distribution are the most shaky.

Okay, back to my main point about trying to make it difficult to get specific numbers. What I wanted readers to do is focus on that big picture, which I think is much more stable and reliable. And so I made some decisions in the design that were intended to gently push them toward my desired reading. First off is that color scheme, which has only small changes between each level of swearing, which makes it hard to look at your home and tell if it’s in the zone for 12 or 13 or 14 profanities per 100 tweets. What’s important is that you know your home is at the high end. Whether it measured at 12 or 14 doesn’t matter, because that number is based on a lot of assumptions and limitations, and is likely to go up or down some if I collected on a different day. The color scheme makes the overall patterns pretty clear — bright areas, dark areas, medium areas, which is where I want the reader to focus. It’s weaker in showing the details I would rather they avoid.

The other thing I did was to smooth out the isolines manually. The isolines I got from my software had a very precise look to them. Lots of little detailed bends and jogs, which makes it look like I knew exactly where the line between 8 and 9 profanities per 100 tweets was. It lends an impression of precision which is at odds with the reality of the data set, so I generalized them to look broader and more sweeping. The line’s exact location is not entirely reflective of reality, so there’s no harm in moving it a bit, since it would shift around quite a bit on its own if the sample had been different.

Original digital isolines

Manually smoothed in Illustrator

This is a subtler change, but I hope it helped make the map feel a bit less like 100% truth and more like a general idea of swearing. Readers have a rather frightening propensity for assuming what they see in a map is true (myself included), and I’d rather than not take the little details as though they were fact.

Had I to do it over again I probably would have made it smaller (it’s 18″ x 24″). Doing it at 8.5″ x 11″ would have taken the small details even further out of focus and perhaps kept people thinking about regional, rather than local patterns. Maybe I shouldn’t have used isolines at all, but rather a continuous-tone raster surface. There are many ways to second-guess how I made the map.

Anyway, the point I mostly want to make about all of this is that it’s sometimes preferable to make the design focus on the big picture, and to do so you may need to obfuscate the little things. Certainly, though, a number of people were unhappy that I impaired their preferred reading of the map. People like precision, and getting specific information about places. But I didn’t feel like I had the data to support that, and I would be misleading readers if I allowed them to do so easily.