The following is a continuation of a series where we use common data extraction, analysis, and machine learning techniques to make our business smarter. You can read Part 1 here.

The first post in this series kicked off the new blog and got a really incredible response (frankly, we were a little surprised.) Since then, we’ve decided to ramp up content and make sure everyone on the team can contribute. What we’ve learned from reading a lot of posts is that good content draws from personal experience to become insightful and relatable.

But now it’s time for another nerdy data science post. Thank goodness. 😃  Here’s a preview of what’s at the end of the rainbow if you keep reading:

Part 2 begins

In our last “What I Learned…” post, I ended on a slightly boastful note. I think I said something along the lines of…

The really fun stuff begins when you start digging into sentence structures, keyword frequencies, sentiment, and readability.

This was close to the exact definition of hubris — an Icarian level pronouncement that would clearly bring failure and humiliation given the fact that I’ve never actually done any NLP (Natural Language Processing.) But it was out there, so I had to at least try. As our CEO, Ry Walker, likes to say, I was “iterating in the wild.” In Astronomer speak, that basically means jumping off a cliff and trying to build wings on your way down. (Kudos to Ray Bradbury for the metaphor. He really understood startups.) If I could find any pattern at all that helped us understand why some posts get shared more than others, I’d consider it a success.

So I Got to Work

I began with the basics. I found a useful guide titled ‘Getting Started with Natural Language Processing with Python’ to walk me through NLTK (Natural Language Toolkit — a popular python package) and a nice accompanying cheat sheet titled, Text Analysis with NLTK Cheatsheet. Not too creative but hey, it works.

I read about Bag-of-words, TF-IDF, NLTK vs. OpenNLP, the Stanford vs.Berkeley Parser… the list continues. It soon became clear that I wasn’t going to be able to create any algorithm I’d be confident in using in a timeframe that mattered.

Then the project got shelved because we’re a startup and text analysis isn’t helping the bottom line. Before I knew it I had another blog post to write and that’s when I saw something that hit me like lightning… overused metaphor aside.

The “Aha” Moment

Browsing LinkedIn one night (you don’t?), I found a post by Jonathan Pickard, founder of the business intelligence firm, Analyzer 1. In it, he took President Obama’s SOTU address from 2015 and, using Watson (part of IBM Bluemix), analyzed it for personality traits. I loved this post and the concept of leveraging an established NLP system like Watson (the same project group at IBM that won Jeopardy a few years back) to power this project. So I signed up for a Bluemix* account and started loading in data.

*Technically, I signed up for 10 Bluemix accounts. IBM caps Bluemix’s API at 1000 calls/day on the free tier (and I didn’t have a budget) so I needed a few keys to process all my data.

The Stack

After our last post, a few people reached out to learn more about our process/tool stack. I’ll write a follow-up post about that at some point, but here’s a quick overview if you’re interested (if not, feel free to skip this section.)

Setting up the NLP Engine

Getting the Values I Needed

Doing the Analysis

plotly graphs

Keywords, Keywords, Keywords

The first endpoint from AlchemyAPI we used was Keyword Extraction. Depending on the length of the post, it’ll give you up to 50 top keywords with individual sentiment and relevancy scores. Overall? Pretty positive phrasing. Well, done people. Positive keywords are used almost 5:1 against negative keywords.

1-ccSfKUUyOxhXe1fAA3yFyw

Nice to have a basic understanding of the distribution but we can (and should) always get a little more granular. Using some basic GROUP BY commands to consolidate the most commonly used keywords by company and then a quick plyr.arrange in R to rank order them by frequency, we get the following:

1-3Rz97-jPXhnS00TItMY9RQ

The above is the top 10 keywords used by one company. Generally, there aren’t any HUGE surprises here but it is interesting to see what they’re mentioning and how often. The real value of this will come into play as we track this list over time and begin to understand how it changes. What new technologies are suddenly getting mentioned? What product features are suddenly being highlighted as very important?

Sentiment and Readability

Beyond keyword inspection, we needed a more holistic view of the posts themselves. How positive/negative were the posts overall? How easy were they to read? How did this change across companies? Glad you asked.

Sentimental Fools

It may surprise you but there are actually some posts (<100) with an overall negative tone. The majority, however, fall on the positive side with a mostly normal distribution centered around 0.4–0.5. 

sentiment count

Breaking it down by company doesn’t really tell us anything further. Some skew to the more positive side than others but there is generally still a normal distribution around 0.4–0.5 and few negative posts.

sentiment histogram

In the boxplots below, we can confirm that for most companies, the negative posts that exist are statistical outliers. For others…well, maybe focus on being a bit more positive.

1-zyizBC mp6eFK1pJCMrI8g

How about Readability?

Now, we didn’t exclusively use Bluemix to perform this analysis. We also wanted to examine the readability of each post and, for that, all I needed was a Python package that implemented the ’Flesch Reading-Ease’ index. You can read more about it at the link I’ve provided. Essentially, as the average words per sentence and syllables per word increases, the lower the score becomes. The highest possible score is 120 (using a two-word sentence each with one syllable) with no theoretical lowest score (as some sentences can go on and on and on and on and you get the idea.) Any score lower than 50 is considered to be at or above college-level. Here’s the quick wiki-reference that we used:

1-FcgO5m2r4Cc1zRTKJ -tyA

So across all posts, how difficult are these posts to read?

1-Uxm8rfvU91NLXhsFkOvJxg

There is a fairly normal distribution in readability difficulty with a majority of posts falling in the “Fairly Difficult” category and overall slight skew to the left. Notice that there are no “Very Easy to Read” posts, suggesting that we’re all at least smarter than a 5th grader. Now, let’s break it down by company.

1-dTLnUVimUAOtHispsWeqgQ

This starts to get really interesting because we’re beginning to understand the complexity distribution across every competitor. Although readability has an overall normal distribution around “Fairly Difficult,” you can see some skew towards easier-to-read posts and some skew towards more difficult. In the boxplots below, we see how reliably each company posts in certain difficulty ranges.

1-XbM2tRoKvt aaTUM1e3MAA

The Social Vortex

Tying this all back into total shares (the end goal of all this analysis), we needed to dive into how factors like readability and sentiment of a post contribute to the total shares it received. And because I was using Plot.ly’s API to create charts, I was able to take the visual component of analysis up a notch to help figure this out.

A Quick Word on Principles of Perception

Data visualization isn’t just a “nice-to-have”; it’s a measurably faster medium to convey system complexity*. I promise I’m not making this up. In the early 20th Century, the Gestalt School of Psychology established seven “Principles of Perception” (aka the Laws of Grouping) that encompass the cognitive interpretation of visual stimuli. They found that using these seven patterns could actually increase the speed at which subjects made connections and conclusions.

*If you’re interested in learning more and having your team work faster through well thought out data viz, I recommend reading The Functional Art by Alberto Cairo as a starting point.

What I’m trying to say is that good viz can help make insights much more pronounced. Case in point- what if I gave you a chart mapping the distribution of total shares by sentiment?

1-157-cMl73-6TUI5h57NsLQ

And shares by readability?

1-QJbpiiWJp6TR7FAPoaLexQ

And what about readability by sentiment?

1-U94tJI2Q1pe rtp2C5F8Dw

All somewhat interesting but these charts don’t tell us much by themselves. But what if I could combine all of these into one three-dimensional view?

Boom.

Optimal Social Zone

Notice how when we view this in three dimensions that there seems to be a vortex-esque shape forming between a sentiment score of 0.2–0.6 and a readability score of 35–75. We call this the Optimal Social Zone. Posts with too many negative keywords (low sentiment) or overly complex language (low readability) aren’t getting the top posts. Nearly all of the top performing posts seem to fall in this range, suggesting that while you might not get a lot of shares simply because it has those qualities, you’re unlikely to score well if it falls outside of that range. A larger sample size and more rigorous analysis will be needed to confirm this but it’s a really interesting early finding.

What’s also interesting is that you can see a clear pattern of higher shares as readability increases (i.e. becomes easier.) Sharing begins rising around 35 — approximately college level — and peaks around 75, which corresponds to about a 7th grade reading level.

1-M636k-m42YKIaSV83h9uag

Because there are many more less-frequently shared posts, it’s a bit difficult to see how this patterns holds for the entire distribution. The top posts are easy to see but what about the lower ones. Are these high ranking posts outliers? Let’s log transform total social shares to spread the distribution out more.

So What Did We Learn (Part 2)

And… What’s Next?

That’s a really good question. The truth is, opportunities to effectively use data are endless. Around here, we like to quote Carl Sagan (a “lower-case a” astronomer): “Somewhere, something is waiting to be known.” Insights are out there—it just takes some data wrangling to find them. Astronomer’s platform takes the collection, processing and unification of your enterprise data off your hands, so you can get straight to analytics, data science and—more importantly—insights.

Want to talk about how your organization can leverage Astronomer to do better, more robust analytics? Contact us. Insights might be closer than you think.