In 2011, the consulting firm McKinsey & Co caught headlines when they predicted that in a mere seven years the newly minted "Data Scientist” role would have a 200,000-person talent deficit. This prediction gave enormous credibility to the idea that the economy as a whole was moving toward becoming more data driven, and the study continues to be quoted even today, appearing as recently as last December in TechCrunch. Although different research firms and publications may give slightly different deficit predictions and timelines, there is a consistent message that can be found throughout all of them. Namely, we don’t have enough people with the skills to do the job. And we won’t anytime soon.
Fast forward to today and how this has changed. To find that, we look towards the fine people at RedMonk who have kept the project going. The below represents language rankings as they stood in January 2016.
What do language rankings have to do with the Data Scientist talent gap?
It’s all starting to make a lot of sense that we’re looking a 200k person deficit, isn’t it? No wonder any person with that entire skill set is colloquially termed a “unicorn."
1) The IDE (Integrated Development Environment)
2) The Libraries
An IDE is great but it’s only going to be as good as the algorithms you’re able to create through it. The ability to efficiently create meaningful algorithms is key to any good data science stack and, in large part, is dependent upon available libraries and the communities behind them. The balance between powerful yet usable libraries (e.g. nltk, sciki-learn, scipy, numpy, pandas) that make statistical and machine learning tasks more approachable is largely credited for Python’s position as a pervasive and essential data science language. Just check out Github and you’ll see how big the community is around these tools. NLTK has over 3,000 stars, pandas has over 6,000, and scikit-learn has over 11,000 (with nearly 600 contributors!) But look closely at NPM and you’ll see a growing community of comparable packages being written in Node. Don’t believe me? Check out this great presentation by Sean Byrnes, the founder of Flurry, at last year’s Node Summit. Want to implement a multi-arm bandit algorithm? Use Percipio. How about import a Bloom Filter as a self-contained module? Use Bloom.js. While Python without question has the upperhand right now, the growing number of data science packages in Node make it an increasingly feasible alternative.
All that being said, building a meaningful algorithm is not a trivial task. Even with the right tools, it’s far too easy to succumb to inferences that are not actually statistically significant or employ an inappropriate model for your data’s distribution type. Unless your data team has a person with advanced training in applied mathematics, you’re taking a risk when you try to roll your own algorithm from these packages and trust you know what you’re doing. In response to that, many large tech companies are stepping in to help you outsource the algorithmic work that might require an expert eye.Google, IBM, and Amazon all have these products and, you guessed it, they all come with a Node SDK. We’ve personally used IBM’s Bluemix for some internal projects and found it cut down our development time by a great amount.
3) The Processing
4) The Processing Alternative
5) The Pipelines
Astronomer for Data Scientists
Of course, as anyone who's dabbled in data science knows, the key to discovering new intelligence is having reliable data in real-time—once data is in your hands, the fun stuff begins: processing, analysis, modeling, prediction, etc. If your organization wants to focus on insights, contact Astronomer, a platform for data engineering. We'll make it easy to get the data you need, wherever you need it, meaning insights are closer than you think.
Ready to build your data workflows with Airflow?
Astronomer is the data engineering platform built by developers for developers. Send data anywhere with automated Apache Airflow workflows, built in minutes...