If you’ve read our blog before, you know that we talk a lot about everything industries (from healthcare to the NBA) can do with data properly structured and stored. But what kind of data are we talking about and why exactly is it so hard to access, organize and store?
The truth is, there’s no single reason why large organizations fall into data disarray, because data is constantly flowing from every corner of their operations, and different sources carry with them different constraints. Some sources produce millions of data points every hour and require robust engineering operations just to control. Others are smaller but contain federally protected information that can carry severe penalties if accessed outside of very defined circumstances. Knowing the root causes of why your data is playing hard to get is the first step in understanding how to get your house in order.
We’ve identified 14 types of data that present enormous opportunity but are otherwise incredibly difficult to access because of their 1) inherent properties, 2) corruption from human interaction or 3) manner of change.
Difficult by Inherent Property
- Internet of Things (IoT) - Much like clickstream, IoT generates a large quantity of data at a very fast rate. Cisco estimates that by 2020, the annual amount of data generated by connected devices will be approximately 600 zettabytes (or 6 trillion gigabytes). Combine this with a variety of manufacturers, platforms and protocols, and the complexity around managing this data moves from simply storing (already an overwhelming issue) to accessing and purposefully using it in a meaningful timeframe.
- Healthcare/HIPAA - The 1996 Act protecting health record privacy is simultaneously a strong ally of patient privacy and an enormous hurdle to jump when innovating in the healthcare sector. HIPAA certification requires a significant investment in your infrastructure and knowledge to simply receive healthcare records without risk of a breach and exposure to multi-million-dollar lawsuits. A really interesting company in Baltimore named Protenus is building an entire business around the monitoring and visualization of HIPAA offenses. Want to know who the worst violators are? Cardiologists (seriously!).
- Financial - Financial records can be difficult for two reasons. The first is the sheer volume velocity of record generation, which pushes the limits of existing engineering capabilities. Setting aside the data generated from the recent rise of algorithmic/flash trading (where real-time means microseconds), everyday commercial transactions (any given purchase at any given store) generate a staggering amount of data. In 2016, Mastercard was estimated to average 160 million transactions per hour. The second reason taking advantage of this data is difficult is due to federal regulation underneath the Gramm–Leach–Bliley Act of 1999. Similar to HIPAA, this Act provides strict regulations around the sharing of customer data. With everything being digital, what was once a pure regulatory problem is now an engineering one as these records can be accessed broadly and carry greater risk if leaked. And while this might seem like a problem exclusive to banks, the type of organizations affected by this include: non-bank mortgage lenders, real estate appraisers, loan brokers, debt collectors, tax return preparers and real estate settlement service providers.
Difficult by Human Interaction/Choice
- SaaS Silos - When you’re picking out a tool to solve a specific need, your focus is understandably on solving that need. If you want a CRM, you’re going to make an assessment of how well Salesforce and Hubspot help you keep everything in line, probably in large part via their interface. But what typically goes overlooked is how well you can report OUT of the tool once you’ve been using it for a while. Data export may be available via an API in the best scenarios, but often these APIs are not as descriptive or robust as they could be given the companies you’re paying typically have little incentive to help you become independent of their platform.
- Customer Inputted - Especially for insights companies, collecting “first person data” from customers can enhance your ability to deliver value and increase customer stickiness. However, whenever humans are involved, opportunities for data inconsistencies or corruption can arise. Are they from New York or New York City or NYC or New York, New York? Is their shoe size 9.5, 9 ½, or nine and a half? The number of irregularities that arise from human input is bound only by the number of questions you ask.
- Cross-organizational - From other departments (politically divisive)—from a parent/sister/child company or "semi-hostile" department. Say you’re the team in charge Coke Zero and you’re charged with improving sales over the next year. How incentivized would you be to share data with Diet Coke to see if there are any trends between the two products’ performances. It’s all Coca-Cola, isn’t it? But what if the Diet Coke team uses this data against you to boost their sales and makes you look like an idiot? Maybe you shouldn’t share that data after all. Now imagine you’re in charge of all sales of all products, regardless of sub-brand: how does the last scenario strike you? Is Coca-Cola as an organization doing as well as it could without this data collaboration? Are you comfortable letting people with unique agendas make the decision on whether or not to share their data?*
- Open Data - This shouldn't be hard, right? It's called open data, after all. Sadly, it’s a misnomer as this data can come in the most frustrating and incomprehensible format. Much like SaaS platforms, open data sources do not always have the actual use of their data front of mind. Especially when coming from a non-profit or public entity, these datasets are often published as part of regulatory procedure and, while valuable, they can be difficult to parse. Don’t believe me? Check out the European Union Statistical Office’s database—it’s a minefield of complexity. While free, the effort and time required to access it runs the risk of outweighing the actual value it can provide.
Difficult by Process (Lost in Translation)
- Algorithmic - This is not only hard to generate, but also hard to use. Financial records are once again relevant here, though other data falls into this category as well. According to Mastercard, every transaction they process is subject to an estimated 1.9 million rules and algorithms. So you have the canonical record of the transaction itself occurring, but then you have literally millions of sub-values being generated to inform models, dashboards, alerts, and even more algorithms that in turn inform models, dashboards, alerts, and even more algorithms that, in turn...you get the idea.
- Cleaned Data - By cleaning data, like combining it or applying scoring algorithms, you create data from your existing data. Whenever there is human interaction, there is opportunity for data corruption or invalidity. There are various cleaning algorithms that can be run to properly format your data into greater uniformity but this requires version controlling at each step in case you need to go back to the original state for reproducibility. A newer and growing trend is to actually utilize machine learning to establish cleaner master data. One exciting prospect in this area is the ActiveClean project out of Berkeley, Columbia and Simon Fraser universities.
- Model generation, version control, etc. - One of the most overlooked elements of strong data science operations is reproducibility of algorithms and outputs. Like any software development, this requires identical environments for the algorithm to run in but it also requires identical inputs. If you are working with the dataset in a way that alters the original form (e.g. normalizing, cleaning, enriching, feature creating) and you don't keep a copy of the original form, you're unwittingly destroying your ability to reproduce results down the line. This is doubly true when working with a larger team with many people touching the original dataset and models. In cases like that, simply logging user interactions to create some sort of provenance for the data is, ironically, a new dataset itself that you also now need to be tracking. For more information, we recommend reading Google Research's hilariously-titled paper: Machine Learning: The High-Interest Credit Card of Technical Debt.
- Analytics, correlations, cohorts - Good insights are all about context and, as important as the raw data is to capture, it’s just as important to view this data’s relationship to other data or itself at a previous time. It’s interesting to see that you sold X widgets today. It’s more interesting to see that 70% of those purchasing are repeat customers. It’s even more interesting to see that over those 70%, the single largest purchasing cohort joined 4 months ago. And it’s even more interesting than THAT to realize that, on average, their cart purchase is trending upwards. And it’s even more interesting than THAT to realize that nearly all of them signed up via an ad you ran on Facebook. To see the whole story, you need all the variables and you need them over time.
No matter why your data is difficult, once you understand the “game” it’s playing, it can’t hide from you. You can identify the right steps to chase it down, organize it and—ultimately—use it proactively to drive every business decision.
*It’s worth noting that this scenario is purely hypothetical; we have no specific reason to think that Coke Zero people hate Diet Coke people. (But they probably do.)
Ready to build your data workflows with Airflow?
Astronomer is the data engineering platform built by developers for developers. Send data anywhere with automated Apache Airflow workflows, built in minutes...