Pareto principle, also known by the 80-20 rule, states that 80% of the consequences are caused by 20% of causes. This makes the rest less important.
People who work with data might have heard a different version of the 80-20 rule. A data scientist spends 80% of his time cleaning up data, rather than actually analyzing or creating insights. You can see the effect if you imagine a 30-minute drive being extended by traffic jams to take up two-and-a half hours.
It is tempting to imagine a future in which every business process can be automated using machine learning, but we don't need to go that far.
Data scientists can spend up to 20% of their time on analysis. However, it is still a lot of work to turn a mess of data into an easily accessible dataset. This can include cleaning duplicate data, formatting entries correctly, and other preparatory work.
Anaconda's recent survey found that this workflow stage consumes about 45% of total time. CrowdFlower had previously estimated that 60% of the time was spent on this stage, and other surveys have similar numbers.
This is not to suggest that data preparation isn't important. It is a well-known rule that garbage in, garbage out is used in computer science circles. This applies to data science as well. The script will return an error in the best case scenario. It will warn that the script cannot calculate average client spending because customer #1527's entry is written as text and not as a number. The worst scenario is that the company will act upon insights that are not consistent with reality.
It is important to consider whether re-formatting customer #1527's data is the best use of time for a highly-paid expert. According to different estimates, the average data scientist earns between $95,000- $120,000 annually. This is a waste of time for both the employee and the company. Real-world data is not immortal. If a project requires too much time to gather and process, the data can become outdated.
Companies often use non-data-focused employees to collect data. Employees are asked to fetch and produce data rather than working on their normal responsibilities. Companies often don't use more than half the data they collect, which suggests that all of the people involved in collecting the data have wasted their time and caused operational delays and losses.
However, the data collected is rarely used by any data science team because they are too busy to review all of it.
All about data, and all for data
These issues all point to the fact that companies, except for data pioneers like Google or Facebook, are still trying to figure out how they can re-imagine their businesses for the data-driven age. Data scientists pull data into large databases, and they have a lot to clean up, while other people, who spent their time fetching the data, don't get much benefit.
Data transformation is still in its infancy. Tech giants who have made data a core part of their business models have ignited a fire that is just beginning to burn. Even though results are mixed at the moment, it is a sign companies still need to master data thinking.
Businesses know that data is valuable and are eager to hire AI specialists in non-tech businesses. It is possible for companies to succeed if they focus on people as much AIs.
Data can improve the operation of almost any part of an organization. It may seem appealing to imagine a future with a machine-learning model for all business processes, but we don't need to go that far. Any company that wants to tap data should aim to get it from point A and B. Point A refers to the point in the workflow where data are being collected. Point B is for the individual who requires the data for decision-making.
It is important to note that point B doesn't have to be a data scientist. This could be a manager looking to design the best workflow, an engineer looking at flaws in a manufacturing process, or a UI designer performing A/B testing of a particular feature. All these people need to have all the data they require at their disposal, so that they can make the right decisions and get the best insights.
Data can be just as useful as models for people, especially if the company invests and provides basic analytics skills. Accessibility must be the goal in this approach.
Skeptics might claim that big data is a fad. However, advanced analytics capabilities can improve the bottom line of any company provided it has a clear plan and meets expectations. It is important to make data easy and accessible, not to try to get as much data as possible.
This means that a company's overall data culture is as important as its data infrastructure.