Data Science and Machine Learning.
These trendy words are being heard, but not everyone understands how exactly it can be useful for business. Meanwhile, these disciplines solve quite pragmatic problems.
For example, we do predictive analytics: it’s a class of Data Science methods that can be used to predict some important indicators for the client in the future. In this article, we will explain how predictive analytics works and how it helps, for example, to calculate revenue and save money.
Say, we have a large retailer with a very specific request: I want to know where to open a new shop and how much revenue it will generate.
Is it doable? It is.
First, we look at what data the customer already has. They are also called internal data.
Outline
Internal data
The store usually already has some existing data points:
- assortment,
- turnover,
- sales area and so on.
Using only this data, we can train the model and try to predict, for example, the revenue for each point: we divide the existing data in the proportion of 70/30, train the model in 70% of the data, and the remaining 30% to check how accurately our model has learned to predict revenue for the point.
The problem is that the accuracy of such a model may not be high: it simply does not have enough data for training. In other words, if we only have internal data from stores, it may not be enough to predict with reasonable accuracy how much a store will make in a month.
What to do in that case? Enriching the data, i.e. supplementing what the customer already has with external data.
External data
There’s a lot of external data.
Weather, currency exchange rates, SpaceX rocket launch schedule – all this external data in relation to our client.
It is clear that we do not need all external data, and not all of them we can get. At this stage, the analyst joins us: he is well versed in the types and sources of external data, and can give an expert opinion on which of them will be relevant. Before we develop a model, we do research that helps us to understand which data will and will not be useful to us.
In the case of a store, we may find it useful, for example, to find out what kind of data is available, what kind of competitors are standing next to each other, and how much money is being raised by the outlets in the area.
Based on these hypotheses, we can pull up the external data and train the model using it.
Predictive power in this case usually improves. We can train the model several times, adding and removing some data sets, achieving more and more accuracy.
How do we get external data?
Some data aggregator services give them away freely, sometimes even in a convenient XML or JSON format, such as the OpenStreetMap service, where you can get geographical data about an object. There are public databases, for example from Google – these are already collected large sets of data on various topics, which can be found in the public domain and freely used to teach your model.
Some of the data is in the public domain, but it is inconvenient to use it. Then you have to parse the sites, that is, pull out the data automatically (as long as it is legal, of course – but in most cases it is).
And some data have to be bought or agreed to use them – for example, if you work with fiscal data operators who can allow you to use some information about checks.
In each case, we decide how much we need the data, how much it will improve the accuracy of the model and how important it is to the customer. Let’s suppose that some data set will allow us to make the model 10% more accurate.
How good is this for the customer? How much money will he save or get if our model predictions are 10% more accurate? Is it worth buying this data set?
To understand this, we really need to know a lot about the client – so at the stage of understanding the task, we ask a lot of questions about his business, sources of income and features of work.
How do we check the accuracy of the model?
How do you check (and prove to the client) that our model really makes sense? What does it predict with the right probability?
We share all the data that we have, randomly in the proportion of 80/20. With 80% we will work and train the model on them, this is our training sample. We will need them later to test the model and make sure that everything works. This is a validation sample.
The training sample is divided into training and test samples (70/30). We train the model with 70% and check it by the remaining 30%. When we are satisfied with the accuracy of the model, we check the model now finally, on the validation sample, that is, on the data that the model has never seen before. This allows us to make sure that the model actually predicts with a given accuracy.
As a rule, the accuracy of the model on the test and validation samples is almost identical. If they are very different, it is likely that the data might be malformed, e.g. dataset may not have been divided into training and validation samples in a random way, or they are heterogeneous.
MVP and industrial solution
When we discuss a task with a client, we define, among other things, the criteria for the success of a project. How do we know that we have completed the task? What accuracy should the resulting model have and why is it?
We always start a project with MVP – it’s a relatively cheap test of our hypotheses, it’s a model that can already be valuable. We try to teach the model on the available data and find some baseline – the minimum accuracy of the model (for example, 75%). We will try to improve this accuracy all the time as long as it is cost-effective and reasonable.
When we are finally satisfied with the accuracy of the model, we pack the resulting model into a web service or a mobile application with a user-friendly interface. In our example, with the opening of a store and forecasting of its revenue, the web service could look like an interactive map, where different areas would be highlighted in different colors depending on the prospects of opening a store here, and for each selected point would be drawn a plate with a forecast of the revenue of the store, delivered at this point.
The difference between the MVP and the industrial solution is that the MVP model cannot be retrained. And the accuracy of any model deteriorates with time, and it needs to be further trained. That’s why for the industrial solution we implement one of two options of support:
- either we support it by ourselves, constantly training the model (and increasing its accuracy), or
- we implement a cycle of retraining of the model directly inside the software itself.
Support from a live team, of course, is more expensive. But the disadvantage of automatic retraining is that it cannot take into account sudden changes in the nature of data. It will not take into account, for example, that as a result of some market conjuncture the store stopped selling certain types of goods and its revenue decreased. Then the accuracy of the model will fall significantly, and it will have to be retrained manually, adding missing data.
Predictive analytics: deliverables
Web service or a mobile application with a user-friendly interface that clearly shows the customer the answer to his question (for example, where to open a store and how much revenue he will have).
Under the hood is a model that provides predictions with a specified (and agreed) accuracy based on available data – internal customer data and external data that we have decided to collect and use in this model.
The support of the model in a form of either as constant updating of the model from the side of a live DS-team or as a built-in function of periodic retraining within the program itself. The support model is selected depending on the nature of the data and business tasks that the model solves.
Visual confirmation of the fact that Data Science and Machine Learning are not just fashionable technologies, but tools that help to quickly and accurately solve real business problems.