6 Data Collection Rules for Your Future Perfect Machine Learning Dataset
Modern companies produce gigantic amounts of data. Later it becomes a part of their machine learning datasets. Those are further used to build models that aim to solve various problems business may face, and make it more profitable, customer-oriented and, of course, data-driven. Machine Learning depends heavily on data, that makes algorithm training possible. Regardless of the amount of information and data science expertise we have, machine learning may be useless or even harmful with poor data collection process in place.
The thing is, the perfect dataset probably doesn’t exist. However, there is a number of things businesses can secure in order to get the best results from their future data science and machine learning initiatives.
What business should do to establish the right data collection mechanism:
1. Ensure the data has no gaps
Of course, it is hard to know in advance, what kind of data will be helpful in future. However, it may turn out really hard to build a model if certain variables are missing. Ideally, every company should have a data strategy in place long before they start collecting any data. This may help avoid collecting unnecessary data. If you still don’t have a data strategy, it is highly recommended to collect complete data. It may turn out to be cheaper to pay for additional data storage rather than for the whole team to wait for necessary data to be collected. In case some data points were not gathered, it won’t be possible to restore them.
2. Keep your raw data raw
With today’s low storage costs companies can stop worrying about compressing their data and start worrying about making sure they can fully understand their data.
Keeping the data raw has a lot of advantages. Raw data keeps data analysis fast and secure. Companies go through so much trouble collecting data, that it is completely illogical to start throwing away parts of it. Pre-processing data without knowing exactly how it would be further used is more of the relic of the past when data storage costs were enormous.
Let’s imagine, you collect raw data and at some point you decide that it takes up too much space. You have an idea that only nouns will be useful for future analysis, so you filter all your data and keep just nouns saving space on your hard drive and maybe time spent on data pre-processing.
This approach is wrong in all the possible ways. You can never know what kind of data may be useful, this is not always obvious. There is no guarantee that everything was done right during pre-processing stage. It should always be possible to repeat data processing on the original data.
Anyway, there is no need in trying to predict, what form of data will be convenient for data analysts and machine learning engineers to use in future. It is better to spend one or two weeks on writing necessary transformations rather than to find out that something is irrevocably missing or all the data structures need to be transformed, since someone thought that the work would be done differently.
3. Foresee and document all the possible missing values and outliers in your data
During exploratory data analysis phase data scientists often spend a lot of time trying to understand why some data sets have missing values. For instance, some data margins can be optional, they also might be a result of developers’ mistakes, network malfunctions, and any other system failures. Of course, it is hard to foresee all the variants, but in case you know the reason of such happenings it is better to document it for future users. Understanding the nature of such missing values and outliers helps to work with the data faster and more effectively in future.
4. Changelogs and data structures versioning
Every application, website, and platform expands and transforms. This causes changes in user-data storage structure. Some margins are added, others go out of date or are modified. It is important to keep the up-to-date logs with all the descriptions of data structure changes, as well as save all the previous documents’ versions. This will save data engineers and analysts a lot of time in future.
5. Ensure the data points can’t get lost
Unfortunately, anything can happen while we work with the network. Lost connection can result in data loss. This is why it is better to foresee a failover mechanism in your data collection and analytics pipeline. Or you can also push data to message queue with sufficient time-to-live. This way, even if network problems happen, the data won’t get lost.
6. Hire a data officer
There should be a person in every company who knows everything about its data. This is especially important for fast-growing companies that actively attract new users and add new services and features. Such companies generate a lot of autonomous teams that work with different types of data, as a result general perception of company’s data gets lost. This may result in a situation when a team starts working on a new project without knowing that somewhere in the company there is data that may help them better solve their task. For instance, a team starts working on a project to identify fraudulent transactions and doesn’t know that some other team has already collected the data about IP addresses of their users, which might turn out to be very useful. Data officer position has also become extremely popular after GDPR came into force in EU on May 25, 2018.
Of course, it is also important to remember about certain basic things that sometimes are underestimated and forgotten about during data collection. For instance, foreign keys are often missing in data, which makes it hard to merge different entities. Or sometimes it is easy to forget to mention user’s time zone while logging his activity. All of this small gaps and flaws in data may lead to some serious inaccuracies in final results of data analysis. But if you try and follow at least some of the data collection rules we’ve discussed, you’ll make the life of your future data science team a lot easier and your projects more successful.