Starting a data science project: Three things to remember about your data
Many companies are collecting and managing the data with little to no forethought. And just as planning is key to any strategic business project, forethought is utterly important when dealing with data. Here are the key points to remember when starting a data science project.
Identify the right questions about your data
In most cases a key to successful machine learning development is to use large amounts of good quality data. But in order to get it, a lot of work has to be done starting from the beginning of any data science project.
All the elements of the development process must be purposeful and aligned closely with business goals. In order to do this company should answer the following questions first: what is the business problem we are trying to solve applying data science techniques or what will the AI system do in our product/service? Working on a data strategy at the early stages helps to answer these questions while ensuring that the data collection process goes smoothly and aligned with business goals. Having a solid data strategy in place you can be sure that you’ll have the necessary data at the right moment.
Don’t wait too long to start collecting data in the right way
In case the company does not have a data science team in place they can always get professional advice on collecting and storing data from the companies providing data strategy consulting services.
In many cases data scientists are brought to projects too late and can’t answer the questions that they are being asked because the right data was not collected. There can be not enough tracking and instrumentation or it often turns out that the data was stored in the database that is not suitable for future tasks.
For example, when InData Labs started working with Flo (a mobile app using neural networks to predict women’s menstrual cycle), the project started from the migration of the app from Parse platform, which provided no ability to conduct analytics, while the analytics was supposed to become the key competitive advantage of the application. In case data science consultants were involved in the project earlier, Flo team could have omitted a lot of unnecessary work and saved a lot of time.
Don’t expect data science models to perform well on any type of data
One of the first questions potential clients ask us about the data is “How much data do I need to start working on a data science project with your company”. The statement that “more data gives better results” is misleading, although it is quite common. That’s why it is worth explaining, in which scenario more data and more features in a model are helpful and how the choice of the algorithm affects the end result.
There are two major reasons a data science model might not perform well. In one case, we might have a model that is too complicated for the amount of data we have. Such situation is known as high variance. It leads to model overfitting. We can identify high variance issue when the training error is much lower than the test error. Problems associated with high variance can be addressed by reducing the number of features or increasing the number of data points.
In the other case, we might have a model that is too simple to explain the data we have. This situation is known as high bias. In this case more data won’t improve model performance, we will need to increase the number of features in a model.
Common misbelief that more data can always improve the performance of a data science model leads to inflated expectations of fast results. But the process of choosing and tuning the model according to the data that we already have in place is one of the most time-consuming and fundamental steps in a data science project.
Companies that are only starting their way towards data science and AI adoption should keep these aspects in mind while planning their activities. It will help them better understand how data science teams work and know what to expect with the amount and quality of data that they have.
Already have a project in mind but not sure whether your data is ready? Let’s talk.