Any entrepreneur or business owner who is far from being a data scientist usually thinks like this “I’ve got lots of data, why not make some data science magic on it in order to find the dependencies and connections I wasn’t able to guess myself”. Looks like a job for a data scientist. We are not going to discuss the results of such analysis in this article since they depend on each particular case and each database. First, we will start with discussing the issues associated with the preparation of the data for analysis – data cleansing.
According to The New York Times research, data scientists spent from 50 percent to 80 percent of their time collecting and preparing the data before it can actually be analyzed. This makes data cleansing the most time consuming part of the process.
Let’s look into possible reasons of these findings.
Cleansing your own data
Imagine, you own a small online shop. You are dealing with customers and selling clothes of different brands. For the sake of good order you store all the data from your web in certain directories (Figure 1).
You also attract customers with the help of promotional sales (e.g. jeans sale, brand A sale, buy 2 get 1 free sale). It is also embodied in your data (Figure 2).
As an owner you want to know, which of those sales are the most effective and how. Such data is clean by structure. However, before the analysis it is essential to merge data about “sold” and “sales” into a clean database.
Data Cleansing for social media data.
This time you own a travel agency. It is mostly specialized in trips around Europe. The agency has a network of offices in every US state. You also have vast opportunities to make an ad campaign anywhere you want using billboards, magazine and public transport advertisements. Let’s assume your only accessible sources of data are your Twitter and Facebook accounts. There is a certain number of followers who share their emotions, thoughts and reviews about your services. You want to take advantage of this information as well.
Data scientists can get the data from your streams by looking into text sentiments, customers’ hashtags etc. After being collected the data should be merged together and cleaned from possible spam in order to perform the analysis, which will help you learn about the areas where you are already recognized and those you still need to work on (Figure 3).
Data from open databases.
Here we will have a look at different public data repositories and sources.
Let’s start with books database of goodreads.com. XML returned for every book query on GoodReads looks like this (Figure 4).
And here is an example of data cleansing of this public data (//github.com/sidooms/Twitter-ratings);
Let’s take a closer look at WTO databases (Figure 5). The database is available as HTML only and still has to be parsed.
The Internet sources offer a lot of public data, but all of it has to be reworked — prepared, cleaned, organized and augmented.
Key characteristics of data prepared for cleansing such as: methods, amount of data in a tidy set, main trends and linkages between data series — mostly depend on the problem to be solved later. This is a starting point for data science. Data owner should set his own goals and make sure that data scientist understands him clearly.
Data preparation process consists of several steps:
- Cleansing the data — putting out the complete rubbish (advertising, spam etc.);
- Checking for outliers — in case anything in your data is really outstanding it has to be studied separately;
- Checking your sample for normality — in case your data is normally distributed, you can state (in most cases) that this data was produced independently within many similar sources. “Normality” is the most interesting result of probability theories. Any set for a study usually needs to be normally distributed;
- Checking some assumptions you have in mind — average value, extremums, amount of data available for the study;
- Checking for missing data — just as in case with outliers, whether it needs additional studying.
After performing these steps you should get a really tidy data set. Only having an accurate clean data you can perform a consistent research.
Also note that for a proper analysis you need your data to be organized into the most simple way possible: .csv file for little data sets, database or cloud solution for larger ones.
By Hanna Rudakouskaya for InData Labs.
Using machine learning, AI and Big Data technologies InData Labs helps tech startups and enterprises explore new ways of leveraging data, implement highly complex and innovative projects, and build breakthrough AI products. Our core services include AI Software Development, Big Data Engineering, Data Science Consulting.