Keys to building robust data infrastructure for a data science project
Ones you decide to leverage data science techniques in your company, it is time to make sure the data infrastructure is ready for it.
Starting a data science project is a big investment, not just a financial one. It involves a lot of time, effort, and preparatory work.
Data science is about leveraging a company’s data to optimize operations or profitability. Therefore all of the processes that come before this stage — such as data warehousing and data engineering — should be fully operational before the data science part of a project begins.
Important qualities of the data infrastructure for a data science project
Software infrastructure that allows to both store and access a company’s data is needed from the start. That’s what data engineers do: they build data infrastructure, maintain the data infrastructure, and make sure the data is accessible to data scientists who will analyze it and make it useful to a company. Generally speaking, data engineers are needed in the early stages of a company’s life. It might also be useful to consider contracting a data scientist or a data science consulting company at this stage to ensure that the initial infrastructure is built in a way that will be optimally useful down the line when the business is ready for a full-time data scientist. This approach can help avoid redoing things in future.
Also, it is important to keep scalability in mind. If a company is planning to grow, its engineers should build a scalable data infrastructure.
After a company has collected enough data that can be used for producing meaningful insight and its stakeholders start asking questions about optimizing the business, then the company is beyond ready for data science.
What if your company’s data infrastructure doesn’t work the way it should
In case the existing data infrastructure doesn’t support the type of analysis and experiments the data scientist needs to perform, that resource will either end up idling while you try to catch your infrastructure up, or data scientists will get frustrated by not having the tools they need.
One of the first members of LinkedIn’s data team Monica Rogati encourages companies to give more thought to what a data scientist needs to be successful. She outlines the problem associated with the common perception of hiring a data scientist to “sprinkle machine learning dust over data to solve all the problems”. There are many cases when data scientists are brought to companies with no necessary infrastructure to perform the tasks or simply data access is not granted.
Often, data is housed on multiple servers, which creates challenges for engineers to integrate data so that it may be analyzed properly. Data processing is a challenge as powerful computers, programs, and a lot of preparatory data engineering works are required to crunch massive data sets.
Security barriers in data science projects
Although most companies investing into machine learning projects own and store a lot of data, the data is not always ready to use.
Companies may be ready for working with processing systems or performing data aggregation, but while performing the data extraction process it may turn out that their data includes a lot of personal or “sensitive” information. This brings us to data security issues. Such data may need to go through an encryption process before being put into a machine learning model, and this may turn out to be a time-consuming process. In their data science blog, Airbnb could not emphasize more the importance of such process. They’ve even built an encryption service called Cipher to address the technical challenges and enable engineers to encrypt data easily and consistently across Airbnb infrastructure. Cipher abstracts away all of the complexities that come with encryption, like algorithms, key bootstrapping, key distribution and rotation, access control, monitoring, etc.
Another way of avoiding those technical challenges is to store personal and sensitive data separately from the rest of data. Such approach can minimize security risks and reduce the need for data protection. Rest of the data is anonymized and ready for a cross-team use. This allows for faster testing and experimenting with data while working on the proof of concept projects.
The idea of introducing data science technologies into a company may seem overwhelming for any business owner. However, with the right professional help and solid preparatory work on data infrastructure for a data science project, the results won’t keep you waiting.