InData Labs Founders on Specifics of Building a Data Science Company
In just three years data science company InData Labs became a significant player on data research and artificial intelligence market both in Belarus and overseas. In their big interview to dev.by, this start-up from Minsk share successful projects experience, talk about technological stack and intricacies in working with data.
“Back in 2014, when the idea of the data science company was born, very few worked in the field of big data and data science in Belarus,” says Ilya Kirillov, co-founder and CEO of InData Labs. “It was the reason why we spent so much time checking the idea: we studied analytical reports, liaised with colleagues from Wargaming and RadiumOne, made sure there was room for business development in this sphere. I was full of enthusiasm and needed a like-minded person who would share this attitude. Irina, who worked for Marat Karpeko (сo-founder of InData Labs and COO of Wargaming — comment dev.by) back then, joined me. The two of us gave it a start.”
“In early 2015 we started to build the team,” recalls Irina Kryshneva, data science company’s operations manager. “Dzianis Pirshtuk was one of our first employees. Today he is head of the Department of Data Science and InData employs 30 people.”
According to the interviewees, forming a data science team was no easy job at the time: hardly anyone had experience in such projects. Still, they managed to find some people with required expertise on the market. Others were trained in-house. For this purpose, the data science company joined forces with the Faculty of Applied Mathematics (FAMCS) and Computer Science of Belarusian State University and its Scientific Research Institute of Mathematics and Informatics and set up a lab.
“We did realize that some guys had excellent theoretical training, but lacked practical experience. In our lab we offered them a possibility to transform their knowledge into real cases,” says Dzianis Pirshtuk, chief data scientist at InData. “We gave them tasks and provided great datasets, collections of conference materials which described various approaches to task-solving and potential pitfalls. These materials didn’t have a single line of code. Guys had to carefully study the materials and come up with a solution.”
At first, they enrolled only a few FAMCS students who were personally invited to the program by data science company executives. Interest in data analysis has significantly increased since then. Not long ago InData enrolled for the third time. That time data science company received solutions to given task from people of different ages, professions and even countries.
“To Be Able to Work with Data, You Need to Understand It Well”
InData positions itself as a service company. Code writing and implementation are just one of the stages of working on a project in a startup. A lot of time is spent on counseling, data collection and analysis, and, in case of success, results validation.
Irina: An overwhelming majority of InData employees are data scientists and engineers. We sell science-driven services, so the guys from a small business development team have to dive deep into data science and study neural networks. Yet we involve data specialists or engineers as early as at the presales stage, so that we can clearly understand whether or not we are able to work with the task, whether or not the potential customer has all the data required to put the project into action.
Ilya: Presales is as exciting as it is challenging. It involves employees from sales and technical departments. It is at this stage that we showcase our knowledge of the business domain and technical expertise and we can even provide technological solutions.
Dzianis: We form small teams to work on projects, which is an advantage for us as well as for our customers. We build projects around data and lately we actually tend to start working with business analysis phase: to be able to work with data, you need to understand it well. Since the data is often specific and comes from unfamiliar domains, the immersion of each employee into it takes a lot of time. And time is customer’s money. In addition, the fewer people are involved in the project, the easier it is to communicate and synchronize actions with the customer.
Irina: We have different customers. Firstly, we work with start-ups that initially build the app architecture so as to collect data adequate for further processing and analysis. Secondly, we have enterprise companies that have problems accumulated over the years of operation. Basically, they come to us with specific business tasks which may be diverse. For example, we deal with prediction of customer churn, user segmentation and recommendation systems building.
Dzianis: Many product companies want to at least try machine learning and by doing so to bring new functionality to their users. To be able to do this, you need to have a clear understanding of the audience and know its interests well.
Irina: We put the main focus on working with the audience. We analyze telecommunications companies and mobile apps users and retailers. We can’t disclose the names of many customers, since data is often a very delicate and sensitive topic, especially for telecoms and banks. But by and large, we can dive into any domain. For example, in one of our recent projects we dealt with a “female” calendar called Flo. We helped the customer to implement neural networks in the application and carry out an interesting case for predicting the fertile window.
“Sometimes We Suggest Cases that the Customer Hasn’t Even Thought of”
Dzianis: More often than not we work with already collected data. When we started working on Flo, the guys already had a large audience and a lot of collected data — a solid basis for building an initial model.
Data is Flo’s competitive advantage, obtained through a long-term strategy. Millions of women enter a huge amount of data in the app — Flo’s surveys contain up to 100 points. No physician, no university disposes of such detailed clinical information. To achieve this, Flo had to work closely on the UX and go to great lengths to get their users to input as much information as possible.
It was the chance to work with unique data that motivated our employees. Analyzing millions of records, one can find very unusual patterns and greatly improve predictions accuracy, which is almost research work.
Irina: Our team had to study Flo’s subject in great detail: they analyzed a lot of scientific articles, consulted with sector experts. By now they probably know more about menstrual cycles specifics than many women!
Dzianis: Flo is a great case. But sometimes it happens so that the customer’s data is not sufficient for the task solution. In such cases our engineers can give advice on how to collect necessary data — to develop a data strategy — or give an account of how to make good use of the data available.
Ilya: So far, we haven’t had a single instance when we couldn’t figure out how the customer could take advantage of the collected data. Sometimes we suggest cases that the customer hasn’t even thought of.
Dzianis: We often implement architectural improvements in order to facilitate the customer’s system extensibility. The same was true with Flo: we analyzed the back-end part of the application, noted bottlenecks in the architecture, proposed and carried out a new server part. Our solution is a complex trade-off between flexibility required for a rapid development of data-driven functionality and analytics, and compulsory scalability margin, calculated given constant growing of load and data volumes.
Irina: We boast a long list of competencies, or say full-stack data engineering/data science consulting. We can also collect data for the customer by ourselves, usually from open sources, as it was, for example, in the project with a Californian start-up Captiv8. Our team introduced analytics of social media audience: prediction of demographic characteristics and interests of users.
Dzianis: Let me elaborate. In brand advertising the so-called influencers are often engaged. They are bloggers who can affect audience’s choice. Before launching an ad, marketing specialists need to understand how much the blogger’s audience matches their own target audience. They are interested in segmentation of users based on basic characteristics: gender, age, race, religion, where they live, what languages they speak, what interests in the context of advertising categories they have. This is behavioral analytics: for example, from the point of view of a marketer, Barack Obama is white.
To get analytics like this, advertisers want to have a tool that uses a “clean” social networking API and does not require addressing the blogger personally. We created such a tool for Captiv8. The solution turned out to be “turnkey”: the system downloads all public information about users from social networks, analyzes it, produces a behavioral forecast and saves the results in the repository. The case turned out extremely useful for ourselves: we gained valuable experience of collecting data from open sources.
Text Analytics: “A Gamer May Express Positive Emotions by Cursing”
Dzianis: We implemented another interesting case for a large gaming company which pays much attention to working with the community. People post tons of comments about the company’s products on YouTube, social networks, forums. It is physically impossible to read them all. That’s why our customer wanted us to develop an additional service for audience researchers — an analytical system that would allow to point out the most relevant texts.
There are a lot of ready-made voice of customer systems, many of which are well designed. But as a rule, they work well only in case of mass market: for traditional brand observations and tracking reviews in online stores. To solve our customer’s task we needed a fundamentally different architecture. A traditional system would either go down due to a massive data flow or cost our customer a fortune. In addition, standard text analytics won’t work well in a game company since there is too much specific lexicon in the gaming community.
Our customer had no data collected, but they knew exactly which texts to collect and where to get them. We built a system that made it possible to set required parameters: to select certain review channels, keywords and dates. The amount of data is very large: sometimes a single YouTube video may get hundreds of thousands of comments. The system collects them within a reasonable timeframe and saves them in the database.
At the same time, all texts undergo preprocessing before entering the database. Firstly, each text is assigned a subject category. Secondly, the tone of the statement is analyzed. It is a sentimental analysis aimed to identify the emotional coloring of the text. For tasks like that we built a distribution model based on neural networks. We used Wikipedia dump as a reference body of texts by means of which the neural network learned to understand connections between words. Then we added tens of millions of game related texts the volume of Wikipedia.
As a result, the model began to nicely distinguish the emotional coloring of the text given all the peculiarities: specific lexicon of active players, frequent use of explicit vocabulary (not necessarily a negative sign, it may also express positive emotions). Of course, the model isn’t as good at distinguishing tones as human audience researchers, but we did not try to completely replace them. The fact is that the system recognizes these tones way better than a person not involved in the gaming community.
All the texts saved in the database, in fact tens of millions of them, are available for full-text search. We use Elasticsearch — a fairly classical, well scalable instrument. We also implemented a distributed search to go with it.
The thing is that if a system user wants to find texts about game balance, it is highly unlikely that nothing but texts having “game balance” word by word in them would do. By using a neural network, we automatically expand the search query with necessary keywords — those that are often found alongside the search terms. In the end, a simple query brings about sufficient results with a category breakdown and an emotional coloring rating.
“Stack Variation Is a Trade-off between Flexibility and Efficiency”
InData technology stack varies with every project. Dzianis Pirshtuk sees the reason in a data science company’s field of the activity itself.
Dzianis: We must constantly experiment, offer something new, maintain flexible development and quickly adjust to the customer’s tasks.
After formulating a task we widely research data, then transform business requirements into technical ones, compare what we have with what we want. If everything goes well at this stage, we start to experiment with machine learning. Then the validation phase begins using A / B tests as well as offline tests based on historical data.
Data may be unpredictable, so we don’t always achieve the desired result at the first try. Sometimes we have to go back to the experiment stage, and sometimes to business requirements formulation. And even after implementation the nature of the data may change slightly, and the model will have to be taught anew. But if more data is collected, we may be able to build a new, fundamentally better quality model. Development goes in this circle multiple times.
As a rule, we use Python for working with data. Today it is number-one language for data analysis. It has a great ecosystem, a huge number of libraries. It perfectly matches our desire to stay flexible, it’s easier to build solutions with it, making changes in the process is less of a pain, too. And it’s Python that is used to implement the most advanced algorithms. The back-end of Flo, for example, is written entirely in Python.
With Captiv8 we used Hadoop-stack: Hadoop itself, HBase for data storage, Kafka distributed queues. It was a justified decision since there was a lot of data. But with Flo we used PostgreSQL as a database. This option proved to be optimal in terms of speed and development reliability. PostgreSQL is a proven tool: Instagram was launched on it, Twitch uses hundreds of servers with PostgreSQL. You can not act like a child and play with the latest tool available. The more new tools you use in the project, the more potential pitfalls there are.
In the project for the game company I mentioned earlier we also used Python to train distributive models, Kafka as middleware to make the system fault-tolerant and easily expandable, Elasticsearch, as I said, for searching. It is a tool written in Java, but easily accessible from the code in other languages. We did not set out to write our own Elasticsearch — we were solving the business problem using the best available tools.
Stack variation is always a trade-off between flexibility and efficiency. NoSQL, for example, is great, but using a non-relational database has its limitations: for example, storing data in an unnormalized form. If the team has a clear development plan for six months ahead, using Java — the so-called language of “bloody enterprise” — is a great idea. But if you have a hi-tech project that requires research and you need to quickly go into production, it makes more sense to use Python. We usually rewrite complex computing stuff in C ++.
“We Feel Comfortable with Being Data Science Company from Belarus”
InData Labs frequently gets acquisition offers, but the management has no intention to sell the business. Ilya and Irina clearly outline the goals of the present-day start-up. They aim to become a data science and artificial intelligence center of excellence that would be the best in Belarus and one of the best in the world.
Irina: We feel comfortable with being from Belarus. Development takes place in Minsk and our customers are companies from all over the world: Europe, America, Asia.
Dzianis: In general, the more geographically distant you are from the customer, the more challenging it is to cooperate. Data is a very sensitive thing. It’s not an easy task to entrust even a service company with such a closely held asset. So the only chance to win over a remote customer is to have big advantages over competitors.
Our wish to become a well-known center of excellence in the field of Big Data, machine learning and AI is a matter of survival. You are either prominent, the country is proud of you and you are recognized worldwide, or you are nothing, and no one wants to work with you. It’s either of the two.
Ilya: Today we have a lot of competitors on the market. We compete over resources, still we are only too happy: this contributes to the overall level and qualification of specialists. We strongly advocate for the market growth and data science development.
Dzianis: It’s important to study cases of others. No company in AI and big data field comes up with ideas from scratch. We need to carefully analyze what has already been done in the sphere. This way we can gain a competitive edge. Any service company should have the best expertise, be one step ahead of everyone else. So it is vital for us to boost the competence of our employees.
Being a team leader, I don’t want to worry about retaining people in the team. Instead, we are trying to create an atmosphere in which they would be happy to grow. It is crucial that experts exchange views and knowledge and become part of a wider data science community. We look for employees who are passionate about the subject and we want them to see InData as a leader both on the market and in the community.
Photo credit to Andrey Davydchyk
Originally published at dev.by