Efficient Website Data Scraping for Improved Data Management

Accessing multiple data sources with data scraping.

Key Details

Accessing multiple data sources with data scraping.

Challenge

Fast and accurate data scraping from multiple sources
Solution

Best practices for robust & resilient web scraping
Technologies and tools

Microsoft Azure Cloud Services for infrastructure hosting, tuning and administration. Python language with required libraries and frameworks (Azure-sdk, Scrapy, Selenium, etc.) for web sites scraping and crawling process

Client

The client is a non-commercial organization that provides support for African American small businesses and entrepreneurs. They’re proud to provide services that help African American businessmen get grants and achieve success in competitions.

Challenge: fast and accurate data scraping from multiple sources

The client deals with huge amounts of data coming from various sources regularly. So data management has become a concern for them.

They wanted to scrape job offers, mentorship and network opportunities for talented African American entrepreneurs from various websites and publish it on their own platform. So, entrepreneurs can easily discover businesses owned by African Americans and go support them or locate their own.

InData Labs was challenged to develop a strong data scraping solution for the client’s marketplace.

Solution: best practices for robust & resilient web scraping

Our team of engineers have applied their data scraping expertise to enable effective data collection from various sources.

The InData Labs team was to setup infrastructure & code flow for the client:

Git and CI/CD part

For code management there was used AzureDevOps repository with such a pipeline setup that allowed our team to build and push docker images to the registry by using a parallel job agent.
Registry and Logic App part

Next, we created Azure Docker Container Registry on Azure portal to store our docker images.

Then we needed to create docker instances from images by using Azure Logic app to run scraper code in parallel and separately.
Scraper part

During this stage, the InData Labs team created container instances with Logic apps. Then we needed to give each container access to Azure resources and sensitive data such as passwords, connection strings etc. that were stored in Azure KeyVault.

To store scraper outputs, our team decided to create a Storage Account which would be like a cloud folder to save scraped data. After that we were able to start our scrapers in a manual way, but we needed some orchestration, automatisation and postprocessing.
Data Factory and orchestration part

Our engineers ran all our scrapers with time-trigger and in a single pipeline run with Azure Data Factory.

The main pipeline was supposed to start all containers with requests via azure API, then run DataBricks Notebooks to process collected data.
DataBricks

At this stage, we were scrapping all the data from websites (as incremental loading of data from websites is not possible or difficult) and full data processing/saving to the database. Before loading new data to the database, we deleted the existing data.

As a result, the client has got a robust data scraping solution that scrapes data from multiple sites and business listings and gathers information about businesses’ founded by African Americans that are useful for the subscribers of the client’s platform.