Case Study AI Tool

Data Extraction
& Classification AI Tool

Mad Devs created an AI-driven solution that eliminates manual data collection by automatically extracting, structuring, and classifying information from hundreds of online platforms.

Overview

Mad Devs developed an AI-based system that automates large-scale data extraction, parsing, and classification across over 100 online platforms. The project was delivered for a commercial client who requested confidentiality under a non-disclosure agreement. The system streamlines data collection, ensures consistent quality, and provides a scalable foundation for information management.

Scale of the database

img1

The project focused on designing a modular architecture capable of adapting to changing website structures, managing diverse content formats, and maintaining high throughput while keeping infrastructure costs predictable.

Challenges and Solutions

Within just a few months, Mad Devs delivered an AI-powered tool that reliably automates data collection and processing from a wide range of complex web sources.

Challenge 1: Various obstacles when scrap more than 100 platforms

Collecting and processing structured and unstructured data from more than 100 platforms presented multiple technical challenges. Websites differed in architecture, rendering type, and access mechanisms, requiring a unified yet flexible approach.

Key challenges included:

img2

Solution:

To address the diversity of sources, Mad Devs built a universal web-scraping framework with specialized modules for each challenge.

Rendering type handling

The framework automatically detects whether a site uses SSR, CSR, or API endpoints and applies the appropriate extraction strategy.

Dynamic сontent rendering

Custom JavaScript scripts were developed to manage infinite scroll, automate pagination, bypass lazy loading, and reproduce complex user actions such as clicks or form submissions.

Anti-scraping protection

The system integrates CAPTCHA-solving services, rotating proxies, and request throttling to avoid rate limits and IP blocking.

Authentication support

Session and cookie management modules allow stable scraping of protected platforms without losing authorization.

Together, these capabilities created a scalable and adaptable system that could process data consistently across all websites.

img3

This diagram illustrates the high-level architecture: AgentRunner orchestrates data extraction and processing in parallel, ensuring scalability across 150+ websites.

Challenge 2: Data quality control and standardization

Data collected from multiple platforms arrived in different formats, with inconsistent field structures and potential duplicates. Reliable data quality controls were essential for maintaining consistency and accuracy across the entire dataset.

To ensure data quality, a monitoring and validation framework was introduced:

  • Continuous tracking of data quality metrics.
  • Completeness and coverage measurement for parsed records.
  • Field-level accuracy validation.
  • Cost monitoring for LLM-based components within the parsing pipeline.
  • Visualization of key indicators via Metabase dashboards.

Solution:

An AI-driven parsing system was implemented with specialized LLM-based components for each stage of data processing:

Extraction pipeline for asynchronous processing of large datasets.

Entity recognition agents to detect relevant data fields.

Summarization agents to generate concise and structured text.

Adapters for processing both textual and tabular data (PDF, DOC, XLSX, Markdown).

Validation layer for ensuring accuracy and removing inconsistencies.

Agent orchestration layer with model switching and tool integration.

Configurable model settings directly available through the user interface.

LangGraph-based architecture prepared for scalable deployment.

img

This diagram shows the specialized agents responsible for ensuring accurate classification, duplication control, and standardized outputs.

Challenge 3: Performance and efficiency

The system needed to process hundreds of records daily while keeping infrastructure costs stable. Manual workflows and static infrastructure previously limited throughput and flexibility.

Solution:

A parallel data processing architecture was introduced, allowing multiple AI models to work concurrently on batches of records. Different models were applied depending on the task:

  • Entity recognition models to identify key entities, attributes, and relationships within extracted data.
  • Summarization models to generate concise, structured overviews from unstructured text.
  • Classification models to categorize records according to predefined taxonomies or data schemas.
  • Validation models to cross-check outputs, remove duplicates, and ensure data consistency and accuracy.

These models are orchestrated within a Microsoft Azure environment designed for both scalability and cost-efficiency. Autoscaling ensures compute nodes are provisioned only when parsing jobs are running and automatically released after completion. Spot instances are used for time-bounded workloads, reducing expenses while maintaining performance. Task orchestration with Kubernetes allows jobs to run in parallel with minimal idle time, while Grafana and VictoriaMetrics provide real-time visibility into infrastructure load and model-related costs.

As a result, the platform processes large data volumes at scale while keeping infrastructure expenses at approximately €750 per month.

Results

The collaboration between Mad Devs and our client resulted in automated, AI-powered data extraction and classification system that transformed client’s operations:

Automated large-scale data collection

Continuous extraction and processing from 100+ web platforms, replacing manual monitoring and input.

Higher processing capacity

Parallelized architecture boosted daily processing capacity from thousands to tens of thousands of records.

Improved data quality

LLM-based validation significantly reduced duplicates and formatting inconsistencies, improving dataset reliability.

Predictable infrastructure costs

Cloud autoscaling and spot-instance optimization stabilized monthly expenses at around €750 while maintaining performance.

Future-ready architecture

The system’s component-based design allows quick integration of new data formats and processing pipelines without service downtime.

Tech stack

Backend:

LangGraph

LangGraph

LangChain

LangChain

Pydantic

Pydantic

Postgres

Postgres

Metabase

Metabase

Python

Python

Python Gjango

Python Gjango

Apache Airflow

Apache Airflow

Crawl4AI

Crawl4AI

Open Router

Open Router

OpenAI

OpenAI

Google Gemini

Google Gemini

Claude

Claude

Infrastructure:

Microsoft Azure

Microsoft Azure

Kubernetes

Kubernetes

Docker

Docker

GitHub logo.

Github Actions

Grafana

Grafana

Icon

Victoria Metrcis

Meet the team

  • Nakylai Taiirova

    Nakylai Taiirova

    Main Backend developer

  • Anton Kozlov

    Anton Kozlov

    Backend Consultant

  • Pavel Silaenkov

    Pavel Silaenkov

    ML Engineer

  • Farida Bagirova

    Farida Bagirova

    Junior ML Engineer

  • Roman Panarin

    Roman Panarin

    Lead ML Engineer, consultant

  • Lead ML Engineer, consultant

    Alexander Bryl

    Lead ML Engineer, consultant

  • Zhamila Baigazieva, DevOps Engineer.

    Jamilya Baigazieva

    DevOps Engineer

  • Dmitrii Khalezin

    Dmitrii Khalezin

    Lead DevOps Engineer

  • Alice Jang

    Alice Jang

    Delivery Manager

  • Maksim Pankov

    Maksim Pankov

    Project manager