Data Extraction
& Classification AI Tool
Mad Devs created an AI-driven solution that eliminates manual data collection by automatically extracting, structuring, and classifying information from hundreds of online platforms.
Overview
Mad Devs developed an AI-based system that automates large-scale data extraction, parsing, and classification across over 100 online platforms. The project was delivered for a commercial client who requested confidentiality under a non-disclosure agreement. The system streamlines data collection, ensures consistent quality, and provides a scalable foundation for information management.
Scale of the database

The project focused on designing a modular architecture capable of adapting to changing website structures, managing diverse content formats, and maintaining high throughput while keeping infrastructure costs predictable.
Challenges and Solutions
Within just a few months, Mad Devs delivered an AI-powered tool that reliably automates data collection and processing from a wide range of complex web sources.
Challenge 1: Various obstacles when scrap more than 100 platforms
Collecting and processing structured and unstructured data from more than 100 platforms presented multiple technical challenges. Websites differed in architecture, rendering type, and access mechanisms, requiring a unified yet flexible approach.
Key challenges included:

Solution:
To address the diversity of sources, Mad Devs built a universal web-scraping framework with specialized modules for each challenge.
Together, these capabilities created a scalable and adaptable system that could process data consistently across all websites.

This diagram illustrates the high-level architecture: AgentRunner orchestrates data extraction and processing in parallel, ensuring scalability across 150+ websites.
Challenge 2: Data quality control and standardization
Data collected from multiple platforms arrived in different formats, with inconsistent field structures and potential duplicates. Reliable data quality controls were essential for maintaining consistency and accuracy across the entire dataset.
To ensure data quality, a monitoring and validation framework was introduced:
- Continuous tracking of data quality metrics.
- Completeness and coverage measurement for parsed records.
- Field-level accuracy validation.
- Cost monitoring for LLM-based components within the parsing pipeline.
- Visualization of key indicators via Metabase dashboards.
Solution:
An AI-driven parsing system was implemented with specialized LLM-based components for each stage of data processing:

This diagram shows the specialized agents responsible for ensuring accurate classification, duplication control, and standardized outputs.
Challenge 3: Performance and efficiency
The system needed to process hundreds of records daily while keeping infrastructure costs stable. Manual workflows and static infrastructure previously limited throughput and flexibility.
Solution:
A parallel data processing architecture was introduced, allowing multiple AI models to work concurrently on batches of records. Different models were applied depending on the task:
- Entity recognition models to identify key entities, attributes, and relationships within extracted data.
- Summarization models to generate concise, structured overviews from unstructured text.
- Classification models to categorize records according to predefined taxonomies or data schemas.
- Validation models to cross-check outputs, remove duplicates, and ensure data consistency and accuracy.
These models are orchestrated within a Microsoft Azure environment designed for both scalability and cost-efficiency. Autoscaling ensures compute nodes are provisioned only when parsing jobs are running and automatically released after completion. Spot instances are used for time-bounded workloads, reducing expenses while maintaining performance. Task orchestration with Kubernetes allows jobs to run in parallel with minimal idle time, while Grafana and VictoriaMetrics provide real-time visibility into infrastructure load and model-related costs.
As a result, the platform processes large data volumes at scale while keeping infrastructure expenses at approximately €750 per month.
Results
The collaboration between Mad Devs and our client resulted in automated, AI-powered data extraction and classification system that transformed client’s operations:
Tech stack
Backend:
LangGraph
LangChain
Pydantic
Postgres
Metabase
Python
Python Gjango
Apache Airflow
Crawl4AI
Open Router
OpenAI
Google Gemini
Claude
Infrastructure:
Microsoft Azure
Kubernetes
Docker
Github Actions
Grafana
Victoria Metrcis
Meet the team

Nakylai Taiirova
Main Backend developer

Anton Kozlov
Backend Consultant

Pavel Silaenkov
ML Engineer

Farida Bagirova
Junior ML Engineer

Roman Panarin
Lead ML Engineer, consultant

Alexander Bryl
Lead ML Engineer, consultant

Jamilya Baigazieva
DevOps Engineer

Dmitrii Khalezin
Lead DevOps Engineer

Alice Jang
Delivery Manager

Maksim Pankov
Project manager