Case Study AI Tool

Data Extraction
& Classification AI Tool

Mad Devs created an AI-driven solution that eliminates manual data collection by automatically extracting, structuring, and classifying information from hundreds of online platforms.

Overview

Mad Devs developed an AI-based system that automates large-scale data extraction, parsing, and classification across over 100 online platforms. The project was delivered for a commercial client who requested confidentiality under a non-disclosure agreement. The system streamlines data collection, ensures consistent quality, and provides a scalable foundation for information management.

Scale of the database

The project focused on designing a modular architecture capable of adapting to changing website structures, managing diverse content formats, and maintaining high throughput while keeping infrastructure costs predictable.

Challenges and Solutions

Within just a few months, Mad Devs delivered an AI-powered tool that reliably automates data collection and processing from a wide range of complex web sources.

Challenge 1: Various obstacles when scrap more than 100 platforms

Collecting and processing structured and unstructured data from more than 100 platforms presented multiple technical challenges. Websites differed in architecture, rendering type, and access mechanisms, requiring a unified yet flexible approach.

Key challenges included:

Solution:

To address the diversity of sources, Mad Devs built a universal web-scraping framework with specialized modules for each challenge.

Rendering type handling

The framework automatically detects whether a site uses SSR, CSR, or API endpoints and applies the appropriate extraction strategy.

Dynamic сontent rendering

Custom JavaScript scripts were developed to manage infinite scroll, automate pagination, bypass lazy loading, and reproduce complex user actions such as clicks or form submissions.

Anti-scraping protection

The system integrates CAPTCHA-solving services, rotating proxies, and request throttling to avoid rate limits and IP blocking.

Authentication support

Session and cookie management modules allow stable scraping of protected platforms without losing authorization.

Together, these capabilities created a scalable and adaptable system that could process data consistently across all websites.

This diagram illustrates the high-level architecture: AgentRunner orchestrates data extraction and processing in parallel, ensuring scalability across 150+ websites.

Challenge 2: Data quality control and standardization

Data collected from multiple platforms arrived in different formats, with inconsistent field structures and potential duplicates. Reliable data quality controls were essential for maintaining consistency and accuracy across the entire dataset.

To ensure data quality, a monitoring and validation framework was introduced:

Continuous tracking of data quality metrics.
Completeness and coverage measurement for parsed records.
Field-level accuracy validation.
Cost monitoring for LLM-based components within the parsing pipeline.
Visualization of key indicators via Metabase dashboards.

Solution:

An AI-driven parsing system was implemented with specialized LLM-based components for each stage of data processing:

Extraction pipeline for asynchronous processing of large datasets.

Entity recognition agents to detect relevant data fields.

Summarization agents to generate concise and structured text.

Adapters for processing both textual and tabular data (PDF, DOC, XLSX, Markdown).

Validation layer for ensuring accuracy and removing inconsistencies.

Agent orchestration layer with model switching and tool integration.

Configurable model settings directly available through the user interface.

LangGraph-based architecture prepared for scalable deployment.

This diagram shows the specialized agents responsible for ensuring accurate classification, duplication control, and standardized outputs.

Challenge 3: Performance and efficiency

The system needed to process hundreds of records daily while keeping infrastructure costs stable. Manual workflows and static infrastructure previously limited throughput and flexibility.

Solution:

A parallel data processing architecture was introduced, allowing multiple AI models to work concurrently on batches of records. Different models were applied depending on the task:

Entity recognition models to identify key entities, attributes, and relationships within extracted data.
Summarization models to generate concise, structured overviews from unstructured text.
Classification models to categorize records according to predefined taxonomies or data schemas.
Validation models to cross-check outputs, remove duplicates, and ensure data consistency and accuracy.

These models are orchestrated within a Microsoft Azure environment designed for both scalability and cost-efficiency. Autoscaling ensures compute nodes are provisioned only when parsing jobs are running and automatically released after completion. Spot instances are used for time-bounded workloads, reducing expenses while maintaining performance. Task orchestration with Kubernetes allows jobs to run in parallel with minimal idle time, while Grafana and VictoriaMetrics provide real-time visibility into infrastructure load and model-related costs.

As a result, the platform processes large data volumes at scale while keeping infrastructure expenses at approximately €750 per month.

Results

The collaboration between Mad Devs and our client resulted in automated, AI-powered data extraction and classification system that transformed client’s operations: