Stop Saving Bad Data: Validating Scraped Content in Real-Time

We've all been there. You spend hours perfecting a scraper, set it to run overnight, and wake up to a 50MB CSV file. But when you open it, your heart sinks. Half the price columns are empty, the product titles are actually "Access Denied" messages, and your rating field contains random HTML tags.

The website changed its layout at 2:00 AM, and your scraper—blissfully unaware—kept right on "working."

In web scraping, this is a silent failure. Because the script didn't crash, your monitoring tools didn't alert you, but the resulting data is useless. To solve this, we need to move away from the "scrape now, clean later" mindset.

Instead, we need to implement Data Quality Gates. This is a validation layer that sits between your extraction logic and your database, ensuring only data meeting your exact standards ever gets saved.

Why "Look Before You Save" Matters

Traditionally, scraping pipelines follow a linear path: Extract -> Save -> Clean. The problem is that by the time you realize the data is corrupted, you may have already overwritten good data or wasted thousands of proxy credits on garbage results.

By implementing a Data Quality Gate, we shift to a more reliable workflow: Extract -> Validate -> Save. This offers three major advantages:

Fail Fast: You catch a broken CSS selector on the very first page, allowing you to stop the crawl before wasting resources.
Schema Guarantee: Downstream applications, such as price trackers or machine learning models, can trust that the data matches the expected format.
Precise Debugging: Instead of seeing a generic NoneType error, your logs will tell you exactly which field failed: Value error, price: 'Out of Stock' is not a valid float.

Tooling: Why Pydantic?

While you could write manual if statements to check your data, it quickly becomes a maintenance nightmare. Consider this manual check:

# The manual approach that doesn't scale
def validate_product(data):
    if not data.get('name'):
        return False
    if 'price' in data:
        try:
            float(data['price'].replace('$', ''))
        except ValueError:
            return False
    return True

This is brittle. Instead, use Pydantic, the most popular data validation library for Python. Pydantic uses Python type hints to enforce schemas. It doesn't just check types; it coerces them. If you send a string "10.99" to a float field, Pydantic handles the conversion for you.

Step 1: Defining the Data Schema

Let's build a practical example. Suppose we are scraping product data from an e-commerce site. We define what a valid product looks like using Pydantic's BaseModel.

from pydantic import BaseModel, Field, HttpUrl
from typing import Optional

class ProductSchema(BaseModel):
    # Required fields
    product_id: str = Field(min_length=5)
    name: str = Field(min_length=1)
    price: float = Field(gt=0) # Price must be greater than 0
    url: HttpUrl

    # Optional fields
    description: Optional[str] = None
    rating: Optional[float] = Field(None, ge=0, le=5) # 0 to 5 scale
    review_count: int = 0

In this schema, we’ve set strict rules. A product without a price or a product_id is considered invalid, while a missing description is acceptable.

Step 2: Building the Validation Gate

Websites are messy. They include currency symbols, commas in numbers, and extra whitespace. We can use Pydantic's @field_validator to clean this data during the validation process.

from pydantic import field_validator

class ProductSchema(BaseModel):
    product_id: str
    name: str
    price: float
    url: HttpUrl

    @field_validator('price', mode='before')
    @classmethod
    def clean_price(cls, value):
        if isinstance(value, str):
            # Remove currency symbols and commas
            clean_val = value.replace('$', '').replace(',', '').strip()
            return float(clean_val)
        return value

Now, let's implement the gatekeeper function. This function takes the raw dictionary from your scraper and attempts to pass it through the gate.

from pydantic import ValidationError

def gatekeeper(raw_data: dict):
    try:
        # Attempt to create a validated object
        validated_item = ProductSchema(**raw_data)
        return {"status": "valid", "data": validated_item.model_dump()}
    except ValidationError as e:
        # Capture exactly why it failed
        return {"status": "invalid", "errors": e.errors(), "raw": raw_data}

Step 3: Determining "Drop vs. Alert" Logic

In a production scraper, not all validation errors should stop the process. We categorize failures into two buckets:

Critical Failures: Missing price or product_id. If these fail, drop the record and log a high-priority error.
Non-Critical Failures: Missing image_url or rating. You might still want to keep the record but log a warning.

You also need a halt threshold. If 100% of your items fail validation over a five-minute window, the website has likely changed its structure or blocked your bots. In this case, your script should automatically shut down to save your proxy budget.

Step 4: Monitoring Field Coverage

Sometimes data is "valid" (it’s a string) but "wrong" (it’s the wrong string). For example, if a site starts showing "Sign in to see price" instead of the actual price, your float validator will fail.

Track this using Field Coverage. If you scrape 1,000 items and only 200 have a description, your coverage for that field is 20%.

class StatsCollector:
    def __init__(self):
        self.total_scraped = 0
        self.valid_count = 0
        self.field_counts = {}

    def track(self, status, data=None):
        self.total_scraped += 1
        if status == "valid":
            self.valid_count += 1
            for field, value in data.items():
                if value is not None:
                    self.field_counts[field] = self.field_counts.get(field, 0) + 1

    def get_report(self):
        if self.total_scraped == 0: return {}
        return {field: (count / self.total_scraped) * 100 
                for field, count in self.field_counts.items()}

If your price coverage drops from 99% to 0% suddenly, you know exactly which selector to fix.

Step 5: Efficient Storage with JSONL

Once the data passes the gate, it's time to save it. For web scraping, JSON Lines (JSONL) is often better than standard JSON. In JSONL, every line is a separate JSON object.

import json

def save_to_jsonl(validated_data, filename="products.jsonl"):
    with open(filename, "a") as f:
        f.write(json.dumps(validated_data) + "\n")

JSONL is useful for scraping because:

Memory Efficiency: You don't need to load a massive list into RAM.
Corruption Resistance: If the script crashes, the previous lines are already saved and valid.

To Wrap Up

Implementing Data Quality Gates turns a fragile script into a professional data pipeline. By validating in real-time, you ensure your database remains a source of truth rather than a collection of broken HTML fragments.

Key Takeaways:

Use Pydantic to define schemas and handle data cleaning.
Fail Fast by setting thresholds that stop the scraper if too many items fail validation.
Monitor Coverage to detect subtle website changes that don't trigger hard errors.
Store in JSONL for better data integrity and memory efficiency during long crawls.

If you’re building production-grade scrapers and want real-world examples, check out the Walmart.com scraper examples on GitHub. These repositories show practical implementations you can adapt into your own validation-first pipelines.

Stop Saving Bad Data: Validating Scraped Content in Real-Time

Why "Look Before You Save" Matters

Tooling: Why Pydantic?

Step 1: Defining the Data Schema

Step 2: Building the Validation Gate

Step 3: Determining "Drop vs. Alert" Logic

Step 4: Monitoring Field Coverage

Step 5: Efficient Storage with JSONL

To Wrap Up

Comments

More from this blog

Stop Fixing Broken Selectors: Automate Depop Scraper Maintenance with AI

How to Build an Automated Ulta Price Tracker Using Python and Playwright

Full Catalog Sync: Strategies for Scraping and Pagination at Scale

Scaling Data Pipelines: Handling Pagination, Deduplication, and Proxies in Web Scrapers

Stop Fixing Broken Scrapers: A Guide to Schema-First Data Extraction

Command Palette

Why "Look Before You Save" Matters

Tooling: Why Pydantic?

Step 1: Defining the Data Schema

Step 2: Building the Validation Gate

Step 3: Determining "Drop vs. Alert" Logic

Step 4: Monitoring Field Coverage

Step 5: Efficient Storage with JSONL

To Wrap Up

Comments

More from this blog