Stop Saving Bad Data: Validating Scraped Content in Real-Time

Cloud Developer Hub is a space for developers exploring cloud, DevOps, and scalable systems.
We've all been there. You spend hours perfecting a scraper, set it to run overnight, and wake up to a 50MB CSV file. But when you open it, your heart sinks. Half the price columns are empty, the product titles are actually "Access Denied" messages, and your rating field contains random HTML tags.
The website changed its layout at 2:00 AM, and your scraper—blissfully unaware—kept right on "working."
In web scraping, this is a silent failure. Because the script didn't crash, your monitoring tools didn't alert you, but the resulting data is useless. To solve this, we need to move away from the "scrape now, clean later" mindset.
Instead, we need to implement Data Quality Gates. This is a validation layer that sits between your extraction logic and your database, ensuring only data meeting your exact standards ever gets saved.
Why "Look Before You Save" Matters
Traditionally, scraping pipelines follow a linear path: Extract -> Save -> Clean. The problem is that by the time you realize the data is corrupted, you may have already overwritten good data or wasted thousands of proxy credits on garbage results.
By implementing a Data Quality Gate, we shift to a more reliable workflow: Extract -> Validate -> Save. This offers three major advantages:
Fail Fast: You catch a broken CSS selector on the very first page, allowing you to stop the crawl before wasting resources.
Schema Guarantee: Downstream applications, such as price trackers or machine learning models, can trust that the data matches the expected format.
Precise Debugging: Instead of seeing a generic
NoneTypeerror, your logs will tell you exactly which field failed:Value error, price: 'Out of Stock' is not a valid float.
Tooling: Why Pydantic?
While you could write manual if statements to check your data, it quickly becomes a maintenance nightmare. Consider this manual check:
# The manual approach that doesn't scale
def validate_product(data):
if not data.get('name'):
return False
if 'price' in data:
try:
float(data['price'].replace('$', ''))
except ValueError:
return False
return True
This is brittle. Instead, use Pydantic, the most popular data validation library for Python. Pydantic uses Python type hints to enforce schemas. It doesn't just check types; it coerces them. If you send a string "10.99" to a float field, Pydantic handles the conversion for you.
Step 1: Defining the Data Schema
Let's build a practical example. Suppose we are scraping product data from an e-commerce site. We define what a valid product looks like using Pydantic's BaseModel.
from pydantic import BaseModel, Field, HttpUrl
from typing import Optional
class ProductSchema(BaseModel):
# Required fields
product_id: str = Field(min_length=5)
name: str = Field(min_length=1)
price: float = Field(gt=0) # Price must be greater than 0
url: HttpUrl
# Optional fields
description: Optional[str] = None
rating: Optional[float] = Field(None, ge=0, le=5) # 0 to 5 scale
review_count: int = 0
In this schema, we’ve set strict rules. A product without a price or a product_id is considered invalid, while a missing description is acceptable.
Step 2: Building the Validation Gate
Websites are messy. They include currency symbols, commas in numbers, and extra whitespace. We can use Pydantic's @field_validator to clean this data during the validation process.
from pydantic import field_validator
class ProductSchema(BaseModel):
product_id: str
name: str
price: float
url: HttpUrl
@field_validator('price', mode='before')
@classmethod
def clean_price(cls, value):
if isinstance(value, str):
# Remove currency symbols and commas
clean_val = value.replace('$', '').replace(',', '').strip()
return float(clean_val)
return value
Now, let's implement the gatekeeper function. This function takes the raw dictionary from your scraper and attempts to pass it through the gate.
from pydantic import ValidationError
def gatekeeper(raw_data: dict):
try:
# Attempt to create a validated object
validated_item = ProductSchema(**raw_data)
return {"status": "valid", "data": validated_item.model_dump()}
except ValidationError as e:
# Capture exactly why it failed
return {"status": "invalid", "errors": e.errors(), "raw": raw_data}
Step 3: Determining "Drop vs. Alert" Logic
In a production scraper, not all validation errors should stop the process. We categorize failures into two buckets:
Critical Failures: Missing
priceorproduct_id. If these fail, drop the record and log a high-priority error.Non-Critical Failures: Missing
image_urlorrating. You might still want to keep the record but log a warning.
You also need a halt threshold. If 100% of your items fail validation over a five-minute window, the website has likely changed its structure or blocked your bots. In this case, your script should automatically shut down to save your proxy budget.
Step 4: Monitoring Field Coverage
Sometimes data is "valid" (it’s a string) but "wrong" (it’s the wrong string). For example, if a site starts showing "Sign in to see price" instead of the actual price, your float validator will fail.
Track this using Field Coverage. If you scrape 1,000 items and only 200 have a description, your coverage for that field is 20%.
class StatsCollector:
def __init__(self):
self.total_scraped = 0
self.valid_count = 0
self.field_counts = {}
def track(self, status, data=None):
self.total_scraped += 1
if status == "valid":
self.valid_count += 1
for field, value in data.items():
if value is not None:
self.field_counts[field] = self.field_counts.get(field, 0) + 1
def get_report(self):
if self.total_scraped == 0: return {}
return {field: (count / self.total_scraped) * 100
for field, count in self.field_counts.items()}
If your price coverage drops from 99% to 0% suddenly, you know exactly which selector to fix.
Step 5: Efficient Storage with JSONL
Once the data passes the gate, it's time to save it. For web scraping, JSON Lines (JSONL) is often better than standard JSON. In JSONL, every line is a separate JSON object.
import json
def save_to_jsonl(validated_data, filename="products.jsonl"):
with open(filename, "a") as f:
f.write(json.dumps(validated_data) + "\n")
JSONL is useful for scraping because:
Memory Efficiency: You don't need to load a massive list into RAM.
Corruption Resistance: If the script crashes, the previous lines are already saved and valid.
To Wrap Up
Implementing Data Quality Gates turns a fragile script into a professional data pipeline. By validating in real-time, you ensure your database remains a source of truth rather than a collection of broken HTML fragments.
Key Takeaways:
Use Pydantic to define schemas and handle data cleaning.
Fail Fast by setting thresholds that stop the scraper if too many items fail validation.
Monitor Coverage to detect subtle website changes that don't trigger hard errors.
Store in JSONL for better data integrity and memory efficiency during long crawls.
If you’re building production-grade scrapers and want real-world examples, check out the Walmart.com scraper examples on GitHub. These repositories show practical implementations you can adapt into your own validation-first pipelines.




