Solving CHALLENGEs in Browser Automation with Custom OCR

Integrate a custom-trained OCR model into browser automation to solve text-based CHALLENGEs. The OCR service, confidence thresholds, and retry logic.

Status: This post describes the planned CHALLENGE-solving integration. The OCR model described in Teaching AI to Distrust Itself is trained and ready; integration with browser automation is pending.

Text-based CHALLENGEs are speed bumps, not walls. Distorted letters, wavy backgrounds, overlapping characters - they’re designed to stop bots, but a well-trained OCR model handles them reliably.

In Teaching AI to Distrust Itself, I described building a 98% accurate OCR model through iterative label refinement. This post covers the integration: wrapping that model in an HTTP service and calling it from the browser automation pipeline.

The OCR Service

The trained model runs as a simple HTTP service:

from flask import Flask, request, jsonify
from model import CRNNModel  # Your trained model

app = Flask(__name__)
model = CRNNModel.load("models/ocr_98pct.pt")

@app.route("/predict", methods=["POST"])
def predict():
    if "image" not in request.files:
        return jsonify({"error": "No image provided"}), 400

    image_file = request.files["image"]
    image_bytes = image_file.read()

    prediction, confidence = model.predict_with_confidence(image_bytes)

    return jsonify({
        "prediction": prediction,
        "confidence": float(confidence)
    })

@app.route("/health")
def health():
    return jsonify({"status": "healthy"})

For model training details - the CRNN architecture, CTC loss, and the iterative label refinement that got us to 98% accuracy - see Teaching AI to Distrust Itself.

Deployment Options

The OCR service can run:

  • Local on the Pi - lowest latency, requires model deployment
  • On GCE - centralized, single model instance for multiple Pis
  • Cloud Run - auto-scaling, pay-per-use

For a single Pi setup, running on GCE keeps the Pi focused on browser automation while centralizing the ML inference.

Calling the Service

From the renderer:

import httpx

async def solve_challenge(image_bytes: bytes) -> dict:
    """Send CHALLENGE image to OCR service.

    Returns:
        dict with keys: text, confidence, success
    """
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "http://10.0.0.1:8082/predict",  # GCE via WireGuard
            files={"image": ("challenge.png", image_bytes, "image/png")},
            timeout=10.0
        )

        if response.status_code == 200:
            result = response.json()
            return {
                'text': result['prediction'],
                'confidence': result['confidence'],
                'success': True
            }
        else:
            return {
                'text': None,
                'confidence': 0,
                'success': False,
                'error': response.text
            }

Confidence Thresholds

The model returns a confidence score with each prediction. This is useful for deciding whether to submit or request a new CHALLENGE:

MIN_CONFIDENCE = 0.7

result = await solve_challenge(image_bytes)

if result['confidence'] < MIN_CONFIDENCE:
    # Low confidence - refresh for a new CHALLENGE
    await page.reload()
else:
    # High confidence - submit the solution
    await submit_solution(result['text'])

Some CHALLENGEs are harder than others. Refreshing for a new one often yields something the model reads more confidently.

Retry Logic

OCR isn’t perfect. Build in retries:

async def solve_with_retry(
    page,
    extract_fn,  # Target-specific extraction
    submit_fn,   # Target-specific submission
    max_attempts: int = 3,
    min_confidence: float = 0.7
) -> dict:
    """Attempt to solve CHALLENGE with retries."""
    for attempt in range(1, max_attempts + 1):
        # Extract CHALLENGE image (target-specific)
        image = await extract_fn(page)
        if not image:
            return {'solved': True, 'reason': 'no_challenge'}

        # Get prediction
        result = await solve_challenge(image)

        if not result['success']:
            continue

        # Skip low-confidence predictions
        if result['confidence'] < min_confidence:
            await page.reload()
            continue

        # Submit solution (target-specific)
        solved = await submit_fn(page, result['text'])

        if solved:
            return {
                'solved': True,
                'attempts': attempt,
                'confidence': result['confidence']
            }

        # Wrong answer - refresh and retry
        await page.reload()

    return {
        'solved': False,
        'attempts': max_attempts,
        'reason': 'max_attempts_exceeded'
    }

The extract_fn and submit_fn are target-specific - they know where the CHALLENGE image is and where to submit the solution for that particular site.

Handling Failures

When all retries fail:

  1. Save the image for manual review and potential training data
  2. Log the attempt with confidence scores
  3. Return failure to the orchestrator
async def handle_challenge_failure(
    job_id: str,
    challenge_image: bytes,
    attempts: list
):
    """Handle failed CHALLENGE solving."""
    # Save image for training data
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    save_path = f"/var/log/challenges/failed_{job_id}_{timestamp}.png"

    with open(save_path, "wb") as f:
        f.write(challenge_image)

    logger.warning(
        f"CHALLENGE failed for job {job_id}: "
        f"{len(attempts)} attempts, saved to {save_path}"
    )

Failed CHALLENGEs become training data. This closes the loop with the iterative refinement process described in the OCR training post - failures improve the model for next time.

What’s Next

The OCR model is trained and ready. Here’s where we stand:

  • WireGuard tunnel for Pi-to-GCE communication ✅
  • Multi-dongle networking for IP rotation ✅
  • Custom OCR for CHALLENGE solving ✅
  • Playwright renderer for page capture - coming next

The next post covers the GCE orchestrator that coordinates jobs across multiple Pis.

About the Author

Ashish Anand

Ashish Anand

Founder & Lead Developer

Full-stack developer with 10+ years experience in Python, JavaScript, and DevOps. Creator of DevGuide.dev. Previously worked at Microsoft. Specializes in developer tools and automation.