Solving CHALLENGEs in Browser Automation with Custom OCR
Integrate a custom-trained OCR model into browser automation to solve text-based CHALLENGEs. The OCR service, confidence thresholds, and retry logic.
SERIES
Distributed Browser Automation
Status: This post describes the planned CHALLENGE-solving integration. The OCR model described in Teaching AI to Distrust Itself is trained and ready; integration with browser automation is pending.
Text-based CHALLENGEs are speed bumps, not walls. Distorted letters, wavy backgrounds, overlapping characters - they’re designed to stop bots, but a well-trained OCR model handles them reliably.
In Teaching AI to Distrust Itself, I described building a 98% accurate OCR model through iterative label refinement. This post covers the integration: wrapping that model in an HTTP service and calling it from the browser automation pipeline.
The OCR Service
The trained model runs as a simple HTTP service:
from flask import Flask, request, jsonify
from model import CRNNModel # Your trained model
app = Flask(__name__)
model = CRNNModel.load("models/ocr_98pct.pt")
@app.route("/predict", methods=["POST"])
def predict():
if "image" not in request.files:
return jsonify({"error": "No image provided"}), 400
image_file = request.files["image"]
image_bytes = image_file.read()
prediction, confidence = model.predict_with_confidence(image_bytes)
return jsonify({
"prediction": prediction,
"confidence": float(confidence)
})
@app.route("/health")
def health():
return jsonify({"status": "healthy"})
For model training details - the CRNN architecture, CTC loss, and the iterative label refinement that got us to 98% accuracy - see Teaching AI to Distrust Itself.
Deployment Options
The OCR service can run:
- Local on the Pi - lowest latency, requires model deployment
- On GCE - centralized, single model instance for multiple Pis
- Cloud Run - auto-scaling, pay-per-use
For a single Pi setup, running on GCE keeps the Pi focused on browser automation while centralizing the ML inference.
Calling the Service
From the renderer:
import httpx
async def solve_challenge(image_bytes: bytes) -> dict:
"""Send CHALLENGE image to OCR service.
Returns:
dict with keys: text, confidence, success
"""
async with httpx.AsyncClient() as client:
response = await client.post(
"http://10.0.0.1:8082/predict", # GCE via WireGuard
files={"image": ("challenge.png", image_bytes, "image/png")},
timeout=10.0
)
if response.status_code == 200:
result = response.json()
return {
'text': result['prediction'],
'confidence': result['confidence'],
'success': True
}
else:
return {
'text': None,
'confidence': 0,
'success': False,
'error': response.text
}
Confidence Thresholds
The model returns a confidence score with each prediction. This is useful for deciding whether to submit or request a new CHALLENGE:
MIN_CONFIDENCE = 0.7
result = await solve_challenge(image_bytes)
if result['confidence'] < MIN_CONFIDENCE:
# Low confidence - refresh for a new CHALLENGE
await page.reload()
else:
# High confidence - submit the solution
await submit_solution(result['text'])
Some CHALLENGEs are harder than others. Refreshing for a new one often yields something the model reads more confidently.
Retry Logic
OCR isn’t perfect. Build in retries:
async def solve_with_retry(
page,
extract_fn, # Target-specific extraction
submit_fn, # Target-specific submission
max_attempts: int = 3,
min_confidence: float = 0.7
) -> dict:
"""Attempt to solve CHALLENGE with retries."""
for attempt in range(1, max_attempts + 1):
# Extract CHALLENGE image (target-specific)
image = await extract_fn(page)
if not image:
return {'solved': True, 'reason': 'no_challenge'}
# Get prediction
result = await solve_challenge(image)
if not result['success']:
continue
# Skip low-confidence predictions
if result['confidence'] < min_confidence:
await page.reload()
continue
# Submit solution (target-specific)
solved = await submit_fn(page, result['text'])
if solved:
return {
'solved': True,
'attempts': attempt,
'confidence': result['confidence']
}
# Wrong answer - refresh and retry
await page.reload()
return {
'solved': False,
'attempts': max_attempts,
'reason': 'max_attempts_exceeded'
}
The extract_fn and submit_fn are target-specific - they know where the CHALLENGE image is and where to submit the solution for that particular site.
Handling Failures
When all retries fail:
- Save the image for manual review and potential training data
- Log the attempt with confidence scores
- Return failure to the orchestrator
async def handle_challenge_failure(
job_id: str,
challenge_image: bytes,
attempts: list
):
"""Handle failed CHALLENGE solving."""
# Save image for training data
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
save_path = f"/var/log/challenges/failed_{job_id}_{timestamp}.png"
with open(save_path, "wb") as f:
f.write(challenge_image)
logger.warning(
f"CHALLENGE failed for job {job_id}: "
f"{len(attempts)} attempts, saved to {save_path}"
)
Failed CHALLENGEs become training data. This closes the loop with the iterative refinement process described in the OCR training post - failures improve the model for next time.
What’s Next
The OCR model is trained and ready. Here’s where we stand:
- WireGuard tunnel for Pi-to-GCE communication ✅
- Multi-dongle networking for IP rotation ✅
- Custom OCR for CHALLENGE solving ✅
- Playwright renderer for page capture - coming next
The next post covers the GCE orchestrator that coordinates jobs across multiple Pis.