How to Build a Grammar Checker using spaCy and a Fine-Tuned LLM
Learn how to create a grammar checking tool by combining spaCy's NLP capabilities with a fine-tuned large language model (LLM) for accurate text correction.
If you have ever wanted to build a grammar checking tool similar to Grammarly, combining the strengths of spaCy for linguistic analysis and a fine-tuned large language model (LLM) for text correction, then you are in the right place.
This tutorial will give you a step-by-step roadmap to create a grammar checker that can analyze text, detect grammatical errors, and provide corrections.
Letβs walk through a clear technical roadmap, with architecture options and example code, so you can realistically build and deploy it.
π§ Step-by-Step Plan: Grammar Checker using spaCy + LLM
1. Define the Goal
You want:
- To detect grammar errors in web content (sentences, paragraphs, HTML).
- To return corrected text, error positions, and possibly explanations.
So the model must handle both:
- Detection: identifying incorrect grammar (token-level or sentence-level).
- Correction: rewriting text correctly (sentence-level generative task).
2. Why Combine spaCy + LLM?
| Component | Role | Reason |
|---|---|---|
| spaCy | Tokenization, POS tagging, dependency parsing | Fast and deterministic pre-processing; can highlight specific grammar violations (like subject-verb disagreement, double determiners, etc.). |
| LLM (fine-tuned) | Grammar correction | Generates natural, grammatically correct rewrites and explanations. |
| Combined | Hybrid model | You get accuracy + interpretability + low latency (LLM handles only sentences flagged by spaCy). |
3. Overall Architecture
ββββββββββββββββββββββββββ
β Website Content (HTML) β
ββββββββββββββ¬ββββββββββββ
β
[1] Extract visible text
β
βΌ
βββββββββββββββββββ
β spaCy NLP Model β
β (tokenize, tag) β
βββββββββ¬ββββββββββ
β
[2] Detect rule-based grammar issues
β
βΌ
ββββββββββββββββββββββββββββ
β Fine-tuned LLM (Grammar) β
β Correct & explain errors β
βββββββββββββββ¬βββββββββββββ
β
[3] Merge results + JSON
β
βΌ
Response to API client
4. spaCy Pipeline Setup
Install:
pip install spacy
python -m spacy download en_core_web_sm
Example base pipeline:
import spacy
nlp = spacy.load("en_core_web_sm")
def detect_grammar_issues(text):
doc = nlp(text)
issues = []
for token in doc:
# Simple rule example: double determiners
if token.pos_ == "DET" and token.i < len(doc)-1:
next_token = doc[token.i+1]
if next_token.pos_ == "DET":
issues.append({
"type": "Double Determiner",
"text": f"{token.text} {next_token.text}",
"position": (token.idx, next_token.idx)
})
return issues
This gives you linguistically explainable error detection, to be passed to your LLM for correction.
5. Fine-Tuning an LLM for Grammar Correction
You can use an open-source model like LLaMA 3, Mistral, or T5 (which already does grammar correction well).
Example: Fine-tuning with T5
T5 is perfect for grammar tasks, it learns mapping:
Input: "I has a pen."
Output: "I have a pen."
Fine-tune with Hugging Face transformers:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Trainer, TrainingArguments
model_name = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
# Example dataset
examples = [
{"input_text": "I has a pen.", "target_text": "I have a pen."},
{"input_text": "She go to school.", "target_text": "She goes to school."},
]
train_encodings = tokenizer([e["input_text"] for e in examples], truncation=True, padding=True)
labels = tokenizer([e["target_text"] for e in examples], truncation=True, padding=True)
import torch
class GrammarDataset(torch.utils.data.Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels
def __getitem__(self, idx):
item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
item["labels"] = torch.tensor(self.labels["input_ids"][idx])
return item
def __len__(self):
return len(self.labels["input_ids"])
train_dataset = GrammarDataset(train_encodings, labels)
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=4,
save_steps=10_000,
save_total_limit=2,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
)
trainer.train()
model.save_pretrained("./grammar-t5")
You now have a fine-tuned grammar correction model.
6. Integrate Both (spaCy + LLM)
from transformers import pipeline
grammar_corrector = pipeline("text2text-generation", model="./grammar-t5")
def grammar_check_pipeline(text):
# Step 1: rule-based detection
issues = detect_grammar_issues(text)
# Step 2: send problematic sentences to LLM
corrections = grammar_corrector(text, max_length=256)[0]["generated_text"]
return {
"original": text,
"corrections": corrections,
"issues": issues
}
7. Wrap as an API
Using FastAPI:
from fastapi import FastAPI, Body
app = FastAPI()
@app.post("/check")
async def check_grammar(content: str = Body(...)):
result = grammar_check_pipeline(content)
return result
Run it:
uvicorn app:app --reload
8. Training Data Sources (Grammar Correction)
You can bootstrap data from:
- Jfleg, BEA-2019, or Lang-8 datasets
- Or generate synthetic errors using spaCy (e.g., swapping verbs, removing articles)
9. (Optional) Website Integration
Once API is live:
- Create a JavaScript snippet for websites to call your API:
fetch('https://api.grammarcheckapi.io/check', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ content: document.body.innerText }),
})
.then(r => r.json())
.then(console.log);
10. Future Enhancements
- Use spaCy custom components to mark error spans in
Doc.spans. - Integrate with FastAPI + Redis for caching corrections.
- Add confidence scores.
- Provide diff output between original and corrected text.
Next steps:
- How to generate a large grammar correction dataset automatically using spaCy rules + synthetic corruptions. This step is essential if you want to fine tune your own LLM instead of relying on pre-trained ones.