How to Build a Grammar Checker using spaCy and a Fine-Tuned LLM

Learn how to create a grammar checking tool by combining spaCy's NLP capabilities with a fine-tuned large language model (LLM) for accurate text correction.

Published: 11-Nov-2025 ⏱ 5 min read

If you have ever wanted to build a grammar checking tool similar to Grammarly, combining the strengths of spaCy for linguistic analysis and a fine-tuned large language model (LLM) for text correction, then you are in the right place.

This tutorial will give you a step-by-step roadmap to create a grammar checker that can analyze text, detect grammatical errors, and provide corrections.

Let’s walk through a clear technical roadmap, with architecture options and example code, so you can realistically build and deploy it.


🧠 Step-by-Step Plan: Grammar Checker using spaCy + LLM


1. Define the Goal

You want:

  • To detect grammar errors in web content (sentences, paragraphs, HTML).
  • To return corrected text, error positions, and possibly explanations.

So the model must handle both:

  • Detection: identifying incorrect grammar (token-level or sentence-level).
  • Correction: rewriting text correctly (sentence-level generative task).

2. Why Combine spaCy + LLM?

ComponentRoleReason
spaCyTokenization, POS tagging, dependency parsingFast and deterministic pre-processing; can highlight specific grammar violations (like subject-verb disagreement, double determiners, etc.).
LLM (fine-tuned)Grammar correctionGenerates natural, grammatically correct rewrites and explanations.
CombinedHybrid modelYou get accuracy + interpretability + low latency (LLM handles only sentences flagged by spaCy).

3. Overall Architecture

                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                β”‚ Website Content (HTML) β”‚
                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚
                  [1] Extract visible text
                             β”‚
                             β–Ό
                   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                   β”‚ spaCy NLP Model β”‚
                   β”‚ (tokenize, tag) β”‚
                   β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
           [2] Detect rule-based grammar issues
                           β”‚
                           β–Ό
             β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
             β”‚ Fine-tuned LLM (Grammar) β”‚
             β”‚ Correct & explain errors β”‚
             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
                 [3] Merge results + JSON
                           β”‚
                           β–Ό
                  Response to API client

4. spaCy Pipeline Setup

Install:

pip install spacy
python -m spacy download en_core_web_sm

Example base pipeline:

import spacy

nlp = spacy.load("en_core_web_sm")

def detect_grammar_issues(text):
    doc = nlp(text)
    issues = []
    for token in doc:
        # Simple rule example: double determiners
        if token.pos_ == "DET" and token.i < len(doc)-1:
            next_token = doc[token.i+1]
            if next_token.pos_ == "DET":
                issues.append({
                    "type": "Double Determiner",
                    "text": f"{token.text} {next_token.text}",
                    "position": (token.idx, next_token.idx)
                })
    return issues

This gives you linguistically explainable error detection, to be passed to your LLM for correction.


5. Fine-Tuning an LLM for Grammar Correction

You can use an open-source model like LLaMA 3, Mistral, or T5 (which already does grammar correction well).

Example: Fine-tuning with T5

T5 is perfect for grammar tasks, it learns mapping:

Input:  "I has a pen."
Output: "I have a pen."

Fine-tune with Hugging Face transformers:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Trainer, TrainingArguments

model_name = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Example dataset
examples = [
    {"input_text": "I has a pen.", "target_text": "I have a pen."},
    {"input_text": "She go to school.", "target_text": "She goes to school."},
]

train_encodings = tokenizer([e["input_text"] for e in examples], truncation=True, padding=True)
labels = tokenizer([e["target_text"] for e in examples], truncation=True, padding=True)

import torch
class GrammarDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels["input_ids"][idx])
        return item
    def __len__(self):
        return len(self.labels["input_ids"])

train_dataset = GrammarDataset(train_encodings, labels)

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    save_steps=10_000,
    save_total_limit=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)
trainer.train()

model.save_pretrained("./grammar-t5")

You now have a fine-tuned grammar correction model.


6. Integrate Both (spaCy + LLM)

from transformers import pipeline

grammar_corrector = pipeline("text2text-generation", model="./grammar-t5")

def grammar_check_pipeline(text):
    # Step 1: rule-based detection
    issues = detect_grammar_issues(text)

    # Step 2: send problematic sentences to LLM
    corrections = grammar_corrector(text, max_length=256)[0]["generated_text"]

    return {
        "original": text,
        "corrections": corrections,
        "issues": issues
    }

7. Wrap as an API

Using FastAPI:

from fastapi import FastAPI, Body

app = FastAPI()

@app.post("/check")
async def check_grammar(content: str = Body(...)):
    result = grammar_check_pipeline(content)
    return result

Run it:

uvicorn app:app --reload

8. Training Data Sources (Grammar Correction)

You can bootstrap data from:

  • Jfleg, BEA-2019, or Lang-8 datasets
  • Or generate synthetic errors using spaCy (e.g., swapping verbs, removing articles)

9. (Optional) Website Integration

Once API is live:

  • Create a JavaScript snippet for websites to call your API:
fetch('https://api.grammarcheckapi.io/check', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({ content: document.body.innerText }),
})
  .then(r => r.json())
  .then(console.log);

10. Future Enhancements

  • Use spaCy custom components to mark error spans in Doc.spans.
  • Integrate with FastAPI + Redis for caching corrections.
  • Add confidence scores.
  • Provide diff output between original and corrected text.

Next steps:

  • How to generate a large grammar correction dataset automatically using spaCy rules + synthetic corruptions. This step is essential if you want to fine tune your own LLM instead of relying on pre-trained ones.