A lot of "learn English with YouTube" tools just dump every word from the captions into your face and call it a vocabulary list. The result is 80% noise — pronouns, articles, contractions, proper nouns, and the same 200 high-frequency words repeated until your brain melts.
When I was building TubeVocab, the hardest engineering problem wasn't scraping subtitles or shipping the React UI. It was the linguistic plumbing between raw caption text and a card a B1 learner would actually benefit from studying. That plumbing is a 4-stage NLP pipeline I tuned over 14 days and ~3,000 manual quality reviews. Here it is end-to-end, with the spaCy snippets that actually run in prod.
Stage 1: Lemmatization (one card per lemma, not per inflection)
If your subtitle says "running, ran, runs, has run" in one video, learners don't need 4 cards. They need one card for run with all forms surfaced as examples.
spaCy's en_core_web_sm lemmatizer does ~95% of this for free. The catch: I disable everything I don't need so the pipeline runs at ~12k tokens/sec on a single CPU.
import spacy
nlp = spacy.load(
"en_core_web_sm",
disable=["ner", "parser", "textcat"],
)
def extract_lemmas(text: str) -> list[tuple[str, str]]:
doc = nlp(text)
return [
(tok.lemma_.lower(), tok.pos_)
for tok in doc
if tok.is_alpha and not tok.is_stop
]
The is_stop filter alone removes ~40% of tokens (the/a/and/is/etc), which cascades into massive savings downstream.
Stage 2: POS-tag filtering (kill the proper nouns and the junk)
After lemmatization I have things like ("netflix", "PROPN"), ("ok", "INTJ"), ("uh", "INTJ"). None of these belong on a flashcard.
I keep only NOUN, VERB, ADJ, ADV and explicitly drop PROPN, INTJ, NUM, and anything tagged X (unknown).
KEEP_POS = {"NOUN", "VERB", "ADJ", "ADV"}
def filter_by_pos(lemmas: list[tuple[str, str]]) -> list[str]:
return [lem for lem, pos in lemmas if pos in KEEP_POS]
Sounds trivial. The first version of TubeVocab didn't do this and ~18% of generated cards were words like "MrBeast", "TikTok", or "umm". Conversion to paid tanked because the first 5 cards a free user saw made the product look broken.
Stage 3: CEFR difficulty classification (the part that took 14 days)
Every card needs a CEFR band — A1 through C2. I tried 3 approaches:
- A pretrained CEFR classifier from HuggingFace — slow (~120ms/word), 25% disagreement with native-speaker spot checks.
- A custom fine-tuned BERT — 91% agreement but +800MB Docker image and 4s cold start. Not worth it.
- A frequency-band lookup with hand-tuned overrides — this won.
I merged EFLLex (CEFR-aligned) + SUBTLEX-US (film/TV frequency), added ~600 manual overrides:
CEFR_BAND = load_static_cefr() # {"run": "A1", "ostensibly": "C1", ...}
def classify(lemma: str, default: str = "B2") -> str:
return CEFR_BAND.get(lemma, default)
B2 is the unknown-word default because it's the median for educational YouTube. Now ~0.4ms/word, 89% agreement with manual reviews.
Stage 4: Dedupe-by-context (the secret sauce)
A learner doesn't need 12 cards for run even if it appears in 12 videos. They need one card with the best example sentence.
For each lemma I score every context sentence on length (10–20 tokens), CEFR-density, and a tiny TextRank clarity score:
def best_context(lemma: str, candidates: list[Sentence]) -> Sentence:
return max(
candidates,
key=lambda s: (
-abs(len(s.tokens) - 15)
- 2 * count_above_band(s, "B2")
+ textrank_score(s)
),
)
This single change moved 14-day retention from 18% to 31%.
Before vs. after on one real 12-min MrBeast video
| Metric | v1 (lemma + freq) | v4 (full pipeline) |
|---|---|---|
| Tokens after lemmatization | 1,847 | 1,847 |
| Cards after POS filter | 1,847 | 612 |
| Cards after CEFR-band trim (B1–B2 target) | 612 | 184 |
| Cards after context-dedupe | 184 | 71 |
| User-reported "useful" rate (n=40) | 22% | 78% |
If you want to see the output without reading spaCy, TubeVocab is the side project these 4 stages live inside — paste a YouTube URL, get back ~50–100 CEFR-tagged cards with timestamps clickable back to the exact second the word was spoken.
United States
NORTH AMERICA
Related News
UCP Variant Data: The #1 Reason Agent Checkouts Fail
7h ago
Amazon Employees Are 'Tokenmaxxing' Due To Pressure To Use AI Tools
21h ago
How Braze’s CTO is rethinking engineering for the agentic area
11h ago

Décryptage technique : Comment builder un téléchargeur de vidéos Reddit performant (DASH, HLS & WebAssembly)
17h ago
How AI Reduced Manual Driver Verification by 75% — Operations Case Study. Part 2
4h ago