6:["$","$L17",null,{"section":{"slug":"nlp","label":"NLP","shortLabel":"NLP","description":"Tokenization, embeddings, sequence models, and BERT.","seoTitle":"NLP Interview Questions & MCQs","seoDescription":"Master Natural Language Processing (NLP) interview questions, covering tokenization, embeddings, sequence models, and BERT.","keywords":["NLP interview questions","Natural Language Processing MCQs"],"icon":"N","iconColor":"bg-indigo-600","status":"active","phase":2,"priority":0.8},"learnMcqs":[{"section":"nlp","topicSlug":"text-preprocessing","topic":"Text Preprocessing","id":"nlp-01001","difficulty":"easy","orderIndex":1,"question":"A pipeline tokenizes the sentence \"Dr. Smith lives in Washington D.C.\" using whitespace splitting. The downstream model receives 7 tokens. A second pipeline using a sentence tokenizer first produces 1 sentence, then word-tokenizes it. Why does naive whitespace tokenization fail here compared to the rule-based sentence tokenizer?","options":{"A":"Whitespace tokenization always produces more tokens than sentence tokenizers regardless of input","B":"Whitespace splitting cannot distinguish abbreviation periods (Dr., D.C.) from sentence-ending periods, causing incorrect sentence boundary detection and inflated token counts","C":"The sentence tokenizer uses a neural model that recognizes named entities, which whitespace splitting cannot do","D":"Whitespace tokenization is case-sensitive and fails on capitalized words like \"Dr.\" and \"Washington\""},"correct":"B","explanation":{"correct":"- Rule-based sentence tokenizers use handcrafted abbreviation lists and punctuation heuristics to differentiate \"Dr.\" (abbreviation) from \".\" at sentence end — whitespace splitting has no such awareness.\n- \"D.C.\" splits into [\"D.C.\"] with whitespace but a naive period-split would break it into [\"D\", \"C\", \"\"] — this causes downstream tokenization to produce garbage tokens.\n- In production NLP pipelines, incorrect sentence segmentation propagates errors to every downstream step: POS tagging, NER, dependency parsing all assume clean sentence boundaries.\n- NLTK's `sent_tokenize` uses the Punkt algorithm, which is trained on abbreviation patterns — it is still \"classical\" (not neural) but vastly outperforms whitespace splitting.","A":"This is false — sentence tokenizers can produce more tokens when they correctly split run-on text. Token count alone is not the measure of correctness.","B":"","C":"Classical rule-based sentence tokenizers like Punkt do not use neural models or NER; they rely on unsupervised training on abbreviation statistics. Mixing up classical and neural tooling is a common beginner misconception.","D":"Whitespace tokenization is not case-sensitive in any standard implementation. The failure is about period disambiguation, not case handling."},"reference":"- Kiss & Strunk, \"Unsupervised Multilingual Sentence Boundary Detection\" (Punkt algorithm): https://aclanthology.org/J06-4003/\n- NLTK tokenization docs: https://www.nltk.org/api/nltk.tokenize.html"},{"section":"nlp","topicSlug":"text-preprocessing","topic":"Text Preprocessing","id":"nlp-01002","difficulty":"easy","orderIndex":2,"question":"A sentiment analysis model trained on movie reviews removes all stopwords before training. At inference time, a user submits the review \"The movie was not good at all.\" After stopword removal, the processed text is \"movie good.\" The model predicts positive sentiment. What fundamental limitation of stopword removal does this illustrate?","options":{"A":"Stopword lists are too large and remove content-bearing words like \"good\"","B":"Stopword removal destroys negation context — words like \"not\" are typically on stopword lists but are semantically critical for sentiment","C":"The model was not trained on enough negative examples, which is a data imbalance problem unrelated to stopword removal","D":"Sentiment analysis models cannot process reviews shorter than 5 tokens after preprocessing"},"correct":"B","explanation":{"correct":"- \"not\" is almost universally on standard stopword lists (NLTK English stopwords, spaCy, sklearn's default), yet it inverts the polarity of any adjacent sentiment word.\n- After removing \"not\", \"good\" survives — and the model correctly predicts \"good\" as positive, having never seen the negation during preprocessing.\n- This is one of the core documented failure modes of bag-of-words + stopword removal for sentiment tasks. The fix is either keeping negation words explicitly or using n-grams that capture \"not_good\" as a single feature.\n- In production, shipping stopword removal blindly without domain-specific curation is a known source of sentiment pipeline regressions.","A":"Standard stopword lists contain only function words (the, is, at, which, on) — content words like \"good\", \"bad\", \"excellent\" are never on them. The problem is the opposite: removing function words that are semantically load-bearing.","B":"","C":"While data imbalance is real, this specific error is entirely reproducible with balanced data — any sentence with \"not [positive_word]\" will be mispredicted after stopword removal. Blaming data balance misses the preprocessing root cause.","D":"No standard model has a minimum token count requirement post-preprocessing. This is a fabricated constraint."},"reference":"- Manning et al., \"Introduction to Information Retrieval\", Chapter 2 (Stopwords): https://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html"},{"section":"nlp","topicSlug":"text-preprocessing","topic":"Text Preprocessing","id":"nlp-01003","difficulty":"easy","orderIndex":3,"question":"A team uses Porter stemming on a legal document corpus. The words \"university\", \"universe\", and \"universal\" all stem to \"univers\". A downstream retrieval system now conflates documents about university admissions with those about universal human rights. What does this demonstrate about stemming compared to lemmatization?","options":{"A":"Stemming uses a dictionary lookup and lemmatization does not, making stemming more accurate for legal text","B":"Stemming applies rule-based suffix stripping without semantic awareness, collapsing unrelated words with common morphological roots; lemmatization uses vocabulary and morphological analysis to return linguistically valid base forms","C":"Porter stemming is designed for scientific text and should not be used on legal documents","D":"Lemmatization would also conflate these three words because they share the same Latin root"},"correct":"B","explanation":{"correct":"- Porter stemmer applies a cascade of suffix-stripping rules (e.g., remove \"-ity\", \"-al\", \"-ity\") without any dictionary validation. \"university\" → \"univers\", \"universe\" → \"univers\", \"universal\" → \"univers\" — all identical stems despite different meanings.\n- Lemmatization (e.g., WordNet-based) would return \"university\", \"universe\", \"universal\" — each is the correct lemma because it validates against a lexicon.\n- The trade-off is speed and simplicity (stemming) vs. precision (lemmatization). For retrieval, false conflation hurts precision more than recall.\n- In production search engines, Porter stemming can cause legal liability if document retrieval incorrectly surfaces unrelated legal precedents.","A":"This is exactly backwards — stemming uses rule-based suffix stripping with no dictionary. Lemmatization is the approach that uses a dictionary (WordNet or similar morphological analyzer).","B":"","C":"Porter stemming was designed for English generally, not specifically for scientific text. Its limitations apply across all domains where semantic precision matters.","D":"Lemmatizers preserve the base form by checking the result against a lexicon. \"University\", \"universe\", and \"universal\" are distinct lemmas in WordNet and would not be conflated."},"reference":"- Porter, \"An Algorithm for Suffix Stripping\": https://tartarus.org/martin/PorterStemmer/def.txt\n- spaCy lemmatization: https://spacy.io/usage/linguistic-features#lemmatization"},{"section":"nlp","topicSlug":"text-preprocessing","topic":"Text Preprocessing","id":"nlp-01004","difficulty":"easy","orderIndex":4,"question":"A search engine indexes 1 million documents. The term \"machine\" appears in 800,000 of them. When a user searches for \"machine learning\", the TF-IDF score for \"machine\" in a highly relevant document is very low despite the document containing the word 50 times. Why?","options":{"A":"TF-IDF penalizes documents that are too long, so 50 occurrences triggers a length penalty","B":"The IDF component for \"machine\" is near zero because the word appears in 80% of the corpus, making it uninformative as a discriminative signal regardless of term frequency","C":"TF-IDF does not handle compound phrases like \"machine learning\" and splits them incorrectly","D":"50 occurrences is below the minimum threshold for TF-IDF to assign a non-zero score"},"correct":"B","explanation":{"correct":"- IDF = log(N / df) where N = total documents and df = documents containing the term. With df = 800,000 and N = 1,000,000: IDF(\"machine\") = log(1,000,000 / 800,000) = log(1.25) ≈ 0.097.\n- TF-IDF = TF × IDF. Even with TF = 50, TF-IDF ≈ 50 × 0.097 = 4.85 — orders of magnitude lower than a rare, discriminative term.\n- The design intent is correct: \"machine\" alone is not discriminative in an ML corpus. The IDF component correctly discounts it.\n- In production, corpus-specific stopwords (domain jargon that appears everywhere) must often be identified via IDF analysis rather than generic stopword lists.","A":"Standard TF-IDF has no length penalty. BM25 (an extension) does apply document length normalization, but base TF-IDF does not. Confusing TF-IDF with BM25 is a common interview mistake.","B":"","C":"Standard TF-IDF treats each word independently — it does not natively handle phrases. The question is about single-word scoring, not phrase handling.","D":"TF-IDF has no minimum threshold in its formula. Any non-zero TF and non-zero IDF produce a non-zero score. There is no threshold concept in the base formula."},"reference":"- Manning et al., \"Introduction to Information Retrieval\", Chapter 6 (TF-IDF): https://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html"},{"section":"nlp","topicSlug":"text-preprocessing","topic":"Text Preprocessing","id":"nlp-01005","difficulty":"medium","orderIndex":5,"question":"A team builds a Bag-of-Words classifier for detecting toxic comments. They notice the model assigns the same feature vector to \"The dog bit the man\" and \"The man bit the dog.\" A senior engineer says \"use bigrams.\" After adding bigrams, both sentences still produce very similar feature vectors. What is the correct explanation?","codeSnippet":"from sklearn.feature_extraction.text import CountVectorizer\n\ncv = CountVectorizer(ngram_range=(1, 2))\ns1 = cv.fit_transform([\"The dog bit the man\"])\ns2 = cv.transform([\"The man bit the dog\"])\nprint((s1 - s2).nnz) # non-zero differences","options":{"A":"The code has a bug — `fit_transform` and `transform` must both use `fit_transform` for bigrams to work correctly","B":"Bigrams capture local word order but both sentences share the bigrams \"bit the\" and \"the man\"/\"the dog\" — the overlap is high enough that cosine similarity remains near 1.0","C":"BoW with bigrams still cannot distinguish the sentences because bigrams only add partial ordering; the critical semantic reversal (subject-object swap) requires trigrams or dependency-based features","D":"CountVectorizer normalizes vectors to unit length, so any two documents with the same vocabulary always have cosine similarity of 1.0"},"correct":"C","explanation":{"correct":"- \"The dog bit the man\" produces bigrams: [the_dog, dog_bit, bit_the, the_man]. \"The man bit the dog\" produces: [the_man, man_bit, bit_the, the_dog]. They share \"bit_the\", \"the_dog\", and \"the_man\" — 3 of 4 bigrams each overlap, yielding high feature overlap.\n- The agent/patient relationship (\"who bit whom\") is encoded in subject-verb-object dependency arcs, not in surface bigrams. Capturing it requires either trigrams (which still miss long-range dependencies) or dependency-parsed features.\n- This is the fundamental limitation of n-gram BoW: it approximates local context but cannot encode syntactic roles, which are critical for toxic content detection (e.g., \"police shot the man\" vs \"man shot the police\").\n- In production, this failure mode appears in toxicity classifiers that flip subject-object and still predict correctly — or vice versa — because the syntactic structure is lost.","A":"`fit_transform` on training data and `transform` on test data is the correct pattern — it prevents data leakage by not fitting vocabulary on test samples. Using `fit_transform` on both is actually the bug, not the fix.","B":"While the bigram overlap explanation is partially correct, the answer oversimplifies — the code output (number of differing positions) is non-zero. The core limitation is that even with different bigrams, BoW cannot encode subject-object reversal. This option describes a symptom, not the root cause.","C":"","D":"`CountVectorizer` does not normalize vectors — it outputs raw counts. `TfidfVectorizer` with `norm='l2'` (default) normalizes, but even then, two documents with different content can have cosine similarity < 1.0."},"reference":"- Jurafsky & Martin, \"Speech and Language Processing\", Chapter 6 (N-grams): https://web.stanford.edu/~jurafsky/slp3/6.pdf"},{"section":"nlp","topicSlug":"text-preprocessing","topic":"Text Preprocessing","id":"nlp-01006","difficulty":"medium","orderIndex":6,"question":"A text classifier uses TF-IDF with default sklearn settings on a training corpus of 10,000 documents. At inference time, a new document contains the word \"CRISPR\" which never appeared in training. A junior engineer reports: \"The model crashes with a KeyError on unseen vocabulary.\" The senior engineer responds: \"No, it doesn't crash — but the model is still blind to CRISPR.\" Which mechanism explains the senior engineer's claim?","options":{"A":"sklearn's TF-IDF raises a warning but substitutes a default IDF value of 1.0 for unseen terms","B":"The fitted `TfidfVectorizer` maps all text to a fixed vocabulary learned at fit time; unseen terms are silently ignored — their columns simply do not exist in the output matrix, so CRISPR contributes zero weight to any feature","C":"sklearn's `TfidfVectorizer` uses hash trick by default to handle unseen vocabulary without KeyErrors","D":"The model substitutes the IDF of the most similar known word using cosine distance in TF-IDF space"},"correct":"B","explanation":{"correct":"- `TfidfVectorizer.transform()` (not `fit_transform`) maps input tokens to the vocabulary index learned during `fit`. Any token not in `vocabulary_` is silently dropped — the output sparse matrix has the same column count as the training vocabulary, and unseen tokens simply add no signal.\n- No error is raised, no fallback is used — the document is represented only by its known words, which can be severely incomplete for domain-shift scenarios (e.g., training on 2018 data, inferring on 2023 biomedical text).\n- This is why TF-IDF + BoW models degrade silently on domain shift: the feature vector looks valid but is missing entire semantic dimensions.\n- The fix is either periodic retraining, `HashingVectorizer` (which handles unseen terms via hashing but loses interpretability), or switching to subword embeddings.","A":"sklearn does not assign any default IDF for unseen terms — the term is simply ignored, not assigned 1.0. This misconception could lead an engineer to think unseen terms still contribute to classification.","B":"","C":"`TfidfVectorizer` does NOT use the hash trick by default. `HashingVectorizer` does — these are separate classes. Confusing them is a common error in production ML code reviews.","D":"sklearn has no nearest-neighbor imputation for unseen vocabulary. This would be computationally expensive and is not implemented. The actual behavior is simpler and more brittle: silent drop."},"reference":"- sklearn TfidfVectorizer docs: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html\n- sklearn HashingVectorizer: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html"},{"section":"nlp","topicSlug":"text-preprocessing","topic":"Text Preprocessing","id":"nlp-01007","difficulty":"medium","orderIndex":7,"question":"A team is building an information retrieval system. They compute TF-IDF on a corpus where document lengths vary from 50 to 5,000 words. They observe that longer documents systematically rank higher for most queries, even when shorter documents are more focused and relevant. Which property of standard TF-IDF causes this, and what is the standard remedy?","options":{"A":"TF-IDF uses raw term frequency, which grows with document length; BM25's length normalization term (1 - b + b × dl/avgdl) corrects this bias","B":"Longer documents have higher IDF values because they contain more unique words, inflating their TF-IDF scores","C":"The IDF component is computed globally across the corpus and unfairly weights terms that appear more in longer documents","D":"TF-IDF should be replaced with cosine similarity, which normalizes for document length automatically without any modification"},"correct":"A","explanation":{"correct":"- In standard TF-IDF, TF = raw count (or log-normalized count). A 5,000-word document about \"machine learning\" will naturally contain \"learning\" more times than a 50-word abstract, even if the abstract is more topically focused.\n- BM25 (Best Match 25) introduces saturation (the k₁ parameter) and length normalization (the b parameter). The term `(k₁ + 1) × tf / (k₁ × (1 - b + b × dl/avgdl) + tf)` ensures that raw TF is normalized by document length relative to the corpus average.\n- Elasticsearch, Lucene, and most production search engines use BM25 over TF-IDF precisely because of this length bias issue.\n- In interviews, knowing that BM25 is TF-IDF's production successor — and understanding the two parameters (k₁ for saturation, b for length normalization) — is a strong signal of practical IR knowledge.","A":"","B":"IDF is computed per-term based on how many documents contain that term — document length does not affect IDF. A 5,000-word document containing \"machine\" once and a 50-word document containing \"machine\" once both contribute equally to the IDF computation.","C":"IDF is document-frequency-based, not term-count-based. It counts documents containing the term, not occurrences within documents. Document length is irrelevant to IDF calculation.","D":"Cosine similarity normalizes the vector to unit length, which does address length bias for vector comparison. However, cosine similarity is a similarity metric, not a retrieval scoring formula per se. The standard production remedy for BM25 is explicit, and cosine + TF-IDF is still affected by the raw TF issue before normalization."},"reference":"- Robertson & Zaragoza, \"The Probabilistic Relevance Framework: BM25 and Beyond\": https://www.staff.city.ac.uk/~sbrp622/papers/foundations_bm25_review.pdf"},{"section":"nlp","topicSlug":"text-preprocessing","topic":"Text Preprocessing","id":"nlp-01008","difficulty":"medium","orderIndex":8,"question":"A team is training a sentiment classifier on product reviews. They use word-level tokenization and find the model struggles with \"don't\", \"can't\", and \"I'm\". A colleague suggests using a subword tokenizer instead. However, the team's pipeline uses a rule-based tokenizer, and they want to stay classical. What is the correct classical approach to handle contractions, and what limitation persists even after applying it?","options":{"A":"Replace all contractions with their expanded form (e.g., \"don't\" → \"do not\") using a lookup table; limitation: the lookup table must be manually maintained and fails on new informal contractions like \"ain't gonna\"","B":"Apply Porter stemming to contractions, which will normalize \"don't\" and \"do not\" to the same stem; limitation: stemming loses the negation signal","C":"Use a regex to remove apostrophes, turning \"don't\" into \"dont\"; limitation: this creates an OOV token that the model has never seen","D":"Split on apostrophes to get [\"don\", \"'t\"], [\"can\", \"'t\"], [\"I\", \"'m\"]; limitation: \"'t\" and \"'m\" become meaningless tokens that dilute the vocabulary"},"correct":"A","explanation":{"correct":"- Classical NLP pipelines handle contractions through rule-based expansion dictionaries. NLTK's `contractions` library and spaCy's rule-based tokenizer both use this approach.\n- After expansion, \"don't\" → \"do not\" — both tokens are in-vocabulary, negation is preserved, and downstream stopword removal must carefully exclude \"not\" (as discussed earlier).\n- The genuine limitation is coverage: informal contractions (\"gonna\" = \"going to\", \"wanna\", \"ain't\", dialectal forms) require manual curation, and social media text constantly generates new shortenings. This is precisely why subword tokenizers emerged.\n- In production, a contraction lookup table plus a fallback to character n-grams handles most cases, but the brittleness of rule maintenance is the long-term cost.","A":"","B":"Stemming operates on word morphology, not on contracted forms. \"Don't\" would stem to something like \"don\" (not \"do\") — the apostrophe handling is undefined in most stemmers, and negation would be destroyed, not preserved.","C":"Removing apostrophes (\"dont\") creates a token that is not in any standard English vocabulary and has no semantic relationship to \"do\" or \"not\". This is strictly worse than doing nothing.","D":"Splitting on apostrophes gives [\"don\", \"'t\"] — but \"'t\" is not a standard English morpheme and will not appear in pre-trained vocabularies. It conflates \"don't\", \"can't\", \"won't\" all to \"'t\", which means negation is not preserved semantically."},"reference":"- spaCy rule-based tokenization: https://spacy.io/usage/linguistic-features#tokenization"},{"section":"nlp","topicSlug":"text-preprocessing","topic":"Text Preprocessing","id":"nlp-01009","difficulty":"medium","orderIndex":9,"question":"A data scientist computes bigram counts from a 1GB news corpus to build a language model. They observe that 92% of all possible bigrams that appear in the test set never appeared in the training corpus. A teammate proposes increasing to trigrams to capture more context. What will increasing to trigrams do to the sparsity problem?","options":{"A":"Trigrams will reduce sparsity because they capture more semantic context, making each n-gram more meaningful","B":"Trigrams will dramatically worsen sparsity — the number of possible n-grams grows as V^n, so trigrams have V times more possible combinations than bigrams, making unseen n-grams even more prevalent in the test set","C":"Trigrams and bigrams have the same sparsity rate because both are computed from the same corpus","D":"Trigrams reduce sparsity by providing more unique context that allows better interpolation with unigrams"},"correct":"B","explanation":{"correct":"- If vocabulary size V = 50,000, bigrams have V² = 2.5 billion possible combinations; trigrams have V³ = 125 trillion. The corpus of 1GB cannot adequately cover even bigram space — trigrams are exponentially worse.\n- This is the \"curse of dimensionality\" in n-gram models. If 92% of bigrams are unseen, virtually all trigrams that contain those bigrams will also be unseen.\n- This sparsity problem is precisely why smoothing techniques (Laplace, Good-Turing, Kneser-Ney) were developed — and ultimately why neural language models replaced n-gram models entirely.\n- In practice, going from bigrams to trigrams in a sparse corpus without aggressive smoothing degrades test perplexity significantly.","A":"\"More semantic context\" is a neural embedding intuition, not an n-gram property. N-gram models have no semantic understanding — they are pure co-occurrence statistics. More context in n-grams means more data requirements, not more meaning.","B":"","C":"The sparsity rate depends on how many possible n-grams exist relative to what the corpus covers. Trigrams have exponentially more possible combinations from the same vocabulary, so they are always sparser than bigrams from the same corpus.","D":"Interpolation with lower-order models is a smoothing technique, not a property of trigrams themselves. The suggestion conflates the n-gram model with the smoothing strategy applied to it. Trigrams alone worsen sparsity; interpolation is a separate remedy."},"reference":"- Jurafsky & Martin, \"Speech and Language Processing\", Chapter 3 (N-gram Language Models): https://web.stanford.edu/~jurafsky/slp3/3.pdf"},{"section":"nlp","topicSlug":"text-preprocessing","topic":"Text Preprocessing","id":"nlp-01010","difficulty":"medium","orderIndex":10,"question":"A production NLP system uses TF-IDF to rank documents. A junior engineer recomputes TF-IDF nightly by running `fit_transform` on all documents including new ones. A senior engineer flags this as a critical bug. What is the exact failure mode introduced by nightly re-fitting?","options":{"A":"Re-fitting recomputes IDF values based on the updated corpus, so documents from yesterday may receive different TF-IDF scores than today for the same content, making ranking scores incomparable across time and breaking any cached index","B":"`fit_transform` is slower than `transform`, so nightly re-fitting wastes compute resources","C":"Re-fitting adds new vocabulary terms but drops old ones, causing the feature dimensionality to change and the downstream model to crash with a shape mismatch","D":"TF-IDF scores are stored as floats and accumulate rounding errors with each nightly refit, causing gradual score drift"},"correct":"A","explanation":{"correct":"- IDF = log(N / df) — when new documents are added, N changes and df can change for any term. A document containing \"bitcoin\" that scored highly yesterday might score differently today because the IDF of \"bitcoin\" shifted as more \"bitcoin\" documents were added.\n- This breaks cache coherence: any pre-computed similarity scores, document rankings, or user personalization models trained on old TF-IDF vectors are now misaligned with the new vector space.\n- The correct production pattern: fit once on a representative corpus, freeze the vocabulary and IDF weights, then use `transform` only for new documents. Periodically re-fit on a full snapshot and re-index everything atomically.\n- This pattern also applies to any stateful feature transformer (e.g., standardization, one-hot encoding) — fitting on streaming data invalidates previously computed representations.","A":"","B":"Compute efficiency is a valid concern but not a \"critical bug\" — it's a performance issue. The senior engineer's concern is correctness, not speed.","C":"Re-fitting does change vocabulary size, which would cause downstream model shape mismatches — this is a real secondary bug. However, the primary critical issue is the semantic drift of IDF values breaking comparability, even before considering shape changes.","D":"Floating-point rounding errors from recomputation are negligible compared to the semantic shift caused by changed IDF weights. Float drift is not a meaningful concern in this context."}},{"section":"nlp","topicSlug":"text-preprocessing","topic":"Text Preprocessing","id":"nlp-01011","difficulty":"hard","orderIndex":11,"question":"A team compares two preprocessing pipelines on a biomedical NER task. Pipeline A uses NLTK word tokenizer + Porter stemmer. Pipeline B uses a rule-based subword tokenizer that splits on common medical morphemes (e.g., \"cardio-\", \"-itis\", \"-ectomy\"). Pipeline B achieves 12 F1 points higher on rare medical terms. A reviewer argues: \"This is just because subword handles OOV better.\" Is the reviewer correct, and what additional mechanism explains Pipeline B's advantage?","options":{"A":"The reviewer is correct — the only advantage of subword tokenization is OOV handling through character-level decomposition","B":"The reviewer is partially correct; subword also improves performance because medical morphemes are semantically compositional — \"appendic-itis\" and \"bronch-itis\" share the \"-itis\" morpheme (inflammation), so the model can generalize across conditions it has never seen as whole words","C":"The reviewer is wrong — rule-based subword tokenization performs better only because it reduces vocabulary size, which speeds up training and incidentally improves accuracy","D":"The reviewer is wrong — Pipeline B is better because Porter stemming corrupts medical terms (e.g., \"appendicitis\" → \"appendicit\"), creating noisy features that are worse than the raw word"},"correct":"B","explanation":{"correct":"- OOV handling is the commonly cited benefit of subword tokenization, and the reviewer is not wrong about that. However, for morphologically rich domains like medicine, the deeper advantage is semantic compositionality.\n- Medical terminology is highly systematic: \"-itis\" = inflammation, \"-ectomy\" = surgical removal, \"cardio-\" = heart. Splitting \"appendicitis\" into \"appendic\" + \"itis\" allows a model trained on \"bronchitis\" to leverage the \"-itis\" morpheme when encountering \"peritonitis\" for the first time.\n- Porter stemming does corrupt medical terms, making option D partially true — but D frames it as the only reason, missing the compositionality argument.\n- In production biomedical NLP, domain-specific tokenizers (ScispaCy, BioTokenizer) are used precisely because general-purpose tokenizers destroy the morphological signal that medical terms encode.","A":"OOV handling is a benefit, but framing it as the \"only\" advantage is incorrect. Subword tokenization in morphologically rich domains provides compositionality benefits even for words that appeared in training — the model gets better generalization from shared morphemes.","B":"","C":"Vocabulary size reduction is a side effect, not the primary mechanism. A smaller vocabulary does reduce the embedding matrix size and can speed training, but this alone does not explain 12 F1 points on rare medical terms specifically.","D":"Porter stemming corrupting \"appendicitis\" is a real problem, but this option presents only the negative of Pipeline A and misses Pipeline B's active advantage (morpheme compositionality). A complete answer must explain both sides."},"reference":"- Neumann et al., \"ScispaCy: Fast and Robust Models for Biomedical NLP\": https://arxiv.org/abs/1902.07669"},{"section":"nlp","topicSlug":"text-preprocessing","topic":"Text Preprocessing","id":"nlp-01012","difficulty":"hard","orderIndex":12,"question":"A search engine team uses TF-IDF with smoothed IDF: IDF(t) = log((1 + N) / (1 + df(t))) + 1 (sklearn's default). They notice that a query term appearing in ALL documents still receives a non-zero TF-IDF score. A colleague claims this is a bug. Is it a bug, and what is the design rationale?","codeSnippet":"from sklearn.feature_extraction.text import TfidfVectorizer\nimport numpy as np\n\ncorpus = [\"the cat sat\", \"the dog ran\", \"the bird flew\"]\nvec = TfidfVectorizer()\nX = vec.fit_transform(corpus)\nthe_idx = vec.vocabulary_[\"the\"]\nprint(X[:, the_idx].toarray()) # scores for \"the\" across all 3 docs","options":{"A":"It is a bug — a term in all documents has zero discriminative value and should score 0; the team should set `smooth_idf=False` to fix it","B":"It is not a bug — smooth IDF prevents division-by-zero for terms in all documents and the +1 addend ensures every term retains a minimum weight; the design accepts a small non-zero score for universal terms in exchange for numerical stability","C":"It is not a bug — TF-IDF always assigns non-zero scores to any term that appears in a document, regardless of IDF; the IDF only scales the score","D":"It is a bug introduced by sklearn's normalization step — setting `norm=None` would make universal terms score zero"},"correct":"B","explanation":{"correct":"- sklearn's smooth IDF formula: IDF(t) = log((1 + N) / (1 + df(t))) + 1. When df(t) = N (term in all docs): IDF = log(1) + 1 = 0 + 1 = 1. The term is NOT zero — it receives IDF = 1 as a floor.\n- The `+1` addend outside the log is a deliberate design choice to prevent terms with IDF = 0 from completely zeroing out the TF-IDF score. Without it, any term appearing in all documents would always score zero regardless of its TF.\n- The design rationale: in short corpora, a term in all N documents may still be discriminative once the corpus grows. The floor of 1 is a conservative prior that says \"every term that actually appears deserves some weight.\"\n- `smooth_idf=False` uses `log(N / df) + 1` — with df=N, this gives log(1) + 1 = 1, same result. Setting `smooth_idf=False` would not change the output for this case; it only prevents division by zero for terms seen at inference but not in training.","A":"Setting `smooth_idf=False` does NOT make universal terms score zero — both formulas include the `+1` addend which ensures IDF ≥ 1. The claim about `smooth_idf=False` being the fix is incorrect; the formulas are equivalent for terms seen during training.","B":"","C":"IDF can be zero in the raw (unsmoothed) formula without the addend. The statement \"IDF only scales\" misses that zero scaling means zero score — so the `+1` floor is the active design decision preventing zero scores.","D":"`norm` controls L1/L2 normalization of the output vector, not the IDF computation. Setting `norm=None` would change the vector magnitude but would not make any individual term score zero based on IDF."},"reference":"- sklearn TF-IDF implementation details: https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting"},{"section":"nlp","topicSlug":"text-preprocessing","topic":"Text Preprocessing","id":"nlp-01013","difficulty":"hard","orderIndex":13,"question":"A team preprocesses a multilingual corpus (English, German, Turkish) for a text classification task. They use a single English NLTK stemmer across all three languages. Turkish is an agglutinative language where a single word \"Çekoslovakyalılaştıramadıklarımızdanmışsınız\" can encode a full sentence. After stemming, Turkish accuracy drops 34 points while German accuracy drops only 8 points. What structural property of Turkish explains this dramatic accuracy gap?","options":{"A":"Turkish uses a different character set (diacritics) that the NLTK stemmer cannot process, causing all Turkish tokens to pass through unchanged","B":"Turkish is agglutinative — meaning is encoded through long chains of suffixes attached to a root, so English suffix-stripping rules destroy the entire semantic content of Turkish words rather than just normalizing inflection","C":"German compound words (Donaudampfschiffahrtsgesellschaft) are the source of the gap — German is actually harder to stem than Turkish, and the 8-point drop is artificially low","D":"The accuracy gap is due to Turkish having a much larger vocabulary than German, causing more OOV tokens rather than stemming errors"},"correct":"B","explanation":{"correct":"- In agglutinative languages like Turkish, Finnish, or Hungarian, a single word can carry multiple morphemes encoding tense, person, number, case, negation, and more. \"Çekoslovakyalılaştıramadıklarımızdanmışsınız\" = \"You are said to be one of those we could not make into a Czechoslovakian.\" Stripping any suffix with English rules destroys the entire meaning.\n- German is an inflected/compounding language but uses far fewer suffixes per word than Turkish — German words typically encode 1-2 morphological dimensions, so English suffix stripping corrupts fewer features.\n- This is why language-specific morphological analyzers (not just stemmers) are required for morphologically rich languages, and why multilingual BERT/XLM-R trained with subword tokenization significantly outperforms n-gram methods on agglutinative languages.\n- In production, using a single English NLP pipeline on multilingual data without language detection is a common source of silent failures in globally deployed systems.","A":"NLTK's Porter stemmer processes characters using ASCII/Unicode patterns — diacritics cause some issues but the stemmer does not silently skip Turkish tokens. The core issue is rule misapplication, not character encoding failure.","B":"","C":"This answer inverts the relative difficulty. German does have long compound words, but compounds are splitting problems, not stemming problems. Turkish agglutination specifically interacts destructively with suffix-stripping stemmers in a way German compounding does not.","D":"Vocabulary size would cause OOV issues in embedding-based models, not in classical TF-IDF/BoW pipelines where OOV terms are simply absent from the feature space. The 34-point drop is about feature corruption, not feature absence."},"reference":"- Jurafsky & Martin, \"Speech and Language Processing\", Chapter 2 (Morphology): https://web.stanford.edu/~jurafsky/slp3/2.pdf"},{"section":"nlp","topicSlug":"text-preprocessing","topic":"Text Preprocessing","id":"nlp-01014","difficulty":"hard","orderIndex":14,"question":"A team uses character-level trigrams as features for a language identification classifier. The model achieves 99% accuracy on clean news text but drops to 67% on social media posts with heavy code-switching (e.g., \"Je suis très tired aujourd'hui\"). They propose adding word-level unigrams to the feature set. A senior NLP engineer says this will help but identifies a more fundamental issue. What is it?","options":{"A":"Character trigrams cannot handle emojis and special characters common in social media, which dilute the trigram distribution","B":"Code-switching produces trigram distributions that are mixtures of two or more language models — neither the source language's trigrams nor the target language's trigrams dominate, placing the document in an ambiguous region of the feature space that neither language's training examples covered","C":"Social media text is too short for trigram statistics to be reliable; minimum document length for character trigrams is 500 characters","D":"Word-level unigrams will solve the problem because language-specific vocabulary is more discriminative than character patterns for code-switched text"},"correct":"B","explanation":{"correct":"- Character trigram language ID works by comparing a document's trigram distribution to per-language profiles. Code-switched text has a mixed distribution — French trigrams (e.g., \"uis\", \"est\") and English trigrams (e.g., \"the\", \"ing\") coexist in the same document.\n- Neither the French nor English training distribution matches the mixed profile. The classifier is trained on monolingual documents and has never seen this mixture class, so it makes unreliable predictions.\n- The fundamental issue is that the problem is mis-specified: \"language identification\" assumes a single language per document, but code-switching is a multilingual phenomenon requiring either token-level language ID or multi-label document classification.\n- In production multilingual social media analysis, the correct approach is token-level language tagging followed by per-language processing, not document-level language ID.","A":"While emojis can affect trigram distributions, they are a surface phenomenon — many social media language ID systems handle emojis by stripping or normalizing them. The core failure is the mixed-language distribution, not emoji coverage.","B":"","C":"There is no hard minimum document length rule for character trigrams. Short documents do have higher variance in trigram statistics, but \"too short\" is not the primary failure mode here — code-switching causes misclassification even in long posts.","D":"Word-level unigrams have the same fundamental problem: code-switched text contains vocabulary from multiple languages, and the document-level classification still faces a mixed-language feature space. Token-level language detection is needed, not a different feature type at the document level."},"reference":"- Cavnar & Trenkle, \"N-Gram-Based Text Categorization\" (language ID with n-grams): https://www.researchgate.net/publication/2375544_N-Gram-Based_Text_Categorization"},{"section":"nlp","topicSlug":"text-preprocessing","topic":"Text Preprocessing","id":"nlp-01015","difficulty":"hard","orderIndex":15,"question":"A team uses TF-IDF on a corpus where one document is 10,000 words long and another is 50 words long. Both documents contain the word \"transformer\" exactly once. The TF of \"transformer\" is higher in the 50-word document (using raw frequency: TF = count/doc_length). The shorter document therefore gets a higher TF-IDF score. A product manager argues: \"Then shorter documents always rank higher — our system is biased against research papers.\" Is the PM correct?","options":{"A":"Yes, the PM is correct — TF-IDF with normalized TF is inherently biased toward shorter documents and cannot be corrected without switching to BM25","B":"No — sklearn's default TF-IDF uses raw term counts (not length-normalized TF), so both documents receive TF = 1 for one occurrence; the PM's assumption about TF normalization is incorrect for sklearn's implementation","C":"Yes, but only for single-term queries — multi-term queries cancel out the length bias because longer documents cover more terms","D":"No — the IDF component compensates for document length by assigning higher IDF to terms that appear in shorter documents"},"correct":"B","explanation":{"correct":"- sklearn's `TfidfVectorizer` computes TF as raw count (sublinear scaling is optional via `sublinear_tf=True`). By default, TF(\"transformer\") = 1 for both the 10,000-word and 50-word document — there is no division by document length.\n- Length normalization happens only in the final L2 normalization step (`norm='l2'`), which normalizes the entire TF-IDF vector, not individual term frequencies. This is a different operation than per-term TF normalization.\n- The PM's intuition about length bias is correct in general (and BM25 addresses it), but the specific claim about TF being length-normalized in sklearn's default TF-IDF is factually wrong.\n- In interviews, confusing \"TF = tf/doc_length\" (length-normalized TF) with sklearn's raw count default is a very common mistake — reading the sklearn docs carefully on this is important.","A":"While BM25 does solve length bias more rigorously, sklearn's TF-IDF with raw counts does NOT inherently produce higher scores for shorter documents (since TF is raw, not normalized). Switching to BM25 is unnecessary to correct this specific misunderstanding.","B":"","C":"Multi-term queries do not \"cancel out\" length bias in a principled way — longer documents matching more terms may score higher for other reasons, but this is not a systematic correction for the (incorrectly assumed) length bias.","D":"IDF is computed based on document frequency (how many documents contain the term), completely independent of individual document lengths. IDF does not know which documents are short or long."},"reference":"- sklearn TfidfVectorizer source (tf computation): https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/feature_extraction/text.py\n- BM25 vs TF-IDF comparison: https://www.elastic.co/blog/practical-bm25-part-1-how-shards-affect-relevance-scoring-in-elasticsearch"},{"section":"nlp","topicSlug":"word-representations","topic":"Word Representations","id":"nlp-02001","difficulty":"easy","orderIndex":1,"question":"A team trains Word2Vec Skip-gram on a 10GB news corpus with window size 5 and 300 dimensions. They query `model.most_similar(\"king\")` and get [\"queen\", \"prince\", \"monarch\", \"emperor\", \"throne\"]. A colleague claims the model \"understands\" kingship. What does the model actually encode, and why is \"understanding\" an overstatement?","options":{"A":"The model encodes syntactic role — \"king\" and \"queen\" are both nouns, so their embeddings are similar","B":"The model encodes distributional similarity — words appearing in similar contexts (surrounding words in a window) receive similar embeddings; the model has no concept of kingship, only that \"king\" and \"queen\" co-occur with similar words like \"royal\", \"palace\", \"crowned\"","C":"The model encodes semantic meaning because the training objective explicitly maximizes the similarity between semantically related words","D":"The model understands kingship because it was trained on enough data to learn world knowledge about monarchy"},"correct":"B","explanation":{"correct":"- Word2Vec's training objective (Skip-gram) is: given a center word, predict surrounding words within a window. Two words that appear in similar contexts will have similar gradient updates and converge to nearby points in embedding space.\n- This is the distributional hypothesis (Harris, 1954): \"a word is characterized by the company it keeps.\" The embedding captures nothing about the real-world concept of kingship — only contextual co-occurrence patterns.\n- Consequence: \"king\" and \"virus\" will have dissimilar embeddings not because the model \"knows\" they are different concepts, but because they appear with very different surrounding words.\n- In interviews, confusing \"distributional similarity\" with \"semantic understanding\" is a red flag. Word2Vec embeddings cannot answer \"Is a king alive?\" — that requires knowledge beyond co-occurrence.","A":"Word2Vec does capture some syntactic information (words with similar POS often appear in similar contexts), but the primary signal is distributional co-occurrence, not explicitly syntactic role. Also, \"throne\" is not a noun with the same POS as \"king\" but still appears in the top results.","B":"","C":"The training objective maximizes the probability of surrounding words given the center word — not semantic similarity. Semantic similarity emerges as a side effect of the distributional hypothesis, not as an explicit training signal.","D":"Word2Vec has no world knowledge, no reasoning capability, and no concept of \"enough data\" producing factual understanding. Scale does not convert statistical co-occurrence into semantic understanding."},"reference":"- Mikolov et al., \"Efficient Estimation of Word Representations in Vector Space\" (Word2Vec): https://arxiv.org/abs/1301.3781"},{"section":"nlp","topicSlug":"word-representations","topic":"Word Representations","id":"nlp-02002","difficulty":"easy","orderIndex":2,"question":"A Word2Vec model trained on Wikipedia is used for a downstream finance NLP task. The word \"bear\" has a single embedding vector. When the model is queried for similarity, \"bear\" is closest to [\"bull\", \"market\", \"stock\", \"trade\"] rather than [\"animal\", \"grizzly\", \"forest\"]. A data scientist says the embedding is \"wrong.\" Is the data scientist correct?","options":{"A":"Yes — Word2Vec produces incorrect embeddings when trained on domain-mismatched corpora","B":"No — the embedding reflects the dominant usage context in the training corpus; if Wikipedia's usage of \"bear\" skews toward financial contexts, the embedding captures that context; the limitation is that one vector cannot capture multiple word senses","C":"Yes — \"bear\" in finance is a technical term that Word2Vec cannot learn without domain-specific tokenization","D":"No — the data scientist should retrain with a larger window size to capture both senses simultaneously"},"correct":"B","explanation":{"correct":"- Word2Vec produces one embedding vector per word token. If \"bear\" (financial) appears 10x more frequently than \"bear\" (animal) in the training corpus, the single vector is \"pulled\" toward the financial context by the weight of more training examples.\n- This is the fundamental polysemy limitation of static word embeddings: one vector cannot encode multiple senses. \"Bank\" (financial institution vs. riverbank), \"bass\" (fish vs. musical), \"crane\" (bird vs. machine) all suffer from this.\n- The embedding is not \"wrong\" — it correctly reflects the most common context in the training data. It is \"incomplete\" — it discards the minority sense.\n- Contextual embeddings (ELMo, BERT) solve this by producing different vectors for the same word token depending on its surrounding context.","A":"The embedding is not \"wrong\" — it is optimally trained given the training data distribution. \"Domain mismatch\" means the embedding may not transfer well, but within the training distribution it is correct.","B":"","C":"\"Bear\" in financial contexts is a regular English word used in financial text. Word2Vec tokenizes at the word level and will learn any word that appears frequently. No special domain tokenization is needed.","D":"Increasing window size would capture a wider context, potentially mixing more senses together — making the polysemy problem worse, not better. It does not enable one vector to represent two distinct senses."},"reference":"- Mikolov et al., Word2Vec paper (polysemy discussion): https://arxiv.org/abs/1301.3781"},{"section":"nlp","topicSlug":"word-representations","topic":"Word Representations","id":"nlp-02003","difficulty":"easy","orderIndex":3,"question":"A team evaluates their Word2Vec model using the classic analogy task: \"man : king :: woman : ?\" The model returns \"queen\" correctly. They then test \"Paris : France :: Berlin : ?\" The model returns \"Germany\" correctly. A researcher claims Word2Vec \"performs arithmetic on meaning.\" What is the more precise technical explanation of why these analogies work?","options":{"A":"Word2Vec is trained with an arithmetic loss function that explicitly encodes country-capital relationships","B":"The vector offset `vec(\"king\") - vec(\"man\") + vec(\"woman\")` approximates `vec(\"queen\")` because the difference vector encodes a gender transformation direction in the embedding space that was learned from co-occurrence patterns","C":"Analogies work because Word2Vec uses cross-lingual training data that explicitly maps capitals to countries","D":"Word2Vec memorizes common analogy pairs during training because they appear in Wikipedia's \"List of capitals\" articles"},"correct":"B","explanation":{"correct":"- The embedding arithmetic works because the training objective forces words in similar relational contexts to lie in parallel subspaces. \"King\" and \"queen\" appear in similar contexts except where gendered pronouns differ — so `vec(\"king\") - vec(\"queen\") ≈ vec(\"man\") - vec(\"woman\")`.\n- Rearranging: `vec(\"queen\") ≈ vec(\"king\") - vec(\"man\") + vec(\"woman\")`. This is not programmed in — it emerges from the distributional patterns in the corpus.\n- Similarly, Paris-France and Berlin-Germany appear with similar surrounding words (travel writing, news, political text) — the offset `vec(\"Paris\") - vec(\"France\")` encodes a \"capital-of\" direction.\n- Important caveat: analogy evaluation is an imperfect measure of embedding quality. Many analogies fail, especially for rare or ambiguous words.","A":"Word2Vec's loss function is a simple softmax (or negative sampling approximation) over surrounding word probabilities. There is no arithmetic loss, no relational encoding, and no knowledge of country-capital pairs.","B":"","C":"Standard Word2Vec is trained on monolingual text. Cross-lingual training is a separate research area (multilingual embeddings). The analogy arithmetic works from monolingual distributional patterns alone.","D":"Word2Vec does not memorize — it learns continuous vector representations through gradient descent. Even if \"List of capitals\" articles appear in training, the learning is distributional, not lookup-based."},"reference":"- Mikolov et al., \"Linguistic Regularities in Continuous Space Word Representations\": https://aclanthology.org/N13-1090/"},{"section":"nlp","topicSlug":"word-representations","topic":"Word Representations","id":"nlp-02004","difficulty":"medium","orderIndex":4,"question":"A team trains Word2Vec with Skip-gram and negative sampling on a large corpus. They set `negative=5` in gensim. A colleague asks: \"If we set `negative=20`, will the embeddings improve?\" The team runs both and finds that `negative=20` produces marginally better analogies but trains 3× slower. An engineer proposes `negative=100`. What will happen and why?","options":{"A":"`negative=100` will produce the best possible embeddings because more negative samples always improve training","B":"`negative=100` will produce similar or worse embeddings than `negative=20` because the gradient signal per positive sample becomes dominated by noise from negative samples, and the model approaches the behavior of a pure noise contrastive estimator with diminishing returns","C":"`negative=100` will cause the training to crash with an out-of-memory error because negative sampling stores all samples in RAM","D":"`negative=100` will make the model equivalent to a full softmax, eliminating all approximation error"},"correct":"B","explanation":{"correct":"- Negative sampling approximates the softmax by contrasting each positive (center, context) pair against k randomly drawn negative words. The gradient update balances: push the positive pair's dot product up, push k negative pairs down.\n- With k=5-20, the model gets a good signal. With k=100, each positive update is swamped by 100 negative updates per step. This over-regularizes the embedding space, reducing the signal-to-noise ratio in training.\n- Empirically, Mikolov's original paper found that k=5-20 works well for large corpora and k=2-5 suffices for very large ones. Beyond ~25, returns diminish rapidly.\n- The full softmax over V vocabulary words would be k=V (exact but slow). Negative sampling is an approximation that works well precisely because k << V.","A":"More negative samples do not always improve embeddings. There is a sweet spot determined by the positive-to-negative signal ratio. Past this point, the positive signal is overwhelmed and embedding quality plateaus or degrades.","B":"","C":"Negative sampling draws k indices from the vocabulary distribution per step — it does not store k samples in RAM simultaneously. Memory usage is proportional to batch size, not k alone.","D":"Full softmax requires computing probabilities over all V words in the vocabulary. Negative sampling with k << V is never equivalent to full softmax — it is a biased estimator that happens to work well in practice."},"reference":"- Mikolov et al., \"Distributed Representations of Words and Phrases and their Compositionality\" (negative sampling): https://arxiv.org/abs/1310.4546"},{"section":"nlp","topicSlug":"word-representations","topic":"Word Representations","id":"nlp-02005","difficulty":"medium","orderIndex":5,"question":"A team uses GloVe embeddings (trained on Common Crawl, 840B tokens, 300d) for a downstream hate speech detection task. They fine-tune only the classification head (not the embeddings) and achieve 78% F1. Another team uses the same GloVe embeddings but updates the embedding layer during fine-tuning, achieving 81% F1. A third team trains Word2Vec on 50,000 hate speech examples from scratch and achieves 84% F1. Rank these approaches from worst to best and explain the underlying mechanism.","options":{"A":"Frozen GloVe < Fine-tuned GloVe < Domain Word2Vec — because more training data always produces better embeddings","B":"Frozen GloVe < Fine-tuned GloVe < Domain Word2Vec — because GloVe's co-occurrence matrix is built globally and cannot capture the specialized vocabulary and co-occurrence patterns of hate speech (e.g., slurs, dog-whistles, coded language that appear rarely in Common Crawl but dominate hate speech data)","C":"Domain Word2Vec < Fine-tuned GloVe < Frozen GloVe — because pre-trained embeddings from massive corpora always outperform small domain-specific training","D":"Fine-tuned GloVe < Frozen GloVe < Domain Word2Vec — because updating embeddings during fine-tuning causes catastrophic forgetting of general language patterns"},"correct":"B","explanation":{"correct":"- Frozen GloVe cannot adapt to domain vocabulary — hate speech uses coded language, slurs, and neologisms that are rare or absent in Common Crawl. The embedding space was built for general web text.\n- Fine-tuning the embedding layer allows GloVe vectors to shift toward the hate speech distribution, improving recall on coded terms — hence the 3-point improvement.\n- Domain-specific Word2Vec trained on 50,000 hate speech examples starts from scratch but builds a co-occurrence space directly tuned to the vocabulary and patterns of the target domain. For specialized tasks, domain-specific training with smaller data can outperform large general embeddings.\n- This is a documented pattern in domain adaptation: legal NLP, medical NLP, and social media NLP all benefit from domain-specific embeddings over frozen general-purpose ones.","A":"\"More training data always produces better embeddings\" is false. Domain relevance matters more than scale for specialized tasks. 50K hate speech examples carry more domain signal than 840B general web tokens for this specific task.","B":"","C":"This ranking is the opposite of what is empirically observed. Pre-trained general embeddings are excellent starting points but are not universally superior for specialized domains with unique vocabulary.","D":"Catastrophic forgetting is a concern when fine-tuning large models with low learning rates on few examples. Updating word embeddings (a small parameter set) with domain-relevant data typically helps rather than hurts, especially for domain-specific vocabulary."},"reference":"- Pennington et al., \"GloVe: Global Vectors for Word Representation\": https://nlp.stanford.edu/pubs/glove.pdf"},{"section":"nlp","topicSlug":"word-representations","topic":"Word Representations","id":"nlp-02006","difficulty":"medium","orderIndex":6,"question":"A team trains FastText embeddings instead of Word2Vec on a Twitter dataset containing many misspellings (\"recieve\", \"seperate\", \"definately\"). A colleague trained on Word2Vec reports that all misspelled words return OOV embeddings. FastText returns meaningful embeddings for these misspelled words. What is the mechanism by which FastText handles this?","codeSnippet":"# FastText character n-gram decomposition\n# \"recieve\" → [\"\", ...]\n# \"receive\" → [\"\", ...]\n# Shared subwords: \"\" → partial vector recovery","options":{"A":"FastText uses a spell-checker pre-processing step that corrects misspellings before embedding lookup","B":"FastText represents each word as the sum of its character n-gram embeddings; misspelled words share many character n-grams with correctly spelled variants, so their embeddings are similar to the intended word's embedding","C":"FastText trains a separate OOV embedding for unknown words by averaging all word embeddings in the vocabulary","D":"FastText uses edit-distance lookup at inference time to find the nearest known word and returns its embedding"},"correct":"B","explanation":{"correct":"- FastText decomposes each word into character n-grams (typically trigrams to hexagrams with boundary markers < and >). The word embedding is the sum of all its character n-gram embeddings.\n- \"recieve\" and \"receive\" share most of their character n-grams — only \"eci\"/\"ece\", \"cie\"/\"cei\", \"iev\"/\"eiv\" differ. The sum of shared n-gram embeddings produces a vector close to the correct spelling's vector.\n- This subword decomposition also handles morphologically rich language (German compounds, Turkish agglutination) and rare technical words — if the components appear in training, a reasonable embedding can be constructed.\n- Word2Vec has no fallback for OOV words — it simply cannot embed them. FastText can embed any string, including complete nonsense, by summing whatever character n-grams were seen in training.","A":"FastText does not include a spell-checker. The robustness to misspellings is a byproduct of subword decomposition, not preprocessing. Adding a spell-checker would be a separate pipeline step.","B":"","C":"Averaging all vocabulary embeddings would produce a generic \"center of vocabulary\" vector with no meaningful relationship to any specific word. FastText does not do this.","D":"Edit-distance lookup would be computationally expensive (O(V × word_length)) and is not part of FastText's architecture. The n-gram approach is O(n-grams) per word, which is fast."},"reference":"- Bojanowski et al., \"Enriching Word Vectors with Subword Information\" (FastText): https://arxiv.org/abs/1607.04606"},{"section":"nlp","topicSlug":"word-representations","topic":"Word Representations","id":"nlp-02007","difficulty":"medium","orderIndex":7,"question":"A team evaluates Word2Vec embeddings using the WordSim-353 benchmark and achieves 0.72 Spearman correlation. They conclude their embeddings are production-ready. A senior NLP engineer warns this evaluation is insufficient. What critical failure mode does WordSim-353 miss?","options":{"A":"WordSim-353 only evaluates syntactic similarity, not semantic similarity","B":"WordSim-353 measures pairwise word similarity on a small set of 353 pairs rated by humans — it misses task-specific performance, polysemy handling, out-of-domain vocabulary, and bias encoding; high WordSim-353 scores do not predict downstream task performance reliably","C":"WordSim-353 is outdated and has been replaced by GloVe evaluation benchmarks that are more rigorous","D":"WordSim-353 is only valid for English, so any multilingual embeddings will score poorly regardless of quality"},"correct":"B","explanation":{"correct":"- WordSim-353 asks human annotators to rate word pair similarity on a 0-10 scale. The benchmark measures whether embedding cosine similarities correlate with human judgment on 353 hand-picked pairs.\n- This misses: (1) domain shift — embeddings trained on Wikipedia may fail on medical/legal text not represented in the 353 pairs; (2) polysemy — \"bank\" appears once with a single human rating but has multiple senses; (3) downstream task performance — embeddings optimized for WordSim-353 may not transfer to NER, classification, or QA; (4) social bias — racist/sexist associations in the embedding space that harm downstream fairness.\n- Intrinsic evaluation (analogy, similarity benchmarks) and extrinsic evaluation (downstream task performance) often diverge significantly.\n- Production readiness requires extrinsic evaluation on the actual downstream task, bias auditing, and OOV rate analysis — not just a benchmark correlation score.","A":"WordSim-353 specifically evaluates semantic similarity (as judged by humans) — it covers both similarity and relatedness. The limitation is scope and methodology, not a syntactic/semantic distinction.","B":"","C":"GloVe does not define its own evaluation benchmarks — it uses the same community benchmarks. WordSim-353 is still widely used despite its known limitations. The issue is with benchmark limitations, not obsolescence.","D":"WordSim-353 is indeed English-only, but the question is about its limitations for any use case, not specifically multilingual evaluation. The fundamental methodological weaknesses apply regardless of language."},"reference":"- Faruqui et al., \"Retrofitting Word Vectors to Semantic Lexicons\": https://arxiv.org/abs/1411.4166\n- Schnabel et al., \"Evaluation methods for unsupervised word embeddings\": https://aclanthology.org/D15-1036/"},{"section":"nlp","topicSlug":"word-representations","topic":"Word Representations","id":"nlp-02008","difficulty":"medium","orderIndex":8,"question":"A GloVe model is trained on a corpus where \"doctor\" co-occurs more frequently with \"he\" than \"she\", and \"nurse\" co-occurs more frequently with \"she\" than \"he\". A fairness audit finds that `vec(\"doctor\") - vec(\"nurse\") ≈ vec(\"man\") - vec(\"woman\")`. A team proposes debiasing by hard-projecting gender direction out of occupational words using the Bolukbasi et al. method. A colleague warns this fix is cosmetic. Why?","options":{"A":"The Bolukbasi method is computationally too expensive for large embedding spaces","B":"Hard debiasing removes the explicit gender direction from occupational embeddings but leaves intact the downstream co-occurrence biases — a word like \"secretary\" may no longer be geometrically close to \"woman\" but will still activate gender-biased associations through indirect neighboring words (\"receptionist\", \"assistant\") that were not debiased","C":"Debiasing only works for binary gender and fails for non-binary gender representation","D":"The Bolukbasi method changes word meanings — \"doctor\" will no longer be semantically similar to \"physician\" after debiasing"},"correct":"B","explanation":{"correct":"- Hard debiasing identifies a gender subspace (approximated by `vec(\"he\") - vec(\"she\")`) and projects occupational word vectors onto the orthogonal complement, removing their component along the gender axis.\n- However, the bias is also encoded in the graph of associations — \"secretary\" is close to \"assistant\", \"receptionist\", \"administrative\" — all of which may themselves encode gender bias. Removing the direct gender dimension does not purge these indirect pathways.\n- Gonen & Goldberg (2019) showed that debiased embeddings can be re-clustered by gender with high accuracy using the indirect neighborhood, proving the bias persists in latent form.\n- True debiasing requires retraining on curated data or training-time constraints, not post-hoc geometric manipulation of the output space.","A":"The Bolukbasi method is computationally trivial — it requires only a PCA to find the gender subspace and a matrix projection. Computational cost is not its limitation.","B":"","C":"The limitation is not about gender dimensionality but about the distributed encoding of bias across the embedding space. A multi-dimensional gender subspace (for non-binary representation) still leaves indirect co-occurrence patterns intact.","D":"Projecting out the gender direction does not change the relative positions of semantically related words (doctor ↔ physician) because their similarity is driven by shared non-gender co-occurrence patterns. Semantic similarity is largely preserved."},"reference":"- Bolukbasi et al., \"Man is to Computer Programmer as Woman is to Homemaker?\": https://arxiv.org/abs/1607.06520\n- Gonen & Goldberg, \"Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases\": https://arxiv.org/abs/1903.03862"},{"section":"nlp","topicSlug":"word-representations","topic":"Word Representations","id":"nlp-02009","difficulty":"hard","orderIndex":9,"question":"A team trains Skip-gram Word2Vec with window size 2 on the sentence \"I love deep learning and deep neural networks.\" They then train with window size 10 on the same corpus. A researcher hypothesizes: \"Window size 2 will produce more syntactically similar neighbors; window size 10 will produce more topically/semantically similar neighbors.\" Is this hypothesis correct, and what is the mechanism?","options":{"A":"The hypothesis is wrong — larger window sizes always produce better embeddings regardless of the type of similarity","B":"The hypothesis is correct — small windows capture local syntactic relationships (subject-verb, adjective-noun pairs that appear adjacent), while large windows capture topical co-occurrence (words that appear in the same general context, even if not adjacent)","C":"The hypothesis is partially wrong — window size only affects training speed, not the type of similarity captured","D":"The hypothesis is wrong — smaller window sizes capture more semantic meaning because they force the model to learn from the most informative nearby context"},"correct":"B","explanation":{"correct":"- With window=2, the model trains on tight local contexts: [love, deep] for \"I\", [I, deep, learning] for \"love\", etc. Adjacent words in English are often syntactically dependent (determiner-noun, verb-object) — so the embeddings encode syntactic compatibility.\n- With window=10, the model trains on broader discourse context. \"deep\", \"learning\", \"neural\", \"networks\" all fall within each other's windows across many sentences in a large corpus, encoding topical relatedness.\n- This was empirically validated: small-window Word2Vec produces embeddings where nearest neighbors are syntactically substitutable (e.g., \"king\" → \"president\", \"minister\"), while large-window produces topically related words (e.g., \"king\" → \"castle\", \"throne\", \"coronation\").\n- GloVe's co-occurrence matrix is conceptually equivalent to a large window — hence GloVe tends toward semantic/topical similarity.","A":"Larger windows are not always better. For syntactic tasks (POS tagging, dependency parsing), small-window embeddings often outperform large-window ones because they capture tighter syntactic structure.","B":"","C":"Window size fundamentally changes what co-occurrence statistics the model learns, which directly changes the type of relationships encoded. The effect on speed is a side effect, not the primary difference.","D":"\"Most informative nearby context\" is an intuition that does not hold empirically. Adjacent words in text are often syntactically related but not necessarily more semantically informative than discourse-level co-occurrence."},"reference":"- Levy & Goldberg, \"Linguistic Regularities in Sparse and Explicit Word Representations\": https://aclanthology.org/W14-1618/\n- Levy & Goldberg, \"Neural Word Embedding as Implicit Matrix Factorization\": https://papers.nips.cc/paper/2014/hash/feab05aa91085b7a8012516bc3533958-Abstract.html"},{"section":"nlp","topicSlug":"word-representations","topic":"Word Representations","id":"nlp-02010","difficulty":"hard","orderIndex":10,"question":"Two teams train embeddings on the same corpus. Team A uses Word2Vec CBOW with window 5. Team B uses GloVe with window 5 and 100 iterations. A third researcher proves mathematically that both methods implicitly factorize a shifted PMI (Pointwise Mutual Information) matrix. Given this, Team A claims their embeddings are equivalent to Team B's. Is this claim correct?","options":{"A":"Yes — since both factorize the same PMI matrix, the resulting embedding spaces are identical up to rotation","B":"No — while both factorize a shifted PMI matrix, CBOW uses stochastic online updates while GloVe uses weighted least squares on the full co-occurrence matrix; the optimization landscapes differ, producing different embeddings even with identical hyperparameters","C":"Yes — equivalence holds because the PMI factorization theorem applies regardless of the optimization method used","D":"No — GloVe factorizes PMI while CBOW factorizes PPMI (positive PMI); the negative values handled differently cause fundamentally different embeddings"},"correct":"B","explanation":{"correct":"- Levy & Goldberg (2014) showed that Skip-gram with negative sampling implicitly factorizes the PMI matrix shifted by log(k) where k is the number of negative samples. GloVe directly minimizes a weighted least squares objective on log co-occurrence counts, which is closely related to PMI factorization.\n- However, theoretical equivalence of the objective does not mean practical equivalence of the result. CBOW/Skip-gram uses stochastic gradient descent on individual (word, context) pairs — the optimization is noisy and depends heavily on sampling. GloVe uses iterative global optimization on the full co-occurrence matrix.\n- Different optimization trajectories produce different local minima. The embedding spaces are related in theory but distinct in practice.\n- This matters for production: the choice of Word2Vec vs. GloVe is not arbitrary — GloVe tends to be more stable across runs (deterministic co-occurrence matrix), while Word2Vec can vary between training runs with different random seeds.","A":"Identical objective functions + different optimization algorithms ≠ identical solutions. Neural network training is a classic example: SGD and Adam on the same loss find different local minima.","B":"","C":"The PMI factorization theorem describes the implicit objective, not the solution. No theorem guarantees that different optimization procedures on the same objective reach the same solution.","D":"Both methods relate to PMI — CBOW's relationship is to PMI (not exclusively PPMI). While PPMI (clamping negative values to 0) is sometimes used in practice, this is not the fundamental difference between the two methods."},"reference":"- Levy & Goldberg, \"Neural Word Embedding as Implicit Matrix Factorization\": https://papers.nips.cc/paper/2014/hash/feab05aa91085b7a8012516bc3533958-Abstract.html\n- Pennington et al., \"GloVe: Global Vectors for Word Representation\": https://nlp.stanford.edu/pubs/glove.pdf"},{"section":"nlp","topicSlug":"word-representations","topic":"Word Representations","id":"nlp-02011","difficulty":"hard","orderIndex":11,"question":"A team uses pre-trained GloVe 300d embeddings as fixed features for an NER model. The validation F1 is 78%. They switch to initializing with GloVe and allowing fine-tuning during training, achieving 82% F1. A researcher suggests that the improvement comes from \"the model learning better representations.\" A senior engineer disagrees. What is the more precise technical explanation?","options":{"A":"Fine-tuning increases the effective capacity of the model by adding the embedding parameters to the trainable parameter count","B":"Fine-tuning adapts the pre-trained embeddings to the specific token distribution and label signal of the NER task — GloVe was trained on a general corpus with no NER signal; fine-tuning lets the gradient of the NER loss shift embeddings so that NER-relevant features (e.g., capitalization-correlated dimensions, entity-type-specific directions) are amplified in the embedding space","C":"The improvement is due to the model memorizing the training set more effectively when embeddings are trainable","D":"Fine-tuning embeddings reduces vocabulary mismatch because GloVe vectors are updated to match the NER dataset's vocabulary distribution"},"correct":"B","explanation":{"correct":"- Pre-trained GloVe was trained with a language modeling objective on general web text. The embedding space encodes distributional similarity, not NER-relevant features. \"Google\", \"Microsoft\", \"Apple\" are close together because they appear in similar financial/tech contexts — useful for NER (all are ORG entities) but not optimally separated from common nouns.\n- When the NER loss backpropagates through the embedding layer, gradients push entity-type embeddings apart (ORG, PER, LOC become more distinct) and push non-entity words further from entity words — creating an embedding space that is NER-task-optimal.\n- This is task-specific feature adaptation, not \"better representations\" in a general sense. The fine-tuned embeddings likely perform worse on WordSim-353 while performing better on NER.\n- In production, the decision to freeze or fine-tune embeddings depends on dataset size: with few examples, fine-tuning causes overfitting; with sufficient data, fine-tuning adapts the representation profitably.","A":"While adding trainable embedding parameters does increase model capacity, this is a generic statement that doesn't explain why the improvement specifically helps NER. Capacity alone would also enable overfitting, which the F1 improvement on validation set rules out.","B":"","C":"The improvement was measured on the validation set (not training set), so memorization is not the explanation. Validation F1 improvement rules out pure memorization.","D":"Fine-tuning does not change the vocabulary or resolve OOV issues — words not in the GloVe vocabulary are still OOV. The improvement is from adapting existing embeddings' directions, not vocabulary coverage."},"reference":"- Peters et al., \"Deep contextualized word representations\" (ELMo, discusses fine-tuning tradeoffs): https://arxiv.org/abs/1802.05365"},{"section":"nlp","topicSlug":"word-representations","topic":"Word Representations","id":"nlp-02012","difficulty":"hard","orderIndex":12,"question":"A team trains Word2Vec on a medical corpus and evaluates embedding quality using cosine similarity. They find that `cosine(\"hypertension\", \"high blood pressure\") = 0.23` — very low — even though they are synonyms. The team concludes Word2Vec \"failed\" to capture medical synonyms. What is the actual cause, and does it represent a fundamental limitation?","options":{"A":"The low cosine similarity is a training bug — Word2Vec should always assign high similarity to synonyms in domain-specific corpora","B":"The low similarity reflects a fundamental limitation of the distributional hypothesis for synonymy: \"hypertension\" appears in formal clinical notes while \"high blood pressure\" appears in patient-facing materials — their contexts are systematically different even though they refer to the same condition, so Word2Vec correctly encodes their distributional difference, which diverges from semantic identity","C":"The low similarity is caused by the multi-word phrase \"high blood pressure\" not being treated as a single token, causing phrase-level semantics to be lost","D":"The training corpus is too small for Word2Vec to learn medical synonyms; increasing corpus size to 10B tokens would fix this"},"correct":"B","explanation":{"correct":"- The distributional hypothesis states that meaning is encoded by context. \"Hypertension\" appears in contexts like: \"treated with antihypertensives\", \"ICD-10 code I10\", \"comorbid with diabetes\" — formal clinical documentation.\n- \"High blood pressure\" appears in: \"your blood pressure reading\", \"lifestyle changes for high blood pressure\", \"check your blood pressure at home\" — patient-facing and lay contexts.\n- Word2Vec correctly encodes this distributional difference — they appear in fundamentally different textual contexts even though they refer to the same condition. The model is not wrong; the distributional hypothesis has a known limitation: synonyms used in different registers/styles will have low similarity.\n- The fix is not more data but a different approach: medical ontology retrofitting (Faruqui et al.) or synonym-aware training (RETROFIT, counter-fitting) that incorporates external knowledge (UMLS, SNOMED-CT) to push known synonyms together.","A":"Word2Vec correctly reflects the distributional patterns in the corpus. The \"failure\" is a property of the distributional hypothesis applied to register-differentiated synonyms, not a training bug.","B":"","C":"Multi-word phrases are a partial contributor — \"high blood pressure\" as three separate tokens dilutes the phrase's embedding across component word vectors. But even if it were a single token, the register difference would still produce low similarity. Both issues coexist.","D":"More training data would strengthen the existing distributional patterns, not correct them. With more clinical data, \"hypertension\" becomes even more strongly associated with clinical contexts, widening the gap rather than closing it."},"reference":"- Faruqui et al., \"Retrofitting Word Vectors to Semantic Lexicons\": https://arxiv.org/abs/1411.4166\n- Mrksic et al., \"Counter-fitting Word Vectors to Linguistic Constraints\": https://arxiv.org/abs/1603.00892"},{"section":"nlp","topicSlug":"classical-nlp-tasks","topic":"Classical NLP Tasks","id":"nlp-03001","difficulty":"easy","orderIndex":1,"question":"A spaCy POS tagger labels the word \"run\" as a VERB in \"I run every morning\" but as a NOUN in \"a test run.\" A junior engineer is surprised the same word gets different tags. What is the mechanism that enables this context-sensitive tagging?","options":{"A":"spaCy maintains a separate model for each word and switches between them at inference time","B":"The POS tagger uses features of the surrounding words (context window), sentence position, and the word's shape to disambiguate — \"run\" following a determiner \"a\" and preceding \"test\" provides strong noun context signals","C":"The tagger uses a dictionary of all possible POS tags per word and always picks the most frequent one","D":"spaCy detects verb tense to determine POS, and \"run\" without a tense suffix defaults to NOUN"},"correct":"B","explanation":{"correct":"- Rule-based and statistical POS taggers (HMM, MaxEnt, CRF) use contextual features: preceding/following words, their POS tags, capitalization, suffix patterns, and syntactic position.\n- \"A test run\" — \"a\" is a DT (determiner), which strongly predicts a following NP (noun phrase). \"run\" following a noun-like modifier (\"test\") and a determiner is tagged NOUN.\n- \"I run every morning\" — \"I\" is a PRP (pronoun), the subject, which predicts a VBP (verb, non-3rd-person singular). Context makes VERB the highest-probability tag.\n- Modern neural POS taggers (CRF over BiLSTM features) achieve 97%+ accuracy on standard benchmarks by encoding these context patterns implicitly.","A":"spaCy uses a single unified model for all words — it is context-dependent, not word-specific. Maintaining per-word models would be computationally intractable for a vocabulary of millions of words.","B":"","C":"A most-frequent-tag baseline exists (and is actually a strong baseline ~90% accuracy), but spaCy's tagger is statistical and context-sensitive, not a frequency lookup. For ambiguous words like \"run\", frequency alone would always pick VERB, failing on the noun usage.","D":"POS tagging in English does not simply detect \"tense suffixes.\" \"Run\" is both the present tense VERB and a NOUN — suffix analysis alone cannot disambiguate these."},"reference":"- spaCy POS tagging: https://spacy.io/usage/linguistic-features#pos-tagging\n- Jurafsky & Martin, \"Speech and Language Processing\", Chapter 8 (Sequence Labeling): https://web.stanford.edu/~jurafsky/slp3/8.pdf"},{"section":"nlp","topicSlug":"classical-nlp-tasks","topic":"Classical NLP Tasks","id":"nlp-03002","difficulty":"easy","orderIndex":2,"question":"A team uses a rule-based NER system to extract company names from financial filings. The system correctly identifies \"Apple Inc.\" and \"Microsoft Corporation\" but misses \"Alphabet\" (Google's parent company) and \"Meta\" (Facebook's rebranded name). What is the fundamental limitation of rule-based NER that this illustrates?","options":{"A":"Rule-based systems can only handle proper nouns and fail on common words used as names","B":"Rule-based NER depends on curated gazetteers (lists of known entities) and surface patterns; when companies adopt new names not yet in the gazetteer, the system has no mechanism to recognize them until the list is manually updated","C":"Rule-based systems cannot process financial documents because they use domain-specific jargon","D":"The rule-based system fails because \"Alphabet\" and \"Meta\" are not capitalized in financial filings"},"correct":"B","explanation":{"correct":"- Rule-based NER typically combines: (1) gazetteers of known entity names, (2) capitalization/suffix patterns (e.g., words ending in \"Corp.\", \"Inc.\", \"Ltd.\"), and (3) context rules (e.g., \"CEO of [ORG]\").\n- \"Alphabet\" and \"Meta\" are common English words — \"alphabet\" means a writing system, \"meta\" is a Greek prefix. They are not in traditional company gazetteers, and their surface forms match common nouns, making pattern rules unreliable.\n- This is the maintenance brittleness of rule-based systems: they require continuous manual updates for new entities, rebrands, mergers, and emerging organizations.\n- Statistical/neural NER models partially address this through contextual signals (\"Alphabet reported quarterly earnings\" → company context), though they also require retraining on new entity names.","A":"Rule-based NER does handle common words used as names — \"Apple\" is a common noun but appears in company gazetteers. The issue is not common-word origin but gazetteer coverage of newly adopted names.","B":"","C":"Financial document processing is a primary use case for rule-based NER. The failure is entity coverage, not document domain.","D":"Financial filings capitalize company names consistently. Both \"Alphabet\" and \"Meta\" would appear capitalized — but capitalization alone cannot distinguish \"Alphabet\" (company) from \"The alphabet\" (writing system) without additional context or gazetteer lookup."},"reference":"- Jurafsky & Martin, \"Speech and Language Processing\", Chapter 17 (NER): https://web.stanford.edu/~jurafsky/slp3/17.pdf"},{"section":"nlp","topicSlug":"classical-nlp-tasks","topic":"Classical NLP Tasks","id":"nlp-03003","difficulty":"easy","orderIndex":3,"question":"A dependency parser produces the following parse for \"The bank can guarantee deposits will eventually cover future tuition costs.\" A downstream system uses the subject of the main verb to build a knowledge graph. It extracts \"The bank\" as the agent. However, \"cover\" (not \"guarantee\") is the main verb, and its subject is \"deposits\", not \"bank\". What does this illustrate about dependency parsing for knowledge extraction?","options":{"A":"Dependency parsers cannot handle sentences with more than one verb","B":"Nested clausal structures (complement clauses) create ambiguity in extracting subject-verb-object triples because the main clause's subject may not be the agent of the embedded verb — downstream systems must traverse the dependency tree structure, not just extract the nearest nominal subject","C":"The parser made an error; modern parsers always correctly identify the main verb of a sentence","D":"This is a tokenization error — \"can guarantee\" should be treated as a single verb token"},"correct":"B","explanation":{"correct":"- The sentence has two clausal levels: \"The bank can guarantee [deposits will eventually cover future tuition costs]\". \"Guarantee\" takes a clausal complement (the embedded clause \"deposits will cover costs\").\n- A naive system extracting \"subject of main verb\" gets \"bank\" because \"guarantee\" is the highest-level verb. But for the knowledge triple about covering costs, the relevant subject is \"deposits.\"\n- Correct knowledge extraction requires understanding the dependency tree: ROOT → guarantee → bank (nsubj) + complement clause → cover → deposits (nsubj of cover).\n- This is why information extraction from text is non-trivial even with perfect parsing — the semantic roles require understanding clause boundaries and argument structures.","A":"Dependency parsers explicitly model multi-verb sentences — the tree structure is designed to encode relationships between multiple predicates. This is the parser's primary use case.","B":"","C":"\"Made an error\" is incorrect — the parser correctly identifies the parse structure. The problem is the downstream system's naive extraction logic, not parser failure.","D":"\"Can guarantee\" is correctly treated as two tokens (modal + infinitive). In dependency parsing, \"can\" is typically the ROOT or modal auxiliary, with \"guarantee\" as its dependent. This is standard behavior, not an error."},"reference":"- spaCy dependency parsing: https://spacy.io/usage/linguistic-features#dependency-parse"},{"section":"nlp","topicSlug":"classical-nlp-tasks","topic":"Classical NLP Tasks","id":"nlp-03004","difficulty":"medium","orderIndex":4,"question":"A team builds a chunker to extract noun phrases (NP) from text using a regular expression on POS tag sequences. The pattern `NP: {

?*+}` works well on formal English. They deploy it on social media text and precision drops from 91% to 43%. A colleague suggests switching to a CRF-based chunker. Before switching, what is the correct diagnosis of why regex chunking fails on social media?","options":{"A":"Social media text uses different programming languages and regex cannot parse them","B":"Regex NP patterns are brittle to POS tagging errors — social media text's non-standard grammar, abbreviations, and slang increase POS tagger error rates significantly; since regex chunking takes POS tags as input, POS errors cascade into chunking errors","C":"Social media posts are too short to contain noun phrases","D":"The pattern `+` is incorrect syntax; it should be `+` to properly match noun tokens"},"correct":"B","explanation":{"correct":"- Regex chunking is a pipeline: raw text → POS tagger → regex pattern matching. Errors in POS tagging propagate directly to chunking because the regex operates on predicted POS sequences, not raw text.\n- Social media text introduces: abbreviations (\"govt\", \"app\"), informal capitalization (\"i\" for pronoun, \"LOVE\"), numerals in unusual positions, hashtags (#NYC as NP candidate), and missing punctuation — all of which degrade POS tagger accuracy.\n- When \"Apple\" (company, NNP) is tagged as JJ (adjective) due to context errors, the NP regex pattern may not match even though it is clearly an NP.\n- CRF chunkers learn from features of the raw text (character patterns, capitalization, neighboring words) in addition to POS tags, making them more robust to POS errors.","A":"Regex in Python and NLP tools has no concept of \"programming languages\" — it matches character/tag patterns in strings. This option is nonsensical.","B":"","C":"Social media posts regularly contain noun phrases — \"your latest iPhone\", \"the best coffee\", \"@POTUS response\" are all NPs. Short posts still contain noun phrases.","D":"`+` is valid NLTK regex chunker syntax. `NN.*` is a wildcard that matches NN, NNS, NNP, NNPS — this is intentional and correct for matching all noun subtypes."},"reference":"- NLTK chunking chapter: https://www.nltk.org/book/ch07.html"},{"section":"nlp","topicSlug":"classical-nlp-tasks","topic":"Classical NLP Tasks","id":"nlp-03005","difficulty":"medium","orderIndex":5,"question":"A CRF-based NER model is trained on a news corpus with the CoNLL-2003 entity scheme (PER, ORG, LOC, MISC). The model is deployed on clinical notes and F1 drops from 88% (news) to 41% (clinical). A data scientist proposes \"just fine-tuning the output layer.\" An NLP engineer disagrees and explains why fine-tuning the output layer alone is insufficient. What is the correct explanation?","options":{"A":"CRFs cannot be fine-tuned — they require complete retraining from scratch","B":"The performance drop is caused by two compounding factors: (1) the entity types differ — clinical text has DRUG, DISEASE, PROCEDURE entities not covered by PER/ORG/LOC/MISC; (2) the input features (word n-grams, POS tags, capitalization patterns) encode news-domain distributions; medical entities like drug names (alphanumeric, unusual morphology) have feature distributions the model has never seen — changing only the output layer leaves the feature layer mismatched to clinical inputs","C":"Fine-tuning only the output layer would work if the team added DRUG and DISEASE to the entity list; the engineer is wrong","D":"The CRF's transition matrix is the main source of error — clinical sentences have different entity co-occurrence patterns that the news-trained transition matrix does not model"},"correct":"B","explanation":{"correct":"- A CRF-based NER model has three main components: (1) feature engineering / feature extraction layer (word shape, n-grams, POS, gazette features), (2) emission weights (feature → entity type), and (3) the CRF transition matrix (entity sequence constraints).\n- Fine-tuning only the output (emission weights + transition matrix) does not address the fact that the feature space itself is mismatched. Drug names like \"atorvastatin\", \"metformin-HCl\" have alphanumeric patterns and suffixes (-statin, -mycin, -pril) that never appear in news entity features.\n- Domain adaptation for CRF-NER requires: retraining on annotated clinical data, adding domain-specific feature templates (medical suffix patterns, drug name gazetteers), and potentially updating the entity tag set.\n- In production, teams who fine-tune only the classification head on NER for a new domain typically see rapid initial improvement but hit a ceiling where mismatched input features limit performance.","A":"CRFs can absolutely be fine-tuned or partially retrained. The weights are learned via gradient-based or L-BFGS optimization and can be updated with new training data.","B":"","C":"Adding new entity types requires the model to learn new emission patterns from the feature space. If the features do not encode clinical entity signals, adding output nodes for DRUG and DISEASE does not help the model learn to recognize them.","D":"The transition matrix is a contributing factor (clinical sentences have different BIO transition patterns), but it is secondary to the feature mismatch problem. The transition matrix is also part of what would need retraining."},"reference":"- Lafferty et al., \"Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data\": https://repository.upenn.edu/cis_papers/159/"},{"section":"nlp","topicSlug":"classical-nlp-tasks","topic":"Classical NLP Tasks","id":"nlp-03006","difficulty":"medium","orderIndex":6,"question":"A team builds a co-reference resolution system for legal contracts. The system must link \"the Company\" and \"it\" and \"the Corporation\" all to the same entity when they refer to the same legal party. The rule-based Hobbs algorithm achieves 62% accuracy. A colleague says \"just use a neural model.\" A senior engineer argues the legal domain requires rule-based components even in a hybrid system. Why?","options":{"A":"Neural co-reference models are too slow for legal document processing","B":"Legal contracts have highly structured, domain-specific co-reference patterns — \"hereinafter referred to as 'the Company'\" explicitly defines an alias, and \"the Corporation\" is a legal synonym that must be linked via contract-defined definitions rather than context alone; a purely neural model trained on news/fiction text will miss these legal definitional bindings","C":"Rule-based Hobbs is always more accurate than neural models for co-reference resolution","D":"Neural models cannot handle documents longer than 512 tokens, making them unsuitable for legal contracts"},"correct":"B","explanation":{"correct":"- Legal contracts routinely contain explicit definitional bindings: \"XYZ Inc. (hereinafter 'the Company')\" — this is not a probabilistic reference, it is a declared alias. Rule-based systems can exploit this explicit definitional syntax reliably.\n- Neural co-reference models trained on OntoNotes (news, Bible, broadcast conversations) learn co-reference patterns from those domains. \"The Company\" in legal text refers to a specific defined party — not just any company — and the binding is defined pages earlier.\n- Hybrid systems that combine rule-based extraction of defined aliases + neural resolution for pronoun and implicit references significantly outperform purely neural or purely rule-based approaches on legal NLP benchmarks.\n- This is a general principle: rule-based components excel at deterministic, structured patterns; neural models excel at statistical, contextual disambiguation.","A":"Inference speed of neural co-reference models is a genuine concern for very long contracts, but it is an engineering problem (batching, quantization) not a fundamental architectural limitation. Speed alone would not justify hybrid architecture.","B":"","C":"Rule-based Hobbs is not always more accurate than neural models. On standard co-reference benchmarks (CoNLL-2012), neural models significantly outperform Hobbs. The claim is domain-specific, not universal.","D":"Standard transformer context windows (512 tokens) are a real limitation, but modern models (Longformer, legal-BERT variants) handle long documents. The fundamental issue is domain distribution mismatch, not context length alone."},"reference":"- Lee et al., \"End-to-end Neural Coreference Resolution\": https://arxiv.org/abs/1707.07045"},{"section":"nlp","topicSlug":"classical-nlp-tasks","topic":"Classical NLP Tasks","id":"nlp-03007","difficulty":"medium","orderIndex":7,"question":"A team evaluates their NER system and reports token-level F1 = 89%. A product manager says \"great, 89% F1 means we're identifying 89% of entities correctly.\" A senior engineer corrects this. What is the distinction the engineer makes?","options":{"A":"Token-level F1 is always higher than entity-level F1, and the PM is reporting the inflated metric","B":"Token-level F1 counts each token independently — a 3-token entity \"New York City\" where only \"New\" and \"York\" are correctly labeled gets partial credit (2/3 tokens correct); entity-level F1 requires all tokens in a span to be correctly labeled for the entity to count as correct, giving zero credit for partial matches","C":"Token-level F1 is the correct metric — entity-level F1 is only used in academic papers","D":"The PM is correct — token-level and entity-level F1 measure the same thing when entities are one token long"},"correct":"B","explanation":{"correct":"- Token-level F1 treats each token's label independently. If \"New York City\" (LOC, 3 tokens: B-LOC, I-LOC, I-LOC) is predicted as [\"B-LOC\", \"I-LOC\", \"O\"], token-level gives 2/3 correct → partial credit.\n- Entity-level (span-level) F1, as used in CoNLL-2003 evaluation, requires the entire span to be correctly identified (exact start, end, and type) for a true positive. A partial match counts as both a false positive (the predicted span) and a false negative (the gold span).\n- Entity-level F1 is almost always lower than token-level F1 for multi-token entities, and it is the production-relevant metric because a half-recognized entity is useless for downstream tasks like knowledge graph construction.\n- Always report entity-level F1 for NER benchmarks. Token-level F1 can be misleadingly high.","A":"While typically true that token-level F1 > entity-level F1 for multi-token entities, the directionality can reverse for single-token entities or when entity boundary errors are common. The key distinction is the partial credit mechanism, not just the direction of the difference.","B":"","C":"Entity-level F1 is the standard production metric for NER evaluation. Token-level F1 is used in some research settings but is less meaningful for downstream applications that consume entity spans.","D":"For single-token entities, the two metrics agree. But the PM's claim extends to all entities — for any corpus with multi-token entities (person names, organization names, geographic locations), the metrics diverge significantly."},"reference":"- CoNLL-2003 NER evaluation script and metric definition: https://aclanthology.org/W03-0419/"},{"section":"nlp","topicSlug":"classical-nlp-tasks","topic":"Classical NLP Tasks","id":"nlp-03008","difficulty":"hard","orderIndex":8,"question":"A team implements a CRF for NER using BIO tagging. During training, they observe that the model learns to never predict I-ORG after B-PER, which is correct. However, the model occasionally predicts I-ORG after O (outside), which is invalid BIO. A colleague suggests adding a constraint layer. What is the CRF mechanism that should prevent invalid transitions, and why is it failing?","options":{"A":"The CRF is working correctly; \"I-ORG after O\" is valid BIO syntax and the constraint is unnecessary","B":"CRF models the transition probability between consecutive labels via a learned transition matrix; \"I-ORG after O\" should receive a large negative transition score during training, but if training data contains no I-ORG-after-O transitions, the transition weight may be initialized near zero rather than strongly negative, allowing invalid sequences at inference — the fix is explicit constraint initialization or Viterbi decoding with hard transition masks","C":"The invalid prediction means the CRF implementation has a bug in the Viterbi decoder","D":"BIO tagging inherently cannot prevent invalid transitions; the team should switch to BIOES tagging to enforce constraints"},"correct":"B","explanation":{"correct":"- CRF's key contribution is modeling sequence-level constraints via the transition matrix A[i][j] = score of transitioning from label i to label j. Valid BIO constraints (B-X/I-X/O sequence rules) should be learned as large negative weights for invalid transitions.\n- During training, transitions that never occur in the data may be initialized to 0 (or small random values) rather than large negatives. If no training example has \"I-ORG after O\", the gradient for this transition weight is zero — it stays near initialization.\n- At inference, if the emission score for I-ORG at some position is high enough, a near-zero transition score (instead of a strongly negative one) may not prevent the Viterbi decoder from selecting this invalid path.\n- Fix: (1) initialize invalid transition weights to -∞ (hard constraint), or (2) mask the Viterbi search space to eliminate invalid transitions before decoding. Most production NER systems use hard masking.","A":"\"I-ORG after O\" is explicitly invalid BIO syntax — a continuation tag (I-) cannot begin a new entity span after an Outside token. This is not a valid sequence.","B":"","C":"A Viterbi decoder is correct if it finds the highest-scoring valid path given the model's scores. If the transition matrix assigns 0 (not -∞) to invalid transitions, the decoder correctly returns the highest-scoring sequence under the model — which may be invalid under BIO rules. This is a model initialization issue, not a decoder bug.","D":"BIOES (Begin, Inside, Outside, End, Single) provides more granular tagging but still requires transition constraints. It does not inherently prevent invalid transitions — the same problem would occur with uninitialized BIOES transitions."},"reference":"- Lafferty et al., \"Conditional Random Fields\": https://repository.upenn.edu/cis_papers/159/"},{"section":"nlp","topicSlug":"classical-nlp-tasks","topic":"Classical NLP Tasks","id":"nlp-03009","difficulty":"hard","orderIndex":9,"question":"A team's dependency parser achieves 92% UAS (Unlabeled Attachment Score) on the Penn Treebank test set but only 74% UAS on a new technical manual corpus. The team tries fine-tuning on 500 annotated technical sentences and reaches 81% UAS. An engineer asks: \"Why does 500 sentences of in-domain data provide a bigger boost than the 50,000 general sentences we trained on initially?\" What is the correct explanation?","options":{"A":"500 sentences is enough to overwrite all previously learned weights, so the model essentially restarts training with in-domain data","B":"Domain adaptation exhibits diminishing returns — each additional example from the general domain provides marginal signal because most syntactic patterns are already covered; in-domain examples provide high-density signal for the specific syntactic constructions and vocabulary that differ in technical text (e.g., imperative instructions, passive voice, noun-heavy compound structures), where the model was weakest","C":"The Penn Treebank is lower quality than the technical annotations, so replacing Penn Treebank training examples with technical ones always improves performance","D":"Fine-tuning on 500 sentences increases the learning rate automatically, which is why it provides a larger per-example improvement"},"correct":"B","explanation":{"correct":"- Neural dependency parsers trained on large general corpora have already learned the vast majority of English syntactic patterns — adding more Penn Treebank examples provides diminishing signal because the model's error rate on covered patterns is already low.\n- Technical manuals use domain-specific constructions that are underrepresented in news/narrative text: imperative sentences (\"Insert the module into slot A\"), technical noun compounds (\"power supply unit relay\"), and passive-heavy procedural language. These constructions have high error rate but low representation in general training data.\n- 500 targeted in-domain examples that densely cover these failure-mode constructions provide much higher gradient signal per example than 50,000 general examples where 99% of examples are already handled correctly.\n- This principle — targeted domain adaptation with small high-quality data — is consistently more efficient than adding more general data once a model has saturated on general patterns.","A":"Fine-tuning on 500 examples with a small learning rate does not overwrite previously learned weights. The model retains general syntactic knowledge while adapting to domain-specific patterns through targeted gradient updates.","B":"","C":"Penn Treebank is a gold-standard annotated corpus used as the reference for English parsing benchmarks. It is not lower quality. The performance difference is about distribution mismatch, not annotation quality.","D":"Fine-tuning does not automatically change the learning rate. Standard fine-tuning protocols use smaller learning rates than initial training to prevent catastrophic forgetting. The improvement is from data relevance, not learning rate changes."},"reference":"- Blitzer et al., \"Domain Adaptation with Structural Correspondence Learning\": https://aclanthology.org/D06-1009/"},{"section":"nlp","topicSlug":"classical-nlp-tasks","topic":"Classical NLP Tasks","id":"nlp-03010","difficulty":"hard","orderIndex":10,"question":"A team evaluates co-reference resolution on a large-scale corpus and finds their system has 89% recall for pronoun resolution but only 41% recall for \"bridging anaphora\" (e.g., linking \"the engine\" to \"the car\" in \"I bought a car. The engine needed work.\"). A researcher claims bridging anaphora requires world knowledge that statistical co-reference models cannot encode. Is this claim correct?","options":{"A":"The claim is wrong — bridging anaphora can be resolved by n-gram overlap between the anaphor and the antecedent","B":"The claim is partially correct — bridging anaphora requires recognizing part-whole and set-member relationships (car has-part engine) that are not encoded in distributional co-occurrence; however, models can learn proxy signals (cars and engines co-occur in similar contexts) that provide partial resolution without explicit world knowledge","C":"The claim is correct and complete — bridging anaphora is fundamentally unsolvable by statistical NLP methods and requires a symbolic knowledge base like WordNet","D":"The low recall is entirely due to the system's 512-token context window missing long-range references; increasing context length would fix bridging anaphora"},"correct":"B","explanation":{"correct":"- Bridging anaphora requires inferring implicit relationships not stated in the text: \"I bought a car\" does not explicitly state \"the car has an engine,\" yet \"the engine\" is understood to refer to the car's engine. This requires the \"has-part\" relation from world knowledge.\n- Purely distributional models can learn proxy signals: \"car\" and \"engine\" frequently co-occur in automotive texts, so their embeddings are nearby. A co-reference model might use this embedding similarity as a soft signal for part-whole bridging.\n- However, embedding proximity captures topical relatedness, not structural relationships. \"car\" and \"dealership\" are also topically related — the model cannot distinguish \"the dealership's engine\" from \"the car's engine\" purely from embeddings.\n- In practice, hybrid systems that use distributional similarity + ontological relations (WordNet hypernymy/meronymy, ConceptNet part-of) significantly outperform pure statistical models on bridging.","A":"N-gram overlap (\"car\" vs \"engine\") is zero — they share no n-grams. Bridging anaphora by definition involves non-overlapping expressions, so string similarity methods completely fail.","B":"","C":"While symbolic knowledge bases help, claiming the problem is \"fundamentally unsolvable\" by statistical methods is too strong. Neural models with large-scale pre-training (BERT, GPT) do learn some world knowledge implicitly and show partial bridging resolution. The correct framing is \"suboptimal\" not \"unsolvable.\"","D":"The context window is rarely the bottleneck for bridging — most bridging examples span 1-3 sentences, well within any context window. The issue is relational reasoning, not context length."},"reference":"- Poesio et al., \"Anaphora Resolution\" survey: https://link.springer.com/book/10.1007/978-94-007-2088-3"},{"section":"nlp","topicSlug":"language-models-statistical","topic":"Language Models Statistical","id":"nlp-04001","difficulty":"easy","orderIndex":1,"question":"A trigram language model is trained on a news corpus and asked to assign a probability to the sentence \"The stock market crashed.\" It encounters \"stock market crashed\" as a trigram that never appeared in training. What happens if no smoothing is applied?","options":{"A":"The model assigns a probability of 0.0001 based on unigram fallback","B":"The model assigns a probability of zero to the entire sentence, making it indistinguishable from a random string of words","C":"The model raises an exception and skips the sentence during evaluation","D":"The model falls back to bigram probabilities for the unseen trigram only"},"correct":"B","explanation":{"correct":"- In a maximum likelihood n-gram LM, P(sentence) = product of conditional probabilities of each word given its n-1 preceding words. If any single conditional probability is zero, the entire product is zero.\n- \"stock market crashed\" appearing zero times in training gives P(crashed | stock, market) = 0/count(stock, market) = 0.\n- This is the zero-probability problem: any sentence containing an unseen n-gram receives probability zero, regardless of how plausible the rest of the sentence is.\n- This is why every deployed n-gram LM requires smoothing — raw MLE produces degenerate probability distributions for any real-world input.","A":"Unigram fallback is a smoothing technique (add-alpha or backoff) — it does not happen automatically without smoothing being explicitly applied. Raw MLE has no fallback.","B":"","C":"Language model probability computation does not raise exceptions for unseen n-grams — it simply returns zero. Exceptions only occur if the implementation incorrectly handles division by zero separately.","D":"Bigram fallback is a feature of backoff models (Katz backoff) or interpolated models. Without smoothing, there is no automatic fallback mechanism."},"reference":"- Jurafsky & Martin, SLP3 Chapter 3 (N-gram Language Models): https://web.stanford.edu/~jurafsky/slp3/3.pdf"},{"section":"nlp","topicSlug":"language-models-statistical","topic":"Language Models Statistical","id":"nlp-04002","difficulty":"easy","orderIndex":2,"question":"You evaluate two trigram language models on a held-out test set. Model A has perplexity 120, Model B has perplexity 85. A colleague says \"Model B is worse because higher perplexity means the model is more confident.\" What is wrong with this interpretation?","options":{"A":"Perplexity is only meaningful for bigram models, not trigram models","B":"Lower perplexity means the model assigns higher probability to the test data — it is less \"surprised\" by the text, indicating better fit, so Model B is actually better","C":"Perplexity measures accuracy on training data, not test data, so the comparison is invalid","D":"The two models must use the same vocabulary size for perplexity to be comparable, which is not guaranteed here"},"correct":"B","explanation":{"correct":"- Perplexity = 2^(average cross-entropy per token) = 2^(-1/N * Σ log₂P(wᵢ|context)). A model that assigns high probability to the test tokens has low cross-entropy and low perplexity.\n- Intuitively, perplexity represents the \"weighted average branching factor\" — how many equally likely next words the model considers at each step. Lower perplexity = fewer plausible next words = more confident, better predictions.\n- Model B (PP=85) is better: it assigns higher probability to the test corpus, meaning it has learned the language distribution more accurately.\n- The colleague has the direction reversed — this is an extremely common confusion at interviews.","A":"Perplexity is model-order agnostic — it is defined for any probability distribution over sequences, including unigram, bigram, trigram, and neural LMs. The formula does not change with n-gram order.","B":"","C":"By convention, perplexity is always evaluated on held-out test data. Training perplexity would overfit to the training corpus and be meaningless as an evaluation metric.","D":"Vocabulary size does affect perplexity comparability, but this is a secondary concern. The primary error here is reversing the direction of perplexity — the question is specifically testing that misconception."},"reference":"- Jurafsky & Martin, SLP3 Chapter 3, Section 3.3 (Perplexity): https://web.stanford.edu/~jurafsky/slp3/3.pdf"},{"section":"nlp","topicSlug":"language-models-statistical","topic":"Language Models Statistical","id":"nlp-04003","difficulty":"easy","orderIndex":3,"question":"A bigram model trained on 1 million tokens uses add-1 (Laplace) smoothing on a vocabulary of 50,000 words. An engineer notices that the smoothed probability of a common bigram \"the president\" drops significantly compared to the unsmoothed MLE probability. What causes this?","options":{"A":"Laplace smoothing redistributes probability mass uniformly across all V² possible bigrams, adding 1 to each count, which inflates the denominator massively for frequent bigrams","B":"Add-1 smoothing only reduces probabilities for bigrams that appear more than 1000 times in training","C":"The model was incorrectly implemented — Laplace smoothing should only increase probabilities for seen bigrams","D":"Vocabulary size has no effect on smoothed probabilities; the drop is caused by test corpus domain shift"},"correct":"A","explanation":{"correct":"- Add-1 smoothing: P_smooth(wᵢ|wᵢ₋₁) = (count(wᵢ₋₁, wᵢ) + 1) / (count(wᵢ₋₁) + V). The denominator adds V (vocabulary size = 50,000) to every context count.\n- For a common bigram like \"the president\" with count 5,000: unsmoothed P = 5000/Σcount(\"the\",*). Smoothed P = 5001/(Σcount(\"the\",*) + 50,000). The denominator jumps by 50,000, significantly reducing probability.\n- Laplace smoothing steals significant probability mass from frequent events to give to unseen events — with a 50K vocabulary, it adds 50K phantom observations, which is enormous relative to real counts.\n- This is why add-1 smoothing is considered \"too aggressive\" and rarely used in production — Kneser-Ney and Good-Turing steal far less mass from observed events.","A":"","B":"Laplace smoothing applies the same +1 to all bigrams regardless of frequency. There is no threshold at 1000 or any other value.","C":"Laplace smoothing does increase unseen bigram probabilities — but as a conservation of probability law, it must decrease seen bigram probabilities. Total probability must sum to 1.","D":"Vocabulary size directly appears in the smoothed denominator. This is a mathematical fact, not an implementation choice. Domain shift is a separate, real concern but not the cause described."},"reference":"- Jurafsky & Martin, SLP3 Chapter 3, Section 3.4 (Smoothing): https://web.stanford.edu/~jurafsky/slp3/3.pdf"},{"section":"nlp","topicSlug":"language-models-statistical","topic":"Language Models Statistical","id":"nlp-04004","difficulty":"medium","orderIndex":4,"question":"A trigram model uses Kneser-Ney smoothing. For the word \"Francisco\", the model assigns a very low unigram probability despite \"Francisco\" appearing thousands of times in training. An engineer suspects a bug. What is the correct explanation?","options":{"A":"Kneser-Ney has a minimum count threshold that discards infrequent words, and \"Francisco\" falls below it","B":"Kneser-Ney replaces raw unigram counts with continuation counts — the number of distinct bigram contexts a word appears in. \"Francisco\" almost always follows \"San\", giving it a low continuation count despite high raw frequency","C":"The model is correctly implemented but the training corpus does not contain enough \"Francisco\" occurrences to produce a meaningful probability","D":"Kneser-Ney penalizes words that always appear at the end of n-grams by reducing their base probability"},"correct":"B","explanation":{"correct":"- Kneser-Ney's key insight: for lower-order distributions used in backoff, raw frequency is the wrong signal. What matters is how likely a word is to appear in a *novel context* (as a continuation).\n- P_KN(Francisco) ∝ |{w : count(w, Francisco) > 0}| — the number of distinct words that precede \"Francisco.\" Since \"Francisco\" almost exclusively follows \"San\", this count is very small (≈1).\n- Compare \"the\": it follows hundreds of thousands of distinct words, giving it a huge continuation count and high KN unigram probability. This is correct behavior, not a bug.\n- This is one of the most counterintuitive properties of Kneser-Ney. In interviews, many candidates know KN is \"better than Laplace\" but cannot explain why continuation counts beat raw counts.","A":"Kneser-Ney does not have a count threshold that discards words. All words in the vocabulary are retained. Count thresholds (e.g., minimum count = 5) are a separate vocabulary pruning step, not part of KN smoothing.","B":"","C":"\"Francisco\" appears thousands of times — raw frequency is high. The issue is how KN redistributes that probability using continuation counts, not data sparsity.","D":"Position within an n-gram (beginning, middle, end) is not a factor in KN smoothing. KN uses bigram contexts, not positional features."},"reference":"- Chen & Goodman, \"An Empirical Study of Smoothing Techniques for Language Modeling\": https://aclanthology.org/P96-1041/"},{"section":"nlp","topicSlug":"language-models-statistical","topic":"Language Models Statistical","id":"nlp-04005","difficulty":"medium","orderIndex":5,"question":"Two n-gram models are compared on the same held-out test set: a bigram model (PP=200) and a trigram model (PP=150). A team decides to deploy the trigram model. On production data from a slightly different domain, the trigram model's perplexity jumps to 600 while the bigram's jumps to 280. What explains this pattern?","options":{"A":"Trigram models are always less robust than bigram models because they have more parameters","B":"Higher-order n-gram models are more sensitive to domain shift because longer contexts are more domain-specific and more likely to be unseen in the new domain","C":"The perplexity jump is caused by the trigram model's larger vocabulary, which has more unseen words in the new domain","D":"The smoothing algorithm used by the trigram model is less effective than the one used by the bigram model"},"correct":"B","explanation":{"correct":"- A trigram \"Federal Reserve meeting\" is highly domain-specific. In a new domain (e.g., medical text), this trigram will never appear, and even with smoothing, the model assigns very low probability to domain-specific sequences.\n- Bigrams are shorter and more transferable across domains — \"the meeting\", \"a patient\" — because two-word combinations have higher cross-domain coverage than three-word combinations.\n- This is the bias-variance tradeoff for LMs: higher-order models have lower bias on in-domain data (lower perplexity in domain) but higher variance on out-of-domain data (larger perplexity jump).\n- Production systems often use interpolated models or domain adaptation to balance this tradeoff rather than committing to a single n-gram order.","A":"\"More parameters\" is not the right framing for n-gram models. The issue is context specificity, not parameter count. A large bigram model can also fail on domain shift if its vocabulary is domain-specific.","B":"","C":"Both models are trained on the same corpus and share the same vocabulary. The vocabulary size difference between bigram and trigram models is zero.","D":"The question does not state that different smoothing algorithms are used. Even with the same smoothing, the trigram model's longer contexts make it more vulnerable to domain shift by design."},"reference":"- Jurafsky & Martin, SLP3 Chapter 3, Section 3.5 (Practical Issues): https://web.stanford.edu/~jurafsky/slp3/3.pdf"},{"section":"nlp","topicSlug":"language-models-statistical","topic":"Language Models Statistical","id":"nlp-04006","difficulty":"medium","orderIndex":6,"question":"A language model is trained with the Markov assumption. An engineer argues that a trigram model already captures long-range dependencies because \"you can just chain the trigrams together.\" What is the fundamental flaw in this reasoning?","options":{"A":"The Markov assumption in a trigram model states that the probability of a word depends only on the two immediately preceding words, regardless of earlier context","B":"Trigram models cannot chain probabilities because multiplication of probabilities requires independent events","C":"The Markov assumption is not a limitation — trigrams capture all syntactic dependencies in English","D":"Chaining trigrams is mathematically equivalent to a unigram model because of the independence assumption"},"correct":"A","explanation":{"correct":"- The Markov assumption: P(wₙ | w₁, w₂, ..., wₙ₋₁) ≈ P(wₙ | wₙ₋ₖ₊₁, ..., wₙ₋₁) for k-gram model. For trigrams, k=3, so only the previous 2 words matter.\n- \"Chaining\" trigrams means P(w₅|w₃,w₄) × P(w₆|w₄,w₅) — each factor still only depends on the immediately preceding 2 words. Information from w₁, w₂ cannot influence w₅ once we've conditioned on w₃, w₄.\n- Real language has dependencies spanning 20+ words: \"The defendant who was acquitted of all charges __ free.\" The blank requires knowing \"defendant\" from many tokens back.\n- This ceiling is precisely why neural LMs (first RNNs, then Transformers) were developed — they can maintain state or attend to arbitrary positions in the context.","A":"","B":"Probability chain rule does not require independence. P(A,B) = P(A) × P(B|A) regardless of independence. The Markov assumption is about conditional independence from distant context, not independence between events.","C":"English syntax routinely involves dependencies spanning clause boundaries, relative clauses, and long-distance agreement. Trigrams capture only local 3-word windows, which is demonstrably insufficient for agreement phenomena like \"The boys who like soccer __ here.\"","D":"Chaining trigrams preserves the conditional structure — each factor conditions on two words. A unigram model conditions on zero preceding words. These are mathematically and statistically very different."},"reference":"- Jurafsky & Martin, SLP3 Chapter 3 (N-gram Language Models and Markov assumption): https://web.stanford.edu/~jurafsky/slp3/3.pdf"},{"section":"nlp","topicSlug":"language-models-statistical","topic":"Language Models Statistical","id":"nlp-04007","difficulty":"medium","orderIndex":7,"question":"A research team uses an n-gram language model to generate text by sampling from the distribution at each step. They notice the generated text becomes repetitive and incoherent after 20 tokens. A teammate says \"just use a higher n\" to fix this. Why does increasing n not solve the problem?","options":{"A":"Higher n-gram models generate text faster, causing the repetition to appear earlier","B":"Increasing n worsens data sparsity exponentially — long contexts become unseen in training, causing the model to constantly back off to lower-order distributions, effectively reverting to lower-order behavior during generation","C":"Text generation from n-gram models is always incoherent — the correct fix is to switch to beam search","D":"The repetition is caused by the smoothing algorithm, and increasing n removes the smoothing effect"},"correct":"B","explanation":{"correct":"- An n-gram model with large n faces extreme sparsity: for n=7, the probability P(wₙ|wₙ₋₆,...,wₙ₋₁) requires the exact 7-word context to have appeared in training. In practice, after generating a few tokens, the context becomes unique and unseen.\n- With backoff (Katz, KN), the model continuously backs off to (n-1)-gram, then (n-2)-gram, eventually using unigram or bigram distributions — the same low-order distributions that caused incoherence in the first place.\n- Higher n helps on in-domain test perplexity (when test sentences resemble training), but for generation, the model immediately leaves in-distribution contexts and the benefit of high n disappears.\n- This is the fundamental ceiling of statistical LMs for generation: they are good at scoring/ranking existing sentences but cannot maintain coherent long-range generation. Neural sequence models (RNN/LSTM) address this with recurrent state.","A":"Computational speed of n-gram models is roughly O(n) per token and is not the cause of repetition. Generation speed does not cause semantic repetition.","B":"","C":"Beam search is a decoding strategy, not a model property. An n-gram model with beam search still generates from the same probability distribution — beam search cannot create long-range coherence that the model's distribution lacks.","D":"Smoothing does not cause repetition. Smoothing redistributes probability mass to handle unseen n-grams. The repetition comes from the model's inability to maintain global coherence, not from smoothing."},"reference":"- Jurafsky & Martin, SLP3 Chapter 3, Section 3.6 (Limitations of n-gram language models): https://web.stanford.edu/~jurafsky/slp3/3.pdf"},{"section":"nlp","topicSlug":"language-models-statistical","topic":"Language Models Statistical","id":"nlp-04008","difficulty":"hard","orderIndex":8,"question":"You compute perplexity of a trigram model on a test set and get PP=∞ (infinity). The model uses Kneser-Ney smoothing. A senior engineer says \"that's impossible with KN smoothing.\" Is the engineer right, and what could cause this?","options":{"A":"The engineer is correct — Kneser-Ney guarantees non-zero probability for all word sequences, so infinite perplexity is impossible","B":"The engineer is wrong — infinite perplexity occurs when a test token is completely out-of-vocabulary (OOV), assigned probability zero even by KN smoothing, because KN does not smooth over unseen vocabulary items","C":"The engineer is correct — infinite perplexity can only occur with add-1 smoothing, not Kneser-Ney","D":"The engineer is wrong — infinite perplexity occurs whenever the test set contains more than 10% unseen bigrams"},"correct":"B","explanation":{"correct":"- Kneser-Ney smoothing guarantees non-zero probability for any n-gram of *known* vocabulary words through its discount-and-redistribute mechanism. However, it does not assign probability to OOV tokens — words not seen during training.\n- If a test token wᵢ is not in the training vocabulary, P(wᵢ | context) = 0. The log probability is -∞, and perplexity is 2^∞ = ∞.\n- Standard practice is to replace all words below a count threshold (e.g., count < 5) with a special `` token during both training and testing. This ensures OOV words map to a known token with a non-zero probability.\n- If the test set is preprocessed differently from training (missing the `` replacement step), infinite perplexity is the result — a common production bug when train/test pipelines diverge.","A":"KN smoothing distributes probability mass among *known* vocabulary items using continuation counts and discount values. It has no mechanism to assign probability to tokens outside the vocabulary. The engineer is wrong about this edge case.","B":"","C":"Infinite perplexity from OOV tokens can occur with any smoothing algorithm, not just add-1. The smoothing algorithm handles unseen n-grams of known words — it does not handle unknown vocabulary.","D":"Unseen bigrams (of known words) do not cause infinite perplexity with KN smoothing — KN explicitly handles them via backoff to unigram continuation probabilities. The threshold of 10% is fabricated and meaningless."},"reference":"- Jurafsky & Martin, SLP3 Chapter 3, Section 3.3 (Unknown words and UNK handling): https://web.stanford.edu/~jurafsky/slp3/3.pdf"},{"section":"nlp","topicSlug":"language-models-statistical","topic":"Language Models Statistical","id":"nlp-04009","difficulty":"hard","orderIndex":9,"question":"A team compares a 5-gram LM (PP=120 on in-domain test set) against a feedforward neural LM with a fixed context window of 5 words (PP=95 on the same test set). The manager says \"the neural LM wins because it has lower perplexity.\" A researcher argues the comparison is unfair. What is the researcher's strongest argument?","options":{"A":"Neural LMs cannot be fairly compared to n-gram LMs because they use different tokenization schemes","B":"The neural LM with a fixed 5-word window encodes the same Markov assumption as the 5-gram LM but uses a continuous representation — the perplexity gain comes from better generalization via distributed representations, not from modeling longer dependencies","C":"Perplexity is only valid for n-gram models — neural LMs should be evaluated with BLEU score instead","D":"The neural LM is unfair because it uses more training data than the 5-gram LM"},"correct":"B","explanation":{"correct":"- A feedforward neural LM with context size n encodes exactly the same Markov assumption as an n-gram LM: P(wₜ | wₜ₋ₙ₊₁, ..., wₜ₋₁). It does not model longer dependencies.\n- The perplexity gain (120 → 95) comes from distributed representations: \"president\" and \"senator\" share embedding dimensions, so seeing \"the president met\" generalizes to \"the senator met\" even if the exact trigram was never seen. N-gram models treat each word as atomic.\n- This generalization is real and valuable — but it is a different kind of improvement than truly modeling longer context. The manager's interpretation (\"better LM\") is right, but the reason attributed is often wrong in interviews.\n- The ceiling that neural LMs break through with respect to statistical LMs is specifically this generalization via embeddings, not longer effective context (that requires RNNs/Transformers).","A":"Both n-gram and neural LMs can use identical tokenization (word-level). Tokenization differences are orthogonal to the model architecture comparison. This is a deflection, not a substantive argument.","B":"","C":"Perplexity is a valid evaluation metric for any language model that assigns probabilities to sequences — it is model-agnostic. BLEU score measures translation quality, not language model quality.","D":"The question states both models are trained and evaluated on the same data. This objection may be valid in practice but is not the strongest argument about the nature of the comparison."},"reference":"- Bengio et al., \"A Neural Probabilistic Language Model\" (2003): https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf"},{"section":"nlp","topicSlug":"language-models-statistical","topic":"Language Models Statistical","id":"nlp-04010","difficulty":"hard","orderIndex":10,"question":"A modified Kneser-Ney smoothed trigram model and an RNN language model are both trained on 1 billion tokens. On a standard PTB benchmark, KN achieves PP=60, RNN achieves PP=45. A colleague claims that for a well-resourced high-frequency domain (e.g., financial newswire from a fixed time period), the gap will be the same or larger. What phenomenon makes this claim likely wrong?","options":{"A":"RNN LMs are always better than n-gram LMs regardless of domain or data size","B":"In a narrow, high-frequency domain with a closed vocabulary, KN can approach or match RNN performance because the sparsity problem that neural representations solve is reduced — common n-grams are well-estimated by MLE and distributed generalization provides less marginal gain","C":"Financial newswire is too short for RNNs to learn useful patterns, making KN perform better","D":"KN smoothing was specifically designed for financial text, giving it a domain advantage"},"correct":"B","explanation":{"correct":"- The key advantage of neural LMs over n-gram LMs is handling sparsity: when trigrams are unseen, neural embeddings generalize through shared dimensions. In a narrow domain with a closed vocabulary and millions of repetitions, most common trigrams *are* seen — MLE estimates are reliable.\n- In financial newswire: \"quarterly earnings per share\", \"Federal Reserve interest rate\", \"S&P 500 index\" — these domain-specific trigrams appear thousands of times. KN's smoothing for unseen cases matters less when most cases are seen.\n- Studies (e.g., Mikolov et al., 2011) have shown that for specialized narrow domains, well-tuned KN models remain competitive with early RNN LMs.\n- The general lesson: neural LMs' advantage is largest when data is sparse and diverse (news, web, books). For closed-domain, high-frequency text, the sparsity advantage shrinks.","A":"\"Always\" is false. For restricted vocabulary domains with abundant data, n-gram models with good smoothing can be competitive. Empirical results show task-dependent performance.","B":"","C":"Financial newswire corpora contain hundreds of millions of tokens. Data length is not the bottleneck. RNNs require sufficient data but a 1-billion-token training corpus is more than adequate.","D":"Kneser-Ney smoothing is a general-purpose statistical smoothing technique developed for newswire text (Wall Street Journal corpus was common in research), but it has no domain-specific architectural advantage for financial text vs other text."},"reference":"- Mikolov et al., \"Empirical Evaluation and Combination of Advanced Language Modeling Techniques\" (2011): https://www.isca-archive.org/interspeech_2011/mikolov11_interspeech.pdf"},{"section":"nlp","topicSlug":"sequence-models-rnn-lstm","topic":"Sequence Models Rnn Lstm","id":"nlp-05001","difficulty":"easy","orderIndex":1,"question":"A vanilla RNN is trained to predict the next word in: \"The trophy didn't fit in the bag because it was too ___.\" The model consistently outputs generic words like \"large\" rather than correctly predicting \"big\" or inferring whether \"it\" refers to \"trophy\" or \"bag.\" What architectural limitation causes this failure?","options":{"A":"Vanilla RNNs cannot process sentences with pronouns","B":"The vanilla RNN's hidden state is updated at every step but the gradient signal from \"trophy\" or \"bag\" has vanished by the time the model reaches \"it\" due to the vanishing gradient problem","C":"The RNN architecture only processes the last 3 words, making it equivalent to a trigram model","D":"Vanilla RNNs do not use embeddings, so they cannot represent co-reference relationships"},"correct":"B","explanation":{"correct":"- In a vanilla RNN, the hidden state hₜ = tanh(Whₕₜ₋₁ + Wxₓₜ). The gradient of the loss with respect to hₜ₋ₙ involves repeated multiplication by Wₕ. If |eigenvalues(Wₕ)| < 1, gradients shrink exponentially over distance.\n- The word \"trophy\" occurs several positions before \"it\" — by the time the RNN processes the blank, the gradient signal from \"trophy\" has been multiplied many times and effectively becomes zero. The network cannot learn the co-reference dependency.\n- This is the Winograd schema challenge — requiring long-range co-reference resolution — and it is specifically used to demonstrate the limitations of vanilla RNNs.\n- LSTMs address this by maintaining a separate cell state with additive (not multiplicative) updates controlled by gates, which allows gradients to flow over longer distances.","A":"Vanilla RNNs process all tokens in a sequence including pronouns. They have no explicit filter against pronouns. The failure is about learning long-range dependencies, not token type restrictions.","B":"","C":"RNNs process the entire sequence one token at a time with a recurrent hidden state — the effective context is theoretically the entire past sequence, not just 3 words. The practical limitation is gradient propagation, not a fixed window.","D":"Vanilla RNNs can and do use word embeddings as input representations. The issue is not embedding absence but gradient flow through the recurrence."},"reference":"- Hochreiter & Schmidhuber, \"Long Short-Term Memory\" (1997): https://www.bioinf.jku.at/publications/older/2604.pdf\n- Bengio et al., \"Learning Long-Term Dependencies with Gradient Descent is Difficult\" (1994): https://ieeexplore.ieee.org/document/279181"},{"section":"nlp","topicSlug":"sequence-models-rnn-lstm","topic":"Sequence Models Rnn Lstm","id":"nlp-05002","difficulty":"easy","orderIndex":2,"question":"An LSTM processes the token sequence one word at a time. At each step, it produces three vectors: the hidden state hₜ, the cell state cₜ, and an output. A team member asks why the LSTM needs both hₜ and cₜ — \"aren't they redundant?\" What is the correct functional distinction?","options":{"A":"hₜ and cₜ contain identical information; the duplication is for numerical stability","B":"cₜ is the long-term memory that accumulates information over many time steps via additive updates controlled by the forget and input gates; hₜ is the short-term output computed from cₜ via the output gate, used as input to the next layer","C":"hₜ stores syntactic information and cₜ stores semantic information, which is why both are needed","D":"cₜ is only used during training for gradient computation and is discarded at inference time"},"correct":"B","explanation":{"correct":"- The LSTM cell state cₜ = fₜ ⊙ cₜ₋₁ + iₜ ⊙ g̃ₜ. This is an additive update — the forget gate fₜ selectively removes information, the input gate iₜ selectively adds new information. Additive updates enable gradients to flow without repeated squashing.\n- The hidden state hₜ = oₜ ⊙ tanh(cₜ) is derived from cₜ through the output gate — it is a filtered, squashed version of the cell state used as the output at each time step and as input to subsequent layers.\n- The key insight: cₜ can carry information across hundreds of time steps with minimal decay because addition does not shrink magnitudes. hₜ is a compressed, task-relevant read of that memory.\n- In practice, cₜ is analogous to RAM (persistent, addressable) while hₜ is analogous to CPU registers (current working state).","A":"hₜ and cₜ contain different information — hₜ is a non-linear transformation of cₜ through the output gate. They are not redundant; removing cₜ reduces the model to a GRU-like architecture with fundamentally different gradient flow properties.","B":"","C":"There is no architectural separation of syntactic vs semantic information between hₜ and cₜ. Both are distributed representations. Probing studies show both states encode various linguistic features.","D":"cₜ is used at inference time — the cell state is passed between time steps during forward passes, which happens at both training and inference. Discarding cₜ at inference would break the model entirely."},"reference":"- Hochreiter & Schmidhuber, \"Long Short-Term Memory\": https://www.bioinf.jku.at/publications/older/2604.pdf\n- Colah's blog, \"Understanding LSTM Networks\": https://colah.github.io/posts/2015-08-Understanding-LSTMs/"},{"section":"nlp","topicSlug":"sequence-models-rnn-lstm","topic":"Sequence Models Rnn Lstm","id":"nlp-05003","difficulty":"easy","orderIndex":3,"question":"A GRU is being compared to an LSTM for a text classification task on sentences averaging 15 words. Both are trained with the same hyperparameters. The GRU trains 20% faster and achieves nearly identical accuracy. A colleague insists on using LSTM \"because it's more powerful.\" Under what condition would this preference actually be justified?","options":{"A":"LSTM is always more powerful and should be preferred regardless of sequence length or task","B":"LSTM's separate cell state provides additional capacity for tasks requiring fine-grained long-range memory management; for very long sequences (100+ tokens) or tasks requiring precise selective forgetting, LSTM's explicit forget gate can outperform GRU's coupled reset/update gates","C":"GRUs cannot handle multi-class classification — they are only valid for binary tasks","D":"LSTM is preferred when the training corpus has fewer than 100,000 examples because it overfits less than GRU"},"correct":"B","explanation":{"correct":"- GRU merges the cell state and hidden state into one, and uses coupled update/reset gates (fewer parameters). LSTM has 4 gate matrices vs GRU's 3. On short sequences (15 tokens), the additional capacity of LSTM's cell state provides negligible benefit.\n- For tasks requiring long-range memory with selective update — e.g., tracking dialogue state over many turns, long document summarization — LSTM's explicit forget gate and separate cell state give it more precise memory control.\n- Empirically (Chung et al., 2014), LSTM and GRU perform comparably on many NLP benchmarks. The choice is often made based on computational budget, not inherent superiority.\n- LSTM's \"more powerful\" claim is context-dependent, not universal. For a 15-word classification task, the claim is unsupported by evidence.","A":"Empirical research (Chung et al., 2014; Jozefowicz et al., 2015) consistently shows that LSTM and GRU have task-dependent performance. There is no universal \"more powerful\" conclusion.","B":"","C":"GRUs can handle any classification task — binary, multi-class, multi-label. The number of output classes is determined by the final linear layer, not the recurrent cell type.","D":"LSTM has more parameters than GRU (not fewer), so it is more prone to overfitting on small datasets, not less. The direction of this reasoning is inverted."},"reference":"- Chung et al., \"Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling\": https://arxiv.org/abs/1412.3555"},{"section":"nlp","topicSlug":"sequence-models-rnn-lstm","topic":"Sequence Models Rnn Lstm","id":"nlp-05004","difficulty":"medium","orderIndex":4,"question":"A bidirectional LSTM (BiLSTM) is used for NER. The forward LSTM processes tokens left-to-right; the backward LSTM processes right-to-left. A junior engineer removes the backward LSTM to reduce inference latency, keeping only the forward LSTM. For which input would this degrade performance most severely?","codeSnippet":"sentence = \"Apple is looking at buying U.K. startup for $1 billion\"","options":{"A":"The NER model would fail on \"Apple\" because it is ambiguous (company vs fruit) and the backward LSTM provides right-context that resolves the ambiguity","B":"The NER model would fail on \"$1 billion\" because currency amounts always require left-context only","C":"Removing the backward pass affects all tokens equally because the forward LSTM already sees the full sentence during training","D":"The backward LSTM only helps with stopwords like \"is\" and \"at\" — removing it would not affect content word NER"},"correct":"A","explanation":{"correct":"- \"Apple\" at sentence start has no left context to disambiguate it as ORG vs fruit. The backward LSTM processes from \"billion\" → \"Apple\", encoding that the sentence is about buying/startup/UK context — strong signals for ORG.\n- The forward LSTM at position 1 (processing \"Apple\") has only the start-of-sentence token as context. Without the backward pass, the model must rely solely on \"Apple\" itself and subsequent context accumulated in later layers.\n- BiLSTM for NER was specifically designed to handle this: the first and last tokens of sequences have the highest asymmetric context dependence. Research confirms BiLSTM consistently outperforms unidirectional LSTM on ambiguous named entities.\n- In production NER systems, the backward pass is typically not optional — it is architecturally required for competitive entity-boundary and type disambiguation performance.","A":"","B":"\"$1 billion\" contains strong surface-level signals (dollar sign, numeric value) that a forward LSTM can learn to classify as MONEY from its own token features and left context (\"for\"). It is less dependent on backward context.","C":"The forward LSTM does not \"see the full sentence during training\" — at inference, it processes left to right and at each position only has past context. Training does not give it future information; only the bidirectional architecture does.","D":"Stopwords (\"is\", \"at\") are rarely the target entities in NER and their tags are typically O (non-entity). Content words with type ambiguity (organization vs product vs location names) benefit most from bidirectional context."},"reference":"- Lample et al., \"Neural Architectures for Named Entity Recognition\" (BiLSTM-CRF): https://arxiv.org/abs/1603.01360"},{"section":"nlp","topicSlug":"sequence-models-rnn-lstm","topic":"Sequence Models Rnn Lstm","id":"nlp-05005","difficulty":"medium","orderIndex":5,"question":"A seq2seq model for machine translation is trained with teacher forcing. During inference on a long sentence, the model generates a correct first 10 words, then makes one wrong word, after which the entire remaining output degenerates into repetitive, incoherent text. What causes this behavior?","options":{"A":"Teacher forcing causes the decoder to never learn to handle its own prediction errors, creating a train-inference mismatch called \"exposure bias\"","B":"The model ran out of memory after 10 words because the encoder compressed the input into a fixed-size vector","C":"The decoder LSTM has a vanishing gradient problem that causes degradation after 10 tokens","D":"Teacher forcing trains the decoder on the wrong inputs — it should use the encoder hidden states, not the target tokens"},"correct":"A","explanation":{"correct":"- Teacher forcing: at each decoder step during training, the *gold target token* is fed as input to the next step, regardless of what the model predicted. This makes training fast and stable.\n- At inference, the model feeds its *own predictions* as input to the next step. If the model makes one error, the next step receives an incorrect input it has never seen during training — causing a distribution shift that compounds.\n- This is exposure bias (Bengio et al., 2015): the decoder is exposed only to gold-prefix distributions during training, but at inference it must handle its own (potentially erroneous) predicted distributions.\n- Mitigations include scheduled sampling (gradually replacing gold tokens with model predictions during training) and reinforcement learning-based training objectives.","A":"","B":"Fixed-size encoder bottleneck is a real limitation of vanilla seq2seq, but it causes information loss for long inputs (not specifically after 10 correct words). The cascading failure after one error is specifically the exposure bias pattern, not a memory issue.","C":"LSTM vanishing gradients are a training phenomenon — the gradient doesn't flow back through long sequences during training. At inference, there is no gradient; the forward pass processes each token sequentially without degradation from gradient issues.","D":"Teacher forcing correctly uses the gold target tokens as decoder input — this is by design. The encoder hidden states initialize the decoder state, not the per-step input. The mechanism is correct; the problem is the train-inference mismatch it creates."},"reference":"- Bengio et al., \"Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks\": https://arxiv.org/abs/1506.03099"},{"section":"nlp","topicSlug":"sequence-models-rnn-lstm","topic":"Sequence Models Rnn Lstm","id":"nlp-05006","difficulty":"medium","orderIndex":6,"question":"A seq2seq LSTM translation model uses beam search with beam width k=5 at inference. A researcher increases k to 50. On short sentences (< 10 words), BLEU improves slightly. On long sentences (> 30 words), BLEU decreases. What explains the degradation on long sentences?","options":{"A":"Beam search with large k always reduces BLEU because it generates shorter sentences","B":"Larger beams exacerbate the length bias of log-probability scoring: the model assigns lower joint probability to longer sequences, so larger beams increasingly prefer shorter hypotheses, causing incomplete translations","C":"Increasing k to 50 causes the model to run out of GPU memory, truncating long sentence outputs","D":"Beam search with k > 10 switches to a greedy algorithm internally, removing the benefit of multiple hypotheses"},"correct":"B","explanation":{"correct":"- Beam search scores hypotheses by log P(y₁,...,yₜ) = Σlog P(yᵢ|y₁,...,yᵢ₋₁, x). Adding more tokens always adds a negative log-probability term (probabilities < 1), so longer sequences receive lower scores.\n- With larger beam width, more competing hypotheses survive. Short, completed sentences stop accumulating negative scores early, while longer partial translations continue to decline. The beam fills with shorter, lower-quality translations.\n- This length penalty problem is well-documented (Wu et al., 2016). The standard fix is length normalization: divide log probability by sequence length^α (α ≈ 0.6-1.0).\n- Without length normalization, increasing beam width past a certain point can actually hurt BLEU — this is a known failure mode in production MT systems.","A":"The BLEU decrease is not uniform across sentence lengths. Short sentences improve with larger k, which rules out a universal negative effect. The issue is length-dependent, pointing to the length bias mechanism.","B":"","C":"GPU memory constraints are a practical consideration, but the question describes a systematic BLEU pattern correlated with sentence length. A memory truncation bug would produce different artifacts (truncated output, not length-biased shorter outputs).","D":"Beam search does not switch algorithms at any k threshold. For any k, the algorithm maintains exactly k hypotheses at each step and scores them consistently. There is no internal algorithmic switch."},"reference":"- Wu et al., \"Google's Neural Machine Translation System\" (length penalty section): https://arxiv.org/abs/1609.08144"},{"section":"nlp","topicSlug":"sequence-models-rnn-lstm","topic":"Sequence Models Rnn Lstm","id":"nlp-05007","difficulty":"medium","orderIndex":7,"question":"A BLEU score of 0.45 is reported for a neural MT model translating English to French. A product manager says \"45% of the output is correct.\" What is the most precise reason this interpretation is wrong?","options":{"A":"BLEU ranges from 0 to 100, so 0.45 means 45 out of 100, which is correct","B":"BLEU measures modified n-gram precision (1-gram through 4-gram) with a brevity penalty — it does not measure semantic correctness, adequacy, or fluency; a score of 0.45 means the model's n-gram overlap with reference translations is 0.45, not that 45% of the output is semantically correct","C":"BLEU is a recall-based metric so it measures how much of the reference appears in the output, not correctness","D":"BLEU score of 0.45 is below the passing threshold of 0.50, indicating the model should not be deployed"},"correct":"B","explanation":{"correct":"- BLEU = BP × exp(Σwₙ log pₙ) where pₙ is the modified n-gram precision for n=1..4 and BP is the brevity penalty. It measures geometric mean of modified n-gram overlap between hypothesis and one or more reference translations.\n- \"Modified\" precision clips each n-gram count by its maximum reference count, preventing repetition gaming. But it is still n-gram overlap, not semantic equivalence.\n- A translation can have low BLEU but be semantically perfect (using synonyms not in reference), or have high BLEU but be semantically wrong (right words, wrong order partially). BLEU ≠ correctness.\n- Industry standard (Papineni et al., 2002): BLEU is an automatic proxy for human evaluation. It correlates with human quality at corpus level but is unreliable for individual sentence evaluation.","A":"BLEU is typically reported in [0, 1] range in code (though multiplied by 100 for publication). Even if it were 45/100, interpreting it as \"45% correct\" conflates n-gram overlap with semantic correctness.","B":"","C":"BLEU is precision-based (how many hypothesis n-grams appear in the reference), with a brevity penalty acting as a recall surrogate. It is not recall. METEOR is a metric that explicitly incorporates recall.","D":"There is no universal \"passing threshold\" for BLEU — acceptable scores vary dramatically by language pair, domain, and reference count. BLEU=0.45 for En→Fr with a single reference is actually competitive."},"reference":"- Papineni et al., \"BLEU: a Method for Automatic Evaluation of Machine Translation\": https://aclanthology.org/P02-1040/"},{"section":"nlp","topicSlug":"sequence-models-rnn-lstm","topic":"Sequence Models Rnn Lstm","id":"nlp-05008","difficulty":"hard","orderIndex":8,"question":"A seq2seq LSTM model for summarization uses a fixed-size 512-dimensional encoder hidden state to represent the entire input document. You benchmark it on 50-word inputs (ROUGE=0.38) and 500-word inputs (ROUGE=0.21). A colleague proposes doubling the hidden state to 1024 dimensions to fix the long-document degradation. Why is this approach fundamentally insufficient?","options":{"A":"Doubling the hidden state doubles the parameters, causing the model to overfit on short inputs","B":"The fixed-size bottleneck problem is not about dimensionality — no fixed-size vector can losslessly encode arbitrarily long sequences; the information-theoretic capacity of the encoder is bounded regardless of dimension, and for 500-word inputs, critical information is necessarily discarded","C":"The ROUGE metric does not scale linearly with hidden state size, so larger hidden states will not improve ROUGE","D":"Larger hidden states require longer training time, making the approach impractical"},"correct":"B","explanation":{"correct":"- A fixed-size vector is an information bottleneck. For a 50-word input with vocabulary 50K, the input space is exponentially larger than any fixed-size representation. Doubling from 512 to 1024 doubles the bottleneck capacity but the input space grows exponentially with length.\n- For 500-word inputs, the encoder must compress 10× more information into a vector that is only 2× larger. The compression ratio worsens, and distant information is overwritten by more recent tokens (recency bias in RNNs).\n- The fundamental fix is attention: instead of forcing all information through one vector, attention allows the decoder to selectively query the encoder's hidden states at each input position. This is the exact motivation for Bahdanau attention (next topic).\n- In production: fixed-size encoder seq2seq was retired as soon as attention was introduced. Dimension scaling was attempted early (2014-2015) and found to have rapidly diminishing returns beyond 512-1024.","A":"Overfitting on short inputs is a valid concern with larger models in general, but the question is about the fundamental inability to represent long documents. Regularization techniques (dropout, weight decay) can address overfitting but cannot solve the information bottleneck.","B":"","C":"ROUGE does not have any direct mathematical relationship with hidden state size. The metric itself is not the problem. The metric is measuring a real quality degradation caused by information loss.","D":"Training time is an engineering concern, not a fundamental limitation. The question asks about why the approach is \"fundamentally insufficient\" — computational cost is not a fundamental barrier if the approach works."},"reference":"- Bahdanau et al., \"Neural Machine Translation by Jointly Learning to Align and Translate\": https://arxiv.org/abs/1409.0473"},{"section":"nlp","topicSlug":"sequence-models-rnn-lstm","topic":"Sequence Models Rnn Lstm","id":"nlp-05009","difficulty":"hard","orderIndex":9,"question":"During training a seq2seq LSTM, you observe that training loss decreases smoothly but validation loss plateaus early, and generated sequences on validation contain the `` token immediately after a few words. What is the most likely diagnosis?","options":{"A":"The model has learned to always output `` early because the training data contained many short sequences, and the model is rewarded for ending early by the brevity penalty","B":"The `` token's embedding has become a local minimum attractor — the model outputs `` early to minimize cross-entropy loss on training examples where `` is a frequent token in positions 3-5","C":"The model is overfitting: it has memorized training sequence lengths and outputs `` after the average training sequence length; the length distribution mismatch with validation causes early termination","D":"The model cannot distinguish between regular vocabulary tokens and `` because they are all embedded in the same 300-dimensional space, causing random `` insertions"},"correct":"C","explanation":{"correct":"- When training data contains many short sequences (e.g., 3-5 words), the decoder is trained with `` at position 4-6 frequently. The model learns to associate the recurrent state at those positions with ``.\n- Overfitting to training length distribution: the model learns \"after ~4 tokens, output ``\" as a shortcut to minimize training loss, rather than learning the true stopping condition (semantic completeness).\n- Validation sequences may be longer or have different length distributions — the model's learned position-based `` heuristic fails.\n- Mitigations: balanced length distribution in training data, length regularization, or training with curriculum learning (short sequences first, then longer ones progressively).","A":"Brevity penalty applies to BLEU evaluation, not to cross-entropy training loss. During LSTM training with teacher forcing, there is no brevity penalty — the model is penalized by cross-entropy for wrong token predictions only.","B":"While `` embedding collapse is theoretically possible, it is a much rarer pathology than the common length overfitting described in C. The specific symptom (early EOS after few words, plateau in validation) matches length distribution overfitting more precisely.","C":"","D":"All tokens including `` are embedded in the same space — this is by design, not a bug. The model learns to distinguish them through learned embeddings and context. Random `` insertions would produce a different pattern than systematic early termination."},"reference":"- Sutskever et al., \"Sequence to Sequence Learning with Neural Networks\": https://arxiv.org/abs/1409.3215"},{"section":"nlp","topicSlug":"sequence-models-rnn-lstm","topic":"Sequence Models Rnn Lstm","id":"nlp-05010","difficulty":"hard","orderIndex":10,"question":"A bidirectional LSTM language model is proposed for autoregressive text generation. A senior engineer immediately rejects it. Given that BiLSTMs achieve state-of-the-art on many NLP tasks, why is this rejection architecturally correct?","options":{"A":"BiLSTMs use too much memory to run on standard GPUs required for generation","B":"Autoregressive generation requires that at step t, the model can only condition on tokens w₁,...,wₜ₋₁. A BiLSTM's backward pass conditions on future tokens wₜ₊₁,...,wₙ — which are unknown at generation time, making BiLSTM architecturally incompatible with autoregressive generation","C":"BiLSTMs cannot generate variable-length sequences because the backward LSTM requires knowing the sequence length in advance","D":"BiLSTMs produce two hidden states per token which would require a modified softmax layer, making them impractical for generation"},"correct":"B","explanation":{"correct":"- Autoregressive generation: P(w₁,...,wₙ) = Π P(wₜ | w₁,...,wₜ₋₁). At generation step t, only tokens 1 to t-1 are known. The model must predict wₜ from past context only.\n- BiLSTM backward pass: at position t, the backward hidden state hₜ_backward encodes context from wₙ, wₙ₋₁,...,wₜ₊₁ — tokens that do not yet exist during generation.\n- This is a causal violation: the model would be conditioning on future tokens to predict the present token. At training time this is fine (the full sequence exists); at generation time it is impossible.\n- This is exactly the distinction between BERT (bidirectional encoder, can see full context, cannot generate autoregressively) and GPT (unidirectional decoder, autoregressive generation). BiLSTM corresponds to the BERT use case.","A":"Memory is an engineering concern, not an architectural incompatibility. Memory issues can be solved with gradient checkpointing, quantization, or smaller batch sizes. The BiLSTM rejection is architectural, not computational.","B":"","C":"BiLSTMs do require the full sequence during training (the backward pass processes from end to start), but variable-length sequences are handled through masking and batching. Sequence length is not required to be fixed in advance.","D":"Concatenating two hidden states for token prediction is trivially handled by a linear layer of size 2×hidden_dim → vocab_size. This is a standard implementation detail, not a barrier."},"reference":"- Devlin et al., \"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding\": https://arxiv.org/abs/1810.04805"},{"section":"nlp","topicSlug":"attention-before-transformers","topic":"Attention Before Transformers","id":"nlp-06001","difficulty":"easy","orderIndex":1,"question":"A seq2seq model without attention translates a 30-word English sentence to French. The encoder produces a single 512-dim context vector c. At each decoder step, the decoder uses the same c as input. A researcher argues this is \"like asking someone to translate a paragraph after reading it once with no notes.\" What specific failure mode does this describe?","options":{"A":"The encoder produces a different c for each decoder step, so this analogy is incorrect","B":"The fixed context vector forces the decoder to use identical encoder information at every step, regardless of which part of the input is relevant for the current output word — long-range information is overwritten by the most recent encoder hidden state","C":"The analogy is about computational cost — using the same vector is slow","D":"The fixed context vector causes the decoder to copy the input verbatim rather than translating it"},"correct":"B","explanation":{"correct":"- In vanilla seq2seq, the encoder final hidden state hₙ encodes a compression of the entire input. The same c = hₙ is used as initial decoder state and/or decoder input at every step.\n- For a 30-word input, words at positions 1-5 must be compressed through 25 more RNN steps, losing resolution. The encoder \"forgets\" early words — the recency bias of RNNs makes hₙ heavily weighted toward the last few input tokens.\n- When translating word 20 of the French output (which may correspond to English word 8), the decoder has no mechanism to re-focus on English position 8. It uses the same overloaded c.\n- Attention solves this by allowing the decoder to compute a dynamic context vector cₜ = Σαₜᵢhᵢ — a weighted average of *all* encoder hidden states, with weights learned to focus on relevant positions.","A":"In vanilla seq2seq without attention, the encoder produces one fixed vector, not one per decoder step. Attention-based models produce dynamic vectors per decoder step. The question explicitly asks about the no-attention case.","B":"","C":"Using the same vector is actually computationally cheaper, not slower. The computational cost argument is inverted — attention adds O(n×m) cost (comparing all encoder and decoder positions).","D":"The fixed context vector does not cause copying — the decoder uses its LSTM to generate output from the context. Copying behavior in NMT is a different phenomenon (often called \"copy mechanism\" or \"pointer networks\")."},"reference":"- Sutskever et al., \"Sequence to Sequence Learning with Neural Networks\": https://arxiv.org/abs/1409.3215\n- Bahdanau et al., \"Neural Machine Translation by Jointly Learning to Align and Translate\": https://arxiv.org/abs/1409.0473"},{"section":"nlp","topicSlug":"attention-before-transformers","topic":"Attention Before Transformers","id":"nlp-06002","difficulty":"easy","orderIndex":2,"question":"In Bahdanau attention, an alignment model computes attention scores eₜᵢ = vᵀ tanh(Waₛₜ₋₁ + Uahᵢ) where sₜ₋₁ is the decoder state and hᵢ is the i-th encoder hidden state. A junior engineer asks why both sₜ₋₁ and hᵢ are passed through a tanh before the dot product. What does this jointly-learned alignment model actually compute?","options":{"A":"It computes the exact match between decoder state and encoder state using cosine similarity","B":"It learns a parameterized compatibility function between the decoder's current query (what it needs) and each encoder position (what information is available) — the additive interaction allows the model to learn non-linear combinations of both","C":"It normalizes the decoder and encoder states to the same magnitude before comparison","D":"It prevents the encoder states from being updated by the decoder's gradients"},"correct":"B","explanation":{"correct":"- Bahdanau's additive attention: eₜᵢ = vᵀ tanh(Wa·sₜ₋₁ + Ua·hᵢ). The matrices Wa and Ua project the decoder state and encoder state into a shared space, then sum them, then apply tanh non-linearity, then a final linear projection v.\n- This is a feedforward network computing a \"how compatible is this decoder state with this encoder position?\" score. It learns to recognize that \"when decoding a verb, look at input verb positions\" etc.\n- \"Additive\" refers to the additive combination Wa·s + Ua·h — as opposed to Luong's \"multiplicative\" attention which uses sᵀWh (a bilinear form). Both compute compatibility but differ in expressiveness and computational cost.\n- The parameters (Wa, Ua, v) are learned jointly with the encoder/decoder, making the alignment model task-specific.","A":"Cosine similarity would be (sₜ₋₁ · hᵢ) / (|sₜ₋₁| |hᵢ|) — a non-parameterized formula. Bahdanau attention introduces learnable weight matrices (Wa, Ua, v) making it a learned compatibility function, not cosine similarity.","B":"","C":"The tanh squashes values to [-1, 1] but does not normalize to unit magnitude. L2 normalization (dividing by Euclidean norm) is a different operation. tanh here applies non-linearity for expressiveness, not normalization.","D":"Gradient flow between decoder and encoder is determined by the computational graph, not by the tanh activation. In fact, attention is designed to *enable* gradient flow from the decoder back through the encoder hidden states, improving encoder training."},"reference":"- Bahdanau et al., \"Neural Machine Translation by Jointly Learning to Align and Translate\": https://arxiv.org/abs/1409.0473"},{"section":"nlp","topicSlug":"attention-before-transformers","topic":"Attention Before Transformers","id":"nlp-06003","difficulty":"easy","orderIndex":3,"question":"After computing raw attention scores eₜᵢ for each encoder position i, Bahdanau attention applies softmax to get αₜᵢ = exp(eₜᵢ) / Σⱼ exp(eₜⱼ). A team member proposes skipping softmax and using raw scores directly. What would break?","options":{"A":"Raw scores can be negative, which would cause the weighted sum to subtract encoder states, potentially corrupting the context vector","B":"Without softmax, the attention weights do not sum to 1 and are not bounded — the context vector would scale with the magnitude of the raw scores rather than being a proper weighted mixture of encoder states","C":"Softmax prevents gradient explosion during backpropagation through attention","D":"Raw scores are already probabilities if the alignment model uses tanh, so softmax is redundant"},"correct":"B","explanation":{"correct":"- Without softmax, αₜᵢ = eₜᵢ (raw scores). The context vector cₜ = Σᵢ eₜᵢ hᵢ is a weighted sum but the weights can be any real value — they do not form a probability distribution.\n- Softmax enforces two properties: (1) Σᵢ αₜᵢ = 1 (convex combination → the context is a weighted average, not a weighted sum that scales with encoder state magnitudes), and (2) αₜᵢ > 0 (all positions contribute positively).\n- These properties make the context vector interpretable as a \"soft selection\" over encoder positions. Without them, the scale of eₜᵢ arbitrary affects cₜ magnitude, making training unstable.\n- Attention weights after softmax can also be visualized as alignment matrices (which positions the decoder attended to) — a key debugging and interpretability tool.","A":"Negative weights in a weighted sum are mathematically valid. The issue is not sign — it is the lack of normalization. Negative weights would mean \"subtract this encoder state\" which is theoretically expressible but not what attention aims to compute.","B":"","C":"Softmax does affect gradient flow, but its primary role is probability normalization. Gradient explosion is addressed by gradient clipping, not by softmax (softmax can itself cause gradient issues when inputs are very large, leading to the √d scaling in transformers).","D":"tanh outputs are in [-1, 1] — they are not probabilities (which require non-negativity and summation to 1). The raw alignment scores after v·tanh(...) are unbounded real values, not probabilities."},"reference":"- Bahdanau et al., \"Neural Machine Translation by Jointly Learning to Align and Translate\": https://arxiv.org/abs/1409.0473"},{"section":"nlp","topicSlug":"attention-before-transformers","topic":"Attention Before Transformers","id":"nlp-06004","difficulty":"medium","orderIndex":4,"question":"Luong attention (multiplicative) computes score(sₜ, hᵢ) = sₜᵀ W hᵢ (general form) or sₜᵀ hᵢ (dot product form). Bahdanau attention uses score(sₜ₋₁, hᵢ) = vᵀ tanh(Wa sₜ₋₁ + Ua hᵢ). A team must choose between them for a low-resource translation task with only 50K sentence pairs. Which should they choose and why?","options":{"A":"Luong dot-product attention, because it has no parameters and will not overfit on 50K sentences","B":"Bahdanau additive attention, because its additional parameters (Wa, Ua, v) give it more capacity to learn alignment with limited data","C":"Bahdanau attention, because its additive structure generalizes better to unseen language pairs with low data due to fewer matrix multiplications","D":"Luong general attention, because the bilinear matrix W provides richer alignment than additive attention at all data scales"},"correct":"A","explanation":{"correct":"- Luong dot-product attention score(sₜ, hᵢ) = sₜᵀ hᵢ has zero additional parameters — it uses raw dot products between decoder state and encoder states.\n- Bahdanau attention introduces 3 weight matrices (Wa, Ua, v), adding O(d²) parameters to the attention module. On 50K sentence pairs, these additional parameters risk overfitting.\n- Empirically (Luong et al., 2015), multiplicative attention achieves comparable or better performance on standard benchmarks while being computationally simpler. For low-resource settings, fewer parameters reduce overfitting risk.\n- The trade-off: additive attention can be more expressive with sufficient data (it can learn non-linear compatibility functions), but for small corpora, the regularization benefit of parameter-free dot-product attention outweighs expressiveness.","A":"","B":"More parameters in a low-resource setting increases overfitting risk. The \"more capacity\" argument holds for large-data settings but is counterproductive for 50K sentence pairs where the attention module parameters would not be well-estimated.","C":"Fewer matrix multiplications is a speed argument, not a generalization argument. Bahdanau does not \"generalize better to unseen language pairs\" by virtue of its additive structure — generalization is determined by data, regularization, and architecture inductive biases, not addition vs multiplication.","D":"The bilinear matrix W in Luong general attention does add parameters (d×d matrix). For low-resource settings, this is also a potential overfitting concern. The parameter-free dot-product form is better than the general form for this task."},"reference":"- Luong et al., \"Effective Approaches to Attention-based Neural Machine Translation\": https://arxiv.org/abs/1508.04025"},{"section":"nlp","topicSlug":"attention-before-transformers","topic":"Attention Before Transformers","id":"nlp-06005","difficulty":"medium","orderIndex":5,"question":"An attention-based seq2seq model for English-to-German translation produces the attention weight matrix shown below (rows = decoder steps, columns = encoder positions). The attention weights for the last 5 decoder steps are nearly uniform (≈0.05 for all 20 encoder positions). What does this indicate?","options":{"A":"The model has achieved perfect alignment because uniform attention means it uses all input information equally","B":"The attention is diffuse — the decoder cannot identify which input positions are relevant for these output tokens, suggesting the model is uncertain or the output is a function of global context rather than specific input words","C":"Uniform attention means the context vector equals the average of all encoder hidden states, which is equivalent to having no attention","D":"The last 5 decoder steps are generating padding tokens, and padding attention is always uniform by design"},"correct":"B","explanation":{"correct":"- Sharp attention (peaky distribution): α_ti ≈ 1 for one i and ≈ 0 for others → decoder strongly focuses on one encoder position. This is ideal for monotonic alignments (e.g., the model aligns \"cat\" to \"Katze\").\n- Diffuse/flat attention: α_ti ≈ 1/n for all i → the decoder uses the same weighted average of all encoder states, regardless of the current output token. This suggests the model cannot learn a specific alignment.\n- Causes: training instability, insufficient model capacity, very long input sequences where the alignment model cannot discriminate, or genuinely global-context output tokens (e.g., punctuation at the end of a sentence may diffusely attend to the whole input).\n- In production MT debugging, attention visualization is a primary diagnostic tool. Diffuse attention in the middle of translations indicates alignment failure, not success.","A":"\"Using all input information equally\" is not ideal for translation — translation requires mapping specific source words to specific target words. Equal attention means the model cannot distinguish which input is relevant.","B":"","C":"Uniform attention does reduce to the mean of encoder states — but this is not equivalent to \"no attention.\" Without attention, the decoder uses the final encoder state (recency-biased). Uniform attention uses an unweighted mean, which is actually slightly better than no attention in terms of coverage.","D":"Padding tokens in batch processing typically have their attention masked to -inf before softmax so they receive weight ≈ 0. Uniform attention across real (non-padding) encoder positions is a model behavior issue, not a padding artifact."},"reference":"- Bahdanau et al., \"Neural Machine Translation by Jointly Learning to Align and Translate\" (attention visualization figures): https://arxiv.org/abs/1409.0473"},{"section":"nlp","topicSlug":"attention-before-transformers","topic":"Attention Before Transformers","id":"nlp-06006","difficulty":"medium","orderIndex":6,"question":"A researcher argues that \"attention solves the vanishing gradient problem in seq2seq models.\" A senior engineer disagrees, saying \"attention and vanishing gradients are orthogonal problems.\" Who is right and why?","options":{"A":"The researcher is right — attention bypasses the encoder's recurrent path, allowing gradients to flow directly from the decoder loss to any encoder hidden state","B":"The senior engineer is right — attention provides direct gradient paths from decoder loss to encoder hidden states, which helps encoder training, but vanishing gradients in the encoder's own recurrent connections still exist; attention does not fix the RNN's internal gradient flow problem","C":"Both are right — attention solves vanishing gradients in the decoder but not in the encoder","D":"The senior engineer is right — attention actually causes gradient explosion, not gradient vanishing, so they are not orthogonal but opposite problems"},"correct":"B","explanation":{"correct":"- Attention mechanism: the decoder loss backpropagates through αₜᵢ to hᵢ directly (not through time steps). This creates a shortcut gradient path: ∂L/∂hᵢ gets a direct contribution from each decoder step that attends to position i. This is beneficial for encoder training.\n- However, within the encoder itself: hᵢ = RNN(hᵢ₋₁, xᵢ). The gradient of hᵢ with respect to hᵢ₋ₖ still travels through k recurrent steps and can vanish. Attention does not modify the encoder's internal RNN unrolling.\n- The correct fix for encoder internal vanishing gradients is LSTM/GRU (which attention complements, not replaces). Transformers later eliminate the problem entirely by replacing recurrence with self-attention.\n- In practice, attention + LSTM encoder is significantly more powerful than attention + vanilla RNN encoder, confirming that they address distinct problems.","A":"The researcher's claim is partially right (attention does provide direct gradient paths to encoder states) but overclaims — it implies attention fully solves vanishing gradients, which is false for the encoder's internal recurrence.","B":"","C":"Attention's gradient shortcut benefits both encoder and decoder training. The claim that it \"only solves decoder\" is imprecise — but the claim that it \"solves\" (rather than \"mitigates\") is still wrong.","D":"Attention does not cause gradient explosion. The softmax normalization keeps attention weights in [0, 1], providing a bounded gradient contribution from each decoder step. Gradient explosion is a separate issue addressed by gradient clipping."},"reference":"- Bahdanau et al., \"Neural Machine Translation by Jointly Learning to Align and Translate\": https://arxiv.org/abs/1409.0473\n- Pascanu et al., \"On the difficulty of training recurrent neural networks\": https://arxiv.org/abs/1211.5063"},{"section":"nlp","topicSlug":"attention-before-transformers","topic":"Attention Before Transformers","id":"nlp-06007","difficulty":"hard","orderIndex":7,"question":"An attention-based seq2seq model is applied to a document summarization task. The input is 800 words; the summary target is 50 words. At decoder step 30 (generating word 30 of the summary), the attention mechanism assigns high weights to encoder positions 50-100 — the same positions it focused on at steps 10-15. A researcher says \"this is the coverage problem.\" What is coverage, why is it a problem, and what is the standard fix?","options":{"A":"Coverage means the model has exceeded its maximum attention capacity; the fix is to reduce the input length to 400 words","B":"Coverage is the phenomenon where attention repeatedly focuses on the same encoder positions across decoder steps, causing important input segments to be ignored and others to be over-represented in the output; the fix is a coverage vector that tracks cumulatively attended positions and penalizes re-attending","C":"Coverage is caused by softmax normalization always selecting the same maximum position; the fix is to replace softmax with sparsemax","D":"Coverage refers to the fraction of input tokens that appear in the output; the fix is to use copy mechanisms to ensure all input words appear at least once"},"correct":"B","explanation":{"correct":"- Coverage problem (Tu et al., 2016): in summarization, attention may repeatedly focus on salient phrases (e.g., the topic sentence) while ignoring equally important but less salient sections. The summary becomes repetitive and misses information from unattended regions.\n- Coverage vector: cₜ = Σₜ'<ₜ αₜ'ᵢ (cumulative attention weights over all previous decoder steps). This tracks \"how much has each encoder position already been attended to?\"\n- Coverage penalty: add a term -λ Σᵢ min(cₜᵢ, αₜᵢ) to the loss, discouraging the current step from re-attending to already-covered positions.\n- See & Liu (2017) show coverage mechanism reduces repetition in abstractive summarization from 35% to 8% repeated trigrams.","A":"Coverage is not about capacity limits. The model can attend to any encoder position at any decoder step regardless of input length. Reducing input length is a workaround that loses information, not a fix for the coverage mechanism.","B":"","C":"Softmax does not always select the same maximum — at each decoder step, the query (decoder state) is different, so scores change. The coverage problem is about learning to re-attend to salient positions, not about softmax mechanics. Sparsemax would worsen coverage by creating even more concentrated, potentially repetitive attention.","D":"Copy mechanisms (pointer networks) address a different problem: copying words from input that are OOV or should appear verbatim. Coverage and copy mechanisms are often used together in summarization but address distinct failure modes."},"reference":"- See et al., \"Get To The Point: Summarization with Pointer-Generator Networks\" (coverage mechanism): https://arxiv.org/abs/1704.04368"},{"section":"nlp","topicSlug":"attention-before-transformers","topic":"Attention Before Transformers","id":"nlp-06008","difficulty":"hard","orderIndex":8,"question":"The Transformer paper (Vaswani et al., 2017) replaced Bahdanau-style attention in seq2seq with multi-head self-attention. A candidate at a MAANG interview says \"self-attention is just attention where the decoder attends to the encoder.\" What is specifically wrong with this definition, and what does self-attention actually compute?","options":{"A":"The candidate is correct — self-attention and cross-attention are identical operations","B":"The candidate has confused self-attention with cross-attention. Self-attention is when a sequence attends to itself — queries, keys, and values all come from the same sequence, allowing each position to encode context from all other positions in the same sequence","C":"Self-attention uses different activation functions from Bahdanau attention, which is why they are distinct","D":"Self-attention can only be computed in the encoder; the decoder uses cross-attention exclusively"},"correct":"B","explanation":{"correct":"- Cross-attention (Bahdanau-style): queries come from the decoder, keys and values come from the encoder. The decoder \"looks at\" the encoder.\n- Self-attention: queries, keys, and values all come from the *same* sequence (encoder attending to itself, or decoder attending to its own partial output). Each position computes attention scores against all other positions in its own sequence.\n- Encoder self-attention: hᵢ can directly attend to h₁, h₂,...,hₙ — capturing long-range dependencies in a single layer without recurrence. No RNN needed.\n- This is the critical breakthrough: self-attention removes the sequential dependency of RNNs, enabling full parallelization during training and O(1) path length between any two positions (vs O(n) in RNNs).","A":"Self-attention and cross-attention are distinct operations. Self-attention uses one sequence for Q, K, V. Cross-attention uses two sequences (K, V from source; Q from target). Confusing them leads to fundamental architectural misunderstanding.","B":"","C":"Both operations use the same scaled dot-product attention formula: score = QKᵀ/√d, followed by softmax and weighted sum over V. The activation function is not the distinguishing factor.","D":"Transformers use self-attention in both encoder and decoder. The decoder uses masked self-attention (attending only to past positions in its own output) and cross-attention (attending to encoder output). Both attention types appear in the decoder."},"reference":"- Vaswani et al., \"Attention Is All You Need\": https://arxiv.org/abs/1706.03762\n- The Illustrated Transformer (Jay Alammar): https://jalammar.github.io/illustrated-transformer/"},{"section":"nlp","topicSlug":"bert-and-variants","topic":"Bert And Variants","id":"nlp-07001","difficulty":"easy","orderIndex":1,"question":"BERT's Masked Language Model (MLM) pretraining randomly masks 15% of tokens and trains the model to predict them. An engineer asks why BERT doesn't just mask all tokens (100%) and predict the whole sequence at once. What fundamental issue would this cause?","options":{"A":"Masking all tokens would make BERT equivalent to GPT, which is patented by OpenAI","B":"If all tokens are masked, the model has no unmasked context to condition on — every prediction would be made from zero information, equivalent to a unigram language model that cannot leverage bidirectional context","C":"Masking 100% causes the positional embeddings to lose their meaning since all positions are identical","D":"The model would run out of GPU memory because predicting all tokens requires storing gradients for every position simultaneously"},"correct":"B","explanation":{"correct":"- MLM's power comes from bidirectional conditioning: when predicting [MASK] at position i, the model uses both left and right context (all unmasked tokens). This is why BERT can encode \"The bank by the [MASK] is steep\" correctly (river bank vs financial bank resolved by \"steep\").\n- If all tokens are [MASK], each prediction must be made from a sequence of identical [MASK] tokens and positional embeddings only — no word content is visible. The model cannot use context because there is no context.\n- The 15% masking rate is the balance between: enough unmasked context for meaningful conditioning, and enough masked positions to train the model efficiently on prediction tasks.\n- This is the fundamental design choice that makes BERT bidirectional but non-autoregressive — and also why BERT cannot generate text (no causal structure).","A":"No such patent constraint exists. This is a red herring. GPT uses causal (left-to-right) autoregressive LM, not MLM. They are different objectives, not IP-constrained alternatives.","B":"","C":"Positional embeddings are still distinct for each position even when tokens are [MASK]. Position 1 [MASK] and Position 10 [MASK] have the same token embedding but different positional embeddings — the positional information is preserved.","D":"Memory usage is proportional to sequence length × batch size, not mask rate. Predicting all tokens does increase the gradient computation but does not fundamentally run out of memory differently from 15% masking at the same sequence length."},"reference":"- Devlin et al., \"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding\": https://arxiv.org/abs/1810.04805"},{"section":"nlp","topicSlug":"bert-and-variants","topic":"Bert And Variants","id":"nlp-07002","difficulty":"easy","orderIndex":2,"question":"BERT uses three types of embeddings summed together for input representation: token embeddings, segment embeddings, and positional embeddings. A team removing the segment embeddings for a single-sentence classification task claims \"it won't matter because we only have one segment.\" Under what specific condition would removing segment embeddings still degrade performance?","options":{"A":"Single-sentence classification never uses segment embeddings, so removal always has zero effect","B":"If the model was pretrained with segment embeddings, the [CLS] token representation was learned to incorporate segment-boundary signals during NSP pretraining; at fine-tuning time, removing segment embeddings creates a distribution shift between pretraining and fine-tuning","C":"Segment embeddings encode punctuation positions which are needed for accurate classification","D":"Removing segment embeddings changes the input dimensionality, requiring the model architecture to be retrained from scratch"},"correct":"B","explanation":{"correct":"- During BERT pretraining, every input has segment embeddings (Segment A for first sentence, Segment B for second, or Segment A for both in single-sentence inputs). The model's representations are conditioned on these signals.\n- At fine-tuning, if you feed a single sentence with segment embeddings = 0 (or missing), the input distribution differs from pretraining where single sentences used Segment A embeddings. This mismatch is subtle but measurable.\n- RoBERTa (Liu et al., 2019) ablation studies showed that NSP and segment embeddings together have small but non-zero effects on downstream task performance.\n- The correct approach for single-sentence tasks: set all segment embeddings to Segment A (same as pretraining single-sentence inputs), not remove them.","A":"Segment embeddings always have some effect during pretraining — they are a learned signal. Claiming zero effect is too strong. Even for single-sentence inputs during pretraining, BERT uses Segment A embeddings (not zero), so the learned representations include segment conditioning.","B":"","C":"Segment embeddings encode which sentence a token belongs to (Segment A/B), not punctuation positions. Punctuation-aware representations emerge from attention, not segment embeddings.","D":"Segment embeddings are added to the input embedding sum. Removing one summand changes the input values but not the dimensionality — the embedding dimension d_model remains constant. No architecture change is needed."},"reference":"- Liu et al., \"RoBERTa: A Robustly Optimized BERT Pretraining Approach\" (NSP ablation): https://arxiv.org/abs/1907.11692"},{"section":"nlp","topicSlug":"bert-and-variants","topic":"Bert And Variants","id":"nlp-07003","difficulty":"easy","orderIndex":3,"question":"BERT is fine-tuned for sentiment analysis by adding a linear layer on the [CLS] token representation. After fine-tuning on 10K examples, the model achieves 92% accuracy. A colleague then fine-tunes BERT for NER on 5K examples and achieves only 78% F1. They conclude \"BERT is better at classification than NER.\" What is the more likely architectural explanation for the gap?","options":{"A":"BERT's [CLS] token is specifically optimized for classification tasks and cannot transfer to token-level tasks","B":"NER requires token-level predictions, where the quality depends on the representation of each individual token; BERT's pretraining on MLM does optimize token representations, but NER has higher annotation complexity per example (multi-label BIO tagging) and the 5K vs 10K data difference likely explains much of the gap","C":"NER cannot be solved with BERT — it requires a BiLSTM-CRF architecture instead","D":"The linear layer for NER has too many parameters (vocab_size × hidden_dim) causing overfitting on 5K examples"},"correct":"B","explanation":{"correct":"- NER with BERT: add a token-level linear layer that maps each token's representation to BIO tag probabilities. This is a valid and state-of-the-art approach. The architecture is sound.\n- The gap (92% vs 78%) is more likely explained by: (1) less data (5K vs 10K), (2) annotation complexity — NER labels are per-token BIO tags requiring consistent span annotation, inherently noisier than sentence-level sentiment labels, (3) evaluation metric difference (F1 vs accuracy — NER F1 is more sensitive to boundary errors).\n- BERT fine-tuning on NER typically achieves 91%+ F1 on CoNLL-2003 with sufficient data (15K examples). The 78% result suggests data insufficiency, not architectural incompatibility.\n- Comparing accuracy vs F1 is also misleading — these metrics have different baselines and scales.","A":"[CLS] is not \"specifically optimized\" for classification in a way that precludes token-level representations. All BERT token representations are equally trained through MLM — [CLS] is special only in NSP pretraining, not in token representation quality.","B":"","C":"BERT alone (without BiLSTM-CRF) achieves state-of-the-art NER. Adding a CRF layer can help by modeling tag transition probabilities, but BERT + linear is a strong and commonly deployed NER architecture.","D":"The NER classification head has (num_NER_tags × hidden_dim) parameters — with 9 BIO tags and 768 hidden dimensions, that's 9×768 = 6,912 parameters. This is tiny relative to BERT's 110M parameters and does not cause overfitting."},"reference":"- Devlin et al., \"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding\" (fine-tuning for NER): https://arxiv.org/abs/1810.04805"},{"section":"nlp","topicSlug":"bert-and-variants","topic":"Bert And Variants","id":"nlp-07004","difficulty":"medium","orderIndex":4,"question":"RoBERTa removes BERT's Next Sentence Prediction (NSP) objective and trains longer with larger batches on more data. Surprisingly, RoBERTa outperforms BERT on all GLUE benchmarks. A researcher argues \"NSP must be harmful.\" A more precise interpretation of the ablation results is:","options":{"A":"NSP is always harmful and should be removed from all future BERT-like models","B":"The RoBERTa paper's ablation shows NSP's benefit is marginal or negative when controlling for data size and training duration — NSP may force artificially short sequences (sentence pairs instead of full documents), limiting contextual richness","C":"NSP was removed because it was computationally too expensive for large-scale pretraining","D":"RoBERTa's improvements are entirely due to larger batch sizes, not NSP removal"},"correct":"B","explanation":{"correct":"- RoBERTa ablation: when training with full-document MLM (no sentence pairs) and equivalent compute, NSP-trained models perform worse or equal on downstream tasks. NSP forces the model to process truncated sentence pairs instead of longer document chunks.\n- Key insight: NSP was supposed to teach inter-sentence reasoning, but it was too easy — models could often detect \"not next sentence\" pairs from topic differences alone, without learning deep semantic coherence.\n- Full-document MLM (training on contiguous text up to 512 tokens, not artificially split sentence pairs) gives the model longer-range context, which is more valuable than NSP's inter-sentence signal.\n- The correct conclusion: NSP is not \"harmful\" in isolation — it is less valuable than the alternative use of the same training resources (longer sequences, more MLM signal).","A":"\"Always harmful\" is too strong. In some low-resource settings, NSP may provide useful signal. The ablation shows marginal/negative benefit at scale, not universal harm.","B":"","C":"NSP is computationally cheap — it adds one binary classification head. The reason for removing it was empirical performance, not computational cost.","D":"RoBERTa changed multiple variables: batch size, data, training steps, masking strategy, NSP removal. The paper shows each contributes. Attributing all gains to batch size alone misrepresents the ablation."},"reference":"- Liu et al., \"RoBERTa: A Robustly Optimized BERT Pretraining Approach\": https://arxiv.org/abs/1907.11692"},{"section":"nlp","topicSlug":"bert-and-variants","topic":"Bert And Variants","id":"nlp-07005","difficulty":"medium","orderIndex":5,"question":"DistilBERT achieves 97% of BERT's performance with 40% fewer parameters by using knowledge distillation. An engineer proposes the following approach: train DistilBERT from scratch using only the distillation loss (KL divergence between teacher and student soft logits) without any MLM loss. Why would this fail?","options":{"A":"KL divergence cannot be computed between a large teacher and small student with different architectures","B":"Soft targets from the teacher provide good final-layer supervision but do not teach the student intermediate representations — without MLM loss, the student's hidden layers lack the rich token-level representations needed for fine-tuning","C":"Distillation requires the teacher and student to have the same vocabulary, which is not guaranteed without MLM pretraining","D":"Training from scratch with only distillation loss would make DistilBERT identical to BERT since it would copy all the teacher's weights"},"correct":"B","explanation":{"correct":"- DistilBERT's actual training objective (Sanh et al., 2019): L = α·L_MLM + β·L_distil + γ·L_cos, where L_distil is KL divergence on soft targets, and L_cos is cosine embedding loss between teacher/student hidden states.\n- Pure distillation loss only supervises the output distribution (final logits). The student's intermediate layers can learn arbitrary representations that produce the right output but may not generalize well for fine-tuning on new tasks.\n- MLM loss ensures each token's hidden representation is semantically meaningful (the model must predict masked words), which is the core signal BERT's representations are built on. Without it, fine-tuning performance degrades significantly.\n- The hidden state alignment loss (cosine) is also critical — it forces student hidden states to mimic teacher hidden states at each layer, not just the final output.","A":"KL divergence is computed on the output probability distributions (after the final softmax layer), which have the same dimensionality (vocabulary size) regardless of model architecture. Cross-architecture distillation is standard practice.","B":"","C":"Teacher and student share the same vocabulary in DistilBERT. The vocabulary is a dataset property, not an architecture property. Distillation does not require different vocabularies.","D":"Distillation optimizes the student's parameters to minimize divergence from the teacher — this produces a smaller model that approximates the teacher's behavior, not a copy of the teacher's weights."},"reference":"- Sanh et al., \"DistilBERT, a distilled version of BERT\": https://arxiv.org/abs/1910.01108"},{"section":"nlp","topicSlug":"bert-and-variants","topic":"Bert And Variants","id":"nlp-07006","difficulty":"medium","orderIndex":6,"question":"ALBERT uses cross-layer parameter sharing (all 12 transformer layers share the same weights) and factorized embedding decomposition. Despite having 89% fewer parameters than BERT-large, ALBERT-xxlarge (12 layers, 4096 hidden dim) outperforms BERT-large on GLUE. How is this possible if parameter sharing reduces model capacity?","options":{"A":"ALBERT-xxlarge has hidden dimension 4096 vs BERT-large's 1024, so the parameter count comparison is misleading — more parameters per layer compensate for sharing","B":"Parameter sharing is a form of regularization — forcing all layers to compute the same transformation improves generalization; the model is deeper (more non-linear composition) without overfitting from independent layer weights","C":"ALBERT uses larger training data, which compensates for reduced parameters","D":"Cross-layer parameter sharing is only applied to the attention weights, not the FFN weights, so the capacity reduction is smaller than claimed"},"correct":"B","explanation":{"correct":"- ALBERT's key insight: BERT's 24 independent layers may waste capacity learning similar transformations at each layer. Forcing all layers to share weights acts as a strong regularizer while maintaining the depth (number of non-linear compositions through the network).\n- More passes through the shared layer = more refined representation without independent layer overfitting. It is analogous to recurrent networks (same weight applied repeatedly) vs having different weights per time step.\n- ALBERT-xxlarge's advantage comes from the large hidden dimension (4096) per shared layer — each pass through the shared layer has high capacity, and 12 passes create deep representation without the overfitting risk of 12 independent high-capacity layers.\n- This is a parameter efficiency vs performance tradeoff: fewer params, same or better GLUE score, but slower inference (12 independent passes through one large layer takes the same FLOPs as 12 large independent layers).","A":"Hidden dimension 4096 does mean more parameters per layer, but ALBERT's total parameters are still much smaller than BERT-large due to sharing. The larger hidden dim is part of the answer but does not explain the performance gain from sharing alone.","B":"","C":"ALBERT uses the same pretraining data as RoBERTa (and more than original BERT), but the performance gains were specifically shown in controlled ablations where data was held constant.","D":"ALBERT shares parameters across both attention and FFN sub-layers, not just attention. Full cross-layer sharing is the default ALBERT configuration."},"reference":"- Lan et al., \"ALBERT: A Lite BERT for Self-supervised Learning of Language Representations\": https://arxiv.org/abs/1909.11942"},{"section":"nlp","topicSlug":"bert-and-variants","topic":"Bert And Variants","id":"nlp-07007","difficulty":"hard","orderIndex":7,"question":"A team fine-tunes BERT-base for a multi-class document classification task on 50 classes with only 100 labeled examples per class (5,000 total). They achieve 71% accuracy. A colleague proposes fine-tuning GPT-2 instead, arguing \"decoder models are better for text tasks.\" Under what reasoning should the team prefer BERT over GPT-2 for this specific task?","options":{"A":"BERT is always better for classification tasks because its [CLS] token was designed for classification","B":"For classification tasks with limited data, BERT's bidirectional encoder provides richer token representations per layer than GPT-2's unidirectional decoder — each BERT token sees full context, making [CLS] a better document summary; GPT-2's causal masking means [CLS] (or the last token) only has left-context from the rest of the document","C":"GPT-2 cannot perform classification because it was trained only for generation","D":"BERT is better because it uses GELU activations while GPT-2 uses ReLU, and GELU is superior for classification"},"correct":"B","explanation":{"correct":"- BERT encoder: each token's representation is computed from full bidirectional context. The [CLS] token aggregates information from the entire sequence in every layer. For a 512-token document, [CLS] at layer 12 has attended to all 512 tokens across all 12 layers.\n- GPT-2 decoder (causal): each token only attends to previous tokens. The last token or [CLS] at the beginning has limited context in GPT-2 (if [CLS] is first, it attends to nothing on the left).\n- For classification with 5K training examples, BERT's richer bidirectional representations require less data to fine-tune — each example produces better-quality gradient signals because the representations are more complete.\n- When would GPT-2 be preferred for classification? If the classes are naturally defined by the generative distribution (e.g., language identification), or for very few-shot settings where prompt-based GPT-2 generation can be used.","A":"[CLS] was used for NSP during pretraining, which is a binary classification — this does not make it universally \"designed for classification.\" The [CLS] representation is useful because it aggregates bidirectional context, not due to any special architectural optimization for multi-class tasks.","B":"","C":"GPT-2 can perform classification by adding a linear classification head on the final token's representation (as shown in the original GPT paper). Fine-tuning for classification is standard for decoder models.","D":"Both BERT and GPT-2 use GELU activations in their FFN layers. This is not a distinguishing factor between the architectures."},"reference":"- Devlin et al., \"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding\": https://arxiv.org/abs/1810.04805"},{"section":"nlp","topicSlug":"bert-and-variants","topic":"Bert And Variants","id":"nlp-07008","difficulty":"hard","orderIndex":8,"question":"You fine-tune BERT-base on a custom NER dataset for a new domain (financial filings). Training loss drops to 0.02 but entity-level F1 on the test set is 0.51, far below the expected 0.85. Dropout is 0.1, batch size 32, learning rate 2e-5, 3 epochs. What is the most likely cause and fix?","options":{"A":"The learning rate 2e-5 is too high, causing catastrophic forgetting of BERT's pretrained representations","B":"Training loss near zero with poor test F1 indicates overfitting — the model has memorized training spans rather than generalizing; increase dropout (0.2-0.3), add L2 regularization, reduce epochs, and verify training/test domain and annotation consistency","C":"BERT-base is too small for financial NER — BERT-large must be used for domain-specific tasks","D":"Entity-level F1 is always lower than token-level accuracy, so 0.51 is expected for BERT on NER"},"correct":"B","explanation":{"correct":"- Training loss 0.02 ≈ 0 with test F1 0.51 is a classic overfitting signature. The model has memorized training annotations rather than learning generalizable entity patterns.\n- Financial filing NER is a specialized domain — if the training set is small (< 5K sentences), BERT can easily memorize exact entity spans without generalizing entity-boundary and type patterns.\n- Interventions: (1) Higher dropout (0.2-0.3 in classification head), (2) reduce epochs (early stopping based on validation F1), (3) check annotation quality — inconsistent BIO labeling causes the model to memorize noise, (4) use domain-adapted BERT (FinBERT) which has pretraining closer to the fine-tuning domain.\n- A subtle additional check: domain shift between training documents (10-K filings) and test documents (8-K filings) can cause F1 drop even without classical overfitting.","A":"Learning rate 2e-5 is the standard BERT fine-tuning learning rate recommended in the original paper. Catastrophic forgetting typically manifests as training loss *not* decreasing (the model fails to learn) or validation loss increasing early — not training loss near zero.","B":"","C":"BERT-large provides marginal improvements over BERT-base on NER tasks (+1-2% F1 typically). The 34% gap (51% vs 85%) is far too large to be explained by model size. Domain overfitting or data issues are the correct diagnosis.","D":"Entity-level F1 on CoNLL-2003 with BERT-base is ~91% (not 51%). Entity-level F1 being lower than token accuracy is true, but \"expected 0.51\" is false. 0.51 indicates a significant problem."},"reference":"- Devlin et al., \"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding\": https://arxiv.org/abs/1810.04805\n- Yang et al., \"FinBERT: A Pretrained Language Model for Financial Communications\": https://arxiv.org/abs/2006.08097"},{"section":"nlp","topicSlug":"text-classification","topic":"Text Classification","id":"nlp-08001","difficulty":"easy","orderIndex":1,"question":"A sentiment classifier trained on movie reviews achieves 94% accuracy on the test set. When deployed on product reviews, accuracy drops to 67%. The training and product review test sets have similar class distributions. What is the most likely cause?","options":{"A":"The model is too large and has overfitted to movie review vocabulary","B":"Domain shift: sentiment vocabulary differs between movie reviews (\"compelling,\" \"gripping\") and product reviews (\"durable,\" \"fast shipping\"), causing the model to misclassify domain-specific sentiment signals","C":"The test set for product reviews has a different label scheme than the training set","D":"Accuracy dropped because the product review test set is larger, making the metric harder to achieve"},"correct":"B","explanation":{"correct":"- A model trained on movie reviews learns associations: \"cinematography\" → positive, \"plot holes\" → negative. In product reviews, the relevant signals are \"build quality,\" \"customer service,\" \"arrived damaged\" — entirely different vocabulary.\n- This is domain shift in text classification: the input distribution P(X) changes between training and deployment even though the label space P(Y|X) semantics remain the same.\n- The 27% accuracy drop is large — typical of models that rely heavily on domain-specific lexical features without generalizing sentiment polarity across vocabulary.\n- Mitigations: domain adaptation (fine-tuning on a small labeled sample from the target domain), pseudo-labeling, or using a more general pretrained encoder (BERT trained on diverse text).","A":"Overfitting would manifest as high training accuracy and low test accuracy on the *same* distribution. A 94% in-domain test accuracy suggests the model is not overfitting to training data — it is well-calibrated for movies. The issue is domain mismatch, not model size.","B":"","C":"The question states similar class distributions (positive/negative/neutral). If label schemes differed, the accuracy drop would be systematic and detectable through the confusion matrix — it would not be a gradual performance drop.","D":"Metric value is not affected by test set size — accuracy = correct/total, which is a ratio. A larger test set gives a more reliable estimate of the same underlying accuracy, not a different value."},"reference":"- Blitzer et al., \"Domain Adaptation with Structural Correspondence Learning\": https://aclanthology.org/W06-1615/"},{"section":"nlp","topicSlug":"text-classification","topic":"Text Classification","id":"nlp-08002","difficulty":"easy","orderIndex":2,"question":"A 5-class topic classifier achieves macro F1 = 0.72 and micro F1 = 0.89. The product manager reports \"89% accuracy\" to stakeholders. What does the gap between macro and micro F1 reveal, and why is the manager's report potentially misleading?","options":{"A":"The gap indicates the model was trained with cross-entropy loss, which optimizes for micro F1 by default","B":"High micro F1 (0.89) is driven by dominant classes that have many examples; macro F1 (0.72) averages F1 equally across all classes, revealing that minority classes perform poorly — reporting 0.89 misleads stakeholders into thinking all classes perform well","C":"Macro F1 should always be reported because micro F1 is only valid for binary classification","D":"The gap means the model is biased toward false positives in dominant classes, which inflates micro F1"},"correct":"B","explanation":{"correct":"- Micro F1: pool all TP, FP, FN across all classes, then compute F1. Heavily influenced by large classes since they contribute more TP/FP/FN to the pool.\n- Macro F1: compute F1 per class, then average equally. Each class contributes equally regardless of support.\n- Gap of 0.17 (0.89 - 0.72) with 5 classes suggests some classes with small support have significantly lower F1 (possibly 0.40-0.60), dragging down macro F1 while the large dominant class (with many examples) achieves 0.95+ F1.\n- In a production system, if any of the 5 classes is critical (e.g., \"complaint\" in a customer service classifier), reporting micro F1 0.89 hides the fact that that class might have F1 = 0.45.","A":"Cross-entropy loss minimizes per-token/per-sample log loss, which is related to accuracy/micro metrics. But the gap between macro and micro F1 is about class imbalance, not the loss function used during training.","B":"","C":"Micro F1 is valid for multi-class and multi-label classification. The choice of macro vs micro depends on whether all classes are equally important (macro) or whether you want to reflect overall volume of correct predictions (micro).","D":"The gap between macro and micro F1 specifically indicates class imbalance effects, not bias toward false positives. A model biased toward FP in dominant classes would show low precision (not necessarily high micro F1)."},"reference":"- Manning et al., \"Introduction to Information Retrieval\", Chapter 8 (Evaluation in Text Classification): https://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-text-classifiers-1.html"},{"section":"nlp","topicSlug":"text-classification","topic":"Text Classification","id":"nlp-08003","difficulty":"easy","orderIndex":3,"question":"A multi-label text classification model predicts topic tags for research papers (e.g., a paper can be tagged with \"Machine Learning\" AND \"Computer Vision\" simultaneously). An engineer implements the output layer as a softmax over all labels. What is wrong with this approach?","options":{"A":"Softmax cannot handle more than 10 output classes for text classification","B":"Softmax normalizes outputs to a probability distribution that sums to 1, enforcing mutual exclusivity — for multi-label classification, each label must be independently decided using sigmoid, allowing multiple labels to be active simultaneously","C":"The engineer should use tanh instead of sigmoid for multi-label outputs because tanh produces values in [-1, 1]","D":"Softmax is correct for multi-label classification; the error is in using cross-entropy loss instead of binary cross-entropy loss"},"correct":"B","explanation":{"correct":"- Softmax: P(yᵢ) = exp(zᵢ) / Σⱼ exp(zⱼ). The outputs sum to 1, modeling a categorical distribution where exactly one class is true. Assigning high probability to \"ML\" forces lower probability on \"CV.\"\n- For multi-label: each label is an independent binary decision. \"ML\" can be 0.9 AND \"CV\" can be 0.85 simultaneously. This requires sigmoid: P(yᵢ) = 1 / (1 + exp(-zᵢ)) — each label has its own independent probability in [0, 1].\n- Loss: binary cross-entropy (BCE) applied independently per label, not categorical cross-entropy (which assumes one-hot targets).\n- A common mistake is using softmax for multi-label classification — it is one of the most frequent output layer bugs in NLP classification systems.","A":"Softmax handles arbitrarily many output classes — there is no upper limit of 10 or any number. This is factually incorrect.","B":"","C":"tanh outputs [-1, 1] which cannot be interpreted as probabilities (no non-negativity, no upper bound at 1). Sigmoid [0, 1] is the correct choice for independent binary probabilities.","D":"Softmax with binary cross-entropy is an inconsistent combination — binary cross-entropy expects independent per-label probabilities, but softmax produces dependent (summing to 1) probabilities. The error is in using softmax itself, not just the loss function."},"reference":"- Zhang & Zhou, \"A Review on Multi-Label Learning Algorithms\": https://ieeexplore.ieee.org/document/6471714"},{"section":"nlp","topicSlug":"text-classification","topic":"Text Classification","id":"nlp-08004","difficulty":"medium","orderIndex":4,"question":"A zero-shot text classifier uses a large LLM with the prompt: \"Classify the following text as one of [Sports, Politics, Technology]. Text: 'The government passed a bill regulating AI development.' Classification:\" The model outputs \"Technology.\" A researcher argues \"zero-shot LLMs cannot be reliable classifiers because they have no task-specific training.\" What evidence would best refute this claim?","options":{"A":"The LLM's perplexity on classification prompts is lower than on random text","B":"Calibrated few-shot prompting studies (Brown et al., GPT-3) show that LLMs can match or exceed supervised baselines on text classification benchmarks when prompts include class definitions and examples, demonstrating task-specific adaptation through in-context learning without parameter updates","C":"The LLM was fine-tuned on a classification dataset, making it a supervised model by definition","D":"Zero-shot classification accuracy on one example proves the model is reliable"},"correct":"B","explanation":{"correct":"- GPT-3's few-shot results (Brown et al., 2020): on SST-2 (sentiment), GPT-3 175B achieved 95.3% with few-shot prompting vs 96.8% for fine-tuned BERT-base — nearly matching supervised performance without gradient updates.\n- Zero-shot and few-shot prompting leverage the model's pretraining on diverse classification-adjacent text. The model has implicitly seen sentiment analysis, topic labeling, and categorization tasks during pretraining.\n- \"No task-specific training\" is precisely the point: LLMs demonstrate in-context learning, where the task specification in the prompt serves as the \"training signal\" at inference time.\n- This evidence refutes \"cannot be reliable\" by showing that under proper prompting, LLM zero-shot/few-shot classification is competitive with supervised approaches, especially in domains where supervised data is scarce.","A":"Perplexity on classification prompts measures language modeling quality, not classification accuracy. A model can have low perplexity on a prompt and still produce wrong classifications. Perplexity does not directly measure task performance.","B":"","C":"If the LLM was fine-tuned on a classification dataset, it is by definition not a zero-shot classifier for that task. This argument changes the premise of the question rather than refuting the claim within it.","D":"One example is not statistically significant evidence of reliability. Reliability requires consistent performance across a held-out test set. A single correct classification could be a coincidence."},"reference":"- Brown et al., \"Language Models are Few-Shot Learners\" (GPT-3): https://arxiv.org/abs/2005.14165"},{"section":"nlp","topicSlug":"text-classification","topic":"Text Classification","id":"nlp-08005","difficulty":"medium","orderIndex":5,"question":"A sentiment classifier for customer service tickets must achieve high recall on negative sentiment (complaints must not be missed), even at the cost of precision. The model currently has precision=0.90, recall=0.65 for the negative class. An engineer suggests \"just lower the classification threshold from 0.5 to 0.3.\" What is the correct effect of this change?","codeSnippet":"# Current: predict negative if P(negative) > 0.5\n# Proposed: predict negative if P(negative) > 0.3\nthreshold = 0.3\npredictions = [1 if p > threshold else 0 for p in probabilities]","options":{"A":"Lowering the threshold increases both precision and recall for the negative class","B":"Lowering the threshold increases recall (more negative examples caught) but decreases precision (more false positives — non-complaints classified as complaints)","C":"Lowering the threshold has no effect on recall, only on the number of false positives","D":"Lowering the threshold below 0.5 makes the model equivalent to always predicting negative"},"correct":"B","explanation":{"correct":"- Threshold lowering effect: more examples are predicted as \"negative\" (the positive class). Examples that were previously just below 0.5 are now predicted as negative.\n- True positives (genuine complaints with P > 0.3): more are caught → recall increases. TP/(TP+FN) improves because FN decreases.\n- False positives (non-complaints with 0.3 < P < 0.5): these are now also predicted as negative → precision decreases. TP/(TP+FP) worsens because FP increases.\n- This is the precision-recall tradeoff — it cannot be violated by threshold adjustment alone. For the stated requirement (high recall, acceptable precision drop), threshold lowering is the correct engineering solution.","A":"Precision and recall cannot both increase by lowering the threshold (for the same model). Increasing recall requires accepting more false positives, which decreases precision. If both could be improved, the model was miscalibrated, not threshold-limited.","B":"","C":"Recall does change with threshold. Recall = TP/(TP+FN). Lowering the threshold reduces FN (previously missed complaints with P in [0.3, 0.5] are now correctly predicted as negative), directly increasing recall.","D":"Predicting \"always negative\" would require threshold = 0 (predict negative for any probability > 0). At threshold 0.3, examples with P < 0.3 are still predicted as non-negative. Only if threshold is set to exactly 0 would all examples be positive."},"reference":"- Jurafsky & Martin, SLP3 Chapter 5 (Logistic Regression, threshold effects): https://web.stanford.edu/~jurafsky/slp3/5.pdf"},{"section":"nlp","topicSlug":"text-classification","topic":"Text Classification","id":"nlp-08006","difficulty":"medium","orderIndex":6,"question":"A team trains a BERT-based sentiment classifier on 100K labeled examples. After deployment, a data scientist discovers that 40% of the training labels were noisy (incorrectly labeled by annotators who disagreed on sarcasm). The model achieves 78% accuracy. A colleague claims \"the model learned to be 78% accurate despite 40% noise — that's impressive.\" What does the 78% actually imply about label noise?","options":{"A":"78% accuracy with 40% noise confirms the model learned to detect and correct noisy labels automatically","B":"Without a clean test set, 78% accuracy measured against noisy test labels underestimates true clean accuracy — if test labels are also 40% noisy, the model could have clean accuracy of ~88-92%, or the noise could be concentrated in specific examples the model consistently gets right for wrong reasons","C":"78% accuracy proves the model is robust to label noise since it exceeds the 60% that would result from a majority-vote baseline","D":"BERT cannot learn from noisy labels — the 78% accuracy means 22% of the model's weights are corrupted"},"correct":"B","explanation":{"correct":"- If the test set also has 40% noisy labels and \"accuracy\" is measured against these noisy labels, the metric itself is unreliable. A model that agrees with noisy annotators 78% of the time may have learned the noise patterns, not the true signal.\n- For random label noise at rate ε: a Bayes optimal classifier can achieve at most (1-ε) accuracy on noisy labels even with perfect underlying classification — since ε of the labels are wrong, 40% noise theoretically caps accuracy at ~60% for purely random noise. 78% exceeding this suggests the model learned something real (or the noise is not uniformly random).\n- The crucial question: what is the accuracy on a *clean* validation set? Without this, 78% is difficult to interpret.\n- In production: always maintain a clean gold-standard evaluation set separate from crowdsourced/noisy annotations.","A":"Models do not automatically detect and correct noisy labels during standard cross-entropy training. This requires explicit noise-robust training objectives (e.g., learning with noise transition matrix). BERT with standard fine-tuning will partially fit noisy labels.","B":"","C":"The majority-vote baseline (always predict the most common class) is relevant for class imbalance, not label noise. Label noise does not directly determine the majority-class baseline. Also, \"robust to label noise\" would require demonstration on clean labels, not noisy test labels.","D":"Model \"weights\" are real numbers, not binary — the concept of \"22% corrupted weights\" is not how neural networks work. Noisy labels affect the optimization landscape but do not corrupt specific weight subsets."},"reference":"- Frenay & Verleysen, \"Classification in the Presence of Label Noise: A Survey\": https://ieeexplore.ieee.org/document/6685834"},{"section":"nlp","topicSlug":"text-classification","topic":"Text Classification","id":"nlp-08007","difficulty":"hard","orderIndex":7,"question":"You build a zero-shot classifier using an LLM by scoring P(label | text) for each candidate label via conditional generation probability. The classifier works well for \"Sports\" vs \"Politics\" but confuses \"Finance\" vs \"Economics.\" Changing the label strings to \"Financial Markets\" vs \"Economic Policy\" improves F1 by 12 points. What does this reveal about zero-shot LLM classification?","options":{"A":"The LLM was pretrained on financial data but not economic data, creating a domain gap","B":"Zero-shot LLM classification is sensitive to label surface form — the model scores labels based on the likelihood of generating those exact tokens given the input, not based on abstract category semantics; more descriptive labels reduce ambiguity in the generation probability space","C":"Longer label strings always improve zero-shot classification because they provide more scoring tokens","D":"The improvement indicates the LLM is performing keyword matching rather than semantic classification"},"correct":"B","explanation":{"correct":"- Zero-shot classification via generation probability: P(label | text) is computed as the product of token probabilities for the label string. \"Finance\" is a single token; \"Financial Markets\" is two tokens with different probability mass in the language model's output distribution.\n- \"Finance\" and \"Economics\" are both common single tokens that often appear in similar contexts in pretraining data. \"Financial Markets\" and \"Economic Policy\" are phrase-level constructs with more discriminative co-occurrence patterns.\n- The LLM's probability scoring is fundamentally a language modeling operation — it asks \"how likely is this label word/phrase to follow this text?\" — not a semantic equivalence check. Labels that appear in similar pretraining contexts will be hard to discriminate.\n- This is a key limitation of raw generation-probability zero-shot classification: label choice has outsized impact. Entailment-based classifiers (NLI models) are more robust to label surface form variation.","A":"Both \"Finance\" and \"Economics\" are common English words that appear abundantly in virtually all large pretraining corpora. A domain gap between them in pretraining data would be visible in the model's base vocabulary distribution, not in a 12-point F1 gap from label renaming.","B":"","C":"Longer labels do not universally improve performance — \"Very Important Finance Topic\" is longer but likely worse than \"Finance.\" The improvement comes from disambiguation (richer context for the label), not length per se.","D":"Keyword matching would not explain the sensitivity to label *phrasing*. A keyword matcher would match \"Finance\" whenever \"finance\" appears in the text regardless of the label string format."},"reference":"- Yin et al., \"Benchmarking Zero-shot Text Classification\" (NLI-based zero-shot): https://arxiv.org/abs/1909.00161"},{"section":"nlp","topicSlug":"text-classification","topic":"Text Classification","id":"nlp-08008","difficulty":"hard","orderIndex":8,"question":"A BERT-fine-tuned 3-class classifier (positive/neutral/negative) achieves 88% accuracy overall. Calibration testing reveals the model's confidence on correctly classified examples averages 0.97, while on incorrectly classified examples averages 0.91. A reliability diagram shows the model outputs 0.95 confidence but is correct only 72% of the time for that confidence bucket. What problem does this describe and how does it affect downstream systems that threshold on confidence?","options":{"A":"The model is underfitting — 88% accuracy is too low for a 3-class problem","B":"The model is overconfident (miscalibrated): its predicted probabilities do not match empirical accuracy — a system thresholding at \"confidence > 0.9 to auto-respond\" will incorrectly auto-respond to 28% of the 0.95-confidence cases, causing real-world errors that confidence filtering was supposed to prevent","C":"The reliability diagram indicates a data preprocessing error that must be fixed before deployment","D":"BERT models cannot be calibrated — the softmax output is always overconfident for transformer models"},"correct":"B","explanation":{"correct":"- Expected Calibration Error (ECE): a well-calibrated model's predicted probability matches empirical accuracy. P(correct | confidence = 0.95) should equal 0.95. Here it equals 0.72 — a 23-point gap, indicating severe overconfidence.\n- Root cause: fine-tuned transformer models are known to be overconfident because the softmax over large logits can produce very peaked distributions that do not reflect true uncertainty.\n- Impact on downstream systems: if an auto-response system triggers when confidence > 0.9, it will incorrectly respond to ~28% of those cases — much worse than the advertised 88% accuracy implies. Stakeholders expect confidence to be a reliable signal; miscalibration silently violates this assumption.\n- Fixes: temperature scaling (T in softmax: P = softmax(z/T)), label smoothing during training, or Platt scaling. Temperature scaling is the simplest and most effective post-hoc calibration method.","A":"Underfitting produces high loss and low accuracy across training and test sets. 88% accuracy on a 3-class balanced problem is strong performance. The problem is confidence miscalibration, not underfitting.","B":"","C":"A reliability diagram showing miscalibration is a model behavior finding, not a preprocessing artifact. Data preprocessing errors would manifest as systematic performance gaps on specific data subsets or vocabulary anomalies.","D":"BERT models can be calibrated post-hoc using temperature scaling or other methods. Guo et al. (2017) demonstrated this for neural networks generally. \"Cannot be calibrated\" is false — they require explicit calibration, not that it is impossible."},"reference":"- Guo et al., \"On Calibration of Modern Neural Networks\": https://arxiv.org/abs/1706.04599"},{"section":"nlp","topicSlug":"named-entity-recognition","topic":"Named Entity Recognition","id":"nlp-09001","difficulty":"easy","orderIndex":1,"question":"A NER model must label each token in \"John Smith joined Google in 2023\" with entity tags. An engineer proposes labeling the entire span \"John Smith\" as a single \"PER\" tag. Why does production NER use BIO tagging instead?","options":{"A":"Single-span labeling is computationally more expensive than BIO tagging","B":"BIO tagging encodes entity boundaries at the token level — B-PER marks the first token of a person entity, I-PER marks continuation tokens, and O marks non-entities — enabling models to distinguish adjacent entities of the same type and handle tokenization-level predictions","C":"Single-span labeling requires a separate span detection model before the classification step, making it two-stage only","D":"BIO tagging was introduced specifically because transformers cannot process multi-token spans"},"correct":"B","explanation":{"correct":"- BIO scheme: \"John\"→B-PER, \"Smith\"→I-PER, \"joined\"→O, \"Google\"→B-ORG, \"in\"→O, \"2023\"→B-DATE.\n- Critical case: \"John Smith John Doe\" — two adjacent PER entities. Without BIO, the model cannot determine where John Smith ends and John Doe begins from the label sequence alone. BIO resolves this: B-PER I-PER B-PER I-PER — the second B-PER signals a new entity.\n- Token-level prediction is necessary because NER models (LSTM-CRF, BERT+linear) predict one label per token. The model architecture does not natively handle variable-length spans — BIO converts spans to a sequence labeling problem.\n- BIOES (Begin, Inside, Outside, End, Single) is an extension that adds E- and S- tags for more precise boundary marking, often used in span-based NER.","A":"Single-span labeling is not more computationally expensive — it requires fewer labels per token. The issue is representational, not computational. BIO has more label states but is computationally comparable.","B":"","C":"Single-span labeling can be done in one stage (span prediction + classification jointly, as in span-based NER). The issue is not the number of stages but the representation of entity boundaries at the token level.","D":"Transformers can process multi-token spans — span-based NER with BERT does exactly this. BIO predates transformers (it was used with CRF and LSTM models). The reason for BIO is the sequence labeling paradigm, not transformer limitations."},"reference":"- Ramshaw & Marcus, \"Text Chunking using Transformation-Based Learning\" (introduced IOB tagging): https://aclanthology.org/W95-0107/"},{"section":"nlp","topicSlug":"named-entity-recognition","topic":"Named Entity Recognition","id":"nlp-09002","difficulty":"easy","orderIndex":2,"question":"A CRF-based NER model assigns token-level tag probabilities independently and then uses the CRF layer to decode the best tag sequence. A BERT+linear NER model assigns tag probabilities independently per token without a CRF layer. In which specific case would the CRF provide measurable advantage over the linear head alone?","options":{"A":"CRF always outperforms linear heads for NER regardless of model capacity","B":"CRF provides advantage when the underlying token probabilities violate valid BIO constraints — e.g., the linear head might independently assign high probability to I-PER at position 1 without a preceding B-PER, which CRF prevents through learned transition constraints","C":"CRF is faster at inference time than a linear head, making it preferred for production systems","D":"CRF is needed because BERT's representations are not contextual enough to distinguish B- from I- tags without transition modeling"},"correct":"B","explanation":{"correct":"- A linear head makes independent predictions per token: each position's tag is argmax of its own softmax output. No constraint prevents I-PER at position 1 (no preceding entity) or O→I-PER transitions (invalid in BIO).\n- CRF adds a transition matrix T[i][j] = score of transitioning from tag i to tag j. Viterbi decoding finds the globally optimal valid tag sequence by enforcing that I-PER must follow B-PER or I-PER.\n- Empirically: BERT+linear achieves ~91% entity F1 on CoNLL-2003; BERT+CRF achieves ~92-93%. The gain is real but modest because BERT's contextual representations already make strong boundary predictions — the CRF provides a constraint-based correction for residual errors.\n- The CRF matters most when the encoder is weaker (LSTM, word2vec features) and independently assigns inconsistent per-token probabilities more frequently.","A":"With a strong enough encoder (like BERT), the linear head rarely produces invalid sequences because the contextual representations already encode entity boundaries well. \"Always outperforms\" overstates the advantage.","B":"","C":"CRF Viterbi decoding is O(n × k²) where k is number of tags — it is slower than a linear head's O(n × k) per-token argmax. CRF adds inference cost, not reduces it.","D":"BERT representations are highly contextual and encode B- vs I- tag distinctions effectively. The improvement from CRF is marginal for BERT specifically — the claim that BERT \"cannot distinguish\" B- from I- is false."},"reference":"- Lample et al., \"Neural Architectures for Named Entity Recognition\" (BiLSTM-CRF): https://arxiv.org/abs/1603.01360"},{"section":"nlp","topicSlug":"named-entity-recognition","topic":"Named Entity Recognition","id":"nlp-09003","difficulty":"medium","orderIndex":3,"question":"A NER model achieves token-level accuracy of 94% on a test set. Entity-level F1 is 79%. A product manager reports \"94% accuracy, the model works well.\" What causes the 15-point gap and why does entity-level F1 better represent production quality?","options":{"A":"Token-level accuracy counts O (non-entity) tokens which dominate the dataset; a model predicting O for all tokens achieves high token accuracy but 0 entity recall; entity-level F1 requires correct prediction of the full entity span and type","B":"Entity-level F1 is always lower than token accuracy because F1 penalizes the model for using the wrong label format","C":"The gap occurs because the test set has more non-entity tokens than entity tokens, making accuracy unreliable for all NLP tasks","D":"Token-level accuracy counts partial entity matches as correct; entity-level F1 requires exact span match, which is why it is always lower"},"correct":"A","explanation":{"correct":"- In a typical NER dataset, 70-80% of tokens are O (non-entity). A model predicting O for all tokens would achieve 75% token accuracy but 0 entity precision and recall (entity-level F1 = 0).\n- Entity-level F1 evaluation: an entity is correct only if the predicted span boundaries AND entity type exactly match the gold annotation. \"John Smith\" labeled as PER only when start=0, end=1, type=PER all match simultaneously.\n- This mirrors production value: users care about correctly identified named entities (people, organizations, locations), not about whether \"the\" and \"in\" are correctly labeled O.\n- The 94% token accuracy is inflated by correct O predictions — it is a misleading metric for NER. CoNLL-style entity-level F1 is the standard.","A":"","B":"The gap is not about \"label format penalties.\" Entity-level F1 is lower because it requires span completeness — one wrong token in a multi-token entity causes the entire entity prediction to be wrong. Token accuracy counts each token independently.","C":"Class imbalance (more O tokens) does make accuracy unreliable for NER, but this is not unique to NLP — it is a general consequence of class imbalance in any classification task. The specific mechanism for NER is entity-level span evaluation, not just imbalance.","D":"Entity-level F1 does not count partial entity matches as correct. A model predicting \"John\" as PER when \"John Smith\" is the gold entity gets 0 F1 for that entity (wrong span boundary). Token accuracy would give partial credit (one correct token)."},"reference":"- Tjong Kim Sang & De Meulder, \"Introduction to the CoNLL-2003 Shared Task\" (entity-level evaluation): https://aclanthology.org/W03-0419/"},{"section":"nlp","topicSlug":"named-entity-recognition","topic":"Named Entity Recognition","id":"nlp-09004","difficulty":"medium","orderIndex":4,"question":"A production NER model for medical records must identify drug names. It was trained on 10,000 annotated sentences. During evaluation, it achieves 91% entity F1 on the test set but 63% F1 on a set of prescriptions containing drug trade names (e.g., \"Prozac\") while performing well on generic names (e.g., \"fluoxetine\"). What is the most precise diagnosis?","options":{"A":"The model has overfit to generic drug names in the training data because trade names are OOV and the model cannot generalize to unseen vocabulary","B":"Trade names and generic names are lexically dissimilar — \"Prozac\" shares no morphological features with \"fluoxetine\" (the SSRI suffix pattern); if training data contains mostly generic names, the model learned morphological signals (e.g., \"-ine,\" \"-ol,\" \"-mab\") that trade names lack, causing generalization failure","C":"The model needs more training data — 10,000 sentences is insufficient for medical NER","D":"Trade names always have capital letters which confuse the model's capitalization features"},"correct":"B","explanation":{"correct":"- BERT and LSTM-CRF NER models learn features including: subword patterns (morphological suffixes), context words (\"mg\", \"dose\", \"prescribed\"), capitalization, and surrounding POS tags.\n- Generic drug names in English often follow systematic morphological patterns: \"-olol\" (beta-blockers), \"-pril\" (ACE inhibitors), \"-mab\" (monoclonal antibodies). These suffixes are strong learnable signals.\n- Trade names (Prozac, Lipitor, Advil) are brand names with no systematic morphological pattern — they are designed to be memorable and distinctive, not to encode pharmacological class. If training data is skewed toward generics, the morphological signals do not transfer.\n- Fix: include trade names in training data, or use a drug lexicon/gazeteer as a feature to augment context-based predictions.","A":"\"OOV\" is partially correct with word-level models, but BERT uses WordPiece subword tokenization — \"Prozac\" is tokenized as known subwords and processed in context. The issue is not purely OOV but rather absent morphological signals, which the model has not learned to ignore in favor of context-only features.","B":"","C":"10,000 annotated medical sentences is a reasonable corpus for domain-specific NER. The F1 gap between generics (91%) and trade names (63%) within the same evaluation indicates a systematic pattern, not overall data insufficiency.","D":"Both trade names and generic names are typically capitalized in prescriptions. Capitalization alone does not distinguish them. The capitalization signal should help both, not hurt trade names specifically."},"reference":"- Uzuner et al., \"Evaluating the State of the Art in Coreference Resolution for Electronic Medical Records\": https://academic.oup.com/jamia/article/19/5/786/734020"},{"section":"nlp","topicSlug":"named-entity-recognition","topic":"Named Entity Recognition","id":"nlp-09005","difficulty":"medium","orderIndex":5,"question":"A span-based NER model (Lee et al., 2017 style) enumerates all possible spans up to length L and scores each span independently. For a 100-token sentence with L=10, how many candidate spans are evaluated, and what architectural advantage does span-based NER have over BIO-sequence NER?","options":{"A":"1000 spans; span-based NER is faster because it evaluates spans in parallel","B":"955 spans; span-based NER can naturally model overlapping entities (e.g., \"New York City Council\" contains both a LOCATION and an ORG within the same text) and avoids CRF-style sequential decoding constraints","C":"100 spans; span-based NER is equivalent to BIO tagging because each span covers one token","D":"10,000 spans; span-based NER cannot handle entities longer than L tokens, which BIO-tagging can"},"correct":"B","explanation":{"correct":"- Span count for length L in a sequence of n tokens: Σₗ₌₁ᴸ (n - l + 1) = nL - L(L-1)/2. For n=100, L=10: 100×10 - 10×9/2 = 1000 - 45 = 955 candidate spans.\n- Key advantage: BIO tagging is a sequential labeling scheme where each token has exactly one label. This means a token cannot simultaneously be B-LOC (start of New York City) and I-ORG (continuation of New York City Council). Overlapping entities are architecturally impossible in BIO.\n- Span-based NER evaluates each span independently — span [0,2] (\"New York City\") can be LOC while span [0,3] (\"New York City Council\") can be ORG. Overlapping spans are naturally supported.\n- In biomedical NER, overlapping entities are common: \"alpha-2 macroglobulin receptor\" might be annotated as both a protein (full span) and a protein-domain (partial span).","A":"1000 is the overcounting without the subtraction for impossibly long spans near the end of the sequence. The correct count is 955. Also, span-based NER evaluates more candidates than BIO (one per token) — it is not necessarily faster.","B":"","C":"Each span covers 1 to L tokens — for L=10, each span covers 1-10 tokens. With 100 tokens and L=10, there are 955 spans, not 100. Evaluating only 1-token spans would be equivalent to BIO without the B/I distinction.","D":"Span-based NER with L=10 indeed cannot directly detect entities longer than 10 tokens — this is a real limitation. However, BIO sequential tagging can only produce non-overlapping entities, which is also a significant limitation. Both approaches have tradeoffs."},"reference":"- Lee et al., \"End-to-end Neural Coreference Resolution\" (span-based scoring): https://arxiv.org/abs/1707.07045"},{"section":"nlp","topicSlug":"named-entity-recognition","topic":"Named Entity Recognition","id":"nlp-09006","difficulty":"hard","orderIndex":6,"question":"A BERT+CRF NER model trained on CoNLL-2003 (4 types: PER, ORG, LOC, MISC) is fine-tuned on an internal dataset with 8 entity types including the original 4 plus 4 new types. After fine-tuning, performance on the original 4 types drops from 92% F1 to 84% F1. What specific phenomenon causes this, and what architectural change would mitigate it?","options":{"A":"The CRF transition matrix is not large enough for 8 types","B":"Catastrophic forgetting: fine-tuning on the new 8-type data causes the model to overwrite the learned representations for the original 4 types; mitigation includes elastic weight consolidation (EWC), continual learning, or multi-task training on both datasets simultaneously","C":"The new 4 entity types have similar surface forms to the original types, causing systematic label confusion","D":"BERT's WordPiece tokenizer cannot handle 8 entity types — the vocabulary must be extended"},"correct":"B","explanation":{"correct":"- Catastrophic forgetting: when a neural network is fine-tuned on a new task/distribution, gradient updates that improve performance on the new task can overwrite the weight configurations learned for the old task.\n- Here: the model was fine-tuned to discriminate 4 types. Re-fine-tuning on 8 types changes the output layer (now 8×2+1 = 17 BIO tags vs 9) and modifies BERT weights via backpropagation. Gradients for new examples override some of the weight configurations learned for original 4-type discrimination.\n- Mitigations: (1) EWC: regularize weights that were important for the original task; (2) Joint training: train on both CoNLL-2003 and the new dataset simultaneously; (3) Progressive neural networks: add new columns for new types without modifying old weights; (4) Lower learning rate with early stopping on the original 4-type validation set.\n- In production multi-domain NER, joint training across all entity type sets is the standard approach.","A":"The CRF transition matrix size is (num_tags × num_tags) — for 8 types with BIO, 17×17 = 289 transitions. This is trivially larger but the old 9×9 matrix is a subset. Matrix size is not the limitation.","B":"","C":"New entity type confusion is a valid concern, but the question describes a systematic 8-point drop across all 4 original types — this is catastrophic forgetting, not pairwise confusion. Type confusion would appear as specific misclassification patterns (PER→PERSON, LOC→LOCATION) not a uniform drop.","D":"WordPiece tokenizer operates on characters/subwords and is completely independent of entity type labels. The number of entity types does not affect tokenization. This is a fundamental misunderstanding of the model architecture."},"reference":"- Kirkpatrick et al., \"Overcoming Catastrophic Forgetting in Neural Networks\" (EWC): https://arxiv.org/abs/1612.00796"},{"section":"nlp","topicSlug":"named-entity-recognition","topic":"Named Entity Recognition","id":"nlp-09007","difficulty":"hard","orderIndex":7,"question":"You evaluate two NER systems on a test set of 1,000 sentences containing 2,000 gold entities. System A: precision=0.90, recall=0.80, F1=0.848. System B: precision=0.80, recall=0.90, F1=0.848. Both have identical F1. A project manager says \"choose either — they're equivalent.\" In which production scenario would System B be strictly preferred, and what downstream consequence does lower precision create?","options":{"A":"System B is never preferred — higher precision is always better in production NER","B":"System B (high recall) is preferred when missing an entity causes higher cost than false detection — e.g., in pharmacovigilance (drug adverse event detection), missing a drug-event mention could cause regulatory non-compliance; the downstream consequence is alert fatigue from false positives that human reviewers must filter","C":"System B is preferred when the test set has more than 2,000 entities because recall matters more at scale","D":"Both systems are equivalent in all production scenarios when F1 is identical"},"correct":"B","explanation":{"correct":"- The F1 score hides the asymmetric cost structure of different applications. F1 = 2PR/(P+R) gives equal weight to precision and recall — but real-world costs are rarely symmetric.\n- High-recall scenario (System B): pharmacovigilance, legal discovery, medical diagnosis mention extraction — missing an entity has severe downstream consequences (missed adverse events, missed relevant documents). False positives (extra entities) cost human reviewer time but are recoverable.\n- Downstream consequence of low precision: reviewers must process a larger set of predicted entities (2,250 with P=0.80 vs 1,778 with P=0.90 for 1,800 true positives), and 20% of reviewed entities are false alarms — creating alert fatigue and trust erosion in the system.\n- The correct product decision is to select based on the cost ratio of false negatives to false positives, not identical F1.","A":"Higher precision is not universally preferred. In high-stakes recall scenarios (medical, legal, security), missing true entities is far more costly than reviewing false positives. The scenario determines the preference.","B":"","C":"Entity count scaling does not change the fundamental precision-recall tradeoff. At any scale, the cost structure of false negatives vs false positives determines which system is preferred.","D":"Identical F1 does not mean equivalent production value. F1 treats precision and recall as equal — it is a valid aggregate metric for benchmarking but not for deployment decisions where cost asymmetry exists."},"reference":"- Manning et al., \"Introduction to Information Retrieval\", Chapter 8 (evaluation metrics and their meaning): https://nlp.stanford.edu/IR-book/"},{"section":"nlp","topicSlug":"question-answering","topic":"Question Answering","id":"nlp-10001","difficulty":"easy","orderIndex":1,"question":"A reading comprehension QA model is asked: \"When was the Eiffel Tower built?\" given the passage \"The Eiffel Tower, constructed between 1887 and 1889, was built as the entrance arch for the 1889 World's Fair.\" The model returns the span \"1887 and 1889.\" An engineer argues the correct answer is just \"1887.\" Who is right in the context of extractive QA?","options":{"A":"The engineer is right — extractive QA must return the shortest possible span","B":"The model is right — extractive QA returns a verbatim span from the passage; \"1887 and 1889\" is a valid answer span, and the SQuAD evaluation accounts for multiple valid answer spans by taking the maximum F1 over all annotated reference answers","C":"Neither is correct — extractive QA should paraphrase the answer rather than copy text","D":"The model is wrong — \"built\" in the question means a single start date must be returned"},"correct":"B","explanation":{"correct":"- Extractive QA (reading comprehension): the model selects a contiguous span from the passage as the answer. The span is not generated or paraphrased — it is copied verbatim.\n- SQuAD evaluation: human annotators provide multiple valid answer spans per question (typically 3). The model's predicted span is evaluated against all references, and the maximum token-level F1 and EM (exact match) scores are reported.\n- \"1887 and 1889\" as an answer to \"When was it built?\" captures the full construction period, which is a valid and arguably more complete answer than just \"1887.\"\n- The \"correct\" answer depends on the annotation — different annotators might label \"1887,\" \"1889,\" \"1887 and 1889,\" or \"between 1887 and 1889\" as valid. SQuAD handles this ambiguity through multi-annotator averaging.","A":"Extractive QA does not have a \"shortest span\" objective. Models are trained to maximize the log probability of the correct start and end token positions, not to minimize span length. Shorter is not inherently better.","B":"","C":"Extractive QA by definition returns verbatim text. Paraphrasing is the domain of abstractive QA (e.g., using a seq2seq model or LLM to generate the answer in natural language). These are distinct QA paradigms.","D":"\"When was it built?\" most naturally refers to the construction period, which spans 1887-1889. Prescribing that only a start date qualifies is an ad-hoc constraint not reflected in SQuAD annotations."},"reference":"- Rajpurkar et al., \"SQuAD: 100,000+ Questions for Machine Comprehension of Text\": https://arxiv.org/abs/1606.05250"},{"section":"nlp","topicSlug":"question-answering","topic":"Question Answering","id":"nlp-10002","difficulty":"easy","orderIndex":2,"question":"BERT for extractive QA adds two output vectors S and E (start and end) to compute start and end span probabilities. During inference on a passage of 400 tokens, the model predicts start position 50 and end position 48. What would a production system do with this prediction?","options":{"A":"Return the span from position 48 to 50 (reversing start and end)","B":"Reject the prediction as invalid — end position must be ≥ start position; production systems apply constraints during decoding (e.g., only consider end positions ≥ start positions) or return a \"no answer\" fallback","C":"Return an empty string because position 48-50 is a 2-token span","D":"The model cannot output end < start because the softmax ensures start ≤ end by design"},"correct":"B","explanation":{"correct":"- BERT QA computes start logits sᵢ = Sᵀhᵢ and end logits eⱼ = Eᵀhⱼ independently for each position. The model then selects argmax(sᵢ) and argmax(eⱼ) separately — there is no architectural constraint ensuring end ≥ start.\n- Production constraint: when computing the best span, enumerate valid (start, end) pairs where end ≥ start and end - start ≤ max_answer_length (e.g., 30 tokens). Select the pair maximizing s_start + e_end.\n- For SQuAD 2.0 (with unanswerable questions): also compute a \"no answer\" score and return no answer if it exceeds the best valid span score.\n- The naive argmax approach can produce invalid spans — production implementations must constrain the search space explicitly.","A":"Reversing start and end is semantically wrong — the span [48, 50] means something different from [50, 48] and the model predicted high start probability at 50, not at 48. Reversing would return incorrect text.","B":"","C":"End position 48 < start position 50 means there is no valid span, not a 2-token span. The span would have negative length, which is undefined.","D":"Softmax over start positions and softmax over end positions are computed independently with no coupling between them. The softmax normalizes probabilities within each distribution (start scores across all positions, end scores across all positions) — it does not enforce ordering between the two argmaxes."},"reference":"- Devlin et al., \"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding\" (Section 4.2, QA): https://arxiv.org/abs/1810.04805"},{"section":"nlp","topicSlug":"question-answering","topic":"Question Answering","id":"nlp-10003","difficulty":"medium","orderIndex":3,"question":"An extractive QA system achieves EM=0.72 and F1=0.85 on SQuAD 1.1. A product manager reports \"the system answers 85% of questions correctly.\" What is wrong with this interpretation, and what does the gap between EM and F1 reveal?","options":{"A":"F1=0.85 means the system gets 85% of tokens correct across all predictions, which is a reasonable \"85% correct\" interpretation","B":"F1 here is token-level overlap between predicted and reference spans — a prediction of \"Napoleon Bonaparte\" vs reference \"Napoleon\" scores F1=0.67 (2/3 overlap), not \"correct\"; the EM=0.72 vs F1=0.85 gap reveals that 13% of questions are partially correct (right tokens, wrong exact span), which F1 gives partial credit for but EM counts as 0","C":"The EM and F1 gap means 13% of questions have multiple valid answers in the annotation","D":"F1 is a better metric than EM and should replace it in all QA reporting"},"correct":"B","explanation":{"correct":"- Exact Match (EM): 1 if the predicted span exactly matches any reference answer (after normalization), 0 otherwise. EM=0.72 means 72% of predictions are word-for-word correct.\n- Token-level F1: F1 = 2 × (precision × recall) / (precision + recall) where precision = shared tokens / predicted tokens, recall = shared tokens / reference tokens. F1=0.85 means on average, predictions overlap 85% of reference tokens.\n- Gap EM=0.72, F1=0.85: the 13% difference represents cases where the model gets most of the answer right (correct tokens, partial span) — e.g., predicting \"the Eiffel Tower\" when the reference is \"Eiffel Tower\" (off by one token \"the\").\n- \"85% correct\" is misleading: F1=0.85 means 15% of answer tokens on average are wrong or missing, not that 85% of questions are answered correctly. The \"85% of questions\" framing applies only to EM.","A":"The interpretation \"85% of tokens correct\" is closer to what F1 measures, but \"correct\" is still misleading — it means 85% average token overlap, not correctness at the question level. The PM's \"85% of questions correctly\" is specifically the wrong framing.","B":"","C":"Multiple valid annotations per question in SQuAD are handled by taking the maximum F1/EM over all references. The EM-F1 gap reflects partial span matches, not annotation count.","D":"EM and F1 measure different things and both are reported in QA benchmarks for different reasons. Exact match is more interpretable for users; F1 is more informative for model comparison. Dropping either loses information."},"reference":"- Rajpurkar et al., \"SQuAD: 100,000+ Questions for Machine Comprehension of Text\": https://arxiv.org/abs/1606.05250"},{"section":"nlp","topicSlug":"question-answering","topic":"Question Answering","id":"nlp-10004","difficulty":"medium","orderIndex":4,"question":"An open-domain QA system (ODQA) uses a retriever-reader pipeline. The retriever returns the top-5 passages for the question \"What causes aurora borealis?\" using BM25 (sparse retrieval). All 5 passages discuss \"northern lights\" but none use the term \"aurora borealis.\" The reader gets 0 passages with the relevant term and fails to extract an answer. What is the root cause and what fix would most directly address it?","options":{"A":"BM25 failed because the retriever needs more passages — increase top-k to 50","B":"BM25 is a lexical retrieval model that requires keyword overlap — \"aurora borealis\" vs \"northern lights\" is a vocabulary mismatch; dense retrieval (DPR, bi-encoder) embeds questions and passages in semantic space and can match \"aurora borealis\" to \"northern lights\" through learned semantic similarity","C":"The reader model is too weak to understand that \"northern lights\" means \"aurora borealis\"","D":"The pipeline needs a translation model to convert \"aurora borealis\" to \"northern lights\" before retrieval"},"correct":"B","explanation":{"correct":"- BM25 scoring: TF-IDF based term matching. Query \"aurora borealis\" looks for passages containing these exact terms (or their morphological variants). Passages using \"northern lights\" have zero term overlap with \"aurora borealis\" → low BM25 score → not retrieved.\n- Dense Passage Retrieval (DPR): a bi-encoder fine-tuned on QA pairs learns that questions and their supporting passages should have similar representations. \"Aurora borealis\" and \"northern lights\" appear in similar semantic contexts during training — their embeddings are nearby in the learned space.\n- This is the lexical vs semantic retrieval tradeoff: BM25 fails on synonymy, paraphrase, and cross-lingual terms. Dense retrieval handles these but requires training data and GPU at inference.\n- Production ODQA systems (e.g., RAG) often combine BM25 and dense retrieval (hybrid retrieval) to get both lexical exactness and semantic generalization.","A":"Increasing top-k with BM25 will retrieve more passages about \"northern lights\" but still based on BM25 scores. If \"aurora borealis\" appears nowhere in the corpus, BM25 cannot match it to \"northern lights\" passages regardless of k.","B":"","C":"The reader failure is downstream of the retrieval failure. If the retriever provides relevant passages (those using \"northern lights\"), a strong reader can extract \"northern lights\" as the answer. The root cause is retrieval vocabulary mismatch.","D":"A translation model for synonyms is architecturally unnecessary and brittle — it cannot enumerate all possible synonyms. Dense retrieval implicitly handles semantic equivalence through learned embeddings without requiring explicit synonym lists."},"reference":"- Karpukhin et al., \"Dense Passage Retrieval for Open-Domain Question Answering\" (DPR): https://arxiv.org/abs/2004.04906"},{"section":"nlp","topicSlug":"question-answering","topic":"Question Answering","id":"nlp-10005","difficulty":"medium","orderIndex":5,"question":"A RAG (Retrieval-Augmented Generation) system for QA retrieves the top-3 passages and feeds them to an LLM to generate the final answer. The system has retriever recall@3 = 0.75 (75% of questions have the answer in the top-3 passages) and reader EM = 0.85 (given the right passage, the reader answers correctly 85% of the time). What is the end-to-end EM and what does this imply for which component to prioritize improving?","options":{"A":"End-to-end EM = 0.75 + 0.85 = 1.60 (must be capped at 1.0)","B":"End-to-end EM ≈ 0.75 × 0.85 = 0.638; since the system cannot answer correctly if retrieval fails, the bottleneck analysis shows improving retriever recall from 0.75 to 0.85 (a 10-point gain) would improve end-to-end EM to ~0.72, while the same 10-point reader gain (0.85→0.95) would yield ~0.71 — both components have similar leverage at these values","C":"End-to-end EM = 0.85 because the reader is the bottleneck and the retriever is already good enough at 0.75","D":"End-to-end EM = min(0.75, 0.85) = 0.75 because the weaker component determines overall performance"},"correct":"B","explanation":{"correct":"- End-to-end EM = P(retriever finds answer) × P(reader extracts answer | retriever found it) = 0.75 × 0.85 = 0.638.\n- This multiplicative relationship means both components contribute. The 25% retrieval failure is a hard ceiling — those questions produce wrong answers regardless of reader quality.\n- Sensitivity analysis: improving retriever recall to 0.85 → end-to-end EM = 0.85 × 0.85 = 0.723 (+8.5 points). Improving reader to 0.95 → 0.75 × 0.95 = 0.713 (+7.5 points). At current values, retriever improvement has slightly more leverage.\n- At lower retriever recall (e.g., 0.5), retriever improvement becomes much higher leverage. Prioritization depends on where each component currently stands.","A":"Probabilities cannot be added directly — this violates basic probability theory. P(A and B) = P(A) × P(B|A) for sequential events, not P(A) + P(B). Capping at 1.0 would be a symptom of applying the wrong formula.","B":"","C":"End-to-end EM cannot be 0.85 (the reader's rate) because 25% of questions have no valid passage from the retriever — the reader receives wrong context and cannot answer those correctly regardless of reader quality.","D":"min(P, Q) applies to the weakest parallel link in a chain where all links must succeed, but it is not the exact formula. P(A and B) = P(A) × P(B) (for independent A, B) — this equals min only when one probability is 1. With both < 1, the product is less than both."},"reference":"- Lewis et al., \"Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks\": https://arxiv.org/abs/2005.11401"},{"section":"nlp","topicSlug":"question-answering","topic":"Question Answering","id":"nlp-10006","difficulty":"hard","orderIndex":6,"question":"SQuAD 1.1 requires models to answer from a passage that always contains the answer. SQuAD 2.0 adds unanswerable questions. A model trained only on SQuAD 1.1 is evaluated on SQuAD 2.0. Performance degrades significantly. A researcher proposes that the model \"does not know what it doesn't know.\" What architectural mechanism enables a model to determine that a question is unanswerable from a passage?","options":{"A":"The model needs a separate binary classifier trained specifically to detect unanswerable questions","B":"BERT-based QA models add a special [CLS] token score for \"no answer\" — the model learns to compare the best valid span score against the no-answer score; if P(no answer) > P(best span) × threshold, it abstains; this requires training on SQuAD 2.0 examples where no-answer examples have high [CLS] score","C":"Unanswerable questions can be detected by checking if any n-gram from the question appears in the passage","D":"The model cannot learn unanswerability — it always produces a span prediction and the user must manually filter unanswerable questions"},"correct":"B","explanation":{"correct":"- BERT SQuAD 2.0 formulation: extend the span prediction by adding the [CLS] token as a candidate start and end position. The \"no answer\" score is s_[CLS] + e_[CLS] (start and end logits at position 0).\n- During SQuAD 2.0 training, unanswerable examples are labeled with start=0, end=0 ([CLS] position). The model learns to assign high scores to [CLS] when the passage does not contain the answer.\n- Threshold: s_[CLS] + e_[CLS] > (best span score + τ) → return no answer. τ is tuned on validation set.\n- This is effectively the model learning a \"none of the above\" option — a critical capability for deployment where passages may not always contain the answer.","A":"A separate binary classifier is one approach, but BERT-based QA handles unanswerability within the same model by treating [CLS] as a special answer position. This is more elegant and avoids training two separate models with potential inconsistency.","B":"","C":"N-gram overlap between question and passage is a weak heuristic — many answerable questions ask \"who,\" \"when,\" \"where\" questions whose n-grams appear in the passage but whose answer spans require inference. And many question keywords genuinely appear in both answerable and unanswerable passages.","D":"The model can learn to abstain — SQuAD 2.0 training explicitly provides this supervision. Models trained on SQuAD 2.0 achieve ~80% F1 on the unanswerable split, significantly above the 0% that \"always produces a span\" implies."},"reference":"- Rajpurkar et al., \"Know What You Don't Know: Unanswerable Questions for SQuAD\" (SQuAD 2.0): https://arxiv.org/abs/1806.03822"},{"section":"nlp","topicSlug":"question-answering","topic":"Question Answering","id":"nlp-10007","difficulty":"hard","orderIndex":7,"question":"A team building a customer support QA system debates whether to use extractive QA (BERT span extraction) or abstractive QA (LLM generation). The answers must be auditable — each answer must be traceable to a specific sentence in the knowledge base for compliance reasons. Which approach is more suitable and what is the key limitation of the other?","options":{"A":"Abstractive QA with citations is better because LLMs can generate both the answer and the source sentence simultaneously","B":"Extractive QA is more suitable for auditability — the answer is literally a verbatim span from the knowledge base, making source attribution trivially verifiable; abstractive QA generates fluent text that may hallucinate, paraphrase, or blend information from multiple sources, making it difficult to audit even when citations are added post-hoc","C":"Both approaches provide equal auditability because modern LLMs include source references in their output by default","D":"Abstractive QA is better because it can synthesize information from multiple passages, which is a compliance requirement in most enterprise settings"},"correct":"B","explanation":{"correct":"$18","A":"LLMs can be prompted to generate citations, but the generated answer itself may still hallucinate. The citation points to a document; it does not guarantee the generated text is faithful to that document. \"Simultaneously generating\" does not solve hallucination.","B":"","C":"Modern LLMs do not include source references \"by default\" — this requires specific prompting and retrieval pipelines (RAG). Even when citations are added, faithfulness to cited sources is not guaranteed without additional verification steps.","D":"Synthesizing information from multiple passages is a capability, not a compliance requirement. Compliance typically requires traceability to specific source statements, not synthesis — which is precisely what extractive QA provides."},"reference":"- Maynez et al., \"On Faithfulness and Factuality in Abstractive Summarization\": https://arxiv.org/abs/2005.00661"},{"section":"nlp","topicSlug":"machine-translation","topic":"Machine Translation","id":"nlp-11001","difficulty":"easy","orderIndex":1,"question":"A vanilla seq2seq model (no attention) translates \"The cat sat on the mat because it was comfortable\" to French. The translation of \"comfortable\" is incorrect but grammatically fluent French text is produced. What architectural limitation most directly causes the semantic error on \"comfortable\"?","options":{"A":"The model has a vocabulary mismatch between English and French","B":"The fixed-size encoder bottleneck vector must compress the entire source sentence; by the time the decoder generates the translation of the final clause, information from earlier tokens (including context for \"comfortable\") has been overwritten by the recurrent encoder's recency bias","C":"Seq2seq models cannot translate adjectives, only nouns and verbs","D":"The French vocabulary for \"comfortable\" is not present in the training data"},"correct":"B","explanation":{"correct":"- The encoder's final hidden state hₙ is the only information the decoder has about the source sentence. For a 14-word sentence, hₙ is the result of 14 RNN steps, with more recent words dominating the representation.\n- \"Comfortable\" appears at the end of the sentence — yet the decoder generating its French translation still needs to consider the full sentence context (what is comfortable? the mat or the cat?). This context is partially lost due to the bottleneck.\n- Irony: here \"comfortable\" is at the end, so it should be relatively well-preserved in hₙ. But the surrounding context needed to correctly translate it (\"because it was\" — what does \"it\" refer to?) requires integrating information from across the full sentence.\n- This demonstrates why attention was necessary: the decoder needs the ability to re-read specific source positions rather than relying on a single compressed representation.","A":"Vocabulary mismatch is a real problem in MT (especially for rare words), but it manifests as OOV handling (e.g., copying the source word or producing UNK), not as semantically wrong but fluent translations. The question describes a semantic error with fluent output.","B":"","C":"Seq2seq models can translate all parts of speech — they learn conditional distributions over target vocabulary which includes adjectives. There is no architectural restriction on word class.","D":"If \"comfortable\" were truly OOV in the French vocabulary, the model would produce UNK or a related word from the vocabulary, not a semantically incorrect but fluent phrase. OOV manifests differently."},"reference":"- Sutskever et al., \"Sequence to Sequence Learning with Neural Networks\": https://arxiv.org/abs/1409.3215"},{"section":"nlp","topicSlug":"machine-translation","topic":"Machine Translation","id":"nlp-11002","difficulty":"easy","orderIndex":2,"question":"A neural MT model with attention translates English to Japanese, which has SOV (Subject-Object-Verb) word order vs English SVO. The attention visualization shows that the decoder, when generating the Japanese verb (last token), attends strongly to the English verb (middle of source). What does this reveal about the attention mechanism in MT?","options":{"A":"The attention mechanism failed because it should attend to English tokens in sequence, not jump to the middle","B":"The attention mechanism learned to model non-monotonic alignments — it maps source positions to target positions based on the learned translation correspondence, not source order; this is essential for language pairs with different word orders","C":"The attention visualization proves the model is translating word-by-word in source order","D":"Attending to the middle of the source for the last target token means the model is hallucinating content not in the source"},"correct":"B","explanation":{"correct":"- Monotonic alignment (French-English): English word order is similar to French, so attention weights form a near-diagonal matrix (token i attends to nearby source position i).\n- Non-monotonic alignment (English-Japanese): Japanese verbs come at the end, but English verbs are in the middle. The attention matrix has off-diagonal patterns — the decoder position for the Japanese verb (late in output) must focus on the English verb position (middle of source).\n- This non-monotonic alignment learning is a key capability of attention that fixed-context seq2seq models cannot achieve — those must compress the reordering into the fixed vector, which is less effective.\n- Bahdanau et al.'s original paper demonstrated alignment matrices for French-English (monotonic) as evidence of the mechanism; cross-lingual alignment for typologically different pairs is an even stronger test.","A":"There is no rule that attention must be monotonic. The purpose of attention is to learn task-appropriate alignment, which is non-monotonic for language pairs with different word orders. Non-diagonal attention is correct behavior for English-Japanese.","B":"","C":"The attention to the middle of source for a late target token is precisely evidence of non-sequential translation. The visualization contradicts \"word-by-word in source order.\"","D":"Attending to the English verb to generate the Japanese verb is semantically correct, not hallucination. Hallucination would produce target content with no corresponding source span. The model is correctly aligning the verb across languages."},"reference":"- Bahdanau et al., \"Neural Machine Translation by Jointly Learning to Align and Translate\": https://arxiv.org/abs/1409.0473"},{"section":"nlp","topicSlug":"machine-translation","topic":"Machine Translation","id":"nlp-11003","difficulty":"medium","orderIndex":3,"question":"A neural MT system achieves BLEU=42 on a news translation benchmark. Deployed on legal documents, BLEU drops to 18. The legal documents use technical terminology and formal register. A colleague proposes increasing the beam search width from 5 to 20 to improve BLEU. Why is this diagnosis wrong?","options":{"A":"Beam search width affects speed but not translation quality in any domain","B":"The BLEU drop from news to legal is caused by domain shift — the model's decoder has not seen legal terminology in training; wider beam search explores more hypotheses from the same learned distribution, which is still a news distribution; it cannot generate correct legal terminology that was never in training","C":"Beam width should be decreased to 3 for legal documents because legal sentences are shorter","D":"BLEU score is not a valid metric for legal translation so the comparison is meaningless"},"correct":"B","explanation":{"correct":"- Beam search explores k hypotheses simultaneously to find the highest-scoring translation under the model's probability distribution. If the model's distribution P(y|x) was trained on news data, beam=5 and beam=20 both sample from this news distribution.\n- For a legal term like \"indemnification,\" if this word and its translation were rare in training, P(\"indemnification\" translation | context) ≈ 0 across all beam hypotheses. Wider beam does not help find low-probability tokens the model has not learned.\n- The correct fix is domain adaptation: fine-tune on legal translation pairs (even hundreds of examples with careful extraction can help), or use domain-specific terminology lexicons as constraints during decoding.\n- Beam width tuning is a valid optimization but only provides marginal gains (1-3 BLEU) within the same domain — it cannot bridge a 24-point cross-domain BLEU gap.","A":"Beam width does affect quality — larger beams reduce search errors and can improve BLEU by 1-3 points within the training domain. But this is a marginal effect. The statement \"no effect\" is too strong; the correct point is that it cannot fix domain shift.","B":"","C":"Legal sentences are typically longer and more complex than news sentences, not shorter. Legal documents contain nested clauses, defined terms, and cross-references that make sentences structurally complex.","D":"BLEU is used for legal MT evaluation in research and industry, with recognized limitations. The 24-point domain gap is a real and measurable quality difference even accounting for BLEU's limitations (synonym mismatch, etc.)."},"reference":"- Koehn & Knowles, \"Six Challenges for Neural Machine Translation\": https://arxiv.org/abs/1706.04972"},{"section":"nlp","topicSlug":"machine-translation","topic":"Machine Translation","id":"nlp-11004","difficulty":"medium","orderIndex":4,"question":"A neural MT system translates \"The bank can guarantee deposits will eventually cover future tuition costs because it invests in education.\" The system produces two different translations on different runs with temperature=0.7. One translates \"bank\" as a financial institution, the other as a river bank. What in the NMT architecture determines which sense is chosen?","options":{"A":"The model randomly chooses between word senses because temperature > 0 introduces stochasticity; setting temperature=0 would always produce the financial institution translation","B":"The attention mechanism at the token \"bank\" aggregates context from the full source sentence — in a well-trained model, \"deposits,\" \"costs,\" and \"invests\" provide strong context signals for the financial sense; temperature sampling occasionally samples the lower-probability river-bank sense","C":"The model always defaults to the most frequent word sense from training data unless the word appears near water-related words","D":"The two translations occur because the model has two separate word embeddings for \"bank\" — one per sense — and randomly selects between them"},"correct":"B","explanation":{"correct":"- Transformer MT encoder: each word's representation is contextualized through self-attention over the full source sentence. The representation of \"bank\" at the encoder already incorporates context from \"deposits,\" \"invests,\" \"tuition\" — strong financial sense signals.\n- With temperature=0 (greedy/beam), the model would likely always produce the financial translation because the financial-sense probability dominates given the context.\n- With temperature=0.7, the output distribution is softened: P_new(y) = P(y)^(1/T) normalized. Lower-probability tokens (\"river bank\" translation) occasionally get sampled, especially when two senses have closer probabilities (weakly disambiguating context).\n- For a strongly disambiguating context (\"deposits will eventually cover\"), the financial sense should dominate even with moderate temperature. Multiple translations suggest the model is somewhat uncertain in this specific sentence.","A":"Setting temperature=0 would produce argmax (greedy) decoding, which always picks the highest probability token at each step. For this sentence with strong financial context, temperature=0 would likely always produce the financial translation. But the mechanism is attention-based contextual disambiguation, not just temperature.","B":"","C":"\"Most frequent word sense\" is the behavior of a non-contextual model (word2vec, GloVe). Transformer MT uses fully contextual representations — the word sense is determined by context, not just frequency statistics.","D":"Neural MT models use a single embedding per word token (or subword). Word sense disambiguation is an emergent property of contextual attention representations, not separate embeddings. Multiple embeddings per word are not standard in current NMT architectures."},"reference":"- Vaswani et al., \"Attention Is All You Need\": https://arxiv.org/abs/1706.03762"},{"section":"nlp","topicSlug":"machine-translation","topic":"Machine Translation","id":"nlp-11005","difficulty":"medium","orderIndex":5,"question":"A team evaluates an MT system using BLEU with a single reference translation. A colleague proposes using 4 reference translations per sentence instead. How would this change affect BLEU scores, and does a higher score with 4 references mean better translation quality?","options":{"A":"More references always produce higher BLEU scores and always indicate better translation quality","B":"More reference translations increase BLEU scores because modified n-gram precision allows matching against any reference — a hypothesis matching any of the 4 references gets credit; higher BLEU with 4 references reflects better coverage of valid translation variants, not necessarily higher quality of the specific translations being evaluated","C":"More references decrease BLEU because the brevity penalty increases with more references","D":"BLEU is reference-count invariant — adding references has no effect on the score"},"correct":"B","explanation":{"correct":"- BLEU modified n-gram precision: for each hypothesis n-gram, find the maximum count across all references, clip the hypothesis count to this maximum. More references = more ways to match = higher clipped counts = higher precision.\n- Example: hypothesis \"The cat\" — Reference 1 has \"The cat\" (match), Reference 2 has \"A feline\" (no match). With 2 references, \"The cat\" still matches Ref 1. Now if Reference 3 has \"The feline,\" the bigram \"The feline\" also gets credit if the hypothesis contains it.\n- Critical implication: BLEU scores are only comparable when computed with the same number of references. Reporting BLEU=35 with 1 reference vs BLEU=45 with 4 references does not mean System B is better.\n- This is why MT benchmarks (WMT) standardize reference counts. Comparing single-reference BLEU to multi-reference BLEU is invalid.","A":"Higher BLEU with more references does not automatically mean better translation quality — it means more opportunities to match one of several valid paraphrases. The quality of the underlying system has not changed; the evaluation metric has become more lenient.","B":"","C":"The brevity penalty is based on the closest reference length to the hypothesis length. With more references, the closest reference might be shorter or longer, but the general direction of the brevity penalty effect is not systematically to increase it.","D":"BLEU scores change substantially with reference count. Single-reference BLEU scores for state-of-the-art systems are typically 30-35 lower than 4-reference scores on the same system. Reference count has a major effect on the metric value."},"reference":"- Papineni et al., \"BLEU: a Method for Automatic Evaluation of Machine Translation\": https://aclanthology.org/P02-1040/"},{"section":"nlp","topicSlug":"machine-translation","topic":"Machine Translation","id":"nlp-11006","difficulty":"hard","orderIndex":6,"question":"An NMT system translates a long English document (500 words) by processing it as a single sequence. The translation is fluent but several entities are \"hallucinated\" — the model generates plausible but incorrect names and numbers. A researcher claims \"this is a fundamental limitation of autoregressive generation, not a data issue.\" Evaluate this claim.","options":{"A":"The researcher is correct — hallucination in NMT is caused by autoregressive generation and cannot be fixed with more data","B":"The researcher is partially right: autoregressive generation accumulates errors (each generated token conditions future tokens), and at long distances the model's probability mass can shift toward fluent but unfaithful text; but data also matters — models trained on noisy parallel corpora or with insufficient coverage of the source domain hallucinate more; both factors contribute","C":"Hallucination in NMT is entirely a data quality problem — clean parallel corpora eliminate all hallucination","D":"Hallucination only occurs in abstractive tasks (summarization) — MT models cannot hallucinate because they are conditioned on source text"},"correct":"B","explanation":{"correct":"- Autoregressive generation contribution: at step t, the decoder conditions on y₁,...,yₜ₋₁ and source context. For long sequences, the attended source context may become diffuse (attention spread thin over 500 source tokens), and the model may rely more on \"what sounds fluent\" (language model prior) than \"what is in the source.\"\n- Data quality contribution: models trained on web-crawled parallel text (CommonCrawl, ParaCrawl) contain misaligned sentence pairs, OCR errors, and template text. Models learn that generating plausible-sounding text is sometimes rewarded when source-faithful generation is not in training data.\n- Both factors are real: Lee et al. (2018) demonstrated hallucination increases with longer input and that data cleaning reduces it — but does not eliminate it entirely even with clean data.\n- Production MT systems use copy mechanisms, faithful training objectives (like MLE + faithfulness loss), and post-hoc factual consistency checking to mitigate hallucination.","A":"\"Cannot be fixed with more data\" is too strong. Data cleaning (filtering noisy parallel pairs) measurably reduces hallucination rate. The claim that it is solely an architectural issue ignores empirical evidence.","B":"","C":"\"Clean parallel corpora eliminate all hallucination\" is also too strong. Even models trained on human-translated parallel text (e.g., European Parliament proceedings, which are high quality) can hallucinate on out-of-distribution inputs or very long sequences.","D":"MT hallucination is well-documented (Koehn & Knowles, 2017; Lee et al., 2018). The source conditioning reduces hallucination compared to unconditioned generation but does not prevent it. Long-range conditioning is imperfect even in attention-based models."},"reference":"- Lee et al., \"Hallucinations in Neural Machine Translation\": https://aclanthology.org/2018.emnlp-main.590"},{"section":"nlp","topicSlug":"machine-translation","topic":"Machine Translation","id":"nlp-11007","difficulty":"hard","orderIndex":7,"question":"Two MT systems are compared on English→Chinese translation. System A uses beam search with beam=5 and achieves BLEU=38, chrF=62. System B uses sampling with temperature=0.8 and achieves BLEU=30, chrF=65. Human evaluators prefer System B's translations in a side-by-side evaluation (65% preference for B). How do you reconcile the BLEU/chrF disagreement with human preference?","options":{"A":"Human evaluators are wrong — BLEU is the gold standard for MT evaluation","B":"BLEU and chrF measure different aspects of translation quality — BLEU penalizes paraphrase (requires exact n-gram matches to reference), while chrF uses character n-grams giving partial credit; for Chinese, character n-grams better capture morphological fluency; human preference aligns with chrF because humans value fluency and naturalness over exact reference matching","C":"System B cheated by using temperature sampling — only beam search produces valid BLEU-comparable translations","D":"The metrics disagree because System A has a higher brevity penalty, artificially inflating its BLEU"},"correct":"B","explanation":{"correct":"- BLEU = word n-gram overlap with reference translations. For Chinese, word segmentation is ambiguous (no spaces), and BLEU computed on characters or with a specific segmenter produces different values. BLEU also does not give partial credit for near-matches.\n- chrF (Popovic, 2015) = character n-gram F-score averaging over n=1..6. In morphologically rich languages and character-based scripts like Chinese, character n-grams better capture translation quality — partial word matches (e.g., getting 3 of 4 characters of a compound word correct) contribute to chrF but not to BLEU.\n- Human preference for System B: temperature sampling produces more diverse, natural-sounding outputs. System B may paraphrase the reference (lower BLEU) while being equally or more correct and natural (higher human preference, higher chrF).\n- This is a core problem with BLEU as the sole evaluation metric — it correlates with human judgment at corpus level but can diverge systematically for individual system comparisons.","A":"Human evaluation is the ground truth for MT quality. BLEU was designed as a cheap proxy for human evaluation, not to replace it. When BLEU and human preference diverge, human preference is the more authoritative signal.","B":"","C":"Temperature sampling is a valid inference method. BLEU evaluates the output strings regardless of how they were generated. There is no sampling vs beam search constraint in BLEU computation.","D":"Brevity penalty affects BLEU for short outputs (system outputs shorter than reference). Both systems are generating Chinese translations of the same English input — brevity differences would be small and unlikely to account for an 8-point BLEU gap."},"reference":"- Popovic, \"chrF: character n-gram F-score for automatic MT evaluation\": https://aclanthology.org/W15-3049/\n- Callison-Burch et al., \"Re-evaluating the role of BLEU in MT research\": https://aclanthology.org/E06-1032/"},{"section":"nlp","topicSlug":"text-generation-decoding","topic":"Text Generation Decoding","id":"nlp-12001","difficulty":"easy","orderIndex":1,"question":"A language model generates text by greedy decoding — at each step selecting the token with the highest probability. The output reads: \"The best way to learn programming is to practice. practice. practice. practice.\" What decoding property causes this repetition, and what does greedy decoding optimize for?","options":{"A":"Greedy decoding randomly selects tokens, occasionally repeating them","B":"Greedy decoding maximizes the probability of each individual token at each step (local optimum), but \"practice.\" after the first occurrence has very high conditional probability given the context \"practice.\", creating a degenerate loop; greedy decoding does not optimize the joint probability of the full sequence","C":"Greedy decoding produces repetition only when the language model was trained on repetitive text","D":"The model has overfit to the word \"practice\" in fine-tuning data, which is an unrelated training issue"},"correct":"B","explanation":{"correct":"- Greedy decoding: at step t, select wₜ = argmax P(wₜ | w₁,...,wₜ₋₁). This optimizes one token at a time.\n- Once \"practice.\" appears, the model's context now contains \"practice.\" at the end. Training data contains repeated phrases (e.g., slogans, lists) where repetition is common, so P(\"practice\" | \"...practice.\") becomes high.\n- This creates a self-reinforcing loop: high P(x | context ending in x) → x is selected → context now ends in x again → repeat. The loop is stable because greedy provides no look-ahead to realize the sequence quality degrades.\n- This is the joint probability problem: argmax P(w₁,...,wₙ) ≠ Π argmax P(wₜ | context). The globally best sequence may require choosing a lower-probability token at step t to enable much higher probability tokens later.","A":"Greedy decoding is deterministic — it always selects the maximum probability token. There is no randomness. Repetition is not random but a systematic consequence of local greedy optimization.","B":"","C":"Repetition from greedy decoding is a general neural LM phenomenon, not limited to models trained on repetitive text. It has been documented for models trained on diverse, non-repetitive corpora (Holtzman et al., 2020).","D":"If this were a fine-tuning overfitting issue, the repetition would be specific to fine-tuning domain topics and would not occur systematically across diverse prompts. Repetition in greedy decoding occurs for many words, not just \"practice.\""},"reference":"- Holtzman et al., \"The Curious Case of Neural Text Degeneration\": https://arxiv.org/abs/1904.09751"},{"section":"nlp","topicSlug":"text-generation-decoding","topic":"Text Generation Decoding","id":"nlp-12002","difficulty":"easy","orderIndex":2,"question":"Beam search with beam width k=5 generates the 5 highest-probability complete sequences according to the model. A researcher claims \"beam search always produces better text than greedy decoding.\" Under what condition is this claim false?","options":{"A":"Beam search is always better — the claim is always true","B":"Beam search optimizes joint sequence probability, which can produce repetitive, generic text by finding high-probability but low-diversity sequences; research (Holtzman et al., 2020) shows that maximum-probability sequences (\"the the the the\") are often degenerate and low quality by human judgment","C":"Beam search fails when beam width k exceeds vocabulary size","D":"Greedy decoding produces better text than beam search for sequences longer than 100 tokens"},"correct":"B","explanation":{"correct":"- Beam search finds approximately the k highest joint probability sequences under the model. The model's training distribution (MLE on human text) does not imply that maximum-probability sequences are the most human-like.\n- Holtzman et al. (2020) demonstrated that the most probable continuation for many prompts is degenerate (repetitive, bland) while typical human text lives in the middle of the probability distribution.\n- As beam width increases, the found sequences have higher joint probability — but can be more repetitive and less diverse than moderately lower-probability sequences (e.g., those found by sampling).\n- Beam search is well-suited for tasks with ground truth (MT: BLEU improves with beam width up to ~5-10; summarization with specific content requirements). For open-ended generation (stories, dialogue), sampling-based methods often produce better human-preferred output.","A":"The claim is falsified by empirical evidence. Holtzman et al. (2020) conducted human evaluations showing that beam-searched text is significantly less preferred than nucleus sampling for open-ended generation tasks.","B":"","C":"Beam width k has no constraint relative to vocabulary size. k is the number of hypotheses maintained during decoding — it is entirely independent of vocabulary size (which can be 50,000+). Beams of width k=50 are standard.","D":"There is no specific threshold (100 tokens) at which greedy outperforms beam search. The relative quality depends on task type, not sequence length threshold."},"reference":"- Holtzman et al., \"The Curious Case of Neural Text Degeneration\": https://arxiv.org/abs/1904.09751"},{"section":"nlp","topicSlug":"text-generation-decoding","topic":"Text Generation Decoding","id":"nlp-12003","difficulty":"easy","orderIndex":3,"question":"Top-k sampling selects the next token by sampling from the top-k highest probability tokens. A model generates text about cooking with k=50. Mid-generation, the model is highly confident the next word is \"salt\" (P=0.95) with the remaining probability mass spread across 49 other tokens. An engineer observes poor output quality. What is the likely issue?","options":{"A":"k=50 is too small for cooking text, which requires a larger vocabulary","B":"Top-k samples from a fixed k regardless of the probability distribution's shape — when P(salt)=0.95 leaves only 0.05 probability across 49 other tokens, k=50 forces sampling from very low-probability tokens, adding noise when the model is confident and not enough diversity when the model is uncertain","C":"Top-k sampling with k=50 is equivalent to greedy decoding and produces the same repetition issues","D":"The model is using the wrong temperature — top-k requires temperature=0 to work correctly"},"correct":"B","explanation":{"correct":"- Top-k limitation: k is a fixed count, not a fixed probability mass. When the model is highly confident (peaked distribution), k=50 still samples from 49 near-zero-probability tokens, introducing unwanted randomness.\n- Conversely, when the model is uncertain (flat distribution), k=50 may not include enough diverse options.\n- This motivates nucleus (top-p) sampling: instead of fixing the count k, fix the cumulative probability p=0.9. When P(salt)=0.95 > p, nucleus sampling takes k=1 (just \"salt\"). When the distribution is flat, nucleus sampling takes k=large to include enough probability mass.\n- Top-p adapts to the distribution shape at each step — this is the key insight of Holtzman et al. (2020).","A":"Vocabulary size does not determine k quality. A large vocabulary is handled by the softmax layer — k=50 means 50 candidate tokens regardless of vocabulary size. The issue is k being fixed, not being too small.","B":"","C":"Top-k with k=50 is far from greedy decoding (k=1). It samples stochastically from 50 tokens with probability weights — greedy is deterministic argmax. The degeneration patterns from top-k are different (noise, not repetition).","D":"Top-k sampling can use any temperature, which adjusts the sharpness of the distribution before taking the top-k and sampling. Temperature=0 with top-k=50 would be equivalent to greedy decoding (argmax). Temperature is orthogonal to the fixed-k limitation."},"reference":"- Holtzman et al., \"The Curious Case of Neural Text Degeneration\" (nucleus sampling): https://arxiv.org/abs/1904.09751"},{"section":"nlp","topicSlug":"text-generation-decoding","topic":"Text Generation Decoding","id":"nlp-12004","difficulty":"medium","orderIndex":4,"question":"Nucleus (top-p) sampling with p=0.9 generates a token by: (1) sorting tokens by probability descending, (2) selecting the smallest set of tokens whose cumulative probability ≥ 0.9, (3) renormalizing to sum to 1, (4) sampling from this set. A model is generating with p=0.9 and temperature=0.5. In what order should temperature and nucleus truncation be applied, and why does order matter?","options":{"A":"Order does not matter — temperature and nucleus sampling produce the same result regardless of application order","B":"Temperature should be applied first (divide logits by T before softmax), then top-p truncation is applied to the resulting probability distribution — applying temperature after top-p would alter which tokens are included in the nucleus, producing a different set of candidates and a different effective sampling distribution","C":"Top-p truncation must always be applied before temperature to prevent overflow in the softmax computation","D":"Temperature and top-p cannot be used together — only one should be applied at a time"},"correct":"B","explanation":{"correct":"- Standard order: logits → divide by T (temperature) → softmax → sort by P → cumulative top-p truncation → renormalize → sample.\n- If top-p is applied first (before temperature): the nucleus is selected based on the original distribution shape. Then temperature flattens/sharpens the already-truncated distribution.\n- If temperature is applied first (before top-p): temperature T < 1 sharpens the distribution (high-probability tokens become even more dominant). The nucleus p=0.9 is then applied to this sharpened distribution, which may include fewer tokens (more concentrated at the top). T > 1 flattens the distribution, potentially expanding the nucleus.\n- In practice: temperature first, then top-p is standard (implemented in HuggingFace `generate()`, OpenAI API). Reversing the order would produce a different candidate set and effectively a different sampling scheme.","A":"Order changes the outcome when both are applied. Temperature changes the distribution shape, which changes the cumulative probability threshold's position. The two operations are not commutative when they interact through the probability ordering.","B":"","C":"Numerical overflow in softmax is addressed by subtracting the maximum logit before exponentiation (log-sum-exp trick), not by applying top-p first. Overflow is an implementation concern, not an ordering requirement.","D":"Temperature and top-p are routinely combined in production text generation (e.g., OpenAI API supports both simultaneously). They address different aspects: temperature shapes the overall distribution, top-p truncates the tail."},"reference":"- HuggingFace `generate()` documentation (temperature and top-p parameters): https://huggingface.co/docs/transformers/main_classes/text_generation"},{"section":"nlp","topicSlug":"text-generation-decoding","topic":"Text Generation Decoding","id":"nlp-12005","difficulty":"medium","orderIndex":5,"question":"A model generates a 200-token story using temperature=1.5. The output is creative and diverse but contains grammatical errors, factual contradictions, and incoherent plot points. A team member then uses temperature=0.3 and the output is grammatically perfect and factually consistent, but reads like a generic template. What tradeoff does temperature control?","options":{"A":"Temperature controls the number of tokens generated — higher temperature generates more tokens","B":"Temperature T scales the logit distribution before softmax: T > 1 flattens the distribution (more uniform → higher entropy → more diversity but more probability mass on low-probability/incorrect tokens), T < 1 sharpens it (lower entropy → more conservative/predictable/grammatical output but less creative)","C":"Temperature above 1.0 causes the model to ignore the prompt and generate from its pretraining prior","D":"Temperature below 0.5 causes the model to use beam search internally instead of sampling"},"correct":"B","explanation":{"correct":"- Temperature modification: P_T(w) ∝ exp(logit_w / T). For T=1.5: dividing logits by 1.5 < 1 makes logit differences smaller → softmax produces a flatter distribution → sampling has higher variance.\n- T=1.5 consequences: tokens with log-prob=-5 get probability exp(-5/1.5)/Z vs exp(-5)/Z — the low-probability token's probability increases more proportionally. The model samples from a wider, more uncertain region, including grammatically unusual or factually wrong tokens.\n- T=0.3 consequences: dividing logits by 0.3 amplifies differences → distribution concentrates on the top token → output is near-greedy, always choosing the safest (highest-probability) continuation. Safe = grammatically correct + generic.\n- The creative-quality tradeoff is fundamental: creativity requires exploring lower-probability (surprising) tokens, which also includes errors.","A":"Temperature affects the probability distribution, not the number of tokens generated. Output length is controlled by max_new_tokens, stop sequences, or the probability of generating EOS — not temperature.","B":"","C":"Temperature modifies the softmax scaling — it does not disable prompt conditioning. At any temperature, the model's output is still conditioned on the prompt through the attention mechanism. Very high temperature (T→∞) approaches uniform random sampling, but it does not \"ignore\" the prompt architecturally.","D":"Temperature is a sampling parameter — it never triggers beam search. Beam search is a separate decoding strategy. A model configured for temperature sampling uses temperature at all values. Beam search must be explicitly selected as the decoding strategy."},"reference":"- Holtzman et al., \"The Curious Case of Neural Text Degeneration\": https://arxiv.org/abs/1904.09751"},{"section":"nlp","topicSlug":"text-generation-decoding","topic":"Text Generation Decoding","id":"nlp-12006","difficulty":"medium","orderIndex":6,"question":"A chatbot using an LLM generates the response \"I love you too!\" when a user types \"I hate you.\" The system uses nucleus sampling (p=0.9, temperature=0.8) and no repetition penalty. A product manager asks \"did the model get confused?\" What is the most precise technical explanation?","options":{"A":"The model incorrectly mapped hate to love due to a polarity reversal bug in the tokenizer","B":"\"I hate you\" and \"I love you\" appear in similar conversational contexts in training data (direct address, emotional response); the model's probability distribution over responses to \"I [emotion] you\" may assign non-negligible probability to emotional reciprocation responses; nucleus sampling sampled a response from this shared contextual distribution","C":"Nucleus sampling with p=0.9 always has a 10% chance of generating an unrelated response","D":"The model hallucinated because temperature=0.8 is too high for conversational AI applications"},"correct":"B","explanation":{"correct":"- Distributional explanation: in training data, \"I hate you\" and \"I love you\" may share similar response contexts — both elicit emotional reactions, apologies, counter-expressions. P(response | \"I hate you\") and P(response | \"I love you\") overlap in the high-probability region.\n- \"I love you too!\" specifically: training data may contain adversarial or sarcastic replies (\"I hate you\" → \"I love you too! [laughing emoji]\"), or the model may have poor sentiment discrimination in the conversational generation distribution.\n- The response is not a random error or bug — it reflects what patterns were present in training data and what the model assigns non-zero probability to given the input context.\n- Mitigation: RLHF (reinforcement learning from human feedback) or constitutional AI explicitly trains the model to avoid inappropriate emotional responses through reward modeling.","A":"Tokenizers convert text to token IDs — they have no polarity or semantic understanding. \"Hate\" and \"love\" are different tokens with different embeddings. There is no \"polarity reversal\" in the tokenizer.","B":"","C":"Nucleus sampling does not have a fixed 10% \"unrelated response\" rate. The 10% excluded by p=0.9 is the bottom tail of the probability distribution. The actual sampling probabilities are determined by the model's distribution, not a fixed 10% random failure rate.","D":"Temperature=0.8 is a moderate, commonly used temperature for dialogue — it is not considered \"too high.\" The issue is not temperature choice but the model's learned associations from training data. Lower temperature would reduce this specific error but would also reduce diversity."},"reference":"- Ouyang et al., \"Training language models to follow instructions with human feedback\" (RLHF): https://arxiv.org/abs/2203.02155"},{"section":"nlp","topicSlug":"text-generation-decoding","topic":"Text Generation Decoding","id":"nlp-12007","difficulty":"hard","orderIndex":7,"question":"A text generation system applies a repetition penalty of α=1.3 to reduce token repetition. The penalty is applied by dividing the logit of any previously generated token by α. A model is generating a story about \"the Mississippi River.\" After generating \"the Mississippi River flows through,\" the next token distribution has P(\"the\") = 0.35 (high due to \"the\" being common), and P(\"Mississippi\") = 0.08. After applying repetition penalty (both \"the\" and \"Mississippi\" were previously generated), what happens and what unintended consequence might occur?","codeSnippet":"# Simplified penalty application\nfor token_id in previously_generated_tokens:\n logits[token_id] /= repetition_penalty # alpha = 1.3","options":{"A":"The penalty correctly prevents \"the the\" and \"Mississippi Mississippi\" with no side effects","B":"Dividing logits by 1.3 reduces both tokens' probabilities, but because \"the\" has logit ≈ log(0.35) ≈ -1.05 and \"Mississippi\" has logit ≈ log(0.08) ≈ -2.53, dividing by 1.3 reduces \"the\" less proportionally — however, the repetition penalty also penalizes legitimate entity mentions: \"the Mississippi River\" may need \"Mississippi\" again later, and the penalty prevents natural name repetition in a narrative context","C":"The penalty will cause the model to always output END_OF_SEQUENCE after penalizing high-frequency tokens","D":"Dividing logits by 1.3 has no effect on tokens with negative logits"},"correct":"B","explanation":{"correct":"- The penalty divides logits (not probabilities) by α. For tokens with negative logits: logit(-1.05) / 1.3 = -1.365 — moved more negative → lower probability. For logit(-2.53) / 1.3 = -1.946 — also reduced. Both tokens' probabilities decrease after softmax.\n- Critical unintended consequence: \"Mississippi River\" is the topic of the story. Natural storytelling requires repeating \"Mississippi\" multiple times (\"...the Mississippi River flows through...the states along the Mississippi...\"). A blanket repetition penalty treats entity names the same as meaningless word repetition.\n- Production systems implement penalty decay (reduce penalty strength for tokens seen longer ago) or entity-aware repetition penalty (exempting proper nouns and named entities from the penalty).\n- The penalty also interacts with \"the\" being grammatically required — penalizing \"the\" reduces an article that is necessary for grammatical English, potentially degrading fluency.","A":"The penalty does prevent immediate repetition effectively, but \"no side effects\" is false. The penalty indiscriminately affects all previously generated tokens, including necessary entity re-mentions and required function words.","B":"","C":"END_OF_SEQUENCE has its own logit and is only penalized if it was previously generated (i.e., after the sequence ends, which doesn't apply here). Penalizing high-frequency tokens like \"the\" reduces their probability but does not cause EOS to dominate.","D":"Dividing a negative logit by α > 1 makes it more negative (further from zero), which after softmax produces a lower probability. The operation does have an effect — negative logits become more negative when divided by a value > 1."},"reference":"- Keskar et al., \"CTRL: A Conditional Transformer Language Model for Controllable Generation\": https://arxiv.org/abs/1909.05858"},{"section":"nlp","topicSlug":"text-generation-decoding","topic":"Text Generation Decoding","id":"nlp-12008","difficulty":"hard","orderIndex":8,"question":"A research team compares four decoding strategies for generating product descriptions (requiring factual accuracy and fluency): greedy, beam=5, top-p=0.9, and top-p=0.9 + temperature=0.7. They measure BLEU vs reference descriptions and human preference. Greedy: BLEU=28, human=55%. Beam=5: BLEU=35, human=48%. top-p=0.9: BLEU=22, human=62%. top-p+temp: BLEU=20, human=65%. A PM says \"use beam search — highest BLEU.\" What is wrong with this recommendation for product descriptions specifically?","options":{"A":"Beam search cannot be used for product descriptions because it requires structured output format","B":"BLEU measures n-gram overlap with reference descriptions, rewarding exact phrasing; human preference measures perceived quality, naturalness, and usefulness; for product descriptions, humans prefer creative, natural language over exact reference matching; beam search's high BLEU at the cost of lower human preference indicates it produces bland, reference-like text that scores well mechanically but is less effective in practice","C":"The PM should use greedy because it produces the highest human preference","D":"The correlation between BLEU and human preference confirms beam search is the best choice"},"correct":"B","explanation":{"correct":"- Product description context: high-quality product descriptions are engaging, varied, and persuasive — not necessarily verbatim replicas of reference texts. Human raters prefer descriptions that sound natural and appealing, even if they paraphrase differently.\n- Beam search optimizes joint probability, which correlates with producing text close to high-frequency training patterns (reference-like). High BLEU = close to references = potentially generic and less engaging.\n- Top-p + temperature produces more diverse, creative text that deviates from references (lower BLEU) but humans find it more natural and appealing (higher preference).\n- The PM's recommendation assumes BLEU = quality. For creative text generation tasks, this assumption is demonstrably false. For constrained tasks (MT, extractive summarization), BLEU and human preference align better.","A":"Beam search has no structured output format requirement. It is a general decoding algorithm applicable to any sequence generation task including product descriptions.","B":"","C":"Greedy has lower human preference (55%) than both top-p methods (62%, 65%). Choosing greedy based on this data would be wrong — the statement itself is incorrect.","D":"The data shows BLEU and human preference are *inversely* correlated across these strategies: higher BLEU (greedy=28→beam=35) corresponds to lower human preference (55%→48%). The correlation is negative, not confirmatory."},"reference":"- Holtzman et al., \"The Curious Case of Neural Text Degeneration\": https://arxiv.org/abs/1904.09751\n- Callison-Burch et al., \"Re-evaluating the role of BLEU in MT research\": https://aclanthology.org/E06-1032/"}],"practiceMcqs":[{"section":"nlp","difficulty":"easy","id":"nlp-e001","topicSlug":"text-preprocessing","orderIndex":1,"topic":"Text Preprocessing","question":"A data scientist computes TF-IDF for the word \"neural\" in a corpus of 1,000 documents. It appears in 500 documents and occurs 10 times in a 200-word document. A colleague says \"TF-IDF will be high because the word appears 10 times.\" What is wrong with this reasoning?","options":{"A":"TF is always 0 for words appearing more than 5 times in a document","B":"IDF = log(1000/500) = log(2) ≈ 0.69, which is low because \"neural\" appears in half the corpus — high document frequency penalizes the score; despite TF = 0.05, the TF-IDF = 0.05 × 0.69 = 0.035, which is low relative to rare domain-specific terms","C":"TF-IDF penalizes words that appear in the document more than 10 times","D":"TF-IDF only counts unique occurrences, so 10 repeated occurrences are treated as 1"},"correct":"B","explanation":{"correct":"- TF = term count / document length = 10/200 = 0.05. This measures local frequency within the document.\n- IDF = log(N/df) = log(1000/500) = 0.693. When a word appears in half the corpus, it is too common to be discriminative — IDF penalizes it heavily.\n- TF-IDF = 0.05 × 0.693 = 0.035. A rare technical term appearing in only 10 documents would have IDF = log(1000/10) = 4.6, yielding much higher TF-IDF even at the same TF.\n- The colleague's error: confusing raw term frequency (TF alone) with TF-IDF. The IDF component is specifically designed to discount words common across documents.","A":"There is no threshold in TF computation. TF = raw count / total words, applied uniformly regardless of count.","B":"","C":"TF-IDF has no per-document frequency threshold. The penalty is for corpus-level document frequency (IDF), not within-document frequency.","D":"Standard TF counts all occurrences, not unique ones. Counting unique occurrences would be binary term frequency (0 or 1), which is a different variant."},"reference":"- Jurafsky & Martin, SLP3 Chapter 6 (TF-IDF): https://web.stanford.edu/~jurafsky/slp3/6.pdf"},{"section":"nlp","difficulty":"easy","id":"nlp-e002","topicSlug":"text-preprocessing","orderIndex":2,"topic":"Text Preprocessing","question":"A sentiment analysis pipeline removes all stopwords before training a classifier. Review: \"This movie is not bad at all.\" After stopword removal the tokens are [\"movie\", \"bad\"]. The model predicts negative sentiment. What fundamental NLP problem does this reveal?","options":{"A":"Stopword lists are too long and remove meaningful content words","B":"Removing \"not\" destroys the negation that inverts the sentiment of \"bad\" — stopword removal for sentiment tasks eliminates critical semantic operators that change the polarity of surrounding words","C":"The model predicted incorrectly because \"bad\" is always classified as negative regardless of context","D":"Stopword removal fails only on short sentences under 5 words"},"correct":"B","explanation":{"correct":"- \"Not bad\" is positive; \"not good\" is negative. The word \"not\" (and other negations: \"never,\" \"hardly,\" \"barely\") are typically included in stopword lists but are semantically critical in sentiment contexts.\n- Classic rule-based sentiment systems handle this by \"negation scope\": when \"not\" is detected, flip the sentiment polarity of all words until the next punctuation.\n- Modern approaches: use n-grams (\"not bad\" as a bigram feature) or contextual embeddings (BERT) that encode the full context including negation.\n- This is one of the canonical examples of when standard preprocessing heuristics break task-specific requirements.","A":"Stopword lists do remove \"not\" because it appears in most documents. The issue is not list length but that sentiment tasks have domain-specific requirements that conflict with general-purpose stopword removal.","B":"","C":"A well-designed classifier should use bigrams or negation-aware features. The failure here is the preprocessing step, not the model's intrinsic inability to handle \"bad\" in context.","D":"Negation failure occurs at any sentence length. \"This is not bad\" (5 words after removal → 2 tokens) and \"This 300-word review says the product is not at all defective\" both lose negation scope."},"reference":"- Pang et al., \"Thumbs up?: Sentiment Classification using Machine Learning Techniques\": https://aclanthology.org/W02-1011/"},{"section":"nlp","difficulty":"easy","id":"nlp-e003","topicSlug":"text-preprocessing","orderIndex":3,"topic":"Text Preprocessing","question":"A Python NLP pipeline uses character-level trigrams as features for language identification. The text \"The cat sat.\" produces trigrams: \"The\", \"he \", \"e c\", \" ca\", \"cat\", \"at \", \"t s\", \" sa\", \"sat\", \"at.\". A colleague argues \"word-level bigrams are better because they capture meaning.\" For language identification specifically, why are character n-grams more appropriate?","options":{"A":"Character n-grams are faster to compute than word bigrams","B":"Language identity is encoded in character-level patterns (letter sequences, diacritics, morphological patterns) that are language-specific regardless of vocabulary — \"sch\" signals German, \"ão\" signals Portuguese; character n-grams work even on short texts or unknown-vocabulary text where word bigrams would hit OOV issues","C":"Word bigrams require a larger vocabulary which consumes more memory","D":"Character trigrams are the ISO standard for language identification"},"correct":"B","explanation":{"correct":"- Language identification does not require understanding meaning — it requires detecting the statistical fingerprint of the language's writing system and phonology.\n- Character n-gram profiles are language-specific: \"qu\" is common in French/Spanish/English; \"zsch\" in German; \"kk\" in Finnish. These patterns are stable even in short texts and across vocabulary domains.\n- Word-level bigrams fail on OOV texts (technical, proper nouns) and require larger corpora to estimate reliably. A 10-word text might have no bigram overlap with training vocabulary.\n- Caveat & McNamee (2003) showed character 4-grams outperform word-level features for language ID across 14 languages.","A":"Computational speed is a practical concern, not the primary reason character n-grams are more appropriate for language ID. Both are fast for modern hardware.","B":"","C":"Memory is an implementation concern, not a task-fitness argument. The appropriateness of character n-grams for language ID comes from their linguistic properties, not memory usage.","D":"No ISO standard mandates character trigrams specifically. The ISO 639 standards define language codes, not feature extraction methods."},"reference":"- Cavnar & Trenkle, \"N-gram Based Text Categorization\": https://aclanthology.org/W00-0817/"},{"section":"nlp","difficulty":"easy","id":"nlp-e004","topicSlug":"word-representations","orderIndex":4,"topic":"Word Representations","question":"A word2vec model is trained on a news corpus. `model.similarity(\"man\", \"woman\")` returns 0.76. `model.similarity(\"king\", \"queen\")` returns 0.73. A researcher uses this to claim \"the model treats gender as a symmetric relationship.\" What does the asymmetric test `model.most_similar(\"woman\", topn=10)` potentially reveal that the similarity scores hide?","options":{"A":"Cosine similarity is always symmetric so both calls return identical lists","B":"`most_similar(\"woman\")` returns neighbors like \"actress\", \"wife\", \"mother\" while `most_similar(\"man\")` returns \"businessman\", \"athlete\", \"senator\" — the word's nearest neighbors reveal gender bias in training data even when pairwise similarity scores look symmetric","C":"`most_similar` and `similarity` use different embedding spaces and are incomparable","D":"The similarity scores prove there is no gender bias in the model"},"correct":"B","explanation":{"correct":"- Cosine similarity is symmetric: sim(A, B) = sim(B, A). So the similarity score 0.76 for (man, woman) is identical in both directions.\n- However, the full neighborhood reveals context bias: what other words are close to \"woman\" vs \"man\"? Training data reflects societal roles — women are more associated with domestic/performance roles in news corpora.\n- Bolukbasi et al. (2016) demonstrated this explicitly: \"man\" → computer programmer, \"woman\" → homemaker in word2vec analogies.\n- The practical danger: downstream systems using these embeddings inherit the biases.","A":"Cosine similarity is symmetric in score, but `most_similar` returns a ranked list of *different words*, not just the similarity between two specific words. The neighborhood compositions differ.","B":"","C":"Both use the same embedding space and the same underlying dot product / cosine similarity computation. They are fully comparable.","D":"Symmetric pairwise similarity between \"man\" and \"woman\" says nothing about the broader neighborhood structure of each word. Symmetry in one pair does not mean absence of bias in the embedding space."},"reference":"- Bolukbasi et al., \"Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings\": https://arxiv.org/abs/1607.06520"},{"section":"nlp","difficulty":"easy","id":"nlp-e005","topicSlug":"word-representations","orderIndex":5,"topic":"Word Representations","question":"FastText represents the word \"playing\" as the sum of its character n-gram vectors: {\"\", \"\"}. The word \"unplaying\" is not in the training vocabulary. How does FastText handle this, and why does standard Word2Vec fail here?","options":{"A":"FastText also returns a zero vector for OOV words — it only uses character n-grams during training","B":"FastText constructs the OOV vector by summing its character n-gram embeddings that were learned during training — \"unplaying\" shares n-grams \"pla\", \"lay\", \"ayi\", \"yin\", \"ing\" with \"playing\", producing a meaningful vector; Word2Vec has no n-gram embeddings and returns UNK for any OOV word","C":"FastText uses a spell checker to map \"unplaying\" to the nearest in-vocabulary word","D":"FastText and Word2Vec handle OOV identically — both fall back to the average of all training vectors"},"correct":"B","explanation":{"correct":"- Word2Vec maps each word to a single dense vector. Unknown words at inference get the `` vector or are dropped — no meaningful representation is produced.\n- FastText trains separate embeddings for every character n-gram (typically n=3,4,5,6). At inference, a word's embedding = sum of all its character n-gram embeddings, including `` (the whole word if known).\n- \"unplaying\": character n-grams include \"pla\", \"lay\", \"ayi\", \"yin\", \"ing\" — all learned from many training words. Their sum produces a sensible vector that captures the morphological content.\n- This is crucial for morphologically rich languages (Finnish, Turkish, Arabic) and for domain text with technical neologisms.","A":"FastText uses character n-grams at both training and inference. At inference, the OOV word's embedding is composed from its n-grams on-the-fly. The n-gram embeddings learned during training are the mechanism.","B":"","C":"FastText has no spell checker or nearest-neighbor lookup for OOV. It strictly computes the vector from character n-grams.","D":"Word2Vec does not fall back to any average vector — the default behavior is UNK handling (either a single UNK vector or no vector). FastText's n-gram sum is an active composition, not an averaging fallback."},"reference":"- Bojanowski et al., \"Enriching Word Vectors with Subword Information\" (FastText): https://arxiv.org/abs/1607.04606"},{"section":"nlp","difficulty":"easy","id":"nlp-e006","topicSlug":"classical-nlp-tasks","orderIndex":6,"topic":"Classical NLP Tasks","question":"A dependency parser processes \"The quick brown fox jumps over the lazy dog.\" It produces arcs: fox→jumps (nsubj), jumps→dog (prep→over→pobj). A student asks: \"Why does 'jumps' have no incoming arc but all other content words do?\" What is the structural property being observed?","options":{"A":"\"jumps\" is a verb and verbs never have incoming arcs in dependency grammar","B":"In a dependency tree, exactly one token is the root — it has no head (no incoming arc) but governs the rest of the sentence directly or transitively; every other token has exactly one head","C":"The parser made an error — every word must have an incoming arc","D":"\"jumps\" has no incoming arc because it is at the center of the sentence"},"correct":"B","explanation":{"correct":"- Dependency grammar constraint: the parse is a directed tree rooted at the main verb (or root token). The root has no parent; all other tokens have exactly one parent (head).\n- \"jumps\" is the main predicate — it is the root of the sentence. Its incoming arc would point to an artificial ROOT node (often added in practice) but it has no lexical head.\n- Every other token (fox, dog, The, quick, brown, etc.) has exactly one head: \"fox\" depends on \"jumps\" (nsubj), \"The\" depends on \"fox\" (det), etc.\n- This tree structure property means: n words → n-1 dependency arcs (excluding the root arc).","A":"Verbs can have incoming arcs — infinitival complements (\"I want to go\" → \"go\" depends on \"want\"), coordinated verbs, relative clauses all produce verbs with head relationships.","B":"","C":"The parser produced a valid tree. The root node having no incoming arc is by design, not an error. Every valid dependency tree has exactly one root.","D":"Sentence position (center, start, end) does not determine root status. The root is the main predicate, which is typically the finite main verb regardless of its position."},"reference":"- Jurafsky & Martin, SLP3 Chapter 15 (Dependency Parsing): https://web.stanford.edu/~jurafsky/slp3/15.pdf"},{"section":"nlp","difficulty":"easy","id":"nlp-e007","topicSlug":"classical-nlp-tasks","orderIndex":7,"topic":"Classical NLP Tasks","question":"A co-reference resolution system must link all mentions of the same entity in: \"The CEO announced that she would resign. Her decision shocked investors.\" It correctly links \"CEO\" → \"she\" → \"Her\". A student asks why co-reference resolution is considered harder than NER. What is the core challenge?","options":{"A":"Co-reference requires reading multiple sentences and linking mentions across arbitrary distances, handling pronouns, definite descriptions, and implicit references that require world knowledge","B":"Co-reference is harder only for long documents; for short texts it is easier than NER","C":"NER only labels 4 entity types but co-reference has unlimited entity types","D":"Co-reference requires a larger vocabulary lookup table than NER"},"correct":"A","explanation":{"correct":"- NER labels individual mention spans with types — local, span-level decision.\n- Co-reference requires: (1) detecting all mentions (not just named ones — \"she,\" \"her decision\"), (2) deciding which mentions refer to the same entity across potentially hundreds of tokens, (3) resolving pronoun-antecedent binding (\"she\" → \"CEO\" not \"investors\"), (4) handling bridging (\"The CEO\" → \"the executive\" → \"her\" — lexically diverse mentions).\n- Winograd schema: \"The trophy didn't fit in the bag because **it** was too large\" — \"it\" refers to trophy or bag? Requires world knowledge (bags contain things; if something is too large, it's the contained item).\n- Modern co-reference systems (SpanBERT, LingMess) still achieve only ~80-85% F1 on OntoNotes, well below BERT NER's 91%+.","A":"","B":"Document length is one factor but co-reference is inherently harder even in short texts due to pronoun resolution and world knowledge requirements. A 2-sentence text can still have ambiguous co-reference.","C":"Entity type count is irrelevant to co-reference — co-reference groups mentions by identity, not type. NER's type set size does not make it easier or harder relative to co-reference.","D":"Vocabulary size affects model capacity but is not the defining difficulty of co-reference. The challenge is reasoning about identity across linguistic contexts, not vocabulary coverage."},"reference":"- Jurafsky & Martin, SLP3 Chapter 22 (Coreference Resolution): https://web.stanford.edu/~jurafsky/slp3/22.pdf"},{"section":"nlp","difficulty":"easy","id":"nlp-e008","topicSlug":"language-models-statistical","orderIndex":8,"topic":"Language Models Statistical","question":"A student computes the perplexity of a unigram language model and a bigram language model on the same test set. The unigram gives PP=500, the bigram gives PP=300. The student concludes \"bigram is 200 points better.\" Why is this a misleading way to compare perplexity improvements?","options":{"A":"Perplexity can only be compared using ratios, and a ratio of 500/300 is more meaningful than the difference","B":"Both statements are equally valid — absolute and ratio comparisons are interchangeable for perplexity","C":"The absolute difference of 200 is the only valid comparison method","D":"Perplexity should be converted to bits-per-word before any comparison"},"correct":"A","explanation":{"correct":"- Perplexity = 2^H where H is cross-entropy in bits. A drop from PP=500 to PP=300 means the model's weighted branching factor decreased by 200 — but the underlying entropy improvement is H = log₂(500) ≈ 8.97 bits vs log₂(300) ≈ 8.23 bits, a reduction of 0.74 bits/word.\n- Perplexity is on a logarithmic scale. The same 200-point improvement means very different things at different levels: PP=500→300 is meaningful, but PP=100→-100 (impossible) shows the scale is non-linear.\n- Percentage reduction (PP dropped 40%) or cross-entropy improvement (bits/word) are better measures of relative improvement than absolute point differences.\n- In practice, researchers compare perplexity ratios or bits-per-character/word consistently rather than raw differences.","A":"","B":"Absolute and ratio comparisons are NOT interchangeable for perplexity. Due to the logarithmic relationship with entropy, absolute differences give misleading impressions of improvement magnitude. They are only interchangeable at the same baseline value.","C":"Absolute difference is the misleading comparison — it implies linear scaling that perplexity does not have.","D":"Converting to bits-per-word is a valid transformation, but the question asks why absolute point comparison is misleading. The core issue is non-linear scale, not units."},"reference":"- Jurafsky & Martin, SLP3 Chapter 3 (Perplexity and language model comparison): https://web.stanford.edu/~jurafsky/slp3/3.pdf"},{"section":"nlp","difficulty":"easy","id":"nlp-e009","topicSlug":"sequence-models-rnn-lstm","orderIndex":9,"topic":"Sequence Models Rnn Lstm","question":"An LSTM language model is used to predict the next word. At each time step, it produces a hidden state hₜ of dimension 256. A linear layer maps hₜ to a vocabulary of 50,000 words, followed by softmax. A junior engineer proposes replacing the 256-dim LSTM with a 512-dim LSTM. What is the direct computational cost change?","options":{"A":"The cost doubles because the hidden dimension doubled","B":"The linear layer weights change from 256×50,000 = 12.8M params to 512×50,000 = 25.6M params; the LSTM internal matrices change from O(256²) to O(512²) — the LSTM itself quadruples in parameter count for the recurrent weight matrices; overall cost increase is roughly 2-4× depending on which component dominates","C":"The cost increases by exactly 50,000 parameters because only the linear layer changes","D":"Cost is unchanged because LSTM processes one token at a time regardless of hidden size"},"correct":"B","explanation":{"correct":"- LSTM parameters per gate: Wₕ (hidden-to-hidden) = h×h and Wx (input-to-hidden) = d×h. With 4 gates, total LSTM params ≈ 4 × (h² + d×h). Doubling h: Wₕ goes from 256² = 65K to 512² = 262K per gate — quadrupling the recurrent matrix costs.\n- Output projection: 256×50K = 12.8M → 512×50K = 25.6M — doubled.\n- The bottleneck depends on vocabulary size: for large vocabularies (50K), the output projection often dominates, making the increase closer to 2×. For small vocabularies, the LSTM matrices dominate.\n- FLOPs per time step also increase quadratically for the recurrent computation: matrix-vector products with Wₕ scale as O(h²) → O((2h)²) = 4× for the recurrent part.","A":"\"Doubles\" is too simplistic. The recurrent weight matrices (Wₕ) scale quadratically with h — they quadruple, not double. Whether the overall model approximately doubles depends on which component is larger.","B":"","C":"Both the LSTM internal matrices AND the linear layer change. The 50,000 parameter increase (one row of the linear layer) is negligible and incorrect — the full linear layer dimension increases.","D":"Hidden size directly affects both computation per step (matrix dimensions) and total parameter count. LSTM processing one token at a time is sequential, not an indication that hidden size is irrelevant."},"reference":"- Hochreiter & Schmidhuber, \"Long Short-Term Memory\": https://www.bioinf.jku.at/publications/older/2604.pdf"},{"section":"nlp","difficulty":"easy","id":"nlp-e010","topicSlug":"attention-before-transformers","orderIndex":10,"topic":"Attention Before Transformers","question":"An attention-based seq2seq model translates \"I eat apples\" to French \"Je mange des pommes.\" The attention visualization shows that when generating \"pommes\" (apples), the model attends to \"apples\" (position 3 in source) with weight 0.92. A student says \"the model is just copying the word.\" What is more precisely happening?","options":{"A":"The model is indeed copying — attention just identifies what to copy","B":"Attention assigns high weight to \"apples\" to retrieve that position's encoder representation, which encodes contextual information about \"apples\" in the source sentence; the decoder then uses this representation to generate the appropriate French translation, not to copy the source word","C":"Attention weight 0.92 means the output token is 92% likely to be a copy of the source token","D":"The model uses a dictionary lookup triggered by high attention weight to find \"pommes\""},"correct":"B","explanation":{"correct":"- The context vector cₜ = Σᵢ αₜᵢ hᵢ at this decoding step is dominated by h₃ (the encoder hidden state at \"apples\" position). This vector encodes the encoder's full contextual representation of the word \"apples\" in its sentence context.\n- The decoder then takes cₜ and its own state to compute the output distribution P(target word | context). The model predicts \"pommes\" from this representation — it is a generation decision, not a copy.\n- High attention to source position i means \"this source position is most relevant for the current target position\" — not \"copy this source token.\" The generation comes from the decoder's learned French vocabulary.\n- Copy mechanisms (pointer networks) are a separate architectural component that explicitly copies source tokens — standard seq2seq attention does not have this behavior.","A":"Standard seq2seq attention does not copy source words — it uses source representations to condition generation. Copying would produce the English word \"apples\" in the French output.","B":"","C":"Attention weight is a weighted relevance score over source positions, not a copy probability. The output distribution is over target vocabulary, not source tokens.","D":"The model does not use a translation dictionary. The mapping \"apples\" → \"pommes\" is encoded in the decoder's weights learned from parallel training data, not a lookup table."},"reference":"- Bahdanau et al., \"Neural Machine Translation by Jointly Learning to Align and Translate\": https://arxiv.org/abs/1409.0473"},{"section":"nlp","difficulty":"easy","id":"nlp-e011","topicSlug":"bert-and-variants","orderIndex":11,"topic":"Bert And Variants","question":"A student fine-tunes BERT-base for binary sentiment classification. They add a linear layer on top of the [CLS] token with output dimension 2 (positive/negative), followed by softmax. Another student says \"you should use the average of all token embeddings instead of [CLS] — it uses more information.\" Under what condition is averaging actually correct?","options":{"A":"Averaging is never correct — [CLS] always contains more information","B":"Averaging final-layer token embeddings can work and sometimes outperforms [CLS], particularly for sentence similarity tasks where [CLS] was not specifically pretrained to aggregate semantic content; however for classification after fine-tuning, [CLS] learns to aggregate task-relevant information through its full self-attention receptive field over all other tokens","C":"Averaging must always be used for classification tasks because BERT's architecture requires it","D":"The [CLS] embedding only encodes the first token's information and must be replaced with averaging"},"correct":"B","explanation":{"correct":"- [CLS] token: at every BERT layer, [CLS] can attend to all other tokens via self-attention. After fine-tuning, the [CLS] representation is updated to encode classification-relevant aggregate features. This is why [CLS] is the standard classification head.\n- Mean pooling: average all non-[CLS], non-[SEP] token embeddings. For sentence embeddings in semantic search and sentence similarity, mean pooling often outperforms [CLS] (Reimers & Gurevych, 2019 — SBERT showed this).\n- The reason: [CLS] in BERT was pretrained for NSP (binary classification), not general sentence representation. Mean pooling captures distributed token-level semantics that may be better for similarity tasks.\n- After fine-tuning on classification, [CLS] is task-specifically adapted and typically optimal for that task.","A":"Averaging is a valid technique — SBERT explicitly shows mean pooling outperforms [CLS] for sentence similarity tasks. \"Never correct\" is empirically false.","B":"","C":"BERT's architecture does not require averaging — [CLS] is the convention for classification, and the architecture supports both approaches. There is no architectural requirement for mean pooling.","D":"After 12 layers of self-attention, [CLS] has attended to every other token at every layer — it is not limited to position 0 information. Self-attention gives it a full receptive field over the input."},"reference":"- Reimers & Gurevych, \"Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks\": https://arxiv.org/abs/1908.10084"},{"section":"nlp","difficulty":"easy","id":"nlp-e012","topicSlug":"text-classification","orderIndex":12,"topic":"Text Classification","question":"A Naive Bayes classifier is trained for spam detection. The training set has 900 ham and 100 spam emails. Without any balancing, a student expects the classifier to predict \"ham\" for almost every email. What prior probability drives this behavior and how can it be addressed?","options":{"A":"Naive Bayes is unaffected by class distribution because it uses conditional probabilities only","B":"P(spam) = 100/1000 = 0.10 acts as a strong prior; even when P(words | spam) is high, multiplying by 0.10 vs P(words | ham) × 0.90 creates a systematic bias toward ham; mitigation: use class-balanced sampling, adjust decision threshold, or set uniform priors P(spam) = P(ham) = 0.5","C":"The prior only matters for Bayesian classifiers on tabular data, not text","D":"The solution is to add more spam features to the vocabulary"},"correct":"B","explanation":{"correct":"- Naive Bayes: P(spam | email) ∝ P(spam) × Π P(wᵢ | spam). The prior P(spam) = 0.10 multiplies every spam prediction — it must be overcome by the likelihood ratio for every classification.\n- In a 9:1 imbalanced dataset, the model learns that predicting ham has a 90% baseline accuracy. Even if P(\"buy now\" | spam) is much higher than P(\"buy now\" | ham), the prior overwhelms marginal word evidence.\n- Fixes: (1) Adjust threshold from 0.5 to a lower value, (2) Set equal priors in Naive Bayes formula (ignore corpus class ratio), (3) Oversample spam or undersample ham.\n- This is a general class imbalance problem — Naive Bayes just makes the prior explicit in the formula, making the problem visible.","A":"Naive Bayes explicitly includes the class prior P(class) in the posterior calculation. The prior is not ignored — it is a fundamental part of the Bayesian framework. Class distribution directly affects predictions.","B":"","C":"The prior is used in text Naive Bayes just as in tabular Naive Bayes — the formula P(class | features) ∝ P(class) × P(features | class) is identical regardless of feature type.","D":"Adding spam vocabulary features helps the likelihood component but does not address the prior imbalance. A word feature would need to be astronomically discriminative to overcome a 9:1 prior ratio."},"reference":"- Jurafsky & Martin, SLP3 Chapter 4 (Naive Bayes Classification): https://web.stanford.edu/~jurafsky/slp3/4.pdf"},{"section":"nlp","difficulty":"easy","id":"nlp-e013","topicSlug":"named-entity-recognition","orderIndex":13,"topic":"Named Entity Recognition","question":"A NER model processes \"Apple released iOS 17 in September 2023.\" The gold annotations are: Apple=ORG, iOS 17=PRODUCT, September 2023=DATE. The model predicts: Apple=ORG, iOS=PRODUCT (misses \"17\"), September=DATE (misses \"2023\"). How many of the 3 gold entities does the model get correct under strict entity-level F1 evaluation?","options":{"A":"2 correct (Apple and partial iOS 17)","B":"1 correct (only Apple) — \"iOS\" without \"17\" is a wrong span boundary for iOS 17; \"September\" without \"2023\" is a wrong span for September 2023; only Apple matches exactly","C":"3 correct — each entity gets partial credit for correct type prediction","D":"2 correct — Apple and iOS (because \"17\" is just a number modifier)"},"correct":"B","explanation":{"correct":"- Entity-level F1 (CoNLL evaluation): an entity prediction is correct if and only if span start, span end, AND entity type all match the gold annotation exactly.\n- \"iOS\" (positions 2-2) vs gold \"iOS 17\" (positions 2-3): span boundary mismatch → 0 credit for this entity.\n- \"September\" (positions 4-4) vs gold \"September 2023\" (positions 4-5): span boundary mismatch → 0 credit.\n- \"Apple\" (position 1-1) vs gold \"Apple\" (position 1-1) as ORG: exact match → 1 correct.\n- Entity-level precision = 1/3 (1 of 3 predicted entities correct), recall = 1/3 (1 of 3 gold entities found), F1 = 1/3.","A":"Partial span matches receive zero credit in standard CoNLL entity-level evaluation. There is no \"partial iOS 17\" credit. The evaluation is all-or-nothing per entity span.","B":"","C":"Entity type alone is insufficient for credit. The span boundaries must also match. Predicting the right type on the wrong span = 0 F1.","D":"\"17\" is part of the entity \"iOS 17\" in the gold annotation — it is not an optional modifier. The annotator included it in the span, and the evaluation respects the annotated boundaries exactly."},"reference":"- Tjong Kim Sang & De Meulder, \"Introduction to the CoNLL-2003 Shared Task\": https://aclanthology.org/W03-0419/"},{"section":"nlp","difficulty":"easy","id":"nlp-e014","topicSlug":"question-answering","orderIndex":14,"topic":"Question Answering","question":"A student builds an open-domain QA system by feeding the question directly to a large language model without any retrieval: \"What is the boiling point of titanium?\" The model answers \"3,287°C\" which happens to be correct. Another question \"What is the current prime minister of Canada?\" returns an outdated answer. What is the fundamental limitation demonstrated?","options":{"A":"The model is too small to store current events in its parameters","B":"LLM parametric knowledge has a training cutoff — facts that change over time (leadership, prices, current events) become stale post-cutoff; the model produces answers from memorized training data, not live information; this is the key motivation for retrieval-augmented approaches","C":"The model should use a search engine for all factual questions","D":"Open-domain QA only works for scientific facts, not political facts"},"correct":"B","explanation":{"correct":"- Parametric knowledge: facts stored in model weights during pretraining. Static facts (boiling points, mathematical constants, historical events) remain correct indefinitely. Dynamic facts (current leaders, stock prices, latest research) become outdated.\n- The model cannot know who became PM after its training cutoff because that information was not in its training data. The model's \"knowledge\" is frozen at the training date.\n- RAG (Retrieval-Augmented Generation) addresses this by retrieving current documents at inference time — the model answers from retrieved context rather than from memorized parameters.\n- This is the core tradeoff: LLM-only (fast, no retrieval, but stale on dynamic facts) vs RAG (requires retrieval infrastructure, but current).","A":"Model size does not determine knowledge currency. Larger models have more parameters to store facts but all have training cutoffs. A 70B parameter model is equally outdated on current events.","B":"","C":"Search engine integration is one solution (the RAG approach), but the question asks about the fundamental limitation demonstrated. \"Should use search\" is a solution, not a diagnosis.","D":"Political facts are not inherently harder than scientific facts for parametric memorization. The issue is temporal dynamism (leadership changes), not topic category."},"reference":"- Lewis et al., \"Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks\": https://arxiv.org/abs/2005.11401"},{"section":"nlp","difficulty":"easy","id":"nlp-e015","topicSlug":"machine-translation","orderIndex":15,"topic":"Machine Translation","question":"A student evaluates two MT outputs against the reference \"The cat sat on the mat.\" Output A: \"The cat sat on the mat.\" Output B: \"A cat was sitting on the mat.\" Both are semantically equivalent. BLEU scores: A=1.0, B=0.41. A classmate says \"BLEU is broken — both should score 1.0.\" What is the precise property of BLEU that causes this?","options":{"A":"BLEU is broken and should not be used for MT evaluation","B":"BLEU measures n-gram surface overlap with the reference — Output B uses synonyms (\"A\" vs \"The\") and a different verb form (\"was sitting\" vs \"sat\") which are valid paraphrases but produce zero matches for those n-grams; BLEU does not model semantic equivalence, only lexical overlap","C":"Output B is grammatically wrong which causes low BLEU","D":"BLEU requires at least 5 reference translations to score equivalently correct paraphrases as 1.0"},"correct":"B","explanation":{"correct":"- BLEU 1-gram precision for B: \"A\"→0 (reference has \"The\"), \"cat\"→1, \"was\"→0, \"sitting\"→0, \"on\"→1, \"the\"→1, \"mat\"→1. Precision ≈ 4/7 ≈ 0.57.\n- BLEU bigram precision for B: \"A cat\"→0 (reference \"The cat\"), \"cat was\"→0, \"was sitting\"→0, \"sitting on\"→0, \"on the\"→1, \"the mat\"→1. Precision ≈ 2/6 ≈ 0.33.\n- The article \"A\" vs \"The\" and \"was sitting\" vs \"sat\" are non-matching even though semantically equivalent. BLEU has no synonym/paraphrase awareness.\n- This is the known limitation of BLEU: it correlates with human judgment at corpus level (many translations averaged) but is unreliable for individual sentence comparison, especially for valid paraphrases.","A":"BLEU is widely used in MT research precisely because at corpus level it correlates well with human judgments and allows reproducible comparison. Calling it \"broken\" overstates the limitation. It has specific known weaknesses, not general brokenness.","B":"","C":"\"Was sitting on the mat\" is grammatically correct English. Grammar is not what BLEU measures — it measures n-gram overlap with references.","D":"Multiple references do help (more valid paraphrases can match one of them), but 5 references is not a magical threshold. Even with 10 references, a completely valid paraphrase using all different n-grams would still score less than 1.0."},"reference":"- Papineni et al., \"BLEU: a Method for Automatic Evaluation of Machine Translation\": https://aclanthology.org/P02-1040/"},{"section":"nlp","difficulty":"easy","id":"nlp-e016","topicSlug":"text-generation-decoding","orderIndex":16,"topic":"Text Generation Decoding","question":"A language model is generating text with `temperature=0`. The top-5 token probabilities at a certain step are: \"the\" (0.45), \"a\" (0.22), \"this\" (0.18), \"one\" (0.10), \"that\" (0.05). What token is selected and why is this equivalent to greedy decoding?","options":{"A":"\"a\" is selected because it is in the second position","B":"\"the\" is selected — temperature=0 causes the model to always select the highest-probability token (argmax), making it deterministic and equivalent to greedy decoding; mathematically, as T→0, softmax(logits/T) approaches a one-hot distribution on the argmax token","C":"A random token is selected because temperature=0 means equal probability for all tokens","D":"Temperature=0 is invalid and causes a division by zero error in all implementations"},"correct":"B","explanation":{"correct":"- Softmax with temperature: P(wᵢ) = exp(logᵢ/T) / Σⱼ exp(logⱼ/T). As T → 0, the differences between logits get amplified (divided by small T → large differences). The highest logit dominates completely → one-hot distribution.\n- At T=0, this is equivalent to argmax selection — the token with the highest original logit always wins, regardless of the gap between first and second place.\n- Practical implementations handle T=0 as a special case: directly return argmax(logits) to avoid division by zero.\n- T=0 is deterministic and produces the same output every time. This is used for reproducibility in evaluation and for tasks requiring the most probable prediction (classification-style generation).","A":"Token position (second, third...) has no role in selection — only the probability value matters. \"a\" is second most probable, not selected.","B":"","C":"Temperature=0 is the extreme of concentration, not uniform distribution. T→∞ produces a uniform distribution (all tokens equally probable). T→0 produces a spike on the argmax. These are opposite effects.","D":"Division by zero is a potential implementation issue, but well-designed libraries (HuggingFace, OpenAI) handle T=0 as argmax selection without literal division. It is a supported and commonly used value."},"reference":"- HuggingFace `generate()` temperature documentation: https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.GenerationConfig.temperature"},{"section":"nlp","difficulty":"easy","id":"nlp-e017","topicSlug":"text-generation-decoding","orderIndex":17,"topic":"Text Generation Decoding","question":"A text generation system uses beam search with beam width k=1. A colleague says \"this is different from greedy decoding because beam search tracks multiple hypotheses.\" Is the colleague right?","options":{"A":"The colleague is right — beam width k=1 and greedy decoding use different algorithms internally","B":"The colleague is wrong — beam search with k=1 is mathematically identical to greedy decoding; at each step, exactly one hypothesis is maintained (the highest-scoring token), which is the same as greedily selecting the best token","C":"Beam width k=1 is invalid — the minimum valid beam width is 2","D":"Beam search with k=1 produces different results because it uses length normalization while greedy does not"},"correct":"B","explanation":{"correct":"- Beam search with k=1: maintain exactly 1 hypothesis at each step. At step t, expand the 1 current hypothesis by all vocabulary tokens, score them, keep top-1. This is exactly: select the token with highest score → advance → repeat.\n- This is the definition of greedy decoding: at each step, select the most probable next token given the current sequence.\n- Greedy decoding is a special case of beam search with k=1. The algorithms are not just \"equivalent in output\" — they are the same algorithm with k=1 substituted.\n- This is useful to know: beam search implementations can be tested with k=1 against a greedy baseline for correctness verification.","A":"\"Different algorithms internally\" is false. Beam search with k=1 executes the same computational steps as greedy decoding. No \"multiple hypothesis tracking\" occurs when k=1.","B":"","C":"k=1 is a valid, commonly used beam width (as the greedy baseline). The minimum valid value is k=1, not k=2. k=0 would be invalid.","D":"Length normalization is an optional add-on to beam search that can also be applied to greedy decoding independently. Whether normalization is applied is separate from whether k=1 produces the same result as greedy."},"reference":"- Jurafsky & Martin, SLP3 Chapter 9 (Sequence Models and Beam Search): https://web.stanford.edu/~jurafsky/slp3/9.pdf"},{"section":"nlp","difficulty":"hard","id":"nlp-h001","topicSlug":"text-preprocessing","orderIndex":1,"topic":"Text Preprocessing","question":"A team builds a subword tokenizer using BPE (Byte Pair Encoding) trained on a 10GB English corpus. When tokenizing a medical domain corpus, words like \"hepatocellular\" tokenize to [\"hepato\", \"##cell\", \"##ular\"] while the general tokenizer tokenizes \"hepatocellular\" to [\"hep\", \"##at\", \"##oc\", \"##ell\", \"##ular\"] (5 tokens). A BERT model fine-tuned for medical NER uses the general tokenizer. Why does subword fragmentation affect NER performance and what is the architecturally correct fix?","options":{"A":"More subword tokens make the sequence longer which slows inference","B":"BPE fragmentation splits medical entities into many subwords — NER predicts one BIO tag per subword token, so \"hepatocellular carcinoma\" might tokenize to 8 tokens; the model must predict consistent BIO labels across all subword fragments (typically using only the first subword's prediction), but fine-tuning on fragmented representations of domain terms produces noisier gradient signals; the fix is domain-adaptive pretraining (DAP): continue pretraining BERT on medical text to build a medical-domain vocabulary and representations before NER fine-tuning","C":"The fix is to increase BERT's maximum sequence length from 512 to 1024","D":"Subword fragmentation has no effect on NER because BERT's self-attention resolves it"},"correct":"B","explanation":{"correct":"- NER labeling for subword tokens: standard practice is to assign the BIO label to the first subword token and mask the remaining subwords during loss computation. \"hepato\" gets B-DISEASE label; \"##cell\", \"##ular\" are ignored in the loss.\n- Problem: the model learns entity boundaries on fragmented representations. The first subword \"hep\" (in general tokenizer) does not encode the full morphological content of \"hepatocellular\" — it is an incomplete representation that makes classification harder.\n- Domain-adaptive pretraining (Gururangan et al., 2020): continue MLM pretraining on medical text (PubMed, clinical notes). The model encounters medical terms frequently in complete form, building better representations for them. BioBERT and PubMedBERT use this approach.\n- Results: BioBERT (domain-pretraining) achieves 3-7% F1 improvement over BERT-base on biomedical NER tasks.","A":"Sequence length increase does slow inference, but this is an engineering concern, not the NER performance problem. The question asks about NER performance degradation, not speed.","B":"","C":"Maximum sequence length increase would help if documents were being truncated (>512 tokens), but it does not address the subword fragmentation problem for individual medical terms.","D":"Self-attention resolves some context ambiguity but cannot recover morphological information that was never present in the subword token. A model that never saw \"hepatocellular\" as a unit cannot represent it as well as one that did."},"reference":"- Gururangan et al., \"Don't Stop Pretraining: Adapt Language Models to Domains and Tasks\": https://arxiv.org/abs/2004.10964\n- Lee et al., \"BioBERT: a pre-trained biomedical language representation model\": https://arxiv.org/abs/1901.08746"},{"section":"nlp","difficulty":"hard","id":"nlp-h002","topicSlug":"word-representations","orderIndex":2,"topic":"Word Representations","question":"A word2vec model trained on a 2015 corpus is used for a 2024 application. The word \"transformer\" has embedding nearest neighbors: [\"generator\", \"rectifier\", \"circuit\", \"voltage\"] (electrical transformers). A newer model trained on a 2022 corpus returns [\"BERT\", \"attention\", \"encoder\", \"GPT\"] for the same word. An engineer uses the old embeddings for NLP document classification. Describe the specific failure mode and how retrofitting (Faruqui et al.) or temporal embedding updates would address it.","options":{"A":"The old embeddings are completely unusable and must be discarded","B":"The old \"transformer\" embedding encodes the electrical engineering sense (2015 corpus had little NLP transformer text); a document labeled as \"NLP/ML\" that uses \"transformer\" extensively will be embedded near electrical engineering documents — misclassification via semantic sense drift; retrofitting injects synonym constraints from a domain ontology to reposition embeddings; alternatively, constructing a new embedding that interpolates between time-sliced corpora tracks semantic shift","C":"The classifier can compensate for wrong word embeddings through fine-tuning","D":"Semantic drift only affects rare words and \"transformer\" is too common to be affected"},"correct":"B","explanation":{"correct":"- Semantic shift: \"transformer\" gained a dominant NLP sense post-2017. In a 2015 corpus, nearly all \"transformer\" occurrences are in electrical/power engineering contexts → its embedding is in the electrical domain cluster.\n- Failure in NLP classification: an ML paper discussing \"the transformer architecture improves...\" gets the word \"transformer\" represented in an electrical engineering direction → the document embedding drifts toward engineering topics → misclassification.\n- Retrofitting (Faruqui et al., 2015): given a lexical resource (e.g., WordNet) or domain ontology stating \"transformer (NLP) is synonymous with attention-based model,\" retrofitting moves the embedding toward the synonyms' centroid while preserving proximity to semantically related words.\n- Temporal alignment: train embeddings on time-sliced corpora and align them using Procrustes alignment to track how word meanings shift over time.","A":"Old embeddings retain value for words whose semantics have not changed (most of the vocabulary). Discarding entirely is wasteful — targeted retrofitting or partial updates are sufficient.","B":"","C":"Fine-tuning the classifier adjusts the classification boundary but the input representation (embedding) remains in the wrong semantic neighborhood. Fine-tuning cannot retroactively fix incorrect input representations without re-embedding or retraining.","D":"Semantic drift is most pronounced for words undergoing rapid sense change — neologisms and repurposed technical terms like \"transformer,\" \"token,\" \"model\" in NLP. Common words are exactly the ones whose new senses are most impactful for downstream tasks."},"reference":"- Faruqui et al., \"Retrofitting Word Vectors to Semantic Lexicons\": https://arxiv.org/abs/1411.4166\n- Hamilton et al., \"Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change\": https://arxiv.org/abs/1605.09096"},{"section":"nlp","difficulty":"hard","id":"nlp-h003","topicSlug":"classical-nlp-tasks","orderIndex":3,"topic":"Classical NLP Tasks","question":"A dependency parser uses a transition-based (arc-standard) system. The parser has a stack and a buffer. It encounters the sentence \"The bank by the river is steep.\" A researcher shows that the attachment of \"bank\" (financial vs river bank) produces different correct parses. With an HMM-based tagger providing POS features, both \"bank\" senses receive the same POS tag (NN). How does ambiguity propagate through a pipeline of NLP components and what architectural change addresses this?","options":{"A":"POS ambiguity has no effect on dependency parsing","B":"Pipeline error propagation: the POS tagger assigns NN to \"bank\" (correct for both senses but the wrong POS input eliminates sense-disambiguation signals); the parser uses POS features but cannot distinguish \"bank (financial)\" from \"bank (geography)\" via POS alone; errors in upstream components (POS tagger) propagate and compound downstream (parser attachment decisions); the fix is joint modeling — train POS tagging and parsing simultaneously sharing representations so error signals backpropagate jointly rather than greedily left-to-right","C":"The solution is to run multiple parsers and vote","D":"HMM taggers cannot assign NN to ambiguous nouns"},"correct":"B","explanation":{"correct":"- Pipeline architecture: tokenizer → POS tagger → parser → NER → ... Each component takes the previous component's output as input. Errors accumulate: a wrong POS tag constrains the parser's feature space, potentially steering attachment decisions incorrectly.\n- \"Bank\" sense disambiguation requires semantic context (\"by the river\" → geography sense). An HMM POS tagger with only local unigram/bigram features may assign NN correctly (the POS is NN for both senses) but this provides no sense information to the parser.\n- Joint models: Bohnet et al. (2013) showed joint POS+parsing outperforms pipelined systems because the parser can \"correct\" for ambiguous POS assignments using broader syntactic context, and POS training benefits from parsing signal.\n- Neural models (spaCy, stanza) naturally implement joint training via multi-task learning with shared BERT/LSTM encoders.","A":"POS tags are core features in many dependency parsers (the arc-standard system uses POS of the top of stack and front of buffer as features). Wrong or ambiguous POS signals do affect attachment decisions.","B":"","C":"Ensemble voting reduces variance but does not address the fundamental pipeline error propagation — multiple parsers with the same wrong POS input will make correlated errors.","D":"HMM taggers routinely assign NN to ambiguous nouns because NN is often the highest-frequency tag for that word. The issue is not assignment failure but the insufficiency of POS alone for downstream disambiguation."},"reference":"- Bohnet et al., \"Joint Morphological and Syntactic Analysis for Richly Inflected Languages\": https://aclanthology.org/Q13-1031/"},{"section":"nlp","difficulty":"hard","id":"nlp-h004","topicSlug":"language-models-statistical","orderIndex":4,"topic":"Language Models Statistical","question":"A trigram language model with Kneser-Ney smoothing is used as the language model component in a speech recognition system. The acoustic model outputs a lattice of possible word sequences with scores. The language model rescores these using log P(word sequence). An engineer notes that the trigram LM systematically prefers shorter word sequences in the lattice even when longer sequences are acoustically more probable. What is the precise mathematical cause and how is it addressed in ASR systems?","options":{"A":"Kneser-Ney smoothing penalizes longer sequences","B":"Language model log probability = Σ log P(wᵢ | context) — each term is negative (log of probability < 1); longer sequences have more terms and therefore lower total log probability; this length bias means the LM component systematically down-scores longer (potentially correct) hypotheses; ASR systems address this with a word insertion penalty (WIP) — a per-word bonus added to each word's score to offset the LM's accumulated negativity","C":"The trigram LM cannot score sequences longer than 3 words","D":"The acoustic model's lattice is incorrect — the LM should not be used for rescoring"},"correct":"B","explanation":{"correct":"- LM score for sequence w₁,...,wₙ: Σᵢ log P(wᵢ | wᵢ₋₁, wᵢ₋₂). Each log P term ≤ 0 (since P ≤ 1). Adding n terms means longer sequences always have ≤ shorter sequences in raw LM score, regardless of linguistic quality.\n- ASR combination: score = λ × acoustic_score + μ × LM_score + WIP × n_words. Word insertion penalty (WIP) is a per-word bonus (positive constant × sequence length) that counteracts the LM's per-word penalty.\n- Tuning WIP: too high WIP → model inserts extra words; too low → model prefers deletions. WIP is tuned on held-out development data to optimize word error rate (WER).\n- This is analogous to beam search length normalization in NMT — both address the accumulation of per-step log probabilities over variable-length sequences.","A":"Kneser-Ney smoothing does not penalize longer sequences specifically. It modifies the probability estimates for unseen n-grams using continuation counts. The length bias comes from the additive log probability formula, not from the smoothing method.","B":"","C":"Trigram models can score sequences of any length — each step uses only the previous 2 words as context, but the sum can extend over any n steps. A trigram LM can score a 500-word sentence.","D":"Using LM for lattice rescoring is a standard, essential component of ASR systems. The LM scores are combined with acoustic model scores through the combination formula — they complement, not replace, each other."},"reference":"- Jurafsky & Martin, SLP3 Chapter 16 (Automatic Speech Recognition): https://web.stanford.edu/~jurafsky/slp3/16.pdf"},{"section":"nlp","difficulty":"hard","id":"nlp-h005","topicSlug":"sequence-models-rnn-lstm","orderIndex":5,"topic":"Sequence Models Rnn Lstm","question":"An LSTM language model achieves perplexity 42 on PTB test set with hidden size 1024 and dropout 0.5. A researcher attempts to improve it by adding a \"Temporal DropConnect\" variant: randomly zeroing connections in Wₕ (the hidden-to-hidden weight matrix) at each time step during training. Performance degrades to PP=58. The researcher claims \"recurrent connections are critical, so dropout there always hurts.\" What is the subtle error in both the implementation and the conclusion?","options":{"A":"The conclusion is correct — dropout on Wₕ always hurts LSTM performance","B":"Standard recurrent dropout applies the same dropout mask across all time steps within a sequence (not a new random mask per step); applying a new random mask at each step breaks the consistent recurrent path that LSTM uses to carry information across time steps, effectively injecting random noise into the memory; the correct implementation (Gal & Ghahramani, 2016 variational dropout) uses one fixed mask per sequence, enabling regularization without disrupting temporal consistency","C":"DropConnect on Wₕ is identical to standard dropout on hₜ","D":"The perplexity increase is caused by the lower effective learning rate from dropout, not by the temporal inconsistency"},"correct":"B","explanation":{"correct":"- Standard dropout applied independently at each time step: at step t, Wₕ has random zeros; at step t+1, different random zeros. The LSTM's hidden state at step t depends on Wₕ in a way that is uncorrelated with step t+1's effective Wₕ — the memory path is randomly disrupted.\n- Gal & Ghahramani (2016) variational dropout: sample one dropout mask at the beginning of each sequence, apply the same mask to Wₕ at every time step of that sequence. The mask is consistent within a sequence but varies across sequences.\n- Result: variational dropout regularizes without breaking temporal consistency — the model learns to be robust to a fixed partial connection scheme rather than random noise at each step.\n- The researcher's conclusion (\"always hurts\") is wrong; correct recurrent dropout implementation consistently improves LSTM performance on language modeling.","A":"Properly implemented recurrent dropout (variational) consistently improves generalization, as shown in Merity et al. (AWD-LSTM, 2018) which achieved state-of-the-art PTB perplexity using multiple regularization techniques including variational dropout.","B":"","C":"DropConnect on Wₕ and dropout on hₜ are related but different. DropConnect zeros individual weights in the matrix; standard dropout zeros the output activation vector. Both can be applied per-step (wrong) or with fixed masks per sequence (right).","D":"Learning rate reduction from dropout is a separate concern. The performance degradation here is specifically due to temporal inconsistency of the dropout mask — the LM cannot form consistent temporal patterns when the connection matrix changes randomly at each step."},"reference":"- Gal & Ghahramani, \"A Theoretically Grounded Application of Dropout in Recurrent Neural Networks\": https://arxiv.org/abs/1512.05287\n- Merity et al., \"Regularizing and Optimizing LSTM Language Models\" (AWD-LSTM): https://arxiv.org/abs/1708.02182"},{"section":"nlp","difficulty":"hard","id":"nlp-h006","topicSlug":"attention-before-transformers","orderIndex":6,"topic":"Attention Before Transformers","question":"Monotonic attention models (Raffel et al., 2017) constrain the attention alignment to be non-decreasing across decoding steps — the model can only attend to source positions ≥ the previous step's position. This achieves O(n+m) decoding complexity vs O(n×m) for standard attention. For which NLP task would monotonic attention critically fail, and what is the mathematical property of that task that breaks the assumption?","options":{"A":"Monotonic attention fails on all tasks because it is less expressive","B":"Monotonic attention fails on machine translation between typologically distant language pairs (English→Japanese, English→German for verb-final clauses) where the correct alignment is non-monotonic — the Japanese verb appears at the end but corresponds to the English verb in the middle; the non-decreasing constraint forces the model to attend to source positions in the wrong order, preventing correct non-monotonic alignment","C":"Monotonic attention fails only on sentences longer than 100 words","D":"Monotonic attention fails on question answering because questions require bidirectional reading"},"correct":"B","explanation":{"correct":"- Monotonic assumption: αₜ position ≥ αₜ₋₁ position. This encodes the intuition that as the decoder progresses, it moves through the source sequentially. This is approximately true for similar language pairs (English→French where word order is similar).\n- English→Japanese failure: SOV vs SVO word order. When decoding the Japanese verb (last target token), the model must attend to the English verb (middle of source). But the monotonic constraint would force the model to have already passed through source positions including the English verb, and it cannot attend back.\n- Specifically: if the decoder has already attended to source position 8 (end of source) at step m-1, at step m (generating Japanese verb) it cannot attend to source position 5 (English verb) because 5 < 8 violates the non-decreasing constraint.\n- Monotonic attention works well for: TTS (text-to-speech), incremental translation of similar language pairs, speech recognition with sequential alignment.","A":"Monotonic attention is more expressive than a fixed-context model and works well for tasks with sequential alignment. The claim \"fails on all tasks\" is false — it achieves competitive performance on English→French MT and TTS.","B":"","C":"Sentence length is not the determining factor. A 5-word English→Japanese sentence still has non-monotonic alignment. Monotonic attention fails on any non-monotonic alignment requirement, regardless of length.","D":"Monotonic attention is a seq2seq (encoder-decoder) concept. QA (extractive) uses a different architecture (span prediction). The concept of monotonic alignment does not apply to extractive QA."},"reference":"- Raffel et al., \"Online and Linear-Time Attention by Enforcing Monotonic Alignments\": https://arxiv.org/abs/1704.00784"},{"section":"nlp","difficulty":"hard","id":"nlp-h007","topicSlug":"bert-and-variants","orderIndex":7,"topic":"Bert And Variants","question":"A team probes BERT's 12 layers to understand what linguistic information each layer encodes. They train linear classifiers on frozen layer representations for: POS tagging, dependency parsing, NER, and coreference. Results show POS peaks at layer 3, NER peaks at layer 7, dependency structure peaks at layer 6, coreference peaks at layer 10. An engineer proposes \"always use the last layer for fine-tuning.\" What does this probing evidence suggest about when using last-layer representations is suboptimal?","options":{"A":"The last layer is always optimal — probing results do not apply to fine-tuning","B":"Probing reveals that BERT builds linguistic representations hierarchically — lower layers capture surface/lexical features (POS), middle layers encode syntactic structure, upper layers encode discourse/coreference; for POS-sensitive tasks (chunking, morphological analysis), using last-layer representations discards the peak POS signal at layer 3; optimal practice is task-dependent layer weighting (ELMo-style scalar mix or task-specific layer selection)","C":"The engineer is correct — fine-tuning updates all layers so probing results are irrelevant","D":"Last-layer representations are always better because they have been processed by more attention layers"},"correct":"B","explanation":{"correct":"- Tenney et al. (2019) BERT probing study: systematically confirmed this hierarchical structure — syntactic features peak in lower/middle layers, semantic features peak in upper layers.\n- For fine-tuning: if the task requires POS-level features (part-of-speech-sensitive disambiguation, morphological analysis), the last layer may have \"overwritten\" the clean POS signal with higher-level semantic abstractions.\n- ELMo scalar mix: learn a task-specific weighted combination of all layers' representations. For POS-heavy tasks, higher weights for lower layers emerge naturally during fine-tuning.\n- Practice implication: when fine-tuning BERT for a new task, experimenting with different layer extraction points or learned layer weights can outperform always-last-layer for syntactically sensitive tasks.","A":"Probing results do apply to fine-tuning because probing reveals what information is available at each layer. If layer 3 has richer POS information, a fine-tuned model that has direct access to layer 3 during inference can leverage it better.","B":"","C":"Fine-tuning updates all layers, which is true — but the gradient updates are task-directed and may not perfectly preserve the layer-specific specialization observed in probing. Starting from a suboptimal layer can still affect the fine-tuned model's representational quality.","D":"More attention layers don't guarantee better representations for all tasks. Over-abstraction in upper layers can lose the fine-grained surface signals that syntactic tasks need."},"reference":"- Tenney et al., \"BERT Rediscovers the Classical NLP Pipeline\": https://arxiv.org/abs/1905.05950"},{"section":"nlp","difficulty":"hard","id":"nlp-h008","topicSlug":"text-classification","orderIndex":8,"topic":"Text Classification","question":"A news classification system uses a BERT-based model fine-tuned on 50 news categories. After 6 months of deployment, the model's accuracy degrades from 91% to 79% without any model changes. Log analysis shows accuracy degraded gradually, correlating with the emergence of new terminology around AI developments. A ML engineer says \"data drift.\" A more senior engineer says \"concept drift.\" What is the precise difference, and which applies here?","options":{"A":"Both terms describe the same phenomenon — they are interchangeable","B":"Data drift (covariate shift): the input distribution P(X) changes while P(Y|X) remains stable — e.g., writing style becomes more informal but \"Sports\" articles are still about sports. Concept drift: P(Y|X) changes — the relationship between input features and labels shifts. Here: \"AI\" articles previously mapped to \"Technology\" but now cover \"Politics,\" \"Economy,\" \"Science\" simultaneously — the label assignment for AI-related text has changed; this is concept drift, which requires relabeling and retraining, not just retraining on new distribution data","C":"Data drift always requires model retraining; concept drift can be fixed with threshold tuning","D":"Concept drift only occurs in streaming data, not in batch-trained models"},"correct":"B","explanation":{"correct":"- Data drift: P(X) changes — new vocabulary, new writing patterns, but existing categories remain definitionally stable. A news classifier can be updated by retraining on new examples without changing the category taxonomy.\n- Concept drift: P(Y|X) changes — the correct label for given input features changes. \"ChatGPT launches\" is classified as Technology in 2022, but by 2023, similar AI coverage appears in Business, Politics, Education. The boundary of what \"Technology\" means shifts.\n- Detection methods: monitoring prediction confidence over time, comparing label distributions of new data with historical labels, periodic human evaluation of randomly sampled predictions.\n- Concept drift requires: (1) taxonomy review (does \"Technology\" need to be split? does \"AI Policy\" need its own category?), (2) relabeling of recent data with updated taxonomy, (3) fine-tuning on revised annotations.","A":"Data drift and concept drift have specific technical definitions and require different remediation strategies. Conflating them leads to applying the wrong fix (retraining on shifted data when the categories themselves need updating).","B":"","C":"Concept drift cannot be fixed with threshold tuning — the decision boundary itself has changed, not just the confidence calibration. Threshold tuning only adjusts where within the learned space the classification boundary is drawn.","D":"Concept drift occurs in any learning setting where the data-generating process changes over time, including batch-trained models evaluated over long deployment horizons. It is not limited to streaming systems."},"reference":"- Gama et al., \"A Survey on Concept Drift Adaptation\": https://dl.acm.org/doi/10.1145/2523813"},{"section":"nlp","difficulty":"hard","id":"nlp-h009","topicSlug":"named-entity-recognition","orderIndex":9,"topic":"Named Entity Recognition","question":"A production NER system must process 10,000 documents per minute with <50ms latency per document. BERT-large NER achieves 93% F1 but processes at 5ms/document on GPU. DistilBERT achieves 90% F1 at 2ms/document. A BiLSTM-CRF achieves 87% F1 at 0.1ms/document. A team architect specifies: \"use the highest F1 model.\" Under what real-world constraints would the architect's specification be incorrect, and what engineering analysis should precede model selection?","options":{"A":"The architect is always right — highest F1 should always be used","B":"At 10,000 docs/minute = 167 docs/second, BERT-large at 5ms/doc requires 167×5ms = 835ms total per second using 1 GPU — meaning a single GPU is at 83.5% capacity (0.835s out of 1s); DistilBERT uses 33% capacity; latency, throughput, cost per inference, and F1 must all be considered; for the stated throughput, BERT-large is technically feasible on one GPU but leaves no headroom for traffic spikes; architecture review should include cost analysis, SLA definition, and the production cost of a 3% F1 difference","C":"BiLSTM-CRF should always be selected for production because GPU models are unreliable","D":"Model selection should be made purely on F1 without considering operational constraints"},"correct":"B","explanation":{"correct":"- Throughput analysis: 10K docs/min = 167 docs/sec. BERT-large at 5ms per doc: can process 200 docs/sec per GPU (1s / 0.005s = 200). At 167 docs/sec load, utilization ≈ 83.5% — no headroom for traffic spikes (2× traffic spike → 334 docs/sec > 200 capacity → SLA breach).\n- DistilBERT at 2ms: processes 500 docs/sec per GPU — at 167 load, 33% utilization. Handles 3× traffic spikes before saturation.\n- Cost analysis: GPU cost, model serving infrastructure, on-call burden. A 3% F1 difference may be worth $0 or $10K/month depending on the application.\n- Production F1 also typically degrades from benchmark: real documents differ from benchmark test sets. A 93% vs 90% benchmark gap may narrow to 91% vs 89% in production.","A":"\"Always use highest F1\" ignores latency SLA, cost, scalability, and operational burden. This is a common engineering mistake that leads to overprovisioned, expensive production systems that fail under load.","B":"","C":"GPU models are widely used in production with proper infrastructure (batching, quantization, model serving frameworks). Reliability is an operational engineering problem, not an inherent GPU weakness.","D":"F1 is the primary task metric but cannot be the only selection criterion for production systems. Operational constraints are first-class engineering requirements alongside task performance."},"reference":"- Sanh et al., \"DistilBERT, a distilled version of BERT\": https://arxiv.org/abs/1910.01108"},{"section":"nlp","difficulty":"hard","id":"nlp-h010","topicSlug":"question-answering","orderIndex":10,"topic":"Question Answering","question":"A RAG-based QA system retrieves top-3 passages using a dense retriever. A researcher notices that for questions with precise numerical answers (\"What is the melting point of iron?\"), retrieval often returns 3 partially relevant passages but none containing the exact numeric answer. The reader model then generates \"approximately 1538°C\" from context clues but the gold answer is \"1538°C.\" The system's EM=0 for this question despite being correct. What does this reveal about the limitations of both retrieval and evaluation for precise-fact QA?","options":{"A":"The system is wrong — \"approximately 1538°C\" is not the same as \"1538°C\"","B":"Dense retrieval optimizes for semantic similarity between question and passage — a passage about \"iron properties in metallurgy\" is semantically close to the question but may not contain the exact fact; EM penalizes any deviation from exact reference strings (\"approximately\" is not in the gold); this reveals that EM is unsuitable as the sole metric for numerical QA where near-correct generation should receive credit, and that dense retrieval needs factual coverage verification beyond semantic similarity","C":"The model should never generate answers — it should only extract spans","D":"The problem is solved by using exact-match retrieval (BM25) which finds the exact numeric string"},"correct":"B","explanation":{"correct":"- Dense retriever limitation: bi-encoder models are trained to maximize similarity between question embedding and answer-containing passage embedding. For \"melting point of iron,\" the model may retrieve general iron metallurgy passages where \"melting\" is semantically relevant but the specific number 1538 is absent.\n- EM evaluation limitation: \"approximately 1538°C\" fails EM despite being numerically correct. EM = 1 only for exact character string match (after normalization). Any qualifier (\"approximately,\" \"about,\" \"roughly\") breaks EM.\n- Better metrics for numerical QA: numeric EM (extract numbers, compare), tolerance-based EM (±1%), or human evaluation.\n- Retrieval fix: for factual QA, augment dense retrieval with entity/number-aware retrieval that explicitly checks whether candidate passages contain numeric values related to the query entity.","A":"The question asks what this reveals about limitations, not whether the answer is correct. \"Approximately 1538°C\" is practically correct; the evaluation framework is what fails to recognize it.","B":"","C":"Generative QA with retrieval (RAG) is a standard approach for open-domain QA and outperforms extractive QA on questions where no single passage contains the exact answer. Prohibiting generation removes this capability.","D":"BM25 can match the number \"1538\" if it appears verbatim in a retrieved passage, but this depends on whether the training corpus has explicit \"melting point: 1538°C\" text. BM25 cannot retrieve numerically equivalent reformulations (\"just above 1500°C\"), which is a symmetric limitation."},"reference":"- Rajpurkar et al., \"SQuAD: 100,000+ Questions for Machine Comprehension of Text\" (EM/F1 limitations): https://arxiv.org/abs/1606.05250"},{"section":"nlp","difficulty":"hard","id":"nlp-h011","topicSlug":"machine-translation","orderIndex":11,"topic":"Machine Translation","question":"A production English→Arabic NMT system must handle right-to-left (RTL) script and rich morphology (Arabic verbs encode subject, gender, number). The system uses a standard byte-pair encoding (BPE) tokenizer trained on Arabic text with 32K merge operations. An evaluation shows consistent errors on verb agreement (\"they went\" translated with wrong gender agreement). A linguist suggests \"the model doesn't understand Arabic morphology.\" What precise engineering intervention addresses this, and why does standard BPE tokenization contribute to the problem?","options":{"A":"The model needs to be retrained with more data — morphology improves with scale","B":"Arabic verbs embed grammatical features (gender, number, person) in affixes; BPE merges common substrings without linguistic awareness — it may merge or split morpheme boundaries inconsistently (\"yaktubūna\" split as \"yak\" + \"tubū\" + \"na\" instead of morphologically meaningful \"y-\" (3rd) + \"aktub\" (write) + \"ūna\" (plural masculine)); linguistically-motivated tokenization (morphological segmentation) preserves morpheme boundaries, making agreement features explicit in the token stream and enabling the model to learn systematic patterns","C":"RTL script direction causes BPE to fail on Arabic","D":"The solution is to translate Arabic to Latin script first, then apply standard BPE"},"correct":"B","explanation":{"correct":"- Arabic morphology: highly fusional and templatic. A single Arabic word like \"وَيَكْتُبُونَ\" (wa-ya-ktub-ūna = \"and they write\") encodes conjunction (wa), subject person/gender (ya = 3rd masculine), root (k-t-b = write), and plural marker (-ūna) in a single word.\n- Standard BPE: trained on frequency, not morphology. It may merge high-frequency character sequences regardless of morpheme boundaries, creating inconsistent representations of the same grammatical morpheme across different surface forms.\n- Morphological segmentation (Farasa, MADAMIRA): explicitly segment Arabic text into morphemes before BPE. This ensures grammatical morphemes (gender, number affixes) are consistently tokenized as separate units across all verbs.\n- Impact: with morphological tokenization, the model sees the same morpheme token for \"3rd person plural masculine\" across all verbs, learning the agreement pattern systematically.","A":"Scale helps but does not address the tokenization inductive bias problem. A model trained on 10× data with inconsistent morpheme tokenization still learns less systematic morphological patterns than a smaller model with morphologically-aware tokenization.","B":"","C":"BPE processes sequences of characters/bytes — the logical character sequence of Arabic text (whether stored as RTL bytes or represented as Unicode code points) does not break BPE. Text is processed by the Unicode code point sequence, not display direction.","D":"Transliteration to Latin script is a lossy transformation that loses Arabic-specific phonological distinctions and creates new ambiguities. Production MT systems for Arabic use Arabic script directly with appropriate tokenization."},"reference":"- Sennrich & Haddow, \"Linguistic Input Features Improve Neural Machine Translation\": https://arxiv.org/abs/1606.02892"},{"section":"nlp","difficulty":"hard","id":"nlp-h012","topicSlug":"text-generation-decoding","orderIndex":12,"topic":"Text Generation Decoding","question":"A constrained decoding system must generate text that satisfies a hard lexical constraint: the output must contain the word \"climate.\" Standard top-p sampling ignores constraints and may or may not include \"climate.\" A team implements a naive fix: if \"climate\" does not appear in the generated output, regenerate until it does. For a constraint with P(\"climate\" appears in output) = 0.05, why is the expected generation cost of this approach prohibitive, and what is the correct algorithmic solution?","options":{"A":"The fix works fine — 0.05 probability means about 20 regenerations on average, which is acceptable","B":"Expected regenerations = 1 / P(constraint satisfied) = 1 / 0.05 = 20 full sequence regenerations — at 200 tokens per sequence and 50ms per generation, 20 attempts = 1000ms = 1 second latency per output; for rare constraints P=0.001, it becomes 1000 attempts = 50 seconds; the correct solution is constrained beam search (CGBS, Post & Vilar, 2018) which modifies the beam search states to track which constraints remain unsatisfied and forces constraint satisfaction through the decoding algorithm, not post-hoc rejection sampling","C":"0.05 probability is high enough that regeneration is rarely needed in practice","D":"The constraint should be added as a soft penalty in the loss function instead"},"correct":"B","explanation":{"correct":"- Expected number of Bernoulli trials until success: E[trials] = 1/p. At p=0.05: 20 attempts. At p=0.001 (uncommon technical term): 1000 attempts.\n- At 50ms per 200-token generation: 20 attempts = 1 second, 1000 attempts = 50 seconds — clearly impractical for real-time applications.\n- Constrained beam search (lexically constrained decoding): maintain beam hypotheses with metadata tracking which constraints have been satisfied. When constraints are unsatisfied near end-of-sequence, the algorithm manipulates scores to force constraint-satisfying tokens before EOS. This guarantees constraint satisfaction in one decoding pass.\n- Implementations: DBA (Dynamic Beam Allocation), GBS (Grid Beam Search), vectorized constrained beam search. All achieve constraint satisfaction with O(1) generation attempts.","A":"\"20 regenerations is acceptable\" is task-dependent. For real-time chat, 20 attempts × 50ms = 1 second total latency is at the boundary of acceptable UX. For batch offline generation it may be fine. But the critical flaw is that \"acceptable\" does not scale — rare constraints make it arbitrarily slow.","B":"","C":"P=0.05 is not \"high enough\" for reliable single-pass generation. With 5% success rate, 50% of outputs require >13 regenerations (geometric distribution: P(≤13 attempts) = 1-(0.95)^13 ≈ 0.49). This is unreliable.","D":"Soft penalties in the loss function (during training) bias the model toward constraint satisfaction on average but do not guarantee it on any specific output. A soft penalty produces p ≈ 0.7 instead of p ≈ 0.05, but hard constraints require guaranteed satisfaction, which only constrained decoding provides."},"reference":"- Post & Vilar, \"Fast Lexically Constrained Decoding with Dynamic Beam Allocation for Neural Machine Translation\": https://arxiv.org/abs/1804.06189"},{"section":"nlp","difficulty":"medium","id":"nlp-m001","topicSlug":"text-preprocessing","orderIndex":1,"topic":"Text Preprocessing","question":"A search engine indexes 10 million documents and builds a TF-IDF matrix. An engineer proposes adding bigrams as features to improve retrieval. The unigram vocabulary is 200,000 words. A product manager asks: \"If we add bigrams, how much does the feature space grow?\" What is the theoretical maximum and the practical reality?","options":{"A":"The feature space doubles — bigrams add 200,000 new features","B":"Theoretical maximum = 200,000² = 40 billion bigrams, but in practice only ~1-5 million bigrams occur with frequency ≥ 5 in the corpus; the practical feature space grows by a factor of 5-25×, not 200,000×","C":"Bigrams always produce exactly 2× the feature space regardless of corpus","D":"Bigrams reduce the feature space because they replace two unigrams with one combined feature"},"correct":"B","explanation":{"correct":"- Theoretical: any ordered pair of 200,000 words = 200,000² = 40 billion possible bigrams. This is the Cartesian product of the vocabulary with itself.\n- Practical: most bigrams never appear in text. Zipf's law means a small number of bigrams (collocations, common phrases) account for most occurrences. With a minimum frequency threshold (≥ 5), bigrams in a 10M document corpus typically number in the millions.\n- Engineering decision: adding all bigrams above count threshold (e.g., 5) typically adds 1-5M features. This requires sparse matrix representations (CSR) to be tractable.\n- Implication: bigrams improve precision for multi-word concepts (\"New York,\" \"not good\") but the vocabulary explosion requires count filtering and sparse storage.","A":"Doubling (200K unigrams → 400K total) would be the case if only common bigrams were added — but even the most common bigrams alone exceed this. With count threshold filtering, bigrams typically add millions of features.","B":"","C":"There is no fixed 2× rule. The bigram count depends entirely on corpus size, vocabulary, and frequency threshold. Two different corpora would produce very different bigram counts.","D":"Bigrams are added as new features alongside unigrams — they are concatenated, not substituted. The feature space grows; it does not shrink."},"reference":"- Manning et al., \"Introduction to Information Retrieval\", Chapter 6 (Weighted Zone Scoring, n-gram indexing): https://nlp.stanford.edu/IR-book/"},{"section":"nlp","difficulty":"medium","id":"nlp-m002","topicSlug":"text-preprocessing","orderIndex":2,"topic":"Text Preprocessing","question":"A team builds a document similarity system using TF-IDF cosine similarity. Two documents about \"machine learning\" both have high TF-IDF for \"algorithm\" and \"model\" but one uses \"neural network\" and the other uses \"deep learning.\" The similarity score is low despite being about the same topic. What is the fundamental limitation exposed, and what two approaches address it?","options":{"A":"TF-IDF vectors are too sparse — the fix is to use dense vectors by padding with zeros","B":"TF-IDF treats vocabulary as independent dimensions — \"neural network\" and \"deep learning\" are different dimensions with zero overlap despite semantic equivalence; fixes: (1) latent semantic analysis (LSA/SVD) to learn latent topic dimensions, (2) dense word/sentence embeddings (Word2Vec, BERT) that place semantically related terms in nearby vector space","C":"The similarity is low because both documents are too similar — cosine similarity only works for dissimilar documents","D":"TF-IDF similarity requires stemming first — \"learning\" and \"networks\" are not normalized"},"correct":"B","explanation":{"correct":"- TF-IDF cosine similarity: documents are compared via dot product of their TF-IDF vectors. Each unique word is an independent dimension. \"neural\" has zero overlap with \"deep\" — they are orthogonal vectors.\n- LSA: apply SVD to the term-document matrix, reducing to k latent dimensions where semantically related terms co-occur in similar documents → they project to nearby latent dimensions.\n- Dense embeddings: represent each document as mean of word/sentence embeddings. Words used in similar contexts (\"neural network,\" \"deep learning\") have similar embeddings → high document similarity.\n- This is the bag-of-words semantic gap: syntactic overlap ≠ semantic similarity.","A":"The sparsity of TF-IDF is by design — zero values mean the term is absent. Adding zeros does not change the cosine similarity or the underlying vocabulary independence problem.","B":"","C":"Cosine similarity produces values in [-1, 1] for any pair of documents, not just dissimilar ones. Documents about the same topic should ideally have high cosine similarity — the TF-IDF representation is the limitation, not the similarity metric.","D":"Stemming normalizes morphological variants (\"learning\" → \"learn\") but cannot bridge semantic synonymy between different root forms like \"neural\" and \"deep.\""},"reference":"- Deerwester et al., \"Indexing by Latent Semantic Analysis\" (LSA): https://asistdl.onlinelibrary.wiley.com/doi/10.1002/asi.4630410605"},{"section":"nlp","difficulty":"medium","id":"nlp-m003","topicSlug":"word-representations","orderIndex":3,"topic":"Word Representations","question":"GloVe trains on a word co-occurrence matrix, while Word2Vec trains on local context windows. A researcher shows that on the analogy task \"Paris:France :: Berlin:?\" GloVe achieves 80% accuracy and Word2Vec achieves 75%. But on a syntactic analogy task \"run:running :: swim:?\" Word2Vec achieves 85% and GloVe achieves 72%. What property of each method explains the task-specific performance difference?","options":{"A":"GloVe is better at all semantic tasks because it uses global statistics","B":"GloVe's global co-occurrence statistics capture long-range semantic associations (country-capital relationships persist across large documents); Word2Vec's local window training captures syntactic patterns from immediate context (verb form changes occur within 5-word windows), giving it an edge on morphosyntactic analogies","C":"The difference is random variation — both methods are trained on the same data so performance should be identical","D":"Word2Vec always outperforms GloVe because it uses a neural network while GloVe uses matrix factorization"},"correct":"B","explanation":{"correct":"- GloVe global statistics: the co-occurrence matrix captures how often any two words appear in the same document/window across the entire corpus. Country-capital pairs (\"Paris,\" \"France\") often co-occur in the same documents (news articles, geography text) producing clear relational structure in GloVe space.\n- Word2Vec local window (size 5): training is sensitive to immediate local context. Morphological transformations (\"run,\" \"running\") consistently appear in similar local syntactic positions — word windows capture syntactic regularity more sharply.\n- This aligns with empirical findings: GloVe tends to capture semantic/topical similarity better, Word2Vec captures syntactic similarity better. Neither is universally superior.\n- Pennington et al. (GloVe paper) showed GloVe outperforms Word2Vec on semantic analogies but not consistently on syntactic ones.","A":"GloVe is not globally better on all semantic tasks. Word2Vec with large data and window size also captures semantic relationships well. The advantage is task-specific, not universal.","B":"","C":"Training on the same data with different objectives (global co-occurrence factorization vs local context prediction) produces different embedding spaces. Different methods on the same data systematically differ in what relationships they capture.","D":"Matrix factorization (GloVe) vs neural prediction (Word2Vec) are both valid approaches and both produce competitive embeddings. There is no universal winner — the comparison depends on task type, data size, and hyperparameters."},"reference":"- Pennington et al., \"GloVe: Global Vectors for Word Representation\": https://aclanthology.org/D14-1162/"},{"section":"nlp","difficulty":"medium","id":"nlp-m004","topicSlug":"classical-nlp-tasks","orderIndex":4,"topic":"Classical NLP Tasks","question":"A CRF model for POS tagging is trained to maximize P(tag sequence | word sequence). At inference, it uses Viterbi decoding. A colleague proposes replacing CRF with per-token softmax (predicting each tag independently). On a test set, per-token softmax achieves 96.5% accuracy and CRF achieves 97.1%. The colleague argues \"0.6% is not worth the complexity.\" On which specific examples does CRF provide its advantage?","options":{"A":"CRF always outperforms independent softmax on all tokens equally","B":"CRF's advantage is concentrated on tokens where tag validity depends on neighboring tags — e.g., it prevents invalid sequences like JJ JJ NN from having a third consecutive adjective where grammar requires a noun; the 0.6% improvement comes from a small number of constraint-violating examples that independent softmax gets wrong but CRF's transition model corrects","C":"CRF improves only on punctuation tags, not content word tags","D":"Independent softmax cannot handle multi-class POS tagging — that is why CRF is needed"},"correct":"B","explanation":{"correct":"- Independent softmax: each token's tag is argmax of its own probability. No constraint prevents predicting two B- tags in sequence (for chunking) or implausible POS sequences.\n- CRF transition matrix: learns valid tag pair scores. \"DT followed by NN\" has high score; \"DT followed by VBZ\" has low score. Viterbi finds the globally optimal valid sequence.\n- The 0.6% gain is not uniform — it is concentrated on: (1) tag disambiguation where local context is ambiguous but sequence context resolves it, (2) preventing technically invalid sequences (e.g., I- without B-), (3) handling rare tags at sentence boundaries.\n- In practice: for BERT-based models, CRF provides smaller marginal gains (0.2-0.5%) because BERT's bidirectional representations already encode sequence context. For weaker encoders (LSTM, word2vec), CRF provides larger gains.","A":"CRF does not improve all tokens equally. Unambiguous tokens (clearly a DT before an obvious NP) benefit negligibly. Only ambiguous boundary cases where sequence constraints resolve the ambiguity benefit.","B":"","C":"CRF improves content word tags at structural boundaries — e.g., ambiguous noun/verb (run, bank) near determiners or prepositions. Punctuation tags are typically unambiguous and benefit less.","D":"Independent softmax handles multi-class classification trivially — softmax outputs a distribution over all K POS tags. The issue is lack of sequence consistency, not multi-class incapability."},"reference":"- Lafferty et al., \"Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data\": https://dl.acm.org/doi/10.5555/645530.655813"},{"section":"nlp","difficulty":"medium","id":"nlp-m005","topicSlug":"language-models-statistical","orderIndex":5,"topic":"Language Models Statistical","question":"A language model is evaluated on two test sets: a news test set (PP=85) and a literature test set (PP=320). The model was trained on a mix of 70% news and 30% literature text. A researcher proposes \"train on equal amounts of both to fix the gap.\" What does the perplexity gap actually reveal, and what is the limitation of the equal-mixing proposal?","options":{"A":"The perplexity gap is a bug — it should be fixed by retuning the learning rate","B":"Higher literature perplexity reflects the higher linguistic diversity of literary text (unusual vocabulary, complex syntax, figurative language) — equal mixing improves literature PP at the cost of worsening news PP (domain tradeoff), and may not close the gap fully since literary language is intrinsically higher-entropy than news","C":"Equal mixing always produces equal perplexity on both test sets","D":"The gap means the model has memorized the news training data, so news PP is artificially low"},"correct":"B","explanation":{"correct":"- Perplexity measures how well the model predicts test text. Literature has naturally higher perplexity because it uses more varied vocabulary, metaphors, unusual word orders, and less formulaic language than news — even a perfect literature LM would have higher PP than a perfect news LM.\n- Mixing 50/50 would expose the model to more literature patterns, reducing literature PP. But it also reduces the effective news training data (from 70% to 50%), likely increasing news PP.\n- The gap partially reflects intrinsic domain difficulty (entropy difference), not just training imbalance. This is why equal mixing is not guaranteed to equalize perplexity.\n- Domain-specific LMs are the production solution: train separate models for news and literature, or use domain-adaptive fine-tuning on target domain.","A":"Perplexity gap between domains is expected behavior — it reflects domain distribution mismatch, not a software bug. Learning rate affects training convergence, not cross-domain perplexity gaps.","B":"","C":"Equal mixing generally reduces the perplexity gap but does not equalize it when domains have different intrinsic entropy. Literary text's natural variety ensures its PP remains higher than news PP regardless of mixing ratio.","D":"News PP=85 is a plausible in-domain perplexity for a well-trained news LM — not evidence of memorization. Memorization (overfitting) would manifest as very low training PP but high test PP, not consistently low test PP."},"reference":"- Jurafsky & Martin, SLP3 Chapter 3 (Language Models): https://web.stanford.edu/~jurafsky/slp3/3.pdf"},{"section":"nlp","difficulty":"medium","id":"nlp-m006","topicSlug":"sequence-models-rnn-lstm","orderIndex":6,"topic":"Sequence Models Rnn Lstm","question":"A text classification pipeline uses a BiLSTM to encode a document, then takes the concatenation of the final forward and backward hidden states [h_forward_T; h_backward_0] as the document representation. A colleague suggests using mean pooling over all time steps instead. For a classification task on reviews averaging 200 words, which is better and why?","options":{"A":"Final states are always better because they capture the whole sequence","B":"Mean pooling is often better for long documents: final forward state h_T is dominated by the last few tokens (recency bias), and h_backward_0 by the first few tokens; averaging all hidden states distributes coverage across the full 200-word sequence, reducing the positional bias","C":"Both methods are identical because BiLSTM hidden states all contain the same information","D":"Final states are better because mean pooling loses temporal ordering information"},"correct":"B","explanation":{"correct":"- LSTM recency bias: h_T encodes the full sequence in theory, but in practice, recent tokens disproportionately influence the final hidden state because older gradient signals decay. For a 200-word document, h_T may poorly represent words at positions 1-150.\n- BiLSTM concatenation [h_T; h_0_backward]: h_T is biased toward sentence end; h_0_backward is biased toward sentence beginning. Middle content gets weaker representation.\n- Mean pooling: hᵢ for each position encodes local context at position i with some history. Averaging over all positions gives relatively equal representation to all parts of the document.\n- Empirically: for long documents (>100 tokens), max or mean pooling over BiLSTM hidden states typically outperforms terminal state concatenation (Conneau et al., InferSent, 2017).","A":"\"Always better\" is false. For short sequences (5-10 tokens), the final state captures the full sequence well. For long sequences, the recency bias makes final states suboptimal.","B":"","C":"Hidden states are not identical — each hᵢ encodes context from positions 1..i (forward) and n..i (backward) with different weights. Position 100 in a 200-word text has different context than position 200.","D":"Mean pooling over an ordered sequence of hidden states does not lose temporal ordering — each hᵢ encodes the ordered context up to position i. The mean is not a bag-of-tokens average; it is an average over position-dependent representations."},"reference":"- Conneau et al., \"Supervised Learning of Universal Sentence Representations from Natural Language Inference Data\" (InferSent): https://arxiv.org/abs/1705.02364"},{"section":"nlp","difficulty":"medium","id":"nlp-m007","topicSlug":"attention-before-transformers","orderIndex":7,"topic":"Attention Before Transformers","question":"An attention-based NMT model translates 10-word English sentences to German. Encoding time grows linearly with source length O(n). During decoding, each of the m target tokens computes attention over all n source positions. A researcher claims \"attention is O(n) per decoder step.\" A senior engineer corrects this. What is the correct complexity, and why does it matter for very long inputs?","options":{"A":"O(n) per decoder step is correct — attention is just a weighted sum","B":"Attention per decoder step is O(n) in computation (one dot product per source position), but the total attention cost over the full decoding is O(n × m) — for long documents (n=1000, m=500), this is 500,000 operations per batch example; this motivated linear attention approximations and sparse attention in transformers","C":"Attention is O(1) because softmax normalizes all weights to sum to 1","D":"Attention complexity is irrelevant for sequence lengths under 512 tokens"},"correct":"B","explanation":{"correct":"- Per decoder step: compute eₜᵢ = score(sₜ, hᵢ) for i = 1..n → n score computations → softmax → weighted sum. This is O(n) per decoder step.\n- Full decoding: m decoder steps × O(n) per step = O(n×m) total attention operations. For seq2seq models, both n (source) and m (target) can be large.\n- For document-level MT (n=1000, m=800): O(800,000) attention operations per example vs O(n) for the encoder. At batches of 32 examples, this becomes significant.\n- This motivated: sparse attention (attend to only local window + global tokens), linear attention approximations, and ultimately the shift toward transformer architectures with efficient multi-head attention.","A":"O(n) per decoder step is correct, but \"attention is O(n)\" as a total statement is incomplete — the total over all decoder steps is O(n×m), which is the practically relevant complexity.","B":"","C":"Softmax is O(n) (computing exponentials and normalizing over n positions). The O(1) claim conflates the normalized output (one weighted sum value) with the computation required to produce it.","D":"At 512 tokens, O(512×512) ≈ 262K operations per example × batch size × attention heads. At production scale, this is non-trivial and motivated efficient transformer attention (e.g., FlashAttention)."},"reference":"- Vaswani et al., \"Attention Is All You Need\" (complexity discussion): https://arxiv.org/abs/1706.03762"},{"section":"nlp","difficulty":"medium","id":"nlp-m008","topicSlug":"bert-and-variants","orderIndex":8,"topic":"Bert And Variants","question":"A team fine-tunes BERT for a 3-way NLI task (entailment/neutral/contradiction) on SNLI (550K examples). They achieve 91% accuracy. They then use the same model for zero-shot topic classification by framing classification as NLI: \"This text is about [topic].\" — does this premise entail the hypothesis? This achieves 78% F1 on a 10-class topic test. Why does NLI fine-tuning enable this zero-shot transfer?","options":{"A":"BERT was pretrained on NLI data, so fine-tuning restores this capability","B":"NLI training teaches the model a general textual entailment reasoning capability — determining if a hypothesis follows from a premise; rephrasing \"classify as topic X\" as \"does this text entail 'this text is about X'?\" leverages the same reasoning function; the model generalizes to unseen topics because it reasons about semantic compatibility rather than topic-specific patterns","C":"The model guesses randomly because it has never seen topic classification examples","D":"Zero-shot transfer works because NLI and topic classification use the same label space"},"correct":"B","explanation":{"correct":"- NLI fine-tuning: the model learns to assess semantic compatibility between two text spans. This is a general reasoning function, not topic-specific.\n- Zero-shot reformulation: instead of asking \"is this about sports?\" (requires sports-specific training), ask \"does this text entail 'this article discusses sports events and competitions'?\" — the model reasons about whether the text's content is compatible with the hypothesis.\n- The key insight (Yin et al., 2019; Welleck et al., 2019): NLI-trained models function as zero-shot classifiers when classification problems are reformulated as entailment problems. They generalize to label categories never seen during fine-tuning.\n- 78% F1 zero-shot is competitive with supervised baselines on many topic classification benchmarks, demonstrating genuine transfer of reasoning capability.","A":"BERT was pretrained on masked language modeling and NSP — not NLI. NLI capability comes from fine-tuning on SNLI/MultiNLI. The transfer is from the fine-tuning objective, not pretraining.","B":"","C":"78% F1 is far above chance (10% for 10-class random) and above the majority-class baseline for typical topic datasets. The model is demonstrably not guessing.","D":"NLI labels (entailment/neutral/contradiction) are completely different from topic labels (sports/politics/tech/...). The transfer works despite different label spaces, which is precisely what makes it zero-shot."},"reference":"- Yin et al., \"Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach\": https://arxiv.org/abs/1909.00161"},{"section":"nlp","difficulty":"medium","id":"nlp-m009","topicSlug":"text-classification","orderIndex":9,"topic":"Text Classification","question":"A content moderation system classifies text into 5 categories: hate speech, harassment, spam, safe, and unclear. The \"unclear\" class has only 3% of training examples. After training a BERT classifier, the confusion matrix shows \"unclear\" examples are almost always misclassified as one of the other 4 classes. A team member says \"just delete the unclear class — 3% of data doesn't matter.\" What production risk does this introduce?","options":{"A":"Removing a class never causes production issues if it is a small fraction of data","B":"\"Unclear\" content is often the most ambiguous and context-dependent — content that the classifier is genuinely uncertain about; removing the class causes the model to force-assign borderline content to a definitive class, removing the signal that this content needs human review; production systems need confidence-aware routing to human moderators for ambiguous cases","C":"Deleting the unclear class will improve overall accuracy so there is no risk","D":"The model should have even more classes, not fewer, to handle ambiguous content"},"correct":"B","explanation":{"correct":"- The \"unclear\" class serves a semantic purpose: it captures content that is borderline, context-dependent, or requires additional information. A comment like \"you should be careful\" could be a threat or friendly advice — no clear label is appropriate.\n- Without \"unclear,\" the model must assign these cases to hate speech, harassment, spam, or safe. Wrong assignments in any direction cause harm: false positive (incorrectly removing safe content) or false negative (missing harmful content).\n- Proper production content moderation: use confidence scores to route uncertain predictions to human review queues. \"Unclear\" predictions can be captured as low-confidence outputs (max softmax probability < 0.7) even without an explicit class.\n- Deleting ambiguous classes is a common mistake that optimizes for clean benchmark metrics at the cost of real-world reliability.","A":"Small class size is a training data problem, not a reason the class is unimportant. Classes representing important real-world scenarios matter regardless of their frequency in training data.","B":"","C":"Overall accuracy may improve (fewer misclassified 3% examples become correctly classified as one of 4 classes by definition), but this is a metric artifact. The real-world performance on the originally-unclear content degrades.","D":"More granular classes increase data requirements, labeling cost, and class imbalance. The solution is not more classes but better handling of uncertainty — which \"unclear\" was already providing."},"reference":"- Aroyo & Welty, \"Truth is a Lie: Crowd Truth and the Seven Myths of Human Annotation\": https://ojs.aaai.org/aimagazine/index.php/aimagazine/article/view/2564"},{"section":"nlp","difficulty":"medium","id":"nlp-m010","topicSlug":"named-entity-recognition","orderIndex":10,"topic":"Named Entity Recognition","question":"A researcher proposes training a multilingual NER model on English CoNLL-2003 and zero-shot transferring it to Spanish without any Spanish NER training data. The model uses mBERT (multilingual BERT). Entity F1 on Spanish drops to 62% from 91% on English. A colleague claims \"mBERT just does language translation internally.\" What is a more accurate characterization of what mBERT learns and why the gap exists?","options":{"A":"mBERT translates text to English internally before processing","B":"mBERT learns language-agnostic representations through multilingual masked language modeling — subword representations of \"Madrid\" and \"Nueva York\" are shared across languages because they appear in similar multilingual contexts; the 29% gap arises from language-specific entity patterns (Spanish capitalization conventions, different entity distributions) and subword coverage differences, not a translation step","C":"The gap is entirely due to vocabulary differences — Spanish words are OOV in mBERT","D":"Zero-shot cross-lingual transfer never works — mBERT cannot transfer NER to any other language"},"correct":"B","explanation":{"correct":"- mBERT is trained with masked language modeling on 104 languages simultaneously using a shared vocabulary (≈110K WordPiece tokens). Languages with similar scripts share subword pieces.\n- Cross-lingual transfer emerges because named entities often appear in similar multilingual contexts: \"Paris\" in French, Spanish, and English text co-occurs with similar country/capital concepts. mBERT's representations for these entities cluster together across languages.\n- The 29% gap comes from: (1) source-target language distance (English and Spanish are related, but entity distributions differ), (2) script and capitalization differences, (3) label distribution mismatch (fewer LOC entities in Spanish vs English test), (4) subword tokenization artifacts.\n- XLM-RoBERTa (trained with more multilingual data) reduces this gap significantly, showing the transfer capability is real but data-dependent.","A":"mBERT does not perform translation. It processes input text directly in the original language through its multilingual encoder. No translation step occurs.","B":"","C":"mBERT's WordPiece vocabulary covers Spanish well (it was trained on Wikipedia in 104 languages including Spanish). Most common Spanish NER-relevant tokens are in-vocabulary. OOV is not the primary cause of the gap.","D":"Zero-shot cross-lingual NER transfer with mBERT achieves 60-75% F1 across many European languages — clearly above chance and demonstrably functional. \"Never works\" is empirically false."},"reference":"- Wu & Dredze, \"Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT\": https://arxiv.org/abs/1904.01038"},{"section":"nlp","difficulty":"medium","id":"nlp-m011","topicSlug":"question-answering","orderIndex":11,"topic":"Question Answering","question":"A multi-hop QA system must answer: \"What is the capital of the country where the Eiffel Tower is located?\" This requires: (1) Eiffel Tower → France, (2) France capital → Paris. A single-hop extractive QA model (BERT on SQuAD) fails on this question even when given a concatenated passage containing both facts. What architectural property makes single-hop QA fail on multi-hop questions?","options":{"A":"BERT cannot process passages longer than 50 words","B":"Single-hop extractive QA models predict one contiguous span from the passage as the answer — multi-hop questions require chaining inferences across multiple facts; the model would need to locate \"Eiffel Tower → France\" and then use that intermediate result to locate \"France capital → Paris,\" a two-step reasoning process that a span-extraction head cannot perform in a single forward pass","C":"Multi-hop questions always have multiple correct answers, which single-answer QA cannot handle","D":"BERT's attention cannot span more than 3 sentences simultaneously"},"correct":"B","explanation":{"correct":"- SQuAD-style extractive QA: input = (question, passage) → output = (start position, end position). The model directly predicts a span that answers the question from the passage in one forward pass.\n- Multi-hop chain: to answer \"capital of France\" you must first establish \"France\" (from Eiffel Tower fact). The model's start/end logits cannot model this reasoning chain — they score each token independently as a possible answer.\n- Multi-hop QA approaches: (1) iterative retrieval (retrieve hop 1, then use intermediate answer to retrieve hop 2), (2) GNN-based reasoning over extracted facts, (3) chain-of-thought prompting in LLMs.\n- HotpotQA (Yang et al., 2018) was introduced specifically to benchmark multi-hop reasoning and revealed that standard QA models fail at this.","A":"BERT can process up to 512 tokens (BERT-base). Concatenated passages often fit within this limit. The failure is architectural (single-hop head), not length-based.","B":"","C":"Multi-hop questions typically have one correct final answer (Paris, in this example). The problem is the reasoning chain, not multiple correct answers.","D":"BERT's self-attention can span all positions in a 512-token sequence — there is no 3-sentence attention limit. The full self-attention receptive field covers the entire input."},"reference":"- Yang et al., \"HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering\": https://arxiv.org/abs/1809.09600"},{"section":"nlp","difficulty":"medium","id":"nlp-m012","topicSlug":"machine-translation","orderIndex":12,"topic":"Machine Translation","question":"An NMT model translates English to French. BLEU is computed on the test set. A team proposes using back-translation to augment training data: translate French monolingual data to English using a separate model, then use (synthetic English, real French) pairs for training. How does back-translation increase effective training data, and what is its primary risk?","options":{"A":"Back-translation creates exact copies of existing training pairs, which helps by repetition","B":"Back-translation converts abundant monolingual target-language data into synthetic parallel pairs — the synthetic English source is noisy but the real French target is clean; this augments training with more examples of target-language distribution; the primary risk is that low-quality back-translation introduces systematic noise that degrades model performance on authentic source-language inputs","C":"Back-translation only works for languages with more than 100M speakers","D":"Back-translation is equivalent to data augmentation with random word dropout and produces the same effect"},"correct":"B","explanation":{"correct":"- Data asymmetry in MT: parallel corpora (aligned sentence pairs) are rare and expensive. Monolingual data (one language only) is abundant. Back-translation bridges this by using a reverse MT model to create synthetic parallel pairs.\n- Process: French mono corpus → back-translate to English (noisy) → train on (synthetic_en, real_fr) pairs alongside original parallel data.\n- Why it works: the model sees more diverse French output distributions during training, improving its ability to generate natural French. The decoder benefits most from seeing more real target-language sentences.\n- Risk: if the back-translation model is poor quality, synthetic English sources are systematic in their errors, and the forward model may learn to produce outputs conditioned on \"back-translated English patterns\" rather than authentic English.","A":"Back-translation creates new (noisy source, real target) pairs — not copies. Each monolingual sentence generates a unique synthetic parallel pair. This is distinct from simple repetition.","B":"","C":"Back-translation is a standard technique for all language pairs regardless of speaker count. It is especially valuable for low-resource languages with limited parallel data — exactly those without 100M speakers.","D":"Random word dropout (a data augmentation technique for source sentences) is unrelated to back-translation. They operate differently: one modifies existing source sentences, the other creates entirely new training pairs from monolingual data."},"reference":"- Sennrich et al., \"Improving Neural Machine Translation Models with Monolingual Data\" (back-translation): https://arxiv.org/abs/1511.06709"},{"section":"nlp","difficulty":"medium","id":"nlp-m013","topicSlug":"text-generation-decoding","orderIndex":13,"topic":"Text Generation Decoding","question":"Two text generation systems are configured as follows. System A: nucleus sampling (top-p=0.95), temperature=1.0. System B: nucleus sampling (top-p=0.5), temperature=1.0. Both use the same LLM. A creative writing application evaluates both. System B produces more focused, coherent narratives but less surprising plot developments. System A produces more unexpected turns but occasionally goes off-topic. Explain the precise mechanism causing these differences.","options":{"A":"System A uses a larger model internally because higher p requires more computation","B":"Top-p=0.95 includes tokens up to 95% cumulative probability — a large vocabulary at each step, including low-probability but creative tokens; top-p=0.5 restricts sampling to the highest-probability tokens that cover only 50% of mass — a smaller, more conservative vocabulary; this is why A is more diverse but occasionally unfocused, and B is more predictable but coherent","C":"Temperature=1.0 is the cause of the difference, not top-p","D":"Both systems produce identical output because they use the same temperature"},"correct":"B","explanation":{"correct":"- Top-p=0.95 at a typical step: the model includes tokens until the sum of their probabilities reaches 0.95. If the top token has P=0.4, second P=0.3, third P=0.15, fourth P=0.08 → total = 0.93, add fifth P=0.04 → 0.97 ≥ 0.95. Roughly 4-5 candidate tokens; includes low-probability options.\n- Top-p=0.5: include only tokens until cumulative P=0.5. If top token P=0.4, second P=0.3 → 0.70 ≥ 0.5 at just 2 tokens. More concentrated sampling.\n- Creative diversity tradeoff: p=0.95 samples from a wider distribution, occasionally choosing 5th or 6th-rank tokens that lead to novel directions. p=0.5 almost always samples from the top 1-2 tokens, producing conventional continuations.\n- Temperature=1.0 keeps the original distribution shape; top-p then truncates it at different thresholds.","A":"Top-p is a post-softmax operation applied to the already-computed probability distribution. It does not change model size or computation complexity — it only determines which tokens to include in the sampling pool.","B":"","C":"Temperature is the same (1.0) for both systems. Temperature would only explain differences if it differed between them. The only varying parameter is top-p, which is the correct explanation.","D":"Top-p directly affects the sampling vocabulary size at each step. Two systems with different top-p and the same temperature will produce measurably different output diversity."},"reference":"- Holtzman et al., \"The Curious Case of Neural Text Degeneration\": https://arxiv.org/abs/1904.09751"}],"allMcqs":[{"section":"nlp","topicSlug":"text-preprocessing","topic":"Text Preprocessing","id":"nlp-01001","difficulty":"easy","orderIndex":1,"question":"A pipeline tokenizes the sentence \"Dr. Smith lives in Washington D.C.\" using whitespace splitting. The downstream model receives 7 tokens. A second pipeline using a sentence tokenizer first produces 1 sentence, then word-tokenizes it. Why does naive whitespace tokenization fail here compared to the rule-based sentence tokenizer?","options":{"A":"Whitespace tokenization always produces more tokens than sentence tokenizers regardless of input","B":"Whitespace splitting cannot distinguish abbreviation periods (Dr., D.C.) from sentence-ending periods, causing incorrect sentence boundary detection and inflated token counts","C":"The sentence tokenizer uses a neural model that recognizes named entities, which whitespace splitting cannot do","D":"Whitespace tokenization is case-sensitive and fails on capitalized words like \"Dr.\" and \"Washington\""},"correct":"B","explanation":{"correct":"- Rule-based sentence tokenizers use handcrafted abbreviation lists and punctuation heuristics to differentiate \"Dr.\" (abbreviation) from \".\" at sentence end — whitespace splitting has no such awareness.\n- \"D.C.\" splits into [\"D.C.\"] with whitespace but a naive period-split would break it into [\"D\", \"C\", \"\"] — this causes downstream tokenization to produce garbage tokens.\n- In production NLP pipelines, incorrect sentence segmentation propagates errors to every downstream step: POS tagging, NER, dependency parsing all assume clean sentence boundaries.\n- NLTK's `sent_tokenize` uses the Punkt algorithm, which is trained on abbreviation patterns — it is still \"classical\" (not neural) but vastly outperforms whitespace splitting.","A":"This is false — sentence tokenizers can produce more tokens when they correctly split run-on text. Token count alone is not the measure of correctness.","B":"","C":"Classical rule-based sentence tokenizers like Punkt do not use neural models or NER; they rely on unsupervised training on abbreviation statistics. Mixing up classical and neural tooling is a common beginner misconception.","D":"Whitespace tokenization is not case-sensitive in any standard implementation. The failure is about period disambiguation, not case handling."},"reference":"- Kiss & Strunk, \"Unsupervised Multilingual Sentence Boundary Detection\" (Punkt algorithm): https://aclanthology.org/J06-4003/\n- NLTK tokenization docs: https://www.nltk.org/api/nltk.tokenize.html"},{"section":"nlp","topicSlug":"text-preprocessing","topic":"Text Preprocessing","id":"nlp-01002","difficulty":"easy","orderIndex":2,"question":"A sentiment analysis model trained on movie reviews removes all stopwords before training. At inference time, a user submits the review \"The movie was not good at all.\" After stopword removal, the processed text is \"movie good.\" The model predicts positive sentiment. What fundamental limitation of stopword removal does this illustrate?","options":{"A":"Stopword lists are too large and remove content-bearing words like \"good\"","B":"Stopword removal destroys negation context — words like \"not\" are typically on stopword lists but are semantically critical for sentiment","C":"The model was not trained on enough negative examples, which is a data imbalance problem unrelated to stopword removal","D":"Sentiment analysis models cannot process reviews shorter than 5 tokens after preprocessing"},"correct":"B","explanation":{"correct":"- \"not\" is almost universally on standard stopword lists (NLTK English stopwords, spaCy, sklearn's default), yet it inverts the polarity of any adjacent sentiment word.\n- After removing \"not\", \"good\" survives — and the model correctly predicts \"good\" as positive, having never seen the negation during preprocessing.\n- This is one of the core documented failure modes of bag-of-words + stopword removal for sentiment tasks. The fix is either keeping negation words explicitly or using n-grams that capture \"not_good\" as a single feature.\n- In production, shipping stopword removal blindly without domain-specific curation is a known source of sentiment pipeline regressions.","A":"Standard stopword lists contain only function words (the, is, at, which, on) — content words like \"good\", \"bad\", \"excellent\" are never on them. The problem is the opposite: removing function words that are semantically load-bearing.","B":"","C":"While data imbalance is real, this specific error is entirely reproducible with balanced data — any sentence with \"not [positive_word]\" will be mispredicted after stopword removal. Blaming data balance misses the preprocessing root cause.","D":"No standard model has a minimum token count requirement post-preprocessing. This is a fabricated constraint."},"reference":"- Manning et al., \"Introduction to Information Retrieval\", Chapter 2 (Stopwords): https://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html"},{"section":"nlp","topicSlug":"text-preprocessing","topic":"Text Preprocessing","id":"nlp-01003","difficulty":"easy","orderIndex":3,"question":"A team uses Porter stemming on a legal document corpus. The words \"university\", \"universe\", and \"universal\" all stem to \"univers\". A downstream retrieval system now conflates documents about university admissions with those about universal human rights. What does this demonstrate about stemming compared to lemmatization?","options":{"A":"Stemming uses a dictionary lookup and lemmatization does not, making stemming more accurate for legal text","B":"Stemming applies rule-based suffix stripping without semantic awareness, collapsing unrelated words with common morphological roots; lemmatization uses vocabulary and morphological analysis to return linguistically valid base forms","C":"Porter stemming is designed for scientific text and should not be used on legal documents","D":"Lemmatization would also conflate these three words because they share the same Latin root"},"correct":"B","explanation":{"correct":"- Porter stemmer applies a cascade of suffix-stripping rules (e.g., remove \"-ity\", \"-al\", \"-ity\") without any dictionary validation. \"university\" → \"univers\", \"universe\" → \"univers\", \"universal\" → \"univers\" — all identical stems despite different meanings.\n- Lemmatization (e.g., WordNet-based) would return \"university\", \"universe\", \"universal\" — each is the correct lemma because it validates against a lexicon.\n- The trade-off is speed and simplicity (stemming) vs. precision (lemmatization). For retrieval, false conflation hurts precision more than recall.\n- In production search engines, Porter stemming can cause legal liability if document retrieval incorrectly surfaces unrelated legal precedents.","A":"This is exactly backwards — stemming uses rule-based suffix stripping with no dictionary. Lemmatization is the approach that uses a dictionary (WordNet or similar morphological analyzer).","B":"","C":"Porter stemming was designed for English generally, not specifically for scientific text. Its limitations apply across all domains where semantic precision matters.","D":"Lemmatizers preserve the base form by checking the result against a lexicon. \"University\", \"universe\", and \"universal\" are distinct lemmas in WordNet and would not be conflated."},"reference":"- Porter, \"An Algorithm for Suffix Stripping\": https://tartarus.org/martin/PorterStemmer/def.txt\n- spaCy lemmatization: https://spacy.io/usage/linguistic-features#lemmatization"},{"section":"nlp","topicSlug":"text-preprocessing","topic":"Text Preprocessing","id":"nlp-01004","difficulty":"easy","orderIndex":4,"question":"A search engine indexes 1 million documents. The term \"machine\" appears in 800,000 of them. When a user searches for \"machine learning\", the TF-IDF score for \"machine\" in a highly relevant document is very low despite the document containing the word 50 times. Why?","options":{"A":"TF-IDF penalizes documents that are too long, so 50 occurrences triggers a length penalty","B":"The IDF component for \"machine\" is near zero because the word appears in 80% of the corpus, making it uninformative as a discriminative signal regardless of term frequency","C":"TF-IDF does not handle compound phrases like \"machine learning\" and splits them incorrectly","D":"50 occurrences is below the minimum threshold for TF-IDF to assign a non-zero score"},"correct":"B","explanation":{"correct":"- IDF = log(N / df) where N = total documents and df = documents containing the term. With df = 800,000 and N = 1,000,000: IDF(\"machine\") = log(1,000,000 / 800,000) = log(1.25) ≈ 0.097.\n- TF-IDF = TF × IDF. Even with TF = 50, TF-IDF ≈ 50 × 0.097 = 4.85 — orders of magnitude lower than a rare, discriminative term.\n- The design intent is correct: \"machine\" alone is not discriminative in an ML corpus. The IDF component correctly discounts it.\n- In production, corpus-specific stopwords (domain jargon that appears everywhere) must often be identified via IDF analysis rather than generic stopword lists.","A":"Standard TF-IDF has no length penalty. BM25 (an extension) does apply document length normalization, but base TF-IDF does not. Confusing TF-IDF with BM25 is a common interview mistake.","B":"","C":"Standard TF-IDF treats each word independently — it does not natively handle phrases. The question is about single-word scoring, not phrase handling.","D":"TF-IDF has no minimum threshold in its formula. Any non-zero TF and non-zero IDF produce a non-zero score. There is no threshold concept in the base formula."},"reference":"- Manning et al., \"Introduction to Information Retrieval\", Chapter 6 (TF-IDF): https://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html"},{"section":"nlp","topicSlug":"text-preprocessing","topic":"Text Preprocessing","id":"nlp-01005","difficulty":"medium","orderIndex":5,"question":"A team builds a Bag-of-Words classifier for detecting toxic comments. They notice the model assigns the same feature vector to \"The dog bit the man\" and \"The man bit the dog.\" A senior engineer says \"use bigrams.\" After adding bigrams, both sentences still produce very similar feature vectors. What is the correct explanation?","codeSnippet":"from sklearn.feature_extraction.text import CountVectorizer\n\ncv = CountVectorizer(ngram_range=(1, 2))\ns1 = cv.fit_transform([\"The dog bit the man\"])\ns2 = cv.transform([\"The man bit the dog\"])\nprint((s1 - s2).nnz) # non-zero differences","options":{"A":"The code has a bug — `fit_transform` and `transform` must both use `fit_transform` for bigrams to work correctly","B":"Bigrams capture local word order but both sentences share the bigrams \"bit the\" and \"the man\"/\"the dog\" — the overlap is high enough that cosine similarity remains near 1.0","C":"BoW with bigrams still cannot distinguish the sentences because bigrams only add partial ordering; the critical semantic reversal (subject-object swap) requires trigrams or dependency-based features","D":"CountVectorizer normalizes vectors to unit length, so any two documents with the same vocabulary always have cosine similarity of 1.0"},"correct":"C","explanation":{"correct":"- \"The dog bit the man\" produces bigrams: [the_dog, dog_bit, bit_the, the_man]. \"The man bit the dog\" produces: [the_man, man_bit, bit_the, the_dog]. They share \"bit_the\", \"the_dog\", and \"the_man\" — 3 of 4 bigrams each overlap, yielding high feature overlap.\n- The agent/patient relationship (\"who bit whom\") is encoded in subject-verb-object dependency arcs, not in surface bigrams. Capturing it requires either trigrams (which still miss long-range dependencies) or dependency-parsed features.\n- This is the fundamental limitation of n-gram BoW: it approximates local context but cannot encode syntactic roles, which are critical for toxic content detection (e.g., \"police shot the man\" vs \"man shot the police\").\n- In production, this failure mode appears in toxicity classifiers that flip subject-object and still predict correctly — or vice versa — because the syntactic structure is lost.","A":"`fit_transform` on training data and `transform` on test data is the correct pattern — it prevents data leakage by not fitting vocabulary on test samples. Using `fit_transform` on both is actually the bug, not the fix.","B":"While the bigram overlap explanation is partially correct, the answer oversimplifies — the code output (number of differing positions) is non-zero. The core limitation is that even with different bigrams, BoW cannot encode subject-object reversal. This option describes a symptom, not the root cause.","C":"","D":"`CountVectorizer` does not normalize vectors — it outputs raw counts. `TfidfVectorizer` with `norm='l2'` (default) normalizes, but even then, two documents with different content can have cosine similarity < 1.0."},"reference":"- Jurafsky & Martin, \"Speech and Language Processing\", Chapter 6 (N-grams): https://web.stanford.edu/~jurafsky/slp3/6.pdf"},{"section":"nlp","topicSlug":"text-preprocessing","topic":"Text Preprocessing","id":"nlp-01006","difficulty":"medium","orderIndex":6,"question":"A text classifier uses TF-IDF with default sklearn settings on a training corpus of 10,000 documents. At inference time, a new document contains the word \"CRISPR\" which never appeared in training. A junior engineer reports: \"The model crashes with a KeyError on unseen vocabulary.\" The senior engineer responds: \"No, it doesn't crash — but the model is still blind to CRISPR.\" Which mechanism explains the senior engineer's claim?","options":{"A":"sklearn's TF-IDF raises a warning but substitutes a default IDF value of 1.0 for unseen terms","B":"The fitted `TfidfVectorizer` maps all text to a fixed vocabulary learned at fit time; unseen terms are silently ignored — their columns simply do not exist in the output matrix, so CRISPR contributes zero weight to any feature","C":"sklearn's `TfidfVectorizer` uses hash trick by default to handle unseen vocabulary without KeyErrors","D":"The model substitutes the IDF of the most similar known word using cosine distance in TF-IDF space"},"correct":"B","explanation":{"correct":"- `TfidfVectorizer.transform()` (not `fit_transform`) maps input tokens to the vocabulary index learned during `fit`. Any token not in `vocabulary_` is silently dropped — the output sparse matrix has the same column count as the training vocabulary, and unseen tokens simply add no signal.\n- No error is raised, no fallback is used — the document is represented only by its known words, which can be severely incomplete for domain-shift scenarios (e.g., training on 2018 data, inferring on 2023 biomedical text).\n- This is why TF-IDF + BoW models degrade silently on domain shift: the feature vector looks valid but is missing entire semantic dimensions.\n- The fix is either periodic retraining, `HashingVectorizer` (which handles unseen terms via hashing but loses interpretability), or switching to subword embeddings.","A":"sklearn does not assign any default IDF for unseen terms — the term is simply ignored, not assigned 1.0. This misconception could lead an engineer to think unseen terms still contribute to classification.","B":"","C":"`TfidfVectorizer` does NOT use the hash trick by default. `HashingVectorizer` does — these are separate classes. Confusing them is a common error in production ML code reviews.","D":"sklearn has no nearest-neighbor imputation for unseen vocabulary. This would be computationally expensive and is not implemented. The actual behavior is simpler and more brittle: silent drop."},"reference":"- sklearn TfidfVectorizer docs: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html\n- sklearn HashingVectorizer: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html"},{"section":"nlp","topicSlug":"text-preprocessing","topic":"Text Preprocessing","id":"nlp-01007","difficulty":"medium","orderIndex":7,"question":"A team is building an information retrieval system. They compute TF-IDF on a corpus where document lengths vary from 50 to 5,000 words. They observe that longer documents systematically rank higher for most queries, even when shorter documents are more focused and relevant. Which property of standard TF-IDF causes this, and what is the standard remedy?","options":{"A":"TF-IDF uses raw term frequency, which grows with document length; BM25's length normalization term (1 - b + b × dl/avgdl) corrects this bias","B":"Longer documents have higher IDF values because they contain more unique words, inflating their TF-IDF scores","C":"The IDF component is computed globally across the corpus and unfairly weights terms that appear more in longer documents","D":"TF-IDF should be replaced with cosine similarity, which normalizes for document length automatically without any modification"},"correct":"A","explanation":{"correct":"- In standard TF-IDF, TF = raw count (or log-normalized count). A 5,000-word document about \"machine learning\" will naturally contain \"learning\" more times than a 50-word abstract, even if the abstract is more topically focused.\n- BM25 (Best Match 25) introduces saturation (the k₁ parameter) and length normalization (the b parameter). The term `(k₁ + 1) × tf / (k₁ × (1 - b + b × dl/avgdl) + tf)` ensures that raw TF is normalized by document length relative to the corpus average.\n- Elasticsearch, Lucene, and most production search engines use BM25 over TF-IDF precisely because of this length bias issue.\n- In interviews, knowing that BM25 is TF-IDF's production successor — and understanding the two parameters (k₁ for saturation, b for length normalization) — is a strong signal of practical IR knowledge.","A":"","B":"IDF is computed per-term based on how many documents contain that term — document length does not affect IDF. A 5,000-word document containing \"machine\" once and a 50-word document containing \"machine\" once both contribute equally to the IDF computation.","C":"IDF is document-frequency-based, not term-count-based. It counts documents containing the term, not occurrences within documents. Document length is irrelevant to IDF calculation.","D":"Cosine similarity normalizes the vector to unit length, which does address length bias for vector comparison. However, cosine similarity is a similarity metric, not a retrieval scoring formula per se. The standard production remedy for BM25 is explicit, and cosine + TF-IDF is still affected by the raw TF issue before normalization."},"reference":"- Robertson & Zaragoza, \"The Probabilistic Relevance Framework: BM25 and Beyond\": https://www.staff.city.ac.uk/~sbrp622/papers/foundations_bm25_review.pdf"},{"section":"nlp","topicSlug":"text-preprocessing","topic":"Text Preprocessing","id":"nlp-01008","difficulty":"medium","orderIndex":8,"question":"A team is training a sentiment classifier on product reviews. They use word-level tokenization and find the model struggles with \"don't\", \"can't\", and \"I'm\". A colleague suggests using a subword tokenizer instead. However, the team's pipeline uses a rule-based tokenizer, and they want to stay classical. What is the correct classical approach to handle contractions, and what limitation persists even after applying it?","options":{"A":"Replace all contractions with their expanded form (e.g., \"don't\" → \"do not\") using a lookup table; limitation: the lookup table must be manually maintained and fails on new informal contractions like \"ain't gonna\"","B":"Apply Porter stemming to contractions, which will normalize \"don't\" and \"do not\" to the same stem; limitation: stemming loses the negation signal","C":"Use a regex to remove apostrophes, turning \"don't\" into \"dont\"; limitation: this creates an OOV token that the model has never seen","D":"Split on apostrophes to get [\"don\", \"'t\"], [\"can\", \"'t\"], [\"I\", \"'m\"]; limitation: \"'t\" and \"'m\" become meaningless tokens that dilute the vocabulary"},"correct":"A","explanation":{"correct":"- Classical NLP pipelines handle contractions through rule-based expansion dictionaries. NLTK's `contractions` library and spaCy's rule-based tokenizer both use this approach.\n- After expansion, \"don't\" → \"do not\" — both tokens are in-vocabulary, negation is preserved, and downstream stopword removal must carefully exclude \"not\" (as discussed earlier).\n- The genuine limitation is coverage: informal contractions (\"gonna\" = \"going to\", \"wanna\", \"ain't\", dialectal forms) require manual curation, and social media text constantly generates new shortenings. This is precisely why subword tokenizers emerged.\n- In production, a contraction lookup table plus a fallback to character n-grams handles most cases, but the brittleness of rule maintenance is the long-term cost.","A":"","B":"Stemming operates on word morphology, not on contracted forms. \"Don't\" would stem to something like \"don\" (not \"do\") — the apostrophe handling is undefined in most stemmers, and negation would be destroyed, not preserved.","C":"Removing apostrophes (\"dont\") creates a token that is not in any standard English vocabulary and has no semantic relationship to \"do\" or \"not\". This is strictly worse than doing nothing.","D":"Splitting on apostrophes gives [\"don\", \"'t\"] — but \"'t\" is not a standard English morpheme and will not appear in pre-trained vocabularies. It conflates \"don't\", \"can't\", \"won't\" all to \"'t\", which means negation is not preserved semantically."},"reference":"- spaCy rule-based tokenization: https://spacy.io/usage/linguistic-features#tokenization"},{"section":"nlp","topicSlug":"text-preprocessing","topic":"Text Preprocessing","id":"nlp-01009","difficulty":"medium","orderIndex":9,"question":"A data scientist computes bigram counts from a 1GB news corpus to build a language model. They observe that 92% of all possible bigrams that appear in the test set never appeared in the training corpus. A teammate proposes increasing to trigrams to capture more context. What will increasing to trigrams do to the sparsity problem?","options":{"A":"Trigrams will reduce sparsity because they capture more semantic context, making each n-gram more meaningful","B":"Trigrams will dramatically worsen sparsity — the number of possible n-grams grows as V^n, so trigrams have V times more possible combinations than bigrams, making unseen n-grams even more prevalent in the test set","C":"Trigrams and bigrams have the same sparsity rate because both are computed from the same corpus","D":"Trigrams reduce sparsity by providing more unique context that allows better interpolation with unigrams"},"correct":"B","explanation":{"correct":"- If vocabulary size V = 50,000, bigrams have V² = 2.5 billion possible combinations; trigrams have V³ = 125 trillion. The corpus of 1GB cannot adequately cover even bigram space — trigrams are exponentially worse.\n- This is the \"curse of dimensionality\" in n-gram models. If 92% of bigrams are unseen, virtually all trigrams that contain those bigrams will also be unseen.\n- This sparsity problem is precisely why smoothing techniques (Laplace, Good-Turing, Kneser-Ney) were developed — and ultimately why neural language models replaced n-gram models entirely.\n- In practice, going from bigrams to trigrams in a sparse corpus without aggressive smoothing degrades test perplexity significantly.","A":"\"More semantic context\" is a neural embedding intuition, not an n-gram property. N-gram models have no semantic understanding — they are pure co-occurrence statistics. More context in n-grams means more data requirements, not more meaning.","B":"","C":"The sparsity rate depends on how many possible n-grams exist relative to what the corpus covers. Trigrams have exponentially more possible combinations from the same vocabulary, so they are always sparser than bigrams from the same corpus.","D":"Interpolation with lower-order models is a smoothing technique, not a property of trigrams themselves. The suggestion conflates the n-gram model with the smoothing strategy applied to it. Trigrams alone worsen sparsity; interpolation is a separate remedy."},"reference":"- Jurafsky & Martin, \"Speech and Language Processing\", Chapter 3 (N-gram Language Models): https://web.stanford.edu/~jurafsky/slp3/3.pdf"},{"section":"nlp","topicSlug":"text-preprocessing","topic":"Text Preprocessing","id":"nlp-01010","difficulty":"medium","orderIndex":10,"question":"A production NLP system uses TF-IDF to rank documents. A junior engineer recomputes TF-IDF nightly by running `fit_transform` on all documents including new ones. A senior engineer flags this as a critical bug. What is the exact failure mode introduced by nightly re-fitting?","options":{"A":"Re-fitting recomputes IDF values based on the updated corpus, so documents from yesterday may receive different TF-IDF scores than today for the same content, making ranking scores incomparable across time and breaking any cached index","B":"`fit_transform` is slower than `transform`, so nightly re-fitting wastes compute resources","C":"Re-fitting adds new vocabulary terms but drops old ones, causing the feature dimensionality to change and the downstream model to crash with a shape mismatch","D":"TF-IDF scores are stored as floats and accumulate rounding errors with each nightly refit, causing gradual score drift"},"correct":"A","explanation":{"correct":"- IDF = log(N / df) — when new documents are added, N changes and df can change for any term. A document containing \"bitcoin\" that scored highly yesterday might score differently today because the IDF of \"bitcoin\" shifted as more \"bitcoin\" documents were added.\n- This breaks cache coherence: any pre-computed similarity scores, document rankings, or user personalization models trained on old TF-IDF vectors are now misaligned with the new vector space.\n- The correct production pattern: fit once on a representative corpus, freeze the vocabulary and IDF weights, then use `transform` only for new documents. Periodically re-fit on a full snapshot and re-index everything atomically.\n- This pattern also applies to any stateful feature transformer (e.g., standardization, one-hot encoding) — fitting on streaming data invalidates previously computed representations.","A":"","B":"Compute efficiency is a valid concern but not a \"critical bug\" — it's a performance issue. The senior engineer's concern is correctness, not speed.","C":"Re-fitting does change vocabulary size, which would cause downstream model shape mismatches — this is a real secondary bug. However, the primary critical issue is the semantic drift of IDF values breaking comparability, even before considering shape changes.","D":"Floating-point rounding errors from recomputation are negligible compared to the semantic shift caused by changed IDF weights. Float drift is not a meaningful concern in this context."}},{"section":"nlp","topicSlug":"text-preprocessing","topic":"Text Preprocessing","id":"nlp-01011","difficulty":"hard","orderIndex":11,"question":"A team compares two preprocessing pipelines on a biomedical NER task. Pipeline A uses NLTK word tokenizer + Porter stemmer. Pipeline B uses a rule-based subword tokenizer that splits on common medical morphemes (e.g., \"cardio-\", \"-itis\", \"-ectomy\"). Pipeline B achieves 12 F1 points higher on rare medical terms. A reviewer argues: \"This is just because subword handles OOV better.\" Is the reviewer correct, and what additional mechanism explains Pipeline B's advantage?","options":{"A":"The reviewer is correct — the only advantage of subword tokenization is OOV handling through character-level decomposition","B":"The reviewer is partially correct; subword also improves performance because medical morphemes are semantically compositional — \"appendic-itis\" and \"bronch-itis\" share the \"-itis\" morpheme (inflammation), so the model can generalize across conditions it has never seen as whole words","C":"The reviewer is wrong — rule-based subword tokenization performs better only because it reduces vocabulary size, which speeds up training and incidentally improves accuracy","D":"The reviewer is wrong — Pipeline B is better because Porter stemming corrupts medical terms (e.g., \"appendicitis\" → \"appendicit\"), creating noisy features that are worse than the raw word"},"correct":"B","explanation":{"correct":"- OOV handling is the commonly cited benefit of subword tokenization, and the reviewer is not wrong about that. However, for morphologically rich domains like medicine, the deeper advantage is semantic compositionality.\n- Medical terminology is highly systematic: \"-itis\" = inflammation, \"-ectomy\" = surgical removal, \"cardio-\" = heart. Splitting \"appendicitis\" into \"appendic\" + \"itis\" allows a model trained on \"bronchitis\" to leverage the \"-itis\" morpheme when encountering \"peritonitis\" for the first time.\n- Porter stemming does corrupt medical terms, making option D partially true — but D frames it as the only reason, missing the compositionality argument.\n- In production biomedical NLP, domain-specific tokenizers (ScispaCy, BioTokenizer) are used precisely because general-purpose tokenizers destroy the morphological signal that medical terms encode.","A":"OOV handling is a benefit, but framing it as the \"only\" advantage is incorrect. Subword tokenization in morphologically rich domains provides compositionality benefits even for words that appeared in training — the model gets better generalization from shared morphemes.","B":"","C":"Vocabulary size reduction is a side effect, not the primary mechanism. A smaller vocabulary does reduce the embedding matrix size and can speed training, but this alone does not explain 12 F1 points on rare medical terms specifically.","D":"Porter stemming corrupting \"appendicitis\" is a real problem, but this option presents only the negative of Pipeline A and misses Pipeline B's active advantage (morpheme compositionality). A complete answer must explain both sides."},"reference":"- Neumann et al., \"ScispaCy: Fast and Robust Models for Biomedical NLP\": https://arxiv.org/abs/1902.07669"},{"section":"nlp","topicSlug":"text-preprocessing","topic":"Text Preprocessing","id":"nlp-01012","difficulty":"hard","orderIndex":12,"question":"A search engine team uses TF-IDF with smoothed IDF: IDF(t) = log((1 + N) / (1 + df(t))) + 1 (sklearn's default). They notice that a query term appearing in ALL documents still receives a non-zero TF-IDF score. A colleague claims this is a bug. Is it a bug, and what is the design rationale?","codeSnippet":"from sklearn.feature_extraction.text import TfidfVectorizer\nimport numpy as np\n\ncorpus = [\"the cat sat\", \"the dog ran\", \"the bird flew\"]\nvec = TfidfVectorizer()\nX = vec.fit_transform(corpus)\nthe_idx = vec.vocabulary_[\"the\"]\nprint(X[:, the_idx].toarray()) # scores for \"the\" across all 3 docs","options":{"A":"It is a bug — a term in all documents has zero discriminative value and should score 0; the team should set `smooth_idf=False` to fix it","B":"It is not a bug — smooth IDF prevents division-by-zero for terms in all documents and the +1 addend ensures every term retains a minimum weight; the design accepts a small non-zero score for universal terms in exchange for numerical stability","C":"It is not a bug — TF-IDF always assigns non-zero scores to any term that appears in a document, regardless of IDF; the IDF only scales the score","D":"It is a bug introduced by sklearn's normalization step — setting `norm=None` would make universal terms score zero"},"correct":"B","explanation":{"correct":"- sklearn's smooth IDF formula: IDF(t) = log((1 + N) / (1 + df(t))) + 1. When df(t) = N (term in all docs): IDF = log(1) + 1 = 0 + 1 = 1. The term is NOT zero — it receives IDF = 1 as a floor.\n- The `+1` addend outside the log is a deliberate design choice to prevent terms with IDF = 0 from completely zeroing out the TF-IDF score. Without it, any term appearing in all documents would always score zero regardless of its TF.\n- The design rationale: in short corpora, a term in all N documents may still be discriminative once the corpus grows. The floor of 1 is a conservative prior that says \"every term that actually appears deserves some weight.\"\n- `smooth_idf=False` uses `log(N / df) + 1` — with df=N, this gives log(1) + 1 = 1, same result. Setting `smooth_idf=False` would not change the output for this case; it only prevents division by zero for terms seen at inference but not in training.","A":"Setting `smooth_idf=False` does NOT make universal terms score zero — both formulas include the `+1` addend which ensures IDF ≥ 1. The claim about `smooth_idf=False` being the fix is incorrect; the formulas are equivalent for terms seen during training.","B":"","C":"IDF can be zero in the raw (unsmoothed) formula without the addend. The statement \"IDF only scales\" misses that zero scaling means zero score — so the `+1` floor is the active design decision preventing zero scores.","D":"`norm` controls L1/L2 normalization of the output vector, not the IDF computation. Setting `norm=None` would change the vector magnitude but would not make any individual term score zero based on IDF."},"reference":"- sklearn TF-IDF implementation details: https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting"},{"section":"nlp","topicSlug":"text-preprocessing","topic":"Text Preprocessing","id":"nlp-01013","difficulty":"hard","orderIndex":13,"question":"A team preprocesses a multilingual corpus (English, German, Turkish) for a text classification task. They use a single English NLTK stemmer across all three languages. Turkish is an agglutinative language where a single word \"Çekoslovakyalılaştıramadıklarımızdanmışsınız\" can encode a full sentence. After stemming, Turkish accuracy drops 34 points while German accuracy drops only 8 points. What structural property of Turkish explains this dramatic accuracy gap?","options":{"A":"Turkish uses a different character set (diacritics) that the NLTK stemmer cannot process, causing all Turkish tokens to pass through unchanged","B":"Turkish is agglutinative — meaning is encoded through long chains of suffixes attached to a root, so English suffix-stripping rules destroy the entire semantic content of Turkish words rather than just normalizing inflection","C":"German compound words (Donaudampfschiffahrtsgesellschaft) are the source of the gap — German is actually harder to stem than Turkish, and the 8-point drop is artificially low","D":"The accuracy gap is due to Turkish having a much larger vocabulary than German, causing more OOV tokens rather than stemming errors"},"correct":"B","explanation":{"correct":"- In agglutinative languages like Turkish, Finnish, or Hungarian, a single word can carry multiple morphemes encoding tense, person, number, case, negation, and more. \"Çekoslovakyalılaştıramadıklarımızdanmışsınız\" = \"You are said to be one of those we could not make into a Czechoslovakian.\" Stripping any suffix with English rules destroys the entire meaning.\n- German is an inflected/compounding language but uses far fewer suffixes per word than Turkish — German words typically encode 1-2 morphological dimensions, so English suffix stripping corrupts fewer features.\n- This is why language-specific morphological analyzers (not just stemmers) are required for morphologically rich languages, and why multilingual BERT/XLM-R trained with subword tokenization significantly outperforms n-gram methods on agglutinative languages.\n- In production, using a single English NLP pipeline on multilingual data without language detection is a common source of silent failures in globally deployed systems.","A":"NLTK's Porter stemmer processes characters using ASCII/Unicode patterns — diacritics cause some issues but the stemmer does not silently skip Turkish tokens. The core issue is rule misapplication, not character encoding failure.","B":"","C":"This answer inverts the relative difficulty. German does have long compound words, but compounds are splitting problems, not stemming problems. Turkish agglutination specifically interacts destructively with suffix-stripping stemmers in a way German compounding does not.","D":"Vocabulary size would cause OOV issues in embedding-based models, not in classical TF-IDF/BoW pipelines where OOV terms are simply absent from the feature space. The 34-point drop is about feature corruption, not feature absence."},"reference":"- Jurafsky & Martin, \"Speech and Language Processing\", Chapter 2 (Morphology): https://web.stanford.edu/~jurafsky/slp3/2.pdf"},{"section":"nlp","topicSlug":"text-preprocessing","topic":"Text Preprocessing","id":"nlp-01014","difficulty":"hard","orderIndex":14,"question":"A team uses character-level trigrams as features for a language identification classifier. The model achieves 99% accuracy on clean news text but drops to 67% on social media posts with heavy code-switching (e.g., \"Je suis très tired aujourd'hui\"). They propose adding word-level unigrams to the feature set. A senior NLP engineer says this will help but identifies a more fundamental issue. What is it?","options":{"A":"Character trigrams cannot handle emojis and special characters common in social media, which dilute the trigram distribution","B":"Code-switching produces trigram distributions that are mixtures of two or more language models — neither the source language's trigrams nor the target language's trigrams dominate, placing the document in an ambiguous region of the feature space that neither language's training examples covered","C":"Social media text is too short for trigram statistics to be reliable; minimum document length for character trigrams is 500 characters","D":"Word-level unigrams will solve the problem because language-specific vocabulary is more discriminative than character patterns for code-switched text"},"correct":"B","explanation":{"correct":"- Character trigram language ID works by comparing a document's trigram distribution to per-language profiles. Code-switched text has a mixed distribution — French trigrams (e.g., \"uis\", \"est\") and English trigrams (e.g., \"the\", \"ing\") coexist in the same document.\n- Neither the French nor English training distribution matches the mixed profile. The classifier is trained on monolingual documents and has never seen this mixture class, so it makes unreliable predictions.\n- The fundamental issue is that the problem is mis-specified: \"language identification\" assumes a single language per document, but code-switching is a multilingual phenomenon requiring either token-level language ID or multi-label document classification.\n- In production multilingual social media analysis, the correct approach is token-level language tagging followed by per-language processing, not document-level language ID.","A":"While emojis can affect trigram distributions, they are a surface phenomenon — many social media language ID systems handle emojis by stripping or normalizing them. The core failure is the mixed-language distribution, not emoji coverage.","B":"","C":"There is no hard minimum document length rule for character trigrams. Short documents do have higher variance in trigram statistics, but \"too short\" is not the primary failure mode here — code-switching causes misclassification even in long posts.","D":"Word-level unigrams have the same fundamental problem: code-switched text contains vocabulary from multiple languages, and the document-level classification still faces a mixed-language feature space. Token-level language detection is needed, not a different feature type at the document level."},"reference":"- Cavnar & Trenkle, \"N-Gram-Based Text Categorization\" (language ID with n-grams): https://www.researchgate.net/publication/2375544_N-Gram-Based_Text_Categorization"},{"section":"nlp","topicSlug":"text-preprocessing","topic":"Text Preprocessing","id":"nlp-01015","difficulty":"hard","orderIndex":15,"question":"A team uses TF-IDF on a corpus where one document is 10,000 words long and another is 50 words long. Both documents contain the word \"transformer\" exactly once. The TF of \"transformer\" is higher in the 50-word document (using raw frequency: TF = count/doc_length). The shorter document therefore gets a higher TF-IDF score. A product manager argues: \"Then shorter documents always rank higher — our system is biased against research papers.\" Is the PM correct?","options":{"A":"Yes, the PM is correct — TF-IDF with normalized TF is inherently biased toward shorter documents and cannot be corrected without switching to BM25","B":"No — sklearn's default TF-IDF uses raw term counts (not length-normalized TF), so both documents receive TF = 1 for one occurrence; the PM's assumption about TF normalization is incorrect for sklearn's implementation","C":"Yes, but only for single-term queries — multi-term queries cancel out the length bias because longer documents cover more terms","D":"No — the IDF component compensates for document length by assigning higher IDF to terms that appear in shorter documents"},"correct":"B","explanation":{"correct":"- sklearn's `TfidfVectorizer` computes TF as raw count (sublinear scaling is optional via `sublinear_tf=True`). By default, TF(\"transformer\") = 1 for both the 10,000-word and 50-word document — there is no division by document length.\n- Length normalization happens only in the final L2 normalization step (`norm='l2'`), which normalizes the entire TF-IDF vector, not individual term frequencies. This is a different operation than per-term TF normalization.\n- The PM's intuition about length bias is correct in general (and BM25 addresses it), but the specific claim about TF being length-normalized in sklearn's default TF-IDF is factually wrong.\n- In interviews, confusing \"TF = tf/doc_length\" (length-normalized TF) with sklearn's raw count default is a very common mistake — reading the sklearn docs carefully on this is important.","A":"While BM25 does solve length bias more rigorously, sklearn's TF-IDF with raw counts does NOT inherently produce higher scores for shorter documents (since TF is raw, not normalized). Switching to BM25 is unnecessary to correct this specific misunderstanding.","B":"","C":"Multi-term queries do not \"cancel out\" length bias in a principled way — longer documents matching more terms may score higher for other reasons, but this is not a systematic correction for the (incorrectly assumed) length bias.","D":"IDF is computed based on document frequency (how many documents contain the term), completely independent of individual document lengths. IDF does not know which documents are short or long."},"reference":"- sklearn TfidfVectorizer source (tf computation): https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/feature_extraction/text.py\n- BM25 vs TF-IDF comparison: https://www.elastic.co/blog/practical-bm25-part-1-how-shards-affect-relevance-scoring-in-elasticsearch"},{"section":"nlp","topicSlug":"word-representations","topic":"Word Representations","id":"nlp-02001","difficulty":"easy","orderIndex":1,"question":"A team trains Word2Vec Skip-gram on a 10GB news corpus with window size 5 and 300 dimensions. They query `model.most_similar(\"king\")` and get [\"queen\", \"prince\", \"monarch\", \"emperor\", \"throne\"]. A colleague claims the model \"understands\" kingship. What does the model actually encode, and why is \"understanding\" an overstatement?","options":{"A":"The model encodes syntactic role — \"king\" and \"queen\" are both nouns, so their embeddings are similar","B":"The model encodes distributional similarity — words appearing in similar contexts (surrounding words in a window) receive similar embeddings; the model has no concept of kingship, only that \"king\" and \"queen\" co-occur with similar words like \"royal\", \"palace\", \"crowned\"","C":"The model encodes semantic meaning because the training objective explicitly maximizes the similarity between semantically related words","D":"The model understands kingship because it was trained on enough data to learn world knowledge about monarchy"},"correct":"B","explanation":{"correct":"- Word2Vec's training objective (Skip-gram) is: given a center word, predict surrounding words within a window. Two words that appear in similar contexts will have similar gradient updates and converge to nearby points in embedding space.\n- This is the distributional hypothesis (Harris, 1954): \"a word is characterized by the company it keeps.\" The embedding captures nothing about the real-world concept of kingship — only contextual co-occurrence patterns.\n- Consequence: \"king\" and \"virus\" will have dissimilar embeddings not because the model \"knows\" they are different concepts, but because they appear with very different surrounding words.\n- In interviews, confusing \"distributional similarity\" with \"semantic understanding\" is a red flag. Word2Vec embeddings cannot answer \"Is a king alive?\" — that requires knowledge beyond co-occurrence.","A":"Word2Vec does capture some syntactic information (words with similar POS often appear in similar contexts), but the primary signal is distributional co-occurrence, not explicitly syntactic role. Also, \"throne\" is not a noun with the same POS as \"king\" but still appears in the top results.","B":"","C":"The training objective maximizes the probability of surrounding words given the center word — not semantic similarity. Semantic similarity emerges as a side effect of the distributional hypothesis, not as an explicit training signal.","D":"Word2Vec has no world knowledge, no reasoning capability, and no concept of \"enough data\" producing factual understanding. Scale does not convert statistical co-occurrence into semantic understanding."},"reference":"- Mikolov et al., \"Efficient Estimation of Word Representations in Vector Space\" (Word2Vec): https://arxiv.org/abs/1301.3781"},{"section":"nlp","topicSlug":"word-representations","topic":"Word Representations","id":"nlp-02002","difficulty":"easy","orderIndex":2,"question":"A Word2Vec model trained on Wikipedia is used for a downstream finance NLP task. The word \"bear\" has a single embedding vector. When the model is queried for similarity, \"bear\" is closest to [\"bull\", \"market\", \"stock\", \"trade\"] rather than [\"animal\", \"grizzly\", \"forest\"]. A data scientist says the embedding is \"wrong.\" Is the data scientist correct?","options":{"A":"Yes — Word2Vec produces incorrect embeddings when trained on domain-mismatched corpora","B":"No — the embedding reflects the dominant usage context in the training corpus; if Wikipedia's usage of \"bear\" skews toward financial contexts, the embedding captures that context; the limitation is that one vector cannot capture multiple word senses","C":"Yes — \"bear\" in finance is a technical term that Word2Vec cannot learn without domain-specific tokenization","D":"No — the data scientist should retrain with a larger window size to capture both senses simultaneously"},"correct":"B","explanation":{"correct":"- Word2Vec produces one embedding vector per word token. If \"bear\" (financial) appears 10x more frequently than \"bear\" (animal) in the training corpus, the single vector is \"pulled\" toward the financial context by the weight of more training examples.\n- This is the fundamental polysemy limitation of static word embeddings: one vector cannot encode multiple senses. \"Bank\" (financial institution vs. riverbank), \"bass\" (fish vs. musical), \"crane\" (bird vs. machine) all suffer from this.\n- The embedding is not \"wrong\" — it correctly reflects the most common context in the training data. It is \"incomplete\" — it discards the minority sense.\n- Contextual embeddings (ELMo, BERT) solve this by producing different vectors for the same word token depending on its surrounding context.","A":"The embedding is not \"wrong\" — it is optimally trained given the training data distribution. \"Domain mismatch\" means the embedding may not transfer well, but within the training distribution it is correct.","B":"","C":"\"Bear\" in financial contexts is a regular English word used in financial text. Word2Vec tokenizes at the word level and will learn any word that appears frequently. No special domain tokenization is needed.","D":"Increasing window size would capture a wider context, potentially mixing more senses together — making the polysemy problem worse, not better. It does not enable one vector to represent two distinct senses."},"reference":"- Mikolov et al., Word2Vec paper (polysemy discussion): https://arxiv.org/abs/1301.3781"},{"section":"nlp","topicSlug":"word-representations","topic":"Word Representations","id":"nlp-02003","difficulty":"easy","orderIndex":3,"question":"A team evaluates their Word2Vec model using the classic analogy task: \"man : king :: woman : ?\" The model returns \"queen\" correctly. They then test \"Paris : France :: Berlin : ?\" The model returns \"Germany\" correctly. A researcher claims Word2Vec \"performs arithmetic on meaning.\" What is the more precise technical explanation of why these analogies work?","options":{"A":"Word2Vec is trained with an arithmetic loss function that explicitly encodes country-capital relationships","B":"The vector offset `vec(\"king\") - vec(\"man\") + vec(\"woman\")` approximates `vec(\"queen\")` because the difference vector encodes a gender transformation direction in the embedding space that was learned from co-occurrence patterns","C":"Analogies work because Word2Vec uses cross-lingual training data that explicitly maps capitals to countries","D":"Word2Vec memorizes common analogy pairs during training because they appear in Wikipedia's \"List of capitals\" articles"},"correct":"B","explanation":{"correct":"- The embedding arithmetic works because the training objective forces words in similar relational contexts to lie in parallel subspaces. \"King\" and \"queen\" appear in similar contexts except where gendered pronouns differ — so `vec(\"king\") - vec(\"queen\") ≈ vec(\"man\") - vec(\"woman\")`.\n- Rearranging: `vec(\"queen\") ≈ vec(\"king\") - vec(\"man\") + vec(\"woman\")`. This is not programmed in — it emerges from the distributional patterns in the corpus.\n- Similarly, Paris-France and Berlin-Germany appear with similar surrounding words (travel writing, news, political text) — the offset `vec(\"Paris\") - vec(\"France\")` encodes a \"capital-of\" direction.\n- Important caveat: analogy evaluation is an imperfect measure of embedding quality. Many analogies fail, especially for rare or ambiguous words.","A":"Word2Vec's loss function is a simple softmax (or negative sampling approximation) over surrounding word probabilities. There is no arithmetic loss, no relational encoding, and no knowledge of country-capital pairs.","B":"","C":"Standard Word2Vec is trained on monolingual text. Cross-lingual training is a separate research area (multilingual embeddings). The analogy arithmetic works from monolingual distributional patterns alone.","D":"Word2Vec does not memorize — it learns continuous vector representations through gradient descent. Even if \"List of capitals\" articles appear in training, the learning is distributional, not lookup-based."},"reference":"- Mikolov et al., \"Linguistic Regularities in Continuous Space Word Representations\": https://aclanthology.org/N13-1090/"},{"section":"nlp","topicSlug":"word-representations","topic":"Word Representations","id":"nlp-02004","difficulty":"medium","orderIndex":4,"question":"A team trains Word2Vec with Skip-gram and negative sampling on a large corpus. They set `negative=5` in gensim. A colleague asks: \"If we set `negative=20`, will the embeddings improve?\" The team runs both and finds that `negative=20` produces marginally better analogies but trains 3× slower. An engineer proposes `negative=100`. What will happen and why?","options":{"A":"`negative=100` will produce the best possible embeddings because more negative samples always improve training","B":"`negative=100` will produce similar or worse embeddings than `negative=20` because the gradient signal per positive sample becomes dominated by noise from negative samples, and the model approaches the behavior of a pure noise contrastive estimator with diminishing returns","C":"`negative=100` will cause the training to crash with an out-of-memory error because negative sampling stores all samples in RAM","D":"`negative=100` will make the model equivalent to a full softmax, eliminating all approximation error"},"correct":"B","explanation":{"correct":"- Negative sampling approximates the softmax by contrasting each positive (center, context) pair against k randomly drawn negative words. The gradient update balances: push the positive pair's dot product up, push k negative pairs down.\n- With k=5-20, the model gets a good signal. With k=100, each positive update is swamped by 100 negative updates per step. This over-regularizes the embedding space, reducing the signal-to-noise ratio in training.\n- Empirically, Mikolov's original paper found that k=5-20 works well for large corpora and k=2-5 suffices for very large ones. Beyond ~25, returns diminish rapidly.\n- The full softmax over V vocabulary words would be k=V (exact but slow). Negative sampling is an approximation that works well precisely because k << V.","A":"More negative samples do not always improve embeddings. There is a sweet spot determined by the positive-to-negative signal ratio. Past this point, the positive signal is overwhelmed and embedding quality plateaus or degrades.","B":"","C":"Negative sampling draws k indices from the vocabulary distribution per step — it does not store k samples in RAM simultaneously. Memory usage is proportional to batch size, not k alone.","D":"Full softmax requires computing probabilities over all V words in the vocabulary. Negative sampling with k << V is never equivalent to full softmax — it is a biased estimator that happens to work well in practice."},"reference":"- Mikolov et al., \"Distributed Representations of Words and Phrases and their Compositionality\" (negative sampling): https://arxiv.org/abs/1310.4546"},{"section":"nlp","topicSlug":"word-representations","topic":"Word Representations","id":"nlp-02005","difficulty":"medium","orderIndex":5,"question":"A team uses GloVe embeddings (trained on Common Crawl, 840B tokens, 300d) for a downstream hate speech detection task. They fine-tune only the classification head (not the embeddings) and achieve 78% F1. Another team uses the same GloVe embeddings but updates the embedding layer during fine-tuning, achieving 81% F1. A third team trains Word2Vec on 50,000 hate speech examples from scratch and achieves 84% F1. Rank these approaches from worst to best and explain the underlying mechanism.","options":{"A":"Frozen GloVe < Fine-tuned GloVe < Domain Word2Vec — because more training data always produces better embeddings","B":"Frozen GloVe < Fine-tuned GloVe < Domain Word2Vec — because GloVe's co-occurrence matrix is built globally and cannot capture the specialized vocabulary and co-occurrence patterns of hate speech (e.g., slurs, dog-whistles, coded language that appear rarely in Common Crawl but dominate hate speech data)","C":"Domain Word2Vec < Fine-tuned GloVe < Frozen GloVe — because pre-trained embeddings from massive corpora always outperform small domain-specific training","D":"Fine-tuned GloVe < Frozen GloVe < Domain Word2Vec — because updating embeddings during fine-tuning causes catastrophic forgetting of general language patterns"},"correct":"B","explanation":{"correct":"- Frozen GloVe cannot adapt to domain vocabulary — hate speech uses coded language, slurs, and neologisms that are rare or absent in Common Crawl. The embedding space was built for general web text.\n- Fine-tuning the embedding layer allows GloVe vectors to shift toward the hate speech distribution, improving recall on coded terms — hence the 3-point improvement.\n- Domain-specific Word2Vec trained on 50,000 hate speech examples starts from scratch but builds a co-occurrence space directly tuned to the vocabulary and patterns of the target domain. For specialized tasks, domain-specific training with smaller data can outperform large general embeddings.\n- This is a documented pattern in domain adaptation: legal NLP, medical NLP, and social media NLP all benefit from domain-specific embeddings over frozen general-purpose ones.","A":"\"More training data always produces better embeddings\" is false. Domain relevance matters more than scale for specialized tasks. 50K hate speech examples carry more domain signal than 840B general web tokens for this specific task.","B":"","C":"This ranking is the opposite of what is empirically observed. Pre-trained general embeddings are excellent starting points but are not universally superior for specialized domains with unique vocabulary.","D":"Catastrophic forgetting is a concern when fine-tuning large models with low learning rates on few examples. Updating word embeddings (a small parameter set) with domain-relevant data typically helps rather than hurts, especially for domain-specific vocabulary."},"reference":"- Pennington et al., \"GloVe: Global Vectors for Word Representation\": https://nlp.stanford.edu/pubs/glove.pdf"},{"section":"nlp","topicSlug":"word-representations","topic":"Word Representations","id":"nlp-02006","difficulty":"medium","orderIndex":6,"question":"A team trains FastText embeddings instead of Word2Vec on a Twitter dataset containing many misspellings (\"recieve\", \"seperate\", \"definately\"). A colleague trained on Word2Vec reports that all misspelled words return OOV embeddings. FastText returns meaningful embeddings for these misspelled words. What is the mechanism by which FastText handles this?","codeSnippet":"# FastText character n-gram decomposition\n# \"recieve\" → [\"\", ...]\n# \"receive\" → [\"\", ...]\n# Shared subwords: \"\" → partial vector recovery","options":{"A":"FastText uses a spell-checker pre-processing step that corrects misspellings before embedding lookup","B":"FastText represents each word as the sum of its character n-gram embeddings; misspelled words share many character n-grams with correctly spelled variants, so their embeddings are similar to the intended word's embedding","C":"FastText trains a separate OOV embedding for unknown words by averaging all word embeddings in the vocabulary","D":"FastText uses edit-distance lookup at inference time to find the nearest known word and returns its embedding"},"correct":"B","explanation":{"correct":"- FastText decomposes each word into character n-grams (typically trigrams to hexagrams with boundary markers < and >). The word embedding is the sum of all its character n-gram embeddings.\n- \"recieve\" and \"receive\" share most of their character n-grams — only \"eci\"/\"ece\", \"cie\"/\"cei\", \"iev\"/\"eiv\" differ. The sum of shared n-gram embeddings produces a vector close to the correct spelling's vector.\n- This subword decomposition also handles morphologically rich language (German compounds, Turkish agglutination) and rare technical words — if the components appear in training, a reasonable embedding can be constructed.\n- Word2Vec has no fallback for OOV words — it simply cannot embed them. FastText can embed any string, including complete nonsense, by summing whatever character n-grams were seen in training.","A":"FastText does not include a spell-checker. The robustness to misspellings is a byproduct of subword decomposition, not preprocessing. Adding a spell-checker would be a separate pipeline step.","B":"","C":"Averaging all vocabulary embeddings would produce a generic \"center of vocabulary\" vector with no meaningful relationship to any specific word. FastText does not do this.","D":"Edit-distance lookup would be computationally expensive (O(V × word_length)) and is not part of FastText's architecture. The n-gram approach is O(n-grams) per word, which is fast."},"reference":"- Bojanowski et al., \"Enriching Word Vectors with Subword Information\" (FastText): https://arxiv.org/abs/1607.04606"},{"section":"nlp","topicSlug":"word-representations","topic":"Word Representations","id":"nlp-02007","difficulty":"medium","orderIndex":7,"question":"A team evaluates Word2Vec embeddings using the WordSim-353 benchmark and achieves 0.72 Spearman correlation. They conclude their embeddings are production-ready. A senior NLP engineer warns this evaluation is insufficient. What critical failure mode does WordSim-353 miss?","options":{"A":"WordSim-353 only evaluates syntactic similarity, not semantic similarity","B":"WordSim-353 measures pairwise word similarity on a small set of 353 pairs rated by humans — it misses task-specific performance, polysemy handling, out-of-domain vocabulary, and bias encoding; high WordSim-353 scores do not predict downstream task performance reliably","C":"WordSim-353 is outdated and has been replaced by GloVe evaluation benchmarks that are more rigorous","D":"WordSim-353 is only valid for English, so any multilingual embeddings will score poorly regardless of quality"},"correct":"B","explanation":{"correct":"- WordSim-353 asks human annotators to rate word pair similarity on a 0-10 scale. The benchmark measures whether embedding cosine similarities correlate with human judgment on 353 hand-picked pairs.\n- This misses: (1) domain shift — embeddings trained on Wikipedia may fail on medical/legal text not represented in the 353 pairs; (2) polysemy — \"bank\" appears once with a single human rating but has multiple senses; (3) downstream task performance — embeddings optimized for WordSim-353 may not transfer to NER, classification, or QA; (4) social bias — racist/sexist associations in the embedding space that harm downstream fairness.\n- Intrinsic evaluation (analogy, similarity benchmarks) and extrinsic evaluation (downstream task performance) often diverge significantly.\n- Production readiness requires extrinsic evaluation on the actual downstream task, bias auditing, and OOV rate analysis — not just a benchmark correlation score.","A":"WordSim-353 specifically evaluates semantic similarity (as judged by humans) — it covers both similarity and relatedness. The limitation is scope and methodology, not a syntactic/semantic distinction.","B":"","C":"GloVe does not define its own evaluation benchmarks — it uses the same community benchmarks. WordSim-353 is still widely used despite its known limitations. The issue is with benchmark limitations, not obsolescence.","D":"WordSim-353 is indeed English-only, but the question is about its limitations for any use case, not specifically multilingual evaluation. The fundamental methodological weaknesses apply regardless of language."},"reference":"- Faruqui et al., \"Retrofitting Word Vectors to Semantic Lexicons\": https://arxiv.org/abs/1411.4166\n- Schnabel et al., \"Evaluation methods for unsupervised word embeddings\": https://aclanthology.org/D15-1036/"},{"section":"nlp","topicSlug":"word-representations","topic":"Word Representations","id":"nlp-02008","difficulty":"medium","orderIndex":8,"question":"A GloVe model is trained on a corpus where \"doctor\" co-occurs more frequently with \"he\" than \"she\", and \"nurse\" co-occurs more frequently with \"she\" than \"he\". A fairness audit finds that `vec(\"doctor\") - vec(\"nurse\") ≈ vec(\"man\") - vec(\"woman\")`. A team proposes debiasing by hard-projecting gender direction out of occupational words using the Bolukbasi et al. method. A colleague warns this fix is cosmetic. Why?","options":{"A":"The Bolukbasi method is computationally too expensive for large embedding spaces","B":"Hard debiasing removes the explicit gender direction from occupational embeddings but leaves intact the downstream co-occurrence biases — a word like \"secretary\" may no longer be geometrically close to \"woman\" but will still activate gender-biased associations through indirect neighboring words (\"receptionist\", \"assistant\") that were not debiased","C":"Debiasing only works for binary gender and fails for non-binary gender representation","D":"The Bolukbasi method changes word meanings — \"doctor\" will no longer be semantically similar to \"physician\" after debiasing"},"correct":"B","explanation":{"correct":"- Hard debiasing identifies a gender subspace (approximated by `vec(\"he\") - vec(\"she\")`) and projects occupational word vectors onto the orthogonal complement, removing their component along the gender axis.\n- However, the bias is also encoded in the graph of associations — \"secretary\" is close to \"assistant\", \"receptionist\", \"administrative\" — all of which may themselves encode gender bias. Removing the direct gender dimension does not purge these indirect pathways.\n- Gonen & Goldberg (2019) showed that debiased embeddings can be re-clustered by gender with high accuracy using the indirect neighborhood, proving the bias persists in latent form.\n- True debiasing requires retraining on curated data or training-time constraints, not post-hoc geometric manipulation of the output space.","A":"The Bolukbasi method is computationally trivial — it requires only a PCA to find the gender subspace and a matrix projection. Computational cost is not its limitation.","B":"","C":"The limitation is not about gender dimensionality but about the distributed encoding of bias across the embedding space. A multi-dimensional gender subspace (for non-binary representation) still leaves indirect co-occurrence patterns intact.","D":"Projecting out the gender direction does not change the relative positions of semantically related words (doctor ↔ physician) because their similarity is driven by shared non-gender co-occurrence patterns. Semantic similarity is largely preserved."},"reference":"- Bolukbasi et al., \"Man is to Computer Programmer as Woman is to Homemaker?\": https://arxiv.org/abs/1607.06520\n- Gonen & Goldberg, \"Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases\": https://arxiv.org/abs/1903.03862"},{"section":"nlp","topicSlug":"word-representations","topic":"Word Representations","id":"nlp-02009","difficulty":"hard","orderIndex":9,"question":"A team trains Skip-gram Word2Vec with window size 2 on the sentence \"I love deep learning and deep neural networks.\" They then train with window size 10 on the same corpus. A researcher hypothesizes: \"Window size 2 will produce more syntactically similar neighbors; window size 10 will produce more topically/semantically similar neighbors.\" Is this hypothesis correct, and what is the mechanism?","options":{"A":"The hypothesis is wrong — larger window sizes always produce better embeddings regardless of the type of similarity","B":"The hypothesis is correct — small windows capture local syntactic relationships (subject-verb, adjective-noun pairs that appear adjacent), while large windows capture topical co-occurrence (words that appear in the same general context, even if not adjacent)","C":"The hypothesis is partially wrong — window size only affects training speed, not the type of similarity captured","D":"The hypothesis is wrong — smaller window sizes capture more semantic meaning because they force the model to learn from the most informative nearby context"},"correct":"B","explanation":{"correct":"- With window=2, the model trains on tight local contexts: [love, deep] for \"I\", [I, deep, learning] for \"love\", etc. Adjacent words in English are often syntactically dependent (determiner-noun, verb-object) — so the embeddings encode syntactic compatibility.\n- With window=10, the model trains on broader discourse context. \"deep\", \"learning\", \"neural\", \"networks\" all fall within each other's windows across many sentences in a large corpus, encoding topical relatedness.\n- This was empirically validated: small-window Word2Vec produces embeddings where nearest neighbors are syntactically substitutable (e.g., \"king\" → \"president\", \"minister\"), while large-window produces topically related words (e.g., \"king\" → \"castle\", \"throne\", \"coronation\").\n- GloVe's co-occurrence matrix is conceptually equivalent to a large window — hence GloVe tends toward semantic/topical similarity.","A":"Larger windows are not always better. For syntactic tasks (POS tagging, dependency parsing), small-window embeddings often outperform large-window ones because they capture tighter syntactic structure.","B":"","C":"Window size fundamentally changes what co-occurrence statistics the model learns, which directly changes the type of relationships encoded. The effect on speed is a side effect, not the primary difference.","D":"\"Most informative nearby context\" is an intuition that does not hold empirically. Adjacent words in text are often syntactically related but not necessarily more semantically informative than discourse-level co-occurrence."},"reference":"- Levy & Goldberg, \"Linguistic Regularities in Sparse and Explicit Word Representations\": https://aclanthology.org/W14-1618/\n- Levy & Goldberg, \"Neural Word Embedding as Implicit Matrix Factorization\": https://papers.nips.cc/paper/2014/hash/feab05aa91085b7a8012516bc3533958-Abstract.html"},{"section":"nlp","topicSlug":"word-representations","topic":"Word Representations","id":"nlp-02010","difficulty":"hard","orderIndex":10,"question":"Two teams train embeddings on the same corpus. Team A uses Word2Vec CBOW with window 5. Team B uses GloVe with window 5 and 100 iterations. A third researcher proves mathematically that both methods implicitly factorize a shifted PMI (Pointwise Mutual Information) matrix. Given this, Team A claims their embeddings are equivalent to Team B's. Is this claim correct?","options":{"A":"Yes — since both factorize the same PMI matrix, the resulting embedding spaces are identical up to rotation","B":"No — while both factorize a shifted PMI matrix, CBOW uses stochastic online updates while GloVe uses weighted least squares on the full co-occurrence matrix; the optimization landscapes differ, producing different embeddings even with identical hyperparameters","C":"Yes — equivalence holds because the PMI factorization theorem applies regardless of the optimization method used","D":"No — GloVe factorizes PMI while CBOW factorizes PPMI (positive PMI); the negative values handled differently cause fundamentally different embeddings"},"correct":"B","explanation":{"correct":"- Levy & Goldberg (2014) showed that Skip-gram with negative sampling implicitly factorizes the PMI matrix shifted by log(k) where k is the number of negative samples. GloVe directly minimizes a weighted least squares objective on log co-occurrence counts, which is closely related to PMI factorization.\n- However, theoretical equivalence of the objective does not mean practical equivalence of the result. CBOW/Skip-gram uses stochastic gradient descent on individual (word, context) pairs — the optimization is noisy and depends heavily on sampling. GloVe uses iterative global optimization on the full co-occurrence matrix.\n- Different optimization trajectories produce different local minima. The embedding spaces are related in theory but distinct in practice.\n- This matters for production: the choice of Word2Vec vs. GloVe is not arbitrary — GloVe tends to be more stable across runs (deterministic co-occurrence matrix), while Word2Vec can vary between training runs with different random seeds.","A":"Identical objective functions + different optimization algorithms ≠ identical solutions. Neural network training is a classic example: SGD and Adam on the same loss find different local minima.","B":"","C":"The PMI factorization theorem describes the implicit objective, not the solution. No theorem guarantees that different optimization procedures on the same objective reach the same solution.","D":"Both methods relate to PMI — CBOW's relationship is to PMI (not exclusively PPMI). While PPMI (clamping negative values to 0) is sometimes used in practice, this is not the fundamental difference between the two methods."},"reference":"- Levy & Goldberg, \"Neural Word Embedding as Implicit Matrix Factorization\": https://papers.nips.cc/paper/2014/hash/feab05aa91085b7a8012516bc3533958-Abstract.html\n- Pennington et al., \"GloVe: Global Vectors for Word Representation\": https://nlp.stanford.edu/pubs/glove.pdf"},{"section":"nlp","topicSlug":"word-representations","topic":"Word Representations","id":"nlp-02011","difficulty":"hard","orderIndex":11,"question":"A team uses pre-trained GloVe 300d embeddings as fixed features for an NER model. The validation F1 is 78%. They switch to initializing with GloVe and allowing fine-tuning during training, achieving 82% F1. A researcher suggests that the improvement comes from \"the model learning better representations.\" A senior engineer disagrees. What is the more precise technical explanation?","options":{"A":"Fine-tuning increases the effective capacity of the model by adding the embedding parameters to the trainable parameter count","B":"Fine-tuning adapts the pre-trained embeddings to the specific token distribution and label signal of the NER task — GloVe was trained on a general corpus with no NER signal; fine-tuning lets the gradient of the NER loss shift embeddings so that NER-relevant features (e.g., capitalization-correlated dimensions, entity-type-specific directions) are amplified in the embedding space","C":"The improvement is due to the model memorizing the training set more effectively when embeddings are trainable","D":"Fine-tuning embeddings reduces vocabulary mismatch because GloVe vectors are updated to match the NER dataset's vocabulary distribution"},"correct":"B","explanation":{"correct":"- Pre-trained GloVe was trained with a language modeling objective on general web text. The embedding space encodes distributional similarity, not NER-relevant features. \"Google\", \"Microsoft\", \"Apple\" are close together because they appear in similar financial/tech contexts — useful for NER (all are ORG entities) but not optimally separated from common nouns.\n- When the NER loss backpropagates through the embedding layer, gradients push entity-type embeddings apart (ORG, PER, LOC become more distinct) and push non-entity words further from entity words — creating an embedding space that is NER-task-optimal.\n- This is task-specific feature adaptation, not \"better representations\" in a general sense. The fine-tuned embeddings likely perform worse on WordSim-353 while performing better on NER.\n- In production, the decision to freeze or fine-tune embeddings depends on dataset size: with few examples, fine-tuning causes overfitting; with sufficient data, fine-tuning adapts the representation profitably.","A":"While adding trainable embedding parameters does increase model capacity, this is a generic statement that doesn't explain why the improvement specifically helps NER. Capacity alone would also enable overfitting, which the F1 improvement on validation set rules out.","B":"","C":"The improvement was measured on the validation set (not training set), so memorization is not the explanation. Validation F1 improvement rules out pure memorization.","D":"Fine-tuning does not change the vocabulary or resolve OOV issues — words not in the GloVe vocabulary are still OOV. The improvement is from adapting existing embeddings' directions, not vocabulary coverage."},"reference":"- Peters et al., \"Deep contextualized word representations\" (ELMo, discusses fine-tuning tradeoffs): https://arxiv.org/abs/1802.05365"},{"section":"nlp","topicSlug":"word-representations","topic":"Word Representations","id":"nlp-02012","difficulty":"hard","orderIndex":12,"question":"A team trains Word2Vec on a medical corpus and evaluates embedding quality using cosine similarity. They find that `cosine(\"hypertension\", \"high blood pressure\") = 0.23` — very low — even though they are synonyms. The team concludes Word2Vec \"failed\" to capture medical synonyms. What is the actual cause, and does it represent a fundamental limitation?","options":{"A":"The low cosine similarity is a training bug — Word2Vec should always assign high similarity to synonyms in domain-specific corpora","B":"The low similarity reflects a fundamental limitation of the distributional hypothesis for synonymy: \"hypertension\" appears in formal clinical notes while \"high blood pressure\" appears in patient-facing materials — their contexts are systematically different even though they refer to the same condition, so Word2Vec correctly encodes their distributional difference, which diverges from semantic identity","C":"The low similarity is caused by the multi-word phrase \"high blood pressure\" not being treated as a single token, causing phrase-level semantics to be lost","D":"The training corpus is too small for Word2Vec to learn medical synonyms; increasing corpus size to 10B tokens would fix this"},"correct":"B","explanation":{"correct":"- The distributional hypothesis states that meaning is encoded by context. \"Hypertension\" appears in contexts like: \"treated with antihypertensives\", \"ICD-10 code I10\", \"comorbid with diabetes\" — formal clinical documentation.\n- \"High blood pressure\" appears in: \"your blood pressure reading\", \"lifestyle changes for high blood pressure\", \"check your blood pressure at home\" — patient-facing and lay contexts.\n- Word2Vec correctly encodes this distributional difference — they appear in fundamentally different textual contexts even though they refer to the same condition. The model is not wrong; the distributional hypothesis has a known limitation: synonyms used in different registers/styles will have low similarity.\n- The fix is not more data but a different approach: medical ontology retrofitting (Faruqui et al.) or synonym-aware training (RETROFIT, counter-fitting) that incorporates external knowledge (UMLS, SNOMED-CT) to push known synonyms together.","A":"Word2Vec correctly reflects the distributional patterns in the corpus. The \"failure\" is a property of the distributional hypothesis applied to register-differentiated synonyms, not a training bug.","B":"","C":"Multi-word phrases are a partial contributor — \"high blood pressure\" as three separate tokens dilutes the phrase's embedding across component word vectors. But even if it were a single token, the register difference would still produce low similarity. Both issues coexist.","D":"More training data would strengthen the existing distributional patterns, not correct them. With more clinical data, \"hypertension\" becomes even more strongly associated with clinical contexts, widening the gap rather than closing it."},"reference":"- Faruqui et al., \"Retrofitting Word Vectors to Semantic Lexicons\": https://arxiv.org/abs/1411.4166\n- Mrksic et al., \"Counter-fitting Word Vectors to Linguistic Constraints\": https://arxiv.org/abs/1603.00892"},{"section":"nlp","topicSlug":"classical-nlp-tasks","topic":"Classical NLP Tasks","id":"nlp-03001","difficulty":"easy","orderIndex":1,"question":"A spaCy POS tagger labels the word \"run\" as a VERB in \"I run every morning\" but as a NOUN in \"a test run.\" A junior engineer is surprised the same word gets different tags. What is the mechanism that enables this context-sensitive tagging?","options":{"A":"spaCy maintains a separate model for each word and switches between them at inference time","B":"The POS tagger uses features of the surrounding words (context window), sentence position, and the word's shape to disambiguate — \"run\" following a determiner \"a\" and preceding \"test\" provides strong noun context signals","C":"The tagger uses a dictionary of all possible POS tags per word and always picks the most frequent one","D":"spaCy detects verb tense to determine POS, and \"run\" without a tense suffix defaults to NOUN"},"correct":"B","explanation":{"correct":"- Rule-based and statistical POS taggers (HMM, MaxEnt, CRF) use contextual features: preceding/following words, their POS tags, capitalization, suffix patterns, and syntactic position.\n- \"A test run\" — \"a\" is a DT (determiner), which strongly predicts a following NP (noun phrase). \"run\" following a noun-like modifier (\"test\") and a determiner is tagged NOUN.\n- \"I run every morning\" — \"I\" is a PRP (pronoun), the subject, which predicts a VBP (verb, non-3rd-person singular). Context makes VERB the highest-probability tag.\n- Modern neural POS taggers (CRF over BiLSTM features) achieve 97%+ accuracy on standard benchmarks by encoding these context patterns implicitly.","A":"spaCy uses a single unified model for all words — it is context-dependent, not word-specific. Maintaining per-word models would be computationally intractable for a vocabulary of millions of words.","B":"","C":"A most-frequent-tag baseline exists (and is actually a strong baseline ~90% accuracy), but spaCy's tagger is statistical and context-sensitive, not a frequency lookup. For ambiguous words like \"run\", frequency alone would always pick VERB, failing on the noun usage.","D":"POS tagging in English does not simply detect \"tense suffixes.\" \"Run\" is both the present tense VERB and a NOUN — suffix analysis alone cannot disambiguate these."},"reference":"- spaCy POS tagging: https://spacy.io/usage/linguistic-features#pos-tagging\n- Jurafsky & Martin, \"Speech and Language Processing\", Chapter 8 (Sequence Labeling): https://web.stanford.edu/~jurafsky/slp3/8.pdf"},{"section":"nlp","topicSlug":"classical-nlp-tasks","topic":"Classical NLP Tasks","id":"nlp-03002","difficulty":"easy","orderIndex":2,"question":"A team uses a rule-based NER system to extract company names from financial filings. The system correctly identifies \"Apple Inc.\" and \"Microsoft Corporation\" but misses \"Alphabet\" (Google's parent company) and \"Meta\" (Facebook's rebranded name). What is the fundamental limitation of rule-based NER that this illustrates?","options":{"A":"Rule-based systems can only handle proper nouns and fail on common words used as names","B":"Rule-based NER depends on curated gazetteers (lists of known entities) and surface patterns; when companies adopt new names not yet in the gazetteer, the system has no mechanism to recognize them until the list is manually updated","C":"Rule-based systems cannot process financial documents because they use domain-specific jargon","D":"The rule-based system fails because \"Alphabet\" and \"Meta\" are not capitalized in financial filings"},"correct":"B","explanation":{"correct":"- Rule-based NER typically combines: (1) gazetteers of known entity names, (2) capitalization/suffix patterns (e.g., words ending in \"Corp.\", \"Inc.\", \"Ltd.\"), and (3) context rules (e.g., \"CEO of [ORG]\").\n- \"Alphabet\" and \"Meta\" are common English words — \"alphabet\" means a writing system, \"meta\" is a Greek prefix. They are not in traditional company gazetteers, and their surface forms match common nouns, making pattern rules unreliable.\n- This is the maintenance brittleness of rule-based systems: they require continuous manual updates for new entities, rebrands, mergers, and emerging organizations.\n- Statistical/neural NER models partially address this through contextual signals (\"Alphabet reported quarterly earnings\" → company context), though they also require retraining on new entity names.","A":"Rule-based NER does handle common words used as names — \"Apple\" is a common noun but appears in company gazetteers. The issue is not common-word origin but gazetteer coverage of newly adopted names.","B":"","C":"Financial document processing is a primary use case for rule-based NER. The failure is entity coverage, not document domain.","D":"Financial filings capitalize company names consistently. Both \"Alphabet\" and \"Meta\" would appear capitalized — but capitalization alone cannot distinguish \"Alphabet\" (company) from \"The alphabet\" (writing system) without additional context or gazetteer lookup."},"reference":"- Jurafsky & Martin, \"Speech and Language Processing\", Chapter 17 (NER): https://web.stanford.edu/~jurafsky/slp3/17.pdf"},{"section":"nlp","topicSlug":"classical-nlp-tasks","topic":"Classical NLP Tasks","id":"nlp-03003","difficulty":"easy","orderIndex":3,"question":"A dependency parser produces the following parse for \"The bank can guarantee deposits will eventually cover future tuition costs.\" A downstream system uses the subject of the main verb to build a knowledge graph. It extracts \"The bank\" as the agent. However, \"cover\" (not \"guarantee\") is the main verb, and its subject is \"deposits\", not \"bank\". What does this illustrate about dependency parsing for knowledge extraction?","options":{"A":"Dependency parsers cannot handle sentences with more than one verb","B":"Nested clausal structures (complement clauses) create ambiguity in extracting subject-verb-object triples because the main clause's subject may not be the agent of the embedded verb — downstream systems must traverse the dependency tree structure, not just extract the nearest nominal subject","C":"The parser made an error; modern parsers always correctly identify the main verb of a sentence","D":"This is a tokenization error — \"can guarantee\" should be treated as a single verb token"},"correct":"B","explanation":{"correct":"- The sentence has two clausal levels: \"The bank can guarantee [deposits will eventually cover future tuition costs]\". \"Guarantee\" takes a clausal complement (the embedded clause \"deposits will cover costs\").\n- A naive system extracting \"subject of main verb\" gets \"bank\" because \"guarantee\" is the highest-level verb. But for the knowledge triple about covering costs, the relevant subject is \"deposits.\"\n- Correct knowledge extraction requires understanding the dependency tree: ROOT → guarantee → bank (nsubj) + complement clause → cover → deposits (nsubj of cover).\n- This is why information extraction from text is non-trivial even with perfect parsing — the semantic roles require understanding clause boundaries and argument structures.","A":"Dependency parsers explicitly model multi-verb sentences — the tree structure is designed to encode relationships between multiple predicates. This is the parser's primary use case.","B":"","C":"\"Made an error\" is incorrect — the parser correctly identifies the parse structure. The problem is the downstream system's naive extraction logic, not parser failure.","D":"\"Can guarantee\" is correctly treated as two tokens (modal + infinitive). In dependency parsing, \"can\" is typically the ROOT or modal auxiliary, with \"guarantee\" as its dependent. This is standard behavior, not an error."},"reference":"- spaCy dependency parsing: https://spacy.io/usage/linguistic-features#dependency-parse"},{"section":"nlp","topicSlug":"classical-nlp-tasks","topic":"Classical NLP Tasks","id":"nlp-03004","difficulty":"medium","orderIndex":4,"question":"A team builds a chunker to extract noun phrases (NP) from text using a regular expression on POS tag sequences. The pattern `NP: {

?*+}` works well on formal English. They deploy it on social media text and precision drops from 91% to 43%. A colleague suggests switching to a CRF-based chunker. Before switching, what is the correct diagnosis of why regex chunking fails on social media?","options":{"A":"Social media text uses different programming languages and regex cannot parse them","B":"Regex NP patterns are brittle to POS tagging errors — social media text's non-standard grammar, abbreviations, and slang increase POS tagger error rates significantly; since regex chunking takes POS tags as input, POS errors cascade into chunking errors","C":"Social media posts are too short to contain noun phrases","D":"The pattern `+` is incorrect syntax; it should be `+` to properly match noun tokens"},"correct":"B","explanation":{"correct":"- Regex chunking is a pipeline: raw text → POS tagger → regex pattern matching. Errors in POS tagging propagate directly to chunking because the regex operates on predicted POS sequences, not raw text.\n- Social media text introduces: abbreviations (\"govt\", \"app\"), informal capitalization (\"i\" for pronoun, \"LOVE\"), numerals in unusual positions, hashtags (#NYC as NP candidate), and missing punctuation — all of which degrade POS tagger accuracy.\n- When \"Apple\" (company, NNP) is tagged as JJ (adjective) due to context errors, the NP regex pattern may not match even though it is clearly an NP.\n- CRF chunkers learn from features of the raw text (character patterns, capitalization, neighboring words) in addition to POS tags, making them more robust to POS errors.","A":"Regex in Python and NLP tools has no concept of \"programming languages\" — it matches character/tag patterns in strings. This option is nonsensical.","B":"","C":"Social media posts regularly contain noun phrases — \"your latest iPhone\", \"the best coffee\", \"@POTUS response\" are all NPs. Short posts still contain noun phrases.","D":"`+` is valid NLTK regex chunker syntax. `NN.*` is a wildcard that matches NN, NNS, NNP, NNPS — this is intentional and correct for matching all noun subtypes."},"reference":"- NLTK chunking chapter: https://www.nltk.org/book/ch07.html"},{"section":"nlp","topicSlug":"classical-nlp-tasks","topic":"Classical NLP Tasks","id":"nlp-03005","difficulty":"medium","orderIndex":5,"question":"A CRF-based NER model is trained on a news corpus with the CoNLL-2003 entity scheme (PER, ORG, LOC, MISC). The model is deployed on clinical notes and F1 drops from 88% (news) to 41% (clinical). A data scientist proposes \"just fine-tuning the output layer.\" An NLP engineer disagrees and explains why fine-tuning the output layer alone is insufficient. What is the correct explanation?","options":{"A":"CRFs cannot be fine-tuned — they require complete retraining from scratch","B":"The performance drop is caused by two compounding factors: (1) the entity types differ — clinical text has DRUG, DISEASE, PROCEDURE entities not covered by PER/ORG/LOC/MISC; (2) the input features (word n-grams, POS tags, capitalization patterns) encode news-domain distributions; medical entities like drug names (alphanumeric, unusual morphology) have feature distributions the model has never seen — changing only the output layer leaves the feature layer mismatched to clinical inputs","C":"Fine-tuning only the output layer would work if the team added DRUG and DISEASE to the entity list; the engineer is wrong","D":"The CRF's transition matrix is the main source of error — clinical sentences have different entity co-occurrence patterns that the news-trained transition matrix does not model"},"correct":"B","explanation":{"correct":"- A CRF-based NER model has three main components: (1) feature engineering / feature extraction layer (word shape, n-grams, POS, gazette features), (2) emission weights (feature → entity type), and (3) the CRF transition matrix (entity sequence constraints).\n- Fine-tuning only the output (emission weights + transition matrix) does not address the fact that the feature space itself is mismatched. Drug names like \"atorvastatin\", \"metformin-HCl\" have alphanumeric patterns and suffixes (-statin, -mycin, -pril) that never appear in news entity features.\n- Domain adaptation for CRF-NER requires: retraining on annotated clinical data, adding domain-specific feature templates (medical suffix patterns, drug name gazetteers), and potentially updating the entity tag set.\n- In production, teams who fine-tune only the classification head on NER for a new domain typically see rapid initial improvement but hit a ceiling where mismatched input features limit performance.","A":"CRFs can absolutely be fine-tuned or partially retrained. The weights are learned via gradient-based or L-BFGS optimization and can be updated with new training data.","B":"","C":"Adding new entity types requires the model to learn new emission patterns from the feature space. If the features do not encode clinical entity signals, adding output nodes for DRUG and DISEASE does not help the model learn to recognize them.","D":"The transition matrix is a contributing factor (clinical sentences have different BIO transition patterns), but it is secondary to the feature mismatch problem. The transition matrix is also part of what would need retraining."},"reference":"- Lafferty et al., \"Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data\": https://repository.upenn.edu/cis_papers/159/"},{"section":"nlp","topicSlug":"classical-nlp-tasks","topic":"Classical NLP Tasks","id":"nlp-03006","difficulty":"medium","orderIndex":6,"question":"A team builds a co-reference resolution system for legal contracts. The system must link \"the Company\" and \"it\" and \"the Corporation\" all to the same entity when they refer to the same legal party. The rule-based Hobbs algorithm achieves 62% accuracy. A colleague says \"just use a neural model.\" A senior engineer argues the legal domain requires rule-based components even in a hybrid system. Why?","options":{"A":"Neural co-reference models are too slow for legal document processing","B":"Legal contracts have highly structured, domain-specific co-reference patterns — \"hereinafter referred to as 'the Company'\" explicitly defines an alias, and \"the Corporation\" is a legal synonym that must be linked via contract-defined definitions rather than context alone; a purely neural model trained on news/fiction text will miss these legal definitional bindings","C":"Rule-based Hobbs is always more accurate than neural models for co-reference resolution","D":"Neural models cannot handle documents longer than 512 tokens, making them unsuitable for legal contracts"},"correct":"B","explanation":{"correct":"- Legal contracts routinely contain explicit definitional bindings: \"XYZ Inc. (hereinafter 'the Company')\" — this is not a probabilistic reference, it is a declared alias. Rule-based systems can exploit this explicit definitional syntax reliably.\n- Neural co-reference models trained on OntoNotes (news, Bible, broadcast conversations) learn co-reference patterns from those domains. \"The Company\" in legal text refers to a specific defined party — not just any company — and the binding is defined pages earlier.\n- Hybrid systems that combine rule-based extraction of defined aliases + neural resolution for pronoun and implicit references significantly outperform purely neural or purely rule-based approaches on legal NLP benchmarks.\n- This is a general principle: rule-based components excel at deterministic, structured patterns; neural models excel at statistical, contextual disambiguation.","A":"Inference speed of neural co-reference models is a genuine concern for very long contracts, but it is an engineering problem (batching, quantization) not a fundamental architectural limitation. Speed alone would not justify hybrid architecture.","B":"","C":"Rule-based Hobbs is not always more accurate than neural models. On standard co-reference benchmarks (CoNLL-2012), neural models significantly outperform Hobbs. The claim is domain-specific, not universal.","D":"Standard transformer context windows (512 tokens) are a real limitation, but modern models (Longformer, legal-BERT variants) handle long documents. The fundamental issue is domain distribution mismatch, not context length alone."},"reference":"- Lee et al., \"End-to-end Neural Coreference Resolution\": https://arxiv.org/abs/1707.07045"},{"section":"nlp","topicSlug":"classical-nlp-tasks","topic":"Classical NLP Tasks","id":"nlp-03007","difficulty":"medium","orderIndex":7,"question":"A team evaluates their NER system and reports token-level F1 = 89%. A product manager says \"great, 89% F1 means we're identifying 89% of entities correctly.\" A senior engineer corrects this. What is the distinction the engineer makes?","options":{"A":"Token-level F1 is always higher than entity-level F1, and the PM is reporting the inflated metric","B":"Token-level F1 counts each token independently — a 3-token entity \"New York City\" where only \"New\" and \"York\" are correctly labeled gets partial credit (2/3 tokens correct); entity-level F1 requires all tokens in a span to be correctly labeled for the entity to count as correct, giving zero credit for partial matches","C":"Token-level F1 is the correct metric — entity-level F1 is only used in academic papers","D":"The PM is correct — token-level and entity-level F1 measure the same thing when entities are one token long"},"correct":"B","explanation":{"correct":"- Token-level F1 treats each token's label independently. If \"New York City\" (LOC, 3 tokens: B-LOC, I-LOC, I-LOC) is predicted as [\"B-LOC\", \"I-LOC\", \"O\"], token-level gives 2/3 correct → partial credit.\n- Entity-level (span-level) F1, as used in CoNLL-2003 evaluation, requires the entire span to be correctly identified (exact start, end, and type) for a true positive. A partial match counts as both a false positive (the predicted span) and a false negative (the gold span).\n- Entity-level F1 is almost always lower than token-level F1 for multi-token entities, and it is the production-relevant metric because a half-recognized entity is useless for downstream tasks like knowledge graph construction.\n- Always report entity-level F1 for NER benchmarks. Token-level F1 can be misleadingly high.","A":"While typically true that token-level F1 > entity-level F1 for multi-token entities, the directionality can reverse for single-token entities or when entity boundary errors are common. The key distinction is the partial credit mechanism, not just the direction of the difference.","B":"","C":"Entity-level F1 is the standard production metric for NER evaluation. Token-level F1 is used in some research settings but is less meaningful for downstream applications that consume entity spans.","D":"For single-token entities, the two metrics agree. But the PM's claim extends to all entities — for any corpus with multi-token entities (person names, organization names, geographic locations), the metrics diverge significantly."},"reference":"- CoNLL-2003 NER evaluation script and metric definition: https://aclanthology.org/W03-0419/"},{"section":"nlp","topicSlug":"classical-nlp-tasks","topic":"Classical NLP Tasks","id":"nlp-03008","difficulty":"hard","orderIndex":8,"question":"A team implements a CRF for NER using BIO tagging. During training, they observe that the model learns to never predict I-ORG after B-PER, which is correct. However, the model occasionally predicts I-ORG after O (outside), which is invalid BIO. A colleague suggests adding a constraint layer. What is the CRF mechanism that should prevent invalid transitions, and why is it failing?","options":{"A":"The CRF is working correctly; \"I-ORG after O\" is valid BIO syntax and the constraint is unnecessary","B":"CRF models the transition probability between consecutive labels via a learned transition matrix; \"I-ORG after O\" should receive a large negative transition score during training, but if training data contains no I-ORG-after-O transitions, the transition weight may be initialized near zero rather than strongly negative, allowing invalid sequences at inference — the fix is explicit constraint initialization or Viterbi decoding with hard transition masks","C":"The invalid prediction means the CRF implementation has a bug in the Viterbi decoder","D":"BIO tagging inherently cannot prevent invalid transitions; the team should switch to BIOES tagging to enforce constraints"},"correct":"B","explanation":{"correct":"- CRF's key contribution is modeling sequence-level constraints via the transition matrix A[i][j] = score of transitioning from label i to label j. Valid BIO constraints (B-X/I-X/O sequence rules) should be learned as large negative weights for invalid transitions.\n- During training, transitions that never occur in the data may be initialized to 0 (or small random values) rather than large negatives. If no training example has \"I-ORG after O\", the gradient for this transition weight is zero — it stays near initialization.\n- At inference, if the emission score for I-ORG at some position is high enough, a near-zero transition score (instead of a strongly negative one) may not prevent the Viterbi decoder from selecting this invalid path.\n- Fix: (1) initialize invalid transition weights to -∞ (hard constraint), or (2) mask the Viterbi search space to eliminate invalid transitions before decoding. Most production NER systems use hard masking.","A":"\"I-ORG after O\" is explicitly invalid BIO syntax — a continuation tag (I-) cannot begin a new entity span after an Outside token. This is not a valid sequence.","B":"","C":"A Viterbi decoder is correct if it finds the highest-scoring valid path given the model's scores. If the transition matrix assigns 0 (not -∞) to invalid transitions, the decoder correctly returns the highest-scoring sequence under the model — which may be invalid under BIO rules. This is a model initialization issue, not a decoder bug.","D":"BIOES (Begin, Inside, Outside, End, Single) provides more granular tagging but still requires transition constraints. It does not inherently prevent invalid transitions — the same problem would occur with uninitialized BIOES transitions."},"reference":"- Lafferty et al., \"Conditional Random Fields\": https://repository.upenn.edu/cis_papers/159/"},{"section":"nlp","topicSlug":"classical-nlp-tasks","topic":"Classical NLP Tasks","id":"nlp-03009","difficulty":"hard","orderIndex":9,"question":"A team's dependency parser achieves 92% UAS (Unlabeled Attachment Score) on the Penn Treebank test set but only 74% UAS on a new technical manual corpus. The team tries fine-tuning on 500 annotated technical sentences and reaches 81% UAS. An engineer asks: \"Why does 500 sentences of in-domain data provide a bigger boost than the 50,000 general sentences we trained on initially?\" What is the correct explanation?","options":{"A":"500 sentences is enough to overwrite all previously learned weights, so the model essentially restarts training with in-domain data","B":"Domain adaptation exhibits diminishing returns — each additional example from the general domain provides marginal signal because most syntactic patterns are already covered; in-domain examples provide high-density signal for the specific syntactic constructions and vocabulary that differ in technical text (e.g., imperative instructions, passive voice, noun-heavy compound structures), where the model was weakest","C":"The Penn Treebank is lower quality than the technical annotations, so replacing Penn Treebank training examples with technical ones always improves performance","D":"Fine-tuning on 500 sentences increases the learning rate automatically, which is why it provides a larger per-example improvement"},"correct":"B","explanation":{"correct":"- Neural dependency parsers trained on large general corpora have already learned the vast majority of English syntactic patterns — adding more Penn Treebank examples provides diminishing signal because the model's error rate on covered patterns is already low.\n- Technical manuals use domain-specific constructions that are underrepresented in news/narrative text: imperative sentences (\"Insert the module into slot A\"), technical noun compounds (\"power supply unit relay\"), and passive-heavy procedural language. These constructions have high error rate but low representation in general training data.\n- 500 targeted in-domain examples that densely cover these failure-mode constructions provide much higher gradient signal per example than 50,000 general examples where 99% of examples are already handled correctly.\n- This principle — targeted domain adaptation with small high-quality data — is consistently more efficient than adding more general data once a model has saturated on general patterns.","A":"Fine-tuning on 500 examples with a small learning rate does not overwrite previously learned weights. The model retains general syntactic knowledge while adapting to domain-specific patterns through targeted gradient updates.","B":"","C":"Penn Treebank is a gold-standard annotated corpus used as the reference for English parsing benchmarks. It is not lower quality. The performance difference is about distribution mismatch, not annotation quality.","D":"Fine-tuning does not automatically change the learning rate. Standard fine-tuning protocols use smaller learning rates than initial training to prevent catastrophic forgetting. The improvement is from data relevance, not learning rate changes."},"reference":"- Blitzer et al., \"Domain Adaptation with Structural Correspondence Learning\": https://aclanthology.org/D06-1009/"},{"section":"nlp","topicSlug":"classical-nlp-tasks","topic":"Classical NLP Tasks","id":"nlp-03010","difficulty":"hard","orderIndex":10,"question":"A team evaluates co-reference resolution on a large-scale corpus and finds their system has 89% recall for pronoun resolution but only 41% recall for \"bridging anaphora\" (e.g., linking \"the engine\" to \"the car\" in \"I bought a car. The engine needed work.\"). A researcher claims bridging anaphora requires world knowledge that statistical co-reference models cannot encode. Is this claim correct?","options":{"A":"The claim is wrong — bridging anaphora can be resolved by n-gram overlap between the anaphor and the antecedent","B":"The claim is partially correct — bridging anaphora requires recognizing part-whole and set-member relationships (car has-part engine) that are not encoded in distributional co-occurrence; however, models can learn proxy signals (cars and engines co-occur in similar contexts) that provide partial resolution without explicit world knowledge","C":"The claim is correct and complete — bridging anaphora is fundamentally unsolvable by statistical NLP methods and requires a symbolic knowledge base like WordNet","D":"The low recall is entirely due to the system's 512-token context window missing long-range references; increasing context length would fix bridging anaphora"},"correct":"B","explanation":{"correct":"- Bridging anaphora requires inferring implicit relationships not stated in the text: \"I bought a car\" does not explicitly state \"the car has an engine,\" yet \"the engine\" is understood to refer to the car's engine. This requires the \"has-part\" relation from world knowledge.\n- Purely distributional models can learn proxy signals: \"car\" and \"engine\" frequently co-occur in automotive texts, so their embeddings are nearby. A co-reference model might use this embedding similarity as a soft signal for part-whole bridging.\n- However, embedding proximity captures topical relatedness, not structural relationships. \"car\" and \"dealership\" are also topically related — the model cannot distinguish \"the dealership's engine\" from \"the car's engine\" purely from embeddings.\n- In practice, hybrid systems that use distributional similarity + ontological relations (WordNet hypernymy/meronymy, ConceptNet part-of) significantly outperform pure statistical models on bridging.","A":"N-gram overlap (\"car\" vs \"engine\") is zero — they share no n-grams. Bridging anaphora by definition involves non-overlapping expressions, so string similarity methods completely fail.","B":"","C":"While symbolic knowledge bases help, claiming the problem is \"fundamentally unsolvable\" by statistical methods is too strong. Neural models with large-scale pre-training (BERT, GPT) do learn some world knowledge implicitly and show partial bridging resolution. The correct framing is \"suboptimal\" not \"unsolvable.\"","D":"The context window is rarely the bottleneck for bridging — most bridging examples span 1-3 sentences, well within any context window. The issue is relational reasoning, not context length."},"reference":"- Poesio et al., \"Anaphora Resolution\" survey: https://link.springer.com/book/10.1007/978-94-007-2088-3"},{"section":"nlp","topicSlug":"language-models-statistical","topic":"Language Models Statistical","id":"nlp-04001","difficulty":"easy","orderIndex":1,"question":"A trigram language model is trained on a news corpus and asked to assign a probability to the sentence \"The stock market crashed.\" It encounters \"stock market crashed\" as a trigram that never appeared in training. What happens if no smoothing is applied?","options":{"A":"The model assigns a probability of 0.0001 based on unigram fallback","B":"The model assigns a probability of zero to the entire sentence, making it indistinguishable from a random string of words","C":"The model raises an exception and skips the sentence during evaluation","D":"The model falls back to bigram probabilities for the unseen trigram only"},"correct":"B","explanation":{"correct":"- In a maximum likelihood n-gram LM, P(sentence) = product of conditional probabilities of each word given its n-1 preceding words. If any single conditional probability is zero, the entire product is zero.\n- \"stock market crashed\" appearing zero times in training gives P(crashed | stock, market) = 0/count(stock, market) = 0.\n- This is the zero-probability problem: any sentence containing an unseen n-gram receives probability zero, regardless of how plausible the rest of the sentence is.\n- This is why every deployed n-gram LM requires smoothing — raw MLE produces degenerate probability distributions for any real-world input.","A":"Unigram fallback is a smoothing technique (add-alpha or backoff) — it does not happen automatically without smoothing being explicitly applied. Raw MLE has no fallback.","B":"","C":"Language model probability computation does not raise exceptions for unseen n-grams — it simply returns zero. Exceptions only occur if the implementation incorrectly handles division by zero separately.","D":"Bigram fallback is a feature of backoff models (Katz backoff) or interpolated models. Without smoothing, there is no automatic fallback mechanism."},"reference":"- Jurafsky & Martin, SLP3 Chapter 3 (N-gram Language Models): https://web.stanford.edu/~jurafsky/slp3/3.pdf"},{"section":"nlp","topicSlug":"language-models-statistical","topic":"Language Models Statistical","id":"nlp-04002","difficulty":"easy","orderIndex":2,"question":"You evaluate two trigram language models on a held-out test set. Model A has perplexity 120, Model B has perplexity 85. A colleague says \"Model B is worse because higher perplexity means the model is more confident.\" What is wrong with this interpretation?","options":{"A":"Perplexity is only meaningful for bigram models, not trigram models","B":"Lower perplexity means the model assigns higher probability to the test data — it is less \"surprised\" by the text, indicating better fit, so Model B is actually better","C":"Perplexity measures accuracy on training data, not test data, so the comparison is invalid","D":"The two models must use the same vocabulary size for perplexity to be comparable, which is not guaranteed here"},"correct":"B","explanation":{"correct":"- Perplexity = 2^(average cross-entropy per token) = 2^(-1/N * Σ log₂P(wᵢ|context)). A model that assigns high probability to the test tokens has low cross-entropy and low perplexity.\n- Intuitively, perplexity represents the \"weighted average branching factor\" — how many equally likely next words the model considers at each step. Lower perplexity = fewer plausible next words = more confident, better predictions.\n- Model B (PP=85) is better: it assigns higher probability to the test corpus, meaning it has learned the language distribution more accurately.\n- The colleague has the direction reversed — this is an extremely common confusion at interviews.","A":"Perplexity is model-order agnostic — it is defined for any probability distribution over sequences, including unigram, bigram, trigram, and neural LMs. The formula does not change with n-gram order.","B":"","C":"By convention, perplexity is always evaluated on held-out test data. Training perplexity would overfit to the training corpus and be meaningless as an evaluation metric.","D":"Vocabulary size does affect perplexity comparability, but this is a secondary concern. The primary error here is reversing the direction of perplexity — the question is specifically testing that misconception."},"reference":"- Jurafsky & Martin, SLP3 Chapter 3, Section 3.3 (Perplexity): https://web.stanford.edu/~jurafsky/slp3/3.pdf"},{"section":"nlp","topicSlug":"language-models-statistical","topic":"Language Models Statistical","id":"nlp-04003","difficulty":"easy","orderIndex":3,"question":"A bigram model trained on 1 million tokens uses add-1 (Laplace) smoothing on a vocabulary of 50,000 words. An engineer notices that the smoothed probability of a common bigram \"the president\" drops significantly compared to the unsmoothed MLE probability. What causes this?","options":{"A":"Laplace smoothing redistributes probability mass uniformly across all V² possible bigrams, adding 1 to each count, which inflates the denominator massively for frequent bigrams","B":"Add-1 smoothing only reduces probabilities for bigrams that appear more than 1000 times in training","C":"The model was incorrectly implemented — Laplace smoothing should only increase probabilities for seen bigrams","D":"Vocabulary size has no effect on smoothed probabilities; the drop is caused by test corpus domain shift"},"correct":"A","explanation":{"correct":"- Add-1 smoothing: P_smooth(wᵢ|wᵢ₋₁) = (count(wᵢ₋₁, wᵢ) + 1) / (count(wᵢ₋₁) + V). The denominator adds V (vocabulary size = 50,000) to every context count.\n- For a common bigram like \"the president\" with count 5,000: unsmoothed P = 5000/Σcount(\"the\",*). Smoothed P = 5001/(Σcount(\"the\",*) + 50,000). The denominator jumps by 50,000, significantly reducing probability.\n- Laplace smoothing steals significant probability mass from frequent events to give to unseen events — with a 50K vocabulary, it adds 50K phantom observations, which is enormous relative to real counts.\n- This is why add-1 smoothing is considered \"too aggressive\" and rarely used in production — Kneser-Ney and Good-Turing steal far less mass from observed events.","A":"","B":"Laplace smoothing applies the same +1 to all bigrams regardless of frequency. There is no threshold at 1000 or any other value.","C":"Laplace smoothing does increase unseen bigram probabilities — but as a conservation of probability law, it must decrease seen bigram probabilities. Total probability must sum to 1.","D":"Vocabulary size directly appears in the smoothed denominator. This is a mathematical fact, not an implementation choice. Domain shift is a separate, real concern but not the cause described."},"reference":"- Jurafsky & Martin, SLP3 Chapter 3, Section 3.4 (Smoothing): https://web.stanford.edu/~jurafsky/slp3/3.pdf"},{"section":"nlp","topicSlug":"language-models-statistical","topic":"Language Models Statistical","id":"nlp-04004","difficulty":"medium","orderIndex":4,"question":"A trigram model uses Kneser-Ney smoothing. For the word \"Francisco\", the model assigns a very low unigram probability despite \"Francisco\" appearing thousands of times in training. An engineer suspects a bug. What is the correct explanation?","options":{"A":"Kneser-Ney has a minimum count threshold that discards infrequent words, and \"Francisco\" falls below it","B":"Kneser-Ney replaces raw unigram counts with continuation counts — the number of distinct bigram contexts a word appears in. \"Francisco\" almost always follows \"San\", giving it a low continuation count despite high raw frequency","C":"The model is correctly implemented but the training corpus does not contain enough \"Francisco\" occurrences to produce a meaningful probability","D":"Kneser-Ney penalizes words that always appear at the end of n-grams by reducing their base probability"},"correct":"B","explanation":{"correct":"- Kneser-Ney's key insight: for lower-order distributions used in backoff, raw frequency is the wrong signal. What matters is how likely a word is to appear in a *novel context* (as a continuation).\n- P_KN(Francisco) ∝ |{w : count(w, Francisco) > 0}| — the number of distinct words that precede \"Francisco.\" Since \"Francisco\" almost exclusively follows \"San\", this count is very small (≈1).\n- Compare \"the\": it follows hundreds of thousands of distinct words, giving it a huge continuation count and high KN unigram probability. This is correct behavior, not a bug.\n- This is one of the most counterintuitive properties of Kneser-Ney. In interviews, many candidates know KN is \"better than Laplace\" but cannot explain why continuation counts beat raw counts.","A":"Kneser-Ney does not have a count threshold that discards words. All words in the vocabulary are retained. Count thresholds (e.g., minimum count = 5) are a separate vocabulary pruning step, not part of KN smoothing.","B":"","C":"\"Francisco\" appears thousands of times — raw frequency is high. The issue is how KN redistributes that probability using continuation counts, not data sparsity.","D":"Position within an n-gram (beginning, middle, end) is not a factor in KN smoothing. KN uses bigram contexts, not positional features."},"reference":"- Chen & Goodman, \"An Empirical Study of Smoothing Techniques for Language Modeling\": https://aclanthology.org/P96-1041/"},{"section":"nlp","topicSlug":"language-models-statistical","topic":"Language Models Statistical","id":"nlp-04005","difficulty":"medium","orderIndex":5,"question":"Two n-gram models are compared on the same held-out test set: a bigram model (PP=200) and a trigram model (PP=150). A team decides to deploy the trigram model. On production data from a slightly different domain, the trigram model's perplexity jumps to 600 while the bigram's jumps to 280. What explains this pattern?","options":{"A":"Trigram models are always less robust than bigram models because they have more parameters","B":"Higher-order n-gram models are more sensitive to domain shift because longer contexts are more domain-specific and more likely to be unseen in the new domain","C":"The perplexity jump is caused by the trigram model's larger vocabulary, which has more unseen words in the new domain","D":"The smoothing algorithm used by the trigram model is less effective than the one used by the bigram model"},"correct":"B","explanation":{"correct":"- A trigram \"Federal Reserve meeting\" is highly domain-specific. In a new domain (e.g., medical text), this trigram will never appear, and even with smoothing, the model assigns very low probability to domain-specific sequences.\n- Bigrams are shorter and more transferable across domains — \"the meeting\", \"a patient\" — because two-word combinations have higher cross-domain coverage than three-word combinations.\n- This is the bias-variance tradeoff for LMs: higher-order models have lower bias on in-domain data (lower perplexity in domain) but higher variance on out-of-domain data (larger perplexity jump).\n- Production systems often use interpolated models or domain adaptation to balance this tradeoff rather than committing to a single n-gram order.","A":"\"More parameters\" is not the right framing for n-gram models. The issue is context specificity, not parameter count. A large bigram model can also fail on domain shift if its vocabulary is domain-specific.","B":"","C":"Both models are trained on the same corpus and share the same vocabulary. The vocabulary size difference between bigram and trigram models is zero.","D":"The question does not state that different smoothing algorithms are used. Even with the same smoothing, the trigram model's longer contexts make it more vulnerable to domain shift by design."},"reference":"- Jurafsky & Martin, SLP3 Chapter 3, Section 3.5 (Practical Issues): https://web.stanford.edu/~jurafsky/slp3/3.pdf"},{"section":"nlp","topicSlug":"language-models-statistical","topic":"Language Models Statistical","id":"nlp-04006","difficulty":"medium","orderIndex":6,"question":"A language model is trained with the Markov assumption. An engineer argues that a trigram model already captures long-range dependencies because \"you can just chain the trigrams together.\" What is the fundamental flaw in this reasoning?","options":{"A":"The Markov assumption in a trigram model states that the probability of a word depends only on the two immediately preceding words, regardless of earlier context","B":"Trigram models cannot chain probabilities because multiplication of probabilities requires independent events","C":"The Markov assumption is not a limitation — trigrams capture all syntactic dependencies in English","D":"Chaining trigrams is mathematically equivalent to a unigram model because of the independence assumption"},"correct":"A","explanation":{"correct":"- The Markov assumption: P(wₙ | w₁, w₂, ..., wₙ₋₁) ≈ P(wₙ | wₙ₋ₖ₊₁, ..., wₙ₋₁) for k-gram model. For trigrams, k=3, so only the previous 2 words matter.\n- \"Chaining\" trigrams means P(w₅|w₃,w₄) × P(w₆|w₄,w₅) — each factor still only depends on the immediately preceding 2 words. Information from w₁, w₂ cannot influence w₅ once we've conditioned on w₃, w₄.\n- Real language has dependencies spanning 20+ words: \"The defendant who was acquitted of all charges __ free.\" The blank requires knowing \"defendant\" from many tokens back.\n- This ceiling is precisely why neural LMs (first RNNs, then Transformers) were developed — they can maintain state or attend to arbitrary positions in the context.","A":"","B":"Probability chain rule does not require independence. P(A,B) = P(A) × P(B|A) regardless of independence. The Markov assumption is about conditional independence from distant context, not independence between events.","C":"English syntax routinely involves dependencies spanning clause boundaries, relative clauses, and long-distance agreement. Trigrams capture only local 3-word windows, which is demonstrably insufficient for agreement phenomena like \"The boys who like soccer __ here.\"","D":"Chaining trigrams preserves the conditional structure — each factor conditions on two words. A unigram model conditions on zero preceding words. These are mathematically and statistically very different."},"reference":"- Jurafsky & Martin, SLP3 Chapter 3 (N-gram Language Models and Markov assumption): https://web.stanford.edu/~jurafsky/slp3/3.pdf"},{"section":"nlp","topicSlug":"language-models-statistical","topic":"Language Models Statistical","id":"nlp-04007","difficulty":"medium","orderIndex":7,"question":"A research team uses an n-gram language model to generate text by sampling from the distribution at each step. They notice the generated text becomes repetitive and incoherent after 20 tokens. A teammate says \"just use a higher n\" to fix this. Why does increasing n not solve the problem?","options":{"A":"Higher n-gram models generate text faster, causing the repetition to appear earlier","B":"Increasing n worsens data sparsity exponentially — long contexts become unseen in training, causing the model to constantly back off to lower-order distributions, effectively reverting to lower-order behavior during generation","C":"Text generation from n-gram models is always incoherent — the correct fix is to switch to beam search","D":"The repetition is caused by the smoothing algorithm, and increasing n removes the smoothing effect"},"correct":"B","explanation":{"correct":"- An n-gram model with large n faces extreme sparsity: for n=7, the probability P(wₙ|wₙ₋₆,...,wₙ₋₁) requires the exact 7-word context to have appeared in training. In practice, after generating a few tokens, the context becomes unique and unseen.\n- With backoff (Katz, KN), the model continuously backs off to (n-1)-gram, then (n-2)-gram, eventually using unigram or bigram distributions — the same low-order distributions that caused incoherence in the first place.\n- Higher n helps on in-domain test perplexity (when test sentences resemble training), but for generation, the model immediately leaves in-distribution contexts and the benefit of high n disappears.\n- This is the fundamental ceiling of statistical LMs for generation: they are good at scoring/ranking existing sentences but cannot maintain coherent long-range generation. Neural sequence models (RNN/LSTM) address this with recurrent state.","A":"Computational speed of n-gram models is roughly O(n) per token and is not the cause of repetition. Generation speed does not cause semantic repetition.","B":"","C":"Beam search is a decoding strategy, not a model property. An n-gram model with beam search still generates from the same probability distribution — beam search cannot create long-range coherence that the model's distribution lacks.","D":"Smoothing does not cause repetition. Smoothing redistributes probability mass to handle unseen n-grams. The repetition comes from the model's inability to maintain global coherence, not from smoothing."},"reference":"- Jurafsky & Martin, SLP3 Chapter 3, Section 3.6 (Limitations of n-gram language models): https://web.stanford.edu/~jurafsky/slp3/3.pdf"},{"section":"nlp","topicSlug":"language-models-statistical","topic":"Language Models Statistical","id":"nlp-04008","difficulty":"hard","orderIndex":8,"question":"You compute perplexity of a trigram model on a test set and get PP=∞ (infinity). The model uses Kneser-Ney smoothing. A senior engineer says \"that's impossible with KN smoothing.\" Is the engineer right, and what could cause this?","options":{"A":"The engineer is correct — Kneser-Ney guarantees non-zero probability for all word sequences, so infinite perplexity is impossible","B":"The engineer is wrong — infinite perplexity occurs when a test token is completely out-of-vocabulary (OOV), assigned probability zero even by KN smoothing, because KN does not smooth over unseen vocabulary items","C":"The engineer is correct — infinite perplexity can only occur with add-1 smoothing, not Kneser-Ney","D":"The engineer is wrong — infinite perplexity occurs whenever the test set contains more than 10% unseen bigrams"},"correct":"B","explanation":{"correct":"- Kneser-Ney smoothing guarantees non-zero probability for any n-gram of *known* vocabulary words through its discount-and-redistribute mechanism. However, it does not assign probability to OOV tokens — words not seen during training.\n- If a test token wᵢ is not in the training vocabulary, P(wᵢ | context) = 0. The log probability is -∞, and perplexity is 2^∞ = ∞.\n- Standard practice is to replace all words below a count threshold (e.g., count < 5) with a special `` token during both training and testing. This ensures OOV words map to a known token with a non-zero probability.\n- If the test set is preprocessed differently from training (missing the `` replacement step), infinite perplexity is the result — a common production bug when train/test pipelines diverge.","A":"KN smoothing distributes probability mass among *known* vocabulary items using continuation counts and discount values. It has no mechanism to assign probability to tokens outside the vocabulary. The engineer is wrong about this edge case.","B":"","C":"Infinite perplexity from OOV tokens can occur with any smoothing algorithm, not just add-1. The smoothing algorithm handles unseen n-grams of known words — it does not handle unknown vocabulary.","D":"Unseen bigrams (of known words) do not cause infinite perplexity with KN smoothing — KN explicitly handles them via backoff to unigram continuation probabilities. The threshold of 10% is fabricated and meaningless."},"reference":"- Jurafsky & Martin, SLP3 Chapter 3, Section 3.3 (Unknown words and UNK handling): https://web.stanford.edu/~jurafsky/slp3/3.pdf"},{"section":"nlp","topicSlug":"language-models-statistical","topic":"Language Models Statistical","id":"nlp-04009","difficulty":"hard","orderIndex":9,"question":"A team compares a 5-gram LM (PP=120 on in-domain test set) against a feedforward neural LM with a fixed context window of 5 words (PP=95 on the same test set). The manager says \"the neural LM wins because it has lower perplexity.\" A researcher argues the comparison is unfair. What is the researcher's strongest argument?","options":{"A":"Neural LMs cannot be fairly compared to n-gram LMs because they use different tokenization schemes","B":"The neural LM with a fixed 5-word window encodes the same Markov assumption as the 5-gram LM but uses a continuous representation — the perplexity gain comes from better generalization via distributed representations, not from modeling longer dependencies","C":"Perplexity is only valid for n-gram models — neural LMs should be evaluated with BLEU score instead","D":"The neural LM is unfair because it uses more training data than the 5-gram LM"},"correct":"B","explanation":{"correct":"- A feedforward neural LM with context size n encodes exactly the same Markov assumption as an n-gram LM: P(wₜ | wₜ₋ₙ₊₁, ..., wₜ₋₁). It does not model longer dependencies.\n- The perplexity gain (120 → 95) comes from distributed representations: \"president\" and \"senator\" share embedding dimensions, so seeing \"the president met\" generalizes to \"the senator met\" even if the exact trigram was never seen. N-gram models treat each word as atomic.\n- This generalization is real and valuable — but it is a different kind of improvement than truly modeling longer context. The manager's interpretation (\"better LM\") is right, but the reason attributed is often wrong in interviews.\n- The ceiling that neural LMs break through with respect to statistical LMs is specifically this generalization via embeddings, not longer effective context (that requires RNNs/Transformers).","A":"Both n-gram and neural LMs can use identical tokenization (word-level). Tokenization differences are orthogonal to the model architecture comparison. This is a deflection, not a substantive argument.","B":"","C":"Perplexity is a valid evaluation metric for any language model that assigns probabilities to sequences — it is model-agnostic. BLEU score measures translation quality, not language model quality.","D":"The question states both models are trained and evaluated on the same data. This objection may be valid in practice but is not the strongest argument about the nature of the comparison."},"reference":"- Bengio et al., \"A Neural Probabilistic Language Model\" (2003): https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf"},{"section":"nlp","topicSlug":"language-models-statistical","topic":"Language Models Statistical","id":"nlp-04010","difficulty":"hard","orderIndex":10,"question":"A modified Kneser-Ney smoothed trigram model and an RNN language model are both trained on 1 billion tokens. On a standard PTB benchmark, KN achieves PP=60, RNN achieves PP=45. A colleague claims that for a well-resourced high-frequency domain (e.g., financial newswire from a fixed time period), the gap will be the same or larger. What phenomenon makes this claim likely wrong?","options":{"A":"RNN LMs are always better than n-gram LMs regardless of domain or data size","B":"In a narrow, high-frequency domain with a closed vocabulary, KN can approach or match RNN performance because the sparsity problem that neural representations solve is reduced — common n-grams are well-estimated by MLE and distributed generalization provides less marginal gain","C":"Financial newswire is too short for RNNs to learn useful patterns, making KN perform better","D":"KN smoothing was specifically designed for financial text, giving it a domain advantage"},"correct":"B","explanation":{"correct":"- The key advantage of neural LMs over n-gram LMs is handling sparsity: when trigrams are unseen, neural embeddings generalize through shared dimensions. In a narrow domain with a closed vocabulary and millions of repetitions, most common trigrams *are* seen — MLE estimates are reliable.\n- In financial newswire: \"quarterly earnings per share\", \"Federal Reserve interest rate\", \"S&P 500 index\" — these domain-specific trigrams appear thousands of times. KN's smoothing for unseen cases matters less when most cases are seen.\n- Studies (e.g., Mikolov et al., 2011) have shown that for specialized narrow domains, well-tuned KN models remain competitive with early RNN LMs.\n- The general lesson: neural LMs' advantage is largest when data is sparse and diverse (news, web, books). For closed-domain, high-frequency text, the sparsity advantage shrinks.","A":"\"Always\" is false. For restricted vocabulary domains with abundant data, n-gram models with good smoothing can be competitive. Empirical results show task-dependent performance.","B":"","C":"Financial newswire corpora contain hundreds of millions of tokens. Data length is not the bottleneck. RNNs require sufficient data but a 1-billion-token training corpus is more than adequate.","D":"Kneser-Ney smoothing is a general-purpose statistical smoothing technique developed for newswire text (Wall Street Journal corpus was common in research), but it has no domain-specific architectural advantage for financial text vs other text."},"reference":"- Mikolov et al., \"Empirical Evaluation and Combination of Advanced Language Modeling Techniques\" (2011): https://www.isca-archive.org/interspeech_2011/mikolov11_interspeech.pdf"},{"section":"nlp","topicSlug":"sequence-models-rnn-lstm","topic":"Sequence Models Rnn Lstm","id":"nlp-05001","difficulty":"easy","orderIndex":1,"question":"A vanilla RNN is trained to predict the next word in: \"The trophy didn't fit in the bag because it was too ___.\" The model consistently outputs generic words like \"large\" rather than correctly predicting \"big\" or inferring whether \"it\" refers to \"trophy\" or \"bag.\" What architectural limitation causes this failure?","options":{"A":"Vanilla RNNs cannot process sentences with pronouns","B":"The vanilla RNN's hidden state is updated at every step but the gradient signal from \"trophy\" or \"bag\" has vanished by the time the model reaches \"it\" due to the vanishing gradient problem","C":"The RNN architecture only processes the last 3 words, making it equivalent to a trigram model","D":"Vanilla RNNs do not use embeddings, so they cannot represent co-reference relationships"},"correct":"B","explanation":{"correct":"- In a vanilla RNN, the hidden state hₜ = tanh(Whₕₜ₋₁ + Wxₓₜ). The gradient of the loss with respect to hₜ₋ₙ involves repeated multiplication by Wₕ. If |eigenvalues(Wₕ)| < 1, gradients shrink exponentially over distance.\n- The word \"trophy\" occurs several positions before \"it\" — by the time the RNN processes the blank, the gradient signal from \"trophy\" has been multiplied many times and effectively becomes zero. The network cannot learn the co-reference dependency.\n- This is the Winograd schema challenge — requiring long-range co-reference resolution — and it is specifically used to demonstrate the limitations of vanilla RNNs.\n- LSTMs address this by maintaining a separate cell state with additive (not multiplicative) updates controlled by gates, which allows gradients to flow over longer distances.","A":"Vanilla RNNs process all tokens in a sequence including pronouns. They have no explicit filter against pronouns. The failure is about learning long-range dependencies, not token type restrictions.","B":"","C":"RNNs process the entire sequence one token at a time with a recurrent hidden state — the effective context is theoretically the entire past sequence, not just 3 words. The practical limitation is gradient propagation, not a fixed window.","D":"Vanilla RNNs can and do use word embeddings as input representations. The issue is not embedding absence but gradient flow through the recurrence."},"reference":"- Hochreiter & Schmidhuber, \"Long Short-Term Memory\" (1997): https://www.bioinf.jku.at/publications/older/2604.pdf\n- Bengio et al., \"Learning Long-Term Dependencies with Gradient Descent is Difficult\" (1994): https://ieeexplore.ieee.org/document/279181"},{"section":"nlp","topicSlug":"sequence-models-rnn-lstm","topic":"Sequence Models Rnn Lstm","id":"nlp-05002","difficulty":"easy","orderIndex":2,"question":"An LSTM processes the token sequence one word at a time. At each step, it produces three vectors: the hidden state hₜ, the cell state cₜ, and an output. A team member asks why the LSTM needs both hₜ and cₜ — \"aren't they redundant?\" What is the correct functional distinction?","options":{"A":"hₜ and cₜ contain identical information; the duplication is for numerical stability","B":"cₜ is the long-term memory that accumulates information over many time steps via additive updates controlled by the forget and input gates; hₜ is the short-term output computed from cₜ via the output gate, used as input to the next layer","C":"hₜ stores syntactic information and cₜ stores semantic information, which is why both are needed","D":"cₜ is only used during training for gradient computation and is discarded at inference time"},"correct":"B","explanation":{"correct":"- The LSTM cell state cₜ = fₜ ⊙ cₜ₋₁ + iₜ ⊙ g̃ₜ. This is an additive update — the forget gate fₜ selectively removes information, the input gate iₜ selectively adds new information. Additive updates enable gradients to flow without repeated squashing.\n- The hidden state hₜ = oₜ ⊙ tanh(cₜ) is derived from cₜ through the output gate — it is a filtered, squashed version of the cell state used as the output at each time step and as input to subsequent layers.\n- The key insight: cₜ can carry information across hundreds of time steps with minimal decay because addition does not shrink magnitudes. hₜ is a compressed, task-relevant read of that memory.\n- In practice, cₜ is analogous to RAM (persistent, addressable) while hₜ is analogous to CPU registers (current working state).","A":"hₜ and cₜ contain different information — hₜ is a non-linear transformation of cₜ through the output gate. They are not redundant; removing cₜ reduces the model to a GRU-like architecture with fundamentally different gradient flow properties.","B":"","C":"There is no architectural separation of syntactic vs semantic information between hₜ and cₜ. Both are distributed representations. Probing studies show both states encode various linguistic features.","D":"cₜ is used at inference time — the cell state is passed between time steps during forward passes, which happens at both training and inference. Discarding cₜ at inference would break the model entirely."},"reference":"- Hochreiter & Schmidhuber, \"Long Short-Term Memory\": https://www.bioinf.jku.at/publications/older/2604.pdf\n- Colah's blog, \"Understanding LSTM Networks\": https://colah.github.io/posts/2015-08-Understanding-LSTMs/"},{"section":"nlp","topicSlug":"sequence-models-rnn-lstm","topic":"Sequence Models Rnn Lstm","id":"nlp-05003","difficulty":"easy","orderIndex":3,"question":"A GRU is being compared to an LSTM for a text classification task on sentences averaging 15 words. Both are trained with the same hyperparameters. The GRU trains 20% faster and achieves nearly identical accuracy. A colleague insists on using LSTM \"because it's more powerful.\" Under what condition would this preference actually be justified?","options":{"A":"LSTM is always more powerful and should be preferred regardless of sequence length or task","B":"LSTM's separate cell state provides additional capacity for tasks requiring fine-grained long-range memory management; for very long sequences (100+ tokens) or tasks requiring precise selective forgetting, LSTM's explicit forget gate can outperform GRU's coupled reset/update gates","C":"GRUs cannot handle multi-class classification — they are only valid for binary tasks","D":"LSTM is preferred when the training corpus has fewer than 100,000 examples because it overfits less than GRU"},"correct":"B","explanation":{"correct":"- GRU merges the cell state and hidden state into one, and uses coupled update/reset gates (fewer parameters). LSTM has 4 gate matrices vs GRU's 3. On short sequences (15 tokens), the additional capacity of LSTM's cell state provides negligible benefit.\n- For tasks requiring long-range memory with selective update — e.g., tracking dialogue state over many turns, long document summarization — LSTM's explicit forget gate and separate cell state give it more precise memory control.\n- Empirically (Chung et al., 2014), LSTM and GRU perform comparably on many NLP benchmarks. The choice is often made based on computational budget, not inherent superiority.\n- LSTM's \"more powerful\" claim is context-dependent, not universal. For a 15-word classification task, the claim is unsupported by evidence.","A":"Empirical research (Chung et al., 2014; Jozefowicz et al., 2015) consistently shows that LSTM and GRU have task-dependent performance. There is no universal \"more powerful\" conclusion.","B":"","C":"GRUs can handle any classification task — binary, multi-class, multi-label. The number of output classes is determined by the final linear layer, not the recurrent cell type.","D":"LSTM has more parameters than GRU (not fewer), so it is more prone to overfitting on small datasets, not less. The direction of this reasoning is inverted."},"reference":"- Chung et al., \"Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling\": https://arxiv.org/abs/1412.3555"},{"section":"nlp","topicSlug":"sequence-models-rnn-lstm","topic":"Sequence Models Rnn Lstm","id":"nlp-05004","difficulty":"medium","orderIndex":4,"question":"A bidirectional LSTM (BiLSTM) is used for NER. The forward LSTM processes tokens left-to-right; the backward LSTM processes right-to-left. A junior engineer removes the backward LSTM to reduce inference latency, keeping only the forward LSTM. For which input would this degrade performance most severely?","codeSnippet":"sentence = \"Apple is looking at buying U.K. startup for $1 billion\"","options":{"A":"The NER model would fail on \"Apple\" because it is ambiguous (company vs fruit) and the backward LSTM provides right-context that resolves the ambiguity","B":"The NER model would fail on \"$1 billion\" because currency amounts always require left-context only","C":"Removing the backward pass affects all tokens equally because the forward LSTM already sees the full sentence during training","D":"The backward LSTM only helps with stopwords like \"is\" and \"at\" — removing it would not affect content word NER"},"correct":"A","explanation":{"correct":"- \"Apple\" at sentence start has no left context to disambiguate it as ORG vs fruit. The backward LSTM processes from \"billion\" → \"Apple\", encoding that the sentence is about buying/startup/UK context — strong signals for ORG.\n- The forward LSTM at position 1 (processing \"Apple\") has only the start-of-sentence token as context. Without the backward pass, the model must rely solely on \"Apple\" itself and subsequent context accumulated in later layers.\n- BiLSTM for NER was specifically designed to handle this: the first and last tokens of sequences have the highest asymmetric context dependence. Research confirms BiLSTM consistently outperforms unidirectional LSTM on ambiguous named entities.\n- In production NER systems, the backward pass is typically not optional — it is architecturally required for competitive entity-boundary and type disambiguation performance.","A":"","B":"\"$1 billion\" contains strong surface-level signals (dollar sign, numeric value) that a forward LSTM can learn to classify as MONEY from its own token features and left context (\"for\"). It is less dependent on backward context.","C":"The forward LSTM does not \"see the full sentence during training\" — at inference, it processes left to right and at each position only has past context. Training does not give it future information; only the bidirectional architecture does.","D":"Stopwords (\"is\", \"at\") are rarely the target entities in NER and their tags are typically O (non-entity). Content words with type ambiguity (organization vs product vs location names) benefit most from bidirectional context."},"reference":"- Lample et al., \"Neural Architectures for Named Entity Recognition\" (BiLSTM-CRF): https://arxiv.org/abs/1603.01360"},{"section":"nlp","topicSlug":"sequence-models-rnn-lstm","topic":"Sequence Models Rnn Lstm","id":"nlp-05005","difficulty":"medium","orderIndex":5,"question":"A seq2seq model for machine translation is trained with teacher forcing. During inference on a long sentence, the model generates a correct first 10 words, then makes one wrong word, after which the entire remaining output degenerates into repetitive, incoherent text. What causes this behavior?","options":{"A":"Teacher forcing causes the decoder to never learn to handle its own prediction errors, creating a train-inference mismatch called \"exposure bias\"","B":"The model ran out of memory after 10 words because the encoder compressed the input into a fixed-size vector","C":"The decoder LSTM has a vanishing gradient problem that causes degradation after 10 tokens","D":"Teacher forcing trains the decoder on the wrong inputs — it should use the encoder hidden states, not the target tokens"},"correct":"A","explanation":{"correct":"- Teacher forcing: at each decoder step during training, the *gold target token* is fed as input to the next step, regardless of what the model predicted. This makes training fast and stable.\n- At inference, the model feeds its *own predictions* as input to the next step. If the model makes one error, the next step receives an incorrect input it has never seen during training — causing a distribution shift that compounds.\n- This is exposure bias (Bengio et al., 2015): the decoder is exposed only to gold-prefix distributions during training, but at inference it must handle its own (potentially erroneous) predicted distributions.\n- Mitigations include scheduled sampling (gradually replacing gold tokens with model predictions during training) and reinforcement learning-based training objectives.","A":"","B":"Fixed-size encoder bottleneck is a real limitation of vanilla seq2seq, but it causes information loss for long inputs (not specifically after 10 correct words). The cascading failure after one error is specifically the exposure bias pattern, not a memory issue.","C":"LSTM vanishing gradients are a training phenomenon — the gradient doesn't flow back through long sequences during training. At inference, there is no gradient; the forward pass processes each token sequentially without degradation from gradient issues.","D":"Teacher forcing correctly uses the gold target tokens as decoder input — this is by design. The encoder hidden states initialize the decoder state, not the per-step input. The mechanism is correct; the problem is the train-inference mismatch it creates."},"reference":"- Bengio et al., \"Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks\": https://arxiv.org/abs/1506.03099"},{"section":"nlp","topicSlug":"sequence-models-rnn-lstm","topic":"Sequence Models Rnn Lstm","id":"nlp-05006","difficulty":"medium","orderIndex":6,"question":"A seq2seq LSTM translation model uses beam search with beam width k=5 at inference. A researcher increases k to 50. On short sentences (< 10 words), BLEU improves slightly. On long sentences (> 30 words), BLEU decreases. What explains the degradation on long sentences?","options":{"A":"Beam search with large k always reduces BLEU because it generates shorter sentences","B":"Larger beams exacerbate the length bias of log-probability scoring: the model assigns lower joint probability to longer sequences, so larger beams increasingly prefer shorter hypotheses, causing incomplete translations","C":"Increasing k to 50 causes the model to run out of GPU memory, truncating long sentence outputs","D":"Beam search with k > 10 switches to a greedy algorithm internally, removing the benefit of multiple hypotheses"},"correct":"B","explanation":{"correct":"- Beam search scores hypotheses by log P(y₁,...,yₜ) = Σlog P(yᵢ|y₁,...,yᵢ₋₁, x). Adding more tokens always adds a negative log-probability term (probabilities < 1), so longer sequences receive lower scores.\n- With larger beam width, more competing hypotheses survive. Short, completed sentences stop accumulating negative scores early, while longer partial translations continue to decline. The beam fills with shorter, lower-quality translations.\n- This length penalty problem is well-documented (Wu et al., 2016). The standard fix is length normalization: divide log probability by sequence length^α (α ≈ 0.6-1.0).\n- Without length normalization, increasing beam width past a certain point can actually hurt BLEU — this is a known failure mode in production MT systems.","A":"The BLEU decrease is not uniform across sentence lengths. Short sentences improve with larger k, which rules out a universal negative effect. The issue is length-dependent, pointing to the length bias mechanism.","B":"","C":"GPU memory constraints are a practical consideration, but the question describes a systematic BLEU pattern correlated with sentence length. A memory truncation bug would produce different artifacts (truncated output, not length-biased shorter outputs).","D":"Beam search does not switch algorithms at any k threshold. For any k, the algorithm maintains exactly k hypotheses at each step and scores them consistently. There is no internal algorithmic switch."},"reference":"- Wu et al., \"Google's Neural Machine Translation System\" (length penalty section): https://arxiv.org/abs/1609.08144"},{"section":"nlp","topicSlug":"sequence-models-rnn-lstm","topic":"Sequence Models Rnn Lstm","id":"nlp-05007","difficulty":"medium","orderIndex":7,"question":"A BLEU score of 0.45 is reported for a neural MT model translating English to French. A product manager says \"45% of the output is correct.\" What is the most precise reason this interpretation is wrong?","options":{"A":"BLEU ranges from 0 to 100, so 0.45 means 45 out of 100, which is correct","B":"BLEU measures modified n-gram precision (1-gram through 4-gram) with a brevity penalty — it does not measure semantic correctness, adequacy, or fluency; a score of 0.45 means the model's n-gram overlap with reference translations is 0.45, not that 45% of the output is semantically correct","C":"BLEU is a recall-based metric so it measures how much of the reference appears in the output, not correctness","D":"BLEU score of 0.45 is below the passing threshold of 0.50, indicating the model should not be deployed"},"correct":"B","explanation":{"correct":"- BLEU = BP × exp(Σwₙ log pₙ) where pₙ is the modified n-gram precision for n=1..4 and BP is the brevity penalty. It measures geometric mean of modified n-gram overlap between hypothesis and one or more reference translations.\n- \"Modified\" precision clips each n-gram count by its maximum reference count, preventing repetition gaming. But it is still n-gram overlap, not semantic equivalence.\n- A translation can have low BLEU but be semantically perfect (using synonyms not in reference), or have high BLEU but be semantically wrong (right words, wrong order partially). BLEU ≠ correctness.\n- Industry standard (Papineni et al., 2002): BLEU is an automatic proxy for human evaluation. It correlates with human quality at corpus level but is unreliable for individual sentence evaluation.","A":"BLEU is typically reported in [0, 1] range in code (though multiplied by 100 for publication). Even if it were 45/100, interpreting it as \"45% correct\" conflates n-gram overlap with semantic correctness.","B":"","C":"BLEU is precision-based (how many hypothesis n-grams appear in the reference), with a brevity penalty acting as a recall surrogate. It is not recall. METEOR is a metric that explicitly incorporates recall.","D":"There is no universal \"passing threshold\" for BLEU — acceptable scores vary dramatically by language pair, domain, and reference count. BLEU=0.45 for En→Fr with a single reference is actually competitive."},"reference":"- Papineni et al., \"BLEU: a Method for Automatic Evaluation of Machine Translation\": https://aclanthology.org/P02-1040/"},{"section":"nlp","topicSlug":"sequence-models-rnn-lstm","topic":"Sequence Models Rnn Lstm","id":"nlp-05008","difficulty":"hard","orderIndex":8,"question":"A seq2seq LSTM model for summarization uses a fixed-size 512-dimensional encoder hidden state to represent the entire input document. You benchmark it on 50-word inputs (ROUGE=0.38) and 500-word inputs (ROUGE=0.21). A colleague proposes doubling the hidden state to 1024 dimensions to fix the long-document degradation. Why is this approach fundamentally insufficient?","options":{"A":"Doubling the hidden state doubles the parameters, causing the model to overfit on short inputs","B":"The fixed-size bottleneck problem is not about dimensionality — no fixed-size vector can losslessly encode arbitrarily long sequences; the information-theoretic capacity of the encoder is bounded regardless of dimension, and for 500-word inputs, critical information is necessarily discarded","C":"The ROUGE metric does not scale linearly with hidden state size, so larger hidden states will not improve ROUGE","D":"Larger hidden states require longer training time, making the approach impractical"},"correct":"B","explanation":{"correct":"- A fixed-size vector is an information bottleneck. For a 50-word input with vocabulary 50K, the input space is exponentially larger than any fixed-size representation. Doubling from 512 to 1024 doubles the bottleneck capacity but the input space grows exponentially with length.\n- For 500-word inputs, the encoder must compress 10× more information into a vector that is only 2× larger. The compression ratio worsens, and distant information is overwritten by more recent tokens (recency bias in RNNs).\n- The fundamental fix is attention: instead of forcing all information through one vector, attention allows the decoder to selectively query the encoder's hidden states at each input position. This is the exact motivation for Bahdanau attention (next topic).\n- In production: fixed-size encoder seq2seq was retired as soon as attention was introduced. Dimension scaling was attempted early (2014-2015) and found to have rapidly diminishing returns beyond 512-1024.","A":"Overfitting on short inputs is a valid concern with larger models in general, but the question is about the fundamental inability to represent long documents. Regularization techniques (dropout, weight decay) can address overfitting but cannot solve the information bottleneck.","B":"","C":"ROUGE does not have any direct mathematical relationship with hidden state size. The metric itself is not the problem. The metric is measuring a real quality degradation caused by information loss.","D":"Training time is an engineering concern, not a fundamental limitation. The question asks about why the approach is \"fundamentally insufficient\" — computational cost is not a fundamental barrier if the approach works."},"reference":"- Bahdanau et al., \"Neural Machine Translation by Jointly Learning to Align and Translate\": https://arxiv.org/abs/1409.0473"},{"section":"nlp","topicSlug":"sequence-models-rnn-lstm","topic":"Sequence Models Rnn Lstm","id":"nlp-05009","difficulty":"hard","orderIndex":9,"question":"During training a seq2seq LSTM, you observe that training loss decreases smoothly but validation loss plateaus early, and generated sequences on validation contain the `` token immediately after a few words. What is the most likely diagnosis?","options":{"A":"The model has learned to always output `` early because the training data contained many short sequences, and the model is rewarded for ending early by the brevity penalty","B":"The `` token's embedding has become a local minimum attractor — the model outputs `` early to minimize cross-entropy loss on training examples where `` is a frequent token in positions 3-5","C":"The model is overfitting: it has memorized training sequence lengths and outputs `` after the average training sequence length; the length distribution mismatch with validation causes early termination","D":"The model cannot distinguish between regular vocabulary tokens and `` because they are all embedded in the same 300-dimensional space, causing random `` insertions"},"correct":"C","explanation":{"correct":"- When training data contains many short sequences (e.g., 3-5 words), the decoder is trained with `` at position 4-6 frequently. The model learns to associate the recurrent state at those positions with ``.\n- Overfitting to training length distribution: the model learns \"after ~4 tokens, output ``\" as a shortcut to minimize training loss, rather than learning the true stopping condition (semantic completeness).\n- Validation sequences may be longer or have different length distributions — the model's learned position-based `` heuristic fails.\n- Mitigations: balanced length distribution in training data, length regularization, or training with curriculum learning (short sequences first, then longer ones progressively).","A":"Brevity penalty applies to BLEU evaluation, not to cross-entropy training loss. During LSTM training with teacher forcing, there is no brevity penalty — the model is penalized by cross-entropy for wrong token predictions only.","B":"While `` embedding collapse is theoretically possible, it is a much rarer pathology than the common length overfitting described in C. The specific symptom (early EOS after few words, plateau in validation) matches length distribution overfitting more precisely.","C":"","D":"All tokens including `` are embedded in the same space — this is by design, not a bug. The model learns to distinguish them through learned embeddings and context. Random `` insertions would produce a different pattern than systematic early termination."},"reference":"- Sutskever et al., \"Sequence to Sequence Learning with Neural Networks\": https://arxiv.org/abs/1409.3215"},{"section":"nlp","topicSlug":"sequence-models-rnn-lstm","topic":"Sequence Models Rnn Lstm","id":"nlp-05010","difficulty":"hard","orderIndex":10,"question":"A bidirectional LSTM language model is proposed for autoregressive text generation. A senior engineer immediately rejects it. Given that BiLSTMs achieve state-of-the-art on many NLP tasks, why is this rejection architecturally correct?","options":{"A":"BiLSTMs use too much memory to run on standard GPUs required for generation","B":"Autoregressive generation requires that at step t, the model can only condition on tokens w₁,...,wₜ₋₁. A BiLSTM's backward pass conditions on future tokens wₜ₊₁,...,wₙ — which are unknown at generation time, making BiLSTM architecturally incompatible with autoregressive generation","C":"BiLSTMs cannot generate variable-length sequences because the backward LSTM requires knowing the sequence length in advance","D":"BiLSTMs produce two hidden states per token which would require a modified softmax layer, making them impractical for generation"},"correct":"B","explanation":{"correct":"- Autoregressive generation: P(w₁,...,wₙ) = Π P(wₜ | w₁,...,wₜ₋₁). At generation step t, only tokens 1 to t-1 are known. The model must predict wₜ from past context only.\n- BiLSTM backward pass: at position t, the backward hidden state hₜ_backward encodes context from wₙ, wₙ₋₁,...,wₜ₊₁ — tokens that do not yet exist during generation.\n- This is a causal violation: the model would be conditioning on future tokens to predict the present token. At training time this is fine (the full sequence exists); at generation time it is impossible.\n- This is exactly the distinction between BERT (bidirectional encoder, can see full context, cannot generate autoregressively) and GPT (unidirectional decoder, autoregressive generation). BiLSTM corresponds to the BERT use case.","A":"Memory is an engineering concern, not an architectural incompatibility. Memory issues can be solved with gradient checkpointing, quantization, or smaller batch sizes. The BiLSTM rejection is architectural, not computational.","B":"","C":"BiLSTMs do require the full sequence during training (the backward pass processes from end to start), but variable-length sequences are handled through masking and batching. Sequence length is not required to be fixed in advance.","D":"Concatenating two hidden states for token prediction is trivially handled by a linear layer of size 2×hidden_dim → vocab_size. This is a standard implementation detail, not a barrier."},"reference":"- Devlin et al., \"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding\": https://arxiv.org/abs/1810.04805"},{"section":"nlp","topicSlug":"attention-before-transformers","topic":"Attention Before Transformers","id":"nlp-06001","difficulty":"easy","orderIndex":1,"question":"A seq2seq model without attention translates a 30-word English sentence to French. The encoder produces a single 512-dim context vector c. At each decoder step, the decoder uses the same c as input. A researcher argues this is \"like asking someone to translate a paragraph after reading it once with no notes.\" What specific failure mode does this describe?","options":{"A":"The encoder produces a different c for each decoder step, so this analogy is incorrect","B":"The fixed context vector forces the decoder to use identical encoder information at every step, regardless of which part of the input is relevant for the current output word — long-range information is overwritten by the most recent encoder hidden state","C":"The analogy is about computational cost — using the same vector is slow","D":"The fixed context vector causes the decoder to copy the input verbatim rather than translating it"},"correct":"B","explanation":{"correct":"- In vanilla seq2seq, the encoder final hidden state hₙ encodes a compression of the entire input. The same c = hₙ is used as initial decoder state and/or decoder input at every step.\n- For a 30-word input, words at positions 1-5 must be compressed through 25 more RNN steps, losing resolution. The encoder \"forgets\" early words — the recency bias of RNNs makes hₙ heavily weighted toward the last few input tokens.\n- When translating word 20 of the French output (which may correspond to English word 8), the decoder has no mechanism to re-focus on English position 8. It uses the same overloaded c.\n- Attention solves this by allowing the decoder to compute a dynamic context vector cₜ = Σαₜᵢhᵢ — a weighted average of *all* encoder hidden states, with weights learned to focus on relevant positions.","A":"In vanilla seq2seq without attention, the encoder produces one fixed vector, not one per decoder step. Attention-based models produce dynamic vectors per decoder step. The question explicitly asks about the no-attention case.","B":"","C":"Using the same vector is actually computationally cheaper, not slower. The computational cost argument is inverted — attention adds O(n×m) cost (comparing all encoder and decoder positions).","D":"The fixed context vector does not cause copying — the decoder uses its LSTM to generate output from the context. Copying behavior in NMT is a different phenomenon (often called \"copy mechanism\" or \"pointer networks\")."},"reference":"- Sutskever et al., \"Sequence to Sequence Learning with Neural Networks\": https://arxiv.org/abs/1409.3215\n- Bahdanau et al., \"Neural Machine Translation by Jointly Learning to Align and Translate\": https://arxiv.org/abs/1409.0473"},{"section":"nlp","topicSlug":"attention-before-transformers","topic":"Attention Before Transformers","id":"nlp-06002","difficulty":"easy","orderIndex":2,"question":"In Bahdanau attention, an alignment model computes attention scores eₜᵢ = vᵀ tanh(Waₛₜ₋₁ + Uahᵢ) where sₜ₋₁ is the decoder state and hᵢ is the i-th encoder hidden state. A junior engineer asks why both sₜ₋₁ and hᵢ are passed through a tanh before the dot product. What does this jointly-learned alignment model actually compute?","options":{"A":"It computes the exact match between decoder state and encoder state using cosine similarity","B":"It learns a parameterized compatibility function between the decoder's current query (what it needs) and each encoder position (what information is available) — the additive interaction allows the model to learn non-linear combinations of both","C":"It normalizes the decoder and encoder states to the same magnitude before comparison","D":"It prevents the encoder states from being updated by the decoder's gradients"},"correct":"B","explanation":{"correct":"- Bahdanau's additive attention: eₜᵢ = vᵀ tanh(Wa·sₜ₋₁ + Ua·hᵢ). The matrices Wa and Ua project the decoder state and encoder state into a shared space, then sum them, then apply tanh non-linearity, then a final linear projection v.\n- This is a feedforward network computing a \"how compatible is this decoder state with this encoder position?\" score. It learns to recognize that \"when decoding a verb, look at input verb positions\" etc.\n- \"Additive\" refers to the additive combination Wa·s + Ua·h — as opposed to Luong's \"multiplicative\" attention which uses sᵀWh (a bilinear form). Both compute compatibility but differ in expressiveness and computational cost.\n- The parameters (Wa, Ua, v) are learned jointly with the encoder/decoder, making the alignment model task-specific.","A":"Cosine similarity would be (sₜ₋₁ · hᵢ) / (|sₜ₋₁| |hᵢ|) — a non-parameterized formula. Bahdanau attention introduces learnable weight matrices (Wa, Ua, v) making it a learned compatibility function, not cosine similarity.","B":"","C":"The tanh squashes values to [-1, 1] but does not normalize to unit magnitude. L2 normalization (dividing by Euclidean norm) is a different operation. tanh here applies non-linearity for expressiveness, not normalization.","D":"Gradient flow between decoder and encoder is determined by the computational graph, not by the tanh activation. In fact, attention is designed to *enable* gradient flow from the decoder back through the encoder hidden states, improving encoder training."},"reference":"- Bahdanau et al., \"Neural Machine Translation by Jointly Learning to Align and Translate\": https://arxiv.org/abs/1409.0473"},{"section":"nlp","topicSlug":"attention-before-transformers","topic":"Attention Before Transformers","id":"nlp-06003","difficulty":"easy","orderIndex":3,"question":"After computing raw attention scores eₜᵢ for each encoder position i, Bahdanau attention applies softmax to get αₜᵢ = exp(eₜᵢ) / Σⱼ exp(eₜⱼ). A team member proposes skipping softmax and using raw scores directly. What would break?","options":{"A":"Raw scores can be negative, which would cause the weighted sum to subtract encoder states, potentially corrupting the context vector","B":"Without softmax, the attention weights do not sum to 1 and are not bounded — the context vector would scale with the magnitude of the raw scores rather than being a proper weighted mixture of encoder states","C":"Softmax prevents gradient explosion during backpropagation through attention","D":"Raw scores are already probabilities if the alignment model uses tanh, so softmax is redundant"},"correct":"B","explanation":{"correct":"- Without softmax, αₜᵢ = eₜᵢ (raw scores). The context vector cₜ = Σᵢ eₜᵢ hᵢ is a weighted sum but the weights can be any real value — they do not form a probability distribution.\n- Softmax enforces two properties: (1) Σᵢ αₜᵢ = 1 (convex combination → the context is a weighted average, not a weighted sum that scales with encoder state magnitudes), and (2) αₜᵢ > 0 (all positions contribute positively).\n- These properties make the context vector interpretable as a \"soft selection\" over encoder positions. Without them, the scale of eₜᵢ arbitrary affects cₜ magnitude, making training unstable.\n- Attention weights after softmax can also be visualized as alignment matrices (which positions the decoder attended to) — a key debugging and interpretability tool.","A":"Negative weights in a weighted sum are mathematically valid. The issue is not sign — it is the lack of normalization. Negative weights would mean \"subtract this encoder state\" which is theoretically expressible but not what attention aims to compute.","B":"","C":"Softmax does affect gradient flow, but its primary role is probability normalization. Gradient explosion is addressed by gradient clipping, not by softmax (softmax can itself cause gradient issues when inputs are very large, leading to the √d scaling in transformers).","D":"tanh outputs are in [-1, 1] — they are not probabilities (which require non-negativity and summation to 1). The raw alignment scores after v·tanh(...) are unbounded real values, not probabilities."},"reference":"- Bahdanau et al., \"Neural Machine Translation by Jointly Learning to Align and Translate\": https://arxiv.org/abs/1409.0473"},{"section":"nlp","topicSlug":"attention-before-transformers","topic":"Attention Before Transformers","id":"nlp-06004","difficulty":"medium","orderIndex":4,"question":"Luong attention (multiplicative) computes score(sₜ, hᵢ) = sₜᵀ W hᵢ (general form) or sₜᵀ hᵢ (dot product form). Bahdanau attention uses score(sₜ₋₁, hᵢ) = vᵀ tanh(Wa sₜ₋₁ + Ua hᵢ). A team must choose between them for a low-resource translation task with only 50K sentence pairs. Which should they choose and why?","options":{"A":"Luong dot-product attention, because it has no parameters and will not overfit on 50K sentences","B":"Bahdanau additive attention, because its additional parameters (Wa, Ua, v) give it more capacity to learn alignment with limited data","C":"Bahdanau attention, because its additive structure generalizes better to unseen language pairs with low data due to fewer matrix multiplications","D":"Luong general attention, because the bilinear matrix W provides richer alignment than additive attention at all data scales"},"correct":"A","explanation":{"correct":"- Luong dot-product attention score(sₜ, hᵢ) = sₜᵀ hᵢ has zero additional parameters — it uses raw dot products between decoder state and encoder states.\n- Bahdanau attention introduces 3 weight matrices (Wa, Ua, v), adding O(d²) parameters to the attention module. On 50K sentence pairs, these additional parameters risk overfitting.\n- Empirically (Luong et al., 2015), multiplicative attention achieves comparable or better performance on standard benchmarks while being computationally simpler. For low-resource settings, fewer parameters reduce overfitting risk.\n- The trade-off: additive attention can be more expressive with sufficient data (it can learn non-linear compatibility functions), but for small corpora, the regularization benefit of parameter-free dot-product attention outweighs expressiveness.","A":"","B":"More parameters in a low-resource setting increases overfitting risk. The \"more capacity\" argument holds for large-data settings but is counterproductive for 50K sentence pairs where the attention module parameters would not be well-estimated.","C":"Fewer matrix multiplications is a speed argument, not a generalization argument. Bahdanau does not \"generalize better to unseen language pairs\" by virtue of its additive structure — generalization is determined by data, regularization, and architecture inductive biases, not addition vs multiplication.","D":"The bilinear matrix W in Luong general attention does add parameters (d×d matrix). For low-resource settings, this is also a potential overfitting concern. The parameter-free dot-product form is better than the general form for this task."},"reference":"- Luong et al., \"Effective Approaches to Attention-based Neural Machine Translation\": https://arxiv.org/abs/1508.04025"},{"section":"nlp","topicSlug":"attention-before-transformers","topic":"Attention Before Transformers","id":"nlp-06005","difficulty":"medium","orderIndex":5,"question":"An attention-based seq2seq model for English-to-German translation produces the attention weight matrix shown below (rows = decoder steps, columns = encoder positions). The attention weights for the last 5 decoder steps are nearly uniform (≈0.05 for all 20 encoder positions). What does this indicate?","options":{"A":"The model has achieved perfect alignment because uniform attention means it uses all input information equally","B":"The attention is diffuse — the decoder cannot identify which input positions are relevant for these output tokens, suggesting the model is uncertain or the output is a function of global context rather than specific input words","C":"Uniform attention means the context vector equals the average of all encoder hidden states, which is equivalent to having no attention","D":"The last 5 decoder steps are generating padding tokens, and padding attention is always uniform by design"},"correct":"B","explanation":{"correct":"- Sharp attention (peaky distribution): α_ti ≈ 1 for one i and ≈ 0 for others → decoder strongly focuses on one encoder position. This is ideal for monotonic alignments (e.g., the model aligns \"cat\" to \"Katze\").\n- Diffuse/flat attention: α_ti ≈ 1/n for all i → the decoder uses the same weighted average of all encoder states, regardless of the current output token. This suggests the model cannot learn a specific alignment.\n- Causes: training instability, insufficient model capacity, very long input sequences where the alignment model cannot discriminate, or genuinely global-context output tokens (e.g., punctuation at the end of a sentence may diffusely attend to the whole input).\n- In production MT debugging, attention visualization is a primary diagnostic tool. Diffuse attention in the middle of translations indicates alignment failure, not success.","A":"\"Using all input information equally\" is not ideal for translation — translation requires mapping specific source words to specific target words. Equal attention means the model cannot distinguish which input is relevant.","B":"","C":"Uniform attention does reduce to the mean of encoder states — but this is not equivalent to \"no attention.\" Without attention, the decoder uses the final encoder state (recency-biased). Uniform attention uses an unweighted mean, which is actually slightly better than no attention in terms of coverage.","D":"Padding tokens in batch processing typically have their attention masked to -inf before softmax so they receive weight ≈ 0. Uniform attention across real (non-padding) encoder positions is a model behavior issue, not a padding artifact."},"reference":"- Bahdanau et al., \"Neural Machine Translation by Jointly Learning to Align and Translate\" (attention visualization figures): https://arxiv.org/abs/1409.0473"},{"section":"nlp","topicSlug":"attention-before-transformers","topic":"Attention Before Transformers","id":"nlp-06006","difficulty":"medium","orderIndex":6,"question":"A researcher argues that \"attention solves the vanishing gradient problem in seq2seq models.\" A senior engineer disagrees, saying \"attention and vanishing gradients are orthogonal problems.\" Who is right and why?","options":{"A":"The researcher is right — attention bypasses the encoder's recurrent path, allowing gradients to flow directly from the decoder loss to any encoder hidden state","B":"The senior engineer is right — attention provides direct gradient paths from decoder loss to encoder hidden states, which helps encoder training, but vanishing gradients in the encoder's own recurrent connections still exist; attention does not fix the RNN's internal gradient flow problem","C":"Both are right — attention solves vanishing gradients in the decoder but not in the encoder","D":"The senior engineer is right — attention actually causes gradient explosion, not gradient vanishing, so they are not orthogonal but opposite problems"},"correct":"B","explanation":{"correct":"- Attention mechanism: the decoder loss backpropagates through αₜᵢ to hᵢ directly (not through time steps). This creates a shortcut gradient path: ∂L/∂hᵢ gets a direct contribution from each decoder step that attends to position i. This is beneficial for encoder training.\n- However, within the encoder itself: hᵢ = RNN(hᵢ₋₁, xᵢ). The gradient of hᵢ with respect to hᵢ₋ₖ still travels through k recurrent steps and can vanish. Attention does not modify the encoder's internal RNN unrolling.\n- The correct fix for encoder internal vanishing gradients is LSTM/GRU (which attention complements, not replaces). Transformers later eliminate the problem entirely by replacing recurrence with self-attention.\n- In practice, attention + LSTM encoder is significantly more powerful than attention + vanilla RNN encoder, confirming that they address distinct problems.","A":"The researcher's claim is partially right (attention does provide direct gradient paths to encoder states) but overclaims — it implies attention fully solves vanishing gradients, which is false for the encoder's internal recurrence.","B":"","C":"Attention's gradient shortcut benefits both encoder and decoder training. The claim that it \"only solves decoder\" is imprecise — but the claim that it \"solves\" (rather than \"mitigates\") is still wrong.","D":"Attention does not cause gradient explosion. The softmax normalization keeps attention weights in [0, 1], providing a bounded gradient contribution from each decoder step. Gradient explosion is a separate issue addressed by gradient clipping."},"reference":"- Bahdanau et al., \"Neural Machine Translation by Jointly Learning to Align and Translate\": https://arxiv.org/abs/1409.0473\n- Pascanu et al., \"On the difficulty of training recurrent neural networks\": https://arxiv.org/abs/1211.5063"},{"section":"nlp","topicSlug":"attention-before-transformers","topic":"Attention Before Transformers","id":"nlp-06007","difficulty":"hard","orderIndex":7,"question":"An attention-based seq2seq model is applied to a document summarization task. The input is 800 words; the summary target is 50 words. At decoder step 30 (generating word 30 of the summary), the attention mechanism assigns high weights to encoder positions 50-100 — the same positions it focused on at steps 10-15. A researcher says \"this is the coverage problem.\" What is coverage, why is it a problem, and what is the standard fix?","options":{"A":"Coverage means the model has exceeded its maximum attention capacity; the fix is to reduce the input length to 400 words","B":"Coverage is the phenomenon where attention repeatedly focuses on the same encoder positions across decoder steps, causing important input segments to be ignored and others to be over-represented in the output; the fix is a coverage vector that tracks cumulatively attended positions and penalizes re-attending","C":"Coverage is caused by softmax normalization always selecting the same maximum position; the fix is to replace softmax with sparsemax","D":"Coverage refers to the fraction of input tokens that appear in the output; the fix is to use copy mechanisms to ensure all input words appear at least once"},"correct":"B","explanation":{"correct":"- Coverage problem (Tu et al., 2016): in summarization, attention may repeatedly focus on salient phrases (e.g., the topic sentence) while ignoring equally important but less salient sections. The summary becomes repetitive and misses information from unattended regions.\n- Coverage vector: cₜ = Σₜ'<ₜ αₜ'ᵢ (cumulative attention weights over all previous decoder steps). This tracks \"how much has each encoder position already been attended to?\"\n- Coverage penalty: add a term -λ Σᵢ min(cₜᵢ, αₜᵢ) to the loss, discouraging the current step from re-attending to already-covered positions.\n- See & Liu (2017) show coverage mechanism reduces repetition in abstractive summarization from 35% to 8% repeated trigrams.","A":"Coverage is not about capacity limits. The model can attend to any encoder position at any decoder step regardless of input length. Reducing input length is a workaround that loses information, not a fix for the coverage mechanism.","B":"","C":"Softmax does not always select the same maximum — at each decoder step, the query (decoder state) is different, so scores change. The coverage problem is about learning to re-attend to salient positions, not about softmax mechanics. Sparsemax would worsen coverage by creating even more concentrated, potentially repetitive attention.","D":"Copy mechanisms (pointer networks) address a different problem: copying words from input that are OOV or should appear verbatim. Coverage and copy mechanisms are often used together in summarization but address distinct failure modes."},"reference":"- See et al., \"Get To The Point: Summarization with Pointer-Generator Networks\" (coverage mechanism): https://arxiv.org/abs/1704.04368"},{"section":"nlp","topicSlug":"attention-before-transformers","topic":"Attention Before Transformers","id":"nlp-06008","difficulty":"hard","orderIndex":8,"question":"The Transformer paper (Vaswani et al., 2017) replaced Bahdanau-style attention in seq2seq with multi-head self-attention. A candidate at a MAANG interview says \"self-attention is just attention where the decoder attends to the encoder.\" What is specifically wrong with this definition, and what does self-attention actually compute?","options":{"A":"The candidate is correct — self-attention and cross-attention are identical operations","B":"The candidate has confused self-attention with cross-attention. Self-attention is when a sequence attends to itself — queries, keys, and values all come from the same sequence, allowing each position to encode context from all other positions in the same sequence","C":"Self-attention uses different activation functions from Bahdanau attention, which is why they are distinct","D":"Self-attention can only be computed in the encoder; the decoder uses cross-attention exclusively"},"correct":"B","explanation":{"correct":"- Cross-attention (Bahdanau-style): queries come from the decoder, keys and values come from the encoder. The decoder \"looks at\" the encoder.\n- Self-attention: queries, keys, and values all come from the *same* sequence (encoder attending to itself, or decoder attending to its own partial output). Each position computes attention scores against all other positions in its own sequence.\n- Encoder self-attention: hᵢ can directly attend to h₁, h₂,...,hₙ — capturing long-range dependencies in a single layer without recurrence. No RNN needed.\n- This is the critical breakthrough: self-attention removes the sequential dependency of RNNs, enabling full parallelization during training and O(1) path length between any two positions (vs O(n) in RNNs).","A":"Self-attention and cross-attention are distinct operations. Self-attention uses one sequence for Q, K, V. Cross-attention uses two sequences (K, V from source; Q from target). Confusing them leads to fundamental architectural misunderstanding.","B":"","C":"Both operations use the same scaled dot-product attention formula: score = QKᵀ/√d, followed by softmax and weighted sum over V. The activation function is not the distinguishing factor.","D":"Transformers use self-attention in both encoder and decoder. The decoder uses masked self-attention (attending only to past positions in its own output) and cross-attention (attending to encoder output). Both attention types appear in the decoder."},"reference":"- Vaswani et al., \"Attention Is All You Need\": https://arxiv.org/abs/1706.03762\n- The Illustrated Transformer (Jay Alammar): https://jalammar.github.io/illustrated-transformer/"},{"section":"nlp","topicSlug":"bert-and-variants","topic":"Bert And Variants","id":"nlp-07001","difficulty":"easy","orderIndex":1,"question":"BERT's Masked Language Model (MLM) pretraining randomly masks 15% of tokens and trains the model to predict them. An engineer asks why BERT doesn't just mask all tokens (100%) and predict the whole sequence at once. What fundamental issue would this cause?","options":{"A":"Masking all tokens would make BERT equivalent to GPT, which is patented by OpenAI","B":"If all tokens are masked, the model has no unmasked context to condition on — every prediction would be made from zero information, equivalent to a unigram language model that cannot leverage bidirectional context","C":"Masking 100% causes the positional embeddings to lose their meaning since all positions are identical","D":"The model would run out of GPU memory because predicting all tokens requires storing gradients for every position simultaneously"},"correct":"B","explanation":{"correct":"- MLM's power comes from bidirectional conditioning: when predicting [MASK] at position i, the model uses both left and right context (all unmasked tokens). This is why BERT can encode \"The bank by the [MASK] is steep\" correctly (river bank vs financial bank resolved by \"steep\").\n- If all tokens are [MASK], each prediction must be made from a sequence of identical [MASK] tokens and positional embeddings only — no word content is visible. The model cannot use context because there is no context.\n- The 15% masking rate is the balance between: enough unmasked context for meaningful conditioning, and enough masked positions to train the model efficiently on prediction tasks.\n- This is the fundamental design choice that makes BERT bidirectional but non-autoregressive — and also why BERT cannot generate text (no causal structure).","A":"No such patent constraint exists. This is a red herring. GPT uses causal (left-to-right) autoregressive LM, not MLM. They are different objectives, not IP-constrained alternatives.","B":"","C":"Positional embeddings are still distinct for each position even when tokens are [MASK]. Position 1 [MASK] and Position 10 [MASK] have the same token embedding but different positional embeddings — the positional information is preserved.","D":"Memory usage is proportional to sequence length × batch size, not mask rate. Predicting all tokens does increase the gradient computation but does not fundamentally run out of memory differently from 15% masking at the same sequence length."},"reference":"- Devlin et al., \"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding\": https://arxiv.org/abs/1810.04805"},{"section":"nlp","topicSlug":"bert-and-variants","topic":"Bert And Variants","id":"nlp-07002","difficulty":"easy","orderIndex":2,"question":"BERT uses three types of embeddings summed together for input representation: token embeddings, segment embeddings, and positional embeddings. A team removing the segment embeddings for a single-sentence classification task claims \"it won't matter because we only have one segment.\" Under what specific condition would removing segment embeddings still degrade performance?","options":{"A":"Single-sentence classification never uses segment embeddings, so removal always has zero effect","B":"If the model was pretrained with segment embeddings, the [CLS] token representation was learned to incorporate segment-boundary signals during NSP pretraining; at fine-tuning time, removing segment embeddings creates a distribution shift between pretraining and fine-tuning","C":"Segment embeddings encode punctuation positions which are needed for accurate classification","D":"Removing segment embeddings changes the input dimensionality, requiring the model architecture to be retrained from scratch"},"correct":"B","explanation":{"correct":"- During BERT pretraining, every input has segment embeddings (Segment A for first sentence, Segment B for second, or Segment A for both in single-sentence inputs). The model's representations are conditioned on these signals.\n- At fine-tuning, if you feed a single sentence with segment embeddings = 0 (or missing), the input distribution differs from pretraining where single sentences used Segment A embeddings. This mismatch is subtle but measurable.\n- RoBERTa (Liu et al., 2019) ablation studies showed that NSP and segment embeddings together have small but non-zero effects on downstream task performance.\n- The correct approach for single-sentence tasks: set all segment embeddings to Segment A (same as pretraining single-sentence inputs), not remove them.","A":"Segment embeddings always have some effect during pretraining — they are a learned signal. Claiming zero effect is too strong. Even for single-sentence inputs during pretraining, BERT uses Segment A embeddings (not zero), so the learned representations include segment conditioning.","B":"","C":"Segment embeddings encode which sentence a token belongs to (Segment A/B), not punctuation positions. Punctuation-aware representations emerge from attention, not segment embeddings.","D":"Segment embeddings are added to the input embedding sum. Removing one summand changes the input values but not the dimensionality — the embedding dimension d_model remains constant. No architecture change is needed."},"reference":"- Liu et al., \"RoBERTa: A Robustly Optimized BERT Pretraining Approach\" (NSP ablation): https://arxiv.org/abs/1907.11692"},{"section":"nlp","topicSlug":"bert-and-variants","topic":"Bert And Variants","id":"nlp-07003","difficulty":"easy","orderIndex":3,"question":"BERT is fine-tuned for sentiment analysis by adding a linear layer on the [CLS] token representation. After fine-tuning on 10K examples, the model achieves 92% accuracy. A colleague then fine-tunes BERT for NER on 5K examples and achieves only 78% F1. They conclude \"BERT is better at classification than NER.\" What is the more likely architectural explanation for the gap?","options":{"A":"BERT's [CLS] token is specifically optimized for classification tasks and cannot transfer to token-level tasks","B":"NER requires token-level predictions, where the quality depends on the representation of each individual token; BERT's pretraining on MLM does optimize token representations, but NER has higher annotation complexity per example (multi-label BIO tagging) and the 5K vs 10K data difference likely explains much of the gap","C":"NER cannot be solved with BERT — it requires a BiLSTM-CRF architecture instead","D":"The linear layer for NER has too many parameters (vocab_size × hidden_dim) causing overfitting on 5K examples"},"correct":"B","explanation":{"correct":"- NER with BERT: add a token-level linear layer that maps each token's representation to BIO tag probabilities. This is a valid and state-of-the-art approach. The architecture is sound.\n- The gap (92% vs 78%) is more likely explained by: (1) less data (5K vs 10K), (2) annotation complexity — NER labels are per-token BIO tags requiring consistent span annotation, inherently noisier than sentence-level sentiment labels, (3) evaluation metric difference (F1 vs accuracy — NER F1 is more sensitive to boundary errors).\n- BERT fine-tuning on NER typically achieves 91%+ F1 on CoNLL-2003 with sufficient data (15K examples). The 78% result suggests data insufficiency, not architectural incompatibility.\n- Comparing accuracy vs F1 is also misleading — these metrics have different baselines and scales.","A":"[CLS] is not \"specifically optimized\" for classification in a way that precludes token-level representations. All BERT token representations are equally trained through MLM — [CLS] is special only in NSP pretraining, not in token representation quality.","B":"","C":"BERT alone (without BiLSTM-CRF) achieves state-of-the-art NER. Adding a CRF layer can help by modeling tag transition probabilities, but BERT + linear is a strong and commonly deployed NER architecture.","D":"The NER classification head has (num_NER_tags × hidden_dim) parameters — with 9 BIO tags and 768 hidden dimensions, that's 9×768 = 6,912 parameters. This is tiny relative to BERT's 110M parameters and does not cause overfitting."},"reference":"- Devlin et al., \"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding\" (fine-tuning for NER): https://arxiv.org/abs/1810.04805"},{"section":"nlp","topicSlug":"bert-and-variants","topic":"Bert And Variants","id":"nlp-07004","difficulty":"medium","orderIndex":4,"question":"RoBERTa removes BERT's Next Sentence Prediction (NSP) objective and trains longer with larger batches on more data. Surprisingly, RoBERTa outperforms BERT on all GLUE benchmarks. A researcher argues \"NSP must be harmful.\" A more precise interpretation of the ablation results is:","options":{"A":"NSP is always harmful and should be removed from all future BERT-like models","B":"The RoBERTa paper's ablation shows NSP's benefit is marginal or negative when controlling for data size and training duration — NSP may force artificially short sequences (sentence pairs instead of full documents), limiting contextual richness","C":"NSP was removed because it was computationally too expensive for large-scale pretraining","D":"RoBERTa's improvements are entirely due to larger batch sizes, not NSP removal"},"correct":"B","explanation":{"correct":"- RoBERTa ablation: when training with full-document MLM (no sentence pairs) and equivalent compute, NSP-trained models perform worse or equal on downstream tasks. NSP forces the model to process truncated sentence pairs instead of longer document chunks.\n- Key insight: NSP was supposed to teach inter-sentence reasoning, but it was too easy — models could often detect \"not next sentence\" pairs from topic differences alone, without learning deep semantic coherence.\n- Full-document MLM (training on contiguous text up to 512 tokens, not artificially split sentence pairs) gives the model longer-range context, which is more valuable than NSP's inter-sentence signal.\n- The correct conclusion: NSP is not \"harmful\" in isolation — it is less valuable than the alternative use of the same training resources (longer sequences, more MLM signal).","A":"\"Always harmful\" is too strong. In some low-resource settings, NSP may provide useful signal. The ablation shows marginal/negative benefit at scale, not universal harm.","B":"","C":"NSP is computationally cheap — it adds one binary classification head. The reason for removing it was empirical performance, not computational cost.","D":"RoBERTa changed multiple variables: batch size, data, training steps, masking strategy, NSP removal. The paper shows each contributes. Attributing all gains to batch size alone misrepresents the ablation."},"reference":"- Liu et al., \"RoBERTa: A Robustly Optimized BERT Pretraining Approach\": https://arxiv.org/abs/1907.11692"},{"section":"nlp","topicSlug":"bert-and-variants","topic":"Bert And Variants","id":"nlp-07005","difficulty":"medium","orderIndex":5,"question":"DistilBERT achieves 97% of BERT's performance with 40% fewer parameters by using knowledge distillation. An engineer proposes the following approach: train DistilBERT from scratch using only the distillation loss (KL divergence between teacher and student soft logits) without any MLM loss. Why would this fail?","options":{"A":"KL divergence cannot be computed between a large teacher and small student with different architectures","B":"Soft targets from the teacher provide good final-layer supervision but do not teach the student intermediate representations — without MLM loss, the student's hidden layers lack the rich token-level representations needed for fine-tuning","C":"Distillation requires the teacher and student to have the same vocabulary, which is not guaranteed without MLM pretraining","D":"Training from scratch with only distillation loss would make DistilBERT identical to BERT since it would copy all the teacher's weights"},"correct":"B","explanation":{"correct":"- DistilBERT's actual training objective (Sanh et al., 2019): L = α·L_MLM + β·L_distil + γ·L_cos, where L_distil is KL divergence on soft targets, and L_cos is cosine embedding loss between teacher/student hidden states.\n- Pure distillation loss only supervises the output distribution (final logits). The student's intermediate layers can learn arbitrary representations that produce the right output but may not generalize well for fine-tuning on new tasks.\n- MLM loss ensures each token's hidden representation is semantically meaningful (the model must predict masked words), which is the core signal BERT's representations are built on. Without it, fine-tuning performance degrades significantly.\n- The hidden state alignment loss (cosine) is also critical — it forces student hidden states to mimic teacher hidden states at each layer, not just the final output.","A":"KL divergence is computed on the output probability distributions (after the final softmax layer), which have the same dimensionality (vocabulary size) regardless of model architecture. Cross-architecture distillation is standard practice.","B":"","C":"Teacher and student share the same vocabulary in DistilBERT. The vocabulary is a dataset property, not an architecture property. Distillation does not require different vocabularies.","D":"Distillation optimizes the student's parameters to minimize divergence from the teacher — this produces a smaller model that approximates the teacher's behavior, not a copy of the teacher's weights."},"reference":"- Sanh et al., \"DistilBERT, a distilled version of BERT\": https://arxiv.org/abs/1910.01108"},{"section":"nlp","topicSlug":"bert-and-variants","topic":"Bert And Variants","id":"nlp-07006","difficulty":"medium","orderIndex":6,"question":"ALBERT uses cross-layer parameter sharing (all 12 transformer layers share the same weights) and factorized embedding decomposition. Despite having 89% fewer parameters than BERT-large, ALBERT-xxlarge (12 layers, 4096 hidden dim) outperforms BERT-large on GLUE. How is this possible if parameter sharing reduces model capacity?","options":{"A":"ALBERT-xxlarge has hidden dimension 4096 vs BERT-large's 1024, so the parameter count comparison is misleading — more parameters per layer compensate for sharing","B":"Parameter sharing is a form of regularization — forcing all layers to compute the same transformation improves generalization; the model is deeper (more non-linear composition) without overfitting from independent layer weights","C":"ALBERT uses larger training data, which compensates for reduced parameters","D":"Cross-layer parameter sharing is only applied to the attention weights, not the FFN weights, so the capacity reduction is smaller than claimed"},"correct":"B","explanation":{"correct":"- ALBERT's key insight: BERT's 24 independent layers may waste capacity learning similar transformations at each layer. Forcing all layers to share weights acts as a strong regularizer while maintaining the depth (number of non-linear compositions through the network).\n- More passes through the shared layer = more refined representation without independent layer overfitting. It is analogous to recurrent networks (same weight applied repeatedly) vs having different weights per time step.\n- ALBERT-xxlarge's advantage comes from the large hidden dimension (4096) per shared layer — each pass through the shared layer has high capacity, and 12 passes create deep representation without the overfitting risk of 12 independent high-capacity layers.\n- This is a parameter efficiency vs performance tradeoff: fewer params, same or better GLUE score, but slower inference (12 independent passes through one large layer takes the same FLOPs as 12 large independent layers).","A":"Hidden dimension 4096 does mean more parameters per layer, but ALBERT's total parameters are still much smaller than BERT-large due to sharing. The larger hidden dim is part of the answer but does not explain the performance gain from sharing alone.","B":"","C":"ALBERT uses the same pretraining data as RoBERTa (and more than original BERT), but the performance gains were specifically shown in controlled ablations where data was held constant.","D":"ALBERT shares parameters across both attention and FFN sub-layers, not just attention. Full cross-layer sharing is the default ALBERT configuration."},"reference":"- Lan et al., \"ALBERT: A Lite BERT for Self-supervised Learning of Language Representations\": https://arxiv.org/abs/1909.11942"},{"section":"nlp","topicSlug":"bert-and-variants","topic":"Bert And Variants","id":"nlp-07007","difficulty":"hard","orderIndex":7,"question":"A team fine-tunes BERT-base for a multi-class document classification task on 50 classes with only 100 labeled examples per class (5,000 total). They achieve 71% accuracy. A colleague proposes fine-tuning GPT-2 instead, arguing \"decoder models are better for text tasks.\" Under what reasoning should the team prefer BERT over GPT-2 for this specific task?","options":{"A":"BERT is always better for classification tasks because its [CLS] token was designed for classification","B":"For classification tasks with limited data, BERT's bidirectional encoder provides richer token representations per layer than GPT-2's unidirectional decoder — each BERT token sees full context, making [CLS] a better document summary; GPT-2's causal masking means [CLS] (or the last token) only has left-context from the rest of the document","C":"GPT-2 cannot perform classification because it was trained only for generation","D":"BERT is better because it uses GELU activations while GPT-2 uses ReLU, and GELU is superior for classification"},"correct":"B","explanation":{"correct":"- BERT encoder: each token's representation is computed from full bidirectional context. The [CLS] token aggregates information from the entire sequence in every layer. For a 512-token document, [CLS] at layer 12 has attended to all 512 tokens across all 12 layers.\n- GPT-2 decoder (causal): each token only attends to previous tokens. The last token or [CLS] at the beginning has limited context in GPT-2 (if [CLS] is first, it attends to nothing on the left).\n- For classification with 5K training examples, BERT's richer bidirectional representations require less data to fine-tune — each example produces better-quality gradient signals because the representations are more complete.\n- When would GPT-2 be preferred for classification? If the classes are naturally defined by the generative distribution (e.g., language identification), or for very few-shot settings where prompt-based GPT-2 generation can be used.","A":"[CLS] was used for NSP during pretraining, which is a binary classification — this does not make it universally \"designed for classification.\" The [CLS] representation is useful because it aggregates bidirectional context, not due to any special architectural optimization for multi-class tasks.","B":"","C":"GPT-2 can perform classification by adding a linear classification head on the final token's representation (as shown in the original GPT paper). Fine-tuning for classification is standard for decoder models.","D":"Both BERT and GPT-2 use GELU activations in their FFN layers. This is not a distinguishing factor between the architectures."},"reference":"- Devlin et al., \"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding\": https://arxiv.org/abs/1810.04805"},{"section":"nlp","topicSlug":"bert-and-variants","topic":"Bert And Variants","id":"nlp-07008","difficulty":"hard","orderIndex":8,"question":"You fine-tune BERT-base on a custom NER dataset for a new domain (financial filings). Training loss drops to 0.02 but entity-level F1 on the test set is 0.51, far below the expected 0.85. Dropout is 0.1, batch size 32, learning rate 2e-5, 3 epochs. What is the most likely cause and fix?","options":{"A":"The learning rate 2e-5 is too high, causing catastrophic forgetting of BERT's pretrained representations","B":"Training loss near zero with poor test F1 indicates overfitting — the model has memorized training spans rather than generalizing; increase dropout (0.2-0.3), add L2 regularization, reduce epochs, and verify training/test domain and annotation consistency","C":"BERT-base is too small for financial NER — BERT-large must be used for domain-specific tasks","D":"Entity-level F1 is always lower than token-level accuracy, so 0.51 is expected for BERT on NER"},"correct":"B","explanation":{"correct":"- Training loss 0.02 ≈ 0 with test F1 0.51 is a classic overfitting signature. The model has memorized training annotations rather than learning generalizable entity patterns.\n- Financial filing NER is a specialized domain — if the training set is small (< 5K sentences), BERT can easily memorize exact entity spans without generalizing entity-boundary and type patterns.\n- Interventions: (1) Higher dropout (0.2-0.3 in classification head), (2) reduce epochs (early stopping based on validation F1), (3) check annotation quality — inconsistent BIO labeling causes the model to memorize noise, (4) use domain-adapted BERT (FinBERT) which has pretraining closer to the fine-tuning domain.\n- A subtle additional check: domain shift between training documents (10-K filings) and test documents (8-K filings) can cause F1 drop even without classical overfitting.","A":"Learning rate 2e-5 is the standard BERT fine-tuning learning rate recommended in the original paper. Catastrophic forgetting typically manifests as training loss *not* decreasing (the model fails to learn) or validation loss increasing early — not training loss near zero.","B":"","C":"BERT-large provides marginal improvements over BERT-base on NER tasks (+1-2% F1 typically). The 34% gap (51% vs 85%) is far too large to be explained by model size. Domain overfitting or data issues are the correct diagnosis.","D":"Entity-level F1 on CoNLL-2003 with BERT-base is ~91% (not 51%). Entity-level F1 being lower than token accuracy is true, but \"expected 0.51\" is false. 0.51 indicates a significant problem."},"reference":"- Devlin et al., \"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding\": https://arxiv.org/abs/1810.04805\n- Yang et al., \"FinBERT: A Pretrained Language Model for Financial Communications\": https://arxiv.org/abs/2006.08097"},{"section":"nlp","topicSlug":"text-classification","topic":"Text Classification","id":"nlp-08001","difficulty":"easy","orderIndex":1,"question":"A sentiment classifier trained on movie reviews achieves 94% accuracy on the test set. When deployed on product reviews, accuracy drops to 67%. The training and product review test sets have similar class distributions. What is the most likely cause?","options":{"A":"The model is too large and has overfitted to movie review vocabulary","B":"Domain shift: sentiment vocabulary differs between movie reviews (\"compelling,\" \"gripping\") and product reviews (\"durable,\" \"fast shipping\"), causing the model to misclassify domain-specific sentiment signals","C":"The test set for product reviews has a different label scheme than the training set","D":"Accuracy dropped because the product review test set is larger, making the metric harder to achieve"},"correct":"B","explanation":{"correct":"- A model trained on movie reviews learns associations: \"cinematography\" → positive, \"plot holes\" → negative. In product reviews, the relevant signals are \"build quality,\" \"customer service,\" \"arrived damaged\" — entirely different vocabulary.\n- This is domain shift in text classification: the input distribution P(X) changes between training and deployment even though the label space P(Y|X) semantics remain the same.\n- The 27% accuracy drop is large — typical of models that rely heavily on domain-specific lexical features without generalizing sentiment polarity across vocabulary.\n- Mitigations: domain adaptation (fine-tuning on a small labeled sample from the target domain), pseudo-labeling, or using a more general pretrained encoder (BERT trained on diverse text).","A":"Overfitting would manifest as high training accuracy and low test accuracy on the *same* distribution. A 94% in-domain test accuracy suggests the model is not overfitting to training data — it is well-calibrated for movies. The issue is domain mismatch, not model size.","B":"","C":"The question states similar class distributions (positive/negative/neutral). If label schemes differed, the accuracy drop would be systematic and detectable through the confusion matrix — it would not be a gradual performance drop.","D":"Metric value is not affected by test set size — accuracy = correct/total, which is a ratio. A larger test set gives a more reliable estimate of the same underlying accuracy, not a different value."},"reference":"- Blitzer et al., \"Domain Adaptation with Structural Correspondence Learning\": https://aclanthology.org/W06-1615/"},{"section":"nlp","topicSlug":"text-classification","topic":"Text Classification","id":"nlp-08002","difficulty":"easy","orderIndex":2,"question":"A 5-class topic classifier achieves macro F1 = 0.72 and micro F1 = 0.89. The product manager reports \"89% accuracy\" to stakeholders. What does the gap between macro and micro F1 reveal, and why is the manager's report potentially misleading?","options":{"A":"The gap indicates the model was trained with cross-entropy loss, which optimizes for micro F1 by default","B":"High micro F1 (0.89) is driven by dominant classes that have many examples; macro F1 (0.72) averages F1 equally across all classes, revealing that minority classes perform poorly — reporting 0.89 misleads stakeholders into thinking all classes perform well","C":"Macro F1 should always be reported because micro F1 is only valid for binary classification","D":"The gap means the model is biased toward false positives in dominant classes, which inflates micro F1"},"correct":"B","explanation":{"correct":"- Micro F1: pool all TP, FP, FN across all classes, then compute F1. Heavily influenced by large classes since they contribute more TP/FP/FN to the pool.\n- Macro F1: compute F1 per class, then average equally. Each class contributes equally regardless of support.\n- Gap of 0.17 (0.89 - 0.72) with 5 classes suggests some classes with small support have significantly lower F1 (possibly 0.40-0.60), dragging down macro F1 while the large dominant class (with many examples) achieves 0.95+ F1.\n- In a production system, if any of the 5 classes is critical (e.g., \"complaint\" in a customer service classifier), reporting micro F1 0.89 hides the fact that that class might have F1 = 0.45.","A":"Cross-entropy loss minimizes per-token/per-sample log loss, which is related to accuracy/micro metrics. But the gap between macro and micro F1 is about class imbalance, not the loss function used during training.","B":"","C":"Micro F1 is valid for multi-class and multi-label classification. The choice of macro vs micro depends on whether all classes are equally important (macro) or whether you want to reflect overall volume of correct predictions (micro).","D":"The gap between macro and micro F1 specifically indicates class imbalance effects, not bias toward false positives. A model biased toward FP in dominant classes would show low precision (not necessarily high micro F1)."},"reference":"- Manning et al., \"Introduction to Information Retrieval\", Chapter 8 (Evaluation in Text Classification): https://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-text-classifiers-1.html"},{"section":"nlp","topicSlug":"text-classification","topic":"Text Classification","id":"nlp-08003","difficulty":"easy","orderIndex":3,"question":"A multi-label text classification model predicts topic tags for research papers (e.g., a paper can be tagged with \"Machine Learning\" AND \"Computer Vision\" simultaneously). An engineer implements the output layer as a softmax over all labels. What is wrong with this approach?","options":{"A":"Softmax cannot handle more than 10 output classes for text classification","B":"Softmax normalizes outputs to a probability distribution that sums to 1, enforcing mutual exclusivity — for multi-label classification, each label must be independently decided using sigmoid, allowing multiple labels to be active simultaneously","C":"The engineer should use tanh instead of sigmoid for multi-label outputs because tanh produces values in [-1, 1]","D":"Softmax is correct for multi-label classification; the error is in using cross-entropy loss instead of binary cross-entropy loss"},"correct":"B","explanation":{"correct":"- Softmax: P(yᵢ) = exp(zᵢ) / Σⱼ exp(zⱼ). The outputs sum to 1, modeling a categorical distribution where exactly one class is true. Assigning high probability to \"ML\" forces lower probability on \"CV.\"\n- For multi-label: each label is an independent binary decision. \"ML\" can be 0.9 AND \"CV\" can be 0.85 simultaneously. This requires sigmoid: P(yᵢ) = 1 / (1 + exp(-zᵢ)) — each label has its own independent probability in [0, 1].\n- Loss: binary cross-entropy (BCE) applied independently per label, not categorical cross-entropy (which assumes one-hot targets).\n- A common mistake is using softmax for multi-label classification — it is one of the most frequent output layer bugs in NLP classification systems.","A":"Softmax handles arbitrarily many output classes — there is no upper limit of 10 or any number. This is factually incorrect.","B":"","C":"tanh outputs [-1, 1] which cannot be interpreted as probabilities (no non-negativity, no upper bound at 1). Sigmoid [0, 1] is the correct choice for independent binary probabilities.","D":"Softmax with binary cross-entropy is an inconsistent combination — binary cross-entropy expects independent per-label probabilities, but softmax produces dependent (summing to 1) probabilities. The error is in using softmax itself, not just the loss function."},"reference":"- Zhang & Zhou, \"A Review on Multi-Label Learning Algorithms\": https://ieeexplore.ieee.org/document/6471714"},{"section":"nlp","topicSlug":"text-classification","topic":"Text Classification","id":"nlp-08004","difficulty":"medium","orderIndex":4,"question":"A zero-shot text classifier uses a large LLM with the prompt: \"Classify the following text as one of [Sports, Politics, Technology]. Text: 'The government passed a bill regulating AI development.' Classification:\" The model outputs \"Technology.\" A researcher argues \"zero-shot LLMs cannot be reliable classifiers because they have no task-specific training.\" What evidence would best refute this claim?","options":{"A":"The LLM's perplexity on classification prompts is lower than on random text","B":"Calibrated few-shot prompting studies (Brown et al., GPT-3) show that LLMs can match or exceed supervised baselines on text classification benchmarks when prompts include class definitions and examples, demonstrating task-specific adaptation through in-context learning without parameter updates","C":"The LLM was fine-tuned on a classification dataset, making it a supervised model by definition","D":"Zero-shot classification accuracy on one example proves the model is reliable"},"correct":"B","explanation":{"correct":"- GPT-3's few-shot results (Brown et al., 2020): on SST-2 (sentiment), GPT-3 175B achieved 95.3% with few-shot prompting vs 96.8% for fine-tuned BERT-base — nearly matching supervised performance without gradient updates.\n- Zero-shot and few-shot prompting leverage the model's pretraining on diverse classification-adjacent text. The model has implicitly seen sentiment analysis, topic labeling, and categorization tasks during pretraining.\n- \"No task-specific training\" is precisely the point: LLMs demonstrate in-context learning, where the task specification in the prompt serves as the \"training signal\" at inference time.\n- This evidence refutes \"cannot be reliable\" by showing that under proper prompting, LLM zero-shot/few-shot classification is competitive with supervised approaches, especially in domains where supervised data is scarce.","A":"Perplexity on classification prompts measures language modeling quality, not classification accuracy. A model can have low perplexity on a prompt and still produce wrong classifications. Perplexity does not directly measure task performance.","B":"","C":"If the LLM was fine-tuned on a classification dataset, it is by definition not a zero-shot classifier for that task. This argument changes the premise of the question rather than refuting the claim within it.","D":"One example is not statistically significant evidence of reliability. Reliability requires consistent performance across a held-out test set. A single correct classification could be a coincidence."},"reference":"- Brown et al., \"Language Models are Few-Shot Learners\" (GPT-3): https://arxiv.org/abs/2005.14165"},{"section":"nlp","topicSlug":"text-classification","topic":"Text Classification","id":"nlp-08005","difficulty":"medium","orderIndex":5,"question":"A sentiment classifier for customer service tickets must achieve high recall on negative sentiment (complaints must not be missed), even at the cost of precision. The model currently has precision=0.90, recall=0.65 for the negative class. An engineer suggests \"just lower the classification threshold from 0.5 to 0.3.\" What is the correct effect of this change?","codeSnippet":"# Current: predict negative if P(negative) > 0.5\n# Proposed: predict negative if P(negative) > 0.3\nthreshold = 0.3\npredictions = [1 if p > threshold else 0 for p in probabilities]","options":{"A":"Lowering the threshold increases both precision and recall for the negative class","B":"Lowering the threshold increases recall (more negative examples caught) but decreases precision (more false positives — non-complaints classified as complaints)","C":"Lowering the threshold has no effect on recall, only on the number of false positives","D":"Lowering the threshold below 0.5 makes the model equivalent to always predicting negative"},"correct":"B","explanation":{"correct":"- Threshold lowering effect: more examples are predicted as \"negative\" (the positive class). Examples that were previously just below 0.5 are now predicted as negative.\n- True positives (genuine complaints with P > 0.3): more are caught → recall increases. TP/(TP+FN) improves because FN decreases.\n- False positives (non-complaints with 0.3 < P < 0.5): these are now also predicted as negative → precision decreases. TP/(TP+FP) worsens because FP increases.\n- This is the precision-recall tradeoff — it cannot be violated by threshold adjustment alone. For the stated requirement (high recall, acceptable precision drop), threshold lowering is the correct engineering solution.","A":"Precision and recall cannot both increase by lowering the threshold (for the same model). Increasing recall requires accepting more false positives, which decreases precision. If both could be improved, the model was miscalibrated, not threshold-limited.","B":"","C":"Recall does change with threshold. Recall = TP/(TP+FN). Lowering the threshold reduces FN (previously missed complaints with P in [0.3, 0.5] are now correctly predicted as negative), directly increasing recall.","D":"Predicting \"always negative\" would require threshold = 0 (predict negative for any probability > 0). At threshold 0.3, examples with P < 0.3 are still predicted as non-negative. Only if threshold is set to exactly 0 would all examples be positive."},"reference":"- Jurafsky & Martin, SLP3 Chapter 5 (Logistic Regression, threshold effects): https://web.stanford.edu/~jurafsky/slp3/5.pdf"},{"section":"nlp","topicSlug":"text-classification","topic":"Text Classification","id":"nlp-08006","difficulty":"medium","orderIndex":6,"question":"A team trains a BERT-based sentiment classifier on 100K labeled examples. After deployment, a data scientist discovers that 40% of the training labels were noisy (incorrectly labeled by annotators who disagreed on sarcasm). The model achieves 78% accuracy. A colleague claims \"the model learned to be 78% accurate despite 40% noise — that's impressive.\" What does the 78% actually imply about label noise?","options":{"A":"78% accuracy with 40% noise confirms the model learned to detect and correct noisy labels automatically","B":"Without a clean test set, 78% accuracy measured against noisy test labels underestimates true clean accuracy — if test labels are also 40% noisy, the model could have clean accuracy of ~88-92%, or the noise could be concentrated in specific examples the model consistently gets right for wrong reasons","C":"78% accuracy proves the model is robust to label noise since it exceeds the 60% that would result from a majority-vote baseline","D":"BERT cannot learn from noisy labels — the 78% accuracy means 22% of the model's weights are corrupted"},"correct":"B","explanation":{"correct":"- If the test set also has 40% noisy labels and \"accuracy\" is measured against these noisy labels, the metric itself is unreliable. A model that agrees with noisy annotators 78% of the time may have learned the noise patterns, not the true signal.\n- For random label noise at rate ε: a Bayes optimal classifier can achieve at most (1-ε) accuracy on noisy labels even with perfect underlying classification — since ε of the labels are wrong, 40% noise theoretically caps accuracy at ~60% for purely random noise. 78% exceeding this suggests the model learned something real (or the noise is not uniformly random).\n- The crucial question: what is the accuracy on a *clean* validation set? Without this, 78% is difficult to interpret.\n- In production: always maintain a clean gold-standard evaluation set separate from crowdsourced/noisy annotations.","A":"Models do not automatically detect and correct noisy labels during standard cross-entropy training. This requires explicit noise-robust training objectives (e.g., learning with noise transition matrix). BERT with standard fine-tuning will partially fit noisy labels.","B":"","C":"The majority-vote baseline (always predict the most common class) is relevant for class imbalance, not label noise. Label noise does not directly determine the majority-class baseline. Also, \"robust to label noise\" would require demonstration on clean labels, not noisy test labels.","D":"Model \"weights\" are real numbers, not binary — the concept of \"22% corrupted weights\" is not how neural networks work. Noisy labels affect the optimization landscape but do not corrupt specific weight subsets."},"reference":"- Frenay & Verleysen, \"Classification in the Presence of Label Noise: A Survey\": https://ieeexplore.ieee.org/document/6685834"},{"section":"nlp","topicSlug":"text-classification","topic":"Text Classification","id":"nlp-08007","difficulty":"hard","orderIndex":7,"question":"You build a zero-shot classifier using an LLM by scoring P(label | text) for each candidate label via conditional generation probability. The classifier works well for \"Sports\" vs \"Politics\" but confuses \"Finance\" vs \"Economics.\" Changing the label strings to \"Financial Markets\" vs \"Economic Policy\" improves F1 by 12 points. What does this reveal about zero-shot LLM classification?","options":{"A":"The LLM was pretrained on financial data but not economic data, creating a domain gap","B":"Zero-shot LLM classification is sensitive to label surface form — the model scores labels based on the likelihood of generating those exact tokens given the input, not based on abstract category semantics; more descriptive labels reduce ambiguity in the generation probability space","C":"Longer label strings always improve zero-shot classification because they provide more scoring tokens","D":"The improvement indicates the LLM is performing keyword matching rather than semantic classification"},"correct":"B","explanation":{"correct":"- Zero-shot classification via generation probability: P(label | text) is computed as the product of token probabilities for the label string. \"Finance\" is a single token; \"Financial Markets\" is two tokens with different probability mass in the language model's output distribution.\n- \"Finance\" and \"Economics\" are both common single tokens that often appear in similar contexts in pretraining data. \"Financial Markets\" and \"Economic Policy\" are phrase-level constructs with more discriminative co-occurrence patterns.\n- The LLM's probability scoring is fundamentally a language modeling operation — it asks \"how likely is this label word/phrase to follow this text?\" — not a semantic equivalence check. Labels that appear in similar pretraining contexts will be hard to discriminate.\n- This is a key limitation of raw generation-probability zero-shot classification: label choice has outsized impact. Entailment-based classifiers (NLI models) are more robust to label surface form variation.","A":"Both \"Finance\" and \"Economics\" are common English words that appear abundantly in virtually all large pretraining corpora. A domain gap between them in pretraining data would be visible in the model's base vocabulary distribution, not in a 12-point F1 gap from label renaming.","B":"","C":"Longer labels do not universally improve performance — \"Very Important Finance Topic\" is longer but likely worse than \"Finance.\" The improvement comes from disambiguation (richer context for the label), not length per se.","D":"Keyword matching would not explain the sensitivity to label *phrasing*. A keyword matcher would match \"Finance\" whenever \"finance\" appears in the text regardless of the label string format."},"reference":"- Yin et al., \"Benchmarking Zero-shot Text Classification\" (NLI-based zero-shot): https://arxiv.org/abs/1909.00161"},{"section":"nlp","topicSlug":"text-classification","topic":"Text Classification","id":"nlp-08008","difficulty":"hard","orderIndex":8,"question":"A BERT-fine-tuned 3-class classifier (positive/neutral/negative) achieves 88% accuracy overall. Calibration testing reveals the model's confidence on correctly classified examples averages 0.97, while on incorrectly classified examples averages 0.91. A reliability diagram shows the model outputs 0.95 confidence but is correct only 72% of the time for that confidence bucket. What problem does this describe and how does it affect downstream systems that threshold on confidence?","options":{"A":"The model is underfitting — 88% accuracy is too low for a 3-class problem","B":"The model is overconfident (miscalibrated): its predicted probabilities do not match empirical accuracy — a system thresholding at \"confidence > 0.9 to auto-respond\" will incorrectly auto-respond to 28% of the 0.95-confidence cases, causing real-world errors that confidence filtering was supposed to prevent","C":"The reliability diagram indicates a data preprocessing error that must be fixed before deployment","D":"BERT models cannot be calibrated — the softmax output is always overconfident for transformer models"},"correct":"B","explanation":{"correct":"- Expected Calibration Error (ECE): a well-calibrated model's predicted probability matches empirical accuracy. P(correct | confidence = 0.95) should equal 0.95. Here it equals 0.72 — a 23-point gap, indicating severe overconfidence.\n- Root cause: fine-tuned transformer models are known to be overconfident because the softmax over large logits can produce very peaked distributions that do not reflect true uncertainty.\n- Impact on downstream systems: if an auto-response system triggers when confidence > 0.9, it will incorrectly respond to ~28% of those cases — much worse than the advertised 88% accuracy implies. Stakeholders expect confidence to be a reliable signal; miscalibration silently violates this assumption.\n- Fixes: temperature scaling (T in softmax: P = softmax(z/T)), label smoothing during training, or Platt scaling. Temperature scaling is the simplest and most effective post-hoc calibration method.","A":"Underfitting produces high loss and low accuracy across training and test sets. 88% accuracy on a 3-class balanced problem is strong performance. The problem is confidence miscalibration, not underfitting.","B":"","C":"A reliability diagram showing miscalibration is a model behavior finding, not a preprocessing artifact. Data preprocessing errors would manifest as systematic performance gaps on specific data subsets or vocabulary anomalies.","D":"BERT models can be calibrated post-hoc using temperature scaling or other methods. Guo et al. (2017) demonstrated this for neural networks generally. \"Cannot be calibrated\" is false — they require explicit calibration, not that it is impossible."},"reference":"- Guo et al., \"On Calibration of Modern Neural Networks\": https://arxiv.org/abs/1706.04599"},{"section":"nlp","topicSlug":"named-entity-recognition","topic":"Named Entity Recognition","id":"nlp-09001","difficulty":"easy","orderIndex":1,"question":"A NER model must label each token in \"John Smith joined Google in 2023\" with entity tags. An engineer proposes labeling the entire span \"John Smith\" as a single \"PER\" tag. Why does production NER use BIO tagging instead?","options":{"A":"Single-span labeling is computationally more expensive than BIO tagging","B":"BIO tagging encodes entity boundaries at the token level — B-PER marks the first token of a person entity, I-PER marks continuation tokens, and O marks non-entities — enabling models to distinguish adjacent entities of the same type and handle tokenization-level predictions","C":"Single-span labeling requires a separate span detection model before the classification step, making it two-stage only","D":"BIO tagging was introduced specifically because transformers cannot process multi-token spans"},"correct":"B","explanation":{"correct":"- BIO scheme: \"John\"→B-PER, \"Smith\"→I-PER, \"joined\"→O, \"Google\"→B-ORG, \"in\"→O, \"2023\"→B-DATE.\n- Critical case: \"John Smith John Doe\" — two adjacent PER entities. Without BIO, the model cannot determine where John Smith ends and John Doe begins from the label sequence alone. BIO resolves this: B-PER I-PER B-PER I-PER — the second B-PER signals a new entity.\n- Token-level prediction is necessary because NER models (LSTM-CRF, BERT+linear) predict one label per token. The model architecture does not natively handle variable-length spans — BIO converts spans to a sequence labeling problem.\n- BIOES (Begin, Inside, Outside, End, Single) is an extension that adds E- and S- tags for more precise boundary marking, often used in span-based NER.","A":"Single-span labeling is not more computationally expensive — it requires fewer labels per token. The issue is representational, not computational. BIO has more label states but is computationally comparable.","B":"","C":"Single-span labeling can be done in one stage (span prediction + classification jointly, as in span-based NER). The issue is not the number of stages but the representation of entity boundaries at the token level.","D":"Transformers can process multi-token spans — span-based NER with BERT does exactly this. BIO predates transformers (it was used with CRF and LSTM models). The reason for BIO is the sequence labeling paradigm, not transformer limitations."},"reference":"- Ramshaw & Marcus, \"Text Chunking using Transformation-Based Learning\" (introduced IOB tagging): https://aclanthology.org/W95-0107/"},{"section":"nlp","topicSlug":"named-entity-recognition","topic":"Named Entity Recognition","id":"nlp-09002","difficulty":"easy","orderIndex":2,"question":"A CRF-based NER model assigns token-level tag probabilities independently and then uses the CRF layer to decode the best tag sequence. A BERT+linear NER model assigns tag probabilities independently per token without a CRF layer. In which specific case would the CRF provide measurable advantage over the linear head alone?","options":{"A":"CRF always outperforms linear heads for NER regardless of model capacity","B":"CRF provides advantage when the underlying token probabilities violate valid BIO constraints — e.g., the linear head might independently assign high probability to I-PER at position 1 without a preceding B-PER, which CRF prevents through learned transition constraints","C":"CRF is faster at inference time than a linear head, making it preferred for production systems","D":"CRF is needed because BERT's representations are not contextual enough to distinguish B- from I- tags without transition modeling"},"correct":"B","explanation":{"correct":"- A linear head makes independent predictions per token: each position's tag is argmax of its own softmax output. No constraint prevents I-PER at position 1 (no preceding entity) or O→I-PER transitions (invalid in BIO).\n- CRF adds a transition matrix T[i][j] = score of transitioning from tag i to tag j. Viterbi decoding finds the globally optimal valid tag sequence by enforcing that I-PER must follow B-PER or I-PER.\n- Empirically: BERT+linear achieves ~91% entity F1 on CoNLL-2003; BERT+CRF achieves ~92-93%. The gain is real but modest because BERT's contextual representations already make strong boundary predictions — the CRF provides a constraint-based correction for residual errors.\n- The CRF matters most when the encoder is weaker (LSTM, word2vec features) and independently assigns inconsistent per-token probabilities more frequently.","A":"With a strong enough encoder (like BERT), the linear head rarely produces invalid sequences because the contextual representations already encode entity boundaries well. \"Always outperforms\" overstates the advantage.","B":"","C":"CRF Viterbi decoding is O(n × k²) where k is number of tags — it is slower than a linear head's O(n × k) per-token argmax. CRF adds inference cost, not reduces it.","D":"BERT representations are highly contextual and encode B- vs I- tag distinctions effectively. The improvement from CRF is marginal for BERT specifically — the claim that BERT \"cannot distinguish\" B- from I- is false."},"reference":"- Lample et al., \"Neural Architectures for Named Entity Recognition\" (BiLSTM-CRF): https://arxiv.org/abs/1603.01360"},{"section":"nlp","topicSlug":"named-entity-recognition","topic":"Named Entity Recognition","id":"nlp-09003","difficulty":"medium","orderIndex":3,"question":"A NER model achieves token-level accuracy of 94% on a test set. Entity-level F1 is 79%. A product manager reports \"94% accuracy, the model works well.\" What causes the 15-point gap and why does entity-level F1 better represent production quality?","options":{"A":"Token-level accuracy counts O (non-entity) tokens which dominate the dataset; a model predicting O for all tokens achieves high token accuracy but 0 entity recall; entity-level F1 requires correct prediction of the full entity span and type","B":"Entity-level F1 is always lower than token accuracy because F1 penalizes the model for using the wrong label format","C":"The gap occurs because the test set has more non-entity tokens than entity tokens, making accuracy unreliable for all NLP tasks","D":"Token-level accuracy counts partial entity matches as correct; entity-level F1 requires exact span match, which is why it is always lower"},"correct":"A","explanation":{"correct":"- In a typical NER dataset, 70-80% of tokens are O (non-entity). A model predicting O for all tokens would achieve 75% token accuracy but 0 entity precision and recall (entity-level F1 = 0).\n- Entity-level F1 evaluation: an entity is correct only if the predicted span boundaries AND entity type exactly match the gold annotation. \"John Smith\" labeled as PER only when start=0, end=1, type=PER all match simultaneously.\n- This mirrors production value: users care about correctly identified named entities (people, organizations, locations), not about whether \"the\" and \"in\" are correctly labeled O.\n- The 94% token accuracy is inflated by correct O predictions — it is a misleading metric for NER. CoNLL-style entity-level F1 is the standard.","A":"","B":"The gap is not about \"label format penalties.\" Entity-level F1 is lower because it requires span completeness — one wrong token in a multi-token entity causes the entire entity prediction to be wrong. Token accuracy counts each token independently.","C":"Class imbalance (more O tokens) does make accuracy unreliable for NER, but this is not unique to NLP — it is a general consequence of class imbalance in any classification task. The specific mechanism for NER is entity-level span evaluation, not just imbalance.","D":"Entity-level F1 does not count partial entity matches as correct. A model predicting \"John\" as PER when \"John Smith\" is the gold entity gets 0 F1 for that entity (wrong span boundary). Token accuracy would give partial credit (one correct token)."},"reference":"- Tjong Kim Sang & De Meulder, \"Introduction to the CoNLL-2003 Shared Task\" (entity-level evaluation): https://aclanthology.org/W03-0419/"},{"section":"nlp","topicSlug":"named-entity-recognition","topic":"Named Entity Recognition","id":"nlp-09004","difficulty":"medium","orderIndex":4,"question":"A production NER model for medical records must identify drug names. It was trained on 10,000 annotated sentences. During evaluation, it achieves 91% entity F1 on the test set but 63% F1 on a set of prescriptions containing drug trade names (e.g., \"Prozac\") while performing well on generic names (e.g., \"fluoxetine\"). What is the most precise diagnosis?","options":{"A":"The model has overfit to generic drug names in the training data because trade names are OOV and the model cannot generalize to unseen vocabulary","B":"Trade names and generic names are lexically dissimilar — \"Prozac\" shares no morphological features with \"fluoxetine\" (the SSRI suffix pattern); if training data contains mostly generic names, the model learned morphological signals (e.g., \"-ine,\" \"-ol,\" \"-mab\") that trade names lack, causing generalization failure","C":"The model needs more training data — 10,000 sentences is insufficient for medical NER","D":"Trade names always have capital letters which confuse the model's capitalization features"},"correct":"B","explanation":{"correct":"- BERT and LSTM-CRF NER models learn features including: subword patterns (morphological suffixes), context words (\"mg\", \"dose\", \"prescribed\"), capitalization, and surrounding POS tags.\n- Generic drug names in English often follow systematic morphological patterns: \"-olol\" (beta-blockers), \"-pril\" (ACE inhibitors), \"-mab\" (monoclonal antibodies). These suffixes are strong learnable signals.\n- Trade names (Prozac, Lipitor, Advil) are brand names with no systematic morphological pattern — they are designed to be memorable and distinctive, not to encode pharmacological class. If training data is skewed toward generics, the morphological signals do not transfer.\n- Fix: include trade names in training data, or use a drug lexicon/gazeteer as a feature to augment context-based predictions.","A":"\"OOV\" is partially correct with word-level models, but BERT uses WordPiece subword tokenization — \"Prozac\" is tokenized as known subwords and processed in context. The issue is not purely OOV but rather absent morphological signals, which the model has not learned to ignore in favor of context-only features.","B":"","C":"10,000 annotated medical sentences is a reasonable corpus for domain-specific NER. The F1 gap between generics (91%) and trade names (63%) within the same evaluation indicates a systematic pattern, not overall data insufficiency.","D":"Both trade names and generic names are typically capitalized in prescriptions. Capitalization alone does not distinguish them. The capitalization signal should help both, not hurt trade names specifically."},"reference":"- Uzuner et al., \"Evaluating the State of the Art in Coreference Resolution for Electronic Medical Records\": https://academic.oup.com/jamia/article/19/5/786/734020"},{"section":"nlp","topicSlug":"named-entity-recognition","topic":"Named Entity Recognition","id":"nlp-09005","difficulty":"medium","orderIndex":5,"question":"A span-based NER model (Lee et al., 2017 style) enumerates all possible spans up to length L and scores each span independently. For a 100-token sentence with L=10, how many candidate spans are evaluated, and what architectural advantage does span-based NER have over BIO-sequence NER?","options":{"A":"1000 spans; span-based NER is faster because it evaluates spans in parallel","B":"955 spans; span-based NER can naturally model overlapping entities (e.g., \"New York City Council\" contains both a LOCATION and an ORG within the same text) and avoids CRF-style sequential decoding constraints","C":"100 spans; span-based NER is equivalent to BIO tagging because each span covers one token","D":"10,000 spans; span-based NER cannot handle entities longer than L tokens, which BIO-tagging can"},"correct":"B","explanation":{"correct":"- Span count for length L in a sequence of n tokens: Σₗ₌₁ᴸ (n - l + 1) = nL - L(L-1)/2. For n=100, L=10: 100×10 - 10×9/2 = 1000 - 45 = 955 candidate spans.\n- Key advantage: BIO tagging is a sequential labeling scheme where each token has exactly one label. This means a token cannot simultaneously be B-LOC (start of New York City) and I-ORG (continuation of New York City Council). Overlapping entities are architecturally impossible in BIO.\n- Span-based NER evaluates each span independently — span [0,2] (\"New York City\") can be LOC while span [0,3] (\"New York City Council\") can be ORG. Overlapping spans are naturally supported.\n- In biomedical NER, overlapping entities are common: \"alpha-2 macroglobulin receptor\" might be annotated as both a protein (full span) and a protein-domain (partial span).","A":"1000 is the overcounting without the subtraction for impossibly long spans near the end of the sequence. The correct count is 955. Also, span-based NER evaluates more candidates than BIO (one per token) — it is not necessarily faster.","B":"","C":"Each span covers 1 to L tokens — for L=10, each span covers 1-10 tokens. With 100 tokens and L=10, there are 955 spans, not 100. Evaluating only 1-token spans would be equivalent to BIO without the B/I distinction.","D":"Span-based NER with L=10 indeed cannot directly detect entities longer than 10 tokens — this is a real limitation. However, BIO sequential tagging can only produce non-overlapping entities, which is also a significant limitation. Both approaches have tradeoffs."},"reference":"- Lee et al., \"End-to-end Neural Coreference Resolution\" (span-based scoring): https://arxiv.org/abs/1707.07045"},{"section":"nlp","topicSlug":"named-entity-recognition","topic":"Named Entity Recognition","id":"nlp-09006","difficulty":"hard","orderIndex":6,"question":"A BERT+CRF NER model trained on CoNLL-2003 (4 types: PER, ORG, LOC, MISC) is fine-tuned on an internal dataset with 8 entity types including the original 4 plus 4 new types. After fine-tuning, performance on the original 4 types drops from 92% F1 to 84% F1. What specific phenomenon causes this, and what architectural change would mitigate it?","options":{"A":"The CRF transition matrix is not large enough for 8 types","B":"Catastrophic forgetting: fine-tuning on the new 8-type data causes the model to overwrite the learned representations for the original 4 types; mitigation includes elastic weight consolidation (EWC), continual learning, or multi-task training on both datasets simultaneously","C":"The new 4 entity types have similar surface forms to the original types, causing systematic label confusion","D":"BERT's WordPiece tokenizer cannot handle 8 entity types — the vocabulary must be extended"},"correct":"B","explanation":{"correct":"- Catastrophic forgetting: when a neural network is fine-tuned on a new task/distribution, gradient updates that improve performance on the new task can overwrite the weight configurations learned for the old task.\n- Here: the model was fine-tuned to discriminate 4 types. Re-fine-tuning on 8 types changes the output layer (now 8×2+1 = 17 BIO tags vs 9) and modifies BERT weights via backpropagation. Gradients for new examples override some of the weight configurations learned for original 4-type discrimination.\n- Mitigations: (1) EWC: regularize weights that were important for the original task; (2) Joint training: train on both CoNLL-2003 and the new dataset simultaneously; (3) Progressive neural networks: add new columns for new types without modifying old weights; (4) Lower learning rate with early stopping on the original 4-type validation set.\n- In production multi-domain NER, joint training across all entity type sets is the standard approach.","A":"The CRF transition matrix size is (num_tags × num_tags) — for 8 types with BIO, 17×17 = 289 transitions. This is trivially larger but the old 9×9 matrix is a subset. Matrix size is not the limitation.","B":"","C":"New entity type confusion is a valid concern, but the question describes a systematic 8-point drop across all 4 original types — this is catastrophic forgetting, not pairwise confusion. Type confusion would appear as specific misclassification patterns (PER→PERSON, LOC→LOCATION) not a uniform drop.","D":"WordPiece tokenizer operates on characters/subwords and is completely independent of entity type labels. The number of entity types does not affect tokenization. This is a fundamental misunderstanding of the model architecture."},"reference":"- Kirkpatrick et al., \"Overcoming Catastrophic Forgetting in Neural Networks\" (EWC): https://arxiv.org/abs/1612.00796"},{"section":"nlp","topicSlug":"named-entity-recognition","topic":"Named Entity Recognition","id":"nlp-09007","difficulty":"hard","orderIndex":7,"question":"You evaluate two NER systems on a test set of 1,000 sentences containing 2,000 gold entities. System A: precision=0.90, recall=0.80, F1=0.848. System B: precision=0.80, recall=0.90, F1=0.848. Both have identical F1. A project manager says \"choose either — they're equivalent.\" In which production scenario would System B be strictly preferred, and what downstream consequence does lower precision create?","options":{"A":"System B is never preferred — higher precision is always better in production NER","B":"System B (high recall) is preferred when missing an entity causes higher cost than false detection — e.g., in pharmacovigilance (drug adverse event detection), missing a drug-event mention could cause regulatory non-compliance; the downstream consequence is alert fatigue from false positives that human reviewers must filter","C":"System B is preferred when the test set has more than 2,000 entities because recall matters more at scale","D":"Both systems are equivalent in all production scenarios when F1 is identical"},"correct":"B","explanation":{"correct":"- The F1 score hides the asymmetric cost structure of different applications. F1 = 2PR/(P+R) gives equal weight to precision and recall — but real-world costs are rarely symmetric.\n- High-recall scenario (System B): pharmacovigilance, legal discovery, medical diagnosis mention extraction — missing an entity has severe downstream consequences (missed adverse events, missed relevant documents). False positives (extra entities) cost human reviewer time but are recoverable.\n- Downstream consequence of low precision: reviewers must process a larger set of predicted entities (2,250 with P=0.80 vs 1,778 with P=0.90 for 1,800 true positives), and 20% of reviewed entities are false alarms — creating alert fatigue and trust erosion in the system.\n- The correct product decision is to select based on the cost ratio of false negatives to false positives, not identical F1.","A":"Higher precision is not universally preferred. In high-stakes recall scenarios (medical, legal, security), missing true entities is far more costly than reviewing false positives. The scenario determines the preference.","B":"","C":"Entity count scaling does not change the fundamental precision-recall tradeoff. At any scale, the cost structure of false negatives vs false positives determines which system is preferred.","D":"Identical F1 does not mean equivalent production value. F1 treats precision and recall as equal — it is a valid aggregate metric for benchmarking but not for deployment decisions where cost asymmetry exists."},"reference":"- Manning et al., \"Introduction to Information Retrieval\", Chapter 8 (evaluation metrics and their meaning): https://nlp.stanford.edu/IR-book/"},{"section":"nlp","topicSlug":"question-answering","topic":"Question Answering","id":"nlp-10001","difficulty":"easy","orderIndex":1,"question":"A reading comprehension QA model is asked: \"When was the Eiffel Tower built?\" given the passage \"The Eiffel Tower, constructed between 1887 and 1889, was built as the entrance arch for the 1889 World's Fair.\" The model returns the span \"1887 and 1889.\" An engineer argues the correct answer is just \"1887.\" Who is right in the context of extractive QA?","options":{"A":"The engineer is right — extractive QA must return the shortest possible span","B":"The model is right — extractive QA returns a verbatim span from the passage; \"1887 and 1889\" is a valid answer span, and the SQuAD evaluation accounts for multiple valid answer spans by taking the maximum F1 over all annotated reference answers","C":"Neither is correct — extractive QA should paraphrase the answer rather than copy text","D":"The model is wrong — \"built\" in the question means a single start date must be returned"},"correct":"B","explanation":{"correct":"- Extractive QA (reading comprehension): the model selects a contiguous span from the passage as the answer. The span is not generated or paraphrased — it is copied verbatim.\n- SQuAD evaluation: human annotators provide multiple valid answer spans per question (typically 3). The model's predicted span is evaluated against all references, and the maximum token-level F1 and EM (exact match) scores are reported.\n- \"1887 and 1889\" as an answer to \"When was it built?\" captures the full construction period, which is a valid and arguably more complete answer than just \"1887.\"\n- The \"correct\" answer depends on the annotation — different annotators might label \"1887,\" \"1889,\" \"1887 and 1889,\" or \"between 1887 and 1889\" as valid. SQuAD handles this ambiguity through multi-annotator averaging.","A":"Extractive QA does not have a \"shortest span\" objective. Models are trained to maximize the log probability of the correct start and end token positions, not to minimize span length. Shorter is not inherently better.","B":"","C":"Extractive QA by definition returns verbatim text. Paraphrasing is the domain of abstractive QA (e.g., using a seq2seq model or LLM to generate the answer in natural language). These are distinct QA paradigms.","D":"\"When was it built?\" most naturally refers to the construction period, which spans 1887-1889. Prescribing that only a start date qualifies is an ad-hoc constraint not reflected in SQuAD annotations."},"reference":"- Rajpurkar et al., \"SQuAD: 100,000+ Questions for Machine Comprehension of Text\": https://arxiv.org/abs/1606.05250"},{"section":"nlp","topicSlug":"question-answering","topic":"Question Answering","id":"nlp-10002","difficulty":"easy","orderIndex":2,"question":"BERT for extractive QA adds two output vectors S and E (start and end) to compute start and end span probabilities. During inference on a passage of 400 tokens, the model predicts start position 50 and end position 48. What would a production system do with this prediction?","options":{"A":"Return the span from position 48 to 50 (reversing start and end)","B":"Reject the prediction as invalid — end position must be ≥ start position; production systems apply constraints during decoding (e.g., only consider end positions ≥ start positions) or return a \"no answer\" fallback","C":"Return an empty string because position 48-50 is a 2-token span","D":"The model cannot output end < start because the softmax ensures start ≤ end by design"},"correct":"B","explanation":{"correct":"- BERT QA computes start logits sᵢ = Sᵀhᵢ and end logits eⱼ = Eᵀhⱼ independently for each position. The model then selects argmax(sᵢ) and argmax(eⱼ) separately — there is no architectural constraint ensuring end ≥ start.\n- Production constraint: when computing the best span, enumerate valid (start, end) pairs where end ≥ start and end - start ≤ max_answer_length (e.g., 30 tokens). Select the pair maximizing s_start + e_end.\n- For SQuAD 2.0 (with unanswerable questions): also compute a \"no answer\" score and return no answer if it exceeds the best valid span score.\n- The naive argmax approach can produce invalid spans — production implementations must constrain the search space explicitly.","A":"Reversing start and end is semantically wrong — the span [48, 50] means something different from [50, 48] and the model predicted high start probability at 50, not at 48. Reversing would return incorrect text.","B":"","C":"End position 48 < start position 50 means there is no valid span, not a 2-token span. The span would have negative length, which is undefined.","D":"Softmax over start positions and softmax over end positions are computed independently with no coupling between them. The softmax normalizes probabilities within each distribution (start scores across all positions, end scores across all positions) — it does not enforce ordering between the two argmaxes."},"reference":"- Devlin et al., \"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding\" (Section 4.2, QA): https://arxiv.org/abs/1810.04805"},{"section":"nlp","topicSlug":"question-answering","topic":"Question Answering","id":"nlp-10003","difficulty":"medium","orderIndex":3,"question":"An extractive QA system achieves EM=0.72 and F1=0.85 on SQuAD 1.1. A product manager reports \"the system answers 85% of questions correctly.\" What is wrong with this interpretation, and what does the gap between EM and F1 reveal?","options":{"A":"F1=0.85 means the system gets 85% of tokens correct across all predictions, which is a reasonable \"85% correct\" interpretation","B":"F1 here is token-level overlap between predicted and reference spans — a prediction of \"Napoleon Bonaparte\" vs reference \"Napoleon\" scores F1=0.67 (2/3 overlap), not \"correct\"; the EM=0.72 vs F1=0.85 gap reveals that 13% of questions are partially correct (right tokens, wrong exact span), which F1 gives partial credit for but EM counts as 0","C":"The EM and F1 gap means 13% of questions have multiple valid answers in the annotation","D":"F1 is a better metric than EM and should replace it in all QA reporting"},"correct":"B","explanation":{"correct":"- Exact Match (EM): 1 if the predicted span exactly matches any reference answer (after normalization), 0 otherwise. EM=0.72 means 72% of predictions are word-for-word correct.\n- Token-level F1: F1 = 2 × (precision × recall) / (precision + recall) where precision = shared tokens / predicted tokens, recall = shared tokens / reference tokens. F1=0.85 means on average, predictions overlap 85% of reference tokens.\n- Gap EM=0.72, F1=0.85: the 13% difference represents cases where the model gets most of the answer right (correct tokens, partial span) — e.g., predicting \"the Eiffel Tower\" when the reference is \"Eiffel Tower\" (off by one token \"the\").\n- \"85% correct\" is misleading: F1=0.85 means 15% of answer tokens on average are wrong or missing, not that 85% of questions are answered correctly. The \"85% of questions\" framing applies only to EM.","A":"The interpretation \"85% of tokens correct\" is closer to what F1 measures, but \"correct\" is still misleading — it means 85% average token overlap, not correctness at the question level. The PM's \"85% of questions correctly\" is specifically the wrong framing.","B":"","C":"Multiple valid annotations per question in SQuAD are handled by taking the maximum F1/EM over all references. The EM-F1 gap reflects partial span matches, not annotation count.","D":"EM and F1 measure different things and both are reported in QA benchmarks for different reasons. Exact match is more interpretable for users; F1 is more informative for model comparison. Dropping either loses information."},"reference":"- Rajpurkar et al., \"SQuAD: 100,000+ Questions for Machine Comprehension of Text\": https://arxiv.org/abs/1606.05250"},{"section":"nlp","topicSlug":"question-answering","topic":"Question Answering","id":"nlp-10004","difficulty":"medium","orderIndex":4,"question":"An open-domain QA system (ODQA) uses a retriever-reader pipeline. The retriever returns the top-5 passages for the question \"What causes aurora borealis?\" using BM25 (sparse retrieval). All 5 passages discuss \"northern lights\" but none use the term \"aurora borealis.\" The reader gets 0 passages with the relevant term and fails to extract an answer. What is the root cause and what fix would most directly address it?","options":{"A":"BM25 failed because the retriever needs more passages — increase top-k to 50","B":"BM25 is a lexical retrieval model that requires keyword overlap — \"aurora borealis\" vs \"northern lights\" is a vocabulary mismatch; dense retrieval (DPR, bi-encoder) embeds questions and passages in semantic space and can match \"aurora borealis\" to \"northern lights\" through learned semantic similarity","C":"The reader model is too weak to understand that \"northern lights\" means \"aurora borealis\"","D":"The pipeline needs a translation model to convert \"aurora borealis\" to \"northern lights\" before retrieval"},"correct":"B","explanation":{"correct":"- BM25 scoring: TF-IDF based term matching. Query \"aurora borealis\" looks for passages containing these exact terms (or their morphological variants). Passages using \"northern lights\" have zero term overlap with \"aurora borealis\" → low BM25 score → not retrieved.\n- Dense Passage Retrieval (DPR): a bi-encoder fine-tuned on QA pairs learns that questions and their supporting passages should have similar representations. \"Aurora borealis\" and \"northern lights\" appear in similar semantic contexts during training — their embeddings are nearby in the learned space.\n- This is the lexical vs semantic retrieval tradeoff: BM25 fails on synonymy, paraphrase, and cross-lingual terms. Dense retrieval handles these but requires training data and GPU at inference.\n- Production ODQA systems (e.g., RAG) often combine BM25 and dense retrieval (hybrid retrieval) to get both lexical exactness and semantic generalization.","A":"Increasing top-k with BM25 will retrieve more passages about \"northern lights\" but still based on BM25 scores. If \"aurora borealis\" appears nowhere in the corpus, BM25 cannot match it to \"northern lights\" passages regardless of k.","B":"","C":"The reader failure is downstream of the retrieval failure. If the retriever provides relevant passages (those using \"northern lights\"), a strong reader can extract \"northern lights\" as the answer. The root cause is retrieval vocabulary mismatch.","D":"A translation model for synonyms is architecturally unnecessary and brittle — it cannot enumerate all possible synonyms. Dense retrieval implicitly handles semantic equivalence through learned embeddings without requiring explicit synonym lists."},"reference":"- Karpukhin et al., \"Dense Passage Retrieval for Open-Domain Question Answering\" (DPR): https://arxiv.org/abs/2004.04906"},{"section":"nlp","topicSlug":"question-answering","topic":"Question Answering","id":"nlp-10005","difficulty":"medium","orderIndex":5,"question":"A RAG (Retrieval-Augmented Generation) system for QA retrieves the top-3 passages and feeds them to an LLM to generate the final answer. The system has retriever recall@3 = 0.75 (75% of questions have the answer in the top-3 passages) and reader EM = 0.85 (given the right passage, the reader answers correctly 85% of the time). What is the end-to-end EM and what does this imply for which component to prioritize improving?","options":{"A":"End-to-end EM = 0.75 + 0.85 = 1.60 (must be capped at 1.0)","B":"End-to-end EM ≈ 0.75 × 0.85 = 0.638; since the system cannot answer correctly if retrieval fails, the bottleneck analysis shows improving retriever recall from 0.75 to 0.85 (a 10-point gain) would improve end-to-end EM to ~0.72, while the same 10-point reader gain (0.85→0.95) would yield ~0.71 — both components have similar leverage at these values","C":"End-to-end EM = 0.85 because the reader is the bottleneck and the retriever is already good enough at 0.75","D":"End-to-end EM = min(0.75, 0.85) = 0.75 because the weaker component determines overall performance"},"correct":"B","explanation":{"correct":"- End-to-end EM = P(retriever finds answer) × P(reader extracts answer | retriever found it) = 0.75 × 0.85 = 0.638.\n- This multiplicative relationship means both components contribute. The 25% retrieval failure is a hard ceiling — those questions produce wrong answers regardless of reader quality.\n- Sensitivity analysis: improving retriever recall to 0.85 → end-to-end EM = 0.85 × 0.85 = 0.723 (+8.5 points). Improving reader to 0.95 → 0.75 × 0.95 = 0.713 (+7.5 points). At current values, retriever improvement has slightly more leverage.\n- At lower retriever recall (e.g., 0.5), retriever improvement becomes much higher leverage. Prioritization depends on where each component currently stands.","A":"Probabilities cannot be added directly — this violates basic probability theory. P(A and B) = P(A) × P(B|A) for sequential events, not P(A) + P(B). Capping at 1.0 would be a symptom of applying the wrong formula.","B":"","C":"End-to-end EM cannot be 0.85 (the reader's rate) because 25% of questions have no valid passage from the retriever — the reader receives wrong context and cannot answer those correctly regardless of reader quality.","D":"min(P, Q) applies to the weakest parallel link in a chain where all links must succeed, but it is not the exact formula. P(A and B) = P(A) × P(B) (for independent A, B) — this equals min only when one probability is 1. With both < 1, the product is less than both."},"reference":"- Lewis et al., \"Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks\": https://arxiv.org/abs/2005.11401"},{"section":"nlp","topicSlug":"question-answering","topic":"Question Answering","id":"nlp-10006","difficulty":"hard","orderIndex":6,"question":"SQuAD 1.1 requires models to answer from a passage that always contains the answer. SQuAD 2.0 adds unanswerable questions. A model trained only on SQuAD 1.1 is evaluated on SQuAD 2.0. Performance degrades significantly. A researcher proposes that the model \"does not know what it doesn't know.\" What architectural mechanism enables a model to determine that a question is unanswerable from a passage?","options":{"A":"The model needs a separate binary classifier trained specifically to detect unanswerable questions","B":"BERT-based QA models add a special [CLS] token score for \"no answer\" — the model learns to compare the best valid span score against the no-answer score; if P(no answer) > P(best span) × threshold, it abstains; this requires training on SQuAD 2.0 examples where no-answer examples have high [CLS] score","C":"Unanswerable questions can be detected by checking if any n-gram from the question appears in the passage","D":"The model cannot learn unanswerability — it always produces a span prediction and the user must manually filter unanswerable questions"},"correct":"B","explanation":{"correct":"- BERT SQuAD 2.0 formulation: extend the span prediction by adding the [CLS] token as a candidate start and end position. The \"no answer\" score is s_[CLS] + e_[CLS] (start and end logits at position 0).\n- During SQuAD 2.0 training, unanswerable examples are labeled with start=0, end=0 ([CLS] position). The model learns to assign high scores to [CLS] when the passage does not contain the answer.\n- Threshold: s_[CLS] + e_[CLS] > (best span score + τ) → return no answer. τ is tuned on validation set.\n- This is effectively the model learning a \"none of the above\" option — a critical capability for deployment where passages may not always contain the answer.","A":"A separate binary classifier is one approach, but BERT-based QA handles unanswerability within the same model by treating [CLS] as a special answer position. This is more elegant and avoids training two separate models with potential inconsistency.","B":"","C":"N-gram overlap between question and passage is a weak heuristic — many answerable questions ask \"who,\" \"when,\" \"where\" questions whose n-grams appear in the passage but whose answer spans require inference. And many question keywords genuinely appear in both answerable and unanswerable passages.","D":"The model can learn to abstain — SQuAD 2.0 training explicitly provides this supervision. Models trained on SQuAD 2.0 achieve ~80% F1 on the unanswerable split, significantly above the 0% that \"always produces a span\" implies."},"reference":"- Rajpurkar et al., \"Know What You Don't Know: Unanswerable Questions for SQuAD\" (SQuAD 2.0): https://arxiv.org/abs/1806.03822"},{"section":"nlp","topicSlug":"question-answering","topic":"Question Answering","id":"nlp-10007","difficulty":"hard","orderIndex":7,"question":"A team building a customer support QA system debates whether to use extractive QA (BERT span extraction) or abstractive QA (LLM generation). The answers must be auditable — each answer must be traceable to a specific sentence in the knowledge base for compliance reasons. Which approach is more suitable and what is the key limitation of the other?","options":{"A":"Abstractive QA with citations is better because LLMs can generate both the answer and the source sentence simultaneously","B":"Extractive QA is more suitable for auditability — the answer is literally a verbatim span from the knowledge base, making source attribution trivially verifiable; abstractive QA generates fluent text that may hallucinate, paraphrase, or blend information from multiple sources, making it difficult to audit even when citations are added post-hoc","C":"Both approaches provide equal auditability because modern LLMs include source references in their output by default","D":"Abstractive QA is better because it can synthesize information from multiple passages, which is a compliance requirement in most enterprise settings"},"correct":"B","explanation":{"correct":"$19","A":"LLMs can be prompted to generate citations, but the generated answer itself may still hallucinate. The citation points to a document; it does not guarantee the generated text is faithful to that document. \"Simultaneously generating\" does not solve hallucination.","B":"","C":"Modern LLMs do not include source references \"by default\" — this requires specific prompting and retrieval pipelines (RAG). Even when citations are added, faithfulness to cited sources is not guaranteed without additional verification steps.","D":"Synthesizing information from multiple passages is a capability, not a compliance requirement. Compliance typically requires traceability to specific source statements, not synthesis — which is precisely what extractive QA provides."},"reference":"- Maynez et al., \"On Faithfulness and Factuality in Abstractive Summarization\": https://arxiv.org/abs/2005.00661"},{"section":"nlp","topicSlug":"machine-translation","topic":"Machine Translation","id":"nlp-11001","difficulty":"easy","orderIndex":1,"question":"A vanilla seq2seq model (no attention) translates \"The cat sat on the mat because it was comfortable\" to French. The translation of \"comfortable\" is incorrect but grammatically fluent French text is produced. What architectural limitation most directly causes the semantic error on \"comfortable\"?","options":{"A":"The model has a vocabulary mismatch between English and French","B":"The fixed-size encoder bottleneck vector must compress the entire source sentence; by the time the decoder generates the translation of the final clause, information from earlier tokens (including context for \"comfortable\") has been overwritten by the recurrent encoder's recency bias","C":"Seq2seq models cannot translate adjectives, only nouns and verbs","D":"The French vocabulary for \"comfortable\" is not present in the training data"},"correct":"B","explanation":{"correct":"- The encoder's final hidden state hₙ is the only information the decoder has about the source sentence. For a 14-word sentence, hₙ is the result of 14 RNN steps, with more recent words dominating the representation.\n- \"Comfortable\" appears at the end of the sentence — yet the decoder generating its French translation still needs to consider the full sentence context (what is comfortable? the mat or the cat?). This context is partially lost due to the bottleneck.\n- Irony: here \"comfortable\" is at the end, so it should be relatively well-preserved in hₙ. But the surrounding context needed to correctly translate it (\"because it was\" — what does \"it\" refer to?) requires integrating information from across the full sentence.\n- This demonstrates why attention was necessary: the decoder needs the ability to re-read specific source positions rather than relying on a single compressed representation.","A":"Vocabulary mismatch is a real problem in MT (especially for rare words), but it manifests as OOV handling (e.g., copying the source word or producing UNK), not as semantically wrong but fluent translations. The question describes a semantic error with fluent output.","B":"","C":"Seq2seq models can translate all parts of speech — they learn conditional distributions over target vocabulary which includes adjectives. There is no architectural restriction on word class.","D":"If \"comfortable\" were truly OOV in the French vocabulary, the model would produce UNK or a related word from the vocabulary, not a semantically incorrect but fluent phrase. OOV manifests differently."},"reference":"- Sutskever et al., \"Sequence to Sequence Learning with Neural Networks\": https://arxiv.org/abs/1409.3215"},{"section":"nlp","topicSlug":"machine-translation","topic":"Machine Translation","id":"nlp-11002","difficulty":"easy","orderIndex":2,"question":"A neural MT model with attention translates English to Japanese, which has SOV (Subject-Object-Verb) word order vs English SVO. The attention visualization shows that the decoder, when generating the Japanese verb (last token), attends strongly to the English verb (middle of source). What does this reveal about the attention mechanism in MT?","options":{"A":"The attention mechanism failed because it should attend to English tokens in sequence, not jump to the middle","B":"The attention mechanism learned to model non-monotonic alignments — it maps source positions to target positions based on the learned translation correspondence, not source order; this is essential for language pairs with different word orders","C":"The attention visualization proves the model is translating word-by-word in source order","D":"Attending to the middle of the source for the last target token means the model is hallucinating content not in the source"},"correct":"B","explanation":{"correct":"- Monotonic alignment (French-English): English word order is similar to French, so attention weights form a near-diagonal matrix (token i attends to nearby source position i).\n- Non-monotonic alignment (English-Japanese): Japanese verbs come at the end, but English verbs are in the middle. The attention matrix has off-diagonal patterns — the decoder position for the Japanese verb (late in output) must focus on the English verb position (middle of source).\n- This non-monotonic alignment learning is a key capability of attention that fixed-context seq2seq models cannot achieve — those must compress the reordering into the fixed vector, which is less effective.\n- Bahdanau et al.'s original paper demonstrated alignment matrices for French-English (monotonic) as evidence of the mechanism; cross-lingual alignment for typologically different pairs is an even stronger test.","A":"There is no rule that attention must be monotonic. The purpose of attention is to learn task-appropriate alignment, which is non-monotonic for language pairs with different word orders. Non-diagonal attention is correct behavior for English-Japanese.","B":"","C":"The attention to the middle of source for a late target token is precisely evidence of non-sequential translation. The visualization contradicts \"word-by-word in source order.\"","D":"Attending to the English verb to generate the Japanese verb is semantically correct, not hallucination. Hallucination would produce target content with no corresponding source span. The model is correctly aligning the verb across languages."},"reference":"- Bahdanau et al., \"Neural Machine Translation by Jointly Learning to Align and Translate\": https://arxiv.org/abs/1409.0473"},{"section":"nlp","topicSlug":"machine-translation","topic":"Machine Translation","id":"nlp-11003","difficulty":"medium","orderIndex":3,"question":"A neural MT system achieves BLEU=42 on a news translation benchmark. Deployed on legal documents, BLEU drops to 18. The legal documents use technical terminology and formal register. A colleague proposes increasing the beam search width from 5 to 20 to improve BLEU. Why is this diagnosis wrong?","options":{"A":"Beam search width affects speed but not translation quality in any domain","B":"The BLEU drop from news to legal is caused by domain shift — the model's decoder has not seen legal terminology in training; wider beam search explores more hypotheses from the same learned distribution, which is still a news distribution; it cannot generate correct legal terminology that was never in training","C":"Beam width should be decreased to 3 for legal documents because legal sentences are shorter","D":"BLEU score is not a valid metric for legal translation so the comparison is meaningless"},"correct":"B","explanation":{"correct":"- Beam search explores k hypotheses simultaneously to find the highest-scoring translation under the model's probability distribution. If the model's distribution P(y|x) was trained on news data, beam=5 and beam=20 both sample from this news distribution.\n- For a legal term like \"indemnification,\" if this word and its translation were rare in training, P(\"indemnification\" translation | context) ≈ 0 across all beam hypotheses. Wider beam does not help find low-probability tokens the model has not learned.\n- The correct fix is domain adaptation: fine-tune on legal translation pairs (even hundreds of examples with careful extraction can help), or use domain-specific terminology lexicons as constraints during decoding.\n- Beam width tuning is a valid optimization but only provides marginal gains (1-3 BLEU) within the same domain — it cannot bridge a 24-point cross-domain BLEU gap.","A":"Beam width does affect quality — larger beams reduce search errors and can improve BLEU by 1-3 points within the training domain. But this is a marginal effect. The statement \"no effect\" is too strong; the correct point is that it cannot fix domain shift.","B":"","C":"Legal sentences are typically longer and more complex than news sentences, not shorter. Legal documents contain nested clauses, defined terms, and cross-references that make sentences structurally complex.","D":"BLEU is used for legal MT evaluation in research and industry, with recognized limitations. The 24-point domain gap is a real and measurable quality difference even accounting for BLEU's limitations (synonym mismatch, etc.)."},"reference":"- Koehn & Knowles, \"Six Challenges for Neural Machine Translation\": https://arxiv.org/abs/1706.04972"},{"section":"nlp","topicSlug":"machine-translation","topic":"Machine Translation","id":"nlp-11004","difficulty":"medium","orderIndex":4,"question":"A neural MT system translates \"The bank can guarantee deposits will eventually cover future tuition costs because it invests in education.\" The system produces two different translations on different runs with temperature=0.7. One translates \"bank\" as a financial institution, the other as a river bank. What in the NMT architecture determines which sense is chosen?","options":{"A":"The model randomly chooses between word senses because temperature > 0 introduces stochasticity; setting temperature=0 would always produce the financial institution translation","B":"The attention mechanism at the token \"bank\" aggregates context from the full source sentence — in a well-trained model, \"deposits,\" \"costs,\" and \"invests\" provide strong context signals for the financial sense; temperature sampling occasionally samples the lower-probability river-bank sense","C":"The model always defaults to the most frequent word sense from training data unless the word appears near water-related words","D":"The two translations occur because the model has two separate word embeddings for \"bank\" — one per sense — and randomly selects between them"},"correct":"B","explanation":{"correct":"- Transformer MT encoder: each word's representation is contextualized through self-attention over the full source sentence. The representation of \"bank\" at the encoder already incorporates context from \"deposits,\" \"invests,\" \"tuition\" — strong financial sense signals.\n- With temperature=0 (greedy/beam), the model would likely always produce the financial translation because the financial-sense probability dominates given the context.\n- With temperature=0.7, the output distribution is softened: P_new(y) = P(y)^(1/T) normalized. Lower-probability tokens (\"river bank\" translation) occasionally get sampled, especially when two senses have closer probabilities (weakly disambiguating context).\n- For a strongly disambiguating context (\"deposits will eventually cover\"), the financial sense should dominate even with moderate temperature. Multiple translations suggest the model is somewhat uncertain in this specific sentence.","A":"Setting temperature=0 would produce argmax (greedy) decoding, which always picks the highest probability token at each step. For this sentence with strong financial context, temperature=0 would likely always produce the financial translation. But the mechanism is attention-based contextual disambiguation, not just temperature.","B":"","C":"\"Most frequent word sense\" is the behavior of a non-contextual model (word2vec, GloVe). Transformer MT uses fully contextual representations — the word sense is determined by context, not just frequency statistics.","D":"Neural MT models use a single embedding per word token (or subword). Word sense disambiguation is an emergent property of contextual attention representations, not separate embeddings. Multiple embeddings per word are not standard in current NMT architectures."},"reference":"- Vaswani et al., \"Attention Is All You Need\": https://arxiv.org/abs/1706.03762"},{"section":"nlp","topicSlug":"machine-translation","topic":"Machine Translation","id":"nlp-11005","difficulty":"medium","orderIndex":5,"question":"A team evaluates an MT system using BLEU with a single reference translation. A colleague proposes using 4 reference translations per sentence instead. How would this change affect BLEU scores, and does a higher score with 4 references mean better translation quality?","options":{"A":"More references always produce higher BLEU scores and always indicate better translation quality","B":"More reference translations increase BLEU scores because modified n-gram precision allows matching against any reference — a hypothesis matching any of the 4 references gets credit; higher BLEU with 4 references reflects better coverage of valid translation variants, not necessarily higher quality of the specific translations being evaluated","C":"More references decrease BLEU because the brevity penalty increases with more references","D":"BLEU is reference-count invariant — adding references has no effect on the score"},"correct":"B","explanation":{"correct":"- BLEU modified n-gram precision: for each hypothesis n-gram, find the maximum count across all references, clip the hypothesis count to this maximum. More references = more ways to match = higher clipped counts = higher precision.\n- Example: hypothesis \"The cat\" — Reference 1 has \"The cat\" (match), Reference 2 has \"A feline\" (no match). With 2 references, \"The cat\" still matches Ref 1. Now if Reference 3 has \"The feline,\" the bigram \"The feline\" also gets credit if the hypothesis contains it.\n- Critical implication: BLEU scores are only comparable when computed with the same number of references. Reporting BLEU=35 with 1 reference vs BLEU=45 with 4 references does not mean System B is better.\n- This is why MT benchmarks (WMT) standardize reference counts. Comparing single-reference BLEU to multi-reference BLEU is invalid.","A":"Higher BLEU with more references does not automatically mean better translation quality — it means more opportunities to match one of several valid paraphrases. The quality of the underlying system has not changed; the evaluation metric has become more lenient.","B":"","C":"The brevity penalty is based on the closest reference length to the hypothesis length. With more references, the closest reference might be shorter or longer, but the general direction of the brevity penalty effect is not systematically to increase it.","D":"BLEU scores change substantially with reference count. Single-reference BLEU scores for state-of-the-art systems are typically 30-35 lower than 4-reference scores on the same system. Reference count has a major effect on the metric value."},"reference":"- Papineni et al., \"BLEU: a Method for Automatic Evaluation of Machine Translation\": https://aclanthology.org/P02-1040/"},{"section":"nlp","topicSlug":"machine-translation","topic":"Machine Translation","id":"nlp-11006","difficulty":"hard","orderIndex":6,"question":"An NMT system translates a long English document (500 words) by processing it as a single sequence. The translation is fluent but several entities are \"hallucinated\" — the model generates plausible but incorrect names and numbers. A researcher claims \"this is a fundamental limitation of autoregressive generation, not a data issue.\" Evaluate this claim.","options":{"A":"The researcher is correct — hallucination in NMT is caused by autoregressive generation and cannot be fixed with more data","B":"The researcher is partially right: autoregressive generation accumulates errors (each generated token conditions future tokens), and at long distances the model's probability mass can shift toward fluent but unfaithful text; but data also matters — models trained on noisy parallel corpora or with insufficient coverage of the source domain hallucinate more; both factors contribute","C":"Hallucination in NMT is entirely a data quality problem — clean parallel corpora eliminate all hallucination","D":"Hallucination only occurs in abstractive tasks (summarization) — MT models cannot hallucinate because they are conditioned on source text"},"correct":"B","explanation":{"correct":"- Autoregressive generation contribution: at step t, the decoder conditions on y₁,...,yₜ₋₁ and source context. For long sequences, the attended source context may become diffuse (attention spread thin over 500 source tokens), and the model may rely more on \"what sounds fluent\" (language model prior) than \"what is in the source.\"\n- Data quality contribution: models trained on web-crawled parallel text (CommonCrawl, ParaCrawl) contain misaligned sentence pairs, OCR errors, and template text. Models learn that generating plausible-sounding text is sometimes rewarded when source-faithful generation is not in training data.\n- Both factors are real: Lee et al. (2018) demonstrated hallucination increases with longer input and that data cleaning reduces it — but does not eliminate it entirely even with clean data.\n- Production MT systems use copy mechanisms, faithful training objectives (like MLE + faithfulness loss), and post-hoc factual consistency checking to mitigate hallucination.","A":"\"Cannot be fixed with more data\" is too strong. Data cleaning (filtering noisy parallel pairs) measurably reduces hallucination rate. The claim that it is solely an architectural issue ignores empirical evidence.","B":"","C":"\"Clean parallel corpora eliminate all hallucination\" is also too strong. Even models trained on human-translated parallel text (e.g., European Parliament proceedings, which are high quality) can hallucinate on out-of-distribution inputs or very long sequences.","D":"MT hallucination is well-documented (Koehn & Knowles, 2017; Lee et al., 2018). The source conditioning reduces hallucination compared to unconditioned generation but does not prevent it. Long-range conditioning is imperfect even in attention-based models."},"reference":"- Lee et al., \"Hallucinations in Neural Machine Translation\": https://aclanthology.org/2018.emnlp-main.590"},{"section":"nlp","topicSlug":"machine-translation","topic":"Machine Translation","id":"nlp-11007","difficulty":"hard","orderIndex":7,"question":"Two MT systems are compared on English→Chinese translation. System A uses beam search with beam=5 and achieves BLEU=38, chrF=62. System B uses sampling with temperature=0.8 and achieves BLEU=30, chrF=65. Human evaluators prefer System B's translations in a side-by-side evaluation (65% preference for B). How do you reconcile the BLEU/chrF disagreement with human preference?","options":{"A":"Human evaluators are wrong — BLEU is the gold standard for MT evaluation","B":"BLEU and chrF measure different aspects of translation quality — BLEU penalizes paraphrase (requires exact n-gram matches to reference), while chrF uses character n-grams giving partial credit; for Chinese, character n-grams better capture morphological fluency; human preference aligns with chrF because humans value fluency and naturalness over exact reference matching","C":"System B cheated by using temperature sampling — only beam search produces valid BLEU-comparable translations","D":"The metrics disagree because System A has a higher brevity penalty, artificially inflating its BLEU"},"correct":"B","explanation":{"correct":"- BLEU = word n-gram overlap with reference translations. For Chinese, word segmentation is ambiguous (no spaces), and BLEU computed on characters or with a specific segmenter produces different values. BLEU also does not give partial credit for near-matches.\n- chrF (Popovic, 2015) = character n-gram F-score averaging over n=1..6. In morphologically rich languages and character-based scripts like Chinese, character n-grams better capture translation quality — partial word matches (e.g., getting 3 of 4 characters of a compound word correct) contribute to chrF but not to BLEU.\n- Human preference for System B: temperature sampling produces more diverse, natural-sounding outputs. System B may paraphrase the reference (lower BLEU) while being equally or more correct and natural (higher human preference, higher chrF).\n- This is a core problem with BLEU as the sole evaluation metric — it correlates with human judgment at corpus level but can diverge systematically for individual system comparisons.","A":"Human evaluation is the ground truth for MT quality. BLEU was designed as a cheap proxy for human evaluation, not to replace it. When BLEU and human preference diverge, human preference is the more authoritative signal.","B":"","C":"Temperature sampling is a valid inference method. BLEU evaluates the output strings regardless of how they were generated. There is no sampling vs beam search constraint in BLEU computation.","D":"Brevity penalty affects BLEU for short outputs (system outputs shorter than reference). Both systems are generating Chinese translations of the same English input — brevity differences would be small and unlikely to account for an 8-point BLEU gap."},"reference":"- Popovic, \"chrF: character n-gram F-score for automatic MT evaluation\": https://aclanthology.org/W15-3049/\n- Callison-Burch et al., \"Re-evaluating the role of BLEU in MT research\": https://aclanthology.org/E06-1032/"},{"section":"nlp","topicSlug":"text-generation-decoding","topic":"Text Generation Decoding","id":"nlp-12001","difficulty":"easy","orderIndex":1,"question":"A language model generates text by greedy decoding — at each step selecting the token with the highest probability. The output reads: \"The best way to learn programming is to practice. practice. practice. practice.\" What decoding property causes this repetition, and what does greedy decoding optimize for?","options":{"A":"Greedy decoding randomly selects tokens, occasionally repeating them","B":"Greedy decoding maximizes the probability of each individual token at each step (local optimum), but \"practice.\" after the first occurrence has very high conditional probability given the context \"practice.\", creating a degenerate loop; greedy decoding does not optimize the joint probability of the full sequence","C":"Greedy decoding produces repetition only when the language model was trained on repetitive text","D":"The model has overfit to the word \"practice\" in fine-tuning data, which is an unrelated training issue"},"correct":"B","explanation":{"correct":"- Greedy decoding: at step t, select wₜ = argmax P(wₜ | w₁,...,wₜ₋₁). This optimizes one token at a time.\n- Once \"practice.\" appears, the model's context now contains \"practice.\" at the end. Training data contains repeated phrases (e.g., slogans, lists) where repetition is common, so P(\"practice\" | \"...practice.\") becomes high.\n- This creates a self-reinforcing loop: high P(x | context ending in x) → x is selected → context now ends in x again → repeat. The loop is stable because greedy provides no look-ahead to realize the sequence quality degrades.\n- This is the joint probability problem: argmax P(w₁,...,wₙ) ≠ Π argmax P(wₜ | context). The globally best sequence may require choosing a lower-probability token at step t to enable much higher probability tokens later.","A":"Greedy decoding is deterministic — it always selects the maximum probability token. There is no randomness. Repetition is not random but a systematic consequence of local greedy optimization.","B":"","C":"Repetition from greedy decoding is a general neural LM phenomenon, not limited to models trained on repetitive text. It has been documented for models trained on diverse, non-repetitive corpora (Holtzman et al., 2020).","D":"If this were a fine-tuning overfitting issue, the repetition would be specific to fine-tuning domain topics and would not occur systematically across diverse prompts. Repetition in greedy decoding occurs for many words, not just \"practice.\""},"reference":"- Holtzman et al., \"The Curious Case of Neural Text Degeneration\": https://arxiv.org/abs/1904.09751"},{"section":"nlp","topicSlug":"text-generation-decoding","topic":"Text Generation Decoding","id":"nlp-12002","difficulty":"easy","orderIndex":2,"question":"Beam search with beam width k=5 generates the 5 highest-probability complete sequences according to the model. A researcher claims \"beam search always produces better text than greedy decoding.\" Under what condition is this claim false?","options":{"A":"Beam search is always better — the claim is always true","B":"Beam search optimizes joint sequence probability, which can produce repetitive, generic text by finding high-probability but low-diversity sequences; research (Holtzman et al., 2020) shows that maximum-probability sequences (\"the the the the\") are often degenerate and low quality by human judgment","C":"Beam search fails when beam width k exceeds vocabulary size","D":"Greedy decoding produces better text than beam search for sequences longer than 100 tokens"},"correct":"B","explanation":{"correct":"- Beam search finds approximately the k highest joint probability sequences under the model. The model's training distribution (MLE on human text) does not imply that maximum-probability sequences are the most human-like.\n- Holtzman et al. (2020) demonstrated that the most probable continuation for many prompts is degenerate (repetitive, bland) while typical human text lives in the middle of the probability distribution.\n- As beam width increases, the found sequences have higher joint probability — but can be more repetitive and less diverse than moderately lower-probability sequences (e.g., those found by sampling).\n- Beam search is well-suited for tasks with ground truth (MT: BLEU improves with beam width up to ~5-10; summarization with specific content requirements). For open-ended generation (stories, dialogue), sampling-based methods often produce better human-preferred output.","A":"The claim is falsified by empirical evidence. Holtzman et al. (2020) conducted human evaluations showing that beam-searched text is significantly less preferred than nucleus sampling for open-ended generation tasks.","B":"","C":"Beam width k has no constraint relative to vocabulary size. k is the number of hypotheses maintained during decoding — it is entirely independent of vocabulary size (which can be 50,000+). Beams of width k=50 are standard.","D":"There is no specific threshold (100 tokens) at which greedy outperforms beam search. The relative quality depends on task type, not sequence length threshold."},"reference":"- Holtzman et al., \"The Curious Case of Neural Text Degeneration\": https://arxiv.org/abs/1904.09751"},{"section":"nlp","topicSlug":"text-generation-decoding","topic":"Text Generation Decoding","id":"nlp-12003","difficulty":"easy","orderIndex":3,"question":"Top-k sampling selects the next token by sampling from the top-k highest probability tokens. A model generates text about cooking with k=50. Mid-generation, the model is highly confident the next word is \"salt\" (P=0.95) with the remaining probability mass spread across 49 other tokens. An engineer observes poor output quality. What is the likely issue?","options":{"A":"k=50 is too small for cooking text, which requires a larger vocabulary","B":"Top-k samples from a fixed k regardless of the probability distribution's shape — when P(salt)=0.95 leaves only 0.05 probability across 49 other tokens, k=50 forces sampling from very low-probability tokens, adding noise when the model is confident and not enough diversity when the model is uncertain","C":"Top-k sampling with k=50 is equivalent to greedy decoding and produces the same repetition issues","D":"The model is using the wrong temperature — top-k requires temperature=0 to work correctly"},"correct":"B","explanation":{"correct":"- Top-k limitation: k is a fixed count, not a fixed probability mass. When the model is highly confident (peaked distribution), k=50 still samples from 49 near-zero-probability tokens, introducing unwanted randomness.\n- Conversely, when the model is uncertain (flat distribution), k=50 may not include enough diverse options.\n- This motivates nucleus (top-p) sampling: instead of fixing the count k, fix the cumulative probability p=0.9. When P(salt)=0.95 > p, nucleus sampling takes k=1 (just \"salt\"). When the distribution is flat, nucleus sampling takes k=large to include enough probability mass.\n- Top-p adapts to the distribution shape at each step — this is the key insight of Holtzman et al. (2020).","A":"Vocabulary size does not determine k quality. A large vocabulary is handled by the softmax layer — k=50 means 50 candidate tokens regardless of vocabulary size. The issue is k being fixed, not being too small.","B":"","C":"Top-k with k=50 is far from greedy decoding (k=1). It samples stochastically from 50 tokens with probability weights — greedy is deterministic argmax. The degeneration patterns from top-k are different (noise, not repetition).","D":"Top-k sampling can use any temperature, which adjusts the sharpness of the distribution before taking the top-k and sampling. Temperature=0 with top-k=50 would be equivalent to greedy decoding (argmax). Temperature is orthogonal to the fixed-k limitation."},"reference":"- Holtzman et al., \"The Curious Case of Neural Text Degeneration\" (nucleus sampling): https://arxiv.org/abs/1904.09751"},{"section":"nlp","topicSlug":"text-generation-decoding","topic":"Text Generation Decoding","id":"nlp-12004","difficulty":"medium","orderIndex":4,"question":"Nucleus (top-p) sampling with p=0.9 generates a token by: (1) sorting tokens by probability descending, (2) selecting the smallest set of tokens whose cumulative probability ≥ 0.9, (3) renormalizing to sum to 1, (4) sampling from this set. A model is generating with p=0.9 and temperature=0.5. In what order should temperature and nucleus truncation be applied, and why does order matter?","options":{"A":"Order does not matter — temperature and nucleus sampling produce the same result regardless of application order","B":"Temperature should be applied first (divide logits by T before softmax), then top-p truncation is applied to the resulting probability distribution — applying temperature after top-p would alter which tokens are included in the nucleus, producing a different set of candidates and a different effective sampling distribution","C":"Top-p truncation must always be applied before temperature to prevent overflow in the softmax computation","D":"Temperature and top-p cannot be used together — only one should be applied at a time"},"correct":"B","explanation":{"correct":"- Standard order: logits → divide by T (temperature) → softmax → sort by P → cumulative top-p truncation → renormalize → sample.\n- If top-p is applied first (before temperature): the nucleus is selected based on the original distribution shape. Then temperature flattens/sharpens the already-truncated distribution.\n- If temperature is applied first (before top-p): temperature T < 1 sharpens the distribution (high-probability tokens become even more dominant). The nucleus p=0.9 is then applied to this sharpened distribution, which may include fewer tokens (more concentrated at the top). T > 1 flattens the distribution, potentially expanding the nucleus.\n- In practice: temperature first, then top-p is standard (implemented in HuggingFace `generate()`, OpenAI API). Reversing the order would produce a different candidate set and effectively a different sampling scheme.","A":"Order changes the outcome when both are applied. Temperature changes the distribution shape, which changes the cumulative probability threshold's position. The two operations are not commutative when they interact through the probability ordering.","B":"","C":"Numerical overflow in softmax is addressed by subtracting the maximum logit before exponentiation (log-sum-exp trick), not by applying top-p first. Overflow is an implementation concern, not an ordering requirement.","D":"Temperature and top-p are routinely combined in production text generation (e.g., OpenAI API supports both simultaneously). They address different aspects: temperature shapes the overall distribution, top-p truncates the tail."},"reference":"- HuggingFace `generate()` documentation (temperature and top-p parameters): https://huggingface.co/docs/transformers/main_classes/text_generation"},{"section":"nlp","topicSlug":"text-generation-decoding","topic":"Text Generation Decoding","id":"nlp-12005","difficulty":"medium","orderIndex":5,"question":"A model generates a 200-token story using temperature=1.5. The output is creative and diverse but contains grammatical errors, factual contradictions, and incoherent plot points. A team member then uses temperature=0.3 and the output is grammatically perfect and factually consistent, but reads like a generic template. What tradeoff does temperature control?","options":{"A":"Temperature controls the number of tokens generated — higher temperature generates more tokens","B":"Temperature T scales the logit distribution before softmax: T > 1 flattens the distribution (more uniform → higher entropy → more diversity but more probability mass on low-probability/incorrect tokens), T < 1 sharpens it (lower entropy → more conservative/predictable/grammatical output but less creative)","C":"Temperature above 1.0 causes the model to ignore the prompt and generate from its pretraining prior","D":"Temperature below 0.5 causes the model to use beam search internally instead of sampling"},"correct":"B","explanation":{"correct":"- Temperature modification: P_T(w) ∝ exp(logit_w / T). For T=1.5: dividing logits by 1.5 < 1 makes logit differences smaller → softmax produces a flatter distribution → sampling has higher variance.\n- T=1.5 consequences: tokens with log-prob=-5 get probability exp(-5/1.5)/Z vs exp(-5)/Z — the low-probability token's probability increases more proportionally. The model samples from a wider, more uncertain region, including grammatically unusual or factually wrong tokens.\n- T=0.3 consequences: dividing logits by 0.3 amplifies differences → distribution concentrates on the top token → output is near-greedy, always choosing the safest (highest-probability) continuation. Safe = grammatically correct + generic.\n- The creative-quality tradeoff is fundamental: creativity requires exploring lower-probability (surprising) tokens, which also includes errors.","A":"Temperature affects the probability distribution, not the number of tokens generated. Output length is controlled by max_new_tokens, stop sequences, or the probability of generating EOS — not temperature.","B":"","C":"Temperature modifies the softmax scaling — it does not disable prompt conditioning. At any temperature, the model's output is still conditioned on the prompt through the attention mechanism. Very high temperature (T→∞) approaches uniform random sampling, but it does not \"ignore\" the prompt architecturally.","D":"Temperature is a sampling parameter — it never triggers beam search. Beam search is a separate decoding strategy. A model configured for temperature sampling uses temperature at all values. Beam search must be explicitly selected as the decoding strategy."},"reference":"- Holtzman et al., \"The Curious Case of Neural Text Degeneration\": https://arxiv.org/abs/1904.09751"},{"section":"nlp","topicSlug":"text-generation-decoding","topic":"Text Generation Decoding","id":"nlp-12006","difficulty":"medium","orderIndex":6,"question":"A chatbot using an LLM generates the response \"I love you too!\" when a user types \"I hate you.\" The system uses nucleus sampling (p=0.9, temperature=0.8) and no repetition penalty. A product manager asks \"did the model get confused?\" What is the most precise technical explanation?","options":{"A":"The model incorrectly mapped hate to love due to a polarity reversal bug in the tokenizer","B":"\"I hate you\" and \"I love you\" appear in similar conversational contexts in training data (direct address, emotional response); the model's probability distribution over responses to \"I [emotion] you\" may assign non-negligible probability to emotional reciprocation responses; nucleus sampling sampled a response from this shared contextual distribution","C":"Nucleus sampling with p=0.9 always has a 10% chance of generating an unrelated response","D":"The model hallucinated because temperature=0.8 is too high for conversational AI applications"},"correct":"B","explanation":{"correct":"- Distributional explanation: in training data, \"I hate you\" and \"I love you\" may share similar response contexts — both elicit emotional reactions, apologies, counter-expressions. P(response | \"I hate you\") and P(response | \"I love you\") overlap in the high-probability region.\n- \"I love you too!\" specifically: training data may contain adversarial or sarcastic replies (\"I hate you\" → \"I love you too! [laughing emoji]\"), or the model may have poor sentiment discrimination in the conversational generation distribution.\n- The response is not a random error or bug — it reflects what patterns were present in training data and what the model assigns non-zero probability to given the input context.\n- Mitigation: RLHF (reinforcement learning from human feedback) or constitutional AI explicitly trains the model to avoid inappropriate emotional responses through reward modeling.","A":"Tokenizers convert text to token IDs — they have no polarity or semantic understanding. \"Hate\" and \"love\" are different tokens with different embeddings. There is no \"polarity reversal\" in the tokenizer.","B":"","C":"Nucleus sampling does not have a fixed 10% \"unrelated response\" rate. The 10% excluded by p=0.9 is the bottom tail of the probability distribution. The actual sampling probabilities are determined by the model's distribution, not a fixed 10% random failure rate.","D":"Temperature=0.8 is a moderate, commonly used temperature for dialogue — it is not considered \"too high.\" The issue is not temperature choice but the model's learned associations from training data. Lower temperature would reduce this specific error but would also reduce diversity."},"reference":"- Ouyang et al., \"Training language models to follow instructions with human feedback\" (RLHF): https://arxiv.org/abs/2203.02155"},{"section":"nlp","topicSlug":"text-generation-decoding","topic":"Text Generation Decoding","id":"nlp-12007","difficulty":"hard","orderIndex":7,"question":"A text generation system applies a repetition penalty of α=1.3 to reduce token repetition. The penalty is applied by dividing the logit of any previously generated token by α. A model is generating a story about \"the Mississippi River.\" After generating \"the Mississippi River flows through,\" the next token distribution has P(\"the\") = 0.35 (high due to \"the\" being common), and P(\"Mississippi\") = 0.08. After applying repetition penalty (both \"the\" and \"Mississippi\" were previously generated), what happens and what unintended consequence might occur?","codeSnippet":"# Simplified penalty application\nfor token_id in previously_generated_tokens:\n logits[token_id] /= repetition_penalty # alpha = 1.3","options":{"A":"The penalty correctly prevents \"the the\" and \"Mississippi Mississippi\" with no side effects","B":"Dividing logits by 1.3 reduces both tokens' probabilities, but because \"the\" has logit ≈ log(0.35) ≈ -1.05 and \"Mississippi\" has logit ≈ log(0.08) ≈ -2.53, dividing by 1.3 reduces \"the\" less proportionally — however, the repetition penalty also penalizes legitimate entity mentions: \"the Mississippi River\" may need \"Mississippi\" again later, and the penalty prevents natural name repetition in a narrative context","C":"The penalty will cause the model to always output END_OF_SEQUENCE after penalizing high-frequency tokens","D":"Dividing logits by 1.3 has no effect on tokens with negative logits"},"correct":"B","explanation":{"correct":"- The penalty divides logits (not probabilities) by α. For tokens with negative logits: logit(-1.05) / 1.3 = -1.365 — moved more negative → lower probability. For logit(-2.53) / 1.3 = -1.946 — also reduced. Both tokens' probabilities decrease after softmax.\n- Critical unintended consequence: \"Mississippi River\" is the topic of the story. Natural storytelling requires repeating \"Mississippi\" multiple times (\"...the Mississippi River flows through...the states along the Mississippi...\"). A blanket repetition penalty treats entity names the same as meaningless word repetition.\n- Production systems implement penalty decay (reduce penalty strength for tokens seen longer ago) or entity-aware repetition penalty (exempting proper nouns and named entities from the penalty).\n- The penalty also interacts with \"the\" being grammatically required — penalizing \"the\" reduces an article that is necessary for grammatical English, potentially degrading fluency.","A":"The penalty does prevent immediate repetition effectively, but \"no side effects\" is false. The penalty indiscriminately affects all previously generated tokens, including necessary entity re-mentions and required function words.","B":"","C":"END_OF_SEQUENCE has its own logit and is only penalized if it was previously generated (i.e., after the sequence ends, which doesn't apply here). Penalizing high-frequency tokens like \"the\" reduces their probability but does not cause EOS to dominate.","D":"Dividing a negative logit by α > 1 makes it more negative (further from zero), which after softmax produces a lower probability. The operation does have an effect — negative logits become more negative when divided by a value > 1."},"reference":"- Keskar et al., \"CTRL: A Conditional Transformer Language Model for Controllable Generation\": https://arxiv.org/abs/1909.05858"},{"section":"nlp","topicSlug":"text-generation-decoding","topic":"Text Generation Decoding","id":"nlp-12008","difficulty":"hard","orderIndex":8,"question":"A research team compares four decoding strategies for generating product descriptions (requiring factual accuracy and fluency): greedy, beam=5, top-p=0.9, and top-p=0.9 + temperature=0.7. They measure BLEU vs reference descriptions and human preference. Greedy: BLEU=28, human=55%. Beam=5: BLEU=35, human=48%. top-p=0.9: BLEU=22, human=62%. top-p+temp: BLEU=20, human=65%. A PM says \"use beam search — highest BLEU.\" What is wrong with this recommendation for product descriptions specifically?","options":{"A":"Beam search cannot be used for product descriptions because it requires structured output format","B":"BLEU measures n-gram overlap with reference descriptions, rewarding exact phrasing; human preference measures perceived quality, naturalness, and usefulness; for product descriptions, humans prefer creative, natural language over exact reference matching; beam search's high BLEU at the cost of lower human preference indicates it produces bland, reference-like text that scores well mechanically but is less effective in practice","C":"The PM should use greedy because it produces the highest human preference","D":"The correlation between BLEU and human preference confirms beam search is the best choice"},"correct":"B","explanation":{"correct":"- Product description context: high-quality product descriptions are engaging, varied, and persuasive — not necessarily verbatim replicas of reference texts. Human raters prefer descriptions that sound natural and appealing, even if they paraphrase differently.\n- Beam search optimizes joint probability, which correlates with producing text close to high-frequency training patterns (reference-like). High BLEU = close to references = potentially generic and less engaging.\n- Top-p + temperature produces more diverse, creative text that deviates from references (lower BLEU) but humans find it more natural and appealing (higher preference).\n- The PM's recommendation assumes BLEU = quality. For creative text generation tasks, this assumption is demonstrably false. For constrained tasks (MT, extractive summarization), BLEU and human preference align better.","A":"Beam search has no structured output format requirement. It is a general decoding algorithm applicable to any sequence generation task including product descriptions.","B":"","C":"Greedy has lower human preference (55%) than both top-p methods (62%, 65%). Choosing greedy based on this data would be wrong — the statement itself is incorrect.","D":"The data shows BLEU and human preference are *inversely* correlated across these strategies: higher BLEU (greedy=28→beam=35) corresponds to lower human preference (55%→48%). The correlation is negative, not confirmatory."},"reference":"- Holtzman et al., \"The Curious Case of Neural Text Degeneration\": https://arxiv.org/abs/1904.09751\n- Callison-Burch et al., \"Re-evaluating the role of BLEU in MT research\": https://aclanthology.org/E06-1032/"},{"section":"nlp","difficulty":"easy","id":"nlp-e001","topicSlug":"text-preprocessing","orderIndex":1,"topic":"Text Preprocessing","question":"A data scientist computes TF-IDF for the word \"neural\" in a corpus of 1,000 documents. It appears in 500 documents and occurs 10 times in a 200-word document. A colleague says \"TF-IDF will be high because the word appears 10 times.\" What is wrong with this reasoning?","options":{"A":"TF is always 0 for words appearing more than 5 times in a document","B":"IDF = log(1000/500) = log(2) ≈ 0.69, which is low because \"neural\" appears in half the corpus — high document frequency penalizes the score; despite TF = 0.05, the TF-IDF = 0.05 × 0.69 = 0.035, which is low relative to rare domain-specific terms","C":"TF-IDF penalizes words that appear in the document more than 10 times","D":"TF-IDF only counts unique occurrences, so 10 repeated occurrences are treated as 1"},"correct":"B","explanation":{"correct":"- TF = term count / document length = 10/200 = 0.05. This measures local frequency within the document.\n- IDF = log(N/df) = log(1000/500) = 0.693. When a word appears in half the corpus, it is too common to be discriminative — IDF penalizes it heavily.\n- TF-IDF = 0.05 × 0.693 = 0.035. A rare technical term appearing in only 10 documents would have IDF = log(1000/10) = 4.6, yielding much higher TF-IDF even at the same TF.\n- The colleague's error: confusing raw term frequency (TF alone) with TF-IDF. The IDF component is specifically designed to discount words common across documents.","A":"There is no threshold in TF computation. TF = raw count / total words, applied uniformly regardless of count.","B":"","C":"TF-IDF has no per-document frequency threshold. The penalty is for corpus-level document frequency (IDF), not within-document frequency.","D":"Standard TF counts all occurrences, not unique ones. Counting unique occurrences would be binary term frequency (0 or 1), which is a different variant."},"reference":"- Jurafsky & Martin, SLP3 Chapter 6 (TF-IDF): https://web.stanford.edu/~jurafsky/slp3/6.pdf"},{"section":"nlp","difficulty":"easy","id":"nlp-e002","topicSlug":"text-preprocessing","orderIndex":2,"topic":"Text Preprocessing","question":"A sentiment analysis pipeline removes all stopwords before training a classifier. Review: \"This movie is not bad at all.\" After stopword removal the tokens are [\"movie\", \"bad\"]. The model predicts negative sentiment. What fundamental NLP problem does this reveal?","options":{"A":"Stopword lists are too long and remove meaningful content words","B":"Removing \"not\" destroys the negation that inverts the sentiment of \"bad\" — stopword removal for sentiment tasks eliminates critical semantic operators that change the polarity of surrounding words","C":"The model predicted incorrectly because \"bad\" is always classified as negative regardless of context","D":"Stopword removal fails only on short sentences under 5 words"},"correct":"B","explanation":{"correct":"- \"Not bad\" is positive; \"not good\" is negative. The word \"not\" (and other negations: \"never,\" \"hardly,\" \"barely\") are typically included in stopword lists but are semantically critical in sentiment contexts.\n- Classic rule-based sentiment systems handle this by \"negation scope\": when \"not\" is detected, flip the sentiment polarity of all words until the next punctuation.\n- Modern approaches: use n-grams (\"not bad\" as a bigram feature) or contextual embeddings (BERT) that encode the full context including negation.\n- This is one of the canonical examples of when standard preprocessing heuristics break task-specific requirements.","A":"Stopword lists do remove \"not\" because it appears in most documents. The issue is not list length but that sentiment tasks have domain-specific requirements that conflict with general-purpose stopword removal.","B":"","C":"A well-designed classifier should use bigrams or negation-aware features. The failure here is the preprocessing step, not the model's intrinsic inability to handle \"bad\" in context.","D":"Negation failure occurs at any sentence length. \"This is not bad\" (5 words after removal → 2 tokens) and \"This 300-word review says the product is not at all defective\" both lose negation scope."},"reference":"- Pang et al., \"Thumbs up?: Sentiment Classification using Machine Learning Techniques\": https://aclanthology.org/W02-1011/"},{"section":"nlp","difficulty":"easy","id":"nlp-e003","topicSlug":"text-preprocessing","orderIndex":3,"topic":"Text Preprocessing","question":"A Python NLP pipeline uses character-level trigrams as features for language identification. The text \"The cat sat.\" produces trigrams: \"The\", \"he \", \"e c\", \" ca\", \"cat\", \"at \", \"t s\", \" sa\", \"sat\", \"at.\". A colleague argues \"word-level bigrams are better because they capture meaning.\" For language identification specifically, why are character n-grams more appropriate?","options":{"A":"Character n-grams are faster to compute than word bigrams","B":"Language identity is encoded in character-level patterns (letter sequences, diacritics, morphological patterns) that are language-specific regardless of vocabulary — \"sch\" signals German, \"ão\" signals Portuguese; character n-grams work even on short texts or unknown-vocabulary text where word bigrams would hit OOV issues","C":"Word bigrams require a larger vocabulary which consumes more memory","D":"Character trigrams are the ISO standard for language identification"},"correct":"B","explanation":{"correct":"- Language identification does not require understanding meaning — it requires detecting the statistical fingerprint of the language's writing system and phonology.\n- Character n-gram profiles are language-specific: \"qu\" is common in French/Spanish/English; \"zsch\" in German; \"kk\" in Finnish. These patterns are stable even in short texts and across vocabulary domains.\n- Word-level bigrams fail on OOV texts (technical, proper nouns) and require larger corpora to estimate reliably. A 10-word text might have no bigram overlap with training vocabulary.\n- Caveat & McNamee (2003) showed character 4-grams outperform word-level features for language ID across 14 languages.","A":"Computational speed is a practical concern, not the primary reason character n-grams are more appropriate for language ID. Both are fast for modern hardware.","B":"","C":"Memory is an implementation concern, not a task-fitness argument. The appropriateness of character n-grams for language ID comes from their linguistic properties, not memory usage.","D":"No ISO standard mandates character trigrams specifically. The ISO 639 standards define language codes, not feature extraction methods."},"reference":"- Cavnar & Trenkle, \"N-gram Based Text Categorization\": https://aclanthology.org/W00-0817/"},{"section":"nlp","difficulty":"easy","id":"nlp-e004","topicSlug":"word-representations","orderIndex":4,"topic":"Word Representations","question":"A word2vec model is trained on a news corpus. `model.similarity(\"man\", \"woman\")` returns 0.76. `model.similarity(\"king\", \"queen\")` returns 0.73. A researcher uses this to claim \"the model treats gender as a symmetric relationship.\" What does the asymmetric test `model.most_similar(\"woman\", topn=10)` potentially reveal that the similarity scores hide?","options":{"A":"Cosine similarity is always symmetric so both calls return identical lists","B":"`most_similar(\"woman\")` returns neighbors like \"actress\", \"wife\", \"mother\" while `most_similar(\"man\")` returns \"businessman\", \"athlete\", \"senator\" — the word's nearest neighbors reveal gender bias in training data even when pairwise similarity scores look symmetric","C":"`most_similar` and `similarity` use different embedding spaces and are incomparable","D":"The similarity scores prove there is no gender bias in the model"},"correct":"B","explanation":{"correct":"- Cosine similarity is symmetric: sim(A, B) = sim(B, A). So the similarity score 0.76 for (man, woman) is identical in both directions.\n- However, the full neighborhood reveals context bias: what other words are close to \"woman\" vs \"man\"? Training data reflects societal roles — women are more associated with domestic/performance roles in news corpora.\n- Bolukbasi et al. (2016) demonstrated this explicitly: \"man\" → computer programmer, \"woman\" → homemaker in word2vec analogies.\n- The practical danger: downstream systems using these embeddings inherit the biases.","A":"Cosine similarity is symmetric in score, but `most_similar` returns a ranked list of *different words*, not just the similarity between two specific words. The neighborhood compositions differ.","B":"","C":"Both use the same embedding space and the same underlying dot product / cosine similarity computation. They are fully comparable.","D":"Symmetric pairwise similarity between \"man\" and \"woman\" says nothing about the broader neighborhood structure of each word. Symmetry in one pair does not mean absence of bias in the embedding space."},"reference":"- Bolukbasi et al., \"Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings\": https://arxiv.org/abs/1607.06520"},{"section":"nlp","difficulty":"easy","id":"nlp-e005","topicSlug":"word-representations","orderIndex":5,"topic":"Word Representations","question":"FastText represents the word \"playing\" as the sum of its character n-gram vectors: {\"\", \"\"}. The word \"unplaying\" is not in the training vocabulary. How does FastText handle this, and why does standard Word2Vec fail here?","options":{"A":"FastText also returns a zero vector for OOV words — it only uses character n-grams during training","B":"FastText constructs the OOV vector by summing its character n-gram embeddings that were learned during training — \"unplaying\" shares n-grams \"pla\", \"lay\", \"ayi\", \"yin\", \"ing\" with \"playing\", producing a meaningful vector; Word2Vec has no n-gram embeddings and returns UNK for any OOV word","C":"FastText uses a spell checker to map \"unplaying\" to the nearest in-vocabulary word","D":"FastText and Word2Vec handle OOV identically — both fall back to the average of all training vectors"},"correct":"B","explanation":{"correct":"- Word2Vec maps each word to a single dense vector. Unknown words at inference get the `` vector or are dropped — no meaningful representation is produced.\n- FastText trains separate embeddings for every character n-gram (typically n=3,4,5,6). At inference, a word's embedding = sum of all its character n-gram embeddings, including `` (the whole word if known).\n- \"unplaying\": character n-grams include \"pla\", \"lay\", \"ayi\", \"yin\", \"ing\" — all learned from many training words. Their sum produces a sensible vector that captures the morphological content.\n- This is crucial for morphologically rich languages (Finnish, Turkish, Arabic) and for domain text with technical neologisms.","A":"FastText uses character n-grams at both training and inference. At inference, the OOV word's embedding is composed from its n-grams on-the-fly. The n-gram embeddings learned during training are the mechanism.","B":"","C":"FastText has no spell checker or nearest-neighbor lookup for OOV. It strictly computes the vector from character n-grams.","D":"Word2Vec does not fall back to any average vector — the default behavior is UNK handling (either a single UNK vector or no vector). FastText's n-gram sum is an active composition, not an averaging fallback."},"reference":"- Bojanowski et al., \"Enriching Word Vectors with Subword Information\" (FastText): https://arxiv.org/abs/1607.04606"},{"section":"nlp","difficulty":"easy","id":"nlp-e006","topicSlug":"classical-nlp-tasks","orderIndex":6,"topic":"Classical NLP Tasks","question":"A dependency parser processes \"The quick brown fox jumps over the lazy dog.\" It produces arcs: fox→jumps (nsubj), jumps→dog (prep→over→pobj). A student asks: \"Why does 'jumps' have no incoming arc but all other content words do?\" What is the structural property being observed?","options":{"A":"\"jumps\" is a verb and verbs never have incoming arcs in dependency grammar","B":"In a dependency tree, exactly one token is the root — it has no head (no incoming arc) but governs the rest of the sentence directly or transitively; every other token has exactly one head","C":"The parser made an error — every word must have an incoming arc","D":"\"jumps\" has no incoming arc because it is at the center of the sentence"},"correct":"B","explanation":{"correct":"- Dependency grammar constraint: the parse is a directed tree rooted at the main verb (or root token). The root has no parent; all other tokens have exactly one parent (head).\n- \"jumps\" is the main predicate — it is the root of the sentence. Its incoming arc would point to an artificial ROOT node (often added in practice) but it has no lexical head.\n- Every other token (fox, dog, The, quick, brown, etc.) has exactly one head: \"fox\" depends on \"jumps\" (nsubj), \"The\" depends on \"fox\" (det), etc.\n- This tree structure property means: n words → n-1 dependency arcs (excluding the root arc).","A":"Verbs can have incoming arcs — infinitival complements (\"I want to go\" → \"go\" depends on \"want\"), coordinated verbs, relative clauses all produce verbs with head relationships.","B":"","C":"The parser produced a valid tree. The root node having no incoming arc is by design, not an error. Every valid dependency tree has exactly one root.","D":"Sentence position (center, start, end) does not determine root status. The root is the main predicate, which is typically the finite main verb regardless of its position."},"reference":"- Jurafsky & Martin, SLP3 Chapter 15 (Dependency Parsing): https://web.stanford.edu/~jurafsky/slp3/15.pdf"},{"section":"nlp","difficulty":"easy","id":"nlp-e007","topicSlug":"classical-nlp-tasks","orderIndex":7,"topic":"Classical NLP Tasks","question":"A co-reference resolution system must link all mentions of the same entity in: \"The CEO announced that she would resign. Her decision shocked investors.\" It correctly links \"CEO\" → \"she\" → \"Her\". A student asks why co-reference resolution is considered harder than NER. What is the core challenge?","options":{"A":"Co-reference requires reading multiple sentences and linking mentions across arbitrary distances, handling pronouns, definite descriptions, and implicit references that require world knowledge","B":"Co-reference is harder only for long documents; for short texts it is easier than NER","C":"NER only labels 4 entity types but co-reference has unlimited entity types","D":"Co-reference requires a larger vocabulary lookup table than NER"},"correct":"A","explanation":{"correct":"- NER labels individual mention spans with types — local, span-level decision.\n- Co-reference requires: (1) detecting all mentions (not just named ones — \"she,\" \"her decision\"), (2) deciding which mentions refer to the same entity across potentially hundreds of tokens, (3) resolving pronoun-antecedent binding (\"she\" → \"CEO\" not \"investors\"), (4) handling bridging (\"The CEO\" → \"the executive\" → \"her\" — lexically diverse mentions).\n- Winograd schema: \"The trophy didn't fit in the bag because **it** was too large\" — \"it\" refers to trophy or bag? Requires world knowledge (bags contain things; if something is too large, it's the contained item).\n- Modern co-reference systems (SpanBERT, LingMess) still achieve only ~80-85% F1 on OntoNotes, well below BERT NER's 91%+.","A":"","B":"Document length is one factor but co-reference is inherently harder even in short texts due to pronoun resolution and world knowledge requirements. A 2-sentence text can still have ambiguous co-reference.","C":"Entity type count is irrelevant to co-reference — co-reference groups mentions by identity, not type. NER's type set size does not make it easier or harder relative to co-reference.","D":"Vocabulary size affects model capacity but is not the defining difficulty of co-reference. The challenge is reasoning about identity across linguistic contexts, not vocabulary coverage."},"reference":"- Jurafsky & Martin, SLP3 Chapter 22 (Coreference Resolution): https://web.stanford.edu/~jurafsky/slp3/22.pdf"},{"section":"nlp","difficulty":"easy","id":"nlp-e008","topicSlug":"language-models-statistical","orderIndex":8,"topic":"Language Models Statistical","question":"A student computes the perplexity of a unigram language model and a bigram language model on the same test set. The unigram gives PP=500, the bigram gives PP=300. The student concludes \"bigram is 200 points better.\" Why is this a misleading way to compare perplexity improvements?","options":{"A":"Perplexity can only be compared using ratios, and a ratio of 500/300 is more meaningful than the difference","B":"Both statements are equally valid — absolute and ratio comparisons are interchangeable for perplexity","C":"The absolute difference of 200 is the only valid comparison method","D":"Perplexity should be converted to bits-per-word before any comparison"},"correct":"A","explanation":{"correct":"- Perplexity = 2^H where H is cross-entropy in bits. A drop from PP=500 to PP=300 means the model's weighted branching factor decreased by 200 — but the underlying entropy improvement is H = log₂(500) ≈ 8.97 bits vs log₂(300) ≈ 8.23 bits, a reduction of 0.74 bits/word.\n- Perplexity is on a logarithmic scale. The same 200-point improvement means very different things at different levels: PP=500→300 is meaningful, but PP=100→-100 (impossible) shows the scale is non-linear.\n- Percentage reduction (PP dropped 40%) or cross-entropy improvement (bits/word) are better measures of relative improvement than absolute point differences.\n- In practice, researchers compare perplexity ratios or bits-per-character/word consistently rather than raw differences.","A":"","B":"Absolute and ratio comparisons are NOT interchangeable for perplexity. Due to the logarithmic relationship with entropy, absolute differences give misleading impressions of improvement magnitude. They are only interchangeable at the same baseline value.","C":"Absolute difference is the misleading comparison — it implies linear scaling that perplexity does not have.","D":"Converting to bits-per-word is a valid transformation, but the question asks why absolute point comparison is misleading. The core issue is non-linear scale, not units."},"reference":"- Jurafsky & Martin, SLP3 Chapter 3 (Perplexity and language model comparison): https://web.stanford.edu/~jurafsky/slp3/3.pdf"},{"section":"nlp","difficulty":"easy","id":"nlp-e009","topicSlug":"sequence-models-rnn-lstm","orderIndex":9,"topic":"Sequence Models Rnn Lstm","question":"An LSTM language model is used to predict the next word. At each time step, it produces a hidden state hₜ of dimension 256. A linear layer maps hₜ to a vocabulary of 50,000 words, followed by softmax. A junior engineer proposes replacing the 256-dim LSTM with a 512-dim LSTM. What is the direct computational cost change?","options":{"A":"The cost doubles because the hidden dimension doubled","B":"The linear layer weights change from 256×50,000 = 12.8M params to 512×50,000 = 25.6M params; the LSTM internal matrices change from O(256²) to O(512²) — the LSTM itself quadruples in parameter count for the recurrent weight matrices; overall cost increase is roughly 2-4× depending on which component dominates","C":"The cost increases by exactly 50,000 parameters because only the linear layer changes","D":"Cost is unchanged because LSTM processes one token at a time regardless of hidden size"},"correct":"B","explanation":{"correct":"- LSTM parameters per gate: Wₕ (hidden-to-hidden) = h×h and Wx (input-to-hidden) = d×h. With 4 gates, total LSTM params ≈ 4 × (h² + d×h). Doubling h: Wₕ goes from 256² = 65K to 512² = 262K per gate — quadrupling the recurrent matrix costs.\n- Output projection: 256×50K = 12.8M → 512×50K = 25.6M — doubled.\n- The bottleneck depends on vocabulary size: for large vocabularies (50K), the output projection often dominates, making the increase closer to 2×. For small vocabularies, the LSTM matrices dominate.\n- FLOPs per time step also increase quadratically for the recurrent computation: matrix-vector products with Wₕ scale as O(h²) → O((2h)²) = 4× for the recurrent part.","A":"\"Doubles\" is too simplistic. The recurrent weight matrices (Wₕ) scale quadratically with h — they quadruple, not double. Whether the overall model approximately doubles depends on which component is larger.","B":"","C":"Both the LSTM internal matrices AND the linear layer change. The 50,000 parameter increase (one row of the linear layer) is negligible and incorrect — the full linear layer dimension increases.","D":"Hidden size directly affects both computation per step (matrix dimensions) and total parameter count. LSTM processing one token at a time is sequential, not an indication that hidden size is irrelevant."},"reference":"- Hochreiter & Schmidhuber, \"Long Short-Term Memory\": https://www.bioinf.jku.at/publications/older/2604.pdf"},{"section":"nlp","difficulty":"easy","id":"nlp-e010","topicSlug":"attention-before-transformers","orderIndex":10,"topic":"Attention Before Transformers","question":"An attention-based seq2seq model translates \"I eat apples\" to French \"Je mange des pommes.\" The attention visualization shows that when generating \"pommes\" (apples), the model attends to \"apples\" (position 3 in source) with weight 0.92. A student says \"the model is just copying the word.\" What is more precisely happening?","options":{"A":"The model is indeed copying — attention just identifies what to copy","B":"Attention assigns high weight to \"apples\" to retrieve that position's encoder representation, which encodes contextual information about \"apples\" in the source sentence; the decoder then uses this representation to generate the appropriate French translation, not to copy the source word","C":"Attention weight 0.92 means the output token is 92% likely to be a copy of the source token","D":"The model uses a dictionary lookup triggered by high attention weight to find \"pommes\""},"correct":"B","explanation":{"correct":"- The context vector cₜ = Σᵢ αₜᵢ hᵢ at this decoding step is dominated by h₃ (the encoder hidden state at \"apples\" position). This vector encodes the encoder's full contextual representation of the word \"apples\" in its sentence context.\n- The decoder then takes cₜ and its own state to compute the output distribution P(target word | context). The model predicts \"pommes\" from this representation — it is a generation decision, not a copy.\n- High attention to source position i means \"this source position is most relevant for the current target position\" — not \"copy this source token.\" The generation comes from the decoder's learned French vocabulary.\n- Copy mechanisms (pointer networks) are a separate architectural component that explicitly copies source tokens — standard seq2seq attention does not have this behavior.","A":"Standard seq2seq attention does not copy source words — it uses source representations to condition generation. Copying would produce the English word \"apples\" in the French output.","B":"","C":"Attention weight is a weighted relevance score over source positions, not a copy probability. The output distribution is over target vocabulary, not source tokens.","D":"The model does not use a translation dictionary. The mapping \"apples\" → \"pommes\" is encoded in the decoder's weights learned from parallel training data, not a lookup table."},"reference":"- Bahdanau et al., \"Neural Machine Translation by Jointly Learning to Align and Translate\": https://arxiv.org/abs/1409.0473"},{"section":"nlp","difficulty":"easy","id":"nlp-e011","topicSlug":"bert-and-variants","orderIndex":11,"topic":"Bert And Variants","question":"A student fine-tunes BERT-base for binary sentiment classification. They add a linear layer on top of the [CLS] token with output dimension 2 (positive/negative), followed by softmax. Another student says \"you should use the average of all token embeddings instead of [CLS] — it uses more information.\" Under what condition is averaging actually correct?","options":{"A":"Averaging is never correct — [CLS] always contains more information","B":"Averaging final-layer token embeddings can work and sometimes outperforms [CLS], particularly for sentence similarity tasks where [CLS] was not specifically pretrained to aggregate semantic content; however for classification after fine-tuning, [CLS] learns to aggregate task-relevant information through its full self-attention receptive field over all other tokens","C":"Averaging must always be used for classification tasks because BERT's architecture requires it","D":"The [CLS] embedding only encodes the first token's information and must be replaced with averaging"},"correct":"B","explanation":{"correct":"- [CLS] token: at every BERT layer, [CLS] can attend to all other tokens via self-attention. After fine-tuning, the [CLS] representation is updated to encode classification-relevant aggregate features. This is why [CLS] is the standard classification head.\n- Mean pooling: average all non-[CLS], non-[SEP] token embeddings. For sentence embeddings in semantic search and sentence similarity, mean pooling often outperforms [CLS] (Reimers & Gurevych, 2019 — SBERT showed this).\n- The reason: [CLS] in BERT was pretrained for NSP (binary classification), not general sentence representation. Mean pooling captures distributed token-level semantics that may be better for similarity tasks.\n- After fine-tuning on classification, [CLS] is task-specifically adapted and typically optimal for that task.","A":"Averaging is a valid technique — SBERT explicitly shows mean pooling outperforms [CLS] for sentence similarity tasks. \"Never correct\" is empirically false.","B":"","C":"BERT's architecture does not require averaging — [CLS] is the convention for classification, and the architecture supports both approaches. There is no architectural requirement for mean pooling.","D":"After 12 layers of self-attention, [CLS] has attended to every other token at every layer — it is not limited to position 0 information. Self-attention gives it a full receptive field over the input."},"reference":"- Reimers & Gurevych, \"Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks\": https://arxiv.org/abs/1908.10084"},{"section":"nlp","difficulty":"easy","id":"nlp-e012","topicSlug":"text-classification","orderIndex":12,"topic":"Text Classification","question":"A Naive Bayes classifier is trained for spam detection. The training set has 900 ham and 100 spam emails. Without any balancing, a student expects the classifier to predict \"ham\" for almost every email. What prior probability drives this behavior and how can it be addressed?","options":{"A":"Naive Bayes is unaffected by class distribution because it uses conditional probabilities only","B":"P(spam) = 100/1000 = 0.10 acts as a strong prior; even when P(words | spam) is high, multiplying by 0.10 vs P(words | ham) × 0.90 creates a systematic bias toward ham; mitigation: use class-balanced sampling, adjust decision threshold, or set uniform priors P(spam) = P(ham) = 0.5","C":"The prior only matters for Bayesian classifiers on tabular data, not text","D":"The solution is to add more spam features to the vocabulary"},"correct":"B","explanation":{"correct":"- Naive Bayes: P(spam | email) ∝ P(spam) × Π P(wᵢ | spam). The prior P(spam) = 0.10 multiplies every spam prediction — it must be overcome by the likelihood ratio for every classification.\n- In a 9:1 imbalanced dataset, the model learns that predicting ham has a 90% baseline accuracy. Even if P(\"buy now\" | spam) is much higher than P(\"buy now\" | ham), the prior overwhelms marginal word evidence.\n- Fixes: (1) Adjust threshold from 0.5 to a lower value, (2) Set equal priors in Naive Bayes formula (ignore corpus class ratio), (3) Oversample spam or undersample ham.\n- This is a general class imbalance problem — Naive Bayes just makes the prior explicit in the formula, making the problem visible.","A":"Naive Bayes explicitly includes the class prior P(class) in the posterior calculation. The prior is not ignored — it is a fundamental part of the Bayesian framework. Class distribution directly affects predictions.","B":"","C":"The prior is used in text Naive Bayes just as in tabular Naive Bayes — the formula P(class | features) ∝ P(class) × P(features | class) is identical regardless of feature type.","D":"Adding spam vocabulary features helps the likelihood component but does not address the prior imbalance. A word feature would need to be astronomically discriminative to overcome a 9:1 prior ratio."},"reference":"- Jurafsky & Martin, SLP3 Chapter 4 (Naive Bayes Classification): https://web.stanford.edu/~jurafsky/slp3/4.pdf"},{"section":"nlp","difficulty":"easy","id":"nlp-e013","topicSlug":"named-entity-recognition","orderIndex":13,"topic":"Named Entity Recognition","question":"A NER model processes \"Apple released iOS 17 in September 2023.\" The gold annotations are: Apple=ORG, iOS 17=PRODUCT, September 2023=DATE. The model predicts: Apple=ORG, iOS=PRODUCT (misses \"17\"), September=DATE (misses \"2023\"). How many of the 3 gold entities does the model get correct under strict entity-level F1 evaluation?","options":{"A":"2 correct (Apple and partial iOS 17)","B":"1 correct (only Apple) — \"iOS\" without \"17\" is a wrong span boundary for iOS 17; \"September\" without \"2023\" is a wrong span for September 2023; only Apple matches exactly","C":"3 correct — each entity gets partial credit for correct type prediction","D":"2 correct — Apple and iOS (because \"17\" is just a number modifier)"},"correct":"B","explanation":{"correct":"- Entity-level F1 (CoNLL evaluation): an entity prediction is correct if and only if span start, span end, AND entity type all match the gold annotation exactly.\n- \"iOS\" (positions 2-2) vs gold \"iOS 17\" (positions 2-3): span boundary mismatch → 0 credit for this entity.\n- \"September\" (positions 4-4) vs gold \"September 2023\" (positions 4-5): span boundary mismatch → 0 credit.\n- \"Apple\" (position 1-1) vs gold \"Apple\" (position 1-1) as ORG: exact match → 1 correct.\n- Entity-level precision = 1/3 (1 of 3 predicted entities correct), recall = 1/3 (1 of 3 gold entities found), F1 = 1/3.","A":"Partial span matches receive zero credit in standard CoNLL entity-level evaluation. There is no \"partial iOS 17\" credit. The evaluation is all-or-nothing per entity span.","B":"","C":"Entity type alone is insufficient for credit. The span boundaries must also match. Predicting the right type on the wrong span = 0 F1.","D":"\"17\" is part of the entity \"iOS 17\" in the gold annotation — it is not an optional modifier. The annotator included it in the span, and the evaluation respects the annotated boundaries exactly."},"reference":"- Tjong Kim Sang & De Meulder, \"Introduction to the CoNLL-2003 Shared Task\": https://aclanthology.org/W03-0419/"},{"section":"nlp","difficulty":"easy","id":"nlp-e014","topicSlug":"question-answering","orderIndex":14,"topic":"Question Answering","question":"A student builds an open-domain QA system by feeding the question directly to a large language model without any retrieval: \"What is the boiling point of titanium?\" The model answers \"3,287°C\" which happens to be correct. Another question \"What is the current prime minister of Canada?\" returns an outdated answer. What is the fundamental limitation demonstrated?","options":{"A":"The model is too small to store current events in its parameters","B":"LLM parametric knowledge has a training cutoff — facts that change over time (leadership, prices, current events) become stale post-cutoff; the model produces answers from memorized training data, not live information; this is the key motivation for retrieval-augmented approaches","C":"The model should use a search engine for all factual questions","D":"Open-domain QA only works for scientific facts, not political facts"},"correct":"B","explanation":{"correct":"- Parametric knowledge: facts stored in model weights during pretraining. Static facts (boiling points, mathematical constants, historical events) remain correct indefinitely. Dynamic facts (current leaders, stock prices, latest research) become outdated.\n- The model cannot know who became PM after its training cutoff because that information was not in its training data. The model's \"knowledge\" is frozen at the training date.\n- RAG (Retrieval-Augmented Generation) addresses this by retrieving current documents at inference time — the model answers from retrieved context rather than from memorized parameters.\n- This is the core tradeoff: LLM-only (fast, no retrieval, but stale on dynamic facts) vs RAG (requires retrieval infrastructure, but current).","A":"Model size does not determine knowledge currency. Larger models have more parameters to store facts but all have training cutoffs. A 70B parameter model is equally outdated on current events.","B":"","C":"Search engine integration is one solution (the RAG approach), but the question asks about the fundamental limitation demonstrated. \"Should use search\" is a solution, not a diagnosis.","D":"Political facts are not inherently harder than scientific facts for parametric memorization. The issue is temporal dynamism (leadership changes), not topic category."},"reference":"- Lewis et al., \"Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks\": https://arxiv.org/abs/2005.11401"},{"section":"nlp","difficulty":"easy","id":"nlp-e015","topicSlug":"machine-translation","orderIndex":15,"topic":"Machine Translation","question":"A student evaluates two MT outputs against the reference \"The cat sat on the mat.\" Output A: \"The cat sat on the mat.\" Output B: \"A cat was sitting on the mat.\" Both are semantically equivalent. BLEU scores: A=1.0, B=0.41. A classmate says \"BLEU is broken — both should score 1.0.\" What is the precise property of BLEU that causes this?","options":{"A":"BLEU is broken and should not be used for MT evaluation","B":"BLEU measures n-gram surface overlap with the reference — Output B uses synonyms (\"A\" vs \"The\") and a different verb form (\"was sitting\" vs \"sat\") which are valid paraphrases but produce zero matches for those n-grams; BLEU does not model semantic equivalence, only lexical overlap","C":"Output B is grammatically wrong which causes low BLEU","D":"BLEU requires at least 5 reference translations to score equivalently correct paraphrases as 1.0"},"correct":"B","explanation":{"correct":"- BLEU 1-gram precision for B: \"A\"→0 (reference has \"The\"), \"cat\"→1, \"was\"→0, \"sitting\"→0, \"on\"→1, \"the\"→1, \"mat\"→1. Precision ≈ 4/7 ≈ 0.57.\n- BLEU bigram precision for B: \"A cat\"→0 (reference \"The cat\"), \"cat was\"→0, \"was sitting\"→0, \"sitting on\"→0, \"on the\"→1, \"the mat\"→1. Precision ≈ 2/6 ≈ 0.33.\n- The article \"A\" vs \"The\" and \"was sitting\" vs \"sat\" are non-matching even though semantically equivalent. BLEU has no synonym/paraphrase awareness.\n- This is the known limitation of BLEU: it correlates with human judgment at corpus level (many translations averaged) but is unreliable for individual sentence comparison, especially for valid paraphrases.","A":"BLEU is widely used in MT research precisely because at corpus level it correlates well with human judgments and allows reproducible comparison. Calling it \"broken\" overstates the limitation. It has specific known weaknesses, not general brokenness.","B":"","C":"\"Was sitting on the mat\" is grammatically correct English. Grammar is not what BLEU measures — it measures n-gram overlap with references.","D":"Multiple references do help (more valid paraphrases can match one of them), but 5 references is not a magical threshold. Even with 10 references, a completely valid paraphrase using all different n-grams would still score less than 1.0."},"reference":"- Papineni et al., \"BLEU: a Method for Automatic Evaluation of Machine Translation\": https://aclanthology.org/P02-1040/"},{"section":"nlp","difficulty":"easy","id":"nlp-e016","topicSlug":"text-generation-decoding","orderIndex":16,"topic":"Text Generation Decoding","question":"A language model is generating text with `temperature=0`. The top-5 token probabilities at a certain step are: \"the\" (0.45), \"a\" (0.22), \"this\" (0.18), \"one\" (0.10), \"that\" (0.05). What token is selected and why is this equivalent to greedy decoding?","options":{"A":"\"a\" is selected because it is in the second position","B":"\"the\" is selected — temperature=0 causes the model to always select the highest-probability token (argmax), making it deterministic and equivalent to greedy decoding; mathematically, as T→0, softmax(logits/T) approaches a one-hot distribution on the argmax token","C":"A random token is selected because temperature=0 means equal probability for all tokens","D":"Temperature=0 is invalid and causes a division by zero error in all implementations"},"correct":"B","explanation":{"correct":"- Softmax with temperature: P(wᵢ) = exp(logᵢ/T) / Σⱼ exp(logⱼ/T). As T → 0, the differences between logits get amplified (divided by small T → large differences). The highest logit dominates completely → one-hot distribution.\n- At T=0, this is equivalent to argmax selection — the token with the highest original logit always wins, regardless of the gap between first and second place.\n- Practical implementations handle T=0 as a special case: directly return argmax(logits) to avoid division by zero.\n- T=0 is deterministic and produces the same output every time. This is used for reproducibility in evaluation and for tasks requiring the most probable prediction (classification-style generation).","A":"Token position (second, third...) has no role in selection — only the probability value matters. \"a\" is second most probable, not selected.","B":"","C":"Temperature=0 is the extreme of concentration, not uniform distribution. T→∞ produces a uniform distribution (all tokens equally probable). T→0 produces a spike on the argmax. These are opposite effects.","D":"Division by zero is a potential implementation issue, but well-designed libraries (HuggingFace, OpenAI) handle T=0 as argmax selection without literal division. It is a supported and commonly used value."},"reference":"- HuggingFace `generate()` temperature documentation: https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.GenerationConfig.temperature"},{"section":"nlp","difficulty":"easy","id":"nlp-e017","topicSlug":"text-generation-decoding","orderIndex":17,"topic":"Text Generation Decoding","question":"A text generation system uses beam search with beam width k=1. A colleague says \"this is different from greedy decoding because beam search tracks multiple hypotheses.\" Is the colleague right?","options":{"A":"The colleague is right — beam width k=1 and greedy decoding use different algorithms internally","B":"The colleague is wrong — beam search with k=1 is mathematically identical to greedy decoding; at each step, exactly one hypothesis is maintained (the highest-scoring token), which is the same as greedily selecting the best token","C":"Beam width k=1 is invalid — the minimum valid beam width is 2","D":"Beam search with k=1 produces different results because it uses length normalization while greedy does not"},"correct":"B","explanation":{"correct":"- Beam search with k=1: maintain exactly 1 hypothesis at each step. At step t, expand the 1 current hypothesis by all vocabulary tokens, score them, keep top-1. This is exactly: select the token with highest score → advance → repeat.\n- This is the definition of greedy decoding: at each step, select the most probable next token given the current sequence.\n- Greedy decoding is a special case of beam search with k=1. The algorithms are not just \"equivalent in output\" — they are the same algorithm with k=1 substituted.\n- This is useful to know: beam search implementations can be tested with k=1 against a greedy baseline for correctness verification.","A":"\"Different algorithms internally\" is false. Beam search with k=1 executes the same computational steps as greedy decoding. No \"multiple hypothesis tracking\" occurs when k=1.","B":"","C":"k=1 is a valid, commonly used beam width (as the greedy baseline). The minimum valid value is k=1, not k=2. k=0 would be invalid.","D":"Length normalization is an optional add-on to beam search that can also be applied to greedy decoding independently. Whether normalization is applied is separate from whether k=1 produces the same result as greedy."},"reference":"- Jurafsky & Martin, SLP3 Chapter 9 (Sequence Models and Beam Search): https://web.stanford.edu/~jurafsky/slp3/9.pdf"},{"section":"nlp","difficulty":"hard","id":"nlp-h001","topicSlug":"text-preprocessing","orderIndex":1,"topic":"Text Preprocessing","question":"A team builds a subword tokenizer using BPE (Byte Pair Encoding) trained on a 10GB English corpus. When tokenizing a medical domain corpus, words like \"hepatocellular\" tokenize to [\"hepato\", \"##cell\", \"##ular\"] while the general tokenizer tokenizes \"hepatocellular\" to [\"hep\", \"##at\", \"##oc\", \"##ell\", \"##ular\"] (5 tokens). A BERT model fine-tuned for medical NER uses the general tokenizer. Why does subword fragmentation affect NER performance and what is the architecturally correct fix?","options":{"A":"More subword tokens make the sequence longer which slows inference","B":"BPE fragmentation splits medical entities into many subwords — NER predicts one BIO tag per subword token, so \"hepatocellular carcinoma\" might tokenize to 8 tokens; the model must predict consistent BIO labels across all subword fragments (typically using only the first subword's prediction), but fine-tuning on fragmented representations of domain terms produces noisier gradient signals; the fix is domain-adaptive pretraining (DAP): continue pretraining BERT on medical text to build a medical-domain vocabulary and representations before NER fine-tuning","C":"The fix is to increase BERT's maximum sequence length from 512 to 1024","D":"Subword fragmentation has no effect on NER because BERT's self-attention resolves it"},"correct":"B","explanation":{"correct":"- NER labeling for subword tokens: standard practice is to assign the BIO label to the first subword token and mask the remaining subwords during loss computation. \"hepato\" gets B-DISEASE label; \"##cell\", \"##ular\" are ignored in the loss.\n- Problem: the model learns entity boundaries on fragmented representations. The first subword \"hep\" (in general tokenizer) does not encode the full morphological content of \"hepatocellular\" — it is an incomplete representation that makes classification harder.\n- Domain-adaptive pretraining (Gururangan et al., 2020): continue MLM pretraining on medical text (PubMed, clinical notes). The model encounters medical terms frequently in complete form, building better representations for them. BioBERT and PubMedBERT use this approach.\n- Results: BioBERT (domain-pretraining) achieves 3-7% F1 improvement over BERT-base on biomedical NER tasks.","A":"Sequence length increase does slow inference, but this is an engineering concern, not the NER performance problem. The question asks about NER performance degradation, not speed.","B":"","C":"Maximum sequence length increase would help if documents were being truncated (>512 tokens), but it does not address the subword fragmentation problem for individual medical terms.","D":"Self-attention resolves some context ambiguity but cannot recover morphological information that was never present in the subword token. A model that never saw \"hepatocellular\" as a unit cannot represent it as well as one that did."},"reference":"- Gururangan et al., \"Don't Stop Pretraining: Adapt Language Models to Domains and Tasks\": https://arxiv.org/abs/2004.10964\n- Lee et al., \"BioBERT: a pre-trained biomedical language representation model\": https://arxiv.org/abs/1901.08746"},{"section":"nlp","difficulty":"hard","id":"nlp-h002","topicSlug":"word-representations","orderIndex":2,"topic":"Word Representations","question":"A word2vec model trained on a 2015 corpus is used for a 2024 application. The word \"transformer\" has embedding nearest neighbors: [\"generator\", \"rectifier\", \"circuit\", \"voltage\"] (electrical transformers). A newer model trained on a 2022 corpus returns [\"BERT\", \"attention\", \"encoder\", \"GPT\"] for the same word. An engineer uses the old embeddings for NLP document classification. Describe the specific failure mode and how retrofitting (Faruqui et al.) or temporal embedding updates would address it.","options":{"A":"The old embeddings are completely unusable and must be discarded","B":"The old \"transformer\" embedding encodes the electrical engineering sense (2015 corpus had little NLP transformer text); a document labeled as \"NLP/ML\" that uses \"transformer\" extensively will be embedded near electrical engineering documents — misclassification via semantic sense drift; retrofitting injects synonym constraints from a domain ontology to reposition embeddings; alternatively, constructing a new embedding that interpolates between time-sliced corpora tracks semantic shift","C":"The classifier can compensate for wrong word embeddings through fine-tuning","D":"Semantic drift only affects rare words and \"transformer\" is too common to be affected"},"correct":"B","explanation":{"correct":"- Semantic shift: \"transformer\" gained a dominant NLP sense post-2017. In a 2015 corpus, nearly all \"transformer\" occurrences are in electrical/power engineering contexts → its embedding is in the electrical domain cluster.\n- Failure in NLP classification: an ML paper discussing \"the transformer architecture improves...\" gets the word \"transformer\" represented in an electrical engineering direction → the document embedding drifts toward engineering topics → misclassification.\n- Retrofitting (Faruqui et al., 2015): given a lexical resource (e.g., WordNet) or domain ontology stating \"transformer (NLP) is synonymous with attention-based model,\" retrofitting moves the embedding toward the synonyms' centroid while preserving proximity to semantically related words.\n- Temporal alignment: train embeddings on time-sliced corpora and align them using Procrustes alignment to track how word meanings shift over time.","A":"Old embeddings retain value for words whose semantics have not changed (most of the vocabulary). Discarding entirely is wasteful — targeted retrofitting or partial updates are sufficient.","B":"","C":"Fine-tuning the classifier adjusts the classification boundary but the input representation (embedding) remains in the wrong semantic neighborhood. Fine-tuning cannot retroactively fix incorrect input representations without re-embedding or retraining.","D":"Semantic drift is most pronounced for words undergoing rapid sense change — neologisms and repurposed technical terms like \"transformer,\" \"token,\" \"model\" in NLP. Common words are exactly the ones whose new senses are most impactful for downstream tasks."},"reference":"- Faruqui et al., \"Retrofitting Word Vectors to Semantic Lexicons\": https://arxiv.org/abs/1411.4166\n- Hamilton et al., \"Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change\": https://arxiv.org/abs/1605.09096"},{"section":"nlp","difficulty":"hard","id":"nlp-h003","topicSlug":"classical-nlp-tasks","orderIndex":3,"topic":"Classical NLP Tasks","question":"A dependency parser uses a transition-based (arc-standard) system. The parser has a stack and a buffer. It encounters the sentence \"The bank by the river is steep.\" A researcher shows that the attachment of \"bank\" (financial vs river bank) produces different correct parses. With an HMM-based tagger providing POS features, both \"bank\" senses receive the same POS tag (NN). How does ambiguity propagate through a pipeline of NLP components and what architectural change addresses this?","options":{"A":"POS ambiguity has no effect on dependency parsing","B":"Pipeline error propagation: the POS tagger assigns NN to \"bank\" (correct for both senses but the wrong POS input eliminates sense-disambiguation signals); the parser uses POS features but cannot distinguish \"bank (financial)\" from \"bank (geography)\" via POS alone; errors in upstream components (POS tagger) propagate and compound downstream (parser attachment decisions); the fix is joint modeling — train POS tagging and parsing simultaneously sharing representations so error signals backpropagate jointly rather than greedily left-to-right","C":"The solution is to run multiple parsers and vote","D":"HMM taggers cannot assign NN to ambiguous nouns"},"correct":"B","explanation":{"correct":"- Pipeline architecture: tokenizer → POS tagger → parser → NER → ... Each component takes the previous component's output as input. Errors accumulate: a wrong POS tag constrains the parser's feature space, potentially steering attachment decisions incorrectly.\n- \"Bank\" sense disambiguation requires semantic context (\"by the river\" → geography sense). An HMM POS tagger with only local unigram/bigram features may assign NN correctly (the POS is NN for both senses) but this provides no sense information to the parser.\n- Joint models: Bohnet et al. (2013) showed joint POS+parsing outperforms pipelined systems because the parser can \"correct\" for ambiguous POS assignments using broader syntactic context, and POS training benefits from parsing signal.\n- Neural models (spaCy, stanza) naturally implement joint training via multi-task learning with shared BERT/LSTM encoders.","A":"POS tags are core features in many dependency parsers (the arc-standard system uses POS of the top of stack and front of buffer as features). Wrong or ambiguous POS signals do affect attachment decisions.","B":"","C":"Ensemble voting reduces variance but does not address the fundamental pipeline error propagation — multiple parsers with the same wrong POS input will make correlated errors.","D":"HMM taggers routinely assign NN to ambiguous nouns because NN is often the highest-frequency tag for that word. The issue is not assignment failure but the insufficiency of POS alone for downstream disambiguation."},"reference":"- Bohnet et al., \"Joint Morphological and Syntactic Analysis for Richly Inflected Languages\": https://aclanthology.org/Q13-1031/"},{"section":"nlp","difficulty":"hard","id":"nlp-h004","topicSlug":"language-models-statistical","orderIndex":4,"topic":"Language Models Statistical","question":"A trigram language model with Kneser-Ney smoothing is used as the language model component in a speech recognition system. The acoustic model outputs a lattice of possible word sequences with scores. The language model rescores these using log P(word sequence). An engineer notes that the trigram LM systematically prefers shorter word sequences in the lattice even when longer sequences are acoustically more probable. What is the precise mathematical cause and how is it addressed in ASR systems?","options":{"A":"Kneser-Ney smoothing penalizes longer sequences","B":"Language model log probability = Σ log P(wᵢ | context) — each term is negative (log of probability < 1); longer sequences have more terms and therefore lower total log probability; this length bias means the LM component systematically down-scores longer (potentially correct) hypotheses; ASR systems address this with a word insertion penalty (WIP) — a per-word bonus added to each word's score to offset the LM's accumulated negativity","C":"The trigram LM cannot score sequences longer than 3 words","D":"The acoustic model's lattice is incorrect — the LM should not be used for rescoring"},"correct":"B","explanation":{"correct":"- LM score for sequence w₁,...,wₙ: Σᵢ log P(wᵢ | wᵢ₋₁, wᵢ₋₂). Each log P term ≤ 0 (since P ≤ 1). Adding n terms means longer sequences always have ≤ shorter sequences in raw LM score, regardless of linguistic quality.\n- ASR combination: score = λ × acoustic_score + μ × LM_score + WIP × n_words. Word insertion penalty (WIP) is a per-word bonus (positive constant × sequence length) that counteracts the LM's per-word penalty.\n- Tuning WIP: too high WIP → model inserts extra words; too low → model prefers deletions. WIP is tuned on held-out development data to optimize word error rate (WER).\n- This is analogous to beam search length normalization in NMT — both address the accumulation of per-step log probabilities over variable-length sequences.","A":"Kneser-Ney smoothing does not penalize longer sequences specifically. It modifies the probability estimates for unseen n-grams using continuation counts. The length bias comes from the additive log probability formula, not from the smoothing method.","B":"","C":"Trigram models can score sequences of any length — each step uses only the previous 2 words as context, but the sum can extend over any n steps. A trigram LM can score a 500-word sentence.","D":"Using LM for lattice rescoring is a standard, essential component of ASR systems. The LM scores are combined with acoustic model scores through the combination formula — they complement, not replace, each other."},"reference":"- Jurafsky & Martin, SLP3 Chapter 16 (Automatic Speech Recognition): https://web.stanford.edu/~jurafsky/slp3/16.pdf"},{"section":"nlp","difficulty":"hard","id":"nlp-h005","topicSlug":"sequence-models-rnn-lstm","orderIndex":5,"topic":"Sequence Models Rnn Lstm","question":"An LSTM language model achieves perplexity 42 on PTB test set with hidden size 1024 and dropout 0.5. A researcher attempts to improve it by adding a \"Temporal DropConnect\" variant: randomly zeroing connections in Wₕ (the hidden-to-hidden weight matrix) at each time step during training. Performance degrades to PP=58. The researcher claims \"recurrent connections are critical, so dropout there always hurts.\" What is the subtle error in both the implementation and the conclusion?","options":{"A":"The conclusion is correct — dropout on Wₕ always hurts LSTM performance","B":"Standard recurrent dropout applies the same dropout mask across all time steps within a sequence (not a new random mask per step); applying a new random mask at each step breaks the consistent recurrent path that LSTM uses to carry information across time steps, effectively injecting random noise into the memory; the correct implementation (Gal & Ghahramani, 2016 variational dropout) uses one fixed mask per sequence, enabling regularization without disrupting temporal consistency","C":"DropConnect on Wₕ is identical to standard dropout on hₜ","D":"The perplexity increase is caused by the lower effective learning rate from dropout, not by the temporal inconsistency"},"correct":"B","explanation":{"correct":"- Standard dropout applied independently at each time step: at step t, Wₕ has random zeros; at step t+1, different random zeros. The LSTM's hidden state at step t depends on Wₕ in a way that is uncorrelated with step t+1's effective Wₕ — the memory path is randomly disrupted.\n- Gal & Ghahramani (2016) variational dropout: sample one dropout mask at the beginning of each sequence, apply the same mask to Wₕ at every time step of that sequence. The mask is consistent within a sequence but varies across sequences.\n- Result: variational dropout regularizes without breaking temporal consistency — the model learns to be robust to a fixed partial connection scheme rather than random noise at each step.\n- The researcher's conclusion (\"always hurts\") is wrong; correct recurrent dropout implementation consistently improves LSTM performance on language modeling.","A":"Properly implemented recurrent dropout (variational) consistently improves generalization, as shown in Merity et al. (AWD-LSTM, 2018) which achieved state-of-the-art PTB perplexity using multiple regularization techniques including variational dropout.","B":"","C":"DropConnect on Wₕ and dropout on hₜ are related but different. DropConnect zeros individual weights in the matrix; standard dropout zeros the output activation vector. Both can be applied per-step (wrong) or with fixed masks per sequence (right).","D":"Learning rate reduction from dropout is a separate concern. The performance degradation here is specifically due to temporal inconsistency of the dropout mask — the LM cannot form consistent temporal patterns when the connection matrix changes randomly at each step."},"reference":"- Gal & Ghahramani, \"A Theoretically Grounded Application of Dropout in Recurrent Neural Networks\": https://arxiv.org/abs/1512.05287\n- Merity et al., \"Regularizing and Optimizing LSTM Language Models\" (AWD-LSTM): https://arxiv.org/abs/1708.02182"},{"section":"nlp","difficulty":"hard","id":"nlp-h006","topicSlug":"attention-before-transformers","orderIndex":6,"topic":"Attention Before Transformers","question":"Monotonic attention models (Raffel et al., 2017) constrain the attention alignment to be non-decreasing across decoding steps — the model can only attend to source positions ≥ the previous step's position. This achieves O(n+m) decoding complexity vs O(n×m) for standard attention. For which NLP task would monotonic attention critically fail, and what is the mathematical property of that task that breaks the assumption?","options":{"A":"Monotonic attention fails on all tasks because it is less expressive","B":"Monotonic attention fails on machine translation between typologically distant language pairs (English→Japanese, English→German for verb-final clauses) where the correct alignment is non-monotonic — the Japanese verb appears at the end but corresponds to the English verb in the middle; the non-decreasing constraint forces the model to attend to source positions in the wrong order, preventing correct non-monotonic alignment","C":"Monotonic attention fails only on sentences longer than 100 words","D":"Monotonic attention fails on question answering because questions require bidirectional reading"},"correct":"B","explanation":{"correct":"- Monotonic assumption: αₜ position ≥ αₜ₋₁ position. This encodes the intuition that as the decoder progresses, it moves through the source sequentially. This is approximately true for similar language pairs (English→French where word order is similar).\n- English→Japanese failure: SOV vs SVO word order. When decoding the Japanese verb (last target token), the model must attend to the English verb (middle of source). But the monotonic constraint would force the model to have already passed through source positions including the English verb, and it cannot attend back.\n- Specifically: if the decoder has already attended to source position 8 (end of source) at step m-1, at step m (generating Japanese verb) it cannot attend to source position 5 (English verb) because 5 < 8 violates the non-decreasing constraint.\n- Monotonic attention works well for: TTS (text-to-speech), incremental translation of similar language pairs, speech recognition with sequential alignment.","A":"Monotonic attention is more expressive than a fixed-context model and works well for tasks with sequential alignment. The claim \"fails on all tasks\" is false — it achieves competitive performance on English→French MT and TTS.","B":"","C":"Sentence length is not the determining factor. A 5-word English→Japanese sentence still has non-monotonic alignment. Monotonic attention fails on any non-monotonic alignment requirement, regardless of length.","D":"Monotonic attention is a seq2seq (encoder-decoder) concept. QA (extractive) uses a different architecture (span prediction). The concept of monotonic alignment does not apply to extractive QA."},"reference":"- Raffel et al., \"Online and Linear-Time Attention by Enforcing Monotonic Alignments\": https://arxiv.org/abs/1704.00784"},{"section":"nlp","difficulty":"hard","id":"nlp-h007","topicSlug":"bert-and-variants","orderIndex":7,"topic":"Bert And Variants","question":"A team probes BERT's 12 layers to understand what linguistic information each layer encodes. They train linear classifiers on frozen layer representations for: POS tagging, dependency parsing, NER, and coreference. Results show POS peaks at layer 3, NER peaks at layer 7, dependency structure peaks at layer 6, coreference peaks at layer 10. An engineer proposes \"always use the last layer for fine-tuning.\" What does this probing evidence suggest about when using last-layer representations is suboptimal?","options":{"A":"The last layer is always optimal — probing results do not apply to fine-tuning","B":"Probing reveals that BERT builds linguistic representations hierarchically — lower layers capture surface/lexical features (POS), middle layers encode syntactic structure, upper layers encode discourse/coreference; for POS-sensitive tasks (chunking, morphological analysis), using last-layer representations discards the peak POS signal at layer 3; optimal practice is task-dependent layer weighting (ELMo-style scalar mix or task-specific layer selection)","C":"The engineer is correct — fine-tuning updates all layers so probing results are irrelevant","D":"Last-layer representations are always better because they have been processed by more attention layers"},"correct":"B","explanation":{"correct":"- Tenney et al. (2019) BERT probing study: systematically confirmed this hierarchical structure — syntactic features peak in lower/middle layers, semantic features peak in upper layers.\n- For fine-tuning: if the task requires POS-level features (part-of-speech-sensitive disambiguation, morphological analysis), the last layer may have \"overwritten\" the clean POS signal with higher-level semantic abstractions.\n- ELMo scalar mix: learn a task-specific weighted combination of all layers' representations. For POS-heavy tasks, higher weights for lower layers emerge naturally during fine-tuning.\n- Practice implication: when fine-tuning BERT for a new task, experimenting with different layer extraction points or learned layer weights can outperform always-last-layer for syntactically sensitive tasks.","A":"Probing results do apply to fine-tuning because probing reveals what information is available at each layer. If layer 3 has richer POS information, a fine-tuned model that has direct access to layer 3 during inference can leverage it better.","B":"","C":"Fine-tuning updates all layers, which is true — but the gradient updates are task-directed and may not perfectly preserve the layer-specific specialization observed in probing. Starting from a suboptimal layer can still affect the fine-tuned model's representational quality.","D":"More attention layers don't guarantee better representations for all tasks. Over-abstraction in upper layers can lose the fine-grained surface signals that syntactic tasks need."},"reference":"- Tenney et al., \"BERT Rediscovers the Classical NLP Pipeline\": https://arxiv.org/abs/1905.05950"},{"section":"nlp","difficulty":"hard","id":"nlp-h008","topicSlug":"text-classification","orderIndex":8,"topic":"Text Classification","question":"A news classification system uses a BERT-based model fine-tuned on 50 news categories. After 6 months of deployment, the model's accuracy degrades from 91% to 79% without any model changes. Log analysis shows accuracy degraded gradually, correlating with the emergence of new terminology around AI developments. A ML engineer says \"data drift.\" A more senior engineer says \"concept drift.\" What is the precise difference, and which applies here?","options":{"A":"Both terms describe the same phenomenon — they are interchangeable","B":"Data drift (covariate shift): the input distribution P(X) changes while P(Y|X) remains stable — e.g., writing style becomes more informal but \"Sports\" articles are still about sports. Concept drift: P(Y|X) changes — the relationship between input features and labels shifts. Here: \"AI\" articles previously mapped to \"Technology\" but now cover \"Politics,\" \"Economy,\" \"Science\" simultaneously — the label assignment for AI-related text has changed; this is concept drift, which requires relabeling and retraining, not just retraining on new distribution data","C":"Data drift always requires model retraining; concept drift can be fixed with threshold tuning","D":"Concept drift only occurs in streaming data, not in batch-trained models"},"correct":"B","explanation":{"correct":"- Data drift: P(X) changes — new vocabulary, new writing patterns, but existing categories remain definitionally stable. A news classifier can be updated by retraining on new examples without changing the category taxonomy.\n- Concept drift: P(Y|X) changes — the correct label for given input features changes. \"ChatGPT launches\" is classified as Technology in 2022, but by 2023, similar AI coverage appears in Business, Politics, Education. The boundary of what \"Technology\" means shifts.\n- Detection methods: monitoring prediction confidence over time, comparing label distributions of new data with historical labels, periodic human evaluation of randomly sampled predictions.\n- Concept drift requires: (1) taxonomy review (does \"Technology\" need to be split? does \"AI Policy\" need its own category?), (2) relabeling of recent data with updated taxonomy, (3) fine-tuning on revised annotations.","A":"Data drift and concept drift have specific technical definitions and require different remediation strategies. Conflating them leads to applying the wrong fix (retraining on shifted data when the categories themselves need updating).","B":"","C":"Concept drift cannot be fixed with threshold tuning — the decision boundary itself has changed, not just the confidence calibration. Threshold tuning only adjusts where within the learned space the classification boundary is drawn.","D":"Concept drift occurs in any learning setting where the data-generating process changes over time, including batch-trained models evaluated over long deployment horizons. It is not limited to streaming systems."},"reference":"- Gama et al., \"A Survey on Concept Drift Adaptation\": https://dl.acm.org/doi/10.1145/2523813"},{"section":"nlp","difficulty":"hard","id":"nlp-h009","topicSlug":"named-entity-recognition","orderIndex":9,"topic":"Named Entity Recognition","question":"A production NER system must process 10,000 documents per minute with <50ms latency per document. BERT-large NER achieves 93% F1 but processes at 5ms/document on GPU. DistilBERT achieves 90% F1 at 2ms/document. A BiLSTM-CRF achieves 87% F1 at 0.1ms/document. A team architect specifies: \"use the highest F1 model.\" Under what real-world constraints would the architect's specification be incorrect, and what engineering analysis should precede model selection?","options":{"A":"The architect is always right — highest F1 should always be used","B":"At 10,000 docs/minute = 167 docs/second, BERT-large at 5ms/doc requires 167×5ms = 835ms total per second using 1 GPU — meaning a single GPU is at 83.5% capacity (0.835s out of 1s); DistilBERT uses 33% capacity; latency, throughput, cost per inference, and F1 must all be considered; for the stated throughput, BERT-large is technically feasible on one GPU but leaves no headroom for traffic spikes; architecture review should include cost analysis, SLA definition, and the production cost of a 3% F1 difference","C":"BiLSTM-CRF should always be selected for production because GPU models are unreliable","D":"Model selection should be made purely on F1 without considering operational constraints"},"correct":"B","explanation":{"correct":"- Throughput analysis: 10K docs/min = 167 docs/sec. BERT-large at 5ms per doc: can process 200 docs/sec per GPU (1s / 0.005s = 200). At 167 docs/sec load, utilization ≈ 83.5% — no headroom for traffic spikes (2× traffic spike → 334 docs/sec > 200 capacity → SLA breach).\n- DistilBERT at 2ms: processes 500 docs/sec per GPU — at 167 load, 33% utilization. Handles 3× traffic spikes before saturation.\n- Cost analysis: GPU cost, model serving infrastructure, on-call burden. A 3% F1 difference may be worth $0 or $10K/month depending on the application.\n- Production F1 also typically degrades from benchmark: real documents differ from benchmark test sets. A 93% vs 90% benchmark gap may narrow to 91% vs 89% in production.","A":"\"Always use highest F1\" ignores latency SLA, cost, scalability, and operational burden. This is a common engineering mistake that leads to overprovisioned, expensive production systems that fail under load.","B":"","C":"GPU models are widely used in production with proper infrastructure (batching, quantization, model serving frameworks). Reliability is an operational engineering problem, not an inherent GPU weakness.","D":"F1 is the primary task metric but cannot be the only selection criterion for production systems. Operational constraints are first-class engineering requirements alongside task performance."},"reference":"- Sanh et al., \"DistilBERT, a distilled version of BERT\": https://arxiv.org/abs/1910.01108"},{"section":"nlp","difficulty":"hard","id":"nlp-h010","topicSlug":"question-answering","orderIndex":10,"topic":"Question Answering","question":"A RAG-based QA system retrieves top-3 passages using a dense retriever. A researcher notices that for questions with precise numerical answers (\"What is the melting point of iron?\"), retrieval often returns 3 partially relevant passages but none containing the exact numeric answer. The reader model then generates \"approximately 1538°C\" from context clues but the gold answer is \"1538°C.\" The system's EM=0 for this question despite being correct. What does this reveal about the limitations of both retrieval and evaluation for precise-fact QA?","options":{"A":"The system is wrong — \"approximately 1538°C\" is not the same as \"1538°C\"","B":"Dense retrieval optimizes for semantic similarity between question and passage — a passage about \"iron properties in metallurgy\" is semantically close to the question but may not contain the exact fact; EM penalizes any deviation from exact reference strings (\"approximately\" is not in the gold); this reveals that EM is unsuitable as the sole metric for numerical QA where near-correct generation should receive credit, and that dense retrieval needs factual coverage verification beyond semantic similarity","C":"The model should never generate answers — it should only extract spans","D":"The problem is solved by using exact-match retrieval (BM25) which finds the exact numeric string"},"correct":"B","explanation":{"correct":"- Dense retriever limitation: bi-encoder models are trained to maximize similarity between question embedding and answer-containing passage embedding. For \"melting point of iron,\" the model may retrieve general iron metallurgy passages where \"melting\" is semantically relevant but the specific number 1538 is absent.\n- EM evaluation limitation: \"approximately 1538°C\" fails EM despite being numerically correct. EM = 1 only for exact character string match (after normalization). Any qualifier (\"approximately,\" \"about,\" \"roughly\") breaks EM.\n- Better metrics for numerical QA: numeric EM (extract numbers, compare), tolerance-based EM (±1%), or human evaluation.\n- Retrieval fix: for factual QA, augment dense retrieval with entity/number-aware retrieval that explicitly checks whether candidate passages contain numeric values related to the query entity.","A":"The question asks what this reveals about limitations, not whether the answer is correct. \"Approximately 1538°C\" is practically correct; the evaluation framework is what fails to recognize it.","B":"","C":"Generative QA with retrieval (RAG) is a standard approach for open-domain QA and outperforms extractive QA on questions where no single passage contains the exact answer. Prohibiting generation removes this capability.","D":"BM25 can match the number \"1538\" if it appears verbatim in a retrieved passage, but this depends on whether the training corpus has explicit \"melting point: 1538°C\" text. BM25 cannot retrieve numerically equivalent reformulations (\"just above 1500°C\"), which is a symmetric limitation."},"reference":"- Rajpurkar et al., \"SQuAD: 100,000+ Questions for Machine Comprehension of Text\" (EM/F1 limitations): https://arxiv.org/abs/1606.05250"},{"section":"nlp","difficulty":"hard","id":"nlp-h011","topicSlug":"machine-translation","orderIndex":11,"topic":"Machine Translation","question":"A production English→Arabic NMT system must handle right-to-left (RTL) script and rich morphology (Arabic verbs encode subject, gender, number). The system uses a standard byte-pair encoding (BPE) tokenizer trained on Arabic text with 32K merge operations. An evaluation shows consistent errors on verb agreement (\"they went\" translated with wrong gender agreement). A linguist suggests \"the model doesn't understand Arabic morphology.\" What precise engineering intervention addresses this, and why does standard BPE tokenization contribute to the problem?","options":{"A":"The model needs to be retrained with more data — morphology improves with scale","B":"Arabic verbs embed grammatical features (gender, number, person) in affixes; BPE merges common substrings without linguistic awareness — it may merge or split morpheme boundaries inconsistently (\"yaktubūna\" split as \"yak\" + \"tubū\" + \"na\" instead of morphologically meaningful \"y-\" (3rd) + \"aktub\" (write) + \"ūna\" (plural masculine)); linguistically-motivated tokenization (morphological segmentation) preserves morpheme boundaries, making agreement features explicit in the token stream and enabling the model to learn systematic patterns","C":"RTL script direction causes BPE to fail on Arabic","D":"The solution is to translate Arabic to Latin script first, then apply standard BPE"},"correct":"B","explanation":{"correct":"- Arabic morphology: highly fusional and templatic. A single Arabic word like \"وَيَكْتُبُونَ\" (wa-ya-ktub-ūna = \"and they write\") encodes conjunction (wa), subject person/gender (ya = 3rd masculine), root (k-t-b = write), and plural marker (-ūna) in a single word.\n- Standard BPE: trained on frequency, not morphology. It may merge high-frequency character sequences regardless of morpheme boundaries, creating inconsistent representations of the same grammatical morpheme across different surface forms.\n- Morphological segmentation (Farasa, MADAMIRA): explicitly segment Arabic text into morphemes before BPE. This ensures grammatical morphemes (gender, number affixes) are consistently tokenized as separate units across all verbs.\n- Impact: with morphological tokenization, the model sees the same morpheme token for \"3rd person plural masculine\" across all verbs, learning the agreement pattern systematically.","A":"Scale helps but does not address the tokenization inductive bias problem. A model trained on 10× data with inconsistent morpheme tokenization still learns less systematic morphological patterns than a smaller model with morphologically-aware tokenization.","B":"","C":"BPE processes sequences of characters/bytes — the logical character sequence of Arabic text (whether stored as RTL bytes or represented as Unicode code points) does not break BPE. Text is processed by the Unicode code point sequence, not display direction.","D":"Transliteration to Latin script is a lossy transformation that loses Arabic-specific phonological distinctions and creates new ambiguities. Production MT systems for Arabic use Arabic script directly with appropriate tokenization."},"reference":"- Sennrich & Haddow, \"Linguistic Input Features Improve Neural Machine Translation\": https://arxiv.org/abs/1606.02892"},{"section":"nlp","difficulty":"hard","id":"nlp-h012","topicSlug":"text-generation-decoding","orderIndex":12,"topic":"Text Generation Decoding","question":"A constrained decoding system must generate text that satisfies a hard lexical constraint: the output must contain the word \"climate.\" Standard top-p sampling ignores constraints and may or may not include \"climate.\" A team implements a naive fix: if \"climate\" does not appear in the generated output, regenerate until it does. For a constraint with P(\"climate\" appears in output) = 0.05, why is the expected generation cost of this approach prohibitive, and what is the correct algorithmic solution?","options":{"A":"The fix works fine — 0.05 probability means about 20 regenerations on average, which is acceptable","B":"Expected regenerations = 1 / P(constraint satisfied) = 1 / 0.05 = 20 full sequence regenerations — at 200 tokens per sequence and 50ms per generation, 20 attempts = 1000ms = 1 second latency per output; for rare constraints P=0.001, it becomes 1000 attempts = 50 seconds; the correct solution is constrained beam search (CGBS, Post & Vilar, 2018) which modifies the beam search states to track which constraints remain unsatisfied and forces constraint satisfaction through the decoding algorithm, not post-hoc rejection sampling","C":"0.05 probability is high enough that regeneration is rarely needed in practice","D":"The constraint should be added as a soft penalty in the loss function instead"},"correct":"B","explanation":{"correct":"- Expected number of Bernoulli trials until success: E[trials] = 1/p. At p=0.05: 20 attempts. At p=0.001 (uncommon technical term): 1000 attempts.\n- At 50ms per 200-token generation: 20 attempts = 1 second, 1000 attempts = 50 seconds — clearly impractical for real-time applications.\n- Constrained beam search (lexically constrained decoding): maintain beam hypotheses with metadata tracking which constraints have been satisfied. When constraints are unsatisfied near end-of-sequence, the algorithm manipulates scores to force constraint-satisfying tokens before EOS. This guarantees constraint satisfaction in one decoding pass.\n- Implementations: DBA (Dynamic Beam Allocation), GBS (Grid Beam Search), vectorized constrained beam search. All achieve constraint satisfaction with O(1) generation attempts.","A":"\"20 regenerations is acceptable\" is task-dependent. For real-time chat, 20 attempts × 50ms = 1 second total latency is at the boundary of acceptable UX. For batch offline generation it may be fine. But the critical flaw is that \"acceptable\" does not scale — rare constraints make it arbitrarily slow.","B":"","C":"P=0.05 is not \"high enough\" for reliable single-pass generation. With 5% success rate, 50% of outputs require >13 regenerations (geometric distribution: P(≤13 attempts) = 1-(0.95)^13 ≈ 0.49). This is unreliable.","D":"Soft penalties in the loss function (during training) bias the model toward constraint satisfaction on average but do not guarantee it on any specific output. A soft penalty produces p ≈ 0.7 instead of p ≈ 0.05, but hard constraints require guaranteed satisfaction, which only constrained decoding provides."},"reference":"- Post & Vilar, \"Fast Lexically Constrained Decoding with Dynamic Beam Allocation for Neural Machine Translation\": https://arxiv.org/abs/1804.06189"},{"section":"nlp","difficulty":"medium","id":"nlp-m001","topicSlug":"text-preprocessing","orderIndex":1,"topic":"Text Preprocessing","question":"A search engine indexes 10 million documents and builds a TF-IDF matrix. An engineer proposes adding bigrams as features to improve retrieval. The unigram vocabulary is 200,000 words. A product manager asks: \"If we add bigrams, how much does the feature space grow?\" What is the theoretical maximum and the practical reality?","options":{"A":"The feature space doubles — bigrams add 200,000 new features","B":"Theoretical maximum = 200,000² = 40 billion bigrams, but in practice only ~1-5 million bigrams occur with frequency ≥ 5 in the corpus; the practical feature space grows by a factor of 5-25×, not 200,000×","C":"Bigrams always produce exactly 2× the feature space regardless of corpus","D":"Bigrams reduce the feature space because they replace two unigrams with one combined feature"},"correct":"B","explanation":{"correct":"- Theoretical: any ordered pair of 200,000 words = 200,000² = 40 billion possible bigrams. This is the Cartesian product of the vocabulary with itself.\n- Practical: most bigrams never appear in text. Zipf's law means a small number of bigrams (collocations, common phrases) account for most occurrences. With a minimum frequency threshold (≥ 5), bigrams in a 10M document corpus typically number in the millions.\n- Engineering decision: adding all bigrams above count threshold (e.g., 5) typically adds 1-5M features. This requires sparse matrix representations (CSR) to be tractable.\n- Implication: bigrams improve precision for multi-word concepts (\"New York,\" \"not good\") but the vocabulary explosion requires count filtering and sparse storage.","A":"Doubling (200K unigrams → 400K total) would be the case if only common bigrams were added — but even the most common bigrams alone exceed this. With count threshold filtering, bigrams typically add millions of features.","B":"","C":"There is no fixed 2× rule. The bigram count depends entirely on corpus size, vocabulary, and frequency threshold. Two different corpora would produce very different bigram counts.","D":"Bigrams are added as new features alongside unigrams — they are concatenated, not substituted. The feature space grows; it does not shrink."},"reference":"- Manning et al., \"Introduction to Information Retrieval\", Chapter 6 (Weighted Zone Scoring, n-gram indexing): https://nlp.stanford.edu/IR-book/"},{"section":"nlp","difficulty":"medium","id":"nlp-m002","topicSlug":"text-preprocessing","orderIndex":2,"topic":"Text Preprocessing","question":"A team builds a document similarity system using TF-IDF cosine similarity. Two documents about \"machine learning\" both have high TF-IDF for \"algorithm\" and \"model\" but one uses \"neural network\" and the other uses \"deep learning.\" The similarity score is low despite being about the same topic. What is the fundamental limitation exposed, and what two approaches address it?","options":{"A":"TF-IDF vectors are too sparse — the fix is to use dense vectors by padding with zeros","B":"TF-IDF treats vocabulary as independent dimensions — \"neural network\" and \"deep learning\" are different dimensions with zero overlap despite semantic equivalence; fixes: (1) latent semantic analysis (LSA/SVD) to learn latent topic dimensions, (2) dense word/sentence embeddings (Word2Vec, BERT) that place semantically related terms in nearby vector space","C":"The similarity is low because both documents are too similar — cosine similarity only works for dissimilar documents","D":"TF-IDF similarity requires stemming first — \"learning\" and \"networks\" are not normalized"},"correct":"B","explanation":{"correct":"- TF-IDF cosine similarity: documents are compared via dot product of their TF-IDF vectors. Each unique word is an independent dimension. \"neural\" has zero overlap with \"deep\" — they are orthogonal vectors.\n- LSA: apply SVD to the term-document matrix, reducing to k latent dimensions where semantically related terms co-occur in similar documents → they project to nearby latent dimensions.\n- Dense embeddings: represent each document as mean of word/sentence embeddings. Words used in similar contexts (\"neural network,\" \"deep learning\") have similar embeddings → high document similarity.\n- This is the bag-of-words semantic gap: syntactic overlap ≠ semantic similarity.","A":"The sparsity of TF-IDF is by design — zero values mean the term is absent. Adding zeros does not change the cosine similarity or the underlying vocabulary independence problem.","B":"","C":"Cosine similarity produces values in [-1, 1] for any pair of documents, not just dissimilar ones. Documents about the same topic should ideally have high cosine similarity — the TF-IDF representation is the limitation, not the similarity metric.","D":"Stemming normalizes morphological variants (\"learning\" → \"learn\") but cannot bridge semantic synonymy between different root forms like \"neural\" and \"deep.\""},"reference":"- Deerwester et al., \"Indexing by Latent Semantic Analysis\" (LSA): https://asistdl.onlinelibrary.wiley.com/doi/10.1002/asi.4630410605"},{"section":"nlp","difficulty":"medium","id":"nlp-m003","topicSlug":"word-representations","orderIndex":3,"topic":"Word Representations","question":"GloVe trains on a word co-occurrence matrix, while Word2Vec trains on local context windows. A researcher shows that on the analogy task \"Paris:France :: Berlin:?\" GloVe achieves 80% accuracy and Word2Vec achieves 75%. But on a syntactic analogy task \"run:running :: swim:?\" Word2Vec achieves 85% and GloVe achieves 72%. What property of each method explains the task-specific performance difference?","options":{"A":"GloVe is better at all semantic tasks because it uses global statistics","B":"GloVe's global co-occurrence statistics capture long-range semantic associations (country-capital relationships persist across large documents); Word2Vec's local window training captures syntactic patterns from immediate context (verb form changes occur within 5-word windows), giving it an edge on morphosyntactic analogies","C":"The difference is random variation — both methods are trained on the same data so performance should be identical","D":"Word2Vec always outperforms GloVe because it uses a neural network while GloVe uses matrix factorization"},"correct":"B","explanation":{"correct":"- GloVe global statistics: the co-occurrence matrix captures how often any two words appear in the same document/window across the entire corpus. Country-capital pairs (\"Paris,\" \"France\") often co-occur in the same documents (news articles, geography text) producing clear relational structure in GloVe space.\n- Word2Vec local window (size 5): training is sensitive to immediate local context. Morphological transformations (\"run,\" \"running\") consistently appear in similar local syntactic positions — word windows capture syntactic regularity more sharply.\n- This aligns with empirical findings: GloVe tends to capture semantic/topical similarity better, Word2Vec captures syntactic similarity better. Neither is universally superior.\n- Pennington et al. (GloVe paper) showed GloVe outperforms Word2Vec on semantic analogies but not consistently on syntactic ones.","A":"GloVe is not globally better on all semantic tasks. Word2Vec with large data and window size also captures semantic relationships well. The advantage is task-specific, not universal.","B":"","C":"Training on the same data with different objectives (global co-occurrence factorization vs local context prediction) produces different embedding spaces. Different methods on the same data systematically differ in what relationships they capture.","D":"Matrix factorization (GloVe) vs neural prediction (Word2Vec) are both valid approaches and both produce competitive embeddings. There is no universal winner — the comparison depends on task type, data size, and hyperparameters."},"reference":"- Pennington et al., \"GloVe: Global Vectors for Word Representation\": https://aclanthology.org/D14-1162/"},{"section":"nlp","difficulty":"medium","id":"nlp-m004","topicSlug":"classical-nlp-tasks","orderIndex":4,"topic":"Classical NLP Tasks","question":"A CRF model for POS tagging is trained to maximize P(tag sequence | word sequence). At inference, it uses Viterbi decoding. A colleague proposes replacing CRF with per-token softmax (predicting each tag independently). On a test set, per-token softmax achieves 96.5% accuracy and CRF achieves 97.1%. The colleague argues \"0.6% is not worth the complexity.\" On which specific examples does CRF provide its advantage?","options":{"A":"CRF always outperforms independent softmax on all tokens equally","B":"CRF's advantage is concentrated on tokens where tag validity depends on neighboring tags — e.g., it prevents invalid sequences like JJ JJ NN from having a third consecutive adjective where grammar requires a noun; the 0.6% improvement comes from a small number of constraint-violating examples that independent softmax gets wrong but CRF's transition model corrects","C":"CRF improves only on punctuation tags, not content word tags","D":"Independent softmax cannot handle multi-class POS tagging — that is why CRF is needed"},"correct":"B","explanation":{"correct":"- Independent softmax: each token's tag is argmax of its own probability. No constraint prevents predicting two B- tags in sequence (for chunking) or implausible POS sequences.\n- CRF transition matrix: learns valid tag pair scores. \"DT followed by NN\" has high score; \"DT followed by VBZ\" has low score. Viterbi finds the globally optimal valid sequence.\n- The 0.6% gain is not uniform — it is concentrated on: (1) tag disambiguation where local context is ambiguous but sequence context resolves it, (2) preventing technically invalid sequences (e.g., I- without B-), (3) handling rare tags at sentence boundaries.\n- In practice: for BERT-based models, CRF provides smaller marginal gains (0.2-0.5%) because BERT's bidirectional representations already encode sequence context. For weaker encoders (LSTM, word2vec), CRF provides larger gains.","A":"CRF does not improve all tokens equally. Unambiguous tokens (clearly a DT before an obvious NP) benefit negligibly. Only ambiguous boundary cases where sequence constraints resolve the ambiguity benefit.","B":"","C":"CRF improves content word tags at structural boundaries — e.g., ambiguous noun/verb (run, bank) near determiners or prepositions. Punctuation tags are typically unambiguous and benefit less.","D":"Independent softmax handles multi-class classification trivially — softmax outputs a distribution over all K POS tags. The issue is lack of sequence consistency, not multi-class incapability."},"reference":"- Lafferty et al., \"Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data\": https://dl.acm.org/doi/10.5555/645530.655813"},{"section":"nlp","difficulty":"medium","id":"nlp-m005","topicSlug":"language-models-statistical","orderIndex":5,"topic":"Language Models Statistical","question":"A language model is evaluated on two test sets: a news test set (PP=85) and a literature test set (PP=320). The model was trained on a mix of 70% news and 30% literature text. A researcher proposes \"train on equal amounts of both to fix the gap.\" What does the perplexity gap actually reveal, and what is the limitation of the equal-mixing proposal?","options":{"A":"The perplexity gap is a bug — it should be fixed by retuning the learning rate","B":"Higher literature perplexity reflects the higher linguistic diversity of literary text (unusual vocabulary, complex syntax, figurative language) — equal mixing improves literature PP at the cost of worsening news PP (domain tradeoff), and may not close the gap fully since literary language is intrinsically higher-entropy than news","C":"Equal mixing always produces equal perplexity on both test sets","D":"The gap means the model has memorized the news training data, so news PP is artificially low"},"correct":"B","explanation":{"correct":"- Perplexity measures how well the model predicts test text. Literature has naturally higher perplexity because it uses more varied vocabulary, metaphors, unusual word orders, and less formulaic language than news — even a perfect literature LM would have higher PP than a perfect news LM.\n- Mixing 50/50 would expose the model to more literature patterns, reducing literature PP. But it also reduces the effective news training data (from 70% to 50%), likely increasing news PP.\n- The gap partially reflects intrinsic domain difficulty (entropy difference), not just training imbalance. This is why equal mixing is not guaranteed to equalize perplexity.\n- Domain-specific LMs are the production solution: train separate models for news and literature, or use domain-adaptive fine-tuning on target domain.","A":"Perplexity gap between domains is expected behavior — it reflects domain distribution mismatch, not a software bug. Learning rate affects training convergence, not cross-domain perplexity gaps.","B":"","C":"Equal mixing generally reduces the perplexity gap but does not equalize it when domains have different intrinsic entropy. Literary text's natural variety ensures its PP remains higher than news PP regardless of mixing ratio.","D":"News PP=85 is a plausible in-domain perplexity for a well-trained news LM — not evidence of memorization. Memorization (overfitting) would manifest as very low training PP but high test PP, not consistently low test PP."},"reference":"- Jurafsky & Martin, SLP3 Chapter 3 (Language Models): https://web.stanford.edu/~jurafsky/slp3/3.pdf"},{"section":"nlp","difficulty":"medium","id":"nlp-m006","topicSlug":"sequence-models-rnn-lstm","orderIndex":6,"topic":"Sequence Models Rnn Lstm","question":"A text classification pipeline uses a BiLSTM to encode a document, then takes the concatenation of the final forward and backward hidden states [h_forward_T; h_backward_0] as the document representation. A colleague suggests using mean pooling over all time steps instead. For a classification task on reviews averaging 200 words, which is better and why?","options":{"A":"Final states are always better because they capture the whole sequence","B":"Mean pooling is often better for long documents: final forward state h_T is dominated by the last few tokens (recency bias), and h_backward_0 by the first few tokens; averaging all hidden states distributes coverage across the full 200-word sequence, reducing the positional bias","C":"Both methods are identical because BiLSTM hidden states all contain the same information","D":"Final states are better because mean pooling loses temporal ordering information"},"correct":"B","explanation":{"correct":"- LSTM recency bias: h_T encodes the full sequence in theory, but in practice, recent tokens disproportionately influence the final hidden state because older gradient signals decay. For a 200-word document, h_T may poorly represent words at positions 1-150.\n- BiLSTM concatenation [h_T; h_0_backward]: h_T is biased toward sentence end; h_0_backward is biased toward sentence beginning. Middle content gets weaker representation.\n- Mean pooling: hᵢ for each position encodes local context at position i with some history. Averaging over all positions gives relatively equal representation to all parts of the document.\n- Empirically: for long documents (>100 tokens), max or mean pooling over BiLSTM hidden states typically outperforms terminal state concatenation (Conneau et al., InferSent, 2017).","A":"\"Always better\" is false. For short sequences (5-10 tokens), the final state captures the full sequence well. For long sequences, the recency bias makes final states suboptimal.","B":"","C":"Hidden states are not identical — each hᵢ encodes context from positions 1..i (forward) and n..i (backward) with different weights. Position 100 in a 200-word text has different context than position 200.","D":"Mean pooling over an ordered sequence of hidden states does not lose temporal ordering — each hᵢ encodes the ordered context up to position i. The mean is not a bag-of-tokens average; it is an average over position-dependent representations."},"reference":"- Conneau et al., \"Supervised Learning of Universal Sentence Representations from Natural Language Inference Data\" (InferSent): https://arxiv.org/abs/1705.02364"},{"section":"nlp","difficulty":"medium","id":"nlp-m007","topicSlug":"attention-before-transformers","orderIndex":7,"topic":"Attention Before Transformers","question":"An attention-based NMT model translates 10-word English sentences to German. Encoding time grows linearly with source length O(n). During decoding, each of the m target tokens computes attention over all n source positions. A researcher claims \"attention is O(n) per decoder step.\" A senior engineer corrects this. What is the correct complexity, and why does it matter for very long inputs?","options":{"A":"O(n) per decoder step is correct — attention is just a weighted sum","B":"Attention per decoder step is O(n) in computation (one dot product per source position), but the total attention cost over the full decoding is O(n × m) — for long documents (n=1000, m=500), this is 500,000 operations per batch example; this motivated linear attention approximations and sparse attention in transformers","C":"Attention is O(1) because softmax normalizes all weights to sum to 1","D":"Attention complexity is irrelevant for sequence lengths under 512 tokens"},"correct":"B","explanation":{"correct":"- Per decoder step: compute eₜᵢ = score(sₜ, hᵢ) for i = 1..n → n score computations → softmax → weighted sum. This is O(n) per decoder step.\n- Full decoding: m decoder steps × O(n) per step = O(n×m) total attention operations. For seq2seq models, both n (source) and m (target) can be large.\n- For document-level MT (n=1000, m=800): O(800,000) attention operations per example vs O(n) for the encoder. At batches of 32 examples, this becomes significant.\n- This motivated: sparse attention (attend to only local window + global tokens), linear attention approximations, and ultimately the shift toward transformer architectures with efficient multi-head attention.","A":"O(n) per decoder step is correct, but \"attention is O(n)\" as a total statement is incomplete — the total over all decoder steps is O(n×m), which is the practically relevant complexity.","B":"","C":"Softmax is O(n) (computing exponentials and normalizing over n positions). The O(1) claim conflates the normalized output (one weighted sum value) with the computation required to produce it.","D":"At 512 tokens, O(512×512) ≈ 262K operations per example × batch size × attention heads. At production scale, this is non-trivial and motivated efficient transformer attention (e.g., FlashAttention)."},"reference":"- Vaswani et al., \"Attention Is All You Need\" (complexity discussion): https://arxiv.org/abs/1706.03762"},{"section":"nlp","difficulty":"medium","id":"nlp-m008","topicSlug":"bert-and-variants","orderIndex":8,"topic":"Bert And Variants","question":"A team fine-tunes BERT for a 3-way NLI task (entailment/neutral/contradiction) on SNLI (550K examples). They achieve 91% accuracy. They then use the same model for zero-shot topic classification by framing classification as NLI: \"This text is about [topic].\" — does this premise entail the hypothesis? This achieves 78% F1 on a 10-class topic test. Why does NLI fine-tuning enable this zero-shot transfer?","options":{"A":"BERT was pretrained on NLI data, so fine-tuning restores this capability","B":"NLI training teaches the model a general textual entailment reasoning capability — determining if a hypothesis follows from a premise; rephrasing \"classify as topic X\" as \"does this text entail 'this text is about X'?\" leverages the same reasoning function; the model generalizes to unseen topics because it reasons about semantic compatibility rather than topic-specific patterns","C":"The model guesses randomly because it has never seen topic classification examples","D":"Zero-shot transfer works because NLI and topic classification use the same label space"},"correct":"B","explanation":{"correct":"- NLI fine-tuning: the model learns to assess semantic compatibility between two text spans. This is a general reasoning function, not topic-specific.\n- Zero-shot reformulation: instead of asking \"is this about sports?\" (requires sports-specific training), ask \"does this text entail 'this article discusses sports events and competitions'?\" — the model reasons about whether the text's content is compatible with the hypothesis.\n- The key insight (Yin et al., 2019; Welleck et al., 2019): NLI-trained models function as zero-shot classifiers when classification problems are reformulated as entailment problems. They generalize to label categories never seen during fine-tuning.\n- 78% F1 zero-shot is competitive with supervised baselines on many topic classification benchmarks, demonstrating genuine transfer of reasoning capability.","A":"BERT was pretrained on masked language modeling and NSP — not NLI. NLI capability comes from fine-tuning on SNLI/MultiNLI. The transfer is from the fine-tuning objective, not pretraining.","B":"","C":"78% F1 is far above chance (10% for 10-class random) and above the majority-class baseline for typical topic datasets. The model is demonstrably not guessing.","D":"NLI labels (entailment/neutral/contradiction) are completely different from topic labels (sports/politics/tech/...). The transfer works despite different label spaces, which is precisely what makes it zero-shot."},"reference":"- Yin et al., \"Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach\": https://arxiv.org/abs/1909.00161"},{"section":"nlp","difficulty":"medium","id":"nlp-m009","topicSlug":"text-classification","orderIndex":9,"topic":"Text Classification","question":"A content moderation system classifies text into 5 categories: hate speech, harassment, spam, safe, and unclear. The \"unclear\" class has only 3% of training examples. After training a BERT classifier, the confusion matrix shows \"unclear\" examples are almost always misclassified as one of the other 4 classes. A team member says \"just delete the unclear class — 3% of data doesn't matter.\" What production risk does this introduce?","options":{"A":"Removing a class never causes production issues if it is a small fraction of data","B":"\"Unclear\" content is often the most ambiguous and context-dependent — content that the classifier is genuinely uncertain about; removing the class causes the model to force-assign borderline content to a definitive class, removing the signal that this content needs human review; production systems need confidence-aware routing to human moderators for ambiguous cases","C":"Deleting the unclear class will improve overall accuracy so there is no risk","D":"The model should have even more classes, not fewer, to handle ambiguous content"},"correct":"B","explanation":{"correct":"- The \"unclear\" class serves a semantic purpose: it captures content that is borderline, context-dependent, or requires additional information. A comment like \"you should be careful\" could be a threat or friendly advice — no clear label is appropriate.\n- Without \"unclear,\" the model must assign these cases to hate speech, harassment, spam, or safe. Wrong assignments in any direction cause harm: false positive (incorrectly removing safe content) or false negative (missing harmful content).\n- Proper production content moderation: use confidence scores to route uncertain predictions to human review queues. \"Unclear\" predictions can be captured as low-confidence outputs (max softmax probability < 0.7) even without an explicit class.\n- Deleting ambiguous classes is a common mistake that optimizes for clean benchmark metrics at the cost of real-world reliability.","A":"Small class size is a training data problem, not a reason the class is unimportant. Classes representing important real-world scenarios matter regardless of their frequency in training data.","B":"","C":"Overall accuracy may improve (fewer misclassified 3% examples become correctly classified as one of 4 classes by definition), but this is a metric artifact. The real-world performance on the originally-unclear content degrades.","D":"More granular classes increase data requirements, labeling cost, and class imbalance. The solution is not more classes but better handling of uncertainty — which \"unclear\" was already providing."},"reference":"- Aroyo & Welty, \"Truth is a Lie: Crowd Truth and the Seven Myths of Human Annotation\": https://ojs.aaai.org/aimagazine/index.php/aimagazine/article/view/2564"},{"section":"nlp","difficulty":"medium","id":"nlp-m010","topicSlug":"named-entity-recognition","orderIndex":10,"topic":"Named Entity Recognition","question":"A researcher proposes training a multilingual NER model on English CoNLL-2003 and zero-shot transferring it to Spanish without any Spanish NER training data. The model uses mBERT (multilingual BERT). Entity F1 on Spanish drops to 62% from 91% on English. A colleague claims \"mBERT just does language translation internally.\" What is a more accurate characterization of what mBERT learns and why the gap exists?","options":{"A":"mBERT translates text to English internally before processing","B":"mBERT learns language-agnostic representations through multilingual masked language modeling — subword representations of \"Madrid\" and \"Nueva York\" are shared across languages because they appear in similar multilingual contexts; the 29% gap arises from language-specific entity patterns (Spanish capitalization conventions, different entity distributions) and subword coverage differences, not a translation step","C":"The gap is entirely due to vocabulary differences — Spanish words are OOV in mBERT","D":"Zero-shot cross-lingual transfer never works — mBERT cannot transfer NER to any other language"},"correct":"B","explanation":{"correct":"- mBERT is trained with masked language modeling on 104 languages simultaneously using a shared vocabulary (≈110K WordPiece tokens). Languages with similar scripts share subword pieces.\n- Cross-lingual transfer emerges because named entities often appear in similar multilingual contexts: \"Paris\" in French, Spanish, and English text co-occurs with similar country/capital concepts. mBERT's representations for these entities cluster together across languages.\n- The 29% gap comes from: (1) source-target language distance (English and Spanish are related, but entity distributions differ), (2) script and capitalization differences, (3) label distribution mismatch (fewer LOC entities in Spanish vs English test), (4) subword tokenization artifacts.\n- XLM-RoBERTa (trained with more multilingual data) reduces this gap significantly, showing the transfer capability is real but data-dependent.","A":"mBERT does not perform translation. It processes input text directly in the original language through its multilingual encoder. No translation step occurs.","B":"","C":"mBERT's WordPiece vocabulary covers Spanish well (it was trained on Wikipedia in 104 languages including Spanish). Most common Spanish NER-relevant tokens are in-vocabulary. OOV is not the primary cause of the gap.","D":"Zero-shot cross-lingual NER transfer with mBERT achieves 60-75% F1 across many European languages — clearly above chance and demonstrably functional. \"Never works\" is empirically false."},"reference":"- Wu & Dredze, \"Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT\": https://arxiv.org/abs/1904.01038"},{"section":"nlp","difficulty":"medium","id":"nlp-m011","topicSlug":"question-answering","orderIndex":11,"topic":"Question Answering","question":"A multi-hop QA system must answer: \"What is the capital of the country where the Eiffel Tower is located?\" This requires: (1) Eiffel Tower → France, (2) France capital → Paris. A single-hop extractive QA model (BERT on SQuAD) fails on this question even when given a concatenated passage containing both facts. What architectural property makes single-hop QA fail on multi-hop questions?","options":{"A":"BERT cannot process passages longer than 50 words","B":"Single-hop extractive QA models predict one contiguous span from the passage as the answer — multi-hop questions require chaining inferences across multiple facts; the model would need to locate \"Eiffel Tower → France\" and then use that intermediate result to locate \"France capital → Paris,\" a two-step reasoning process that a span-extraction head cannot perform in a single forward pass","C":"Multi-hop questions always have multiple correct answers, which single-answer QA cannot handle","D":"BERT's attention cannot span more than 3 sentences simultaneously"},"correct":"B","explanation":{"correct":"- SQuAD-style extractive QA: input = (question, passage) → output = (start position, end position). The model directly predicts a span that answers the question from the passage in one forward pass.\n- Multi-hop chain: to answer \"capital of France\" you must first establish \"France\" (from Eiffel Tower fact). The model's start/end logits cannot model this reasoning chain — they score each token independently as a possible answer.\n- Multi-hop QA approaches: (1) iterative retrieval (retrieve hop 1, then use intermediate answer to retrieve hop 2), (2) GNN-based reasoning over extracted facts, (3) chain-of-thought prompting in LLMs.\n- HotpotQA (Yang et al., 2018) was introduced specifically to benchmark multi-hop reasoning and revealed that standard QA models fail at this.","A":"BERT can process up to 512 tokens (BERT-base). Concatenated passages often fit within this limit. The failure is architectural (single-hop head), not length-based.","B":"","C":"Multi-hop questions typically have one correct final answer (Paris, in this example). The problem is the reasoning chain, not multiple correct answers.","D":"BERT's self-attention can span all positions in a 512-token sequence — there is no 3-sentence attention limit. The full self-attention receptive field covers the entire input."},"reference":"- Yang et al., \"HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering\": https://arxiv.org/abs/1809.09600"},{"section":"nlp","difficulty":"medium","id":"nlp-m012","topicSlug":"machine-translation","orderIndex":12,"topic":"Machine Translation","question":"An NMT model translates English to French. BLEU is computed on the test set. A team proposes using back-translation to augment training data: translate French monolingual data to English using a separate model, then use (synthetic English, real French) pairs for training. How does back-translation increase effective training data, and what is its primary risk?","options":{"A":"Back-translation creates exact copies of existing training pairs, which helps by repetition","B":"Back-translation converts abundant monolingual target-language data into synthetic parallel pairs — the synthetic English source is noisy but the real French target is clean; this augments training with more examples of target-language distribution; the primary risk is that low-quality back-translation introduces systematic noise that degrades model performance on authentic source-language inputs","C":"Back-translation only works for languages with more than 100M speakers","D":"Back-translation is equivalent to data augmentation with random word dropout and produces the same effect"},"correct":"B","explanation":{"correct":"- Data asymmetry in MT: parallel corpora (aligned sentence pairs) are rare and expensive. Monolingual data (one language only) is abundant. Back-translation bridges this by using a reverse MT model to create synthetic parallel pairs.\n- Process: French mono corpus → back-translate to English (noisy) → train on (synthetic_en, real_fr) pairs alongside original parallel data.\n- Why it works: the model sees more diverse French output distributions during training, improving its ability to generate natural French. The decoder benefits most from seeing more real target-language sentences.\n- Risk: if the back-translation model is poor quality, synthetic English sources are systematic in their errors, and the forward model may learn to produce outputs conditioned on \"back-translated English patterns\" rather than authentic English.","A":"Back-translation creates new (noisy source, real target) pairs — not copies. Each monolingual sentence generates a unique synthetic parallel pair. This is distinct from simple repetition.","B":"","C":"Back-translation is a standard technique for all language pairs regardless of speaker count. It is especially valuable for low-resource languages with limited parallel data — exactly those without 100M speakers.","D":"Random word dropout (a data augmentation technique for source sentences) is unrelated to back-translation. They operate differently: one modifies existing source sentences, the other creates entirely new training pairs from monolingual data."},"reference":"- Sennrich et al., \"Improving Neural Machine Translation Models with Monolingual Data\" (back-translation): https://arxiv.org/abs/1511.06709"},{"section":"nlp","difficulty":"medium","id":"nlp-m013","topicSlug":"text-generation-decoding","orderIndex":13,"topic":"Text Generation Decoding","question":"Two text generation systems are configured as follows. System A: nucleus sampling (top-p=0.95), temperature=1.0. System B: nucleus sampling (top-p=0.5), temperature=1.0. Both use the same LLM. A creative writing application evaluates both. System B produces more focused, coherent narratives but less surprising plot developments. System A produces more unexpected turns but occasionally goes off-topic. Explain the precise mechanism causing these differences.","options":{"A":"System A uses a larger model internally because higher p requires more computation","B":"Top-p=0.95 includes tokens up to 95% cumulative probability — a large vocabulary at each step, including low-probability but creative tokens; top-p=0.5 restricts sampling to the highest-probability tokens that cover only 50% of mass — a smaller, more conservative vocabulary; this is why A is more diverse but occasionally unfocused, and B is more predictable but coherent","C":"Temperature=1.0 is the cause of the difference, not top-p","D":"Both systems produce identical output because they use the same temperature"},"correct":"B","explanation":{"correct":"- Top-p=0.95 at a typical step: the model includes tokens until the sum of their probabilities reaches 0.95. If the top token has P=0.4, second P=0.3, third P=0.15, fourth P=0.08 → total = 0.93, add fifth P=0.04 → 0.97 ≥ 0.95. Roughly 4-5 candidate tokens; includes low-probability options.\n- Top-p=0.5: include only tokens until cumulative P=0.5. If top token P=0.4, second P=0.3 → 0.70 ≥ 0.5 at just 2 tokens. More concentrated sampling.\n- Creative diversity tradeoff: p=0.95 samples from a wider distribution, occasionally choosing 5th or 6th-rank tokens that lead to novel directions. p=0.5 almost always samples from the top 1-2 tokens, producing conventional continuations.\n- Temperature=1.0 keeps the original distribution shape; top-p then truncates it at different thresholds.","A":"Top-p is a post-softmax operation applied to the already-computed probability distribution. It does not change model size or computation complexity — it only determines which tokens to include in the sampling pool.","B":"","C":"Temperature is the same (1.0) for both systems. Temperature would only explain differences if it differed between them. The only varying parameter is top-p, which is the correct explanation.","D":"Top-p directly affects the sampling vocabulary size at each step. Two systems with different top-p and the same temperature will produce measurably different output diversity."},"reference":"- Holtzman et al., \"The Curious Case of Neural Text Degeneration\": https://arxiv.org/abs/1904.09751"}],"allTopics":[{"slug":"text-preprocessing","label":"Text Preprocessing","section":"nlp","description":"Master Text Preprocessing interviewer-level concepts.","orderIndex":1,"mcqCount":15},{"slug":"word-representations","label":"Word Representations","section":"nlp","description":"Master Word Representations interviewer-level concepts.","orderIndex":2,"mcqCount":12},{"slug":"classical-nlp-tasks","label":"Classical NLP Tasks","section":"nlp","description":"Master Classical NLP Tasks interviewer-level concepts.","orderIndex":3,"mcqCount":10},{"slug":"language-models-statistical","label":"Language Models Statistical","section":"nlp","description":"Master Language Models Statistical interviewer-level concepts.","orderIndex":4,"mcqCount":10},{"slug":"sequence-models-rnn-lstm","label":"Sequence Models Rnn Lstm","section":"nlp","description":"Master Sequence Models Rnn Lstm interviewer-level concepts.","orderIndex":5,"mcqCount":10},{"slug":"attention-before-transformers","label":"Attention Before Transformers","section":"nlp","description":"Master Attention Before Transformers interviewer-level concepts.","orderIndex":6,"mcqCount":8},{"slug":"bert-and-variants","label":"Bert And Variants","section":"nlp","description":"Master Bert And Variants interviewer-level concepts.","orderIndex":7,"mcqCount":8},{"slug":"text-classification","label":"Text Classification","section":"nlp","description":"Master Text Classification interviewer-level concepts.","orderIndex":8,"mcqCount":8},{"slug":"named-entity-recognition","label":"Named Entity Recognition","section":"nlp","description":"Master Named Entity Recognition interviewer-level concepts.","orderIndex":9,"mcqCount":7},{"slug":"question-answering","label":"Question Answering","section":"nlp","description":"Master Question Answering interviewer-level concepts.","orderIndex":10,"mcqCount":7},{"slug":"machine-translation","label":"Machine Translation","section":"nlp","description":"Master Machine Translation interviewer-level concepts.","orderIndex":11,"mcqCount":7},{"slug":"text-generation-decoding","label":"Text Generation Decoding","section":"nlp","description":"Master Text Generation Decoding interviewer-level concepts.","orderIndex":12,"mcqCount":8}],"tests":[{"id":"nlp-test-001","name":"Text Preprocessing + Word Representations","level":"mixed","duration":15,"order":1,"description":"Group test covering tokenization strategies, TF-IDF mechanics, stemming vs lemmatization, distributional semantics, Word2Vec/GloVe, and FastText. Tests whether you can reason about representation tradeoffs — not just recall definitions.","questionIds":["nlp-01001","nlp-01002","nlp-01003","nlp-02001","nlp-02002","nlp-01005","nlp-01007","nlp-02004","nlp-02005","nlp-02007","nlp-01011","nlp-02009"]},{"id":"nlp-test-002","name":"Classical NLP + Statistical Language Models","level":"mixed","duration":18,"order":2,"description":"Group test covering POS tagging, dependency parsing, co-reference resolution, n-gram models, perplexity, and smoothing. Tests your ability to reason about why pipeline systems fail and what the Markov assumption can and cannot model.","questionIds":["nlp-03001","nlp-03002","nlp-03003","nlp-04001","nlp-04002","nlp-03004","nlp-03005","nlp-04004","nlp-04005","nlp-04006","nlp-03008","nlp-04008","nlp-04009"]},{"id":"nlp-test-003","name":"Sequence Models + Attention Mechanisms","level":"mixed","duration":18,"order":3,"description":"Group test on RNN vanishing gradients, LSTM gating, GRU tradeoffs, BiLSTM representations, teacher forcing, Bahdanau vs Luong attention, and coverage problems. Tests your model of how sequence encoders fail and what attention actually solves.","questionIds":["nlp-05001","nlp-05002","nlp-05003","nlp-06001","nlp-06002","nlp-05004","nlp-05005","nlp-06004","nlp-06005","nlp-06006","nlp-05008","nlp-05009","nlp-06007"]},{"id":"nlp-test-004","name":"BERT Variants + Text Classification","level":"mixed","duration":18,"order":4,"description":"Group test on BERT pretraining objectives, segment embeddings, RoBERTa/DistilBERT/ALBERT design choices, fine-tuning behavior, and classification system design — including class imbalance, multi-label heads, calibration, and zero-shot strategies.","questionIds":["nlp-07001","nlp-07002","nlp-07003","nlp-08001","nlp-08002","nlp-07004","nlp-07005","nlp-08004","nlp-08005","nlp-08006","nlp-07007","nlp-07008","nlp-08007"]},{"id":"nlp-test-005","name":"Named Entity Recognition + Question Answering","level":"mixed","duration":16,"order":5,"description":"Group test on BIO tagging, CRF vs linear heads, entity-level evaluation, span-based NER, SQuAD EM/F1 metrics, BERT QA mechanics, SQuAD 2.0 unanswerability, and RAG-based systems. Tests whether you understand why token accuracy masks entity-level failures.","questionIds":["nlp-09001","nlp-09002","nlp-10001","nlp-10002","nlp-09003","nlp-09004","nlp-10003","nlp-10004","nlp-10005","nlp-e013","nlp-09006","nlp-10006"]},{"id":"nlp-test-006","name":"Machine Translation + Text Generation & Decoding","level":"mixed","duration":17,"order":6,"description":"Group test on NMT architecture bottlenecks, attention alignment patterns, BLEU limitations, back-translation, hallucination causes, greedy vs beam vs sampling decoding, temperature, top-k/top-p tradeoffs, and repetition. Tests your ability to pick the right decoding strategy for the right application.","questionIds":["nlp-11001","nlp-11002","nlp-12001","nlp-12002","nlp-12003","nlp-11003","nlp-11004","nlp-12004","nlp-12005","nlp-12006","nlp-11006","nlp-12007"]},{"id":"nlp-mock-001","name":"Easy Mock Interview — NLP Fundamentals I","level":"easy","duration":12,"order":7,"description":"Simulates an easy NLP screening interview. Covers tokenization, word embeddings, classical tasks, language models, sequence models, and attention — all framed as engineering decisions. Includes common reasoning traps that catch underprepared candidates.","questionIds":["nlp-e001","nlp-e003","nlp-e005","nlp-e007","nlp-e009","nlp-e011","nlp-e013","nlp-e015","nlp-e017","nlp-01004"]},{"id":"nlp-mock-002","name":"Easy Mock Interview — NLP Fundamentals II","level":"easy","duration":12,"order":8,"description":"Second easy NLP screening simulation. Covers BERT intuition, classification, NER basics, QA retrieval, MT evaluation, and text generation — all through practical scenario framing. A distinct question set from Mock I, no overlap.","questionIds":["nlp-e002","nlp-e004","nlp-e006","nlp-e008","nlp-e010","nlp-e012","nlp-e014","nlp-e016","nlp-03001","nlp-07001"]},{"id":"nlp-mock-003","name":"Medium Mock Interview — Applied NLP Reasoning I","level":"medium","duration":18,"order":9,"description":"Simulates a medium-difficulty NLP engineering interview. Tests feature space tradeoffs, GloVe vs Word2Vec performance gaps, perplexity domain shifts, attention complexity, zero-shot NLI transfer, production class-deletion risks, and multi-hop QA architecture. Requires multi-step reasoning throughout.","questionIds":["nlp-m001","nlp-m003","nlp-m005","nlp-m007","nlp-m009","nlp-m011","nlp-01006","nlp-02006","nlp-05006","nlp-06005","nlp-09005","nlp-11004"]},{"id":"nlp-mock-004","name":"Medium Mock Interview — Applied NLP Reasoning II","level":"medium","duration":18,"order":10,"description":"Second medium NLP interview simulation — distinct question set. Covers TF-IDF semantic gaps, CRF vs per-token softmax, BiLSTM pooling for long docs, NLI zero-shot, cross-lingual NER transfer, back-translation, and top-p diversity tradeoffs. FAANG-style scenario framing throughout.","questionIds":["nlp-m002","nlp-m004","nlp-m006","nlp-m008","nlp-m010","nlp-m012","nlp-m013","nlp-03006","nlp-04007","nlp-07006","nlp-08006","nlp-12006"]},{"id":"nlp-mock-005","name":"Hard Mock Interview — NLP Engineering Depth I","level":"hard","duration":25,"order":11,"description":"Simulates a hard NLP engineering interview targeting senior/staff candidates. Questions involve domain-adaptive pretraining, semantic drift over time, pipeline error propagation, variational dropout, monotonic attention failure modes, BERT layer probing, concept vs data drift, NER production tradeoffs, and dense retrieval gaps. Expect traps.","questionIds":["nlp-h001","nlp-h003","nlp-h005","nlp-h007","nlp-h009","nlp-h011","nlp-01012","nlp-02010","nlp-03009","nlp-04010","nlp-05010","nlp-06008","nlp-07007","nlp-08008","nlp-09007"]},{"id":"nlp-mock-006","name":"Hard Mock Interview — NLP Engineering Depth II","level":"hard","duration":25,"order":12,"description":"Second hard NLP engineering simulation — no question overlap with Hard Mock I. Covers BPE morphological fragmentation, LM length bias in ASR, Arabic morphology tokenization, constrained decoding complexity, TF-IDF/BPE interactions, hard CRF boundary questions, extractive QA multi-hop failure, and adversarial MT evaluation. All questions require multi-step reasoning.","questionIds":["nlp-h002","nlp-h004","nlp-h006","nlp-h008","nlp-h010","nlp-h012","nlp-01013","nlp-02011","nlp-03010","nlp-05009","nlp-07008","nlp-10007","nlp-11007","nlp-12008","nlp-01015"]},{"id":"nlp-elite-001","name":"Elite Test — NLP Systems Architect","level":"hard","duration":35,"order":13,"description":"Senior ML engineer / AI architect-level assessment. 18 questions drawn from the hardest material across all 12 NLP topics. Tests deceptive edge cases, production failure modes, and multi-concept intersections: domain pretraining vs tokenization, LM length bias in ASR, BPE-morphology interactions, BERT probing hierarchy, pipeline error propagation, precision-recall production tradeoffs, and constrained decoding complexity. No easy warmup — this starts hard and stays hard.","questionIds":["nlp-h001","nlp-h002","nlp-h004","nlp-h006","nlp-h008","nlp-h010","nlp-h012","nlp-01014","nlp-01015","nlp-02012","nlp-03008","nlp-03010","nlp-04008","nlp-04010","nlp-05008","nlp-06007","nlp-09006","nlp-11006"]},{"id":"nlp-elite-002","name":"Elite Test — Senior NLP Engineer Gauntlet","level":"hard","duration":40,"order":14,"description":"Staff engineer / advanced ML interview gauntlet. 18 questions spanning every NLP subsystem — all hard, all distinct from Elite Test I. Covers recurrent dropout temporal inconsistency, monotonic attention language-pair failure, BERT layer hierarchy implications, semantic drift over time, variational dropout internals, multi-hop QA architectural limits, Arabic NMT morphology, CRF pipeline error propagation, coverage and hallucination in MT, and statistical LM OOV edge cases. Requires holding multiple system-level concepts simultaneously.","questionIds":["nlp-h003","nlp-h005","nlp-h007","nlp-h009","nlp-h011","nlp-01011","nlp-01012","nlp-01013","nlp-02009","nlp-02010","nlp-02011","nlp-05009","nlp-05010","nlp-06008","nlp-07008","nlp-08007","nlp-10006","nlp-12007"]}],"initialMode":"learn","initialTopic":"text-preprocessing"}]