d:["$","$L16",null,{"section":{"slug":"machine-learning","label":"Machine Learning","shortLabel":"Machine Learning","description":"Bias-variance, metrics, and theoretical trade-offs.","seoTitle":"Machine Learning Interview Questions","seoDescription":"Practice Machine Learning interview questions on bias-variance, metrics, and theoretical trade-offs.","keywords":["Machine learning interview questions","ML MCQs"],"icon":"M","iconColor":"bg-orange-600","status":"active","phase":2,"priority":0.9},"learnMcqs":[{"section":"machine-learning","topicSlug":"ml-fundamentals","topic":"ML Fundamentals","id":"ml-01001","difficulty":"easy","orderIndex":1,"question":"A company labels 100,000 emails as \"spam\" or \"not spam\" and trains a binary classifier. Six months later, they build a second model using only email metadata (no labels) to group emails into clusters. Which learning paradigm does each model use, and what structural property of the data drives this distinction?","options":{"A":"Both use supervised learning because the data was collected by the same team","B":"The first uses supervised learning (labeled targets drive optimization), the second uses unsupervised learning (no target variable — structure is inferred from the data distribution)","C":"The first uses unsupervised learning because clustering happens implicitly during classification; the second uses supervised learning because cluster labels become targets","D":"The second model uses reinforcement learning because it must explore and evaluate clusters through trial and error"},"correct":"B","explanation":{"correct":"- The defining structural property is the presence of a target variable `y`: supervised learning minimizes loss between predictions and labels; unsupervised learning finds structure (clusters, embeddings, densities) without any `y`.\n- The same raw dataset can produce both paradigms depending on whether labels are used — this is the conceptual foundation of semi-supervised learning, a common interview follow-up.\n- In production, the paradigm determines the evaluation strategy: supervised models use labeled held-out data for accuracy/F1; unsupervised models use intrinsic metrics (silhouette, inertia) or downstream task performance.","A":"Who collected the data has no bearing on the learning paradigm. The paradigm is determined by whether target labels are present in the optimization objective.","B":"","C":"Classification does not involve implicit clustering. The reversal here exploits confusion about what \"finding groups\" means — clustering and classification are distinct operations.","D":"Reinforcement learning requires an agent, an environment, actions, and a reward signal — none of which are present in clustering. Algorithmic exploration in k-means is not RL-style policy learning."},"reference":"- Goodfellow et al., Deep Learning, Chapter 5: https://www.deeplearningbook.org/contents/ml.html"},{"section":"machine-learning","topicSlug":"ml-fundamentals","topic":"ML Fundamentals","id":"ml-01002","difficulty":"easy","orderIndex":2,"question":"You split a dataset into 70% train, 15% validation, and 15% test. Your model scores 95% on the validation set but only 67% on the test set. A teammate says this proves the model overfit to training data. What is the most precise diagnosis?","options":{"A":"The model overfit to training data — the 28-point gap is definitive proof of overfitting","B":"The model overfit to the validation set through repeated hyperparameter tuning, not to training data — this is exactly why a held-out test set is required","C":"The test set is too small at 15% to be statistically reliable, so the gap is sampling noise","D":"The validation and test sets were not stratified, causing class imbalance to distort the test score"},"correct":"B","explanation":{"correct":"- When you repeatedly tune hyperparameters by evaluating on the validation set and picking the best-performing configuration, information from the validation set leaks into your model selection process — this is called \"validation set overfitting\" or \"meta-overfitting.\"\n- Overfitting to training data would show low training loss and comparatively lower validation loss — but validation accuracy is 95%, so the model generalizes to the validation distribution. The gap is specifically between validation and test.\n- This is why the test set must never influence any decision — architecture, hyperparameters, feature engineering — and should be evaluated exactly once at the very end.","A":"Overfitting to training data would manifest as a gap between training and validation metrics, not between validation and test. High validation accuracy rules out classic overfitting to training data.","B":"","C":"15% of most ML datasets is thousands of samples — more than enough for statistical reliability. Attributing a 28-point gap to sampling noise requires extreme justification.","D":"While stratification matters, a 28-point gap from a stratification issue would require catastrophic class imbalance that the question does not state. Stratification problems cause consistent bias, not a cliff-edge drop."},"reference":"- Hastie et al., The Elements of Statistical Learning, Chapter 7: https://hastie.su.domains/ElemStatLearn/"},{"section":"machine-learning","topicSlug":"ml-fundamentals","topic":"ML Fundamentals","id":"ml-01003","difficulty":"easy","orderIndex":3,"question":"A data scientist standardizes all features using `StandardScaler` fitted on the entire dataset before splitting into train and test sets. The model achieves 94% test accuracy. A reviewer flags this as a critical error. Why?","codeSnippet":"scaler = StandardScaler()\nX_scaled = scaler.fit_transform(X) # entire dataset\nX_train, X_test = train_test_split(X_scaled, test_size=0.2)","options":{"A":"`StandardScaler` should be applied after model training, not before","B":"The scaler was fitted on test data too, so test-set statistics (mean and std) influenced the training feature distribution — this is data leakage","C":"`fit_transform` is slower than fitting separately; the correct approach is `scaler.fit(X).transform(X)`","D":"Standardization should only be applied to the target variable, not to input features"},"correct":"B","explanation":{"correct":"- `fit_transform` on the full dataset computes mean and std using all rows, including test rows. Training features are now scaled using statistics that \"know about\" the test distribution.\n- In production, test data arrives after deployment — you would never have access to it during training. Fitting the scaler only on training data (`scaler.fit(X_train)`) correctly simulates this.\n- The performance impact is often small in practice, but in time-series or distribution-shift scenarios it can be significant. The conceptual violation is always critical in an interview context.","A":"Preprocessing is applied before model training — this is correct pipeline order. The error is not about timing relative to the model; it is about which data was used to compute the scaler parameters.","B":"","C":"`scaler.fit(X).transform(X)` is functionally identical to `fit_transform(X)` and has the exact same leakage problem. This option exploits confusion about method equivalence.","D":"Standardization is most commonly applied to input features. Applying it only to targets would be unusual and incorrect for most standard ML models."},"reference":"- scikit-learn Pipeline documentation: https://scikit-learn.org/stable/modules/compose.html"},{"section":"machine-learning","topicSlug":"ml-fundamentals","topic":"ML Fundamentals","id":"ml-01004","difficulty":"easy","orderIndex":4,"question":"A fraud detection model is evaluated on a dataset with 1 million transactions, where 0.1% are fraud. The team does a random 80/20 train/test split and reports 99.9% test accuracy. Why should they be skeptical of this result before celebrating?","options":{"A":"80/20 is too large a training set — 50/50 splits are required for imbalanced data","B":"99.9% accuracy on this dataset is achievable by a model that predicts \"not fraud\" for every transaction — the metric is uninformative under extreme class imbalance","C":"The model is certainly overfit because 99.9% accuracy is unrealistically high for any real-world dataset","D":"A random split is invalid for fraud data because fraud events cluster in time and must use a temporal split"},"correct":"B","explanation":{"correct":"- With 0.1% fraud, a zero-rule classifier (always predict \"not fraud\") achieves exactly 99.9% accuracy trivially. High accuracy on imbalanced data is the canonical misleading metric trap.\n- The meaningful metrics for fraud detection are precision, recall, F1 on the minority class, AUC-ROC, and the Precision-Recall curve — none of which are reported here.\n- In production, a fraud model that never triggers allows real fraud to pass undetected. Reporting accuracy alone on an imbalanced problem is a red flag in any ML design review.","A":"The train/test ratio is not the issue. 80/20 is standard. No fixed ratio is \"required\" for imbalanced data — the problem is the choice of metric, not the split proportion.","B":"","C":"99.9% accuracy is not inherently a sign of overfitting — it is trivially achievable as demonstrated. Overfitting is diagnosed by comparing training accuracy to test accuracy, not by the absolute value alone.","D":"Temporal clustering is a valid concern for time-series fraud but is a separate, secondary issue from the metric problem the question is testing."}},{"section":"machine-learning","topicSlug":"ml-fundamentals","topic":"ML Fundamentals","id":"ml-01005","difficulty":"easy","orderIndex":5,"question":"An ML pipeline processes data in this order: (1) raw data ingestion → (2) feature engineering → (3) model training → (4) evaluation → (5) deployment. A team adds a new step: \"re-engineer features based on test set error analysis.\" Precisely between which steps does this new step violate pipeline integrity, and why?","options":{"A":"Between steps 1 and 2 — feature engineering must happen before any data is seen by the model","B":"Between steps 4 and 2 — using test set errors to redesign features feeds test-set information backward into the feature space, creating look-ahead bias","C":"Between steps 3 and 4 — model training must complete before any analysis is performed","D":"Between steps 2 and 3 — features must be completely frozen before training begins"},"correct":"B","explanation":{"correct":"- The ML pipeline is a one-directional flow. Feeding test set error signals back into step 2 (feature engineering) means test data implicitly shapes the feature representation — a form of look-ahead bias or indirect data leakage.\n- This is analogous to a researcher peeking at exam answers before designing the exam questions. The resulting performance metrics are no longer trustworthy estimates of generalization.\n- The correct approach: analyze errors on a validation set, re-engineer features, then evaluate final performance on a completely untouched test set.","A":"Feature engineering after data ingestion is the correct order — this is not a violation. The violation is about direction of information flow, not absolute pipeline position.","B":"","C":"Model training completing before evaluation is correct pipeline order — this describes a valid step, not a violation.","D":"Freezing features before training is correct practice. But the question asks where the \"re-engineer from test errors\" step creates the problem — that is specifically the backward feedback from test results to feature design."}},{"section":"machine-learning","topicSlug":"ml-fundamentals","topic":"ML Fundamentals","id":"ml-01006","difficulty":"medium","orderIndex":6,"question":"A team trains a churn prediction model. During feature engineering they include \"number of support tickets submitted in the 7 days after the billing date.\" The model scores 0.91 AUC in offline evaluation but drops to 0.54 AUC in production. What is the most likely cause?","options":{"A":"The model overfit due to too many features — regularization would close the production gap","B":"\"Tickets submitted in the 7 days after billing date\" cannot be known at prediction time — the model trained on future information relative to the prediction timestamp, which is temporal data leakage","C":"AUC is not an appropriate metric for churn prediction; the team should use accuracy instead","D":"The production dataset has a different class balance than the training data, causing the metric to drop"},"correct":"B","explanation":{"correct":"- Temporal data leakage occurs when a feature uses information from the future relative to the prediction timestamp. \"Tickets submitted in 7 days after billing date\" is knowable only 7 days after the billing date — but churn prediction runs at or before the billing date.\n- The model learned to rely on a signal that is causally downstream of the prediction event. In offline evaluation, future data was present in the dataset; in production, it isn't available.\n- This is one of the hardest leakage types to catch because the feature is plausible (\"support tickets predict churn\"). Always audit feature timestamps against the prediction timestamp.","A":"A 37-point AUC drop between offline and production is not a regularization problem. Overfitting would cause a smaller gap and would appear as training AUC >> validation AUC, not as an offline-to-production cliff.","B":"","C":"AUC is widely used and appropriate for churn prediction, especially when score ranking matters for intervention campaigns. The metric is not the cause of the drop.","D":"Class balance shifts can affect metric values, but AUC is relatively robust to class imbalance since it measures ranking across all thresholds. A 37-point collapse requires a systematic cause like leakage."},"reference":"- Kaufman et al., \"Leakage in Data Mining\": https://dl.acm.org/doi/10.1145/2020408.2020496"},{"section":"machine-learning","topicSlug":"ml-fundamentals","topic":"ML Fundamentals","id":"ml-01007","difficulty":"medium","orderIndex":7,"question":"You are designing a train/test split for a dataset of 10,000 user sessions where each user contributes an average of 50 sessions. A colleague applies session-level random splitting. Why is this split strategy incorrect for evaluating generalization to new users?","codeSnippet":"from sklearn.model_selection import train_test_split\n\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)","options":{"A":"The test set is too small; it should be at least 30% of the data for reliable evaluation","B":"Session-level random splits allow the same user's sessions to appear in both train and test, letting the model memorize user-specific patterns and overestimate generalization to new users","C":"`train_test_split` does not support session data — a time-series split must always be used for session-level data","D":"`random_state` is not set, so results are non-reproducible — this is the primary flaw"},"correct":"B","explanation":{"correct":"- With 10,000 sessions from ~200 users (50 sessions each), a session-level random split puts roughly 80% of each user's sessions in train and 20% in test. The model can learn user-identity patterns and apply them to the same user's test sessions.\n- This inflates test performance because you are measuring interpolation within known users, not generalization to unseen users. In production, the model will encounter new users with zero historical sessions.\n- The correct approach is a **user-level split**: all sessions from a given user go entirely into train or entirely into test.","A":"20% test size (2,000 sessions) is statistically adequate. The issue is user-identity leakage, not dataset size.","B":"","C":"`train_test_split` can be applied to any tabular data including sessions. The problem is the granularity of the split entity, not the function used. Time-series splits address temporal ordering, which is a different concern.","D":"Not setting `random_state` affects reproducibility, not correctness. A non-reproducible split is a best-practice concern, not a validity issue."}},{"section":"machine-learning","topicSlug":"ml-fundamentals","topic":"ML Fundamentals","id":"ml-01008","difficulty":"medium","orderIndex":8,"question":"A reinforcement learning agent is trained to play chess. A developer describes it as: \"the agent sees the board, a neural net predicts the best move, and the model is trained on historical grandmaster games with move quality labels.\" A senior ML engineer says this description is wrong about the learning paradigm. Who is correct and why?","options":{"A":"The developer is correct — predicting move quality from labeled data is supervised learning regardless of the game domain","B":"The senior engineer is correct — any game-playing agent is by definition reinforcement learning","C":"The senior engineer is correct — the description is of supervised learning, not RL; RL requires an agent learning from delayed outcome rewards, not from labeled (board, move-quality) pairs","D":"Both are correct — RL and supervised learning are equivalent when the reward is immediate"},"correct":"C","explanation":{"correct":"- Reinforcement learning requires an agent that takes actions, receives delayed rewards from an environment, and learns a policy through trial and interaction. It does not require labeled training examples.\n- Learning from a dataset of (board state → labeled move quality) pairs is supervised learning, regardless of the domain being chess. AlphaGo's first phase used supervised learning on human games before switching to RL through self-play.\n- The domain (chess, games) does not determine the paradigm — the training signal does. This is a common misconception that trips up developers in interviews.","A":"The developer's description is technically accurate about the training signal. The question asks whether the senior engineer's correction is valid — it is, because calling a supervised setup \"RL\" is a paradigm misclassification.","B":"Game-playing does not imply RL. Deep Blue used minimax search with hand-crafted evaluation — no learning at all. AlphaGo's SL policy network used supervised learning on human moves before RL self-play.","C":"","D":"Immediate rewards do not make RL equivalent to supervised learning. In RL, a scalar feedback from the environment follows an action; in supervised learning, a label is paired with each input. They are distinct training regimes with different update rules."}},{"section":"machine-learning","topicSlug":"ml-fundamentals","topic":"ML Fundamentals","id":"ml-01009","difficulty":"medium","orderIndex":9,"question":"A data scientist applies SMOTE oversampling before the train/test split to handle class imbalance. Validation F1 on the minority class is 0.87. A reviewer marks this result as inflated. What is the exact mechanism causing the inflation?","codeSnippet":"X_resampled, y_resampled = SMOTE().fit_resample(X, y) # before split\nX_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled)","options":{"A":"SMOTE creates too many synthetic samples, which always inflates F1 regardless of when it is applied","B":"SMOTE generates synthetic points by interpolating between real minority-class samples; if applied before splitting, synthetic test samples are geometric near-neighbors of training samples, giving the model an artificial advantage on the test set","C":"SMOTE should not be used on imbalanced datasets; stratified sampling is the only valid approach","D":"The test set created from a SMOTE-augmented dataset still contains real samples, so no inflation is possible"},"correct":"B","explanation":{"correct":"- SMOTE generates synthetic points by interpolating between a minority-class sample and one of its k-nearest neighbors. If SMOTE runs on the full dataset before splitting, some synthetic test samples will be geometrically close to training samples — the model has effectively \"seen\" the test-space neighborhood.\n- This violates the independence assumption between train and test. Evaluation results are optimistically biased.\n- The correct practice: SMOTE is applied **only to the training set** after splitting. The test set must consist of real, unaugmented samples reflecting the production distribution.","A":"SMOTE applied correctly (after splitting, on training data only) does not inflate test metrics. The inflation is caused by when SMOTE is applied, not the technique itself.","B":"","C":"SMOTE is a valid and widely-used oversampling technique. Stratified sampling addresses split proportions, not class imbalance during training.","D":"Even though test samples may be \"real,\" the synthetic training samples are near-neighbors of those real test samples due to the interpolation process. The independence assumption is violated regardless of sample authenticity."}},{"section":"machine-learning","topicSlug":"ml-fundamentals","topic":"ML Fundamentals","id":"ml-01010","difficulty":"medium","orderIndex":10,"question":"A house price prediction model includes a feature: \"this property's last recorded sale price.\" The model achieves very high R² on the test set. During production, this feature is unavailable so it is removed — performance collapses. What does this reveal about the training pipeline?","options":{"A":"The model needed regularization to prevent over-reliance on a single feature","B":"The feature was a target proxy — it encoded near-direct information about the target variable (house price), making it a form of target leakage","C":"The test data was collected from a different time period than training data, causing distribution shift","D":"R² is not a reliable metric for regression models with highly correlated features"},"correct":"B","explanation":{"correct":"- Target leakage occurs when a feature encodes the target variable directly or as a near-proxy. \"Last recorded sale price\" is essentially a direct measurement of what the model is predicting — house prices — so the model learned to use historical price as its answer.\n- In a real deployment, this feature doesn't exist before the sale completes — it cannot be used for prediction. The pipeline must always verify feature availability at prediction time, not just at training time.\n- This is subtler than pure data leakage: the feature is plausible (real estate agents reference recent sales), but temporal availability at inference time was never verified.","A":"Regularization prevents overfitting to statistical noise, not over-reliance on a causally plausible but temporally unavailable feature. The problem is feature availability, not model complexity.","B":"","C":"Distribution shift causes gradual degradation. The question states the collapse occurred directly upon removing the feature — this is causal, not temporal.","D":"R² is a valid regression metric. High R² driven by a leaking feature is a pipeline design failure, not a metric deficiency."}},{"section":"machine-learning","topicSlug":"ml-fundamentals","topic":"ML Fundamentals","id":"ml-01011","difficulty":"medium","orderIndex":11,"question":"A team evaluates five model architectures by testing each one on the held-out test set and selects the architecture with 95% test accuracy for deployment. A month later, production accuracy is 82%. What mistake was made, and what process would have prevented it?","options":{"A":"The model overfit to training data — more regularization during training would prevent the gap","B":"The test set was used as a selection criterion, converting it into an implicit validation set — the 95% is no longer an unbiased estimate; a truly held-out final test set never used in selection would have given an honest estimate","C":"The team should have used cross-validation instead of a fixed split for architecture comparison","D":"A 13-point production gap is expected noise — test-to-production gaps of this size are normal"},"correct":"B","explanation":{"correct":"- A test set provides an unbiased performance estimate only if it is evaluated exactly once on the final selected model. Using it to choose among 5 architectures turns it into an implicit validation set — you are doing 5-way model selection on it.\n- With 5 candidates, there is a meaningful probability that one will \"luck into\" high test accuracy due to random alignment with the test distribution, not genuine generalization.\n- The correct setup: use a validation set for all selection decisions; reserve the test set for the single final evaluation after all model choices are made.","A":"Overfitting to training data would show low training loss and high validation loss. The test accuracy is 95% — the gap is between test and production, indicating test set integrity was compromised, not that training overfit.","B":"","C":"Cross-validation is good practice but does not resolve the issue if the test set is still used for final selection. The problem is the test set's role in decision-making, not the validation strategy.","D":"A 13-point production gap is not expected noise. Statistical noise in large test sets is under 2%. A gap this large indicates a systematic flaw in the evaluation setup."}},{"section":"machine-learning","topicSlug":"ml-fundamentals","topic":"ML Fundamentals","id":"ml-01012","difficulty":"hard","orderIndex":12,"question":"A medical imaging model is trained on data from 5 hospitals using a random 80/20 sample-level split. The model performs well in cross-hospital evaluation at deployment to a 6th hospital, but poorly on a 7th hospital from a different country. Which type of leakage or bias is responsible and what split strategy would better estimate cross-institution generalization?","options":{"A":"The model overfit due to too few samples — collecting more data from the 5 original hospitals would fix generalization","B":"Sample-level splitting distributed all 5 hospitals into both train and test, so the model learned institution-specific confounders (equipment, demographics, labeling conventions); a hospital-level split would test true cross-institution generalization","C":"The test set should have been balanced across disease categories using stratified sampling to ensure all classes are represented","D":"Random splits are always appropriate for medical data — the poor performance on the 7th hospital is uncontrollable distribution shift"},"correct":"B","explanation":{"correct":"- Sample-level splitting puts samples from all 5 hospitals in both train and test. The model learns hospital-specific signals: scanner calibration artifacts, patient population characteristics, radiologist labeling conventions. These are institution confounders, not generalizable medical knowledge.\n- Evaluating within the same 5 hospitals (even with a random split) measures in-distribution performance. Generalizing to unseen institutions requires an institution-level held-out split.\n- The fix is a **site-level split**: hold out all samples from one or more hospitals entirely for testing. This is standard practice in federated learning and clinical ML validation (e.g., multi-site trials).","A":"More data from the same 5 hospitals deepens the institution-specific confounders rather than helping cross-institution generalization. It may make things worse.","B":"","C":"Stratified sampling ensures class representation in train/test but does not address institution confounders. A disease-stratified random split still contaminates test with all 5 hospitals' signals.","D":"Distribution shift from a new institution is the symptom, not the root cause. The root cause is a split strategy that never tested out-of-institution generalization. This is absolutely addressable by changing split granularity."}},{"section":"machine-learning","topicSlug":"ml-fundamentals","topic":"ML Fundamentals","id":"ml-01013","difficulty":"hard","orderIndex":13,"question":"You train a binary classifier and achieve 0.92 AUC-ROC on the test set. You threshold predictions at 0.5 and report accuracy. A stakeholder says \"92% AUC means the model is 92% accurate.\" In what specific scenario would a model with 0.92 AUC coexist with accuracy near the trivial baseline?","options":{"A":"Only when the dataset has more than 1 million samples — AUC and accuracy decouple at scale","B":"When the positive class is rare (e.g., 1%), a model with 0.92 AUC may still place most predicted probabilities below 0.5 — predicting \"negative\" for all samples would yield 99% accuracy, making the threshold-based accuracy trivially high and misleading","C":"AUC is always higher than accuracy on imbalanced datasets because it corrects for class imbalance mathematically","D":"AUC above 0.9 guarantees that accuracy at any threshold will be above 90%"},"correct":"B","explanation":{"correct":"- AUC-ROC measures the probability that the model ranks a random positive above a random negative across all possible thresholds. It evaluates ranking quality, not prediction at a specific threshold.\n- On a 1% positive class, a model with excellent ranking (0.92 AUC) may still produce raw probabilities mostly below 0.5. A threshold at 0.5 then predicts negative for nearly everything, yielding 99% accuracy — the same as predicting the majority class always.\n- Conversely, a miscalibrated model can have high AUC but very low accuracy at the default threshold. This is why threshold tuning and proper calibration are separate steps from ranking evaluation.","A":"Dataset size does not determine the AUC-accuracy relationship. The decoupling occurs due to class imbalance and probability miscalibration, both of which can happen at any scale.","B":"","C":"AUC does not \"correct\" for class imbalance. Precision-Recall AUC is generally preferred for imbalanced datasets precisely because ROC AUC can appear optimistically high due to the large true-negative pool inflating the metric.","D":"AUC above 0.9 guarantees only ranking quality. A perfectly ranked model (AUC = 1.0) with all probabilities in [0.01, 0.02] range will predict \"negative\" at threshold 0.5 for every sample — 99% accuracy on a 1% positive class."}},{"section":"machine-learning","topicSlug":"ml-fundamentals","topic":"ML Fundamentals","id":"ml-01014","difficulty":"hard","orderIndex":14,"question":"A team trains a loan default prediction model. After deployment, they discover their preprocessing pipeline (imputation, encoding, scaling) was fitted on the full dataset including future months of data. The business insists performance is great — \"the deployed model is identical to what we tested.\" Why does the leakage matter even if production performance looks strong?","options":{"A":"It doesn't — if production performance is strong, the leakage is irrelevant","B":"The leakage corrupts the measurement system: the team cannot distinguish genuine generalization from leakage-inflated metrics, future retraining on new data may degrade silently without explanation, and model comparison decisions made during development were potentially wrong","C":"The model must be retrained immediately because leaked preprocessing invalidates all model weights","D":"Data leakage only matters in healthcare; financial models are not affected because regulations require fair data use"},"correct":"B","explanation":{"correct":"- Leakage corrupts the evaluation system, not necessarily the deployed weights. The model may genuinely perform well — but the team cannot establish how much performance is due to generalization vs. leakage-assisted metric inflation.\n- The real danger appears at retraining time: when the team retrains periodically on new data (without future leakage), they may see a performance drop and not know why. They will chase a phantom problem, potentially deploying an inferior model.\n- Leakage also poisons model comparison. If Leaky Model A scores 0.93 AUC and Clean Model B scores 0.89 AUC, the team deploys A when B may actually be the better generalizer.","A":"\"It works in production\" is survivorship bias. Short-term production metrics can look good due to temporal correlation, luck, or leakage. Without an unbiased evaluation, you cannot confirm what is driving performance.","B":"","C":"Leakage in preprocessing does not technically corrupt model weights — it means the weights were optimized using inflated feature representations. Retraining is advisable, but calling weights \"invalidated\" overstates the technical mechanism.","D":"Data leakage is a universal ML problem independent of domain. Financial regulations address fairness and explainability — they do not specifically prevent preprocessing leakage, and this claim is categorically false."}},{"section":"machine-learning","topicSlug":"ml-fundamentals","topic":"ML Fundamentals","id":"ml-01015","difficulty":"hard","orderIndex":15,"question":"A team uses 5-fold cross-validation. For each fold, they fit a `StandardScaler` on the training fold and transform the validation fold separately. A new team member suggests fitting the scaler once on all data — \"StandardScaler parameters barely change between folds.\" What is the senior engineer's precise objection?","codeSnippet":"# Current correct implementation\nfor train_idx, val_idx in kfold.split(X):\n scaler = StandardScaler()\n X_train_scaled = scaler.fit_transform(X[train_idx])\n X_val_scaled = scaler.transform(X[val_idx])","options":{"A":"The new team member is correct — fitting the scaler once is computationally more efficient and produces numerically identical results","B":"Fitting the scaler on all data leaks validation fold statistics (mean and std) into the training fold's preprocessing, violating the independence of each fold as a simulated held-out set","C":"`StandardScaler` parameters change negligibly between folds so the bias is practically zero — the senior engineer is over-engineering","D":"The correct fix is to use `MinMaxScaler` instead, which doesn't require fitting on the training set"},"correct":"B","explanation":{"correct":"- Cross-validation simulates the train-on-some/evaluate-on-held-out process. If the scaler is fitted on all data, the validation fold's mean and std values are embedded in the scaler parameters — the validation fold is no longer truly unseen.\n- This is particularly harmful for small datasets, features with outliers, or non-stationary distributions where fold-level statistics differ meaningfully.\n- `sklearn.pipeline.Pipeline` automates this correctly: any transformer inside a Pipeline is fitted only on training data within each fold automatically, which is exactly what the correct loop does manually.","A":"Fitting once does not produce identical results — the global mean and std include contribution from each fold's held-out samples. Results may be numerically close for large datasets but the principle is violated and the bias is real.","B":"","C":"\"Practically zero\" is context-dependent and misleading. For small datasets, skewed features, or few folds, the bias can be significant. More critically, the methodology is wrong even when the numerical difference is small — pipelines fail unexpectedly in edge cases.","D":"`MinMaxScaler` also requires fitting (to compute feature-wise min and max). It has the exact same leakage problem if fitted on all data. Switching scalers does not resolve the underlying issue."},"reference":"- scikit-learn Pipeline documentation: https://scikit-learn.org/stable/modules/compose.html#pipeline\n- Cross-validation guide: https://scikit-learn.org/stable/modules/cross_validation.html"},{"section":"machine-learning","topicSlug":"linear-regression","topic":"Linear Regression","id":"ml-02001","difficulty":"easy","orderIndex":1,"question":"You train a linear regression model using OLS. The closed-form solution gives coefficients that perfectly minimize the training loss. A colleague says \"since the loss is minimized, the model is optimal.\" What critical nuance does this claim miss?","options":{"A":"OLS does not minimize squared error — it minimizes absolute error","B":"OLS minimizes training loss exactly, but \"optimal\" requires generalization to unseen data, which OLS cannot guarantee — a model with perfect training loss can still overfit if the number of features approaches the number of samples","C":"OLS can only minimize loss when features are uncorrelated with each other","D":"The closed-form solution minimizes loss only when all feature values are positive"},"correct":"B","explanation":{"correct":"- OLS minimizes the sum of squared residuals on training data exactly via the normal equations. This is the mathematical definition of what OLS does.\n- \"Optimal\" in ML means generalizing to unseen data. When the number of predictors is close to the number of observations, OLS fits noise perfectly (R² = 1) but generalizes poorly — this is overfitting in the classical sense.\n- In production, a model with zero training loss and terrible test loss is worse than a regularized model with slightly higher training loss. OLS optimality is strictly in-sample.","A":"OLS minimizes sum of squared residuals (L2 loss), not absolute error. Least absolute deviations (LAD) regression minimizes absolute error — they are different estimators with different robustness properties.","B":"","C":"OLS computes valid coefficient estimates regardless of feature correlation. High multicollinearity makes coefficients unstable and hard to interpret, but OLS will still converge to a solution (unless features are perfectly collinear).","D":"OLS makes no assumption about the sign of feature values. The normal equations work on any real-valued feature matrix."},"reference":"- Hastie et al., The Elements of Statistical Learning, Chapter 3: https://hastie.su.domains/ElemStatLearn/"},{"section":"machine-learning","topicSlug":"linear-regression","topic":"Linear Regression","id":"ml-02002","difficulty":"easy","orderIndex":2,"question":"A linear regression model is trained to predict employee salary from years of experience. The residual plot shows a clear curved (parabolic) pattern rather than random scatter. Which OLS assumption is violated, and what is the consequence?","options":{"A":"Homoscedasticity — the variance of residuals changes across predicted values, inflating standard errors","B":"Linearity — the true relationship between the predictor and outcome is nonlinear, so OLS fits a line through a curve, producing systematically biased predictions at every value","C":"Independence — the residuals are correlated with each other because salary data is collected sequentially","D":"Normality of errors — the curved residuals indicate non-normal error distribution, which invalidates p-values"},"correct":"B","explanation":{"correct":"- The linearity assumption requires that the true relationship between predictors and the outcome is linear. A parabolic residual pattern means the model is missing a nonlinear component — the error is not random noise but systematic bias.\n- The consequence is that predictions are wrong in a directional, predictable way: the model underpredicts at low and high values and overpredicts in the middle (or vice versa), depending on the curve direction.\n- The fix is feature transformation (e.g., adding `experience²` as a predictor) or switching to a nonlinear model. A curved residual plot is one of the clearest diagnostic signals in regression.","A":"Homoscedasticity violations show a funnel shape in residuals (variance increasing or decreasing with predicted value), not a systematic curve. A curved pattern is not a variance issue.","B":"","C":"Independence violations produce autocorrelation in residuals — typically diagnosed with a Durbin-Watson test on time-ordered data, not a parabolic pattern in a residual vs. fitted plot.","D":"Normality of errors produces a skewed or heavy-tailed residual distribution, visible in a Q-Q plot — not a systematic curve in the residual vs. fitted plot."}},{"section":"machine-learning","topicSlug":"linear-regression","topic":"Linear Regression","id":"ml-02003","difficulty":"easy","orderIndex":3,"question":"A linear regression model on house prices achieves R² = 0.85 on the training set. A manager concludes \"the model explains 85% of the variance in house prices.\" Is this interpretation correct, and what common misuse does it enable?","options":{"A":"The interpretation is incorrect — R² of 0.85 means the model is 85% accurate in absolute price terms","B":"The interpretation is correct for training data, but reporting training R² as model quality enables overfitting — the same model may have R² = 0.30 on the test set, which the training R² completely hides","C":"R² = 0.85 means 85% of predictions are within one standard deviation of the true price","D":"The interpretation is correct and training R² is always a reliable estimate of generalization quality"},"correct":"B","explanation":{"correct":"- R² measures the proportion of variance in the target explained by the model: $R^2 = 1 - \\frac{SS_{res}}{SS_{tot}}$. An R² of 0.85 does mean the model explains 85% of training variance — the interpretation itself is technically correct.\n- The misuse is treating training R² as a generalization metric. A model with many features can achieve R² close to 1.0 on training data by overfitting, while explaining almost nothing on unseen data.\n- Always report test set R² or cross-validated R². Training R² is a diagnostic for fit, not for generalization.","A":"R² is a variance-explained measure, not an absolute accuracy measure. 85% accuracy in absolute terms would require a different metric like MAPE or MAE relative to price.","B":"","C":"R² has no direct relationship to predictions being within one standard deviation. That would be a confidence interval statement, not an R² statement.","D":"Training R² is not a reliable estimate of generalization. Adding irrelevant features always increases (or maintains) training R² even when they add noise — this is why adjusted R² and test-set R² exist."}},{"section":"machine-learning","topicSlug":"linear-regression","topic":"Linear Regression","id":"ml-02004","difficulty":"easy","orderIndex":4,"question":"A linear regression model is trained on daily stock returns. The Durbin-Watson statistic comes back at 0.8. A data scientist says this is a minor issue and proceeds with standard OLS inference. Why is this dangerous?","options":{"A":"A Durbin-Watson value below 2 is normal and indicates a well-fitted model","B":"A Durbin-Watson value near 0 indicates strong positive autocorrelation in residuals — this violates the independence assumption, making OLS standard errors underestimated and all hypothesis tests (p-values, confidence intervals) invalid","C":"The Durbin-Watson test only applies to classification models; for regression it is irrelevant","D":"Autocorrelation in residuals only matters when the dataset has fewer than 1,000 rows"},"correct":"B","explanation":{"correct":"- Durbin-Watson ranges from 0 to 4: 2 indicates no autocorrelation, values near 0 indicate positive autocorrelation, values near 4 indicate negative autocorrelation. A value of 0.8 signals strong positive autocorrelation in residuals.\n- Positive autocorrelation makes the effective sample size smaller than the nominal sample size — OLS treats correlated observations as independent, artificially deflating standard errors. Confidence intervals are too narrow and p-values are too small.\n- For time-series data, the correct approaches are GLS (generalized least squares), ARIMA, or including lagged terms. Proceeding with OLS produces spuriously significant results.","A":"Values below 2 are not automatically \"normal.\" The reference value is exactly 2 for no autocorrelation. Deviations in either direction are violations.","B":"","C":"The Durbin-Watson test was specifically designed for regression residuals, particularly in time-series contexts. It is not applicable to classification and is extremely relevant to regression.","D":"Autocorrelation violates OLS assumptions regardless of sample size. Larger samples make the p-values more confident but not more valid — a biased estimator with infinite data is still biased."}},{"section":"machine-learning","topicSlug":"linear-regression","topic":"Linear Regression","id":"ml-02005","difficulty":"easy","orderIndex":5,"question":"A residual plot for a linear regression shows that residuals fan out as the predicted value increases — small predictions have small residuals, large predictions have large residuals. Which assumption is violated and what does this mean for OLS coefficient estimates?","options":{"A":"Linearity is violated — the fanning pattern means a polynomial term is needed","B":"Homoscedasticity is violated — variance of residuals is not constant across predicted values; OLS coefficient estimates remain unbiased but are no longer the minimum-variance estimators (BLUE), and standard errors are wrong","C":"Independence is violated — fanning indicates autocorrelation among observations","D":"Normality is violated — the fanning indicates a heavy-tailed error distribution requiring robust regression"},"correct":"B","explanation":{"correct":"- Homoscedasticity requires that the variance of the error term $\\varepsilon$ is constant: $\\text{Var}(\\varepsilon_i) = \\sigma^2$ for all $i$. A fanning pattern (heteroscedasticity) means $\\text{Var}(\\varepsilon_i)$ increases with predicted value.\n- Under heteroscedasticity, OLS coefficients are still unbiased (the Gauss-Markov theorem's unbiasedness does not require homoscedasticity). However, they are no longer BLUE (Best Linear Unbiased Estimators) — GLS or WLS achieves lower variance.\n- More practically: OLS standard errors are wrong, making all t-tests and confidence intervals unreliable. This is why heteroscedasticity-robust standard errors (White's sandwich estimator) exist.","A":"Linearity violations show a curved (systematic) pattern in residuals. A fanning pattern is a variance pattern (grows with fitted values), not a curvature pattern.","B":"","C":"Independence violations (autocorrelation) are diagnosed on time-ordered residual plots, not residual-vs-fitted plots, and show wavelike patterns rather than fanning.","D":"Heavy-tailed distributions produce outliers in residuals uniformly, not a fan that grows with predicted value. The fanning is specifically about variance scaling."}},{"section":"machine-learning","topicSlug":"linear-regression","topic":"Linear Regression","id":"ml-02006","difficulty":"easy","orderIndex":6,"question":"You add 20 new random noise features (completely uncorrelated with the target) to a linear regression model. What happens to the training R², the test R², and why do they diverge?","options":{"A":"Both training R² and test R² increase because more features always improve fit","B":"Training R² increases or stays the same (OLS fits noise), test R² decreases or stays the same — the divergence is the gap between in-sample fit inflation and generalization degradation","C":"Training R² stays the same because OLS ignores features with zero correlation with the target","D":"Both decrease because adding irrelevant features introduces multicollinearity"},"correct":"B","explanation":{"correct":"- OLS will assign small, nonzero coefficients to noise features because they capture random in-sample correlation with the target. This always increases (or maintains) training R².\n- Noise features add noise to predictions on unseen data — the model learned to use patterns that don't generalize. Test R² decreases as variance from noise coefficients accumulates.\n- This is why adjusted R² penalizes for the number of predictors: $\\bar{R}^2 = 1 - (1-R^2)\\frac{n-1}{n-k-1}$ where $k$ is the number of predictors. It can decrease when useless features are added.","A":"More features mechanically increase training R² but not test R². The divergence is the very definition of overfitting in regression.","B":"","C":"OLS does not ignore zero-correlation features. It assigns whatever coefficients minimize training residuals — for noise features, those coefficients are small but nonzero and still inflate training R².","D":"Noise features don't introduce multicollinearity among existing features. They may increase the condition number of the feature matrix, but the primary effect is overfitting via noise coefficient absorption."}},{"section":"machine-learning","topicSlug":"linear-regression","topic":"Linear Regression","id":"ml-02007","difficulty":"medium","orderIndex":7,"question":"A dataset has 500,000 rows and 8 features. A teammate argues that gradient descent should be used instead of the OLS closed-form solution for fitting linear regression. Under what condition would this argument be correct, and when is it incorrect?","options":{"A":"Gradient descent is always preferred because it is more numerically stable than the normal equations","B":"Gradient descent is preferred when the feature matrix is too large to invert efficiently (e.g., millions of features) — for 8 features and 500,000 rows, the normal equations solve in milliseconds and gradient descent adds unnecessary hyperparameter complexity","C":"Gradient descent is preferred when features are correlated, because the normal equations produce incorrect results under multicollinearity","D":"The closed-form OLS solution requires normally distributed features; gradient descent has no such requirement"},"correct":"B","explanation":{"correct":"- The OLS closed-form requires computing $(X^TX)^{-1}$, an $(p \\times p)$ matrix inversion where $p$ is the number of features. For $p = 8$, this is trivially fast regardless of the number of rows.\n- The computational cost of the normal equations scales as $O(p^3)$ for the inversion and $O(np^2)$ for $X^TX$. When $p$ is large (hundreds of thousands of features), the inversion becomes infeasible and gradient descent is preferred.\n- With 500,000 rows and 8 features, gradient descent introduces learning rate tuning, convergence checking, and mini-batch sizing for no benefit over the exact closed-form solution.","A":"The normal equations are numerically stable for well-conditioned feature matrices. Near-singular matrices (high multicollinearity) can cause numerical issues, but this is addressed via regularization or feature pruning — not by defaulting to gradient descent.","B":"","C":"Multicollinearity makes $(X^TX)$ near-singular, which causes numerical instability in OLS — but this is also a problem for gradient descent convergence. Neither method magically \"handles\" multicollinearity correctly.","D":"OLS makes no distributional assumptions about features. The normality assumption applies to errors (residuals), not features, and even then is only needed for valid hypothesis testing — not for the coefficient estimates themselves."}},{"section":"machine-learning","topicSlug":"linear-regression","topic":"Linear Regression","id":"ml-02008","difficulty":"medium","orderIndex":8,"question":"A linear regression model predicts employee salary using age, years of experience, and age × experience as a feature. The VIF (Variance Inflation Factor) for \"age\" is 47. The model has R² = 0.88. What does this tell you, and what is the specific risk?","options":{"A":"VIF of 47 confirms the model is overfit — reducing features would lower the VIF and improve generalization","B":"VIF of 47 indicates severe multicollinearity — the coefficient for \"age\" is highly unstable; small changes to training data will cause large swings in the age coefficient, making it uninterpretable and sensitive to sampling variation","C":"VIF above 10 invalidates R², so the reported 0.88 is meaningless","D":"High VIF means the model cannot make predictions for new data points"},"correct":"B","explanation":{"correct":"- VIF measures how much the variance of a coefficient is inflated due to correlation with other predictors: $\\text{VIF}_j = \\frac{1}{1 - R_j^2}$ where $R_j^2$ is the R² from regressing feature $j$ on all other features. VIF = 47 means the variance of the age coefficient is 47× what it would be if age were uncorrelated with other features.\n- This does not prevent predictions — the combined prediction $\\hat{y} = \\beta_1 x_1 + \\beta_2 x_2 + \\beta_3 x_3$ can still be accurate. But individual coefficients are unreliable for interpretation or inference.\n- The interaction term (age × experience) is the cause: it is nearly a linear combination of age and experience when both are continuous, creating near-perfect collinearity.","A":"Multicollinearity and overfitting are separate concepts. High VIF does not indicate overfitting. Overfitting is about train/test gap; multicollinearity is about coefficient stability.","B":"","C":"High VIF does not invalidate R². R² measures variance explained in the outcome, which is not affected by predictor collinearity. The predictions can be fine even when individual coefficients are unstable.","D":"Multicollinearity does not prevent prediction. The model can still produce predictions for new data — the issue is that predictions are stable even when coefficients swing wildly, because the multicollinear predictors compensate for each other."}},{"section":"machine-learning","topicSlug":"linear-regression","topic":"Linear Regression","id":"ml-02009","difficulty":"medium","orderIndex":9,"question":"A team fits a linear regression model with 50 predictors on 60 observations. The model achieves R² = 0.98 on training data. They report this as evidence of a strong model. What is the correct diagnosis?","options":{"A":"R² = 0.98 with 50 predictors and 60 observations is strong evidence the model has found real signal in the data","B":"With 50 predictors and 60 observations, OLS has near-perfect freedom to fit training noise — R² close to 1 is mathematically expected regardless of real signal, and the model almost certainly has negative R² on holdout data","C":"The model is valid because R² = 0.98 exceeds the standard 0.90 threshold for publication quality","D":"The model needs regularization only if test R² drops below 0.80; otherwise R² = 0.98 is reliable"},"correct":"B","explanation":{"correct":"- OLS with $p$ predictors and $n$ observations can achieve R² = 1 exactly when $p = n$ (perfect interpolation). With $p/n = 50/60 = 0.83$, the model has enormous freedom to fit noise — R² near 1 is expected even if all features are random.\n- The model has approximately 10 degrees of freedom for error ($n - p - 1 = 9$). This is insufficient to estimate generalization. The true test performance would likely show negative or near-zero R².\n- This is the classical $p > n$ or near-$p = n$ regime. Ridge regression or feature selection is mandatory before drawing any conclusions.","A":"With $p/n$ ratio near 1, high training R² provides zero evidence of real signal. The model is fitting the sampling noise in those 60 observations. Cross-validation would expose this.","B":"","C":"There is no universal R² threshold for \"publication quality.\" This is domain-dependent, and training R² is never the relevant metric for model quality assessment.","D":"Test R² should be measured and reported regardless of the training value. The threshold for \"needing regularization\" is not 0.80 test R² — any regime where $p/n$ is high requires regularization by default."}},{"section":"machine-learning","topicSlug":"linear-regression","topic":"Linear Regression","id":"ml-02010","difficulty":"medium","orderIndex":10,"question":"You fit a linear regression model on a dataset where the true relationship is $y = 2x_1 + 3x_2 + \\varepsilon$. After training, you find $\\hat{\\beta}_1 = 8.4$ and $\\hat{\\beta}_2 = -3.2$, far from the true values. The model's predictions are accurate. What explains this phenomenon?","options":{"A":"OLS has a bug when the true coefficients differ by a factor of more than 2","B":"The features $x_1$ and $x_2$ are highly correlated — multicollinearity makes individual coefficient estimates unstable, but because the multicollinear predictors compensate for each other, predictions remain accurate","C":"The model converged to a local minimum in the loss landscape, missing the global solution","D":"The dataset has too few observations relative to the number of features"},"correct":"B","explanation":{"correct":"- When $x_1$ and $x_2$ are highly correlated ($x_1 \\approx x_2$), OLS cannot distinguish their individual contributions. Many coefficient combinations produce nearly the same predictions: $8.4 x_1 + (-3.2) x_2 \\approx 2 x_1 + 3 x_2$ when $x_1 \\approx x_2$.\n- The prediction $\\hat{y}$ is stable (low variance) even though individual coefficients swing wildly. The problem is not with predictions — it is with interpretation and stability.\n- This is why multicollinearity is an interpretability and stability problem, not (necessarily) a prediction quality problem. If you care about which feature drives the outcome, multicollinear models are uninformative.","A":"OLS has no bugs related to coefficient magnitudes. The normal equations always find the exact global minimum of the squared error loss on training data.","B":"","C":"OLS linear regression has no local minima. The loss surface is a convex quadratic bowl with exactly one global minimum, reached exactly by the closed-form solution.","D":"The question implies the dataset is adequate for fitting (predictions are accurate). Insufficient data would cause both prediction instability and coefficient instability."}},{"section":"machine-learning","topicSlug":"linear-regression","topic":"Linear Regression","id":"ml-02011","difficulty":"medium","orderIndex":11,"question":"Four datasets have identical summary statistics: same mean, variance, and linear regression line (same slope, intercept, and R² = 0.67). A junior analyst says \"they have the same relationship between X and Y.\" A statistician disagrees. What point is the statistician making?","options":{"A":"R² = 0.67 is too low to confirm any relationship between X and Y in all four datasets","B":"Identical summary statistics and R² can mask completely different underlying data distributions — the datasets may contain linear, curved, clustered, or outlier-dominated patterns that R² and the regression line cannot distinguish","C":"The statistician is wrong — identical R² and regression coefficients confirm identical relationships between X and Y","D":"Four datasets cannot have identical statistics unless they are copies of the same data"},"correct":"B","explanation":{"correct":"- This is Anscombe's Quartet: four datasets constructed to have nearly identical descriptive statistics (mean, variance, correlation, regression line) but visually completely different scatter plots. One is linear, one is curved, one has an outlier driving the line, one has a perfect linear relationship disrupted by a single outlier.\n- R², slope, and intercept are aggregate statistics that destroy distributional information. Two datasets can have the same R² while one is perfectly linear and the other is quadratic with the same fitted line.\n- This is why residual plots are mandatory: they reveal patterns (curvature, outliers, heteroscedasticity) that summary statistics hide.","A":"R² = 0.67 can represent a meaningful relationship — the adequacy threshold is domain-dependent. The issue is not whether 0.67 is enough, but that identical R² does not mean identical patterns.","B":"","C":"This is the exact misconception the question targets. Identical summary statistics do not confirm identical relationships — this is the entire lesson of Anscombe's Quartet.","D":"Anscombe's Quartet was deliberately constructed to prove this is possible. Datasets with identical summary statistics but different patterns are not only possible but well-known in statistics education."}},{"section":"machine-learning","topicSlug":"linear-regression","topic":"Linear Regression","id":"ml-02012","difficulty":"hard","orderIndex":12,"question":"A linear regression model predicts housing prices. The normal equations solution is: $\\hat{\\beta} = (X^TX)^{-1}X^Ty$. A machine learning engineer says: \"for this 500,000-row, 200-feature dataset, the normal equations are infeasible and we should use mini-batch gradient descent.\" Evaluate this claim precisely.","options":{"A":"The claim is correct — normal equations are always infeasible for datasets with more than 10,000 rows","B":"The claim is partially correct — the bottleneck is feature count, not row count; $(X^TX)$ is a $200 \\times 200$ matrix that inverts in microseconds, but forming $X^TX$ costs $O(n p^2)$ which for 500,000 rows and 200 features is ~20 billion operations — feasible but expensive; gradient descent would save computation time at this scale","C":"The claim is incorrect — the normal equations are always faster than gradient descent regardless of dataset size","D":"Mini-batch gradient descent requires normally distributed features, so it is not always a valid alternative to the normal equations"},"correct":"B","explanation":{"correct":"- The normal equations require computing $X^TX$, which costs $O(n p^2)$, and then inverting a $(p \\times p)$ matrix, which costs $O(p^3)$. For $n = 500,000$ and $p = 200$: $X^TX$ computation is $500,000 \\times 200^2 = 2 \\times 10^{10}$ multiply-adds — heavy but not infeasible on modern hardware.\n- Mini-batch gradient descent processes batches of rows at a time, never materializing the full $X^TX$. This reduces memory requirements and enables early stopping, but introduces hyperparameter tuning overhead.\n- The claim that normal equations are \"infeasible\" is an overstatement — they are feasible for 200 features. They become genuinely infeasible when $p$ reaches hundreds of thousands (e.g., text feature matrices).","A":"Row count primarily affects the $O(np^2)$ formation cost, not the $O(p^3)$ inversion. Millions of rows with few features is still manageable. The real threshold for infeasibility is feature count, not row count.","B":"","C":"For very large $p$ or online learning requirements, gradient descent absolutely outperforms the normal equations. There is no universal \"always faster\" claim for either method.","D":"Gradient descent makes no distributional assumptions about features. It works on any real-valued feature matrix regardless of distribution."}},{"section":"machine-learning","topicSlug":"linear-regression","topic":"Linear Regression","id":"ml-02013","difficulty":"hard","orderIndex":13,"question":"A team adds an engineered feature that is a linear combination of two existing features: $x_3 = 2x_1 + x_2$. They then run OLS on the full feature set $[x_1, x_2, x_3]$. What happens to the OLS solution?","codeSnippet":"X['x3'] = 2 * X['x1'] + X['x2']\nmodel = LinearRegression().fit(X[['x1', 'x2', 'x3']], y)","options":{"A":"OLS produces inflated R² because the new feature adds redundant information","B":"The feature matrix $X$ is rank-deficient — $(X^TX)$ is singular and cannot be inverted; OLS has no unique solution, and numerical implementations will return arbitrary coefficients depending on the solver","C":"OLS will assign coefficient 0 to $x_3$ since it is a linear combination of the other features","D":"Gradient descent will converge normally because it does not require matrix inversion"},"correct":"B","explanation":{"correct":"- When $x_3 = 2x_1 + x_2$, the feature matrix $X$ has linearly dependent columns. $X^TX$ becomes singular (determinant = 0) and is not invertible — the normal equations have no unique solution.\n- Infinite coefficient combinations produce the same predictions: e.g., $(\\beta_1, \\beta_2, \\beta_3) = (2, 3, 0)$ and $(\\beta_1, \\beta_2, \\beta_3) = (4, 4, -1)$ yield identical $\\hat{y}$ when $x_3 = 2x_1 + x_2$.\n- In practice, `sklearn.LinearRegression` uses SVD-based pseudoinverse which returns one solution, but that solution is arbitrary and the coefficients are meaningless. Different implementations may return different coefficient values.","A":"R² inflation is a symptom of overfitting with many features, not of perfect collinearity. With perfect collinearity, the model doesn't fit \"better\" — it simply cannot identify unique coefficients.","B":"","C":"OLS does not automatically assign zero to redundant features. The zero-coefficient outcome is only guaranteed by regularized regression (Lasso). OLS with a singular matrix returns a pseudoinverse solution, not a zero coefficient.","D":"Gradient descent on a rank-deficient feature matrix does not converge to a unique minimum — it wanders in the null space. The gradient can go to zero along the direction of the dependent feature combination, causing oscillation or non-convergence."}},{"section":"machine-learning","topicSlug":"linear-regression","topic":"Linear Regression","id":"ml-02014","difficulty":"hard","orderIndex":14,"question":"A linear regression model on financial returns achieves high R² and low p-values on all coefficients. However, the Breusch-Pagan test for heteroscedasticity returns p = 0.0001, and a Durbin-Watson test returns 1.1. A quant says \"both violations together are worse than either alone.\" Explain precisely why.","options":{"A":"The violations cancel each other out — positive autocorrelation and heteroscedasticity have opposite effects on standard errors","B":"Autocorrelation reduces effective sample size, making standard errors underestimated; heteroscedasticity makes OLS standard errors incorrect; both biases act in the same direction — standard errors are doubly underestimated, making p-values appear significant when they are not","C":"Heteroscedasticity only matters with fewer than 1,000 observations; the quant is overreacting","D":"The two tests measure the same underlying violation — only one needs to be corrected"},"correct":"B","explanation":{"correct":"- Positive autocorrelation (DW = 1.1) means consecutive residuals are correlated, reducing the effective sample size below the nominal $n$. OLS standard errors assume $n$ independent observations — they are underestimated by a factor related to the autocorrelation magnitude.\n- Heteroscedasticity (BP p = 0.0001) means OLS standard errors use a wrong error variance estimate — the standard error formula $\\hat{\\sigma}^2 (X^TX)^{-1}$ assumes constant variance.\n- Both effects push standard errors downward, making t-statistics larger and p-values smaller than they should be. The combination means you may believe a coefficient is statistically significant when the true p-value, corrected for both violations, would not pass any threshold.","A":"The violations do not cancel. Positive autocorrelation and heteroscedasticity both bias standard errors downward in typical financial return applications. They compound the problem, not offset it.","B":"","C":"Heteroscedasticity matters at any sample size. With large samples, p-values become smaller (tests more powerful), making heteroscedasticity-driven false positives more, not less, likely.","D":"Autocorrelation and heteroscedasticity are distinct violations. The Durbin-Watson test specifically detects first-order autocorrelation in residuals; the Breusch-Pagan test detects non-constant error variance. They require separate corrections (GLS/HAC for autocorrelation, WLS or robust SEs for heteroscedasticity)."}},{"section":"machine-learning","topicSlug":"linear-regression","topic":"Linear Regression","id":"ml-02015","difficulty":"hard","orderIndex":15,"question":"You fit a linear regression on test data and compute R² = −0.12. A teammate says \"that's impossible — R² is a proportion and must be between 0 and 1.\" Who is correct and what does a negative R² mean?","options":{"A":"The teammate is correct — R² is always between 0 and 1 by mathematical definition","B":"You are correct — R² can be negative when evaluated on data the model was not trained on; a negative R² means the model performs worse than predicting the mean of the target for every observation, which is a meaningful and alarming signal","C":"Negative R² indicates a bug in the implementation — it should be recalculated using the absolute value","D":"Negative R² is only possible when the target variable has negative values"},"correct":"B","explanation":{"correct":"- $R^2 = 1 - \\frac{SS_{res}}{SS_{tot}}$. On training data, OLS guarantees $SS_{res} \\leq SS_{tot}$, so $R^2 \\geq 0$. On test data, this guarantee does not hold — a poorly generalizing model can have $SS_{res} > SS_{tot}$, yielding $R^2 < 0$.\n- A negative test R² means the model is worse than the trivial baseline of always predicting $\\bar{y}$ (the mean of the training target). This is a severe signal of overfitting, distribution shift, or fundamental model failure.\n- This is a critical interview point: R² between 0 and 1 is only guaranteed on training data for OLS. On test data or for non-OLS models, all real numbers are possible.","A":"The 0-to-1 guarantee holds only for OLS on training data. The mathematical formula $1 - SS_{res}/SS_{tot}$ can produce any value when the model was not trained to minimize this specific loss on this specific data.","B":"","C":"Negative R² is a valid, meaningful result — not a bug. Taking the absolute value would destroy the diagnostic information that the model is performing below baseline.","D":"The sign of the target variable has no bearing on R². R² is computed from the ratio of sum of squared residuals to total sum of squares, which is always non-negative regardless of target sign. The negative R² comes from the ratio exceeding 1."},"reference":"- Draper and Smith, Applied Regression Analysis: https://onlinelibrary.wiley.com/doi/book/10.1002/9781118625590"},{"section":"machine-learning","topicSlug":"logistic-regression","topic":"Logistic Regression","id":"ml-03001","difficulty":"easy","orderIndex":1,"question":"A logistic regression model outputs 0.73 for a new data point. A developer interprets this as \"the model is 73% confident.\" A statistician flags this interpretation as imprecise. What is the more rigorous interpretation, and when does \"confidence\" mislead?","options":{"A":"0.73 means the model predicts class 1 with 73% accuracy on the test set","B":"0.73 is the estimated probability that this observation belongs to class 1, under the model's assumptions — but \"confidence\" conflates probability with calibration; if the model is poorly calibrated, 0.73 may not correspond to 73% empirical frequency of class 1","C":"0.73 means 73 out of 100 features voted for class 1","D":"0.73 is the sigmoid-transformed log-loss for this specific prediction"},"correct":"B","explanation":{"correct":"- Logistic regression outputs $P(y=1 | x) = \\sigma(w^Tx + b)$ — a conditional probability estimate under the model's assumptions (linear log-odds, correct feature set, IID data).\n- \"Confidence\" implies the output is reliable, but probability outputs are only meaningful if the model is calibrated: among all predictions of 0.73, approximately 73% of the actual outcomes should be positive class. Poorly calibrated models can output 0.73 while the true empirical frequency is 0.40.\n- In production: uncalibrated scores can cause harm in high-stakes decisions (e.g., credit, medical). Calibration is evaluated with Platt scaling, isotonic regression, or reliability diagrams.","A":"The output 0.73 is a probability for one specific sample, not a summary accuracy statistic for the test set. Accuracy is computed across many predictions, not from a single score.","B":"","C":"Logistic regression has no \"voting\" mechanism. It is a single linear model, not an ensemble. Voting is a concept from ensemble methods.","D":"The output is the sigmoid of the linear combination — a probability estimate. Log-loss is a metric computed after comparing predictions to true labels, not a raw model output."},"reference":"- Platt, \"Probabilistic Outputs for Support Vector Machines\": https://citeseerx.ist.psu.edu/doc/10.1.1.41.1639"},{"section":"machine-learning","topicSlug":"logistic-regression","topic":"Logistic Regression","id":"ml-03002","difficulty":"easy","orderIndex":2,"question":"You plot the sigmoid function $\\sigma(z) = \\frac{1}{1 + e^{-z}}$ for a logistic regression model. A colleague asks: \"why does logistic regression use sigmoid instead of a simple step function (0 if $z < 0$, 1 if $z \\geq 0$)?\" What is the correct explanation?","options":{"A":"The sigmoid function is faster to compute than the step function on modern hardware","B":"The step function has zero gradient almost everywhere and is discontinuous at zero, making gradient-based optimization impossible — the sigmoid provides smooth, differentiable gradients that allow learning via backpropagation","C":"The step function cannot output values between 0 and 1, so it cannot be used for probability regression","D":"The sigmoid is preferred because it always outputs exactly 0 or 1, matching binary targets"},"correct":"B","explanation":{"correct":"- The step function is non-differentiable at zero and has zero derivative everywhere else. Gradient descent requires $\\frac{\\partial L}{\\partial w}$, which flows through $\\frac{\\partial \\hat{y}}{\\partial z}$ — zero almost everywhere means no gradient signal and no learning.\n- The sigmoid $\\sigma(z)$ has derivative $\\sigma(z)(1-\\sigma(z))$, which is smooth, nonzero in $(-\\infty, +\\infty)$, and peaks at $z=0$. This allows gradient descent to adjust weights continuously.\n- This is the same reason neural networks use differentiable activations (ReLU, tanh) instead of step functions — differentiability is the prerequisite for gradient-based training.","A":"Computational speed is not the reason. Both functions are trivially fast. The reason is mathematical: gradient availability.","B":"","C":"The step function does output values between 0 and 1 (exactly 0 and exactly 1) — but it cannot produce intermediate values. However, the primary reason sigmoid is used is not the range; it is the differentiability needed for optimization.","D":"The sigmoid never outputs exactly 0 or 1; it is asymptotic to both extremes. As $z \\to +\\infty$, $\\sigma(z) \\to 1$, but never equals 1. This option is precisely backwards."}},{"section":"machine-learning","topicSlug":"logistic-regression","topic":"Logistic Regression","id":"ml-03003","difficulty":"easy","orderIndex":3,"question":"A logistic regression model is trained on email spam detection. The decision boundary is set at threshold 0.5 by default. In production, the cost of a false negative (spam reaching inbox) is 10× the cost of a false positive (legitimate email flagged as spam). What should the team change?","options":{"A":"Retrain the model with a different loss function","B":"Lower the classification threshold below 0.5 (e.g., 0.2) so the model flags more emails as spam, accepting more false positives to reduce false negatives — no retraining is needed","C":"Add more training data to reduce false negatives","D":"Switch to a different model architecture — logistic regression cannot handle asymmetric costs"},"correct":"B","explanation":{"correct":"- The classification threshold is a post-hoc decision boundary applied to the probability output. Lowering the threshold means any email with P(spam) > 0.2 is flagged — this catches more true positives (spam) at the cost of more false positives (legitimate emails flagged).\n- This is threshold calibration, completely separate from retraining. The model's learned weights and probabilities do not change.\n- The optimal threshold can be found on the validation set by computing the weighted cost: $\\text{cost} = 10 \\times FN + 1 \\times FP$, minimizing over threshold values.","A":"Retraining with a different loss function (e.g., weighted cross-entropy) is a valid approach, but it requires retraining — the question implies finding a simpler solution. Threshold adjustment achieves the goal without retraining.","B":"","C":"Adding training data improves generalization but does not specifically address asymmetric cost structure. More data would not lower the false negative rate unless paired with threshold or loss adjustment.","D":"Logistic regression handles asymmetric costs through both threshold adjustment and class-weighted training. The claim that it \"cannot handle\" asymmetric costs is false."}},{"section":"machine-learning","topicSlug":"logistic-regression","topic":"Logistic Regression","id":"ml-03004","difficulty":"easy","orderIndex":4,"question":"The log-loss (binary cross-entropy) for a logistic regression model is defined as $L = -[y\\log(\\hat{p}) + (1-y)\\log(1-\\hat{p})]$. A model outputs $\\hat{p} = 0.01$ for a true positive (y=1). How large is the resulting loss and why does this matter?","options":{"A":"The loss is 0.01 — proportional to the confidence of the wrong prediction","B":"The loss is $-\\log(0.01) \\approx 4.6$ — the log function heavily penalizes confident wrong predictions, making log-loss much more sensitive to large mispredictions than squared error","C":"The loss is 1.0 because the prediction is wrong — log-loss only outputs 0 or 1","D":"The loss is undefined because $\\log(0.01)$ requires a calculator and has no closed-form value"},"correct":"B","explanation":{"correct":"- When $y = 1$: $L = -\\log(\\hat{p})$. For $\\hat{p} = 0.01$: $L = -\\log(0.01) = \\log(100) \\approx 4.605$.\n- The logarithm diverges to $+\\infty$ as $\\hat{p} \\to 0$, so a confident wrong prediction (low $\\hat{p}$ for a true positive) is penalized extremely heavily. This is by design — it strongly discourages overconfident errors.\n- Compared to squared error $(y - \\hat{p})^2 = (1 - 0.01)^2 \\approx 0.98$, log-loss at 4.6 imposes 4.7× more penalty. This asymmetry makes log-loss far more aggressive about punishing confident mistakes.","A":"The loss is not proportional to the raw prediction value. The logarithm creates a highly nonlinear penalty that explodes near 0 and near 1.","B":"","C":"Log-loss is a continuous function outputting any non-negative real number. It is not binary. A loss of 1.0 corresponds to $\\hat{p} = e^{-1} \\approx 0.368$, not a wrong prediction in general.","D":"$$\\log(0.01)$ is perfectly computable: $\\log(0.01) = \\log(10^{-2}) = -2\\log(10) \\approx -4.605$. Log-loss is defined for all $\\hat{p} \\in (0, 1)$."}},{"section":"machine-learning","topicSlug":"logistic-regression","topic":"Logistic Regression","id":"ml-03005","difficulty":"easy","orderIndex":5,"question":"A logistic regression model is trained on linearly separable data. During training, the loss keeps decreasing but never converges. The gradient descent optimizer reports no convergence after 10,000 iterations. What is happening?","options":{"A":"The learning rate is too high, causing gradient explosion","B":"On linearly separable data, the optimal logistic regression solution requires infinite weights — the sigmoid can reach arbitrarily high certainty by scaling weights toward infinity, so the loss keeps decreasing forever without a finite optimum","C":"Logistic regression is not suitable for linearly separable data and should be replaced with a linear SVM","D":"The batch size is too small, causing the gradient to oscillate without converging"},"correct":"B","explanation":{"correct":"- On linearly separable data, a perfect classification exists: one weight vector correctly classifies all training points. As weights grow larger, the sigmoid pushes probabilities closer to 0 and 1, reducing log-loss further.\n- There is no finite weight vector that achieves log-loss = 0 (since $\\log(1) = 0$ requires $\\hat{p} = 1$, which requires infinite weights). The optimizer chases a loss that approaches 0 but never reaches it.\n- The fix is L2 regularization, which penalizes large weights and creates a finite optimum. Without regularization, logistic regression on separable data does not converge in the standard sense.","A":"Gradient explosion from a high learning rate would cause the loss to increase erratically, not decrease steadily. A steadily decreasing, non-converging loss indicates the mathematical non-existence of a finite optimum.","B":"","C":"Logistic regression is entirely valid for linearly separable data. The convergence failure is a mathematical property of the log-loss on separable data, not a model incompatibility.","D":"Batch size affects convergence speed and noise, but does not cause the fundamental issue here. Mini-batch oscillation shows non-monotone loss; this problem shows monotone decrease without convergence."}},{"section":"machine-learning","topicSlug":"logistic-regression","topic":"Logistic Regression","id":"ml-03006","difficulty":"medium","orderIndex":6,"question":"You apply L1 regularization to a logistic regression with 500 features. The resulting model has non-zero coefficients for only 30 features. You then apply L2 regularization with the same regularization strength C. What structural difference in the learned coefficients should you expect?","options":{"A":"L2 regularization will also produce exactly 30 non-zero features because it applies the same penalty magnitude","B":"L2 regularization will keep all 500 coefficients non-zero but shrink them toward zero — it does not produce sparsity because the L2 penalty's gradient never reaches zero for non-zero weights","C":"L2 regularization produces sparser models than L1 because the squared penalty removes more irrelevant features","D":"L1 and L2 regularization produce identical coefficient distributions when applied with the same C value"},"correct":"B","explanation":{"correct":"- L1 penalty adds $\\lambda |w|$ to the loss. At the optimum, the subdifferential condition allows the gradient of the data loss to exactly cancel the L1 gradient, permitting exact zero weights — this is the geometric reason L1 produces sparsity.\n- L2 penalty adds $\\lambda w^2$. The gradient of the penalty at any non-zero $w$ is $2\\lambda w \\neq 0$, which always pushes weights toward zero but never makes the optimal weight exactly zero unless the data gradient is also zero (rare in practice).\n- In feature selection contexts, L1 (Lasso) is preferred for sparsity; L2 (Ridge) is preferred when all features are expected to contribute something, or for stability under multicollinearity.","A":"The same C value does not produce the same sparsity structure. C controls penalty strength, not the penalty geometry. L1's diamond constraint geometry produces corners (sparse solutions); L2's circular constraint does not.","B":"","C":"This is backwards. L1 produces sparser models because of its non-differentiability at zero, which allows exact zero solutions. L2 does not produce sparsity.","D":"L1 and L2 produce structurally different coefficient distributions even at the same C. They are not equivalent — this is a foundational distinction in regularization theory."},"reference":"- Tibshirani, \"Regression Shrinkage and Selection via the Lasso\": https://www.jstor.org/stable/2346178"},{"section":"machine-learning","topicSlug":"logistic-regression","topic":"Logistic Regression","id":"ml-03007","difficulty":"medium","orderIndex":7,"question":"A logistic regression model achieves 0.95 AUC on a balanced binary classification task. When evaluated on a three-class problem with the same features using one-vs-rest (OvR) logistic regression, per-class AUCs are 0.91, 0.88, and 0.72. A developer says the model is failing on class 3. What should they check first?","options":{"A":"Retrain the model using softmax (multinomial) logistic regression instead of OvR","B":"Check whether class 3 is linearly separable from the other two classes in feature space — a low OvR AUC for class 3 indicates the linear decision boundary cannot adequately separate it from the combined rest, not necessarily a data or training bug","C":"Class 3 AUC of 0.72 means logistic regression is the wrong model for all three classes","D":"Increase the number of training epochs for the class 3 binary classifier"},"correct":"B","explanation":{"correct":"- OvR trains three separate binary classifiers: class 1 vs {2,3}, class 2 vs {1,3}, class 3 vs {1,2}. A low AUC for class 3 means the model struggles to distinguish class 3 from classes 1 and 2 combined.\n- The most likely explanation: class 3 overlaps in feature space with the other classes, making a linear boundary insufficient. This is a data geometry problem, not a training bug.\n- Next steps: visualize class 3 in PCA/t-SNE space, check feature distributions per class, or try a nonlinear model. Before switching models, confirm whether classes 1 and 2 are proxies or mixtures of class 3.","A":"Switching to multinomial (softmax) logistic regression can improve performance, but it is still a linear model — if class 3 is not linearly separable, softmax won't fix the fundamental problem.","B":"","C":"AUC of 0.72 is poor but does not invalidate the entire model. Two out of three classes perform well. The decision to switch models should be based on the specific class 3 geometry, not a blanket judgment.","D":"OvR logistic regression using `sklearn` is a convex optimization — more epochs help convergence but cannot overcome a linearly non-separable class boundary."}},{"section":"machine-learning","topicSlug":"logistic-regression","topic":"Logistic Regression","id":"ml-03008","difficulty":"medium","orderIndex":8,"question":"A logistic regression model is trained to predict loan default. The coefficient for `credit_score` is −0.032. A business analyst says \"a one-point increase in credit score reduces default probability by 3.2%.\" What is wrong with this interpretation?","options":{"A":"The interpretation is correct — coefficients in logistic regression directly represent probability changes","B":"The coefficient −0.032 represents the change in log-odds per unit increase in credit score, not probability — the actual change in probability depends on the current value of the linear combination and is nonlinear due to the sigmoid","C":"Logistic regression coefficients cannot be interpreted for individual features when there are multiple predictors","D":"The sign is wrong — a negative coefficient should increase probability, not decrease it"},"correct":"B","explanation":{"correct":"- Logistic regression models the log-odds: $\\log\\frac{p}{1-p} = w^Tx + b$. A coefficient of $-0.032$ means each one-unit increase in credit score decreases the log-odds of default by 0.032.\n- The change in probability is: $\\Delta p \\approx \\hat{p}(1-\\hat{p}) \\times (-0.032)$. For $\\hat{p} = 0.5$, $\\Delta p = 0.5 \\times 0.5 \\times (-0.032) = -0.008$ — about −0.8%, not −3.2%.\n- Near $\\hat{p} = 0.1$, the change is $0.1 \\times 0.9 \\times (-0.032) = -0.0029$ — about −0.29%. The probability change depends on the current baseline probability, which varies across individuals.","A":"Direct coefficient-to-probability mapping is the most common logistic regression misinterpretation. Only in linear probability models do coefficients represent percentage point changes in probability.","B":"","C":"Coefficients in logistic regression are interpretable (as log-odds effects) with multiple predictors, holding other features constant — the same as in linear regression for partial effects.","D":"A negative coefficient for credit score makes intuitive sense: higher credit scores reduce default risk (lower log-odds of default). The sign direction is correct."}},{"section":"machine-learning","topicSlug":"logistic-regression","topic":"Logistic Regression","id":"ml-03009","difficulty":"medium","orderIndex":9,"question":"You train logistic regression on a dataset with 10,000 positive and 10,000 negative examples. In production, the true positive rate is 1% (1 in 100 transactions is positive). Your model outputs 0.6 for a new transaction. What adjustment is needed for the output to reflect the true production probability?","options":{"A":"No adjustment — the model's output is already calibrated for production use","B":"The model was trained on a balanced dataset but production has 1% positive rate — the model's prior is wrong; Bayes' theorem can correct the output: the true posterior probability is much lower than 0.6 given the low base rate","C":"The model should be retrained with production data to fix the calibration","D":"The threshold should be lowered to 0.1 to account for the lower positive rate in production"},"correct":"B","explanation":{"correct":"- Logistic regression implicitly encodes the training class prior into its intercept. Trained on 50% positives, the model's intercept reflects a prior of $P(y=1) = 0.5$. In production, $P(y=1) = 0.01$.\n- Using Bayes' theorem to correct: $P(y=1|x, \\text{prod}) = \\frac{\\sigma(w^Tx) \\cdot \\pi_{\\text{prod}} / \\pi_{\\text{train}}}{\\sigma(w^Tx) \\cdot \\pi_{\\text{prod}} / \\pi_{\\text{train}} + (1-\\sigma(w^Tx)) \\cdot (1-\\pi_{\\text{prod}}) / (1-\\pi_{\\text{train}})}$.\n- This is a common production ML issue: models trained on resampled or balanced datasets output systematically overconfident probabilities for the positive class. Intercept adjustment ($b' = b + \\log(\\pi_{\\text{train}} / \\pi_{\\text{prod}})$) is the standard fix.","A":"The model is not calibrated for production. A model trained on 50% positive data that encounters 1% positive data will output systematically inflated positive-class probabilities.","B":"","C":"Retraining on production data is valid but not the only or fastest solution. Prior correction via intercept adjustment is an analytical fix that doesn't require retraining.","D":"Adjusting the threshold changes what you classify as positive, but does not fix the probability calibration. The raw output of 0.6 still does not represent the true 1%-prior-adjusted probability of being positive."}},{"section":"machine-learning","topicSlug":"logistic-regression","topic":"Logistic Regression","id":"ml-03010","difficulty":"medium","orderIndex":10,"question":"A team trains logistic regression to classify customers as \"high value\" vs \"low value.\" The decision boundary in 2D feature space is a straight line. After adding a third feature `x3 = x1² + x2²`, the model's performance improves significantly. What does this tell you about the original data distribution?","options":{"A":"The original two features were irrelevant — only the new polynomial feature matters","B":"The original decision boundary required a circle (or ellipse) in the original 2D feature space — the data was not linearly separable in 2D but became linearly separable in 3D after adding the polynomial feature that captures radial structure","C":"Adding polynomial features always improves logistic regression performance","D":"The improvement proves that logistic regression is inferior to polynomial regression for classification"},"correct":"B","explanation":{"correct":"- Logistic regression always creates a linear decision boundary in the feature space it receives. If the true boundary in 2D is circular (e.g., $x_1^2 + x_2^2 = r^2$), logistic regression on raw features cannot represent it.\n- Adding $x_3 = x_1^2 + x_2^2$ allows the 3D decision boundary $w_1 x_1 + w_2 x_2 + w_3 x_3 + b = 0$ to represent a circle in the original 2D space when $w_1 = w_2 = 0$.\n- This is the kernel trick intuition: mapping to a higher-dimensional space makes nonlinearly separable data linearly separable. Logistic regression with polynomial features is equivalent to a polynomial classifier.","A":"The original two features cannot be irrelevant if the new feature (built from them) improves performance. The improvement comes from capturing nonlinear interactions of the original features.","B":"","C":"Adding polynomial features does not always improve performance. It increases model complexity, risk of overfitting, and multicollinearity. Improvement depends on whether the true decision boundary is nonlinear.","D":"Logistic regression with engineered polynomial features is a valid and powerful approach. It does not prove inferiority — it demonstrates that feature engineering can substitute for nonlinear models."}},{"section":"machine-learning","topicSlug":"logistic-regression","topic":"Logistic Regression","id":"ml-03011","difficulty":"medium","orderIndex":11,"question":"A logistic regression model is trained for medical diagnosis (disease vs no-disease). The model is well-calibrated — among predictions of 0.7, exactly 70% of patients have the disease. A cardiologist says the model's output can be used directly to make individual treatment decisions. A statistician disagrees. Why?","options":{"A":"The statistician is wrong — calibration means probability outputs are reliable for individual decisions","B":"Calibration is a population-level property; it says nothing about individual prediction certainty — a patient with P(disease) = 0.7 has an irreducible 30% chance of being misclassified, and no amount of calibration reduces this individual uncertainty","C":"The model needs higher AUC before its probabilities can be used for treatment decisions","D":"Medical models should always output binary classifications, not probabilities"},"correct":"B","explanation":{"correct":"- Calibration means: across all patients the model assigns P = 0.7, approximately 70% actually have the disease. This is a property of the group, not of the individual.\n- For any single patient at P = 0.7, we cannot say more than \"we estimate a 70% chance.\" There is irreducible uncertainty — we do not know if this individual is in the 70% or 30%.\n- Treatment decisions require integrating this probability with clinical context, cost-benefit analysis, and individual patient factors. Treating P = 0.7 as a yes/no decision without threshold analysis ignores the 30% risk of a wrong treatment.","A":"Calibration ensures the probability scale is meaningful, but it does not reduce individual uncertainty. A 70% probability still means 30% of such patients do not have the disease.","B":"","C":"AUC measures discrimination (ranking quality), not calibration. A model can have high AUC and poor calibration, or high calibration and moderate AUC. The two metrics assess different properties.","D":"Probability outputs are more informative than binary outputs for medical decisions because they preserve uncertainty information needed for risk stratification. Binary outputs discard this."}},{"section":"machine-learning","topicSlug":"logistic-regression","topic":"Logistic Regression","id":"ml-03012","difficulty":"hard","orderIndex":12,"question":"A logistic regression model trained with L2 regularization ($C = 0.01$, very strong regularization) achieves poor training accuracy on a 5-class problem. A developer increases C to 1,000,000 (essentially no regularization) and training accuracy improves to 97%. Which failure mode should they expect in production, and what does the very strong regularization failure reveal?","options":{"A":"C = 0.01 underfits because the regularization forces all weights to exactly zero; no regularization is always better for training accuracy","B":"Very strong regularization (small C) biases the model toward zero weights, causing underfitting — the model cannot capture the signal; very weak regularization (large C) allows overfitting; the large gap between training and production accuracy at C = 1,000,000 is overfitting, and the correct C should be found by validation","C":"L2 regularization in logistic regression should only be used for binary classification; for multiclass, it always fails","D":"Increasing C improves training accuracy without any production risk because L2 regularization only affects convergence speed"},"correct":"B","explanation":{"correct":"- In `sklearn`, $C = 1/\\lambda$ — smaller C means stronger regularization. $C = 0.01$ corresponds to very large $\\lambda$, heavily penalizing weights and forcing them near zero. The model cannot fit the data's complexity — this is underfitting (high bias).\n- $C = 1,000,000$ corresponds to effectively no regularization. On a 5-class problem, the model can overfit the training set, learning class-specific noise. 97% training accuracy with negligible regularization is a warning sign.\n- The optimal C balances bias and variance. In a classification task with validation data, grid-search C across $[0.001, 0.01, 0.1, 1, 10, 100]$ and select based on validation AUC or F1.","A":"C = 0.01 does not force all weights to exactly zero. L2 regularization smoothly shrinks weights but does not produce exact zeros (unlike L1). The model still uses all features but with small, underpowered weights.","B":"","C":"L2 regularization works correctly for multiclass logistic regression (both OvR and multinomial). The failure is about regularization strength, not multiclass compatibility.","D":"C directly affects the trade-off between fitting training data and preventing overfit. Claiming \"no production risk\" for unlimited C is precisely wrong — it is the definition of overfitting risk."}},{"section":"machine-learning","topicSlug":"logistic-regression","topic":"Logistic Regression","id":"ml-03013","difficulty":"hard","orderIndex":13,"question":"Two features $x_1$ and $x_2$ each have AUC = 0.85 individually for predicting a binary outcome. A logistic regression trained on both features achieves AUC = 0.81 — lower than either individual feature. What is the most likely cause?","options":{"A":"Logistic regression cannot combine two features effectively — a decision tree should be used instead","B":"The two features are highly correlated and both capture the same signal; multicollinearity causes the combined model's coefficient estimates to be unstable, and the slightly different noise in each feature's contribution degrades performance versus a single clean predictor","C":"AUC = 0.85 for individual features means each feature is overfitting — combining them reduces overfitting and 0.81 is the correct generalization performance","D":"Logistic regression with two features always performs worse than univariate models because the decision boundary requires more data to fit a 2D hyperplane"},"correct":"B","explanation":{"correct":"- When two features are near-perfect proxies for each other (high correlation), a logistic regression with both features attempts to split the signal between two collinear predictors. The individual coefficient estimates become unstable (high variance) due to multicollinearity.\n- Each feature individually uses the clean single-predictor signal. The combined model's instability in coefficient estimation can hurt generalization, particularly when the features contain slightly different measurement noise.\n- This is a practical case where feature selection outperforms feature stacking. The fix is to use one of the features, or apply PCA to get a single principal component capturing the shared variance.","A":"Logistic regression absolutely can combine multiple features effectively. The issue is feature correlation, not a logistic regression limitation. This would also fail for decision trees with correlated features.","B":"","C":"Individual AUC = 0.85 on a proper validation set does not indicate overfitting — it indicates discriminative power. Combining correlated features is the cause of degradation, not a sign of fixing overfitting.","D":"Logistic regression with two uncorrelated features outperforms univariate models when both features are informative. The hypothesis that \"two features always performs worse\" is false."}},{"section":"machine-learning","topicSlug":"logistic-regression","topic":"Logistic Regression","id":"ml-03014","difficulty":"hard","orderIndex":14,"question":"A logistic regression model is trained on customer transaction data to predict fraud. The positive class (fraud) is 0.1% of all transactions. The model achieves 99.95% training accuracy, 99.92% validation accuracy, and the loss curves look perfectly converged. A fraud analyst says the model is useless. How can the analyst be right?","codeSnippet":"from sklearn.metrics import classification_report\nprint(classification_report(y_test, model.predict(X_test)))\n# precision recall f1-score support\n# 0 1.00 1.00 1.00 99900\n# 1 0.00 0.00 0.00 100","options":{"A":"The analyst is wrong — 99.92% validation accuracy with converged loss proves the model is well-trained","B":"The model learned to predict \"not fraud\" for every transaction — the 99.9% majority class gives 99.9% accuracy trivially, and precision/recall/F1 for class 1 (fraud) are all 0.00, meaning the model never detects a single fraud case","C":"The model is overfit to the training set — the validation accuracy should be lower than the training accuracy by more","D":"The loss converging to a small value proves the model captured the signal — the analyst must be misreading the output"},"correct":"B","explanation":{"correct":"- The classification report reveals the critical truth: class 1 (fraud) has recall = 0.00, meaning no fraud case was ever correctly identified. The model outputs \"not fraud\" for every input and achieves 99.9% accuracy by exploiting class imbalance.\n- This is the classic imbalanced classification trap. The log-loss can also be low: if the model predicts P(fraud) = 0.001 for all samples, the loss is $-[0.999 \\times \\log(0.999) + 0.001 \\times \\log(0.001)] \\approx 0.011$ — small, but with zero fraud detection.\n- Solutions: class-weighted cross-entropy, oversampling minority class (SMOTE), undersampling majority class, or using precision-recall AUC instead of accuracy and standard loss.","A":"Accuracy and loss curves are meaningless metrics on highly imbalanced data. They do not prove the model is well-trained — they prove the model learned the trivial majority-class solution.","B":"","C":"The validation accuracy is close to training accuracy (99.95% vs 99.92%), which looks like good generalization. The problem is not overfitting — it is that the model learned the wrong target behavior entirely.","D":"A low loss converging does not prove the model captured signal. On a 0.1% positive class, a model predicting the constant majority class achieves low cross-entropy loss because most examples are easily correct."}},{"section":"machine-learning","topicSlug":"logistic-regression","topic":"Logistic Regression","id":"ml-03015","difficulty":"hard","orderIndex":15,"question":"A team replaces logistic regression with a deep neural network for a tabular binary classification task with 15 features and 5,000 training samples. The DNN achieves 3% higher AUC on the test set. A senior engineer says \"logistic regression would have been better here.\" What is the engineer's reasoning?","options":{"A":"Deep neural networks are always worse than logistic regression on binary classification","B":"With 5,000 samples and 15 features, a deep neural network has far more parameters than training examples, leading to overfitting — logistic regression's simplicity and built-in implicit regularization (via limited capacity) is more appropriate; the 3% AUC gain may reflect test set overfitting rather than real generalization","C":"Logistic regression is always preferable for tabular data because neural networks cannot handle structured features","D":"The engineer is wrong — higher test AUC always means the DNN is genuinely better"},"correct":"B","explanation":{"correct":"- With 5,000 samples and 15 features, a DNN with 2-3 hidden layers may have thousands of parameters — far exceeding the number of training examples. The risk of overfitting is high.\n- The 3% AUC improvement on a single test set may reflect the DNN fitting noise patterns specific to the test distribution. Cross-validated AUC would provide a more reliable comparison.\n- Logistic regression is a strong baseline for tabular data with limited samples: it has only $p+1 = 16$ parameters, is fully interpretable, and its regularization is tunable with a single hyperparameter C.","A":"There are many tasks where DNNs outperform logistic regression on tabular data, particularly with complex feature interactions. The claim \"always worse\" is false.","B":"","C":"Neural networks can handle tabular data and often do so effectively when data is abundant. The limitation is sample efficiency, not structural incompatibility.","D":"Higher AUC on a single test evaluation is not conclusive proof of better generalization. Test set overfitting (especially if the test set influenced hyperparameter choices) can inflate a single-split AUC measurement."},"reference":"- Grinsztajn et al., \"Why tree-based models still outperform deep learning on tabular data\": https://arxiv.org/abs/2207.08815"},{"section":"machine-learning","topicSlug":"decision-trees","topic":"Decision Trees","id":"ml-04001","difficulty":"easy","orderIndex":1,"question":"A decision tree splits a node containing 100 samples: 50 class A and 50 class B. After the split, the left child has 40 class A and 10 class B; the right child has 10 class A and 40 class B. Which impurity measure correctly identifies this as a good split, and why?","options":{"A":"Neither Gini impurity nor entropy can evaluate splits — only accuracy can determine split quality","B":"Both Gini impurity and entropy would indicate this is a good split because both children are purer than the parent — each child has a dominant class (80% majority), while the parent was maximally impure (50/50)","C":"Gini impurity would reject this split because the total samples in each child are equal","D":"Entropy would reject this split because information gain requires one class to disappear completely from a child node"},"correct":"B","explanation":{"correct":"- Parent Gini: $1 - (0.5^2 + 0.5^2) = 0.5$ (maximum impurity for 2 classes). Child Gini (left): $1 - (0.8^2 + 0.2^2) = 0.32$. Child Gini (right): $1 - (0.2^2 + 0.8^2) = 0.32$.\n- Weighted child Gini: $(50/100) \\times 0.32 + (50/100) \\times 0.32 = 0.32$. Information gain from Gini: $0.5 - 0.32 = 0.18$ — a positive improvement.\n- The same result holds for entropy. Any split that makes children purer than the parent produces positive information gain, and this split reduces impurity by 36%.","A":"Accuracy is not used as an impurity criterion during tree splitting. Gini impurity and entropy are the standard splitting criteria precisely because they measure class purity within nodes.","B":"","C":"Gini impurity is not affected by the relative size of the children — it measures class proportion within each child, not child size. The weighted average accounts for size.","D":"Information gain does not require a class to disappear. Any reduction in weighted impurity from parent to children yields positive information gain. Perfect splits (one class per child) are the maximum gain, not the minimum requirement."}},{"section":"machine-learning","topicSlug":"decision-trees","topic":"Decision Trees","id":"ml-04002","difficulty":"easy","orderIndex":2,"question":"A node contains 100 samples: 99 class A and 1 class B. Compute the Gini impurity of this node. Is this a good or bad node to split further, and why?","options":{"A":"Gini = 0.5 (maximum impurity), making this a high-priority node to split","B":"Gini ≈ 0.02 (nearly pure), making this a low-priority node — splitting it is unlikely to yield meaningful information gain and would increase tree complexity unnecessarily","C":"Gini = 1.0 for any node with two classes present","D":"The node cannot be evaluated with Gini because one class has only 1 sample"},"correct":"B","explanation":{"correct":"- Gini impurity: $1 - (0.99^2 + 0.01^2) = 1 - (0.9801 + 0.0001) = 0.0198 \\approx 0.02$.\n- A nearly pure node (0.02) has little room for improvement — any split will reduce impurity by at most 0.02, which is unlikely to justify an additional split.\n- Decision tree algorithms naturally stop splitting near-pure nodes (via min_impurity_decrease or min_samples_split parameters), preventing overfitting by memorizing individual rare samples.","A":"Gini = 0.5 represents maximum impurity (50/50 split). A 99/1 split is near-minimum impurity. These are opposite ends of the spectrum.","B":"","C":"Gini equals 0 only for a pure node (one class), and 0.5 for a 50/50 two-class node. Having two classes present does not make Gini = 1.0 — the formula depends on proportions, not presence.","D":"Gini impurity works for any class distribution regardless of sample counts. Even a node with 1 sample has a well-defined (zero) impurity."}},{"section":"machine-learning","topicSlug":"decision-trees","topic":"Decision Trees","id":"ml-04003","difficulty":"easy","orderIndex":3,"question":"A decision tree is trained with `max_depth=None` on a training set of 1,000 samples with 20 features. The training accuracy is 100%, but test accuracy is 62%. A colleague says \"just add more training data.\" What is the correct diagnosis and most direct fix?","options":{"A":"The model needs more features, not more data — 20 features are insufficient to generalize","B":"The tree is fully overfit — with max_depth=None, it memorizes each training sample by splitting until each leaf has a single sample; the most direct fix is to constrain tree depth (max_depth), minimum samples per leaf, or apply pruning","C":"100% training accuracy always indicates data leakage, not overfitting","D":"The model needs a different impurity criterion — switching from Gini to entropy would improve test accuracy"},"correct":"B","explanation":{"correct":"- Decision trees with no depth constraint will grow until every leaf contains a single training sample (or until all samples in a leaf belong to the same class). With 1,000 samples, the tree may have up to 1,000 leaves — it has memorized the training data perfectly.\n- The train-test gap (100% vs 62%) is the classic overfitting signature. The tree is fitting noise rather than signal.\n- Direct fixes: `max_depth` (limit tree height), `min_samples_leaf` (require minimum samples per leaf), `min_samples_split` (require minimum samples before splitting), or cost-complexity pruning (`ccp_alpha`).","A":"Adding features increases the risk of overfitting further — more features give the tree more dimensions to split on, worsening memorization. Restricting tree complexity, not adding features, is the fix.","B":"","C":"100% training accuracy in a complex model is a sign of overfitting, not necessarily leakage. Data leakage causes high test accuracy too, which is not the case here (62% test).","D":"Switching impurity criteria (Gini vs entropy) rarely changes model quality significantly. Both criteria produce similar trees, and neither addresses the overfitting caused by unbounded depth."}},{"section":"machine-learning","topicSlug":"decision-trees","topic":"Decision Trees","id":"ml-04004","difficulty":"easy","orderIndex":4,"question":"You train two decision trees on the same dataset: one with Gini impurity and one with entropy. Both trees achieve nearly identical test accuracy. A manager asks: \"why use entropy instead of Gini if they produce the same result?\" What is the correct explanation?","options":{"A":"Entropy is always more accurate than Gini — if they produce the same result, one implementation is wrong","B":"Gini and entropy measure similar things (node impurity) and usually produce nearly identical splits and trees — entropy is slightly more computationally expensive due to the logarithm but has a stronger information-theoretic interpretation; Gini is preferred in practice for speed","C":"Entropy should never be used for classification trees — it is only valid for regression trees","D":"The only difference between Gini and entropy is that Gini penalizes larger classes while entropy penalizes smaller classes"},"correct":"B","explanation":{"correct":"- Gini: $G = 1 - \\sum p_i^2$. Entropy: $H = -\\sum p_i \\log_2(p_i)$. Both are minimized at 0 for pure nodes and maximized at the uniform distribution. Their functional forms differ but they identify nearly the same optimal splits in practice.\n- Entropy has a direct connection to information theory (Shannon entropy) and the concept of information gain, making it preferable for theoretical analysis. Gini avoids the logarithm, making it faster to compute.\n- `sklearn`'s `DecisionTreeClassifier` uses Gini by default. For most tasks, the choice makes negligible practical difference — both should be tried in hyperparameter tuning.","A":"Producing the same result is the expected outcome, not a sign of implementation error. The two criteria converge on similar trees because they both maximize node purity.","B":"","C":"Entropy is valid and standard for classification trees. Regression trees use different criteria (variance reduction, MSE minimization). Entropy is not used for regression trees.","D":"Neither Gini nor entropy \"penalizes\" a class size in the way described. Both are symmetric functions of class proportions. Gini gives more weight to misclassification probability; entropy gives more weight to rare classes proportionally due to the log, but this is not a \"penalty on smaller classes.\""}},{"section":"machine-learning","topicSlug":"decision-trees","topic":"Decision Trees","id":"ml-04005","difficulty":"easy","orderIndex":5,"question":"A decision tree is retrained after adding 5 new training examples. The entire tree structure changes completely — different splits, different depths, different features at each node. A developer says this is a bug. Is it?","options":{"A":"Yes — a well-trained decision tree should be stable when small amounts of data are added","B":"No — decision trees are inherently unstable (high variance); small changes to training data can cause the first split to change, which completely alters all downstream splits through a cascade effect","C":"It is a bug only if the new training examples are outliers; normal samples would not change the tree","D":"Tree instability is always caused by using Gini impurity; switching to entropy produces stable trees"},"correct":"B","explanation":{"correct":"- Decision trees are greedy algorithms: each split is chosen to maximize immediate impurity reduction without look-ahead. A small change in training data can shift which feature and threshold achieves the maximum gain at the root.\n- Since every split in the tree depends on the data reaching that node, a changed root split routes different data to subsequent nodes, changing their optimal splits too. The effect cascades through the entire tree.\n- This instability is one of the primary motivations for Random Forests — building many trees on bootstrapped samples and averaging their predictions reduces variance introduced by individual tree instability.","A":"Instability in decision trees is a known, documented property, not a bug. It is a feature of greedy splitting algorithms without global optimization.","B":"","C":"Decision trees are sensitive to all samples, not just outliers. Even a few representative samples that shift a class boundary at the root can trigger full restructuring.","D":"Both Gini and entropy are greedy impurity measures and produce similarly unstable trees. The instability is inherent to the greedy tree-building process, not the choice of criterion."}},{"section":"machine-learning","topicSlug":"decision-trees","topic":"Decision Trees","id":"ml-04006","difficulty":"easy","orderIndex":6,"question":"A decision tree classifier is trained on a dataset with a continuous feature `age`. The tree considers thresholds at every unique value in the training data. After training, the split is \"age > 34.\" What happens to a new test sample with `age = 34.001`?","options":{"A":"The model throws an error because 34.001 was not in the training data","B":"The sample goes to the right subtree (age > 34 is True), and the tree applies the same decision to all future samples following this branch regardless of how far they are from the threshold","C":"The model interpolates between neighboring training values to handle unseen continuous values","D":"The sample goes to the left subtree because 34.001 is too close to the training threshold to be reliable"},"correct":"B","explanation":{"correct":"- Decision tree splits on continuous features are threshold comparisons: `age > 34.001` evaluates to True, so the sample goes right. The tree applies the same learned threshold to all new samples, including values never seen during training.\n- Decision trees do not interpolate, extrapolate, or compute distance to the threshold. The boundary is a hard cutoff: any value > 34 routes right, any value ≤ 34 routes left.\n- This threshold-based approach means decision trees can handle unseen continuous values that fall between training values — but they cannot extrapolate beyond the range of training data in a meaningful way (they simply apply the outermost leaf's class).","A":"Decision trees do not require test values to be present in training data. The split is a learned threshold, not a lookup table.","B":"","C":"Decision trees have no interpolation mechanism. They are piecewise constant functions of the input — within a region defined by the thresholds, all samples get the same leaf prediction.","D":"Distance to the training threshold has no role in decision tree inference. The comparison is purely `value > threshold`, with no uncertainty based on proximity."}},{"section":"machine-learning","topicSlug":"decision-trees","topic":"Decision Trees","id":"ml-04007","difficulty":"medium","orderIndex":7,"question":"A decision tree is trained on a classification problem with a highly imbalanced feature: `transaction_amount` has 95% of values below \\$100 and 5% above \\$10,000. Gini impurity selects this feature as the first split at threshold \\$99. What bias does this introduce?","options":{"A":"No bias — Gini impurity selects the optimal split regardless of feature distribution","B":"The 5% high-value transactions are almost always routed to the right subtree at depth 1, but this small node cannot be further split effectively due to low sample count — the model may have poor recall on the rare high-value segment because it gets too few samples to learn a good sub-classifier","C":"Gini impurity favors features with more unique values, so `transaction_amount` is always selected first regardless of its true predictive power","D":"The tree would fail to converge because continuous features with extreme skew cannot be split by Gini impurity"},"correct":"B","explanation":{"correct":"- Splitting at \\$99 routes 95% of samples left and 5% right. The right node has only 5% of training data — perhaps a few hundred samples. Subsequent splits on this tiny subtree have limited statistical power: each further split divides an already small set.\n- High-value transactions may have complex patterns requiring multiple splits to model correctly, but the small sample count limits tree depth before min_samples_split or min_impurity_decrease stops growth.\n- This is a well-known limitation of decision trees on imbalanced feature distributions and class-imbalanced datasets. Techniques like stratified sampling or class-weighted splitting can partially address it.","A":"Gini impurity selects the split that maximizes weighted purity reduction, which can be optimal for the majority class while being suboptimal for minority class capture. \"Optimal\" globally ≠ \"optimal for all segments.\"","B":"","C":"Gini impurity considers the impurity reduction, not the number of unique values. A feature with many unique values does get more split candidates evaluated, but selection is based on quality of the resulting split, not feature cardinality alone.","D":"Gini impurity works on any continuous or categorical feature regardless of distribution. Convergence is not affected by skew."}},{"section":"machine-learning","topicSlug":"decision-trees","topic":"Decision Trees","id":"ml-04008","difficulty":"medium","orderIndex":8,"question":"A team applies cost-complexity pruning (also called weakest-link pruning) to a fully-grown decision tree by setting `ccp_alpha = 0.05`. The pruned tree has 40% fewer nodes but the test accuracy drops only 1.2%. A stakeholder asks: \"is this a good trade-off?\" What is the correct reasoning?","options":{"A":"No — any accuracy drop from pruning means the tree is worse and pruning should not be applied","B":"Yes — pruning removes subtrees whose impurity decrease per added node is less than alpha (0.05); 40% fewer nodes means significantly less overfitting, better generalization, lower inference cost, and improved interpretability, while a 1.2% accuracy drop is likely within statistical noise","C":"Pruning is only valid for regression trees; for classification trees it always reduces accuracy too much to be useful","D":"ccp_alpha should always be set to 0 for classification tasks — any positive alpha introduces bias"},"correct":"B","explanation":{"correct":"- Cost-complexity pruning removes the subtree at each internal node where the impurity reduction per node added is below `ccp_alpha`. Setting alpha = 0.05 means only splits that provide substantial improvement are kept.\n- A 40% node reduction with only 1.2% accuracy loss is an excellent trade-off: the removed nodes were fitting noise (the accuracy they \"contributed\" was pure overfitting), not real signal.\n- Pruned trees are more interpretable (fewer rules to explain), faster to inference (shorter paths), and generalize better. The optimal alpha is found by cross-validating the pruned tree at various alpha values.","A":"Any accuracy drop from pruning does not mean the tree is worse overall. If the dropped accuracy was overfitted noise, the pruned tree generalizes better. Test set accuracy is the correct measure, and 1.2% drop is minimal.","B":"","C":"Cost-complexity pruning applies equally to classification and regression trees. It is a general pruning strategy, not regression-specific.","D":"Setting ccp_alpha = 0 means no pruning at all. Any positive alpha introduces a regularization effect that reduces overfitting — this is bias-variance trade-off management, not a defect."}},{"section":"machine-learning","topicSlug":"decision-trees","topic":"Decision Trees","id":"ml-04009","difficulty":"medium","orderIndex":9,"question":"A decision tree with max_depth=3 is trained on a 10-class classification problem. The tree has at most $2^3 = 8$ leaves. With 10 classes but only 8 possible leaf predictions, what happens to the two classes that are \"impossible\" to represent?","options":{"A":"The model raises an error because the number of leaves is less than the number of classes","B":"There is no guarantee which classes are represented — the 8 leaves will cover the 8 classes (or class combinations) that maximize training accuracy; minority classes or classes similar to others may never appear as leaf predictions","C":"All 10 classes are always represented because each leaf can output probability distributions over all classes","D":"The model automatically increases max_depth to 4 to accommodate all 10 classes"},"correct":"B","explanation":{"correct":"- A decision tree with max_depth=3 creates at most 8 leaf nodes. Each leaf predicts the majority class among its training samples. With 10 classes, at most 8 distinct class labels can appear as leaf predictions.\n- Classes that are rare, poorly separated, or similar to majority classes may never form a leaf majority — they get classified as the nearest majority class in the region.\n- This is an important depth constraint consideration for multi-class problems. Rule of thumb: max_depth should allow at least as many leaves as classes: max_depth ≥ $\\lceil \\log_2(k) \\rceil$ for $k$ classes.","A":"Decision trees do not error when leaves < classes. They silently under-represent minority classes. This silent failure mode is the dangerous aspect.","B":"","C":"Leaf predictions are based on the majority class of samples reaching that leaf, not a full probability distribution over all classes. `sklearn` can output `predict_proba`, which gives class fractions at the leaf, but the majority-class prediction may still ignore some classes.","D":"Decision trees never automatically adjust max_depth. It is a hard constraint set by the user."}},{"section":"machine-learning","topicSlug":"decision-trees","topic":"Decision Trees","id":"ml-04010","difficulty":"medium","orderIndex":10,"question":"You train a decision tree on a dataset where feature `A` has information gain 0.42 and feature `B` has information gain 0.38 at the root. Feature `B` is the correct causal predictor (generates the data), but feature `A` is a noisy proxy. The tree selects feature `A` first. What does this reveal, and why is it a problem for deployment?","options":{"A":"The tree made an error — information gain always selects the causally correct feature first","B":"Information gain is a statistical measure of correlation, not causation — feature A has higher empirical correlation with the target on this training sample; in production, if A's noise pattern changes (distribution shift), the model fails while a model built on B would remain robust","C":"The problem is resolved by increasing max_depth — more depth allows the tree to eventually use feature B","D":"This situation cannot occur — decision trees always select causal features because Gini impurity is derived from causal inference theory"},"correct":"B","explanation":{"correct":"- Greedy information gain measures which feature most reduces training set impurity. It does not distinguish between causal features and spurious correlates — a noisy proxy with slightly higher empirical correlation will always be chosen first.\n- In production, distribution shift is the real danger: if feature A's noise pattern changes (e.g., A was derived from a data pipeline that changes behavior), the model breaks. Feature B, being causal, is robust to such shifts.\n- This is a fundamental limitation of decision trees (and most ML models): they optimize for empirical fit, not causal structure. Causal reasoning requires explicit domain knowledge or causal discovery methods.","A":"Information gain has no connection to causal correctness. It measures mutual information between feature and label in the training data — a purely statistical quantity.","B":"","C":"Increasing max_depth allows more splits but does not change which feature is selected first. The root split is already feature A; B may appear lower in the tree, but the model's primary decision pathway is built on the spurious correlate.","D":"Gini impurity is a probabilistic measure from decision theory, not causal inference theory. Causal inference is a separate field (Pearl's do-calculus, structural equation models) with no connection to how decision trees work."}},{"section":"machine-learning","topicSlug":"decision-trees","topic":"Decision Trees","id":"ml-04011","difficulty":"medium","orderIndex":11,"question":"Two decision trees trained on different 80% subsets of the same dataset produce completely different structures — different root splits, different depths, different features used. A random forest uses 100 such trees. Why does averaging these unstable trees produce better results than any single tree?","options":{"A":"Averaging trees cancels out their individual errors because each tree predicts a different class, so the majority vote is always correct","B":"Each tree has high variance (unstable, overfits to its training subset) but low bias (on average, the mean prediction converges to the true class boundary); averaging reduces variance without increasing bias — this is the bias-variance decomposition of bagging","C":"Averaging trees is equivalent to training a single deep tree with more training data","D":"The 100 trees collectively remove noise from the training data before making predictions"},"correct":"B","explanation":{"correct":"- Bias-variance decomposition: a single deep tree has low bias (can fit complex boundaries) but high variance (changes dramatically with data). The expected error = bias² + variance + irreducible noise.\n- Bagging trains each tree on a bootstrap sample (random 80% with replacement). Each tree independently overfits to its sample. Averaged over 100 trees, the high-variance components cancel: $\\text{Var}(\\bar{X}) = \\frac{\\sigma^2}{n}$ for independent models.\n- In practice, trees are not fully independent (same dataset), so variance reduction is partial — but still substantial. The ensemble prediction converges to the true decision boundary as the number of trees grows.","A":"Trees don't predict different classes to \"cancel errors\" by design. Individual trees can all be wrong simultaneously on difficult examples. Majority vote helps because errors are independent (different bootstrap samples), not because they predict opposite classes.","B":"","C":"Averaging 100 trees is not equivalent to a single deeper tree. A single tree with any depth is still a greedy, unstable estimator. The ensemble's strength comes from independent estimation and variance cancellation, not deeper representation.","D":"Trees do not \"remove noise from data.\" Each tree works on a noisy bootstrap sample. The averaging averages out model variance, not data noise — irreducible noise remains."}},{"section":"machine-learning","topicSlug":"decision-trees","topic":"Decision Trees","id":"ml-04012","difficulty":"hard","orderIndex":12,"question":"A decision tree on a regression task (predicting house prices) has max_depth=2. The test RMSE is 45,000. Increasing max_depth to 20 drops the training RMSE to 1,200 but test RMSE increases to 71,000. Increasing max_depth to 5 gives training RMSE = 28,000 and test RMSE = 38,000. What is the precise mechanism causing test RMSE to be higher at depth=20 than at depth=2?","options":{"A":"Deep trees have more computation, which introduces floating-point rounding errors","B":"At depth=20, each leaf contains very few samples (possibly 1) and memorizes individual training prices including their noise; the leaf prediction for a test sample is the noisy price of the nearest training neighbor, not the true underlying price pattern","C":"Test RMSE increases because deeper trees use more features, causing multicollinearity","D":"The variance of predictions is lower at depth=20 because more splits produce more precise leaf boundaries"},"correct":"B","explanation":{"correct":"- Regression trees predict the mean of training samples in each leaf. At depth=20, leaves contain 1-2 samples — the \"mean\" is essentially the individual training price, which includes measurement noise and idiosyncratic factors.\n- For test samples that land in these leaves, the prediction is the price of a specific training house, not a generalized neighborhood price. The test error reflects both the bias of the prediction and the noise absorbed from training samples.\n- At depth=5, leaves contain more samples (perhaps 20-50), so predictions are averages that smooth out noise while still capturing meaningful price patterns.","A":"Floating-point rounding errors are negligible at the scale of RMSE differences (1,200 vs 71,000). This is not the mechanism.","B":"","C":"Decision trees do not suffer from multicollinearity in the traditional sense — each split considers one feature at a time. More depth means more splits, not more feature interaction artifacts.","D":"At depth=20, prediction variance is actually higher (high variance model), not lower. Each leaf's prediction varies widely based on which training house happened to land there. Variance increases with depth."}},{"section":"machine-learning","topicSlug":"decision-trees","topic":"Decision Trees","id":"ml-04013","difficulty":"hard","orderIndex":13,"question":"A decision tree is trained on a dataset with a categorical feature `country` having 150 unique values. The tree evaluator must consider all possible binary splits of 150 categories. How many possible binary splits exist for this feature, and what computational problem does this cause?","options":{"A":"150 splits — one per category, where the split is \"country = X\" vs \"country ≠ X\"","B":"$$2^{150-1} - 1 \\approx 10^{44}$ possible binary splits — evaluating all subsets is computationally infeasible; implementations use heuristics (sorting by target mean for regression, or frequency for classification) to reduce evaluation to $O(k)$ candidates","C":"150 × 149 / 2 = 11,175 splits — all pairwise combinations of countries","D":"Only 1 split is possible — the median category by frequency"},"correct":"B","explanation":{"correct":"- A binary split on a categorical feature with $k$ values divides $k$ values into two non-empty subsets. The number of such divisions is $2^{k-1} - 1$ (dividing by 2 for symmetry). For $k = 150$: $2^{149} - 1 \\approx 7 \\times 10^{44}$.\n- Exhaustive evaluation is impossible. Practical implementations use: for regression, sort categories by mean target and evaluate $k-1$ contiguous splits; for binary classification, a similar ordering by class proportion; for multi-class, approximate methods.\n- This is why high-cardinality categorical features are problematic for decision trees and why target encoding or ordinal encoding is often applied before tree training.","A":"\"Country = X\" vs \"Country ≠ X\" gives only 150 splits (one per category), not the full set of possible groupings. This is a valid subset of splits but not all possible binary partitions.","B":"","C":"Pairwise combinations count pairs of categories, not binary partitions of the full set. This is neither the correct formula nor the computational problem.","D":"Single-median splits exist only for ordinal/continuous features. For categorical features, no natural ordering exists to define a \"median.\""}},{"section":"machine-learning","topicSlug":"decision-trees","topic":"Decision Trees","id":"ml-04014","difficulty":"hard","orderIndex":14,"question":"A decision tree is trained on temporal data: daily sales for 3 years, predicting tomorrow's sales. A random 80/20 train/test split is used. The model achieves R² = 0.91 on the test set. When deployed, the model makes poor predictions for future months. What is wrong with the evaluation strategy?","options":{"A":"R² is not an appropriate metric for time-series regression — RMSE should be used instead","B":"Random splits on temporal data place future dates in the training set and past dates in the test set — the model \"knows\" future patterns during training; a time-based split (first 80% of dates for train, last 20% for test) would reveal the true out-of-sample performance","C":"The tree has too many leaves for time-series data — a linear model is always better for temporal prediction","D":"Decision trees require at least 5 years of training data for time-series applications"},"correct":"B","explanation":{"correct":"- Random splitting on time-ordered data breaks the temporal ordering. Training samples may include dates from year 3 while test samples include dates from year 1 — the model sees future information during training.\n- This creates temporal leakage: seasonal patterns, trends, and yearly cycles from future dates inform predictions of past dates. The model appears to generalize well because test data (past dates) is \"easier\" than true future dates.\n- The correct evaluation is a **temporal split**: train on days 1-800, test on days 801-1000. This simulates the production scenario of predicting future dates.","A":"R² is a valid metric for regression regardless of the data type (temporal or otherwise). The problem is the split strategy, not the metric choice.","B":"","C":"Decision trees can model temporal patterns when given appropriate lag features (yesterday's sales, 7-day rolling average, etc.). The issue is the evaluation strategy, not the model type.","D":"There is no universal rule requiring 5 years of data. The appropriate training window depends on the seasonality and signal in the data, not a fixed duration."}},{"section":"machine-learning","topicSlug":"decision-trees","topic":"Decision Trees","id":"ml-04015","difficulty":"hard","orderIndex":15,"question":"A decision tree splits on feature `income` at threshold \\$50,000 at the root. A researcher notes that after removing 3 outliers (extreme high-income cases), the root split changes to feature `age` at threshold 35. This causes the entire tree to restructure. What does this reveal about the decision tree's robustness, and how does this compare to a Random Forest's behavior?","options":{"A":"Removing 3 outliers is always data manipulation — the researcher's action invalidated both models","B":"Decision trees are sensitive to individual data points because the root split depends on maximizing information gain across all training samples; 3 extreme outliers can shift which feature achieves maximum gain at the root, cascading changes through the entire tree — Random Forests are more robust because each tree uses a bootstrap sample where outliers appear in only a subset of trees","C":"This demonstrates that `age` is the correct feature to split on — removing outliers always reveals the true underlying structure","D":"Random Forests would have the same sensitivity because they use the same splitting algorithm"},"correct":"B","explanation":{"correct":"- Extreme income outliers can disproportionately influence Gini impurity calculations: a split that isolates outliers may appear to maximize information gain at the root, even if it captures only noise.\n- Without the outliers, the underlying structure (age as the primary predictor) becomes dominant. This illustrates that single decision trees are highly sensitive to training data composition.\n- Random Forests: each tree is trained on a bootstrap sample (~63% of data with replacement). Outliers appear in only a subset of trees. The ensemble vote is dominated by the majority of trees that don't have outliers strongly influencing the root split, making the forest more robust.","A":"Removing outliers to understand model stability is a valid diagnostic technique, not data manipulation. Understanding how models respond to data perturbations is standard practice.","B":"","C":"The change after removing outliers doesn't prove `age` is \"correct.\" It demonstrates that the tree's structure is data-dependent — neither split is definitively \"correct\" without domain knowledge.","D":"Random Forests use the same impurity criteria but on different bootstrap samples. Outliers are diluted across the ensemble. Individual trees within a RF can still be influenced, but the ensemble vote is robust."},"reference":"- Breiman et al., \"Classification and Regression Trees (CART)\": https://www.routledge.com/Classification-and-Regression-Trees/Breiman-Friedman-Stone-Olshen/p/book/9780412048418"},{"section":"machine-learning","topicSlug":"random-forest","topic":"Random Forest","id":"ml-05001","difficulty":"easy","orderIndex":1,"question":"A Random Forest trains 100 trees, each on a different bootstrap sample of the training data. A colleague claims \"bootstrapping introduces sampling bias because each tree sees less than the full dataset.\" Is this correct, and what does bootstrapping actually achieve?","options":{"A":"The colleague is correct — bootstrapping reduces training accuracy compared to using the full dataset for each tree","B":"The colleague is incorrect about the purpose — bootstrap sampling intentionally introduces diversity (each tree sees a different data sample with replacement) to decorrelate trees, enabling variance reduction through averaging; the \"bias\" from seeing ~63% unique samples per tree is the design, not a flaw","C":"Bootstrap sampling is only used to increase the effective training dataset size, not to create diversity","D":"Each Random Forest tree uses exactly 80% of the data; bootstrapping and 80/20 splits are equivalent"},"correct":"B","explanation":{"correct":"- Bootstrap sampling draws $n$ samples with replacement from an $n$-sample dataset. Statistically, each bootstrap sample contains approximately 63.2% unique observations ($1 - 1/e \\approx 0.632$), with the rest being duplicates.\n- The purpose is **decorrelation**: if all trees trained on the same full dataset, they would be nearly identical (same dominant splits), and averaging them would achieve nothing. Bootstrap sampling forces each tree to find different patterns.\n- The remaining ~36.8% of samples not drawn (out-of-bag samples) provide a free internal validation estimate — another benefit of bootstrapping unique to Random Forests.","A":"Bootstrapping does not reduce accuracy relative to using the full dataset in expectation. Each individual tree may have slightly higher bias, but the ensemble variance reduction more than compensates, producing better generalization.","B":"","C":"Bootstrapping does not increase dataset size — it resamples the same $n$ observations. Dataset size augmentation is a different technique (data augmentation, oversampling). The goal is diversity, not size.","D":"Bootstrapping with replacement is not equivalent to a fixed 80% split. Bootstrap samples can contain duplicates and the unique observation proportion is approximately 63%, not 80%. The mechanism and purpose differ."}},{"section":"machine-learning","topicSlug":"random-forest","topic":"Random Forest","id":"ml-05002","difficulty":"easy","orderIndex":2,"question":"A Random Forest uses `max_features=sqrt(p)` at each split, where `p` is the total number of features. Why does this feature subsampling improve the forest over a standard bagged tree ensemble that uses all `p` features at each split?","options":{"A":"Using fewer features at each split reduces training time but always hurts accuracy","B":"Feature subsampling at each split forces trees to find the best split among a random subset of features, making trees more diverse and less correlated — without it, all trees would still choose the same dominant feature at most nodes, and their errors would not be independent","C":"`sqrt(p)` is the mathematically optimal number of features for any dataset and model size","D":"Feature subsampling is used to reduce memory usage, not to improve predictive performance"},"correct":"B","explanation":{"correct":"- Bootstrap sampling alone is insufficient to decorrelate trees. If one feature is highly predictive, all trees would choose it at the root regardless of bootstrap sample — producing nearly identical trees.\n- Feature subsampling at each node ensures that even the dominant feature is absent at some splits, forcing trees to find alternative decision boundaries. This maximally decorrelates trees.\n- The variance reduction formula for correlated trees: $\\text{Var}(\\text{avg}) = \\rho \\sigma^2 + \\frac{1-\\rho}{B}\\sigma^2$, where $\\rho$ is the pairwise tree correlation. Feature subsampling reduces $\\rho$, which directly reduces ensemble variance.","A":"Feature subsampling can reduce training time, but the primary purpose and empirical effect is improved accuracy through variance reduction. Single-tree accuracy may decrease slightly, but ensemble accuracy improves.","B":"","C":"`sqrt(p)` is an empirical rule-of-thumb that works well in practice (Breiman's original recommendation). It is not mathematically optimal for all cases — `max_features` is a hyperparameter that should be tuned.","D":"Memory usage is not the motivation. Feature subsampling uses less memory per split as a side effect, but the purpose is diversity and decorrelation."},"reference":"- Breiman, \"Random Forests\": https://link.springer.com/article/10.1023/A:1010933404324"},{"section":"machine-learning","topicSlug":"random-forest","topic":"Random Forest","id":"ml-05003","difficulty":"easy","orderIndex":3,"question":"A Random Forest is trained on 10,000 samples. Out-of-bag (OOB) error is reported as 0.18. A developer asks: \"do I need a separate validation set if I have OOB error?\" What is the accurate answer?","options":{"A":"No — OOB error is mathematically equivalent to a held-out test set and always replaces cross-validation","B":"OOB error provides a reliable estimate of generalization error for Random Forests specifically, because each sample is only evaluated by trees that did not see it during training — it approximates leave-one-out cross-validation and often eliminates the need for a separate validation set, but a test set is still needed for final unbiased evaluation","C":"OOB error is calculated on training data and has the same overfitting risk as training accuracy","D":"OOB error is only valid for classification; for regression, a separate validation set is always required"},"correct":"B","explanation":{"correct":"- Out-of-bag prediction for sample $i$: aggregate predictions from all trees that did not include sample $i$ in their bootstrap sample (approximately 37% of trees). This is inherently a held-out evaluation for each sample.\n- OOB error approximates leave-one-out cross-validation and is generally a reliable estimate of generalization error for Random Forests. It avoids the cost of explicit cross-validation.\n- However, OOB error is internal to the training process. For final model reporting and comparison, a completely held-out test set (never used during model selection) is still the gold standard.","A":"OOB error is not mathematically equivalent to a held-out test set. It is an approximation of cross-validation that works well in practice but is based on the same training distribution. A true test set tests on held-out data from the same population.","B":"","C":"OOB error is explicitly calculated on samples not used to train each evaluating tree — it is not training accuracy. The mechanism specifically prevents the overfitting that training accuracy suffers from.","D":"OOB error applies equally to regression Random Forests. For regression, OOB RMSE or R² serves as the generalization estimate, just as OOB accuracy does for classification."}},{"section":"machine-learning","topicSlug":"random-forest","topic":"Random Forest","id":"ml-05004","difficulty":"easy","orderIndex":4,"question":"A Random Forest feature importance ranks feature `X` as the most important. A data scientist removes all features except `X` and retrains a single decision tree. The decision tree performs much worse than the Random Forest with all features. What explains this paradox?","options":{"A":"Random Forest feature importances are always wrong and should not be used for feature selection","B":"Random Forest feature importance measures marginal contribution averaged across all trees and splits — it accounts for feature interactions; removing all other features changes the problem complexity and eliminates the variance reduction from the ensemble, making a single-tree comparison invalid","C":"Feature importance in Random Forest is calculated on test data, which is why it doesn't transfer to a single decision tree on training data","D":"The single decision tree performs worse because it uses more memory than the Random Forest"},"correct":"B","explanation":{"correct":"- Random Forest feature importance (Mean Decrease Impurity) measures how much feature $X$ reduces impurity on average across all nodes and trees where it is used. It captures $X$'s contribution in the context of all other features.\n- Removing all other features eliminates feature interactions — the single tree on $X$ alone may not capture the same signal that $X$ provided when combined with other features.\n- Additionally, a single decision tree has high variance. Even if $X$ is the most important feature, a single tree's performance is far inferior to the ensemble's variance reduction, independent of feature selection.","A":"Random Forest feature importances are useful and widely used. They have known biases (favoring high-cardinality features), but are not \"always wrong.\" Permutation importance is a more robust alternative.","B":"","C":"Feature importance is calculated on training data (Mean Decrease Impurity uses training set splits), not test data. This does not affect its interpretability relative to a single decision tree experiment.","D":"Memory usage has no effect on model performance. This option is irrelevant."}},{"section":"machine-learning","topicSlug":"random-forest","topic":"Random Forest","id":"ml-05005","difficulty":"easy","orderIndex":5,"question":"A Random Forest with 500 trees achieves test accuracy of 92%. A colleague adds 1,000 more trees (total 1,500 trees). The test accuracy is now 92.1%. The training time tripled. What general principle does this demonstrate about the number of estimators in a Random Forest?","options":{"A":"More trees always significantly improve performance; 1,500 trees should give much better results than 500","B":"Random Forest accuracy converges as the number of trees grows — after a certain point, adding more trees provides diminishing returns with negligible accuracy improvement; the optimal number of trees is found when OOB error stabilizes","C":"1,500 trees overfit the training data, which is why the accuracy increase is small","D":"Random Forests always need at least 1,000 trees to be effective; 500 trees is insufficient"},"correct":"B","explanation":{"correct":"- As the number of trees in a Random Forest grows, the ensemble prediction converges to a stable value (by the law of large numbers). The error decreases rapidly in the first 50-100 trees and flattens afterward.\n- Mathematically: each tree is a random variable with variance $\\sigma^2$. After $B$ trees, ensemble variance is approximately $\\rho\\sigma^2 + (1-\\rho)\\sigma^2/B$. As $B \\to \\infty$, the second term vanishes — the irreducible term $\\rho\\sigma^2$ is the limit.\n- The practical approach: plot OOB error vs. number of trees. When the curve flattens, adding more trees only costs computation without accuracy benefit.","A":"More trees do not \"always significantly improve performance.\" The improvement is largest in the first few dozen trees and negligible after convergence. This is a fundamental property of bagging.","B":"","C":"Random Forests do not overfit as trees increase. Adding more trees reduces variance and does not increase bias. The small accuracy gain is due to convergence, not overfitting.","D":"500 trees is typically well past convergence for most datasets. The optimal number is dataset-dependent and often much smaller than 500 for moderate-complexity problems."}},{"section":"machine-learning","topicSlug":"random-forest","topic":"Random Forest","id":"ml-05006","difficulty":"medium","orderIndex":6,"question":"A Random Forest reports that feature `income` has importance 0.35 (highest). After replacing `income` with two features `income_log` and `income_bracket` (an ordinal version), both have importances of 0.18 and 0.19. The total importance of income-related features dropped from 0.35 to 0.37. A data scientist says the original feature was \"more important.\" What is the correct interpretation?","options":{"A":"The original feature was more important because 0.35 > 0.37 sum","B":"Importance values depend on feature representation — splitting one feature into two distributes the total importance across both; the combined importance (0.37) is essentially unchanged, and neither representation is inherently \"more important\"","C":"The two new features are more important because together they capture more variance","D":"Feature importances above 0.15 indicate overfitting regardless of the number of features"},"correct":"B","explanation":{"correct":"- Random Forest feature importance (Mean Decrease Impurity) measures contribution within the model structure. When one feature is split into two related features, the tree can use either at each node — the total signal is distributed across both, but the combined impact is similar.\n- The small difference (0.35 vs 0.37) could reflect slightly better utilization of the transformed features or noise. Neither representation is objectively \"more important\" — importance is always relative to the model architecture.\n- This is one of the known biases of MDI importance: correlated or redundant features share importance. Permutation importance better handles this by measuring actual impact on predictions when features are shuffled.","A":"Comparing a single feature's importance to the sum of two derived features is not meaningful. The 0.35 vs 0.37 difference is within noise and represents the same underlying signal.","B":"","C":"\"Capturing more variance\" is vague. The combined 0.37 is slightly higher than 0.35, but the difference is marginal. Feature importance does not directly measure explained variance in this context.","D":"There is no universal threshold at which feature importance indicates overfitting. A single dominant feature is common in many well-fitted models."}},{"section":"machine-learning","topicSlug":"random-forest","topic":"Random Forest","id":"ml-05007","difficulty":"medium","orderIndex":7,"question":"A Random Forest achieves 95% accuracy on training data and 93% on test data. A Gradient Boosted Tree achieves 97% training accuracy and 96% test accuracy. A stakeholder says \"always use gradient boosting — it's strictly better.\" Under what real-world conditions is Random Forest still preferred?","options":{"A":"Random Forest is always preferred because it is simpler to implement","B":"Random Forest is preferred when training speed, robustness to hyperparameter settings, built-in OOB validation, parallelizability, or interpretability of feature importance outweigh the 2-3% accuracy advantage of gradient boosting — particularly for large datasets requiring fast iteration or production systems with strict latency constraints","C":"Gradient boosting always outperforms Random Forest — the stakeholder is correct","D":"Random Forest should be used when the dataset has fewer than 1,000 samples; gradient boosting for larger datasets"},"correct":"B","explanation":{"correct":"- Random Forest trains trees in parallel (each tree is independent). Gradient Boosting trains trees sequentially (each tree depends on the previous). For large datasets or when fast training is needed, Random Forest is significantly faster.\n- Random Forest is also more robust to hyperparameter choices — the main hyperparameters (n_estimators, max_features, max_depth) have sensible defaults. Gradient Boosting requires careful tuning of learning rate, n_estimators, and max_depth together; a poorly tuned GBM can underperform Random Forest.\n- In streaming or online learning contexts, Random Forests can update more easily. For interpretability, both provide feature importance, but Random Forest's structure is more intuitive.","A":"Ease of implementation is not a decisive production factor. The question is about practical trade-offs in real systems.","B":"","C":"Gradient boosting does not universally outperform Random Forest. On noisy datasets, Random Forest's variance-averaging can match or outperform gradient boosting's bias-reduction approach.","D":"Dataset size thresholds are not the correct axis for this decision. Both methods work at any scale; the choice depends on the training time, tuning budget, and accuracy requirements."}},{"section":"machine-learning","topicSlug":"random-forest","topic":"Random Forest","id":"ml-05008","difficulty":"medium","orderIndex":8,"question":"A Random Forest is trained on a dataset with 3 highly predictive features and 97 noise features. `max_features=sqrt(100)=10`. A colleague says \"with 10 features at each split, we might miss all 3 important features in some nodes.\" Is this a problem?","options":{"A":"Yes — this is a critical flaw; max_features must always be set to include all important features","B":"Not a meaningful problem — with sqrt(100)=10 features sampled, the probability of including at least one important feature per split is $1 - C(97,10)/C(100,10) \\approx 0.99$; over many splits and trees, important features appear frequently and drive the model","C":"This is always a problem; max_features should be set to 100 (all features) to avoid missing any important feature","D":"The model will fail if any single tree misses all 3 important features at the root split"},"correct":"B","explanation":{"correct":"- Probability of excluding all 3 important features: $P(\\text{none of 3}) = \\frac{\\binom{97}{10}}{\\binom{100}{10}} = \\frac{97!/(87!\\cdot10!)}{100!/(90!\\cdot10!)} = \\frac{97 \\times 96 \\times 95}{100 \\times 99 \\times 98} \\approx 0.91$. Wait, let me recalculate: $P = \\frac{\\binom{97}{10}}{\\binom{100}{10}}$. Actually: choosing 10 from 97 non-important vs 10 from all 100. $= \\frac{97 \\cdot 96 \\cdot 95 \\cdots 88}{100 \\cdot 99 \\cdot 98 \\cdots 91} \\approx 0.72$. So P(at least one important) ≈ 0.28 per split. Over hundreds of splits and 100 trees, important features appear many times.\n- The forest is robust to missing features at individual nodes precisely because there are many nodes and trees. Important features statistically dominate the ensemble even when absent from some individual splits.\n- This is part of why Random Forests are robust — no single split decision is critical, and the ensemble averages over many \"views\" of the data.","A":"max_features is intentionally set below p to decorrelate trees. Including all important features at every split would make trees correlated again, defeating the purpose of the ensemble.","B":"","C":"max_features = p (all features) is equivalent to bagged trees without feature subsampling — the original Random Forest improvement over bagging. It allows dominant features to always be chosen, reducing tree diversity.","D":"Individual trees can have poor root splits and still contribute meaningfully to the ensemble when averaged. The forest is robust to individual tree imperfections."}},{"section":"machine-learning","topicSlug":"random-forest","topic":"Random Forest","id":"ml-05009","difficulty":"medium","orderIndex":9,"question":"After training a Random Forest on a customer churn dataset, you use Mean Decrease in Impurity (MDI) importance to identify the top feature as `contract_renewal_date`. This feature has very high cardinality (1,000+ unique dates). A teammate says the importance is inflated. Are they right?","options":{"A":"No — MDI importance correctly adjusts for feature cardinality","B":"Yes — MDI importance is biased toward features with more unique values because they offer more potential split points, giving them more opportunities to achieve high impurity reduction; this can cause low-cardinality features with true predictive power to be ranked lower","C":"High cardinality features always have lower MDI importance because fewer samples fall in each split","D":"MDI importance bias toward cardinality only occurs with Gini impurity, not with entropy"},"correct":"B","explanation":{"correct":"- MDI importance sums the impurity decrease over all nodes where a feature is used, normalized by the number of samples reaching each node. High-cardinality features create many more potential split points — they get more \"chances\" to find a split that happens to improve purity on the training data.\n- This inflates their MDI scores even when the cardinality is incidental (like a date field that correlates with data collection timing rather than true business causality).\n- The fix: use **permutation importance** instead. It measures the actual drop in model performance when a feature is randomly shuffled, which is not biased by cardinality or the number of splits.","A":"MDI does not adjust for cardinality. This is a well-documented limitation — Strobl et al. (2007) explicitly demonstrated this bias in their study of Random Forest variable importance measures.","B":"","C":"High cardinality features split data into many small groups, but the MDI is normalized by the number of samples at each node. The bias comes from the number of split opportunities, not from split size.","D":"The cardinality bias exists for both Gini and entropy — it is a property of how impurity is summed across splits, not of the specific impurity formula."},"reference":"- Strobl et al., \"Bias in Random Forest Variable Importance Measures\": https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-8-25"},{"section":"machine-learning","topicSlug":"random-forest","topic":"Random Forest","id":"ml-05010","difficulty":"medium","orderIndex":10,"question":"A Random Forest model is deployed in production to predict loan defaults. The model must explain its decisions to regulators under \"right to explanation\" requirements. A risk manager says \"feature importances from the forest prove that income is the primary driver.\" A compliance officer pushes back. What is the officer's valid concern?","options":{"A":"Feature importances are sufficient for regulatory explanation — the officer is wrong","B":"MDI feature importances are global aggregate measures — they describe the average contribution across all predictions, not the reason for any specific individual's prediction; regulators typically require local explanations (why this specific applicant was declined), which requires SHAP values or LIME, not global importance","C":"Feature importances from Random Forests are not allowed by any financial regulator","D":"The concern is that income is a protected attribute under fair lending laws"},"correct":"B","explanation":{"correct":"- \"Right to explanation\" requirements (e.g., GDPR Article 22, ECOA adverse action notices) require explaining individual decisions — why was this specific person denied, not \"on average, income is important.\"\n- MDI importance is a global summary statistic. It says nothing about a specific prediction: an individual who was denied might have been denied primarily because of their debt-to-income ratio, not income alone, even if income is globally important.\n- SHAP (SHapley Additive exPlanations) provides individual-level attributions: \"for this specific applicant, income contributed −0.32 to the prediction score.\" This satisfies the regulatory requirement for individual explanation.","A":"Global feature importances are not sufficient for individual-level regulatory explanation. This is the exact gap that explainability frameworks (SHAP, LIME) were developed to address.","B":"","C":"Financial regulators don't ban specific ML techniques. They require explainability of individual decisions. Random Forests with SHAP explanations are used in regulated financial applications.","D":"Income can be a valid (non-protected) feature in credit models. The concern is not about what income represents but about the inadequacy of global importance for individual explanations."}},{"section":"machine-learning","topicSlug":"random-forest","topic":"Random Forest","id":"ml-05011","difficulty":"hard","orderIndex":11,"question":"A Random Forest is trained on a dataset with two perfectly correlated features: `salary` and `salary_log`. After training, MDI importance shows salary = 0.31 and salary_log = 0.09, summing to 0.40. A single decision tree on only `salary` shows importance 0.38. What does the discrepancy reveal?","options":{"A":"The Random Forest computed incorrect importances — correlated features should always have equal importance","B":"When two correlated features are present, Random Forest distributes importance between them depending on which one a particular tree's bootstrap sample and feature subset chooses; neither importance value reflects the true contribution of the shared signal, and the sum (0.40) is closer to but still not equal to the single-feature importance (0.38) due to redundancy effects","C":"`salary_log` should have zero importance because it is derived from `salary`","D":"The correlation between salary and salary_log caused the Random Forest to overfit, which is why importances are distorted"},"correct":"B","explanation":{"correct":"- Perfectly correlated features share the same predictive signal. When both are present, the forest randomly selects between them at each node (depending on which appears in the feature subset). Importance is split roughly by how often each appears in the random feature subsets.\n- The distribution is not equal because log transformation changes the feature scale — `salary_log` may be a better split point for some data ranges, so it captures more importance in those nodes.\n- The true signal contribution is best estimated by: either using only one of the features, applying permutation importance (which accounts for the correlated pair jointly), or using SHAP values.","A":"MDI importance has no requirement for equal importance on correlated features. The distribution depends on random feature subsets, data ranges, and scale — equal importance would be coincidental.","B":"","C":"Derived features are not automatically assigned zero importance. `salary_log` captures different split-point relationships with the target than raw `salary` — it may improve certain splits even though both contain the same information.","D":"Correlated features do not cause overfitting by themselves. Overfitting is related to model complexity (tree depth) and noise in training data, not feature correlation."}},{"section":"machine-learning","topicSlug":"random-forest","topic":"Random Forest","id":"ml-05012","difficulty":"hard","orderIndex":12,"question":"A Random Forest achieves OOB error of 0.12 on a training dataset. A team member uses the entire training data (including OOB samples) to tune `max_features` and `n_estimators` by minimizing OOB error. They report OOB error of 0.08 after tuning. A statistician says this OOB estimate is now biased. Why?","options":{"A":"The statistician is wrong — OOB error is always unbiased regardless of how it is used","B":"OOB error is unbiased as a generalization estimate for a fixed model, but when used as the criterion for hyperparameter selection, it becomes a selection criterion — repeated model selection based on OOB error creates the same overfitting-to-the-metric problem as using a validation set for selection; the OOB error no longer represents an unbiased future performance estimate","C":"OOB error only becomes biased if the number of trees exceeds 200","D":"The bias occurs because max_features and n_estimators are too important to tune using OOB error"},"correct":"B","explanation":{"correct":"- OOB error is an unbiased estimate for any single Random Forest with fixed hyperparameters. Each sample is evaluated only by trees that didn't use it — a legitimate holdout.\n- However, when you run multiple hyperparameter configurations and select the one with the lowest OOB error, you are effectively searching for a configuration that happens to perform well on these specific OOB splits. This is the same overfitting problem as using a validation set for model selection.\n- The fix: use nested cross-validation or a held-out test set for final evaluation after OOB-based hyperparameter tuning. The reported OOB of 0.08 is optimistic.","A":"OOB error is unbiased for a fixed, pre-specified model. Once it becomes a selection criterion across multiple models, the bias introduced by selection is indistinguishable from validation-set overfitting.","B":"","C":"The number of trees does not determine OOB bias. More trees actually make OOB estimates more stable (less variance), not more biased.","D":"max_features and n_estimators are legitimate hyperparameters to tune. The problem is the selection mechanism using the same metric as evaluation, not the choice of which hyperparameters to tune."}},{"section":"machine-learning","topicSlug":"random-forest","topic":"Random Forest","id":"ml-05013","difficulty":"hard","orderIndex":13,"question":"A Random Forest is trained on a dataset where the positive class (5% of data) is rare. OOB accuracy is 95.2%. The team declares the model ready for deployment. A fraud analyst runs the model and finds it never flags any fraud. What went wrong, and how should this be diagnosed?","options":{"A":"95.2% OOB accuracy proves the model works correctly — the analyst must be testing on different data","B":"The Random Forest learned to predict the majority class (no-fraud) for all samples — OOB accuracy of 95.2% is achievable with a zero-rule classifier on 5% positive data; the correct diagnostic is OOB precision/recall or F1 on the positive class","C":"Random Forests cannot handle class imbalance — a different algorithm must always be used","D":"OOB accuracy is unreliable for imbalanced datasets because it oversamples minority classes"},"correct":"B","explanation":{"correct":"- With 5% positive class, a model predicting \"no fraud\" for every sample achieves 95% accuracy — close to the reported 95.2%. The Random Forest likely learned to do exactly this because the majority class is overwhelmingly represented in each bootstrap sample.\n- OOB accuracy hides this: a 95.2% baseline is trivial on a 5% positive class. The correct metrics are OOB recall on the positive class (likely 0%), OOB F1-score on the positive class, or OOB precision-recall AUC.\n- Solutions: class-weighted Random Forest (`class_weight='balanced'`), oversample minority class before bootstrapping, or evaluate using appropriate imbalanced-class metrics from the start.","A":"High accuracy on imbalanced data is not evidence of a working model. This is the same imbalanced data trap from the ML Fundamentals topic, now appearing in the Random Forest context.","B":"","C":"Random Forests can handle class imbalance with the `class_weight` parameter. The problem is not the algorithm but the evaluation metric and default behavior on imbalanced data.","D":"OOB sampling is not biased toward minority classes — it reflects the same imbalanced distribution as the training data. The issue is the accuracy metric, not the OOB sampling mechanism."}},{"section":"machine-learning","topicSlug":"random-forest","topic":"Random Forest","id":"ml-05014","difficulty":"hard","orderIndex":14,"question":"You train a Random Forest with `n_estimators=500`, `max_depth=None`, `max_features=sqrt(p)`. The training accuracy is 100%, but OOB error is 0.15. A colleague says \"the model is overfit — reduce max_depth.\" A senior engineer disagrees. Who is right and why?","options":{"A":"The colleague is correct — 100% training accuracy always means the model is overfit","B":"The senior engineer is correct — individual trees in a Random Forest are intentionally grown to full depth (overfit to their bootstrap samples); the ensemble's OOB error of 0.15 is the relevant generalization measure; the 100% training accuracy is expected and irrelevant for Random Forests","C":"Both are wrong — the correct fix is to reduce n_estimators, not max_depth","D":"The senior engineer is wrong — max_depth=None always causes overfitting that cannot be corrected by the ensemble"},"correct":"B","explanation":{"correct":"- Random Forest's design principle: each individual tree is grown deep (often to full depth) so it has low bias. The ensemble variance reduction (through averaging) compensates for each tree's high variance.\n- Training accuracy in a Random Forest is calculated on in-bag samples for each tree — each tree achieves 100% because it memorizes its bootstrap sample. This is expected and part of the design, not a bug.\n- OOB error is the correct generalization estimate. 0.15 means 85% OOB accuracy, which is meaningful. Reducing max_depth would increase individual tree bias without necessarily improving OOB error — and might hurt it if the signal requires deep trees.","A":"100% training accuracy in an ensemble context is not a sign of overfitting in the same sense as for a single model. Each tree overfits to its bootstrap sample, but the ensemble does not — this is by design.","B":"","C":"Reducing n_estimators would reduce computational cost and potentially increase variance (fewer trees). It is not the right response to suspected overfitting in a Random Forest.","D":"max_depth=None (full depth) is the standard and often optimal choice for Random Forests. The ensemble averaging handles the variance of deep trees. Constraining max_depth can help in some cases but is not universally necessary."}},{"section":"machine-learning","topicSlug":"random-forest","topic":"Random Forest","id":"ml-05015","difficulty":"hard","orderIndex":15,"question":"A team uses a Random Forest for a real-time fraud detection system that must respond in under 10ms. The forest has 500 trees of average depth 20. Each tree evaluation traverses up to 20 nodes. Profiling shows inference takes 45ms — too slow. What are the most effective latency optimizations specific to Random Forest inference?","options":{"A":"Retrain the model with fewer features to reduce tree size","B":"Reduce n_estimators (e.g., 100 trees vs 500), reduce max_depth (shallower trees), or apply post-training tree pruning — these directly reduce the number of node evaluations per prediction; alternatively, export trees to optimized formats (ONNX, PMML) or compile trees to native code for 5-10x speedup","C":"Switch to a GPU for inference — Random Forests are GPU-accelerated by default","D":"Add more RAM to the inference server — the slowness is caused by cache misses on tree structures"},"correct":"B","explanation":{"correct":"- Inference latency in a Random Forest is proportional to `n_estimators × average_depth × nodes_per_depth`. Reducing either dimension directly reduces wall-clock time.\n- Practical options: `n_estimators=100` (80% reduction in trees, often with <2% accuracy loss if the forest was converged at 500); `max_depth=10` (halves tree traversal); tree compilation via sklearn's `export_text` to native conditionals or using `treelite`/`ONNX Runtime` for compiled inference.\n- Another approach: early exit (predict with fewer trees if confidence is high) or quantized trees that use integer comparisons instead of floating-point.","A":"Reducing features changes the model's predictive capacity and requires retraining. It indirectly reduces tree depth (fewer splits) but is not targeted at latency. Reducing n_estimators and max_depth is more direct.","B":"","C":"Standard Random Forest implementations (sklearn) are CPU-bound. GPU acceleration requires specialized libraries (RAPIDS cuML). It is not available by default and requires infrastructure changes.","D":"Cache misses can be a factor for very large forests, but the primary bottleneck is node evaluation count. Reducing tree size addresses both the traversal cost and the memory access pattern."},"reference":"- Breiman, \"Random Forests\" (original paper): https://link.springer.com/article/10.1023/A:1010933404324\n- treelite for fast RF inference: https://treelite.readthedocs.io/"},{"section":"machine-learning","topicSlug":"gradient-boosting","topic":"Gradient Boosting","id":"ml-06001","difficulty":"easy","orderIndex":1,"question":"A gradient boosting model trains 100 trees sequentially. The first tree predicts house prices and makes residual errors. The second tree is trained on those residuals. A developer says \"the second tree is trying to correct the first tree's mistakes.\" Is this accurate and complete?","options":{"A":"Accurate and complete — gradient boosting is simply a sequential error-correction process","B":"Accurate in spirit but incomplete — the second tree fits the negative gradient of the loss function with respect to the current ensemble prediction, which equals the residuals for MSE loss but differs for other losses (e.g., log-loss, MAE); the general mechanism is functional gradient descent in prediction space, not just error correction","C":"Inaccurate — the second tree is trained on a weighted version of the original labels, not on residuals","D":"Accurate and complete, but only when using MSE loss; for other losses, gradient boosting uses a different approach altogether"},"correct":"B","explanation":{"correct":"- For MSE loss ($L = \\frac{1}{2}(y - F(x))^2$), the negative gradient is $y - F(x)$ — exactly the residual. So for MSE, \"fitting residuals\" is literally what gradient boosting does.\n- For other losses, the negative gradient is different: for MAE, it is the sign of the residual; for log-loss, it is $y - \\sigma(F(x))$. In each case, the new tree fits the negative gradient of the loss, not the raw residual.\n- Gradient boosting is best understood as gradient descent in function space: each tree is a step in the direction that reduces the loss most, just as SGD takes a step in the direction of the negative gradient in parameter space.","A":"\"Error correction\" is an intuitive description but misses the generalization to non-MSE losses. The general mechanism is loss-function gradient fitting, which becomes residual fitting only as a special case.","B":"","C":"AdaBoost uses sample-weighted loss (the confusion here). Gradient Boosting fits pseudo-residuals (negative gradients) on the original data, not weighted labels.","D":"Gradient boosting uses different loss functions natively — the framework handles any differentiable loss. The question correctly distinguishes MSE residuals from the general gradient."},"reference":"- Friedman, \"Greedy Function Approximation: A Gradient Boosting Machine\": https://projecteuclid.org/journals/annals-of-statistics/volume-29/issue-5/Greedy-function-approximation-a-gradient-boosting-machine/10.1214/aos/1013203451.full"},{"section":"machine-learning","topicSlug":"gradient-boosting","topic":"Gradient Boosting","id":"ml-06002","difficulty":"easy","orderIndex":2,"question":"A gradient boosting model is trained with `learning_rate=1.0` and `n_estimators=100`. The training loss decreases to near-zero quickly, but the test loss starts increasing after 20 trees. A colleague reduces `learning_rate` to 0.01 and increases `n_estimators` to 1000. What is the effect of this change?","options":{"A":"The change is redundant — learning rate and n_estimators compensate each other perfectly at any value","B":"Reducing learning rate makes each tree contribute a smaller step toward the residual (shrinkage), requiring more trees to achieve the same training loss but generally producing a more regularized, better-generalizing model — however, training time increases proportionally with n_estimators","C":"Reducing learning rate always increases test loss because the model learns slower","D":"`learning_rate=0.01` is below the minimum effective threshold and will cause the model to make no progress at all"},"correct":"B","explanation":{"correct":"- The update at each step: $F_m(x) = F_{m-1}(x) + \\eta \\cdot h_m(x)$, where $\\eta$ is the learning rate and $h_m$ is the new tree. A smaller $\\eta$ means each tree contributes less — the model takes smaller gradient steps.\n- Smaller learning rates act as regularization: the model is more conservative about committing to any single tree's prediction, reducing the risk of overfitting to training noise.\n- The empirical trade-off: `learning_rate=0.01` with `n_estimators=1000` often outperforms `learning_rate=1.0` with `n_estimators=100`, but takes 10× longer to train. Early stopping on a validation set avoids the need to manually set n_estimators.","A":"Learning rate and n_estimators are not perfectly compensating. A high learning rate with many trees can still overfit differently than a low learning rate. The regularization effect of small learning rate is not reducible to fewer trees.","B":"","C":"A lower learning rate does not increase test loss — it reduces overfitting by shrinking individual tree contributions. The test loss may temporarily appear to decrease more slowly during training but typically achieves a lower minimum.","D":"Learning rate of 0.01 makes the model progress slowly but steadily. There is no minimum threshold below which progress stops — gradient descent with small steps still converges."}},{"section":"machine-learning","topicSlug":"gradient-boosting","topic":"Gradient Boosting","id":"ml-06003","difficulty":"easy","orderIndex":3,"question":"A gradient boosting model overfits — training AUC is 0.99 but validation AUC is 0.78. Which hyperparameter combination would most directly reduce overfitting?","options":{"A":"Increase n_estimators from 100 to 500 and keep max_depth=6","B":"Reduce max_depth from 6 to 3, reduce learning_rate from 0.1 to 0.05, and enable early stopping on the validation set — these simultaneously reduce individual tree complexity, shrink the gradient steps, and stop training before validation performance degrades","C":"Increase subsample from 0.8 to 1.0 to ensure all training data is used","D":"Switch from a regression loss to a classification loss — the wrong loss function caused the overfit"},"correct":"B","explanation":{"correct":"- The 0.21 AUC gap is severe overfitting. Three complementary mechanisms address it: `max_depth` reduces tree complexity (each tree captures less noise); `learning_rate` shrinks the contribution of each noisy tree; early stopping terminates training precisely when validation performance peaks.\n- `subsample < 1.0` (stochastic gradient boosting) adds noise to each tree's training, acting as regularization. Increasing subsample to 1.0 removes this regularization and typically worsens overfitting.\n- In practice, the hyperparameter search should be: fix a low learning_rate (0.05), use early stopping with a validation set, and tune max_depth and subsample.","A":"More trees at the same depth and learning rate will increase overfitting, not reduce it. More estimators means more steps in the (already overfitting) direction of the training data.","B":"","C":"Increasing subsample to 1.0 removes a regularization mechanism. Stochastic subsampling introduces variance in each tree that reduces overfitting.","D":"Loss function mismatch (regression vs classification) would cause invalid predictions, not overfitting. The loss type is separate from regularization. Also, AUC is a classification metric — the model is already using classification loss."}},{"section":"machine-learning","topicSlug":"gradient-boosting","topic":"Gradient Boosting","id":"ml-06004","difficulty":"easy","orderIndex":4,"question":"Gradient Boosting builds trees sequentially while Random Forest builds trees in parallel. What does this fundamental difference imply about their training time scaling with n_estimators?","options":{"A":"Both scale identically with n_estimators — the sequential vs parallel distinction only affects code structure, not runtime","B":"Random Forest training time is nearly constant (parallel trees can run simultaneously on multiple cores), while Gradient Boosting training time scales linearly with n_estimators because each tree requires the previous tree's predictions before it can begin","C":"Gradient Boosting is faster because sequential processing allows it to skip unnecessary trees early","D":"Random Forest always takes longer because bootstrap sampling adds overhead that sequential processing avoids"},"correct":"B","explanation":{"correct":"- Random Forest: trees are fully independent. With $k$ CPU cores, training time $\\approx n\\_estimators / k$ per core — near-constant with enough cores.\n- Gradient Boosting: tree $m+1$ requires computing pseudo-residuals using tree $m$'s predictions. This creates a sequential dependency that cannot be parallelized across trees. Training time is strictly $O(n\\_estimators)$.\n- Within-tree parallelism (evaluating different feature splits in parallel) can speed up individual tree training, which is how XGBoost achieves speed despite sequential tree building.","A":"Sequential dependency in GBM vs parallel independence in RF creates real, practical runtime differences. On 16 cores, an RF can train 16 trees simultaneously while GBM must wait for each tree to complete.","B":"","C":"Gradient boosting doesn't skip trees — it always trains the specified number. Early stopping terminates training when validation performance plateaus, but this is an optional add-on, not an inherent speed advantage.","D":"Bootstrap sampling overhead is trivial compared to the actual tree-building cost. Random Forest's parallelism advantage outweighs this overhead significantly."}},{"section":"machine-learning","topicSlug":"gradient-boosting","topic":"Gradient Boosting","id":"ml-06005","difficulty":"easy","orderIndex":5,"question":"XGBoost was introduced in 2016 as an improvement over standard gradient boosting. One key difference is that XGBoost adds regularization terms directly to the objective function. What does this change compared to scikit-learn's `GradientBoostingClassifier`?","options":{"A":"XGBoost adds dropout to the trees, similar to neural network dropout","B":"XGBoost's objective includes L1 (alpha) and L2 (lambda) penalties on leaf weights, controlling both sparsity and magnitude of tree leaf scores — this makes overfitting control more explicit and tunable compared to sklearn's GBM which only indirectly regularizes through tree depth and learning rate","C":"XGBoost adds regularization by randomly removing features during training, like feature subsampling in Random Forest","D":"XGBoost's regularization is equivalent to setting max_depth=3 in sklearn's GradientBoostingClassifier"},"correct":"B","explanation":{"correct":"- XGBoost's objective: $\\text{Obj} = \\sum L(y_i, \\hat{y}_i) + \\sum_k [\\gamma T_k + \\frac{1}{2}\\lambda \\|w_k\\|^2 + \\alpha \\|w_k\\|_1]$ where $T_k$ is the number of leaves, $w_k$ are leaf weights, $\\lambda$ is L2, $\\alpha$ is L1, $\\gamma$ is leaf penalty.\n- `gamma` penalizes tree complexity (number of leaves), `lambda` penalizes large leaf weights, and `alpha` promotes sparse leaf weights. These give fine-grained control over overfitting.\n- sklearn's GBM controls regularization mainly through `max_depth`, `min_samples_split`, and `learning_rate` — effective but less direct than XGBoost's explicit weight regularization.","A":"XGBoost has a technique called DART (Dropouts meet Multiple Additive Regression Trees) which applies dropout to trees, but this is separate from the standard regularization terms. Standard XGBoost does not use dropout by default.","B":"","C":"Feature subsampling (`colsample_bytree`) exists in XGBoost but is a separate hyperparameter from the regularization terms (alpha, lambda). XGBoost has both, but they are distinct mechanisms.","D":"Regularization is a functional change to the optimization objective, not equivalent to depth limiting. Regularized shallow trees behave differently from deep trees cut at max_depth=3."}},{"section":"machine-learning","topicSlug":"gradient-boosting","topic":"Gradient Boosting","id":"ml-06006","difficulty":"medium","orderIndex":6,"question":"LightGBM uses a \"leaf-wise\" tree growth strategy while XGBoost (by default) uses \"level-wise\" growth. Both reach the same `max_depth`. On the same dataset, LightGBM trains 5× faster. What is the structural difference and the associated risk?","options":{"A":"LightGBM grows each level before going deeper, while XGBoost grows the deepest path first — this makes LightGBM faster but less accurate","B":"LightGBM grows the leaf with the highest loss reduction at each step (leaf-wise), regardless of level — this creates asymmetric trees that can model complex patterns with fewer nodes, but can overfit more on small datasets because a single deep branch may memorize specific training patterns","C":"Level-wise and leaf-wise growth are equivalent for trees of the same max_depth — the speed difference is purely due to LightGBM's implementation optimizations, not the growth strategy","D":"LightGBM uses gradient-based one-side sampling, which is completely unrelated to leaf-wise growth — the speed gain comes entirely from sampling"},"correct":"B","explanation":{"correct":"- Level-wise (XGBoost default): all nodes at depth $d$ are split before any node at depth $d+1$. Each level has $2^d$ nodes split per level.\n- Leaf-wise (LightGBM): at each step, the single leaf with the highest gain is split, regardless of which level it's on. This produces a longer, asymmetric tree that achieves lower loss per split operation.\n- Leaf-wise growth uses fewer total splits to reach the same loss level, making LightGBM faster. The risk: on small datasets, repeatedly growing the same branch can overfit to a small subset of training samples in that branch.","A":"The description is backwards. LightGBM grows leaf-wise (deepest path first by gain), not level-first. XGBoost default is level-wise (breadth-first). The relationship between strategy and accuracy is also not simply \"level-wise = more accurate.\"","B":"","C":"The growth strategy is a fundamental difference, not just implementation. Level-wise and leaf-wise produce structurally different trees even at the same max_depth, because they prioritize different splits.","D":"Gradient-based one-side sampling (GOSS) is an additional LightGBM optimization that speeds up gradient computation. It contributes to speed but the question specifically asks about the leaf-wise growth mechanism."},"reference":"- LightGBM paper: https://proceedings.neurips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html"},{"section":"machine-learning","topicSlug":"gradient-boosting","topic":"Gradient Boosting","id":"ml-06007","difficulty":"medium","orderIndex":7,"question":"A gradient boosting model uses early stopping: training stops when validation loss does not improve for 50 rounds. After training, the team reports the model at the stopping point has `n_estimators=147`. A data scientist then retrains the model on the full training + validation data using `n_estimators=147`. The production AUC drops. What went wrong?","options":{"A":"The full training data caused the model to overfit because it has more samples","B":"Early stopping determined the optimal n_estimators for the train/validation split — adding validation data to the training set changes the loss landscape, so the same n_estimators is no longer optimal; the model now underfits because it should train for more iterations with more data","C":"The model should be retrained using the validation loss at n_estimators=147 as the new early stopping target","D":"Gradient boosting models always perform worse when trained on more data"},"correct":"B","explanation":{"correct":"- Early stopping finds the optimal n_estimators for a specific train/val split. When the validation set is added to training, the total dataset is larger — each gradient step is noisier and the model can absorb more information per tree.\n- With more training data, the optimal number of trees is typically higher (the model needs more iterations to converge to the same minimum). Using the old n_estimators=147 stops training too early on the larger dataset.\n- The correct approach: retrain on full data, use the early stopping n_estimators as a lower bound, and either run early stopping on a held-out subset or multiply n_estimators by a factor (e.g., 1.1-1.3×) to account for the larger dataset.","A":"More training data generally improves generalization, not causes overfitting. Overfitting from adding data would be unusual and would manifest differently.","B":"","C":"Using the validation loss as a new early stopping target is not a standard or valid technique. The validation loss from the previous split is not directly comparable to a future run's loss trajectory.","D":"Gradient boosting (and all supervised models) generally benefit from more training data. Saying \"always performs worse with more data\" is categorically incorrect."}},{"section":"machine-learning","topicSlug":"gradient-boosting","topic":"Gradient Boosting","id":"ml-06008","difficulty":"medium","orderIndex":8,"question":"CatBoost introduces a technique called \"ordered boosting\" to address a specific problem present in standard gradient boosting. What problem does ordered boosting solve, and why does it matter for large datasets?","options":{"A":"Ordered boosting solves the sequential training bottleneck by allowing trees to be built in parallel","B":"Standard gradient boosting computes pseudo-residuals using the same training samples that will train the next tree, causing \"target leakage\" — the residuals are biased because the tree fits its own prediction errors; ordered boosting uses a subset of preceding data to compute residuals for each sample, preventing this bias","C":"Ordered boosting solves the class imbalance problem by ordering samples by class frequency","D":"Ordered boosting eliminates the need for cross-validation by using temporal ordering of data"},"correct":"B","explanation":{"correct":"- In standard gradient boosting, tree $m$ is trained on pseudo-residuals computed from the current ensemble $F_{m-1}$. The issue: $F_{m-1}$ was itself trained on the same samples, creating a subtle overfitting bias in the residual computation (a form of in-sample contamination).\n- CatBoost's ordered boosting: for each training sample $i$, pseudo-residuals are computed using a model trained only on samples with indices less than $i$ (an artificial temporal ordering). This ensures residuals for sample $i$ are computed from a model that has never seen $i$.\n- This is analogous to online learning evaluation and reduces the gradient estimation bias, particularly impactful on smaller datasets where in-sample contamination is more severe.","A":"Ordered boosting does not parallelize tree training. Gradient boosting remains sequential by nature. CatBoost achieves speed through other optimizations (symmetric trees, GPU training).","B":"","C":"Ordered boosting has nothing to do with class imbalance. The \"ordering\" refers to a permutation of training samples used for unbiased gradient estimation, not class ordering.","D":"Ordered boosting uses an artificial ordering of samples, not real temporal ordering. It is not a cross-validation replacement. CatBoost provides built-in cross-validation separately."},"reference":"- Prokhorenkova et al., \"CatBoost: unbiased boosting with categorical features\": https://arxiv.org/abs/1706.09516"},{"section":"machine-learning","topicSlug":"gradient-boosting","topic":"Gradient Boosting","id":"ml-06009","difficulty":"medium","orderIndex":9,"question":"A gradient boosting model is trained with `max_depth=6` and achieves good performance. Reducing `max_depth` to 2 while keeping all other hyperparameters the same drops training accuracy significantly. A teammate says \"deeper trees are always better for gradient boosting.\" When is shallow max_depth actually preferred?","options":{"A":"Shallow trees are only used for computational budget constraints, never for accuracy","B":"In gradient boosting, each tree corrects the residual of the previous ensemble — for smooth, low-noise targets with primarily additive feature effects, shallow trees (depth 2-4) capture each correction efficiently; deep trees overfit to individual samples in the residual, accumulating noise across iterations; shallow trees are the canonical choice for regression on tabular data with noise","C":"max_depth=2 is always optimal for gradient boosting because it avoids overfitting in all scenarios","D":"Shallow trees are better only when n_estimators is less than 50"},"correct":"B","explanation":{"correct":"- Gradient boosting residuals are noisy versions of the target (after partial fitting). Deep trees can memorize noise in the residuals — since residuals become smaller and noisier as boosting progresses, deep trees in later rounds fit noise more than signal.\n- Shallow trees (depth 2-3, also called \"stumps\" at depth 1) fit smooth corrections that capture the main signal without noise memorization. They require more trees (n_estimators) but generalize better with proper learning rate.\n- For tasks with strong high-order feature interactions (e.g., complex genomics data), deeper trees (5-8) may genuinely be needed. The choice is always dataset-dependent.","A":"Shallow trees are preferred for accuracy on many tabular regression tasks, not just for speed. The regularization effect of shallow trees is a feature, not a constraint.","B":"","C":"max_depth=2 is not universally optimal. Tasks with complex high-order interactions may genuinely require deeper trees. \"Always optimal\" claims for any hyperparameter value are false.","D":"The relationship between optimal depth and n_estimators is not a simple threshold. Both interact as regularization levers — lower depth requires more trees to compensate."}},{"section":"machine-learning","topicSlug":"gradient-boosting","topic":"Gradient Boosting","id":"ml-06010","difficulty":"medium","orderIndex":10,"question":"XGBoost has a `subsample` parameter (default 0.8) and `colsample_bytree` parameter (default 1.0). A team sets both to 1.0 to \"use all available data and features.\" Training loss improves but validation loss worsens. What regularization mechanism did they disable?","options":{"A":"They disabled early stopping by setting subsample=1.0","B":"Both parameters implement stochastic sampling that introduces randomness into each tree — subsample randomly samples training rows per tree (stochastic gradient boosting), and colsample_bytree randomly samples features per tree; removing both makes each tree deterministic and highly correlated, increasing ensemble variance and overfitting","C":"Setting subsample=1.0 is only harmful for datasets with fewer than 1,000 samples; for larger datasets, it has no effect","D":"colsample_bytree=1.0 is the correct setting; only subsample below 1.0 helps with regularization"},"correct":"B","explanation":{"correct":"- `subsample < 1.0`: each tree is trained on a random row subset — Friedman's stochastic gradient boosting. This introduces variance in each tree, improving ensemble diversity and acting as regularization.\n- `colsample_bytree < 1.0`: each tree uses a random feature subset (similar to Random Forest feature subsampling). This decorrelates trees and prevents dominant features from dominating every tree.\n- Both mechanisms reduce tree correlation and introduce beneficial randomness. Using both at 1.0 makes trees more similar to each other (high correlation) and more prone to overfitting the full training distribution.","A":"Subsample has no relationship to early stopping. Early stopping is controlled separately by the `early_stopping_rounds` parameter and a validation dataset.","B":"","C":"Stochastic sampling benefits apply at any dataset size. On larger datasets, each subsample captures the distribution well even at 0.8. On smaller datasets, the effect is more pronounced.","D":"Both subsample and colsample_bytree contribute to regularization. Neither is exclusively responsible. Using both below 1.0 provides complementary regularization — colsample at the feature level and subsample at the sample level."}},{"section":"machine-learning","topicSlug":"gradient-boosting","topic":"Gradient Boosting","id":"ml-06011","difficulty":"hard","orderIndex":11,"question":"A gradient boosting model achieves 0.94 AUC on a fraud detection task. The team then applies the same dataset to XGBoost with default hyperparameters and gets 0.91 AUC. LightGBM with defaults achieves 0.93 AUC. A manager concludes \"sklearn GBM is better.\" What is wrong with this comparison?","options":{"A":"The comparison is valid — default hyperparameters are the standard benchmark","B":"Default hyperparameters are tuned to different datasets by each library's maintainers; a fair comparison requires hyperparameter tuning for each method with the same validation strategy — comparing defaults is comparing how well each library's defaults happen to fit this specific dataset, not the models' inherent capabilities","C":"XGBoost and LightGBM are faster implementations of the same algorithm, so they should always match sklearn GBM accuracy","D":"The manager is correct because a 0.03 AUC difference is statistically significant proof of superiority"},"correct":"B","explanation":{"correct":"- sklearn's GBM default `max_depth=3`, `learning_rate=0.1`, `n_estimators=100` may happen to fit this dataset's complexity well. XGBoost and LightGBM defaults are set for general robustness, not this specific dataset.\n- A proper comparison requires: same cross-validation strategy, same metric, independent hyperparameter tuning for each method with the same computational budget, and statistical significance testing.\n- In practice, XGBoost and LightGBM typically outperform sklearn GBM after proper tuning due to better regularization, faster convergence, and more sophisticated tree-building algorithms.","A":"Default hyperparameters are starting points, not optimal configurations. Comparing defaults tells you which library's defaults happen to work, not which algorithm is better.","B":"","C":"XGBoost and LightGBM are not identical to sklearn GBM. They use different regularization (XGBoost), different tree growth strategies (LightGBM leaf-wise), and different categorical handling (CatBoost). Accuracy differences on specific datasets are expected.","D":"A 0.03 AUC difference may or may not be statistically significant depending on dataset size and variance. Without confidence intervals or paired statistical tests, no significance claim is valid."}},{"section":"machine-learning","topicSlug":"gradient-boosting","topic":"Gradient Boosting","id":"ml-06012","difficulty":"hard","orderIndex":12,"question":"A gradient boosting model is trained to predict customer lifetime value (a regression task). After 200 rounds, the training RMSE is 42 and validation RMSE is 89. Applying L2 regularization (`reg_lambda=10`) reduces validation RMSE to 71. Adding `reg_alpha=5` (L1) further reduces it to 65. Explain precisely what each regularization term is doing to the tree structures.","options":{"A":"L2 regularization reduces the number of leaves; L1 regularization reduces the learning rate","B":"L2 (`reg_lambda`) penalizes large leaf weight magnitudes by adding $\\frac{1}{2}\\lambda w^2$ per leaf to the objective, shrinking all weights toward zero; L1 (`reg_alpha`) adds $\\alpha |w|$ per leaf, which can produce exact zero weights for leaves that don't improve the objective sufficiently, effectively pruning those leaves","C":"L2 reduces max_depth; L1 increases the number of trees required","D":"Both L1 and L2 act identically on gradient boosting trees — they are redundant and only one should be used"},"correct":"B","explanation":{"correct":"- XGBoost's optimal leaf weight for a leaf $j$: $w_j^* = -\\frac{G_j}{H_j + \\lambda}$ where $G_j$ is the sum of gradients and $H_j$ is the sum of Hessians at that leaf. L2 ($\\lambda$) appears in the denominator, shrinking the leaf weight magnitude.\n- L1 ($\\alpha$) adds $\\alpha |w_j|$ to the objective. The optimal weight with both: $w_j^* = -\\frac{\\text{clip}(G_j, \\alpha)}{H_j + \\lambda}$ where $|G_j| \\leq \\alpha$ results in zero weight. This zeros out leaves where the gradient signal is below the L1 threshold — equivalent to pruning.\n- Combined: L2 shrinks all leaf weights smoothly, L1 zeros out weak leaves entirely. The combination provides both magnitude control and tree sparsification.","A":"L2 does not reduce leaf count; it shrinks weights. L1 can effectively prune leaves (by zeroing them), but L1 is not a learning rate modifier.","B":"","C":"Neither L1 nor L2 directly controls max_depth. Depth is a structural hyperparameter. n_estimators need is affected by regularization (stronger regularization may require more trees to compensate for smaller steps), but L1 doesn't \"increase trees required\" specifically.","D":"L1 and L2 are not identical. L1 produces sparsity (exact zero leaf weights); L2 produces smooth weight shrinkage. They serve complementary purposes and can be used simultaneously."}},{"section":"machine-learning","topicSlug":"gradient-boosting","topic":"Gradient Boosting","id":"ml-06013","difficulty":"hard","orderIndex":13,"question":"A team trains gradient boosting with `n_estimators=1000` and `early_stopping_rounds=50` on a train/validation split. Training stops at round 634. They then evaluate on a test set and report AUC from round 634. A colleague says \"using the test set for evaluation is fine since we didn't use it for early stopping.\" Is the colleague correct?","options":{"A":"The colleague is correct — early stopping used the validation set; the test set was not involved and remains unbiased","B":"The colleague is correct only if the test set was also not used for any hyperparameter tuning during the experiment — if `n_estimators`, `learning_rate`, or `max_depth` were tuned by observing test set metrics at any point, the test set is no longer an unbiased estimator","C":"The colleague is wrong — early stopping on any split contaminated the test set","D":"Test set evaluation is always valid because test sets are by definition never used for training"},"correct":"B","explanation":{"correct":"- Early stopping used only the validation set — the test set is correctly isolated from this specific decision. The colleague is technically right about early stopping specifically.\n- However, the colleague's claim has an important conditional: if at any point the team looked at test set performance to decide n_estimators range, learning_rate, or max_depth, those decisions implicitly used test set information. This is the classic model selection contamination.\n- In a clean experiment: hyperparameters are tuned using the validation set (or cross-validation), early stopping uses the validation set, and the test set is evaluated exactly once at the very end. If this protocol was followed, the test evaluation is unbiased.","A":"This is conditionally correct but incomplete. The colleague is right about early stopping specifically, but not necessarily about the broader experimental context.","B":"","C":"Early stopping on the validation set does not contaminate the test set. These are separate held-out sets with different roles. The test set contamination happens only through direct use in decision-making.","D":"Test sets can be contaminated even without direct training use. Using test set metrics to guide model selection (choosing which experiment to report) is a form of test set leakage — \"reporting the lucky run.\""}},{"section":"machine-learning","topicSlug":"gradient-boosting","topic":"Gradient Boosting","id":"ml-06014","difficulty":"hard","orderIndex":14,"question":"LightGBM uses \"Gradient-based One-Side Sampling\" (GOSS) to speed up training. This means it keeps all high-gradient samples but randomly drops a fraction of low-gradient samples. What trade-off does this introduce compared to full-data gradient boosting?","options":{"A":"GOSS eliminates small-gradient samples permanently, causing permanent information loss","B":"GOSS reduces the effective training set size per iteration, which speeds computation but introduces a sampling bias — the gradient distribution is no longer uniform, so LightGBM compensates by upweighting retained low-gradient samples by a factor $(1-a)/b$ to approximate the full gradient; the trade-off is speed vs slight gradient estimation variance","C":"GOSS only drops duplicate samples, so there is no information loss","D":"GOSS is equivalent to mini-batch gradient descent and introduces the same variance as stochastic training in neural networks"},"correct":"B","explanation":{"correct":"- GOSS splits samples into top-$a$ fraction by gradient magnitude (kept) and bottom $(1-a)$ fraction (from which $b$ fraction is randomly sampled). The sampled low-gradient instances are upweighted by $(1-a)/b$ to maintain unbiased gradient estimation.\n- This preserves the main gradient signal (high-gradient samples drive learning) while approximating the low-gradient contribution through sampling + upweighting.\n- The trade-off: each iteration processes fewer samples (faster), but the gradient estimate has higher variance than full-batch computation. The upweighting introduces a stochastic approximation that works well in practice but adds noise to the optimization path.","A":"GOSS does not permanently drop low-gradient samples. The sampling is redone at each boosting round, so all samples have a chance to appear in any given round.","B":"","C":"GOSS explicitly drops samples based on gradient magnitude, not duplication. It introduces intentional statistical sampling with compensation, not deduplication.","D":"GOSS samples by gradient magnitude (not uniformly at random), which is fundamentally different from SGD mini-batch sampling. The weighting mechanism also has no analog in standard SGD."}},{"section":"machine-learning","topicSlug":"gradient-boosting","topic":"Gradient Boosting","id":"ml-06015","difficulty":"hard","orderIndex":15,"question":"You train gradient boosting for binary classification. After 500 trees, you extract the leaf scores from all trees and discover the raw output for a specific sample is +3.2 (the sum of leaf values). A junior developer converts this to a class prediction by applying `round(3.2)` and reports class 1. What is wrong with this approach?","codeSnippet":"raw_score = 3.2 # sum of leaf values from all trees\nprediction = round(raw_score) # Developer's approach: 3.2 → 3, not binary!","options":{"A":"round(3.2) = 3, which is outside {0, 1} — the developer is using the wrong transformation; for binary classification, raw GBM scores must be passed through a sigmoid to get probabilities, then thresholded","B":"The raw score of 3.2 is the correct final prediction — no transformation is needed for binary classification","C":"The raw score should be divided by 500 (number of trees) before rounding","D":"Gradient boosting raw scores are always between -1 and 1, so 3.2 is a model implementation error"},"correct":"A","explanation":{"correct":"- Gradient boosting for binary classification outputs raw scores (log-odds) that can be any real number. `round(3.2) = 3`, which is not a valid binary class label.\n- The correct transformation: $p = \\sigma(\\text{raw\\_score}) = \\frac{1}{1 + e^{-3.2}} \\approx 0.961$. Then apply a threshold (default 0.5): predict class 1 if $p > 0.5$.\n- `predict()` in sklearn, XGBoost, and LightGBM handles this automatically. Using raw leaf sum scores directly requires manual sigmoid transformation.","A":"","B":"The raw log-odds score is an intermediate value, not the final prediction. Binary classification requires converting log-odds to probability then applying a threshold.","C":"Dividing by 500 averages the leaf scores — this is not the correct transformation. The raw score is a log-odds value that must be passed through sigmoid, not averaged.","D":"Gradient boosting raw scores are unbounded — they grow as more trees are added and can take any real value. A score of 3.2 is normal and expected for a model with high confidence in class 1."},"reference":"- XGBoost documentation on output types: https://xgboost.readthedocs.io/en/stable/tutorials/model.html\n- Chen and Guestrin, \"XGBoost: A Scalable Tree Boosting System\": https://arxiv.org/abs/1603.02754"},{"section":"machine-learning","topicSlug":"support-vector-machines","topic":"Support Vector Machines","id":"ml-07001","difficulty":"easy","orderIndex":1,"question":"An SVM with a linear kernel is trained on a binary classification problem. After training, you discover that removing 90% of training samples — those far from the decision boundary — does not change the model's predictions at all. What does this reveal about how SVMs define their decision boundary?","options":{"A":"SVMs ignore 90% of training data by design — they only use randomly selected samples","B":"The SVM decision boundary is defined entirely by the support vectors — the subset of training samples closest to the hyperplane; all other points do not affect the boundary position or margin","C":"The removed samples were duplicates, which is why removing them had no effect","D":"A linear kernel SVM only uses the first and last 10% of training samples sorted by feature value"},"correct":"B","explanation":{"correct":"- An SVM finds the maximum-margin hyperplane defined by: $\\min \\frac{1}{2}\\|w\\|^2$ subject to $y_i(w^Tx_i + b) \\geq 1$ for all $i$. The Karush-Kuhn-Tucker conditions show that only samples with $\\alpha_i > 0$ (non-zero dual weights) contribute to the solution — these are exactly the support vectors.\n- Support vectors are the training points that lie on the margin boundary ($y_i(w^Tx_i + b) = 1$) or inside the margin (for soft-margin SVMs). All other points are correctly classified with margin > 1, so $\\alpha_i = 0$ and they play no role.\n- This is a key SVM property: the model is sparse in training samples. This also means the trained SVM can be serialized as just the support vectors and their weights, regardless of total training set size.","A":"SVMs do not randomly ignore data — they systematically identify which samples define the optimal boundary (support vectors). All samples are considered during optimization; only non-support-vectors end up with zero weight.","B":"","C":"The samples were not duplicates — they were simply non-support vectors. They happened to lie far enough from the margin that they don't constrain the optimal hyperplane.","D":"SVMs have no concept of sorting samples by feature value. The support vectors are determined by geometry (proximity to the hyperplane), not by feature rank."}},{"section":"machine-learning","topicSlug":"support-vector-machines","topic":"Support Vector Machines","id":"ml-07002","difficulty":"easy","orderIndex":2,"question":"A hard-margin SVM is trained on a 2D linearly separable dataset. The margin is defined as $\\frac{2}{\\|w\\|}$. A colleague asks: \"why do we maximize the margin instead of just finding any hyperplane that separates the classes?\" What is the SVM's answer?","options":{"A":"Maximizing the margin is computationally cheaper than finding any separating hyperplane","B":"Among all hyperplanes that separate the classes, the maximum-margin hyperplane generalizes best to new data — Vapnik's structural risk minimization theory shows that larger margins correspond to lower VC dimension and better generalization bounds","C":"Maximizing the margin ensures the decision boundary passes through the center of the dataset","D":"Any separating hyperplane works equally well — margin maximization is an aesthetic choice, not a mathematical one"},"correct":"B","explanation":{"correct":"- For linearly separable data, infinitely many separating hyperplanes exist. The maximum-margin hyperplane is the one with the largest \"buffer zone\" between the closest points of each class.\n- Intuitively: a larger margin means the model is less sensitive to small perturbations in input — a test point must move further to cross the boundary. This equates to better robustness to noise and better generalization.\n- Formally: the VC dimension of a linear classifier with margin $\\gamma$ on data in a ball of radius $R$ is bounded by $\\min(R^2/\\gamma^2, d) + 1$. Larger margins reduce the effective VC dimension, tightening the generalization bound.","A":"Hard-margin SVM optimization is a convex quadratic program — not computationally cheaper than finding any separating hyperplane. The computational justification is backwards.","B":"","C":"The maximum-margin hyperplane is equidistant from the nearest points of each class, but it does not pass through the dataset center. These are different geometric concepts.","D":"All separating hyperplanes are not equal. The maximum-margin hyperplane has provably better generalization properties under statistical learning theory. This is a foundational, not aesthetic, result."},"reference":"- Vapnik, \"The Nature of Statistical Learning Theory\": https://link.springer.com/book/10.1007/978-1-4757-3264-1"},{"section":"machine-learning","topicSlug":"support-vector-machines","topic":"Support Vector Machines","id":"ml-07003","difficulty":"easy","orderIndex":3,"question":"An SVM with a linear kernel cannot classify XOR data (two classes arranged in a checkerboard pattern). A colleague adds an RBF kernel and the model classifies the same data perfectly. What did the RBF kernel actually do?","options":{"A":"The RBF kernel applied a smoothing operation to the training data, removing the XOR pattern","B":"The RBF kernel implicitly maps the data to an infinite-dimensional feature space where the classes become linearly separable — the kernel function $K(x_i, x_j) = e^{-\\gamma\\|x_i - x_j\\|^2}$ computes dot products in that high-dimensional space without explicitly constructing it","C":"The RBF kernel rotated the coordinate axes to align with the XOR pattern, making it linearly separable in 2D","D":"The RBF kernel is equivalent to polynomial degree 2, which adds $x_1^2, x_2^2, x_1 x_2$ features that linearize XOR"},"correct":"B","explanation":{"correct":"- The kernel trick: instead of explicitly transforming features $\\phi(x)$ and computing $\\phi(x_i) \\cdot \\phi(x_j)$, we compute $K(x_i, x_j) = \\phi(x_i) \\cdot \\phi(x_j)$ directly in the original space.\n- The RBF kernel corresponds to an infinite-dimensional feature map (Mercer's theorem). In this infinite-dimensional space, the XOR pattern becomes linearly separable — a hyperplane exists that separates the two classes.\n- The SVM only computes kernel values (dot products between training pairs), never explicitly constructing the infinite-dimensional features. This is the computational elegance of the kernel trick.","A":"The RBF kernel does not smooth or modify the data points. It defines a similarity measure between pairs of points used in the SVM dual formulation.","B":"","C":"Rotation cannot make XOR linearly separable in 2D — no 2D rotation transforms a checkerboard into two half-planes. The transformation requires a higher-dimensional space.","D":"The RBF kernel is not equivalent to degree-2 polynomial. The polynomial kernel $K(x_i, x_j) = (x_i \\cdot x_j + c)^d$ is a different kernel corresponding to a finite-dimensional feature map. RBF is fundamentally different — infinite-dimensional."}},{"section":"machine-learning","topicSlug":"support-vector-machines","topic":"Support Vector Machines","id":"ml-07004","difficulty":"easy","orderIndex":4,"question":"A soft-margin SVM has a hyperparameter C. A team trains two models: SVM-A with C=0.001 and SVM-B with C=1000. SVM-A has a wider margin but more misclassified training points. SVM-B has a narrower margin with fewer training errors. Which model likely generalizes better on noisy test data?","options":{"A":"SVM-B always generalizes better because fewer training errors means better fit","B":"SVM-A likely generalizes better on noisy data — large C forces the SVM to minimize training errors aggressively, creating a narrow margin that overfits to noisy points; small C tolerates training errors in exchange for a wider, more robust margin","C":"Both models are equivalent — C only affects training speed, not the decision boundary","D":"SVM-A generalizes better because wider margins always produce lower test error regardless of data noise"},"correct":"B","explanation":{"correct":"- The soft-margin SVM objective: $\\min \\frac{1}{2}\\|w\\|^2 + C\\sum \\xi_i$ where $\\xi_i$ are slack variables for margin violations. C is the regularization hyperparameter: small C emphasizes maximizing margin (tolerating violations), large C emphasizes minimizing violations (potentially sacrificing margin).\n- On noisy data, individual mislabeled points or outliers are close to the true boundary. Large C forces the SVM to classify these noisy points correctly, distorting the boundary toward noise.\n- Small C produces a wider margin that \"ignores\" noisy points at the cost of some training errors — more robust to noise and outliers.","A":"Fewer training errors do not imply better generalization, especially on noisy data. This is the fundamental bias-variance trade-off: SVM-B has lower bias but higher variance.","B":"","C":"C fundamentally changes the decision boundary by altering the balance between margin width and violation penalty. This is not a computational parameter — it shapes the learned model.","D":"\"Wider margins always produce lower test error\" is too strong. On data with no noise, the maximum-margin boundary is optimal. On noisy data, the relationship depends on the noise magnitude relative to the margin."}},{"section":"machine-learning","topicSlug":"support-vector-machines","topic":"Support Vector Machines","id":"ml-07005","difficulty":"easy","orderIndex":5,"question":"An SVM is trained on a dataset with 1 million samples. Training takes 8 hours. A colleague says \"SVMs scale quadratically or worse with the number of samples — this is expected.\" Is this claim accurate?","options":{"A":"The claim is false — SVMs always train in O(n log n) time, similar to sorting algorithms","B":"The claim is accurate — standard SVM solvers (QP solvers for the dual problem) have complexity between O(n²) and O(n³) in the number of training samples; for 1 million samples, this makes exact SVM training computationally infeasible without specialized algorithms","C":"The claim is only true for RBF kernels; linear SVMs always train in O(n) time","D":"The quadratic scaling applies to the number of features, not samples; for 1 million samples, training always finishes quickly"},"correct":"B","explanation":{"correct":"- The SVM dual problem requires solving a QP over $n$ variables (one per training sample). The kernel matrix $K$ is $n \\times n$ — for 1 million samples, this is $10^{12}$ entries, consuming terabytes of memory.\n- Standard QP solvers have $O(n^3)$ complexity. Approximate methods like SMO (Sequential Minimal Optimization) reduce this to approximately $O(n^2)$, but are still infeasible at 1M samples without further approximation.\n- Practical alternatives for large datasets: LinearSVC (primal formulation, $O(n)$), SGD-based SVM via `sklearn.linear_model.SGDClassifier`, or approximate kernel methods (Nyström approximation, random features).","A":"O(n log n) scaling is for sorting algorithms, not SVMs. SVM complexity is dominated by the QP solver, which scales at least quadratically with n.","B":"","C":"Linear SVMs can be trained with primal methods in $O(n)$ time (e.g., LIBLINEAR), but this is a special case. Nonlinear kernel SVMs do not have linear time solvers.","D":"SVM complexity scales with samples, not just features. The kernel matrix dimensionality is $n \\times n$ (samples × samples), making sample count the binding constraint."}},{"section":"machine-learning","topicSlug":"support-vector-machines","topic":"Support Vector Machines","id":"ml-07006","difficulty":"medium","orderIndex":6,"question":"An SVM with an RBF kernel has hyperparameter gamma (γ). Training with γ=0.001 gives a smooth, slightly underfitting decision boundary. Training with γ=100 gives a highly irregular boundary that perfectly fits training data but fails on test data. What does γ control geometrically?","options":{"A":"γ controls the number of support vectors — higher γ uses more support vectors","B":"γ controls the \"reach\" of each training sample's influence: high γ means each sample only influences the boundary in a tiny local neighborhood (rough, complex boundary), while low γ means each sample influences a large region (smooth, broader boundary)","C":"γ controls the margin width — higher γ produces wider margins and better generalization","D":"γ is the learning rate for the SVM optimizer — higher γ converges faster but can overshoot"},"correct":"B","explanation":{"correct":"- RBF kernel: $K(x_i, x_j) = e^{-\\gamma \\|x_i - x_j\\|^2}$. For high γ: $e^{-\\gamma \\|x_i - x_j\\|^2}$ decays very rapidly with distance — only very close neighbors have non-zero similarity. Each training point only influences its immediate neighborhood.\n- For low γ: the kernel decays slowly — each training point influences a broad region. The resulting boundary is smooth and nearly linear in the limit $\\gamma \\to 0$.\n- High γ produces decision boundaries that wrap tightly around each training cluster, memorizing noise. Low γ produces broad boundaries that may miss fine-grained class structure.","A":"γ does not directly control the number of support vectors. More support vectors often appear with high γ (complex boundary needs more anchor points), but this is a consequence, not the mechanism.","B":"","C":"γ controls kernel bandwidth, not margin width. The margin is controlled by C. Higher γ actually tends to reduce the effective margin by creating locally complex boundaries.","D":"γ is not a learning rate. SVM training with kernels is a convex QP problem — it has no learning rate in the gradient descent sense. The optimization always converges to the global optimum."}},{"section":"machine-learning","topicSlug":"support-vector-machines","topic":"Support Vector Machines","id":"ml-07007","difficulty":"medium","orderIndex":7,"question":"An SVM is trained on text classification with a linear kernel. The training set has 50,000 documents represented as TF-IDF vectors with 50,000 features (sparse). A colleague recommends switching to an RBF kernel for \"better performance.\" Why might this advice be wrong?","options":{"A":"RBF kernels cannot handle text data at all","B":"Text data in high-dimensional sparse TF-IDF space is often nearly linearly separable — a linear SVM can achieve near-optimal performance with much lower computational cost than RBF; the RBF kernel requires computing $O(n^2)$ pairwise kernel values between 50,000-dimensional vectors, which is computationally expensive and may not improve accuracy","C":"Linear SVMs always outperform RBF SVMs on all tasks","D":"The advice is wrong because TF-IDF features should always use polynomial kernels, not RBF"},"correct":"B","explanation":{"correct":"- In high-dimensional sparse feature spaces (like bag-of-words or TF-IDF), the data is often linearly separable or nearly so by the \"blessing of dimensionality.\" A linear SVM in 50,000 dimensions has enormous flexibility.\n- RBF kernel with 50,000-dimensional TF-IDF vectors: the kernel $e^{-\\gamma\\|x_i - x_j\\|^2}$ computes Euclidean distance in 50,000 dimensions. For sparse vectors with mostly zeros, Euclidean distance is dominated by the zero dimensions, making the kernel less meaningful than in low-dimensional spaces.\n- Linear SVMs for text are well-established (LIBLINEAR) and achieve state-of-the-art on many text classification tasks. The RBF kernel adds computational cost without corresponding accuracy benefits.","A":"RBF kernels work mathematically on any numeric vectors, including text TF-IDF. The issue is practical performance and computational cost, not theoretical incompatibility.","B":"","C":"Linear SVMs don't always outperform RBF — for low-dimensional data with nonlinear boundaries, RBF is clearly better. The claim is specifically about high-dimensional sparse text data.","D":"Polynomial kernels for text are not a standard recommendation. Linear kernels are the standard for text; the choice between polynomial and RBF is specific to the data geometry, not feature type."}},{"section":"machine-learning","topicSlug":"support-vector-machines","topic":"Support Vector Machines","id":"ml-07008","difficulty":"medium","orderIndex":8,"question":"A polynomial kernel SVM with degree=3 is trained on a binary classification task. The kernel is $K(x_i, x_j) = (x_i \\cdot x_j + 1)^3$. A developer wants to explicitly create the polynomial feature expansion and train a linear SVM on those features. For input dimension d=100, how many features would this explicit expansion have?","options":{"A":"300 features (d × degree = 100 × 3)","B":"Approximately $\\binom{d + 3}{3} \\approx \\binom{103}{3} = 176,851$ features for degree-3 monomials — the explicit feature space is enormous, making the kernel trick computationally essential","C":"1,000,000 features (d³ = 100³)","D":"The number of features stays at 100 — polynomial kernels do not create new features"},"correct":"B","explanation":{"correct":"- A degree-$p$ polynomial feature expansion of $d$-dimensional data creates all monomials up to degree $p$: $x_1^{a_1} x_2^{a_2} \\cdots x_d^{a_d}$ where $\\sum a_i \\leq p$. The count is $\\binom{d+p}{p}$.\n- For $d=100, p=3$: $\\binom{103}{3} = \\frac{103 \\times 102 \\times 101}{6} = 176,851$ features.\n- The kernel trick avoids constructing these 176,851 features explicitly. Instead, $(x_i \\cdot x_j + 1)^3$ computes the dot product in this space using only a simple formula on the original 100-dimensional vectors.","A":"$$d \\times \\text{degree}$ only accounts for linear terms multiplied by degree — it doesn't count cross-terms ($x_1 x_2 x_3$) or higher-order monomials ($x_1^2 x_2$). The correct count is combinatorial, not multiplicative.","B":"","C":"$$d^p = 100^3 = 1,000,000$ overcounts because it includes all ordered products with repetition. The polynomial feature expansion uses unordered monomials, which is a smaller count.","D":"Polynomial kernels implicitly compute dot products in the higher-dimensional space. The features conceptually exist in that space — they just aren't materialized when using the kernel trick."}},{"section":"machine-learning","topicSlug":"support-vector-machines","topic":"Support Vector Machines","id":"ml-07009","difficulty":"medium","orderIndex":9,"question":"A team builds an SVM to classify genomic sequences. The dataset has 500 training samples and 20,000 features (gene expression values). After scaling features to [0,1], an RBF SVM with cross-validated C and gamma achieves 0.91 AUC. A deep neural network achieves only 0.83 AUC. Why might SVM outperform the neural network here?","options":{"A":"SVMs always outperform neural networks on biological data","B":"With 500 samples and 20,000 features, a neural network has millions of parameters and severely overfits; the SVM's kernel-based approach effectively operates in a high-dimensional feature space with structural risk minimization, requiring far fewer effective parameters relative to the margin geometry","C":"Deep neural networks cannot process genomic data — they require image inputs","D":"The SVM is faster to train, so it converges to the global optimum while the neural network gets stuck in a local minimum"},"correct":"B","explanation":{"correct":"- The $n/p$ ratio here is $500/20,000 = 0.025$ — far fewer samples than features. A neural network with even a modest hidden layer (e.g., 128 neurons) has $20,000 \\times 128 = 2,560,000$ parameters vs. 500 training samples. Severe overfitting is almost guaranteed.\n- An SVM's effective capacity is controlled by the margin width and the number of support vectors, not the feature dimensionality. In high-dimensional settings, SVMs often remain well-regularized because the maximum-margin solution has large margin relative to the feature space volume.\n- This is precisely the scenario where SVMs were dominant before deep learning became prevalent: high-dimensional, low-sample genomic, text, and image data.","A":"SVMs do not always outperform neural networks on biological data. With sufficient data (thousands of labeled examples), neural networks typically win. The key condition is the $n/p$ ratio.","B":"","C":"Neural networks can process any numeric feature vector, including gene expression data. This claim is false.","D":"SVM training is a convex optimization with a unique global optimum — convergence is guaranteed regardless of speed. Neural networks have non-convex loss but can still converge to good local minima with proper initialization. Speed is not the reason for the accuracy difference."}},{"section":"machine-learning","topicSlug":"support-vector-machines","topic":"Support Vector Machines","id":"ml-07010","difficulty":"medium","orderIndex":10,"question":"An SVM's dual formulation gives the decision function as $f(x) = \\sum_{i \\in SV} \\alpha_i y_i K(x_i, x) + b$. After training, there are 850 support vectors out of 10,000 training samples. A data scientist says \"fewer support vectors means a better model.\" Is this correct?","options":{"A":"Correct — the number of support vectors is inversely proportional to test accuracy","B":"Partially correct but oversimplified — fewer support vectors generally indicate a simpler, more generalizable decision boundary (larger margin), but the optimal number depends on the true data complexity; too few support vectors (from over-regularization) indicate underfitting","C":"Support vector count has no relationship to model quality","D":"Exactly 50% of training samples should be support vectors for an optimal SVM"},"correct":"B","explanation":{"correct":"- An upper bound on SVM generalization error relates to the expected leave-one-out error: $E[\\text{LOO error}] \\leq E[\\text{number of support vectors}] / n$. Fewer support vectors → lower LOO error upper bound → better expected generalization.\n- However, this bound is loose. With a very small C (heavy regularization), the model creates a wide margin with few support vectors but may underfit (too simple to capture the true boundary).\n- The optimal C (and hence optimal support vector count) should be found by cross-validation. The support vector count is a diagnostic signal, not a target metric.","A":"Support vector count and test accuracy are not inversely proportional in a strict sense. The relationship depends on C, the kernel, and the data distribution.","B":"","C":"Support vector count is a meaningful diagnostic. Extreme counts (nearly all or very few training samples as support vectors) indicate potential over-regularization or under-regularization.","D":"There is no principled reason for exactly 50% support vectors. This varies widely by dataset and hyperparameter settings."}},{"section":"machine-learning","topicSlug":"support-vector-machines","topic":"Support Vector Machines","id":"ml-07011","difficulty":"hard","orderIndex":11,"question":"You train an SVM with an RBF kernel on a training set. The kernel matrix $K$ is $n \\times n$. During inference, a new point $x$ must compute $K(x, x_i)$ for all $n$ support vectors. For $n = 50,000$ support vectors with $d = 5,000$ features, what is the per-sample inference cost and why is this problematic for real-time serving?","options":{"A":"Inference is O(1) because SVMs use a precomputed lookup table","B":"Inference costs O(n × d) kernel evaluations per sample — for 50,000 support vectors and 5,000 features, each prediction requires 250 million floating-point operations; at real-time latency requirements (< 10ms), this is infeasible without approximation","C":"Inference cost is O(d) regardless of support vector count because the kernel is precomputed during training","D":"SVM inference always costs the same as a single dot product regardless of the number of support vectors"},"correct":"B","explanation":{"correct":"- Each inference requires evaluating $K(x, x_i) = e^{-\\gamma\\|x - x_i\\|^2}$ for each support vector $x_i$. Each evaluation requires $O(d)$ operations (computing $\\|x - x_i\\|^2$). With $n_{sv}$ support vectors: total cost is $O(n_{sv} \\times d)$.\n- For 50,000 support vectors and 5,000 features: $50,000 \\times 5,000 = 2.5 \\times 10^8$ multiply-adds per sample. At ~1 GFLOP/s on a single CPU core, this takes ~250ms — far exceeding real-time requirements.\n- Solutions: reduce C to decrease support vector count, use approximate kernel methods (Nyström, random features), switch to a linear SVM if the RBF is not strictly necessary, or use GPU acceleration.","A":"SVMs do not use precomputed lookup tables for inference. The kernel must be evaluated against all support vectors for each new sample.","B":"","C":"The kernel values between training support vectors can be cached, but the kernel between a new test point and each support vector must be computed fresh at inference time.","D":"SVM inference cost scales with the number of support vectors × feature dimensionality. A single dot product is O(d); total prediction is O($n_{sv} \\times d$)."}},{"section":"machine-learning","topicSlug":"support-vector-machines","topic":"Support Vector Machines","id":"ml-07012","difficulty":"hard","orderIndex":12,"question":"A linear SVM is trained on a balanced binary classification task with features $(x_1, x_2)$. The optimal hyperplane is $2x_1 - 3x_2 + 1 = 0$. A data scientist scales feature $x_1$ by 100 (multiplying all $x_1$ values by 100) and retrains. The new hyperplane becomes $0.02x_1' - 3x_2 + 1 = 0$ (where $x_1' = 100 x_1$). Are these models equivalent?","options":{"A":"Yes — the predictions are identical because scaling doesn't change class membership","B":"The predictions may be the same on the training set, but the unscaled model is heavily influenced by $x_2$ relative to $x_1$ — SVMs are not scale-invariant; the margin calculation depends on $\\|w\\|$, so feature scaling affects which hyperplane achieves maximum margin","C":"SVMs are scale-invariant by design — the kernel handles scaling automatically","D":"The two models are equivalent because the hyperplane equation $2x_1 - 3x_2 + 1 = 0$ and $0.02x_1' - 3x_2 + 1 = 0$ define the same geometric boundary"},"correct":"B","explanation":{"correct":"- The original margin: $\\frac{2}{\\|w\\|} = \\frac{2}{\\sqrt{4+9}} = \\frac{2}{\\sqrt{13}} \\approx 0.555$.\n- After scaling $x_1$ by 100, the equivalent problem in the original space has $w = (2/100, -3)$. Margin: $\\frac{2}{\\|(0.02, -3)\\|} = \\frac{2}{\\sqrt{0.0004+9}} \\approx \\frac{2}{3} = 0.667$ — different margin.\n- The maximum-margin hyperplane changes with feature scaling because $\\|w\\|$ depends on the scale of each feature's weight. SVM is not scale-invariant, which is why **feature standardization is mandatory** before training SVMs.","A":"\"Predictions may be identical\" is plausible only if the training data is perfectly separable (both models achieve zero training error). In general, different margins lead to different boundaries and different generalization.","B":"","C":"Kernel SVMs are not scale-invariant. The RBF kernel $e^{-\\gamma\\|x_i - x_j\\|^2}$ explicitly depends on Euclidean distance, which changes with feature scaling.","D":"The hyperplane equations are geometrically different in the original space. $2x_1 - 3x_2 + 1 = 0$ and $0.02(100x_1) - 3x_2 + 1 = 0$ simplify to $2x_1 - 3x_2 + 1 = 0$ — they are the same line. However, the margin (which determines generalization) differs because $\\|w\\|$ is different in the optimization."}},{"section":"machine-learning","topicSlug":"support-vector-machines","topic":"Support Vector Machines","id":"ml-07013","difficulty":"hard","orderIndex":13,"question":"An SVM with C=10 has 300 support vectors. Increasing C to 10,000 results in 2,800 support vectors (out of 5,000 training samples). A 5-fold cross-validation shows C=10 generalizes better. What is the precise mechanism causing more support vectors at higher C?","options":{"A":"Higher C means the optimizer adds more support vectors for computational stability","B":"Higher C penalizes margin violations more heavily, forcing the boundary to correctly classify more training points — points that were previously allowed to violate the margin (counted as non-support-vectors) are now forced toward or into the margin, becoming support vectors; more support vectors means a narrower margin and a more complex boundary","C":"Support vector count scales linearly with C — doubling C always doubles the support vectors","D":"Higher C causes more features to be selected, which creates more support vectors"},"correct":"B","explanation":{"correct":"- At low C, the SVM tolerates many margin violations — the model says \"it's acceptable for some training points to be inside or on the wrong side of the margin.\" These points may not become support vectors if the overall solution is better served by a wider margin.\n- At high C, any point that lies inside the margin (violates the $y_i(w^Tx_i + b) \\geq 1$ constraint) becomes a support vector with non-zero dual weight. More training points are forced to be correctly classified with margin ≥ 1, but this requires a more complex, narrower boundary.\n- 2,800 out of 5,000 support vectors at high C suggests the model is nearly memorizing training points — a hallmark of overfitting in SVMs.","A":"Optimizer stability has no relationship to support vector count. The number of support vectors is determined by the data geometry and the C value, not by numerical stability.","B":"","C":"The relationship between C and support vector count is not linear. It depends on the data distribution, margin violations at different C values, and the geometry of the decision boundary.","D":"SVMs (especially kernel SVMs) don't perform feature selection. All features are used through the kernel computation. Support vectors are training samples, not features."}},{"section":"machine-learning","topicSlug":"support-vector-machines","topic":"Support Vector Machines","id":"ml-07014","difficulty":"hard","orderIndex":14,"question":"A string kernel SVM is used to classify protein sequences. Two proteins are \"similar\" if they share many substrings (k-mers). The kernel is $K(s_1, s_2) = $ (count of shared k-mers). This is never explicitly computed in feature space. Mercer's theorem requires this kernel to be a valid kernel. What does \"valid kernel\" mean in this context?","options":{"A":"A valid kernel must be computable in polynomial time","B":"A valid kernel must be a symmetric positive semi-definite (PSD) function — it must correspond to a dot product in some (possibly infinite-dimensional) Hilbert space; the k-mer kernel is PSD because the matrix $K_{ij}$ built from all training pairs has all non-negative eigenvalues","C":"A valid kernel must produce values between 0 and 1","D":"Mercer's theorem only applies to continuous feature spaces; string kernels are exempt"},"correct":"B","explanation":{"correct":"- Mercer's theorem states that $K(x_i, x_j)$ is a valid kernel iff the Gram matrix $K_{ij} = K(x_i, x_j)$ is symmetric positive semi-definite (PSD) for any finite set of inputs.\n- PSD means: for all vectors $c$, $\\sum_{i,j} c_i c_j K(x_i, x_j) \\geq 0$. Equivalently, all eigenvalues of the Gram matrix are non-negative.\n- The k-mer string kernel can be written as $K(s_1, s_2) = \\phi(s_1) \\cdot \\phi(s_2)$ where $\\phi(s)$ is the feature vector of k-mer counts. Any kernel expressible as a dot product is automatically PSD.","A":"Computational complexity is not part of Mercer's theorem. A valid kernel only needs to correspond to a dot product in some Hilbert space, regardless of computational cost.","B":"","C":"Kernel values have no required range [0,1]. Many valid kernels (linear: $K(x,y) = x \\cdot y$) produce any real value. The PSD property is about the matrix structure, not individual value range.","D":"Mercer's theorem applies to any measurable space, including discrete spaces like string sequences. String kernels are one of the most important kernel types and are explicitly covered by the theorem."}},{"section":"machine-learning","topicSlug":"support-vector-machines","topic":"Support Vector Machines","id":"ml-07015","difficulty":"hard","orderIndex":15,"question":"A company builds a fraud detection model comparing SVM-RBF vs a Random Forest. Both achieve 0.91 AUC after tuning. The SVM has 12,000 support vectors from 50,000 training samples. The Random Forest has 200 trees. For monthly retraining on 50,000 new samples, which model presents the more significant operational challenge and why?","options":{"A":"Random Forest retraining is harder because it requires 200 separate model files","B":"SVM retraining is operationally harder at scale — with $O(n^2)$ to $O(n^3)$ training complexity, 50,000 new samples requires solving a QP over 50,000 dual variables; warm-starting from the previous 12,000 support vectors is partially possible but not trivially implemented; Random Forest retraining is embarrassingly parallel and completes in minutes","C":"Both models retrain in identical time since they achieve the same AUC","D":"SVM retraining is trivial because only the 12,000 support vectors need to be updated, not all 50,000 samples"},"correct":"B","explanation":{"correct":"- SVM retraining: the QP problem scales at least $O(n^2)$ in memory (kernel matrix) and $O(n^2)$ to $O(n^3)$ in computation. For 50,000 samples, the kernel matrix alone would be $50,000^2 \\times 8$ bytes = 20GB. Full retraining is expensive.\n- Incremental SVM updates (adding new samples without full retraining) exist but are complex — they require re-solving the KKT conditions for changed support vectors and don't reduce complexity for large batch updates.\n- Random Forest retraining: each of 200 trees trains independently in parallel. On 10 cores, retraining 200 trees takes approximately $200/10 = 20$ tree-training times in parallel. Total time: minutes, not hours.","A":"200 model files are easily managed with a serialized ensemble. The number of files is not an operational challenge — the per-file complexity is low.","B":"","C":"Equal AUC does not imply equal retraining time. Model quality and training complexity are independent — a model can be fast to train and perform poorly, or slow to train and perform well.","D":"The support vectors from the previous model are not simply \"updated.\" New training data requires identifying new support vectors from all 50,000 current samples, not just the previous 12,000. Warm-starting from previous SVs reduces time but doesn't eliminate the quadratic scaling."},"reference":"- Joachims, \"Making Large-Scale SVM Learning Practical\" (SVMlight): http://svmlight.joachims.org/\n- sklearn SVM documentation: https://scikit-learn.org/stable/modules/svm.html"},{"section":"machine-learning","topicSlug":"k-nearest-neighbors","topic":"K Nearest Neighbors","id":"ml-08001","difficulty":"easy","orderIndex":1,"question":"A KNN classifier with k=1 achieves 100% training accuracy. A colleague immediately concludes the model is excellent. What is wrong with this reasoning?","options":{"A":"k=1 is always the optimal value — 100% training accuracy confirms this","B":"With k=1, every training point is its own nearest neighbor, so the model always predicts the correct class for any training point — this is guaranteed regardless of signal; training accuracy with k=1 is trivially 100% and reveals nothing about generalization","C":"k=1 KNN cannot achieve 100% accuracy on training data due to tie-breaking rules","D":"100% training accuracy means the model has no variance, which is always desirable"},"correct":"B","explanation":{"correct":"- In KNN with k=1, the nearest neighbor of any training point is itself (distance = 0). The prediction for any training point is trivially correct — this is guaranteed by the algorithm's definition, not by the model learning anything useful.\n- This is identical to why training accuracy is a misleading metric for any memorizing model. The k=1 KNN is an extreme interpolator: it reproduces every training label exactly.\n- The appropriate evaluation is on a held-out test set or via leave-one-out cross-validation (which explicitly prevents a point from being its own neighbor).","A":"k=1 is rarely optimal. It maximizes variance: the decision boundary is highly irregular, adapting to every training point including noisy ones. Optimal k is found by validation.","B":"","C":"There are no tie-breaking issues for a single nearest neighbor. Ties occur when two neighbors are equidistant and k > 1. With k=1, the nearest point (itself, at distance 0) always wins.","D":"A model with k=1 has maximum variance — the decision boundary changes dramatically with small perturbations of training data. High training accuracy with high variance is the textbook overfitting scenario."}},{"section":"machine-learning","topicSlug":"k-nearest-neighbors","topic":"K Nearest Neighbors","id":"ml-08002","difficulty":"easy","orderIndex":2,"question":"A KNN model is trained on customer data with two features: `age` (range 20-80) and `annual_income` (range \\$20,000–\\$500,000). The model performs poorly. After normalizing both features to [0,1], performance improves significantly. What does this reveal about KNN's sensitivity?","options":{"A":"KNN is sensitive to the number of training samples, not feature scale","B":"KNN uses distance metrics (Euclidean, Manhattan) to find nearest neighbors — before normalization, `annual_income` dominates the distance calculation because its absolute scale is 10,000× larger than `age`, effectively making `age` irrelevant; normalization gives both features equal influence on distance","C":"Normalization improved performance because KNN requires features to be normally distributed","D":"Feature scale only matters for KNN when k is larger than 10"},"correct":"B","explanation":{"correct":"- Euclidean distance: $d = \\sqrt{(\\Delta \\text{age})^2 + (\\Delta \\text{income})^2}$. With raw values: $\\Delta \\text{age} \\leq 60$ while $\\Delta \\text{income} \\leq 480,000$. The distance is dominated entirely by income — a 1-year age difference contributes $10^{-8}$ fraction of the total distance.\n- The model effectively ignores age and classifies based only on income proximity. This is a geometric artifact, not a feature relevance judgment.\n- After normalization to [0,1]: $\\Delta \\text{age}^2 + \\Delta \\text{income}^2$ where both terms are in [0,1] — both features contribute meaningfully to distance.","A":"KNN is highly sensitive to feature scale, not primarily to sample count. The issue here is geometric — the distance metric is the core operation, and scale imbalance distorts it.","B":"","C":"KNN has no distributional assumptions. It makes no use of feature distributions — only pairwise distances. Normality is irrelevant.","D":"Feature scale dominance affects KNN for any k. With k=1, a point dominated by high-income similarity would be assigned the nearest high-income neighbor regardless of age."}},{"section":"machine-learning","topicSlug":"k-nearest-neighbors","topic":"K Nearest Neighbors","id":"ml-08003","difficulty":"easy","orderIndex":3,"question":"KNN is described as a \"lazy learner.\" What does this mean and what practical consequence does it have at inference time?","options":{"A":"KNN is lazy because it requires few hyperparameters compared to other models","B":"KNN does no computation during \"training\" — it only stores all training data; all computation (distance calculations to find neighbors) happens at inference time, making training instant but prediction slow for large datasets","C":"KNN is lazy because it produces approximate results rather than exact predictions","D":"KNN is lazy because it uses random sampling at inference time instead of computing exact distances"},"correct":"B","explanation":{"correct":"- A \"lazy learner\" defers computation to inference time. During \"training,\" KNN simply stores all $(x_i, y_i)$ pairs — $O(1)$ or $O(n)$ at most for storage. No model parameters are learned.\n- At inference for a new point $x$: compute distance to all $n$ training points ($O(nd)$), sort or partially sort to find k-nearest ($O(n \\log k)$), aggregate their labels ($O(k)$). Total: $O(nd)$ per query.\n- Contrast with eager learners (logistic regression, neural networks): they invest computation at training time to learn compact parameters; inference is then $O(d)$ — fast regardless of training set size.","A":"\"Lazy\" in ML has a specific technical meaning (deferred computation to inference), not a reference to hyperparameter complexity. KNN actually has few hyperparameters (k, distance metric), but that's a coincidence.","B":"","C":"Standard KNN computes exact distances to find exact nearest neighbors. \"Lazy\" refers to the timing of computation, not its precision. Approximate KNN (FAISS, HNSW) is a separate technique.","D":"KNN uses exact distance computation by default. Random sampling is a different technique (approximate nearest neighbor search) not part of standard KNN."}},{"section":"machine-learning","topicSlug":"k-nearest-neighbors","topic":"K Nearest Neighbors","id":"ml-08004","difficulty":"easy","orderIndex":4,"question":"KNN achieves 85% accuracy with k=3 and 84% with k=5 on the validation set. A data scientist asks: \"should I always choose the k that gives highest validation accuracy?\" What is the risk of this approach?","options":{"A":"No risk — always choosing the highest validation accuracy is the correct model selection strategy","B":"Choosing k based on validation accuracy is valid, but testing many k values increases the chance of selecting a k that happens to fit the validation set by chance — cross-validation over k with a held-out test set gives a more reliable estimate","C":"k=3 is always better than k=5 because lower k means more neighbors are considered","D":"k should always be an odd number to avoid ties; k=3 is correct simply for this reason"},"correct":"B","explanation":{"correct":"- Selecting k by maximizing a single validation set's accuracy is subject to the same model selection overfitting risk as any hyperparameter search: the best k for one validation split may not be the best for the population distribution.\n- The risk is especially high when the validation set is small — a 1% accuracy difference between k=3 and k=5 on a small validation set can easily be within noise.\n- Best practice: use cross-validation to estimate validation accuracy for each k, select the k with the best cross-validated performance, then evaluate once on the test set.","A":"This is the model selection overfitting trap. Always choosing max validation accuracy without cross-validation or confidence intervals risks overfitting to the validation set.","B":"","C":"Lower k does not mean \"more neighbors are considered\" — it means fewer neighbors. k=3 considers 3 nearest neighbors; k=5 considers 5. The statement is factually backwards.","D":"Odd k avoids ties in binary classification but is not a reason to always prefer lower odd values. The optimal k depends on the dataset's noise level and class boundary complexity."}},{"section":"machine-learning","topicSlug":"k-nearest-neighbors","topic":"K Nearest Neighbors","id":"ml-08005","difficulty":"easy","orderIndex":5,"question":"A KNN regression model predicts house prices. With k=1, the test RMSE is 85,000. With k=100, the test RMSE is 42,000. With k=1000, the test RMSE is 61,000. What does the U-shaped relationship between k and RMSE reveal?","options":{"A":"k=100 is optimal for all house price datasets","B":"Small k produces high variance (each prediction depends on a single noisy neighbor), large k produces high bias (predictions are averaged over too many dissimilar houses), and the optimal k balances this trade-off — this is a direct manifestation of the bias-variance trade-off in KNN","C":"The U-shape reveals that KNN is not suitable for regression tasks","D":"The U-shape is caused by an error in feature scaling — after normalization, the relationship would be monotone"},"correct":"B","explanation":{"correct":"- k=1: prediction = single nearest neighbor's price. One noisy or atypical neighbor causes large errors. High variance, low bias.\n- k=1000: prediction = average of 1000 neighbors, many of which may be in different neighborhoods or sizes. Predictions converge to a broad average, missing local patterns. Low variance, high bias.\n- k=100: captures local neighborhood structure with enough averaging to smooth noise. This is the sweet spot for this dataset.\n- This pattern is universal in KNN and illustrates the bias-variance trade-off geometrically: the \"neighborhood\" size controls smoothness vs. locality.","A":"k=100 is optimal for this specific dataset. The optimal k is data-dependent. For a different city with denser similar housing, a larger k might be optimal.","B":"","C":"The U-shape is evidence that KNN can do regression and has a sweet spot — it does not indicate unsuitability. The task is to find the right k via validation.","D":"Feature scaling affects distance calculations but doesn't change the fundamental bias-variance behavior of k. The U-shape appears regardless of feature scale (after proper scaling, the optimal k may shift, but the U-shape persists)."}},{"section":"machine-learning","topicSlug":"k-nearest-neighbors","topic":"K Nearest Neighbors","id":"ml-08006","difficulty":"medium","orderIndex":6,"question":"A KNN model with Euclidean distance achieves 88% accuracy on 50-dimensional data. After applying PCA to reduce to 10 dimensions, KNN accuracy increases to 93%. Explain the mechanism, and what phenomenon does this illustrate?","options":{"A":"PCA adds new information that was missing from the original 50 features","B":"In high-dimensional spaces, distances between points concentrate (all points become nearly equidistant), making nearest-neighbor search meaningless — PCA removes noise dimensions and retains the 10 most informative directions, making distances more discriminative; this is the curse of dimensionality","C":"PCA's normalization step is what improves KNN — the accuracy gain is not from dimensionality reduction but from standardization","D":"50 features is always too many for KNN — the algorithm is designed for at most 20 features"},"correct":"B","explanation":{"correct":"- The curse of dimensionality: as dimension $d$ increases, the volume of space grows exponentially. In high dimensions, all pairwise distances converge: $\\frac{\\max_{\\text{dist}} - \\min_{\\text{dist}}}{\\min_{\\text{dist}}} \\to 0$ as $d \\to \\infty$. The notion of \"nearest\" neighbor loses meaning.\n- With 50 dimensions, 40 of which may be noisy or irrelevant, Euclidean distances are dominated by noise contributions. Two actually similar points appear far apart due to noise in irrelevant dimensions.\n- PCA projects onto the 10 directions of maximum variance — presumably the signal dimensions. In this lower-dimensional space, distances are more meaningful and nearest neighbors are more likely to be genuinely similar.","A":"PCA is a dimensionality reduction technique — it cannot add information that wasn't in the original data. It only retains a subspace of the original feature space.","B":"","C":"PCA does standardize features as a side effect (if using standard PCA with mean centering), but the primary mechanism here is dimensionality reduction removing noise dimensions. The accuracy gain is specifically about reducing the curse of dimensionality.","D":"KNN has no hard feature limit. The algorithm works at any dimensionality, but performance degrades with irrelevant dimensions. The challenge is empirical, not algorithmic."},"reference":"- Beyer et al., \"When is Nearest Neighbor Meaningful?\": https://link.springer.com/chapter/10.1007/3-540-49257-7_15"},{"section":"machine-learning","topicSlug":"k-nearest-neighbors","topic":"K Nearest Neighbors","id":"ml-08007","difficulty":"medium","orderIndex":7,"question":"A KNN model must serve 1,000 queries per second on a dataset of 1 million training samples with 100 features. A brute-force KNN implementation takes 200ms per query. A team considers using a KD-tree or a ball-tree. Under what conditions would the KD-tree fail to provide speedup over brute-force?","options":{"A":"KD-trees always provide the same speedup regardless of dimension","B":"KD-tree performance degrades severely in high dimensions — its expected query complexity is O(kd × n^(1-1/d)), which approaches O(n) as d increases; for d=100, a KD-tree provides essentially no speedup over brute-force, and approximate methods (HNSW, FAISS IVF) are needed","C":"KD-trees fail when the dataset has more than 10,000 samples","D":"KD-trees fail when k (number of neighbors) is larger than 5"},"correct":"B","explanation":{"correct":"- KD-trees split the feature space along axes recursively. In low dimensions (d ≤ 20), they efficiently prune branches and achieve $O(k \\log n)$ query time.\n- As dimension increases, the number of KD-tree cells that could contain nearest neighbors grows exponentially. For $d = 100$, nearly every leaf must be checked — the tree degenerates to a brute-force search.\n- Ball-trees handle moderate dimensions slightly better (up to d~40) because their splits are based on hypersphere geometry rather than axis-aligned hyperplanes. For d=100, even ball-trees struggle. Approximate nearest neighbor libraries (HNSW, FAISS) are the practical solution.","A":"KD-tree speedup is dimension-dependent. The key insight is that the tree structure becomes ineffective in high dimensions — a crucial practical consideration.","B":"","C":"KD-trees efficiently handle millions of samples in low dimensions. Sample count is not the limiting factor — dimensionality is.","D":"The number of neighbors k affects constant factors in KD-tree query time but is not the primary failure mode. The dimension curse dominates."}},{"section":"machine-learning","topicSlug":"k-nearest-neighbors","topic":"K Nearest Neighbors","id":"ml-08008","difficulty":"medium","orderIndex":8,"question":"A KNN classifier uses Euclidean distance. A new feature `binary_flag` (values 0 or 1) is added to a feature set of continuous measurements. After adding this feature, model accuracy drops. The team scales all continuous features to [0,1] but the flag is already in [0,1]. What is the likely cause of the accuracy drop?","options":{"A":"Binary features cannot be used with Euclidean distance","B":"The binary feature contributes the same maximum distance (1) as continuous features, but it represents a fundamentally different type of difference — a 0-vs-1 binary flip may be less meaningful than a 0.01 difference in a continuous feature, or vice versa; Euclidean distance treats all [0,1] features identically regardless of semantic meaning","C":"The accuracy drop is unrelated to the new feature — it is caused by normalization changing existing features","D":"Binary features must be one-hot encoded before use with KNN regardless of binary values"},"correct":"B","explanation":{"correct":"- After [0,1] scaling, Euclidean distance treats a binary flip (0→1 in `binary_flag`) as the same distance as a full range change in a continuous feature. But the semantic meaning differs: the binary flag might represent \"premium vs standard\" — a categorical distinction — while continuous features represent gradual change.\n- If the binary flag is a noisy proxy or introduces class-irrelevant variation, it adds distance noise that misleads the nearest-neighbor search.\n- Solutions: use different feature weights (weighted KNN), use a distance metric appropriate for mixed data types (Gower distance), or assess feature importance before adding binary flags.","A":"Binary features can be used with Euclidean distance. The issue is not mathematical incompatibility but semantic mismatch between binary semantics and continuous distance interpretation.","B":"","C":"Normalization of continuous features affects their distance contribution, but the question specifies they were already scaled to [0,1]. The drop specifically correlates with adding the binary flag.","D":"One-hot encoding a binary feature that already has values 0 and 1 produces the same two columns as the original binary feature — it adds no information and wouldn't change distances."}},{"section":"machine-learning","topicSlug":"k-nearest-neighbors","topic":"K Nearest Neighbors","id":"ml-08009","difficulty":"medium","orderIndex":9,"question":"You are building a product recommendation system using KNN. You find that k=10 gives 0.82 AUC and k=50 gives 0.85 AUC. Your dataset has 500 samples. A data scientist says \"larger k is always better — more neighbors means more information.\" Is this correct?","options":{"A":"Correct — more neighbors always provides more information for any dataset","B":"Incorrect — as k approaches n (total samples), the prediction converges to the majority class for all inputs, ignoring all local structure; in a small dataset of 500 samples, k=50 uses 10% of all data per prediction, which may already be approaching the \"averaging out local structure\" regime","C":"Correct but only when n > 1000 samples; for small datasets k must be minimized","D":"Incorrect only because the dataset is small; for large datasets more neighbors is always better"},"correct":"B","explanation":{"correct":"- As k increases toward n: KNN predictions become increasingly global averages rather than local patterns. For k=n, every new point gets the same prediction (majority class), ignoring features entirely.\n- With 500 samples, k=50 means each prediction is determined by 10% of all training data. This smooths out local patterns and may reduce sensitivity to the specific features that matter for recommendation.\n- The optimal k balances locality (small k captures local patterns) against stability (large k averages out noise). The optimal value is always dataset-dependent and should be found via cross-validation.","A":"\"More neighbors = more information\" fails when the additional neighbors are from different classes or distributions than the query point's true neighborhood. Quality of neighbors matters more than quantity.","B":"","C":"There is no sample-count threshold that determines whether larger k is universally better. The relationship depends on the signal-to-noise ratio in the dataset, not the absolute size.","D":"For large datasets, larger k can still introduce the same high-bias problem by averaging over distant, dissimilar neighbors. The optimal k scales roughly as $\\sqrt{n}$ as a heuristic, not proportionally to n."}},{"section":"machine-learning","topicSlug":"k-nearest-neighbors","topic":"K Nearest Neighbors","id":"ml-08010","difficulty":"medium","orderIndex":10,"question":"A KNN model for credit scoring must be deployed in a regulated environment. A compliance officer asks: \"explain why this applicant was denied.\" The ML engineer says \"our k=5 KNN model found these 5 similar applicants who all defaulted.\" Is this a valid explanation for regulatory purposes?","options":{"A":"Yes — citing 5 similar historical cases is an intuitive and complete explanation","B":"Partially — example-based explanations are intuitive but may fail regulatory requirements that specify feature-level adverse action reasons (e.g., \"denied because of high debt-to-income ratio\"); the 5 neighbors explain similarity in distance space, but don't identify which specific features drove the similarity","C":"No — KNN cannot be used in regulated industries because it has no explainability whatsoever","D":"The explanation is complete because KNN predictions are based on data, not complex math"},"correct":"B","explanation":{"correct":"- KNN's example-based explanation (\"these similar cases all defaulted\") is intuitive and has face validity. However, it doesn't answer \"which specific features make these cases similar?\" — a question regulators require.\n- For ECOA/FCRA adverse action notices, lenders must specify specific reasons: \"denied because of high debt-to-income ratio, insufficient credit history.\" KNN distance similarity doesn't directly map to feature-level reasons.\n- To achieve both, you could augment KNN explanations with feature contribution analysis: which features contributed most to the distance between the applicant and the nearest neighbors?","A":"Example-based explanation is intuitive but not always sufficient. \"Similar past cases\" doesn't identify the legally required specific adverse action factors.","B":"","C":"KNN has the valuable property of example-based explanations — showing similar cases is a form of transparency. Many regulated industries use KNN precisely because of this interpretability. The issue is granularity, not absence of explainability.","D":"Having an explanation based on data doesn't automatically satisfy regulatory requirements for specific feature-level reasons. The regulatory standard is more specific than \"data-driven.\""}},{"section":"machine-learning","topicSlug":"k-nearest-neighbors","topic":"K Nearest Neighbors","id":"ml-08011","difficulty":"hard","orderIndex":11,"question":"You train a KNN model on a 500-dimensional dataset where the true decision boundary depends only on 3 features. KNN achieves poor performance despite the true signal being strong. A colleague suggests using a Mahalanobis distance instead of Euclidean. Why would this help, and what does the Mahalanobis distance compute?","options":{"A":"Mahalanobis distance is faster to compute, which improves KNN performance","B":"Mahalanobis distance accounts for feature covariance — it scales distances by the inverse of the covariance matrix $d_M(x,y) = \\sqrt{(x-y)^T \\Sigma^{-1} (x-y)}$; this de-correlates features and normalizes by variance, reducing the influence of redundant and noisy high-variance features on neighbor selection","C":"Mahalanobis distance is equivalent to Euclidean distance after mean centering","D":"Mahalanobis distance removes irrelevant features by setting their weight to zero"},"correct":"B","explanation":{"correct":"- Euclidean distance in 500 dimensions is dominated by the 497 irrelevant features (each contributing a noise term). Mahalanobis distance stretches or shrinks the space according to the inverse covariance matrix: low-variance features (often uninformative constant features) are amplified; high-variance correlated features are treated jointly.\n- The effect: features that are noisy or redundant contribute less to the Mahalanobis distance, while informative features (with variance aligned with class differences) contribute more.\n- However, Mahalanobis distance doesn't explicitly identify the 3 relevant features. For truly irrelevant features, explicit feature selection or metric learning (learning the optimal distance matrix) is more effective.","A":"Mahalanobis distance requires computing $\\Sigma^{-1}$ (a $500 \\times 500$ matrix) and matrix-vector products — it is significantly more expensive than Euclidean distance, not faster.","B":"","C":"Mahalanobis distance is not equivalent to Euclidean after mean centering. Mean centering removes the bias term but doesn't account for variance or covariance. Mahalanobis requires the full inverse covariance matrix.","D":"Mahalanobis distance doesn't zero out irrelevant features — it reweights them by their inverse variance/covariance. A feature with high variance (even if irrelevant) might still have non-zero contribution to Mahalanobis distance."}},{"section":"machine-learning","topicSlug":"k-nearest-neighbors","topic":"K Nearest Neighbors","id":"ml-08012","difficulty":"hard","orderIndex":12,"question":"A KNN classifier is applied to a time-series classification task: given the last 30 days of stock returns (30-dimensional feature vector), classify the next day as up or down. KNN with Euclidean distance and k=10 achieves only 51% accuracy (random baseline). A finance researcher suggests using Dynamic Time Warping (DTW) as the distance metric. What problem with Euclidean distance does DTW solve?","options":{"A":"DTW is faster than Euclidean distance for 30-dimensional vectors","B":"Euclidean distance compares point-by-point (position 1 vs position 1, day 2 vs day 2) — for time series, similar patterns may be time-shifted or stretched; DTW finds the optimal alignment between two sequences, allowing comparison of temporally shifted patterns and making KNN sensitive to pattern shape rather than exact position","C":"DTW normalizes the feature vectors, which Euclidean distance cannot do","D":"Euclidean distance cannot handle negative values (stock returns can be negative), but DTW can"},"correct":"B","explanation":{"correct":"- Euclidean distance between two time series requires exact temporal alignment. Two otherwise identical stock patterns where one is shifted by 2 days (a common occurrence) would appear very dissimilar by Euclidean distance.\n- DTW finds the best alignment by allowing \"warping\" — matching each point in one series to the most similar point in the other, within a warping window constraint. This captures pattern similarity regardless of temporal shifts.\n- In financial time series, patterns like \"three-day rally followed by consolidation\" are meaningful regardless of exact timing. DTW makes KNN sensitive to these patterns.","A":"DTW is significantly slower than Euclidean distance — it requires $O(n^2)$ dynamic programming per pair, compared to $O(d)$ for Euclidean. Speed is not the motivation.","B":"","C":"DTW does not inherently normalize features. Normalization is a separate step. Both Euclidean and DTW can be applied to normalized or unnormalized series.","D":"Euclidean distance handles negative values correctly — $(x_i - y_i)^2$ is always non-negative regardless of the sign of $x_i$ or $y_i$. Negative values are not an issue for Euclidean distance."}},{"section":"machine-learning","topicSlug":"k-nearest-neighbors","topic":"K Nearest Neighbors","id":"ml-08013","difficulty":"hard","orderIndex":13,"question":"A KNN model is trained on a dataset with severe class imbalance: 95% class 0 and 5% class 1. With k=10, almost every prediction is class 0 because 10 nearest neighbors are dominated by class 0 samples. A developer says \"reduce k to 1 to fix this.\" What is the better approach and why does reducing k to 1 create a different problem?","options":{"A":"Reducing k to 1 is the correct fix — use the single nearest neighbor to avoid majority class dominance","B":"Reducing k to 1 maximizes variance and makes the model sensitive to individual noisy class-1 samples; the better fix is class-weighted KNN (weighting neighbors inversely by class frequency) or combining KNN with oversampling of the minority class to balance the neighborhoods","C":"Class imbalance does not affect KNN — it only affects accuracy-based metrics","D":"The fix is to increase k to 50 to include more class-1 samples in each neighborhood"},"correct":"B","explanation":{"correct":"- With k=1, the prediction for any test point is the label of its single nearest training neighbor. For a test point near the class-0 majority, the nearest neighbor is class 0 — the problem persists in dense majority regions.\n- Additionally, k=1 is highly sensitive to noise: any class-1 point near a class-0 region (or vice versa) will cause misclassifications in its neighborhood.\n- Class-weighted KNN: weight the vote of each neighbor by $1 / P(\\text{class})$ (inverse frequency weighting), giving class-1 neighbors more voting power. Alternatively, oversample class-1 training points to create a balanced neighborhood distribution.","A":"k=1 doesn't fix the imbalance problem in regions dominated by class 0. In the majority class regions (95% of space), the nearest neighbor is almost always class 0 regardless of k=1.","B":"","C":"Class imbalance directly affects KNN by making neighborhoods statistically biased toward the majority class. This is a data representation problem that affects neighbor vote aggregation.","D":"Increasing k to 50 makes the problem worse — with a 95/5 imbalance, 50 neighbors will almost certainly contain 47+ class-0 samples, guaranteeing class-0 predictions everywhere."}},{"section":"machine-learning","topicSlug":"k-nearest-neighbors","topic":"K Nearest Neighbors","id":"ml-08014","difficulty":"hard","orderIndex":14,"question":"A team wants to use KNN for a recommendation system with 10 million users and 100 features. They need sub-50ms response time for nearest-neighbor queries. They evaluate exact KNN (brute force), HNSW (hierarchical navigable small world graph), and an IVF (inverted file index) approach. Exact KNN achieves 100% recall but takes 2 seconds per query. HNSW achieves 99.2% recall in 3ms. IVF achieves 98.1% recall in 8ms. What is the right framework for making this trade-off decision?","options":{"A":"Always use exact KNN — 0.8% recall drop from HNSW is unacceptable for production","B":"For 10M users requiring sub-50ms latency, approximate nearest neighbor (ANN) methods are the only viable choice — the trade-off is recall vs latency; HNSW's 99.2% recall at 3ms is likely acceptable for recommendations (losing 1% of truly relevant items is invisible to users) while meeting the latency SLA; the exact method is infeasible at the required throughput","C":"IVF should always be preferred over HNSW because it uses less memory","D":"The decision should be made based solely on which algorithm is easiest to implement"},"correct":"B","explanation":{"correct":"- Exact KNN with 10M users and 100 features: $O(nd) = 10^9$ operations per query at 2 seconds — physically impossible to meet 50ms SLA. This is not a tuning problem; it is a fundamental computational limitation.\n- HNSW builds a hierarchical graph where each node connects to its approximate neighbors at multiple scales. At 3ms and 99.2% recall, it provides excellent accuracy with 667× speedup. The 0.8% recall gap means ~1 in 125 truly relevant items is missed — imperceptible in recommendation user experience.\n- The decision framework: identify the minimum acceptable recall for the application (recommendations: 99%+ is comfortable; medical image retrieval: 100% may be required), find the ANN method meeting that recall threshold within the latency SLA.","A":"\"Always use exact KNN\" ignores the fundamental infeasibility of 2-second latency for real-time recommendations. 99.2% recall at 3ms is excellent for user-facing systems.","B":"","C":"HNSW vs IVF is a trade-off between recall, latency, and memory. HNSW typically offers better recall/speed trade-offs for dense data. IVF is better for very large datasets where HNSW's graph construction memory is prohibitive. The choice is not universally in favor of either.","D":"Implementation ease is never the primary criterion for production system design. Correctness, performance, and reliability requirements drive the decision."},"reference":"- Malkov & Yashunin, \"Efficient and Robust Approximate Nearest Neighbor Search Using HNSW\": https://arxiv.org/abs/1603.09320\n- FAISS documentation: https://faiss.ai/"},{"section":"machine-learning","topicSlug":"k-nearest-neighbors","topic":"K Nearest Neighbors","id":"ml-08015","difficulty":"hard","orderIndex":15,"question":"A KNN model with Manhattan distance (L1) is compared to the same model with Euclidean distance (L2) on a 1,000-dimensional dataset. The L1 model achieves significantly higher accuracy. Provide a precise geometric explanation for why L1 distance can outperform L2 in high dimensions.","options":{"A":"L1 distance is always superior to L2 distance in any dimension","B":"In high dimensions, L2 distance is dominated by the largest individual feature differences (the squared terms amplify outliers) while L1 distance sums absolute differences linearly — this makes L2 sensitive to a few noisy dimensions, while L1 distributes sensitivity more evenly; L1 is more robust to irrelevant noisy features in high-dimensional spaces","C":"L1 distance is faster to compute than L2, which is why it achieves higher accuracy","D":"L2 distance cannot handle more than 100 dimensions mathematically"},"correct":"B","explanation":{"correct":"- L2 distance: $\\sqrt{\\sum (x_i - y_i)^2}$. The squaring amplifies large individual differences — a single noisy dimension with a large difference dominates the total distance.\n- L1 distance: $\\sum |x_i - y_i|$. Linear sum — no single dimension is disproportionately amplified. In high dimensions with many irrelevant features, L1 averages the noise more uniformly.\n- Theoretical support: the concentration of measure phenomenon affects L2 more severely than L1. The ratio of maximum to minimum pairwise distances (the \"relative contrast\") degrades faster for L2 than L1 as dimension increases, making L1 distances more discriminative.","A":"L2 is superior to L1 in many low-dimensional settings, particularly when the data geometry is spherical or when larger differences are genuinely more important. Neither metric is universally superior.","B":"","C":"L1 computation (no square root, no squaring) is marginally faster than L2, but the accuracy improvement comes from the geometric property of noise robustness, not from computational speed.","D":"L2 distance is mathematically defined for any dimension. The practical challenge is interpretability and concentration of measure, not a mathematical limit."},"reference":"- Aggarwal et al., \"On the Surprising Behavior of Distance Metrics in High Dimensional Space\": https://link.springer.com/chapter/10.1007/3-540-44503-X_27"},{"section":"machine-learning","topicSlug":"naive-bayes","topic":"Naive Bayes","id":"ml-09001","difficulty":"easy","orderIndex":1,"question":"A Naive Bayes spam classifier assigns probability 0.97 to an email being spam. The raw computation is: $P(\\text{spam}|\\text{words}) \\propto P(\\text{spam}) \\times \\prod_{i} P(w_i|\\text{spam})$. A developer asks: \"where does the 'naive' come from?\" What is the correct answer?","options":{"A":"The algorithm is \"naive\" because it uses a simple decision rule: classify spam if probability > 0.5","B":"The algorithm assumes all features (words) are conditionally independent given the class — $P(w_1, w_2, ..., w_n|\\text{spam}) = \\prod P(w_i|\\text{spam})$ — this is the \"naive\" assumption because in reality words co-occur and are correlated","C":"The algorithm is naive because it ignores the email body and only uses the subject line","D":"The algorithm assumes equal prior probabilities for all classes, which is a simplification"},"correct":"B","explanation":{"correct":"- Bayes theorem gives: $P(\\text{class}|\\text{features}) \\propto P(\\text{class}) \\times P(\\text{features}|\\text{class})$. Computing $P(\\text{features}|\\text{class})$ for a 1000-word vocabulary requires modeling the full joint distribution — intractable.\n- The \"naive\" assumption: all features are conditionally independent given the class. This factorizes the joint: $P(f_1, ..., f_n | c) = \\prod P(f_i | c)$. Each term is easy to estimate.\n- This assumption is almost always false in reality (words co-occur: \"machine\" and \"learning\" appear together more than randomly). Yet Naive Bayes works surprisingly well in practice because calibrated probabilities are not required for correct class ranking.","A":"The 0.5 threshold is a standard binary classification decision rule, not specific to Naive Bayes. \"Naive\" refers to the independence assumption, not the threshold.","B":"","C":"Naive Bayes classifiers for text typically use all words in the email (bag-of-words). Ignoring the body would be a design choice, not the definition of \"naive.\"","D":"Naive Bayes uses prior probabilities estimated from class frequency in training data — not assumed equal. The prior is an explicit learned component, not a simplification of equal priors."}},{"section":"machine-learning","topicSlug":"naive-bayes","topic":"Naive Bayes","id":"ml-09002","difficulty":"easy","orderIndex":2,"question":"A Naive Bayes classifier is trained for medical diagnosis. The word \"fever\" appears in 80% of disease-positive training documents and 20% of disease-negative documents. The prior probabilities are $P(\\text{disease}) = 0.01$ (1% base rate). A new patient report mentions only \"fever.\" Which class does Naive Bayes predict, and is the output probability reliable?","options":{"A":"Disease, with probability 0.80 — the model is well-calibrated","B":"No disease — $P(\\text{no disease} | \\text{fever}) \\propto 0.99 \\times 0.20 = 0.198$ vs $P(\\text{disease} | \\text{fever}) \\propto 0.01 \\times 0.80 = 0.008$; the low disease prior overwhelms the likelihood, predicting no disease; the output probability is often unreliable but the class prediction is correct in this case","C":"Disease — the feature likelihood ratio 80/20 = 4 always overrides the prior","D":"The classifier cannot make a prediction because only one feature was provided"},"correct":"B","explanation":{"correct":"- Posterior ∝ Prior × Likelihood: $P(\\text{disease}|\\text{fever}) \\propto 0.01 \\times 0.8 = 0.008$; $P(\\text{no disease}|\\text{fever}) \\propto 0.99 \\times 0.2 = 0.198$. Normalized: $P(\\text{disease}|\\text{fever}) = 0.008/(0.008+0.198) \\approx 0.039$.\n- The model correctly predicts \"no disease\" because the prior is so low. This illustrates base rate neglect: 80% likelihood can still yield a low posterior when the prior is 1%.\n- The output probability (≈3.9%) may be miscalibrated due to the naive independence assumption — but the directional class prediction (no disease) is correct.","A":"P(disease) = 0.01 strongly dominates. 0.80 is the likelihood ratio, not the posterior probability. Naive Bayes computes the posterior, which includes the prior.","B":"","C":"The likelihood ratio (4:1) does not override the prior. Bayes theorem multiplies likelihood by prior. A 4:1 likelihood ratio with a 99:1 prior odds produces 4:99 posterior odds for disease.","D":"Naive Bayes can make predictions from any number of features, including just one. It would simply use $P(\\text{class}) \\times P(\\text{fever}|\\text{class})$ — one feature is sufficient."}},{"section":"machine-learning","topicSlug":"naive-bayes","topic":"Naive Bayes","id":"ml-09003","difficulty":"easy","orderIndex":3,"question":"A Multinomial Naive Bayes model is trained on text data. The word \"unicorn\" never appears in the training corpus. At test time, an email contains \"unicorn.\" Without smoothing, what happens to the model's prediction?","options":{"A":"The word is ignored — the model predicts based on the other words","B":"$$P(\\text{unicorn}|\\text{class}) = 0$ for all classes; since the product $\\prod P(w_i|\\text{class})$ includes a zero term, the posterior becomes zero for every class — the model cannot make any prediction (all class probabilities are zero)","C":"The model assigns P(unicorn|class) = 0.5 as a default for unseen words","D":"The model raises an error because unseen vocabulary is not supported"},"correct":"B","explanation":{"correct":"- Multinomial NB computes the product of word likelihoods: $P(\\text{class}|\\text{doc}) \\propto P(\\text{class}) \\times \\prod_i P(w_i|\\text{class})$.\n- $P(\\text{unicorn}|\\text{class}) = 0/N_{\\text{class}} = 0$ because \"unicorn\" has zero count. The product becomes $P(\\text{class}) \\times 0 \\times P(\\text{other words}) = 0$ for every class.\n- Laplace smoothing (add-one smoothing) fixes this: $P(w|\\text{class}) = \\frac{\\text{count}(w, \\text{class}) + 1}{N_{\\text{class}} + |V|}$ where $|V|$ is vocabulary size. This ensures no word has zero probability.","A":"Standard Multinomial NB doesn't skip words — every word in the document is multiplied into the posterior. Ignoring unseen words would require explicit out-of-vocabulary handling (which is a modification, not the default behavior).","B":"","C":"Default probability of 0.5 for unseen words is not how standard NB works. Laplace smoothing uses $1/(N+|V|)$, not 0.5, to maintain the multinomial distribution property.","D":"Naive Bayes doesn't raise errors on unseen vocabulary — it mathematically produces 0 probability, which causes the prediction to be undefined. This is a silent failure, not an error."}},{"section":"machine-learning","topicSlug":"naive-bayes","topic":"Naive Bayes","id":"ml-09004","difficulty":"easy","orderIndex":4,"question":"You compare Gaussian Naive Bayes (GNB) and Multinomial Naive Bayes (MNB) for classifying customer support tickets by category. The tickets are represented as TF-IDF vectors (continuous values). Which model is more appropriate and why?","options":{"A":"Multinomial NB is always better for text — it was designed specifically for this case","B":"Both models have trade-offs: Multinomial NB assumes non-negative integer counts (word frequencies), making it well-suited for raw count vectors; TF-IDF produces continuous non-negative values, for which Gaussian NB (assuming continuous Gaussian features) or Complement NB is more appropriate; MNB technically applies to TF-IDF but assumes a multinomial distribution that doesn't perfectly fit continuous weights","C":"Gaussian NB is always better than Multinomial NB for classification tasks","D":"Neither model can handle text classification — a deep learning model is required"},"correct":"B","explanation":{"correct":"- Multinomial NB models $P(w_i | \\text{class}) = p_{ic}^{x_{ic}}$ where $x_{ic}$ is the count of word $i$ in class $c$. This assumes integer count data (bag of words). TF-IDF values are continuous and not integer counts — MNB treats them as counts approximately.\n- Gaussian NB models each feature as $P(x_i | \\text{class}) = \\mathcal{N}(\\mu_{ic}, \\sigma_{ic}^2)$. For TF-IDF, this may not fit well because TF-IDF values are highly skewed (many zeros, some large values).\n- Complement NB (a variant of MNB) often works best for text; Bernoulli NB works for binary presence/absence. The choice should be empirically validated on the specific task.","A":"MNB was designed for count vectors, not TF-IDF. For raw bag-of-words counts, MNB is the natural choice. For TF-IDF, the match is approximate.","B":"","C":"Gaussian NB assumes normally distributed features, which is often violated for text features (sparse, skewed distributions). GNB is not universally better for text.","D":"Naive Bayes is a well-established and effective approach for text classification. Deep learning is not required — NB is often a strong baseline."}},{"section":"machine-learning","topicSlug":"naive-bayes","topic":"Naive Bayes","id":"ml-09005","difficulty":"easy","orderIndex":5,"question":"Laplace smoothing with parameter α=1 is applied to a Naive Bayes model on a vocabulary of 10,000 words. Training corpus for class \"spam\" has 1,000 total word tokens. The word \"discount\" appears 50 times in spam. What is the smoothed probability $P(\\text{discount}|\\text{spam})$?","options":{"A":"50/1000 = 0.05","B":"$$(50 + 1) / (1000 + 10000) = 51/11000 \\approx 0.00464$","C":"$$(50 + 1) / (1000 + 1) = 51/1001 \\approx 0.051$","D":"$$50 / (1000 + 10000) = 50/11000 \\approx 0.00454$"},"correct":"B","explanation":{"correct":"- Laplace smoothing formula: $P(w|\\text{class}) = \\frac{\\text{count}(w, \\text{class}) + \\alpha}{\\sum_w \\text{count}(w, \\text{class}) + \\alpha|V|}$.\n- Numerator: $50 + 1 = 51$. Denominator: $1000 + 1 \\times 10000 = 11000$.\n- Result: $51/11000 \\approx 0.00464$. The denominator adds $\\alpha \\times |V|$ (not just $\\alpha$) to ensure probabilities sum to 1 across the entire vocabulary.","A":"Unsmoothed MLE — this ignores Laplace smoothing and would give zero for unseen words.","B":"","C":"Only adds α once to the denominator, not $\\alpha \\times |V|$. This is a common mistake — the smoothing must be applied consistently across all vocabulary terms to maintain valid probability distributions.","D":"Correct denominator but missing the α in the numerator. Laplace smoothing adds α to both the numerator count and the denominator sum."}},{"section":"machine-learning","topicSlug":"naive-bayes","topic":"Naive Bayes","id":"ml-09006","difficulty":"medium","orderIndex":6,"question":"A Naive Bayes email classifier achieves 94% precision on spam detection. A data scientist says \"Naive Bayes works well because its independence assumption holds for email text.\" A researcher disagrees. Why does Naive Bayes work well in practice despite the assumption being violated?","options":{"A":"The independence assumption actually holds for text data — words are statistically independent","B":"Naive Bayes requires only correct class ranking, not calibrated probabilities — even with correlated features, if the posterior $P(\\text{spam}|\\text{words})$ consistently ranks spam above non-spam for spam emails, classification is correct; the correlated features' violation affects probability magnitude but not necessarily the direction of class ranking","C":"94% precision means the independence assumption is valid for this specific dataset","D":"Naive Bayes corrects for dependence automatically through Laplace smoothing"},"correct":"B","explanation":{"correct":"- The naive independence assumption is almost always false for text — \"machine\" and \"learning\" co-occur far more often than independence predicts. The model's probability estimates are therefore miscalibrated (too extreme).\n- But classification only requires: argmax over classes of the posterior. If the model consistently assigns higher (even if miscalibrated) probability to the correct class, predictions are correct.\n- Theoretical analysis (Domingos & Pazzani 1997): NB is optimally robust when features are \"conditionally positively correlated\" — the most common case in text. The class ranking is preserved even when probabilities are miscalibrated.","A":"Words in text are highly correlated — \"New York\" always appears together, \"credit card\" is a common phrase. Independence is definitively violated for text.","B":"","C":"94% precision is evidence of good classification performance, not of the independence assumption holding. The assumption can be violated while performance is high.","D":"Laplace smoothing handles zero probabilities for unseen words — it does not correct for feature dependence. These are separate issues."},"reference":"- Domingos & Pazzani, \"On the Optimality of the Simple Bayesian Classifier under Zero-One Loss\": https://link.springer.com/article/10.1023/A:1007413511361"},{"section":"machine-learning","topicSlug":"naive-bayes","topic":"Naive Bayes","id":"ml-09007","difficulty":"medium","orderIndex":7,"question":"A Naive Bayes classifier outputs $P(\\text{class=1}|\\text{features}) = 0.9999$ for most samples. A calibration plot shows that among samples predicted at 0.9999, only 72% are actually class 1. What structural property of Naive Bayes causes this extreme overconfidence?","options":{"A":"99.99% probability with 72% actual rate is within normal statistical variation — the model is well-calibrated","B":"Correlated features are counted multiple times in the product $\\prod P(f_i|\\text{class})$ — if \"machine\" and \"learning\" both appear (highly correlated), each contributes independently to the product, artificially inflating the probability toward extreme values (near 0 or 1)","C":"Naive Bayes outputs are always overconfident — it is a known limitation that cannot be remedied","D":"The overconfidence is caused by Laplace smoothing inflating all probabilities toward extreme values"},"correct":"B","explanation":{"correct":"- The product $\\prod P(f_i | c)$ of many near-independent terms concentrates near 0 or 1 by the central limit theorem on log-scale. With correlated features, the same information is effectively counted multiple times, pushing products to extreme values.\n- Example: \"spam\" email contains \"discount\", \"offer\", \"deal\" — all highly correlated spam indicators. Naive NB multiplies these as if independent, overestimating the probability of spam far beyond the true conditional probability.\n- This is why Naive NB is often combined with Platt scaling or isotonic regression to calibrate probabilities — the class predictions may be correct, but the probability outputs require post-hoc calibration.","A":"A 27-point gap (99.99% predicted vs 72% actual) is severe miscalibration, not statistical variation. This is a systematic overconfidence pattern, not noise.","B":"","C":"Overconfidence can be remedied. Calibration methods (Platt scaling, temperature scaling) correct NB's overconfidence by mapping raw outputs to calibrated probabilities. The limitation is not irremedied.","D":"Laplace smoothing moves probabilities away from 0 and 1 (it prevents zero probabilities). It does not cause overconfidence — it slightly reduces extreme values."}},{"section":"machine-learning","topicSlug":"naive-bayes","topic":"Naive Bayes","id":"ml-09008","difficulty":"medium","orderIndex":8,"question":"A Naive Bayes model is trained for news topic classification (5 classes). Training data: 10,000 sports articles, 1,000 politics articles. At test time, a politically neutral article gets classified as \"sports\" even though it contains political keywords. What is the most likely cause?","options":{"A":"The model has a bug in the likelihood computation","B":"The prior $P(\\text{sports}) = 10,000/11,000 \\approx 0.91$ strongly dominates — even when $P(\\text{words}|\\text{politics}) > P(\\text{words}|\\text{sports})$, the large sports prior can overwhelm the likelihood ratio; this is prior dominance in imbalanced training data","C":"Naive Bayes always classifies based on the most frequent class — this is expected behavior","D":"Political keywords have zero probability in all classes because they weren't seen in training data"},"correct":"B","explanation":{"correct":"- Prior dominance: $P(\\text{sports}) \\approx 0.91$, $P(\\text{politics}) \\approx 0.09$. Even a 10:1 likelihood ratio in favor of politics gives: $P(\\text{politics}|\\text{doc}) \\propto 0.09 \\times 10 = 0.9$ vs $P(\\text{sports}|\\text{doc}) \\propto 0.91 \\times 1 = 0.91$. Sports still wins with equal likelihoods; the politics class needs a >10:1 likelihood ratio just to overcome the prior.\n- This is a training data imbalance problem. The model effectively needs very strong political signal to overcome the sports prior.\n- Solutions: adjust class priors to reflect true expected distribution (not training imbalance), use class weights, or downsample the majority class.","A":"The behavior is mathematically correct Naive Bayes — it is a consequence of the prior × likelihood computation, not a bug.","B":"","C":"Naive Bayes doesn't always predict the most frequent class — it predicts the class with the highest posterior. When the likelihood ratio is large enough, the minority class can win. The problem is when the likelihood ratio is insufficient to overcome the prior.","D":"Political keywords appear in training data (1,000 politics articles) — they have non-zero likelihood for the politics class. The issue is the low prior, not zero likelihoods."}},{"section":"machine-learning","topicSlug":"naive-bayes","topic":"Naive Bayes","id":"ml-09009","difficulty":"medium","orderIndex":9,"question":"A Bernoulli Naive Bayes model and a Multinomial Naive Bayes model are both trained on the same text data. The Bernoulli model represents documents as binary vectors (word present or absent). The Multinomial model uses word counts. On short documents (tweets, 280 chars), Bernoulli NB outperforms Multinomial NB. Why?","options":{"A":"Bernoulli NB is always better than Multinomial NB — count information is never useful","B":"In short documents, most words appear at most once — count and presence are nearly identical; but Bernoulli NB explicitly models absent words (contributes $P(w=0|\\text{class})$ for words not in the document), which adds discriminative signal about what is NOT present; Multinomial NB ignores absent words, losing this signal in short documents","C":"Multinomial NB is computationally slower on short documents, which is why Bernoulli appears better","D":"Short documents violate the Multinomial distribution assumption, causing Multinomial NB to fail"},"correct":"B","explanation":{"correct":"- Bernoulli NB: $P(\\text{doc}|\\text{class}) = \\prod_{w \\in V} P(w|\\text{class})^{b_w} \\times P(\\text{not-}w|\\text{class})^{1-b_w}$ where $b_w \\in \\{0,1\\}$.\n- When a word is absent ($b_w = 0$), Bernoulli NB multiplies by $P(\\text{not-}w|\\text{class}) = 1 - P(w|\\text{class})$. A word common in spam (high $P(w|\\text{spam})$) contributes $P(\\text{not-}w|\\text{spam}) = $ small value when absent — a positive signal for non-spam.\n- Multinomial NB only processes words present in the document, contributing nothing for absent words. In short documents with few words, the absence of spam indicators is strong evidence — Bernoulli captures this; Multinomial misses it.","A":"Multinomial NB's use of count information is genuinely useful for long documents where word frequency carries meaning (e.g., \"urgent\" appearing 5 times in an email is more suspicious than once). The advantage depends on document length.","B":"","C":"Computational speed differences are not the cause of accuracy differences. Both models have similar complexity.","D":"Both Bernoulli and Multinomial NB have their respective distributional assumptions. Short documents don't \"violate\" the Multinomial distribution — they just provide less count information to leverage."}},{"section":"machine-learning","topicSlug":"naive-bayes","topic":"Naive Bayes","id":"ml-09010","difficulty":"hard","orderIndex":10,"question":"A Naive Bayes model is trained on real-valued continuous features (sensor data). The team uses Gaussian NB: $P(x_i|\\text{class}) = \\mathcal{N}(\\mu_{ic}, \\sigma_{ic}^2)$. Feature $x_3$ has a bimodal distribution within each class (two distinct peaks). The model achieves poor recall on class 1. What is the precise problem and fix?","options":{"A":"Gaussian NB cannot handle real-valued data — it must be discretized","B":"Gaussian NB assumes each feature follows a single Gaussian within each class — a bimodal within-class distribution violates this, causing the estimated mean and variance to represent a \"ghost\" distribution that doesn't reflect either peak; the model systematically underestimates $P(x_3|\\text{class})$ in regions between the two modes","C":"The problem is insufficient training data for class 1 — more samples would fix the Gaussian fit","D":"Bimodal distributions require multinomial NB regardless of the feature type"},"correct":"B","explanation":{"correct":"- A bimodal distribution (e.g., measurements cluster near 10 and 40 within class 1) has mean ≈ 25 — a value rarely observed. Gaussian NB fits $\\mathcal{N}(25, \\sigma^2)$, which concentrates probability around 25 but gives low probability to observations near 10 or 40 (where actual data lives).\n- This causes systematic underestimation of $P(x_3 | \\text{class=1})$ for actual class-1 observations near either mode, reducing the posterior for class 1.\n- Fix: use Kernel Density Estimation (KDE) for the continuous distribution, discretize the feature into bins and use Multinomial NB, or use a mixture of Gaussians to model the bimodal within-class distribution.","A":"Gaussian NB handles real-valued data correctly when the Gaussian assumption holds. The issue is the violation of unimodality, not the data type.","B":"","C":"More training data would produce a more accurate estimate of the bimodal distribution's parameters — but the Gaussian model cannot represent a bimodal distribution regardless of sample count. The model class is wrong.","D":"Multinomial NB is designed for discrete count data. Applying it to bimodal continuous data would require discretization first. The NB variant choice should match the data type, not just the distribution shape."}},{"section":"machine-learning","topicSlug":"naive-bayes","topic":"Naive Bayes","id":"ml-09011","difficulty":"hard","orderIndex":11,"question":"A Naive Bayes model is trained incrementally: new training data arrives daily and the model is updated without retraining from scratch. How does Naive Bayes's generative model structure uniquely enable this incremental update, unlike discriminative models?","options":{"A":"Naive Bayes cannot be updated incrementally — full retraining is always required","B":"Naive Bayes stores sufficient statistics (class counts, word counts, feature sums and variances) that are additive — each new sample updates the counts, and the probability estimates are recomputed directly; no gradient computation or full dataset is needed; discriminative models (logistic regression, neural networks) require full dataset gradient computation for principled incremental updates","C":"Incremental learning only works for Naive Bayes because it has fewer parameters","D":"Naive Bayes is the only model that can be updated incrementally because it uses Bayesian inference"},"correct":"B","explanation":{"correct":"- Multinomial NB: class count $N_c$ and feature count $\\text{count}(w, c)$ are sufficient statistics. Adding a new document with class $c$ and words $\\{w_1, ...\\}$: increment $N_c$ by 1 and increment $\\text{count}(w_i, c)$ by $x_i$. Recompute $P(w_i|c)$ from updated counts.\n- This is $O(d)$ per new sample regardless of total dataset size — true constant-time incremental update.\n- Discriminative models (logistic regression, neural networks) minimize loss over training data. Updating with a new sample requires either full gradient computation (which accesses all past data) or stochastic gradient descent with forgetting effects. Neither is as clean as NB's sufficient statistic updates.","A":"sklearn's `MultinomialNB` explicitly supports `partial_fit()` for incremental learning. Naive Bayes is one of the few classic algorithms with principled online update support.","B":"","C":"Parameter count is not the determining factor. The determining factor is whether the model's parameters can be expressed as additive sufficient statistics of the data.","D":"Several other models support incremental learning (Perceptron, online SGD, vowpal wabbit). Naive Bayes's incremental property comes from its generative structure with additive sufficient statistics, not uniquely from Bayesian inference."}},{"section":"machine-learning","topicSlug":"naive-bayes","topic":"Naive Bayes","id":"ml-09012","difficulty":"hard","orderIndex":12,"question":"Two Naive Bayes models are trained on document classification: Model A uses Laplace smoothing α=1, Model B uses α=10. On a test set with many out-of-vocabulary words, Model B outperforms Model A. However, on test data similar to training, Model A performs better. Explain this behavior precisely.","options":{"A":"Higher α always improves model performance — Model A should always use α=10","B":"α=10 applies heavier smoothing: probabilities for rare/unseen words are uniformly spread across all vocabulary words, moving toward the uniform distribution; this reduces overfitting to training word frequencies but adds more bias toward uniformity — on test data with many OOV words, the bias is less harmful than Model A's zero-probability catastrophe; on in-distribution test data, the extra bias hurts Model A's sharper, better-calibrated estimates","C":"The performance difference is caused by Laplace smoothing only applying to word counts, not to class priors","D":"α=10 is equivalent to having 10 extra observations of each word, making Model B more robust by artificially increasing training size"},"correct":"B","explanation":{"correct":"- Laplace smoothing formula: $P(w|c) = \\frac{N_{wc} + \\alpha}{N_c + \\alpha|V|}$. With $\\alpha = 10$: a word never seen in class $c$ gets $P(w|c) = 10/(N_c + 10|V|)$ — higher than with $\\alpha=1$. All probabilities are pulled closer to $1/|V|$ (uniform).\n- On OOV-heavy test data: $\\alpha=1$ gives very small but not zero probabilities for unseen words (avoiding the catastrophic zero of no smoothing). $\\alpha=10$ gives larger probabilities for unseen words, making predictions less sensitive to OOV words.\n- On in-distribution test data: $\\alpha=1$ preserves more of the training distribution signal. $\\alpha=10$'s over-smoothing weakens the discriminative signal for known words.","A":"Higher α is not universally better. It's a bias-variance trade-off: more smoothing (higher α) reduces variance for OOV words but increases bias on in-distribution data.","B":"","C":"Laplace smoothing does apply to class priors too in some formulations ($P(c) = (N_c + \\alpha) / (N + K\\alpha)$ where K is number of classes). However, the performance difference described is specifically about word probability estimation.","D":"Adding α to counts is loosely analogous to α extra observations of each word, but this framing understates the effect: it uniformly distributes α observations across all vocabulary words, which is different from adding real word observations from training data."}},{"section":"machine-learning","topicSlug":"naive-bayes","topic":"Naive Bayes","id":"ml-09013","difficulty":"hard","orderIndex":13,"question":"A Naive Bayes classifier is used to detect toxic text. Features include word presence (Bernoulli NB). Feature \"hate\" has $P(\\text{hate}|\\text{toxic}) = 0.6$ and $P(\\text{hate}|\\text{not toxic}) = 0.001$. Log-odds ratio = $\\log(0.6/0.001) = 6.4$. An auditor discovers the model is biased: it flags text mentioning marginalized groups as toxic at higher rates. What NB property enables and hides this bias?","options":{"A":"The model is unbiased because it uses probabilistic outputs, not binary decisions","B":"Naive NB's word-level independence assumption makes it transparent about which words drive predictions (high log-odds ratio words), but this also makes it easy for training data bias to embed directly into per-word probabilities — if the training corpus disproportionately associates group-identifying words with toxic content (historical bias), those words get high P(word|toxic) without the model having any mechanism to distinguish correlation from discriminatory association","C":"The bias is caused by Laplace smoothing, which amplifies toxic class probabilities","D":"Naive Bayes cannot be biased because it uses objective probabilities from training data"},"correct":"B","explanation":{"correct":"- NB directly encodes $P(w|\\text{toxic})$ from training data. If training data disproportionately labels text mentioning group-X as toxic (historical human labeling bias), then $P(\\text{group-X word}|\\text{toxic})$ is estimated as high — the model embeds this bias directly.\n- Unlike neural networks where bias is distributed across millions of parameters (hard to audit), NB's bias is transparent and inspectable: high-log-odds words are directly interpretable. This is both a strength (auditable) and a weakness (bias transfers directly).\n- The auditor can detect and partially correct by removing or downweighting group-identifying terms, or by reweighting training examples.","A":"Probabilistic outputs do not prevent bias. If the probability of \"toxic\" is consistently higher for text containing group identifiers due to training data bias, the model produces biased probability outputs.","B":"","C":"Laplace smoothing does not amplify the toxic class. It moves all word probabilities slightly toward uniform — it would reduce, not amplify, class-specific word probabilities.","D":"\"Objective probabilities from training data\" is precisely how bias embeds — if training data contains human labeling bias, the objective probabilities inherit that bias. No algorithm is immune to biased training data."}},{"section":"machine-learning","topicSlug":"naive-bayes","topic":"Naive Bayes","id":"ml-09014","difficulty":"hard","orderIndex":14,"question":"Naive Bayes and Logistic Regression are both trained on the same binary classification task. In the asymptotic limit (infinite training data), logistic regression converges to a better solution than Naive Bayes. But on small training sets (n < 30), Naive Bayes often outperforms logistic regression. What theoretical framework explains this empirical observation?","options":{"A":"Naive Bayes uses a better optimization algorithm than logistic regression for small datasets","B":"Naive Bayes is a generative model — it models the joint distribution $P(x, y)$ and has fewer effective parameters (one mean and variance per feature per class for GNB); it reaches its asymptotic error with fewer samples; logistic regression is a discriminative model that directly models $P(y|x)$ and requires more samples to estimate its parameters reliably, but achieves lower asymptotic error when its assumptions hold","C":"Logistic regression overfits to training data on small datasets, while Naive Bayes cannot overfit because it ignores feature correlations","D":"This observation is false — logistic regression always outperforms Naive Bayes regardless of dataset size"},"correct":"B","explanation":{"correct":"- The Ng & Jordan (2001) study formally showed this crossover: Naive Bayes achieves its asymptotic error after $O(\\log d)$ samples (d = features), while logistic regression requires $O(d)$ samples.\n- Generative models like NB have structural assumptions that constrain the solution space — the model \"knows\" the distribution structure. This inductive bias is helpful with little data.\n- Discriminative models make fewer assumptions and can fit any boundary, but need more data to determine which boundary is correct. Their lower asymptotic error comes from not being constrained by (possibly wrong) generative assumptions.","A":"Naive Bayes is not an optimizer-based model. Its parameters (class probabilities, feature likelihoods) are estimated directly from frequency counts. There's no optimization difference.","B":"","C":"Naive Bayes can overfit — especially with small training sets, estimated word probabilities may reflect training noise. The independence assumption acts as regularization, but it's not absolute protection against overfitting.","D":"This is empirically false. The Ng & Jordan paper directly demonstrates with experiments that Naive Bayes outperforms logistic regression on small datasets."},"reference":"- Ng & Jordan, \"On Discriminative vs. Generative Classifiers\": https://proceedings.neurips.cc/paper/2001/hash/7b7a53e239400a13bd566b1e94b2f4f6-Abstract.html"},{"section":"machine-learning","topicSlug":"pca-dimensionality-reduction","topic":"Pca Dimensionality Reduction","id":"ml-10001","difficulty":"easy","orderIndex":1,"question":"PCA is applied to a dataset with 50 features. The first principal component explains 45% of variance, the second explains 20%, and the third explains 15%. A data scientist says \"the first three components capture 80% of the variance, so we can safely discard the remaining 47 components.\" What important nuance does this claim miss?","options":{"A":"80% variance explained is always insufficient — you must retain 95% minimum","B":"Variance explained measures the proportion of total variance captured, but \"safely discard\" depends on the task — for visualization 80% is often enough, but for a downstream model, the 20% discarded variance may contain the signal most predictive of the target; the claim assumes variance ∝ information, which is only true if the task is reconstruction, not prediction","C":"The claim is correct — 80% is the standard threshold for PCA in all applications","D":"PCA cannot discard components because all 50 components together reconstruct the data exactly"},"correct":"B","explanation":{"correct":"- PCA maximizes explained variance — it finds directions of maximum data spread. But the target variable may correlate with low-variance directions. For example, a subtle survival signal in medical data might be captured by component 10 (2% variance) rather than component 1.\n- \"Explained variance\" measures how well PCA reconstructs the input $X$, not how well it predicts the output $y$. Discarding variance is safe only when you're doing unsupervised compression; for supervised prediction, you should evaluate downstream model performance on held-out data.\n- Alternative: supervised dimensionality reduction (LDA, PLS) finds components that maximize predictive power, not variance.","A":"There is no universal 95% threshold. The appropriate threshold depends on the task: 80% may be sufficient for noise reduction in image compression; 95% may be insufficient for a regression task where the target correlates with rare components.","B":"","C":"80% is a commonly cited heuristic, not a standard. The optimal number of components is task-dependent and should be evaluated empirically.","D":"PCA produces an ordered set of orthogonal components. Using only the first 3 is an approximation — you are discarding the remaining 47 dimensions' information, with some information loss."}},{"section":"machine-learning","topicSlug":"pca-dimensionality-reduction","topic":"Pca Dimensionality Reduction","id":"ml-10002","difficulty":"easy","orderIndex":2,"question":"PCA is applied to a dataset before training a logistic regression classifier. The PCA is fitted on the entire dataset (including test data), and then both train and test sets are transformed with the same PCA. A reviewer says this is a data leakage problem. Is the reviewer correct?","options":{"A":"No — PCA is unsupervised and doesn't use labels, so it cannot cause data leakage","B":"Yes — fitting PCA on the full dataset (including test data) means the principal components (eigenvectors) are computed using test-set variance information; these components may align with patterns specific to the test set, giving the model access to test distribution information during training","C":"The reviewer is partially correct — leakage only occurs if PCA reduces to 1 component","D":"Leakage from PCA only matters for non-linear PCA methods (kernel PCA); linear PCA is safe"},"correct":"B","explanation":{"correct":"- PCA computes eigenvectors of the feature covariance matrix. If the covariance matrix is estimated using all data (including test), the test data's variance structure is embedded in the principal components.\n- For example, if the test set has a unique cluster pattern, PCA may create a component that separates this cluster from the training data — the subsequent model then benefits from this structure during evaluation.\n- The correct approach: fit PCA on training data only, then apply the same PCA transformation to the test set. This is enforced by using `sklearn.pipeline.Pipeline`.","A":"\"Not using labels\" does not prevent leakage. Any information from the test set — distributional, structural, or statistical — that influences training constitutes leakage. PCA uses the covariance structure of all features.","B":"","C":"The number of components does not determine whether leakage occurs. With any number of components, fitting on test+train uses test information.","D":"This distinction is incorrect. Both linear PCA and kernel PCA are fitted on data. Any fitted transformation uses the data it was fitted on. Linear PCA has the same leakage risk as kernel PCA."}},{"section":"machine-learning","topicSlug":"pca-dimensionality-reduction","topic":"Pca Dimensionality Reduction","id":"ml-10003","difficulty":"easy","orderIndex":3,"question":"The first two principal components of a 100-feature dataset are plotted as a scatterplot. A data scientist interprets the plot and identifies three distinct clusters. A colleague says \"we can use this for clustering.\" What critical limitation must they acknowledge?","options":{"A":"PCA scatterplots cannot be used to identify clusters under any circumstances","B":"The 2D PCA visualization shows variance in the two highest-variance directions — clusters visible in 2D may not exist in the full 100-dimensional space, and clusters that exist in the full space may be invisible in the 2D projection; 2D PCA is a lossy projection that can create apparent clusters through projection artifacts or miss real high-dimensional structure","C":"The clusters are guaranteed to be real because PCA extracts the most informative dimensions","D":"Using PCA for clustering is only invalid if explained variance is below 90%"},"correct":"B","explanation":{"correct":"- Projection to 2D compresses 100 dimensions into 2. Points that are well-separated in the full space may overlap in the projection; overlapping points in the full space may appear separated due to the 2D \"shadow\" effect.\n- PCA finds max-variance directions, not max-cluster-separation directions. A dataset where clusters are separated along low-variance components will look homogeneous in a PCA plot despite having clear cluster structure.\n- t-SNE and UMAP are specifically designed for cluster visualization — they preserve neighborhood structure, not variance. They are preferred for exploratory cluster analysis.","A":"PCA scatterplots can be useful starting points for exploration. The limitation is in over-interpreting apparent clusters as definitive, not in using the plot entirely.","B":"","C":"PCA maximizes variance, not discriminative or clustering power. \"Most informative\" is relative to the task: for reconstruction, PC1/PC2 are most informative; for cluster separation, they may not be.","D":"There is no threshold below which PCA is \"invalid\" for clustering visualization. Even 95% variance retention can fail to reveal cluster structure if the clusters separate along the remaining 5%."}},{"section":"machine-learning","topicSlug":"pca-dimensionality-reduction","topic":"Pca Dimensionality Reduction","id":"ml-10004","difficulty":"easy","orderIndex":4,"question":"A scree plot shows eigenvalues: 8.2, 7.9, 7.6, 0.3, 0.2, 0.1. An analyst uses the \"elbow method\" to select the number of principal components. Where is the elbow and what does it indicate?","options":{"A":"The elbow is between components 1 and 2, indicating only 1 component should be kept","B":"The elbow is between components 3 and 4 — the first three eigenvalues (8.2, 7.9, 7.6) are large and similar; then there is a sharp drop to 0.3; the elbow indicates that 3 components capture the dominant variance structure, and additional components mainly capture noise","C":"The elbow is between components 5 and 6, and 5 components should be retained","D":"A flat scree plot with similar initial eigenvalues means PCA is not applicable"},"correct":"B","explanation":{"correct":"- The scree plot elbow method: find the point where the eigenvalue curve \"bends\" sharply — large variance to the left, noise variance to the right. The drop from 7.6 to 0.3 is a factor of 25 — a dramatic elbow.\n- Eigenvalues 8.2, 7.9, 7.6 suggest three approximately equal variance components (perhaps three underlying dimensions of equal importance). The components after the elbow (0.3, 0.2, 0.1) represent residual noise.\n- The elbow method is heuristic — the \"elbow\" is not always obvious. When eigenvalues decrease gradually, parallel analysis or cross-validation-based component selection is more reliable.","A":"Components 1-3 have nearly equal eigenvalues (~8) — there is no elbow between 1 and 2. The sharp drop is between 3 and 4.","B":"","C":"Components 4-6 all have small eigenvalues (0.3, 0.2, 0.1) and represent noise. There is no additional elbow at component 5-6.","D":"PCA is applicable to any dataset. A flat initial portion of the scree plot means multiple components are equally important — this is common and valid."}},{"section":"machine-learning","topicSlug":"pca-dimensionality-reduction","topic":"Pca Dimensionality Reduction","id":"ml-10005","difficulty":"easy","orderIndex":5,"question":"A team applies PCA to reduce features from 100 to 10 before training a neural network. The network achieves 78% accuracy. Without PCA (full 100 features), the network achieves 82%. A colleague says \"PCA always improves neural network performance by removing noise.\" Is this correct?","options":{"A":"Correct — PCA always helps neural networks by removing correlated features","B":"Incorrect — neural networks with sufficient capacity and data can learn to use high-dimensional input effectively; PCA discards the 90 lower-variance components, which may contain task-relevant signal; the 4-point accuracy drop suggests the discarded variance contained useful predictive information","C":"Correct — 78% accuracy from 10 components vs 82% from 100 features proves PCA is harmful in all cases","D":"The accuracy difference is within noise — the two results are statistically equivalent"},"correct":"B","explanation":{"correct":"- Neural networks with many hidden units can model nonlinear interactions across all 100 features. PCA discards 90 directions of variance — if any of these carry signal (even small variance-explained signal), the network loses that information.\n- PCA is most beneficial when training data is limited (fewer samples than features forces the network to generalize in high-dimensional space) or when computation savings are needed.\n- With ample data and computational resources, end-to-end feature learning (letting the network learn its own low-dimensional representation through the early layers) often outperforms manual PCA preprocessing.","A":"\"Always improves\" is definitively false. This example demonstrates the opposite. PCA is a tool with trade-offs, not a universally beneficial preprocessing step.","B":"","C":"The observed drop suggests PCA was harmful on this specific task. But it doesn't \"prove\" PCA is harmful in all cases. Other datasets and architectures may benefit from PCA preprocessing.","D":"A 4-point accuracy difference in neural network evaluation is typically statistically meaningful (much larger than noise) unless the dataset is extremely small."}},{"section":"machine-learning","topicSlug":"pca-dimensionality-reduction","topic":"Pca Dimensionality Reduction","id":"ml-10006","difficulty":"medium","orderIndex":6,"question":"PCA is applied to a 3D dataset where the data lies on a 2D Swiss roll (a nonlinearly curved manifold). The PCA projection to 2D \"unfolds\" the roll into a crescent shape — the two ends of the roll that are geometrically far apart appear close in PCA space. What does this reveal about PCA's limitations?","options":{"A":"PCA failed because the Swiss roll has more than 2 dimensions","B":"PCA finds the 2D linear subspace with maximum variance — it cannot \"unfold\" a curved manifold because it uses only linear projections; points at opposite ends of the roll have high variance between them (far in 3D), so PCA places them correctly by variance but incorrectly by manifold geodesic distance; nonlinear methods (UMAP, t-SNE, Isomap) are needed to preserve manifold structure","C":"PCA failed because the data was not standardized before application","D":"PCA produces correct results on the Swiss roll — the crescent is the geometrically correct 2D representation"},"correct":"B","explanation":{"correct":"- PCA computes eigenvectors of the covariance matrix — these are directions of maximum variance in the original Euclidean space. The Swiss roll's 2D manifold is curved; its intrinsic 2D coordinates cannot be reached by any linear projection.\n- The first two PCs capture the maximum-variance projection of the 3D roll, which squashes the curved surface. Points at opposite ends of the roll may have large Euclidean distance (high variance) but are geodesically close along the manifold.\n- Manifold-aware methods like UMAP or Isomap compute shortest paths along the manifold surface (geodesic distances) rather than Euclidean distances, correctly \"unrolling\" the Swiss roll.","A":"The Swiss roll is intrinsically 2-dimensional — PCA should in principle recover 2D structure. The failure is due to the nonlinearity of the manifold, not the dimensionality.","B":"","C":"Standardization would not help here. The failure is geometric (linear vs. nonlinear projection), not scale-related.","D":"The crescent is not the geometrically correct representation — it folds together parts of the roll that should be separated. The correct unrolled representation would show a rectangle or unfurled band."},"reference":"- Tenenbaum et al., \"A Global Geometric Framework for Nonlinear Dimensionality Reduction\" (Isomap): https://science.sciencemag.org/content/290/5500/2319"},{"section":"machine-learning","topicSlug":"pca-dimensionality-reduction","topic":"Pca Dimensionality Reduction","id":"ml-10007","difficulty":"medium","orderIndex":7,"question":"t-SNE is applied to visualize a 500-dimensional embedding space. The resulting 2D plot shows 8 clear clusters. A data scientist presents this to stakeholders saying \"our model has learned 8 distinct customer segments.\" A statistician pushes back. What is the statistician's concern?","options":{"A":"t-SNE visualizations are always incorrect and should never be used for presentations","B":"t-SNE preserves local neighborhood structure but distorts global distances — clusters in t-SNE plots look more separated than they actually are, and the number and appearance of clusters can change dramatically with different perplexity values; the 8 clusters may not correspond to 8 distinct real-world groups without validation with a downstream task","C":"t-SNE is only valid for image data and cannot be applied to embedding spaces","D":"The statistician's concern is invalid — 8 visible clusters definitively proves 8 segments exist"},"correct":"B","explanation":{"correct":"- t-SNE optimizes a different objective than PCA: it minimizes KL divergence between high-dimensional and low-dimensional neighborhood distributions. This preserves local structure (nearby points stay nearby) but distorts inter-cluster distances.\n- Hyperparameter sensitivity: changing perplexity (5 to 50) can change the apparent number and shape of clusters. t-SNE can create apparent clusters from uniformly distributed data.\n- Validation: the 8 t-SNE clusters should be validated against domain-meaningful criteria (customer behavior differences, business metrics). Visualization alone is not proof of segmentation.","A":"t-SNE is a valuable and widely used exploratory tool. The concern is not about its validity but about its interpretation limitations.","B":"","C":"t-SNE can be applied to any vector space — embeddings, genomics, audio features, text. It has no domain restriction.","D":"Visual clusters in t-SNE do not definitively prove real-world segments. The visualization may create apparent clusters through parameter tuning or reflect local noise patterns. Downstream validation is required."}},{"section":"machine-learning","topicSlug":"pca-dimensionality-reduction","topic":"Pca Dimensionality Reduction","id":"ml-10008","difficulty":"medium","orderIndex":8,"question":"PCA is applied to a gene expression dataset with 20,000 genes (features) and 200 samples. The resulting eigenvalue decomposition is computed on the $200 \\times 200$ covariance matrix rather than the $20,000 \\times 20,000$ matrix. Why is this dimensionality trick valid?","options":{"A":"The trick is invalid — eigenvalues from the $200 \\times 200$ matrix are different from those of the $20,000 \\times 20,000$ matrix","B":"The data matrix $X$ (200×20,000) has rank at most 200 — the covariance matrix $X^TX$ (20,000×20,000) therefore has at most 200 non-zero eigenvalues; computing eigendecomposition of $XX^T$ (200×200) gives the same non-zero eigenvalues and the corresponding eigenvectors can be derived analytically; this is the kernel trick / dual PCA","C":"The $200 \\times 200$ matrix is used only for computational speed — the eigenvectors are different but produce similar results","D":"PCA on $XX^T$ only works when the number of features is exactly 100× the number of samples"},"correct":"B","explanation":{"correct":"- Data matrix $X$ is $n \\times p$ ($200 \\times 20,000$). Rank($X$) $\\leq \\min(n, p) = 200$, so $X^TX$ has at most 200 non-zero eigenvalues.\n- SVD relationship: $X = U\\Sigma V^T$. The eigenvectors of $X^TX$ are the columns of $V$ (right singular vectors), and eigenvectors of $XX^T$ are columns of $U$ (left singular vectors). Non-zero eigenvalues of both are identical: $\\sigma_i^2$.\n- Recovering $V$ from $U$: $V_i = X^T U_i / \\sigma_i$. This gives the full PCA solution (principal components in 20,000-dimensional space) from the 200×200 computation.","A":"The non-zero eigenvalues of $X^TX$ and $XX^T$ are mathematically identical (by the SVD relationship). This is not an approximation — it is an exact equivalence.","B":"","C":"The eigenvectors are not different in meaning — they represent the same principal directions. The $200 \\times 200$ computation gives an exact (not approximate) solution for the non-zero components.","D":"The dual PCA trick works whenever $n < p$ (more features than samples). The specific ratio doesn't matter."}},{"section":"machine-learning","topicSlug":"pca-dimensionality-reduction","topic":"Pca Dimensionality Reduction","id":"ml-10009","difficulty":"medium","orderIndex":9,"question":"UMAP is applied to two datasets: Dataset A (tight, well-separated clusters) and Dataset B (continuous gradient, no distinct clusters). Both produce visually appealing 2D plots with apparent clusters. A researcher uses both as evidence of cluster structure. What is wrong with the interpretation for Dataset B?","options":{"A":"Nothing — UMAP always produces correct visualizations","B":"UMAP, like t-SNE, optimizes for local neighborhood preservation — it will create apparent cluster structure in its 2D output even when the underlying data has a continuous gradient; the discrete-looking clusters in Dataset B's UMAP plot are an artifact of the algorithm's neighborhood compression, not evidence of real distinct groups","C":"UMAP cannot handle continuous gradients — it should be replaced with PCA for Dataset B","D":"Dataset B's continuous gradient means UMAP will produce random noise output, not apparent clusters"},"correct":"B","explanation":{"correct":"- UMAP constructs a fuzzy topological representation of the data and optimizes the low-dimensional embedding to match. Points far apart in high-dimensional space are repelled in the embedding, creating \"white space\" between groups — even if those groups are really just ends of a continuum.\n- This is a known artifact: UMAP (and t-SNE) can create apparent clusters from uniform or continuously varying data. The visual separation in the plot reflects the algorithm's optimization objective (local preservation + global spreading), not necessarily real cluster boundaries.\n- Validation: are the apparent clusters in Dataset B correlated with any external label, domain category, or outcome? Without such validation, the clusters are visualization artifacts.","A":"UMAP creates visualization artifacts that can mislead interpretation. Apparent clusters from continuous data are a documented limitation.","B":"","C":"UMAP can handle continuous gradients — it will produce a visualization. The issue is how to interpret it, not whether UMAP applies.","D":"UMAP produces structured outputs, not random noise, even for continuous data. The structured output may reflect real (continuous) gradients, but the visual clustering effect makes it look discretized."},"reference":"- McInnes et al., \"UMAP: Uniform Manifold Approximation and Projection\": https://arxiv.org/abs/1802.03426"},{"section":"machine-learning","topicSlug":"pca-dimensionality-reduction","topic":"Pca Dimensionality Reduction","id":"ml-10010","difficulty":"hard","orderIndex":10,"question":"PCA is applied to a dataset of stock returns. The first principal component has roughly equal positive weights for all stocks. The second principal component has positive weights for technology stocks and negative weights for financial stocks. What are these components likely capturing?","options":{"A":"The first component captures outlier stocks; the second captures mean-reverting pairs","B":"The first principal component likely captures the overall market direction (a \"market factor\") — all stocks move together with the market; the second component captures a sector rotation factor — technology and financials tend to move in opposite directions; this is consistent with factor model theory (PCA recovers statistical risk factors)","C":"The first component captures volatility (standard deviation) and the second captures correlation structure","D":"Equal weights in PC1 indicate a data preprocessing error — PCA should produce diverse weights"},"correct":"B","explanation":{"correct":"- In equity returns, PCA-derived components often correspond to interpretable market factors: PC1 is typically a market factor (all stocks have the same sign loading — when the market goes up, all stocks tend to go up); PC2 often captures sector-rotation effects.\n- This is foundational to Statistical Factor Models in finance. The Barra model, PCA-based risk models, and Fama-French factors all emerge from applying PCA or factor analysis to return correlation matrices.\n- The equal-weight PC1 is an empirical result, not a preprocessing error — it reflects the strong common factor (market beta) shared by all stocks.","A":"PCA components are not defined in terms of outliers or mean-reversion. Outliers might influence the covariance matrix, but the PC interpretation is about variance structure, not individual sample properties.","B":"","C":"PC1 captures the direction of maximum variance (market returns vary synchronously) — not standard deviation. PCA components are eigenvectors of the covariance matrix, not dispersion statistics.","D":"Equal weights in PC1 are a meaningful signal (all stocks share the market factor), not an artifact. PCA output reflects the data's covariance structure, not an error."}},{"section":"machine-learning","topicSlug":"pca-dimensionality-reduction","topic":"Pca Dimensionality Reduction","id":"ml-10011","difficulty":"hard","orderIndex":11,"question":"A team reduces dimensions from 200 to 20 using PCA, then trains a logistic regression model. Test accuracy is 84%. The team wants to add interpretability: \"which original features matter most?\" Why is this question difficult to answer after PCA, and what alternative preserves interpretability?","options":{"A":"Interpretability is preserved in PCA because each PC corresponds to one original feature","B":"PCA principal components are linear combinations of all original features — a coefficient in a PC is not the same as the feature's importance to the downstream model; to recover feature importance, you must propagate the logistic regression weights back through the PCA transformation ($w_{\\text{original}} = V \\cdot w_{\\text{LR}}$ where V is the PCA loading matrix), or use an interpretable method without PCA (Lasso regression, tree models, or SHAP on the original features)","C":"Feature importance is impossible to determine after any dimensionality reduction technique","D":"Simply rank the original features by their loading on PC1 — the PC1 loading magnitude determines feature importance to the downstream model"},"correct":"B","explanation":{"correct":"- PCA transformation: $z = V^T x$ where $V$ is the $200 \\times 20$ loading matrix (each column is a principal component). Logistic regression learns weights $w_{LR} \\in \\mathbb{R}^{20}$.\n- The implicit model is: $\\hat{y} = \\sigma(w_{LR}^T V^T x + b) = \\sigma((V w_{LR})^T x + b)$. The effective weights in original feature space: $w_{\\text{eff}} = V w_{LR}$.\n- This back-transformation gives a single weight per original feature, enabling feature importance interpretation. However, this only works for linear models; nonlinear models after PCA require different approaches.","A":"Each PC is a weighted combination of all original features, not one-to-one. A single original feature may load heavily on several PCs.","B":"","C":"Feature importance can be recovered by back-transforming through the PCA loadings. The complexity increases but it is mathematically tractable.","D":"PC1 loading magnitude measures how much each feature contributes to the first principal component (direction of max variance), not how important it is to the downstream model. The downstream model may weight PC1 weakly and PC5 strongly."}},{"section":"machine-learning","topicSlug":"pca-dimensionality-reduction","topic":"Pca Dimensionality Reduction","id":"ml-10012","difficulty":"hard","orderIndex":12,"question":"A team uses t-SNE for visualizing high-dimensional customer embeddings. They set perplexity=5 and see 20 clusters. They rerun with perplexity=50 and see 5 clusters. A manager asks: \"which is correct?\" What is the principled answer?","options":{"A":"Perplexity=50 is always correct because it uses more neighbors","B":"t-SNE's perplexity controls the effective number of nearest neighbors — low perplexity emphasizes local micro-structure (can fragment one real cluster into many apparent sub-clusters), high perplexity emphasizes global macro-structure; both visualizations may be \"correct\" at their respective scales; the \"right\" number of clusters requires validation with external criteria (domain labels, business outcomes), not just visual inspection","C":"The two runs show that t-SNE is random and the results are meaningless","D":"The correct visualization is the one with fewer clusters because more clusters indicate overfitting in the visualization"},"correct":"B","explanation":{"correct":"- Perplexity in t-SNE is roughly analogous to the number of effective nearest neighbors (bandwidth of the Gaussian kernel in high-dimensional space). Typical recommendations: 5-50, with larger datasets favoring larger perplexity.\n- Low perplexity: each point only cares about its immediate neighbors — can create many small, tight clusters by separating natural sub-groups that are real micro-structure or noise fragmentation.\n- High perplexity: broader neighborhood — clusters merge more readily, showing macro-structure. Neither result is definitively \"correct\" — they show the data at different resolution scales.\n- Correct validation: do the 5 macro-clusters correspond to business-meaningful segments? Do the 20 micro-clusters show consistent sub-behaviors?","A":"More neighbors doesn't mean \"more correct\" — it means viewing the structure at a coarser scale. For some purposes, fine-grained micro-structure is exactly what you want.","B":"","C":"t-SNE is stochastic but its results are reproducible with fixed random seed. The sensitivity to perplexity is a feature (multi-scale view), not random noise.","D":"More clusters don't indicate \"overfitting\" in the visualization. The number of clusters reflects the perplexity scale, not model complexity."}},{"section":"machine-learning","topicSlug":"pca-dimensionality-reduction","topic":"Pca Dimensionality Reduction","id":"ml-10013","difficulty":"hard","orderIndex":13,"question":"A researcher compares UMAP and t-SNE on the same high-dimensional dataset for downstream clustering. UMAP runs 200× faster and is used in production. A statistician notes: \"UMAP preserves global structure better than t-SNE.\" What specific property of UMAP's objective makes this claim accurate?","options":{"A":"UMAP is faster, which implies it better preserves global structure","B":"t-SNE's cost function places high penalty only on nearby points in the high-dimensional space — it ignores the placement of non-neighboring points; UMAP's cost function has both attraction terms (for close neighbors) and repulsion terms (for distant points), with the repulsion explicitly positioning non-neighboring points away from each other — this provides more consistent global structure preservation","C":"UMAP uses Euclidean distance while t-SNE uses cosine similarity, making UMAP more accurate for spatial data","D":"Both algorithms preserve global structure equally — the speed difference is the only practical distinction"},"correct":"B","explanation":{"correct":"- t-SNE minimizes KL divergence between high-dimensional and low-dimensional neighborhood probabilities. The KL divergence is asymmetric: it penalizes placing nearby high-dimensional points far apart (local structure preservation) but gives less guidance for placing distant points.\n- UMAP minimizes binary cross-entropy: $L = \\sum_{(i,j)} [w_{ij} \\log(\\hat{w}_{ij}) + (1-w_{ij})\\log(1-\\hat{w}_{ij})]$ where $w_{ij}$ is the fuzzy neighborhood membership. This has explicit repulsion for non-edges that positions non-neighboring points at meaningful distances.\n- In practice: UMAP visualizations maintain relative positions of clusters (macro-structure), while t-SNE plots can have cluster positions that are arbitrary and change between runs.","A":"Computational speed has no logical connection to the quality of global structure preservation. Speed comes from algorithmic optimizations (negative sampling, SGD-based optimization), not from the quality of the embedding.","B":"","C":"Both UMAP and t-SNE can use various distance metrics. The default is Euclidean for both in standard implementations. The metric choice is a user parameter, not an inherent difference.","D":"Global structure preservation is documented as better in UMAP vs t-SNE in the original UMAP paper and subsequent comparisons. They are not equivalent in this regard."},"reference":"- McInnes et al., \"UMAP vs t-SNE\": https://umap-learn.readthedocs.io/en/latest/how_umap_works.html"},{"section":"machine-learning","topicSlug":"clustering","topic":"Clustering","id":"ml-11001","difficulty":"easy","orderIndex":1,"question":"K-means clustering is run on a 2D dataset. After 100 iterations, the cluster assignments stop changing. A junior analyst says \"the algorithm found the optimal clustering.\" What is wrong with this statement?","options":{"A":"K-means always finds the global optimum — the statement is correct","B":"K-means is guaranteed to converge (cluster assignments stop changing) but only to a local minimum of the within-cluster sum of squares (WCSS) objective — different random initializations can produce different converged solutions; \"optimal\" requires the global minimum, which K-means cannot guarantee","C":"K-means convergence requires exactly 1,000 iterations — convergence at 100 means an error occurred","D":"K-means minimizes between-cluster variance, not within-cluster variance, so convergence doesn't relate to optimality"},"correct":"B","explanation":{"correct":"- K-means objective: minimize $J = \\sum_{k=1}^{K} \\sum_{x_i \\in C_k} ||x_i - \\mu_k||^2$ (WCSS). The algorithm alternates between assignment and update steps, each reducing $J$.\n- Convergence is guaranteed because there are finitely many possible assignments and $J$ decreases at each step. But the converged solution is a local minimum — different starting centroids can yield different final clusterings with different $J$ values.\n- Best practice: run K-means multiple times (e.g., 10-20 runs with different random seeds) and keep the solution with the lowest WCSS. Sklearn's `n_init=10` default does this.","A":"K-means is a local search algorithm. The global optimal partition minimizing WCSS is an NP-hard problem (for K≥2 clusters in general). K-means makes no global optimality guarantee.","B":"","C":"Convergence can occur in any number of iterations — it depends on the data structure and initialization. Convergence at iteration 5 or 1,000 are both valid.","D":"K-means minimizes within-cluster variance (WCSS) — the total distance from each point to its assigned centroid. Maximizing between-cluster distance is related but not the direct K-means objective."}},{"section":"machine-learning","topicSlug":"clustering","topic":"Clustering","id":"ml-11002","difficulty":"easy","orderIndex":2,"question":"A data scientist is choosing K for K-means on a customer segmentation dataset. They plot WCSS against K from 1 to 15 and see the curve decreasing monotonically. They select K=12 because the curve is still decreasing. Is this a valid approach?","options":{"A":"Yes — a decreasing WCSS curve means K=12 is better than K=11","B":"No — WCSS always decreases as K increases (at K=n, WCSS=0); selecting K where WCSS is still falling ignores the law of diminishing returns; the correct approach is the \"elbow method\" (find K where the rate of decrease sharply slows) or validated metrics like the Silhouette score, Gap statistic, or Calinski-Harabasz index","C":"The correct K is always the K that minimizes WCSS, which is the maximum K tested","D":"A monotonically decreasing WCSS curve indicates the data has no cluster structure and K-means should not be used"},"correct":"B","explanation":{"correct":"- Mathematical property: as K increases, each cluster gets smaller, decreasing within-cluster distances. At K=n (one cluster per point), WCSS = 0 — trivially but uselessly perfect.\n- The elbow method seeks the K where adding another cluster yields diminishing improvements in WCSS. Beyond the elbow, you're splitting natural clusters.\n- Better approaches: Silhouette score measures cohesion vs separation — maximizing it gives a principled K; Gap statistic compares WCSS to that expected under null (no structure) distribution.","A":"While K=12 has lower WCSS than K=11, this doesn't mean K=12 is a better clustering — it may be over-partitioning. Lower WCSS is expected as K grows and is not the criterion for \"better\" clustering.","B":"","C":"Maximum K (one point per cluster) trivially minimizes WCSS but is meaningless. The goal is to find meaningful compact groups, not minimize WCSS at any cost.","D":"A monotonically decreasing curve is the expected behavior for any dataset — it does not indicate lack of cluster structure. Lack of structure would manifest as no clear elbow."}},{"section":"machine-learning","topicSlug":"clustering","topic":"Clustering","id":"ml-11003","difficulty":"easy","orderIndex":3,"question":"DBSCAN is applied with eps=0.5 and min_samples=5. Some points are labeled as noise (-1). A team member says \"noise points in DBSCAN are just outliers — we can remove them from future datasets.\" Is this a valid conclusion?","options":{"A":"Yes — DBSCAN noise points are definitionally outliers and should be removed","B":"DBSCAN noise points are points that don't satisfy the density threshold (fewer than min_samples neighbors within eps) — whether they are \"outliers\" depends on context; if eps and min_samples were poorly chosen, normal points get labeled noise; the labels are specific to the hyperparameter choice, not an objective outlier determination; tuning eps (e.g., via k-distance plot) is needed first","C":"Noise points should be assigned to the nearest cluster, not removed","D":"DBSCAN noise points are always correct outlier labels and should always be removed in preprocessing"},"correct":"B","explanation":{"correct":"- A noise point is one that is neither a core point (≥min_samples neighbors within eps) nor a border point (within eps of a core point). This is a local density criterion.\n- If eps is too small, even dense-region points may be labeled as noise. If eps is too large, noise and border clusters merge. The k-distance plot method: compute distance to the k-th nearest neighbor for all points, sort, and find the elbow — this suggests the appropriate eps.\n- Valid uses of noise labels: anomaly detection after careful hyperparameter tuning. Invalid use: blindly removing noise-labeled points from future datasets without understanding the hyperparameter sensitivity.","A":"DBSCAN noise is a relative concept — it depends entirely on eps and min_samples. The same point may be a noise point with eps=0.3 and a core point with eps=0.8.","B":"","C":"Assigning noise to the nearest cluster is what DBSCAN deliberately avoids — border points near a cluster boundary become noise precisely to avoid forcing low-density points into clusters.","D":"Even well-tuned DBSCAN noise labels are specific to the dataset and parameter choices. \"Always\" remove is too strong — sometimes noise points are sparse-region legitimate data, not errors to exclude."}},{"section":"machine-learning","topicSlug":"clustering","topic":"Clustering","id":"ml-11004","difficulty":"easy","orderIndex":4,"question":"K-means is applied to customer data (age in years 20-70, annual income in dollars 20,000-200,000). The resulting clusters are almost entirely driven by income. What caused this and how should it be fixed?","options":{"A":"K-means is biased toward clustering on higher-cardinality features — this is expected behavior","B":"K-means uses Euclidean distance — income values (20,000-200,000) have much larger absolute magnitude than age (20-70); the income dimension dominates the distance calculation; fix: standardize all features to zero mean and unit variance (or scale to [0,1]) before applying K-means","C":"The clusters are correct — income is inherently more important than age for customer segmentation","D":"This can be fixed by increasing K to include more clusters"},"correct":"B","explanation":{"correct":"- K-means distance: $||x_i - \\mu_k||^2 = (\\text{age}_i - \\mu_{k,\\text{age}})^2 + (\\text{income}_i - \\mu_{k,\\text{income}})^2$. A 1-unit difference in income contributes $(1)^2 = 1$ to the distance. A 50-year age difference contributes $(50)^2 = 2,500$. Wait — actually income dominates because its range is thousands of times larger: $(180,000-20,000)^2 = (160,000)^2 = 25.6 \\times 10^9$ vs age $(70-20)^2 = 2,500$. Income swamps age.\n- Standardization (z-score): $x' = (x - \\mu)/\\sigma$. After scaling, each feature contributes proportionally to its variability in standard deviation units.\n- Domain judgment: if income really should be weighted more heavily, use weighted distance or feature weighting explicitly — but this should be a deliberate choice.","A":"K-means is not \"biased toward high-cardinality features\" — it's biased toward features with large absolute values in the distance calculation. Cardinality is irrelevant; scale is the issue.","B":"","C":"Whether income is more important than age is a domain question. K-means shouldn't make this decision implicitly through scale artifacts — it should be made explicitly through feature engineering.","D":"Increasing K doesn't fix scale imbalance — with more clusters, income will still dominate the assignments. The scale issue remains regardless of K."}},{"section":"machine-learning","topicSlug":"clustering","topic":"Clustering","id":"ml-11005","difficulty":"easy","orderIndex":5,"question":"Hierarchical agglomerative clustering (HAC) is applied with Ward linkage. The dendrogram shows a clear merge at distance 10 that joins two major branches, and then many small merges below. A team cuts the dendrogram at height 10. What does this produce?","options":{"A":"A single cluster — height 10 means stopping when all points are in one cluster","B":"Cutting the dendrogram at height 10 produces the clusters that existed just before the merge at height 10 — the number of clusters equals the number of branches crossing the horizontal line at that height; a clear jump from many small merges (below 10) to a large merge (at 10) suggests 2 major natural clusters exist in the data","C":"Height 10 is the optimal cut point only for Ward linkage, not other linkage methods","D":"Cutting at height 10 produces 10 clusters — the cut height equals the cluster count"},"correct":"B","explanation":{"correct":"- HAC builds a tree (dendrogram) by greedily merging the two closest clusters at each step. The merge height represents the distance between merged clusters.\n- Cutting at height $h$: draw a horizontal line at $h$; each branch crossing the line is a cluster. If two major branches merge at height 10 and many small merges occur below 5, cutting at height 7-9 gives 2 clusters representing the two main groups.\n- The \"large jump\" heuristic: if there is a large increase in merge height at one step, cutting just below that step produces natural clusters. The jump at 10 suggests two genuinely distinct groups (they were far apart before being forced to merge).","A":"Cutting at height 10 stops agglomeration when the next merge would cost distance 10. This produces multiple clusters, not one. One cluster appears only at the very top of the dendrogram.","B":"","C":"The interpretation of dendrogram cuts is the same for all linkage methods. The heights on the y-axis differ by linkage (Ward uses variance increase; single/complete use point-to-point distances), but the cutting procedure is identical.","D":"The cut height has no direct relation to cluster count. Cut height determines the distance threshold; the number of clusters depends on how many branches cross that height."}},{"section":"machine-learning","topicSlug":"clustering","topic":"Clustering","id":"ml-11006","difficulty":"medium","orderIndex":6,"question":"A dataset contains clusters of varying density: Cluster A has 10,000 tightly packed points in a $0.5 \\times 0.5$ region; Cluster B has 100 loosely distributed points in a $50 \\times 50$ region. K-means (K=2) and DBSCAN are both applied. K-means correctly identifies 2 clusters; DBSCAN with a single eps struggles. Explain why DBSCAN fails here.","options":{"A":"DBSCAN fails because K=2 is too small for DBSCAN to identify","B":"DBSCAN uses a single global density threshold (eps, min_samples) — Cluster A is extremely dense (eps must be small to avoid merging with noise), while Cluster B is sparse (eps must be large to connect its distant points); a single eps cannot accommodate both densities simultaneously — Cluster B's points appear as noise to Cluster A's density threshold","C":"DBSCAN fails because it only works for circular clusters","D":"DBSCAN's time complexity prevents it from handling 10,000 points in one cluster"},"correct":"B","explanation":{"correct":"- DBSCAN defines core points by density (≥ min_samples within eps radius). For the dense Cluster A, eps = 0.1 would work well. For the sparse Cluster B, eps needs to be ~5.0 to connect distant points. A single eps cannot serve both.\n- With small eps: Cluster A is correctly identified; Cluster B's points become noise (each has few neighbors within eps=0.1).\n- With large eps: Cluster B is identified; Cluster A merges into one giant cluster, but worse — all nearby noise points and edges of Cluster A may merge with Cluster B.\n- Solution: HDBSCAN (Hierarchical DBSCAN) extracts a hierarchy of density-based clusters and can handle varying densities by considering multiple density levels.","A":"DBSCAN doesn't require a K parameter — it discovers the number of clusters automatically. This is a strength, not a failure mode related to K.","B":"","C":"DBSCAN correctly identifies arbitrary shapes (a key advantage over K-means). The failure here is about density variation, not cluster shape.","D":"DBSCAN's complexity is $O(n \\log n)$ with spatial indexing. 10,000 points is trivial computationally. The failure is algorithmic (single eps), not computational."},"reference":"- HDBSCAN paper: https://link.springer.com/chapter/10.1007/978-3-642-37456-2_14"},{"section":"machine-learning","topicSlug":"clustering","topic":"Clustering","id":"ml-11007","difficulty":"medium","orderIndex":7,"question":"K-means++ initialization is compared to random initialization on a clustering task. K-means++ consistently achieves lower WCSS with fewer restarts. What is the core innovation of K-means++ that produces this improvement?","options":{"A":"K-means++ selects all centroids from the training data, while random initialization allows centroids outside the data range","B":"K-means++ selects each subsequent centroid with probability proportional to its squared distance from the nearest already-chosen centroid — this spreads initial centroids across the data and avoids placing multiple centroids in the same dense region, starting the algorithm closer to a good solution","C":"K-means++ runs multiple complete K-means iterations as part of initialization, making the algorithm slower","D":"K-means++ uses the K-medoids algorithm for initialization, which is more robust than centroid-based methods"},"correct":"B","explanation":{"correct":"- K-means++ algorithm: (1) Pick a random point as first centroid. (2) For each remaining data point, compute $d(x)^2$ = squared distance to nearest chosen centroid. (3) Select next centroid with probability $p(x) \\propto d(x)^2$. (4) Repeat until K centroids selected.\n- This probabilistic selection naturally spreads centroids: high-distance points (far from all current centroids) get high selection probability, ensuring initial centroids span the data space.\n- Theoretical guarantee: K-means++ achieves expected WCSS within $O(\\log K)$ of the optimal — much better than random initialization's worst-case guarantees.","A":"Standard K-means random initialization also selects centroids from the training data points (or randomly from the feature space). Both methods start from data points; the difference is the selection probability.","B":"","C":"K-means++ only selects K initial centroids — it doesn't run K-means iterations during initialization. It's $O(nK)$ to initialize, then the standard K-means iterations follow.","D":"K-medoids uses actual data points as cluster representatives (not centroids). K-means++ is still standard K-means (using means as centroids) — the ++ only refers to the smarter initialization."}},{"section":"machine-learning","topicSlug":"clustering","topic":"Clustering","id":"ml-11008","difficulty":"medium","orderIndex":8,"question":"Gaussian Mixture Models (GMM) and K-means are both applied to the same dataset for clustering. GMM is described as a \"soft\" clustering method. What specific capability does GMM have that K-means lacks, and when does this matter?","options":{"A":"GMM is always better than K-means because it uses the Gaussian distribution assumption","B":"GMM produces probabilistic cluster membership: each point is assigned a probability of belonging to each cluster ($p(\\text{cluster}_k | x_i)$) rather than a hard assignment; this matters when points near cluster boundaries genuinely have ambiguous membership, and when modeling elongated or correlated clusters (GMM allows elliptical covariance; K-means assumes spherical equal-variance clusters)","C":"GMM is simply K-means with a different distance metric","D":"The \"soft\" property means GMM uses gradient descent instead of the EM algorithm"},"correct":"B","explanation":{"correct":"- K-means: each point is assigned to exactly one cluster (hard assignment). Boundary points get an arbitrary assignment.\n- GMM: each Gaussian component has parameters $(\\mu_k, \\Sigma_k, \\pi_k)$. EM computes $p(\\text{cluster}_k | x_i)$ — a point near two cluster boundaries might be 60%/40% split. This is especially useful for recommendations (a customer might partially belong to two segments).\n- GMM covariance structure: full covariance $\\Sigma_k$ can model elongated, tilted clusters. K-means uses squared Euclidean distance, implicitly assuming spherical clusters of equal variance.\n- When GMM matters: imbalanced cluster sizes/shapes, boundary uncertainty quantification, probabilistic downstream decisions.","A":"GMM has the Gaussian assumption, which can fail for non-Gaussian clusters. K-means (distance-based) may outperform GMM on non-Gaussian data. Neither is universally better.","B":"","C":"GMM is not K-means with a different distance. GMM is a probabilistic generative model fitted by EM; K-means is a non-probabilistic distance minimization. They have different objectives and produce qualitatively different outputs.","D":"GMM uses the EM (Expectation-Maximization) algorithm — not gradient descent. \"Soft\" refers to soft (probabilistic) cluster assignments, not the optimization method."}},{"section":"machine-learning","topicSlug":"clustering","topic":"Clustering","id":"ml-11009","difficulty":"medium","orderIndex":9,"question":"The Silhouette score for a K-means clustering result is 0.12 (scale -1 to +1). A data scientist says \"clustering failed because the score is close to 0.\" Is this the correct interpretation?","options":{"A":"A silhouette score of 0.12 is excellent — it is very close to the maximum possible value","B":"A silhouette score near 0 means the clustering is not much better than random cluster assignment — clusters overlap significantly and points are near the boundaries of multiple clusters; however, \"failure\" should be validated by checking if the data has any cluster structure at all (e.g., using the Gap statistic) before concluding clustering is inappropriate","C":"The silhouette score has no fixed interpretation threshold — 0.12 may indicate excellent clustering depending on the domain","D":"The score of 0.12 means exactly 12% of points are correctly clustered"},"correct":"B","explanation":{"correct":"- Silhouette score for point $i$: $s(i) = (b_i - a_i) / \\max(a_i, b_i)$ where $a_i$ = mean distance within cluster, $b_i$ = mean distance to nearest other cluster. Range: [-1, 1].\n- Score near 0: $a_i \\approx b_i$ — the point is equally \"at home\" in its cluster and the nearest other cluster. This means poor cluster separation/cohesion.\n- Score near -1: the point is closer to another cluster (wrong assignment). Score near +1: tightly in its cluster, far from others (good).\n- General benchmarks: >0.7 = strong, 0.5-0.7 = reasonable, 0.25-0.5 = weak, <0.25 = no substantial structure — but these are guidelines, not hard rules.","A":"0.12 is not close to 1.0 (the maximum). The scale is -1 to +1; 0.12 is near the middle, indicating weak clustering.","B":"","C":"While domain context matters, there are established benchmarks for silhouette scores. 0.12 indicates weak structure in virtually any domain context.","D":"Silhouette score is not a percentage of correctly clustered points. It measures the cohesion-to-separation ratio of clustering quality. \"Correct\" assignment is undefined in unsupervised clustering."}},{"section":"machine-learning","topicSlug":"clustering","topic":"Clustering","id":"ml-11010","difficulty":"hard","orderIndex":10,"question":"K-means is applied to text data represented as TF-IDF vectors (sparse, high-dimensional). The resulting clusters appear random — high intra-cluster variance and low inter-cluster separation. What fundamental property of high-dimensional Euclidean space explains this failure?","options":{"A":"K-means fails on text because TF-IDF values are not normally distributed","B":"In high-dimensional spaces, the curse of dimensionality causes pairwise Euclidean distances to concentrate — all pairs of points have nearly equal distances; this makes the concept of \"nearest cluster\" ambiguous; additionally, TF-IDF vectors are sparse (most values zero), and cosine similarity (capturing directional similarity regardless of magnitude) is more appropriate than Euclidean distance for text","C":"K-means fails on text because it requires exactly 2 clusters","D":"The failure is caused by the TF normalization — removing the normalization fixes the Euclidean distance problem"},"correct":"B","explanation":{"correct":"- Distance concentration in high dimensions: as dimensions $d \\to \\infty$, the ratio $(\\max_{\\text{dist}} - \\min_{\\text{dist}}) / \\min_{\\text{dist}} \\to 0$. All points become equidistant. K-means centroids are equidistant from most points — assignments become essentially random.\n- TF-IDF vectors are sparse (10,000 dimensions, ~100 non-zeros). Two documents about different topics: both have mostly-zero vectors. Euclidean distance between them is dominated by the dimensions where both are zero — a meaningless similarity.\n- Cosine similarity: $\\cos(\\theta) = (x_i \\cdot x_j) / (||x_i|| \\cdot ||x_j||)$. Only considers dimensions where at least one document has non-zero value. Captures topical overlap.\n- Fix: use K-means with cosine distance (spherical K-means) or use topic models (LDA) for text clustering.","A":"K-means makes no distribution assumption. The TF-IDF value distribution is not the cause of failure.","B":"","C":"K-means supports any K. The failure is a fundamental algorithmic-geometric issue, not a K value problem.","D":"TF normalization is a feature of TF-IDF that weights by document frequency — removing it makes representations worse, not better. It doesn't fix the Euclidean distance problem."}},{"section":"machine-learning","topicSlug":"clustering","topic":"Clustering","id":"ml-11011","difficulty":"hard","orderIndex":11,"question":"A K-means model is trained on 1 million data points with K=100 clusters. The resulting centroids are saved, and new data is assigned to the nearest centroid in production (no retraining). A data engineer notices some centroids have 0 assigned points in production even though they had many training points. What could cause this, and what are the consequences?","options":{"A":"0-assigned centroids indicate a bug in the assignment code — K-means guarantees every centroid has points","B":"The production data distribution has shifted (data drift) — the regions represented by those centroids no longer contain production data; those centroids are \"dead\" and waste model capacity; the model's effective K is reduced, leading to worse coverage of production distribution; the fix is periodic retraining or monitoring for concept drift","C":"0-assigned centroids always occur in K-means and can be safely ignored","D":"0-assigned centroids indicate the training data had duplicate points — removing duplicates before training fixes the issue"},"correct":"B","explanation":{"correct":"- \"Dead centroid\" problem: centroids that never win the nearest-centroid assignment race. At training time, this is handled by re-initializing empty centroids. In production (frozen centroids), data drift can cause centroids to represent regions of the feature space no longer populated by production data.\n- Consequence: the model treats some production regions as belonging to distant centroids, increasing effective intra-cluster variance where production data actually is.\n- Monitoring: track the number of assigned points per centroid in production. If centroids routinely have zero assignments, trigger model retraining.","A":"K-means training handles empty centroids by re-initializing them. In production serving (static centroids), there's no re-initialization — production data can easily miss some centroid regions if the distribution has shifted.","B":"","C":"0-assigned centroids at training time indicate initialization problems (fixed by K-means++ or re-initialization). At production time, they indicate distribution shift — not ignorable. Effective K reduction degrades performance.","D":"Duplicate training points would cause multiple identical centroids, not 0-assigned centroids. Data drift is the more likely cause when good K-means training is used."}},{"section":"machine-learning","topicSlug":"clustering","topic":"Clustering","id":"ml-11012","difficulty":"hard","orderIndex":12,"question":"A researcher runs K-means 20 times with different random seeds, selecting the best solution by WCSS. They then compare the stability of cluster assignments across runs: point assignments differ significantly between runs even though WCSS values are similar. What does this instability reveal about the dataset?","options":{"A":"The algorithm has a bug — K-means should produce identical results across runs with similar WCSS","B":"Similar WCSS with different assignments indicates multiple near-equivalent local optima — the data likely has weak cluster structure or clusters with similar densities; the WCSS landscape has many flat valleys of similar depth; this means the \"best\" clustering by WCSS is not much better than many other equally valid clusterings","C":"Instability means the optimal K is wrong — changing K will stabilize assignments","D":"WCSS similarity across runs guarantees assignment similarity — the observation described is mathematically impossible"},"correct":"B","explanation":{"correct":"- Multiple near-equivalent local optima occur when cluster boundaries are ambiguous — the data doesn't have well-separated, clearly defined groups. In this case, many different partitions achieve similar total WCSS because there's no single \"natural\" clustering.\n- This is an important diagnostic: cluster instability suggests the data may not have strong cluster structure. Forcing K clusters on data with no natural groups (or with K-1 natural groups) creates this landscape.\n- Assessment: compare the best WCSS found to the expected WCSS for randomly distributed data (Gap statistic). If there's no significant difference, the data may not be clusterable.","A":"K-means is non-deterministic (random initialization). Different seeds legitimately explore different parts of the optimization landscape. Similar WCSS values can occur at different local minima.","B":"","C":"Changing K might help if K is misspecified, but instability can persist at any K when the data has no strong cluster structure. Changing K is worth trying but isn't guaranteed to fix instability.","D":"WCSS is a scalar metric — many different clustering configurations can yield the same or similar WCSS values. Identical WCSS does not imply identical assignments."}},{"section":"machine-learning","topicSlug":"clustering","topic":"Clustering","id":"ml-11013","difficulty":"hard","orderIndex":13,"question":"A team uses clustering for customer segmentation with K=5, then trains a separate binary classifier (purchase prediction) on each cluster's data. This \"cluster-then-classify\" approach achieves 84% accuracy vs 81% for a single global model. A statistician warns: \"this evaluation may be optimistic.\" What is the statistical concern?","options":{"A":"Cluster-then-classify always produces optimistic results because it uses more models","B":"If the clustering and classification are both evaluated on the same data, or if the cluster boundaries are informed by the outcome variable (purchase), the evaluation is circular — the clusters may have been implicitly chosen to separate buyers from non-buyers; additionally, cluster assignments at test time require the new point to be assigned to a training cluster, introducing leakage if clustering was performed on train+test together","C":"The improvement from 81% to 84% is too small to be statistically significant","D":"The statistician is wrong — using separate models for each cluster is always more accurate than a global model"},"correct":"B","explanation":{"correct":"- Two sources of leakage in cluster-then-classify: (1) If K-means clustering used all data (train+test), the clusters encode test distribution information. (2) If K is selected or clusters are interpreted to maximize classification accuracy, the entire approach is optimizing on the test set.\n- Correct procedure: fit K-means on training data only → assign training points to clusters → train one classifier per cluster on training data → at test time, assign test points to nearest training centroid → apply the corresponding cluster classifier.\n- Even with correct procedure, the 84% vs 81% comparison requires statistical significance testing (e.g., McNemar's test for paired predictions) to confirm the improvement is real.","A":"\"Cluster-then-classify\" is not inherently optimistic. With proper train/test separation (no leakage), the comparison can be valid. The concern is about methodological correctness, not the approach itself.","B":"","C":"Whether the improvement is statistically significant is a separate (valid) concern, but the statistician's warning is about potential methodological flaws (leakage), not effect size.","D":"\"Always more accurate\" is false. A global model has more training data per classifier. Cluster-specific models have less data per cluster. For small datasets, global models may generalize better."}},{"section":"machine-learning","topicSlug":"anomaly-detection","topic":"Anomaly Detection","id":"ml-12001","difficulty":"easy","orderIndex":1,"question":"An Isolation Forest model is trained to detect anomalies in server logs. A data scientist explains: \"Isolation Forest works by trying to isolate each data point using random splits — anomalies are isolated faster.\" A colleague asks: \"why would anomalies be isolated faster than normal points?\" What is the correct mechanistic explanation?","options":{"A":"Anomalies are closer to the decision boundary in the feature space, so they require fewer splits","B":"Anomalies are points that are isolated in sparse regions of the feature space — each random split has a higher probability of separating an anomaly from other points because there are fewer nearby points; normal points in dense regions require many splits before they are separated from their neighbors; the average path length to isolation is shorter for anomalies","C":"Isolation Forest uses k-nearest neighbors to identify anomalies, and anomalies have fewer neighbors","D":"Anomalies are isolated faster because they have extreme feature values that make them easy to split at any threshold"},"correct":"B","explanation":{"correct":"- Isolation Forest builds random decision trees by selecting a random feature and a random split threshold. The path length to isolate a point is the number of splits needed.\n- Dense regions: many similar points in a small feature space volume. Splitting at any threshold still leaves many points together — many more splits are needed before a normal point is isolated.\n- Sparse regions (anomalies): few nearby points. Any split tends to separate the anomaly quickly.\n- Anomaly score: based on average path length across many trees. Short path length → anomaly. Long path length → normal. Score is normalized against the expected path length for a random dataset.","A":"\"Proximity to decision boundary\" is not the isolation mechanism. Isolation Forest doesn't compute decision boundaries — it measures path length in trees.","B":"","C":"Isolation Forest is tree-based, not distance-based. It doesn't compute k-nearest neighbors. LOF (Local Outlier Factor) is the k-NN-based anomaly detection method.","D":"Extreme values are one type of anomaly that Isolation Forest handles well, but the explanation is too narrow. Isolation Forest can detect anomalies in any sparse region, including multivariate anomalies that aren't extreme in any single feature."}},{"section":"machine-learning","topicSlug":"anomaly-detection","topic":"Anomaly Detection","id":"ml-12002","difficulty":"easy","orderIndex":2,"question":"A fraud detection system flags 1,000 transactions as fraudulent out of 1,000,000. A manager says \"only 1,000 alerts is very low — we should lower the threshold to catch more fraud.\" After lowering the threshold, 50,000 transactions are flagged. Only 2,000 are actual fraud. What metrics best characterize the trade-off?","options":{"A":"Accuracy — it measures how often the model is correct","B":"Precision (true positives / all flagged) and recall (true positives / all actual fraud) — at the original threshold: high precision (if most of 1,000 were real fraud), unknown recall; at the lower threshold: precision = 2,000/50,000 = 4%, but recall improved; the trade-off between alert volume (workload) and fraud caught (recall) is the core operational decision","C":"F1 score — it is the only metric that captures the trade-off correctly","D":"Accuracy — it is 99.8% before threshold lowering, proving the original model is perfect"},"correct":"B","explanation":{"correct":"- At 1%=1% fraud rate in the population: 10,000 actual fraud transactions out of 1,000,000.\n- Original threshold: 1,000 flagged. If all are real fraud: precision = 100%, recall = 1,000/10,000 = 10%. If half are real: precision = 50%, recall = 5%.\n- Lower threshold: 50,000 flagged, 2,000 real fraud. Precision = 4%, recall = 20%. Investigation workload increased 50×, but only doubles fraud caught.\n- The PR (Precision-Recall) curve visualizes all threshold operating points. AUC-PR is more informative than AUC-ROC for heavily imbalanced anomaly detection.","A":"Accuracy is useless here — predicting \"no fraud\" for all transactions gives 99%+ accuracy because fraud is rare. Accuracy doesn't capture the false-negative cost (missed fraud) or false-positive cost (unnecessary investigation).","B":"","C":"F1 is one summary metric of the precision-recall trade-off, but it doesn't show the full trade-off curve. Separate precision and recall values are more interpretable for business decisions.","D":"99.8% accuracy while missing 99% of fraud is not a good outcome. Accuracy conflates rare-class performance with common-class performance."}},{"section":"machine-learning","topicSlug":"anomaly-detection","topic":"Anomaly Detection","id":"ml-12003","difficulty":"easy","orderIndex":3,"question":"Local Outlier Factor (LOF) gives a score of 4.2 to a data point. A student interprets this as \"the point is 4.2 standard deviations from the mean.\" Why is this interpretation incorrect?","options":{"A":"The interpretation is correct — LOF is based on z-scores","B":"LOF measures the ratio of the local density of a point's neighbors to its own local density — a score of 4.2 means the point's neighborhood is approximately 4.2× less dense than its neighbors' neighborhoods; it is not a standard deviation or a distance — it is a local density ratio; LOF is entirely non-parametric and makes no distributional assumptions","C":"LOF score of 4.2 means the point has 4.2 times more neighbors than average","D":"LOF is similar to z-score but uses median instead of mean"},"correct":"B","explanation":{"correct":"- LOF computation: for a point $p$, compute the $k$-distance (distance to $k$-th nearest neighbor), then the reachability distance (smoothed local distance), then the local reachability density (LRD = inverse of average reachability distance).\n- $\\text{LOF}(p) = \\frac{\\text{average LRD of } p\\text{'s neighbors}}{\\text{LRD}(p)}$. Values near 1: the point has similar density to its neighbors (normal). Values >>1: the point is in a much sparser region than its neighbors (outlier).\n- A score of 4.2 means the surrounding neighborhood is 4.2× denser than the point's own immediate vicinity — a significant density gap.","A":"LOF has nothing to do with z-scores. Z-score requires a global mean and standard deviation. LOF is local and non-parametric — it doesn't require any distributional assumptions.","B":"","C":"LOF measures density ratios, not neighbor counts. The $k$-NN count is fixed at $k$ for all points — the variation is in how far those $k$ neighbors are.","D":"LOF does not use median. It uses reachability distances and density ratios. There's no analogy to the z-score formula."}},{"section":"machine-learning","topicSlug":"anomaly-detection","topic":"Anomaly Detection","id":"ml-12004","difficulty":"easy","orderIndex":4,"question":"An autoencoder is trained on normal server behavior to detect anomalies. In production, it flags a batch of \"anomalous\" requests — but investigation shows these are normal, just from a new product feature launched yesterday. What fundamental assumption of reconstruction-error-based anomaly detection failed?","options":{"A":"The autoencoder threshold was set too low","B":"Reconstruction-error anomaly detection assumes the training distribution is representative of all future normal behavior — the new product feature represents a new normal pattern not in the training distribution; the autoencoder learned to reconstruct old normal patterns, so new legitimate patterns have high reconstruction error; this is concept drift (a change in what \"normal\" means)","C":"Autoencoders cannot be used for anomaly detection — this is an incorrect application","D":"The anomaly detection failed because the autoencoder was not deep enough"},"correct":"B","explanation":{"correct":"- Core assumption: train on normal data only → model learns to reconstruct normal patterns well → high reconstruction error = anomaly.\n- Violation: when the definition of \"normal\" changes (new features, seasonal patterns, product changes), the autoencoder flags legitimate new patterns as anomalies (false positives).\n- This is the stationarity assumption: the underlying data-generating process is stable. When it changes, the model becomes outdated.\n- Solutions: periodic retraining to include new normal patterns, incremental learning, or a human-in-the-loop review period when new features launch to recalibrate the threshold.","A":"Threshold adjustment is a potential short-term fix, but it doesn't address the underlying problem: the model doesn't know how to reconstruct the new product feature's patterns. Lowering the threshold would also reduce detection of real anomalies.","B":"","C":"Autoencoders are a well-established method for anomaly detection, widely used in network intrusion detection, fraud detection, and manufacturing quality control.","D":"Model depth affects representation capacity, not adaptation to distributional shift. A deeper autoencoder would still fail to reconstruct patterns it has never seen."}},{"section":"machine-learning","topicSlug":"anomaly-detection","topic":"Anomaly Detection","id":"ml-12005","difficulty":"easy","orderIndex":5,"question":"One-Class SVM (OCSVM) is trained only on normal data to create a decision boundary around normal instances. At test time, it classifies new points as \"normal\" or \"anomaly.\" How does OCSVM differ from a standard two-class SVM, and what does the nu parameter control?","options":{"A":"OCSVM uses two classes internally — it just doesn't tell the user","B":"Standard SVM separates two classes with a hyperplane between them; OCSVM learns a single hypersphere (or hyperplane from origin) that encompasses the normal data; the nu parameter controls the fraction of training points that are allowed to be outside the hypersphere (treated as support vectors/outliers during training) — smaller nu = tighter boundary; larger nu = more flexible boundary allowing more training points outside","C":"OCSVM is identical to standard SVM except it removes the regularization term","D":"The nu parameter controls the kernel bandwidth, like the gamma parameter in RBF kernels"},"correct":"B","explanation":{"correct":"- OCSVM objective: find a hyperplane that separates the training data from the origin with maximum margin in feature space. Points on the origin side are anomalies.\n- nu ∈ (0, 1]: upper bound on the fraction of outliers in training data AND lower bound on the fraction of support vectors. Setting nu=0.05 means: accept up to 5% of training points as anomalies, ensure at least 5% are support vectors.\n- High nu: looser boundary (accepts more training anomalies as normal). Low nu: tighter boundary (flags more points as anomalies at test time).\n- Practical note: OCSVM is sensitive to feature scaling and the choice of kernel — preprocessing and hyperparameter tuning are critical.","A":"OCSVM genuinely trains on one class only. It doesn't simulate two classes. The decision boundary is defined relative to the origin in the kernel feature space.","B":"","C":"OCSVM has a different formulation and objective than two-class SVM. The regularization approach differs, and the decision function measures distance from the origin rather than distance between two class hyperplanes.","D":"The gamma parameter in RBF kernel controls bandwidth. nu is a separate regularization parameter. They can both be tuned but are independent."}},{"section":"machine-learning","topicSlug":"anomaly-detection","topic":"Anomaly Detection","id":"ml-12006","difficulty":"medium","orderIndex":6,"question":"An anomaly detection model for manufacturing defects achieves: 99.7% recall (catches 99.7% of defects), but precision is 18% (82% of flagged items are false positives). Manufacturing halts production to inspect every flagged item. A business analyst asks: \"should we sacrifice some recall to improve precision?\" What framework should guide this decision?","options":{"A":"Always maximize recall — missing defects is always worse than false alarms","B":"The decision requires comparing the asymmetric costs: cost of a missed defect (e.g., customer harm, recall campaign, warranty cost) vs cost of a false positive (production halt, inspection time, lost throughput); if a missed defect costs $1,000,000 and a false positive costs $200, precision 18% may be acceptable; if costs are more balanced, improving precision at the cost of recall is justified","C":"F1 score = 2×precision×recall / (precision+recall) should always be maximized — it balances both metrics optimally","D":"Precision and recall cannot both be considered in the same decision — you must choose one metric to optimize"},"correct":"B","explanation":{"correct":"- Decision theory: minimize expected cost = $C_{FN} \\times FN + C_{FP} \\times FP$ where $C_{FN}$ = cost of missed defect, $C_{FP}$ = cost of false alarm.\n- At 18% precision: for every real defect caught, 4.6 false alarms are generated. This is acceptable if $C_{FN} / C_{FP} > 4.6$. If a missed defect causes field failure (high $C_{FN}$) and inspections are cheap, high recall is worth the false positive cost.\n- PR curve: plot precision vs recall at all thresholds. Find the operating point where cost is minimized given the business cost ratio.","A":"\"Always maximize recall\" assumes missed defects have infinite cost. In practice, production shutdowns and inspection cost money too. The optimal trade-off depends on cost asymmetry.","B":"","C":"F1 assumes equal cost for false positives and false negatives ($C_{FP} = C_{FN}$). In manufacturing, these costs are typically very different. Maximizing F1 is not appropriate when costs are asymmetric.","D":"Precision and recall must both be considered — they capture different types of errors. The PR curve and cost analysis are specifically designed to navigate this joint consideration."}},{"section":"machine-learning","topicSlug":"anomaly-detection","topic":"Anomaly Detection","id":"ml-12007","difficulty":"medium","orderIndex":7,"question":"An Isolation Forest model is trained on network traffic data where 0.1% of traffic is anomalous (malicious). The `contamination` hyperparameter is set to 0.1. What does this parameter do, and what are the consequences of setting it incorrectly?","options":{"A":"The contamination parameter controls the number of trees in the forest — 0.1 means 10 trees","B":"The contamination parameter sets the expected proportion of anomalies (10% in this case) — it is used to determine the decision threshold: the top-X% of anomaly scores are labeled anomalies; setting contamination=0.1 when the true anomaly rate is 0.001 means the model flags 10× too many points as anomalies, inflating false positives dramatically","C":"The contamination parameter controls the subsample size for each tree","D":"Setting contamination to any value > 0.05 prevents Isolation Forest from working correctly"},"correct":"B","explanation":{"correct":"- Isolation Forest produces anomaly scores for all points. The contamination parameter is used to set the decision threshold: if contamination=0.1, the threshold is set so the lowest-scoring 10% of points are labeled anomalous.\n- True anomaly rate ≈ 0.1% (0.001), but contamination=0.1 means 10% are flagged. This generates massive false positives.\n- Setting contamination too low: may miss real anomalies (threshold too strict). Setting too high: floods output with false positives.\n- Best practice: use the raw anomaly scores and evaluate on a labeled validation set to select the threshold that minimizes the cost function for the specific application.","A":"The number of trees is controlled by `n_estimators` (default 100). Contamination has nothing to do with tree count.","B":"","C":"The subsample size is controlled by `max_samples` (default 256). Contamination only affects the decision threshold, not the forest structure.","D":"There is no hard limit on contamination. Values > 0.5 would be unusual (flagging the majority as anomalies) but the algorithm still runs."}},{"section":"machine-learning","topicSlug":"anomaly-detection","topic":"Anomaly Detection","id":"ml-12008","difficulty":"medium","orderIndex":8,"question":"A statistical anomaly detection system flags data points more than 3 standard deviations from the mean (z-score > 3). On a dataset with 1,000,000 observations following a normal distribution, approximately how many points does this flag, and what is the false positive rate for a truly normal dataset?","options":{"A":"3 standard deviations catches all anomalies — 0 points from a normal distribution are flagged","B":"By the empirical rule, 99.73% of normal data falls within 3σ — approximately 0.27% (2,700 points out of 1,000,000) are flagged; on a truly normal dataset, all 2,700 flagged points are false positives; the 3σ rule has a 0.27% false positive rate, which at scale generates many false alarms","C":"3σ catches exactly 3 points per million from a normal distribution","D":"The 3σ rule is exact — any point beyond 3σ is definitively anomalous regardless of the true distribution"},"correct":"B","explanation":{"correct":"- Normal distribution: $P(|Z| > 3) = 0.0027 = 0.27\\%$. For n=1,000,000: $\\approx 2,700$ expected false positives.\n- The 3σ rule is a heuristic from quality control (6-sigma manufacturing) — it assumes data is normally distributed. For skewed distributions (log-normal, heavy-tailed), the tail probability is completely different.\n- Bonferroni correction: for multiple simultaneous tests, adjust the threshold. Testing 1,000,000 points, each at α=0.0027, produces ~2,700 expected false positives even with no real anomalies.\n- Better approach for large datasets: use extreme value theory (EVT) for threshold setting, or model-based anomaly detection that accounts for the actual distribution.","A":"The 3σ rule defines an outlier region — it doesn't catch \"all anomalies.\" For normal data, it systematically flags the tails. \"All anomalies are caught\" would require a threshold of 0.","B":"","C":"The expected number is 2,700 (0.27% of 1,000,000), not 3. The \"3 per million\" figure corresponds to the 4.5σ rule, not 3σ.","D":"3σ is a probabilistic threshold, not an absolute ground truth. Points beyond 3σ from a normal distribution are rare but not definitively anomalous — they occur with 0.27% probability in normal data."}},{"section":"machine-learning","topicSlug":"anomaly-detection","topic":"Anomaly Detection","id":"ml-12009","difficulty":"hard","orderIndex":9,"question":"An autoencoder-based anomaly detector achieves high performance on a validation set. In a red team exercise, an adversary injects carefully crafted anomalies that maintain low reconstruction error. How can the adversary craft such inputs, and what defense mitigates this?","options":{"A":"An adversary cannot craft low-reconstruction-error anomalies — autoencoders definitionally fail to reconstruct unseen patterns","B":"If the adversary knows the autoencoder architecture and weights, they can use gradient descent to find an input that (1) is anomalous by some external criterion (e.g., contains malicious payload) but (2) has low reconstruction error; by optimizing the anomalous input to minimize reconstruction loss while maintaining the malicious property, the adversary evades detection; defenses include adversarial training (train on adversarially perturbed inputs), ensemble methods, and feature obfuscation","C":"The only defense against adversarial anomalies is using a deeper autoencoder","D":"Adversarial examples only affect classification models, not autoencoders"},"correct":"B","explanation":{"correct":"- Adversarial reconstruction attack: given trained autoencoder with encoder $f$ and decoder $g$, minimize $||x - g(f(x))||^2$ subject to $x$ containing anomalous content. This is an optimization problem the adversary can solve if they have model access (white-box attack).\n- Example: a malicious network packet designed to look like normal traffic in the feature space the autoencoder monitors (e.g., using normal-looking headers while hiding payload in less-monitored fields).\n- Defenses: (1) adversarial training — include adversarially perturbed samples in training; (2) ensemble of diverse autoencoders with different architectures; (3) variational autoencoders (VAEs) that penalize out-of-distribution samples in the latent space; (4) monitoring both reconstruction error AND latent space distance from the training distribution.","A":"This assumes the autoencoder perfectly generalizes from reconstruction error to anomaly detection — it doesn't. The reconstruction manifold can be exploited precisely because the autoencoder's learned manifold doesn't perfectly align with the anomaly boundary.","B":"","C":"Depth alone doesn't prevent adversarial attacks. Deeper models can be attacked just as effectively; in fact, more expressive models may have larger adversarial subspaces.","D":"Adversarial examples extend to any differentiable function, including autoencoders. The gradient of reconstruction loss with respect to input is well-defined and exploitable."}},{"section":"machine-learning","topicSlug":"anomaly-detection","topic":"Anomaly Detection","id":"ml-12010","difficulty":"hard","orderIndex":10,"question":"LOF is applied with k=5 on a dataset with two regions: Region A (1,000 densely packed normal points) and Region B (10 scattered normal points). Points in Region B get LOF scores of 3-8, despite being from the same generating process as Region A points. What causes this, and what does it reveal about LOF's assumptions?","options":{"A":"LOF correctly identifies Region B points as anomalies — they are statistically less common","B":"LOF measures local density relative to a point's local neighborhood — Region B points have sparse local neighborhoods; their LRD is low; their neighbors (some from Region A's sparse border) have higher LRD; the LOF ratio (neighbors' LRD / point's LRD) exceeds 1, flagging Region B as anomalous; this reveals LOF's implicit assumption that all normal data has similar local density — it fails on genuinely multimodal or multi-density datasets","C":"LOF failure is caused by using k=5 — using k=50 would fix the issue","D":"Region B points are genuinely anomalous — the scattered distribution proves they are rare events"},"correct":"B","explanation":{"correct":"- LOF assumes: normal regions are dense, anomalies are sparse. Region B violates this — it contains sparse but legitimate points.\n- LRD of Region B points = low (sparse neighborhood). LRD of some Region A neighbors (on the border) = moderate. LOF ratio > 1 → flagged as anomaly.\n- This is the multi-density problem: LOF cannot distinguish between \"sparse because anomalous\" and \"sparse because it's a legitimate sparse cluster.\"\n- Solutions for multi-density scenarios: LOCI (Local Correlation Integral), HBOS (Histogram-Based Outlier Score), or domain-specific thresholds per region.","A":"\"Statistically less common\" doesn't make something anomalous if it's generated by a known, legitimate process. Anomalous means unexpected or pathological, not just rare.","B":"","C":"Changing k adjusts the neighborhood scale. With k=50, Region B points would include Region A neighbors in their neighborhood, potentially reducing LOF scores, but this is a workaround, not a principled fix. The underlying multi-density problem remains.","D":"Scattered distribution ≠ rare events. Region B could represent a legitimate sparse subpopulation (e.g., a small category of customers with naturally sparse feature patterns)."}},{"section":"machine-learning","topicSlug":"anomaly-detection","topic":"Anomaly Detection","id":"ml-12011","difficulty":"hard","orderIndex":11,"question":"A variational autoencoder (VAE) is used for anomaly detection. Instead of reconstruction error alone, the anomaly score combines reconstruction error and the KL divergence of the latent space distribution from the prior. Why does adding KL divergence improve anomaly detection compared to reconstruction error alone?","options":{"A":"KL divergence replaces reconstruction error entirely — it is a strictly better metric","B":"Standard autoencoders can memorize unusual inputs with low reconstruction error if they have sufficient capacity — the latent representation of an anomalous input may land anywhere in the latent space; the VAE's KL term regularizes the latent space toward a known prior (N(0,1)), so anomalous inputs that produce unusual latent codes (far from the prior) are penalized by the KL term even if reconstruction error is low","C":"KL divergence in VAEs measures the distance between input and output distributions","D":"Adding KL divergence makes the VAE ignore reconstruction error in the anomaly score"},"correct":"B","explanation":{"correct":"- Standard AE failure mode: anomalous inputs may be \"memorized\" or happen to fall in a region of the latent space where the decoder produces a good reconstruction (especially if the anomaly has patterns similar to some normal data in individual features).\n- VAE anomaly score: $\\text{score}(x) = \\text{Reconstruction Error}(x) + \\lambda \\cdot D_{KL}(q(z|x) || p(z))$.\n- $D_{KL}$ term: penalizes latent encodings that deviate from the standard normal prior. An anomalous input that produces unusual latent statistics $(\\mu, \\sigma)$ contributes high KL divergence.\n- Combined score: catches both types of anomalies — those with high reconstruction error (strange patterns) and those with unusual latent representations (off-manifold in latent space).","A":"KL divergence and reconstruction error are complementary. Reconstruction error catches inputs the decoder cannot reconstruct; KL divergence catches inputs that produce unusual latent codes. Using both covers more failure modes.","B":"","C":"KL divergence in VAEs measures the distance between the posterior $q(z|x)$ (learned encoder distribution) and the prior $p(z)$ (typically N(0,I)). It does not compare input and output distributions.","D":"Both terms are part of the ELBO loss and the anomaly score. The weights $\\lambda$ can be tuned, but adding KL divergence does not eliminate reconstruction error from the score."}},{"section":"machine-learning","topicSlug":"anomaly-detection","topic":"Anomaly Detection","id":"ml-12012","difficulty":"hard","orderIndex":12,"question":"An Isolation Forest and a One-Class SVM are compared on a dataset where anomalies are dense clusters of similar malicious patterns (not scattered outliers). Isolation Forest performs poorly; OCSVM performs well. Explain this counter-intuitive result.","options":{"A":"Isolation Forest always outperforms OCSVM — the result indicates a bug","B":"Isolation Forest assumes anomalies are isolated (sparse, in low-density regions) — if anomalies cluster together, they form their own dense region; Isolation Forest cannot distinguish dense anomaly clusters from dense normal clusters; OCSVM learns the boundary of the normal region — even if anomalies cluster, they fall outside the normal support and are correctly flagged","C":"OCSVM performs better because it uses a kernel — Isolation Forest's trees cannot handle kernel-transformed data","D":"The result indicates overfitting in OCSVM — it perfectly memorized the anomaly clusters from training data"},"correct":"B","explanation":{"correct":"- Isolation Forest assumption: anomalies are isolated in the feature space. When anomalies form a dense cluster (e.g., a botnet generating coordinated traffic with consistent patterns), the cluster requires many splits to isolate — Isolation Forest assigns it a long isolation path → low anomaly score → classified as normal.\n- This is Isolation Forest's fundamental weakness: clustered anomalies defeat the isolation criterion.\n- OCSVM learns a closed boundary (hyperplane from origin in kernel space) around the normal data. A dense cluster of anomalies is simply outside this boundary — OCSVM correctly flags the entire cluster regardless of its density.\n- Other methods that handle clustered anomalies: deep one-class classification, robust covariance estimation.","A":"Both algorithms have known failure modes. Isolation Forest is known to fail for clustered anomalies. This is not a bug — it is documented behavior.","B":"","C":"Isolation Forest using trees doesn't prevent kernel-based comparison. Isolation Forest's failure is algorithmic (isolation path length for dense anomaly clusters), not a limitation related to kernel methods.","D":"OCSVM performing well on unseen anomalies is correct generalization, not overfitting. OCSVM was trained only on normal data — it cannot \"memorize\" anomaly clusters it never saw."}},{"section":"machine-learning","topicSlug":"ensemble-methods","topic":"Ensemble Methods","id":"ml-13001","difficulty":"easy","orderIndex":1,"question":"A team trains 10 decision tree classifiers, each on a different random subset of training data (with replacement). They combine predictions by majority vote. What is this ensemble technique, and what property of the base models is essential for this approach to improve over a single tree?","options":{"A":"Boosting — the models must have high accuracy individually","B":"Bagging (Bootstrap Aggregating) — the base models must be diverse (low correlation between errors); if all trees make the same mistakes (highly correlated), majority voting cannot average out errors; diversity is achieved by training on different bootstrap samples and typically by random feature subsampling","C":"Stacking — the base models must make different types of predictions (regression vs classification)","D":"The combination method doesn't matter — any 10 models will always outperform one model"},"correct":"B","explanation":{"correct":"- Bagging: train $B$ models on bootstrap samples (sample $n$ points with replacement from training set); combine by averaging (regression) or majority vote (classification).\n- Why it works: if each model has error rate $e$ and errors are independent, the majority vote error decreases exponentially with $B$. Specifically, for binary classification with $e < 0.5$ and independent errors: $P(\\text{ensemble wrong}) = \\sum_{k > B/2} \\binom{B}{k} e^k (1-e)^{B-k} \\ll e$.\n- Essential condition: errors must be uncorrelated (diverse models). If all models are identical, ensemble error = individual error. Bootstrap sampling and feature subsampling introduce diversity.","A":"Boosting is a sequential procedure (each model focuses on previous errors). Bagging is a parallel procedure. High individual accuracy is helpful but not the essential requirement — diversity is.","B":"","C":"Stacking uses a meta-learner to combine base model predictions. It doesn't require models of different types, and the base models typically make the same type of prediction.","D":"10 identical models perform the same as 1 model. Improvement requires diversity — uncorrelated errors across base models."}},{"section":"machine-learning","topicSlug":"ensemble-methods","topic":"Ensemble Methods","id":"ml-13002","difficulty":"easy","orderIndex":2,"question":"A hard voting classifier combines predictions from 5 classifiers: 3 predict \"class A\" and 2 predict \"class B.\" A soft voting classifier uses the predicted probabilities: the same 5 classifiers output average probabilities of $P(A) = 0.42$ and $P(B) = 0.58$. The two voting methods disagree. Which is more reliable and why?","options":{"A":"Hard voting is more reliable because it uses the final predictions, not uncertain probability estimates","B":"Soft voting is generally more reliable because it uses the full probability distribution — 3 classifiers may predict \"A\" with low confidence (e.g., 55%) while 2 predict \"B\" with high confidence (e.g., 90%); soft voting weights contributions by confidence; hard voting treats a 55% confident prediction the same as a 99% confident prediction","C":"Both methods are equivalent — they always produce the same result","D":"Hard voting is more reliable for classification; soft voting is only for regression"},"correct":"B","explanation":{"correct":"- Hypothetical breakdown: models 1,2,3 predict A with P(A) = 0.55, 0.56, 0.57 (marginally A); models 4,5 predict B with P(B) = 0.70, 0.85 (strongly B). Hard voting: 3 votes for A → predicts A. Soft voting: avg P(A) = (0.55+0.56+0.57+0.30+0.15)/5 = 0.426 → predicts B.\n- Soft voting correctly captures that models 4,5 are much more certain about their prediction than models 1,2,3. Hard voting ignores this signal.\n- Precondition for soft voting: classifiers must produce well-calibrated probabilities. Poorly calibrated probabilities (e.g., naive Bayes overconfidence) can degrade soft voting.","A":"Hard voting's use of \"final predictions\" loses information. Confidence matters — a unanimous weak prediction should not override a strong minority prediction.","B":"","C":"Hard and soft voting can disagree (as in this example). They are not equivalent. When all models agree, the result is the same, but disagreements expose the methodological difference.","D":"Both hard and soft voting work for classification. Averaging is used for regression, but soft voting (averaging class probabilities) applies specifically to classification."}},{"section":"machine-learning","topicSlug":"ensemble-methods","topic":"Ensemble Methods","id":"ml-13003","difficulty":"easy","orderIndex":3,"question":"Random Forest and a bagged decision tree ensemble both use bootstrap sampling. A data scientist says they are the same algorithm. What key feature distinguishes Random Forest from standard bagging of decision trees?","options":{"A":"Random Forest uses boosting instead of bagging","B":"Random Forest adds feature subsampling at each split: when growing each tree, only a random subset of $\\sqrt{p}$ features (for classification) is considered at each node split; this additional randomization reduces correlation between trees beyond what bootstrap sampling alone achieves, further improving ensemble diversity","C":"Random Forest uses pruned trees; standard bagging uses full-depth trees","D":"Random Forest trains trees in sequence (each tree depends on the previous), unlike parallel bagging"},"correct":"B","explanation":{"correct":"- Standard bagging: each tree is trained on a different bootstrap sample, but each split considers all $p$ features. Trees will tend to use the same dominant features at the top levels → correlated trees → diminished ensemble benefit.\n- Random Forest: additionally samples $m$ features at each split ($m \\approx \\sqrt{p}$ for classification, $m \\approx p/3$ for regression). Prevents any single dominant feature from appearing at the top of every tree.\n- Consequence: RF trees are more diverse (less correlated) than bagged trees. The variance reduction from averaging uncorrelated models is larger.","A":"Both Random Forest and bagging are parallel ensemble methods. Boosting is a different family (sequential, focuses on difficult examples).","B":"","C":"Both Random Forest and bagging typically use full-depth (unpruned) trees. Deep trees have low bias individually; the ensemble reduces variance. Pruning would increase bias.","D":"Both bagging and Random Forest are parallel — trees are independent and can be trained simultaneously. Boosting methods (AdaBoost, gradient boosting) use sequential training."}},{"section":"machine-learning","topicSlug":"ensemble-methods","topic":"Ensemble Methods","id":"ml-13004","difficulty":"easy","orderIndex":4,"question":"A stacking ensemble is built with 5 base classifiers and one meta-learner. The meta-learner is trained on the base classifiers' predictions. A team member trains the meta-learner on the same training data used for the base classifiers. A reviewer flags this as a problem. Why?","options":{"A":"The meta-learner must always be a logistic regression — other meta-learners cause issues","B":"If the base classifiers are trained and evaluated on the same data, their predictions on training samples are over-optimistic (they have memorized training data to some degree); the meta-learner learns to trust these over-optimistic predictions; at test time, base classifiers' predictions are more uncertain, and the meta-learner is miscalibrated; the correct approach is out-of-fold cross-validation predictions for the meta-learner's training data","C":"The problem only occurs if the base classifiers use gradient boosting","D":"The reviewer is wrong — using the same training data for meta-learner is the standard approach"},"correct":"B","explanation":{"correct":"- Correct stacking procedure (out-of-fold): (1) Split training data into K folds. (2) For each fold, train base classifiers on the other K-1 folds and predict on the held-out fold. (3) Collect out-of-fold predictions for all training samples. (4) Train meta-learner on these OOF predictions. (5) Retrain base classifiers on all training data for final model.\n- This ensures meta-learner is trained on predictions that reflect each base classifier's true generalization ability, not in-sample performance.\n- Without OOF: base classifiers with high in-sample accuracy (overfitting) appear perfect on training data, causing the meta-learner to over-trust them.","A":"The meta-learner can be any model — logistic regression, gradient boosting, neural network. Logistic regression is commonly used for interpretability and to avoid overfitting the meta-level, but it's not required.","B":"","C":"The problem applies to all base classifiers that can overfit training data, not just gradient boosting. Even simple models have slightly better performance on training data.","D":"Using the same training data is the common but incorrect approach. Out-of-fold is the correct approach and is what frameworks like mlxtend and sklearn's StackingClassifier implement by default."}},{"section":"machine-learning","topicSlug":"ensemble-methods","topic":"Ensemble Methods","id":"ml-13005","difficulty":"medium","orderIndex":5,"question":"A team builds a blending ensemble: 5 base models trained on 70% of training data, validated on 30% (holdout), then a meta-learner trained on the holdout predictions. They claim this is equivalent to stacking with cross-validation. A statistician disagrees. What is the difference?","options":{"A":"Blending and stacking are identical — the statistician is wrong","B":"Blending uses a single holdout set for meta-learner training — this wastes 30% of training data that base models never see; it also risks overfitting the meta-learner to the specific holdout distribution if the holdout is small; stacking with K-fold uses all training data for both base models (via OOF) and provides more samples for meta-learner training; for large datasets the difference is minimal, but for small datasets blending can significantly underperform","C":"Blending is always better because base models see more data (70% vs K-fold's (K-1)/K fraction)","D":"The only difference is computational — both approaches are statistically equivalent"},"correct":"B","explanation":{"correct":"- Blending: base models train on 70%, meta-learner trains on holdout (30%). Problem: base models were not trained on the holdout set, so holdout predictions are valid. But you've given up 30% of data for base model training.\n- K-fold stacking: base models are trained on (K-1)/K of training data, and out-of-fold predictions cover all training samples. Meta-learner trains on predictions for all N training examples.\n- With N=1,000: blending gives meta-learner 300 training samples; 5-fold stacking gives 1,000. More meta-learner training data = better generalization of the meta-learner.","A":"They differ in how much data is available for the meta-learner and how much data base models see during their training phase. These are statistically meaningful differences.","B":"","C":"Base models in blending do see 70% of data. But with K-fold stacking, base models also see approximately (K-1)/K ≈ 80% (5-fold) of data during each OOF fold — AND more samples are available for the meta-learner.","D":"The approaches are not statistically equivalent. The meta-learner sample count difference has a real effect on meta-learner generalization, especially on small datasets."}},{"section":"machine-learning","topicSlug":"ensemble-methods","topic":"Ensemble Methods","id":"ml-13006","difficulty":"medium","orderIndex":6,"question":"A team tries to improve model performance by adding more base classifiers to a Random Forest (from 100 trees to 1,000 trees). Performance on the validation set plateaus after 300 trees. A manager wants to add 10,000 trees. Why is this computationally unjustifiable, and what should they do instead?","options":{"A":"Random Forests always improve monotonically with more trees — 10,000 trees would definitely help","B":"After the ensemble variance is sufficiently reduced (diminishing returns in variance reduction), adding more trees does not improve generalization — it only increases inference cost linearly; the plateau after 300 trees indicates the ensemble has converged; resources are better spent on feature engineering, hyperparameter tuning, or trying a different algorithm","C":"10,000 trees would cause overfitting — RF overfits with too many trees","D":"Random Forest cannot support more than 1,000 trees due to memory constraints in standard implementations"},"correct":"B","explanation":{"correct":"- RF bias-variance: RF reduces variance by averaging many trees. As $B \\to \\infty$, the variance converges to $\\rho \\sigma^2$ where $\\rho$ = average correlation between trees and $\\sigma^2$ = single tree variance. Adding more trees beyond convergence doesn't reduce this lower bound.\n- Law of diminishing returns: variance reduction from tree $k$ decreases as $1/k^2$. Most variance reduction happens in the first few hundred trees.\n- More trees: monotonically (weakly) improve training fit but provide no generalization improvement past convergence. They increase prediction time O(B) and memory O(B).","A":"This is false. Random Forests converge as $B \\to \\infty$. Once the ensemble has enough trees to estimate the expected prediction well, adding more provides no benefit. This is a well-known theoretical property.","B":"","C":"More trees do NOT cause RF overfitting. RF overfitting is controlled by individual tree depth (max_depth, min_samples_split), not by the number of trees. This is a common misconception — more trees can never overfit, they just stop helping.","D":"There is no standard implementation limit of 1,000 trees. Sklearn's RandomForestClassifier supports any number. The constraint is computational budget, not implementation."}},{"section":"machine-learning","topicSlug":"ensemble-methods","topic":"Ensemble Methods","id":"ml-13007","difficulty":"medium","orderIndex":7,"question":"A company runs a critical ML system in production. They decide to use AdaBoost instead of Random Forest because \"AdaBoost focuses on hard examples and should be better.\" At test time, a few corrupted inputs (extreme feature values due to sensor malfunction) cause AdaBoost to make catastrophically wrong predictions. Why is AdaBoost more vulnerable to this than Random Forest?","options":{"A":"AdaBoost uses more trees, amplifying the effect of corrupted inputs","B":"AdaBoost assigns high weights to misclassified training examples and trains subsequent models to focus on them — corrupted training examples (if present) get amplified weights; at test time, AdaBoost is also more sensitive to outlier inputs because its aggregated prediction assigns disproportionate weight to weak learners trained on difficult/noisy regions; Random Forest's uniform averaging is more robust to individual outlier inputs","C":"AdaBoost is no more sensitive than Random Forest — the vulnerability is due to insufficient data preprocessing","D":"Random Forest is more vulnerable to corrupted inputs because it uses full-depth trees"},"correct":"B","explanation":{"correct":"- AdaBoost weighting: misclassified points get exponentially higher weights in subsequent rounds. If corrupted training examples exist, they get amplified weights — the model dedicates significant capacity to fitting noise.\n- Cascading sensitivity: final AdaBoost prediction is a weighted sum of weak learner predictions, where later learners (focused on hard/corrupted examples) have specific regions of high sensitivity.\n- Random Forest robustness: (1) bootstrap sampling means each tree sees a random subset of training data — a corrupted point only appears in ~63% of trees; (2) uniform averaging dilutes any single tree's response to an extreme input.\n- In practice: AdaBoost is significantly more sensitive to noisy labels and outlier inputs; preprocessing and outlier removal are essential before AdaBoost.","A":"AdaBoost doesn't necessarily use more trees. The vulnerability is in the re-weighting mechanism, not tree count.","B":"","C":"Preprocessing is necessary for both methods when corruption is present. But AdaBoost is architecturally more sensitive even after preprocessing, due to the amplifying weight scheme.","D":"Random Forest uses full-depth trees but they are averaged uniformly. Full depth increases individual tree variance, but the ensemble averaging provides robust protection against individual extreme inputs."}},{"section":"machine-learning","topicSlug":"ensemble-methods","topic":"Ensemble Methods","id":"ml-13008","difficulty":"medium","orderIndex":8,"question":"A data scientist builds a diverse stacking ensemble with 5 base models: logistic regression, SVM, Random Forest, gradient boosting, and KNN. The meta-learner is a neural network. Despite strong individual base model performance (all >80% accuracy), the ensemble achieves only 81% — barely better than the best single model. What might explain this?","options":{"A":"Stacking always outperforms individual models — 81% accuracy means implementation error","B":"If the base models are highly correlated in their predictions (they agree on the same hard examples), the meta-learner has little complementary signal to exploit; the meta-learner may also overfit to the training meta-features if data is limited; additionally, if one model dominates (e.g., gradient boosting at 87%), the meta-learner may simply learn to trust it almost exclusively","C":"The problem is the neural network meta-learner — it should be replaced with logistic regression","D":"Diverse architectures cannot be stacked together — the meta-learner requires homogeneous base models"},"correct":"B","explanation":{"correct":"- Stacking gains come from complementarity: when models make different mistakes, the meta-learner can combine them better than any individual. If all 5 models misclassify the same 20% of examples (highly correlated errors), the meta-learner cannot fix those cases.\n- Practical checks: compute pairwise correlation of base model predictions. If all correlations > 0.95, stacking provides little benefit.\n- Meta-learner overfitting: with limited training data and a flexible meta-learner (neural network), the meta-learner may fit training meta-features rather than generalizing.","A":"Stacking does not always outperform individual models. Benefits depend on error diversity among base models. Strong individual model + correlated errors → minimal stacking benefit.","B":"","C":"Replacing the meta-learner with logistic regression may help regularize the meta-level, but the fundamental issue is base model correlation. Logistic regression won't fix uncorrelated errors that don't exist.","D":"Heterogeneous base models (different architectures) stack perfectly well. In fact, diversity in base model types is often recommended to increase prediction diversity."}},{"section":"machine-learning","topicSlug":"ensemble-methods","topic":"Ensemble Methods","id":"ml-13009","difficulty":"hard","orderIndex":9,"question":"A team uses stacking but finds that the gradient boosting base model dominates the meta-learner's weights (coefficient ≈ 0.95, others ≈ 0.01). They add more base models to fix this. A statistician says \"adding correlated models won't help; adding regularization to the meta-learner is more principled.\" Explain why regularization is more principled.","options":{"A":"Regularization reduces the number of base models needed, saving computation","B":"The meta-learner's high weight on gradient boosting reflects the data's signal: if GB genuinely provides 95% of the predictive value, adding more correlated base models just adds noise to the meta-features; L1/L2 regularization on the meta-learner constrains weight magnitudes, preventing extreme dominance by any single base model and producing more stable, calibrated meta-weights; adding more correlated base models can reduce diversity and even introduce multicollinearity in meta-features","C":"Regularization fixes the issue by removing the gradient boosting model from consideration","D":"This situation can only be fixed by switching from stacking to boosting"},"correct":"B","explanation":{"correct":"- Meta-feature multicollinearity: if correlated base models are added (e.g., 3 different gradient boosting variants), the meta-features are correlated. Correlated meta-features destabilize ordinary least squares (OLS) meta-learner coefficients — small changes in data produce large weight swings.\n- L2 regularization (Ridge meta-learner): penalizes large weights, shrinking all coefficients toward zero. Even if GB provides most signal, L2 prevents the coefficient from reaching 0.95 — ensures small, stable contributions from other models.\n- More models without regularization: adds collinear meta-features, potentially destabilizing the already-dominant GB coefficient.","A":"Regularization controls weight distribution, not the number of base models needed. You still need all models to generate meta-features.","B":"","C":"L2/L1 regularization shrinks all coefficients toward zero — it does not remove any base model. L1 (Lasso) may zero out some coefficients, but this is feature selection, not forced inclusion.","D":"Boosting is a fundamentally different algorithm (sequential, adaptive weighting). Switching to boosting doesn't address the meta-learner weight distribution issue in stacking."}},{"section":"machine-learning","topicSlug":"ensemble-methods","topic":"Ensemble Methods","id":"ml-13010","difficulty":"hard","orderIndex":10,"question":"A Random Forest feature importance is computed as the mean decrease in Gini impurity across all trees. Features A and B are highly correlated (correlation = 0.98). Feature A has importance 0.35, Feature B has importance 0.04. A data scientist drops Feature B and retrains. The model's accuracy drops by 2%. What explains this?","options":{"A":"Feature B's importance of 0.04 correctly reflects it as unimportant — the accuracy drop is coincidental","B":"When correlated features compete for splits, both cannot both be selected at every node; whichever feature is randomly chosen first gets \"credit\" for the impurity reduction; Feature A appears dominant because it was selected first in many trees; Feature B's measured importance is artificially suppressed by A's presence; when A is not present, B carries the same predictive signal — removing B after A disappears reveals B's true contribution","C":"Gini importance is always accurate for correlated features — the problem is in the retrained model's hyperparameters","D":"Correlated features always have identical feature importance — A=0.35 and B=0.04 proves they are not actually correlated"},"correct":"B","explanation":{"correct":"- This is the correlated feature importance instability problem in tree-based methods. When A and B provide the same information (r=0.98), the tree randomly selects one — the selected feature gets the full importance credit, the other gets near-zero.\n- The measured importances are unstable across different random seeds and bootstrap samples. If a different seed causes B to be selected more often, B's importance would be 0.35 and A's would be 0.04.\n- Implication: do not use Random Forest Gini importance to select features from correlated groups. Use permutation importance (which measures actual prediction degradation) or SHAP values, which distribute credit more fairly among correlated features.","A":"2% accuracy drop from removing a \"0.04 importance\" feature is diagnostic evidence that Gini importance understated B's value. The drop is not coincidental — it reflects the information lost.","B":"","C":"Gini importance is specifically known to be unreliable for correlated features. This is a well-documented limitation. Use SHAP or permutation importance for correlated feature evaluation.","D":"A=0.35 and B=0.04 despite r=0.98 correlation is precisely the artifact. High correlation and very different importances indicate one is suppressing the other — not that they aren't correlated."}},{"section":"machine-learning","topicSlug":"ensemble-methods","topic":"Ensemble Methods","id":"ml-13011","difficulty":"hard","orderIndex":11,"question":"In gradient boosting, each tree is fitted to the negative gradient of the loss function. For log-loss (binary cross-entropy): $r_i = y_i - \\hat{p}_i$ (the residuals are the difference between actual labels and predicted probabilities). A data scientist says \"gradient boosting for classification is just boosted regression on residuals.\" What subtle distinction does this miss?","options":{"A":"The statement is completely correct — gradient boosting for classification is regression on residuals","B":"Gradient boosting fits regression trees to the pseudo-residuals (negative gradient of the loss), but the final prediction requires a link function: the sum of tree outputs $F_M(x) = \\sum f_m(x)$ is in log-odds space; the probability prediction is $\\hat{p} = \\sigma(F_M(x)) = 1/(1+e^{-F_M(x)})$; the trees regress on residuals, but the model output is a probability after the sigmoid transform — collapsing this to \"regression on residuals\" ignores the non-linear mapping between tree outputs and final probabilities","C":"Gradient boosting for classification does not use residuals — it uses the raw labels directly","D":"Classification gradient boosting uses decision trees for the final prediction, while regression uses linear models — they are fundamentally different architectures"},"correct":"B","explanation":{"correct":"- Gradient boosting for binary classification with log-loss: $L = -\\sum [y_i \\log(\\hat{p}_i) + (1-y_i)\\log(1-\\hat{p}_i)]$.\n- Negative gradient: $-\\partial L / \\partial F_m = y_i - \\hat{p}_i = r_i$. Each tree is fitted to $r_i$ — these look like residuals, similar to regression.\n- Key difference: the trees operate in log-odds space, not probability space. After boosting, $F_M(x)$ is a log-odds score, and the final probability requires the sigmoid: $\\hat{p} = \\sigma(F_M(x))$.\n- For MULTI-class: there are K sets of trees (one per class), and the final probabilities use softmax. The \"regression on residuals\" analogy becomes even less direct.","A":"While mechanistically similar, the log-odds space transformation means trees directly fit quantities that are not interpretable as probabilities. Missing the sigmoid / link function is a conceptual gap that matters for probability calibration and output interpretation.","B":"","C":"Gradient boosting does use pseudo-residuals (negative gradients). For log-loss, these are $y_i - \\hat{p}_i$, which depend on the current probability predictions, not raw labels.","D":"Both classification and regression gradient boosting use decision trees (typically). The difference is the loss function and link function, not the base learner architecture."}},{"section":"machine-learning","topicSlug":"ensemble-methods","topic":"Ensemble Methods","id":"ml-13012","difficulty":"hard","orderIndex":12,"question":"Two ensemble strategies are compared on a medical imaging classification task: (1) Training 10 diverse models (RF, SVM, CNN, LR, KNN, etc.) and stacking; (2) Training 10 variants of the same CNN with different random seeds and averaging predictions. Strategy 2 achieves higher accuracy. Why, and what does this reveal about ensemble design?","options":{"A":"Strategy 1 is always better — diversity of architecture always outperforms diversity of initialization","B":"Architecture diversity does not guarantee prediction diversity — logistic regression and a simple CNN on the same task may make very similar predictions; deep CNNs trained with different seeds on the same data capture different feature representations due to random initialization, dropout, and data augmentation stochasticity, creating genuine prediction diversity; the task-specific best architecture dominates, and variance reduction within the best architecture outperforms mixing weak architectures with the best","C":"Strategy 2 is better because CNNs always outperform other models on image data","D":"The result proves that stacking is an inferior ensemble method compared to averaging"},"correct":"B","explanation":{"correct":"- Effective ensembling requires: (1) high individual model quality, (2) diverse errors (low prediction correlation). Strategy 1 mixes strong (CNN) with weak (LR, KNN on images) models. The meta-learner in stacking will learn to ignore weak models, effectively reducing to a single CNN.\n- Strategy 2: all models have high baseline accuracy (same CNN architecture optimized for the task). Different seeds create genuinely different learned representations — different random feature detectors are learned, reducing correlation.\n- Research insight: in competitive ML (Kaggle, benchmarks), ensembles of the same top architecture with diverse hyperparameters/seeds often outperform heterogeneous ensembles with weak models.","A":"Architecture diversity is useful when all architectures are approximately equally strong. Mixing a strong architecture with significantly weaker ones adds noise to the ensemble without equivalent signal.","B":"","C":"\"CNNs always outperform on images\" is broadly but not universally true (ViT, recent transformer models also perform well). More importantly, the comparison here is about ensemble strategy, not architecture selection.","D":"The result doesn't prove stacking is inferior in general. It shows that for this task, averaging 10 strong models outperforms stacking 5 strong + 5 weak models. Stacking with diverse, equally-strong base models can outperform averaging."}},{"section":"machine-learning","topicSlug":"model-evaluation-and-metrics","topic":"Model Evaluation And Metrics","id":"ml-14001","difficulty":"easy","orderIndex":1,"question":"A binary classifier achieves 99% accuracy on a dataset where 99% of samples belong to class 0. A junior data scientist says \"99% accuracy means our model is excellent.\" What is wrong with this evaluation?","options":{"A":"99% accuracy is always an excellent result regardless of class distribution","B":"A trivial model that predicts \"class 0\" for every sample also achieves 99% accuracy — accuracy conflates majority-class performance with minority-class performance; on imbalanced datasets, accuracy doesn't measure whether the model learned anything about the minority class (class 1)","C":"99% accuracy requires 100% precision and 100% recall to be meaningful","D":"The accuracy is too high — a 99% accurate model is always overfitting"},"correct":"B","explanation":{"correct":"- Null accuracy (baseline): predict the majority class for all samples. With 99% class 0: null accuracy = 99%. The model achieves no improvement over this trivial baseline.\n- For class 1 detection: Recall = TP/(TP+FN). If the model predicts \"0\" for everything: TP = 0, FN = all class 1 samples. Recall = 0 — the model completely fails to detect the minority class.\n- Appropriate metrics for imbalanced data: precision-recall AUC, F1 on the minority class, Cohen's kappa, or Matthews correlation coefficient (MCC).","A":"99% accuracy can be meaningless on imbalanced data. The appropriate interpretation depends critically on class distribution.","B":"","C":"99% accuracy doesn't require perfect precision or recall. But on 99/1 imbalanced data, achieving 99% accuracy tells you nothing about minority class performance.","D":"High accuracy is not a sign of overfitting per se. Overfitting manifests as high training accuracy with lower test accuracy. Class imbalance is a separate issue."}},{"section":"machine-learning","topicSlug":"model-evaluation-and-metrics","topic":"Model Evaluation And Metrics","id":"ml-14002","difficulty":"easy","orderIndex":2,"question":"For a disease screening test, the following confusion matrix applies: TP=90, FP=200, FN=10, TN=9,700. Calculate precision, recall, and F1 score, and determine which metric is most critical for this screening application.","options":{"A":"Precision is most critical — flagging 200 healthy people for follow-up is worse than missing 10 sick people","B":"Recall (90/(90+10) = 90%) is most critical — missing 10 sick people has high clinical cost (late diagnosis, disease progression); precision (90/(90+200) = 31%) is low because it's a screening test where false positives are expected and managed through confirmatory testing; F1 = 2×0.9×0.31/(0.9+0.31) ≈ 0.46 combines both","C":"Accuracy ((90+9700)/10000 = 98%) is most critical — it captures the overall test performance","D":"F1 score should always be optimized for medical tests — it perfectly balances the clinical trade-offs"},"correct":"B","explanation":{"correct":"- Context matters: disease screening vs diagnosis. Screening: cast a wide net (high recall), accept false positives (low precision). Confirmatory tests (more expensive, invasive) eliminate false positives.\n- Missing a sick person (FN) in screening means they receive no follow-up, leading to late-stage diagnosis with much higher treatment cost and mortality.\n- False positive (FP) sends a healthy person for confirmatory testing — inconvenient and costly but not catastrophic.\n- Recall = sensitivity in medical terminology. The WHO's target sensitivity for TB screening is >90%. Low precision is acceptable for initial screening when confirmatory testing exists.","A":"For many diseases (cancer, HIV), missing a case (FN) is far more costly than an unnecessary follow-up test (FP). The asymmetric cost justifies prioritizing recall.","B":"","C":"Accuracy of 98% is dominated by the 9,700 true negatives. It tells you almost nothing about disease detection performance. Never use accuracy for medical screening evaluation.","D":"F1 assumes equal cost for FP and FN ($C_{FP} = C_{FN}$). Medical contexts typically have highly asymmetric costs. F1 is not the right metric when cost asymmetry exists."}},{"section":"machine-learning","topicSlug":"model-evaluation-and-metrics","topic":"Model Evaluation And Metrics","id":"ml-14003","difficulty":"easy","orderIndex":3,"question":"A classifier's ROC curve has an AUC of 0.85. A colleague says \"our model correctly classifies 85% of samples.\" Is this interpretation correct?","options":{"A":"Yes — AUC directly measures the percentage of correct classifications","B":"No — AUC-ROC measures the probability that the model ranks a randomly chosen positive sample higher than a randomly chosen negative sample; AUC=0.85 means: given a random positive and a random negative, the model assigns a higher score to the positive 85% of the time; this is a ranking quality metric, not a classification accuracy metric","C":"AUC-ROC and accuracy are equivalent — both measure the proportion of correct predictions","D":"AUC = 0.85 means the model achieves 85% recall at 85% precision"},"correct":"B","explanation":{"correct":"- Formal definition of AUC-ROC: $P(\\hat{p}(x^+) > \\hat{p}(x^-))$ for randomly drawn positive $x^+$ and negative $x^-$. This is a threshold-independent measure of discriminative ability.\n- AUC = 0.5: model cannot distinguish positive from negative (random ranking). AUC = 1.0: perfect ranking (all positives scored above all negatives). AUC = 0.85: very good discrimination.\n- For a 99% negative dataset with AUC=0.85: accuracy could be 99% (by predicting all negative), but AUC=0.85 correctly shows the model has learned to rank positives higher. Accuracy and AUC are measuring very different things.","A":"Accuracy is TP+TN / total. AUC is a ranking probability. They are completely different quantities and coincidentally equal only in specific cases.","B":"","C":"A model with 50% accuracy on balanced data can have AUC > 0.5. A model with 99% accuracy on 99:1 imbalanced data can have AUC close to 0.5 if it simply predicts all negatives. They are not equivalent.","D":"AUC has no direct relationship to specific precision-recall values at a fixed threshold. AUC is a summary over all possible thresholds."}},{"section":"machine-learning","topicSlug":"model-evaluation-and-metrics","topic":"Model Evaluation And Metrics","id":"ml-14004","difficulty":"easy","orderIndex":4,"question":"A spam classifier has AUC-ROC = 0.92 but AUC-PR (Precision-Recall) = 0.41. The dataset has 1% spam. A data scientist says \"AUC-ROC of 0.92 means the model is good.\" A colleague disagrees. Which is more informative for this application, and why?","options":{"A":"AUC-ROC is always the correct metric — AUC-PR is rarely used in practice","B":"For highly imbalanced datasets, AUC-PR is more informative — AUC-ROC includes the True Negative Rate (specificity), and with 99% negatives, the model classifies negatives easily; the ROC curve's large TN region inflates AUC-ROC; AUC-PR focuses on performance on the rare positive class (spam), where 0.41 indicates poor precision-recall trade-off for spam detection","C":"Both metrics are equivalent and should give the same value for any classifier","D":"The discrepancy between 0.92 and 0.41 indicates a computational error"},"correct":"B","explanation":{"correct":"- ROC curve plots TPR vs FPR. With 99% negatives: even a weak model keeps FPR low (many TN), producing a good-looking ROC curve despite poor minority class performance.\n- PR curve plots precision vs recall. It focuses entirely on the positive (minority) class — no TN in either metric. AUC-PR = 0.41 on 1% spam means the model struggles to achieve good precision-recall balance for spam.\n- Saito & Rehmsmeier (2015): AUC-PR is more informative than AUC-ROC for imbalanced datasets. AUC-PR's random baseline is equal to the positive class rate (1% here), while AUC-ROC's random baseline is always 0.5.","A":"AUC-PR is widely used in imbalanced classification, information retrieval (average precision), and recommendation systems. It is not rarely used.","B":"","C":"AUC-ROC and AUC-PR are different quantities measuring different aspects of model performance. They have different baselines and different interpretations. They are not equivalent.","D":"The discrepancy between 0.92 and 0.41 is expected and common for imbalanced datasets. It is not a computational error — it reveals the model's different performance characteristics."}},{"section":"machine-learning","topicSlug":"model-evaluation-and-metrics","topic":"Model Evaluation And Metrics","id":"ml-14005","difficulty":"easy","orderIndex":5,"question":"K-fold cross-validation (K=5) is compared to a single 80/20 train-test split for model evaluation. A data scientist argues K-fold is always better. When might a single split be more appropriate?","options":{"A":"A single split is never better — K-fold is universally superior","B":"K-fold requires fitting the model K times — for very large datasets or computationally expensive models (deep learning, large gradient boosting), K-fold is impractical; a single split may be sufficient when the dataset is large enough that a 20% test set (which may be 100,000+ samples) provides stable estimates with low variance; K-fold primarily helps when data is limited and variance in evaluation is high","C":"K-fold should never be used for neural networks because it causes overfitting","D":"A single split is better when the data has temporal structure, because K-fold would still use random splits"},"correct":"B","explanation":{"correct":"- K-fold benefit: reduces evaluation variance by averaging over K different train-test splits. With limited data (n<1,000), a single 80/20 split may give high-variance estimates depending on which 20% was in the test set.\n- K-fold cost: K× the computational cost. For a CNN trained for 12 hours, 5-fold = 60 hours. For large datasets where test variance is already low, the extra cost is not justified.\n- Also valid: D is partially correct — for time-series data, K-fold random splits cause leakage (future data in training). Time-series cross-validation (expanding window or rolling window) is needed.","A":"When the dataset is large enough or computation is expensive, K-fold provides minimal benefit at significant cost. Single splits are commonly used in deep learning evaluations.","B":"","C":"K-fold doesn't cause overfitting in neural networks. It's computationally expensive, which is why practitioners often use a single validation set. The concern with K-fold and neural networks is purely computational.","D":"This is an important limitation — but the question asks about the single split being \"more appropriate.\" A single train-test split with proper temporal ordering is the preferred approach for time series, making D a valid secondary answer, but the primary reason is computational cost (B)."}},{"section":"machine-learning","topicSlug":"model-evaluation-and-metrics","topic":"Model Evaluation And Metrics","id":"ml-14006","difficulty":"medium","orderIndex":6,"question":"A multi-class classifier (10 classes) is evaluated with macro F1 and weighted F1. Macro F1 = 0.62, weighted F1 = 0.84. Class distribution: 8 classes with 100 samples each, 2 classes with 5,000 samples each. The gap between the metrics reveals what about the model?","options":{"A":"Weighted F1 is always higher than macro F1 — the gap has no interpretation","B":"Macro F1 averages F1 per class equally — it is dominated by the 8 small classes where the model may perform poorly; weighted F1 weights each class by its sample count — the 2 large classes (5,000 samples each) dominate; the gap (0.84 vs 0.62) reveals that the model performs well on the large common classes but poorly on the small rare classes","C":"The gap means the model has high precision but low recall overall","D":"A weighted F1 of 0.84 means the model is production-ready — the macro F1 can be ignored"},"correct":"B","explanation":{"correct":"- Macro F1: $\\frac{1}{K}\\sum_{k=1}^K F1_k$. Each of 10 classes contributes equally. If the 8 small classes have F1 ≈ 0.3 (poor due to limited data) and 2 large classes have F1 ≈ 0.95: Macro F1 ≈ (8×0.3 + 2×0.95)/10 = 0.43. This aligns with 0.62 in the scenario.\n- Weighted F1: $\\sum_{k=1}^K \\frac{n_k}{N} F1_k$. With 2 classes contributing 5000/10800 ≈ 46% each: weighted F1 ≈ 0.46×0.95 + 0.46×0.95 + small class contribution ≈ dominated by large classes.\n- Decision: if rare classes are important (e.g., rare disease detection, minority customer types), macro F1 is the relevant metric.","A":"The relationship between weighted and macro F1 depends on the class performance distribution. Weighted F1 is not always higher — if the model performs better on rare classes, macro F1 > weighted F1.","B":"","C":"Precision-recall decomposition is not directly revealed by the gap between macro and weighted F1. The gap reveals the performance differential between rare and common classes.","D":"Ignoring macro F1 means ignoring the model's performance on 8 of 10 classes. If those classes are business-relevant (fraud subtypes, disease variants), 0.62 macro F1 may be unacceptable."}},{"section":"machine-learning","topicSlug":"model-evaluation-and-metrics","topic":"Model Evaluation And Metrics","id":"ml-14007","difficulty":"medium","orderIndex":7,"question":"A regression model is evaluated with RMSE = 100 on a dataset where house prices range from $50,000 to $5,000,000. A colleague evaluates the same model on a new dataset (prices $200,000-$400,000) and gets RMSE = 80. They claim \"the model improved by 20% on the new dataset.\" What is wrong with this comparison?","options":{"A":"RMSE values are always comparable across datasets","B":"RMSE is scale-dependent — an RMSE of 80 on $200K-$400K prices (a 200K range) represents 80/200K = 0.04% of the value range, much worse relative performance than RMSE of 100 on a $4.95M range (100/4,950,000 = 0.002%); use RMSE/mean or MAPE (Mean Absolute Percentage Error) for scale-normalized comparison","C":"RMSE cannot be used for regression problems with skewed distributions","D":"RMSE of 80 < RMSE of 100 always means better model performance regardless of scale"},"correct":"B","explanation":{"correct":"- Absolute RMSE is uninterpretable without context of the target variable's scale and variance. RMSE = 100 on a $5M range is ~0.002% error; RMSE = 80 on a $200K range is 0.04% error — the latter is 20× worse proportionally.\n- Normalized RMSE (NRMSE) = RMSE / (max - min) or RMSE / mean. MAPE = mean(|y - ŷ|/|y|) × 100% gives percentage errors that are directly comparable across datasets.\n- Caveat: MAPE is unstable when true values are near 0 and gives asymmetric penalties. Symmetric MAPE (sMAPE) or MASE (Mean Absolute Scaled Error) are more robust.","A":"RMSE values are only comparable across datasets with the same target variable scale and similar variance. Comparing raw RMSE across different-scale datasets is a common mistake.","B":"","C":"RMSE can be used for any regression problem. Skewed distributions may make MSE/RMSE insensitive to outliers, but they don't invalidate the metric.","D":"Lower absolute RMSE does not mean better model unless the datasets are comparable in scale. This is precisely the scale dependency problem."}},{"section":"machine-learning","topicSlug":"model-evaluation-and-metrics","topic":"Model Evaluation And Metrics","id":"ml-14008","difficulty":"medium","orderIndex":8,"question":"A recommendation system uses Mean Average Precision (MAP) for evaluation. The system recommends 10 items; for a user, relevant items are at positions 1, 4, 7. Calculate the Average Precision for this user.","options":{"A":"Average Precision = (1.0 + 0.5 + 0.43) / 10 = 0.193","B":"Average Precision = (1.0 + 0.5 + 0.43) / 3 ≈ 0.643 — average Precision@k values only at the positions of relevant items, divided by the number of relevant items","C":"Average Precision = 3/10 = 0.3 — fraction of relevant items in top 10","D":"Average Precision = P@10 = 3/10 = 0.3"},"correct":"B","explanation":{"correct":"- Average Precision (AP): $AP = \\frac{1}{R} \\sum_{k=1}^{n} P@k \\times \\text{rel}(k)$ where $R$ = total relevant items, $\\text{rel}(k) = 1$ if item at position $k$ is relevant.\n- Only sum precision values at positions of relevant items: $P@1 = 1.0$, $P@4 = 0.5$, $P@7 = 0.429$.\n- $AP = (1.0 + 0.5 + 0.429) / 3 = 1.929 / 3 \\approx 0.643$.\n- MAP (Mean AP) averages AP across all users/queries. AP rewards systems that rank relevant items higher — rank 1 contributes more than rank 7.","A":"Dividing by 10 (list length) is incorrect. AP divides by the number of relevant items (3), not the recommendation list length.","B":"","C":"3/10 = recall at 10 (fraction of relevant items retrieved in top 10). This is recall@10, not average precision. AP accounts for ranking position, not just total recall.","D":"P@10 = 3/10 = 0.3 is precision at the final cutoff. AP is a weighted average of precision at each relevant item's position, not precision at the end of the list."}},{"section":"machine-learning","topicSlug":"model-evaluation-and-metrics","topic":"Model Evaluation And Metrics","id":"ml-14009","difficulty":"hard","orderIndex":9,"question":"Two models are compared on a test set of 1,000 samples. Model A: accuracy 87.5%. Model B: accuracy 86.2%. A data scientist reports \"Model A is better.\" A statistician asks for significance testing. What test is appropriate, and what is the minimum information needed?","options":{"A":"A t-test on accuracy values across multiple test folds","B":"McNemar's test — it requires the 2×2 contingency table of cases where both models agree or disagree: both correct (n_11), A correct/B wrong (n_10), A wrong/B correct (n_01), both wrong (n_00); McNemar's tests only the disagreement cells (n_10 vs n_01) because the agreement cells don't contribute information about which model is better","C":"A chi-squared test on the confusion matrices of both models","D":"No significance test is needed — a 1.3-point accuracy difference on 1,000 samples is always statistically significant"},"correct":"B","explanation":{"correct":"- McNemar's test: given paired binary outcomes (correct/incorrect for each sample), the test statistic is $\\chi^2 = (n_{10} - n_{01})^2 / (n_{10} + n_{01})$. Under $H_0$ (both models have equal error rate): $n_{10} = n_{01}$.\n- Why not t-test: binary outcomes (correct/incorrect) don't meet normality assumptions for a standard t-test. McNemar's is the non-parametric alternative for paired binary outcomes.\n- Effect size: if n_10=50 (A right, B wrong) and n_01=37 (B right, A wrong): $\\chi^2 = (50-37)^2/(50+37) = 169/87 \\approx 1.94$. For df=1, $p \\approx 0.16$ — not significant. A 1.3-point difference may not be significant.","A":"A t-test on K-fold accuracy values (across folds) is a common but problematic approach due to non-independence of K-fold test sets. McNemar's test on paired sample-level predictions is more principled.","B":"","C":"Chi-squared on confusion matrices tests whether performance on individual classes differs, not whether one model is globally better than the other. It's the wrong test for overall comparison.","D":"Statistical significance depends on effect size and sample size together. A 1.3-point difference on 1,000 samples can be statistically significant (p<0.05) or not, depending on the overlap in what the models correctly classify."},"reference":"- Dietterich, \"Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms\": https://www.mitpressjournals.org/doi/10.1162/089976698300017197"},{"section":"machine-learning","topicSlug":"model-evaluation-and-metrics","topic":"Model Evaluation And Metrics","id":"ml-14010","difficulty":"hard","orderIndex":10,"question":"A calibration plot (reliability diagram) for a classifier shows: for predictions in the bin [0.7, 0.8], the actual positive rate is 0.45. For predictions in [0.3, 0.4], the actual positive rate is 0.55. What do these observations indicate, and how would you fix the calibration?","options":{"A":"The model is well-calibrated — slight deviations from the diagonal are expected","B":"The model is severely miscalibrated with inversion: samples predicted as highly positive (70-80% probability) have lower actual positive rate (45%) than samples predicted as moderately negative (30-40% probability, actual 55%); this suggests the model's sigmoid/softmax output is not a reliable probability estimate; fix: apply isotonic regression or Platt scaling to map raw scores to calibrated probabilities","C":"The model has low recall — calibration only measures precision","D":"The observations are impossible — model output probabilities and actual rates must maintain the same ordering"},"correct":"B","explanation":{"correct":"- Calibration: a model is calibrated if $P(y=1 | \\hat{p}(x) = p) = p$ for all $p$. Perfect calibration = reliability diagram on the diagonal.\n- The described model shows inverted calibration: high model scores correlate with lower actual positive rates. This is extreme miscalibration — the model's scores are negatively correlated with actual outcomes in some regions.\n- This can happen when a model is trained with inconsistent labels, when features that accidentally correlate negatively with labels are dominant, or when a model's decision boundary has flipped (e.g., incorrect label encoding).\n- Fixes: Platt scaling (logistic regression on model scores), isotonic regression (non-parametric monotone mapping). But inverted calibration is a severe model failure requiring investigation of the training pipeline.","A":"The described pattern is not a \"slight deviation.\" A 45% actual rate at 70-80% predicted probability and 55% actual rate at 30-40% predicted probability represents severe inversion, not noise.","B":"","C":"Calibration measures reliability of probability estimates, not just precision. Recall is about the classifier's sensitivity at a threshold; calibration is about whether predicted probabilities match actual frequencies.","D":"Model outputs and actual rates can have any relationship — especially for miscalibrated models. The model's raw output scores are transformed to probabilities through softmax/sigmoid and may not have a monotone relationship with ground truth."}},{"section":"machine-learning","topicSlug":"model-evaluation-and-metrics","topic":"Model Evaluation And Metrics","id":"ml-14011","difficulty":"hard","orderIndex":11,"question":"A model is evaluated with Brier Score. Model A: Brier = 0.18. Model B: Brier = 0.22. A data scientist knows Model B achieves higher AUC-ROC. How can a model have better AUC but worse Brier Score, and what does each measure?","options":{"A":"AUC and Brier Score cannot give contradictory results — one must be computed incorrectly","B":"AUC-ROC measures ranking quality (can the model order positives above negatives?); Brier Score measures probabilistic calibration quality ($\\frac{1}{n}\\sum (p_i - y_i)^2$, where $p_i$ is predicted probability); a model can be an excellent ranker (high AUC) but produce poorly calibrated probabilities (high Brier); Model B ranks correctly but may output overconfident or underconfident probabilities; Model A may be a weaker ranker but outputs well-calibrated, reliable probabilities","C":"Model B has higher AUC, so it must have lower Brier Score — the scenario is inconsistent","D":"Brier Score and AUC measure exactly the same thing using different formulas"},"correct":"B","explanation":{"correct":"- AUC = P(rank correct): considers only relative ordering of predicted scores. Multiplying all probabilities by 2 (or any monotone transformation) leaves AUC unchanged — rankings are preserved.\n- Brier Score = mean squared error between predicted probability and outcome: $BS = \\frac{1}{n}\\sum_{i=1}^n (\\hat{p}_i - y_i)^2$. Lower is better. Brier measures absolute probability accuracy.\n- Example: Model B predicts P=0.99 for all positives and P=0.01 for all negatives. AUC = 1.0 (perfect ranking). If actual positive rate is 0.6, the overconfident probabilities incur a penalty: Brier ≈ 0.6×(0.99-1)² + 0.4×(0.01-0)² ≈ small. Actually in this case Brier is low too. A cleaner example: if Model B outputs P=0.9 for positives and P=0.8 for negatives (good ranking, miscalibrated), AUC is high but Brier is penalized.","A":"AUC and Brier measure different properties. They can and do give contradictory rankings of models when ranking quality and probability calibration are different. This is well-documented.","B":"","C":"Higher AUC does not imply lower Brier Score. They measure fundamentally different aspects of model performance.","D":"AUC measures ranking discriminability; Brier measures probabilistic accuracy. They are different quantities with different mathematical formulations."}},{"section":"machine-learning","topicSlug":"model-evaluation-and-metrics","topic":"Model Evaluation And Metrics","id":"ml-14012","difficulty":"hard","orderIndex":12,"question":"A researcher uses test set performance to select between 100 hyperparameter configurations. The best configuration achieves 92% accuracy on the test set. They report this as the model's expected production performance. A statistician warns about \"test set contamination.\" What is the concern and what is the principled fix?","options":{"A":"100 hyperparameter configurations is too many — 10 is the maximum for unbiased evaluation","B":"By selecting the best configuration out of 100 based on test performance, the reported accuracy is optimistically biased — even if all configurations are random, the best of 100 will score high by chance (multiple comparisons problem); the test set effectively becomes a validation set used for selection; production performance will be lower; the principled fix is nested cross-validation or a held-out final test set that is never used during hyperparameter selection","C":"The concern is only valid if the hyperparameters were tuned on the training set — using the test set for selection is always valid","D":"Test set contamination only occurs when feature selection is performed — hyperparameter tuning does not contaminate the test set"},"correct":"B","explanation":{"correct":"- Multiple comparisons inflation: the expected maximum of 100 independent tests at noise level follows the extreme value distribution. Even with random performance (expected 50% for a coin flip classifier), the max of 100 samples can appear much higher by chance.\n- For accuracy at 92%: if 100 random configurations achieve 88-92% by variance, selecting the best inflates the reported estimate. The true expected performance of this configuration on new data is lower.\n- Principled fix: (1) Use 3 splits: training (model fitting), validation (hyperparameter selection), test (final unbiased evaluation). (2) Nested cross-validation: outer loop for test evaluation, inner loop for hyperparameter selection. The outer test fold is never used in hyperparameter selection.","A":"There is no maximum number of configurations for a valid search, as long as a separate test set is never used during selection. The issue is test set use for selection, not the number of configurations.","B":"","C":"Using the test set for any selection (including hyperparameter selection) contaminates it. The test set should only be used once, after all model development decisions are finalized.","D":"Any use of the test set for model selection — feature selection, hyperparameter tuning, architecture search — contaminates it. The contamination is about using test labels to make modeling decisions."}},{"section":"machine-learning","topicSlug":"bias-variance-tradeoff","topic":"Bias Variance Tradeoff","id":"ml-15001","difficulty":"easy","orderIndex":1,"question":"A linear regression model achieves training MSE = 150 and test MSE = 155. A polynomial regression (degree 10) achieves training MSE = 5 and test MSE = 800. What do these results indicate about each model?","options":{"A":"The polynomial model is better because it achieves lower training error","B":"Linear regression shows high bias (training MSE=150, suggesting underfitting) but low variance (test≈train); polynomial model shows low bias (training MSE=5, near-perfect fit) but high variance (test MSE=800, severe overfitting); the polynomial model memorized the training data and cannot generalize","C":"Both models are equivalent because neither achieves zero training error","D":"The test-train gap in polynomial regression means the test set is too small"},"correct":"B","explanation":{"correct":"- Bias-variance decomposition of generalization error: $E[\\text{MSE}] = \\text{Bias}^2 + \\text{Variance} + \\text{Irreducible Error}$.\n- High bias (underfitting): model is too simple to capture the true pattern. Both training and test error are high, with small gap.\n- High variance (overfitting): model fits training noise. Training error is very low, but test error is high (large gap: 800 - 5 = 795).\n- The polynomial model's degree-10 flexibility fits the training data perfectly (including noise) but cannot generalize.","A":"Lower training error does not mean better model. Training error measures how well the model fits historical data, not how well it will generalize to new data. Minimizing training error is not the goal of machine learning.","B":"","C":"Training MSE of 5 vs 150 represents fundamentally different fitting capacity. They are not equivalent.","D":"The large gap is due to model complexity (variance), not test set size. A larger test set would show the same high test MSE — the problem is the model, not the evaluation."}},{"section":"machine-learning","topicSlug":"bias-variance-tradeoff","topic":"Bias Variance Tradeoff","id":"ml-15002","difficulty":"easy","orderIndex":2,"question":"A model's bias is defined as the systematic error: $\\text{Bias}[\\hat{f}(x)] = E[\\hat{f}(x)] - f(x)$. A student asks: \"if I train the same model 100 times on 100 different samples from the same population, what does variance measure?\" What is the correct answer?","options":{"A":"Variance measures how often the model's predictions are correct on test data","B":"Variance measures how much the model's predictions change across different training sets — $\\text{Var}[\\hat{f}(x)] = E[(\\hat{f}(x) - E[\\hat{f}(x)])^2]$; a high-variance model produces very different predictions depending on which training samples it happened to see; a low-variance model produces similar predictions regardless of the specific training set","C":"Variance is the average training error across 100 runs","D":"Variance measures the number of parameters in the model — more parameters means higher variance"},"correct":"B","explanation":{"correct":"- Thought experiment: train a decision tree (high variance) vs logistic regression (lower variance) on 100 different samples of size 100 from the same population. Decision trees will look very different from each run (different splits, different predictions). Logistic regression will produce similar coefficients across runs.\n- High variance → sensitive to specific training data → overfitting to noise. Low variance → stable predictions → less responsive to specific training samples.\n- This definition of variance is over the sampling distribution of training sets — not over the test set.","A":"Whether predictions are correct is accuracy, not variance. Variance is about consistency across different training sets, not correctness.","B":"","C":"Training error measures how well the model fits training data. Variance is about stability of predictions across different training samples.","D":"More parameters can enable higher variance, but variance is not the parameter count. A highly regularized 1,000-parameter model may have lower variance than an unregularized 10-parameter model."}},{"section":"machine-learning","topicSlug":"bias-variance-tradeoff","topic":"Bias Variance Tradeoff","id":"ml-15003","difficulty":"easy","orderIndex":3,"question":"Learning curves are plotted for a model: training error stays high as more data is added; validation error decreases and approaches (but stays above) the training error. What type of error does this pattern indicate?","options":{"A":"High variance — the model is overfitting","B":"High bias (underfitting) — training error is high from the start and doesn't improve significantly with more data; the validation error converges toward training error (they meet at a high value); adding more data will not fix this; the model is too simple to capture the true function; the fix is to increase model complexity","C":"The model is well-optimized — the learning curve indicates good generalization","D":"High variance — training and validation error converging means the model is memorizing training data"},"correct":"B","explanation":{"correct":"- High bias learning curve pattern: training error is already high with few samples; adding more data doesn't dramatically reduce it (the model can't capture the true pattern regardless of data volume); validation error rapidly decreases and converges toward the (high) training error.\n- Intuition: a linear model trying to fit a cubic relationship has a fixed irreducible error floor set by the misspecification. More data refines the linear fit but doesn't help it capture the cubic term.\n- Fix for high bias: increase model complexity (more features, higher polynomial degree, deeper network), reduce regularization, add feature interactions.","A":"High variance shows a large gap between training error (low) and validation error (high). With more data, the gap narrows. The described pattern has training error staying high — this is underfitting, not overfitting.","B":"","C":"High training error that doesn't decrease is diagnostic of underfitting. A well-optimized model would have low training error and validation error approaching it.","D":"Memorizing training data (high variance) produces low training error. The described pattern has high training error — the model is not memorizing; it's underfitting."}},{"section":"machine-learning","topicSlug":"bias-variance-tradeoff","topic":"Bias Variance Tradeoff","id":"ml-15004","difficulty":"easy","orderIndex":4,"question":"A data scientist increases the regularization strength (lambda) in a Ridge regression model from 0.01 to 100. Training error increases significantly, while test error first decreases then increases. What does this behavior demonstrate?","options":{"A":"Higher regularization always improves test performance — the final test error increase is a bug","B":"The bias-variance tradeoff: at lambda=0.01 (low regularization), the model has low bias but high variance; as lambda increases, bias increases (model coefficients are shrunk, reducing model flexibility) but variance decreases (predictions become more stable); there is an optimal lambda where the total error (bias² + variance) is minimized; beyond this point, the added bias from over-regularization exceeds the variance reduction","C":"The test error increase at high lambda means regularization should never be applied","D":"The increase in training error at high lambda indicates the model is overfitting to the regularization penalty"},"correct":"B","explanation":{"correct":"- As lambda → ∞: all Ridge coefficients → 0. The model predicts the training mean for every input — high bias (predicts nothing), very low variance.\n- The U-shaped test error curve as a function of lambda is the empirical manifestation of the bias-variance tradeoff. The minimum of this curve is the optimal lambda.\n- Cross-validation for lambda selection: evaluate test-like performance for many lambda values and select the one with minimum cross-validation error. sklearn's `RidgeCV` does this automatically.","A":"Regularization can harm performance if set too high. The optimal regularization is dataset-specific and should be tuned via cross-validation.","B":"","C":"Regularization is valuable when the model overfits (high variance). The optimal regularization reduces total error. The problem is only at extreme lambda values.","D":"Training error increasing with lambda is expected and correct behavior — regularization constrains the model and prevents it from fully fitting training data. This is not overfitting; it's the intended effect."}},{"section":"machine-learning","topicSlug":"bias-variance-tradeoff","topic":"Bias Variance Tradeoff","id":"ml-15005","difficulty":"medium","orderIndex":5,"question":"An ML practitioner says: \"I always use ensembling because it reduces both bias and variance simultaneously.\" A theorist disagrees. Which ensemble technique primarily reduces variance, and which primarily reduces bias?","options":{"A":"All ensemble methods reduce both bias and variance equally","B":"Bagging (Random Forest) primarily reduces variance — it averages many high-variance low-bias models; boosting (AdaBoost, gradient boosting) primarily reduces bias — it sequentially adds models that correct the residuals/errors of previous models, fitting progressively more complex functions; bagging does not reduce bias because it averages models of the same class with same expected prediction","C":"Boosting reduces variance and bagging reduces bias — the reverse of common understanding","D":"Stacking reduces both bias and variance while bagging and boosting each reduce only one"},"correct":"B","explanation":{"correct":"- Bagging variance reduction: $\\text{Var}(\\bar{X}) = \\rho \\sigma^2 + (1-\\rho)\\sigma^2/B$. Averaging $B$ models reduces variance toward the correlated floor $\\rho \\sigma^2$. Bias of the average = bias of individual trees (unchanged). Bagging works best when base models are high variance (deep decision trees).\n- Boosting bias reduction: each iteration fits residuals $r_i = y_i - F_{m-1}(x_i)$. The composite model's bias decreases as more iterations capture complex patterns. The combined model can represent functions that no single weak learner can.\n- Boosting does also reduce variance through regularization (learning rate, depth), but the primary theoretical mechanism is bias reduction.","A":"Bagging and boosting have different primary mechanisms — claiming equal reduction in both ignores the mathematical structure of each method.","B":"","C":"This is reversed. Bagging = variance reduction (averaging); boosting = bias reduction (sequential error correction). This is a common confusion in interviews.","D":"Stacking is a meta-learning approach that can reduce both, but it's not categorically different in this respect from boosting. The key distinction is bagging vs boosting, not stacking."}},{"section":"machine-learning","topicSlug":"bias-variance-tradeoff","topic":"Bias Variance Tradeoff","id":"ml-15006","difficulty":"medium","orderIndex":6,"question":"Modern deep neural networks have millions of parameters and can interpolate training data (achieve ~0% training loss). Classical statistical theory predicts these models should severely overfit. Yet they generalize well. This \"double descent\" phenomenon challenges classical theory. What is the classical bias-variance tradeoff prediction, and why does deep learning deviate?","options":{"A":"Deep learning doesn't overfit because it uses batch normalization, which prevents overfitting","B":"Classical theory: test error follows a U-shaped curve as model complexity increases — low complexity (high bias), optimal, then high complexity (high variance/overfitting); deep learning observes a \"double descent\" — beyond the interpolation threshold, test error decreases again with more model capacity; overparameterized models have an implicit regularization effect from SGD that finds flat minima generalizing well, challenging the classical overfitting prediction","C":"Deep neural networks don't overfit because they use dropout, which limits effective model capacity","D":"The classical bias-variance tradeoff only applies to linear models — it never predicted overfitting for neural networks"},"correct":"B","explanation":{"correct":"- Classical U-curve: at the interpolation threshold (when model exactly fits training data), test error is expected to peak. Beyond this, classical theory predicts continued high variance.\n- Double descent: Belkin et al. (2019) showed test error can decrease again in the overparameterized regime. Why? SGD with early stopping implicitly finds solutions with low norm (analogous to L2 regularization), preferring flat, well-generalizing minima.\n- Modern understanding: classical bias-variance analysis assumes a specific model class trained to convergence. Deep learning's implicit regularization from SGD, random initialization, and optimization trajectory changes the effective model.","A":"Batch normalization helps training stability and can reduce overfitting somewhat, but it's not the fundamental explanation for generalization in overparameterized networks. The double descent phenomenon occurs even without BatchNorm.","B":"","C":"Dropout is one regularization technique. Double descent occurs even in networks trained without dropout. The phenomenon is fundamental, not dependent on specific regularization techniques.","D":"The classical bias-variance tradeoff is a statistical principle that applies to all models. It did predict overfitting for overparameterized models — deep learning's empirical behavior contradicts this prediction, which is exactly what makes double descent theoretically interesting."},"reference":"- Belkin et al., \"Reconciling modern ML and the bias-variance tradeoff\": https://arxiv.org/abs/1812.11118"},{"section":"machine-learning","topicSlug":"bias-variance-tradeoff","topic":"Bias Variance Tradeoff","id":"ml-15007","difficulty":"medium","orderIndex":7,"question":"A neural network achieves 95% training accuracy and 94% test accuracy. A colleague says \"the bias-variance decomposition shows low bias and low variance.\" Without seeing the learning curves or multiple training runs, what cannot be concluded from these two numbers alone?","options":{"A":"The two numbers are sufficient to fully characterize the bias-variance tradeoff","B":"Without knowing the Bayes optimal error (irreducible error), you cannot determine the absolute bias — if the best possible accuracy on this task is 99%, then a 5% training error indicates high bias; if the best possible is 95%, then 5% training error is at the optimum; variance is estimated from multiple training runs, not from a single train/test comparison; a 1% gap is consistent with low variance, but could also reflect that the test set is easy","C":"95% training accuracy always means low bias and 1% gap always means low variance","D":"The 1% gap between train and test is definitively low variance — no additional information is needed"},"correct":"B","explanation":{"correct":"- Irreducible error (Bayes error): the minimum achievable error given the data's inherent noise and label ambiguity. For noisy labels (humans disagree on classification), Bayes error > 0.\n- If Bayes error is 94%: training accuracy 95% means near-zero bias. If Bayes error is 60%, 95% accuracy already means very high bias.\n- Variance estimation: requires observing how much train/test performance varies across multiple random training runs or data subsets. A single run gives one sample of the distribution.","A":"Two numbers (train accuracy, test accuracy) give partial information. Full bias-variance characterization requires knowledge of Bayes error and multiple training runs.","B":"","C":"\"Always\" is incorrect. The interpretation of 95% training accuracy depends on the task difficulty (Bayes error). A 1% gap is consistent with low variance but doesn't definitively establish it.","D":"A 1% train-test gap is consistent with low variance, but \"definitively\" is too strong. The specific 20% test split might happen to be easier than the training set, creating an artificially small gap."}},{"section":"machine-learning","topicSlug":"bias-variance-tradeoff","topic":"Bias Variance Tradeoff","id":"ml-15008","difficulty":"hard","orderIndex":8,"question":"The bias-variance decomposition for 0-1 loss (classification) behaves differently than for squared loss (regression). For squared loss, $E[(y - \\hat{f}(x))^2] = \\text{Bias}^2 + \\text{Variance} + \\sigma^2$. For 0-1 loss, the bias and variance terms interact multiplicatively. What is the key implication of this difference?","options":{"A":"The bias-variance tradeoff does not apply to classification — only to regression","B":"For 0-1 loss, variance can actually reduce error when bias is high — a high-variance model may \"accidentally\" predict the correct class more often than a biased low-variance model in certain regions; bias and variance interact non-additively, so reducing variance doesn't always improve 0-1 loss; the decomposition is more complex and model selection for classification should use the actual 0-1 loss or a proper surrogate (log-loss, hinge loss) rather than the squared-loss decomposition","C":"For classification, bias and variance are exactly equal in magnitude — maximizing one minimizes the other","D":"The interaction only matters for multi-class problems, not binary classification"},"correct":"B","explanation":{"correct":"- Domingos (2000): for 0-1 loss, bias and variance interact in a complex way: $\\text{Error} = \\text{Noise} + \\text{Bias} \\times \\text{Variance}^{1/2}$ (simplified). In some cases, high variance can help — if a high-variance model has 50% chance of predicting the wrong class, it may also have 50% chance of predicting the right class in biased regions.\n- Practical implication: reducing variance doesn't always help classification. Ensembling (which reduces variance) sometimes improves classification more in low-bias regions and has complex behavior in high-bias regions.\n- Practitioners should use log-loss or hinge loss for optimization, not squared loss, to get well-behaved loss landscapes for classification tasks.","A":"The bias-variance tradeoff applies to all supervised learning. For classification, the decomposition just has a more complex, non-additive form.","B":"","C":"Bias and variance are not equal in magnitude for classification. They have a non-trivial relationship that depends on the decision boundary and the true distribution.","D":"The interaction is a fundamental property of 0-1 loss regardless of the number of classes. It applies equally to binary and multi-class classification."}},{"section":"machine-learning","topicSlug":"bias-variance-tradeoff","topic":"Bias Variance Tradeoff","id":"ml-15009","difficulty":"hard","orderIndex":9,"question":"A data scientist claims: \"adding more training data reduces both bias and variance.\" A researcher disagrees on one point. Which part of the claim is incorrect?","options":{"A":"More data reduces variance but not bias — bias is a function of model misspecification, not data quantity","B":"More data reduces both bias and variance equally","C":"More data reduces bias but increases variance by providing more opportunities for the model to fit noise","D":"More data has no effect on either bias or variance for neural networks"},"correct":"A","explanation":{"correct":"- Variance decreases with more data: $\\text{Var}(\\hat{f}) \\approx \\sigma^2 \\times \\text{model complexity} / n$. As $n \\to \\infty$, variance → 0 (for any fixed model class). More samples → more stable parameter estimates.\n- Bias does NOT decrease with more data: bias is the error due to model misspecification — the gap between the best model in the model class and the true function. A linear model fit to a million samples of a nonlinear function still has the same bias as a linear model fit to 100 samples (the mean prediction converges to the best linear approximation, which is still far from the true nonlinear function).\n- Exception: if model complexity is allowed to grow with data (e.g., using a kernel with adaptive bandwidth, or a neural network with more capacity), both bias and variance may change.","A":"","B":"More data does not reduce bias for a fixed model class. The \"fixed model class\" qualifier is critical.","C":"More data never increases variance for any reasonable model — this is incorrect. Variance decreases monotonically with more data for any fixed model class.","D":"Neural networks with fixed architecture have variance that decreases with more data, just like other models. More data helps neural network generalization."}},{"section":"machine-learning","topicSlug":"bias-variance-tradeoff","topic":"Bias Variance Tradeoff","id":"ml-15010","difficulty":"hard","orderIndex":10,"question":"Dropout in neural networks is typically described as a regularization technique that reduces overfitting. Using the bias-variance framework, explain precisely how dropout reduces variance and whether it has any bias cost.","options":{"A":"Dropout reduces both bias and variance to zero at high dropout rates","B":"Dropout randomly deactivates neurons during training, forcing the network to learn redundant representations; this is equivalent to training an ensemble of exponentially many different sub-networks; the ensemble's predictions are averaged (approximated at test time by weight scaling), reducing variance; the bias cost: a high dropout rate may prevent individual neurons from specializing, reducing the model's effective capacity and introducing bias — small dropout rates (0.1-0.3) are usually variance-reducing without significant bias increase","C":"Dropout has no effect on bias — it only reduces variance by zeroing out weights","D":"Dropout reduces bias by preventing neurons from relying on spurious correlations, with no variance effect"},"correct":"B","explanation":{"correct":"- Dropout as ensemble: with dropout rate $p$, each forward pass uses a different sub-network. Training produces a distribution over sub-networks. Inference averages predictions over this distribution (via weight scaling approximation), analogous to bagging.\n- Averaging reduces variance: the averaged prediction $E[\\hat{f}_\\theta(x)]$ across sub-networks has lower variance than any single sub-network prediction.\n- Bias cost: at high dropout rates (e.g., 0.7), many neurons are dropped per batch. Each sub-network has very few active neurons — may underfit complex patterns, increasing bias. This is why tuning dropout rate is important.\n- Common rates: 0.5 for fully connected layers in the original dropout paper; 0.1-0.3 for convolutional layers.","A":"Dropout at high rates (approaching 1.0) would prevent any learning — catastrophically high bias. Zero bias is not achievable with dropout.","B":"","C":"Dropout does affect bias at high dropout rates by limiting effective model capacity. The bias cost is often small at typical dropout rates but is not zero.","D":"The primary mechanism of dropout is variance reduction (ensemble averaging), not bias reduction. Bias reduction from removing spurious correlations is a secondary effect, not the primary mechanism."}},{"section":"machine-learning","topicSlug":"bias-variance-tradeoff","topic":"Bias Variance Tradeoff","id":"ml-15011","difficulty":"hard","orderIndex":11,"question":"A practitioner tunes a gradient boosting model. As the number of trees increases from 10 to 10,000 (with learning rate 0.01, no early stopping), training error decreases to near 0 while test error first decreases then increases. This pattern exactly mirrors the classical bias-variance tradeoff curve. What is the \"complexity\" axis in this context, and how do learning rate and tree depth interact with the tradeoff?","options":{"A":"The number of trees is the complexity axis; learning rate and depth have no effect on the tradeoff","B":"The number of trees (iterations) is the complexity axis for gradient boosting — more trees = lower bias (more complex function fitted to residuals), higher variance (more susceptible to noise); learning rate scales the contribution of each tree: small learning rate requires more trees to achieve the same bias reduction, making the tradeoff curve flatter; tree depth controls the individual tree's complexity — deeper trees reduce bias faster per iteration but also increase variance per tree; optimal performance requires jointly tuning trees, learning rate, and depth","C":"In gradient boosting, there is no bias-variance tradeoff — only overfitting and underfitting","D":"More trees in gradient boosting always reduces both bias and variance simultaneously"},"correct":"B","explanation":{"correct":"- Gradient boosting complexity: each additional tree adds a residual-fitting component. More trees → model can approximate more complex functions (lower bias); each tree fits residuals that may include noise → model is more sensitive to training noise (higher variance).\n- Learning rate $\\eta$: shrinks each tree's contribution. Small $\\eta$ → smooth interpolation requires more trees to reach the same function complexity → optimal tree count shifts right. The bias-variance curve is \"stretched\" horizontally.\n- Tree depth: shallow trees (depth 1 = stumps) are high-bias, low-variance weak learners. Deep trees reduce bias faster per iteration but add variance. LightGBM default depth = 8; XGBoost recommends 3-6.\n- Early stopping: halts training when validation error starts rising, directly finding the optimal point on the bias-variance curve.","A":"Learning rate and depth fundamentally affect where the optimal point on the bias-variance curve lies. They are not independent of the tradeoff.","B":"","C":"Gradient boosting exhibits a clear bias-variance tradeoff. The test error U-shape described in the question is exactly this tradeoff.","D":"More trees in gradient boosting always reduces bias but eventually increases variance (unlike Random Forest, where more trees only reduce variance). This is a key difference between boosting and bagging ensembles."}},{"section":"machine-learning","topicSlug":"regularization","topic":"Regularization","id":"ml-16001","difficulty":"easy","orderIndex":1,"question":"L1 (Lasso) and L2 (Ridge) regularization both add a penalty to the loss function. Lasso has the property of producing sparse solutions (many weights exactly zero). L2 does not. Why does L1 produce sparsity while L2 does not?","options":{"A":"L1 uses a smaller penalty coefficient than L2, which causes weights to become exactly zero","B":"The L1 penalty ($\\lambda|w|$) has a non-smooth gradient (subdifferential) at $w=0$ — the penalty function has a \"corner\" at zero; when the gradient of the data loss is smaller than $\\lambda$, the optimal solution is exactly $w=0$; L2 penalty ($\\lambda w^2$) has a smooth gradient that approaches 0 as $w \\to 0$ — L2 never pushes weights to exactly zero, only close to zero","C":"L1 regularization is stronger than L2, so it forces more weights to zero through larger penalties","D":"L2 regularization cannot shrink weights at all — it only reduces the learning rate"},"correct":"B","explanation":{"correct":"- Geometric intuition: L1 constraint region is a diamond (in 2D), L2 is a sphere. The optimal solution (where the loss ellipse touches the constraint boundary) tends to land on the corners of the diamond (sparse points) for L1. The sphere has no corners, so solutions rarely land exactly on an axis.\n- Subgradient at zero: L1 derivative is $\\lambda \\times \\text{sign}(w)$, undefined at $w=0$ — the subdifferential is $[-\\lambda, \\lambda]$. If the gradient of the data loss at $w=0$ is within $[-\\lambda, \\lambda]$, setting $w=0$ is optimal.\n- L2 derivative: $2\\lambda w$ → 0 as $w \\to 0$. The gradient always points toward (but never reaches) zero — it only asymptotically approaches zero.","A":"The strength of regularization (lambda value) is comparable between L1 and L2. The sparsity is a geometric property of the L1 norm, not a result of using smaller lambda.","B":"","C":"L1 and L2 with the same lambda have different magnitudes — neither is inherently \"stronger.\" The sparsity property is about the geometry of the penalty function, not just its magnitude.","D":"L2 shrinks weights toward (but not to) zero. This is its primary mechanism — it reduces weight magnitude, preventing any single feature from dominating. It doesn't reduce learning rate."}},{"section":"machine-learning","topicSlug":"regularization","topic":"Regularization","id":"ml-16002","difficulty":"easy","orderIndex":2,"question":"ElasticNet regularization combines L1 and L2 penalties: $L = \\text{MSE} + \\lambda_1 ||w||_1 + \\lambda_2 ||w||_2^2$. A practitioner chooses ElasticNet over Lasso for a dataset with 50 features where 30 are correlated in groups of 5. Why is Lasso alone insufficient here?","options":{"A":"Lasso is always worse than ElasticNet — ElasticNet is the superior method","B":"Lasso tends to arbitrarily select one feature from a group of correlated features and set the others to exactly zero — within a correlated group, it doesn't consistently select the most relevant feature; ElasticNet's L2 component groups correlated features together (similar to Ridge), while the L1 component still produces sparsity — correlated features get similar non-zero coefficients rather than one arbitrarily selected","C":"ElasticNet is chosen because it requires fewer hyperparameters than Lasso","D":"Lasso cannot handle datasets with more features than samples; ElasticNet can"},"correct":"B","explanation":{"correct":"- Lasso and correlated features: the Lasso solution is not unique when features are highly correlated. It may select any one feature from a correlated group — the selection depends on numerical noise and specific optimization path. This is called \"inconsistent variable selection.\"\n- ElasticNet: L2 component adds a grouping effect (correlated features are selected/deselected together). L1 maintains overall sparsity. The combination is more stable and interpretable for correlated feature groups.\n- Practical example: in genomics (correlated gene expressions within pathways), ElasticNet selects representative genes from each pathway rather than arbitrary single genes.","A":"Lasso is sufficient and often preferred when features are independent and a sparse model is the goal. ElasticNet's advantage is specifically for correlated feature scenarios.","B":"","C":"ElasticNet has TWO hyperparameters ($\\lambda_1, \\lambda_2$) vs Lasso's ONE ($\\lambda$). ElasticNet requires more hyperparameter tuning, not less.","D":"Lasso can handle p >> n scenarios (in fact, it's one of the primary tools for high-dimensional sparse regression). The issue is correlated feature instability, not p >> n."}},{"section":"machine-learning","topicSlug":"regularization","topic":"Regularization","id":"ml-16003","difficulty":"easy","orderIndex":3,"question":"A logistic regression model is trained on a dataset with 1,000 features and 500 training samples. Without regularization, the model achieves 100% training accuracy but 61% test accuracy. With L2 regularization (C=0.01 in sklearn, meaning strong regularization), training accuracy drops to 75%, test accuracy improves to 84%. What caused the improvement?","options":{"A":"L2 regularization improved the model by removing irrelevant features","B":"Without regularization, logistic regression can perfectly separate the 500 training samples in 1,000-dimensional space (many separating hyperplanes exist) — the learned coefficients are huge and unstable, perfectly fitting noise; L2 regularization constrains coefficient magnitudes ($||w||_2^2 \\leq \\lambda$), preventing overfitting to noise; the lower training accuracy reflects the regularization constraint, but the model generalizes better by not memorizing noise","C":"The improvement occurred because L2 regularization increased the number of training samples","D":"High training accuracy with low test accuracy indicates the test set is harder than training, not overfitting"},"correct":"B","explanation":{"correct":"- With p=1,000 > n=500: infinitely many hyperplanes separate the training data. Without regularization, the optimization finds a hyperplane that perfectly classifies training data but relies on noise correlations.\n- L2 regularization is equivalent to constraining the weight vector to lie within a ball of radius $\\sqrt{1/\\lambda}$. This prevents large weights that overfit to noise.\n- Sklearn's parameter C = $1/\\lambda$ (inverse of regularization strength). C=0.01 means strong regularization ($\\lambda = 100$), which heavily constrains coefficient magnitudes.","A":"L2 regularization shrinks all coefficients but keeps all features (no sparsity). Feature removal is L1 regularization. L2 improves generalization by coefficient shrinkage, not feature elimination.","B":"","C":"Regularization doesn't change the number of training samples. It changes how the model is fitted to the existing samples.","D":"With 1,000 features and 500 samples, perfect training accuracy is a strong sign of overfitting. Test accuracy of 61% (near random for 3+ classes, or just above random for binary) confirms the model memorized training noise."}},{"section":"machine-learning","topicSlug":"regularization","topic":"Regularization","id":"ml-16004","difficulty":"easy","orderIndex":4,"question":"Early stopping is used in neural network training as an implicit regularization technique. A model's validation loss starts increasing at epoch 50 while training loss continues decreasing. The model is stopped at epoch 50. Why does early stopping reduce overfitting?","options":{"A":"Early stopping prevents gradient descent from converging to the global minimum, which would overfit","B":"Early stopping prevents the model from fitting training noise in later epochs — in early training, gradient descent first captures broad patterns (high gradient signal); in later epochs, the model increasingly fits residual noise (small gradient updates in high-frequency noise directions); stopping before this phase prevents memorizing noise; it is equivalent to keeping the model in a lower effective complexity region","C":"Early stopping reduces overfitting by reducing the learning rate automatically","D":"Early stopping is equivalent to L1 regularization because it also produces sparse models"},"correct":"B","explanation":{"correct":"- Gradient dynamics: early in training, gradients are large and the model captures dominant patterns. As training progresses, the optimization explores finer structure that may reflect training-set-specific noise.\n- Formal equivalence (for linear models): Bishop (1995) showed early stopping in gradient descent is equivalent to L2 regularization, where the effective regularization strength is inversely proportional to the number of iterations.\n- Practical implementation: monitor validation loss; save model checkpoints; restore best checkpoint when validation loss stops improving. Patience parameter: how many epochs to wait before stopping.","A":"Early stopping does prevent reaching the global minimum of training loss. But the global minimum of training loss is not the goal — the global minimum of expected generalization loss is. These are different, especially with overparameterized models.","B":"","C":"Early stopping doesn't change the learning rate schedule. It stops training at a fixed learning rate. Learning rate scheduling is a separate technique.","D":"Early stopping has no sparsity property. It is most closely equivalent to L2 regularization (shrinking weights from their fully trained values). L1's sparsity comes from the subdifferential property of the L1 norm."}},{"section":"machine-learning","topicSlug":"regularization","topic":"Regularization","id":"ml-16005","difficulty":"medium","orderIndex":5,"question":"A Ridge regression model is trained with lambda=10. The model coefficients are: $w = [0.8, 0.6, 0.3, 0.1]$ for features [A, B, C, D]. Feature D has been judged irrelevant by domain experts. A data scientist says \"since $w_D = 0.1 \\approx 0$, Ridge has effectively removed Feature D.\" Why is this claim problematic?","options":{"A":"Ridge has actually set $w_D$ to exactly 0, confirming the claim","B":"Ridge shrinks but doesn't zero out coefficients — $w_D = 0.1$ is still non-zero; at prediction time, Feature D still contributes to every prediction; moreover, Ridge with lambda=10 has shrunk ALL coefficients toward zero, not just irrelevant ones; the \"small\" coefficient may reflect both the feature's low relevance AND the regularization penalty compressing the true coefficient; the proper approach for feature removal is L1 (Lasso) or explicit feature selection","C":"Ridge regularization is specifically designed to identify irrelevant features — the smallest coefficient is always the least relevant","D":"A coefficient of 0.1 is practically zero — Ridge has removed Feature D for all practical purposes"},"correct":"B","explanation":{"correct":"- Ridge coefficient: $\\hat{w}^{Ridge} = \\hat{w}^{OLS} / (1 + \\lambda)$ (in orthogonal feature case). With lambda=10, every coefficient is shrunk by a factor of 11. Feature D's true OLS coefficient might be 1.1 (significant!) but Ridge shrinks it to 0.1.\n- Coefficient magnitude under Ridge reflects BOTH feature relevance AND regularization penalty. Comparing coefficients across features is valid only if features are standardized AND lambda is accounted for.\n- For feature selection: use L1 (Lasso) which explicitly zeros coefficients, or model-agnostic methods (permutation importance, SHAP) that measure the actual predictive contribution.","A":"Ridge does not produce exactly zero coefficients by design. This is mathematically guaranteed by the smooth L2 penalty.","B":"","C":"The smallest Ridge coefficient is not necessarily the least relevant feature. A relevant feature with high collinearity with other features may have a small Ridge coefficient, while an irrelevant but independent feature may have a moderate coefficient.","D":"\"Practically zero\" is a judgment call, but in a prediction context, 0.1 contributes to every prediction. More importantly, without knowing the unregularized coefficient, you cannot distinguish \"small because irrelevant\" from \"small because regularized.\""}},{"section":"machine-learning","topicSlug":"regularization","topic":"Regularization","id":"ml-16006","difficulty":"medium","orderIndex":6,"question":"Dropout with rate 0.5 is applied to a layer with 100 neurons during training. At test time, the dropout is turned off and weights are multiplied by 0.5 (inverted dropout). Why is this weight scaling necessary?","options":{"A":"Weight scaling prevents numerical overflow in large networks","B":"During training with 50% dropout, each neuron is active on average 50% of the time — its expected contribution to the next layer is halved; at test time, all neurons are active; without scaling, the expected input to the next layer doubles compared to training; multiplying weights by 0.5 at test time (or equivalently, multiplying activations by 0.5 at test time) ensures the same expected signal magnitude at test time as during training","C":"Weight scaling at test time doubles the model's capacity to compensate for lost neurons during training","D":"Weight scaling is only needed for convolutional layers — fully connected layers don't require it"},"correct":"B","explanation":{"correct":"- Without scaling: a neuron with weight $w$ connecting to the next layer contributes $w \\times a$ (activation value). During training with 50% dropout: expected contribution = $0.5 \\times w \\times a$. At test time (no dropout): contribution = $w \\times a$ — twice the expected training contribution.\n- This mismatch between training and test distributions would cause the model to produce larger activations at test time, effectively changing the model's behavior. The network was trained to expect 0.5× contributions.\n- Fix: either (1) multiply weights by 0.5 at test time (standard), or (2) during training, multiply active weights by 2 to maintain expected activation magnitude (\"inverted dropout\" — the standard implementation in frameworks like PyTorch and TensorFlow).","A":"Weight scaling prevents train-test distribution mismatch, not numerical overflow. Overflow would be handled by gradient clipping or proper weight initialization.","B":"","C":"Scaling by 0.5 halves weights — it doesn't double capacity. The scaling maintains the same expected activation magnitude, not a doubled capacity.","D":"Dropout and weight scaling apply to any layer type. The activation-magnitude mismatch issue is present for any layer where neurons are randomly dropped."}},{"section":"machine-learning","topicSlug":"regularization","topic":"Regularization","id":"ml-16007","difficulty":"medium","orderIndex":7,"question":"A team uses L1 regularization on a linear model with 500 features. After tuning lambda, 480 features are zeroed out, leaving 20 non-zero features. They claim \"the L1 model selected the 20 most important features.\" A statistician cautions this claim. Why?","options":{"A":"L1 always selects the correct features — the statistician is wrong","B":"L1 feature selection is not stable — small changes in training data or lambda can change which 20 features are selected; when multiple features have similar predictive power, L1 arbitrarily picks one (as with correlated features); the selected set may also change with different regularization paths; for reliable feature selection, use stability selection (run L1 many times with subsampling and select features that consistently appear) or confirm with permutation importance","C":"L1 can only zero features, not identify important ones — it should be replaced with L2","D":"L1 is inconsistent for variable selection when there are more than 100 features"},"correct":"B","explanation":{"correct":"- L1 inconsistency in correlated groups: if features 1, 2, 3 are correlated and all predictive, L1 may select feature 1 in one run and feature 2 in another (depending on numerical noise, bootstrap sample, random initialization of optimization).\n- Stability selection (Meinshausen & Bühlmann 2010): run Lasso on 100 bootstrap subsamples, count how often each feature is selected. Features selected in >80% of runs are stable selections.\n- Near-equal lambda sensitivity: at the exact regularization level, two features may be equally competitive. Small perturbations determine which is selected.\n- Practical implication: report \"these 20 features were selected on this dataset with this lambda\" rather than \"these are the 20 most important features.\"","A":"L1 feature selection is not provably correct in the presence of correlated features. Stability analysis is needed to validate selections.","B":"","C":"L1 does both select features (via sparsity) and identify predictive ones. The caveat is stability, not the mechanism.","D":"There is no established feature count threshold for L1 consistency. The issue is correlation structure, not the absolute number of features."},"reference":"- Stability Selection: https://rss.onlinelibrary.wiley.com/doi/10.1111/j.1467-9868.2010.00740.x"},{"section":"machine-learning","topicSlug":"regularization","topic":"Regularization","id":"ml-16008","difficulty":"hard","orderIndex":8,"question":"Two Ridge regression models are trained: Model A with $\\lambda = 10$, Model B with $\\lambda = 10,000$. Model B's coefficients are all very close to zero but not exactly zero. A student asks: \"what is the closed-form solution for Ridge regression, and how does lambda control coefficient magnitude?\" Provide the answer.","codeSnippet":"# Ridge regression adds L2 penalty:\n# min ||y - Xw||² + λ||w||²\n# Closed-form: w = (X^T X + λI)^{-1} X^T y","options":{"A":"Ridge has no closed-form solution — it must be solved iteratively","B":"The closed form is $\\hat{w} = (X^TX + \\lambda I)^{-1}X^Ty$; as $\\lambda \\to 0$: reduces to OLS; as $\\lambda \\to \\infty$: $(X^TX + \\lambda I)^{-1} \\approx \\lambda^{-1} I \\to 0$, so $\\hat{w} \\to 0$; the $\\lambda I$ term adds a positive constant to the diagonal, making the matrix invertible even when $X^TX$ is singular (high collinearity); larger $\\lambda$ shrinks all coefficients proportionally toward zero","C":"The closed form is $\\hat{w} = (X^TX)^{-1}X^Ty - \\lambda I$ — Ridge subtracts lambda from OLS coefficients","D":"The closed form requires inverting an $n \\times n$ matrix, making Ridge computationally infeasible for large datasets"},"correct":"B","explanation":{"correct":"- Deriving the closed form: take the derivative of $||y - Xw||^2 + \\lambda||w||^2$ with respect to $w$ and set to zero: $-2X^T(y - Xw) + 2\\lambda w = 0 \\to (X^TX + \\lambda I)w = X^Ty \\to w = (X^TX + \\lambda I)^{-1}X^Ty$.\n- Stabilizing ill-conditioned systems: $X^TX$ may have near-zero eigenvalues (collinear features), making OLS unstable. Adding $\\lambda I$ shifts all eigenvalues by $\\lambda$: $(\\sigma_i^2 + \\lambda)^{-1}$ replaces $\\sigma_i^{-2}$. For small $\\sigma_i$, Ridge prevents coefficient explosion.\n- Computation: $X^TX$ is $p \\times p$ — for large $p$, computing $(X^TX + \\lambda I)^{-1}$ is $O(p^3)$. For large $n$, small $p$: efficient. For large $p$: use conjugate gradient or Cholesky decomposition.","A":"Ridge regression has an explicit closed-form solution, unlike L1 (which requires iterative coordinate descent or sub-gradient methods due to non-differentiability).","B":"","C":"This formula is incorrect. Ridge doesn't subtract lambda from OLS estimates. The correct formula changes the matrix to be inverted.","D":"Ridge inverts a $p \\times p$ matrix, not $n \\times n$. For typical problems where $p < n$, this is computationally tractable."}},{"section":"machine-learning","topicSlug":"regularization","topic":"Regularization","id":"ml-16009","difficulty":"hard","orderIndex":9,"question":"A deep learning model uses both dropout (rate=0.3) and L2 weight decay ($\\lambda = 0.001$). A researcher says these two regularization techniques are redundant for neural networks. Are they equivalent, and why or why not?","options":{"A":"Dropout and L2 weight decay are mathematically equivalent for all architectures","B":"They are not equivalent: L2 weight decay penalizes large weights by adding $\\lambda ||w||^2$ to the loss, shrinking all weights uniformly toward zero during gradient updates; dropout randomly deactivates neurons, forcing distributed representations and ensemble-like behavior; they address different sources of overfitting — L2 reduces weight magnitude (preventing reliance on large activations), while dropout prevents co-adaptation of neurons; in practice they are complementary and can improve performance together","C":"Dropout makes L2 redundant because both shrink weights toward zero","D":"L2 weight decay and dropout cancel each other out — applying both produces a worse model than either alone"},"correct":"B","explanation":{"correct":"- L2 weight decay mechanism: adds $\\lambda w$ to gradient update. Effect: all weights are shrunk by a constant fraction each update, preventing large weights.\n- Dropout mechanism: randomly zeros activations. Effect: each neuron cannot rely on specific co-activations → learns more robust, distributed features.\n- Different failure modes addressed: L2 prevents overparameterized models from learning degenerate large-magnitude solutions; dropout prevents neurons from co-adapting (a type of feature interaction overfitting L2 does not address).\n- Note: for adaptive optimizers (Adam, RMSprop), \"weight decay\" and \"L2 regularization\" are NOT equivalent — the interaction with the adaptive learning rate makes them different (decoupled weight decay, AdamW fixes this distinction).","A":"Mathematical equivalence only holds in specific cases (linear models, SGD without momentum). For nonlinear networks with adaptive optimizers, they are not equivalent.","B":"","C":"Dropout does not shrink weights toward zero — it randomly zeros activations during training. Weights can grow large; the stochasticity prevents specific feature detector pairs from always co-occurring.","D":"Combined regularization generally outperforms either alone by addressing multiple sources of overfitting. There is no cancellation — they operate on different mechanisms."}},{"section":"machine-learning","topicSlug":"regularization","topic":"Regularization","id":"ml-16010","difficulty":"hard","orderIndex":10,"question":"Batch normalization is described as a regularization technique in addition to being an acceleration technique. Explain the mechanism by which batch normalization provides implicit regularization, and why it can sometimes replace dropout.","options":{"A":"Batch normalization regularizes by adding Gaussian noise to every layer's output","B":"Batch normalization normalizes each feature by the batch statistics (mean, variance) which vary across mini-batches; this introduces stochastic noise into the training process — a sample's normalization depends on the other samples in its batch; this noise acts as a form of regularization, preventing the network from overfitting to individual sample patterns; when batch sizes are small, this noise is larger, providing more regularization; at test time, running averages replace batch statistics (removing the noise), creating a train-test discrepancy that improves generalization","C":"Batch normalization regularizes only by reducing internal covariate shift — it has no noise effect","D":"Batch normalization is identical to dropout with rate=0.1 — they can always be interchanged"},"correct":"B","explanation":{"correct":"- Stochastic element: during training, $\\mu_B = \\frac{1}{m}\\sum x_i$ and $\\sigma_B^2 = \\frac{1}{m}\\sum(x_i - \\mu_B)^2$ depend on the randomly sampled mini-batch. For sample $x_j$: $\\hat{x}_j = (x_j - \\mu_B)/\\sigma_B$ — the normalized value depends on which other samples are in the batch (stochastic).\n- This is why changing batch size affects generalization: small batches = noisy $\\mu_B, \\sigma_B$ = more regularization but less stable training.\n- BN as dropout replacement: Ioffe & Szegedy (original BN paper) observed that BN reduced the need for dropout. Modern architectures often use BN without dropout for convolutional layers.","A":"BatchNorm doesn't add Gaussian noise explicitly. The stochasticity comes from mini-batch sampling. Adding explicit Gaussian noise is a separate technique (data augmentation via noise injection).","B":"","C":"The internal covariate shift reduction (normalizing layer inputs) is the primary motivation. The regularization effect from batch statistics stochasticity is a secondary benefit. Both mechanisms are real.","D":"BatchNorm and dropout are not identical. Dropout creates stochasticity by zeroing activations; BatchNorm creates stochasticity through batch statistics. They have different properties and are often used together in many architectures."}},{"section":"machine-learning","topicSlug":"feature-selection-and-engineering","topic":"Feature Selection And Engineering","id":"ml-17001","difficulty":"easy","orderIndex":1,"question":"A filter method (chi-squared test) is used to select the top 20 features from 100. A wrapper method (recursive feature elimination with cross-validation) is also run on the same dataset. The wrapper method achieves 4% higher test accuracy but takes 50× longer. A team lead asks: \"which should we use in production?\" What are the correct trade-offs?","options":{"A":"Filter methods are always better because they are faster","B":"Filter methods score features independently of the model — they are fast (O(p) evaluations) but ignore feature interactions; wrapper methods evaluate feature subsets using the actual downstream model, capturing interactions — they are slower (O(p²) to O(2^p) evaluations) but more accurate; for production pipelines with computational budget, filter methods are preferred for fast iteration; when accuracy is critical and features have interactions, wrapper methods are justified","C":"Wrapper methods are always better because they use the actual model","D":"The 4% accuracy difference proves filter methods are unusable for any serious ML task"},"correct":"B","explanation":{"correct":"- Filter method (chi-squared, mutual information, variance threshold): evaluates each feature independently using a statistical test. Cannot detect interactions: feature A alone is useless, but A+B together are predictive (XOR problem).\n- Wrapper method (RFE, forward/backward selection): fits the model on feature subsets. Captures all interactions the model can use. Computational cost: RFE fits the model p times (backward elimination); forward selection fits p×(p/2) times.\n- Practical decision matrix: small dataset + high accuracy requirement → wrapper; large dataset + many features + computational budget → filter as first pass, wrapper on top-K features.","A":"Filter methods miss feature interactions and model-specific synergies. The 4% accuracy gap in the example shows wrappers can provide meaningful improvement.","B":"","C":"Wrapper methods are computationally expensive and can overfit the feature selection process if cross-validation is not properly implemented. They are not always better.","D":"4% may or may not be practically significant depending on the task. Filter methods are widely used in production systems (e.g., mutual information for feature selection in recommendation systems)."}},{"section":"machine-learning","topicSlug":"feature-selection-and-engineering","topic":"Feature Selection And Engineering","id":"ml-17002","difficulty":"easy","orderIndex":2,"question":"A dataset has a categorical feature \"City\" with 500 unique values. One-hot encoding would create 500 binary features. A data scientist suggests using label encoding (assigning integers 1-500) instead. What is the key problem with label encoding for a nominal categorical feature?","options":{"A":"Label encoding creates too many features — it should never be used","B":"Label encoding imposes an artificial ordinal relationship — it implies City 1 < City 2 < City 500, which is meaningless for nominal categories; a linear model or distance-based algorithm will interpret the numerical values as having relative magnitude; this creates false structure that misleads the model","C":"Label encoding is the best method for high-cardinality categoricals — the data scientist is correct","D":"Label encoding and one-hot encoding are equivalent for tree-based models and linear models"},"correct":"B","explanation":{"correct":"- Nominal: no inherent order (London, Paris, Tokyo are not ranked). Ordinal: has inherent order (low, medium, high).\n- Label encoding (LabelEncoder): assigns integer 1-500. Linear regression would learn a coefficient for \"city\" and predict: Tokyo (100) is 100× NYC (1)? This is meaningless arithmetic.\n- One-hot encoding: creates binary indicator per city. The model learns an independent coefficient per city — no ordering imposed. But 500 features is expensive.\n- Better alternatives for high-cardinality: target encoding (replace city with mean target value for that city), frequency encoding, entity embeddings (in deep learning).","A":"Label encoding is valid for ordinal features (education level: high school < bachelor < master). The problem is only with nominal categoricals.","B":"","C":"Label encoding for high-cardinality nominal features is a well-known mistake. It's commonly done by beginners who confuse nominal and ordinal encoding requirements.","D":"Tree-based models (decision trees, RF, gradient boosting) can effectively use label-encoded features because they split on individual thresholds — the ordinal assumption doesn't affect them as much. But linear models and distance-based methods are significantly harmed by label encoding of nominal features."}},{"section":"machine-learning","topicSlug":"feature-selection-and-engineering","topic":"Feature Selection And Engineering","id":"ml-17003","difficulty":"easy","orderIndex":3,"question":"A dataset has a feature \"Age\" with 5% missing values. Three imputation strategies are proposed: (1) Mean imputation, (2) Median imputation, (3) Forward-fill (use the previous row's value). A data scientist says all three are equivalent. When does each strategy fail?","options":{"A":"All three strategies are equivalent because they all produce valid numerical values","B":"Mean imputation fails when age is skewed (outliers pull the mean away from the typical value); median imputation is robust to skewness but fails for time-series data where temporal patterns matter; forward-fill fails for non-time-series tabular data where row order is arbitrary and the \"previous\" row has no relationship to the current row; correct strategy depends on data type, distribution, and whether ordering is meaningful","C":"Forward-fill is always best because it uses observed data","D":"Missing values should always be dropped — imputation always introduces bias"},"correct":"B","explanation":{"correct":"- Mean imputation: replaces missing with $\\bar{x}$. For a skewed distribution (house prices, income), mean is pulled by outliers — imputing with mean artificially concentrates data at a skewed mean.\n- Median imputation: replaces with median (50th percentile). Robust to outliers. But for time-series, a patient's age at time T should be near their age at time T-1 — median of all ages is unrelated to temporal continuity.\n- Forward-fill: uses the last observed value (LOCF — last observation carried forward). For time-series (stock prices, sensor readings), this is reasonable. For a shuffled tabular dataset (rows are independent customers), the \"previous\" row is random — forward-fill introduces noise.\n- Best practice: model-based imputation (MICE, KNN imputation) captures correlations with other features.","A":"The strategies produce different imputed values and have different statistical properties. The choice has a measurable impact on downstream model performance.","B":"","C":"Forward-fill only uses \"observed data\" meaningfully when row order is temporally or logically meaningful. For random-order tabular data, forward-fill is essentially injecting noise.","D":"Dropping rows with missing values (complete case analysis) discards information and can introduce bias if data is not Missing Completely At Random (MCAR). Imputation is often better, but choice of method matters."}},{"section":"machine-learning","topicSlug":"feature-selection-and-engineering","topic":"Feature Selection And Engineering","id":"ml-17004","difficulty":"easy","orderIndex":4,"question":"Mutual information (MI) between a feature X and target Y is computed as: $MI(X; Y) = \\sum_{x,y} p(x,y) \\log \\frac{p(x,y)}{p(x)p(y)}$. A feature has MI = 0. What does this mean, and is it always useless for prediction?","options":{"A":"MI = 0 means the feature is perfectly correlated with the target","B":"MI = 0 means X and Y are statistically independent — knowing X provides no information about Y; for a single feature in isolation, MI = 0 means the feature alone is useless; however, feature interactions exist — X might be useless alone but highly predictive when combined with another feature Z (interaction effect); filter methods miss such interactions","C":"MI = 0 means the feature has constant value — it is a constant feature","D":"MI = 0 is impossible for real-world data — it always has some noise that produces non-zero MI"},"correct":"B","explanation":{"correct":"- MI = 0: $p(x,y) = p(x)p(y)$ for all $x, y$ — full independence. Knowing $X$ changes our estimate of $Y$ not at all.\n- Interaction effect: $Y = XOR(X_1, X_2)$. $MI(X_1; Y) = 0$ (individually useless). $MI(X_1, X_2; Y) > 0$ (jointly informative). Filter methods that evaluate features individually would eliminate both $X_1$ and $X_2$ — losing all predictive power.\n- This is a fundamental limitation of univariate feature selection (chi-squared, MI, ANOVA) — they cannot detect interaction effects. Wrapper methods and embedded methods can detect interactions because they evaluate feature combinations.","A":"Perfect correlation would give MI > 0 (in fact, for a deterministic relationship: MI = entropy of Y). MI = 0 is the opposite of correlation.","B":"","C":"A constant feature also has MI = 0, but MI = 0 does not require the feature to be constant. An independent non-constant feature also has MI = 0.","D":"For finite samples, estimated MI is always slightly non-zero due to sampling noise. But the population MI can be exactly 0 for truly independent X and Y. The claim is about the population quantity."}},{"section":"machine-learning","topicSlug":"feature-selection-and-engineering","topic":"Feature Selection And Engineering","id":"ml-17005","difficulty":"easy","orderIndex":5,"question":"A dataset contains a \"TransactionDate\" column in format YYYY-MM-DD (e.g., \"2023-07-15\"). A junior data scientist label-encodes this as an integer (20230715). A senior engineer suggests better feature engineering. What are the recommended derived features?","options":{"A":"Leave the raw date string — modern ML models can parse dates automatically","B":"Extract meaningful temporal features: year, month, day-of-week, day-of-month, week-of-year, is_weekend, days_since_last_transaction, time_since_cohort_start; raw date integers (20230715) don't encode periodicity — the model cannot learn that July (month 7) recurs annually; the derived features capture seasonality, recency, and cyclical patterns that the raw integer misses","C":"Convert date to Unix timestamp (seconds since 1970-01-01) — this is the most informative representation","D":"Dates should always be removed from features — they cause temporal data leakage"},"correct":"B","explanation":{"correct":"- Raw integer (20230715): the model sees this as a continuous value. It cannot learn that December 31 and January 1 are consecutive unless it sees the transition explicitly. Seasonal patterns (holiday shopping every December) are not captured.\n- Extracted features: month captures annual seasonality; day-of-week captures weekly patterns; is_weekend captures activity patterns; days_since_X captures recency effects.\n- Cyclical encoding: for features like month (1-12) and day-of-week (0-6), use sine/cosine encoding to preserve cyclicality: $\\sin(2\\pi \\times \\text{month}/12)$, $\\cos(2\\pi \\times \\text{month}/12)$. This ensures the model knows December is adjacent to January.","A":"Most ML models (decision trees, linear models, gradient boosting) cannot parse raw date strings. Even if a model ingests the raw integer, the temporal structure (periodicity, recency) is not encoded in the integer value.","B":"","C":"Unix timestamp preserves temporal ordering but doesn't encode periodicity. A model trained on Unix timestamps cannot learn that similar timestamps occur one year apart — it sees them as values 31M seconds apart.","D":"Dates are valuable features when properly engineered. \"Always remove\" is incorrect — temporal features are often among the most predictive in time-sensitive applications (fraud, demand forecasting)."}},{"section":"machine-learning","topicSlug":"feature-selection-and-engineering","topic":"Feature Selection And Engineering","id":"ml-17006","difficulty":"medium","orderIndex":6,"question":"Target encoding replaces a categorical feature with the mean of the target variable for that category. A team applies it directly on training data and achieves high training performance. On test data, performance drops significantly for rare categories. What is the issue and the fix?","options":{"A":"Target encoding should only be used for high-cardinality features — the team applied it to a low-cardinality feature","B":"Target encoding leaks the target into the feature directly; if applied on the full training set (including the sample being encoded), the model learns trivial mappings; for rare categories (few samples), the mean is estimated from very few samples — high variance estimates that overfit to noise; fix: use leave-one-out target encoding or cross-validated encoding (compute mean from out-of-fold samples) and apply smoothing (blend category mean with global mean weighted by sample count)","C":"Target encoding fails because it converts categorical to continuous — use ordinal encoding instead","D":"Target encoding is not appropriate for any ML model — use one-hot encoding always"},"correct":"B","explanation":{"correct":"- Leakage: if a training sample's own target is included in computing its category encoding, the model learns $y_i = \\hat{y}_i$ — trivial perfect training performance.\n- Rare category variance: a category with 3 samples has mean = (y1+y2+y3)/3. High variance — could be 0, 0.5, or 1.0 depending on those 3 specific samples. At test time, the estimate is unreliable.\n- Smoothing formula: $\\text{encoding}(c) = \\frac{n_c \\times \\bar{y}_c + m \\times \\bar{y}_\\text{global}}{n_c + m}$ where $n_c$ = samples in category $c$, $m$ = smoothing parameter. Rare categories are pulled toward the global mean; frequent categories use their own mean.\n- Cross-validated encoding: for each fold, encode training samples using only out-of-fold statistics — prevents leakage.","A":"Target encoding is especially valuable for high-cardinality features (where one-hot would create thousands of dimensions). The problem is not about cardinality but about proper encoding implementation.","B":"","C":"The continuous output of target encoding is a feature, not a problem. The issue is leakage and rare-category variance, not the data type conversion.","D":"Target encoding is widely used and effective for high-cardinality categoricals (e.g., zip code, user ID). One-hot encoding for 10,000 zip codes creates a 10,000-dimensional sparse feature — computationally expensive and prone to overfitting."}},{"section":"machine-learning","topicSlug":"feature-selection-and-engineering","topic":"Feature Selection And Engineering","id":"ml-17007","difficulty":"medium","orderIndex":7,"question":"Permutation importance is computed for a trained random forest: after randomly shuffling feature X, test accuracy drops by 15%. After shuffling feature Z, accuracy drops by 0.2%. A data scientist removes Z from the model and retrains. The new model's accuracy drops by 3%. Explain this result.","options":{"A":"The result proves permutation importance is unreliable — it should never be used","B":"Permutation importance measures how much the model uses each feature, not each feature's intrinsic predictive value; Z may have had low permutation importance because X (correlated with Z) was substituted by the model when Z was shuffled; when Z is removed and X cannot compensate (X may have a different interaction), the 3% drop reveals Z's marginal contribution that permutation importance masked due to collinearity","C":"The 3% drop proves the first permutation importance computation was computed incorrectly","D":"Permutation importance always correctly identifies redundant features — Z should be removed based on the 0.2% drop"},"correct":"B","explanation":{"correct":"- Collinearity masking: if X and Z are correlated ($r = 0.9$), when Z is shuffled, the model can still predict using X (which maintains the shared signal with the target). So shuffling Z appears harmless (0.2% drop). But when Z is physically removed, X may not fully capture Z's unique contribution — 3% drop.\n- Permutation importance measures \"feature usage by the current model\" not \"intrinsic feature importance.\" For correlated features, importance is split arbitrarily between them.\n- Conditional permutation importance (Strobl et al.): shuffle Z while conditioning on the values of correlated features — better estimates the true marginal contribution.","A":"Permutation importance is a valid and widely used method (as model-agnostic importance). The issue is correct interpretation in the presence of correlated features. The tool is not unreliable — it requires nuanced interpretation.","B":"","C":"The permutation importance was computed correctly. It correctly measured how much the model uses Z in the presence of X. The removal experiment revealed a different (but also valid) quantity: Z's marginal contribution when X cannot compensate.","D":"Low permutation importance for a correlated feature should not automatically trigger removal. The collinearity must be investigated before removing features based solely on permutation importance."}},{"section":"machine-learning","topicSlug":"feature-selection-and-engineering","topic":"Feature Selection And Engineering","id":"ml-17008","difficulty":"medium","orderIndex":8,"question":"A dataset contains an \"Income\" feature with a heavy right tail: median = $50K, mean = $80K, max = $5M. Standard scaling (z-score) is applied. After scaling, the top 1% of values (high earners) have z-scores of 40-200. A linear model trained on this data has large coefficients for income-related predictions. What is the problem and the fix?","options":{"A":"The data is correct — high z-scores for outliers are expected and do not affect the model","B":"Standard scaling preserves the original distribution's skewness and outlier effects — extreme values (z=40-200) dominate the model's coefficient for income; a log transformation ($\\log(\\text{income}+1)$) would first compress the right tail, then standard scaling would create a more symmetric distribution with z-scores in a reasonable range; outliers would no longer dominate linear model training","C":"The fix is to remove all records with income > $500K — outliers should always be dropped","D":"Standard scaling is the correct preprocessing — no additional transformation is needed for skewed features"},"correct":"B","explanation":{"correct":"- Heavy-tailed features: standard scaling makes $z = (x - \\mu)/\\sigma$. If $\\sigma$ is large (due to the long tail), most values cluster around $z = 0$ to $z = 2$. The top 1% gets enormous z-scores, dominating any linear model's loss function.\n- Log transformation: compresses the right tail. $\\log(\\$5M) \\approx 15.4$, $\\log(\\$50K) \\approx 10.8$, $\\log(\\$30K) \\approx 10.3$. The range is compressed from $50K-5M$ (100:1 ratio) to $10.3-15.4$ (1.5:1 ratio in log space). After log-transform, standard scaling gives reasonable z-scores.\n- Box-Cox or Yeo-Johnson transform: more general parametric transformation that can handle both positive and zero/negative values.","A":"High z-scores (40-200) are not innocuous. In gradient descent, the gradient magnitude is proportional to feature values × error. Features with huge z-scores produce huge gradients, destabilizing training.","B":"","C":"Dropping income > $500K removes legitimate data points. High earners may be a meaningful segment (tax policy analysis, luxury goods). Removal introduces selection bias.","D":"Standard scaling is appropriate for approximately normal distributions. For heavy-tailed distributions, it leaves the skewness intact — additional transformation (log, sqrt) is needed first."}},{"section":"machine-learning","topicSlug":"feature-selection-and-engineering","topic":"Feature Selection And Engineering","id":"ml-17009","difficulty":"hard","orderIndex":9,"question":"SHAP (SHapley Additive exPlanations) values are used for feature importance in a gradient boosted model. SHAP values for feature X are: most values near 0, but for 10 specific samples, SHAP values are +15 (strongly pushing toward positive class). Standard permutation importance gives feature X an importance of 0.02 (very low). Explain the discrepancy.","options":{"A":"SHAP and permutation importance are computing the same quantity — one must be wrong","B":"Permutation importance averages the effect of shuffling X across ALL samples — if X has high impact on 10 samples but near-zero impact on 990 samples, the average effect is diluted to ~1%; SHAP provides per-sample contributions, revealing that X is critical for the 10 specific samples even though globally unimportant; both metrics are correct — they answer different questions; for rare high-impact cases (fraud detection, medical alerts), SHAP reveals features that matter for specific predictions","C":"SHAP values of +15 indicate a computation error — SHAP values must be between -1 and +1","D":"Permutation importance is always more accurate than SHAP for gradient boosted models"},"correct":"B","explanation":{"correct":"- Permutation importance: average accuracy drop after shuffling X across all test samples. If X is only important for 1% of samples, the average drop is ~0.01 × (importance on those samples) — diluted.\n- SHAP: computes the marginal contribution of feature X to each individual prediction $\\hat{f}(x_i)$ using Shapley values from cooperative game theory. Each sample has its own SHAP vector.\n- The 10 samples with SHAP = +15 represent cases where X is the critical driver. For fraud detection: these might be the actual fraud cases where a specific pattern in X is the key indicator.\n- Application: in production, SHAP explains individual decisions (why was this transaction flagged?), even for features that appear globally unimportant.","A":"SHAP and permutation importance answer different questions. SHAP explains individual predictions; permutation importance measures global model reliance. Discrepancies are expected and meaningful.","B":"","C":"SHAP values have no fixed range — they represent the contribution in the units of the model output. A SHAP value of +15 for log-odds or a regression target is valid.","D":"Neither is universally more accurate. Permutation importance captures global model reliance; SHAP provides per-instance explanations and handles correlated features better through conditional expectations. They are complementary."},"reference":"- Lundberg & Lee, \"A Unified Approach to Interpreting Model Predictions\" (SHAP): https://arxiv.org/abs/1705.07874"},{"section":"machine-learning","topicSlug":"feature-selection-and-engineering","topic":"Feature Selection And Engineering","id":"ml-17010","difficulty":"hard","orderIndex":10,"question":"A team creates a new feature by dividing Revenue by Cost (a ratio feature). In training data, Cost is always > 0. In production, some records have Cost = 0 (free services). The model crashes in production. Additionally, the ratio feature inflates importance in tree models. What are the two problems and their fixes?","options":{"A":"The problems are data type mismatch and overfitting — use float64 and add more training data","B":"Problem 1 (Division by zero): Cost = 0 in production causes infinity or NaN — fix: clip denominator ($\\max(\\text{Cost}, \\epsilon)$) or add a small constant (Revenue/(Cost+1)); Problem 2 (Ratio inflation in trees): if Revenue and Cost are separately available, the ratio extracts information already in the original features and creates a derived feature with different scale/distribution; tree models may overfit to extreme ratio values (very high Revenue/Cost = outlier); fix: use robust ratios ($\\log(\\text{Revenue}) - \\log(\\text{Cost})$), clip ratios, or ensure original features are also included","C":"Remove the ratio feature entirely — ratio features always cause problems in tree models","D":"The crash is caused by integer overflow — use int64 instead of float32"},"correct":"B","explanation":{"correct":"- Division by zero defense: $\\epsilon$ clipping: Revenue/max(Cost, 0.001). This prevents infinity while preserving the ratio's meaning for nearly-free services. Adding 1 (Laplace-like smoothing): Revenue/(Cost+1) — shifts the ratio but prevents zero denominator.\n- Log-ratio: $\\log(\\text{Revenue}) - \\log(\\text{Cost}) = \\log(\\text{Revenue/Cost})$ is the log-ratio, which compresses extreme values and has better statistical properties (more normal distribution in many business scenarios).\n- Tree model interaction: ratio features are nonlinear combinations of existing features. Decision trees can recreate ratios by sequential splits on Revenue and Cost. Providing the ratio can help by expressing a known meaningful relationship, but it can also introduce high-leverage points.","A":"Data type mismatches and overfitting are separate issues. The described problems are divide-by-zero (runtime error) and extreme ratio values (modeling issue).","B":"","C":"Ratio features are extremely common and useful (financial ratios, click-through rates, conversion rates). Removing them entirely would discard meaningful domain knowledge.","D":"The crash is due to division by zero (NaN/infinity propagation), not integer overflow. Float32 and float64 can both represent infinity and NaN."}},{"section":"machine-learning","topicSlug":"feature-selection-and-engineering","topic":"Feature Selection And Engineering","id":"ml-17011","difficulty":"hard","orderIndex":11,"question":"Recursive Feature Elimination with Cross-Validation (RFECV) eliminates features one at a time using model coefficients/importances. On a dataset with 200 features, it selects 45 features. The final model achieves 88% accuracy. A researcher warns this result is overly optimistic. Why, and what is the correct evaluation protocol?","options":{"A":"RFECV is always unbiased — 88% is the correct expected performance","B":"RFECV wraps cross-validation around the feature elimination process — if the entire RFECV (including feature selection) is run on training+test data together, the test set influenced which features were selected; if RFECV was run only on training data but evaluated on the same test set multiple times (testing different selected subsets), the test set was implicitly used for selection; correct protocol: outer cross-validation evaluates the entire pipeline (including RFECV feature selection) on held-out data that was never used during selection","C":"45 features is too many — reducing to 20 features would make the result accurate","D":"The 200-to-45 reduction causes underfitting, which makes the accuracy estimate overly optimistic"},"correct":"B","explanation":{"correct":"- Double dipping: if you run RFECV on train+test, the test set influences which features are \"good\" → test performance is inflated.\n- If RFECV runs only on training data but you then evaluate multiple feature subsets on the test set: each evaluation on the test set is a comparison, and the best-performing feature set is selected → test set contamination.\n- Nested cross-validation: outer loop k-fold for unbiased performance estimate; inner loop k-fold for RFECV. For each outer fold, the RFECV sees only the inner training data. The outer test fold is truly held out.\n- sklearn Pipeline + cross_val_score: when RFECV is inside a Pipeline and cross_val_score is applied, the outer CV loop correctly isolates the test fold from feature selection.","A":"RFECV is only unbiased if the entire feature selection process is nested within cross-validation and the test fold is never accessed during selection.","B":"","C":"The number of selected features doesn't determine whether the evaluation is biased. The bias comes from the evaluation protocol, not the feature count.","D":"Selecting fewer features reduces model complexity — this can either reduce overfitting (if too many features were noise) or reduce underfitting (if removing noise helps). Feature count doesn't directly determine whether the evaluation is optimistic."}},{"section":"machine-learning","topicSlug":"feature-selection-and-engineering","topic":"Feature Selection And Engineering","id":"ml-17012","difficulty":"hard","orderIndex":12,"question":"A practitioner creates \"interaction features\" by multiplying pairs of the top 10 features, creating 45 additional features (10 choose 2). The combined model achieves much better training performance but validation performance is mixed. A colleague says \"interaction features always improve models.\" What is the nuanced truth?","options":{"A":"Interaction features always improve linear models because they add expressiveness","B":"Interaction features explicitly encode pairwise relationships that can improve linear models (which otherwise cannot capture interactions) and can help tree models learn interactions with fewer splits; however, adding 45 features to 10 doubles feature space — the model now has 55 features from 10 original; with limited data, many interaction terms will be noise; the curse of dimensionality: overfitting increases; valid interactions should be grounded in domain knowledge (A × B is meaningful) rather than generated combinatorially; use cross-validation to confirm actual validation improvement","C":"Interaction features only work for tree-based models — they are harmful for linear models","D":"Multiplying features always causes multicollinearity that makes models untrainable"},"correct":"B","explanation":{"correct":"- Linear model limitation: $y = w_1 x_1 + w_2 x_2$ cannot capture $y = x_1 \\times x_2$. Adding the feature $x_1 x_2$ lets the linear model capture this interaction.\n- Tree models: decision trees can learn $x_1 \\times x_2$ interactions through consecutive splits. But explicit interaction features can reduce the tree depth needed, potentially improving learning efficiency with limited data.\n- Overfitting risk: 45 additional features from 10 originals, most of which are likely noise. With n=200 samples: 55 features + noise interactions → overfitting. Use L1 regularization or select only domain-motivated interactions.\n- Domain-motivated examples: Income × Education (reasonable synergy), Age × Risk_Factor (valid interaction for insurance), Random_Feature1 × Random_Feature2 (likely noise).","A":"\"Always improve linear models\" is false. Uninformative interaction features add noise and may not improve validation performance even if they improve training performance.","B":"","C":"Interaction features are most useful for linear models (which cannot learn interactions from raw features). Tree models can learn them from the raw features, though explicit features can help.","D":"Products of features do increase collinearity (especially $x_1$ and $x_1 \\times x_2$ are correlated). But \"untrainable\" is an exaggeration — regularization (L2 or L1) handles the multicollinearity."}}],"practiceMcqs":[{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-001","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T01 · ML Fundamentals] A recruiter asks: \"what is the difference between a model parameter and a hyperparameter?\" Which answer is correct?","options":{"A":"Parameters are set before training; hyperparameters are learned during training","B":"Parameters (weights, biases) are learned by the optimization algorithm during training; hyperparameters (learning rate, tree depth, K in KNN) are set by the practitioner before training and control the training process itself","C":"There is no difference — both are tuned during training","D":"Hyperparameters are only relevant for neural networks, not classical ML models"},"correct":"B","explanation":{"correct":"- Parameters: values the model learns to minimize loss — e.g., linear regression coefficients $w$, neural network weights $\\theta$. They are updated by gradient descent or closed-form solutions.\n- Hyperparameters: design choices that control the training process — learning rate, regularization strength, number of trees, max depth. They are not learned from data; they are set by the practitioner (often via cross-validation).\n- Key test: if the optimization algorithm updates it → parameter. If you set it before training → hyperparameter.","A":"Reversed. Parameters are learned during training; hyperparameters are set before training.","B":"","C":"Only parameters are learned by the optimization algorithm. Hyperparameters require separate tuning (grid search, random search, Bayesian optimization).","D":"Hyperparameters exist for all ML models: K in KNN, max_depth in decision trees, C in SVM, number of clusters in K-means."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-002","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T11 · Clustering] A K-means model is trained on customer data. After convergence, a new customer's cluster is determined by finding the centroid with the smallest Euclidean distance to the new customer's feature vector. No retraining occurs. What is this prediction step called, and what assumption does it make?","options":{"A":"Online learning — the model updates centroids with each new customer","B":"Cluster assignment (inference) — the frozen centroids from training are used to assign the new point; this assumes the production data distribution is similar to training distribution; if distribution has shifted, the cluster labels may be meaningless","C":"Re-clustering — each new point triggers a full K-means restart","D":"Interpolation — the model averages predictions from the nearest two centroids"},"correct":"B","explanation":{"correct":"- After K-means training, centroids are frozen. Inference = find argmin of distance to each centroid.\n- No retraining occurs — the model assumes the production distribution resembles training. If customer behavior changes, new customers may fall in between centroids, getting poor or irrelevant assignments.\n- This single-centroid assignment is the standard production serving pattern for K-means.","A":"Online learning updates model parameters with each new sample. Standard K-means inference does not update centroids.","B":"","C":"Re-clustering restarts K-means from scratch. Standard serving uses frozen centroids.","D":"Interpolation is not how K-means assigns clusters. The closest centroid wins outright (hard assignment)."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-003","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T15 · Bias-Variance] A model achieves 70% accuracy on training data and 69% accuracy on test data. The Bayes optimal accuracy for this task is estimated at 95%. What is the primary problem?","options":{"A":"Overfitting — the 1% train-test gap proves the model has too much variance","B":"High bias (underfitting) — both training and test errors are high (30% and 31%), with a small gap; the model is too simple to capture the true pattern; the large gap between model performance (~70%) and Bayes optimal (~95%) is the key signal","C":"The model is well-optimized — 69% test accuracy is excellent","D":"The test set is too small — more test data would reveal better performance"},"correct":"B","explanation":{"correct":"- Avoidable bias = training error − Bayes error = 30% − 5% = 25%. This is the gap that can be fixed by improving the model.\n- Variance = test error − training error = 1%. Variance is almost zero — the model is highly stable but just not powerful enough.\n- The fix is to increase model complexity (deeper network, more features, more trees), not to add regularization or more data.","A":"A 1% train-test gap indicates very low variance. Overfitting shows a large gap.","B":"","C":"If the best achievable is 95%, 69% leaves 26 points of avoidable error on the table. That's not well-optimized.","D":"Test set size doesn't change the model's actual accuracy on the task. The problem is the model architecture, not the evaluation."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-004","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T04 · Decision Trees] A decision tree is grown to full depth (no max_depth constraint) on training data. What is the training error, and why is this a problem?","options":{"A":"Training error is 50% — full-depth trees cannot learn well without pruning","B":"Training error is near 0% (or exactly 0% if no duplicate feature vectors with different labels exist) — each leaf contains one or a few training samples; the tree has memorized training data; this causes high variance and poor generalization","C":"Full-depth trees have the same training error as pruned trees","D":"Training error is undefined for full-depth trees because they overfit by definition"},"correct":"B","explanation":{"correct":"- A decision tree with no depth constraint will keep splitting until each leaf is pure (one class). For a dataset without conflicting labels, training error reaches exactly 0%.\n- The model memorizes every training sample — extremely high variance. A small change in training data produces a completely different tree.\n- The solution is to use max_depth, min_samples_leaf, min_samples_split, or cost-complexity pruning (ccp_alpha in sklearn) to prevent memorization.","A":"Full-depth trees achieve near-0% training error, not 50%. This is the definition of the overfitting problem.","B":"","C":"Pruning increases training error (removes some memorized splits) but reduces test error by generalizing better.","D":"Training error is well-defined. It's simply the fraction of training samples the tree misclassifies — 0% for full-depth trees on clean data."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-005","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T16 · Regularization] A linear regression model with no regularization has 200 features and 150 training samples. The matrix $X^TX$ is not invertible. What does this mean for the OLS closed form?","options":{"A":"OLS can still be computed using any matrix inversion routine","B":"With more features than samples (p=200 > n=150), $X$ does not have full column rank; $X^TX$ (200×200) is singular (rank ≤ 150); the OLS solution $(X^TX)^{-1}X^Ty$ is undefined; Ridge regularization fixes this by adding $\\lambda I$: $(X^TX + \\lambda I)$ is always invertible for $\\lambda > 0$","C":"The closed form still works — matrix inversion handles singular matrices automatically","D":"With p > n, the correct approach is to use a neural network instead"},"correct":"B","explanation":{"correct":"- Rank of $X^TX$ ≤ min(n, p) = 150. Since $X^TX$ is 200×200 with rank ≤ 150, it has at least 50 zero eigenvalues → not invertible.\n- Geometrically: infinitely many hyperplanes fit the 150 training points in 200-dimensional space — no unique OLS solution.\n- Ridge: $(X^TX + \\lambda I)$ shifts all eigenvalues by $\\lambda > 0$ → all eigenvalues > 0 → invertible. This is one of Ridge's key practical benefits beyond regularization.","A":"Standard matrix inversion routines will fail or return numerically unstable results for singular matrices. Pseudoinverse (Moore-Penrose) can be used but gives the minimum-norm solution, not the maximum-margin solution.","B":"","C":"Standard inversion of a singular matrix produces infinity/NaN or numerical garbage. Python's `np.linalg.solve` will raise a `LinAlgError`.","D":"Neural networks also face ill-conditioning with p > n. The answer is regularization (Ridge, L1, dropout), not switching algorithms."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-006","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T07 · SVM] An SVM with a linear kernel is trained on a 2D dataset that is not linearly separable. Training fails to find a clean margin. A colleague suggests \"use a higher C value to fix this.\" Is this correct?","options":{"A":"Yes — high C forces the SVM to find a wider margin, separating the classes","B":"No — high C shrinks the allowed margin (penalizes misclassifications more); for non-linearly separable data, increasing C tries harder to separate training points but cannot achieve linear separability; the correct fix is to use a nonlinear kernel (RBF, polynomial) that maps to a higher-dimensional space where classes are separable","C":"C has no effect on linear SVM — it only matters for kernel SVMs","D":"For non-linearly separable data, SVM always fails regardless of C or kernel"},"correct":"B","explanation":{"correct":"- C in soft-margin SVM: high C = high penalty for misclassified points → smaller margin, fewer training errors; low C = allows more misclassifications → larger margin, more regularization.\n- For non-linearly separable data in 2D, no linear hyperplane can perfectly separate classes. Increasing C just causes the SVM to try harder to separate with a linear boundary — it may overfit to noise without achieving true separation.\n- RBF kernel implicitly maps to infinite-dimensional space where linear separation often becomes possible (Cover's theorem).","A":"High C shrinks (not widens) the margin. High C = hard margin, low C = soft margin.","B":"","C":"C applies to all SVM variants including linear kernel. It controls the misclassification penalty in all cases.","D":"With a nonlinear kernel (RBF), SVM can achieve separation for most non-linearly separable datasets in practice."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-007","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T09 · Naive Bayes] A text classifier uses Multinomial Naive Bayes. The training vocabulary has 5,000 words. A new test document contains the word \"blockchain\" which was not in the training corpus. Without Laplace smoothing, what happens to the model's prediction for this document?","options":{"A":"The word is ignored — NB skips out-of-vocabulary words automatically","B":"$$P(\\text{blockchain}|\\text{any class}) = 0$; the product $\\prod P(w_i|\\text{class})$ includes a zero term → posterior = 0 for every class; the model cannot classify the document (division by zero / zero probability for all classes)","C":"The model assigns probability 0.5 to the word by default","D":"The model raises an exception because it cannot handle new vocabulary"},"correct":"B","explanation":{"correct":"- MLE probability: $P(w|c) = \\text{count}(w,c) / N_c$. \"Blockchain\" has count 0 → $P(\\text{blockchain}|c) = 0$ for all classes.\n- Product of likelihoods: $\\prod_i P(w_i|c)$ contains one zero factor → product = 0 for every class.\n- All posteriors are 0 → undefined argmax → model cannot predict.\n- Laplace smoothing ($\\alpha = 1$) prevents this: every word gets count + 1 in the numerator.","A":"Standard MNB does not skip unknown words. Without explicit handling, the zero probability problem occurs.","B":"","C":"0.5 is not a default in any standard NB implementation. That would require hard-coding an arbitrary fallback probability.","D":"NB doesn't raise exceptions by default — it silently computes 0 probability. The failure is silent, not an error."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-008","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T12 · Anomaly Detection] An anomaly detection pipeline produces 500 alerts per day. Upon review, investigators find only 10 are real anomalies. What is the precision of the detector, and why does this metric matter operationally?","options":{"A":"Precision = 10/500 = 2%; each alert has only a 2% chance of being a real anomaly; investigators waste 98% of their effort on false alarms; high false alarm rate causes \"alert fatigue\" — investigators start ignoring alerts","B":"Recall = 10/500 = 2%; the model is missing 98% of anomalies","C":"Accuracy = 10/500 = 2%; the model is 2% accurate","D":"Precision cannot be computed without knowing the total number of real anomalies in the day"},"correct":"A","explanation":{"correct":"- Precision = TP / (TP + FP) = 10 / 500 = 2%. Of all flagged alerts, only 2% are genuine.\n- Operational impact: investigators must examine all 500 alerts. 490 are wasted effort. If each investigation takes 30 minutes: 245 hours/day wasted on false positives.\n- Alert fatigue: high false positive rates cause investigators to skip investigations or set high bars, causing them to miss real anomalies.\n- Fix: raise the anomaly score threshold (reduces FP but may increase FN), or use better features/model.","A":"","B":"Recall = TP / (TP + FN). Without knowing the total number of real anomalies in the day, we cannot compute recall from this information alone.","C":"Accuracy requires knowing TN (true negatives — non-anomalous events correctly not flagged), which is much larger. Accuracy is not 2%.","D":"Precision only requires TP and FP from flagged alerts. Recall requires knowing total actual positives, but precision does not."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-009","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T05 · Random Forest] What does \"out-of-bag error\" mean in Random Forest, and why is it useful?","options":{"A":"Out-of-bag error measures the error on training samples that were included in the bootstrap sample","B":"Each bootstrap sample excludes ~36.8% of training points — these excluded points are \"out-of-bag\"; each tree is evaluated on its OOB samples; averaging these evaluations gives an unbiased estimate of generalization error without needing a separate validation set","C":"Out-of-bag error is the difference between training and test error","D":"Out-of-bag error only applies when Random Forest uses 500+ trees"},"correct":"B","explanation":{"correct":"- Bootstrap sampling: draws n samples with replacement. Each sample has probability $(1-1/n)^n \\approx e^{-1} \\approx 36.8\\%$ of never being selected.\n- For each tree, predict using the ~36.8% of samples that tree never saw during training. Average these OOB predictions to get the OOB error.\n- This provides a \"free\" cross-validation estimate — no separate validation set needed. It is comparable to (but not identical to) leave-one-out cross-validation.","A":"OOB points are those EXCLUDED from the bootstrap sample, not included. Training is done on the included points; OOB evaluation uses the excluded ones.","B":"","C":"OOB error is a standalone estimate. It doesn't require a separate test set and is not a gap between two other quantities.","D":"OOB error applies to any Random Forest regardless of tree count. More trees give more stable OOB estimates (each sample is OOB for more trees), but the mechanism works with any number."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-010","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T06 · Gradient Boosting] In gradient boosting with log-loss, what does each new tree in the sequence fit to?","options":{"A":"Each new tree fits the raw training labels (0 or 1) starting fresh","B":"Each new tree fits the pseudo-residuals — the negative gradient of the log-loss evaluated at the current model's predictions; for log-loss: $r_i = y_i - \\hat{p}_i$ (actual label minus current predicted probability); the tree corrects what the current ensemble gets wrong","C":"Each new tree fits the square of the previous tree's predictions","D":"Each new tree is identical to the previous tree but with doubled learning rate"},"correct":"B","explanation":{"correct":"- Gradient boosting framework: $F_m(x) = F_{m-1}(x) + \\eta \\cdot h_m(x)$ where $h_m$ is a tree fitted to the negative gradient.\n- For log-loss: $-\\partial L / \\partial F = y_i - \\hat{p}_i$. Where current predictions are too low for positives (underestimating $\\hat{p}$ for class 1), residuals are positive → new tree pushes predictions up.\n- Sequential correction: the ensemble improves iteratively, each tree fixing the mistakes of the cumulative model so far.","A":"If each tree fit raw labels from scratch, there would be no \"boosting\" — just many independent shallow trees. The sequential residual fitting is what defines gradient boosting.","B":"","C":"Fitting squared predictions has no theoretical justification and would not converge to a useful model.","D":"Each tree is independently fitted on current residuals. Trees are different from each other (they fit different residual patterns)."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-011","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T17 · Feature Engineering] A dataset has house prices as the target. The \"YearBuilt\" feature ranges from 1900 to 2023. A data scientist creates a new feature `age = 2026 - YearBuilt`. What type of feature engineering is this, and what advantage does it provide over the raw year?","options":{"A":"This is feature scaling — it normalizes the year to a standard range","B":"This is feature transformation (domain-informed engineering) — \"age\" directly encodes how old the house is, which has a more natural relationship with price (older = more maintenance, lower value in many markets); raw year (1900-2023) encodes calendar time, which may be hard for a model to interpret relative to the prediction date; age is a more semantically meaningful and model-friendly representation","C":"This transformation is harmful — it removes useful temporal information","D":"Subtracting from a constant is only valid for linear models"},"correct":"B","explanation":{"correct":"- Domain knowledge: house age has a clearer causal relationship with price depreciation, maintenance cost, and desirability than the calendar year of construction.\n- Model interpretability: an age of 5 (newly built) vs age of 120 (very old) is intuitively meaningful. The year \"1950\" is not interpretable without knowing \"what year is now?\"\n- Generalization: if the model is used in future years, \"age\" automatically updates meaning (a house built in 1990 is 36 years old in 2026 but would be 37 in 2027); raw year \"1990\" is static.","A":"Feature scaling changes the range/distribution. Subtracting year from a constant is a simple linear transformation that changes the reference point, not the scale.","B":"","C":"Age preserves the temporal information — it's a monotone transformation of year. No information is lost; the representation is just more meaningful.","D":"Linear transformations (including affine shifts like `2026 - YearBuilt`) are valid for any model type. Tree models handle this identically to the raw year."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-012","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T02 · Linear Regression] A linear regression model is evaluated with $R^2 = 0.85$. A new feature is added, and $R^2$ increases to 0.86. Should the new feature be kept?","options":{"A":"Yes — any increase in R² means the feature is useful","B":"Not necessarily — R² always increases (or stays the same) when any feature is added, even a random noise feature; the increase from 0.85 to 0.86 may reflect overfitting to the new feature, not genuine signal; use adjusted R² or compare models using a held-out test set or cross-validation","C":"No — R² above 0.85 indicates overfitting, so the feature should be removed","D":"Adding a feature to a linear model always causes overfitting regardless of R² change"},"correct":"B","explanation":{"correct":"- Property of R²: adding any feature (even pure random noise) can only increase or maintain R² on training data — it can never decrease. The optimization just sets the noise feature's coefficient to near-zero.\n- Adjusted R²: $\\bar{R}^2 = 1 - (1-R^2)(n-1)/(n-p-1)$. Penalizes for number of parameters $p$. If adjusted R² decreases after adding the feature, the feature is not worth its added complexity.\n- Better: evaluate on a held-out test set. If test R² decreases, the feature is introducing overfitting.","A":"R² inflation is a well-known problem. Adding random noise to a linear model always increases training R². This is why adjusted R² or test performance should be used.","B":"","C":"R² above 0.85 has no connection to overfitting. A model can have R²=0.99 on test data without overfitting.","D":"Adding features to linear models can be beneficial. The model uses regularization or feature selection to handle non-useful features."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-013","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T10 · PCA] PCA is applied to standardize a dataset before K-means clustering. A colleague says \"you must apply PCA before K-means because K-means requires uncorrelated features.\" Is this claim correct?","options":{"A":"Correct — K-means requires uncorrelated input features by design","B":"Incorrect — K-means has no mathematical requirement for uncorrelated features; PCA before K-means can be beneficial for reducing noise and dimensionality, and for making Euclidean distance more meaningful; but it is not required; the stated justification is wrong","C":"Correct — K-means uses PCA internally to find clusters","D":"PCA should never be applied before K-means as it destroys cluster structure"},"correct":"B","explanation":{"correct":"- K-means uses Euclidean distance: $||x_i - \\mu_k||^2$. This works fine with correlated features — correlated features are just redundant, not harmful to the algorithm's convergence.\n- Valid reasons to use PCA before K-means: (1) reduce noise dimensions that dilute the distance signal; (2) reduce computation for high-dimensional data; (3) visualize clusters in 2D.\n- Invalid reason: \"K-means requires uncorrelated features.\" This is a common myth. Correlated features lead to suboptimal cluster shapes (circular vs elliptical), which is a limitation, not a requirement violation.","A":"K-means has no correlation requirement. It minimizes WCSS using Euclidean distance, which is defined for any feature space.","B":"","C":"K-means does not internally use PCA. It uses centroid distance calculations.","D":"PCA can actually help K-means by removing noisy dimensions. The concern would be if PCA discards dimensions that separate the clusters — this is possible but not universal."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-014","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T03 · Logistic Regression] The decision boundary of a logistic regression model in 2D is the set of points where $\\hat{p}(x) = 0.5$. What is the geometric shape of this boundary?","options":{"A":"A circle — logistic regression produces circular decision boundaries","B":"A straight line (linear) — the decision boundary is where $w_1x_1 + w_2x_2 + b = 0$; this is a linear equation in the feature space; the sigmoid outputs 0.5 exactly when its input is 0 (the linear boundary)","C":"A sigmoid curve — the decision boundary follows the shape of the sigmoid function","D":"The boundary can be any shape — it depends on the training data distribution"},"correct":"B","explanation":{"correct":"- $\\hat{p} = \\sigma(w^Tx + b)$. When $\\hat{p} = 0.5$: $\\sigma(z) = 0.5 \\Rightarrow z = 0 \\Rightarrow w^Tx + b = 0$.\n- This equation $w^Tx + b = 0$ defines a hyperplane (line in 2D, plane in 3D). Logistic regression is a linear classifier.\n- To create nonlinear boundaries: add polynomial features ($x_1^2, x_1 x_2$, etc.) before logistic regression, or use kernel logistic regression.","A":"Circular boundaries require $x_1^2 + x_2^2 = c$ — a nonlinear equation. Standard logistic regression cannot produce circles without feature engineering.","B":"","C":"The sigmoid function maps real values to (0,1). The decision boundary is where this output equals 0.5 — a 2D line, not the sigmoid curve itself.","D":"While the training data influences the learned weights $w$, the geometric form of the boundary is always linear (hyperplane) regardless of data distribution."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-015","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T14 · Model Evaluation] A binary classifier is evaluated on a balanced dataset (50% positive, 50% negative). Accuracy = 75%. Precision = 70%, Recall = 85%. What does the F1 score equal approximately?","options":{"A":"F1 = (70 + 85) / 2 = 77.5% (arithmetic mean)","B":"F1 = 2 × (0.70 × 0.85) / (0.70 + 0.85) = 2 × 0.595 / 1.55 ≈ 76.8% (harmonic mean of precision and recall)","C":"F1 = 75% (equals accuracy for balanced datasets)","D":"F1 cannot be computed without knowing TP, FP, FN, TN individually"},"correct":"B","explanation":{"correct":"- F1 formula: $F1 = 2 \\times \\frac{P \\times R}{P + R} = \\frac{2PR}{P + R}$.\n- Computation: $2 \\times (0.70 \\times 0.85) / (0.70 + 0.85) = 1.19 / 1.55 \\approx 0.768 = 76.8\\%$.\n- F1 is the harmonic mean — it is always ≤ arithmetic mean. It penalizes imbalance between precision and recall more than the arithmetic mean would.","A":"The arithmetic mean (77.5%) is higher than F1 (76.8%). F1 uses the harmonic mean, which is more conservative when precision and recall differ.","B":"","C":"F1 ≠ accuracy in general, even for balanced datasets. They would coincide only in specific cases where both precision and recall equal accuracy.","D":"F1 can be computed from aggregate precision and recall values directly. Individual TP/FP counts are not required if precision and recall are already known."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-016","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T08 · KNN] K=1 is chosen for a KNN classifier. The training accuracy is 100%. A data scientist says \"perfect training accuracy means perfect test accuracy.\" What is wrong?","options":{"A":"K=1 always produces 100% accuracy on both training and test — there is no problem","B":"K=1 memorizes the training data — each training point is its own nearest neighbor, so training accuracy is trivially 100%; this is the maximum-variance, zero-training-error extreme of KNN; test accuracy will be lower because the model overfits to noise in training labels; increasing K smooths the decision boundary, reducing variance at the cost of some bias","C":"K=1 is always the best choice because it minimizes training error","D":"The training accuracy of 100% is impossible for K=1 — there is a calculation error"},"correct":"B","explanation":{"correct":"- K=1: for any training point $x_i$, its nearest neighbor is itself (distance=0). Predicted class = $y_i$ = actual class. Training accuracy = 100% by construction.\n- Test points: the nearest training neighbor might belong to a noisy or wrong class. With K=1, there's no smoothing — the prediction is exactly as noisy as the nearest training label.\n- Optimal K: typically selected via cross-validation. K=√n is a common heuristic. Larger K → smoother, more robust boundaries but potentially over-smoothing.","A":"Test accuracy for K=1 is typically lower than training accuracy. Perfect training accuracy does not carry over to test data.","B":"","C":"Minimizing training error is not the goal. The goal is to minimize generalization error on unseen data. K=1 overfits.","D":"100% training accuracy for K=1 is guaranteed (each point is its own nearest neighbor). This is not a calculation error — it's the fundamental behavior of K=1."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-017","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T13 · Ensemble Methods] A data scientist creates an ensemble of 5 identical models (same architecture, same hyperparameters, same training data, same random seed). Does this ensemble outperform a single model?","options":{"A":"Yes — any ensemble of 5 models is always better than 1 model","B":"No — identical models produce identical predictions; the majority vote of 5 identical classifiers equals any single classifier's output; there is zero diversity; for an ensemble to improve over a single model, models must disagree on some examples (uncorrelated errors)","C":"The ensemble is better because it averages out random initialization differences","D":"The ensemble is better because 5 models have 5× the capacity of 1 model"},"correct":"B","explanation":{"correct":"- Ensemble benefit: $\\text{Var}(\\bar{X}) = \\frac{1}{B}[\\rho \\sigma^2 + (1-\\rho)\\sigma^2]$ where $\\rho$ = inter-model correlation. If $\\rho = 1$ (identical predictions): $\\text{Var}(\\bar{X}) = \\sigma^2$ = single model variance. No improvement.\n- Diversity is required. The same seed, same data, same architecture → $\\rho = 1$ → no variance reduction.\n- If different random seeds are used (Question B says \"same random seed\"), even that tiny source of diversity is eliminated.","A":"More models only help when they are diverse (make different mistakes). Identical models provide zero benefit.","B":"","C":"With the same random seed, there are no random initialization differences.","D":"Ensemble prediction is an average/vote, not a capacity increase. The meta-prediction uses 1 combined output, not 5× capacity."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-018","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T01 · ML Fundamentals] A model is trained on 2020-2022 customer data and deployed in 2026. Performance has degraded significantly. What is the most likely cause?","options":{"A":"The model's code has bugs that develop over time","B":"Concept drift — the statistical relationship between features and the target has changed; customer behavior, market conditions, or product mix from 2026 differs significantly from 2020-2022; the model's learned patterns no longer apply to current data","C":"The hardware has degraded, producing random prediction errors","D":"The test set from 2026 is smaller than the training set, causing evaluation noise"},"correct":"B","explanation":{"correct":"- Concept drift: $P(y|x)$ changes over time. A model trained on 2020 patterns may not reflect 2026 customer behavior (new demographics, changed preferences, economic shifts).\n- Data drift (covariate shift): $P(x)$ changes — new types of customers appear. The model sees feature combinations outside its training distribution.\n- Solutions: periodic retraining, monitoring input/output distributions, champion-challenger testing, online learning.","A":"Software models don't develop bugs from running over time. Bugs are static unless code is changed.","B":"","C":"Hardware failures produce hard errors, not gradual degradation. Gradual degradation points to distributional causes.","D":"Evaluation noise from small test sets would cause noisy metrics, not systematic degradation. Concept drift causes systematic directional performance decrease."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-019","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T06 · Gradient Boosting] XGBoost is described as an improved gradient boosting framework. What is one key algorithmic difference between XGBoost and the original gradient boosting (GBDT)?","options":{"A":"XGBoost uses random forests internally instead of gradient boosting","B":"XGBoost incorporates second-order gradient information (Hessian) in the split-finding criterion; standard GBDT uses only first-order gradients (pseudo-residuals); the second-order Taylor expansion gives XGBoost more accurate leaf weight estimates and split gain calculations, often improving convergence","C":"XGBoost uses decision stumps (depth-1 trees) exclusively, while GBDT uses arbitrary-depth trees","D":"XGBoost eliminates the need for a learning rate hyperparameter"},"correct":"B","explanation":{"correct":"- Standard GBDT: fit each tree to the negative gradient (first-order approximation of the loss).\n- XGBoost: uses second-order Taylor expansion of the loss: $L \\approx \\sum_i [g_i f_t(x_i) + \\frac{1}{2}h_i f_t^2(x_i)]$ where $g_i = \\partial L/\\partial \\hat{y}$ and $h_i = \\partial^2 L / \\partial \\hat{y}^2$.\n- Leaf weights in XGBoost: $w^* = -G_j / (H_j + \\lambda)$ where $G, H$ are sums of first and second gradients. More accurate than first-order-only methods.","A":"XGBoost is a boosting framework. It uses gradient boosting with trees as weak learners.","B":"","C":"XGBoost supports arbitrary tree depth (max_depth hyperparameter). Depth-1 stumps (linear booster) are one option, not the default.","D":"XGBoost still uses a learning rate (eta parameter, default 0.3). It is a critical hyperparameter."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-020","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T04 · Decision Trees] Which splitting criterion produces a pure node of 100% class A? What is its Gini impurity value?","options":{"A":"Gini impurity = 1.0 — a pure node has maximum impurity","B":"Gini impurity = 0 — a pure node has $p_A = 1, p_B = 0$; $G = 1 - (1^2 + 0^2) = 0$; the ideal endpoint for a decision tree split","C":"Gini impurity = 0.5 — maximum impurity for a binary classification","D":"Gini impurity is undefined for a pure node because log(0) is undefined"},"correct":"B","explanation":{"correct":"- Gini impurity: $G = 1 - \\sum_k p_k^2$. For a pure node with 100% class A: $p_A = 1, p_B = 0$. $G = 1 - (1^2 + 0^2) = 1 - 1 = 0$.\n- Gini = 0 means perfectly pure — no further splitting benefit.\n- Maximum Gini for binary classification: $G = 0.5$ when $p_A = p_B = 0.5$ (completely mixed). $G = 1 - (0.5^2 + 0.5^2) = 1 - 0.5 = 0.5$.","A":"Pure node = Gini 0, not 1. Gini = 1 would mean a node impossible in binary classification (would need all probability outside any class).","B":"","C":"0.5 is the Gini for a maximally mixed (50/50) binary node — the worst case, not the pure case.","D":"Gini impurity does not involve logarithms (that's entropy/information gain). Gini = $1 - \\sum p_k^2$, which is well-defined for $p = 1$."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-021","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T16 · Regularization] A logistic regression model is trained with L1 regularization. After training, 40 of 100 feature weights are exactly 0. What does this mean for the model?","options":{"A":"40 features were removed from training data before the model ran","B":"L1 regularization drove 40 feature weights to exactly zero during optimization — these features contribute nothing to predictions; this is automatic feature selection; the remaining 60 features form a sparse model that is more interpretable and computationally efficient at inference time","C":"The model failed to converge — zero weights indicate training errors","D":"Zero weights mean the model is ignoring the regularization term for those features"},"correct":"B","explanation":{"correct":"- L1 sparsity mechanism: the subgradient at $w=0$ for the L1 penalty spans $[-\\lambda, \\lambda]$. When the data gradient is smaller than $\\lambda$ in magnitude, $w=0$ is optimal → exact zero.\n- This automatic feature selection is a key advantage of L1 over L2. The sparse model uses only 60 features at inference time — faster prediction and simpler interpretation.\n- Sparse solutions are valuable in high-dimensional settings where most features are noise.","A":"Feature data is present in training. L1 zeroes out the learned coefficient, not the feature column. The feature is present but given zero importance by the model.","B":"","C":"Zero weights from L1 are not a convergence failure — they are the optimal solution. The training converged to a point where L1 sparsity kicked in for those features.","D":"Zero weights for L1 regularized features are the regularization effect working as intended — the penalty was large enough to drive those weights to zero."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-022","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T10 · PCA] Before applying PCA, a data scientist must standardize the features (zero mean, unit variance). Why is this step essential?","options":{"A":"PCA requires non-negative values — standardization ensures all values are positive","B":"PCA finds directions of maximum variance — features with larger absolute scale (e.g., income in dollars vs age in years) dominate the variance; standardization ensures each feature contributes equally to the covariance matrix before PCA finds its principal directions","C":"Standardization is optional — PCA is scale-invariant","D":"Standardization is only needed when features have different units; same-unit features don't need it"},"correct":"B","explanation":{"correct":"- Covariance matrix: $C = \\frac{1}{n}X^TX$. Income variance ($\\sim 10^9$) dominates age variance ($\\sim 200$). PC1 will align almost entirely with income — PCA becomes income-only dimensionality.\n- After standardization: each feature has variance 1. The covariance matrix treats all features equally, and PCA finds the true directions of maximum multivariate variance.\n- Exception: if you deliberately want to give high-variance features more weight (e.g., they are more important by domain knowledge), you could skip standardization — but this should be intentional.","A":"PCA has no non-negativity requirement. Standardized features have negative values (below-mean observations are negative). PCA handles negative values fine.","B":"","C":"PCA is NOT scale-invariant. This is a critical point. Applying PCA to unstandardized data gives results dominated by high-variance features.","D":"Even features with the same units can have very different variances. Variance depends on the value range, not the unit. Standardization is recommended regardless of unit homogeneity."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-023","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T05 · Random Forest] Feature importance in Random Forest is computed as mean decrease in Gini impurity. A feature is listed with importance = 0.0. What might cause this?","options":{"A":"The feature is perfectly correlated with the target and thus contributes everything","B":"The feature was never used in any split across all trees — possible if the feature has very low predictive power, or if a correlated feature was always selected first (Random Forest randomly samples features at each split, so a weak feature may never win the competition); importance = 0.0 means the feature added no impurity reduction across all trees","C":"Feature importance = 0.0 is a computation error — all features must contribute something","D":"The feature was removed from the dataset before training"},"correct":"B","explanation":{"correct":"- Feature subsampling at each split: $m$ features are randomly selected per split. A weak feature may rarely be selected. If selected but does not produce a better split than a threshold (min impurity decrease), it won't be used.\n- For a truly irrelevant feature: it may appear in some splits by random chance but provides no impurity reduction → importance sums to near 0.\n- Also common: if a highly predictive feature is in the same random subset as a weaker correlated feature, the strong feature always wins — the weaker feature gets 0 importance despite some predictive value.","A":"High correlation with the target would give very high importance, not 0.","B":"","C":"A feature that never improves any split genuinely has importance 0. This is a valid outcome, not a bug.","D":"If the feature were removed from training data, it wouldn't appear in the importance rankings at all (no entry, not 0 entry)."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-024","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T14 · Model Evaluation] A model's ROC curve passes exactly through the point (0.1, 0.9). What does this operating point mean in terms of classification performance?","options":{"A":"The model correctly classifies 90% of all samples at a 10% error rate","B":"At this threshold: TPR = 0.9 (recall = 90%, catches 90% of positives) and FPR = 0.1 (10% of negatives are incorrectly flagged); this is an excellent operating point — high sensitivity with relatively low false positive rate","C":"The model has 90% precision and 10% recall at this threshold","D":"The point (0.1, 0.9) means AUC = 0.1 × 0.9 = 0.09"},"correct":"B","explanation":{"correct":"- ROC coordinates: x-axis = FPR = FP/(FP+TN), y-axis = TPR = TP/(TP+FN).\n- At (FPR=0.1, TPR=0.9): the model correctly identifies 90% of actual positives while misclassifying only 10% of negatives as positive.\n- This is a favorable operating point — it lies in the upper-left region of the ROC space, far above the diagonal (random classifier).","A":"TPR and FPR are not \"overall accuracy\" metrics. They measure performance on positives and negatives separately. Overall accuracy requires knowing the class proportions.","B":"","C":"Precision (PPV = TP/(TP+FP)) is not directly readable from the ROC curve coordinates. ROC plots TPR vs FPR, not precision vs recall.","D":"AUC is the area under the entire ROC curve — not the product of a single point's coordinates."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-025","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T09 · Naive Bayes] A Naive Bayes classifier is compared to logistic regression on a small dataset (n=50). Naive Bayes outperforms logistic regression. On a large dataset (n=50,000), logistic regression outperforms Naive Bayes. What principle explains this?","options":{"A":"Logistic regression is always slower, so it needs more data to converge","B":"Naive Bayes reaches its asymptotic error faster ($O(\\log p)$ samples) because its generative assumptions constrain the solution space — helpful with little data; logistic regression needs $O(p)$ samples but achieves lower asymptotic error when its assumptions hold; on large datasets, logistic regression's greater flexibility pays off","C":"Naive Bayes is more accurate on all dataset sizes — the large dataset result indicates an error","D":"The crossover is caused by the dataset size affecting the independence assumption"},"correct":"B","explanation":{"correct":"- Ng & Jordan (2001): generative models (NB) converge faster but to a higher asymptotic error when the generative assumptions are violated (as they almost always are for text).\n- Discriminative models (LR) converge slower (require more data to estimate the decision boundary) but achieve better asymptotic performance because they don't assume a specific data distribution.\n- Practical implication: for very small datasets, NB may be competitive or superior. For large datasets with enough samples to estimate LR parameters well, LR typically wins.","A":"Speed of training is not the explanation. The crossover is about statistical efficiency, not computational efficiency.","B":"","C":"The large-dataset dominance of logistic regression is the expected result from theory and empirical studies. NB's advantage is only at small sample sizes.","D":"The independence assumption is violated regardless of dataset size. It doesn't change with more data — only the estimation of the conditional probabilities improves."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-026","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T03 · Logistic Regression] The C parameter in sklearn's LogisticRegression controls regularization. C=0.001 vs C=1000. Which configuration is more likely to overfit?","options":{"A":"C=0.001 — smaller values mean stronger regularization, more overfitting","B":"C=1000 — C is the inverse of regularization strength; large C = weak regularization ($\\lambda = 1/C \\approx 0$); the model is nearly unregularized and can overfit to training noise, especially in high-dimensional feature spaces","C":"Both are equally likely to overfit","D":"C=0.001 — regularization causes overfitting by constraining the model too much"},"correct":"B","explanation":{"correct":"- sklearn convention: $C = 1/\\lambda$. High C → small $\\lambda$ → weak regularization → model can fit training data very closely → overfitting risk.\n- C=0.001: $\\lambda = 1000$ → strong regularization → heavy coefficient shrinkage → underfitting risk.\n- C=1000: $\\lambda = 0.001$ → almost unregularized → may overfit on small/noisy datasets.\n- The \"correct\" C is always data-specific and should be selected via cross-validation.","A":"C=0.001 has strong regularization (λ=1000). Strong regularization prevents overfitting; it may cause underfitting instead.","B":"","C":"They have opposite regularization strengths — they are not equally likely to overfit. The direction of effect is clear.","D":"Regularization doesn't cause overfitting. Regularization prevents overfitting. Too much regularization causes underfitting (model too constrained)."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-027","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T07 · SVM] Why does SVM use only the support vectors to define the maximum margin hyperplane, and how many support vectors are there typically?","options":{"A":"SVM uses all training points equally — all points define the hyperplane","B":"The optimization's KKT conditions show that only points on or within the margin have non-zero dual variables (Lagrange multipliers); non-support-vector points are outside the margin and contribute $\\alpha_i = 0$ to the decision function; the number of support vectors is typically small (a few dozen to a few hundred) and depends on the margin width and data complexity","C":"SVM randomly selects a subset of training points as support vectors","D":"Support vectors are always exactly equal to the number of features plus one"},"correct":"B","explanation":{"correct":"- SVM dual problem: $\\max_\\alpha \\sum \\alpha_i - \\frac{1}{2}\\sum_{i,j}\\alpha_i \\alpha_j y_i y_j K(x_i, x_j)$ subject to $\\sum \\alpha_i y_i = 0$, $0 \\leq \\alpha_i \\leq C$.\n- KKT complementarity: $\\alpha_i (1 - y_i(w^Tx_i + b)) = 0$. Points outside the margin: $y_i(w^Tx_i + b) > 1 \\Rightarrow \\alpha_i = 0$. Only margin points have $\\alpha_i > 0$.\n- The decision function: $f(x) = \\sum_{i \\in SV} \\alpha_i y_i K(x_i, x)$. Sums only over support vectors — non-SVs vanish.","A":"Non-support-vector points have $\\alpha_i = 0$ and do not contribute to the decision boundary. They can be removed from training data without changing the learned model.","B":"","C":"Support vectors are determined by the optimization, not random selection. They are the points most relevant to the decision boundary (on or inside the margin).","D":"The number of support vectors depends on the data complexity. For linearly separable data with a wide margin: very few SVs. For noisy or complex data: many SVs. No formula links SVs to feature count + 1."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-028","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T17 · Feature Selection] Chi-squared test for feature selection is applied to categorical features. A feature with a very small p-value (0.0001) is considered highly informative. What does the chi-squared test actually measure?","options":{"A":"It measures the correlation between two continuous features","B":"It tests the null hypothesis that the feature and the target are statistically independent; a very small p-value (reject H₀) means the feature's distribution differs significantly across target classes — evidence of association; a large p-value (fail to reject) suggests the feature provides no information about the target","C":"It measures the predictive accuracy if the feature is used alone","D":"Chi-squared test is only valid for regression targets, not classification"},"correct":"B","explanation":{"correct":"- Chi-squared test: compares observed vs expected frequencies in a contingency table. Expected = what you'd see if feature and target were independent.\n- Small p-value: the observed distribution of the feature across target classes is too different to be explained by chance → feature is associated with the target.\n- Limitation: tests marginal association only. Does not capture interactions. Can produce small p-values for features that are associated with the target but not useful conditional on other features.","A":"Chi-squared for feature selection tests categorical vs categorical (feature vs target) association. Correlation (Pearson) is for continuous features.","B":"","C":"Chi-squared measures statistical association, not predictive accuracy. A highly associated feature could have a small p-value but add little practical predictive power.","D":"Chi-squared is specifically designed for categorical features and categorical targets (classification). For regression targets, use ANOVA F-test or mutual information."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-029","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T11 · Clustering] Silhouette score = -0.15 is computed for a specific point. What does this negative value indicate?","options":{"A":"The computation has a bug — Silhouette cannot be negative","B":"The point's average distance to its own cluster ($a$) is greater than its average distance to the nearest other cluster ($b$) — the point is closer to a different cluster than its assigned one; $s = (b - a)/\\max(a,b) < 0$ when $a > b$; this indicates the point is likely misclassified into the wrong cluster","C":"The point is an outlier with no natural cluster affiliation","D":"The clustering used K that is too small"},"correct":"B","explanation":{"correct":"- Silhouette score: $s(i) = (b_i - a_i) / \\max(a_i, b_i)$.\n- $a_i$ = mean distance to points in own cluster (cohesion). $b_i$ = mean distance to nearest other cluster (separation).\n- Negative: $b_i < a_i$ → point is closer to a different cluster → wrong assignment. Score = -1 is the worst (point is deep in the wrong cluster). Score = 0 means on the boundary. Score = 1 means perfectly in its cluster.","A":"Silhouette scores range from -1 to +1. Negative values are mathematically valid and indicate poor cluster assignment.","B":"","C":"Negative Silhouette specifically indicates the point belongs better to another cluster. A true outlier (equidistant from all clusters) would have Silhouette near 0.","D":"The global K choice affects overall average Silhouette, but an individual point's negative score indicates that specific point is in the wrong cluster regardless of K."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-030","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T12 · Anomaly Detection] An anomaly detector reports 100% recall: it catches every single anomaly. A manager says \"perfect recall means our detection is excellent.\" What important information is missing from this evaluation?","options":{"A":"Recall of 100% is the best possible result — no additional information is needed","B":"100% recall may be trivially achieved by flagging everything as anomalous; in that case, precision = (actual anomaly rate) ≈ 1% → 99% of flags are false positives; high recall without precision is operationally useless; the precision-recall tradeoff must be reported together","C":"Recall is the wrong metric for anomaly detection — only precision matters","D":"100% recall is mathematically impossible for any detector"},"correct":"B","explanation":{"correct":"- Trivial high-recall model: flag every single event as anomalous. TP = all anomalies (100% recall). FP = all normal events. If anomaly rate is 1%: precision = 1% → 99% of alerts are false positives.\n- Operational impact: investigators must examine every event. This eliminates the value of the detector entirely.\n- Balanced evaluation: always report precision AND recall together, or use F1 (equal-cost) or PR-AUC (full tradeoff curve) for anomaly detection.","A":"100% recall says nothing about how many false positives are generated. Without precision, the evaluation is incomplete.","B":"","C":"Both precision and recall matter for anomaly detection. Precision determines investigator workload; recall determines how many anomalies are caught. Ignoring either produces a misleading assessment.","D":"100% recall is achievable — flag everything as anomalous. It is trivially achievable and therefore should not be cited as evidence of detector quality without precision."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-031","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T02 · Linear Regression] Homoscedasticity is one of the Gauss-Markov assumptions for OLS to be BLUE (Best Linear Unbiased Estimator). What does homoscedasticity mean?","options":{"A":"Residuals are normally distributed across all fitted values","B":"The variance of the residuals is constant across all fitted values — $\\text{Var}(\\epsilon_i) = \\sigma^2$ for all $i$; when violated (heteroscedasticity), OLS is still unbiased but no longer has minimum variance; some predictions are noisier than others","C":"Features are uncorrelated with each other (no multicollinearity)","D":"The relationship between features and the target is linear"},"correct":"B","explanation":{"correct":"- Homoscedasticity: same variance of errors across all observation levels. On a residual vs fitted plot: residuals should form a horizontal band, not a funnel shape.\n- When violated (heteroscedasticity): OLS is still unbiased but not efficient. Standard errors of coefficients are wrong → confidence intervals and p-values are invalid.\n- Fix: transform the target (log, sqrt), use weighted least squares (WLS), or use heteroscedasticity-robust standard errors (White's correction).","A":"Normality of residuals is a separate assumption (required for hypothesis tests and confidence intervals). Homoscedasticity is specifically about constant variance, not the distribution shape.","B":"","C":"Feature independence (no multicollinearity) is a separate Gauss-Markov condition related to the invertibility of $X^TX$.","D":"Linearity ($E[y] = X\\beta$) is a separate assumption — the relationship between $X$ and $y$ must be linear. These are four distinct Gauss-Markov conditions."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-032","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T15 · Bias-Variance] Which of the following model changes will definitely reduce variance without changing bias?","options":{"A":"Increasing the polynomial degree from 3 to 5","B":"Adding more training data (larger dataset for the same model class)","C":"Removing regularization (setting λ from 0.1 to 0)","D":"Adding more features to the model"},"correct":"B","explanation":{"correct":"- More training data → reduces variance: $\\text{Var}(\\hat{f}) \\propto \\sigma^2 / n$. As $n$ increases, variance decreases toward 0. Bias is unchanged (the model class and its average prediction remain the same — only the estimation becomes more stable).\n- A: Higher polynomial degree increases model complexity → increases variance (and decreases bias).\n- C: Removing regularization allows larger weights → increases variance.\n- D: Adding more features can increase variance (more parameters to estimate) and may reduce bias.","A":"Increased complexity → lower bias, higher variance. Not a variance-reduction step.","B":"","C":"Removing regularization reduces the shrinkage constraint → weights can grow larger → higher variance, lower bias.","D":"Adding features adds parameters. Unless regularization compensates, more features → higher variance."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-001","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T01 · ML Fundamentals] The No Free Lunch (NFL) theorem states that all learning algorithms perform equally well when averaged over all possible data-generating distributions. A colleague concludes: \"NFL means deep learning is not fundamentally better than a decision tree.\" Is this reasoning correct, and what does NFL actually imply for practitioners?","options":{"A":"Correct — NFL proves no algorithm is universally superior, so deep learning is just a fad","B":"Technically correct but practically misleading — NFL applies to uniform averaging over ALL possible distributions including pathological ones (random label assignments, pure noise); in practice, real-world problems come from a small subset of distributions that have structure (spatial locality, temporal patterns, compositional hierarchy); deep learning's inductive biases (CNNs for spatial structure, transformers for sequence) are precisely designed for these structured distributions; the NFL result is trivially satisfied but uninformative for practical algorithm selection","C":"NFL proves that algorithm selection never matters — always use the simplest model","D":"NFL is a theoretical curiosity that has been proven wrong by deep learning's empirical success"},"correct":"B","explanation":{"correct":"- NFL theorem (Wolpert, 1996): $\\sum_{f} E[L(A_1, f)] = \\sum_f E[L(A_2, f)]$ for any two algorithms $A_1, A_2$. Summed over all possible target functions $f$, all algorithms are equal.\n- The catch: the uniform distribution over $f$ is unrealistic. Nature's problems have structure. Inductive biases (smoothness priors, local connectivity, compositionality) are exploited by specific architectures.\n- Practitioner implication: NFL implies you cannot choose an algorithm without prior assumptions about the problem's structure. Deep learning wins when its inductive biases match the problem structure. It doesn't win on all problems.","A":"Misapplies NFL to practical settings. NFL says nothing about performance on any specific distribution — only about the average over all distributions. Deep learning's dominance on structured data is compatible with NFL.","B":"","C":"NFL does not support \"always use the simplest model.\" It says you cannot choose without assumptions — which implies domain knowledge should guide selection.","D":"Deep learning's success is consistent with NFL. NFL is not disproven — it's a theorem. The success occurs because real problems have structure not covered by NFL's uniform distribution."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-002","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T02 · Linear Regression] The Frisch-Waugh-Lovell (FWL) theorem states that the OLS coefficient for $X_1$ in the regression $y \\sim X_1 + X_2$ equals the OLS coefficient from the regression of $M_{X_2}y$ on $M_{X_2}X_1$, where $M_{X_2} = I - X_2(X_2^TX_2)^{-1}X_2^T$ is the annihilator matrix. What does the FWL theorem mean for interpreting coefficients in multiple regression?","options":{"A":"FWL means all regression coefficients are independent of each other","B":"FWL means the coefficient $\\hat{\\beta}_1$ measures the effect of $X_1$ on $y$ after both have been \"partialled out\" of $X_2$ — it is the effect of the component of $X_1$ that is orthogonal to $X_2$; this formalizes the \"ceteris paribus\" (all else equal) interpretation; adding or removing $X_2$ changes $\\hat{\\beta}_1$ precisely when $X_1$ is correlated with $X_2$ (omitted variable bias); FWL explains why multicollinearity inflates standard errors: $M_{X_2}X_1$ has low variance when $X_1$ and $X_2$ are correlated","C":"FWL proves that $\\hat{\\beta}_1$ is identical whether or not $X_2$ is included in the model","D":"FWL is only valid for orthogonal feature matrices ($X_1^TX_2 = 0$)"},"correct":"B","explanation":{"correct":"- Geometric interpretation: $M_{X_2}$ projects out the $X_2$ subspace. $M_{X_2}X_1$ is the residual of $X_1$ after regressing on $X_2$ — the variation in $X_1$ unexplained by $X_2$.\n- Omitted variable bias: if $X_2$ is omitted, $\\hat{\\beta}_1$ absorbs both the effect of $X_1$ AND the effect of $X_2$ on $y$ mediated through $X_1$'s correlation with $X_2$.\n- Multicollinearity: high correlation between $X_1$ and $X_2$ → $M_{X_2}X_1 \\approx 0$ (very small) → $\\text{Var}(\\hat{\\beta}_1) = \\sigma^2 / ||M_{X_2}X_1||^2 \\to \\infty$ → inflated standard errors.","A":"FWL proves the opposite — coefficients are dependent on what other variables are in the model. The \"ceteris paribus\" effect is conditional on other regressors.","B":"","C":"$$\\hat{\\beta}_1$ changes when $X_2$ is included if $X_1$ and $X_2$ are correlated. FWL explains precisely how $\\hat{\\beta}_1$ changes.","D":"FWL applies generally, not just for orthogonal features. The formula $M_{X_2}$ computes the orthogonal complement regardless of feature correlation."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-003","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T03 · Logistic Regression] Maximum likelihood estimation for logistic regression minimizes the cross-entropy loss. Show why cross-entropy loss and negative log-likelihood of the Bernoulli distribution are equivalent, and what this implies about the model's implicit distributional assumption.","options":{"A":"They are unrelated — cross-entropy and log-likelihood are different optimization objectives","B":"For binary classification, MLE assumes $y_i \\sim \\text{Bernoulli}(\\hat{p}_i)$ where $\\hat{p}_i = \\sigma(w^Tx_i)$; the log-likelihood: $\\ell = \\sum_i [y_i \\log(\\hat{p}_i) + (1-y_i)\\log(1-\\hat{p}_i)]$; maximizing this is exactly minimizing the cross-entropy loss; this implies logistic regression is the max-entropy model for binary outcomes with linear sufficient statistics — it makes no assumptions beyond the feature-outcome relationship; misclassifying with L2 loss would be wrong because it assumes Gaussian noise, which is inappropriate for binary labels","C":"They are equivalent but only for balanced classes","D":"Cross-entropy and log-likelihood are equivalent, which proves that logistic regression is unbiased for any classification problem"},"correct":"B","explanation":{"correct":"- Bernoulli MLE: $L(w) = \\prod_i \\hat{p}_i^{y_i}(1-\\hat{p}_i)^{1-y_i}$. Taking log: $\\ell = \\sum_i [y_i \\log \\hat{p}_i + (1-y_i)\\log(1-\\hat{p}_i)]$. Negating to get a loss: $-\\ell = -\\sum_i [y_i \\log \\hat{p}_i + (1-y_i)\\log(1-\\hat{p}_i)]$ = cross-entropy.\n- Max-entropy interpretation: among all distributions consistent with linear constraints $E[x_j y]$, logistic regression gives the maximum entropy distribution over $y|x$ — it makes minimal additional assumptions.\n- Practical implication: using MSE for binary classification assumes a Gaussian error model, which is wrong for $y \\in \\{0,1\\}$. This explains why MSE-trained models have saturated gradient problems in binary classification.","A":"They are mathematically identical — one is maximized and the other minimized, but the same $w^*$ satisfies both.","B":"","C":"The equivalence holds regardless of class balance. Class balance affects the optimal threshold, not the loss function's validity.","D":"MLE consistency (unbiasedness asymptotically) applies when the model is correctly specified. If the true relationship is nonlinear in $x$, logistic regression is biased regardless of the loss function derivation."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-004","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T04 · Decision Trees] CART tree training is NP-hard in general, but greedy top-down induction is used in practice. Explain why the greedy approach does not find the globally optimal tree, and describe one scenario where the greedy approach is demonstrably suboptimal.","options":{"A":"Greedy CART always finds the global optimum for sufficiently large datasets","B":"Greedy CART makes locally optimal splits at each node without backtracking — the split that maximizes impurity reduction at depth 1 is chosen without considering what splits become available at depth 2+; classic scenario: feature A has moderate Gini reduction but enables a perfect depth-2 split; feature B has higher immediate Gini reduction but leads to a dead end; greedy selects B (better immediate gain) and misses the globally better A-then-split tree; this is the XOR problem — neither $X_1$ nor $X_2$ alone separates XOR labels but their combination does; a greedy impurity-based split finds neither feature useful alone and cannot construct the optimal tree","C":"Greedy CART is globally optimal because it evaluates all possible trees","D":"Greedy approaches are globally optimal for tree structures due to the principle of optimality"},"correct":"B","explanation":{"correct":"- Optimality condition for greedy: Bellman's principle of optimality applies when subproblems are independent. Tree splits are NOT independent — the data subset reaching a child node depends on the parent split.\n- XOR example: $y = X_1 \\oplus X_2$ (XOR). $I(Y; X_1) = I(Y; X_2) = 0$ (each feature alone is statistically independent of the label). Greedy impurity at depth 1 = 0 for both features. No split improves impurity. The correct depth-2 tree uses both features, but greedy cannot discover this.\n- Alternative: look-ahead strategies, beam search, or random forest's randomization implicitly explore non-greedy splits.","A":"Greedy is demonstrably suboptimal for the XOR problem and many other interaction-based problems.","B":"","C":"Greedy evaluates one level at a time. It does not evaluate all possible trees. The number of possible binary trees grows exponentially — it is NP-hard to find the global optimum.","D":"Bellman's optimality applies to problems where optimal substructure holds (each subproblem's solution contributes independently). Tree splits violate this because data routing to subtrees depends on ancestor splits."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-005","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T05 · Random Forest] Impurity-based feature importance in Random Forest is known to be biased toward high-cardinality features. Explain the mechanism and describe the permutation importance method as an unbiased alternative.","options":{"A":"Impurity importance is unbiased — high-cardinality features are correctly ranked higher because they provide more information","B":"Impurity importance bias: a feature with many unique values (continuous or high-cardinality categorical) has more candidate thresholds; more thresholds → higher probability of finding a split that reduces impurity by chance → inflated importance; especially visible when comparing a continuous feature (1000 thresholds) vs binary feature (1 threshold); permutation importance: for each feature, randomly shuffle its values in the validation set and measure accuracy drop; a large drop indicates the feature was important; permutation operates on held-out data with the trained model, not during tree-building, so it avoids the threshold-count bias","C":"Impurity importance is biased only for categorical features with fewer than 10 categories","D":"Permutation importance is biased in the opposite direction — it underestimates all feature importances"},"correct":"B","explanation":{"correct":"- Impurity bias mechanism: at each split, CART searches all thresholds of all candidate features. A feature with 100 thresholds has 100 chances to find a good split by chance. A binary feature has 1 chance. The expected maximum impurity reduction scales with the number of thresholds.\n- Mathematical consequence: $E[\\max_{t \\in T_j} \\Delta G_j]$ increases with $|T_j|$ (number of thresholds). Features with more thresholds are systematically favored even when truly uninformative.\n- Permutation importance: evaluates the model on validation data. Shuffle feature $j$, observe $\\Delta \\text{acc}$. This measures actual predictive contribution, not split-time convenience. Unaffected by cardinality.","A":"The bias has been empirically documented and mathematically explained. High-cardinality random features score higher than low-cardinality informative features in impurity importance.","B":"","C":"The bias affects all features proportionally to their number of candidate splits. Continuous features (many real-valued thresholds) are most affected, but any feature with more thresholds is biased upward.","D":"Permutation importance can overestimate importance for correlated features (when one correlated feature is shuffled, the model uses the other correlated feature to compensate, underestimating importance). But it is not systematically biased toward underestimation across all features."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-006","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T06 · Gradient Boosting] Newton boosting (second-order gradient boosting, as in XGBoost) uses both gradients and Hessians. Explain why using the Hessian improves convergence compared to first-order gradient boosting, using the analogy of Newton's method vs gradient descent.","options":{"A":"The Hessian is only useful for smooth loss functions — gradient boosting doesn't need it","B":"Newton's method vs gradient descent: gradient descent takes step proportional to $-g$ (gradient); Newton's method takes step $-H^{-1}g$ (Hessian-adjusted); near optima, the Hessian captures curvature — in flat directions (large step safe), in steep directions (large step overshoots); applying this to boosting: each tree's leaf weights in XGBoost are $w^*_j = -G_j/(H_j + \\lambda)$ where $G = \\sum g_i$ and $H = \\sum h_i$; regions with high curvature (high $h_i$) get smaller leaf weight corrections; this prevents overshooting and allows larger effective learning rates while maintaining stability; convergence requires fewer trees","C":"Hessian use is a computational trick that reduces memory, not a convergence improvement","D":"Newton boosting and gradient boosting converge to different optima — Hessian use changes the solution"},"correct":"B","explanation":{"correct":"- Gradient: $g_i = \\partial L / \\partial \\hat{y}_i$. Hessian: $h_i = \\partial^2 L / \\partial \\hat{y}_i^2$.\n- For log-loss: $g_i = \\hat{p}_i - y_i$, $h_i = \\hat{p}_i(1-\\hat{p}_i)$. Near $\\hat{p} = 0.5$ (high uncertainty): $h = 0.25$ (moderate weight). Near $\\hat{p} = 0$ or $1$ (high confidence): $h \\approx 0$ (tiny weight). This prevents large corrections for high-confidence predictions.\n- Convergence: XGBoost typically needs fewer trees than GBDT (less boosting rounds) to achieve the same loss, especially for well-separated data points.","A":"The Hessian provides additional curvature information. For smooth, well-behaved losses (log-loss, MSE), the Hessian is well-defined and beneficial. MSE Hessian is constant (2), so Newton ≈ gradient for MSE.","B":"","C":"Hessian computation adds memory and computation overhead. The motivation is faster convergence, not memory reduction.","D":"Both approaches converge to the same optimal tree ensemble for the same loss. The Hessian adjustment speeds up the path to the optimum, not the destination."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-007","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T07 · SVM] Mercer's theorem states that a function $K(x, z)$ is a valid kernel if and only if it is a symmetric positive semi-definite function. Why is this condition necessary for the kernel trick to work, and give an example of a function that looks like a kernel but is not one.","options":{"A":"Mercer's condition is just a mathematical formality — any similarity function can be used as a kernel","B":"The kernel trick requires that $K(x,z) = \\langle \\phi(x), \\phi(z) \\rangle$ for some feature map $\\phi$; a valid inner product in any Hilbert space must produce a positive semi-definite Gram matrix $K_{ij} = K(x_i, x_j)$; if $K$ is not PSD, no valid $\\phi$ exists — you'd be optimizing in a non-Euclidean space where the SVM dual problem may be non-convex (indefinite quadratic program) with no guaranteed global minimum; example: $K(x,z) = -||x-z||$ (negative distance) can produce indefinite Gram matrices — not a valid kernel","C":"Mercer's condition guarantees that the kernel produces linearly separable data in the feature space","D":"Mercer's condition only applies to polynomial kernels — RBF kernels don't need it"},"correct":"B","explanation":{"correct":"- SVM dual: $\\max_\\alpha \\sum \\alpha_i - \\frac{1}{2}\\alpha^T K \\alpha$ where $K_{ij} = K(x_i, x_j)$. This is a quadratic program. For it to be convex (and have a global maximum), $K$ must be positive semi-definite.\n- Non-PSD kernel: $-\\frac{1}{2}\\alpha^T K \\alpha$ may be non-convex → saddle points, no guarantee of finding a global optimum → SVM training fails or produces meaningless solutions.\n- Common valid kernels: linear, polynomial (with $c \\geq 0$, $d$ integer), RBF (PSD by Bochner's theorem), Laplacian, sigmoid (PSD only for specific parameter ranges).","A":"Using a non-PSD function as a kernel breaks the convex optimization guarantee. The SVM dual solver (SMO) may not converge or converges to a saddle point.","B":"","C":"Mercer's condition guarantees a valid inner product space exists, not that data is linearly separable in that space. Separability depends on data distribution and the specific kernel choice.","D":"Mercer's condition applies to ALL kernels. RBF satisfies it (proven via Bochner's theorem on positive definite functions). The condition is universal."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-008","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T08 · KNN] Approximate nearest neighbor (ANN) methods like HNSW (Hierarchical Navigable Small World graphs) trade exactness for speed. Explain the HNSW indexing structure and why it achieves $O(\\log n)$ search complexity.","options":{"A":"HNSW is just a better-sorted array — it achieves log(n) via binary search","B":"HNSW builds a multi-layer proximity graph; the top layer is a sparse long-range graph (few nodes, long edges, fast approximate navigation); lower layers add increasing density with shorter edges; at query time: start at the top layer, greedily navigate to the nearest node, descend to the next layer, repeat; this hierarchical navigation visits $O(\\log n)$ nodes per layer instead of all $n$ points; the long-range edges at top layers allow skipping large portions of the space; recall (approximate accuracy) is tunable via `ef` (exploration factor) parameter: higher `ef` = more neighbors explored = higher recall, more computation","C":"HNSW achieves O(log n) by pre-sorting points along a Hilbert curve","D":"HNSW is only efficient for Euclidean distance — cosine or Hamming distances require exact search"},"correct":"B","explanation":{"correct":"- Graph structure analogy: \"six degrees of separation\" — a small-world graph. From any node, you can reach any other node in $O(\\log n)$ hops via long-range connections (shortcuts).\n- Layer hierarchy: inspired by skip lists. Layer 0 has all $n$ nodes. Layer 1 has a random subset, layer 2 a further subset, etc. Long edges at high layers enable fast long-distance jumps.\n- Query: greedy best-first search from the entry point (typically the centroid or a random high-layer node). At each layer, move to the nearest neighbor among connected nodes.","A":"HNSW is a graph structure, not a sorted array. Binary search requires a linear sorted order, which doesn't generalize to high-dimensional spaces.","B":"","C":"Hilbert curve (space-filling curve) indexing gives $O(\\log n)$ for 1D projections but struggles in high dimensions due to the curse of dimensionality. HNSW doesn't use Hilbert curves.","D":"HNSW works with any distance metric for which a greedy graph traversal converges to the approximate nearest neighbor. Cosine, Euclidean, and inner product are all supported in faiss and hnswlib."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-009","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T09 · Naive Bayes] Domingos & Pazzani (1997) showed that Naive Bayes can be an optimal classifier even when the independence assumption is violated, under certain conditions. What are those conditions, and why does NB sometimes achieve near-optimal accuracy despite being a biased probability estimator?","options":{"A":"NB achieves optimality only when features are truly independent","B":"Key condition: NB needs only to rank classes correctly at the decision boundary, not to estimate probabilities accurately; even with strongly dependent features, NB's classification rule $\\hat{y} = \\arg\\max_c P(c)\\prod_i P(x_i|c)$ may produce the correct argmax even when the absolute probability values are wrong; the dependencies can be \"benign\" (they don't change the argmax ordering); practically: NB calibration is poor but decision accuracy is often competitive; optimal condition: the feature dependencies are symmetric across classes (affect all classes equally) — they distort all class scores proportionally, preserving the argmax","C":"Domingos & Pazzani proved NB is optimal whenever it's competitive with Bayes optimal accuracy","D":"NB is optimal when using additive Laplace smoothing"},"correct":"B","explanation":{"correct":"- Classification accuracy vs probability estimation: NB needs $\\arg\\max_c$ to be correct, not $P(c|x)$ to be calibrated. These are weaker conditions.\n- Benign dependencies: if $x_1$ and $x_2$ are correlated given each class, but the correlation pattern is similar across classes, the log-ratio $\\log[P(c_1|x)/P(c_2|x)]$ may still have the right sign even though each individual $P(c|x)$ is wrong.\n- Empirical evidence: NB is competitive with logistic regression on text classification despite clear word co-occurrence dependencies (words like \"not\" and \"bad\" co-occur in sentiment analysis).","A":"This would make the result trivial. The insight is that NB can work DESPITE violated independence — this is what makes it practically useful.","B":"","C":"The question is about when NB is optimal, not circular. The conditions are about the benign symmetry of dependencies across classes.","D":"Laplace smoothing prevents zero probabilities but does not affect the independence assumption or the classification decision structure."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-010","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T10 · PCA] The Johnson-Lindenstrauss (JL) lemma states that any $n$ points in high-dimensional space can be embedded into $O(\\log n / \\epsilon^2)$ dimensions while preserving pairwise distances within a factor of $(1 \\pm \\epsilon)$. How does this compare to PCA, and when would you use JL random projections instead?","options":{"A":"JL and PCA are identical — both reduce to log(n) dimensions","B":"JL: dimensionality depends only on $n$ (number of points), not on original dimension $d$; the projection is a random matrix (Gaussian, ±1 entries); no training required; computational complexity $O(nd)$ per projection; PCA: dimensionality is data-driven (eigenvectors of covariance); captures maximum-variance directions; requires $O(n d^2)$ computation; JL preferred when: $n$ is small (log(n) << PCA's $k$), data has no dominant low-rank structure, fast sketching is needed for streaming/one-pass; PCA preferred when: data has strong low-rank structure, interpretable directions needed, intrinsic dimensionality < log(n)","C":"JL projections always outperform PCA for dimensionality reduction","D":"JL lemma is only applicable to nearest-neighbor search, not general dimensionality reduction"},"correct":"B","explanation":{"correct":"- JL target dimension: $k = O(\\log n / \\epsilon^2)$. For $n=1000$ points with $\\epsilon=0.1$: $k \\approx 700/0.01 = 700$. For $n = 10^6$: $k \\approx 1400$. Independent of original $d$.\n- Comparison: if PCA's effective rank is small (10-50 dimensions capture 99% variance), PCA produces a much lower-dimensional embedding than JL. JL's log(n) guarantee can be large.\n- JL's power: the projection is data-agnostic. You don't need to see all the data to compute the projection matrix — useful for streaming, privacy-preserving learning (RAPPOR), and randomized linear algebra.","A":"JL and PCA have very different properties. JL is random and data-independent; PCA is deterministic and data-adaptive. Their target dimensions can differ by orders of magnitude.","B":"","C":"Neither consistently outperforms the other. PCA is better when data has low intrinsic rank; JL is better for fast, data-agnostic sketching.","D":"JL is used in compressed sensing, sketch-and-solve linear regression, privacy-preserving ML, and fast matrix multiplication. It is not limited to nearest-neighbor search."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-011","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T11 · Clustering] Spectral clustering converts the clustering problem into a graph partitioning problem using eigendecomposition of the graph Laplacian. Why can spectral clustering find clusters that K-means cannot, and what is the computational bottleneck?","options":{"A":"Spectral clustering is just K-means applied to normalized features","B":"Spectral clustering can find non-convex, arbitrarily-shaped clusters because it uses graph connectivity rather than Euclidean distance to centroids; K-means partitions Voronoi cells (convex regions); two interlocking rings would be merged by K-means but separated by spectral clustering because the ring points are not connected in the proximity graph; bottleneck: eigendecomposition of the $n \\times n$ graph Laplacian costs $O(n^3)$ — prohibitive for large datasets; approximate methods (Nyström approximation, landmark-based) scale to $O(n \\cdot k^2)$","C":"Spectral clustering finds only linear cluster boundaries, like K-means","D":"The computational bottleneck is the k-means step at the end, not the eigendecomposition"},"correct":"B","explanation":{"correct":"- Graph Laplacian: $L = D - W$ where $W_{ij}$ is the edge weight (similarity between $x_i, x_j$) and $D_{ii} = \\sum_j W_{ij}$. The eigenvectors of $L$ encode the graph's cluster structure.\n- Non-convex clusters: K-means computes $||x - \\mu_k||^2$ — points near centroid 1 are in cluster 1 regardless of topology. Spectral clustering's affinity is local (RBF kernel with small bandwidth) → connectivity follows the manifold, not Euclidean ball.\n- Two-moons, concentric rings: standard benchmark where spectral succeeds and K-means fails.","A":"Spectral clustering uses K-means as a final step (on the eigenvectors), but the core mechanism is graph Laplacian eigendecomposition. The clustering is fundamentally different.","B":"","C":"The final K-means step on eigenvectors is linear in the eigenvector space, but the eigenvectors themselves encode non-linear structure. The combined effect finds non-convex clusters in original space.","D":"K-means is applied to the low-dimensional eigenvector embedding (k × k matrix). This is fast. The bottleneck is the $n \\times n$ eigendecomposition, which grows cubically."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-012","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T12 · Anomaly Detection] Conformal prediction provides distribution-free anomaly scores with statistical guarantees. Explain how conformal prediction differs from threshold-based anomaly detection, and what guarantee it provides.","options":{"A":"Conformal prediction is a synonym for threshold-based anomaly detection","B":"Threshold-based: pick score threshold $\\tau$; if anomaly score > $\\tau$, flag as anomaly; no statistical guarantee on false positive rate for new distributions; threshold selection is ad hoc; conformal prediction: use a calibration set of normal examples; compute non-conformity score $s_i$ for each calibration example; for a test point $x$, compute $p\\text{-value} = |\\{i : s_i \\geq s(x)\\}| / n_{\\text{cal}}$; flag $x$ as anomaly if $p < \\alpha$; guarantee: false positive rate $\\leq \\alpha$ regardless of the data distribution (assuming exchangeability); this is a distribution-free coverage guarantee","C":"Conformal prediction requires knowing the true data distribution, making it impractical","D":"Conformal prediction only works for classification, not anomaly detection"},"correct":"B","explanation":{"correct":"- Exchangeability assumption: the calibration set and test points are exchangeable (weaker than i.i.d.). Under this assumption, the p-value is uniformly distributed for normal points → FPR is controlled.\n- Practical advantage: no need to choose a threshold by intuition. Choose $\\alpha = 0.05$ → at most 5% of normal points are falsely flagged, regardless of the underlying distribution.\n- Non-conformity scores: can be based on any anomaly scoring function (isolation forest score, reconstruction error, local outlier factor). Conformal provides the wrapper.","A":"Threshold-based methods have no statistical guarantee. Conformal prediction provides a rigorous FPR bound. They are fundamentally different.","B":"","C":"Conformal prediction is explicitly distribution-free. It requires only exchangeability, which is weaker than distributional assumptions.","D":"Conformal prediction is a general framework applicable to any prediction problem: classification, regression, and anomaly detection. It was originally developed for classification but the framework is general."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-013","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T13 · Ensemble Methods] Negative correlation learning (NCL) explicitly promotes diversity in neural network ensembles by adding a penalty term to each network's loss function that discourages agreement with the ensemble's current prediction. What is the NCL penalty term, and why might forcing diversity hurt individual model quality?","options":{"A":"NCL is equivalent to standard ensembling — it adds no explicit diversity penalty","B":"NCL penalty: $\\Omega_i = -\\lambda \\sum_t (F_i(x^t) - \\bar{F}(x^t)) \\sum_{j \\neq i}(F_j(x^t) - \\bar{F}(x^t))$; this penalizes $F_i$ for moving in the same direction as $\\bar{F}$; individual quality cost: forcing each network to differ from the mean prediction may push individual networks toward suboptimal solutions — the ensemble mean may be accurate but individual models are constrained to be \"complementary\" (each covers the other's weaknesses); extreme NCL ($\\lambda$ too large) causes anti-correlated predictions that individually perform poorly; the ensemble still averages well but individual validation accuracy is artificially suppressed","C":"NCL only improves performance with $\\lambda > 1$","D":"NCL is only applicable to regression problems"},"correct":"B","explanation":{"correct":"- NCL derivation: Liu & Yao (1999). The penalty equals the negative correlation between $F_i$'s deviation from the mean and other models' deviations. Negative sign: maximize the negative correlation (i.e., make $F_i$ deviate opposite to others when the mean is wrong).\n- Bias-variance tradeoff at ensemble level: individual model error = bias² + variance + covariance. NCL explicitly reduces covariance at the cost of potentially increasing individual variance/bias.\n- Practical use: $\\lambda$ controls the exploration-exploitation tradeoff. Small $\\lambda$ ≈ independent training; large $\\lambda$ = forced diversity; too large → individual collapse.","A":"NCL adds an explicit, mathematical diversity penalty to each model's training objective. It is fundamentally different from independent model training.","B":"","C":"$$\\lambda$ is a continuous hyperparameter. Benefits appear at moderate values; harm at extreme values. No threshold at 1.","D":"NCL was originally applied to regression but works for classification with appropriate loss and output formulation. It is general to neural ensemble learning."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-014","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T14 · Model Evaluation] The Brier score can be decomposed into three components: reliability (calibration), resolution, and uncertainty. Explain what each component measures and how a model with perfect resolution but poor reliability should be fixed.","options":{"A":"Brier score cannot be decomposed — it is a single atomic metric","B":"Uncertainty: $\\bar{o}(1-\\bar{o})$ — the inherent difficulty of the task; irreducible; depends only on the base rate $\\bar{o}$; Resolution: measures how much predicted probabilities vary across different groups of events — how well the model distinguishes between events that occur and those that don't (forecasting value); Reliability: calibration error — mean squared difference between predicted probabilities and observed frequencies; a model with perfect resolution (correctly orders outcomes by predicted probability) but poor reliability (probabilities are miscalibrated, e.g., always outputs 0.6 when truth is 0.9) should be fixed via calibration (Platt scaling, isotonic regression) without retraining the base model","C":"Resolution measures model accuracy; reliability measures speed; uncertainty measures data quality","D":"Only reliability matters for Brier score optimization; resolution and uncertainty are academic"},"correct":"B","explanation":{"correct":"- DeGroot-Fienberg decomposition: $\\text{BS} = \\text{REL} - \\text{RES} + \\text{UNC}$. Better model → lower BS → lower REL, higher RES (resolution improves by subtracting more).\n- Reliability: $\\sum_{k=1}^{K} n_k(\\bar{f}_k - \\bar{o}_k)^2$. Binned calibration error: do events with predicted probability 0.7 actually occur 70% of the time?\n- Resolution: $\\sum_{k=1}^{K} n_k(\\bar{o}_k - \\bar{o})^2$. How much do outcomes differ from the base rate across forecast bins?\n- Fix: Platt scaling or isotonic regression post-hoc calibration preserves ranking (resolution) while adjusting probability values (improving reliability).","A":"The decomposition is a well-established result (DeGroot & Fienberg, 1983; Murphy, 1973). It is standard in meteorological forecasting and ML model evaluation.","B":"","C":"These definitions are incorrect. Resolution is discriminative power, reliability is calibration quality, uncertainty is base rate difficulty.","D":"All three components contribute to Brier score. Resolution is particularly important in decision-theoretic applications where predictions are used for threshold-based decisions."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-015","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T15 · Bias-Variance] The \"double descent\" phenomenon shows that test error can decrease, then increase, then decrease again as model complexity increases. Where does the second descent occur, and what breaks the classical bias-variance tradeoff picture?","options":{"A":"Double descent is a theoretical curiosity that doesn't occur in practice","B":"Classical U-shaped test error: underfitting → optimal → overfitting. Double descent adds a second descent at the \"interpolation threshold\" (model complexity equals n, the training set size); at this threshold, the model barely interpolates training data — test error spikes; beyond this threshold (heavily overparameterized regime), implicit regularization from gradient descent or minimum-norm solutions picks the \"smoothest\" interpolating function; the classical analysis assumes fixed noise — in the overparameterized regime, the minimum-norm interpolant has low variance despite exactly fitting training data","C":"Double descent occurs only for neural networks trained without any regularization","D":"Double descent means models should always be maximally overparameterized"},"correct":"B","explanation":{"correct":"- Interpolation threshold: at $p = n$ (parameters = samples), the model is at the boundary. Any interpolating solution exists but the unique one found by gradient descent may be maximally noisy.\n- Overparameterized regime ($p >> n$): many interpolating solutions exist. Gradient descent with small learning rate or random initialization finds the minimum-norm solution, which has a smoothness bias. This is implicit regularization.\n- Discovered empirically: Belkin et al. (2019) showed double descent for kernels, random forests, and neural networks. It challenges the single-valley bias-variance picture.","A":"Double descent has been empirically demonstrated for linear regression (random features), random forests, and neural networks. Multiple reproducible papers have confirmed it.","B":"","C":"Double descent occurs for linear models with random features, kernel methods, and trees, not just neural networks. Regularization smooths but doesn't eliminate the phenomenon.","D":"\"Always maximize overparameterization\" ignores computational cost and the risk that heavy overparameterization can still overfit without sufficient implicit regularization (e.g., with noisy labels)."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-016","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T16 · Regularization] Group Lasso extends L1 regularization to penalize groups of features jointly. The penalty is $\\sum_{g=1}^G ||w_g||_2$ (sum of L2 norms of weight groups). How does this differ from standard Lasso, and when should Group Lasso be preferred?","options":{"A":"Group Lasso is identical to standard Lasso applied to grouped features","B":"Standard Lasso: individual sparsity — each $w_j$ can be zero independently; $\\sum_j |w_j|$; Group Lasso: group sparsity — all weights in a group are zeroed together or all kept; penalty $\\sum_g ||w_g||_2$ is L1 between groups (sparse) and L2 within groups (non-sparse within selected groups); use when features form natural groups and group-level decisions are desired: one-hot encoded categorical features (select/drop the entire category), gene pathways (either the whole pathway is relevant or not), time series lag groups (include lags 1-5 together)","C":"Group Lasso penalizes each feature group with L1, making all features within groups individually sparse","D":"Group Lasso is only applicable to neural networks, not linear models"},"correct":"B","explanation":{"correct":"- Penalty structure: $\\Omega(w) = \\sum_{g=1}^G \\sqrt{|g|} \\cdot ||w_g||_2$ (weighted version). The $\\sqrt{|g|}$ factor normalizes for group size.\n- Geometry: standard Lasso's $||w||_1$ ball has corners at coordinate axes → sparsity. Group Lasso's group norm ball has ridges along group subspaces → group-level corners → one entire group goes to zero while others remain non-zero.\n- Application: one-hot encoding of \"city\" (1000 features) — Group Lasso either selects city as a feature (all 1000 non-zero) or removes it entirely. Standard Lasso might zero some city dummies but not others, producing an incoherent partial selection.","A":"Standard Lasso allows individual sparsity (any single feature can be zero). Group Lasso enforces that all features in a group are zeroed together. The sparsity structures are fundamentally different.","B":"","C":"L2 norm within groups means features within a selected group are NOT individually zeroed. The L2 norm shrinks the group uniformly but doesn't create within-group sparsity (that's Sparse Group Lasso, which combines both).","D":"Group Lasso was originally developed for linear models (Yuan & Lin, 2006). It is applicable to any model with structured weight groups."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-017","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T17 · Feature Selection] The Markov Blanket of a target variable $Y$ in a Bayesian network is the minimal set of features that renders $Y$ conditionally independent of all other variables. Explain why the Markov Blanket is the theoretically optimal feature set for predicting $Y$, and why finding it is computationally challenging.","options":{"A":"The Markov Blanket is equivalent to the set of features most correlated with Y","B":"Optimality: conditioning on the Markov Blanket $\\text{MB}(Y)$ makes $Y$ independent of all other variables — no information about $Y$ can be gained from features outside $\\text{MB}(Y)$ given $\\text{MB}(Y)$; formally: $Y \\perp X \\setminus \\text{MB}(Y) | \\text{MB}(Y)$; this means: adding any feature outside $\\text{MB}(Y)$ to the model cannot improve predictive accuracy; it is the smallest sufficient feature set; computational challenge: finding the MB requires testing conditional independence for all feature subsets — exponential search space; IAMB (Incremental Association Markov Blanket) and MMMB algorithms use forward-backward heuristics to find MB in $O(p^2)$ tests approximately","C":"Markov Blanket is a concept from graph theory with no connection to feature selection for prediction","D":"The Markov Blanket is the complete set of all features that are directly connected to Y in any network"},"correct":"B","explanation":{"correct":"- MB composition: parents of $Y$ (direct causes) + children of $Y$ (direct effects) + other parents of $Y$'s children (co-parents/spouses). All three sets carry information about $Y$ that isn't captured by other MB members.\n- Independence property: $P(Y | X) = P(Y | \\text{MB}(Y))$. This is a consequence of the d-separation criterion in Bayesian networks.\n- Hardness: the Bayesian network structure is unknown. Testing all conditional independencies requires exponentially many statistical tests (or exponential search over structures). IAMB uses a greedy forward phase (add features that significantly reduce entropy given current MB) and a backward phase (remove features that become redundant).","A":"Correlation (marginal association) does not define the MB. A feature can be correlated with $Y$ but excluded from MB (if it's conditionally independent given MB). A feature may be uncorrelated with $Y$ marginally but part of MB (through indirect paths or interactions).","B":"","C":"Markov Blankets directly define the optimal feature set for prediction under the Bayesian network model. They are used in MRMR (Minimum Redundancy Maximum Relevance) algorithms and causal feature selection.","D":"MB includes parents, children, AND co-parents. Features indirectly connected to $Y$ are not in the MB if they are d-separated by the MB set."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-018","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T01 · ML Fundamentals] VC dimension measures the complexity of a hypothesis class. A half-space classifier in $\\mathbb{R}^d$ has VC dimension $d+1$. What does this mean for sample complexity, and how does it relate to PAC learning?","options":{"A":"VC dimension measures how many features the model has","B":"VC dimension = $d+1$ for half-spaces: the classifier can shatter $d+1$ points in general position (label them with any binary assignment); PAC learning: to achieve error $\\leq \\epsilon$ with confidence $\\geq 1-\\delta$, the sample complexity is $O\\left(\\frac{d + \\log(1/\\delta)}{\\epsilon}\\right)$; higher VC dimension → more samples needed to generalize; implication: logistic regression in 100D needs $O(100/\\epsilon)$ samples; in 10,000D (NLP features), it needs $O(10,000/\\epsilon)$ → the curse of dimensionality appears in generalization, not just computation","C":"VC dimension equals the test set size needed for reliable evaluation","D":"A higher VC dimension means the model always generalizes better"},"correct":"B","explanation":{"correct":"- Shattering: a set of $m$ points is shattered by a hypothesis class if for every binary labeling of the $m$ points, there exists a hypothesis that correctly classifies all of them. VC dimension = max $m$ that can be shattered.\n- Fundamental theorem of PAC learning: a hypothesis class is PAC learnable iff its VC dimension is finite. The required sample size grows linearly with VC dimension.\n- Connection to practice: this is why high-dimensional linear models need regularization — without it, the effectively infinite-VC-dimension model (with enough features) can overfit on any finite dataset.","A":"VC dimension measures shatter capacity — the complexity of the function class, not the number of features directly. Though for half-spaces, VC dim = $d+1$ (related to dimension).","B":"","C":"VC dimension has nothing to do with test set size directly. Test set size relates to statistical confidence intervals for estimating error from empirical accuracy.","D":"Higher VC dimension means MORE capacity to fit training data — potentially more overfitting, requiring more training samples to generalize. High VC dimension ≠ better generalization."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-019","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T06 · Gradient Boosting] DART (Dropout Additive Regression Trees) applies dropout to gradient boosting. Explain the mechanism and why it was motivated by the over-specialization problem in standard gradient boosting.","options":{"A":"DART is dropout applied to the leaf weights within each individual tree","B":"Over-specialization in GBDT: the first trees in the sequence learn the most important patterns and dominate predictions; later trees learn residuals from increasingly low-signal data; these later trees are small corrections that can destabilize the ensemble; DART mechanism: at each boosting round, randomly drop a subset of previous trees from the ensemble, fit the new tree to the residuals of the remaining ensemble, then re-scale and add back both the dropped and new trees; this forces each new tree to be useful even when some previous trees are absent — prevents any single tree from dominating","C":"DART applies standard neural network dropout to every feature at each split","D":"DART eliminates the need for a learning rate by using dropout rate instead"},"correct":"B","explanation":{"correct":"- Motivation: in standard GBDT with many rounds, early trees are large and predictive; late trees are tiny corrections. The model is dominated by early trees. This is \"over-specialization.\"\n- DART effect: by randomly excluding trees, each new tree must contribute meaningfully even when some trees are missing — it can't rely on specific tree combinations. This produces a more uniform ensemble where all trees contribute.\n- Scaling: when dropped trees are added back, the predictions must be re-normalized to avoid magnitude inflation. This is the DART \"scaling\" step.\n- Implementation: available in XGBoost (`booster='dart'`) and LightGBM.","A":"DART drops entire trees, not leaf weights. The mechanism is conceptually analogous to neural dropout (removing units) but applied at the tree level.","B":"","C":"DART drops complete trees from the ensemble. It doesn't operate at the feature level within trees.","D":"DART can coexist with a learning rate. The learning rate still scales each tree's contribution; DART's dropout rate controls what fraction of trees are dropped during training."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-020","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T09 · Naive Bayes] Semi-Naive Bayes models relax the full independence assumption by allowing some feature dependencies to be modeled. Describe one semi-NB approach (e.g., TAN or NB with feature grouping) and analyze when it is better than full NB or full logistic regression.","options":{"A":"Semi-Naive Bayes is just Naive Bayes with more training data","B":"TAN (Tree Augmented Naive Bayes): extends NB by learning a maximum spanning tree over features (one additional parent per feature beyond the class variable); each feature $X_i$ can depend on one other feature $X_j$ in addition to the class $C$; tree structure is learned using mutual information $I(X_i; X_j | C)$ as edge weights; TAN is better than NB when: a few specific feature pairs have strong conditional dependencies (TAN captures them exactly); better than logistic regression when: small data where LR has insufficient samples to estimate all pairwise interactions; TAN sits on the bias-variance tradeoff between NB and LR","C":"Semi-NB models are always dominated by either full NB or LR — they have no use case","D":"TAN models the full joint distribution over features, making it equivalent to a Bayesian network classifier"},"correct":"B","explanation":{"correct":"- TAN algorithm (Friedman et al., 1997): (1) compute $I(X_i; X_j | C)$ for all feature pairs; (2) build a complete graph with these mutual information weights; (3) find the maximum spanning tree; (4) root the tree and direct edges from root to leaves; (5) train NB on the augmented structure.\n- Bias-variance: TAN has fewer parameters than LR but more than NB. With small datasets, TAN generalizes better than LR (fewer parameters to estimate). With moderate data, LR's flexibility pays off.\n- Practical: TAN achieves competitive performance with LR while maintaining the generative framework's advantages (missing data, easy feature addition).","A":"More training data doesn't change NB's assumption. TAN structurally relaxes the independence assumption by allowing one parent dependency per feature.","B":"","C":"TAN fills a practical niche between NB's extreme independence assumption and LR's full discriminative flexibility. It outperforms NB on datasets with correlated features and small data.","D":"TAN uses a tree structure (one parent per node), not a full Bayesian network (arbitrary parents). A full Bayesian network could model all dependencies but would require exponentially more parameters."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-021","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T14 · Model Evaluation] Cohen's Kappa is often used for evaluating multi-class classification with class imbalance. Explain how Kappa accounts for chance agreement that accuracy ignores, and describe a scenario where high accuracy and low Kappa both occur simultaneously.","options":{"A":"Cohen's Kappa is equivalent to accuracy — it measures the same thing with a different formula","B":"Kappa: $\\kappa = (p_o - p_e)/(1 - p_e)$; $p_o$ = observed accuracy; $p_e$ = expected accuracy by chance (based on class marginals); a classifier that always predicts the majority class has $p_o = p_e$ → $\\kappa = 0$; scenario: 95% class A, 5% class B; classifier always predicts A; accuracy = 95%; $p_e = 0.95^2 + 0.05^2 = 0.905$; $\\kappa = (0.95 - 0.905)/(1 - 0.905) = 0.045/0.095 \\approx 0.47$ — moderate by chance correction; extreme case: predict A always on 99% imbalanced data: accuracy=99%, $\\kappa = 0$ — no skill above chance","C":"Cohen's Kappa is only valid for binary classification","D":"High accuracy always means high Kappa — the two metrics agree on model ranking"},"correct":"B","explanation":{"correct":"- Expected agreement $p_e$: the probability that a random classifier (matching the marginal distributions) would agree with the true labels by chance. For two-class imbalanced data: $p_e = P(\\hat{y}=A)P(y=A) + P(\\hat{y}=B)P(y=B)$.\n- $\\kappa = 0$: no skill above random baseline. $\\kappa < 0$: worse than chance. $\\kappa > 0.8$: near-perfect agreement.\n- Concrete extreme: predict all A, 99% class A. Accuracy = 99%. $p_e = 0.99^2 + 0.01^2 = 0.9802$. $\\kappa = (0.99 - 0.9802)/(1 - 0.9802) = 0.0098/0.0198 \\approx 0.5$. Moderate kappa despite \"excellent\" accuracy.","A":"The key difference is $p_e$ — the chance correction. Accuracy ignores $p_e$; Kappa explicitly subtracts it. For balanced datasets, they are closely related. For imbalanced data, they diverge significantly.","B":"","C":"Cohen's Kappa generalizes to multi-class classification naturally. $p_e = \\sum_k P(\\hat{y}=k) \\times P(y=k)$, summed over all $k$ classes.","D":"High accuracy does NOT imply high Kappa for imbalanced data. A classifier with accuracy 99% can have Kappa near 0 if class imbalance is extreme."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-022","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T15 · Bias-Variance] Rashomon sets are defined as the set of all models within a certain loss tolerance of the best model. Why does a large Rashomon set have both practical benefits and risks for deploying ML models in regulated industries?","options":{"A":"Rashomon sets are only relevant for research — practitioners always select the single best model","B":"Large Rashomon set means many models achieve near-identical predictive accuracy; benefit: enables selection of the simplest, most interpretable model from the set (Occam's razor via Rashomon); in healthcare/credit, regulators require explainability — choosing a sparse linear model from the Rashomon set instead of a black-box model satisfies regulation without sacrificing accuracy; risk: the many near-equivalent models may have very different decision rationales — \"predictively equivalent\" models can disagree on 20-40% of individual predictions, creating arbitrariness in who receives loans/treatment; disparate impact: different Rashomon models may apply different implicit criteria, leading to inconsistent and potentially discriminatory decisions","C":"Large Rashomon sets always indicate overfitting","D":"All models in a Rashomon set are functionally identical and make identical predictions"},"correct":"B","explanation":{"correct":"- Rashomon set definition: $\\{f : L(f) \\leq L^* + \\epsilon\\}$ where $L^*$ is the minimum loss. All $f$ in this set are \"equally good\" within tolerance $\\epsilon$.\n- Interpretability benefit: Semenova et al. (2022) showed that many real-world datasets have large Rashomon sets that include simple decision lists achieving near-optimal performance. Regulators can be satisfied.\n- Predictive multiplicity risk: multiple models with identical aggregate accuracy make different predictions for individual cases. An individual can be denied a loan by one near-optimal model but approved by another with equal aggregate performance. This unpredictability is ethically problematic.","A":"Rashomon sets are increasingly important for ML fairness, interpretability, and regulatory compliance. The concept shapes how regulated industries should approach model selection.","B":"","C":"Rashomon sets relate to model complexity and the landscape of the loss function. A large Rashomon set often indicates an identifiability problem or that the data supports many equally good explanations — not necessarily overfitting.","D":"Models in the Rashomon set have equal aggregate loss but can have very different individual predictions. This is the \"predictive multiplicity\" phenomenon documented in real datasets."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-023","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T12 · Anomaly Detection] An adversarial attacker learns the decision boundary of an anomaly detection model (e.g., Isolation Forest) and crafts anomalous inputs that avoid detection (adversarial examples for anomaly detection). Describe the attack mechanism and a defense strategy.","options":{"A":"Adversarial attacks on anomaly detection are impossible — anomaly detectors are robust by design","B":"Attack mechanism: if the attacker can query the model (black-box) or access model parameters (white-box), they can iteratively adjust an anomalous input to minimize its anomaly score below the detection threshold; for Isolation Forest: the attacker generates samples that occupy dense regions of the training space (low depth in trees) while still representing malicious behavior; defense strategies: (1) ensemble diverse detectors (Isolation Forest + LOF + OCSVM) — an adversary that evades one may not evade all; (2) randomized thresholds and model refresh; (3) incorporate distributional shift detection alongside anomaly scoring; (4) adversarial training on anomaly detectors","C":"Adversarial anomaly examples can only exist in image or text domains","D":"Adding more training data always defends against adversarial anomaly attacks"},"correct":"B","explanation":{"correct":"- Adversarial anomaly: an attack that is semantically anomalous (fraud, intrusion) but statistically \"normal\" (similar to training distribution). The attacker minimizes $\\text{AnomalyScore}(x)$ while maximizing attack effectiveness.\n- Black-box attack: query the model with trial inputs, use gradient-free optimization (evolutionary algorithms, Bayesian optimization) to find low-score anomalous inputs.\n- White-box: if using gradient-based models (autoencoders), use backpropagation to minimize reconstruction error while maintaining attack payload.\n- Ensemble defense: forcing the attacker to evade multiple simultaneously is an exponentially harder optimization problem.","A":"Anomaly detectors are not inherently robust. Any model with a computable score function is potentially vulnerable to adversarial optimization.","B":"","C":"Adversarial attacks have been demonstrated on network intrusion detection, fraud detection (tabular), and industrial control systems. Domain is irrelevant — the attack exploits the score function, not the data modality.","D":"Adding more normal training data makes the \"normal\" region denser, not more protected. The adversary still targets the dense normal region. Additional training data doesn't inherently defend against adversarial anomaly construction."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-024","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T16 · Regularization] Spectral norm regularization (SN-GAN) constrains the spectral norm (largest singular value) of weight matrices in neural networks. Contrast this with L2 regularization and explain why spectral norm is important for training Generative Adversarial Networks.","options":{"A":"Spectral norm regularization is equivalent to L2 weight decay","B":"L2 (weight decay): penalizes $||W||_F^2$ (Frobenius norm) — penalizes sum of squared weights; spectral norm: constrains $\\sigma_1(W) = ||W||_2$ (largest singular value); why different: $||W||_F = \\sqrt{\\sum \\sigma_i^2}$ vs $||W||_2 = \\sigma_1$; GAN discriminator Lipschitz constraint: WGAN and SN-GAN require the discriminator to be 1-Lipschitz; Lipschitz constant is bounded by the product of spectral norms across layers; spectral normalization enforces $\\sigma_1(W) \\leq 1$ per layer → bounding the discriminator's Lipschitz constant → stable GAN training (prevents mode collapse and gradient explosion)","C":"Spectral norm regularization is only applicable to convolutional layers","D":"L2 regularization always achieves a smaller Lipschitz constant than spectral norm"},"correct":"B","explanation":{"correct":"- Lipschitz continuity: $||f(x) - f(y)|| \\leq L||x-y||$ for all $x, y$. For a linear layer: the Lipschitz constant = $\\sigma_1(W)$. For stacked layers: $L_{\\text{network}} \\leq \\prod_l \\sigma_1(W_l)$.\n- WGAN training: the critic must be K-Lipschitz (typically K=1). Gradient clipping (WGAN) is unstable. Spectral normalization (SN-GAN, Miyato et al. 2018) constrains $\\sigma_1(W) = 1$ using power iteration per gradient step — efficient and stable.\n- L2 vs SN: L2 penalizes all singular values equally. SN penalizes only the largest, leaving other singular values free — the network retains expressiveness while bounding its Lipschitz constant.","A":"Frobenius norm (L2) and spectral norm are different matrix norms with different geometric properties. Their minimizers are different, and their effects on the network's function class differ.","B":"","C":"Spectral normalization applies to any weight matrix: fully connected, convolutional, attention layers. Miyato et al. demonstrated it across all layer types.","D":"L2 can produce a smaller Frobenius norm but doesn't bound the spectral norm. A matrix with L2-regularized weights can still have a large spectral norm if one singular direction dominates."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-025","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T10 · PCA] Kernel PCA extends PCA to nonlinear dimensionality reduction using the kernel trick. Explain how kernel PCA avoids explicitly computing the feature map $\\phi(x)$, and what makes choosing the right kernel a difficult problem.","options":{"A":"Kernel PCA computes PCA in the original feature space, then applies a nonlinear transformation","B":"Kernel PCA performs PCA in the RKHS (Reproducing Kernel Hilbert Space) implicitly; the PCA eigenvectors in $\\mathcal{H}$ are expressed as $\\alpha_j = \\sum_i a_{ij} \\phi(x_i)$ (they lie in the span of training points via Representer theorem); the kernel gram matrix $K_{ij} = k(x_i, x_j)$ encodes all required inner products; the projection of a test point: $\\langle \\phi(x), \\alpha_j \\rangle = \\sum_i a_{ij} k(x, x_i)$; only kernel evaluations are needed — $\\phi(x)$ is never computed; kernel choice difficulty: different kernels assume different notions of similarity; RBF bandwidth $\\sigma$ controls locality; wrong $\\sigma$ → either too local (each point isolated) or too global (no nonlinear structure discovered); cross-validation for kernel hyperparameters requires re-computing $K$ each time — $O(n^2)$ per evaluation","C":"Kernel PCA uses the same eigenvectors as standard PCA — it just applies them in a higher-dimensional space","D":"Kernel PCA requires computing $\\phi(x)$ explicitly but in parallel for efficiency"},"correct":"B","explanation":{"correct":"- RKHS: functions in a Reproducing Kernel Hilbert Space have the property that evaluation is bounded by the kernel: $|f(x)| \\leq ||f||_{\\mathcal{H}} \\sqrt{k(x,x)}$.\n- Representer theorem: the solution to any regularized learning problem in RKHS lies in the span of kernel evaluations at training points → eigenvectors can be represented without explicit $\\phi$.\n- Centering in feature space: $K_{ij}^c = K_{ij} - \\frac{1}{n}\\sum_k K_{ik} - \\frac{1}{n}\\sum_k K_{kj} + \\frac{1}{n^2}\\sum_{k,l} K_{kl}$ (double centering). This centers the feature-space Gram matrix without computing $\\phi$.","A":"The order is reversed. Standard PCA in original space followed by nonlinear transformation is not Kernel PCA. Kernel PCA works entirely in $\\mathcal{H}$ via the kernel trick.","B":"","C":"Kernel PCA produces different eigenvectors from standard PCA. The kernel-space covariance matrix is different from the original-space covariance matrix.","D":"Kernel PCA explicitly AVOIDS computing $\\phi(x)$ — that is the entire motivation for using the kernel trick. Computing $\\phi(x)$ would require infinite dimensions for RBF kernel."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-001","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T01 · ML Fundamentals] A company builds a churn prediction model. Features include \"days since last login\" and \"support tickets this month.\" The target is churn in the next 30 days. Six months after deployment, the model's performance drops. The data science team notices that \"support tickets this month\" now has a much higher average than during training. What type of shift has occurred, and how should it be handled?","options":{"A":"Label shift — the proportion of churned customers has changed","B":"Covariate shift (data drift) — the input feature distribution $P(X)$ has changed while $P(Y|X)$ may still hold; the model receives feature values outside its training distribution; monitor input distributions with statistical tests (KS test, PSI), retrain on recent data, and use importance weighting to adjust for the shift","C":"This is expected behavior — support tickets always increase over time","D":"The model has a coding bug that inflates the feature value"},"correct":"B","explanation":{"correct":"- Covariate shift: $P(X_{\\text{train}}) \\neq P(X_{\\text{test/production}})$. The model was trained on lower ticket volumes; it now receives feature values it rarely saw during training.\n- Even if the relationship $P(\\text{churn}|\\text{tickets})$ is unchanged, the model's learned decision boundary may be poorly calibrated for the new range.\n- Monitoring: track feature distribution statistics (mean, std, percentiles) over time. Population Stability Index (PSI) > 0.2 typically triggers retraining.","A":"Label shift is $P(Y)$ changing — the churn rate changing. The scenario describes feature distribution change, not class proportion change.","B":"","C":"A business context explanation (products change, product issues increase tickets) doesn't negate the need to handle the distributional shift. The model still needs updating.","D":"Gradual changes over months are distributional, not a sudden bug. Bugs produce abrupt errors, not gradual drift."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-002","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T02 · Linear Regression] A linear regression model predicts house prices. The residual plot shows a fan shape: small residuals for low-priced houses, large residuals for high-priced houses. What Gauss-Markov assumption is violated, and what is the practical consequence?","options":{"A":"Multicollinearity — features are correlated with each other","B":"Heteroscedasticity — residual variance increases with fitted value (fan shape); OLS is still unbiased but loses the minimum-variance property; confidence intervals and p-values computed from standard OLS are wrong (standard errors are biased); the model makes systematically more uncertain predictions for expensive houses but treats all predictions equally","C":"Non-linearity — the relationship between features and price is nonlinear","D":"The violation is acceptable — fan shapes in residuals are normal for price data"},"correct":"B","explanation":{"correct":"- Homoscedasticity requires $\\text{Var}(\\epsilon_i) = \\sigma^2$ for all $i$. A fan shape shows $\\text{Var}(\\epsilon_i) \\propto \\hat{y}_i$ — variance grows with fitted value.\n- Consequences: (1) coefficient estimates are still unbiased but not BLUE; (2) standard errors are wrong → invalid hypothesis tests; (3) prediction intervals are too narrow for expensive houses, too wide for cheap ones.\n- Fix: $\\log(\\text{price})$ as target often stabilizes variance in price models; weighted least squares (WLS) with $w_i = 1/\\hat{y}_i^2$.","A":"Multicollinearity affects coefficient interpretation and stability but doesn't cause a fan shape in residuals. Multicollinearity is visible in VIF scores and coefficient instability.","B":"","C":"Non-linearity shows a curved pattern in residual vs fitted plots (systematic positive/negative residuals). A fan shape specifically indicates variance increase with fitted value.","D":"Fan shapes indicate a real assumption violation with real consequences. Log-transform of price target is standard practice precisely because of this."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-003","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T03 · Logistic Regression] A logistic regression model is trained on a perfectly linearly separable dataset (all class-1 points are clearly above the decision line, all class-0 points below). Training fails to converge. What is the mathematical reason?","options":{"A":"The optimizer has a bug — any optimizer should converge on linearly separable data","B":"For perfectly separable data, the log-loss is minimized by pushing decision boundary to infinity — weight magnitudes grow without bound; $\\hat{p}_i \\to 1$ for positives requires $w^Tx_i \\to +\\infty$; the log-loss never reaches 0 but approaches 0 asymptotically; gradient descent keeps taking steps forever; regularization (L2 or L1) prevents this by penalizing large weights","C":"Linear separability guarantees fast convergence — the opposite of what was described","D":"The model is trying to fit a nonlinear boundary on linearly separable data"},"correct":"B","explanation":{"correct":"- Log-loss: $L = -\\sum[y\\log(\\hat{p}) + (1-y)\\log(1-\\hat{p})]$. To minimize: push $\\hat{p} \\to 1$ for $y=1$ and $\\hat{p} \\to 0$ for $y=0$.\n- Perfect separation: $\\hat{p}_i = \\sigma(w^Tx_i) \\to 1$ requires $||w|| \\to \\infty$. The loss approaches 0 but never reaches it. Gradient is always non-zero → no convergence.\n- With L2 regularization: the loss has an additional $\\lambda||w||^2$ term that grows with $||w||$. The combined objective has a finite minimum where the weight magnitude is bounded.","A":"This is a known mathematical property of logistic regression with perfectly separable data, not a bug. Any gradient-based optimizer will diverge.","B":"","C":"Linear separability causes divergence (unbounded weights), not fast convergence. For non-separable data, weights are bounded by the constraint that some points must be misclassified.","D":"The model is fitting a linear boundary. The issue is the unbounded weight problem when the linear boundary perfectly separates both classes."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-004","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T04 · Decision Trees] Two splits are evaluated for a binary classification problem. Split A: Gini parent = 0.5, children weighted Gini = 0.35. Split B: Gini parent = 0.5, children weighted Gini = 0.10. Which split is chosen by the CART algorithm, and why?","options":{"A":"Split A — smaller Gini impurity in children means better purity","B":"Split B — CART maximizes information gain (Gini impurity reduction); Split B reduces Gini by 0.40 (from 0.5 to 0.10) vs Split A's 0.15 reduction; larger reduction = purer children = more informative split","C":"Both splits are equivalent — any split with Gini < 0.5 is acceptable","D":"CART would choose Split A because 0.35 < 0.50 satisfies the purity threshold"},"correct":"B","explanation":{"correct":"- CART criterion: choose the split that maximizes impurity reduction = parent Gini − weighted average children Gini.\n- Split A: $\\Delta G = 0.5 - 0.35 = 0.15$.\n- Split B: $\\Delta G = 0.5 - 0.10 = 0.40$.\n- Split B is chosen: children are much purer (Gini 0.10 vs 0.35), meaning the split separates classes much more effectively.","A":"\"Smaller Gini means better\" is directionally correct but the comparison is wrong. Split B has Gini 0.10 < Split A's 0.35. Split B should be preferred, not A.","B":"","C":"The magnitude of the reduction matters. CART chooses the largest Gini reduction, not just any split below the parent's Gini.","D":"CART doesn't use a fixed threshold — it picks the split with the highest impurity reduction among all candidate splits."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-005","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T05 · Random Forest] A Random Forest model achieves 89% accuracy with `max_features='sqrt'`. A data scientist increases it to `max_features='all'` (uses all features at each split). Training speed decreases and accuracy drops slightly. Explain both effects.","options":{"A":"Using all features always increases accuracy — the accuracy drop indicates a bug","B":"Speed: with all features at each split, finding the best split requires evaluating all $p$ features instead of $\\sqrt{p}$ — computation per split increases $O(p/\\sqrt{p}) = O(\\sqrt{p})$ times; accuracy: using all features makes trees more correlated (they all tend to split on the same dominant features), increasing inter-tree correlation $\\rho$, which limits variance reduction: $\\text{Var}(\\text{RF}) = \\rho \\sigma^2 + \\frac{1-\\rho}{B}\\sigma^2$; more $\\rho$ → higher ensemble variance","C":"Using more features always improves accuracy — the drop must be due to different random seeds","D":"Speed improves with more features because the algorithm can skip weaker features"},"correct":"B","explanation":{"correct":"- Feature subsampling in RF: each split evaluates $m$ random features. $m = \\sqrt{p}$ is optimal by empirical and theoretical analysis. It forces diverse splits across trees.\n- With $m = p$: every tree sees all features. All trees will tend to split on the same top-1 feature at the root → correlated trees → $\\rho$ increases → $\\text{Var}(\\text{RF})$ increases.\n- Speed: $\\sqrt{p}$ features to evaluate per split vs $p$ features. For $p=100$: $\\sqrt{p}=10$ vs 100 — 10× more work per split.","A":"Increasing feature subsampling to all features reduces diversity — a fundamental RF design choice. The accuracy drop is expected and documented.","B":"","C":"The accuracy drop is systematic, not random seed dependent. It is reproducible and directly caused by increased inter-tree correlation.","D":"Evaluating more features per split requires more computation, not less. Each feature requires computing the impurity reduction for all possible split thresholds."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-006","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T06 · Gradient Boosting] CatBoost is a gradient boosting library designed specifically to handle categorical features. Standard gradient boosting requires one-hot encoding. What is the key risk of naively one-hot encoding a high-cardinality categorical feature (e.g., city with 1,000 categories)?","options":{"A":"One-hot encoding is always safe — there is no risk with high-cardinality features","B":"With 1,000 categories, one-hot encoding creates 1,000 binary features; for tree-based models, this creates 1,000 possible binary split candidates per node; combined with gradient boosting's tendency to overfit, the model may learn spurious patterns specific to rare city values; rare cities with 1-2 training examples can dominate leaf predictions; CatBoost uses ordered target statistics to avoid this","C":"One-hot encoding fails mathematically for more than 100 categories","D":"High-cardinality one-hot causes one-hot encoded features to have zero variance"},"correct":"B","explanation":{"correct":"- High-cardinality + gradient boosting: a city that appears 2 times in training can have a deterministic mean target — the tree fits to this noise rather than the true signal. The gradient boosting model will pick up city-specific artifacts.\n- CatBoost's ordered boosting: uses only past rows' target statistics to compute the categorical encoding of the current row — prevents target leakage from the category's own label.\n- Also relevant: the Zipf distribution of category frequencies means most categories are rare (long tail). Encoding all rare categories individually adds high-variance features.","A":"One-hot encoding of high-cardinality categoricals is a well-known pitfall in gradient boosting. Rare categories create high-variance leaf predictions.","B":"","C":"There's no mathematical limit. The concern is statistical (overfitting), not mathematical failure.","D":"Each one-hot column is a binary feature with low but non-zero variance (variance = p(1-p) where p = fraction of that category). Zero variance only happens for constant features."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-007","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T07 · SVM] An RBF kernel SVM is trained. The gamma parameter is very high (γ = 100). What effect does this have on the decision boundary, and what problem does it cause?","options":{"A":"High gamma creates a very wide Gaussian kernel, making the boundary smooth","B":"High gamma means the Gaussian kernel $K(x,z) = \\exp(-\\gamma||x-z||^2)$ falls off very rapidly — each support vector influences only a tiny region around itself; the decision boundary becomes highly irregular, tracing tightly around each training point; this causes high variance (severe overfitting) — the model memorizes training data but fails on test data","C":"High gamma increases the margin width, reducing overfitting","D":"Gamma only affects computation speed, not the decision boundary shape"},"correct":"B","explanation":{"correct":"- RBF kernel: $K(x,z) = \\exp(-\\gamma||x-z||^2)$. For high $\\gamma$: only very close points have $K > 0$. Effective neighborhood shrinks to a point.\n- High $\\gamma$ effect: each training point \"owns\" a tiny bubble of influence. The decision boundary wrinkles around each support vector → complex, non-smooth boundary → overfitting.\n- Low $\\gamma$: each point influences a wide region → smooth boundary → high bias, may underfit.\n- Optimal $\\gamma$: selected via cross-validation. sklearn default: $\\gamma = 1/(\\text{n\\_features} \\times \\text{Var}(X))$.","A":"Wide Gaussian (large effective radius) corresponds to LOW gamma. High gamma = narrow Gaussian = small effective radius.","B":"","C":"High gamma increases overfitting. Margin width is controlled by C (the soft-margin penalty), not gamma. The two parameters have independent effects.","D":"Gamma fundamentally changes the kernel function and thus the decision boundary shape. Speed is a secondary effect."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-008","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T08 · KNN] A KNN regression model (K=5) is used to predict house prices. A test house has 5 nearest neighbors with prices: [$200K, $210K, $205K, $800K, $215K]. The prediction is $326K (mean). A data scientist says \"this prediction is wrong — the $800K neighbor is an outlier.\" What is the issue and the fix?","options":{"A":"K=5 is too small — use K=50 for more robust predictions","B":"KNN regression with Euclidean distance treats all K neighbors equally; the $800K outlier is 4× the other neighbors' values, pulling the mean prediction far from the cluster; fix: use distance-weighted KNN ($w_i = 1/d_i^2$, closer neighbors weighted more), use median instead of mean for robust KNN regression, or investigate whether the $800K neighbor is truly a relevant comparison","C":"KNN should use classification, not regression — regression produces unstable predictions","D":"The prediction of $326K is correct — KNN mean is always the right approach"},"correct":"B","explanation":{"correct":"- Equal-weight mean: $P = (200+210+205+800+215)/5 = 1630/5 = 326K$. The outlier inflates the prediction by $126K ($326K vs $207K without the outlier).\n- Distance-weighted KNN: if the $800K house is farther away (less similar), it receives less weight. The prediction reflects the most similar houses more strongly.\n- Median KNN: $\\text{median}(200, 205, 210, 215, 800) = 210K$. Robust to the outlier.\n- Root cause investigation: why is the $800K house a nearest neighbor? Perhaps a feature mismatch — investigating nearest-neighbor relevance may be more valuable than fixing the aggregation method.","A":"Increasing K adds more distant neighbors, potentially bringing in more outliers. It doesn't fix the outlier sensitivity problem.","B":"","C":"KNN regression is a valid and widely used technique. The issue is outlier sensitivity in the averaging step, not the use of regression mode.","D":"Equal-weight mean KNN is sensitive to outliers. The $326K prediction is clearly wrong for a house similar to $200-215K neighbors."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-009","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T09 · Naive Bayes] A Naive Bayes classifier is trained on customer review sentiment (positive/negative). The word \"not\" has $P(\\text{\"not\"}|\\text{positive}) = 0.12$ and $P(\\text{\"not\"}|\\text{negative}) = 0.18$. A review contains the phrase \"not bad.\" NB classifies it as negative. Is this surprising, and why?","options":{"A":"The classification is correct — \"not\" is more common in negative reviews so negative is expected","B":"The conditional independence assumption causes NB to treat \"not\" and \"bad\" as independent features; \"not bad\" means \"good\" (negation), but NB multiplies $P(\\text{\"not\"}|\\text{class}) \\times P(\\text{\"bad\"}|\\text{class})$ separately, destroying the negation relationship; the combination means something different from either word alone — NB misses this linguistic compositionality","C":"NB correctly handles negation because the product of two probabilities captures the combined effect","D":"The word \"not\" should be removed as a stop word to fix this issue"},"correct":"B","explanation":{"correct":"- Negation problem: NB's independence assumption treats each word as contributing independently. \"Not\" + \"bad\" → NB multiplies their individual class likelihoods. But \"not bad\" = \"good\" — the negation reverses the sentiment of \"bad.\"\n- NB's score for \"not bad\": $P(\\text{pos}) \\times P(\\text{\"not\"}|\\text{pos}) \\times P(\\text{\"bad\"}|\\text{pos})$ vs the negative class counterpart. \"Bad\" is heavily associated with negative reviews → NB leans negative even when the phrase means positive.\n- Fix: bigram features (\"not_bad\" as a single feature) or negation detection preprocessing that marks words following \"not\" with a negation tag.","A":"\"Not\" being slightly more common in negative reviews does not explain the full miscategorization. The key problem is that \"not bad\" is a positive phrase that NB's independence assumption breaks apart.","B":"","C":"Multiplying independent probabilities does NOT capture the combined meaning. $P(\\text{\"not bad\"}) \\neq P(\\text{\"not\"}) \\times P(\\text{\"bad\"})$ when words interact.","D":"Removing \"not\" would make things worse — the model would only see \"bad\" and classify as negative even more strongly. Stop word removal is counterproductive for sentiment analysis."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-010","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T10 · PCA] A company stores customer transaction records as sparse matrices (99% zeros, ~50 non-zero values per row out of 10,000 features). An engineer applies standard PCA. Memory usage explodes (100GB for the covariance matrix). What causes this, and what is the solution?","options":{"A":"PCA is working correctly — 100GB is expected for large datasets","B":"Standard PCA computes the full $p \\times p$ covariance matrix: $X^TX$ where $p = 10,000$; the matrix is $10,000 \\times 10,000 = 10^8$ float64 values = 800MB just for the matrix (manageable), but the issue is that standard PCA first densifies the sparse matrix (to compute $X^TX$); fix: use TruncatedSVD (sklearn) which works directly on sparse matrices using iterative methods without explicitly computing the dense covariance matrix","C":"PCA cannot be applied to sparse matrices under any circumstances","D":"The solution is to convert the matrix to float32 instead of float64"},"correct":"B","explanation":{"correct":"- Dense covariance computation: $X^TX$ requires materializing the full dense matrix product. If $X$ is stored as sparse but PCA densifies it: $n \\times p$ dense matrix where $n = 10^6, p = 10^4$ → $10^{10}$ float64 values → 80TB. That's the real memory problem.\n- TruncatedSVD (randomized SVD): computes only the top $k$ singular vectors without computing the full covariance. Works on sparse matrices in scipy.sparse format. sklearn's `TruncatedSVD` is the sparse-compatible equivalent of PCA.\n- Same mathematical result: TruncatedSVD on centered data = PCA. On uncentered data = LSA (Latent Semantic Analysis), often sufficient for NLP.","A":"100GB+ for a PCA covariance computation is not expected or acceptable. Standard PCA on sparse data requires engineering solutions.","B":"","C":"PCA can be applied to sparse data — using TruncatedSVD or online PCA methods. The constraint is on the naive dense implementation.","D":"float32 halves memory but doesn't solve the fundamental problem of materializing a dense matrix from sparse data. 50% of 80TB is still 40TB."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-011","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T11 · Clustering] A team is deciding between K-means and Gaussian Mixture Models (GMM) for customer segmentation. The team wants to handle customers who genuinely \"belong\" partially to two segments (e.g., a customer who shops like both a student and a professional). Which algorithm is better suited, and why?","options":{"A":"K-means is better because it assigns each customer to exactly one segment, ensuring clean boundaries","B":"GMM is better — it provides soft cluster membership: $P(\\text{segment}_k | \\text{customer}_i)$; a customer could be 60% segment A and 40% segment B; marketing campaigns can be weighted by membership probability; K-means hard assignment would arbitrarily force the customer into one segment, losing the ambiguity information","C":"Neither algorithm handles partial membership — a custom algorithm is needed","D":"GMM and K-means produce identical results when using spherical Gaussian components"},"correct":"B","explanation":{"correct":"- GMM soft assignment: the E-step in EM computes $P(\\text{segment}_k | x_i)$ — a full probability vector over K segments per customer. Downstream actions can be personalized proportionally.\n- K-means hard assignment: forces a binary decision. Customers on the boundary get one label arbitrarily. This loses valuable information about borderline cases.\n- Business value: a customer who is 50/50 between student and professional might respond to different messaging in different contexts. Soft membership captures this.","A":"Hard boundaries are not desirable when customers genuinely exhibit mixed behaviors. K-means hard assignment discards the ambiguity information.","B":"","C":"GMM explicitly provides soft membership through its probabilistic formulation. This is one of its primary differentiators from K-means.","D":"When GMM uses spherical equal-variance Gaussians, its MAP estimate (argmax of posterior) equals K-means hard assignments. But GMM still provides soft probabilities. They are equivalent only for the final hard assignment, not the probabilistic output."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-012","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T12 · Anomaly Detection] A manufacturing quality control system uses reconstruction error from an autoencoder to flag defective parts. A new type of defect is introduced (a scratch). The autoencoder was not trained on scratches. What will the autoencoder's reconstruction error be for scratched parts, and why?","options":{"A":"Low reconstruction error — autoencoders generalize perfectly to new types of defects","B":"High reconstruction error — the autoencoder was trained only on normal parts and learned to reconstruct normal surfaces well; a scratch pattern was never seen during training; the decoder cannot reproduce the scratch accurately from the learned latent representation → high reconstruction error → correctly flagged as anomaly","C":"Zero reconstruction error — the autoencoder ignores features it wasn't trained on","D":"The reconstruction error is unpredictable — sometimes high, sometimes low"},"correct":"B","explanation":{"correct":"- Autoencoder anomaly detection principle: train only on normal data → the encoder-decoder learns the manifold of normal appearances; out-of-distribution inputs (defects) are not on this manifold → high reconstruction error.\n- Scratch = new texture pattern not on the learned normal manifold → the decoder produces a smoothed, scratch-free reconstruction → $||x - \\hat{x}||^2$ is high.\n- This is why autoencoder anomaly detection is appealing for manufacturing: it generalizes to unseen defect types as long as \"normal\" was well-represented in training.","A":"Autoencoders do not generalize to new patterns perfectly. In fact, the design intent is the opposite — they should NOT reconstruct anomalies well.","B":"","C":"Autoencoders don't \"ignore\" features. The full reconstruction is attempted, and the scratch region will be reconstructed incorrectly (smoothed out), contributing to the high error.","D":"While individual results vary, the general expectation for a well-trained autoencoder on a novel anomaly type is high reconstruction error. \"Unpredictable\" understates the directional expectation."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-013","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T13 · Ensemble Methods] A stacking ensemble achieves 91% accuracy. The best single base model achieves 90%. A colleague says \"1% gain isn't worth the 5× inference cost.\" When is this trade-off justified, and what metric should guide the decision?","options":{"A":"1% accuracy improvement never justifies added cost","B":"The justification depends on business impact: in fraud detection, 1% more frauds caught could prevent millions in losses (high $C_{FN}$); in a low-stakes recommendation system, 1% may not justify the overhead; the decision should use expected business value: $\\Delta \\text{value} = \\Delta \\text{accuracy} \\times n_{\\text{predictions}} \\times \\text{value\\_per\\_correct\\_prediction}$; if this exceeds the cost of 5× inference, the ensemble is justified","C":"Stacking should always be used because accuracy improvements are always valuable","D":"The correct threshold for justifying an ensemble is always >5% accuracy improvement"},"correct":"B","explanation":{"correct":"- Cost-benefit analysis: $n = 10,000$ fraud predictions/day. 1% = 100 more caught frauds. At $\\$1000$ per fraud: $\\$100K/\\text{day}$ in additional value. 5× inference cost = trivial AWS cost increase. Justified.\n- For a web recommendation: 1% more relevant recommendations → marginal click-through improvement → revenue depends on traffic and monetization. May or may not be justified.\n- The \"5% threshold\" is arbitrary — it has no theoretical basis. Business impact should drive the decision.","A":"A 1% improvement can be extremely valuable in high-stakes, high-volume applications. \"Never justified\" is too absolute.","B":"","C":"Always using complex models ignores operational costs (latency, infrastructure, maintainability). Simple models that meet requirements are often preferable.","D":"No universal threshold exists. A 0.1% improvement in patient mortality prediction is highly significant; a 10% improvement in emoji suggestion accuracy may not justify complexity."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-014","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T14 · Model Evaluation] Nested cross-validation is described as the \"gold standard\" for unbiased model evaluation combined with hyperparameter tuning. What are the two loops, and what does each loop do?","options":{"A":"Outer loop trains the model; inner loop evaluates it","B":"Outer loop (k-fold): splits data into train+validation vs test — the test fold is never touched during model development; inner loop (k-fold on train+validation): performs hyperparameter selection via cross-validation; for each outer fold, the inner CV selects the best hyperparameters; the outer test fold evaluates final performance — this fold was never used in hyperparameter selection","C":"Outer loop does feature selection; inner loop does model training","D":"Nested CV is equivalent to double the cross-validation folds — it is a computational trick, not a methodological improvement"},"correct":"B","explanation":{"correct":"- Outer loop (5-fold): 5 test folds, each truly held out. Gives 5 independent performance estimates → average = unbiased performance estimate.\n- Inner loop (5-fold): for each outer training set, select the best hyperparameters using inner CV. A different set of hyperparameters may be selected for each outer fold.\n- Key insight: the outer test fold was NEVER used in the inner loop's hyperparameter selection → unbiased performance estimate.\n- Limitation: computationally expensive (5×5=25 model fits minimum). sklearn's `cross_val_score(pipeline, ...)` with `GridSearchCV` inside handles this correctly.","A":"The description is reversed. The outer loop is the evaluation loop; the inner loop is the hyperparameter selection loop.","B":"","C":"Feature selection can be included in the pipeline but it is not the purpose of the outer/inner loop structure.","D":"Nested CV is a principled methodology to prevent test set contamination from hyperparameter tuning. It's not just a computational trick — it's the correct evaluation procedure."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-015","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T15 · Bias-Variance] A neural network trained with heavy L2 regularization achieves 78% training accuracy and 77% test accuracy (1% gap). Without regularization, the same network achieves 99% training and 78% test accuracy. Which configuration has better bias, better variance, and which is better overall?","options":{"A":"No regularization is better because it achieves higher training accuracy","B":"Heavy regularization: high bias (78% train, 22% training error vs optimal), low variance (1% gap); No regularization: low bias (99% train, 1% training error), high variance (21% gap from train to test); overall test accuracy is similar (77% vs 78%); regularization is not strictly better here — it achieved similar test performance but with high bias instead of high variance; the no-regularization model might be improved with more data rather than more regularization","C":"Heavy regularization is always better — any reduction in overfitting is beneficial","D":"The models are identical in all meaningful ways"},"correct":"B","explanation":{"correct":"- Test accuracy is nearly identical (77% vs 78%). The regularized model achieved similar generalization through bias (high training error, low gap) rather than allowing the model to fit and then generalize through variance.\n- Practical interpretation: if there were more training data available, the unregularized model could potentially reach both lower training error AND lower test error (variance drops with more data). The regularized model is limited by its high bias.\n- Neither is definitively \"better\" without context. The right amount of regularization balances bias and variance at the current data size.","A":"99% training accuracy with 21% train-test gap shows severe overfitting. High training accuracy alone doesn't make a model better.","B":"","C":"\"Any reduction in overfitting is beneficial\" ignores the bias cost. In this example, heavy regularization introduced so much bias that test performance barely improved.","D":"The models have very different training behaviors and the trade-off between bias and variance differs. They are not identical."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-016","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T16 · Regularization] A team compares Lasso (L1) and Ridge (L2) regression on a dataset with 15 features where they believe 5 are truly predictive and 10 are noise. Which algorithm should they use for best prediction + interpretability, and why?","options":{"A":"Ridge always outperforms Lasso for prediction — use Ridge","B":"Lasso is preferable when the true model is sparse (few relevant features); it will zero out the 10 noise features and select the 5 predictive ones; this gives a sparse, interpretable model with potentially better generalization; Ridge shrinks all 15 features but keeps all non-zero — the noise features contribute (small but non-zero) noise to predictions; for sparse true models, Lasso's feature selection improves both interpretability and prediction","C":"Both algorithms produce identical results on any dataset","D":"Use ElasticNet because it handles all scenarios equally well"},"correct":"B","explanation":{"correct":"- Sparse true model (5/15 predictors): this is exactly the scenario where Lasso shines. Lasso can achieve exact zero for noise features, recovering the true sparse model under certain conditions (irrepresentable condition).\n- Ridge: keeps all 15 features with small coefficients. The 10 noise features add small but real noise to predictions. Prediction variance is slightly higher than Lasso's 5-feature model.\n- When Ridge is better: when the true model has many small but all-non-zero effects (dense signal). In gene expression, many genes have small effects → Ridge may outperform Lasso.","A":"Ridge's advantage is for dense signal problems. For sparse signal (few true predictors), Lasso typically outperforms Ridge.","B":"","C":"Lasso and Ridge produce different solutions. For sparse true models, they differ in test performance and interpretability.","D":"ElasticNet is useful for correlated feature groups, not necessarily better for the described scenario. Using ElasticNet always would be over-engineering without clear benefit."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-017","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T17 · Feature Engineering] A dataset has a feature \"last_purchase_days_ago\" with values ranging from 0 to 1,825 (5 years). The distribution is heavily right-skewed (most customers purchased recently). A decision tree model is used. Does the skewness require transformation before tree-based modeling?","options":{"A":"Yes — all skewed features must be log-transformed before any ML model","B":"No — decision trees split on arbitrary thresholds and are invariant to monotone transformations; $\\log(\\text{last\\_purchase\\_days\\_ago})$ and the raw feature produce the same split structure (the thresholds change but the split boundaries are equivalent); tree-based models (RF, gradient boosting) do not require feature scaling or distribution transformation; transformation helps linear models and distance-based models, not tree models","C":"Yes — skewness causes trees to produce biased splits","D":"The feature should be binned (discretized) before using in decision trees"},"correct":"B","explanation":{"correct":"- Decision tree splits: find the threshold $t$ that maximizes impurity reduction. The splitting criterion doesn't care about the feature distribution — it evaluates all possible thresholds.\n- Monotone transformation invariance: if $f$ is a monotone function, the optimal split threshold changes from $t$ to $f^{-1}(t)$, but the split's ability to separate classes is identical.\n- Who needs transformation: linear models (skewed features create leverage points), neural networks (large gradients), KNN (distance distortion), K-means (centroid computation affected by outliers).","A":"This is incorrect. Tree-based models are monotone transformation invariant. Log-transforming before a tree model adds zero value.","B":"","C":"Skewness does not bias tree splits. Trees consider all possible thresholds and select the best one regardless of distribution shape.","D":"Pre-binning a feature before a decision tree is redundant — the tree discretizes features implicitly through its splitting process."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-018","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T13 · Ensemble Methods] Why does AdaBoost use exponential loss weighting while gradient boosting can use any differentiable loss function?","options":{"A":"AdaBoost and gradient boosting use the same loss function — exponential is the default for both","B":"AdaBoost was derived specifically from the exponential loss function — the multiplicative weight update $w_i \\leftarrow w_i \\exp(-\\alpha_t y_i h_t(x_i))$ is the natural consequence of minimizing exponential loss via forward stagewise additive modeling; gradient boosting generalizes this framework: any differentiable loss has a gradient, and any tree can be fitted to negative gradients; AdaBoost is a special case of gradient boosting with exponential loss","C":"The loss function choice has no effect on the model — only the base learner matters","D":"AdaBoost cannot be described as gradient boosting — they are entirely unrelated algorithms"},"correct":"B","explanation":{"correct":"- AdaBoost derivation: Friedman et al. showed AdaBoost is forward stagewise additive modeling minimizing exponential loss $L(y, f) = e^{-yf(x)}$.\n- Exponential loss is more sensitive to misclassified points (exponentially growing penalty) → makes AdaBoost more sensitive to outliers/noise.\n- Log-loss (log-boosting) or Huber loss (robust boosting) are less sensitive. Gradient boosting generalization: any loss with a computable gradient can be used → XGBoost supports log-loss, Huber, custom losses.","A":"Gradient boosting supports arbitrary differentiable losses (MSE for regression, log-loss for classification, Huber for robust regression). Exponential is only AdaBoost's loss.","B":"","C":"The loss function determines what the ensemble is optimizing — it has fundamental effects on which examples are emphasized, how robust the model is to noise, and the final model form.","D":"This is a well-established theoretical result. AdaBoost is a special case of gradient boosting with exponential loss and specific tree fitting procedure."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-019","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T08 · KNN] A KNN classifier with K=10 uses Euclidean distance. Features are: age (years, range 20-80), account_balance (dollars, range $0-$500,000). After scaling, the classifier performs much better. Why does KNN need feature scaling when, say, Random Forest does not?","options":{"A":"KNN needs scaling to prevent memory overflow with large feature values","B":"KNN's prediction is based entirely on the distance $\\sum_j (x_{ij} - x_{kj})^2$ — without scaling, account_balance (range 500K) contributes $(500,000)^2 = 2.5 \\times 10^{11}$ to the sum while age (range 60) contributes $(60)^2 = 3,600$; account_balance dominates completely; after standardization, both features contribute proportionally; Random Forest splits on individual feature thresholds — it doesn't combine features via distance, so scale doesn't affect which threshold is best","C":"KNN only needs scaling when using cosine distance, not Euclidean distance","D":"Both KNN and Random Forest need scaling — the question's premise is incorrect"},"correct":"B","explanation":{"correct":"- Scale sensitivity: any algorithm using Euclidean (or Minkowski) distance is sensitive to feature scale. KNN, K-means, SVM (RBF), linear/logistic regression with gradient descent, PCA — all benefit from scaling.\n- Tree-based invariance: a split at balance = $50,000 is equally valid whether the feature is in dollars or thousands of dollars. The split threshold changes but the tree structure (and splits' quality) is identical.\n- Standardization: $z = (x - \\mu) / \\sigma$. All features have mean 0, std 1 after scaling → equal contribution to distance calculations.","A":"Memory overflow is not a scaling concern. Large feature values don't cause memory issues.","B":"","C":"Euclidean distance is MORE sensitive to scale than cosine distance (which is inherently scale-invariant by design). Euclidean definitely needs scaling.","D":"Random Forest (and gradient boosting, decision trees) are scale-invariant. Scaling features before training a tree model adds zero value."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-020","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T01 · ML Fundamentals] A model is trained on features A, B, C to predict target Y. Feature B is computed as \"average of Y for the same customer in the past 30 days.\" The model achieves 99% training accuracy but fails in production on new customers (who have no history). What is the core issue?","options":{"A":"The model overfits because it uses too many features","B":"Feature B is computed from the target variable (past Y values); for new customers, B=0 or missing by definition; but more critically, during training B already encodes information about the customer's relationship with Y — the model implicitly uses Y to predict Y (target leakage); in production, B for new customers is meaningless or missing, breaking the model","C":"The model should not use historical data — only real-time features","D":"New customer failure is expected and acceptable — the model is designed for existing customers"},"correct":"B","explanation":{"correct":"- Feature leakage: \"average Y in past 30 days\" is a past value of the target. For existing customers, this feature carries strong predictive signal (past behavior predicts future behavior). But this is essentially using $Y$ to predict $Y$.\n- Production failure: new customers have no history → feature is 0 or NaN → the model's primary feature is gone → predictions become unreliable or require a cold-start fallback.\n- Proper feature engineering: B = \"average purchases in the past 30 days (the feature/action)\" vs \"average of the target (the outcome).\" Using the action is legitimate; using the outcome is leakage.","A":"Three features is not \"too many.\" The issue is specifically feature B's construction from the target.","B":"","C":"Historical features are legitimate and often the most predictive. The problem is not history in general — it's using the target itself (past Y) as a feature.","D":"If the model is deployed to new customers but fails for them, this is a production failure. The scope of deployment should match the model's designed population."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-021","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T07 · SVM] Two students debate: \"SVMs work best for high-dimensional data\" vs \"Neural networks always outperform SVMs in high dimensions.\" Which claim is supported by ML theory and practice?","options":{"A":"Neural networks always outperform SVMs regardless of dataset size and dimensionality","B":"SVMs have theoretical advantages in high-dimensional settings: the kernel trick maps to higher-dimensional spaces where the margin can be large; with limited data in high dimensions (p >> n), SVMs' maximum-margin objective provides implicit regularization; neural networks outperform SVMs when data is abundant, input is structured (images, text), and sufficient computational resources are available; neither claim is universally true","C":"SVMs always outperform neural networks in text classification because text is high-dimensional","D":"The comparison is irrelevant — dimensionality has no effect on relative model performance"},"correct":"B","explanation":{"correct":"- SVM advantages: (1) kernel methods are competitive or superior on small/medium datasets (n < 10,000); (2) well-understood theoretical generalization bounds (VC theory); (3) work well for text, biology (SVMs dominated NLP before deep learning).\n- Neural network advantages: (1) automatic feature learning scales to massive datasets; (2) CNNs/transformers achieve SOTA on structured inputs; (3) can learn hierarchical features that kernel methods struggle with.\n- Modern practice: deep learning has overtaken SVMs on most benchmarks with sufficient data. But SVMs remain competitive for small datasets.","A":"SVMs outperform neural networks on small datasets and some high-dimensional tasks. \"Always\" is definitively false.","B":"","C":"Deep learning (BERT, transformers) has surpassed SVMs on most NLP benchmarks since 2018. SVMs are competitive on small NLP datasets but not universally superior.","D":"Dimensionality matters significantly. In high dimensions with small n, kernel methods are often competitive with neural networks. In high dimensions with large n, neural networks dominate."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-022","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T02 · Linear Regression] A data scientist adds a polynomial feature $x^2$ to a linear regression for a housing price model. The training RMSE drops by 20%, test RMSE drops by 12%. A colleague says \"the $x^2$ term is cheating because it creates a nonlinear model.\" Is the colleague correct?","options":{"A":"Correct — adding polynomial features violates the linearity assumption of linear regression","B":"Incorrect — \"linear\" in linear regression refers to linearity in the parameters (weights), not in the features; $y = w_0 + w_1 x + w_2 x^2$ is a polynomial curve in $x$ but is linear in the parameters $w_0, w_1, w_2$; this is valid linear regression with engineered features; the linearity assumption in Gauss-Markov refers to $E[y|x]$ being linear in the parameters, not in $x$","C":"Correct — polynomial features require using polynomial regression, a completely different algorithm","D":"Incorrect — the $x^2$ feature technically makes the model nonlinear in both features and parameters"},"correct":"B","explanation":{"correct":"- Linear in parameters: $y = \\beta_0 + \\beta_1 x_1 + \\beta_2 x_1^2$. The model is linear in $\\beta$. We can write $z = x_1^2$ and the model becomes $y = \\beta_0 + \\beta_1 x_1 + \\beta_2 z$ — standard linear regression on features $[x_1, z]$.\n- The OLS closed form, Gauss-Markov theorem, and all linear regression theory apply unchanged. The model fits a polynomial curve in feature space but a hyperplane in parameter space.\n- Feature engineering: adding $x^2, \\log(x), x_1 \\times x_2$ etc. all create new features for linear regression. This is standard practice.","A":"The linearity assumption refers to linearity in parameters, not features. Adding $x^2$ is a feature transformation, not a violation of the model's assumptions.","B":"","C":"\"Polynomial regression\" IS linear regression with polynomial features — the same algorithm, the same closed form solution. There's no separate \"polynomial regression\" algorithm.","D":"$$y = w_0 + w_1 x + w_2 x^2$ is nonlinear in $x$ but linear in parameters $w$. OLS optimizes over $w$, not $x$, so the model is linear from the optimizer's perspective."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-023","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T10 · PCA] A data scientist reduces 50 features to 10 using PCA, then trains a Random Forest on the 10 PCs. The Random Forest interprets feature importances of the PCs. Is \"PC1 has the highest importance\" a useful finding?","options":{"A":"Yes — PC1 having high importance means the features with highest variance are most important","B":"Not directly — PC1 represents the direction of maximum variance in the original feature space; high importance means the model uses variance direction 1, but this direction is a linear combination of all 50 original features; to determine which original features matter, you must examine the PC1 loading vector (which original features contribute to PC1) and then validate whether the Random Forest is using PC1 due to its original-feature composition","C":"Yes — Random Forest importance on PCs directly measures the original feature importance","D":"PC importance is meaningless because Random Forests should not be used after PCA"},"correct":"B","explanation":{"correct":"- PC1 = $v_1^T x$ where $v_1$ is the first eigenvector. If $v_1 = [0.8, 0.6, 0.1, ..., 0.05]$, high PC1 importance means the model uses a combination weighted toward features 1 and 2.\n- To interpret in original feature space: $w_{\\text{original}} = V \\cdot w_{\\text{RF importance}}$ (approximate). This propagates PC importance back to original features.\n- Limitation: RF importance on PCs is valid for prediction, but interpretation in original feature space requires the back-transformation step.","A":"Maximum variance ≠ maximum predictive importance (as discussed in the topic file). PC1 might capture a direction orthogonal to the target.","B":"","C":"Random Forest feature importance measures the model's reliance on each PC. To map back to original features requires the PCA loading matrix.","D":"Using Random Forest after PCA is valid. The combination is used to reduce noise/dimensionality before tree-based learning. The interpretation requires care, not avoidance."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-024","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T11 · Clustering] A DBSCAN model labels 30% of data points as noise (-1). A junior analyst removes all noise points from the dataset for downstream analysis. What is the risk?","options":{"A":"No risk — noise points are definitionally not useful","B":"DBSCAN noise points may be the most analytically interesting observations — rare events, fraud cases, medical anomalies; removing 30% of data based on DBSCAN's density criterion (which depends on eps/min_samples hyperparameters) may discard real signals; additionally, 30% noise typically indicates eps is too small or min_samples too large — first tune the hyperparameters before removing points","C":"Removing noise points always improves downstream model performance","D":"Noise points should be assigned to the nearest cluster, not removed"},"correct":"B","explanation":{"correct":"- Business context matters: if the dataset is fraud transactions, DBSCAN noise points (isolated transactions not forming dense groups) may be the actual rare fraud events — exactly what you want to keep.\n- Hyperparameter sensitivity: 30% noise is unusually high, suggesting DBSCAN is poorly tuned. The k-distance elbow plot should be used to select eps before interpreting noise labels.\n- Downstream impact: removing 30% of data reduces the downstream model's training set significantly and may introduce systematic bias if noise points share common characteristics.","A":"\"Not useful\" conflates density-based anomaly with lack of value. Density anomalies are often the most interesting observations in anomaly detection, fraud, and scientific discovery.","B":"","C":"This claim is only valid if the noise points are truly uninformative artifacts (e.g., data entry errors). Without domain validation, removing data is risky.","D":"Forcing noise points into clusters is exactly what DBSCAN is designed to avoid. Boundary/noise points that don't meet density criteria should not be force-assigned — that would change the algorithm's semantics."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-025","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T15 · Bias-Variance] A team evaluates model performance across 5-fold cross-validation. The per-fold accuracies are: [0.82, 0.83, 0.91, 0.82, 0.81]. They report mean = 0.838. A statistician flags the fold-3 outlier. What might cause one fold to perform much better than the others, and what should the team investigate?","options":{"A":"Fold-3 has more samples than other folds — cross-validation always allocates unequal folds","B":"Fold-3 may have different class distribution (stratification issue), easier examples from one data stratum, or temporal ordering artifacts (if data has temporal structure and fold-3 happened to contain mostly \"easy\" time periods); the team should check fold-3's class distribution, sample characteristics, and whether the fold represents a distinct subpopulation","C":"High fold-3 accuracy is good — the model is doing its best on that fold","D":"Variance across folds is always random — outlier folds should be discarded as noise"},"correct":"B","explanation":{"correct":"- Expected CV fold variance: small differences (1-2%) are normal due to different test samples. A 9% jump (91% vs 82%) is unusual and warrants investigation.\n- Common causes: (1) non-stratified split → fold-3 has easier class distribution; (2) temporal leakage → fold-3 represents an earlier period with simpler patterns; (3) the fold captures a specific subpopulation the model handles well (niche feature combination).\n- Action: inspect fold-3's data distribution, run stratified K-fold if not already used, and report variance alongside mean.","A":"sklearn's KFold allocates ± 1 sample per fold (essentially equal). Sample count differences are not the cause.","B":"","C":"One fold being much easier may indicate a data quality or sampling issue. High performance on one fold doesn't validate the model globally.","D":"The outlier fold should be investigated, not discarded. Discarding would require justification. The variance itself is signal about data or evaluation methodology."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-026","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T16 · Regularization] A data scientist applies both L1 and L2 regularization to the same logistic regression model. What is this combined approach called, and how does the mixing ratio $\\rho$ in sklearn's `l1_ratio` parameter affect the behavior?","options":{"A":"This is called Elastic Net; l1_ratio=1 is pure L1 (Lasso behavior: sparsity); l1_ratio=0 is pure L2 (Ridge behavior: grouped shrinkage); values between 0 and 1 blend both, producing sparse models where correlated features are grouped rather than arbitrarily selecting one","B":"This is called Ridge Plus; it always produces exactly the same result as pure L2","C":"Combining L1 and L2 cancels both penalties out, producing an unregularized model","D":"L1 and L2 cannot be combined in the same model — their gradients are incompatible"},"correct":"A","explanation":{"correct":"- ElasticNet penalty: $\\lambda[\\rho ||w||_1 + \\frac{1-\\rho}{2}||w||_2^2]$. $\\rho = 1$: pure Lasso (sparse). $\\rho = 0$: pure Ridge (dense shrinkage).\n- Grouping effect: at $\\rho \\in (0, 1)$, correlated features tend to have similar (non-zero or zero) coefficients rather than Lasso's arbitrary selection of one.\n- Use cases: genomics (gene groups), NLP (word groups), any domain where feature groups exist and partial sparsity is desired.","A":"","B":"ElasticNet is not \"Ridge Plus.\" It has distinct behavior at intermediate l1_ratio values that neither pure Lasso nor pure Ridge achieves.","C":"Adding both penalties creates a combined penalty that is stronger, not zero. The gradient of the combined penalty is non-zero.","D":"L1 and L2 penalties are both subdifferentiable functions of $w$. They are combined additively. There is no gradient incompatibility."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-027","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T17 · Feature Selection] A wrapper method uses logistic regression as the base model for feature selection. The selected features are used to train an XGBoost model. A reviewer says \"the selected features are suboptimal because they were selected for logistic regression, not XGBoost.\" Is this concern valid?","options":{"A":"No — wrapper methods are model-agnostic and always select the best features for any model","B":"Valid concern — wrapper methods find features that work well for the specific base model used in selection; logistic regression finds linearly predictive features; XGBoost can leverage nonlinear interactions and feature combinations; a feature useless for logistic regression may be important for XGBoost (e.g., a feature that is only predictive in interaction with another); use XGBoost itself (or a simpler tree proxy) as the base model in the wrapper, or use model-agnostic filter methods","C":"Wrapper methods always select the same features regardless of the base model","D":"XGBoost should not be used with pre-selected features — it performs its own feature selection internally"},"correct":"B","explanation":{"correct":"- Model-specific feature selection: logistic regression evaluates features based on their linear contribution to log-odds. XGBoost evaluates features based on their contribution to split quality in nonlinear trees.\n- Example: feature X is useless alone for LR (no linear signal), but X × Z is highly predictive. Wrapper with LR drops X. XGBoost would have discovered X useful via interaction splits.\n- Best practice: use the final model (or a fast proxy) as the base model for wrapper-based feature selection to ensure feature relevance aligns with the final model's inductive bias.","A":"Wrapper methods are model-specific by design — they evaluate feature subsets using the base model's loss. Different base models can produce different feature selections.","B":"","C":"Different base models (LR vs XGBoost) evaluate features differently and can produce different selected subsets for the same dataset.","D":"XGBoost does perform internal feature selection (features with zero importance). But external pre-selection can still improve performance by removing noise features that confuse the XGBoost training."}}],"allMcqs":[{"section":"machine-learning","topicSlug":"ml-fundamentals","topic":"ML Fundamentals","id":"ml-01001","difficulty":"easy","orderIndex":1,"question":"A company labels 100,000 emails as \"spam\" or \"not spam\" and trains a binary classifier. Six months later, they build a second model using only email metadata (no labels) to group emails into clusters. Which learning paradigm does each model use, and what structural property of the data drives this distinction?","options":{"A":"Both use supervised learning because the data was collected by the same team","B":"The first uses supervised learning (labeled targets drive optimization), the second uses unsupervised learning (no target variable — structure is inferred from the data distribution)","C":"The first uses unsupervised learning because clustering happens implicitly during classification; the second uses supervised learning because cluster labels become targets","D":"The second model uses reinforcement learning because it must explore and evaluate clusters through trial and error"},"correct":"B","explanation":{"correct":"- The defining structural property is the presence of a target variable `y`: supervised learning minimizes loss between predictions and labels; unsupervised learning finds structure (clusters, embeddings, densities) without any `y`.\n- The same raw dataset can produce both paradigms depending on whether labels are used — this is the conceptual foundation of semi-supervised learning, a common interview follow-up.\n- In production, the paradigm determines the evaluation strategy: supervised models use labeled held-out data for accuracy/F1; unsupervised models use intrinsic metrics (silhouette, inertia) or downstream task performance.","A":"Who collected the data has no bearing on the learning paradigm. The paradigm is determined by whether target labels are present in the optimization objective.","B":"","C":"Classification does not involve implicit clustering. The reversal here exploits confusion about what \"finding groups\" means — clustering and classification are distinct operations.","D":"Reinforcement learning requires an agent, an environment, actions, and a reward signal — none of which are present in clustering. Algorithmic exploration in k-means is not RL-style policy learning."},"reference":"- Goodfellow et al., Deep Learning, Chapter 5: https://www.deeplearningbook.org/contents/ml.html"},{"section":"machine-learning","topicSlug":"ml-fundamentals","topic":"ML Fundamentals","id":"ml-01002","difficulty":"easy","orderIndex":2,"question":"You split a dataset into 70% train, 15% validation, and 15% test. Your model scores 95% on the validation set but only 67% on the test set. A teammate says this proves the model overfit to training data. What is the most precise diagnosis?","options":{"A":"The model overfit to training data — the 28-point gap is definitive proof of overfitting","B":"The model overfit to the validation set through repeated hyperparameter tuning, not to training data — this is exactly why a held-out test set is required","C":"The test set is too small at 15% to be statistically reliable, so the gap is sampling noise","D":"The validation and test sets were not stratified, causing class imbalance to distort the test score"},"correct":"B","explanation":{"correct":"- When you repeatedly tune hyperparameters by evaluating on the validation set and picking the best-performing configuration, information from the validation set leaks into your model selection process — this is called \"validation set overfitting\" or \"meta-overfitting.\"\n- Overfitting to training data would show low training loss and comparatively lower validation loss — but validation accuracy is 95%, so the model generalizes to the validation distribution. The gap is specifically between validation and test.\n- This is why the test set must never influence any decision — architecture, hyperparameters, feature engineering — and should be evaluated exactly once at the very end.","A":"Overfitting to training data would manifest as a gap between training and validation metrics, not between validation and test. High validation accuracy rules out classic overfitting to training data.","B":"","C":"15% of most ML datasets is thousands of samples — more than enough for statistical reliability. Attributing a 28-point gap to sampling noise requires extreme justification.","D":"While stratification matters, a 28-point gap from a stratification issue would require catastrophic class imbalance that the question does not state. Stratification problems cause consistent bias, not a cliff-edge drop."},"reference":"- Hastie et al., The Elements of Statistical Learning, Chapter 7: https://hastie.su.domains/ElemStatLearn/"},{"section":"machine-learning","topicSlug":"ml-fundamentals","topic":"ML Fundamentals","id":"ml-01003","difficulty":"easy","orderIndex":3,"question":"A data scientist standardizes all features using `StandardScaler` fitted on the entire dataset before splitting into train and test sets. The model achieves 94% test accuracy. A reviewer flags this as a critical error. Why?","codeSnippet":"scaler = StandardScaler()\nX_scaled = scaler.fit_transform(X) # entire dataset\nX_train, X_test = train_test_split(X_scaled, test_size=0.2)","options":{"A":"`StandardScaler` should be applied after model training, not before","B":"The scaler was fitted on test data too, so test-set statistics (mean and std) influenced the training feature distribution — this is data leakage","C":"`fit_transform` is slower than fitting separately; the correct approach is `scaler.fit(X).transform(X)`","D":"Standardization should only be applied to the target variable, not to input features"},"correct":"B","explanation":{"correct":"- `fit_transform` on the full dataset computes mean and std using all rows, including test rows. Training features are now scaled using statistics that \"know about\" the test distribution.\n- In production, test data arrives after deployment — you would never have access to it during training. Fitting the scaler only on training data (`scaler.fit(X_train)`) correctly simulates this.\n- The performance impact is often small in practice, but in time-series or distribution-shift scenarios it can be significant. The conceptual violation is always critical in an interview context.","A":"Preprocessing is applied before model training — this is correct pipeline order. The error is not about timing relative to the model; it is about which data was used to compute the scaler parameters.","B":"","C":"`scaler.fit(X).transform(X)` is functionally identical to `fit_transform(X)` and has the exact same leakage problem. This option exploits confusion about method equivalence.","D":"Standardization is most commonly applied to input features. Applying it only to targets would be unusual and incorrect for most standard ML models."},"reference":"- scikit-learn Pipeline documentation: https://scikit-learn.org/stable/modules/compose.html"},{"section":"machine-learning","topicSlug":"ml-fundamentals","topic":"ML Fundamentals","id":"ml-01004","difficulty":"easy","orderIndex":4,"question":"A fraud detection model is evaluated on a dataset with 1 million transactions, where 0.1% are fraud. The team does a random 80/20 train/test split and reports 99.9% test accuracy. Why should they be skeptical of this result before celebrating?","options":{"A":"80/20 is too large a training set — 50/50 splits are required for imbalanced data","B":"99.9% accuracy on this dataset is achievable by a model that predicts \"not fraud\" for every transaction — the metric is uninformative under extreme class imbalance","C":"The model is certainly overfit because 99.9% accuracy is unrealistically high for any real-world dataset","D":"A random split is invalid for fraud data because fraud events cluster in time and must use a temporal split"},"correct":"B","explanation":{"correct":"- With 0.1% fraud, a zero-rule classifier (always predict \"not fraud\") achieves exactly 99.9% accuracy trivially. High accuracy on imbalanced data is the canonical misleading metric trap.\n- The meaningful metrics for fraud detection are precision, recall, F1 on the minority class, AUC-ROC, and the Precision-Recall curve — none of which are reported here.\n- In production, a fraud model that never triggers allows real fraud to pass undetected. Reporting accuracy alone on an imbalanced problem is a red flag in any ML design review.","A":"The train/test ratio is not the issue. 80/20 is standard. No fixed ratio is \"required\" for imbalanced data — the problem is the choice of metric, not the split proportion.","B":"","C":"99.9% accuracy is not inherently a sign of overfitting — it is trivially achievable as demonstrated. Overfitting is diagnosed by comparing training accuracy to test accuracy, not by the absolute value alone.","D":"Temporal clustering is a valid concern for time-series fraud but is a separate, secondary issue from the metric problem the question is testing."}},{"section":"machine-learning","topicSlug":"ml-fundamentals","topic":"ML Fundamentals","id":"ml-01005","difficulty":"easy","orderIndex":5,"question":"An ML pipeline processes data in this order: (1) raw data ingestion → (2) feature engineering → (3) model training → (4) evaluation → (5) deployment. A team adds a new step: \"re-engineer features based on test set error analysis.\" Precisely between which steps does this new step violate pipeline integrity, and why?","options":{"A":"Between steps 1 and 2 — feature engineering must happen before any data is seen by the model","B":"Between steps 4 and 2 — using test set errors to redesign features feeds test-set information backward into the feature space, creating look-ahead bias","C":"Between steps 3 and 4 — model training must complete before any analysis is performed","D":"Between steps 2 and 3 — features must be completely frozen before training begins"},"correct":"B","explanation":{"correct":"- The ML pipeline is a one-directional flow. Feeding test set error signals back into step 2 (feature engineering) means test data implicitly shapes the feature representation — a form of look-ahead bias or indirect data leakage.\n- This is analogous to a researcher peeking at exam answers before designing the exam questions. The resulting performance metrics are no longer trustworthy estimates of generalization.\n- The correct approach: analyze errors on a validation set, re-engineer features, then evaluate final performance on a completely untouched test set.","A":"Feature engineering after data ingestion is the correct order — this is not a violation. The violation is about direction of information flow, not absolute pipeline position.","B":"","C":"Model training completing before evaluation is correct pipeline order — this describes a valid step, not a violation.","D":"Freezing features before training is correct practice. But the question asks where the \"re-engineer from test errors\" step creates the problem — that is specifically the backward feedback from test results to feature design."}},{"section":"machine-learning","topicSlug":"ml-fundamentals","topic":"ML Fundamentals","id":"ml-01006","difficulty":"medium","orderIndex":6,"question":"A team trains a churn prediction model. During feature engineering they include \"number of support tickets submitted in the 7 days after the billing date.\" The model scores 0.91 AUC in offline evaluation but drops to 0.54 AUC in production. What is the most likely cause?","options":{"A":"The model overfit due to too many features — regularization would close the production gap","B":"\"Tickets submitted in the 7 days after billing date\" cannot be known at prediction time — the model trained on future information relative to the prediction timestamp, which is temporal data leakage","C":"AUC is not an appropriate metric for churn prediction; the team should use accuracy instead","D":"The production dataset has a different class balance than the training data, causing the metric to drop"},"correct":"B","explanation":{"correct":"- Temporal data leakage occurs when a feature uses information from the future relative to the prediction timestamp. \"Tickets submitted in 7 days after billing date\" is knowable only 7 days after the billing date — but churn prediction runs at or before the billing date.\n- The model learned to rely on a signal that is causally downstream of the prediction event. In offline evaluation, future data was present in the dataset; in production, it isn't available.\n- This is one of the hardest leakage types to catch because the feature is plausible (\"support tickets predict churn\"). Always audit feature timestamps against the prediction timestamp.","A":"A 37-point AUC drop between offline and production is not a regularization problem. Overfitting would cause a smaller gap and would appear as training AUC >> validation AUC, not as an offline-to-production cliff.","B":"","C":"AUC is widely used and appropriate for churn prediction, especially when score ranking matters for intervention campaigns. The metric is not the cause of the drop.","D":"Class balance shifts can affect metric values, but AUC is relatively robust to class imbalance since it measures ranking across all thresholds. A 37-point collapse requires a systematic cause like leakage."},"reference":"- Kaufman et al., \"Leakage in Data Mining\": https://dl.acm.org/doi/10.1145/2020408.2020496"},{"section":"machine-learning","topicSlug":"ml-fundamentals","topic":"ML Fundamentals","id":"ml-01007","difficulty":"medium","orderIndex":7,"question":"You are designing a train/test split for a dataset of 10,000 user sessions where each user contributes an average of 50 sessions. A colleague applies session-level random splitting. Why is this split strategy incorrect for evaluating generalization to new users?","codeSnippet":"from sklearn.model_selection import train_test_split\n\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)","options":{"A":"The test set is too small; it should be at least 30% of the data for reliable evaluation","B":"Session-level random splits allow the same user's sessions to appear in both train and test, letting the model memorize user-specific patterns and overestimate generalization to new users","C":"`train_test_split` does not support session data — a time-series split must always be used for session-level data","D":"`random_state` is not set, so results are non-reproducible — this is the primary flaw"},"correct":"B","explanation":{"correct":"- With 10,000 sessions from ~200 users (50 sessions each), a session-level random split puts roughly 80% of each user's sessions in train and 20% in test. The model can learn user-identity patterns and apply them to the same user's test sessions.\n- This inflates test performance because you are measuring interpolation within known users, not generalization to unseen users. In production, the model will encounter new users with zero historical sessions.\n- The correct approach is a **user-level split**: all sessions from a given user go entirely into train or entirely into test.","A":"20% test size (2,000 sessions) is statistically adequate. The issue is user-identity leakage, not dataset size.","B":"","C":"`train_test_split` can be applied to any tabular data including sessions. The problem is the granularity of the split entity, not the function used. Time-series splits address temporal ordering, which is a different concern.","D":"Not setting `random_state` affects reproducibility, not correctness. A non-reproducible split is a best-practice concern, not a validity issue."}},{"section":"machine-learning","topicSlug":"ml-fundamentals","topic":"ML Fundamentals","id":"ml-01008","difficulty":"medium","orderIndex":8,"question":"A reinforcement learning agent is trained to play chess. A developer describes it as: \"the agent sees the board, a neural net predicts the best move, and the model is trained on historical grandmaster games with move quality labels.\" A senior ML engineer says this description is wrong about the learning paradigm. Who is correct and why?","options":{"A":"The developer is correct — predicting move quality from labeled data is supervised learning regardless of the game domain","B":"The senior engineer is correct — any game-playing agent is by definition reinforcement learning","C":"The senior engineer is correct — the description is of supervised learning, not RL; RL requires an agent learning from delayed outcome rewards, not from labeled (board, move-quality) pairs","D":"Both are correct — RL and supervised learning are equivalent when the reward is immediate"},"correct":"C","explanation":{"correct":"- Reinforcement learning requires an agent that takes actions, receives delayed rewards from an environment, and learns a policy through trial and interaction. It does not require labeled training examples.\n- Learning from a dataset of (board state → labeled move quality) pairs is supervised learning, regardless of the domain being chess. AlphaGo's first phase used supervised learning on human games before switching to RL through self-play.\n- The domain (chess, games) does not determine the paradigm — the training signal does. This is a common misconception that trips up developers in interviews.","A":"The developer's description is technically accurate about the training signal. The question asks whether the senior engineer's correction is valid — it is, because calling a supervised setup \"RL\" is a paradigm misclassification.","B":"Game-playing does not imply RL. Deep Blue used minimax search with hand-crafted evaluation — no learning at all. AlphaGo's SL policy network used supervised learning on human moves before RL self-play.","C":"","D":"Immediate rewards do not make RL equivalent to supervised learning. In RL, a scalar feedback from the environment follows an action; in supervised learning, a label is paired with each input. They are distinct training regimes with different update rules."}},{"section":"machine-learning","topicSlug":"ml-fundamentals","topic":"ML Fundamentals","id":"ml-01009","difficulty":"medium","orderIndex":9,"question":"A data scientist applies SMOTE oversampling before the train/test split to handle class imbalance. Validation F1 on the minority class is 0.87. A reviewer marks this result as inflated. What is the exact mechanism causing the inflation?","codeSnippet":"X_resampled, y_resampled = SMOTE().fit_resample(X, y) # before split\nX_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled)","options":{"A":"SMOTE creates too many synthetic samples, which always inflates F1 regardless of when it is applied","B":"SMOTE generates synthetic points by interpolating between real minority-class samples; if applied before splitting, synthetic test samples are geometric near-neighbors of training samples, giving the model an artificial advantage on the test set","C":"SMOTE should not be used on imbalanced datasets; stratified sampling is the only valid approach","D":"The test set created from a SMOTE-augmented dataset still contains real samples, so no inflation is possible"},"correct":"B","explanation":{"correct":"- SMOTE generates synthetic points by interpolating between a minority-class sample and one of its k-nearest neighbors. If SMOTE runs on the full dataset before splitting, some synthetic test samples will be geometrically close to training samples — the model has effectively \"seen\" the test-space neighborhood.\n- This violates the independence assumption between train and test. Evaluation results are optimistically biased.\n- The correct practice: SMOTE is applied **only to the training set** after splitting. The test set must consist of real, unaugmented samples reflecting the production distribution.","A":"SMOTE applied correctly (after splitting, on training data only) does not inflate test metrics. The inflation is caused by when SMOTE is applied, not the technique itself.","B":"","C":"SMOTE is a valid and widely-used oversampling technique. Stratified sampling addresses split proportions, not class imbalance during training.","D":"Even though test samples may be \"real,\" the synthetic training samples are near-neighbors of those real test samples due to the interpolation process. The independence assumption is violated regardless of sample authenticity."}},{"section":"machine-learning","topicSlug":"ml-fundamentals","topic":"ML Fundamentals","id":"ml-01010","difficulty":"medium","orderIndex":10,"question":"A house price prediction model includes a feature: \"this property's last recorded sale price.\" The model achieves very high R² on the test set. During production, this feature is unavailable so it is removed — performance collapses. What does this reveal about the training pipeline?","options":{"A":"The model needed regularization to prevent over-reliance on a single feature","B":"The feature was a target proxy — it encoded near-direct information about the target variable (house price), making it a form of target leakage","C":"The test data was collected from a different time period than training data, causing distribution shift","D":"R² is not a reliable metric for regression models with highly correlated features"},"correct":"B","explanation":{"correct":"- Target leakage occurs when a feature encodes the target variable directly or as a near-proxy. \"Last recorded sale price\" is essentially a direct measurement of what the model is predicting — house prices — so the model learned to use historical price as its answer.\n- In a real deployment, this feature doesn't exist before the sale completes — it cannot be used for prediction. The pipeline must always verify feature availability at prediction time, not just at training time.\n- This is subtler than pure data leakage: the feature is plausible (real estate agents reference recent sales), but temporal availability at inference time was never verified.","A":"Regularization prevents overfitting to statistical noise, not over-reliance on a causally plausible but temporally unavailable feature. The problem is feature availability, not model complexity.","B":"","C":"Distribution shift causes gradual degradation. The question states the collapse occurred directly upon removing the feature — this is causal, not temporal.","D":"R² is a valid regression metric. High R² driven by a leaking feature is a pipeline design failure, not a metric deficiency."}},{"section":"machine-learning","topicSlug":"ml-fundamentals","topic":"ML Fundamentals","id":"ml-01011","difficulty":"medium","orderIndex":11,"question":"A team evaluates five model architectures by testing each one on the held-out test set and selects the architecture with 95% test accuracy for deployment. A month later, production accuracy is 82%. What mistake was made, and what process would have prevented it?","options":{"A":"The model overfit to training data — more regularization during training would prevent the gap","B":"The test set was used as a selection criterion, converting it into an implicit validation set — the 95% is no longer an unbiased estimate; a truly held-out final test set never used in selection would have given an honest estimate","C":"The team should have used cross-validation instead of a fixed split for architecture comparison","D":"A 13-point production gap is expected noise — test-to-production gaps of this size are normal"},"correct":"B","explanation":{"correct":"- A test set provides an unbiased performance estimate only if it is evaluated exactly once on the final selected model. Using it to choose among 5 architectures turns it into an implicit validation set — you are doing 5-way model selection on it.\n- With 5 candidates, there is a meaningful probability that one will \"luck into\" high test accuracy due to random alignment with the test distribution, not genuine generalization.\n- The correct setup: use a validation set for all selection decisions; reserve the test set for the single final evaluation after all model choices are made.","A":"Overfitting to training data would show low training loss and high validation loss. The test accuracy is 95% — the gap is between test and production, indicating test set integrity was compromised, not that training overfit.","B":"","C":"Cross-validation is good practice but does not resolve the issue if the test set is still used for final selection. The problem is the test set's role in decision-making, not the validation strategy.","D":"A 13-point production gap is not expected noise. Statistical noise in large test sets is under 2%. A gap this large indicates a systematic flaw in the evaluation setup."}},{"section":"machine-learning","topicSlug":"ml-fundamentals","topic":"ML Fundamentals","id":"ml-01012","difficulty":"hard","orderIndex":12,"question":"A medical imaging model is trained on data from 5 hospitals using a random 80/20 sample-level split. The model performs well in cross-hospital evaluation at deployment to a 6th hospital, but poorly on a 7th hospital from a different country. Which type of leakage or bias is responsible and what split strategy would better estimate cross-institution generalization?","options":{"A":"The model overfit due to too few samples — collecting more data from the 5 original hospitals would fix generalization","B":"Sample-level splitting distributed all 5 hospitals into both train and test, so the model learned institution-specific confounders (equipment, demographics, labeling conventions); a hospital-level split would test true cross-institution generalization","C":"The test set should have been balanced across disease categories using stratified sampling to ensure all classes are represented","D":"Random splits are always appropriate for medical data — the poor performance on the 7th hospital is uncontrollable distribution shift"},"correct":"B","explanation":{"correct":"- Sample-level splitting puts samples from all 5 hospitals in both train and test. The model learns hospital-specific signals: scanner calibration artifacts, patient population characteristics, radiologist labeling conventions. These are institution confounders, not generalizable medical knowledge.\n- Evaluating within the same 5 hospitals (even with a random split) measures in-distribution performance. Generalizing to unseen institutions requires an institution-level held-out split.\n- The fix is a **site-level split**: hold out all samples from one or more hospitals entirely for testing. This is standard practice in federated learning and clinical ML validation (e.g., multi-site trials).","A":"More data from the same 5 hospitals deepens the institution-specific confounders rather than helping cross-institution generalization. It may make things worse.","B":"","C":"Stratified sampling ensures class representation in train/test but does not address institution confounders. A disease-stratified random split still contaminates test with all 5 hospitals' signals.","D":"Distribution shift from a new institution is the symptom, not the root cause. The root cause is a split strategy that never tested out-of-institution generalization. This is absolutely addressable by changing split granularity."}},{"section":"machine-learning","topicSlug":"ml-fundamentals","topic":"ML Fundamentals","id":"ml-01013","difficulty":"hard","orderIndex":13,"question":"You train a binary classifier and achieve 0.92 AUC-ROC on the test set. You threshold predictions at 0.5 and report accuracy. A stakeholder says \"92% AUC means the model is 92% accurate.\" In what specific scenario would a model with 0.92 AUC coexist with accuracy near the trivial baseline?","options":{"A":"Only when the dataset has more than 1 million samples — AUC and accuracy decouple at scale","B":"When the positive class is rare (e.g., 1%), a model with 0.92 AUC may still place most predicted probabilities below 0.5 — predicting \"negative\" for all samples would yield 99% accuracy, making the threshold-based accuracy trivially high and misleading","C":"AUC is always higher than accuracy on imbalanced datasets because it corrects for class imbalance mathematically","D":"AUC above 0.9 guarantees that accuracy at any threshold will be above 90%"},"correct":"B","explanation":{"correct":"- AUC-ROC measures the probability that the model ranks a random positive above a random negative across all possible thresholds. It evaluates ranking quality, not prediction at a specific threshold.\n- On a 1% positive class, a model with excellent ranking (0.92 AUC) may still produce raw probabilities mostly below 0.5. A threshold at 0.5 then predicts negative for nearly everything, yielding 99% accuracy — the same as predicting the majority class always.\n- Conversely, a miscalibrated model can have high AUC but very low accuracy at the default threshold. This is why threshold tuning and proper calibration are separate steps from ranking evaluation.","A":"Dataset size does not determine the AUC-accuracy relationship. The decoupling occurs due to class imbalance and probability miscalibration, both of which can happen at any scale.","B":"","C":"AUC does not \"correct\" for class imbalance. Precision-Recall AUC is generally preferred for imbalanced datasets precisely because ROC AUC can appear optimistically high due to the large true-negative pool inflating the metric.","D":"AUC above 0.9 guarantees only ranking quality. A perfectly ranked model (AUC = 1.0) with all probabilities in [0.01, 0.02] range will predict \"negative\" at threshold 0.5 for every sample — 99% accuracy on a 1% positive class."}},{"section":"machine-learning","topicSlug":"ml-fundamentals","topic":"ML Fundamentals","id":"ml-01014","difficulty":"hard","orderIndex":14,"question":"A team trains a loan default prediction model. After deployment, they discover their preprocessing pipeline (imputation, encoding, scaling) was fitted on the full dataset including future months of data. The business insists performance is great — \"the deployed model is identical to what we tested.\" Why does the leakage matter even if production performance looks strong?","options":{"A":"It doesn't — if production performance is strong, the leakage is irrelevant","B":"The leakage corrupts the measurement system: the team cannot distinguish genuine generalization from leakage-inflated metrics, future retraining on new data may degrade silently without explanation, and model comparison decisions made during development were potentially wrong","C":"The model must be retrained immediately because leaked preprocessing invalidates all model weights","D":"Data leakage only matters in healthcare; financial models are not affected because regulations require fair data use"},"correct":"B","explanation":{"correct":"- Leakage corrupts the evaluation system, not necessarily the deployed weights. The model may genuinely perform well — but the team cannot establish how much performance is due to generalization vs. leakage-assisted metric inflation.\n- The real danger appears at retraining time: when the team retrains periodically on new data (without future leakage), they may see a performance drop and not know why. They will chase a phantom problem, potentially deploying an inferior model.\n- Leakage also poisons model comparison. If Leaky Model A scores 0.93 AUC and Clean Model B scores 0.89 AUC, the team deploys A when B may actually be the better generalizer.","A":"\"It works in production\" is survivorship bias. Short-term production metrics can look good due to temporal correlation, luck, or leakage. Without an unbiased evaluation, you cannot confirm what is driving performance.","B":"","C":"Leakage in preprocessing does not technically corrupt model weights — it means the weights were optimized using inflated feature representations. Retraining is advisable, but calling weights \"invalidated\" overstates the technical mechanism.","D":"Data leakage is a universal ML problem independent of domain. Financial regulations address fairness and explainability — they do not specifically prevent preprocessing leakage, and this claim is categorically false."}},{"section":"machine-learning","topicSlug":"ml-fundamentals","topic":"ML Fundamentals","id":"ml-01015","difficulty":"hard","orderIndex":15,"question":"A team uses 5-fold cross-validation. For each fold, they fit a `StandardScaler` on the training fold and transform the validation fold separately. A new team member suggests fitting the scaler once on all data — \"StandardScaler parameters barely change between folds.\" What is the senior engineer's precise objection?","codeSnippet":"# Current correct implementation\nfor train_idx, val_idx in kfold.split(X):\n scaler = StandardScaler()\n X_train_scaled = scaler.fit_transform(X[train_idx])\n X_val_scaled = scaler.transform(X[val_idx])","options":{"A":"The new team member is correct — fitting the scaler once is computationally more efficient and produces numerically identical results","B":"Fitting the scaler on all data leaks validation fold statistics (mean and std) into the training fold's preprocessing, violating the independence of each fold as a simulated held-out set","C":"`StandardScaler` parameters change negligibly between folds so the bias is practically zero — the senior engineer is over-engineering","D":"The correct fix is to use `MinMaxScaler` instead, which doesn't require fitting on the training set"},"correct":"B","explanation":{"correct":"- Cross-validation simulates the train-on-some/evaluate-on-held-out process. If the scaler is fitted on all data, the validation fold's mean and std values are embedded in the scaler parameters — the validation fold is no longer truly unseen.\n- This is particularly harmful for small datasets, features with outliers, or non-stationary distributions where fold-level statistics differ meaningfully.\n- `sklearn.pipeline.Pipeline` automates this correctly: any transformer inside a Pipeline is fitted only on training data within each fold automatically, which is exactly what the correct loop does manually.","A":"Fitting once does not produce identical results — the global mean and std include contribution from each fold's held-out samples. Results may be numerically close for large datasets but the principle is violated and the bias is real.","B":"","C":"\"Practically zero\" is context-dependent and misleading. For small datasets, skewed features, or few folds, the bias can be significant. More critically, the methodology is wrong even when the numerical difference is small — pipelines fail unexpectedly in edge cases.","D":"`MinMaxScaler` also requires fitting (to compute feature-wise min and max). It has the exact same leakage problem if fitted on all data. Switching scalers does not resolve the underlying issue."},"reference":"- scikit-learn Pipeline documentation: https://scikit-learn.org/stable/modules/compose.html#pipeline\n- Cross-validation guide: https://scikit-learn.org/stable/modules/cross_validation.html"},{"section":"machine-learning","topicSlug":"linear-regression","topic":"Linear Regression","id":"ml-02001","difficulty":"easy","orderIndex":1,"question":"You train a linear regression model using OLS. The closed-form solution gives coefficients that perfectly minimize the training loss. A colleague says \"since the loss is minimized, the model is optimal.\" What critical nuance does this claim miss?","options":{"A":"OLS does not minimize squared error — it minimizes absolute error","B":"OLS minimizes training loss exactly, but \"optimal\" requires generalization to unseen data, which OLS cannot guarantee — a model with perfect training loss can still overfit if the number of features approaches the number of samples","C":"OLS can only minimize loss when features are uncorrelated with each other","D":"The closed-form solution minimizes loss only when all feature values are positive"},"correct":"B","explanation":{"correct":"- OLS minimizes the sum of squared residuals on training data exactly via the normal equations. This is the mathematical definition of what OLS does.\n- \"Optimal\" in ML means generalizing to unseen data. When the number of predictors is close to the number of observations, OLS fits noise perfectly (R² = 1) but generalizes poorly — this is overfitting in the classical sense.\n- In production, a model with zero training loss and terrible test loss is worse than a regularized model with slightly higher training loss. OLS optimality is strictly in-sample.","A":"OLS minimizes sum of squared residuals (L2 loss), not absolute error. Least absolute deviations (LAD) regression minimizes absolute error — they are different estimators with different robustness properties.","B":"","C":"OLS computes valid coefficient estimates regardless of feature correlation. High multicollinearity makes coefficients unstable and hard to interpret, but OLS will still converge to a solution (unless features are perfectly collinear).","D":"OLS makes no assumption about the sign of feature values. The normal equations work on any real-valued feature matrix."},"reference":"- Hastie et al., The Elements of Statistical Learning, Chapter 3: https://hastie.su.domains/ElemStatLearn/"},{"section":"machine-learning","topicSlug":"linear-regression","topic":"Linear Regression","id":"ml-02002","difficulty":"easy","orderIndex":2,"question":"A linear regression model is trained to predict employee salary from years of experience. The residual plot shows a clear curved (parabolic) pattern rather than random scatter. Which OLS assumption is violated, and what is the consequence?","options":{"A":"Homoscedasticity — the variance of residuals changes across predicted values, inflating standard errors","B":"Linearity — the true relationship between the predictor and outcome is nonlinear, so OLS fits a line through a curve, producing systematically biased predictions at every value","C":"Independence — the residuals are correlated with each other because salary data is collected sequentially","D":"Normality of errors — the curved residuals indicate non-normal error distribution, which invalidates p-values"},"correct":"B","explanation":{"correct":"- The linearity assumption requires that the true relationship between predictors and the outcome is linear. A parabolic residual pattern means the model is missing a nonlinear component — the error is not random noise but systematic bias.\n- The consequence is that predictions are wrong in a directional, predictable way: the model underpredicts at low and high values and overpredicts in the middle (or vice versa), depending on the curve direction.\n- The fix is feature transformation (e.g., adding `experience²` as a predictor) or switching to a nonlinear model. A curved residual plot is one of the clearest diagnostic signals in regression.","A":"Homoscedasticity violations show a funnel shape in residuals (variance increasing or decreasing with predicted value), not a systematic curve. A curved pattern is not a variance issue.","B":"","C":"Independence violations produce autocorrelation in residuals — typically diagnosed with a Durbin-Watson test on time-ordered data, not a parabolic pattern in a residual vs. fitted plot.","D":"Normality of errors produces a skewed or heavy-tailed residual distribution, visible in a Q-Q plot — not a systematic curve in the residual vs. fitted plot."}},{"section":"machine-learning","topicSlug":"linear-regression","topic":"Linear Regression","id":"ml-02003","difficulty":"easy","orderIndex":3,"question":"A linear regression model on house prices achieves R² = 0.85 on the training set. A manager concludes \"the model explains 85% of the variance in house prices.\" Is this interpretation correct, and what common misuse does it enable?","options":{"A":"The interpretation is incorrect — R² of 0.85 means the model is 85% accurate in absolute price terms","B":"The interpretation is correct for training data, but reporting training R² as model quality enables overfitting — the same model may have R² = 0.30 on the test set, which the training R² completely hides","C":"R² = 0.85 means 85% of predictions are within one standard deviation of the true price","D":"The interpretation is correct and training R² is always a reliable estimate of generalization quality"},"correct":"B","explanation":{"correct":"- R² measures the proportion of variance in the target explained by the model: $R^2 = 1 - \\frac{SS_{res}}{SS_{tot}}$. An R² of 0.85 does mean the model explains 85% of training variance — the interpretation itself is technically correct.\n- The misuse is treating training R² as a generalization metric. A model with many features can achieve R² close to 1.0 on training data by overfitting, while explaining almost nothing on unseen data.\n- Always report test set R² or cross-validated R². Training R² is a diagnostic for fit, not for generalization.","A":"R² is a variance-explained measure, not an absolute accuracy measure. 85% accuracy in absolute terms would require a different metric like MAPE or MAE relative to price.","B":"","C":"R² has no direct relationship to predictions being within one standard deviation. That would be a confidence interval statement, not an R² statement.","D":"Training R² is not a reliable estimate of generalization. Adding irrelevant features always increases (or maintains) training R² even when they add noise — this is why adjusted R² and test-set R² exist."}},{"section":"machine-learning","topicSlug":"linear-regression","topic":"Linear Regression","id":"ml-02004","difficulty":"easy","orderIndex":4,"question":"A linear regression model is trained on daily stock returns. The Durbin-Watson statistic comes back at 0.8. A data scientist says this is a minor issue and proceeds with standard OLS inference. Why is this dangerous?","options":{"A":"A Durbin-Watson value below 2 is normal and indicates a well-fitted model","B":"A Durbin-Watson value near 0 indicates strong positive autocorrelation in residuals — this violates the independence assumption, making OLS standard errors underestimated and all hypothesis tests (p-values, confidence intervals) invalid","C":"The Durbin-Watson test only applies to classification models; for regression it is irrelevant","D":"Autocorrelation in residuals only matters when the dataset has fewer than 1,000 rows"},"correct":"B","explanation":{"correct":"- Durbin-Watson ranges from 0 to 4: 2 indicates no autocorrelation, values near 0 indicate positive autocorrelation, values near 4 indicate negative autocorrelation. A value of 0.8 signals strong positive autocorrelation in residuals.\n- Positive autocorrelation makes the effective sample size smaller than the nominal sample size — OLS treats correlated observations as independent, artificially deflating standard errors. Confidence intervals are too narrow and p-values are too small.\n- For time-series data, the correct approaches are GLS (generalized least squares), ARIMA, or including lagged terms. Proceeding with OLS produces spuriously significant results.","A":"Values below 2 are not automatically \"normal.\" The reference value is exactly 2 for no autocorrelation. Deviations in either direction are violations.","B":"","C":"The Durbin-Watson test was specifically designed for regression residuals, particularly in time-series contexts. It is not applicable to classification and is extremely relevant to regression.","D":"Autocorrelation violates OLS assumptions regardless of sample size. Larger samples make the p-values more confident but not more valid — a biased estimator with infinite data is still biased."}},{"section":"machine-learning","topicSlug":"linear-regression","topic":"Linear Regression","id":"ml-02005","difficulty":"easy","orderIndex":5,"question":"A residual plot for a linear regression shows that residuals fan out as the predicted value increases — small predictions have small residuals, large predictions have large residuals. Which assumption is violated and what does this mean for OLS coefficient estimates?","options":{"A":"Linearity is violated — the fanning pattern means a polynomial term is needed","B":"Homoscedasticity is violated — variance of residuals is not constant across predicted values; OLS coefficient estimates remain unbiased but are no longer the minimum-variance estimators (BLUE), and standard errors are wrong","C":"Independence is violated — fanning indicates autocorrelation among observations","D":"Normality is violated — the fanning indicates a heavy-tailed error distribution requiring robust regression"},"correct":"B","explanation":{"correct":"- Homoscedasticity requires that the variance of the error term $\\varepsilon$ is constant: $\\text{Var}(\\varepsilon_i) = \\sigma^2$ for all $i$. A fanning pattern (heteroscedasticity) means $\\text{Var}(\\varepsilon_i)$ increases with predicted value.\n- Under heteroscedasticity, OLS coefficients are still unbiased (the Gauss-Markov theorem's unbiasedness does not require homoscedasticity). However, they are no longer BLUE (Best Linear Unbiased Estimators) — GLS or WLS achieves lower variance.\n- More practically: OLS standard errors are wrong, making all t-tests and confidence intervals unreliable. This is why heteroscedasticity-robust standard errors (White's sandwich estimator) exist.","A":"Linearity violations show a curved (systematic) pattern in residuals. A fanning pattern is a variance pattern (grows with fitted values), not a curvature pattern.","B":"","C":"Independence violations (autocorrelation) are diagnosed on time-ordered residual plots, not residual-vs-fitted plots, and show wavelike patterns rather than fanning.","D":"Heavy-tailed distributions produce outliers in residuals uniformly, not a fan that grows with predicted value. The fanning is specifically about variance scaling."}},{"section":"machine-learning","topicSlug":"linear-regression","topic":"Linear Regression","id":"ml-02006","difficulty":"easy","orderIndex":6,"question":"You add 20 new random noise features (completely uncorrelated with the target) to a linear regression model. What happens to the training R², the test R², and why do they diverge?","options":{"A":"Both training R² and test R² increase because more features always improve fit","B":"Training R² increases or stays the same (OLS fits noise), test R² decreases or stays the same — the divergence is the gap between in-sample fit inflation and generalization degradation","C":"Training R² stays the same because OLS ignores features with zero correlation with the target","D":"Both decrease because adding irrelevant features introduces multicollinearity"},"correct":"B","explanation":{"correct":"- OLS will assign small, nonzero coefficients to noise features because they capture random in-sample correlation with the target. This always increases (or maintains) training R².\n- Noise features add noise to predictions on unseen data — the model learned to use patterns that don't generalize. Test R² decreases as variance from noise coefficients accumulates.\n- This is why adjusted R² penalizes for the number of predictors: $\\bar{R}^2 = 1 - (1-R^2)\\frac{n-1}{n-k-1}$ where $k$ is the number of predictors. It can decrease when useless features are added.","A":"More features mechanically increase training R² but not test R². The divergence is the very definition of overfitting in regression.","B":"","C":"OLS does not ignore zero-correlation features. It assigns whatever coefficients minimize training residuals — for noise features, those coefficients are small but nonzero and still inflate training R².","D":"Noise features don't introduce multicollinearity among existing features. They may increase the condition number of the feature matrix, but the primary effect is overfitting via noise coefficient absorption."}},{"section":"machine-learning","topicSlug":"linear-regression","topic":"Linear Regression","id":"ml-02007","difficulty":"medium","orderIndex":7,"question":"A dataset has 500,000 rows and 8 features. A teammate argues that gradient descent should be used instead of the OLS closed-form solution for fitting linear regression. Under what condition would this argument be correct, and when is it incorrect?","options":{"A":"Gradient descent is always preferred because it is more numerically stable than the normal equations","B":"Gradient descent is preferred when the feature matrix is too large to invert efficiently (e.g., millions of features) — for 8 features and 500,000 rows, the normal equations solve in milliseconds and gradient descent adds unnecessary hyperparameter complexity","C":"Gradient descent is preferred when features are correlated, because the normal equations produce incorrect results under multicollinearity","D":"The closed-form OLS solution requires normally distributed features; gradient descent has no such requirement"},"correct":"B","explanation":{"correct":"- The OLS closed-form requires computing $(X^TX)^{-1}$, an $(p \\times p)$ matrix inversion where $p$ is the number of features. For $p = 8$, this is trivially fast regardless of the number of rows.\n- The computational cost of the normal equations scales as $O(p^3)$ for the inversion and $O(np^2)$ for $X^TX$. When $p$ is large (hundreds of thousands of features), the inversion becomes infeasible and gradient descent is preferred.\n- With 500,000 rows and 8 features, gradient descent introduces learning rate tuning, convergence checking, and mini-batch sizing for no benefit over the exact closed-form solution.","A":"The normal equations are numerically stable for well-conditioned feature matrices. Near-singular matrices (high multicollinearity) can cause numerical issues, but this is addressed via regularization or feature pruning — not by defaulting to gradient descent.","B":"","C":"Multicollinearity makes $(X^TX)$ near-singular, which causes numerical instability in OLS — but this is also a problem for gradient descent convergence. Neither method magically \"handles\" multicollinearity correctly.","D":"OLS makes no distributional assumptions about features. The normality assumption applies to errors (residuals), not features, and even then is only needed for valid hypothesis testing — not for the coefficient estimates themselves."}},{"section":"machine-learning","topicSlug":"linear-regression","topic":"Linear Regression","id":"ml-02008","difficulty":"medium","orderIndex":8,"question":"A linear regression model predicts employee salary using age, years of experience, and age × experience as a feature. The VIF (Variance Inflation Factor) for \"age\" is 47. The model has R² = 0.88. What does this tell you, and what is the specific risk?","options":{"A":"VIF of 47 confirms the model is overfit — reducing features would lower the VIF and improve generalization","B":"VIF of 47 indicates severe multicollinearity — the coefficient for \"age\" is highly unstable; small changes to training data will cause large swings in the age coefficient, making it uninterpretable and sensitive to sampling variation","C":"VIF above 10 invalidates R², so the reported 0.88 is meaningless","D":"High VIF means the model cannot make predictions for new data points"},"correct":"B","explanation":{"correct":"- VIF measures how much the variance of a coefficient is inflated due to correlation with other predictors: $\\text{VIF}_j = \\frac{1}{1 - R_j^2}$ where $R_j^2$ is the R² from regressing feature $j$ on all other features. VIF = 47 means the variance of the age coefficient is 47× what it would be if age were uncorrelated with other features.\n- This does not prevent predictions — the combined prediction $\\hat{y} = \\beta_1 x_1 + \\beta_2 x_2 + \\beta_3 x_3$ can still be accurate. But individual coefficients are unreliable for interpretation or inference.\n- The interaction term (age × experience) is the cause: it is nearly a linear combination of age and experience when both are continuous, creating near-perfect collinearity.","A":"Multicollinearity and overfitting are separate concepts. High VIF does not indicate overfitting. Overfitting is about train/test gap; multicollinearity is about coefficient stability.","B":"","C":"High VIF does not invalidate R². R² measures variance explained in the outcome, which is not affected by predictor collinearity. The predictions can be fine even when individual coefficients are unstable.","D":"Multicollinearity does not prevent prediction. The model can still produce predictions for new data — the issue is that predictions are stable even when coefficients swing wildly, because the multicollinear predictors compensate for each other."}},{"section":"machine-learning","topicSlug":"linear-regression","topic":"Linear Regression","id":"ml-02009","difficulty":"medium","orderIndex":9,"question":"A team fits a linear regression model with 50 predictors on 60 observations. The model achieves R² = 0.98 on training data. They report this as evidence of a strong model. What is the correct diagnosis?","options":{"A":"R² = 0.98 with 50 predictors and 60 observations is strong evidence the model has found real signal in the data","B":"With 50 predictors and 60 observations, OLS has near-perfect freedom to fit training noise — R² close to 1 is mathematically expected regardless of real signal, and the model almost certainly has negative R² on holdout data","C":"The model is valid because R² = 0.98 exceeds the standard 0.90 threshold for publication quality","D":"The model needs regularization only if test R² drops below 0.80; otherwise R² = 0.98 is reliable"},"correct":"B","explanation":{"correct":"- OLS with $p$ predictors and $n$ observations can achieve R² = 1 exactly when $p = n$ (perfect interpolation). With $p/n = 50/60 = 0.83$, the model has enormous freedom to fit noise — R² near 1 is expected even if all features are random.\n- The model has approximately 10 degrees of freedom for error ($n - p - 1 = 9$). This is insufficient to estimate generalization. The true test performance would likely show negative or near-zero R².\n- This is the classical $p > n$ or near-$p = n$ regime. Ridge regression or feature selection is mandatory before drawing any conclusions.","A":"With $p/n$ ratio near 1, high training R² provides zero evidence of real signal. The model is fitting the sampling noise in those 60 observations. Cross-validation would expose this.","B":"","C":"There is no universal R² threshold for \"publication quality.\" This is domain-dependent, and training R² is never the relevant metric for model quality assessment.","D":"Test R² should be measured and reported regardless of the training value. The threshold for \"needing regularization\" is not 0.80 test R² — any regime where $p/n$ is high requires regularization by default."}},{"section":"machine-learning","topicSlug":"linear-regression","topic":"Linear Regression","id":"ml-02010","difficulty":"medium","orderIndex":10,"question":"You fit a linear regression model on a dataset where the true relationship is $y = 2x_1 + 3x_2 + \\varepsilon$. After training, you find $\\hat{\\beta}_1 = 8.4$ and $\\hat{\\beta}_2 = -3.2$, far from the true values. The model's predictions are accurate. What explains this phenomenon?","options":{"A":"OLS has a bug when the true coefficients differ by a factor of more than 2","B":"The features $x_1$ and $x_2$ are highly correlated — multicollinearity makes individual coefficient estimates unstable, but because the multicollinear predictors compensate for each other, predictions remain accurate","C":"The model converged to a local minimum in the loss landscape, missing the global solution","D":"The dataset has too few observations relative to the number of features"},"correct":"B","explanation":{"correct":"- When $x_1$ and $x_2$ are highly correlated ($x_1 \\approx x_2$), OLS cannot distinguish their individual contributions. Many coefficient combinations produce nearly the same predictions: $8.4 x_1 + (-3.2) x_2 \\approx 2 x_1 + 3 x_2$ when $x_1 \\approx x_2$.\n- The prediction $\\hat{y}$ is stable (low variance) even though individual coefficients swing wildly. The problem is not with predictions — it is with interpretation and stability.\n- This is why multicollinearity is an interpretability and stability problem, not (necessarily) a prediction quality problem. If you care about which feature drives the outcome, multicollinear models are uninformative.","A":"OLS has no bugs related to coefficient magnitudes. The normal equations always find the exact global minimum of the squared error loss on training data.","B":"","C":"OLS linear regression has no local minima. The loss surface is a convex quadratic bowl with exactly one global minimum, reached exactly by the closed-form solution.","D":"The question implies the dataset is adequate for fitting (predictions are accurate). Insufficient data would cause both prediction instability and coefficient instability."}},{"section":"machine-learning","topicSlug":"linear-regression","topic":"Linear Regression","id":"ml-02011","difficulty":"medium","orderIndex":11,"question":"Four datasets have identical summary statistics: same mean, variance, and linear regression line (same slope, intercept, and R² = 0.67). A junior analyst says \"they have the same relationship between X and Y.\" A statistician disagrees. What point is the statistician making?","options":{"A":"R² = 0.67 is too low to confirm any relationship between X and Y in all four datasets","B":"Identical summary statistics and R² can mask completely different underlying data distributions — the datasets may contain linear, curved, clustered, or outlier-dominated patterns that R² and the regression line cannot distinguish","C":"The statistician is wrong — identical R² and regression coefficients confirm identical relationships between X and Y","D":"Four datasets cannot have identical statistics unless they are copies of the same data"},"correct":"B","explanation":{"correct":"- This is Anscombe's Quartet: four datasets constructed to have nearly identical descriptive statistics (mean, variance, correlation, regression line) but visually completely different scatter plots. One is linear, one is curved, one has an outlier driving the line, one has a perfect linear relationship disrupted by a single outlier.\n- R², slope, and intercept are aggregate statistics that destroy distributional information. Two datasets can have the same R² while one is perfectly linear and the other is quadratic with the same fitted line.\n- This is why residual plots are mandatory: they reveal patterns (curvature, outliers, heteroscedasticity) that summary statistics hide.","A":"R² = 0.67 can represent a meaningful relationship — the adequacy threshold is domain-dependent. The issue is not whether 0.67 is enough, but that identical R² does not mean identical patterns.","B":"","C":"This is the exact misconception the question targets. Identical summary statistics do not confirm identical relationships — this is the entire lesson of Anscombe's Quartet.","D":"Anscombe's Quartet was deliberately constructed to prove this is possible. Datasets with identical summary statistics but different patterns are not only possible but well-known in statistics education."}},{"section":"machine-learning","topicSlug":"linear-regression","topic":"Linear Regression","id":"ml-02012","difficulty":"hard","orderIndex":12,"question":"A linear regression model predicts housing prices. The normal equations solution is: $\\hat{\\beta} = (X^TX)^{-1}X^Ty$. A machine learning engineer says: \"for this 500,000-row, 200-feature dataset, the normal equations are infeasible and we should use mini-batch gradient descent.\" Evaluate this claim precisely.","options":{"A":"The claim is correct — normal equations are always infeasible for datasets with more than 10,000 rows","B":"The claim is partially correct — the bottleneck is feature count, not row count; $(X^TX)$ is a $200 \\times 200$ matrix that inverts in microseconds, but forming $X^TX$ costs $O(n p^2)$ which for 500,000 rows and 200 features is ~20 billion operations — feasible but expensive; gradient descent would save computation time at this scale","C":"The claim is incorrect — the normal equations are always faster than gradient descent regardless of dataset size","D":"Mini-batch gradient descent requires normally distributed features, so it is not always a valid alternative to the normal equations"},"correct":"B","explanation":{"correct":"- The normal equations require computing $X^TX$, which costs $O(n p^2)$, and then inverting a $(p \\times p)$ matrix, which costs $O(p^3)$. For $n = 500,000$ and $p = 200$: $X^TX$ computation is $500,000 \\times 200^2 = 2 \\times 10^{10}$ multiply-adds — heavy but not infeasible on modern hardware.\n- Mini-batch gradient descent processes batches of rows at a time, never materializing the full $X^TX$. This reduces memory requirements and enables early stopping, but introduces hyperparameter tuning overhead.\n- The claim that normal equations are \"infeasible\" is an overstatement — they are feasible for 200 features. They become genuinely infeasible when $p$ reaches hundreds of thousands (e.g., text feature matrices).","A":"Row count primarily affects the $O(np^2)$ formation cost, not the $O(p^3)$ inversion. Millions of rows with few features is still manageable. The real threshold for infeasibility is feature count, not row count.","B":"","C":"For very large $p$ or online learning requirements, gradient descent absolutely outperforms the normal equations. There is no universal \"always faster\" claim for either method.","D":"Gradient descent makes no distributional assumptions about features. It works on any real-valued feature matrix regardless of distribution."}},{"section":"machine-learning","topicSlug":"linear-regression","topic":"Linear Regression","id":"ml-02013","difficulty":"hard","orderIndex":13,"question":"A team adds an engineered feature that is a linear combination of two existing features: $x_3 = 2x_1 + x_2$. They then run OLS on the full feature set $[x_1, x_2, x_3]$. What happens to the OLS solution?","codeSnippet":"X['x3'] = 2 * X['x1'] + X['x2']\nmodel = LinearRegression().fit(X[['x1', 'x2', 'x3']], y)","options":{"A":"OLS produces inflated R² because the new feature adds redundant information","B":"The feature matrix $X$ is rank-deficient — $(X^TX)$ is singular and cannot be inverted; OLS has no unique solution, and numerical implementations will return arbitrary coefficients depending on the solver","C":"OLS will assign coefficient 0 to $x_3$ since it is a linear combination of the other features","D":"Gradient descent will converge normally because it does not require matrix inversion"},"correct":"B","explanation":{"correct":"- When $x_3 = 2x_1 + x_2$, the feature matrix $X$ has linearly dependent columns. $X^TX$ becomes singular (determinant = 0) and is not invertible — the normal equations have no unique solution.\n- Infinite coefficient combinations produce the same predictions: e.g., $(\\beta_1, \\beta_2, \\beta_3) = (2, 3, 0)$ and $(\\beta_1, \\beta_2, \\beta_3) = (4, 4, -1)$ yield identical $\\hat{y}$ when $x_3 = 2x_1 + x_2$.\n- In practice, `sklearn.LinearRegression` uses SVD-based pseudoinverse which returns one solution, but that solution is arbitrary and the coefficients are meaningless. Different implementations may return different coefficient values.","A":"R² inflation is a symptom of overfitting with many features, not of perfect collinearity. With perfect collinearity, the model doesn't fit \"better\" — it simply cannot identify unique coefficients.","B":"","C":"OLS does not automatically assign zero to redundant features. The zero-coefficient outcome is only guaranteed by regularized regression (Lasso). OLS with a singular matrix returns a pseudoinverse solution, not a zero coefficient.","D":"Gradient descent on a rank-deficient feature matrix does not converge to a unique minimum — it wanders in the null space. The gradient can go to zero along the direction of the dependent feature combination, causing oscillation or non-convergence."}},{"section":"machine-learning","topicSlug":"linear-regression","topic":"Linear Regression","id":"ml-02014","difficulty":"hard","orderIndex":14,"question":"A linear regression model on financial returns achieves high R² and low p-values on all coefficients. However, the Breusch-Pagan test for heteroscedasticity returns p = 0.0001, and a Durbin-Watson test returns 1.1. A quant says \"both violations together are worse than either alone.\" Explain precisely why.","options":{"A":"The violations cancel each other out — positive autocorrelation and heteroscedasticity have opposite effects on standard errors","B":"Autocorrelation reduces effective sample size, making standard errors underestimated; heteroscedasticity makes OLS standard errors incorrect; both biases act in the same direction — standard errors are doubly underestimated, making p-values appear significant when they are not","C":"Heteroscedasticity only matters with fewer than 1,000 observations; the quant is overreacting","D":"The two tests measure the same underlying violation — only one needs to be corrected"},"correct":"B","explanation":{"correct":"- Positive autocorrelation (DW = 1.1) means consecutive residuals are correlated, reducing the effective sample size below the nominal $n$. OLS standard errors assume $n$ independent observations — they are underestimated by a factor related to the autocorrelation magnitude.\n- Heteroscedasticity (BP p = 0.0001) means OLS standard errors use a wrong error variance estimate — the standard error formula $\\hat{\\sigma}^2 (X^TX)^{-1}$ assumes constant variance.\n- Both effects push standard errors downward, making t-statistics larger and p-values smaller than they should be. The combination means you may believe a coefficient is statistically significant when the true p-value, corrected for both violations, would not pass any threshold.","A":"The violations do not cancel. Positive autocorrelation and heteroscedasticity both bias standard errors downward in typical financial return applications. They compound the problem, not offset it.","B":"","C":"Heteroscedasticity matters at any sample size. With large samples, p-values become smaller (tests more powerful), making heteroscedasticity-driven false positives more, not less, likely.","D":"Autocorrelation and heteroscedasticity are distinct violations. The Durbin-Watson test specifically detects first-order autocorrelation in residuals; the Breusch-Pagan test detects non-constant error variance. They require separate corrections (GLS/HAC for autocorrelation, WLS or robust SEs for heteroscedasticity)."}},{"section":"machine-learning","topicSlug":"linear-regression","topic":"Linear Regression","id":"ml-02015","difficulty":"hard","orderIndex":15,"question":"You fit a linear regression on test data and compute R² = −0.12. A teammate says \"that's impossible — R² is a proportion and must be between 0 and 1.\" Who is correct and what does a negative R² mean?","options":{"A":"The teammate is correct — R² is always between 0 and 1 by mathematical definition","B":"You are correct — R² can be negative when evaluated on data the model was not trained on; a negative R² means the model performs worse than predicting the mean of the target for every observation, which is a meaningful and alarming signal","C":"Negative R² indicates a bug in the implementation — it should be recalculated using the absolute value","D":"Negative R² is only possible when the target variable has negative values"},"correct":"B","explanation":{"correct":"- $R^2 = 1 - \\frac{SS_{res}}{SS_{tot}}$. On training data, OLS guarantees $SS_{res} \\leq SS_{tot}$, so $R^2 \\geq 0$. On test data, this guarantee does not hold — a poorly generalizing model can have $SS_{res} > SS_{tot}$, yielding $R^2 < 0$.\n- A negative test R² means the model is worse than the trivial baseline of always predicting $\\bar{y}$ (the mean of the training target). This is a severe signal of overfitting, distribution shift, or fundamental model failure.\n- This is a critical interview point: R² between 0 and 1 is only guaranteed on training data for OLS. On test data or for non-OLS models, all real numbers are possible.","A":"The 0-to-1 guarantee holds only for OLS on training data. The mathematical formula $1 - SS_{res}/SS_{tot}$ can produce any value when the model was not trained to minimize this specific loss on this specific data.","B":"","C":"Negative R² is a valid, meaningful result — not a bug. Taking the absolute value would destroy the diagnostic information that the model is performing below baseline.","D":"The sign of the target variable has no bearing on R². R² is computed from the ratio of sum of squared residuals to total sum of squares, which is always non-negative regardless of target sign. The negative R² comes from the ratio exceeding 1."},"reference":"- Draper and Smith, Applied Regression Analysis: https://onlinelibrary.wiley.com/doi/book/10.1002/9781118625590"},{"section":"machine-learning","topicSlug":"logistic-regression","topic":"Logistic Regression","id":"ml-03001","difficulty":"easy","orderIndex":1,"question":"A logistic regression model outputs 0.73 for a new data point. A developer interprets this as \"the model is 73% confident.\" A statistician flags this interpretation as imprecise. What is the more rigorous interpretation, and when does \"confidence\" mislead?","options":{"A":"0.73 means the model predicts class 1 with 73% accuracy on the test set","B":"0.73 is the estimated probability that this observation belongs to class 1, under the model's assumptions — but \"confidence\" conflates probability with calibration; if the model is poorly calibrated, 0.73 may not correspond to 73% empirical frequency of class 1","C":"0.73 means 73 out of 100 features voted for class 1","D":"0.73 is the sigmoid-transformed log-loss for this specific prediction"},"correct":"B","explanation":{"correct":"- Logistic regression outputs $P(y=1 | x) = \\sigma(w^Tx + b)$ — a conditional probability estimate under the model's assumptions (linear log-odds, correct feature set, IID data).\n- \"Confidence\" implies the output is reliable, but probability outputs are only meaningful if the model is calibrated: among all predictions of 0.73, approximately 73% of the actual outcomes should be positive class. Poorly calibrated models can output 0.73 while the true empirical frequency is 0.40.\n- In production: uncalibrated scores can cause harm in high-stakes decisions (e.g., credit, medical). Calibration is evaluated with Platt scaling, isotonic regression, or reliability diagrams.","A":"The output 0.73 is a probability for one specific sample, not a summary accuracy statistic for the test set. Accuracy is computed across many predictions, not from a single score.","B":"","C":"Logistic regression has no \"voting\" mechanism. It is a single linear model, not an ensemble. Voting is a concept from ensemble methods.","D":"The output is the sigmoid of the linear combination — a probability estimate. Log-loss is a metric computed after comparing predictions to true labels, not a raw model output."},"reference":"- Platt, \"Probabilistic Outputs for Support Vector Machines\": https://citeseerx.ist.psu.edu/doc/10.1.1.41.1639"},{"section":"machine-learning","topicSlug":"logistic-regression","topic":"Logistic Regression","id":"ml-03002","difficulty":"easy","orderIndex":2,"question":"You plot the sigmoid function $\\sigma(z) = \\frac{1}{1 + e^{-z}}$ for a logistic regression model. A colleague asks: \"why does logistic regression use sigmoid instead of a simple step function (0 if $z < 0$, 1 if $z \\geq 0$)?\" What is the correct explanation?","options":{"A":"The sigmoid function is faster to compute than the step function on modern hardware","B":"The step function has zero gradient almost everywhere and is discontinuous at zero, making gradient-based optimization impossible — the sigmoid provides smooth, differentiable gradients that allow learning via backpropagation","C":"The step function cannot output values between 0 and 1, so it cannot be used for probability regression","D":"The sigmoid is preferred because it always outputs exactly 0 or 1, matching binary targets"},"correct":"B","explanation":{"correct":"- The step function is non-differentiable at zero and has zero derivative everywhere else. Gradient descent requires $\\frac{\\partial L}{\\partial w}$, which flows through $\\frac{\\partial \\hat{y}}{\\partial z}$ — zero almost everywhere means no gradient signal and no learning.\n- The sigmoid $\\sigma(z)$ has derivative $\\sigma(z)(1-\\sigma(z))$, which is smooth, nonzero in $(-\\infty, +\\infty)$, and peaks at $z=0$. This allows gradient descent to adjust weights continuously.\n- This is the same reason neural networks use differentiable activations (ReLU, tanh) instead of step functions — differentiability is the prerequisite for gradient-based training.","A":"Computational speed is not the reason. Both functions are trivially fast. The reason is mathematical: gradient availability.","B":"","C":"The step function does output values between 0 and 1 (exactly 0 and exactly 1) — but it cannot produce intermediate values. However, the primary reason sigmoid is used is not the range; it is the differentiability needed for optimization.","D":"The sigmoid never outputs exactly 0 or 1; it is asymptotic to both extremes. As $z \\to +\\infty$, $\\sigma(z) \\to 1$, but never equals 1. This option is precisely backwards."}},{"section":"machine-learning","topicSlug":"logistic-regression","topic":"Logistic Regression","id":"ml-03003","difficulty":"easy","orderIndex":3,"question":"A logistic regression model is trained on email spam detection. The decision boundary is set at threshold 0.5 by default. In production, the cost of a false negative (spam reaching inbox) is 10× the cost of a false positive (legitimate email flagged as spam). What should the team change?","options":{"A":"Retrain the model with a different loss function","B":"Lower the classification threshold below 0.5 (e.g., 0.2) so the model flags more emails as spam, accepting more false positives to reduce false negatives — no retraining is needed","C":"Add more training data to reduce false negatives","D":"Switch to a different model architecture — logistic regression cannot handle asymmetric costs"},"correct":"B","explanation":{"correct":"- The classification threshold is a post-hoc decision boundary applied to the probability output. Lowering the threshold means any email with P(spam) > 0.2 is flagged — this catches more true positives (spam) at the cost of more false positives (legitimate emails flagged).\n- This is threshold calibration, completely separate from retraining. The model's learned weights and probabilities do not change.\n- The optimal threshold can be found on the validation set by computing the weighted cost: $\\text{cost} = 10 \\times FN + 1 \\times FP$, minimizing over threshold values.","A":"Retraining with a different loss function (e.g., weighted cross-entropy) is a valid approach, but it requires retraining — the question implies finding a simpler solution. Threshold adjustment achieves the goal without retraining.","B":"","C":"Adding training data improves generalization but does not specifically address asymmetric cost structure. More data would not lower the false negative rate unless paired with threshold or loss adjustment.","D":"Logistic regression handles asymmetric costs through both threshold adjustment and class-weighted training. The claim that it \"cannot handle\" asymmetric costs is false."}},{"section":"machine-learning","topicSlug":"logistic-regression","topic":"Logistic Regression","id":"ml-03004","difficulty":"easy","orderIndex":4,"question":"The log-loss (binary cross-entropy) for a logistic regression model is defined as $L = -[y\\log(\\hat{p}) + (1-y)\\log(1-\\hat{p})]$. A model outputs $\\hat{p} = 0.01$ for a true positive (y=1). How large is the resulting loss and why does this matter?","options":{"A":"The loss is 0.01 — proportional to the confidence of the wrong prediction","B":"The loss is $-\\log(0.01) \\approx 4.6$ — the log function heavily penalizes confident wrong predictions, making log-loss much more sensitive to large mispredictions than squared error","C":"The loss is 1.0 because the prediction is wrong — log-loss only outputs 0 or 1","D":"The loss is undefined because $\\log(0.01)$ requires a calculator and has no closed-form value"},"correct":"B","explanation":{"correct":"- When $y = 1$: $L = -\\log(\\hat{p})$. For $\\hat{p} = 0.01$: $L = -\\log(0.01) = \\log(100) \\approx 4.605$.\n- The logarithm diverges to $+\\infty$ as $\\hat{p} \\to 0$, so a confident wrong prediction (low $\\hat{p}$ for a true positive) is penalized extremely heavily. This is by design — it strongly discourages overconfident errors.\n- Compared to squared error $(y - \\hat{p})^2 = (1 - 0.01)^2 \\approx 0.98$, log-loss at 4.6 imposes 4.7× more penalty. This asymmetry makes log-loss far more aggressive about punishing confident mistakes.","A":"The loss is not proportional to the raw prediction value. The logarithm creates a highly nonlinear penalty that explodes near 0 and near 1.","B":"","C":"Log-loss is a continuous function outputting any non-negative real number. It is not binary. A loss of 1.0 corresponds to $\\hat{p} = e^{-1} \\approx 0.368$, not a wrong prediction in general.","D":"$$\\log(0.01)$ is perfectly computable: $\\log(0.01) = \\log(10^{-2}) = -2\\log(10) \\approx -4.605$. Log-loss is defined for all $\\hat{p} \\in (0, 1)$."}},{"section":"machine-learning","topicSlug":"logistic-regression","topic":"Logistic Regression","id":"ml-03005","difficulty":"easy","orderIndex":5,"question":"A logistic regression model is trained on linearly separable data. During training, the loss keeps decreasing but never converges. The gradient descent optimizer reports no convergence after 10,000 iterations. What is happening?","options":{"A":"The learning rate is too high, causing gradient explosion","B":"On linearly separable data, the optimal logistic regression solution requires infinite weights — the sigmoid can reach arbitrarily high certainty by scaling weights toward infinity, so the loss keeps decreasing forever without a finite optimum","C":"Logistic regression is not suitable for linearly separable data and should be replaced with a linear SVM","D":"The batch size is too small, causing the gradient to oscillate without converging"},"correct":"B","explanation":{"correct":"- On linearly separable data, a perfect classification exists: one weight vector correctly classifies all training points. As weights grow larger, the sigmoid pushes probabilities closer to 0 and 1, reducing log-loss further.\n- There is no finite weight vector that achieves log-loss = 0 (since $\\log(1) = 0$ requires $\\hat{p} = 1$, which requires infinite weights). The optimizer chases a loss that approaches 0 but never reaches it.\n- The fix is L2 regularization, which penalizes large weights and creates a finite optimum. Without regularization, logistic regression on separable data does not converge in the standard sense.","A":"Gradient explosion from a high learning rate would cause the loss to increase erratically, not decrease steadily. A steadily decreasing, non-converging loss indicates the mathematical non-existence of a finite optimum.","B":"","C":"Logistic regression is entirely valid for linearly separable data. The convergence failure is a mathematical property of the log-loss on separable data, not a model incompatibility.","D":"Batch size affects convergence speed and noise, but does not cause the fundamental issue here. Mini-batch oscillation shows non-monotone loss; this problem shows monotone decrease without convergence."}},{"section":"machine-learning","topicSlug":"logistic-regression","topic":"Logistic Regression","id":"ml-03006","difficulty":"medium","orderIndex":6,"question":"You apply L1 regularization to a logistic regression with 500 features. The resulting model has non-zero coefficients for only 30 features. You then apply L2 regularization with the same regularization strength C. What structural difference in the learned coefficients should you expect?","options":{"A":"L2 regularization will also produce exactly 30 non-zero features because it applies the same penalty magnitude","B":"L2 regularization will keep all 500 coefficients non-zero but shrink them toward zero — it does not produce sparsity because the L2 penalty's gradient never reaches zero for non-zero weights","C":"L2 regularization produces sparser models than L1 because the squared penalty removes more irrelevant features","D":"L1 and L2 regularization produce identical coefficient distributions when applied with the same C value"},"correct":"B","explanation":{"correct":"- L1 penalty adds $\\lambda |w|$ to the loss. At the optimum, the subdifferential condition allows the gradient of the data loss to exactly cancel the L1 gradient, permitting exact zero weights — this is the geometric reason L1 produces sparsity.\n- L2 penalty adds $\\lambda w^2$. The gradient of the penalty at any non-zero $w$ is $2\\lambda w \\neq 0$, which always pushes weights toward zero but never makes the optimal weight exactly zero unless the data gradient is also zero (rare in practice).\n- In feature selection contexts, L1 (Lasso) is preferred for sparsity; L2 (Ridge) is preferred when all features are expected to contribute something, or for stability under multicollinearity.","A":"The same C value does not produce the same sparsity structure. C controls penalty strength, not the penalty geometry. L1's diamond constraint geometry produces corners (sparse solutions); L2's circular constraint does not.","B":"","C":"This is backwards. L1 produces sparser models because of its non-differentiability at zero, which allows exact zero solutions. L2 does not produce sparsity.","D":"L1 and L2 produce structurally different coefficient distributions even at the same C. They are not equivalent — this is a foundational distinction in regularization theory."},"reference":"- Tibshirani, \"Regression Shrinkage and Selection via the Lasso\": https://www.jstor.org/stable/2346178"},{"section":"machine-learning","topicSlug":"logistic-regression","topic":"Logistic Regression","id":"ml-03007","difficulty":"medium","orderIndex":7,"question":"A logistic regression model achieves 0.95 AUC on a balanced binary classification task. When evaluated on a three-class problem with the same features using one-vs-rest (OvR) logistic regression, per-class AUCs are 0.91, 0.88, and 0.72. A developer says the model is failing on class 3. What should they check first?","options":{"A":"Retrain the model using softmax (multinomial) logistic regression instead of OvR","B":"Check whether class 3 is linearly separable from the other two classes in feature space — a low OvR AUC for class 3 indicates the linear decision boundary cannot adequately separate it from the combined rest, not necessarily a data or training bug","C":"Class 3 AUC of 0.72 means logistic regression is the wrong model for all three classes","D":"Increase the number of training epochs for the class 3 binary classifier"},"correct":"B","explanation":{"correct":"- OvR trains three separate binary classifiers: class 1 vs {2,3}, class 2 vs {1,3}, class 3 vs {1,2}. A low AUC for class 3 means the model struggles to distinguish class 3 from classes 1 and 2 combined.\n- The most likely explanation: class 3 overlaps in feature space with the other classes, making a linear boundary insufficient. This is a data geometry problem, not a training bug.\n- Next steps: visualize class 3 in PCA/t-SNE space, check feature distributions per class, or try a nonlinear model. Before switching models, confirm whether classes 1 and 2 are proxies or mixtures of class 3.","A":"Switching to multinomial (softmax) logistic regression can improve performance, but it is still a linear model — if class 3 is not linearly separable, softmax won't fix the fundamental problem.","B":"","C":"AUC of 0.72 is poor but does not invalidate the entire model. Two out of three classes perform well. The decision to switch models should be based on the specific class 3 geometry, not a blanket judgment.","D":"OvR logistic regression using `sklearn` is a convex optimization — more epochs help convergence but cannot overcome a linearly non-separable class boundary."}},{"section":"machine-learning","topicSlug":"logistic-regression","topic":"Logistic Regression","id":"ml-03008","difficulty":"medium","orderIndex":8,"question":"A logistic regression model is trained to predict loan default. The coefficient for `credit_score` is −0.032. A business analyst says \"a one-point increase in credit score reduces default probability by 3.2%.\" What is wrong with this interpretation?","options":{"A":"The interpretation is correct — coefficients in logistic regression directly represent probability changes","B":"The coefficient −0.032 represents the change in log-odds per unit increase in credit score, not probability — the actual change in probability depends on the current value of the linear combination and is nonlinear due to the sigmoid","C":"Logistic regression coefficients cannot be interpreted for individual features when there are multiple predictors","D":"The sign is wrong — a negative coefficient should increase probability, not decrease it"},"correct":"B","explanation":{"correct":"- Logistic regression models the log-odds: $\\log\\frac{p}{1-p} = w^Tx + b$. A coefficient of $-0.032$ means each one-unit increase in credit score decreases the log-odds of default by 0.032.\n- The change in probability is: $\\Delta p \\approx \\hat{p}(1-\\hat{p}) \\times (-0.032)$. For $\\hat{p} = 0.5$, $\\Delta p = 0.5 \\times 0.5 \\times (-0.032) = -0.008$ — about −0.8%, not −3.2%.\n- Near $\\hat{p} = 0.1$, the change is $0.1 \\times 0.9 \\times (-0.032) = -0.0029$ — about −0.29%. The probability change depends on the current baseline probability, which varies across individuals.","A":"Direct coefficient-to-probability mapping is the most common logistic regression misinterpretation. Only in linear probability models do coefficients represent percentage point changes in probability.","B":"","C":"Coefficients in logistic regression are interpretable (as log-odds effects) with multiple predictors, holding other features constant — the same as in linear regression for partial effects.","D":"A negative coefficient for credit score makes intuitive sense: higher credit scores reduce default risk (lower log-odds of default). The sign direction is correct."}},{"section":"machine-learning","topicSlug":"logistic-regression","topic":"Logistic Regression","id":"ml-03009","difficulty":"medium","orderIndex":9,"question":"You train logistic regression on a dataset with 10,000 positive and 10,000 negative examples. In production, the true positive rate is 1% (1 in 100 transactions is positive). Your model outputs 0.6 for a new transaction. What adjustment is needed for the output to reflect the true production probability?","options":{"A":"No adjustment — the model's output is already calibrated for production use","B":"The model was trained on a balanced dataset but production has 1% positive rate — the model's prior is wrong; Bayes' theorem can correct the output: the true posterior probability is much lower than 0.6 given the low base rate","C":"The model should be retrained with production data to fix the calibration","D":"The threshold should be lowered to 0.1 to account for the lower positive rate in production"},"correct":"B","explanation":{"correct":"- Logistic regression implicitly encodes the training class prior into its intercept. Trained on 50% positives, the model's intercept reflects a prior of $P(y=1) = 0.5$. In production, $P(y=1) = 0.01$.\n- Using Bayes' theorem to correct: $P(y=1|x, \\text{prod}) = \\frac{\\sigma(w^Tx) \\cdot \\pi_{\\text{prod}} / \\pi_{\\text{train}}}{\\sigma(w^Tx) \\cdot \\pi_{\\text{prod}} / \\pi_{\\text{train}} + (1-\\sigma(w^Tx)) \\cdot (1-\\pi_{\\text{prod}}) / (1-\\pi_{\\text{train}})}$.\n- This is a common production ML issue: models trained on resampled or balanced datasets output systematically overconfident probabilities for the positive class. Intercept adjustment ($b' = b + \\log(\\pi_{\\text{train}} / \\pi_{\\text{prod}})$) is the standard fix.","A":"The model is not calibrated for production. A model trained on 50% positive data that encounters 1% positive data will output systematically inflated positive-class probabilities.","B":"","C":"Retraining on production data is valid but not the only or fastest solution. Prior correction via intercept adjustment is an analytical fix that doesn't require retraining.","D":"Adjusting the threshold changes what you classify as positive, but does not fix the probability calibration. The raw output of 0.6 still does not represent the true 1%-prior-adjusted probability of being positive."}},{"section":"machine-learning","topicSlug":"logistic-regression","topic":"Logistic Regression","id":"ml-03010","difficulty":"medium","orderIndex":10,"question":"A team trains logistic regression to classify customers as \"high value\" vs \"low value.\" The decision boundary in 2D feature space is a straight line. After adding a third feature `x3 = x1² + x2²`, the model's performance improves significantly. What does this tell you about the original data distribution?","options":{"A":"The original two features were irrelevant — only the new polynomial feature matters","B":"The original decision boundary required a circle (or ellipse) in the original 2D feature space — the data was not linearly separable in 2D but became linearly separable in 3D after adding the polynomial feature that captures radial structure","C":"Adding polynomial features always improves logistic regression performance","D":"The improvement proves that logistic regression is inferior to polynomial regression for classification"},"correct":"B","explanation":{"correct":"- Logistic regression always creates a linear decision boundary in the feature space it receives. If the true boundary in 2D is circular (e.g., $x_1^2 + x_2^2 = r^2$), logistic regression on raw features cannot represent it.\n- Adding $x_3 = x_1^2 + x_2^2$ allows the 3D decision boundary $w_1 x_1 + w_2 x_2 + w_3 x_3 + b = 0$ to represent a circle in the original 2D space when $w_1 = w_2 = 0$.\n- This is the kernel trick intuition: mapping to a higher-dimensional space makes nonlinearly separable data linearly separable. Logistic regression with polynomial features is equivalent to a polynomial classifier.","A":"The original two features cannot be irrelevant if the new feature (built from them) improves performance. The improvement comes from capturing nonlinear interactions of the original features.","B":"","C":"Adding polynomial features does not always improve performance. It increases model complexity, risk of overfitting, and multicollinearity. Improvement depends on whether the true decision boundary is nonlinear.","D":"Logistic regression with engineered polynomial features is a valid and powerful approach. It does not prove inferiority — it demonstrates that feature engineering can substitute for nonlinear models."}},{"section":"machine-learning","topicSlug":"logistic-regression","topic":"Logistic Regression","id":"ml-03011","difficulty":"medium","orderIndex":11,"question":"A logistic regression model is trained for medical diagnosis (disease vs no-disease). The model is well-calibrated — among predictions of 0.7, exactly 70% of patients have the disease. A cardiologist says the model's output can be used directly to make individual treatment decisions. A statistician disagrees. Why?","options":{"A":"The statistician is wrong — calibration means probability outputs are reliable for individual decisions","B":"Calibration is a population-level property; it says nothing about individual prediction certainty — a patient with P(disease) = 0.7 has an irreducible 30% chance of being misclassified, and no amount of calibration reduces this individual uncertainty","C":"The model needs higher AUC before its probabilities can be used for treatment decisions","D":"Medical models should always output binary classifications, not probabilities"},"correct":"B","explanation":{"correct":"- Calibration means: across all patients the model assigns P = 0.7, approximately 70% actually have the disease. This is a property of the group, not of the individual.\n- For any single patient at P = 0.7, we cannot say more than \"we estimate a 70% chance.\" There is irreducible uncertainty — we do not know if this individual is in the 70% or 30%.\n- Treatment decisions require integrating this probability with clinical context, cost-benefit analysis, and individual patient factors. Treating P = 0.7 as a yes/no decision without threshold analysis ignores the 30% risk of a wrong treatment.","A":"Calibration ensures the probability scale is meaningful, but it does not reduce individual uncertainty. A 70% probability still means 30% of such patients do not have the disease.","B":"","C":"AUC measures discrimination (ranking quality), not calibration. A model can have high AUC and poor calibration, or high calibration and moderate AUC. The two metrics assess different properties.","D":"Probability outputs are more informative than binary outputs for medical decisions because they preserve uncertainty information needed for risk stratification. Binary outputs discard this."}},{"section":"machine-learning","topicSlug":"logistic-regression","topic":"Logistic Regression","id":"ml-03012","difficulty":"hard","orderIndex":12,"question":"A logistic regression model trained with L2 regularization ($C = 0.01$, very strong regularization) achieves poor training accuracy on a 5-class problem. A developer increases C to 1,000,000 (essentially no regularization) and training accuracy improves to 97%. Which failure mode should they expect in production, and what does the very strong regularization failure reveal?","options":{"A":"C = 0.01 underfits because the regularization forces all weights to exactly zero; no regularization is always better for training accuracy","B":"Very strong regularization (small C) biases the model toward zero weights, causing underfitting — the model cannot capture the signal; very weak regularization (large C) allows overfitting; the large gap between training and production accuracy at C = 1,000,000 is overfitting, and the correct C should be found by validation","C":"L2 regularization in logistic regression should only be used for binary classification; for multiclass, it always fails","D":"Increasing C improves training accuracy without any production risk because L2 regularization only affects convergence speed"},"correct":"B","explanation":{"correct":"- In `sklearn`, $C = 1/\\lambda$ — smaller C means stronger regularization. $C = 0.01$ corresponds to very large $\\lambda$, heavily penalizing weights and forcing them near zero. The model cannot fit the data's complexity — this is underfitting (high bias).\n- $C = 1,000,000$ corresponds to effectively no regularization. On a 5-class problem, the model can overfit the training set, learning class-specific noise. 97% training accuracy with negligible regularization is a warning sign.\n- The optimal C balances bias and variance. In a classification task with validation data, grid-search C across $[0.001, 0.01, 0.1, 1, 10, 100]$ and select based on validation AUC or F1.","A":"C = 0.01 does not force all weights to exactly zero. L2 regularization smoothly shrinks weights but does not produce exact zeros (unlike L1). The model still uses all features but with small, underpowered weights.","B":"","C":"L2 regularization works correctly for multiclass logistic regression (both OvR and multinomial). The failure is about regularization strength, not multiclass compatibility.","D":"C directly affects the trade-off between fitting training data and preventing overfit. Claiming \"no production risk\" for unlimited C is precisely wrong — it is the definition of overfitting risk."}},{"section":"machine-learning","topicSlug":"logistic-regression","topic":"Logistic Regression","id":"ml-03013","difficulty":"hard","orderIndex":13,"question":"Two features $x_1$ and $x_2$ each have AUC = 0.85 individually for predicting a binary outcome. A logistic regression trained on both features achieves AUC = 0.81 — lower than either individual feature. What is the most likely cause?","options":{"A":"Logistic regression cannot combine two features effectively — a decision tree should be used instead","B":"The two features are highly correlated and both capture the same signal; multicollinearity causes the combined model's coefficient estimates to be unstable, and the slightly different noise in each feature's contribution degrades performance versus a single clean predictor","C":"AUC = 0.85 for individual features means each feature is overfitting — combining them reduces overfitting and 0.81 is the correct generalization performance","D":"Logistic regression with two features always performs worse than univariate models because the decision boundary requires more data to fit a 2D hyperplane"},"correct":"B","explanation":{"correct":"- When two features are near-perfect proxies for each other (high correlation), a logistic regression with both features attempts to split the signal between two collinear predictors. The individual coefficient estimates become unstable (high variance) due to multicollinearity.\n- Each feature individually uses the clean single-predictor signal. The combined model's instability in coefficient estimation can hurt generalization, particularly when the features contain slightly different measurement noise.\n- This is a practical case where feature selection outperforms feature stacking. The fix is to use one of the features, or apply PCA to get a single principal component capturing the shared variance.","A":"Logistic regression absolutely can combine multiple features effectively. The issue is feature correlation, not a logistic regression limitation. This would also fail for decision trees with correlated features.","B":"","C":"Individual AUC = 0.85 on a proper validation set does not indicate overfitting — it indicates discriminative power. Combining correlated features is the cause of degradation, not a sign of fixing overfitting.","D":"Logistic regression with two uncorrelated features outperforms univariate models when both features are informative. The hypothesis that \"two features always performs worse\" is false."}},{"section":"machine-learning","topicSlug":"logistic-regression","topic":"Logistic Regression","id":"ml-03014","difficulty":"hard","orderIndex":14,"question":"A logistic regression model is trained on customer transaction data to predict fraud. The positive class (fraud) is 0.1% of all transactions. The model achieves 99.95% training accuracy, 99.92% validation accuracy, and the loss curves look perfectly converged. A fraud analyst says the model is useless. How can the analyst be right?","codeSnippet":"from sklearn.metrics import classification_report\nprint(classification_report(y_test, model.predict(X_test)))\n# precision recall f1-score support\n# 0 1.00 1.00 1.00 99900\n# 1 0.00 0.00 0.00 100","options":{"A":"The analyst is wrong — 99.92% validation accuracy with converged loss proves the model is well-trained","B":"The model learned to predict \"not fraud\" for every transaction — the 99.9% majority class gives 99.9% accuracy trivially, and precision/recall/F1 for class 1 (fraud) are all 0.00, meaning the model never detects a single fraud case","C":"The model is overfit to the training set — the validation accuracy should be lower than the training accuracy by more","D":"The loss converging to a small value proves the model captured the signal — the analyst must be misreading the output"},"correct":"B","explanation":{"correct":"- The classification report reveals the critical truth: class 1 (fraud) has recall = 0.00, meaning no fraud case was ever correctly identified. The model outputs \"not fraud\" for every input and achieves 99.9% accuracy by exploiting class imbalance.\n- This is the classic imbalanced classification trap. The log-loss can also be low: if the model predicts P(fraud) = 0.001 for all samples, the loss is $-[0.999 \\times \\log(0.999) + 0.001 \\times \\log(0.001)] \\approx 0.011$ — small, but with zero fraud detection.\n- Solutions: class-weighted cross-entropy, oversampling minority class (SMOTE), undersampling majority class, or using precision-recall AUC instead of accuracy and standard loss.","A":"Accuracy and loss curves are meaningless metrics on highly imbalanced data. They do not prove the model is well-trained — they prove the model learned the trivial majority-class solution.","B":"","C":"The validation accuracy is close to training accuracy (99.95% vs 99.92%), which looks like good generalization. The problem is not overfitting — it is that the model learned the wrong target behavior entirely.","D":"A low loss converging does not prove the model captured signal. On a 0.1% positive class, a model predicting the constant majority class achieves low cross-entropy loss because most examples are easily correct."}},{"section":"machine-learning","topicSlug":"logistic-regression","topic":"Logistic Regression","id":"ml-03015","difficulty":"hard","orderIndex":15,"question":"A team replaces logistic regression with a deep neural network for a tabular binary classification task with 15 features and 5,000 training samples. The DNN achieves 3% higher AUC on the test set. A senior engineer says \"logistic regression would have been better here.\" What is the engineer's reasoning?","options":{"A":"Deep neural networks are always worse than logistic regression on binary classification","B":"With 5,000 samples and 15 features, a deep neural network has far more parameters than training examples, leading to overfitting — logistic regression's simplicity and built-in implicit regularization (via limited capacity) is more appropriate; the 3% AUC gain may reflect test set overfitting rather than real generalization","C":"Logistic regression is always preferable for tabular data because neural networks cannot handle structured features","D":"The engineer is wrong — higher test AUC always means the DNN is genuinely better"},"correct":"B","explanation":{"correct":"- With 5,000 samples and 15 features, a DNN with 2-3 hidden layers may have thousands of parameters — far exceeding the number of training examples. The risk of overfitting is high.\n- The 3% AUC improvement on a single test set may reflect the DNN fitting noise patterns specific to the test distribution. Cross-validated AUC would provide a more reliable comparison.\n- Logistic regression is a strong baseline for tabular data with limited samples: it has only $p+1 = 16$ parameters, is fully interpretable, and its regularization is tunable with a single hyperparameter C.","A":"There are many tasks where DNNs outperform logistic regression on tabular data, particularly with complex feature interactions. The claim \"always worse\" is false.","B":"","C":"Neural networks can handle tabular data and often do so effectively when data is abundant. The limitation is sample efficiency, not structural incompatibility.","D":"Higher AUC on a single test evaluation is not conclusive proof of better generalization. Test set overfitting (especially if the test set influenced hyperparameter choices) can inflate a single-split AUC measurement."},"reference":"- Grinsztajn et al., \"Why tree-based models still outperform deep learning on tabular data\": https://arxiv.org/abs/2207.08815"},{"section":"machine-learning","topicSlug":"decision-trees","topic":"Decision Trees","id":"ml-04001","difficulty":"easy","orderIndex":1,"question":"A decision tree splits a node containing 100 samples: 50 class A and 50 class B. After the split, the left child has 40 class A and 10 class B; the right child has 10 class A and 40 class B. Which impurity measure correctly identifies this as a good split, and why?","options":{"A":"Neither Gini impurity nor entropy can evaluate splits — only accuracy can determine split quality","B":"Both Gini impurity and entropy would indicate this is a good split because both children are purer than the parent — each child has a dominant class (80% majority), while the parent was maximally impure (50/50)","C":"Gini impurity would reject this split because the total samples in each child are equal","D":"Entropy would reject this split because information gain requires one class to disappear completely from a child node"},"correct":"B","explanation":{"correct":"- Parent Gini: $1 - (0.5^2 + 0.5^2) = 0.5$ (maximum impurity for 2 classes). Child Gini (left): $1 - (0.8^2 + 0.2^2) = 0.32$. Child Gini (right): $1 - (0.2^2 + 0.8^2) = 0.32$.\n- Weighted child Gini: $(50/100) \\times 0.32 + (50/100) \\times 0.32 = 0.32$. Information gain from Gini: $0.5 - 0.32 = 0.18$ — a positive improvement.\n- The same result holds for entropy. Any split that makes children purer than the parent produces positive information gain, and this split reduces impurity by 36%.","A":"Accuracy is not used as an impurity criterion during tree splitting. Gini impurity and entropy are the standard splitting criteria precisely because they measure class purity within nodes.","B":"","C":"Gini impurity is not affected by the relative size of the children — it measures class proportion within each child, not child size. The weighted average accounts for size.","D":"Information gain does not require a class to disappear. Any reduction in weighted impurity from parent to children yields positive information gain. Perfect splits (one class per child) are the maximum gain, not the minimum requirement."}},{"section":"machine-learning","topicSlug":"decision-trees","topic":"Decision Trees","id":"ml-04002","difficulty":"easy","orderIndex":2,"question":"A node contains 100 samples: 99 class A and 1 class B. Compute the Gini impurity of this node. Is this a good or bad node to split further, and why?","options":{"A":"Gini = 0.5 (maximum impurity), making this a high-priority node to split","B":"Gini ≈ 0.02 (nearly pure), making this a low-priority node — splitting it is unlikely to yield meaningful information gain and would increase tree complexity unnecessarily","C":"Gini = 1.0 for any node with two classes present","D":"The node cannot be evaluated with Gini because one class has only 1 sample"},"correct":"B","explanation":{"correct":"- Gini impurity: $1 - (0.99^2 + 0.01^2) = 1 - (0.9801 + 0.0001) = 0.0198 \\approx 0.02$.\n- A nearly pure node (0.02) has little room for improvement — any split will reduce impurity by at most 0.02, which is unlikely to justify an additional split.\n- Decision tree algorithms naturally stop splitting near-pure nodes (via min_impurity_decrease or min_samples_split parameters), preventing overfitting by memorizing individual rare samples.","A":"Gini = 0.5 represents maximum impurity (50/50 split). A 99/1 split is near-minimum impurity. These are opposite ends of the spectrum.","B":"","C":"Gini equals 0 only for a pure node (one class), and 0.5 for a 50/50 two-class node. Having two classes present does not make Gini = 1.0 — the formula depends on proportions, not presence.","D":"Gini impurity works for any class distribution regardless of sample counts. Even a node with 1 sample has a well-defined (zero) impurity."}},{"section":"machine-learning","topicSlug":"decision-trees","topic":"Decision Trees","id":"ml-04003","difficulty":"easy","orderIndex":3,"question":"A decision tree is trained with `max_depth=None` on a training set of 1,000 samples with 20 features. The training accuracy is 100%, but test accuracy is 62%. A colleague says \"just add more training data.\" What is the correct diagnosis and most direct fix?","options":{"A":"The model needs more features, not more data — 20 features are insufficient to generalize","B":"The tree is fully overfit — with max_depth=None, it memorizes each training sample by splitting until each leaf has a single sample; the most direct fix is to constrain tree depth (max_depth), minimum samples per leaf, or apply pruning","C":"100% training accuracy always indicates data leakage, not overfitting","D":"The model needs a different impurity criterion — switching from Gini to entropy would improve test accuracy"},"correct":"B","explanation":{"correct":"- Decision trees with no depth constraint will grow until every leaf contains a single training sample (or until all samples in a leaf belong to the same class). With 1,000 samples, the tree may have up to 1,000 leaves — it has memorized the training data perfectly.\n- The train-test gap (100% vs 62%) is the classic overfitting signature. The tree is fitting noise rather than signal.\n- Direct fixes: `max_depth` (limit tree height), `min_samples_leaf` (require minimum samples per leaf), `min_samples_split` (require minimum samples before splitting), or cost-complexity pruning (`ccp_alpha`).","A":"Adding features increases the risk of overfitting further — more features give the tree more dimensions to split on, worsening memorization. Restricting tree complexity, not adding features, is the fix.","B":"","C":"100% training accuracy in a complex model is a sign of overfitting, not necessarily leakage. Data leakage causes high test accuracy too, which is not the case here (62% test).","D":"Switching impurity criteria (Gini vs entropy) rarely changes model quality significantly. Both criteria produce similar trees, and neither addresses the overfitting caused by unbounded depth."}},{"section":"machine-learning","topicSlug":"decision-trees","topic":"Decision Trees","id":"ml-04004","difficulty":"easy","orderIndex":4,"question":"You train two decision trees on the same dataset: one with Gini impurity and one with entropy. Both trees achieve nearly identical test accuracy. A manager asks: \"why use entropy instead of Gini if they produce the same result?\" What is the correct explanation?","options":{"A":"Entropy is always more accurate than Gini — if they produce the same result, one implementation is wrong","B":"Gini and entropy measure similar things (node impurity) and usually produce nearly identical splits and trees — entropy is slightly more computationally expensive due to the logarithm but has a stronger information-theoretic interpretation; Gini is preferred in practice for speed","C":"Entropy should never be used for classification trees — it is only valid for regression trees","D":"The only difference between Gini and entropy is that Gini penalizes larger classes while entropy penalizes smaller classes"},"correct":"B","explanation":{"correct":"- Gini: $G = 1 - \\sum p_i^2$. Entropy: $H = -\\sum p_i \\log_2(p_i)$. Both are minimized at 0 for pure nodes and maximized at the uniform distribution. Their functional forms differ but they identify nearly the same optimal splits in practice.\n- Entropy has a direct connection to information theory (Shannon entropy) and the concept of information gain, making it preferable for theoretical analysis. Gini avoids the logarithm, making it faster to compute.\n- `sklearn`'s `DecisionTreeClassifier` uses Gini by default. For most tasks, the choice makes negligible practical difference — both should be tried in hyperparameter tuning.","A":"Producing the same result is the expected outcome, not a sign of implementation error. The two criteria converge on similar trees because they both maximize node purity.","B":"","C":"Entropy is valid and standard for classification trees. Regression trees use different criteria (variance reduction, MSE minimization). Entropy is not used for regression trees.","D":"Neither Gini nor entropy \"penalizes\" a class size in the way described. Both are symmetric functions of class proportions. Gini gives more weight to misclassification probability; entropy gives more weight to rare classes proportionally due to the log, but this is not a \"penalty on smaller classes.\""}},{"section":"machine-learning","topicSlug":"decision-trees","topic":"Decision Trees","id":"ml-04005","difficulty":"easy","orderIndex":5,"question":"A decision tree is retrained after adding 5 new training examples. The entire tree structure changes completely — different splits, different depths, different features at each node. A developer says this is a bug. Is it?","options":{"A":"Yes — a well-trained decision tree should be stable when small amounts of data are added","B":"No — decision trees are inherently unstable (high variance); small changes to training data can cause the first split to change, which completely alters all downstream splits through a cascade effect","C":"It is a bug only if the new training examples are outliers; normal samples would not change the tree","D":"Tree instability is always caused by using Gini impurity; switching to entropy produces stable trees"},"correct":"B","explanation":{"correct":"- Decision trees are greedy algorithms: each split is chosen to maximize immediate impurity reduction without look-ahead. A small change in training data can shift which feature and threshold achieves the maximum gain at the root.\n- Since every split in the tree depends on the data reaching that node, a changed root split routes different data to subsequent nodes, changing their optimal splits too. The effect cascades through the entire tree.\n- This instability is one of the primary motivations for Random Forests — building many trees on bootstrapped samples and averaging their predictions reduces variance introduced by individual tree instability.","A":"Instability in decision trees is a known, documented property, not a bug. It is a feature of greedy splitting algorithms without global optimization.","B":"","C":"Decision trees are sensitive to all samples, not just outliers. Even a few representative samples that shift a class boundary at the root can trigger full restructuring.","D":"Both Gini and entropy are greedy impurity measures and produce similarly unstable trees. The instability is inherent to the greedy tree-building process, not the choice of criterion."}},{"section":"machine-learning","topicSlug":"decision-trees","topic":"Decision Trees","id":"ml-04006","difficulty":"easy","orderIndex":6,"question":"A decision tree classifier is trained on a dataset with a continuous feature `age`. The tree considers thresholds at every unique value in the training data. After training, the split is \"age > 34.\" What happens to a new test sample with `age = 34.001`?","options":{"A":"The model throws an error because 34.001 was not in the training data","B":"The sample goes to the right subtree (age > 34 is True), and the tree applies the same decision to all future samples following this branch regardless of how far they are from the threshold","C":"The model interpolates between neighboring training values to handle unseen continuous values","D":"The sample goes to the left subtree because 34.001 is too close to the training threshold to be reliable"},"correct":"B","explanation":{"correct":"- Decision tree splits on continuous features are threshold comparisons: `age > 34.001` evaluates to True, so the sample goes right. The tree applies the same learned threshold to all new samples, including values never seen during training.\n- Decision trees do not interpolate, extrapolate, or compute distance to the threshold. The boundary is a hard cutoff: any value > 34 routes right, any value ≤ 34 routes left.\n- This threshold-based approach means decision trees can handle unseen continuous values that fall between training values — but they cannot extrapolate beyond the range of training data in a meaningful way (they simply apply the outermost leaf's class).","A":"Decision trees do not require test values to be present in training data. The split is a learned threshold, not a lookup table.","B":"","C":"Decision trees have no interpolation mechanism. They are piecewise constant functions of the input — within a region defined by the thresholds, all samples get the same leaf prediction.","D":"Distance to the training threshold has no role in decision tree inference. The comparison is purely `value > threshold`, with no uncertainty based on proximity."}},{"section":"machine-learning","topicSlug":"decision-trees","topic":"Decision Trees","id":"ml-04007","difficulty":"medium","orderIndex":7,"question":"A decision tree is trained on a classification problem with a highly imbalanced feature: `transaction_amount` has 95% of values below \\$100 and 5% above \\$10,000. Gini impurity selects this feature as the first split at threshold \\$99. What bias does this introduce?","options":{"A":"No bias — Gini impurity selects the optimal split regardless of feature distribution","B":"The 5% high-value transactions are almost always routed to the right subtree at depth 1, but this small node cannot be further split effectively due to low sample count — the model may have poor recall on the rare high-value segment because it gets too few samples to learn a good sub-classifier","C":"Gini impurity favors features with more unique values, so `transaction_amount` is always selected first regardless of its true predictive power","D":"The tree would fail to converge because continuous features with extreme skew cannot be split by Gini impurity"},"correct":"B","explanation":{"correct":"- Splitting at \\$99 routes 95% of samples left and 5% right. The right node has only 5% of training data — perhaps a few hundred samples. Subsequent splits on this tiny subtree have limited statistical power: each further split divides an already small set.\n- High-value transactions may have complex patterns requiring multiple splits to model correctly, but the small sample count limits tree depth before min_samples_split or min_impurity_decrease stops growth.\n- This is a well-known limitation of decision trees on imbalanced feature distributions and class-imbalanced datasets. Techniques like stratified sampling or class-weighted splitting can partially address it.","A":"Gini impurity selects the split that maximizes weighted purity reduction, which can be optimal for the majority class while being suboptimal for minority class capture. \"Optimal\" globally ≠ \"optimal for all segments.\"","B":"","C":"Gini impurity considers the impurity reduction, not the number of unique values. A feature with many unique values does get more split candidates evaluated, but selection is based on quality of the resulting split, not feature cardinality alone.","D":"Gini impurity works on any continuous or categorical feature regardless of distribution. Convergence is not affected by skew."}},{"section":"machine-learning","topicSlug":"decision-trees","topic":"Decision Trees","id":"ml-04008","difficulty":"medium","orderIndex":8,"question":"A team applies cost-complexity pruning (also called weakest-link pruning) to a fully-grown decision tree by setting `ccp_alpha = 0.05`. The pruned tree has 40% fewer nodes but the test accuracy drops only 1.2%. A stakeholder asks: \"is this a good trade-off?\" What is the correct reasoning?","options":{"A":"No — any accuracy drop from pruning means the tree is worse and pruning should not be applied","B":"Yes — pruning removes subtrees whose impurity decrease per added node is less than alpha (0.05); 40% fewer nodes means significantly less overfitting, better generalization, lower inference cost, and improved interpretability, while a 1.2% accuracy drop is likely within statistical noise","C":"Pruning is only valid for regression trees; for classification trees it always reduces accuracy too much to be useful","D":"ccp_alpha should always be set to 0 for classification tasks — any positive alpha introduces bias"},"correct":"B","explanation":{"correct":"- Cost-complexity pruning removes the subtree at each internal node where the impurity reduction per node added is below `ccp_alpha`. Setting alpha = 0.05 means only splits that provide substantial improvement are kept.\n- A 40% node reduction with only 1.2% accuracy loss is an excellent trade-off: the removed nodes were fitting noise (the accuracy they \"contributed\" was pure overfitting), not real signal.\n- Pruned trees are more interpretable (fewer rules to explain), faster to inference (shorter paths), and generalize better. The optimal alpha is found by cross-validating the pruned tree at various alpha values.","A":"Any accuracy drop from pruning does not mean the tree is worse overall. If the dropped accuracy was overfitted noise, the pruned tree generalizes better. Test set accuracy is the correct measure, and 1.2% drop is minimal.","B":"","C":"Cost-complexity pruning applies equally to classification and regression trees. It is a general pruning strategy, not regression-specific.","D":"Setting ccp_alpha = 0 means no pruning at all. Any positive alpha introduces a regularization effect that reduces overfitting — this is bias-variance trade-off management, not a defect."}},{"section":"machine-learning","topicSlug":"decision-trees","topic":"Decision Trees","id":"ml-04009","difficulty":"medium","orderIndex":9,"question":"A decision tree with max_depth=3 is trained on a 10-class classification problem. The tree has at most $2^3 = 8$ leaves. With 10 classes but only 8 possible leaf predictions, what happens to the two classes that are \"impossible\" to represent?","options":{"A":"The model raises an error because the number of leaves is less than the number of classes","B":"There is no guarantee which classes are represented — the 8 leaves will cover the 8 classes (or class combinations) that maximize training accuracy; minority classes or classes similar to others may never appear as leaf predictions","C":"All 10 classes are always represented because each leaf can output probability distributions over all classes","D":"The model automatically increases max_depth to 4 to accommodate all 10 classes"},"correct":"B","explanation":{"correct":"- A decision tree with max_depth=3 creates at most 8 leaf nodes. Each leaf predicts the majority class among its training samples. With 10 classes, at most 8 distinct class labels can appear as leaf predictions.\n- Classes that are rare, poorly separated, or similar to majority classes may never form a leaf majority — they get classified as the nearest majority class in the region.\n- This is an important depth constraint consideration for multi-class problems. Rule of thumb: max_depth should allow at least as many leaves as classes: max_depth ≥ $\\lceil \\log_2(k) \\rceil$ for $k$ classes.","A":"Decision trees do not error when leaves < classes. They silently under-represent minority classes. This silent failure mode is the dangerous aspect.","B":"","C":"Leaf predictions are based on the majority class of samples reaching that leaf, not a full probability distribution over all classes. `sklearn` can output `predict_proba`, which gives class fractions at the leaf, but the majority-class prediction may still ignore some classes.","D":"Decision trees never automatically adjust max_depth. It is a hard constraint set by the user."}},{"section":"machine-learning","topicSlug":"decision-trees","topic":"Decision Trees","id":"ml-04010","difficulty":"medium","orderIndex":10,"question":"You train a decision tree on a dataset where feature `A` has information gain 0.42 and feature `B` has information gain 0.38 at the root. Feature `B` is the correct causal predictor (generates the data), but feature `A` is a noisy proxy. The tree selects feature `A` first. What does this reveal, and why is it a problem for deployment?","options":{"A":"The tree made an error — information gain always selects the causally correct feature first","B":"Information gain is a statistical measure of correlation, not causation — feature A has higher empirical correlation with the target on this training sample; in production, if A's noise pattern changes (distribution shift), the model fails while a model built on B would remain robust","C":"The problem is resolved by increasing max_depth — more depth allows the tree to eventually use feature B","D":"This situation cannot occur — decision trees always select causal features because Gini impurity is derived from causal inference theory"},"correct":"B","explanation":{"correct":"- Greedy information gain measures which feature most reduces training set impurity. It does not distinguish between causal features and spurious correlates — a noisy proxy with slightly higher empirical correlation will always be chosen first.\n- In production, distribution shift is the real danger: if feature A's noise pattern changes (e.g., A was derived from a data pipeline that changes behavior), the model breaks. Feature B, being causal, is robust to such shifts.\n- This is a fundamental limitation of decision trees (and most ML models): they optimize for empirical fit, not causal structure. Causal reasoning requires explicit domain knowledge or causal discovery methods.","A":"Information gain has no connection to causal correctness. It measures mutual information between feature and label in the training data — a purely statistical quantity.","B":"","C":"Increasing max_depth allows more splits but does not change which feature is selected first. The root split is already feature A; B may appear lower in the tree, but the model's primary decision pathway is built on the spurious correlate.","D":"Gini impurity is a probabilistic measure from decision theory, not causal inference theory. Causal inference is a separate field (Pearl's do-calculus, structural equation models) with no connection to how decision trees work."}},{"section":"machine-learning","topicSlug":"decision-trees","topic":"Decision Trees","id":"ml-04011","difficulty":"medium","orderIndex":11,"question":"Two decision trees trained on different 80% subsets of the same dataset produce completely different structures — different root splits, different depths, different features used. A random forest uses 100 such trees. Why does averaging these unstable trees produce better results than any single tree?","options":{"A":"Averaging trees cancels out their individual errors because each tree predicts a different class, so the majority vote is always correct","B":"Each tree has high variance (unstable, overfits to its training subset) but low bias (on average, the mean prediction converges to the true class boundary); averaging reduces variance without increasing bias — this is the bias-variance decomposition of bagging","C":"Averaging trees is equivalent to training a single deep tree with more training data","D":"The 100 trees collectively remove noise from the training data before making predictions"},"correct":"B","explanation":{"correct":"- Bias-variance decomposition: a single deep tree has low bias (can fit complex boundaries) but high variance (changes dramatically with data). The expected error = bias² + variance + irreducible noise.\n- Bagging trains each tree on a bootstrap sample (random 80% with replacement). Each tree independently overfits to its sample. Averaged over 100 trees, the high-variance components cancel: $\\text{Var}(\\bar{X}) = \\frac{\\sigma^2}{n}$ for independent models.\n- In practice, trees are not fully independent (same dataset), so variance reduction is partial — but still substantial. The ensemble prediction converges to the true decision boundary as the number of trees grows.","A":"Trees don't predict different classes to \"cancel errors\" by design. Individual trees can all be wrong simultaneously on difficult examples. Majority vote helps because errors are independent (different bootstrap samples), not because they predict opposite classes.","B":"","C":"Averaging 100 trees is not equivalent to a single deeper tree. A single tree with any depth is still a greedy, unstable estimator. The ensemble's strength comes from independent estimation and variance cancellation, not deeper representation.","D":"Trees do not \"remove noise from data.\" Each tree works on a noisy bootstrap sample. The averaging averages out model variance, not data noise — irreducible noise remains."}},{"section":"machine-learning","topicSlug":"decision-trees","topic":"Decision Trees","id":"ml-04012","difficulty":"hard","orderIndex":12,"question":"A decision tree on a regression task (predicting house prices) has max_depth=2. The test RMSE is 45,000. Increasing max_depth to 20 drops the training RMSE to 1,200 but test RMSE increases to 71,000. Increasing max_depth to 5 gives training RMSE = 28,000 and test RMSE = 38,000. What is the precise mechanism causing test RMSE to be higher at depth=20 than at depth=2?","options":{"A":"Deep trees have more computation, which introduces floating-point rounding errors","B":"At depth=20, each leaf contains very few samples (possibly 1) and memorizes individual training prices including their noise; the leaf prediction for a test sample is the noisy price of the nearest training neighbor, not the true underlying price pattern","C":"Test RMSE increases because deeper trees use more features, causing multicollinearity","D":"The variance of predictions is lower at depth=20 because more splits produce more precise leaf boundaries"},"correct":"B","explanation":{"correct":"- Regression trees predict the mean of training samples in each leaf. At depth=20, leaves contain 1-2 samples — the \"mean\" is essentially the individual training price, which includes measurement noise and idiosyncratic factors.\n- For test samples that land in these leaves, the prediction is the price of a specific training house, not a generalized neighborhood price. The test error reflects both the bias of the prediction and the noise absorbed from training samples.\n- At depth=5, leaves contain more samples (perhaps 20-50), so predictions are averages that smooth out noise while still capturing meaningful price patterns.","A":"Floating-point rounding errors are negligible at the scale of RMSE differences (1,200 vs 71,000). This is not the mechanism.","B":"","C":"Decision trees do not suffer from multicollinearity in the traditional sense — each split considers one feature at a time. More depth means more splits, not more feature interaction artifacts.","D":"At depth=20, prediction variance is actually higher (high variance model), not lower. Each leaf's prediction varies widely based on which training house happened to land there. Variance increases with depth."}},{"section":"machine-learning","topicSlug":"decision-trees","topic":"Decision Trees","id":"ml-04013","difficulty":"hard","orderIndex":13,"question":"A decision tree is trained on a dataset with a categorical feature `country` having 150 unique values. The tree evaluator must consider all possible binary splits of 150 categories. How many possible binary splits exist for this feature, and what computational problem does this cause?","options":{"A":"150 splits — one per category, where the split is \"country = X\" vs \"country ≠ X\"","B":"$$2^{150-1} - 1 \\approx 10^{44}$ possible binary splits — evaluating all subsets is computationally infeasible; implementations use heuristics (sorting by target mean for regression, or frequency for classification) to reduce evaluation to $O(k)$ candidates","C":"150 × 149 / 2 = 11,175 splits — all pairwise combinations of countries","D":"Only 1 split is possible — the median category by frequency"},"correct":"B","explanation":{"correct":"- A binary split on a categorical feature with $k$ values divides $k$ values into two non-empty subsets. The number of such divisions is $2^{k-1} - 1$ (dividing by 2 for symmetry). For $k = 150$: $2^{149} - 1 \\approx 7 \\times 10^{44}$.\n- Exhaustive evaluation is impossible. Practical implementations use: for regression, sort categories by mean target and evaluate $k-1$ contiguous splits; for binary classification, a similar ordering by class proportion; for multi-class, approximate methods.\n- This is why high-cardinality categorical features are problematic for decision trees and why target encoding or ordinal encoding is often applied before tree training.","A":"\"Country = X\" vs \"Country ≠ X\" gives only 150 splits (one per category), not the full set of possible groupings. This is a valid subset of splits but not all possible binary partitions.","B":"","C":"Pairwise combinations count pairs of categories, not binary partitions of the full set. This is neither the correct formula nor the computational problem.","D":"Single-median splits exist only for ordinal/continuous features. For categorical features, no natural ordering exists to define a \"median.\""}},{"section":"machine-learning","topicSlug":"decision-trees","topic":"Decision Trees","id":"ml-04014","difficulty":"hard","orderIndex":14,"question":"A decision tree is trained on temporal data: daily sales for 3 years, predicting tomorrow's sales. A random 80/20 train/test split is used. The model achieves R² = 0.91 on the test set. When deployed, the model makes poor predictions for future months. What is wrong with the evaluation strategy?","options":{"A":"R² is not an appropriate metric for time-series regression — RMSE should be used instead","B":"Random splits on temporal data place future dates in the training set and past dates in the test set — the model \"knows\" future patterns during training; a time-based split (first 80% of dates for train, last 20% for test) would reveal the true out-of-sample performance","C":"The tree has too many leaves for time-series data — a linear model is always better for temporal prediction","D":"Decision trees require at least 5 years of training data for time-series applications"},"correct":"B","explanation":{"correct":"- Random splitting on time-ordered data breaks the temporal ordering. Training samples may include dates from year 3 while test samples include dates from year 1 — the model sees future information during training.\n- This creates temporal leakage: seasonal patterns, trends, and yearly cycles from future dates inform predictions of past dates. The model appears to generalize well because test data (past dates) is \"easier\" than true future dates.\n- The correct evaluation is a **temporal split**: train on days 1-800, test on days 801-1000. This simulates the production scenario of predicting future dates.","A":"R² is a valid metric for regression regardless of the data type (temporal or otherwise). The problem is the split strategy, not the metric choice.","B":"","C":"Decision trees can model temporal patterns when given appropriate lag features (yesterday's sales, 7-day rolling average, etc.). The issue is the evaluation strategy, not the model type.","D":"There is no universal rule requiring 5 years of data. The appropriate training window depends on the seasonality and signal in the data, not a fixed duration."}},{"section":"machine-learning","topicSlug":"decision-trees","topic":"Decision Trees","id":"ml-04015","difficulty":"hard","orderIndex":15,"question":"A decision tree splits on feature `income` at threshold \\$50,000 at the root. A researcher notes that after removing 3 outliers (extreme high-income cases), the root split changes to feature `age` at threshold 35. This causes the entire tree to restructure. What does this reveal about the decision tree's robustness, and how does this compare to a Random Forest's behavior?","options":{"A":"Removing 3 outliers is always data manipulation — the researcher's action invalidated both models","B":"Decision trees are sensitive to individual data points because the root split depends on maximizing information gain across all training samples; 3 extreme outliers can shift which feature achieves maximum gain at the root, cascading changes through the entire tree — Random Forests are more robust because each tree uses a bootstrap sample where outliers appear in only a subset of trees","C":"This demonstrates that `age` is the correct feature to split on — removing outliers always reveals the true underlying structure","D":"Random Forests would have the same sensitivity because they use the same splitting algorithm"},"correct":"B","explanation":{"correct":"- Extreme income outliers can disproportionately influence Gini impurity calculations: a split that isolates outliers may appear to maximize information gain at the root, even if it captures only noise.\n- Without the outliers, the underlying structure (age as the primary predictor) becomes dominant. This illustrates that single decision trees are highly sensitive to training data composition.\n- Random Forests: each tree is trained on a bootstrap sample (~63% of data with replacement). Outliers appear in only a subset of trees. The ensemble vote is dominated by the majority of trees that don't have outliers strongly influencing the root split, making the forest more robust.","A":"Removing outliers to understand model stability is a valid diagnostic technique, not data manipulation. Understanding how models respond to data perturbations is standard practice.","B":"","C":"The change after removing outliers doesn't prove `age` is \"correct.\" It demonstrates that the tree's structure is data-dependent — neither split is definitively \"correct\" without domain knowledge.","D":"Random Forests use the same impurity criteria but on different bootstrap samples. Outliers are diluted across the ensemble. Individual trees within a RF can still be influenced, but the ensemble vote is robust."},"reference":"- Breiman et al., \"Classification and Regression Trees (CART)\": https://www.routledge.com/Classification-and-Regression-Trees/Breiman-Friedman-Stone-Olshen/p/book/9780412048418"},{"section":"machine-learning","topicSlug":"random-forest","topic":"Random Forest","id":"ml-05001","difficulty":"easy","orderIndex":1,"question":"A Random Forest trains 100 trees, each on a different bootstrap sample of the training data. A colleague claims \"bootstrapping introduces sampling bias because each tree sees less than the full dataset.\" Is this correct, and what does bootstrapping actually achieve?","options":{"A":"The colleague is correct — bootstrapping reduces training accuracy compared to using the full dataset for each tree","B":"The colleague is incorrect about the purpose — bootstrap sampling intentionally introduces diversity (each tree sees a different data sample with replacement) to decorrelate trees, enabling variance reduction through averaging; the \"bias\" from seeing ~63% unique samples per tree is the design, not a flaw","C":"Bootstrap sampling is only used to increase the effective training dataset size, not to create diversity","D":"Each Random Forest tree uses exactly 80% of the data; bootstrapping and 80/20 splits are equivalent"},"correct":"B","explanation":{"correct":"- Bootstrap sampling draws $n$ samples with replacement from an $n$-sample dataset. Statistically, each bootstrap sample contains approximately 63.2% unique observations ($1 - 1/e \\approx 0.632$), with the rest being duplicates.\n- The purpose is **decorrelation**: if all trees trained on the same full dataset, they would be nearly identical (same dominant splits), and averaging them would achieve nothing. Bootstrap sampling forces each tree to find different patterns.\n- The remaining ~36.8% of samples not drawn (out-of-bag samples) provide a free internal validation estimate — another benefit of bootstrapping unique to Random Forests.","A":"Bootstrapping does not reduce accuracy relative to using the full dataset in expectation. Each individual tree may have slightly higher bias, but the ensemble variance reduction more than compensates, producing better generalization.","B":"","C":"Bootstrapping does not increase dataset size — it resamples the same $n$ observations. Dataset size augmentation is a different technique (data augmentation, oversampling). The goal is diversity, not size.","D":"Bootstrapping with replacement is not equivalent to a fixed 80% split. Bootstrap samples can contain duplicates and the unique observation proportion is approximately 63%, not 80%. The mechanism and purpose differ."}},{"section":"machine-learning","topicSlug":"random-forest","topic":"Random Forest","id":"ml-05002","difficulty":"easy","orderIndex":2,"question":"A Random Forest uses `max_features=sqrt(p)` at each split, where `p` is the total number of features. Why does this feature subsampling improve the forest over a standard bagged tree ensemble that uses all `p` features at each split?","options":{"A":"Using fewer features at each split reduces training time but always hurts accuracy","B":"Feature subsampling at each split forces trees to find the best split among a random subset of features, making trees more diverse and less correlated — without it, all trees would still choose the same dominant feature at most nodes, and their errors would not be independent","C":"`sqrt(p)` is the mathematically optimal number of features for any dataset and model size","D":"Feature subsampling is used to reduce memory usage, not to improve predictive performance"},"correct":"B","explanation":{"correct":"- Bootstrap sampling alone is insufficient to decorrelate trees. If one feature is highly predictive, all trees would choose it at the root regardless of bootstrap sample — producing nearly identical trees.\n- Feature subsampling at each node ensures that even the dominant feature is absent at some splits, forcing trees to find alternative decision boundaries. This maximally decorrelates trees.\n- The variance reduction formula for correlated trees: $\\text{Var}(\\text{avg}) = \\rho \\sigma^2 + \\frac{1-\\rho}{B}\\sigma^2$, where $\\rho$ is the pairwise tree correlation. Feature subsampling reduces $\\rho$, which directly reduces ensemble variance.","A":"Feature subsampling can reduce training time, but the primary purpose and empirical effect is improved accuracy through variance reduction. Single-tree accuracy may decrease slightly, but ensemble accuracy improves.","B":"","C":"`sqrt(p)` is an empirical rule-of-thumb that works well in practice (Breiman's original recommendation). It is not mathematically optimal for all cases — `max_features` is a hyperparameter that should be tuned.","D":"Memory usage is not the motivation. Feature subsampling uses less memory per split as a side effect, but the purpose is diversity and decorrelation."},"reference":"- Breiman, \"Random Forests\": https://link.springer.com/article/10.1023/A:1010933404324"},{"section":"machine-learning","topicSlug":"random-forest","topic":"Random Forest","id":"ml-05003","difficulty":"easy","orderIndex":3,"question":"A Random Forest is trained on 10,000 samples. Out-of-bag (OOB) error is reported as 0.18. A developer asks: \"do I need a separate validation set if I have OOB error?\" What is the accurate answer?","options":{"A":"No — OOB error is mathematically equivalent to a held-out test set and always replaces cross-validation","B":"OOB error provides a reliable estimate of generalization error for Random Forests specifically, because each sample is only evaluated by trees that did not see it during training — it approximates leave-one-out cross-validation and often eliminates the need for a separate validation set, but a test set is still needed for final unbiased evaluation","C":"OOB error is calculated on training data and has the same overfitting risk as training accuracy","D":"OOB error is only valid for classification; for regression, a separate validation set is always required"},"correct":"B","explanation":{"correct":"- Out-of-bag prediction for sample $i$: aggregate predictions from all trees that did not include sample $i$ in their bootstrap sample (approximately 37% of trees). This is inherently a held-out evaluation for each sample.\n- OOB error approximates leave-one-out cross-validation and is generally a reliable estimate of generalization error for Random Forests. It avoids the cost of explicit cross-validation.\n- However, OOB error is internal to the training process. For final model reporting and comparison, a completely held-out test set (never used during model selection) is still the gold standard.","A":"OOB error is not mathematically equivalent to a held-out test set. It is an approximation of cross-validation that works well in practice but is based on the same training distribution. A true test set tests on held-out data from the same population.","B":"","C":"OOB error is explicitly calculated on samples not used to train each evaluating tree — it is not training accuracy. The mechanism specifically prevents the overfitting that training accuracy suffers from.","D":"OOB error applies equally to regression Random Forests. For regression, OOB RMSE or R² serves as the generalization estimate, just as OOB accuracy does for classification."}},{"section":"machine-learning","topicSlug":"random-forest","topic":"Random Forest","id":"ml-05004","difficulty":"easy","orderIndex":4,"question":"A Random Forest feature importance ranks feature `X` as the most important. A data scientist removes all features except `X` and retrains a single decision tree. The decision tree performs much worse than the Random Forest with all features. What explains this paradox?","options":{"A":"Random Forest feature importances are always wrong and should not be used for feature selection","B":"Random Forest feature importance measures marginal contribution averaged across all trees and splits — it accounts for feature interactions; removing all other features changes the problem complexity and eliminates the variance reduction from the ensemble, making a single-tree comparison invalid","C":"Feature importance in Random Forest is calculated on test data, which is why it doesn't transfer to a single decision tree on training data","D":"The single decision tree performs worse because it uses more memory than the Random Forest"},"correct":"B","explanation":{"correct":"- Random Forest feature importance (Mean Decrease Impurity) measures how much feature $X$ reduces impurity on average across all nodes and trees where it is used. It captures $X$'s contribution in the context of all other features.\n- Removing all other features eliminates feature interactions — the single tree on $X$ alone may not capture the same signal that $X$ provided when combined with other features.\n- Additionally, a single decision tree has high variance. Even if $X$ is the most important feature, a single tree's performance is far inferior to the ensemble's variance reduction, independent of feature selection.","A":"Random Forest feature importances are useful and widely used. They have known biases (favoring high-cardinality features), but are not \"always wrong.\" Permutation importance is a more robust alternative.","B":"","C":"Feature importance is calculated on training data (Mean Decrease Impurity uses training set splits), not test data. This does not affect its interpretability relative to a single decision tree experiment.","D":"Memory usage has no effect on model performance. This option is irrelevant."}},{"section":"machine-learning","topicSlug":"random-forest","topic":"Random Forest","id":"ml-05005","difficulty":"easy","orderIndex":5,"question":"A Random Forest with 500 trees achieves test accuracy of 92%. A colleague adds 1,000 more trees (total 1,500 trees). The test accuracy is now 92.1%. The training time tripled. What general principle does this demonstrate about the number of estimators in a Random Forest?","options":{"A":"More trees always significantly improve performance; 1,500 trees should give much better results than 500","B":"Random Forest accuracy converges as the number of trees grows — after a certain point, adding more trees provides diminishing returns with negligible accuracy improvement; the optimal number of trees is found when OOB error stabilizes","C":"1,500 trees overfit the training data, which is why the accuracy increase is small","D":"Random Forests always need at least 1,000 trees to be effective; 500 trees is insufficient"},"correct":"B","explanation":{"correct":"- As the number of trees in a Random Forest grows, the ensemble prediction converges to a stable value (by the law of large numbers). The error decreases rapidly in the first 50-100 trees and flattens afterward.\n- Mathematically: each tree is a random variable with variance $\\sigma^2$. After $B$ trees, ensemble variance is approximately $\\rho\\sigma^2 + (1-\\rho)\\sigma^2/B$. As $B \\to \\infty$, the second term vanishes — the irreducible term $\\rho\\sigma^2$ is the limit.\n- The practical approach: plot OOB error vs. number of trees. When the curve flattens, adding more trees only costs computation without accuracy benefit.","A":"More trees do not \"always significantly improve performance.\" The improvement is largest in the first few dozen trees and negligible after convergence. This is a fundamental property of bagging.","B":"","C":"Random Forests do not overfit as trees increase. Adding more trees reduces variance and does not increase bias. The small accuracy gain is due to convergence, not overfitting.","D":"500 trees is typically well past convergence for most datasets. The optimal number is dataset-dependent and often much smaller than 500 for moderate-complexity problems."}},{"section":"machine-learning","topicSlug":"random-forest","topic":"Random Forest","id":"ml-05006","difficulty":"medium","orderIndex":6,"question":"A Random Forest reports that feature `income` has importance 0.35 (highest). After replacing `income` with two features `income_log` and `income_bracket` (an ordinal version), both have importances of 0.18 and 0.19. The total importance of income-related features dropped from 0.35 to 0.37. A data scientist says the original feature was \"more important.\" What is the correct interpretation?","options":{"A":"The original feature was more important because 0.35 > 0.37 sum","B":"Importance values depend on feature representation — splitting one feature into two distributes the total importance across both; the combined importance (0.37) is essentially unchanged, and neither representation is inherently \"more important\"","C":"The two new features are more important because together they capture more variance","D":"Feature importances above 0.15 indicate overfitting regardless of the number of features"},"correct":"B","explanation":{"correct":"- Random Forest feature importance (Mean Decrease Impurity) measures contribution within the model structure. When one feature is split into two related features, the tree can use either at each node — the total signal is distributed across both, but the combined impact is similar.\n- The small difference (0.35 vs 0.37) could reflect slightly better utilization of the transformed features or noise. Neither representation is objectively \"more important\" — importance is always relative to the model architecture.\n- This is one of the known biases of MDI importance: correlated or redundant features share importance. Permutation importance better handles this by measuring actual impact on predictions when features are shuffled.","A":"Comparing a single feature's importance to the sum of two derived features is not meaningful. The 0.35 vs 0.37 difference is within noise and represents the same underlying signal.","B":"","C":"\"Capturing more variance\" is vague. The combined 0.37 is slightly higher than 0.35, but the difference is marginal. Feature importance does not directly measure explained variance in this context.","D":"There is no universal threshold at which feature importance indicates overfitting. A single dominant feature is common in many well-fitted models."}},{"section":"machine-learning","topicSlug":"random-forest","topic":"Random Forest","id":"ml-05007","difficulty":"medium","orderIndex":7,"question":"A Random Forest achieves 95% accuracy on training data and 93% on test data. A Gradient Boosted Tree achieves 97% training accuracy and 96% test accuracy. A stakeholder says \"always use gradient boosting — it's strictly better.\" Under what real-world conditions is Random Forest still preferred?","options":{"A":"Random Forest is always preferred because it is simpler to implement","B":"Random Forest is preferred when training speed, robustness to hyperparameter settings, built-in OOB validation, parallelizability, or interpretability of feature importance outweigh the 2-3% accuracy advantage of gradient boosting — particularly for large datasets requiring fast iteration or production systems with strict latency constraints","C":"Gradient boosting always outperforms Random Forest — the stakeholder is correct","D":"Random Forest should be used when the dataset has fewer than 1,000 samples; gradient boosting for larger datasets"},"correct":"B","explanation":{"correct":"- Random Forest trains trees in parallel (each tree is independent). Gradient Boosting trains trees sequentially (each tree depends on the previous). For large datasets or when fast training is needed, Random Forest is significantly faster.\n- Random Forest is also more robust to hyperparameter choices — the main hyperparameters (n_estimators, max_features, max_depth) have sensible defaults. Gradient Boosting requires careful tuning of learning rate, n_estimators, and max_depth together; a poorly tuned GBM can underperform Random Forest.\n- In streaming or online learning contexts, Random Forests can update more easily. For interpretability, both provide feature importance, but Random Forest's structure is more intuitive.","A":"Ease of implementation is not a decisive production factor. The question is about practical trade-offs in real systems.","B":"","C":"Gradient boosting does not universally outperform Random Forest. On noisy datasets, Random Forest's variance-averaging can match or outperform gradient boosting's bias-reduction approach.","D":"Dataset size thresholds are not the correct axis for this decision. Both methods work at any scale; the choice depends on the training time, tuning budget, and accuracy requirements."}},{"section":"machine-learning","topicSlug":"random-forest","topic":"Random Forest","id":"ml-05008","difficulty":"medium","orderIndex":8,"question":"A Random Forest is trained on a dataset with 3 highly predictive features and 97 noise features. `max_features=sqrt(100)=10`. A colleague says \"with 10 features at each split, we might miss all 3 important features in some nodes.\" Is this a problem?","options":{"A":"Yes — this is a critical flaw; max_features must always be set to include all important features","B":"Not a meaningful problem — with sqrt(100)=10 features sampled, the probability of including at least one important feature per split is $1 - C(97,10)/C(100,10) \\approx 0.99$; over many splits and trees, important features appear frequently and drive the model","C":"This is always a problem; max_features should be set to 100 (all features) to avoid missing any important feature","D":"The model will fail if any single tree misses all 3 important features at the root split"},"correct":"B","explanation":{"correct":"- Probability of excluding all 3 important features: $P(\\text{none of 3}) = \\frac{\\binom{97}{10}}{\\binom{100}{10}} = \\frac{97!/(87!\\cdot10!)}{100!/(90!\\cdot10!)} = \\frac{97 \\times 96 \\times 95}{100 \\times 99 \\times 98} \\approx 0.91$. Wait, let me recalculate: $P = \\frac{\\binom{97}{10}}{\\binom{100}{10}}$. Actually: choosing 10 from 97 non-important vs 10 from all 100. $= \\frac{97 \\cdot 96 \\cdot 95 \\cdots 88}{100 \\cdot 99 \\cdot 98 \\cdots 91} \\approx 0.72$. So P(at least one important) ≈ 0.28 per split. Over hundreds of splits and 100 trees, important features appear many times.\n- The forest is robust to missing features at individual nodes precisely because there are many nodes and trees. Important features statistically dominate the ensemble even when absent from some individual splits.\n- This is part of why Random Forests are robust — no single split decision is critical, and the ensemble averages over many \"views\" of the data.","A":"max_features is intentionally set below p to decorrelate trees. Including all important features at every split would make trees correlated again, defeating the purpose of the ensemble.","B":"","C":"max_features = p (all features) is equivalent to bagged trees without feature subsampling — the original Random Forest improvement over bagging. It allows dominant features to always be chosen, reducing tree diversity.","D":"Individual trees can have poor root splits and still contribute meaningfully to the ensemble when averaged. The forest is robust to individual tree imperfections."}},{"section":"machine-learning","topicSlug":"random-forest","topic":"Random Forest","id":"ml-05009","difficulty":"medium","orderIndex":9,"question":"After training a Random Forest on a customer churn dataset, you use Mean Decrease in Impurity (MDI) importance to identify the top feature as `contract_renewal_date`. This feature has very high cardinality (1,000+ unique dates). A teammate says the importance is inflated. Are they right?","options":{"A":"No — MDI importance correctly adjusts for feature cardinality","B":"Yes — MDI importance is biased toward features with more unique values because they offer more potential split points, giving them more opportunities to achieve high impurity reduction; this can cause low-cardinality features with true predictive power to be ranked lower","C":"High cardinality features always have lower MDI importance because fewer samples fall in each split","D":"MDI importance bias toward cardinality only occurs with Gini impurity, not with entropy"},"correct":"B","explanation":{"correct":"- MDI importance sums the impurity decrease over all nodes where a feature is used, normalized by the number of samples reaching each node. High-cardinality features create many more potential split points — they get more \"chances\" to find a split that happens to improve purity on the training data.\n- This inflates their MDI scores even when the cardinality is incidental (like a date field that correlates with data collection timing rather than true business causality).\n- The fix: use **permutation importance** instead. It measures the actual drop in model performance when a feature is randomly shuffled, which is not biased by cardinality or the number of splits.","A":"MDI does not adjust for cardinality. This is a well-documented limitation — Strobl et al. (2007) explicitly demonstrated this bias in their study of Random Forest variable importance measures.","B":"","C":"High cardinality features split data into many small groups, but the MDI is normalized by the number of samples at each node. The bias comes from the number of split opportunities, not from split size.","D":"The cardinality bias exists for both Gini and entropy — it is a property of how impurity is summed across splits, not of the specific impurity formula."},"reference":"- Strobl et al., \"Bias in Random Forest Variable Importance Measures\": https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-8-25"},{"section":"machine-learning","topicSlug":"random-forest","topic":"Random Forest","id":"ml-05010","difficulty":"medium","orderIndex":10,"question":"A Random Forest model is deployed in production to predict loan defaults. The model must explain its decisions to regulators under \"right to explanation\" requirements. A risk manager says \"feature importances from the forest prove that income is the primary driver.\" A compliance officer pushes back. What is the officer's valid concern?","options":{"A":"Feature importances are sufficient for regulatory explanation — the officer is wrong","B":"MDI feature importances are global aggregate measures — they describe the average contribution across all predictions, not the reason for any specific individual's prediction; regulators typically require local explanations (why this specific applicant was declined), which requires SHAP values or LIME, not global importance","C":"Feature importances from Random Forests are not allowed by any financial regulator","D":"The concern is that income is a protected attribute under fair lending laws"},"correct":"B","explanation":{"correct":"- \"Right to explanation\" requirements (e.g., GDPR Article 22, ECOA adverse action notices) require explaining individual decisions — why was this specific person denied, not \"on average, income is important.\"\n- MDI importance is a global summary statistic. It says nothing about a specific prediction: an individual who was denied might have been denied primarily because of their debt-to-income ratio, not income alone, even if income is globally important.\n- SHAP (SHapley Additive exPlanations) provides individual-level attributions: \"for this specific applicant, income contributed −0.32 to the prediction score.\" This satisfies the regulatory requirement for individual explanation.","A":"Global feature importances are not sufficient for individual-level regulatory explanation. This is the exact gap that explainability frameworks (SHAP, LIME) were developed to address.","B":"","C":"Financial regulators don't ban specific ML techniques. They require explainability of individual decisions. Random Forests with SHAP explanations are used in regulated financial applications.","D":"Income can be a valid (non-protected) feature in credit models. The concern is not about what income represents but about the inadequacy of global importance for individual explanations."}},{"section":"machine-learning","topicSlug":"random-forest","topic":"Random Forest","id":"ml-05011","difficulty":"hard","orderIndex":11,"question":"A Random Forest is trained on a dataset with two perfectly correlated features: `salary` and `salary_log`. After training, MDI importance shows salary = 0.31 and salary_log = 0.09, summing to 0.40. A single decision tree on only `salary` shows importance 0.38. What does the discrepancy reveal?","options":{"A":"The Random Forest computed incorrect importances — correlated features should always have equal importance","B":"When two correlated features are present, Random Forest distributes importance between them depending on which one a particular tree's bootstrap sample and feature subset chooses; neither importance value reflects the true contribution of the shared signal, and the sum (0.40) is closer to but still not equal to the single-feature importance (0.38) due to redundancy effects","C":"`salary_log` should have zero importance because it is derived from `salary`","D":"The correlation between salary and salary_log caused the Random Forest to overfit, which is why importances are distorted"},"correct":"B","explanation":{"correct":"- Perfectly correlated features share the same predictive signal. When both are present, the forest randomly selects between them at each node (depending on which appears in the feature subset). Importance is split roughly by how often each appears in the random feature subsets.\n- The distribution is not equal because log transformation changes the feature scale — `salary_log` may be a better split point for some data ranges, so it captures more importance in those nodes.\n- The true signal contribution is best estimated by: either using only one of the features, applying permutation importance (which accounts for the correlated pair jointly), or using SHAP values.","A":"MDI importance has no requirement for equal importance on correlated features. The distribution depends on random feature subsets, data ranges, and scale — equal importance would be coincidental.","B":"","C":"Derived features are not automatically assigned zero importance. `salary_log` captures different split-point relationships with the target than raw `salary` — it may improve certain splits even though both contain the same information.","D":"Correlated features do not cause overfitting by themselves. Overfitting is related to model complexity (tree depth) and noise in training data, not feature correlation."}},{"section":"machine-learning","topicSlug":"random-forest","topic":"Random Forest","id":"ml-05012","difficulty":"hard","orderIndex":12,"question":"A Random Forest achieves OOB error of 0.12 on a training dataset. A team member uses the entire training data (including OOB samples) to tune `max_features` and `n_estimators` by minimizing OOB error. They report OOB error of 0.08 after tuning. A statistician says this OOB estimate is now biased. Why?","options":{"A":"The statistician is wrong — OOB error is always unbiased regardless of how it is used","B":"OOB error is unbiased as a generalization estimate for a fixed model, but when used as the criterion for hyperparameter selection, it becomes a selection criterion — repeated model selection based on OOB error creates the same overfitting-to-the-metric problem as using a validation set for selection; the OOB error no longer represents an unbiased future performance estimate","C":"OOB error only becomes biased if the number of trees exceeds 200","D":"The bias occurs because max_features and n_estimators are too important to tune using OOB error"},"correct":"B","explanation":{"correct":"- OOB error is an unbiased estimate for any single Random Forest with fixed hyperparameters. Each sample is evaluated only by trees that didn't use it — a legitimate holdout.\n- However, when you run multiple hyperparameter configurations and select the one with the lowest OOB error, you are effectively searching for a configuration that happens to perform well on these specific OOB splits. This is the same overfitting problem as using a validation set for model selection.\n- The fix: use nested cross-validation or a held-out test set for final evaluation after OOB-based hyperparameter tuning. The reported OOB of 0.08 is optimistic.","A":"OOB error is unbiased for a fixed, pre-specified model. Once it becomes a selection criterion across multiple models, the bias introduced by selection is indistinguishable from validation-set overfitting.","B":"","C":"The number of trees does not determine OOB bias. More trees actually make OOB estimates more stable (less variance), not more biased.","D":"max_features and n_estimators are legitimate hyperparameters to tune. The problem is the selection mechanism using the same metric as evaluation, not the choice of which hyperparameters to tune."}},{"section":"machine-learning","topicSlug":"random-forest","topic":"Random Forest","id":"ml-05013","difficulty":"hard","orderIndex":13,"question":"A Random Forest is trained on a dataset where the positive class (5% of data) is rare. OOB accuracy is 95.2%. The team declares the model ready for deployment. A fraud analyst runs the model and finds it never flags any fraud. What went wrong, and how should this be diagnosed?","options":{"A":"95.2% OOB accuracy proves the model works correctly — the analyst must be testing on different data","B":"The Random Forest learned to predict the majority class (no-fraud) for all samples — OOB accuracy of 95.2% is achievable with a zero-rule classifier on 5% positive data; the correct diagnostic is OOB precision/recall or F1 on the positive class","C":"Random Forests cannot handle class imbalance — a different algorithm must always be used","D":"OOB accuracy is unreliable for imbalanced datasets because it oversamples minority classes"},"correct":"B","explanation":{"correct":"- With 5% positive class, a model predicting \"no fraud\" for every sample achieves 95% accuracy — close to the reported 95.2%. The Random Forest likely learned to do exactly this because the majority class is overwhelmingly represented in each bootstrap sample.\n- OOB accuracy hides this: a 95.2% baseline is trivial on a 5% positive class. The correct metrics are OOB recall on the positive class (likely 0%), OOB F1-score on the positive class, or OOB precision-recall AUC.\n- Solutions: class-weighted Random Forest (`class_weight='balanced'`), oversample minority class before bootstrapping, or evaluate using appropriate imbalanced-class metrics from the start.","A":"High accuracy on imbalanced data is not evidence of a working model. This is the same imbalanced data trap from the ML Fundamentals topic, now appearing in the Random Forest context.","B":"","C":"Random Forests can handle class imbalance with the `class_weight` parameter. The problem is not the algorithm but the evaluation metric and default behavior on imbalanced data.","D":"OOB sampling is not biased toward minority classes — it reflects the same imbalanced distribution as the training data. The issue is the accuracy metric, not the OOB sampling mechanism."}},{"section":"machine-learning","topicSlug":"random-forest","topic":"Random Forest","id":"ml-05014","difficulty":"hard","orderIndex":14,"question":"You train a Random Forest with `n_estimators=500`, `max_depth=None`, `max_features=sqrt(p)`. The training accuracy is 100%, but OOB error is 0.15. A colleague says \"the model is overfit — reduce max_depth.\" A senior engineer disagrees. Who is right and why?","options":{"A":"The colleague is correct — 100% training accuracy always means the model is overfit","B":"The senior engineer is correct — individual trees in a Random Forest are intentionally grown to full depth (overfit to their bootstrap samples); the ensemble's OOB error of 0.15 is the relevant generalization measure; the 100% training accuracy is expected and irrelevant for Random Forests","C":"Both are wrong — the correct fix is to reduce n_estimators, not max_depth","D":"The senior engineer is wrong — max_depth=None always causes overfitting that cannot be corrected by the ensemble"},"correct":"B","explanation":{"correct":"- Random Forest's design principle: each individual tree is grown deep (often to full depth) so it has low bias. The ensemble variance reduction (through averaging) compensates for each tree's high variance.\n- Training accuracy in a Random Forest is calculated on in-bag samples for each tree — each tree achieves 100% because it memorizes its bootstrap sample. This is expected and part of the design, not a bug.\n- OOB error is the correct generalization estimate. 0.15 means 85% OOB accuracy, which is meaningful. Reducing max_depth would increase individual tree bias without necessarily improving OOB error — and might hurt it if the signal requires deep trees.","A":"100% training accuracy in an ensemble context is not a sign of overfitting in the same sense as for a single model. Each tree overfits to its bootstrap sample, but the ensemble does not — this is by design.","B":"","C":"Reducing n_estimators would reduce computational cost and potentially increase variance (fewer trees). It is not the right response to suspected overfitting in a Random Forest.","D":"max_depth=None (full depth) is the standard and often optimal choice for Random Forests. The ensemble averaging handles the variance of deep trees. Constraining max_depth can help in some cases but is not universally necessary."}},{"section":"machine-learning","topicSlug":"random-forest","topic":"Random Forest","id":"ml-05015","difficulty":"hard","orderIndex":15,"question":"A team uses a Random Forest for a real-time fraud detection system that must respond in under 10ms. The forest has 500 trees of average depth 20. Each tree evaluation traverses up to 20 nodes. Profiling shows inference takes 45ms — too slow. What are the most effective latency optimizations specific to Random Forest inference?","options":{"A":"Retrain the model with fewer features to reduce tree size","B":"Reduce n_estimators (e.g., 100 trees vs 500), reduce max_depth (shallower trees), or apply post-training tree pruning — these directly reduce the number of node evaluations per prediction; alternatively, export trees to optimized formats (ONNX, PMML) or compile trees to native code for 5-10x speedup","C":"Switch to a GPU for inference — Random Forests are GPU-accelerated by default","D":"Add more RAM to the inference server — the slowness is caused by cache misses on tree structures"},"correct":"B","explanation":{"correct":"- Inference latency in a Random Forest is proportional to `n_estimators × average_depth × nodes_per_depth`. Reducing either dimension directly reduces wall-clock time.\n- Practical options: `n_estimators=100` (80% reduction in trees, often with <2% accuracy loss if the forest was converged at 500); `max_depth=10` (halves tree traversal); tree compilation via sklearn's `export_text` to native conditionals or using `treelite`/`ONNX Runtime` for compiled inference.\n- Another approach: early exit (predict with fewer trees if confidence is high) or quantized trees that use integer comparisons instead of floating-point.","A":"Reducing features changes the model's predictive capacity and requires retraining. It indirectly reduces tree depth (fewer splits) but is not targeted at latency. Reducing n_estimators and max_depth is more direct.","B":"","C":"Standard Random Forest implementations (sklearn) are CPU-bound. GPU acceleration requires specialized libraries (RAPIDS cuML). It is not available by default and requires infrastructure changes.","D":"Cache misses can be a factor for very large forests, but the primary bottleneck is node evaluation count. Reducing tree size addresses both the traversal cost and the memory access pattern."},"reference":"- Breiman, \"Random Forests\" (original paper): https://link.springer.com/article/10.1023/A:1010933404324\n- treelite for fast RF inference: https://treelite.readthedocs.io/"},{"section":"machine-learning","topicSlug":"gradient-boosting","topic":"Gradient Boosting","id":"ml-06001","difficulty":"easy","orderIndex":1,"question":"A gradient boosting model trains 100 trees sequentially. The first tree predicts house prices and makes residual errors. The second tree is trained on those residuals. A developer says \"the second tree is trying to correct the first tree's mistakes.\" Is this accurate and complete?","options":{"A":"Accurate and complete — gradient boosting is simply a sequential error-correction process","B":"Accurate in spirit but incomplete — the second tree fits the negative gradient of the loss function with respect to the current ensemble prediction, which equals the residuals for MSE loss but differs for other losses (e.g., log-loss, MAE); the general mechanism is functional gradient descent in prediction space, not just error correction","C":"Inaccurate — the second tree is trained on a weighted version of the original labels, not on residuals","D":"Accurate and complete, but only when using MSE loss; for other losses, gradient boosting uses a different approach altogether"},"correct":"B","explanation":{"correct":"- For MSE loss ($L = \\frac{1}{2}(y - F(x))^2$), the negative gradient is $y - F(x)$ — exactly the residual. So for MSE, \"fitting residuals\" is literally what gradient boosting does.\n- For other losses, the negative gradient is different: for MAE, it is the sign of the residual; for log-loss, it is $y - \\sigma(F(x))$. In each case, the new tree fits the negative gradient of the loss, not the raw residual.\n- Gradient boosting is best understood as gradient descent in function space: each tree is a step in the direction that reduces the loss most, just as SGD takes a step in the direction of the negative gradient in parameter space.","A":"\"Error correction\" is an intuitive description but misses the generalization to non-MSE losses. The general mechanism is loss-function gradient fitting, which becomes residual fitting only as a special case.","B":"","C":"AdaBoost uses sample-weighted loss (the confusion here). Gradient Boosting fits pseudo-residuals (negative gradients) on the original data, not weighted labels.","D":"Gradient boosting uses different loss functions natively — the framework handles any differentiable loss. The question correctly distinguishes MSE residuals from the general gradient."},"reference":"- Friedman, \"Greedy Function Approximation: A Gradient Boosting Machine\": https://projecteuclid.org/journals/annals-of-statistics/volume-29/issue-5/Greedy-function-approximation-a-gradient-boosting-machine/10.1214/aos/1013203451.full"},{"section":"machine-learning","topicSlug":"gradient-boosting","topic":"Gradient Boosting","id":"ml-06002","difficulty":"easy","orderIndex":2,"question":"A gradient boosting model is trained with `learning_rate=1.0` and `n_estimators=100`. The training loss decreases to near-zero quickly, but the test loss starts increasing after 20 trees. A colleague reduces `learning_rate` to 0.01 and increases `n_estimators` to 1000. What is the effect of this change?","options":{"A":"The change is redundant — learning rate and n_estimators compensate each other perfectly at any value","B":"Reducing learning rate makes each tree contribute a smaller step toward the residual (shrinkage), requiring more trees to achieve the same training loss but generally producing a more regularized, better-generalizing model — however, training time increases proportionally with n_estimators","C":"Reducing learning rate always increases test loss because the model learns slower","D":"`learning_rate=0.01` is below the minimum effective threshold and will cause the model to make no progress at all"},"correct":"B","explanation":{"correct":"- The update at each step: $F_m(x) = F_{m-1}(x) + \\eta \\cdot h_m(x)$, where $\\eta$ is the learning rate and $h_m$ is the new tree. A smaller $\\eta$ means each tree contributes less — the model takes smaller gradient steps.\n- Smaller learning rates act as regularization: the model is more conservative about committing to any single tree's prediction, reducing the risk of overfitting to training noise.\n- The empirical trade-off: `learning_rate=0.01` with `n_estimators=1000` often outperforms `learning_rate=1.0` with `n_estimators=100`, but takes 10× longer to train. Early stopping on a validation set avoids the need to manually set n_estimators.","A":"Learning rate and n_estimators are not perfectly compensating. A high learning rate with many trees can still overfit differently than a low learning rate. The regularization effect of small learning rate is not reducible to fewer trees.","B":"","C":"A lower learning rate does not increase test loss — it reduces overfitting by shrinking individual tree contributions. The test loss may temporarily appear to decrease more slowly during training but typically achieves a lower minimum.","D":"Learning rate of 0.01 makes the model progress slowly but steadily. There is no minimum threshold below which progress stops — gradient descent with small steps still converges."}},{"section":"machine-learning","topicSlug":"gradient-boosting","topic":"Gradient Boosting","id":"ml-06003","difficulty":"easy","orderIndex":3,"question":"A gradient boosting model overfits — training AUC is 0.99 but validation AUC is 0.78. Which hyperparameter combination would most directly reduce overfitting?","options":{"A":"Increase n_estimators from 100 to 500 and keep max_depth=6","B":"Reduce max_depth from 6 to 3, reduce learning_rate from 0.1 to 0.05, and enable early stopping on the validation set — these simultaneously reduce individual tree complexity, shrink the gradient steps, and stop training before validation performance degrades","C":"Increase subsample from 0.8 to 1.0 to ensure all training data is used","D":"Switch from a regression loss to a classification loss — the wrong loss function caused the overfit"},"correct":"B","explanation":{"correct":"- The 0.21 AUC gap is severe overfitting. Three complementary mechanisms address it: `max_depth` reduces tree complexity (each tree captures less noise); `learning_rate` shrinks the contribution of each noisy tree; early stopping terminates training precisely when validation performance peaks.\n- `subsample < 1.0` (stochastic gradient boosting) adds noise to each tree's training, acting as regularization. Increasing subsample to 1.0 removes this regularization and typically worsens overfitting.\n- In practice, the hyperparameter search should be: fix a low learning_rate (0.05), use early stopping with a validation set, and tune max_depth and subsample.","A":"More trees at the same depth and learning rate will increase overfitting, not reduce it. More estimators means more steps in the (already overfitting) direction of the training data.","B":"","C":"Increasing subsample to 1.0 removes a regularization mechanism. Stochastic subsampling introduces variance in each tree that reduces overfitting.","D":"Loss function mismatch (regression vs classification) would cause invalid predictions, not overfitting. The loss type is separate from regularization. Also, AUC is a classification metric — the model is already using classification loss."}},{"section":"machine-learning","topicSlug":"gradient-boosting","topic":"Gradient Boosting","id":"ml-06004","difficulty":"easy","orderIndex":4,"question":"Gradient Boosting builds trees sequentially while Random Forest builds trees in parallel. What does this fundamental difference imply about their training time scaling with n_estimators?","options":{"A":"Both scale identically with n_estimators — the sequential vs parallel distinction only affects code structure, not runtime","B":"Random Forest training time is nearly constant (parallel trees can run simultaneously on multiple cores), while Gradient Boosting training time scales linearly with n_estimators because each tree requires the previous tree's predictions before it can begin","C":"Gradient Boosting is faster because sequential processing allows it to skip unnecessary trees early","D":"Random Forest always takes longer because bootstrap sampling adds overhead that sequential processing avoids"},"correct":"B","explanation":{"correct":"- Random Forest: trees are fully independent. With $k$ CPU cores, training time $\\approx n\\_estimators / k$ per core — near-constant with enough cores.\n- Gradient Boosting: tree $m+1$ requires computing pseudo-residuals using tree $m$'s predictions. This creates a sequential dependency that cannot be parallelized across trees. Training time is strictly $O(n\\_estimators)$.\n- Within-tree parallelism (evaluating different feature splits in parallel) can speed up individual tree training, which is how XGBoost achieves speed despite sequential tree building.","A":"Sequential dependency in GBM vs parallel independence in RF creates real, practical runtime differences. On 16 cores, an RF can train 16 trees simultaneously while GBM must wait for each tree to complete.","B":"","C":"Gradient boosting doesn't skip trees — it always trains the specified number. Early stopping terminates training when validation performance plateaus, but this is an optional add-on, not an inherent speed advantage.","D":"Bootstrap sampling overhead is trivial compared to the actual tree-building cost. Random Forest's parallelism advantage outweighs this overhead significantly."}},{"section":"machine-learning","topicSlug":"gradient-boosting","topic":"Gradient Boosting","id":"ml-06005","difficulty":"easy","orderIndex":5,"question":"XGBoost was introduced in 2016 as an improvement over standard gradient boosting. One key difference is that XGBoost adds regularization terms directly to the objective function. What does this change compared to scikit-learn's `GradientBoostingClassifier`?","options":{"A":"XGBoost adds dropout to the trees, similar to neural network dropout","B":"XGBoost's objective includes L1 (alpha) and L2 (lambda) penalties on leaf weights, controlling both sparsity and magnitude of tree leaf scores — this makes overfitting control more explicit and tunable compared to sklearn's GBM which only indirectly regularizes through tree depth and learning rate","C":"XGBoost adds regularization by randomly removing features during training, like feature subsampling in Random Forest","D":"XGBoost's regularization is equivalent to setting max_depth=3 in sklearn's GradientBoostingClassifier"},"correct":"B","explanation":{"correct":"- XGBoost's objective: $\\text{Obj} = \\sum L(y_i, \\hat{y}_i) + \\sum_k [\\gamma T_k + \\frac{1}{2}\\lambda \\|w_k\\|^2 + \\alpha \\|w_k\\|_1]$ where $T_k$ is the number of leaves, $w_k$ are leaf weights, $\\lambda$ is L2, $\\alpha$ is L1, $\\gamma$ is leaf penalty.\n- `gamma` penalizes tree complexity (number of leaves), `lambda` penalizes large leaf weights, and `alpha` promotes sparse leaf weights. These give fine-grained control over overfitting.\n- sklearn's GBM controls regularization mainly through `max_depth`, `min_samples_split`, and `learning_rate` — effective but less direct than XGBoost's explicit weight regularization.","A":"XGBoost has a technique called DART (Dropouts meet Multiple Additive Regression Trees) which applies dropout to trees, but this is separate from the standard regularization terms. Standard XGBoost does not use dropout by default.","B":"","C":"Feature subsampling (`colsample_bytree`) exists in XGBoost but is a separate hyperparameter from the regularization terms (alpha, lambda). XGBoost has both, but they are distinct mechanisms.","D":"Regularization is a functional change to the optimization objective, not equivalent to depth limiting. Regularized shallow trees behave differently from deep trees cut at max_depth=3."}},{"section":"machine-learning","topicSlug":"gradient-boosting","topic":"Gradient Boosting","id":"ml-06006","difficulty":"medium","orderIndex":6,"question":"LightGBM uses a \"leaf-wise\" tree growth strategy while XGBoost (by default) uses \"level-wise\" growth. Both reach the same `max_depth`. On the same dataset, LightGBM trains 5× faster. What is the structural difference and the associated risk?","options":{"A":"LightGBM grows each level before going deeper, while XGBoost grows the deepest path first — this makes LightGBM faster but less accurate","B":"LightGBM grows the leaf with the highest loss reduction at each step (leaf-wise), regardless of level — this creates asymmetric trees that can model complex patterns with fewer nodes, but can overfit more on small datasets because a single deep branch may memorize specific training patterns","C":"Level-wise and leaf-wise growth are equivalent for trees of the same max_depth — the speed difference is purely due to LightGBM's implementation optimizations, not the growth strategy","D":"LightGBM uses gradient-based one-side sampling, which is completely unrelated to leaf-wise growth — the speed gain comes entirely from sampling"},"correct":"B","explanation":{"correct":"- Level-wise (XGBoost default): all nodes at depth $d$ are split before any node at depth $d+1$. Each level has $2^d$ nodes split per level.\n- Leaf-wise (LightGBM): at each step, the single leaf with the highest gain is split, regardless of which level it's on. This produces a longer, asymmetric tree that achieves lower loss per split operation.\n- Leaf-wise growth uses fewer total splits to reach the same loss level, making LightGBM faster. The risk: on small datasets, repeatedly growing the same branch can overfit to a small subset of training samples in that branch.","A":"The description is backwards. LightGBM grows leaf-wise (deepest path first by gain), not level-first. XGBoost default is level-wise (breadth-first). The relationship between strategy and accuracy is also not simply \"level-wise = more accurate.\"","B":"","C":"The growth strategy is a fundamental difference, not just implementation. Level-wise and leaf-wise produce structurally different trees even at the same max_depth, because they prioritize different splits.","D":"Gradient-based one-side sampling (GOSS) is an additional LightGBM optimization that speeds up gradient computation. It contributes to speed but the question specifically asks about the leaf-wise growth mechanism."},"reference":"- LightGBM paper: https://proceedings.neurips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html"},{"section":"machine-learning","topicSlug":"gradient-boosting","topic":"Gradient Boosting","id":"ml-06007","difficulty":"medium","orderIndex":7,"question":"A gradient boosting model uses early stopping: training stops when validation loss does not improve for 50 rounds. After training, the team reports the model at the stopping point has `n_estimators=147`. A data scientist then retrains the model on the full training + validation data using `n_estimators=147`. The production AUC drops. What went wrong?","options":{"A":"The full training data caused the model to overfit because it has more samples","B":"Early stopping determined the optimal n_estimators for the train/validation split — adding validation data to the training set changes the loss landscape, so the same n_estimators is no longer optimal; the model now underfits because it should train for more iterations with more data","C":"The model should be retrained using the validation loss at n_estimators=147 as the new early stopping target","D":"Gradient boosting models always perform worse when trained on more data"},"correct":"B","explanation":{"correct":"- Early stopping finds the optimal n_estimators for a specific train/val split. When the validation set is added to training, the total dataset is larger — each gradient step is noisier and the model can absorb more information per tree.\n- With more training data, the optimal number of trees is typically higher (the model needs more iterations to converge to the same minimum). Using the old n_estimators=147 stops training too early on the larger dataset.\n- The correct approach: retrain on full data, use the early stopping n_estimators as a lower bound, and either run early stopping on a held-out subset or multiply n_estimators by a factor (e.g., 1.1-1.3×) to account for the larger dataset.","A":"More training data generally improves generalization, not causes overfitting. Overfitting from adding data would be unusual and would manifest differently.","B":"","C":"Using the validation loss as a new early stopping target is not a standard or valid technique. The validation loss from the previous split is not directly comparable to a future run's loss trajectory.","D":"Gradient boosting (and all supervised models) generally benefit from more training data. Saying \"always performs worse with more data\" is categorically incorrect."}},{"section":"machine-learning","topicSlug":"gradient-boosting","topic":"Gradient Boosting","id":"ml-06008","difficulty":"medium","orderIndex":8,"question":"CatBoost introduces a technique called \"ordered boosting\" to address a specific problem present in standard gradient boosting. What problem does ordered boosting solve, and why does it matter for large datasets?","options":{"A":"Ordered boosting solves the sequential training bottleneck by allowing trees to be built in parallel","B":"Standard gradient boosting computes pseudo-residuals using the same training samples that will train the next tree, causing \"target leakage\" — the residuals are biased because the tree fits its own prediction errors; ordered boosting uses a subset of preceding data to compute residuals for each sample, preventing this bias","C":"Ordered boosting solves the class imbalance problem by ordering samples by class frequency","D":"Ordered boosting eliminates the need for cross-validation by using temporal ordering of data"},"correct":"B","explanation":{"correct":"- In standard gradient boosting, tree $m$ is trained on pseudo-residuals computed from the current ensemble $F_{m-1}$. The issue: $F_{m-1}$ was itself trained on the same samples, creating a subtle overfitting bias in the residual computation (a form of in-sample contamination).\n- CatBoost's ordered boosting: for each training sample $i$, pseudo-residuals are computed using a model trained only on samples with indices less than $i$ (an artificial temporal ordering). This ensures residuals for sample $i$ are computed from a model that has never seen $i$.\n- This is analogous to online learning evaluation and reduces the gradient estimation bias, particularly impactful on smaller datasets where in-sample contamination is more severe.","A":"Ordered boosting does not parallelize tree training. Gradient boosting remains sequential by nature. CatBoost achieves speed through other optimizations (symmetric trees, GPU training).","B":"","C":"Ordered boosting has nothing to do with class imbalance. The \"ordering\" refers to a permutation of training samples used for unbiased gradient estimation, not class ordering.","D":"Ordered boosting uses an artificial ordering of samples, not real temporal ordering. It is not a cross-validation replacement. CatBoost provides built-in cross-validation separately."},"reference":"- Prokhorenkova et al., \"CatBoost: unbiased boosting with categorical features\": https://arxiv.org/abs/1706.09516"},{"section":"machine-learning","topicSlug":"gradient-boosting","topic":"Gradient Boosting","id":"ml-06009","difficulty":"medium","orderIndex":9,"question":"A gradient boosting model is trained with `max_depth=6` and achieves good performance. Reducing `max_depth` to 2 while keeping all other hyperparameters the same drops training accuracy significantly. A teammate says \"deeper trees are always better for gradient boosting.\" When is shallow max_depth actually preferred?","options":{"A":"Shallow trees are only used for computational budget constraints, never for accuracy","B":"In gradient boosting, each tree corrects the residual of the previous ensemble — for smooth, low-noise targets with primarily additive feature effects, shallow trees (depth 2-4) capture each correction efficiently; deep trees overfit to individual samples in the residual, accumulating noise across iterations; shallow trees are the canonical choice for regression on tabular data with noise","C":"max_depth=2 is always optimal for gradient boosting because it avoids overfitting in all scenarios","D":"Shallow trees are better only when n_estimators is less than 50"},"correct":"B","explanation":{"correct":"- Gradient boosting residuals are noisy versions of the target (after partial fitting). Deep trees can memorize noise in the residuals — since residuals become smaller and noisier as boosting progresses, deep trees in later rounds fit noise more than signal.\n- Shallow trees (depth 2-3, also called \"stumps\" at depth 1) fit smooth corrections that capture the main signal without noise memorization. They require more trees (n_estimators) but generalize better with proper learning rate.\n- For tasks with strong high-order feature interactions (e.g., complex genomics data), deeper trees (5-8) may genuinely be needed. The choice is always dataset-dependent.","A":"Shallow trees are preferred for accuracy on many tabular regression tasks, not just for speed. The regularization effect of shallow trees is a feature, not a constraint.","B":"","C":"max_depth=2 is not universally optimal. Tasks with complex high-order interactions may genuinely require deeper trees. \"Always optimal\" claims for any hyperparameter value are false.","D":"The relationship between optimal depth and n_estimators is not a simple threshold. Both interact as regularization levers — lower depth requires more trees to compensate."}},{"section":"machine-learning","topicSlug":"gradient-boosting","topic":"Gradient Boosting","id":"ml-06010","difficulty":"medium","orderIndex":10,"question":"XGBoost has a `subsample` parameter (default 0.8) and `colsample_bytree` parameter (default 1.0). A team sets both to 1.0 to \"use all available data and features.\" Training loss improves but validation loss worsens. What regularization mechanism did they disable?","options":{"A":"They disabled early stopping by setting subsample=1.0","B":"Both parameters implement stochastic sampling that introduces randomness into each tree — subsample randomly samples training rows per tree (stochastic gradient boosting), and colsample_bytree randomly samples features per tree; removing both makes each tree deterministic and highly correlated, increasing ensemble variance and overfitting","C":"Setting subsample=1.0 is only harmful for datasets with fewer than 1,000 samples; for larger datasets, it has no effect","D":"colsample_bytree=1.0 is the correct setting; only subsample below 1.0 helps with regularization"},"correct":"B","explanation":{"correct":"- `subsample < 1.0`: each tree is trained on a random row subset — Friedman's stochastic gradient boosting. This introduces variance in each tree, improving ensemble diversity and acting as regularization.\n- `colsample_bytree < 1.0`: each tree uses a random feature subset (similar to Random Forest feature subsampling). This decorrelates trees and prevents dominant features from dominating every tree.\n- Both mechanisms reduce tree correlation and introduce beneficial randomness. Using both at 1.0 makes trees more similar to each other (high correlation) and more prone to overfitting the full training distribution.","A":"Subsample has no relationship to early stopping. Early stopping is controlled separately by the `early_stopping_rounds` parameter and a validation dataset.","B":"","C":"Stochastic sampling benefits apply at any dataset size. On larger datasets, each subsample captures the distribution well even at 0.8. On smaller datasets, the effect is more pronounced.","D":"Both subsample and colsample_bytree contribute to regularization. Neither is exclusively responsible. Using both below 1.0 provides complementary regularization — colsample at the feature level and subsample at the sample level."}},{"section":"machine-learning","topicSlug":"gradient-boosting","topic":"Gradient Boosting","id":"ml-06011","difficulty":"hard","orderIndex":11,"question":"A gradient boosting model achieves 0.94 AUC on a fraud detection task. The team then applies the same dataset to XGBoost with default hyperparameters and gets 0.91 AUC. LightGBM with defaults achieves 0.93 AUC. A manager concludes \"sklearn GBM is better.\" What is wrong with this comparison?","options":{"A":"The comparison is valid — default hyperparameters are the standard benchmark","B":"Default hyperparameters are tuned to different datasets by each library's maintainers; a fair comparison requires hyperparameter tuning for each method with the same validation strategy — comparing defaults is comparing how well each library's defaults happen to fit this specific dataset, not the models' inherent capabilities","C":"XGBoost and LightGBM are faster implementations of the same algorithm, so they should always match sklearn GBM accuracy","D":"The manager is correct because a 0.03 AUC difference is statistically significant proof of superiority"},"correct":"B","explanation":{"correct":"- sklearn's GBM default `max_depth=3`, `learning_rate=0.1`, `n_estimators=100` may happen to fit this dataset's complexity well. XGBoost and LightGBM defaults are set for general robustness, not this specific dataset.\n- A proper comparison requires: same cross-validation strategy, same metric, independent hyperparameter tuning for each method with the same computational budget, and statistical significance testing.\n- In practice, XGBoost and LightGBM typically outperform sklearn GBM after proper tuning due to better regularization, faster convergence, and more sophisticated tree-building algorithms.","A":"Default hyperparameters are starting points, not optimal configurations. Comparing defaults tells you which library's defaults happen to work, not which algorithm is better.","B":"","C":"XGBoost and LightGBM are not identical to sklearn GBM. They use different regularization (XGBoost), different tree growth strategies (LightGBM leaf-wise), and different categorical handling (CatBoost). Accuracy differences on specific datasets are expected.","D":"A 0.03 AUC difference may or may not be statistically significant depending on dataset size and variance. Without confidence intervals or paired statistical tests, no significance claim is valid."}},{"section":"machine-learning","topicSlug":"gradient-boosting","topic":"Gradient Boosting","id":"ml-06012","difficulty":"hard","orderIndex":12,"question":"A gradient boosting model is trained to predict customer lifetime value (a regression task). After 200 rounds, the training RMSE is 42 and validation RMSE is 89. Applying L2 regularization (`reg_lambda=10`) reduces validation RMSE to 71. Adding `reg_alpha=5` (L1) further reduces it to 65. Explain precisely what each regularization term is doing to the tree structures.","options":{"A":"L2 regularization reduces the number of leaves; L1 regularization reduces the learning rate","B":"L2 (`reg_lambda`) penalizes large leaf weight magnitudes by adding $\\frac{1}{2}\\lambda w^2$ per leaf to the objective, shrinking all weights toward zero; L1 (`reg_alpha`) adds $\\alpha |w|$ per leaf, which can produce exact zero weights for leaves that don't improve the objective sufficiently, effectively pruning those leaves","C":"L2 reduces max_depth; L1 increases the number of trees required","D":"Both L1 and L2 act identically on gradient boosting trees — they are redundant and only one should be used"},"correct":"B","explanation":{"correct":"- XGBoost's optimal leaf weight for a leaf $j$: $w_j^* = -\\frac{G_j}{H_j + \\lambda}$ where $G_j$ is the sum of gradients and $H_j$ is the sum of Hessians at that leaf. L2 ($\\lambda$) appears in the denominator, shrinking the leaf weight magnitude.\n- L1 ($\\alpha$) adds $\\alpha |w_j|$ to the objective. The optimal weight with both: $w_j^* = -\\frac{\\text{clip}(G_j, \\alpha)}{H_j + \\lambda}$ where $|G_j| \\leq \\alpha$ results in zero weight. This zeros out leaves where the gradient signal is below the L1 threshold — equivalent to pruning.\n- Combined: L2 shrinks all leaf weights smoothly, L1 zeros out weak leaves entirely. The combination provides both magnitude control and tree sparsification.","A":"L2 does not reduce leaf count; it shrinks weights. L1 can effectively prune leaves (by zeroing them), but L1 is not a learning rate modifier.","B":"","C":"Neither L1 nor L2 directly controls max_depth. Depth is a structural hyperparameter. n_estimators need is affected by regularization (stronger regularization may require more trees to compensate for smaller steps), but L1 doesn't \"increase trees required\" specifically.","D":"L1 and L2 are not identical. L1 produces sparsity (exact zero leaf weights); L2 produces smooth weight shrinkage. They serve complementary purposes and can be used simultaneously."}},{"section":"machine-learning","topicSlug":"gradient-boosting","topic":"Gradient Boosting","id":"ml-06013","difficulty":"hard","orderIndex":13,"question":"A team trains gradient boosting with `n_estimators=1000` and `early_stopping_rounds=50` on a train/validation split. Training stops at round 634. They then evaluate on a test set and report AUC from round 634. A colleague says \"using the test set for evaluation is fine since we didn't use it for early stopping.\" Is the colleague correct?","options":{"A":"The colleague is correct — early stopping used the validation set; the test set was not involved and remains unbiased","B":"The colleague is correct only if the test set was also not used for any hyperparameter tuning during the experiment — if `n_estimators`, `learning_rate`, or `max_depth` were tuned by observing test set metrics at any point, the test set is no longer an unbiased estimator","C":"The colleague is wrong — early stopping on any split contaminated the test set","D":"Test set evaluation is always valid because test sets are by definition never used for training"},"correct":"B","explanation":{"correct":"- Early stopping used only the validation set — the test set is correctly isolated from this specific decision. The colleague is technically right about early stopping specifically.\n- However, the colleague's claim has an important conditional: if at any point the team looked at test set performance to decide n_estimators range, learning_rate, or max_depth, those decisions implicitly used test set information. This is the classic model selection contamination.\n- In a clean experiment: hyperparameters are tuned using the validation set (or cross-validation), early stopping uses the validation set, and the test set is evaluated exactly once at the very end. If this protocol was followed, the test evaluation is unbiased.","A":"This is conditionally correct but incomplete. The colleague is right about early stopping specifically, but not necessarily about the broader experimental context.","B":"","C":"Early stopping on the validation set does not contaminate the test set. These are separate held-out sets with different roles. The test set contamination happens only through direct use in decision-making.","D":"Test sets can be contaminated even without direct training use. Using test set metrics to guide model selection (choosing which experiment to report) is a form of test set leakage — \"reporting the lucky run.\""}},{"section":"machine-learning","topicSlug":"gradient-boosting","topic":"Gradient Boosting","id":"ml-06014","difficulty":"hard","orderIndex":14,"question":"LightGBM uses \"Gradient-based One-Side Sampling\" (GOSS) to speed up training. This means it keeps all high-gradient samples but randomly drops a fraction of low-gradient samples. What trade-off does this introduce compared to full-data gradient boosting?","options":{"A":"GOSS eliminates small-gradient samples permanently, causing permanent information loss","B":"GOSS reduces the effective training set size per iteration, which speeds computation but introduces a sampling bias — the gradient distribution is no longer uniform, so LightGBM compensates by upweighting retained low-gradient samples by a factor $(1-a)/b$ to approximate the full gradient; the trade-off is speed vs slight gradient estimation variance","C":"GOSS only drops duplicate samples, so there is no information loss","D":"GOSS is equivalent to mini-batch gradient descent and introduces the same variance as stochastic training in neural networks"},"correct":"B","explanation":{"correct":"- GOSS splits samples into top-$a$ fraction by gradient magnitude (kept) and bottom $(1-a)$ fraction (from which $b$ fraction is randomly sampled). The sampled low-gradient instances are upweighted by $(1-a)/b$ to maintain unbiased gradient estimation.\n- This preserves the main gradient signal (high-gradient samples drive learning) while approximating the low-gradient contribution through sampling + upweighting.\n- The trade-off: each iteration processes fewer samples (faster), but the gradient estimate has higher variance than full-batch computation. The upweighting introduces a stochastic approximation that works well in practice but adds noise to the optimization path.","A":"GOSS does not permanently drop low-gradient samples. The sampling is redone at each boosting round, so all samples have a chance to appear in any given round.","B":"","C":"GOSS explicitly drops samples based on gradient magnitude, not duplication. It introduces intentional statistical sampling with compensation, not deduplication.","D":"GOSS samples by gradient magnitude (not uniformly at random), which is fundamentally different from SGD mini-batch sampling. The weighting mechanism also has no analog in standard SGD."}},{"section":"machine-learning","topicSlug":"gradient-boosting","topic":"Gradient Boosting","id":"ml-06015","difficulty":"hard","orderIndex":15,"question":"You train gradient boosting for binary classification. After 500 trees, you extract the leaf scores from all trees and discover the raw output for a specific sample is +3.2 (the sum of leaf values). A junior developer converts this to a class prediction by applying `round(3.2)` and reports class 1. What is wrong with this approach?","codeSnippet":"raw_score = 3.2 # sum of leaf values from all trees\nprediction = round(raw_score) # Developer's approach: 3.2 → 3, not binary!","options":{"A":"round(3.2) = 3, which is outside {0, 1} — the developer is using the wrong transformation; for binary classification, raw GBM scores must be passed through a sigmoid to get probabilities, then thresholded","B":"The raw score of 3.2 is the correct final prediction — no transformation is needed for binary classification","C":"The raw score should be divided by 500 (number of trees) before rounding","D":"Gradient boosting raw scores are always between -1 and 1, so 3.2 is a model implementation error"},"correct":"A","explanation":{"correct":"- Gradient boosting for binary classification outputs raw scores (log-odds) that can be any real number. `round(3.2) = 3`, which is not a valid binary class label.\n- The correct transformation: $p = \\sigma(\\text{raw\\_score}) = \\frac{1}{1 + e^{-3.2}} \\approx 0.961$. Then apply a threshold (default 0.5): predict class 1 if $p > 0.5$.\n- `predict()` in sklearn, XGBoost, and LightGBM handles this automatically. Using raw leaf sum scores directly requires manual sigmoid transformation.","A":"","B":"The raw log-odds score is an intermediate value, not the final prediction. Binary classification requires converting log-odds to probability then applying a threshold.","C":"Dividing by 500 averages the leaf scores — this is not the correct transformation. The raw score is a log-odds value that must be passed through sigmoid, not averaged.","D":"Gradient boosting raw scores are unbounded — they grow as more trees are added and can take any real value. A score of 3.2 is normal and expected for a model with high confidence in class 1."},"reference":"- XGBoost documentation on output types: https://xgboost.readthedocs.io/en/stable/tutorials/model.html\n- Chen and Guestrin, \"XGBoost: A Scalable Tree Boosting System\": https://arxiv.org/abs/1603.02754"},{"section":"machine-learning","topicSlug":"support-vector-machines","topic":"Support Vector Machines","id":"ml-07001","difficulty":"easy","orderIndex":1,"question":"An SVM with a linear kernel is trained on a binary classification problem. After training, you discover that removing 90% of training samples — those far from the decision boundary — does not change the model's predictions at all. What does this reveal about how SVMs define their decision boundary?","options":{"A":"SVMs ignore 90% of training data by design — they only use randomly selected samples","B":"The SVM decision boundary is defined entirely by the support vectors — the subset of training samples closest to the hyperplane; all other points do not affect the boundary position or margin","C":"The removed samples were duplicates, which is why removing them had no effect","D":"A linear kernel SVM only uses the first and last 10% of training samples sorted by feature value"},"correct":"B","explanation":{"correct":"- An SVM finds the maximum-margin hyperplane defined by: $\\min \\frac{1}{2}\\|w\\|^2$ subject to $y_i(w^Tx_i + b) \\geq 1$ for all $i$. The Karush-Kuhn-Tucker conditions show that only samples with $\\alpha_i > 0$ (non-zero dual weights) contribute to the solution — these are exactly the support vectors.\n- Support vectors are the training points that lie on the margin boundary ($y_i(w^Tx_i + b) = 1$) or inside the margin (for soft-margin SVMs). All other points are correctly classified with margin > 1, so $\\alpha_i = 0$ and they play no role.\n- This is a key SVM property: the model is sparse in training samples. This also means the trained SVM can be serialized as just the support vectors and their weights, regardless of total training set size.","A":"SVMs do not randomly ignore data — they systematically identify which samples define the optimal boundary (support vectors). All samples are considered during optimization; only non-support-vectors end up with zero weight.","B":"","C":"The samples were not duplicates — they were simply non-support vectors. They happened to lie far enough from the margin that they don't constrain the optimal hyperplane.","D":"SVMs have no concept of sorting samples by feature value. The support vectors are determined by geometry (proximity to the hyperplane), not by feature rank."}},{"section":"machine-learning","topicSlug":"support-vector-machines","topic":"Support Vector Machines","id":"ml-07002","difficulty":"easy","orderIndex":2,"question":"A hard-margin SVM is trained on a 2D linearly separable dataset. The margin is defined as $\\frac{2}{\\|w\\|}$. A colleague asks: \"why do we maximize the margin instead of just finding any hyperplane that separates the classes?\" What is the SVM's answer?","options":{"A":"Maximizing the margin is computationally cheaper than finding any separating hyperplane","B":"Among all hyperplanes that separate the classes, the maximum-margin hyperplane generalizes best to new data — Vapnik's structural risk minimization theory shows that larger margins correspond to lower VC dimension and better generalization bounds","C":"Maximizing the margin ensures the decision boundary passes through the center of the dataset","D":"Any separating hyperplane works equally well — margin maximization is an aesthetic choice, not a mathematical one"},"correct":"B","explanation":{"correct":"- For linearly separable data, infinitely many separating hyperplanes exist. The maximum-margin hyperplane is the one with the largest \"buffer zone\" between the closest points of each class.\n- Intuitively: a larger margin means the model is less sensitive to small perturbations in input — a test point must move further to cross the boundary. This equates to better robustness to noise and better generalization.\n- Formally: the VC dimension of a linear classifier with margin $\\gamma$ on data in a ball of radius $R$ is bounded by $\\min(R^2/\\gamma^2, d) + 1$. Larger margins reduce the effective VC dimension, tightening the generalization bound.","A":"Hard-margin SVM optimization is a convex quadratic program — not computationally cheaper than finding any separating hyperplane. The computational justification is backwards.","B":"","C":"The maximum-margin hyperplane is equidistant from the nearest points of each class, but it does not pass through the dataset center. These are different geometric concepts.","D":"All separating hyperplanes are not equal. The maximum-margin hyperplane has provably better generalization properties under statistical learning theory. This is a foundational, not aesthetic, result."},"reference":"- Vapnik, \"The Nature of Statistical Learning Theory\": https://link.springer.com/book/10.1007/978-1-4757-3264-1"},{"section":"machine-learning","topicSlug":"support-vector-machines","topic":"Support Vector Machines","id":"ml-07003","difficulty":"easy","orderIndex":3,"question":"An SVM with a linear kernel cannot classify XOR data (two classes arranged in a checkerboard pattern). A colleague adds an RBF kernel and the model classifies the same data perfectly. What did the RBF kernel actually do?","options":{"A":"The RBF kernel applied a smoothing operation to the training data, removing the XOR pattern","B":"The RBF kernel implicitly maps the data to an infinite-dimensional feature space where the classes become linearly separable — the kernel function $K(x_i, x_j) = e^{-\\gamma\\|x_i - x_j\\|^2}$ computes dot products in that high-dimensional space without explicitly constructing it","C":"The RBF kernel rotated the coordinate axes to align with the XOR pattern, making it linearly separable in 2D","D":"The RBF kernel is equivalent to polynomial degree 2, which adds $x_1^2, x_2^2, x_1 x_2$ features that linearize XOR"},"correct":"B","explanation":{"correct":"- The kernel trick: instead of explicitly transforming features $\\phi(x)$ and computing $\\phi(x_i) \\cdot \\phi(x_j)$, we compute $K(x_i, x_j) = \\phi(x_i) \\cdot \\phi(x_j)$ directly in the original space.\n- The RBF kernel corresponds to an infinite-dimensional feature map (Mercer's theorem). In this infinite-dimensional space, the XOR pattern becomes linearly separable — a hyperplane exists that separates the two classes.\n- The SVM only computes kernel values (dot products between training pairs), never explicitly constructing the infinite-dimensional features. This is the computational elegance of the kernel trick.","A":"The RBF kernel does not smooth or modify the data points. It defines a similarity measure between pairs of points used in the SVM dual formulation.","B":"","C":"Rotation cannot make XOR linearly separable in 2D — no 2D rotation transforms a checkerboard into two half-planes. The transformation requires a higher-dimensional space.","D":"The RBF kernel is not equivalent to degree-2 polynomial. The polynomial kernel $K(x_i, x_j) = (x_i \\cdot x_j + c)^d$ is a different kernel corresponding to a finite-dimensional feature map. RBF is fundamentally different — infinite-dimensional."}},{"section":"machine-learning","topicSlug":"support-vector-machines","topic":"Support Vector Machines","id":"ml-07004","difficulty":"easy","orderIndex":4,"question":"A soft-margin SVM has a hyperparameter C. A team trains two models: SVM-A with C=0.001 and SVM-B with C=1000. SVM-A has a wider margin but more misclassified training points. SVM-B has a narrower margin with fewer training errors. Which model likely generalizes better on noisy test data?","options":{"A":"SVM-B always generalizes better because fewer training errors means better fit","B":"SVM-A likely generalizes better on noisy data — large C forces the SVM to minimize training errors aggressively, creating a narrow margin that overfits to noisy points; small C tolerates training errors in exchange for a wider, more robust margin","C":"Both models are equivalent — C only affects training speed, not the decision boundary","D":"SVM-A generalizes better because wider margins always produce lower test error regardless of data noise"},"correct":"B","explanation":{"correct":"- The soft-margin SVM objective: $\\min \\frac{1}{2}\\|w\\|^2 + C\\sum \\xi_i$ where $\\xi_i$ are slack variables for margin violations. C is the regularization hyperparameter: small C emphasizes maximizing margin (tolerating violations), large C emphasizes minimizing violations (potentially sacrificing margin).\n- On noisy data, individual mislabeled points or outliers are close to the true boundary. Large C forces the SVM to classify these noisy points correctly, distorting the boundary toward noise.\n- Small C produces a wider margin that \"ignores\" noisy points at the cost of some training errors — more robust to noise and outliers.","A":"Fewer training errors do not imply better generalization, especially on noisy data. This is the fundamental bias-variance trade-off: SVM-B has lower bias but higher variance.","B":"","C":"C fundamentally changes the decision boundary by altering the balance between margin width and violation penalty. This is not a computational parameter — it shapes the learned model.","D":"\"Wider margins always produce lower test error\" is too strong. On data with no noise, the maximum-margin boundary is optimal. On noisy data, the relationship depends on the noise magnitude relative to the margin."}},{"section":"machine-learning","topicSlug":"support-vector-machines","topic":"Support Vector Machines","id":"ml-07005","difficulty":"easy","orderIndex":5,"question":"An SVM is trained on a dataset with 1 million samples. Training takes 8 hours. A colleague says \"SVMs scale quadratically or worse with the number of samples — this is expected.\" Is this claim accurate?","options":{"A":"The claim is false — SVMs always train in O(n log n) time, similar to sorting algorithms","B":"The claim is accurate — standard SVM solvers (QP solvers for the dual problem) have complexity between O(n²) and O(n³) in the number of training samples; for 1 million samples, this makes exact SVM training computationally infeasible without specialized algorithms","C":"The claim is only true for RBF kernels; linear SVMs always train in O(n) time","D":"The quadratic scaling applies to the number of features, not samples; for 1 million samples, training always finishes quickly"},"correct":"B","explanation":{"correct":"- The SVM dual problem requires solving a QP over $n$ variables (one per training sample). The kernel matrix $K$ is $n \\times n$ — for 1 million samples, this is $10^{12}$ entries, consuming terabytes of memory.\n- Standard QP solvers have $O(n^3)$ complexity. Approximate methods like SMO (Sequential Minimal Optimization) reduce this to approximately $O(n^2)$, but are still infeasible at 1M samples without further approximation.\n- Practical alternatives for large datasets: LinearSVC (primal formulation, $O(n)$), SGD-based SVM via `sklearn.linear_model.SGDClassifier`, or approximate kernel methods (Nyström approximation, random features).","A":"O(n log n) scaling is for sorting algorithms, not SVMs. SVM complexity is dominated by the QP solver, which scales at least quadratically with n.","B":"","C":"Linear SVMs can be trained with primal methods in $O(n)$ time (e.g., LIBLINEAR), but this is a special case. Nonlinear kernel SVMs do not have linear time solvers.","D":"SVM complexity scales with samples, not just features. The kernel matrix dimensionality is $n \\times n$ (samples × samples), making sample count the binding constraint."}},{"section":"machine-learning","topicSlug":"support-vector-machines","topic":"Support Vector Machines","id":"ml-07006","difficulty":"medium","orderIndex":6,"question":"An SVM with an RBF kernel has hyperparameter gamma (γ). Training with γ=0.001 gives a smooth, slightly underfitting decision boundary. Training with γ=100 gives a highly irregular boundary that perfectly fits training data but fails on test data. What does γ control geometrically?","options":{"A":"γ controls the number of support vectors — higher γ uses more support vectors","B":"γ controls the \"reach\" of each training sample's influence: high γ means each sample only influences the boundary in a tiny local neighborhood (rough, complex boundary), while low γ means each sample influences a large region (smooth, broader boundary)","C":"γ controls the margin width — higher γ produces wider margins and better generalization","D":"γ is the learning rate for the SVM optimizer — higher γ converges faster but can overshoot"},"correct":"B","explanation":{"correct":"- RBF kernel: $K(x_i, x_j) = e^{-\\gamma \\|x_i - x_j\\|^2}$. For high γ: $e^{-\\gamma \\|x_i - x_j\\|^2}$ decays very rapidly with distance — only very close neighbors have non-zero similarity. Each training point only influences its immediate neighborhood.\n- For low γ: the kernel decays slowly — each training point influences a broad region. The resulting boundary is smooth and nearly linear in the limit $\\gamma \\to 0$.\n- High γ produces decision boundaries that wrap tightly around each training cluster, memorizing noise. Low γ produces broad boundaries that may miss fine-grained class structure.","A":"γ does not directly control the number of support vectors. More support vectors often appear with high γ (complex boundary needs more anchor points), but this is a consequence, not the mechanism.","B":"","C":"γ controls kernel bandwidth, not margin width. The margin is controlled by C. Higher γ actually tends to reduce the effective margin by creating locally complex boundaries.","D":"γ is not a learning rate. SVM training with kernels is a convex QP problem — it has no learning rate in the gradient descent sense. The optimization always converges to the global optimum."}},{"section":"machine-learning","topicSlug":"support-vector-machines","topic":"Support Vector Machines","id":"ml-07007","difficulty":"medium","orderIndex":7,"question":"An SVM is trained on text classification with a linear kernel. The training set has 50,000 documents represented as TF-IDF vectors with 50,000 features (sparse). A colleague recommends switching to an RBF kernel for \"better performance.\" Why might this advice be wrong?","options":{"A":"RBF kernels cannot handle text data at all","B":"Text data in high-dimensional sparse TF-IDF space is often nearly linearly separable — a linear SVM can achieve near-optimal performance with much lower computational cost than RBF; the RBF kernel requires computing $O(n^2)$ pairwise kernel values between 50,000-dimensional vectors, which is computationally expensive and may not improve accuracy","C":"Linear SVMs always outperform RBF SVMs on all tasks","D":"The advice is wrong because TF-IDF features should always use polynomial kernels, not RBF"},"correct":"B","explanation":{"correct":"- In high-dimensional sparse feature spaces (like bag-of-words or TF-IDF), the data is often linearly separable or nearly so by the \"blessing of dimensionality.\" A linear SVM in 50,000 dimensions has enormous flexibility.\n- RBF kernel with 50,000-dimensional TF-IDF vectors: the kernel $e^{-\\gamma\\|x_i - x_j\\|^2}$ computes Euclidean distance in 50,000 dimensions. For sparse vectors with mostly zeros, Euclidean distance is dominated by the zero dimensions, making the kernel less meaningful than in low-dimensional spaces.\n- Linear SVMs for text are well-established (LIBLINEAR) and achieve state-of-the-art on many text classification tasks. The RBF kernel adds computational cost without corresponding accuracy benefits.","A":"RBF kernels work mathematically on any numeric vectors, including text TF-IDF. The issue is practical performance and computational cost, not theoretical incompatibility.","B":"","C":"Linear SVMs don't always outperform RBF — for low-dimensional data with nonlinear boundaries, RBF is clearly better. The claim is specifically about high-dimensional sparse text data.","D":"Polynomial kernels for text are not a standard recommendation. Linear kernels are the standard for text; the choice between polynomial and RBF is specific to the data geometry, not feature type."}},{"section":"machine-learning","topicSlug":"support-vector-machines","topic":"Support Vector Machines","id":"ml-07008","difficulty":"medium","orderIndex":8,"question":"A polynomial kernel SVM with degree=3 is trained on a binary classification task. The kernel is $K(x_i, x_j) = (x_i \\cdot x_j + 1)^3$. A developer wants to explicitly create the polynomial feature expansion and train a linear SVM on those features. For input dimension d=100, how many features would this explicit expansion have?","options":{"A":"300 features (d × degree = 100 × 3)","B":"Approximately $\\binom{d + 3}{3} \\approx \\binom{103}{3} = 176,851$ features for degree-3 monomials — the explicit feature space is enormous, making the kernel trick computationally essential","C":"1,000,000 features (d³ = 100³)","D":"The number of features stays at 100 — polynomial kernels do not create new features"},"correct":"B","explanation":{"correct":"- A degree-$p$ polynomial feature expansion of $d$-dimensional data creates all monomials up to degree $p$: $x_1^{a_1} x_2^{a_2} \\cdots x_d^{a_d}$ where $\\sum a_i \\leq p$. The count is $\\binom{d+p}{p}$.\n- For $d=100, p=3$: $\\binom{103}{3} = \\frac{103 \\times 102 \\times 101}{6} = 176,851$ features.\n- The kernel trick avoids constructing these 176,851 features explicitly. Instead, $(x_i \\cdot x_j + 1)^3$ computes the dot product in this space using only a simple formula on the original 100-dimensional vectors.","A":"$$d \\times \\text{degree}$ only accounts for linear terms multiplied by degree — it doesn't count cross-terms ($x_1 x_2 x_3$) or higher-order monomials ($x_1^2 x_2$). The correct count is combinatorial, not multiplicative.","B":"","C":"$$d^p = 100^3 = 1,000,000$ overcounts because it includes all ordered products with repetition. The polynomial feature expansion uses unordered monomials, which is a smaller count.","D":"Polynomial kernels implicitly compute dot products in the higher-dimensional space. The features conceptually exist in that space — they just aren't materialized when using the kernel trick."}},{"section":"machine-learning","topicSlug":"support-vector-machines","topic":"Support Vector Machines","id":"ml-07009","difficulty":"medium","orderIndex":9,"question":"A team builds an SVM to classify genomic sequences. The dataset has 500 training samples and 20,000 features (gene expression values). After scaling features to [0,1], an RBF SVM with cross-validated C and gamma achieves 0.91 AUC. A deep neural network achieves only 0.83 AUC. Why might SVM outperform the neural network here?","options":{"A":"SVMs always outperform neural networks on biological data","B":"With 500 samples and 20,000 features, a neural network has millions of parameters and severely overfits; the SVM's kernel-based approach effectively operates in a high-dimensional feature space with structural risk minimization, requiring far fewer effective parameters relative to the margin geometry","C":"Deep neural networks cannot process genomic data — they require image inputs","D":"The SVM is faster to train, so it converges to the global optimum while the neural network gets stuck in a local minimum"},"correct":"B","explanation":{"correct":"- The $n/p$ ratio here is $500/20,000 = 0.025$ — far fewer samples than features. A neural network with even a modest hidden layer (e.g., 128 neurons) has $20,000 \\times 128 = 2,560,000$ parameters vs. 500 training samples. Severe overfitting is almost guaranteed.\n- An SVM's effective capacity is controlled by the margin width and the number of support vectors, not the feature dimensionality. In high-dimensional settings, SVMs often remain well-regularized because the maximum-margin solution has large margin relative to the feature space volume.\n- This is precisely the scenario where SVMs were dominant before deep learning became prevalent: high-dimensional, low-sample genomic, text, and image data.","A":"SVMs do not always outperform neural networks on biological data. With sufficient data (thousands of labeled examples), neural networks typically win. The key condition is the $n/p$ ratio.","B":"","C":"Neural networks can process any numeric feature vector, including gene expression data. This claim is false.","D":"SVM training is a convex optimization with a unique global optimum — convergence is guaranteed regardless of speed. Neural networks have non-convex loss but can still converge to good local minima with proper initialization. Speed is not the reason for the accuracy difference."}},{"section":"machine-learning","topicSlug":"support-vector-machines","topic":"Support Vector Machines","id":"ml-07010","difficulty":"medium","orderIndex":10,"question":"An SVM's dual formulation gives the decision function as $f(x) = \\sum_{i \\in SV} \\alpha_i y_i K(x_i, x) + b$. After training, there are 850 support vectors out of 10,000 training samples. A data scientist says \"fewer support vectors means a better model.\" Is this correct?","options":{"A":"Correct — the number of support vectors is inversely proportional to test accuracy","B":"Partially correct but oversimplified — fewer support vectors generally indicate a simpler, more generalizable decision boundary (larger margin), but the optimal number depends on the true data complexity; too few support vectors (from over-regularization) indicate underfitting","C":"Support vector count has no relationship to model quality","D":"Exactly 50% of training samples should be support vectors for an optimal SVM"},"correct":"B","explanation":{"correct":"- An upper bound on SVM generalization error relates to the expected leave-one-out error: $E[\\text{LOO error}] \\leq E[\\text{number of support vectors}] / n$. Fewer support vectors → lower LOO error upper bound → better expected generalization.\n- However, this bound is loose. With a very small C (heavy regularization), the model creates a wide margin with few support vectors but may underfit (too simple to capture the true boundary).\n- The optimal C (and hence optimal support vector count) should be found by cross-validation. The support vector count is a diagnostic signal, not a target metric.","A":"Support vector count and test accuracy are not inversely proportional in a strict sense. The relationship depends on C, the kernel, and the data distribution.","B":"","C":"Support vector count is a meaningful diagnostic. Extreme counts (nearly all or very few training samples as support vectors) indicate potential over-regularization or under-regularization.","D":"There is no principled reason for exactly 50% support vectors. This varies widely by dataset and hyperparameter settings."}},{"section":"machine-learning","topicSlug":"support-vector-machines","topic":"Support Vector Machines","id":"ml-07011","difficulty":"hard","orderIndex":11,"question":"You train an SVM with an RBF kernel on a training set. The kernel matrix $K$ is $n \\times n$. During inference, a new point $x$ must compute $K(x, x_i)$ for all $n$ support vectors. For $n = 50,000$ support vectors with $d = 5,000$ features, what is the per-sample inference cost and why is this problematic for real-time serving?","options":{"A":"Inference is O(1) because SVMs use a precomputed lookup table","B":"Inference costs O(n × d) kernel evaluations per sample — for 50,000 support vectors and 5,000 features, each prediction requires 250 million floating-point operations; at real-time latency requirements (< 10ms), this is infeasible without approximation","C":"Inference cost is O(d) regardless of support vector count because the kernel is precomputed during training","D":"SVM inference always costs the same as a single dot product regardless of the number of support vectors"},"correct":"B","explanation":{"correct":"- Each inference requires evaluating $K(x, x_i) = e^{-\\gamma\\|x - x_i\\|^2}$ for each support vector $x_i$. Each evaluation requires $O(d)$ operations (computing $\\|x - x_i\\|^2$). With $n_{sv}$ support vectors: total cost is $O(n_{sv} \\times d)$.\n- For 50,000 support vectors and 5,000 features: $50,000 \\times 5,000 = 2.5 \\times 10^8$ multiply-adds per sample. At ~1 GFLOP/s on a single CPU core, this takes ~250ms — far exceeding real-time requirements.\n- Solutions: reduce C to decrease support vector count, use approximate kernel methods (Nyström, random features), switch to a linear SVM if the RBF is not strictly necessary, or use GPU acceleration.","A":"SVMs do not use precomputed lookup tables for inference. The kernel must be evaluated against all support vectors for each new sample.","B":"","C":"The kernel values between training support vectors can be cached, but the kernel between a new test point and each support vector must be computed fresh at inference time.","D":"SVM inference cost scales with the number of support vectors × feature dimensionality. A single dot product is O(d); total prediction is O($n_{sv} \\times d$)."}},{"section":"machine-learning","topicSlug":"support-vector-machines","topic":"Support Vector Machines","id":"ml-07012","difficulty":"hard","orderIndex":12,"question":"A linear SVM is trained on a balanced binary classification task with features $(x_1, x_2)$. The optimal hyperplane is $2x_1 - 3x_2 + 1 = 0$. A data scientist scales feature $x_1$ by 100 (multiplying all $x_1$ values by 100) and retrains. The new hyperplane becomes $0.02x_1' - 3x_2 + 1 = 0$ (where $x_1' = 100 x_1$). Are these models equivalent?","options":{"A":"Yes — the predictions are identical because scaling doesn't change class membership","B":"The predictions may be the same on the training set, but the unscaled model is heavily influenced by $x_2$ relative to $x_1$ — SVMs are not scale-invariant; the margin calculation depends on $\\|w\\|$, so feature scaling affects which hyperplane achieves maximum margin","C":"SVMs are scale-invariant by design — the kernel handles scaling automatically","D":"The two models are equivalent because the hyperplane equation $2x_1 - 3x_2 + 1 = 0$ and $0.02x_1' - 3x_2 + 1 = 0$ define the same geometric boundary"},"correct":"B","explanation":{"correct":"- The original margin: $\\frac{2}{\\|w\\|} = \\frac{2}{\\sqrt{4+9}} = \\frac{2}{\\sqrt{13}} \\approx 0.555$.\n- After scaling $x_1$ by 100, the equivalent problem in the original space has $w = (2/100, -3)$. Margin: $\\frac{2}{\\|(0.02, -3)\\|} = \\frac{2}{\\sqrt{0.0004+9}} \\approx \\frac{2}{3} = 0.667$ — different margin.\n- The maximum-margin hyperplane changes with feature scaling because $\\|w\\|$ depends on the scale of each feature's weight. SVM is not scale-invariant, which is why **feature standardization is mandatory** before training SVMs.","A":"\"Predictions may be identical\" is plausible only if the training data is perfectly separable (both models achieve zero training error). In general, different margins lead to different boundaries and different generalization.","B":"","C":"Kernel SVMs are not scale-invariant. The RBF kernel $e^{-\\gamma\\|x_i - x_j\\|^2}$ explicitly depends on Euclidean distance, which changes with feature scaling.","D":"The hyperplane equations are geometrically different in the original space. $2x_1 - 3x_2 + 1 = 0$ and $0.02(100x_1) - 3x_2 + 1 = 0$ simplify to $2x_1 - 3x_2 + 1 = 0$ — they are the same line. However, the margin (which determines generalization) differs because $\\|w\\|$ is different in the optimization."}},{"section":"machine-learning","topicSlug":"support-vector-machines","topic":"Support Vector Machines","id":"ml-07013","difficulty":"hard","orderIndex":13,"question":"An SVM with C=10 has 300 support vectors. Increasing C to 10,000 results in 2,800 support vectors (out of 5,000 training samples). A 5-fold cross-validation shows C=10 generalizes better. What is the precise mechanism causing more support vectors at higher C?","options":{"A":"Higher C means the optimizer adds more support vectors for computational stability","B":"Higher C penalizes margin violations more heavily, forcing the boundary to correctly classify more training points — points that were previously allowed to violate the margin (counted as non-support-vectors) are now forced toward or into the margin, becoming support vectors; more support vectors means a narrower margin and a more complex boundary","C":"Support vector count scales linearly with C — doubling C always doubles the support vectors","D":"Higher C causes more features to be selected, which creates more support vectors"},"correct":"B","explanation":{"correct":"- At low C, the SVM tolerates many margin violations — the model says \"it's acceptable for some training points to be inside or on the wrong side of the margin.\" These points may not become support vectors if the overall solution is better served by a wider margin.\n- At high C, any point that lies inside the margin (violates the $y_i(w^Tx_i + b) \\geq 1$ constraint) becomes a support vector with non-zero dual weight. More training points are forced to be correctly classified with margin ≥ 1, but this requires a more complex, narrower boundary.\n- 2,800 out of 5,000 support vectors at high C suggests the model is nearly memorizing training points — a hallmark of overfitting in SVMs.","A":"Optimizer stability has no relationship to support vector count. The number of support vectors is determined by the data geometry and the C value, not by numerical stability.","B":"","C":"The relationship between C and support vector count is not linear. It depends on the data distribution, margin violations at different C values, and the geometry of the decision boundary.","D":"SVMs (especially kernel SVMs) don't perform feature selection. All features are used through the kernel computation. Support vectors are training samples, not features."}},{"section":"machine-learning","topicSlug":"support-vector-machines","topic":"Support Vector Machines","id":"ml-07014","difficulty":"hard","orderIndex":14,"question":"A string kernel SVM is used to classify protein sequences. Two proteins are \"similar\" if they share many substrings (k-mers). The kernel is $K(s_1, s_2) = $ (count of shared k-mers). This is never explicitly computed in feature space. Mercer's theorem requires this kernel to be a valid kernel. What does \"valid kernel\" mean in this context?","options":{"A":"A valid kernel must be computable in polynomial time","B":"A valid kernel must be a symmetric positive semi-definite (PSD) function — it must correspond to a dot product in some (possibly infinite-dimensional) Hilbert space; the k-mer kernel is PSD because the matrix $K_{ij}$ built from all training pairs has all non-negative eigenvalues","C":"A valid kernel must produce values between 0 and 1","D":"Mercer's theorem only applies to continuous feature spaces; string kernels are exempt"},"correct":"B","explanation":{"correct":"- Mercer's theorem states that $K(x_i, x_j)$ is a valid kernel iff the Gram matrix $K_{ij} = K(x_i, x_j)$ is symmetric positive semi-definite (PSD) for any finite set of inputs.\n- PSD means: for all vectors $c$, $\\sum_{i,j} c_i c_j K(x_i, x_j) \\geq 0$. Equivalently, all eigenvalues of the Gram matrix are non-negative.\n- The k-mer string kernel can be written as $K(s_1, s_2) = \\phi(s_1) \\cdot \\phi(s_2)$ where $\\phi(s)$ is the feature vector of k-mer counts. Any kernel expressible as a dot product is automatically PSD.","A":"Computational complexity is not part of Mercer's theorem. A valid kernel only needs to correspond to a dot product in some Hilbert space, regardless of computational cost.","B":"","C":"Kernel values have no required range [0,1]. Many valid kernels (linear: $K(x,y) = x \\cdot y$) produce any real value. The PSD property is about the matrix structure, not individual value range.","D":"Mercer's theorem applies to any measurable space, including discrete spaces like string sequences. String kernels are one of the most important kernel types and are explicitly covered by the theorem."}},{"section":"machine-learning","topicSlug":"support-vector-machines","topic":"Support Vector Machines","id":"ml-07015","difficulty":"hard","orderIndex":15,"question":"A company builds a fraud detection model comparing SVM-RBF vs a Random Forest. Both achieve 0.91 AUC after tuning. The SVM has 12,000 support vectors from 50,000 training samples. The Random Forest has 200 trees. For monthly retraining on 50,000 new samples, which model presents the more significant operational challenge and why?","options":{"A":"Random Forest retraining is harder because it requires 200 separate model files","B":"SVM retraining is operationally harder at scale — with $O(n^2)$ to $O(n^3)$ training complexity, 50,000 new samples requires solving a QP over 50,000 dual variables; warm-starting from the previous 12,000 support vectors is partially possible but not trivially implemented; Random Forest retraining is embarrassingly parallel and completes in minutes","C":"Both models retrain in identical time since they achieve the same AUC","D":"SVM retraining is trivial because only the 12,000 support vectors need to be updated, not all 50,000 samples"},"correct":"B","explanation":{"correct":"- SVM retraining: the QP problem scales at least $O(n^2)$ in memory (kernel matrix) and $O(n^2)$ to $O(n^3)$ in computation. For 50,000 samples, the kernel matrix alone would be $50,000^2 \\times 8$ bytes = 20GB. Full retraining is expensive.\n- Incremental SVM updates (adding new samples without full retraining) exist but are complex — they require re-solving the KKT conditions for changed support vectors and don't reduce complexity for large batch updates.\n- Random Forest retraining: each of 200 trees trains independently in parallel. On 10 cores, retraining 200 trees takes approximately $200/10 = 20$ tree-training times in parallel. Total time: minutes, not hours.","A":"200 model files are easily managed with a serialized ensemble. The number of files is not an operational challenge — the per-file complexity is low.","B":"","C":"Equal AUC does not imply equal retraining time. Model quality and training complexity are independent — a model can be fast to train and perform poorly, or slow to train and perform well.","D":"The support vectors from the previous model are not simply \"updated.\" New training data requires identifying new support vectors from all 50,000 current samples, not just the previous 12,000. Warm-starting from previous SVs reduces time but doesn't eliminate the quadratic scaling."},"reference":"- Joachims, \"Making Large-Scale SVM Learning Practical\" (SVMlight): http://svmlight.joachims.org/\n- sklearn SVM documentation: https://scikit-learn.org/stable/modules/svm.html"},{"section":"machine-learning","topicSlug":"k-nearest-neighbors","topic":"K Nearest Neighbors","id":"ml-08001","difficulty":"easy","orderIndex":1,"question":"A KNN classifier with k=1 achieves 100% training accuracy. A colleague immediately concludes the model is excellent. What is wrong with this reasoning?","options":{"A":"k=1 is always the optimal value — 100% training accuracy confirms this","B":"With k=1, every training point is its own nearest neighbor, so the model always predicts the correct class for any training point — this is guaranteed regardless of signal; training accuracy with k=1 is trivially 100% and reveals nothing about generalization","C":"k=1 KNN cannot achieve 100% accuracy on training data due to tie-breaking rules","D":"100% training accuracy means the model has no variance, which is always desirable"},"correct":"B","explanation":{"correct":"- In KNN with k=1, the nearest neighbor of any training point is itself (distance = 0). The prediction for any training point is trivially correct — this is guaranteed by the algorithm's definition, not by the model learning anything useful.\n- This is identical to why training accuracy is a misleading metric for any memorizing model. The k=1 KNN is an extreme interpolator: it reproduces every training label exactly.\n- The appropriate evaluation is on a held-out test set or via leave-one-out cross-validation (which explicitly prevents a point from being its own neighbor).","A":"k=1 is rarely optimal. It maximizes variance: the decision boundary is highly irregular, adapting to every training point including noisy ones. Optimal k is found by validation.","B":"","C":"There are no tie-breaking issues for a single nearest neighbor. Ties occur when two neighbors are equidistant and k > 1. With k=1, the nearest point (itself, at distance 0) always wins.","D":"A model with k=1 has maximum variance — the decision boundary changes dramatically with small perturbations of training data. High training accuracy with high variance is the textbook overfitting scenario."}},{"section":"machine-learning","topicSlug":"k-nearest-neighbors","topic":"K Nearest Neighbors","id":"ml-08002","difficulty":"easy","orderIndex":2,"question":"A KNN model is trained on customer data with two features: `age` (range 20-80) and `annual_income` (range \\$20,000–\\$500,000). The model performs poorly. After normalizing both features to [0,1], performance improves significantly. What does this reveal about KNN's sensitivity?","options":{"A":"KNN is sensitive to the number of training samples, not feature scale","B":"KNN uses distance metrics (Euclidean, Manhattan) to find nearest neighbors — before normalization, `annual_income` dominates the distance calculation because its absolute scale is 10,000× larger than `age`, effectively making `age` irrelevant; normalization gives both features equal influence on distance","C":"Normalization improved performance because KNN requires features to be normally distributed","D":"Feature scale only matters for KNN when k is larger than 10"},"correct":"B","explanation":{"correct":"- Euclidean distance: $d = \\sqrt{(\\Delta \\text{age})^2 + (\\Delta \\text{income})^2}$. With raw values: $\\Delta \\text{age} \\leq 60$ while $\\Delta \\text{income} \\leq 480,000$. The distance is dominated entirely by income — a 1-year age difference contributes $10^{-8}$ fraction of the total distance.\n- The model effectively ignores age and classifies based only on income proximity. This is a geometric artifact, not a feature relevance judgment.\n- After normalization to [0,1]: $\\Delta \\text{age}^2 + \\Delta \\text{income}^2$ where both terms are in [0,1] — both features contribute meaningfully to distance.","A":"KNN is highly sensitive to feature scale, not primarily to sample count. The issue here is geometric — the distance metric is the core operation, and scale imbalance distorts it.","B":"","C":"KNN has no distributional assumptions. It makes no use of feature distributions — only pairwise distances. Normality is irrelevant.","D":"Feature scale dominance affects KNN for any k. With k=1, a point dominated by high-income similarity would be assigned the nearest high-income neighbor regardless of age."}},{"section":"machine-learning","topicSlug":"k-nearest-neighbors","topic":"K Nearest Neighbors","id":"ml-08003","difficulty":"easy","orderIndex":3,"question":"KNN is described as a \"lazy learner.\" What does this mean and what practical consequence does it have at inference time?","options":{"A":"KNN is lazy because it requires few hyperparameters compared to other models","B":"KNN does no computation during \"training\" — it only stores all training data; all computation (distance calculations to find neighbors) happens at inference time, making training instant but prediction slow for large datasets","C":"KNN is lazy because it produces approximate results rather than exact predictions","D":"KNN is lazy because it uses random sampling at inference time instead of computing exact distances"},"correct":"B","explanation":{"correct":"- A \"lazy learner\" defers computation to inference time. During \"training,\" KNN simply stores all $(x_i, y_i)$ pairs — $O(1)$ or $O(n)$ at most for storage. No model parameters are learned.\n- At inference for a new point $x$: compute distance to all $n$ training points ($O(nd)$), sort or partially sort to find k-nearest ($O(n \\log k)$), aggregate their labels ($O(k)$). Total: $O(nd)$ per query.\n- Contrast with eager learners (logistic regression, neural networks): they invest computation at training time to learn compact parameters; inference is then $O(d)$ — fast regardless of training set size.","A":"\"Lazy\" in ML has a specific technical meaning (deferred computation to inference), not a reference to hyperparameter complexity. KNN actually has few hyperparameters (k, distance metric), but that's a coincidence.","B":"","C":"Standard KNN computes exact distances to find exact nearest neighbors. \"Lazy\" refers to the timing of computation, not its precision. Approximate KNN (FAISS, HNSW) is a separate technique.","D":"KNN uses exact distance computation by default. Random sampling is a different technique (approximate nearest neighbor search) not part of standard KNN."}},{"section":"machine-learning","topicSlug":"k-nearest-neighbors","topic":"K Nearest Neighbors","id":"ml-08004","difficulty":"easy","orderIndex":4,"question":"KNN achieves 85% accuracy with k=3 and 84% with k=5 on the validation set. A data scientist asks: \"should I always choose the k that gives highest validation accuracy?\" What is the risk of this approach?","options":{"A":"No risk — always choosing the highest validation accuracy is the correct model selection strategy","B":"Choosing k based on validation accuracy is valid, but testing many k values increases the chance of selecting a k that happens to fit the validation set by chance — cross-validation over k with a held-out test set gives a more reliable estimate","C":"k=3 is always better than k=5 because lower k means more neighbors are considered","D":"k should always be an odd number to avoid ties; k=3 is correct simply for this reason"},"correct":"B","explanation":{"correct":"- Selecting k by maximizing a single validation set's accuracy is subject to the same model selection overfitting risk as any hyperparameter search: the best k for one validation split may not be the best for the population distribution.\n- The risk is especially high when the validation set is small — a 1% accuracy difference between k=3 and k=5 on a small validation set can easily be within noise.\n- Best practice: use cross-validation to estimate validation accuracy for each k, select the k with the best cross-validated performance, then evaluate once on the test set.","A":"This is the model selection overfitting trap. Always choosing max validation accuracy without cross-validation or confidence intervals risks overfitting to the validation set.","B":"","C":"Lower k does not mean \"more neighbors are considered\" — it means fewer neighbors. k=3 considers 3 nearest neighbors; k=5 considers 5. The statement is factually backwards.","D":"Odd k avoids ties in binary classification but is not a reason to always prefer lower odd values. The optimal k depends on the dataset's noise level and class boundary complexity."}},{"section":"machine-learning","topicSlug":"k-nearest-neighbors","topic":"K Nearest Neighbors","id":"ml-08005","difficulty":"easy","orderIndex":5,"question":"A KNN regression model predicts house prices. With k=1, the test RMSE is 85,000. With k=100, the test RMSE is 42,000. With k=1000, the test RMSE is 61,000. What does the U-shaped relationship between k and RMSE reveal?","options":{"A":"k=100 is optimal for all house price datasets","B":"Small k produces high variance (each prediction depends on a single noisy neighbor), large k produces high bias (predictions are averaged over too many dissimilar houses), and the optimal k balances this trade-off — this is a direct manifestation of the bias-variance trade-off in KNN","C":"The U-shape reveals that KNN is not suitable for regression tasks","D":"The U-shape is caused by an error in feature scaling — after normalization, the relationship would be monotone"},"correct":"B","explanation":{"correct":"- k=1: prediction = single nearest neighbor's price. One noisy or atypical neighbor causes large errors. High variance, low bias.\n- k=1000: prediction = average of 1000 neighbors, many of which may be in different neighborhoods or sizes. Predictions converge to a broad average, missing local patterns. Low variance, high bias.\n- k=100: captures local neighborhood structure with enough averaging to smooth noise. This is the sweet spot for this dataset.\n- This pattern is universal in KNN and illustrates the bias-variance trade-off geometrically: the \"neighborhood\" size controls smoothness vs. locality.","A":"k=100 is optimal for this specific dataset. The optimal k is data-dependent. For a different city with denser similar housing, a larger k might be optimal.","B":"","C":"The U-shape is evidence that KNN can do regression and has a sweet spot — it does not indicate unsuitability. The task is to find the right k via validation.","D":"Feature scaling affects distance calculations but doesn't change the fundamental bias-variance behavior of k. The U-shape appears regardless of feature scale (after proper scaling, the optimal k may shift, but the U-shape persists)."}},{"section":"machine-learning","topicSlug":"k-nearest-neighbors","topic":"K Nearest Neighbors","id":"ml-08006","difficulty":"medium","orderIndex":6,"question":"A KNN model with Euclidean distance achieves 88% accuracy on 50-dimensional data. After applying PCA to reduce to 10 dimensions, KNN accuracy increases to 93%. Explain the mechanism, and what phenomenon does this illustrate?","options":{"A":"PCA adds new information that was missing from the original 50 features","B":"In high-dimensional spaces, distances between points concentrate (all points become nearly equidistant), making nearest-neighbor search meaningless — PCA removes noise dimensions and retains the 10 most informative directions, making distances more discriminative; this is the curse of dimensionality","C":"PCA's normalization step is what improves KNN — the accuracy gain is not from dimensionality reduction but from standardization","D":"50 features is always too many for KNN — the algorithm is designed for at most 20 features"},"correct":"B","explanation":{"correct":"- The curse of dimensionality: as dimension $d$ increases, the volume of space grows exponentially. In high dimensions, all pairwise distances converge: $\\frac{\\max_{\\text{dist}} - \\min_{\\text{dist}}}{\\min_{\\text{dist}}} \\to 0$ as $d \\to \\infty$. The notion of \"nearest\" neighbor loses meaning.\n- With 50 dimensions, 40 of which may be noisy or irrelevant, Euclidean distances are dominated by noise contributions. Two actually similar points appear far apart due to noise in irrelevant dimensions.\n- PCA projects onto the 10 directions of maximum variance — presumably the signal dimensions. In this lower-dimensional space, distances are more meaningful and nearest neighbors are more likely to be genuinely similar.","A":"PCA is a dimensionality reduction technique — it cannot add information that wasn't in the original data. It only retains a subspace of the original feature space.","B":"","C":"PCA does standardize features as a side effect (if using standard PCA with mean centering), but the primary mechanism here is dimensionality reduction removing noise dimensions. The accuracy gain is specifically about reducing the curse of dimensionality.","D":"KNN has no hard feature limit. The algorithm works at any dimensionality, but performance degrades with irrelevant dimensions. The challenge is empirical, not algorithmic."},"reference":"- Beyer et al., \"When is Nearest Neighbor Meaningful?\": https://link.springer.com/chapter/10.1007/3-540-49257-7_15"},{"section":"machine-learning","topicSlug":"k-nearest-neighbors","topic":"K Nearest Neighbors","id":"ml-08007","difficulty":"medium","orderIndex":7,"question":"A KNN model must serve 1,000 queries per second on a dataset of 1 million training samples with 100 features. A brute-force KNN implementation takes 200ms per query. A team considers using a KD-tree or a ball-tree. Under what conditions would the KD-tree fail to provide speedup over brute-force?","options":{"A":"KD-trees always provide the same speedup regardless of dimension","B":"KD-tree performance degrades severely in high dimensions — its expected query complexity is O(kd × n^(1-1/d)), which approaches O(n) as d increases; for d=100, a KD-tree provides essentially no speedup over brute-force, and approximate methods (HNSW, FAISS IVF) are needed","C":"KD-trees fail when the dataset has more than 10,000 samples","D":"KD-trees fail when k (number of neighbors) is larger than 5"},"correct":"B","explanation":{"correct":"- KD-trees split the feature space along axes recursively. In low dimensions (d ≤ 20), they efficiently prune branches and achieve $O(k \\log n)$ query time.\n- As dimension increases, the number of KD-tree cells that could contain nearest neighbors grows exponentially. For $d = 100$, nearly every leaf must be checked — the tree degenerates to a brute-force search.\n- Ball-trees handle moderate dimensions slightly better (up to d~40) because their splits are based on hypersphere geometry rather than axis-aligned hyperplanes. For d=100, even ball-trees struggle. Approximate nearest neighbor libraries (HNSW, FAISS) are the practical solution.","A":"KD-tree speedup is dimension-dependent. The key insight is that the tree structure becomes ineffective in high dimensions — a crucial practical consideration.","B":"","C":"KD-trees efficiently handle millions of samples in low dimensions. Sample count is not the limiting factor — dimensionality is.","D":"The number of neighbors k affects constant factors in KD-tree query time but is not the primary failure mode. The dimension curse dominates."}},{"section":"machine-learning","topicSlug":"k-nearest-neighbors","topic":"K Nearest Neighbors","id":"ml-08008","difficulty":"medium","orderIndex":8,"question":"A KNN classifier uses Euclidean distance. A new feature `binary_flag` (values 0 or 1) is added to a feature set of continuous measurements. After adding this feature, model accuracy drops. The team scales all continuous features to [0,1] but the flag is already in [0,1]. What is the likely cause of the accuracy drop?","options":{"A":"Binary features cannot be used with Euclidean distance","B":"The binary feature contributes the same maximum distance (1) as continuous features, but it represents a fundamentally different type of difference — a 0-vs-1 binary flip may be less meaningful than a 0.01 difference in a continuous feature, or vice versa; Euclidean distance treats all [0,1] features identically regardless of semantic meaning","C":"The accuracy drop is unrelated to the new feature — it is caused by normalization changing existing features","D":"Binary features must be one-hot encoded before use with KNN regardless of binary values"},"correct":"B","explanation":{"correct":"- After [0,1] scaling, Euclidean distance treats a binary flip (0→1 in `binary_flag`) as the same distance as a full range change in a continuous feature. But the semantic meaning differs: the binary flag might represent \"premium vs standard\" — a categorical distinction — while continuous features represent gradual change.\n- If the binary flag is a noisy proxy or introduces class-irrelevant variation, it adds distance noise that misleads the nearest-neighbor search.\n- Solutions: use different feature weights (weighted KNN), use a distance metric appropriate for mixed data types (Gower distance), or assess feature importance before adding binary flags.","A":"Binary features can be used with Euclidean distance. The issue is not mathematical incompatibility but semantic mismatch between binary semantics and continuous distance interpretation.","B":"","C":"Normalization of continuous features affects their distance contribution, but the question specifies they were already scaled to [0,1]. The drop specifically correlates with adding the binary flag.","D":"One-hot encoding a binary feature that already has values 0 and 1 produces the same two columns as the original binary feature — it adds no information and wouldn't change distances."}},{"section":"machine-learning","topicSlug":"k-nearest-neighbors","topic":"K Nearest Neighbors","id":"ml-08009","difficulty":"medium","orderIndex":9,"question":"You are building a product recommendation system using KNN. You find that k=10 gives 0.82 AUC and k=50 gives 0.85 AUC. Your dataset has 500 samples. A data scientist says \"larger k is always better — more neighbors means more information.\" Is this correct?","options":{"A":"Correct — more neighbors always provides more information for any dataset","B":"Incorrect — as k approaches n (total samples), the prediction converges to the majority class for all inputs, ignoring all local structure; in a small dataset of 500 samples, k=50 uses 10% of all data per prediction, which may already be approaching the \"averaging out local structure\" regime","C":"Correct but only when n > 1000 samples; for small datasets k must be minimized","D":"Incorrect only because the dataset is small; for large datasets more neighbors is always better"},"correct":"B","explanation":{"correct":"- As k increases toward n: KNN predictions become increasingly global averages rather than local patterns. For k=n, every new point gets the same prediction (majority class), ignoring features entirely.\n- With 500 samples, k=50 means each prediction is determined by 10% of all training data. This smooths out local patterns and may reduce sensitivity to the specific features that matter for recommendation.\n- The optimal k balances locality (small k captures local patterns) against stability (large k averages out noise). The optimal value is always dataset-dependent and should be found via cross-validation.","A":"\"More neighbors = more information\" fails when the additional neighbors are from different classes or distributions than the query point's true neighborhood. Quality of neighbors matters more than quantity.","B":"","C":"There is no sample-count threshold that determines whether larger k is universally better. The relationship depends on the signal-to-noise ratio in the dataset, not the absolute size.","D":"For large datasets, larger k can still introduce the same high-bias problem by averaging over distant, dissimilar neighbors. The optimal k scales roughly as $\\sqrt{n}$ as a heuristic, not proportionally to n."}},{"section":"machine-learning","topicSlug":"k-nearest-neighbors","topic":"K Nearest Neighbors","id":"ml-08010","difficulty":"medium","orderIndex":10,"question":"A KNN model for credit scoring must be deployed in a regulated environment. A compliance officer asks: \"explain why this applicant was denied.\" The ML engineer says \"our k=5 KNN model found these 5 similar applicants who all defaulted.\" Is this a valid explanation for regulatory purposes?","options":{"A":"Yes — citing 5 similar historical cases is an intuitive and complete explanation","B":"Partially — example-based explanations are intuitive but may fail regulatory requirements that specify feature-level adverse action reasons (e.g., \"denied because of high debt-to-income ratio\"); the 5 neighbors explain similarity in distance space, but don't identify which specific features drove the similarity","C":"No — KNN cannot be used in regulated industries because it has no explainability whatsoever","D":"The explanation is complete because KNN predictions are based on data, not complex math"},"correct":"B","explanation":{"correct":"- KNN's example-based explanation (\"these similar cases all defaulted\") is intuitive and has face validity. However, it doesn't answer \"which specific features make these cases similar?\" — a question regulators require.\n- For ECOA/FCRA adverse action notices, lenders must specify specific reasons: \"denied because of high debt-to-income ratio, insufficient credit history.\" KNN distance similarity doesn't directly map to feature-level reasons.\n- To achieve both, you could augment KNN explanations with feature contribution analysis: which features contributed most to the distance between the applicant and the nearest neighbors?","A":"Example-based explanation is intuitive but not always sufficient. \"Similar past cases\" doesn't identify the legally required specific adverse action factors.","B":"","C":"KNN has the valuable property of example-based explanations — showing similar cases is a form of transparency. Many regulated industries use KNN precisely because of this interpretability. The issue is granularity, not absence of explainability.","D":"Having an explanation based on data doesn't automatically satisfy regulatory requirements for specific feature-level reasons. The regulatory standard is more specific than \"data-driven.\""}},{"section":"machine-learning","topicSlug":"k-nearest-neighbors","topic":"K Nearest Neighbors","id":"ml-08011","difficulty":"hard","orderIndex":11,"question":"You train a KNN model on a 500-dimensional dataset where the true decision boundary depends only on 3 features. KNN achieves poor performance despite the true signal being strong. A colleague suggests using a Mahalanobis distance instead of Euclidean. Why would this help, and what does the Mahalanobis distance compute?","options":{"A":"Mahalanobis distance is faster to compute, which improves KNN performance","B":"Mahalanobis distance accounts for feature covariance — it scales distances by the inverse of the covariance matrix $d_M(x,y) = \\sqrt{(x-y)^T \\Sigma^{-1} (x-y)}$; this de-correlates features and normalizes by variance, reducing the influence of redundant and noisy high-variance features on neighbor selection","C":"Mahalanobis distance is equivalent to Euclidean distance after mean centering","D":"Mahalanobis distance removes irrelevant features by setting their weight to zero"},"correct":"B","explanation":{"correct":"- Euclidean distance in 500 dimensions is dominated by the 497 irrelevant features (each contributing a noise term). Mahalanobis distance stretches or shrinks the space according to the inverse covariance matrix: low-variance features (often uninformative constant features) are amplified; high-variance correlated features are treated jointly.\n- The effect: features that are noisy or redundant contribute less to the Mahalanobis distance, while informative features (with variance aligned with class differences) contribute more.\n- However, Mahalanobis distance doesn't explicitly identify the 3 relevant features. For truly irrelevant features, explicit feature selection or metric learning (learning the optimal distance matrix) is more effective.","A":"Mahalanobis distance requires computing $\\Sigma^{-1}$ (a $500 \\times 500$ matrix) and matrix-vector products — it is significantly more expensive than Euclidean distance, not faster.","B":"","C":"Mahalanobis distance is not equivalent to Euclidean after mean centering. Mean centering removes the bias term but doesn't account for variance or covariance. Mahalanobis requires the full inverse covariance matrix.","D":"Mahalanobis distance doesn't zero out irrelevant features — it reweights them by their inverse variance/covariance. A feature with high variance (even if irrelevant) might still have non-zero contribution to Mahalanobis distance."}},{"section":"machine-learning","topicSlug":"k-nearest-neighbors","topic":"K Nearest Neighbors","id":"ml-08012","difficulty":"hard","orderIndex":12,"question":"A KNN classifier is applied to a time-series classification task: given the last 30 days of stock returns (30-dimensional feature vector), classify the next day as up or down. KNN with Euclidean distance and k=10 achieves only 51% accuracy (random baseline). A finance researcher suggests using Dynamic Time Warping (DTW) as the distance metric. What problem with Euclidean distance does DTW solve?","options":{"A":"DTW is faster than Euclidean distance for 30-dimensional vectors","B":"Euclidean distance compares point-by-point (position 1 vs position 1, day 2 vs day 2) — for time series, similar patterns may be time-shifted or stretched; DTW finds the optimal alignment between two sequences, allowing comparison of temporally shifted patterns and making KNN sensitive to pattern shape rather than exact position","C":"DTW normalizes the feature vectors, which Euclidean distance cannot do","D":"Euclidean distance cannot handle negative values (stock returns can be negative), but DTW can"},"correct":"B","explanation":{"correct":"- Euclidean distance between two time series requires exact temporal alignment. Two otherwise identical stock patterns where one is shifted by 2 days (a common occurrence) would appear very dissimilar by Euclidean distance.\n- DTW finds the best alignment by allowing \"warping\" — matching each point in one series to the most similar point in the other, within a warping window constraint. This captures pattern similarity regardless of temporal shifts.\n- In financial time series, patterns like \"three-day rally followed by consolidation\" are meaningful regardless of exact timing. DTW makes KNN sensitive to these patterns.","A":"DTW is significantly slower than Euclidean distance — it requires $O(n^2)$ dynamic programming per pair, compared to $O(d)$ for Euclidean. Speed is not the motivation.","B":"","C":"DTW does not inherently normalize features. Normalization is a separate step. Both Euclidean and DTW can be applied to normalized or unnormalized series.","D":"Euclidean distance handles negative values correctly — $(x_i - y_i)^2$ is always non-negative regardless of the sign of $x_i$ or $y_i$. Negative values are not an issue for Euclidean distance."}},{"section":"machine-learning","topicSlug":"k-nearest-neighbors","topic":"K Nearest Neighbors","id":"ml-08013","difficulty":"hard","orderIndex":13,"question":"A KNN model is trained on a dataset with severe class imbalance: 95% class 0 and 5% class 1. With k=10, almost every prediction is class 0 because 10 nearest neighbors are dominated by class 0 samples. A developer says \"reduce k to 1 to fix this.\" What is the better approach and why does reducing k to 1 create a different problem?","options":{"A":"Reducing k to 1 is the correct fix — use the single nearest neighbor to avoid majority class dominance","B":"Reducing k to 1 maximizes variance and makes the model sensitive to individual noisy class-1 samples; the better fix is class-weighted KNN (weighting neighbors inversely by class frequency) or combining KNN with oversampling of the minority class to balance the neighborhoods","C":"Class imbalance does not affect KNN — it only affects accuracy-based metrics","D":"The fix is to increase k to 50 to include more class-1 samples in each neighborhood"},"correct":"B","explanation":{"correct":"- With k=1, the prediction for any test point is the label of its single nearest training neighbor. For a test point near the class-0 majority, the nearest neighbor is class 0 — the problem persists in dense majority regions.\n- Additionally, k=1 is highly sensitive to noise: any class-1 point near a class-0 region (or vice versa) will cause misclassifications in its neighborhood.\n- Class-weighted KNN: weight the vote of each neighbor by $1 / P(\\text{class})$ (inverse frequency weighting), giving class-1 neighbors more voting power. Alternatively, oversample class-1 training points to create a balanced neighborhood distribution.","A":"k=1 doesn't fix the imbalance problem in regions dominated by class 0. In the majority class regions (95% of space), the nearest neighbor is almost always class 0 regardless of k=1.","B":"","C":"Class imbalance directly affects KNN by making neighborhoods statistically biased toward the majority class. This is a data representation problem that affects neighbor vote aggregation.","D":"Increasing k to 50 makes the problem worse — with a 95/5 imbalance, 50 neighbors will almost certainly contain 47+ class-0 samples, guaranteeing class-0 predictions everywhere."}},{"section":"machine-learning","topicSlug":"k-nearest-neighbors","topic":"K Nearest Neighbors","id":"ml-08014","difficulty":"hard","orderIndex":14,"question":"A team wants to use KNN for a recommendation system with 10 million users and 100 features. They need sub-50ms response time for nearest-neighbor queries. They evaluate exact KNN (brute force), HNSW (hierarchical navigable small world graph), and an IVF (inverted file index) approach. Exact KNN achieves 100% recall but takes 2 seconds per query. HNSW achieves 99.2% recall in 3ms. IVF achieves 98.1% recall in 8ms. What is the right framework for making this trade-off decision?","options":{"A":"Always use exact KNN — 0.8% recall drop from HNSW is unacceptable for production","B":"For 10M users requiring sub-50ms latency, approximate nearest neighbor (ANN) methods are the only viable choice — the trade-off is recall vs latency; HNSW's 99.2% recall at 3ms is likely acceptable for recommendations (losing 1% of truly relevant items is invisible to users) while meeting the latency SLA; the exact method is infeasible at the required throughput","C":"IVF should always be preferred over HNSW because it uses less memory","D":"The decision should be made based solely on which algorithm is easiest to implement"},"correct":"B","explanation":{"correct":"- Exact KNN with 10M users and 100 features: $O(nd) = 10^9$ operations per query at 2 seconds — physically impossible to meet 50ms SLA. This is not a tuning problem; it is a fundamental computational limitation.\n- HNSW builds a hierarchical graph where each node connects to its approximate neighbors at multiple scales. At 3ms and 99.2% recall, it provides excellent accuracy with 667× speedup. The 0.8% recall gap means ~1 in 125 truly relevant items is missed — imperceptible in recommendation user experience.\n- The decision framework: identify the minimum acceptable recall for the application (recommendations: 99%+ is comfortable; medical image retrieval: 100% may be required), find the ANN method meeting that recall threshold within the latency SLA.","A":"\"Always use exact KNN\" ignores the fundamental infeasibility of 2-second latency for real-time recommendations. 99.2% recall at 3ms is excellent for user-facing systems.","B":"","C":"HNSW vs IVF is a trade-off between recall, latency, and memory. HNSW typically offers better recall/speed trade-offs for dense data. IVF is better for very large datasets where HNSW's graph construction memory is prohibitive. The choice is not universally in favor of either.","D":"Implementation ease is never the primary criterion for production system design. Correctness, performance, and reliability requirements drive the decision."},"reference":"- Malkov & Yashunin, \"Efficient and Robust Approximate Nearest Neighbor Search Using HNSW\": https://arxiv.org/abs/1603.09320\n- FAISS documentation: https://faiss.ai/"},{"section":"machine-learning","topicSlug":"k-nearest-neighbors","topic":"K Nearest Neighbors","id":"ml-08015","difficulty":"hard","orderIndex":15,"question":"A KNN model with Manhattan distance (L1) is compared to the same model with Euclidean distance (L2) on a 1,000-dimensional dataset. The L1 model achieves significantly higher accuracy. Provide a precise geometric explanation for why L1 distance can outperform L2 in high dimensions.","options":{"A":"L1 distance is always superior to L2 distance in any dimension","B":"In high dimensions, L2 distance is dominated by the largest individual feature differences (the squared terms amplify outliers) while L1 distance sums absolute differences linearly — this makes L2 sensitive to a few noisy dimensions, while L1 distributes sensitivity more evenly; L1 is more robust to irrelevant noisy features in high-dimensional spaces","C":"L1 distance is faster to compute than L2, which is why it achieves higher accuracy","D":"L2 distance cannot handle more than 100 dimensions mathematically"},"correct":"B","explanation":{"correct":"- L2 distance: $\\sqrt{\\sum (x_i - y_i)^2}$. The squaring amplifies large individual differences — a single noisy dimension with a large difference dominates the total distance.\n- L1 distance: $\\sum |x_i - y_i|$. Linear sum — no single dimension is disproportionately amplified. In high dimensions with many irrelevant features, L1 averages the noise more uniformly.\n- Theoretical support: the concentration of measure phenomenon affects L2 more severely than L1. The ratio of maximum to minimum pairwise distances (the \"relative contrast\") degrades faster for L2 than L1 as dimension increases, making L1 distances more discriminative.","A":"L2 is superior to L1 in many low-dimensional settings, particularly when the data geometry is spherical or when larger differences are genuinely more important. Neither metric is universally superior.","B":"","C":"L1 computation (no square root, no squaring) is marginally faster than L2, but the accuracy improvement comes from the geometric property of noise robustness, not from computational speed.","D":"L2 distance is mathematically defined for any dimension. The practical challenge is interpretability and concentration of measure, not a mathematical limit."},"reference":"- Aggarwal et al., \"On the Surprising Behavior of Distance Metrics in High Dimensional Space\": https://link.springer.com/chapter/10.1007/3-540-44503-X_27"},{"section":"machine-learning","topicSlug":"naive-bayes","topic":"Naive Bayes","id":"ml-09001","difficulty":"easy","orderIndex":1,"question":"A Naive Bayes spam classifier assigns probability 0.97 to an email being spam. The raw computation is: $P(\\text{spam}|\\text{words}) \\propto P(\\text{spam}) \\times \\prod_{i} P(w_i|\\text{spam})$. A developer asks: \"where does the 'naive' come from?\" What is the correct answer?","options":{"A":"The algorithm is \"naive\" because it uses a simple decision rule: classify spam if probability > 0.5","B":"The algorithm assumes all features (words) are conditionally independent given the class — $P(w_1, w_2, ..., w_n|\\text{spam}) = \\prod P(w_i|\\text{spam})$ — this is the \"naive\" assumption because in reality words co-occur and are correlated","C":"The algorithm is naive because it ignores the email body and only uses the subject line","D":"The algorithm assumes equal prior probabilities for all classes, which is a simplification"},"correct":"B","explanation":{"correct":"- Bayes theorem gives: $P(\\text{class}|\\text{features}) \\propto P(\\text{class}) \\times P(\\text{features}|\\text{class})$. Computing $P(\\text{features}|\\text{class})$ for a 1000-word vocabulary requires modeling the full joint distribution — intractable.\n- The \"naive\" assumption: all features are conditionally independent given the class. This factorizes the joint: $P(f_1, ..., f_n | c) = \\prod P(f_i | c)$. Each term is easy to estimate.\n- This assumption is almost always false in reality (words co-occur: \"machine\" and \"learning\" appear together more than randomly). Yet Naive Bayes works surprisingly well in practice because calibrated probabilities are not required for correct class ranking.","A":"The 0.5 threshold is a standard binary classification decision rule, not specific to Naive Bayes. \"Naive\" refers to the independence assumption, not the threshold.","B":"","C":"Naive Bayes classifiers for text typically use all words in the email (bag-of-words). Ignoring the body would be a design choice, not the definition of \"naive.\"","D":"Naive Bayes uses prior probabilities estimated from class frequency in training data — not assumed equal. The prior is an explicit learned component, not a simplification of equal priors."}},{"section":"machine-learning","topicSlug":"naive-bayes","topic":"Naive Bayes","id":"ml-09002","difficulty":"easy","orderIndex":2,"question":"A Naive Bayes classifier is trained for medical diagnosis. The word \"fever\" appears in 80% of disease-positive training documents and 20% of disease-negative documents. The prior probabilities are $P(\\text{disease}) = 0.01$ (1% base rate). A new patient report mentions only \"fever.\" Which class does Naive Bayes predict, and is the output probability reliable?","options":{"A":"Disease, with probability 0.80 — the model is well-calibrated","B":"No disease — $P(\\text{no disease} | \\text{fever}) \\propto 0.99 \\times 0.20 = 0.198$ vs $P(\\text{disease} | \\text{fever}) \\propto 0.01 \\times 0.80 = 0.008$; the low disease prior overwhelms the likelihood, predicting no disease; the output probability is often unreliable but the class prediction is correct in this case","C":"Disease — the feature likelihood ratio 80/20 = 4 always overrides the prior","D":"The classifier cannot make a prediction because only one feature was provided"},"correct":"B","explanation":{"correct":"- Posterior ∝ Prior × Likelihood: $P(\\text{disease}|\\text{fever}) \\propto 0.01 \\times 0.8 = 0.008$; $P(\\text{no disease}|\\text{fever}) \\propto 0.99 \\times 0.2 = 0.198$. Normalized: $P(\\text{disease}|\\text{fever}) = 0.008/(0.008+0.198) \\approx 0.039$.\n- The model correctly predicts \"no disease\" because the prior is so low. This illustrates base rate neglect: 80% likelihood can still yield a low posterior when the prior is 1%.\n- The output probability (≈3.9%) may be miscalibrated due to the naive independence assumption — but the directional class prediction (no disease) is correct.","A":"P(disease) = 0.01 strongly dominates. 0.80 is the likelihood ratio, not the posterior probability. Naive Bayes computes the posterior, which includes the prior.","B":"","C":"The likelihood ratio (4:1) does not override the prior. Bayes theorem multiplies likelihood by prior. A 4:1 likelihood ratio with a 99:1 prior odds produces 4:99 posterior odds for disease.","D":"Naive Bayes can make predictions from any number of features, including just one. It would simply use $P(\\text{class}) \\times P(\\text{fever}|\\text{class})$ — one feature is sufficient."}},{"section":"machine-learning","topicSlug":"naive-bayes","topic":"Naive Bayes","id":"ml-09003","difficulty":"easy","orderIndex":3,"question":"A Multinomial Naive Bayes model is trained on text data. The word \"unicorn\" never appears in the training corpus. At test time, an email contains \"unicorn.\" Without smoothing, what happens to the model's prediction?","options":{"A":"The word is ignored — the model predicts based on the other words","B":"$$P(\\text{unicorn}|\\text{class}) = 0$ for all classes; since the product $\\prod P(w_i|\\text{class})$ includes a zero term, the posterior becomes zero for every class — the model cannot make any prediction (all class probabilities are zero)","C":"The model assigns P(unicorn|class) = 0.5 as a default for unseen words","D":"The model raises an error because unseen vocabulary is not supported"},"correct":"B","explanation":{"correct":"- Multinomial NB computes the product of word likelihoods: $P(\\text{class}|\\text{doc}) \\propto P(\\text{class}) \\times \\prod_i P(w_i|\\text{class})$.\n- $P(\\text{unicorn}|\\text{class}) = 0/N_{\\text{class}} = 0$ because \"unicorn\" has zero count. The product becomes $P(\\text{class}) \\times 0 \\times P(\\text{other words}) = 0$ for every class.\n- Laplace smoothing (add-one smoothing) fixes this: $P(w|\\text{class}) = \\frac{\\text{count}(w, \\text{class}) + 1}{N_{\\text{class}} + |V|}$ where $|V|$ is vocabulary size. This ensures no word has zero probability.","A":"Standard Multinomial NB doesn't skip words — every word in the document is multiplied into the posterior. Ignoring unseen words would require explicit out-of-vocabulary handling (which is a modification, not the default behavior).","B":"","C":"Default probability of 0.5 for unseen words is not how standard NB works. Laplace smoothing uses $1/(N+|V|)$, not 0.5, to maintain the multinomial distribution property.","D":"Naive Bayes doesn't raise errors on unseen vocabulary — it mathematically produces 0 probability, which causes the prediction to be undefined. This is a silent failure, not an error."}},{"section":"machine-learning","topicSlug":"naive-bayes","topic":"Naive Bayes","id":"ml-09004","difficulty":"easy","orderIndex":4,"question":"You compare Gaussian Naive Bayes (GNB) and Multinomial Naive Bayes (MNB) for classifying customer support tickets by category. The tickets are represented as TF-IDF vectors (continuous values). Which model is more appropriate and why?","options":{"A":"Multinomial NB is always better for text — it was designed specifically for this case","B":"Both models have trade-offs: Multinomial NB assumes non-negative integer counts (word frequencies), making it well-suited for raw count vectors; TF-IDF produces continuous non-negative values, for which Gaussian NB (assuming continuous Gaussian features) or Complement NB is more appropriate; MNB technically applies to TF-IDF but assumes a multinomial distribution that doesn't perfectly fit continuous weights","C":"Gaussian NB is always better than Multinomial NB for classification tasks","D":"Neither model can handle text classification — a deep learning model is required"},"correct":"B","explanation":{"correct":"- Multinomial NB models $P(w_i | \\text{class}) = p_{ic}^{x_{ic}}$ where $x_{ic}$ is the count of word $i$ in class $c$. This assumes integer count data (bag of words). TF-IDF values are continuous and not integer counts — MNB treats them as counts approximately.\n- Gaussian NB models each feature as $P(x_i | \\text{class}) = \\mathcal{N}(\\mu_{ic}, \\sigma_{ic}^2)$. For TF-IDF, this may not fit well because TF-IDF values are highly skewed (many zeros, some large values).\n- Complement NB (a variant of MNB) often works best for text; Bernoulli NB works for binary presence/absence. The choice should be empirically validated on the specific task.","A":"MNB was designed for count vectors, not TF-IDF. For raw bag-of-words counts, MNB is the natural choice. For TF-IDF, the match is approximate.","B":"","C":"Gaussian NB assumes normally distributed features, which is often violated for text features (sparse, skewed distributions). GNB is not universally better for text.","D":"Naive Bayes is a well-established and effective approach for text classification. Deep learning is not required — NB is often a strong baseline."}},{"section":"machine-learning","topicSlug":"naive-bayes","topic":"Naive Bayes","id":"ml-09005","difficulty":"easy","orderIndex":5,"question":"Laplace smoothing with parameter α=1 is applied to a Naive Bayes model on a vocabulary of 10,000 words. Training corpus for class \"spam\" has 1,000 total word tokens. The word \"discount\" appears 50 times in spam. What is the smoothed probability $P(\\text{discount}|\\text{spam})$?","options":{"A":"50/1000 = 0.05","B":"$$(50 + 1) / (1000 + 10000) = 51/11000 \\approx 0.00464$","C":"$$(50 + 1) / (1000 + 1) = 51/1001 \\approx 0.051$","D":"$$50 / (1000 + 10000) = 50/11000 \\approx 0.00454$"},"correct":"B","explanation":{"correct":"- Laplace smoothing formula: $P(w|\\text{class}) = \\frac{\\text{count}(w, \\text{class}) + \\alpha}{\\sum_w \\text{count}(w, \\text{class}) + \\alpha|V|}$.\n- Numerator: $50 + 1 = 51$. Denominator: $1000 + 1 \\times 10000 = 11000$.\n- Result: $51/11000 \\approx 0.00464$. The denominator adds $\\alpha \\times |V|$ (not just $\\alpha$) to ensure probabilities sum to 1 across the entire vocabulary.","A":"Unsmoothed MLE — this ignores Laplace smoothing and would give zero for unseen words.","B":"","C":"Only adds α once to the denominator, not $\\alpha \\times |V|$. This is a common mistake — the smoothing must be applied consistently across all vocabulary terms to maintain valid probability distributions.","D":"Correct denominator but missing the α in the numerator. Laplace smoothing adds α to both the numerator count and the denominator sum."}},{"section":"machine-learning","topicSlug":"naive-bayes","topic":"Naive Bayes","id":"ml-09006","difficulty":"medium","orderIndex":6,"question":"A Naive Bayes email classifier achieves 94% precision on spam detection. A data scientist says \"Naive Bayes works well because its independence assumption holds for email text.\" A researcher disagrees. Why does Naive Bayes work well in practice despite the assumption being violated?","options":{"A":"The independence assumption actually holds for text data — words are statistically independent","B":"Naive Bayes requires only correct class ranking, not calibrated probabilities — even with correlated features, if the posterior $P(\\text{spam}|\\text{words})$ consistently ranks spam above non-spam for spam emails, classification is correct; the correlated features' violation affects probability magnitude but not necessarily the direction of class ranking","C":"94% precision means the independence assumption is valid for this specific dataset","D":"Naive Bayes corrects for dependence automatically through Laplace smoothing"},"correct":"B","explanation":{"correct":"- The naive independence assumption is almost always false for text — \"machine\" and \"learning\" co-occur far more often than independence predicts. The model's probability estimates are therefore miscalibrated (too extreme).\n- But classification only requires: argmax over classes of the posterior. If the model consistently assigns higher (even if miscalibrated) probability to the correct class, predictions are correct.\n- Theoretical analysis (Domingos & Pazzani 1997): NB is optimally robust when features are \"conditionally positively correlated\" — the most common case in text. The class ranking is preserved even when probabilities are miscalibrated.","A":"Words in text are highly correlated — \"New York\" always appears together, \"credit card\" is a common phrase. Independence is definitively violated for text.","B":"","C":"94% precision is evidence of good classification performance, not of the independence assumption holding. The assumption can be violated while performance is high.","D":"Laplace smoothing handles zero probabilities for unseen words — it does not correct for feature dependence. These are separate issues."},"reference":"- Domingos & Pazzani, \"On the Optimality of the Simple Bayesian Classifier under Zero-One Loss\": https://link.springer.com/article/10.1023/A:1007413511361"},{"section":"machine-learning","topicSlug":"naive-bayes","topic":"Naive Bayes","id":"ml-09007","difficulty":"medium","orderIndex":7,"question":"A Naive Bayes classifier outputs $P(\\text{class=1}|\\text{features}) = 0.9999$ for most samples. A calibration plot shows that among samples predicted at 0.9999, only 72% are actually class 1. What structural property of Naive Bayes causes this extreme overconfidence?","options":{"A":"99.99% probability with 72% actual rate is within normal statistical variation — the model is well-calibrated","B":"Correlated features are counted multiple times in the product $\\prod P(f_i|\\text{class})$ — if \"machine\" and \"learning\" both appear (highly correlated), each contributes independently to the product, artificially inflating the probability toward extreme values (near 0 or 1)","C":"Naive Bayes outputs are always overconfident — it is a known limitation that cannot be remedied","D":"The overconfidence is caused by Laplace smoothing inflating all probabilities toward extreme values"},"correct":"B","explanation":{"correct":"- The product $\\prod P(f_i | c)$ of many near-independent terms concentrates near 0 or 1 by the central limit theorem on log-scale. With correlated features, the same information is effectively counted multiple times, pushing products to extreme values.\n- Example: \"spam\" email contains \"discount\", \"offer\", \"deal\" — all highly correlated spam indicators. Naive NB multiplies these as if independent, overestimating the probability of spam far beyond the true conditional probability.\n- This is why Naive NB is often combined with Platt scaling or isotonic regression to calibrate probabilities — the class predictions may be correct, but the probability outputs require post-hoc calibration.","A":"A 27-point gap (99.99% predicted vs 72% actual) is severe miscalibration, not statistical variation. This is a systematic overconfidence pattern, not noise.","B":"","C":"Overconfidence can be remedied. Calibration methods (Platt scaling, temperature scaling) correct NB's overconfidence by mapping raw outputs to calibrated probabilities. The limitation is not irremedied.","D":"Laplace smoothing moves probabilities away from 0 and 1 (it prevents zero probabilities). It does not cause overconfidence — it slightly reduces extreme values."}},{"section":"machine-learning","topicSlug":"naive-bayes","topic":"Naive Bayes","id":"ml-09008","difficulty":"medium","orderIndex":8,"question":"A Naive Bayes model is trained for news topic classification (5 classes). Training data: 10,000 sports articles, 1,000 politics articles. At test time, a politically neutral article gets classified as \"sports\" even though it contains political keywords. What is the most likely cause?","options":{"A":"The model has a bug in the likelihood computation","B":"The prior $P(\\text{sports}) = 10,000/11,000 \\approx 0.91$ strongly dominates — even when $P(\\text{words}|\\text{politics}) > P(\\text{words}|\\text{sports})$, the large sports prior can overwhelm the likelihood ratio; this is prior dominance in imbalanced training data","C":"Naive Bayes always classifies based on the most frequent class — this is expected behavior","D":"Political keywords have zero probability in all classes because they weren't seen in training data"},"correct":"B","explanation":{"correct":"- Prior dominance: $P(\\text{sports}) \\approx 0.91$, $P(\\text{politics}) \\approx 0.09$. Even a 10:1 likelihood ratio in favor of politics gives: $P(\\text{politics}|\\text{doc}) \\propto 0.09 \\times 10 = 0.9$ vs $P(\\text{sports}|\\text{doc}) \\propto 0.91 \\times 1 = 0.91$. Sports still wins with equal likelihoods; the politics class needs a >10:1 likelihood ratio just to overcome the prior.\n- This is a training data imbalance problem. The model effectively needs very strong political signal to overcome the sports prior.\n- Solutions: adjust class priors to reflect true expected distribution (not training imbalance), use class weights, or downsample the majority class.","A":"The behavior is mathematically correct Naive Bayes — it is a consequence of the prior × likelihood computation, not a bug.","B":"","C":"Naive Bayes doesn't always predict the most frequent class — it predicts the class with the highest posterior. When the likelihood ratio is large enough, the minority class can win. The problem is when the likelihood ratio is insufficient to overcome the prior.","D":"Political keywords appear in training data (1,000 politics articles) — they have non-zero likelihood for the politics class. The issue is the low prior, not zero likelihoods."}},{"section":"machine-learning","topicSlug":"naive-bayes","topic":"Naive Bayes","id":"ml-09009","difficulty":"medium","orderIndex":9,"question":"A Bernoulli Naive Bayes model and a Multinomial Naive Bayes model are both trained on the same text data. The Bernoulli model represents documents as binary vectors (word present or absent). The Multinomial model uses word counts. On short documents (tweets, 280 chars), Bernoulli NB outperforms Multinomial NB. Why?","options":{"A":"Bernoulli NB is always better than Multinomial NB — count information is never useful","B":"In short documents, most words appear at most once — count and presence are nearly identical; but Bernoulli NB explicitly models absent words (contributes $P(w=0|\\text{class})$ for words not in the document), which adds discriminative signal about what is NOT present; Multinomial NB ignores absent words, losing this signal in short documents","C":"Multinomial NB is computationally slower on short documents, which is why Bernoulli appears better","D":"Short documents violate the Multinomial distribution assumption, causing Multinomial NB to fail"},"correct":"B","explanation":{"correct":"- Bernoulli NB: $P(\\text{doc}|\\text{class}) = \\prod_{w \\in V} P(w|\\text{class})^{b_w} \\times P(\\text{not-}w|\\text{class})^{1-b_w}$ where $b_w \\in \\{0,1\\}$.\n- When a word is absent ($b_w = 0$), Bernoulli NB multiplies by $P(\\text{not-}w|\\text{class}) = 1 - P(w|\\text{class})$. A word common in spam (high $P(w|\\text{spam})$) contributes $P(\\text{not-}w|\\text{spam}) = $ small value when absent — a positive signal for non-spam.\n- Multinomial NB only processes words present in the document, contributing nothing for absent words. In short documents with few words, the absence of spam indicators is strong evidence — Bernoulli captures this; Multinomial misses it.","A":"Multinomial NB's use of count information is genuinely useful for long documents where word frequency carries meaning (e.g., \"urgent\" appearing 5 times in an email is more suspicious than once). The advantage depends on document length.","B":"","C":"Computational speed differences are not the cause of accuracy differences. Both models have similar complexity.","D":"Both Bernoulli and Multinomial NB have their respective distributional assumptions. Short documents don't \"violate\" the Multinomial distribution — they just provide less count information to leverage."}},{"section":"machine-learning","topicSlug":"naive-bayes","topic":"Naive Bayes","id":"ml-09010","difficulty":"hard","orderIndex":10,"question":"A Naive Bayes model is trained on real-valued continuous features (sensor data). The team uses Gaussian NB: $P(x_i|\\text{class}) = \\mathcal{N}(\\mu_{ic}, \\sigma_{ic}^2)$. Feature $x_3$ has a bimodal distribution within each class (two distinct peaks). The model achieves poor recall on class 1. What is the precise problem and fix?","options":{"A":"Gaussian NB cannot handle real-valued data — it must be discretized","B":"Gaussian NB assumes each feature follows a single Gaussian within each class — a bimodal within-class distribution violates this, causing the estimated mean and variance to represent a \"ghost\" distribution that doesn't reflect either peak; the model systematically underestimates $P(x_3|\\text{class})$ in regions between the two modes","C":"The problem is insufficient training data for class 1 — more samples would fix the Gaussian fit","D":"Bimodal distributions require multinomial NB regardless of the feature type"},"correct":"B","explanation":{"correct":"- A bimodal distribution (e.g., measurements cluster near 10 and 40 within class 1) has mean ≈ 25 — a value rarely observed. Gaussian NB fits $\\mathcal{N}(25, \\sigma^2)$, which concentrates probability around 25 but gives low probability to observations near 10 or 40 (where actual data lives).\n- This causes systematic underestimation of $P(x_3 | \\text{class=1})$ for actual class-1 observations near either mode, reducing the posterior for class 1.\n- Fix: use Kernel Density Estimation (KDE) for the continuous distribution, discretize the feature into bins and use Multinomial NB, or use a mixture of Gaussians to model the bimodal within-class distribution.","A":"Gaussian NB handles real-valued data correctly when the Gaussian assumption holds. The issue is the violation of unimodality, not the data type.","B":"","C":"More training data would produce a more accurate estimate of the bimodal distribution's parameters — but the Gaussian model cannot represent a bimodal distribution regardless of sample count. The model class is wrong.","D":"Multinomial NB is designed for discrete count data. Applying it to bimodal continuous data would require discretization first. The NB variant choice should match the data type, not just the distribution shape."}},{"section":"machine-learning","topicSlug":"naive-bayes","topic":"Naive Bayes","id":"ml-09011","difficulty":"hard","orderIndex":11,"question":"A Naive Bayes model is trained incrementally: new training data arrives daily and the model is updated without retraining from scratch. How does Naive Bayes's generative model structure uniquely enable this incremental update, unlike discriminative models?","options":{"A":"Naive Bayes cannot be updated incrementally — full retraining is always required","B":"Naive Bayes stores sufficient statistics (class counts, word counts, feature sums and variances) that are additive — each new sample updates the counts, and the probability estimates are recomputed directly; no gradient computation or full dataset is needed; discriminative models (logistic regression, neural networks) require full dataset gradient computation for principled incremental updates","C":"Incremental learning only works for Naive Bayes because it has fewer parameters","D":"Naive Bayes is the only model that can be updated incrementally because it uses Bayesian inference"},"correct":"B","explanation":{"correct":"- Multinomial NB: class count $N_c$ and feature count $\\text{count}(w, c)$ are sufficient statistics. Adding a new document with class $c$ and words $\\{w_1, ...\\}$: increment $N_c$ by 1 and increment $\\text{count}(w_i, c)$ by $x_i$. Recompute $P(w_i|c)$ from updated counts.\n- This is $O(d)$ per new sample regardless of total dataset size — true constant-time incremental update.\n- Discriminative models (logistic regression, neural networks) minimize loss over training data. Updating with a new sample requires either full gradient computation (which accesses all past data) or stochastic gradient descent with forgetting effects. Neither is as clean as NB's sufficient statistic updates.","A":"sklearn's `MultinomialNB` explicitly supports `partial_fit()` for incremental learning. Naive Bayes is one of the few classic algorithms with principled online update support.","B":"","C":"Parameter count is not the determining factor. The determining factor is whether the model's parameters can be expressed as additive sufficient statistics of the data.","D":"Several other models support incremental learning (Perceptron, online SGD, vowpal wabbit). Naive Bayes's incremental property comes from its generative structure with additive sufficient statistics, not uniquely from Bayesian inference."}},{"section":"machine-learning","topicSlug":"naive-bayes","topic":"Naive Bayes","id":"ml-09012","difficulty":"hard","orderIndex":12,"question":"Two Naive Bayes models are trained on document classification: Model A uses Laplace smoothing α=1, Model B uses α=10. On a test set with many out-of-vocabulary words, Model B outperforms Model A. However, on test data similar to training, Model A performs better. Explain this behavior precisely.","options":{"A":"Higher α always improves model performance — Model A should always use α=10","B":"α=10 applies heavier smoothing: probabilities for rare/unseen words are uniformly spread across all vocabulary words, moving toward the uniform distribution; this reduces overfitting to training word frequencies but adds more bias toward uniformity — on test data with many OOV words, the bias is less harmful than Model A's zero-probability catastrophe; on in-distribution test data, the extra bias hurts Model A's sharper, better-calibrated estimates","C":"The performance difference is caused by Laplace smoothing only applying to word counts, not to class priors","D":"α=10 is equivalent to having 10 extra observations of each word, making Model B more robust by artificially increasing training size"},"correct":"B","explanation":{"correct":"- Laplace smoothing formula: $P(w|c) = \\frac{N_{wc} + \\alpha}{N_c + \\alpha|V|}$. With $\\alpha = 10$: a word never seen in class $c$ gets $P(w|c) = 10/(N_c + 10|V|)$ — higher than with $\\alpha=1$. All probabilities are pulled closer to $1/|V|$ (uniform).\n- On OOV-heavy test data: $\\alpha=1$ gives very small but not zero probabilities for unseen words (avoiding the catastrophic zero of no smoothing). $\\alpha=10$ gives larger probabilities for unseen words, making predictions less sensitive to OOV words.\n- On in-distribution test data: $\\alpha=1$ preserves more of the training distribution signal. $\\alpha=10$'s over-smoothing weakens the discriminative signal for known words.","A":"Higher α is not universally better. It's a bias-variance trade-off: more smoothing (higher α) reduces variance for OOV words but increases bias on in-distribution data.","B":"","C":"Laplace smoothing does apply to class priors too in some formulations ($P(c) = (N_c + \\alpha) / (N + K\\alpha)$ where K is number of classes). However, the performance difference described is specifically about word probability estimation.","D":"Adding α to counts is loosely analogous to α extra observations of each word, but this framing understates the effect: it uniformly distributes α observations across all vocabulary words, which is different from adding real word observations from training data."}},{"section":"machine-learning","topicSlug":"naive-bayes","topic":"Naive Bayes","id":"ml-09013","difficulty":"hard","orderIndex":13,"question":"A Naive Bayes classifier is used to detect toxic text. Features include word presence (Bernoulli NB). Feature \"hate\" has $P(\\text{hate}|\\text{toxic}) = 0.6$ and $P(\\text{hate}|\\text{not toxic}) = 0.001$. Log-odds ratio = $\\log(0.6/0.001) = 6.4$. An auditor discovers the model is biased: it flags text mentioning marginalized groups as toxic at higher rates. What NB property enables and hides this bias?","options":{"A":"The model is unbiased because it uses probabilistic outputs, not binary decisions","B":"Naive NB's word-level independence assumption makes it transparent about which words drive predictions (high log-odds ratio words), but this also makes it easy for training data bias to embed directly into per-word probabilities — if the training corpus disproportionately associates group-identifying words with toxic content (historical bias), those words get high P(word|toxic) without the model having any mechanism to distinguish correlation from discriminatory association","C":"The bias is caused by Laplace smoothing, which amplifies toxic class probabilities","D":"Naive Bayes cannot be biased because it uses objective probabilities from training data"},"correct":"B","explanation":{"correct":"- NB directly encodes $P(w|\\text{toxic})$ from training data. If training data disproportionately labels text mentioning group-X as toxic (historical human labeling bias), then $P(\\text{group-X word}|\\text{toxic})$ is estimated as high — the model embeds this bias directly.\n- Unlike neural networks where bias is distributed across millions of parameters (hard to audit), NB's bias is transparent and inspectable: high-log-odds words are directly interpretable. This is both a strength (auditable) and a weakness (bias transfers directly).\n- The auditor can detect and partially correct by removing or downweighting group-identifying terms, or by reweighting training examples.","A":"Probabilistic outputs do not prevent bias. If the probability of \"toxic\" is consistently higher for text containing group identifiers due to training data bias, the model produces biased probability outputs.","B":"","C":"Laplace smoothing does not amplify the toxic class. It moves all word probabilities slightly toward uniform — it would reduce, not amplify, class-specific word probabilities.","D":"\"Objective probabilities from training data\" is precisely how bias embeds — if training data contains human labeling bias, the objective probabilities inherit that bias. No algorithm is immune to biased training data."}},{"section":"machine-learning","topicSlug":"naive-bayes","topic":"Naive Bayes","id":"ml-09014","difficulty":"hard","orderIndex":14,"question":"Naive Bayes and Logistic Regression are both trained on the same binary classification task. In the asymptotic limit (infinite training data), logistic regression converges to a better solution than Naive Bayes. But on small training sets (n < 30), Naive Bayes often outperforms logistic regression. What theoretical framework explains this empirical observation?","options":{"A":"Naive Bayes uses a better optimization algorithm than logistic regression for small datasets","B":"Naive Bayes is a generative model — it models the joint distribution $P(x, y)$ and has fewer effective parameters (one mean and variance per feature per class for GNB); it reaches its asymptotic error with fewer samples; logistic regression is a discriminative model that directly models $P(y|x)$ and requires more samples to estimate its parameters reliably, but achieves lower asymptotic error when its assumptions hold","C":"Logistic regression overfits to training data on small datasets, while Naive Bayes cannot overfit because it ignores feature correlations","D":"This observation is false — logistic regression always outperforms Naive Bayes regardless of dataset size"},"correct":"B","explanation":{"correct":"- The Ng & Jordan (2001) study formally showed this crossover: Naive Bayes achieves its asymptotic error after $O(\\log d)$ samples (d = features), while logistic regression requires $O(d)$ samples.\n- Generative models like NB have structural assumptions that constrain the solution space — the model \"knows\" the distribution structure. This inductive bias is helpful with little data.\n- Discriminative models make fewer assumptions and can fit any boundary, but need more data to determine which boundary is correct. Their lower asymptotic error comes from not being constrained by (possibly wrong) generative assumptions.","A":"Naive Bayes is not an optimizer-based model. Its parameters (class probabilities, feature likelihoods) are estimated directly from frequency counts. There's no optimization difference.","B":"","C":"Naive Bayes can overfit — especially with small training sets, estimated word probabilities may reflect training noise. The independence assumption acts as regularization, but it's not absolute protection against overfitting.","D":"This is empirically false. The Ng & Jordan paper directly demonstrates with experiments that Naive Bayes outperforms logistic regression on small datasets."},"reference":"- Ng & Jordan, \"On Discriminative vs. Generative Classifiers\": https://proceedings.neurips.cc/paper/2001/hash/7b7a53e239400a13bd566b1e94b2f4f6-Abstract.html"},{"section":"machine-learning","topicSlug":"pca-dimensionality-reduction","topic":"Pca Dimensionality Reduction","id":"ml-10001","difficulty":"easy","orderIndex":1,"question":"PCA is applied to a dataset with 50 features. The first principal component explains 45% of variance, the second explains 20%, and the third explains 15%. A data scientist says \"the first three components capture 80% of the variance, so we can safely discard the remaining 47 components.\" What important nuance does this claim miss?","options":{"A":"80% variance explained is always insufficient — you must retain 95% minimum","B":"Variance explained measures the proportion of total variance captured, but \"safely discard\" depends on the task — for visualization 80% is often enough, but for a downstream model, the 20% discarded variance may contain the signal most predictive of the target; the claim assumes variance ∝ information, which is only true if the task is reconstruction, not prediction","C":"The claim is correct — 80% is the standard threshold for PCA in all applications","D":"PCA cannot discard components because all 50 components together reconstruct the data exactly"},"correct":"B","explanation":{"correct":"- PCA maximizes explained variance — it finds directions of maximum data spread. But the target variable may correlate with low-variance directions. For example, a subtle survival signal in medical data might be captured by component 10 (2% variance) rather than component 1.\n- \"Explained variance\" measures how well PCA reconstructs the input $X$, not how well it predicts the output $y$. Discarding variance is safe only when you're doing unsupervised compression; for supervised prediction, you should evaluate downstream model performance on held-out data.\n- Alternative: supervised dimensionality reduction (LDA, PLS) finds components that maximize predictive power, not variance.","A":"There is no universal 95% threshold. The appropriate threshold depends on the task: 80% may be sufficient for noise reduction in image compression; 95% may be insufficient for a regression task where the target correlates with rare components.","B":"","C":"80% is a commonly cited heuristic, not a standard. The optimal number of components is task-dependent and should be evaluated empirically.","D":"PCA produces an ordered set of orthogonal components. Using only the first 3 is an approximation — you are discarding the remaining 47 dimensions' information, with some information loss."}},{"section":"machine-learning","topicSlug":"pca-dimensionality-reduction","topic":"Pca Dimensionality Reduction","id":"ml-10002","difficulty":"easy","orderIndex":2,"question":"PCA is applied to a dataset before training a logistic regression classifier. The PCA is fitted on the entire dataset (including test data), and then both train and test sets are transformed with the same PCA. A reviewer says this is a data leakage problem. Is the reviewer correct?","options":{"A":"No — PCA is unsupervised and doesn't use labels, so it cannot cause data leakage","B":"Yes — fitting PCA on the full dataset (including test data) means the principal components (eigenvectors) are computed using test-set variance information; these components may align with patterns specific to the test set, giving the model access to test distribution information during training","C":"The reviewer is partially correct — leakage only occurs if PCA reduces to 1 component","D":"Leakage from PCA only matters for non-linear PCA methods (kernel PCA); linear PCA is safe"},"correct":"B","explanation":{"correct":"- PCA computes eigenvectors of the feature covariance matrix. If the covariance matrix is estimated using all data (including test), the test data's variance structure is embedded in the principal components.\n- For example, if the test set has a unique cluster pattern, PCA may create a component that separates this cluster from the training data — the subsequent model then benefits from this structure during evaluation.\n- The correct approach: fit PCA on training data only, then apply the same PCA transformation to the test set. This is enforced by using `sklearn.pipeline.Pipeline`.","A":"\"Not using labels\" does not prevent leakage. Any information from the test set — distributional, structural, or statistical — that influences training constitutes leakage. PCA uses the covariance structure of all features.","B":"","C":"The number of components does not determine whether leakage occurs. With any number of components, fitting on test+train uses test information.","D":"This distinction is incorrect. Both linear PCA and kernel PCA are fitted on data. Any fitted transformation uses the data it was fitted on. Linear PCA has the same leakage risk as kernel PCA."}},{"section":"machine-learning","topicSlug":"pca-dimensionality-reduction","topic":"Pca Dimensionality Reduction","id":"ml-10003","difficulty":"easy","orderIndex":3,"question":"The first two principal components of a 100-feature dataset are plotted as a scatterplot. A data scientist interprets the plot and identifies three distinct clusters. A colleague says \"we can use this for clustering.\" What critical limitation must they acknowledge?","options":{"A":"PCA scatterplots cannot be used to identify clusters under any circumstances","B":"The 2D PCA visualization shows variance in the two highest-variance directions — clusters visible in 2D may not exist in the full 100-dimensional space, and clusters that exist in the full space may be invisible in the 2D projection; 2D PCA is a lossy projection that can create apparent clusters through projection artifacts or miss real high-dimensional structure","C":"The clusters are guaranteed to be real because PCA extracts the most informative dimensions","D":"Using PCA for clustering is only invalid if explained variance is below 90%"},"correct":"B","explanation":{"correct":"- Projection to 2D compresses 100 dimensions into 2. Points that are well-separated in the full space may overlap in the projection; overlapping points in the full space may appear separated due to the 2D \"shadow\" effect.\n- PCA finds max-variance directions, not max-cluster-separation directions. A dataset where clusters are separated along low-variance components will look homogeneous in a PCA plot despite having clear cluster structure.\n- t-SNE and UMAP are specifically designed for cluster visualization — they preserve neighborhood structure, not variance. They are preferred for exploratory cluster analysis.","A":"PCA scatterplots can be useful starting points for exploration. The limitation is in over-interpreting apparent clusters as definitive, not in using the plot entirely.","B":"","C":"PCA maximizes variance, not discriminative or clustering power. \"Most informative\" is relative to the task: for reconstruction, PC1/PC2 are most informative; for cluster separation, they may not be.","D":"There is no threshold below which PCA is \"invalid\" for clustering visualization. Even 95% variance retention can fail to reveal cluster structure if the clusters separate along the remaining 5%."}},{"section":"machine-learning","topicSlug":"pca-dimensionality-reduction","topic":"Pca Dimensionality Reduction","id":"ml-10004","difficulty":"easy","orderIndex":4,"question":"A scree plot shows eigenvalues: 8.2, 7.9, 7.6, 0.3, 0.2, 0.1. An analyst uses the \"elbow method\" to select the number of principal components. Where is the elbow and what does it indicate?","options":{"A":"The elbow is between components 1 and 2, indicating only 1 component should be kept","B":"The elbow is between components 3 and 4 — the first three eigenvalues (8.2, 7.9, 7.6) are large and similar; then there is a sharp drop to 0.3; the elbow indicates that 3 components capture the dominant variance structure, and additional components mainly capture noise","C":"The elbow is between components 5 and 6, and 5 components should be retained","D":"A flat scree plot with similar initial eigenvalues means PCA is not applicable"},"correct":"B","explanation":{"correct":"- The scree plot elbow method: find the point where the eigenvalue curve \"bends\" sharply — large variance to the left, noise variance to the right. The drop from 7.6 to 0.3 is a factor of 25 — a dramatic elbow.\n- Eigenvalues 8.2, 7.9, 7.6 suggest three approximately equal variance components (perhaps three underlying dimensions of equal importance). The components after the elbow (0.3, 0.2, 0.1) represent residual noise.\n- The elbow method is heuristic — the \"elbow\" is not always obvious. When eigenvalues decrease gradually, parallel analysis or cross-validation-based component selection is more reliable.","A":"Components 1-3 have nearly equal eigenvalues (~8) — there is no elbow between 1 and 2. The sharp drop is between 3 and 4.","B":"","C":"Components 4-6 all have small eigenvalues (0.3, 0.2, 0.1) and represent noise. There is no additional elbow at component 5-6.","D":"PCA is applicable to any dataset. A flat initial portion of the scree plot means multiple components are equally important — this is common and valid."}},{"section":"machine-learning","topicSlug":"pca-dimensionality-reduction","topic":"Pca Dimensionality Reduction","id":"ml-10005","difficulty":"easy","orderIndex":5,"question":"A team applies PCA to reduce features from 100 to 10 before training a neural network. The network achieves 78% accuracy. Without PCA (full 100 features), the network achieves 82%. A colleague says \"PCA always improves neural network performance by removing noise.\" Is this correct?","options":{"A":"Correct — PCA always helps neural networks by removing correlated features","B":"Incorrect — neural networks with sufficient capacity and data can learn to use high-dimensional input effectively; PCA discards the 90 lower-variance components, which may contain task-relevant signal; the 4-point accuracy drop suggests the discarded variance contained useful predictive information","C":"Correct — 78% accuracy from 10 components vs 82% from 100 features proves PCA is harmful in all cases","D":"The accuracy difference is within noise — the two results are statistically equivalent"},"correct":"B","explanation":{"correct":"- Neural networks with many hidden units can model nonlinear interactions across all 100 features. PCA discards 90 directions of variance — if any of these carry signal (even small variance-explained signal), the network loses that information.\n- PCA is most beneficial when training data is limited (fewer samples than features forces the network to generalize in high-dimensional space) or when computation savings are needed.\n- With ample data and computational resources, end-to-end feature learning (letting the network learn its own low-dimensional representation through the early layers) often outperforms manual PCA preprocessing.","A":"\"Always improves\" is definitively false. This example demonstrates the opposite. PCA is a tool with trade-offs, not a universally beneficial preprocessing step.","B":"","C":"The observed drop suggests PCA was harmful on this specific task. But it doesn't \"prove\" PCA is harmful in all cases. Other datasets and architectures may benefit from PCA preprocessing.","D":"A 4-point accuracy difference in neural network evaluation is typically statistically meaningful (much larger than noise) unless the dataset is extremely small."}},{"section":"machine-learning","topicSlug":"pca-dimensionality-reduction","topic":"Pca Dimensionality Reduction","id":"ml-10006","difficulty":"medium","orderIndex":6,"question":"PCA is applied to a 3D dataset where the data lies on a 2D Swiss roll (a nonlinearly curved manifold). The PCA projection to 2D \"unfolds\" the roll into a crescent shape — the two ends of the roll that are geometrically far apart appear close in PCA space. What does this reveal about PCA's limitations?","options":{"A":"PCA failed because the Swiss roll has more than 2 dimensions","B":"PCA finds the 2D linear subspace with maximum variance — it cannot \"unfold\" a curved manifold because it uses only linear projections; points at opposite ends of the roll have high variance between them (far in 3D), so PCA places them correctly by variance but incorrectly by manifold geodesic distance; nonlinear methods (UMAP, t-SNE, Isomap) are needed to preserve manifold structure","C":"PCA failed because the data was not standardized before application","D":"PCA produces correct results on the Swiss roll — the crescent is the geometrically correct 2D representation"},"correct":"B","explanation":{"correct":"- PCA computes eigenvectors of the covariance matrix — these are directions of maximum variance in the original Euclidean space. The Swiss roll's 2D manifold is curved; its intrinsic 2D coordinates cannot be reached by any linear projection.\n- The first two PCs capture the maximum-variance projection of the 3D roll, which squashes the curved surface. Points at opposite ends of the roll may have large Euclidean distance (high variance) but are geodesically close along the manifold.\n- Manifold-aware methods like UMAP or Isomap compute shortest paths along the manifold surface (geodesic distances) rather than Euclidean distances, correctly \"unrolling\" the Swiss roll.","A":"The Swiss roll is intrinsically 2-dimensional — PCA should in principle recover 2D structure. The failure is due to the nonlinearity of the manifold, not the dimensionality.","B":"","C":"Standardization would not help here. The failure is geometric (linear vs. nonlinear projection), not scale-related.","D":"The crescent is not the geometrically correct representation — it folds together parts of the roll that should be separated. The correct unrolled representation would show a rectangle or unfurled band."},"reference":"- Tenenbaum et al., \"A Global Geometric Framework for Nonlinear Dimensionality Reduction\" (Isomap): https://science.sciencemag.org/content/290/5500/2319"},{"section":"machine-learning","topicSlug":"pca-dimensionality-reduction","topic":"Pca Dimensionality Reduction","id":"ml-10007","difficulty":"medium","orderIndex":7,"question":"t-SNE is applied to visualize a 500-dimensional embedding space. The resulting 2D plot shows 8 clear clusters. A data scientist presents this to stakeholders saying \"our model has learned 8 distinct customer segments.\" A statistician pushes back. What is the statistician's concern?","options":{"A":"t-SNE visualizations are always incorrect and should never be used for presentations","B":"t-SNE preserves local neighborhood structure but distorts global distances — clusters in t-SNE plots look more separated than they actually are, and the number and appearance of clusters can change dramatically with different perplexity values; the 8 clusters may not correspond to 8 distinct real-world groups without validation with a downstream task","C":"t-SNE is only valid for image data and cannot be applied to embedding spaces","D":"The statistician's concern is invalid — 8 visible clusters definitively proves 8 segments exist"},"correct":"B","explanation":{"correct":"- t-SNE optimizes a different objective than PCA: it minimizes KL divergence between high-dimensional and low-dimensional neighborhood distributions. This preserves local structure (nearby points stay nearby) but distorts inter-cluster distances.\n- Hyperparameter sensitivity: changing perplexity (5 to 50) can change the apparent number and shape of clusters. t-SNE can create apparent clusters from uniformly distributed data.\n- Validation: the 8 t-SNE clusters should be validated against domain-meaningful criteria (customer behavior differences, business metrics). Visualization alone is not proof of segmentation.","A":"t-SNE is a valuable and widely used exploratory tool. The concern is not about its validity but about its interpretation limitations.","B":"","C":"t-SNE can be applied to any vector space — embeddings, genomics, audio features, text. It has no domain restriction.","D":"Visual clusters in t-SNE do not definitively prove real-world segments. The visualization may create apparent clusters through parameter tuning or reflect local noise patterns. Downstream validation is required."}},{"section":"machine-learning","topicSlug":"pca-dimensionality-reduction","topic":"Pca Dimensionality Reduction","id":"ml-10008","difficulty":"medium","orderIndex":8,"question":"PCA is applied to a gene expression dataset with 20,000 genes (features) and 200 samples. The resulting eigenvalue decomposition is computed on the $200 \\times 200$ covariance matrix rather than the $20,000 \\times 20,000$ matrix. Why is this dimensionality trick valid?","options":{"A":"The trick is invalid — eigenvalues from the $200 \\times 200$ matrix are different from those of the $20,000 \\times 20,000$ matrix","B":"The data matrix $X$ (200×20,000) has rank at most 200 — the covariance matrix $X^TX$ (20,000×20,000) therefore has at most 200 non-zero eigenvalues; computing eigendecomposition of $XX^T$ (200×200) gives the same non-zero eigenvalues and the corresponding eigenvectors can be derived analytically; this is the kernel trick / dual PCA","C":"The $200 \\times 200$ matrix is used only for computational speed — the eigenvectors are different but produce similar results","D":"PCA on $XX^T$ only works when the number of features is exactly 100× the number of samples"},"correct":"B","explanation":{"correct":"- Data matrix $X$ is $n \\times p$ ($200 \\times 20,000$). Rank($X$) $\\leq \\min(n, p) = 200$, so $X^TX$ has at most 200 non-zero eigenvalues.\n- SVD relationship: $X = U\\Sigma V^T$. The eigenvectors of $X^TX$ are the columns of $V$ (right singular vectors), and eigenvectors of $XX^T$ are columns of $U$ (left singular vectors). Non-zero eigenvalues of both are identical: $\\sigma_i^2$.\n- Recovering $V$ from $U$: $V_i = X^T U_i / \\sigma_i$. This gives the full PCA solution (principal components in 20,000-dimensional space) from the 200×200 computation.","A":"The non-zero eigenvalues of $X^TX$ and $XX^T$ are mathematically identical (by the SVD relationship). This is not an approximation — it is an exact equivalence.","B":"","C":"The eigenvectors are not different in meaning — they represent the same principal directions. The $200 \\times 200$ computation gives an exact (not approximate) solution for the non-zero components.","D":"The dual PCA trick works whenever $n < p$ (more features than samples). The specific ratio doesn't matter."}},{"section":"machine-learning","topicSlug":"pca-dimensionality-reduction","topic":"Pca Dimensionality Reduction","id":"ml-10009","difficulty":"medium","orderIndex":9,"question":"UMAP is applied to two datasets: Dataset A (tight, well-separated clusters) and Dataset B (continuous gradient, no distinct clusters). Both produce visually appealing 2D plots with apparent clusters. A researcher uses both as evidence of cluster structure. What is wrong with the interpretation for Dataset B?","options":{"A":"Nothing — UMAP always produces correct visualizations","B":"UMAP, like t-SNE, optimizes for local neighborhood preservation — it will create apparent cluster structure in its 2D output even when the underlying data has a continuous gradient; the discrete-looking clusters in Dataset B's UMAP plot are an artifact of the algorithm's neighborhood compression, not evidence of real distinct groups","C":"UMAP cannot handle continuous gradients — it should be replaced with PCA for Dataset B","D":"Dataset B's continuous gradient means UMAP will produce random noise output, not apparent clusters"},"correct":"B","explanation":{"correct":"- UMAP constructs a fuzzy topological representation of the data and optimizes the low-dimensional embedding to match. Points far apart in high-dimensional space are repelled in the embedding, creating \"white space\" between groups — even if those groups are really just ends of a continuum.\n- This is a known artifact: UMAP (and t-SNE) can create apparent clusters from uniform or continuously varying data. The visual separation in the plot reflects the algorithm's optimization objective (local preservation + global spreading), not necessarily real cluster boundaries.\n- Validation: are the apparent clusters in Dataset B correlated with any external label, domain category, or outcome? Without such validation, the clusters are visualization artifacts.","A":"UMAP creates visualization artifacts that can mislead interpretation. Apparent clusters from continuous data are a documented limitation.","B":"","C":"UMAP can handle continuous gradients — it will produce a visualization. The issue is how to interpret it, not whether UMAP applies.","D":"UMAP produces structured outputs, not random noise, even for continuous data. The structured output may reflect real (continuous) gradients, but the visual clustering effect makes it look discretized."},"reference":"- McInnes et al., \"UMAP: Uniform Manifold Approximation and Projection\": https://arxiv.org/abs/1802.03426"},{"section":"machine-learning","topicSlug":"pca-dimensionality-reduction","topic":"Pca Dimensionality Reduction","id":"ml-10010","difficulty":"hard","orderIndex":10,"question":"PCA is applied to a dataset of stock returns. The first principal component has roughly equal positive weights for all stocks. The second principal component has positive weights for technology stocks and negative weights for financial stocks. What are these components likely capturing?","options":{"A":"The first component captures outlier stocks; the second captures mean-reverting pairs","B":"The first principal component likely captures the overall market direction (a \"market factor\") — all stocks move together with the market; the second component captures a sector rotation factor — technology and financials tend to move in opposite directions; this is consistent with factor model theory (PCA recovers statistical risk factors)","C":"The first component captures volatility (standard deviation) and the second captures correlation structure","D":"Equal weights in PC1 indicate a data preprocessing error — PCA should produce diverse weights"},"correct":"B","explanation":{"correct":"- In equity returns, PCA-derived components often correspond to interpretable market factors: PC1 is typically a market factor (all stocks have the same sign loading — when the market goes up, all stocks tend to go up); PC2 often captures sector-rotation effects.\n- This is foundational to Statistical Factor Models in finance. The Barra model, PCA-based risk models, and Fama-French factors all emerge from applying PCA or factor analysis to return correlation matrices.\n- The equal-weight PC1 is an empirical result, not a preprocessing error — it reflects the strong common factor (market beta) shared by all stocks.","A":"PCA components are not defined in terms of outliers or mean-reversion. Outliers might influence the covariance matrix, but the PC interpretation is about variance structure, not individual sample properties.","B":"","C":"PC1 captures the direction of maximum variance (market returns vary synchronously) — not standard deviation. PCA components are eigenvectors of the covariance matrix, not dispersion statistics.","D":"Equal weights in PC1 are a meaningful signal (all stocks share the market factor), not an artifact. PCA output reflects the data's covariance structure, not an error."}},{"section":"machine-learning","topicSlug":"pca-dimensionality-reduction","topic":"Pca Dimensionality Reduction","id":"ml-10011","difficulty":"hard","orderIndex":11,"question":"A team reduces dimensions from 200 to 20 using PCA, then trains a logistic regression model. Test accuracy is 84%. The team wants to add interpretability: \"which original features matter most?\" Why is this question difficult to answer after PCA, and what alternative preserves interpretability?","options":{"A":"Interpretability is preserved in PCA because each PC corresponds to one original feature","B":"PCA principal components are linear combinations of all original features — a coefficient in a PC is not the same as the feature's importance to the downstream model; to recover feature importance, you must propagate the logistic regression weights back through the PCA transformation ($w_{\\text{original}} = V \\cdot w_{\\text{LR}}$ where V is the PCA loading matrix), or use an interpretable method without PCA (Lasso regression, tree models, or SHAP on the original features)","C":"Feature importance is impossible to determine after any dimensionality reduction technique","D":"Simply rank the original features by their loading on PC1 — the PC1 loading magnitude determines feature importance to the downstream model"},"correct":"B","explanation":{"correct":"- PCA transformation: $z = V^T x$ where $V$ is the $200 \\times 20$ loading matrix (each column is a principal component). Logistic regression learns weights $w_{LR} \\in \\mathbb{R}^{20}$.\n- The implicit model is: $\\hat{y} = \\sigma(w_{LR}^T V^T x + b) = \\sigma((V w_{LR})^T x + b)$. The effective weights in original feature space: $w_{\\text{eff}} = V w_{LR}$.\n- This back-transformation gives a single weight per original feature, enabling feature importance interpretation. However, this only works for linear models; nonlinear models after PCA require different approaches.","A":"Each PC is a weighted combination of all original features, not one-to-one. A single original feature may load heavily on several PCs.","B":"","C":"Feature importance can be recovered by back-transforming through the PCA loadings. The complexity increases but it is mathematically tractable.","D":"PC1 loading magnitude measures how much each feature contributes to the first principal component (direction of max variance), not how important it is to the downstream model. The downstream model may weight PC1 weakly and PC5 strongly."}},{"section":"machine-learning","topicSlug":"pca-dimensionality-reduction","topic":"Pca Dimensionality Reduction","id":"ml-10012","difficulty":"hard","orderIndex":12,"question":"A team uses t-SNE for visualizing high-dimensional customer embeddings. They set perplexity=5 and see 20 clusters. They rerun with perplexity=50 and see 5 clusters. A manager asks: \"which is correct?\" What is the principled answer?","options":{"A":"Perplexity=50 is always correct because it uses more neighbors","B":"t-SNE's perplexity controls the effective number of nearest neighbors — low perplexity emphasizes local micro-structure (can fragment one real cluster into many apparent sub-clusters), high perplexity emphasizes global macro-structure; both visualizations may be \"correct\" at their respective scales; the \"right\" number of clusters requires validation with external criteria (domain labels, business outcomes), not just visual inspection","C":"The two runs show that t-SNE is random and the results are meaningless","D":"The correct visualization is the one with fewer clusters because more clusters indicate overfitting in the visualization"},"correct":"B","explanation":{"correct":"- Perplexity in t-SNE is roughly analogous to the number of effective nearest neighbors (bandwidth of the Gaussian kernel in high-dimensional space). Typical recommendations: 5-50, with larger datasets favoring larger perplexity.\n- Low perplexity: each point only cares about its immediate neighbors — can create many small, tight clusters by separating natural sub-groups that are real micro-structure or noise fragmentation.\n- High perplexity: broader neighborhood — clusters merge more readily, showing macro-structure. Neither result is definitively \"correct\" — they show the data at different resolution scales.\n- Correct validation: do the 5 macro-clusters correspond to business-meaningful segments? Do the 20 micro-clusters show consistent sub-behaviors?","A":"More neighbors doesn't mean \"more correct\" — it means viewing the structure at a coarser scale. For some purposes, fine-grained micro-structure is exactly what you want.","B":"","C":"t-SNE is stochastic but its results are reproducible with fixed random seed. The sensitivity to perplexity is a feature (multi-scale view), not random noise.","D":"More clusters don't indicate \"overfitting\" in the visualization. The number of clusters reflects the perplexity scale, not model complexity."}},{"section":"machine-learning","topicSlug":"pca-dimensionality-reduction","topic":"Pca Dimensionality Reduction","id":"ml-10013","difficulty":"hard","orderIndex":13,"question":"A researcher compares UMAP and t-SNE on the same high-dimensional dataset for downstream clustering. UMAP runs 200× faster and is used in production. A statistician notes: \"UMAP preserves global structure better than t-SNE.\" What specific property of UMAP's objective makes this claim accurate?","options":{"A":"UMAP is faster, which implies it better preserves global structure","B":"t-SNE's cost function places high penalty only on nearby points in the high-dimensional space — it ignores the placement of non-neighboring points; UMAP's cost function has both attraction terms (for close neighbors) and repulsion terms (for distant points), with the repulsion explicitly positioning non-neighboring points away from each other — this provides more consistent global structure preservation","C":"UMAP uses Euclidean distance while t-SNE uses cosine similarity, making UMAP more accurate for spatial data","D":"Both algorithms preserve global structure equally — the speed difference is the only practical distinction"},"correct":"B","explanation":{"correct":"- t-SNE minimizes KL divergence between high-dimensional and low-dimensional neighborhood probabilities. The KL divergence is asymmetric: it penalizes placing nearby high-dimensional points far apart (local structure preservation) but gives less guidance for placing distant points.\n- UMAP minimizes binary cross-entropy: $L = \\sum_{(i,j)} [w_{ij} \\log(\\hat{w}_{ij}) + (1-w_{ij})\\log(1-\\hat{w}_{ij})]$ where $w_{ij}$ is the fuzzy neighborhood membership. This has explicit repulsion for non-edges that positions non-neighboring points at meaningful distances.\n- In practice: UMAP visualizations maintain relative positions of clusters (macro-structure), while t-SNE plots can have cluster positions that are arbitrary and change between runs.","A":"Computational speed has no logical connection to the quality of global structure preservation. Speed comes from algorithmic optimizations (negative sampling, SGD-based optimization), not from the quality of the embedding.","B":"","C":"Both UMAP and t-SNE can use various distance metrics. The default is Euclidean for both in standard implementations. The metric choice is a user parameter, not an inherent difference.","D":"Global structure preservation is documented as better in UMAP vs t-SNE in the original UMAP paper and subsequent comparisons. They are not equivalent in this regard."},"reference":"- McInnes et al., \"UMAP vs t-SNE\": https://umap-learn.readthedocs.io/en/latest/how_umap_works.html"},{"section":"machine-learning","topicSlug":"clustering","topic":"Clustering","id":"ml-11001","difficulty":"easy","orderIndex":1,"question":"K-means clustering is run on a 2D dataset. After 100 iterations, the cluster assignments stop changing. A junior analyst says \"the algorithm found the optimal clustering.\" What is wrong with this statement?","options":{"A":"K-means always finds the global optimum — the statement is correct","B":"K-means is guaranteed to converge (cluster assignments stop changing) but only to a local minimum of the within-cluster sum of squares (WCSS) objective — different random initializations can produce different converged solutions; \"optimal\" requires the global minimum, which K-means cannot guarantee","C":"K-means convergence requires exactly 1,000 iterations — convergence at 100 means an error occurred","D":"K-means minimizes between-cluster variance, not within-cluster variance, so convergence doesn't relate to optimality"},"correct":"B","explanation":{"correct":"- K-means objective: minimize $J = \\sum_{k=1}^{K} \\sum_{x_i \\in C_k} ||x_i - \\mu_k||^2$ (WCSS). The algorithm alternates between assignment and update steps, each reducing $J$.\n- Convergence is guaranteed because there are finitely many possible assignments and $J$ decreases at each step. But the converged solution is a local minimum — different starting centroids can yield different final clusterings with different $J$ values.\n- Best practice: run K-means multiple times (e.g., 10-20 runs with different random seeds) and keep the solution with the lowest WCSS. Sklearn's `n_init=10` default does this.","A":"K-means is a local search algorithm. The global optimal partition minimizing WCSS is an NP-hard problem (for K≥2 clusters in general). K-means makes no global optimality guarantee.","B":"","C":"Convergence can occur in any number of iterations — it depends on the data structure and initialization. Convergence at iteration 5 or 1,000 are both valid.","D":"K-means minimizes within-cluster variance (WCSS) — the total distance from each point to its assigned centroid. Maximizing between-cluster distance is related but not the direct K-means objective."}},{"section":"machine-learning","topicSlug":"clustering","topic":"Clustering","id":"ml-11002","difficulty":"easy","orderIndex":2,"question":"A data scientist is choosing K for K-means on a customer segmentation dataset. They plot WCSS against K from 1 to 15 and see the curve decreasing monotonically. They select K=12 because the curve is still decreasing. Is this a valid approach?","options":{"A":"Yes — a decreasing WCSS curve means K=12 is better than K=11","B":"No — WCSS always decreases as K increases (at K=n, WCSS=0); selecting K where WCSS is still falling ignores the law of diminishing returns; the correct approach is the \"elbow method\" (find K where the rate of decrease sharply slows) or validated metrics like the Silhouette score, Gap statistic, or Calinski-Harabasz index","C":"The correct K is always the K that minimizes WCSS, which is the maximum K tested","D":"A monotonically decreasing WCSS curve indicates the data has no cluster structure and K-means should not be used"},"correct":"B","explanation":{"correct":"- Mathematical property: as K increases, each cluster gets smaller, decreasing within-cluster distances. At K=n (one cluster per point), WCSS = 0 — trivially but uselessly perfect.\n- The elbow method seeks the K where adding another cluster yields diminishing improvements in WCSS. Beyond the elbow, you're splitting natural clusters.\n- Better approaches: Silhouette score measures cohesion vs separation — maximizing it gives a principled K; Gap statistic compares WCSS to that expected under null (no structure) distribution.","A":"While K=12 has lower WCSS than K=11, this doesn't mean K=12 is a better clustering — it may be over-partitioning. Lower WCSS is expected as K grows and is not the criterion for \"better\" clustering.","B":"","C":"Maximum K (one point per cluster) trivially minimizes WCSS but is meaningless. The goal is to find meaningful compact groups, not minimize WCSS at any cost.","D":"A monotonically decreasing curve is the expected behavior for any dataset — it does not indicate lack of cluster structure. Lack of structure would manifest as no clear elbow."}},{"section":"machine-learning","topicSlug":"clustering","topic":"Clustering","id":"ml-11003","difficulty":"easy","orderIndex":3,"question":"DBSCAN is applied with eps=0.5 and min_samples=5. Some points are labeled as noise (-1). A team member says \"noise points in DBSCAN are just outliers — we can remove them from future datasets.\" Is this a valid conclusion?","options":{"A":"Yes — DBSCAN noise points are definitionally outliers and should be removed","B":"DBSCAN noise points are points that don't satisfy the density threshold (fewer than min_samples neighbors within eps) — whether they are \"outliers\" depends on context; if eps and min_samples were poorly chosen, normal points get labeled noise; the labels are specific to the hyperparameter choice, not an objective outlier determination; tuning eps (e.g., via k-distance plot) is needed first","C":"Noise points should be assigned to the nearest cluster, not removed","D":"DBSCAN noise points are always correct outlier labels and should always be removed in preprocessing"},"correct":"B","explanation":{"correct":"- A noise point is one that is neither a core point (≥min_samples neighbors within eps) nor a border point (within eps of a core point). This is a local density criterion.\n- If eps is too small, even dense-region points may be labeled as noise. If eps is too large, noise and border clusters merge. The k-distance plot method: compute distance to the k-th nearest neighbor for all points, sort, and find the elbow — this suggests the appropriate eps.\n- Valid uses of noise labels: anomaly detection after careful hyperparameter tuning. Invalid use: blindly removing noise-labeled points from future datasets without understanding the hyperparameter sensitivity.","A":"DBSCAN noise is a relative concept — it depends entirely on eps and min_samples. The same point may be a noise point with eps=0.3 and a core point with eps=0.8.","B":"","C":"Assigning noise to the nearest cluster is what DBSCAN deliberately avoids — border points near a cluster boundary become noise precisely to avoid forcing low-density points into clusters.","D":"Even well-tuned DBSCAN noise labels are specific to the dataset and parameter choices. \"Always\" remove is too strong — sometimes noise points are sparse-region legitimate data, not errors to exclude."}},{"section":"machine-learning","topicSlug":"clustering","topic":"Clustering","id":"ml-11004","difficulty":"easy","orderIndex":4,"question":"K-means is applied to customer data (age in years 20-70, annual income in dollars 20,000-200,000). The resulting clusters are almost entirely driven by income. What caused this and how should it be fixed?","options":{"A":"K-means is biased toward clustering on higher-cardinality features — this is expected behavior","B":"K-means uses Euclidean distance — income values (20,000-200,000) have much larger absolute magnitude than age (20-70); the income dimension dominates the distance calculation; fix: standardize all features to zero mean and unit variance (or scale to [0,1]) before applying K-means","C":"The clusters are correct — income is inherently more important than age for customer segmentation","D":"This can be fixed by increasing K to include more clusters"},"correct":"B","explanation":{"correct":"- K-means distance: $||x_i - \\mu_k||^2 = (\\text{age}_i - \\mu_{k,\\text{age}})^2 + (\\text{income}_i - \\mu_{k,\\text{income}})^2$. A 1-unit difference in income contributes $(1)^2 = 1$ to the distance. A 50-year age difference contributes $(50)^2 = 2,500$. Wait — actually income dominates because its range is thousands of times larger: $(180,000-20,000)^2 = (160,000)^2 = 25.6 \\times 10^9$ vs age $(70-20)^2 = 2,500$. Income swamps age.\n- Standardization (z-score): $x' = (x - \\mu)/\\sigma$. After scaling, each feature contributes proportionally to its variability in standard deviation units.\n- Domain judgment: if income really should be weighted more heavily, use weighted distance or feature weighting explicitly — but this should be a deliberate choice.","A":"K-means is not \"biased toward high-cardinality features\" — it's biased toward features with large absolute values in the distance calculation. Cardinality is irrelevant; scale is the issue.","B":"","C":"Whether income is more important than age is a domain question. K-means shouldn't make this decision implicitly through scale artifacts — it should be made explicitly through feature engineering.","D":"Increasing K doesn't fix scale imbalance — with more clusters, income will still dominate the assignments. The scale issue remains regardless of K."}},{"section":"machine-learning","topicSlug":"clustering","topic":"Clustering","id":"ml-11005","difficulty":"easy","orderIndex":5,"question":"Hierarchical agglomerative clustering (HAC) is applied with Ward linkage. The dendrogram shows a clear merge at distance 10 that joins two major branches, and then many small merges below. A team cuts the dendrogram at height 10. What does this produce?","options":{"A":"A single cluster — height 10 means stopping when all points are in one cluster","B":"Cutting the dendrogram at height 10 produces the clusters that existed just before the merge at height 10 — the number of clusters equals the number of branches crossing the horizontal line at that height; a clear jump from many small merges (below 10) to a large merge (at 10) suggests 2 major natural clusters exist in the data","C":"Height 10 is the optimal cut point only for Ward linkage, not other linkage methods","D":"Cutting at height 10 produces 10 clusters — the cut height equals the cluster count"},"correct":"B","explanation":{"correct":"- HAC builds a tree (dendrogram) by greedily merging the two closest clusters at each step. The merge height represents the distance between merged clusters.\n- Cutting at height $h$: draw a horizontal line at $h$; each branch crossing the line is a cluster. If two major branches merge at height 10 and many small merges occur below 5, cutting at height 7-9 gives 2 clusters representing the two main groups.\n- The \"large jump\" heuristic: if there is a large increase in merge height at one step, cutting just below that step produces natural clusters. The jump at 10 suggests two genuinely distinct groups (they were far apart before being forced to merge).","A":"Cutting at height 10 stops agglomeration when the next merge would cost distance 10. This produces multiple clusters, not one. One cluster appears only at the very top of the dendrogram.","B":"","C":"The interpretation of dendrogram cuts is the same for all linkage methods. The heights on the y-axis differ by linkage (Ward uses variance increase; single/complete use point-to-point distances), but the cutting procedure is identical.","D":"The cut height has no direct relation to cluster count. Cut height determines the distance threshold; the number of clusters depends on how many branches cross that height."}},{"section":"machine-learning","topicSlug":"clustering","topic":"Clustering","id":"ml-11006","difficulty":"medium","orderIndex":6,"question":"A dataset contains clusters of varying density: Cluster A has 10,000 tightly packed points in a $0.5 \\times 0.5$ region; Cluster B has 100 loosely distributed points in a $50 \\times 50$ region. K-means (K=2) and DBSCAN are both applied. K-means correctly identifies 2 clusters; DBSCAN with a single eps struggles. Explain why DBSCAN fails here.","options":{"A":"DBSCAN fails because K=2 is too small for DBSCAN to identify","B":"DBSCAN uses a single global density threshold (eps, min_samples) — Cluster A is extremely dense (eps must be small to avoid merging with noise), while Cluster B is sparse (eps must be large to connect its distant points); a single eps cannot accommodate both densities simultaneously — Cluster B's points appear as noise to Cluster A's density threshold","C":"DBSCAN fails because it only works for circular clusters","D":"DBSCAN's time complexity prevents it from handling 10,000 points in one cluster"},"correct":"B","explanation":{"correct":"- DBSCAN defines core points by density (≥ min_samples within eps radius). For the dense Cluster A, eps = 0.1 would work well. For the sparse Cluster B, eps needs to be ~5.0 to connect distant points. A single eps cannot serve both.\n- With small eps: Cluster A is correctly identified; Cluster B's points become noise (each has few neighbors within eps=0.1).\n- With large eps: Cluster B is identified; Cluster A merges into one giant cluster, but worse — all nearby noise points and edges of Cluster A may merge with Cluster B.\n- Solution: HDBSCAN (Hierarchical DBSCAN) extracts a hierarchy of density-based clusters and can handle varying densities by considering multiple density levels.","A":"DBSCAN doesn't require a K parameter — it discovers the number of clusters automatically. This is a strength, not a failure mode related to K.","B":"","C":"DBSCAN correctly identifies arbitrary shapes (a key advantage over K-means). The failure here is about density variation, not cluster shape.","D":"DBSCAN's complexity is $O(n \\log n)$ with spatial indexing. 10,000 points is trivial computationally. The failure is algorithmic (single eps), not computational."},"reference":"- HDBSCAN paper: https://link.springer.com/chapter/10.1007/978-3-642-37456-2_14"},{"section":"machine-learning","topicSlug":"clustering","topic":"Clustering","id":"ml-11007","difficulty":"medium","orderIndex":7,"question":"K-means++ initialization is compared to random initialization on a clustering task. K-means++ consistently achieves lower WCSS with fewer restarts. What is the core innovation of K-means++ that produces this improvement?","options":{"A":"K-means++ selects all centroids from the training data, while random initialization allows centroids outside the data range","B":"K-means++ selects each subsequent centroid with probability proportional to its squared distance from the nearest already-chosen centroid — this spreads initial centroids across the data and avoids placing multiple centroids in the same dense region, starting the algorithm closer to a good solution","C":"K-means++ runs multiple complete K-means iterations as part of initialization, making the algorithm slower","D":"K-means++ uses the K-medoids algorithm for initialization, which is more robust than centroid-based methods"},"correct":"B","explanation":{"correct":"- K-means++ algorithm: (1) Pick a random point as first centroid. (2) For each remaining data point, compute $d(x)^2$ = squared distance to nearest chosen centroid. (3) Select next centroid with probability $p(x) \\propto d(x)^2$. (4) Repeat until K centroids selected.\n- This probabilistic selection naturally spreads centroids: high-distance points (far from all current centroids) get high selection probability, ensuring initial centroids span the data space.\n- Theoretical guarantee: K-means++ achieves expected WCSS within $O(\\log K)$ of the optimal — much better than random initialization's worst-case guarantees.","A":"Standard K-means random initialization also selects centroids from the training data points (or randomly from the feature space). Both methods start from data points; the difference is the selection probability.","B":"","C":"K-means++ only selects K initial centroids — it doesn't run K-means iterations during initialization. It's $O(nK)$ to initialize, then the standard K-means iterations follow.","D":"K-medoids uses actual data points as cluster representatives (not centroids). K-means++ is still standard K-means (using means as centroids) — the ++ only refers to the smarter initialization."}},{"section":"machine-learning","topicSlug":"clustering","topic":"Clustering","id":"ml-11008","difficulty":"medium","orderIndex":8,"question":"Gaussian Mixture Models (GMM) and K-means are both applied to the same dataset for clustering. GMM is described as a \"soft\" clustering method. What specific capability does GMM have that K-means lacks, and when does this matter?","options":{"A":"GMM is always better than K-means because it uses the Gaussian distribution assumption","B":"GMM produces probabilistic cluster membership: each point is assigned a probability of belonging to each cluster ($p(\\text{cluster}_k | x_i)$) rather than a hard assignment; this matters when points near cluster boundaries genuinely have ambiguous membership, and when modeling elongated or correlated clusters (GMM allows elliptical covariance; K-means assumes spherical equal-variance clusters)","C":"GMM is simply K-means with a different distance metric","D":"The \"soft\" property means GMM uses gradient descent instead of the EM algorithm"},"correct":"B","explanation":{"correct":"- K-means: each point is assigned to exactly one cluster (hard assignment). Boundary points get an arbitrary assignment.\n- GMM: each Gaussian component has parameters $(\\mu_k, \\Sigma_k, \\pi_k)$. EM computes $p(\\text{cluster}_k | x_i)$ — a point near two cluster boundaries might be 60%/40% split. This is especially useful for recommendations (a customer might partially belong to two segments).\n- GMM covariance structure: full covariance $\\Sigma_k$ can model elongated, tilted clusters. K-means uses squared Euclidean distance, implicitly assuming spherical clusters of equal variance.\n- When GMM matters: imbalanced cluster sizes/shapes, boundary uncertainty quantification, probabilistic downstream decisions.","A":"GMM has the Gaussian assumption, which can fail for non-Gaussian clusters. K-means (distance-based) may outperform GMM on non-Gaussian data. Neither is universally better.","B":"","C":"GMM is not K-means with a different distance. GMM is a probabilistic generative model fitted by EM; K-means is a non-probabilistic distance minimization. They have different objectives and produce qualitatively different outputs.","D":"GMM uses the EM (Expectation-Maximization) algorithm — not gradient descent. \"Soft\" refers to soft (probabilistic) cluster assignments, not the optimization method."}},{"section":"machine-learning","topicSlug":"clustering","topic":"Clustering","id":"ml-11009","difficulty":"medium","orderIndex":9,"question":"The Silhouette score for a K-means clustering result is 0.12 (scale -1 to +1). A data scientist says \"clustering failed because the score is close to 0.\" Is this the correct interpretation?","options":{"A":"A silhouette score of 0.12 is excellent — it is very close to the maximum possible value","B":"A silhouette score near 0 means the clustering is not much better than random cluster assignment — clusters overlap significantly and points are near the boundaries of multiple clusters; however, \"failure\" should be validated by checking if the data has any cluster structure at all (e.g., using the Gap statistic) before concluding clustering is inappropriate","C":"The silhouette score has no fixed interpretation threshold — 0.12 may indicate excellent clustering depending on the domain","D":"The score of 0.12 means exactly 12% of points are correctly clustered"},"correct":"B","explanation":{"correct":"- Silhouette score for point $i$: $s(i) = (b_i - a_i) / \\max(a_i, b_i)$ where $a_i$ = mean distance within cluster, $b_i$ = mean distance to nearest other cluster. Range: [-1, 1].\n- Score near 0: $a_i \\approx b_i$ — the point is equally \"at home\" in its cluster and the nearest other cluster. This means poor cluster separation/cohesion.\n- Score near -1: the point is closer to another cluster (wrong assignment). Score near +1: tightly in its cluster, far from others (good).\n- General benchmarks: >0.7 = strong, 0.5-0.7 = reasonable, 0.25-0.5 = weak, <0.25 = no substantial structure — but these are guidelines, not hard rules.","A":"0.12 is not close to 1.0 (the maximum). The scale is -1 to +1; 0.12 is near the middle, indicating weak clustering.","B":"","C":"While domain context matters, there are established benchmarks for silhouette scores. 0.12 indicates weak structure in virtually any domain context.","D":"Silhouette score is not a percentage of correctly clustered points. It measures the cohesion-to-separation ratio of clustering quality. \"Correct\" assignment is undefined in unsupervised clustering."}},{"section":"machine-learning","topicSlug":"clustering","topic":"Clustering","id":"ml-11010","difficulty":"hard","orderIndex":10,"question":"K-means is applied to text data represented as TF-IDF vectors (sparse, high-dimensional). The resulting clusters appear random — high intra-cluster variance and low inter-cluster separation. What fundamental property of high-dimensional Euclidean space explains this failure?","options":{"A":"K-means fails on text because TF-IDF values are not normally distributed","B":"In high-dimensional spaces, the curse of dimensionality causes pairwise Euclidean distances to concentrate — all pairs of points have nearly equal distances; this makes the concept of \"nearest cluster\" ambiguous; additionally, TF-IDF vectors are sparse (most values zero), and cosine similarity (capturing directional similarity regardless of magnitude) is more appropriate than Euclidean distance for text","C":"K-means fails on text because it requires exactly 2 clusters","D":"The failure is caused by the TF normalization — removing the normalization fixes the Euclidean distance problem"},"correct":"B","explanation":{"correct":"- Distance concentration in high dimensions: as dimensions $d \\to \\infty$, the ratio $(\\max_{\\text{dist}} - \\min_{\\text{dist}}) / \\min_{\\text{dist}} \\to 0$. All points become equidistant. K-means centroids are equidistant from most points — assignments become essentially random.\n- TF-IDF vectors are sparse (10,000 dimensions, ~100 non-zeros). Two documents about different topics: both have mostly-zero vectors. Euclidean distance between them is dominated by the dimensions where both are zero — a meaningless similarity.\n- Cosine similarity: $\\cos(\\theta) = (x_i \\cdot x_j) / (||x_i|| \\cdot ||x_j||)$. Only considers dimensions where at least one document has non-zero value. Captures topical overlap.\n- Fix: use K-means with cosine distance (spherical K-means) or use topic models (LDA) for text clustering.","A":"K-means makes no distribution assumption. The TF-IDF value distribution is not the cause of failure.","B":"","C":"K-means supports any K. The failure is a fundamental algorithmic-geometric issue, not a K value problem.","D":"TF normalization is a feature of TF-IDF that weights by document frequency — removing it makes representations worse, not better. It doesn't fix the Euclidean distance problem."}},{"section":"machine-learning","topicSlug":"clustering","topic":"Clustering","id":"ml-11011","difficulty":"hard","orderIndex":11,"question":"A K-means model is trained on 1 million data points with K=100 clusters. The resulting centroids are saved, and new data is assigned to the nearest centroid in production (no retraining). A data engineer notices some centroids have 0 assigned points in production even though they had many training points. What could cause this, and what are the consequences?","options":{"A":"0-assigned centroids indicate a bug in the assignment code — K-means guarantees every centroid has points","B":"The production data distribution has shifted (data drift) — the regions represented by those centroids no longer contain production data; those centroids are \"dead\" and waste model capacity; the model's effective K is reduced, leading to worse coverage of production distribution; the fix is periodic retraining or monitoring for concept drift","C":"0-assigned centroids always occur in K-means and can be safely ignored","D":"0-assigned centroids indicate the training data had duplicate points — removing duplicates before training fixes the issue"},"correct":"B","explanation":{"correct":"- \"Dead centroid\" problem: centroids that never win the nearest-centroid assignment race. At training time, this is handled by re-initializing empty centroids. In production (frozen centroids), data drift can cause centroids to represent regions of the feature space no longer populated by production data.\n- Consequence: the model treats some production regions as belonging to distant centroids, increasing effective intra-cluster variance where production data actually is.\n- Monitoring: track the number of assigned points per centroid in production. If centroids routinely have zero assignments, trigger model retraining.","A":"K-means training handles empty centroids by re-initializing them. In production serving (static centroids), there's no re-initialization — production data can easily miss some centroid regions if the distribution has shifted.","B":"","C":"0-assigned centroids at training time indicate initialization problems (fixed by K-means++ or re-initialization). At production time, they indicate distribution shift — not ignorable. Effective K reduction degrades performance.","D":"Duplicate training points would cause multiple identical centroids, not 0-assigned centroids. Data drift is the more likely cause when good K-means training is used."}},{"section":"machine-learning","topicSlug":"clustering","topic":"Clustering","id":"ml-11012","difficulty":"hard","orderIndex":12,"question":"A researcher runs K-means 20 times with different random seeds, selecting the best solution by WCSS. They then compare the stability of cluster assignments across runs: point assignments differ significantly between runs even though WCSS values are similar. What does this instability reveal about the dataset?","options":{"A":"The algorithm has a bug — K-means should produce identical results across runs with similar WCSS","B":"Similar WCSS with different assignments indicates multiple near-equivalent local optima — the data likely has weak cluster structure or clusters with similar densities; the WCSS landscape has many flat valleys of similar depth; this means the \"best\" clustering by WCSS is not much better than many other equally valid clusterings","C":"Instability means the optimal K is wrong — changing K will stabilize assignments","D":"WCSS similarity across runs guarantees assignment similarity — the observation described is mathematically impossible"},"correct":"B","explanation":{"correct":"- Multiple near-equivalent local optima occur when cluster boundaries are ambiguous — the data doesn't have well-separated, clearly defined groups. In this case, many different partitions achieve similar total WCSS because there's no single \"natural\" clustering.\n- This is an important diagnostic: cluster instability suggests the data may not have strong cluster structure. Forcing K clusters on data with no natural groups (or with K-1 natural groups) creates this landscape.\n- Assessment: compare the best WCSS found to the expected WCSS for randomly distributed data (Gap statistic). If there's no significant difference, the data may not be clusterable.","A":"K-means is non-deterministic (random initialization). Different seeds legitimately explore different parts of the optimization landscape. Similar WCSS values can occur at different local minima.","B":"","C":"Changing K might help if K is misspecified, but instability can persist at any K when the data has no strong cluster structure. Changing K is worth trying but isn't guaranteed to fix instability.","D":"WCSS is a scalar metric — many different clustering configurations can yield the same or similar WCSS values. Identical WCSS does not imply identical assignments."}},{"section":"machine-learning","topicSlug":"clustering","topic":"Clustering","id":"ml-11013","difficulty":"hard","orderIndex":13,"question":"A team uses clustering for customer segmentation with K=5, then trains a separate binary classifier (purchase prediction) on each cluster's data. This \"cluster-then-classify\" approach achieves 84% accuracy vs 81% for a single global model. A statistician warns: \"this evaluation may be optimistic.\" What is the statistical concern?","options":{"A":"Cluster-then-classify always produces optimistic results because it uses more models","B":"If the clustering and classification are both evaluated on the same data, or if the cluster boundaries are informed by the outcome variable (purchase), the evaluation is circular — the clusters may have been implicitly chosen to separate buyers from non-buyers; additionally, cluster assignments at test time require the new point to be assigned to a training cluster, introducing leakage if clustering was performed on train+test together","C":"The improvement from 81% to 84% is too small to be statistically significant","D":"The statistician is wrong — using separate models for each cluster is always more accurate than a global model"},"correct":"B","explanation":{"correct":"- Two sources of leakage in cluster-then-classify: (1) If K-means clustering used all data (train+test), the clusters encode test distribution information. (2) If K is selected or clusters are interpreted to maximize classification accuracy, the entire approach is optimizing on the test set.\n- Correct procedure: fit K-means on training data only → assign training points to clusters → train one classifier per cluster on training data → at test time, assign test points to nearest training centroid → apply the corresponding cluster classifier.\n- Even with correct procedure, the 84% vs 81% comparison requires statistical significance testing (e.g., McNemar's test for paired predictions) to confirm the improvement is real.","A":"\"Cluster-then-classify\" is not inherently optimistic. With proper train/test separation (no leakage), the comparison can be valid. The concern is about methodological correctness, not the approach itself.","B":"","C":"Whether the improvement is statistically significant is a separate (valid) concern, but the statistician's warning is about potential methodological flaws (leakage), not effect size.","D":"\"Always more accurate\" is false. A global model has more training data per classifier. Cluster-specific models have less data per cluster. For small datasets, global models may generalize better."}},{"section":"machine-learning","topicSlug":"anomaly-detection","topic":"Anomaly Detection","id":"ml-12001","difficulty":"easy","orderIndex":1,"question":"An Isolation Forest model is trained to detect anomalies in server logs. A data scientist explains: \"Isolation Forest works by trying to isolate each data point using random splits — anomalies are isolated faster.\" A colleague asks: \"why would anomalies be isolated faster than normal points?\" What is the correct mechanistic explanation?","options":{"A":"Anomalies are closer to the decision boundary in the feature space, so they require fewer splits","B":"Anomalies are points that are isolated in sparse regions of the feature space — each random split has a higher probability of separating an anomaly from other points because there are fewer nearby points; normal points in dense regions require many splits before they are separated from their neighbors; the average path length to isolation is shorter for anomalies","C":"Isolation Forest uses k-nearest neighbors to identify anomalies, and anomalies have fewer neighbors","D":"Anomalies are isolated faster because they have extreme feature values that make them easy to split at any threshold"},"correct":"B","explanation":{"correct":"- Isolation Forest builds random decision trees by selecting a random feature and a random split threshold. The path length to isolate a point is the number of splits needed.\n- Dense regions: many similar points in a small feature space volume. Splitting at any threshold still leaves many points together — many more splits are needed before a normal point is isolated.\n- Sparse regions (anomalies): few nearby points. Any split tends to separate the anomaly quickly.\n- Anomaly score: based on average path length across many trees. Short path length → anomaly. Long path length → normal. Score is normalized against the expected path length for a random dataset.","A":"\"Proximity to decision boundary\" is not the isolation mechanism. Isolation Forest doesn't compute decision boundaries — it measures path length in trees.","B":"","C":"Isolation Forest is tree-based, not distance-based. It doesn't compute k-nearest neighbors. LOF (Local Outlier Factor) is the k-NN-based anomaly detection method.","D":"Extreme values are one type of anomaly that Isolation Forest handles well, but the explanation is too narrow. Isolation Forest can detect anomalies in any sparse region, including multivariate anomalies that aren't extreme in any single feature."}},{"section":"machine-learning","topicSlug":"anomaly-detection","topic":"Anomaly Detection","id":"ml-12002","difficulty":"easy","orderIndex":2,"question":"A fraud detection system flags 1,000 transactions as fraudulent out of 1,000,000. A manager says \"only 1,000 alerts is very low — we should lower the threshold to catch more fraud.\" After lowering the threshold, 50,000 transactions are flagged. Only 2,000 are actual fraud. What metrics best characterize the trade-off?","options":{"A":"Accuracy — it measures how often the model is correct","B":"Precision (true positives / all flagged) and recall (true positives / all actual fraud) — at the original threshold: high precision (if most of 1,000 were real fraud), unknown recall; at the lower threshold: precision = 2,000/50,000 = 4%, but recall improved; the trade-off between alert volume (workload) and fraud caught (recall) is the core operational decision","C":"F1 score — it is the only metric that captures the trade-off correctly","D":"Accuracy — it is 99.8% before threshold lowering, proving the original model is perfect"},"correct":"B","explanation":{"correct":"- At 1%=1% fraud rate in the population: 10,000 actual fraud transactions out of 1,000,000.\n- Original threshold: 1,000 flagged. If all are real fraud: precision = 100%, recall = 1,000/10,000 = 10%. If half are real: precision = 50%, recall = 5%.\n- Lower threshold: 50,000 flagged, 2,000 real fraud. Precision = 4%, recall = 20%. Investigation workload increased 50×, but only doubles fraud caught.\n- The PR (Precision-Recall) curve visualizes all threshold operating points. AUC-PR is more informative than AUC-ROC for heavily imbalanced anomaly detection.","A":"Accuracy is useless here — predicting \"no fraud\" for all transactions gives 99%+ accuracy because fraud is rare. Accuracy doesn't capture the false-negative cost (missed fraud) or false-positive cost (unnecessary investigation).","B":"","C":"F1 is one summary metric of the precision-recall trade-off, but it doesn't show the full trade-off curve. Separate precision and recall values are more interpretable for business decisions.","D":"99.8% accuracy while missing 99% of fraud is not a good outcome. Accuracy conflates rare-class performance with common-class performance."}},{"section":"machine-learning","topicSlug":"anomaly-detection","topic":"Anomaly Detection","id":"ml-12003","difficulty":"easy","orderIndex":3,"question":"Local Outlier Factor (LOF) gives a score of 4.2 to a data point. A student interprets this as \"the point is 4.2 standard deviations from the mean.\" Why is this interpretation incorrect?","options":{"A":"The interpretation is correct — LOF is based on z-scores","B":"LOF measures the ratio of the local density of a point's neighbors to its own local density — a score of 4.2 means the point's neighborhood is approximately 4.2× less dense than its neighbors' neighborhoods; it is not a standard deviation or a distance — it is a local density ratio; LOF is entirely non-parametric and makes no distributional assumptions","C":"LOF score of 4.2 means the point has 4.2 times more neighbors than average","D":"LOF is similar to z-score but uses median instead of mean"},"correct":"B","explanation":{"correct":"- LOF computation: for a point $p$, compute the $k$-distance (distance to $k$-th nearest neighbor), then the reachability distance (smoothed local distance), then the local reachability density (LRD = inverse of average reachability distance).\n- $\\text{LOF}(p) = \\frac{\\text{average LRD of } p\\text{'s neighbors}}{\\text{LRD}(p)}$. Values near 1: the point has similar density to its neighbors (normal). Values >>1: the point is in a much sparser region than its neighbors (outlier).\n- A score of 4.2 means the surrounding neighborhood is 4.2× denser than the point's own immediate vicinity — a significant density gap.","A":"LOF has nothing to do with z-scores. Z-score requires a global mean and standard deviation. LOF is local and non-parametric — it doesn't require any distributional assumptions.","B":"","C":"LOF measures density ratios, not neighbor counts. The $k$-NN count is fixed at $k$ for all points — the variation is in how far those $k$ neighbors are.","D":"LOF does not use median. It uses reachability distances and density ratios. There's no analogy to the z-score formula."}},{"section":"machine-learning","topicSlug":"anomaly-detection","topic":"Anomaly Detection","id":"ml-12004","difficulty":"easy","orderIndex":4,"question":"An autoencoder is trained on normal server behavior to detect anomalies. In production, it flags a batch of \"anomalous\" requests — but investigation shows these are normal, just from a new product feature launched yesterday. What fundamental assumption of reconstruction-error-based anomaly detection failed?","options":{"A":"The autoencoder threshold was set too low","B":"Reconstruction-error anomaly detection assumes the training distribution is representative of all future normal behavior — the new product feature represents a new normal pattern not in the training distribution; the autoencoder learned to reconstruct old normal patterns, so new legitimate patterns have high reconstruction error; this is concept drift (a change in what \"normal\" means)","C":"Autoencoders cannot be used for anomaly detection — this is an incorrect application","D":"The anomaly detection failed because the autoencoder was not deep enough"},"correct":"B","explanation":{"correct":"- Core assumption: train on normal data only → model learns to reconstruct normal patterns well → high reconstruction error = anomaly.\n- Violation: when the definition of \"normal\" changes (new features, seasonal patterns, product changes), the autoencoder flags legitimate new patterns as anomalies (false positives).\n- This is the stationarity assumption: the underlying data-generating process is stable. When it changes, the model becomes outdated.\n- Solutions: periodic retraining to include new normal patterns, incremental learning, or a human-in-the-loop review period when new features launch to recalibrate the threshold.","A":"Threshold adjustment is a potential short-term fix, but it doesn't address the underlying problem: the model doesn't know how to reconstruct the new product feature's patterns. Lowering the threshold would also reduce detection of real anomalies.","B":"","C":"Autoencoders are a well-established method for anomaly detection, widely used in network intrusion detection, fraud detection, and manufacturing quality control.","D":"Model depth affects representation capacity, not adaptation to distributional shift. A deeper autoencoder would still fail to reconstruct patterns it has never seen."}},{"section":"machine-learning","topicSlug":"anomaly-detection","topic":"Anomaly Detection","id":"ml-12005","difficulty":"easy","orderIndex":5,"question":"One-Class SVM (OCSVM) is trained only on normal data to create a decision boundary around normal instances. At test time, it classifies new points as \"normal\" or \"anomaly.\" How does OCSVM differ from a standard two-class SVM, and what does the nu parameter control?","options":{"A":"OCSVM uses two classes internally — it just doesn't tell the user","B":"Standard SVM separates two classes with a hyperplane between them; OCSVM learns a single hypersphere (or hyperplane from origin) that encompasses the normal data; the nu parameter controls the fraction of training points that are allowed to be outside the hypersphere (treated as support vectors/outliers during training) — smaller nu = tighter boundary; larger nu = more flexible boundary allowing more training points outside","C":"OCSVM is identical to standard SVM except it removes the regularization term","D":"The nu parameter controls the kernel bandwidth, like the gamma parameter in RBF kernels"},"correct":"B","explanation":{"correct":"- OCSVM objective: find a hyperplane that separates the training data from the origin with maximum margin in feature space. Points on the origin side are anomalies.\n- nu ∈ (0, 1]: upper bound on the fraction of outliers in training data AND lower bound on the fraction of support vectors. Setting nu=0.05 means: accept up to 5% of training points as anomalies, ensure at least 5% are support vectors.\n- High nu: looser boundary (accepts more training anomalies as normal). Low nu: tighter boundary (flags more points as anomalies at test time).\n- Practical note: OCSVM is sensitive to feature scaling and the choice of kernel — preprocessing and hyperparameter tuning are critical.","A":"OCSVM genuinely trains on one class only. It doesn't simulate two classes. The decision boundary is defined relative to the origin in the kernel feature space.","B":"","C":"OCSVM has a different formulation and objective than two-class SVM. The regularization approach differs, and the decision function measures distance from the origin rather than distance between two class hyperplanes.","D":"The gamma parameter in RBF kernel controls bandwidth. nu is a separate regularization parameter. They can both be tuned but are independent."}},{"section":"machine-learning","topicSlug":"anomaly-detection","topic":"Anomaly Detection","id":"ml-12006","difficulty":"medium","orderIndex":6,"question":"An anomaly detection model for manufacturing defects achieves: 99.7% recall (catches 99.7% of defects), but precision is 18% (82% of flagged items are false positives). Manufacturing halts production to inspect every flagged item. A business analyst asks: \"should we sacrifice some recall to improve precision?\" What framework should guide this decision?","options":{"A":"Always maximize recall — missing defects is always worse than false alarms","B":"The decision requires comparing the asymmetric costs: cost of a missed defect (e.g., customer harm, recall campaign, warranty cost) vs cost of a false positive (production halt, inspection time, lost throughput); if a missed defect costs $1,000,000 and a false positive costs $200, precision 18% may be acceptable; if costs are more balanced, improving precision at the cost of recall is justified","C":"F1 score = 2×precision×recall / (precision+recall) should always be maximized — it balances both metrics optimally","D":"Precision and recall cannot both be considered in the same decision — you must choose one metric to optimize"},"correct":"B","explanation":{"correct":"- Decision theory: minimize expected cost = $C_{FN} \\times FN + C_{FP} \\times FP$ where $C_{FN}$ = cost of missed defect, $C_{FP}$ = cost of false alarm.\n- At 18% precision: for every real defect caught, 4.6 false alarms are generated. This is acceptable if $C_{FN} / C_{FP} > 4.6$. If a missed defect causes field failure (high $C_{FN}$) and inspections are cheap, high recall is worth the false positive cost.\n- PR curve: plot precision vs recall at all thresholds. Find the operating point where cost is minimized given the business cost ratio.","A":"\"Always maximize recall\" assumes missed defects have infinite cost. In practice, production shutdowns and inspection cost money too. The optimal trade-off depends on cost asymmetry.","B":"","C":"F1 assumes equal cost for false positives and false negatives ($C_{FP} = C_{FN}$). In manufacturing, these costs are typically very different. Maximizing F1 is not appropriate when costs are asymmetric.","D":"Precision and recall must both be considered — they capture different types of errors. The PR curve and cost analysis are specifically designed to navigate this joint consideration."}},{"section":"machine-learning","topicSlug":"anomaly-detection","topic":"Anomaly Detection","id":"ml-12007","difficulty":"medium","orderIndex":7,"question":"An Isolation Forest model is trained on network traffic data where 0.1% of traffic is anomalous (malicious). The `contamination` hyperparameter is set to 0.1. What does this parameter do, and what are the consequences of setting it incorrectly?","options":{"A":"The contamination parameter controls the number of trees in the forest — 0.1 means 10 trees","B":"The contamination parameter sets the expected proportion of anomalies (10% in this case) — it is used to determine the decision threshold: the top-X% of anomaly scores are labeled anomalies; setting contamination=0.1 when the true anomaly rate is 0.001 means the model flags 10× too many points as anomalies, inflating false positives dramatically","C":"The contamination parameter controls the subsample size for each tree","D":"Setting contamination to any value > 0.05 prevents Isolation Forest from working correctly"},"correct":"B","explanation":{"correct":"- Isolation Forest produces anomaly scores for all points. The contamination parameter is used to set the decision threshold: if contamination=0.1, the threshold is set so the lowest-scoring 10% of points are labeled anomalous.\n- True anomaly rate ≈ 0.1% (0.001), but contamination=0.1 means 10% are flagged. This generates massive false positives.\n- Setting contamination too low: may miss real anomalies (threshold too strict). Setting too high: floods output with false positives.\n- Best practice: use the raw anomaly scores and evaluate on a labeled validation set to select the threshold that minimizes the cost function for the specific application.","A":"The number of trees is controlled by `n_estimators` (default 100). Contamination has nothing to do with tree count.","B":"","C":"The subsample size is controlled by `max_samples` (default 256). Contamination only affects the decision threshold, not the forest structure.","D":"There is no hard limit on contamination. Values > 0.5 would be unusual (flagging the majority as anomalies) but the algorithm still runs."}},{"section":"machine-learning","topicSlug":"anomaly-detection","topic":"Anomaly Detection","id":"ml-12008","difficulty":"medium","orderIndex":8,"question":"A statistical anomaly detection system flags data points more than 3 standard deviations from the mean (z-score > 3). On a dataset with 1,000,000 observations following a normal distribution, approximately how many points does this flag, and what is the false positive rate for a truly normal dataset?","options":{"A":"3 standard deviations catches all anomalies — 0 points from a normal distribution are flagged","B":"By the empirical rule, 99.73% of normal data falls within 3σ — approximately 0.27% (2,700 points out of 1,000,000) are flagged; on a truly normal dataset, all 2,700 flagged points are false positives; the 3σ rule has a 0.27% false positive rate, which at scale generates many false alarms","C":"3σ catches exactly 3 points per million from a normal distribution","D":"The 3σ rule is exact — any point beyond 3σ is definitively anomalous regardless of the true distribution"},"correct":"B","explanation":{"correct":"- Normal distribution: $P(|Z| > 3) = 0.0027 = 0.27\\%$. For n=1,000,000: $\\approx 2,700$ expected false positives.\n- The 3σ rule is a heuristic from quality control (6-sigma manufacturing) — it assumes data is normally distributed. For skewed distributions (log-normal, heavy-tailed), the tail probability is completely different.\n- Bonferroni correction: for multiple simultaneous tests, adjust the threshold. Testing 1,000,000 points, each at α=0.0027, produces ~2,700 expected false positives even with no real anomalies.\n- Better approach for large datasets: use extreme value theory (EVT) for threshold setting, or model-based anomaly detection that accounts for the actual distribution.","A":"The 3σ rule defines an outlier region — it doesn't catch \"all anomalies.\" For normal data, it systematically flags the tails. \"All anomalies are caught\" would require a threshold of 0.","B":"","C":"The expected number is 2,700 (0.27% of 1,000,000), not 3. The \"3 per million\" figure corresponds to the 4.5σ rule, not 3σ.","D":"3σ is a probabilistic threshold, not an absolute ground truth. Points beyond 3σ from a normal distribution are rare but not definitively anomalous — they occur with 0.27% probability in normal data."}},{"section":"machine-learning","topicSlug":"anomaly-detection","topic":"Anomaly Detection","id":"ml-12009","difficulty":"hard","orderIndex":9,"question":"An autoencoder-based anomaly detector achieves high performance on a validation set. In a red team exercise, an adversary injects carefully crafted anomalies that maintain low reconstruction error. How can the adversary craft such inputs, and what defense mitigates this?","options":{"A":"An adversary cannot craft low-reconstruction-error anomalies — autoencoders definitionally fail to reconstruct unseen patterns","B":"If the adversary knows the autoencoder architecture and weights, they can use gradient descent to find an input that (1) is anomalous by some external criterion (e.g., contains malicious payload) but (2) has low reconstruction error; by optimizing the anomalous input to minimize reconstruction loss while maintaining the malicious property, the adversary evades detection; defenses include adversarial training (train on adversarially perturbed inputs), ensemble methods, and feature obfuscation","C":"The only defense against adversarial anomalies is using a deeper autoencoder","D":"Adversarial examples only affect classification models, not autoencoders"},"correct":"B","explanation":{"correct":"- Adversarial reconstruction attack: given trained autoencoder with encoder $f$ and decoder $g$, minimize $||x - g(f(x))||^2$ subject to $x$ containing anomalous content. This is an optimization problem the adversary can solve if they have model access (white-box attack).\n- Example: a malicious network packet designed to look like normal traffic in the feature space the autoencoder monitors (e.g., using normal-looking headers while hiding payload in less-monitored fields).\n- Defenses: (1) adversarial training — include adversarially perturbed samples in training; (2) ensemble of diverse autoencoders with different architectures; (3) variational autoencoders (VAEs) that penalize out-of-distribution samples in the latent space; (4) monitoring both reconstruction error AND latent space distance from the training distribution.","A":"This assumes the autoencoder perfectly generalizes from reconstruction error to anomaly detection — it doesn't. The reconstruction manifold can be exploited precisely because the autoencoder's learned manifold doesn't perfectly align with the anomaly boundary.","B":"","C":"Depth alone doesn't prevent adversarial attacks. Deeper models can be attacked just as effectively; in fact, more expressive models may have larger adversarial subspaces.","D":"Adversarial examples extend to any differentiable function, including autoencoders. The gradient of reconstruction loss with respect to input is well-defined and exploitable."}},{"section":"machine-learning","topicSlug":"anomaly-detection","topic":"Anomaly Detection","id":"ml-12010","difficulty":"hard","orderIndex":10,"question":"LOF is applied with k=5 on a dataset with two regions: Region A (1,000 densely packed normal points) and Region B (10 scattered normal points). Points in Region B get LOF scores of 3-8, despite being from the same generating process as Region A points. What causes this, and what does it reveal about LOF's assumptions?","options":{"A":"LOF correctly identifies Region B points as anomalies — they are statistically less common","B":"LOF measures local density relative to a point's local neighborhood — Region B points have sparse local neighborhoods; their LRD is low; their neighbors (some from Region A's sparse border) have higher LRD; the LOF ratio (neighbors' LRD / point's LRD) exceeds 1, flagging Region B as anomalous; this reveals LOF's implicit assumption that all normal data has similar local density — it fails on genuinely multimodal or multi-density datasets","C":"LOF failure is caused by using k=5 — using k=50 would fix the issue","D":"Region B points are genuinely anomalous — the scattered distribution proves they are rare events"},"correct":"B","explanation":{"correct":"- LOF assumes: normal regions are dense, anomalies are sparse. Region B violates this — it contains sparse but legitimate points.\n- LRD of Region B points = low (sparse neighborhood). LRD of some Region A neighbors (on the border) = moderate. LOF ratio > 1 → flagged as anomaly.\n- This is the multi-density problem: LOF cannot distinguish between \"sparse because anomalous\" and \"sparse because it's a legitimate sparse cluster.\"\n- Solutions for multi-density scenarios: LOCI (Local Correlation Integral), HBOS (Histogram-Based Outlier Score), or domain-specific thresholds per region.","A":"\"Statistically less common\" doesn't make something anomalous if it's generated by a known, legitimate process. Anomalous means unexpected or pathological, not just rare.","B":"","C":"Changing k adjusts the neighborhood scale. With k=50, Region B points would include Region A neighbors in their neighborhood, potentially reducing LOF scores, but this is a workaround, not a principled fix. The underlying multi-density problem remains.","D":"Scattered distribution ≠ rare events. Region B could represent a legitimate sparse subpopulation (e.g., a small category of customers with naturally sparse feature patterns)."}},{"section":"machine-learning","topicSlug":"anomaly-detection","topic":"Anomaly Detection","id":"ml-12011","difficulty":"hard","orderIndex":11,"question":"A variational autoencoder (VAE) is used for anomaly detection. Instead of reconstruction error alone, the anomaly score combines reconstruction error and the KL divergence of the latent space distribution from the prior. Why does adding KL divergence improve anomaly detection compared to reconstruction error alone?","options":{"A":"KL divergence replaces reconstruction error entirely — it is a strictly better metric","B":"Standard autoencoders can memorize unusual inputs with low reconstruction error if they have sufficient capacity — the latent representation of an anomalous input may land anywhere in the latent space; the VAE's KL term regularizes the latent space toward a known prior (N(0,1)), so anomalous inputs that produce unusual latent codes (far from the prior) are penalized by the KL term even if reconstruction error is low","C":"KL divergence in VAEs measures the distance between input and output distributions","D":"Adding KL divergence makes the VAE ignore reconstruction error in the anomaly score"},"correct":"B","explanation":{"correct":"- Standard AE failure mode: anomalous inputs may be \"memorized\" or happen to fall in a region of the latent space where the decoder produces a good reconstruction (especially if the anomaly has patterns similar to some normal data in individual features).\n- VAE anomaly score: $\\text{score}(x) = \\text{Reconstruction Error}(x) + \\lambda \\cdot D_{KL}(q(z|x) || p(z))$.\n- $D_{KL}$ term: penalizes latent encodings that deviate from the standard normal prior. An anomalous input that produces unusual latent statistics $(\\mu, \\sigma)$ contributes high KL divergence.\n- Combined score: catches both types of anomalies — those with high reconstruction error (strange patterns) and those with unusual latent representations (off-manifold in latent space).","A":"KL divergence and reconstruction error are complementary. Reconstruction error catches inputs the decoder cannot reconstruct; KL divergence catches inputs that produce unusual latent codes. Using both covers more failure modes.","B":"","C":"KL divergence in VAEs measures the distance between the posterior $q(z|x)$ (learned encoder distribution) and the prior $p(z)$ (typically N(0,I)). It does not compare input and output distributions.","D":"Both terms are part of the ELBO loss and the anomaly score. The weights $\\lambda$ can be tuned, but adding KL divergence does not eliminate reconstruction error from the score."}},{"section":"machine-learning","topicSlug":"anomaly-detection","topic":"Anomaly Detection","id":"ml-12012","difficulty":"hard","orderIndex":12,"question":"An Isolation Forest and a One-Class SVM are compared on a dataset where anomalies are dense clusters of similar malicious patterns (not scattered outliers). Isolation Forest performs poorly; OCSVM performs well. Explain this counter-intuitive result.","options":{"A":"Isolation Forest always outperforms OCSVM — the result indicates a bug","B":"Isolation Forest assumes anomalies are isolated (sparse, in low-density regions) — if anomalies cluster together, they form their own dense region; Isolation Forest cannot distinguish dense anomaly clusters from dense normal clusters; OCSVM learns the boundary of the normal region — even if anomalies cluster, they fall outside the normal support and are correctly flagged","C":"OCSVM performs better because it uses a kernel — Isolation Forest's trees cannot handle kernel-transformed data","D":"The result indicates overfitting in OCSVM — it perfectly memorized the anomaly clusters from training data"},"correct":"B","explanation":{"correct":"- Isolation Forest assumption: anomalies are isolated in the feature space. When anomalies form a dense cluster (e.g., a botnet generating coordinated traffic with consistent patterns), the cluster requires many splits to isolate — Isolation Forest assigns it a long isolation path → low anomaly score → classified as normal.\n- This is Isolation Forest's fundamental weakness: clustered anomalies defeat the isolation criterion.\n- OCSVM learns a closed boundary (hyperplane from origin in kernel space) around the normal data. A dense cluster of anomalies is simply outside this boundary — OCSVM correctly flags the entire cluster regardless of its density.\n- Other methods that handle clustered anomalies: deep one-class classification, robust covariance estimation.","A":"Both algorithms have known failure modes. Isolation Forest is known to fail for clustered anomalies. This is not a bug — it is documented behavior.","B":"","C":"Isolation Forest using trees doesn't prevent kernel-based comparison. Isolation Forest's failure is algorithmic (isolation path length for dense anomaly clusters), not a limitation related to kernel methods.","D":"OCSVM performing well on unseen anomalies is correct generalization, not overfitting. OCSVM was trained only on normal data — it cannot \"memorize\" anomaly clusters it never saw."}},{"section":"machine-learning","topicSlug":"ensemble-methods","topic":"Ensemble Methods","id":"ml-13001","difficulty":"easy","orderIndex":1,"question":"A team trains 10 decision tree classifiers, each on a different random subset of training data (with replacement). They combine predictions by majority vote. What is this ensemble technique, and what property of the base models is essential for this approach to improve over a single tree?","options":{"A":"Boosting — the models must have high accuracy individually","B":"Bagging (Bootstrap Aggregating) — the base models must be diverse (low correlation between errors); if all trees make the same mistakes (highly correlated), majority voting cannot average out errors; diversity is achieved by training on different bootstrap samples and typically by random feature subsampling","C":"Stacking — the base models must make different types of predictions (regression vs classification)","D":"The combination method doesn't matter — any 10 models will always outperform one model"},"correct":"B","explanation":{"correct":"- Bagging: train $B$ models on bootstrap samples (sample $n$ points with replacement from training set); combine by averaging (regression) or majority vote (classification).\n- Why it works: if each model has error rate $e$ and errors are independent, the majority vote error decreases exponentially with $B$. Specifically, for binary classification with $e < 0.5$ and independent errors: $P(\\text{ensemble wrong}) = \\sum_{k > B/2} \\binom{B}{k} e^k (1-e)^{B-k} \\ll e$.\n- Essential condition: errors must be uncorrelated (diverse models). If all models are identical, ensemble error = individual error. Bootstrap sampling and feature subsampling introduce diversity.","A":"Boosting is a sequential procedure (each model focuses on previous errors). Bagging is a parallel procedure. High individual accuracy is helpful but not the essential requirement — diversity is.","B":"","C":"Stacking uses a meta-learner to combine base model predictions. It doesn't require models of different types, and the base models typically make the same type of prediction.","D":"10 identical models perform the same as 1 model. Improvement requires diversity — uncorrelated errors across base models."}},{"section":"machine-learning","topicSlug":"ensemble-methods","topic":"Ensemble Methods","id":"ml-13002","difficulty":"easy","orderIndex":2,"question":"A hard voting classifier combines predictions from 5 classifiers: 3 predict \"class A\" and 2 predict \"class B.\" A soft voting classifier uses the predicted probabilities: the same 5 classifiers output average probabilities of $P(A) = 0.42$ and $P(B) = 0.58$. The two voting methods disagree. Which is more reliable and why?","options":{"A":"Hard voting is more reliable because it uses the final predictions, not uncertain probability estimates","B":"Soft voting is generally more reliable because it uses the full probability distribution — 3 classifiers may predict \"A\" with low confidence (e.g., 55%) while 2 predict \"B\" with high confidence (e.g., 90%); soft voting weights contributions by confidence; hard voting treats a 55% confident prediction the same as a 99% confident prediction","C":"Both methods are equivalent — they always produce the same result","D":"Hard voting is more reliable for classification; soft voting is only for regression"},"correct":"B","explanation":{"correct":"- Hypothetical breakdown: models 1,2,3 predict A with P(A) = 0.55, 0.56, 0.57 (marginally A); models 4,5 predict B with P(B) = 0.70, 0.85 (strongly B). Hard voting: 3 votes for A → predicts A. Soft voting: avg P(A) = (0.55+0.56+0.57+0.30+0.15)/5 = 0.426 → predicts B.\n- Soft voting correctly captures that models 4,5 are much more certain about their prediction than models 1,2,3. Hard voting ignores this signal.\n- Precondition for soft voting: classifiers must produce well-calibrated probabilities. Poorly calibrated probabilities (e.g., naive Bayes overconfidence) can degrade soft voting.","A":"Hard voting's use of \"final predictions\" loses information. Confidence matters — a unanimous weak prediction should not override a strong minority prediction.","B":"","C":"Hard and soft voting can disagree (as in this example). They are not equivalent. When all models agree, the result is the same, but disagreements expose the methodological difference.","D":"Both hard and soft voting work for classification. Averaging is used for regression, but soft voting (averaging class probabilities) applies specifically to classification."}},{"section":"machine-learning","topicSlug":"ensemble-methods","topic":"Ensemble Methods","id":"ml-13003","difficulty":"easy","orderIndex":3,"question":"Random Forest and a bagged decision tree ensemble both use bootstrap sampling. A data scientist says they are the same algorithm. What key feature distinguishes Random Forest from standard bagging of decision trees?","options":{"A":"Random Forest uses boosting instead of bagging","B":"Random Forest adds feature subsampling at each split: when growing each tree, only a random subset of $\\sqrt{p}$ features (for classification) is considered at each node split; this additional randomization reduces correlation between trees beyond what bootstrap sampling alone achieves, further improving ensemble diversity","C":"Random Forest uses pruned trees; standard bagging uses full-depth trees","D":"Random Forest trains trees in sequence (each tree depends on the previous), unlike parallel bagging"},"correct":"B","explanation":{"correct":"- Standard bagging: each tree is trained on a different bootstrap sample, but each split considers all $p$ features. Trees will tend to use the same dominant features at the top levels → correlated trees → diminished ensemble benefit.\n- Random Forest: additionally samples $m$ features at each split ($m \\approx \\sqrt{p}$ for classification, $m \\approx p/3$ for regression). Prevents any single dominant feature from appearing at the top of every tree.\n- Consequence: RF trees are more diverse (less correlated) than bagged trees. The variance reduction from averaging uncorrelated models is larger.","A":"Both Random Forest and bagging are parallel ensemble methods. Boosting is a different family (sequential, focuses on difficult examples).","B":"","C":"Both Random Forest and bagging typically use full-depth (unpruned) trees. Deep trees have low bias individually; the ensemble reduces variance. Pruning would increase bias.","D":"Both bagging and Random Forest are parallel — trees are independent and can be trained simultaneously. Boosting methods (AdaBoost, gradient boosting) use sequential training."}},{"section":"machine-learning","topicSlug":"ensemble-methods","topic":"Ensemble Methods","id":"ml-13004","difficulty":"easy","orderIndex":4,"question":"A stacking ensemble is built with 5 base classifiers and one meta-learner. The meta-learner is trained on the base classifiers' predictions. A team member trains the meta-learner on the same training data used for the base classifiers. A reviewer flags this as a problem. Why?","options":{"A":"The meta-learner must always be a logistic regression — other meta-learners cause issues","B":"If the base classifiers are trained and evaluated on the same data, their predictions on training samples are over-optimistic (they have memorized training data to some degree); the meta-learner learns to trust these over-optimistic predictions; at test time, base classifiers' predictions are more uncertain, and the meta-learner is miscalibrated; the correct approach is out-of-fold cross-validation predictions for the meta-learner's training data","C":"The problem only occurs if the base classifiers use gradient boosting","D":"The reviewer is wrong — using the same training data for meta-learner is the standard approach"},"correct":"B","explanation":{"correct":"- Correct stacking procedure (out-of-fold): (1) Split training data into K folds. (2) For each fold, train base classifiers on the other K-1 folds and predict on the held-out fold. (3) Collect out-of-fold predictions for all training samples. (4) Train meta-learner on these OOF predictions. (5) Retrain base classifiers on all training data for final model.\n- This ensures meta-learner is trained on predictions that reflect each base classifier's true generalization ability, not in-sample performance.\n- Without OOF: base classifiers with high in-sample accuracy (overfitting) appear perfect on training data, causing the meta-learner to over-trust them.","A":"The meta-learner can be any model — logistic regression, gradient boosting, neural network. Logistic regression is commonly used for interpretability and to avoid overfitting the meta-level, but it's not required.","B":"","C":"The problem applies to all base classifiers that can overfit training data, not just gradient boosting. Even simple models have slightly better performance on training data.","D":"Using the same training data is the common but incorrect approach. Out-of-fold is the correct approach and is what frameworks like mlxtend and sklearn's StackingClassifier implement by default."}},{"section":"machine-learning","topicSlug":"ensemble-methods","topic":"Ensemble Methods","id":"ml-13005","difficulty":"medium","orderIndex":5,"question":"A team builds a blending ensemble: 5 base models trained on 70% of training data, validated on 30% (holdout), then a meta-learner trained on the holdout predictions. They claim this is equivalent to stacking with cross-validation. A statistician disagrees. What is the difference?","options":{"A":"Blending and stacking are identical — the statistician is wrong","B":"Blending uses a single holdout set for meta-learner training — this wastes 30% of training data that base models never see; it also risks overfitting the meta-learner to the specific holdout distribution if the holdout is small; stacking with K-fold uses all training data for both base models (via OOF) and provides more samples for meta-learner training; for large datasets the difference is minimal, but for small datasets blending can significantly underperform","C":"Blending is always better because base models see more data (70% vs K-fold's (K-1)/K fraction)","D":"The only difference is computational — both approaches are statistically equivalent"},"correct":"B","explanation":{"correct":"- Blending: base models train on 70%, meta-learner trains on holdout (30%). Problem: base models were not trained on the holdout set, so holdout predictions are valid. But you've given up 30% of data for base model training.\n- K-fold stacking: base models are trained on (K-1)/K of training data, and out-of-fold predictions cover all training samples. Meta-learner trains on predictions for all N training examples.\n- With N=1,000: blending gives meta-learner 300 training samples; 5-fold stacking gives 1,000. More meta-learner training data = better generalization of the meta-learner.","A":"They differ in how much data is available for the meta-learner and how much data base models see during their training phase. These are statistically meaningful differences.","B":"","C":"Base models in blending do see 70% of data. But with K-fold stacking, base models also see approximately (K-1)/K ≈ 80% (5-fold) of data during each OOF fold — AND more samples are available for the meta-learner.","D":"The approaches are not statistically equivalent. The meta-learner sample count difference has a real effect on meta-learner generalization, especially on small datasets."}},{"section":"machine-learning","topicSlug":"ensemble-methods","topic":"Ensemble Methods","id":"ml-13006","difficulty":"medium","orderIndex":6,"question":"A team tries to improve model performance by adding more base classifiers to a Random Forest (from 100 trees to 1,000 trees). Performance on the validation set plateaus after 300 trees. A manager wants to add 10,000 trees. Why is this computationally unjustifiable, and what should they do instead?","options":{"A":"Random Forests always improve monotonically with more trees — 10,000 trees would definitely help","B":"After the ensemble variance is sufficiently reduced (diminishing returns in variance reduction), adding more trees does not improve generalization — it only increases inference cost linearly; the plateau after 300 trees indicates the ensemble has converged; resources are better spent on feature engineering, hyperparameter tuning, or trying a different algorithm","C":"10,000 trees would cause overfitting — RF overfits with too many trees","D":"Random Forest cannot support more than 1,000 trees due to memory constraints in standard implementations"},"correct":"B","explanation":{"correct":"- RF bias-variance: RF reduces variance by averaging many trees. As $B \\to \\infty$, the variance converges to $\\rho \\sigma^2$ where $\\rho$ = average correlation between trees and $\\sigma^2$ = single tree variance. Adding more trees beyond convergence doesn't reduce this lower bound.\n- Law of diminishing returns: variance reduction from tree $k$ decreases as $1/k^2$. Most variance reduction happens in the first few hundred trees.\n- More trees: monotonically (weakly) improve training fit but provide no generalization improvement past convergence. They increase prediction time O(B) and memory O(B).","A":"This is false. Random Forests converge as $B \\to \\infty$. Once the ensemble has enough trees to estimate the expected prediction well, adding more provides no benefit. This is a well-known theoretical property.","B":"","C":"More trees do NOT cause RF overfitting. RF overfitting is controlled by individual tree depth (max_depth, min_samples_split), not by the number of trees. This is a common misconception — more trees can never overfit, they just stop helping.","D":"There is no standard implementation limit of 1,000 trees. Sklearn's RandomForestClassifier supports any number. The constraint is computational budget, not implementation."}},{"section":"machine-learning","topicSlug":"ensemble-methods","topic":"Ensemble Methods","id":"ml-13007","difficulty":"medium","orderIndex":7,"question":"A company runs a critical ML system in production. They decide to use AdaBoost instead of Random Forest because \"AdaBoost focuses on hard examples and should be better.\" At test time, a few corrupted inputs (extreme feature values due to sensor malfunction) cause AdaBoost to make catastrophically wrong predictions. Why is AdaBoost more vulnerable to this than Random Forest?","options":{"A":"AdaBoost uses more trees, amplifying the effect of corrupted inputs","B":"AdaBoost assigns high weights to misclassified training examples and trains subsequent models to focus on them — corrupted training examples (if present) get amplified weights; at test time, AdaBoost is also more sensitive to outlier inputs because its aggregated prediction assigns disproportionate weight to weak learners trained on difficult/noisy regions; Random Forest's uniform averaging is more robust to individual outlier inputs","C":"AdaBoost is no more sensitive than Random Forest — the vulnerability is due to insufficient data preprocessing","D":"Random Forest is more vulnerable to corrupted inputs because it uses full-depth trees"},"correct":"B","explanation":{"correct":"- AdaBoost weighting: misclassified points get exponentially higher weights in subsequent rounds. If corrupted training examples exist, they get amplified weights — the model dedicates significant capacity to fitting noise.\n- Cascading sensitivity: final AdaBoost prediction is a weighted sum of weak learner predictions, where later learners (focused on hard/corrupted examples) have specific regions of high sensitivity.\n- Random Forest robustness: (1) bootstrap sampling means each tree sees a random subset of training data — a corrupted point only appears in ~63% of trees; (2) uniform averaging dilutes any single tree's response to an extreme input.\n- In practice: AdaBoost is significantly more sensitive to noisy labels and outlier inputs; preprocessing and outlier removal are essential before AdaBoost.","A":"AdaBoost doesn't necessarily use more trees. The vulnerability is in the re-weighting mechanism, not tree count.","B":"","C":"Preprocessing is necessary for both methods when corruption is present. But AdaBoost is architecturally more sensitive even after preprocessing, due to the amplifying weight scheme.","D":"Random Forest uses full-depth trees but they are averaged uniformly. Full depth increases individual tree variance, but the ensemble averaging provides robust protection against individual extreme inputs."}},{"section":"machine-learning","topicSlug":"ensemble-methods","topic":"Ensemble Methods","id":"ml-13008","difficulty":"medium","orderIndex":8,"question":"A data scientist builds a diverse stacking ensemble with 5 base models: logistic regression, SVM, Random Forest, gradient boosting, and KNN. The meta-learner is a neural network. Despite strong individual base model performance (all >80% accuracy), the ensemble achieves only 81% — barely better than the best single model. What might explain this?","options":{"A":"Stacking always outperforms individual models — 81% accuracy means implementation error","B":"If the base models are highly correlated in their predictions (they agree on the same hard examples), the meta-learner has little complementary signal to exploit; the meta-learner may also overfit to the training meta-features if data is limited; additionally, if one model dominates (e.g., gradient boosting at 87%), the meta-learner may simply learn to trust it almost exclusively","C":"The problem is the neural network meta-learner — it should be replaced with logistic regression","D":"Diverse architectures cannot be stacked together — the meta-learner requires homogeneous base models"},"correct":"B","explanation":{"correct":"- Stacking gains come from complementarity: when models make different mistakes, the meta-learner can combine them better than any individual. If all 5 models misclassify the same 20% of examples (highly correlated errors), the meta-learner cannot fix those cases.\n- Practical checks: compute pairwise correlation of base model predictions. If all correlations > 0.95, stacking provides little benefit.\n- Meta-learner overfitting: with limited training data and a flexible meta-learner (neural network), the meta-learner may fit training meta-features rather than generalizing.","A":"Stacking does not always outperform individual models. Benefits depend on error diversity among base models. Strong individual model + correlated errors → minimal stacking benefit.","B":"","C":"Replacing the meta-learner with logistic regression may help regularize the meta-level, but the fundamental issue is base model correlation. Logistic regression won't fix uncorrelated errors that don't exist.","D":"Heterogeneous base models (different architectures) stack perfectly well. In fact, diversity in base model types is often recommended to increase prediction diversity."}},{"section":"machine-learning","topicSlug":"ensemble-methods","topic":"Ensemble Methods","id":"ml-13009","difficulty":"hard","orderIndex":9,"question":"A team uses stacking but finds that the gradient boosting base model dominates the meta-learner's weights (coefficient ≈ 0.95, others ≈ 0.01). They add more base models to fix this. A statistician says \"adding correlated models won't help; adding regularization to the meta-learner is more principled.\" Explain why regularization is more principled.","options":{"A":"Regularization reduces the number of base models needed, saving computation","B":"The meta-learner's high weight on gradient boosting reflects the data's signal: if GB genuinely provides 95% of the predictive value, adding more correlated base models just adds noise to the meta-features; L1/L2 regularization on the meta-learner constrains weight magnitudes, preventing extreme dominance by any single base model and producing more stable, calibrated meta-weights; adding more correlated base models can reduce diversity and even introduce multicollinearity in meta-features","C":"Regularization fixes the issue by removing the gradient boosting model from consideration","D":"This situation can only be fixed by switching from stacking to boosting"},"correct":"B","explanation":{"correct":"- Meta-feature multicollinearity: if correlated base models are added (e.g., 3 different gradient boosting variants), the meta-features are correlated. Correlated meta-features destabilize ordinary least squares (OLS) meta-learner coefficients — small changes in data produce large weight swings.\n- L2 regularization (Ridge meta-learner): penalizes large weights, shrinking all coefficients toward zero. Even if GB provides most signal, L2 prevents the coefficient from reaching 0.95 — ensures small, stable contributions from other models.\n- More models without regularization: adds collinear meta-features, potentially destabilizing the already-dominant GB coefficient.","A":"Regularization controls weight distribution, not the number of base models needed. You still need all models to generate meta-features.","B":"","C":"L2/L1 regularization shrinks all coefficients toward zero — it does not remove any base model. L1 (Lasso) may zero out some coefficients, but this is feature selection, not forced inclusion.","D":"Boosting is a fundamentally different algorithm (sequential, adaptive weighting). Switching to boosting doesn't address the meta-learner weight distribution issue in stacking."}},{"section":"machine-learning","topicSlug":"ensemble-methods","topic":"Ensemble Methods","id":"ml-13010","difficulty":"hard","orderIndex":10,"question":"A Random Forest feature importance is computed as the mean decrease in Gini impurity across all trees. Features A and B are highly correlated (correlation = 0.98). Feature A has importance 0.35, Feature B has importance 0.04. A data scientist drops Feature B and retrains. The model's accuracy drops by 2%. What explains this?","options":{"A":"Feature B's importance of 0.04 correctly reflects it as unimportant — the accuracy drop is coincidental","B":"When correlated features compete for splits, both cannot both be selected at every node; whichever feature is randomly chosen first gets \"credit\" for the impurity reduction; Feature A appears dominant because it was selected first in many trees; Feature B's measured importance is artificially suppressed by A's presence; when A is not present, B carries the same predictive signal — removing B after A disappears reveals B's true contribution","C":"Gini importance is always accurate for correlated features — the problem is in the retrained model's hyperparameters","D":"Correlated features always have identical feature importance — A=0.35 and B=0.04 proves they are not actually correlated"},"correct":"B","explanation":{"correct":"- This is the correlated feature importance instability problem in tree-based methods. When A and B provide the same information (r=0.98), the tree randomly selects one — the selected feature gets the full importance credit, the other gets near-zero.\n- The measured importances are unstable across different random seeds and bootstrap samples. If a different seed causes B to be selected more often, B's importance would be 0.35 and A's would be 0.04.\n- Implication: do not use Random Forest Gini importance to select features from correlated groups. Use permutation importance (which measures actual prediction degradation) or SHAP values, which distribute credit more fairly among correlated features.","A":"2% accuracy drop from removing a \"0.04 importance\" feature is diagnostic evidence that Gini importance understated B's value. The drop is not coincidental — it reflects the information lost.","B":"","C":"Gini importance is specifically known to be unreliable for correlated features. This is a well-documented limitation. Use SHAP or permutation importance for correlated feature evaluation.","D":"A=0.35 and B=0.04 despite r=0.98 correlation is precisely the artifact. High correlation and very different importances indicate one is suppressing the other — not that they aren't correlated."}},{"section":"machine-learning","topicSlug":"ensemble-methods","topic":"Ensemble Methods","id":"ml-13011","difficulty":"hard","orderIndex":11,"question":"In gradient boosting, each tree is fitted to the negative gradient of the loss function. For log-loss (binary cross-entropy): $r_i = y_i - \\hat{p}_i$ (the residuals are the difference between actual labels and predicted probabilities). A data scientist says \"gradient boosting for classification is just boosted regression on residuals.\" What subtle distinction does this miss?","options":{"A":"The statement is completely correct — gradient boosting for classification is regression on residuals","B":"Gradient boosting fits regression trees to the pseudo-residuals (negative gradient of the loss), but the final prediction requires a link function: the sum of tree outputs $F_M(x) = \\sum f_m(x)$ is in log-odds space; the probability prediction is $\\hat{p} = \\sigma(F_M(x)) = 1/(1+e^{-F_M(x)})$; the trees regress on residuals, but the model output is a probability after the sigmoid transform — collapsing this to \"regression on residuals\" ignores the non-linear mapping between tree outputs and final probabilities","C":"Gradient boosting for classification does not use residuals — it uses the raw labels directly","D":"Classification gradient boosting uses decision trees for the final prediction, while regression uses linear models — they are fundamentally different architectures"},"correct":"B","explanation":{"correct":"- Gradient boosting for binary classification with log-loss: $L = -\\sum [y_i \\log(\\hat{p}_i) + (1-y_i)\\log(1-\\hat{p}_i)]$.\n- Negative gradient: $-\\partial L / \\partial F_m = y_i - \\hat{p}_i = r_i$. Each tree is fitted to $r_i$ — these look like residuals, similar to regression.\n- Key difference: the trees operate in log-odds space, not probability space. After boosting, $F_M(x)$ is a log-odds score, and the final probability requires the sigmoid: $\\hat{p} = \\sigma(F_M(x))$.\n- For MULTI-class: there are K sets of trees (one per class), and the final probabilities use softmax. The \"regression on residuals\" analogy becomes even less direct.","A":"While mechanistically similar, the log-odds space transformation means trees directly fit quantities that are not interpretable as probabilities. Missing the sigmoid / link function is a conceptual gap that matters for probability calibration and output interpretation.","B":"","C":"Gradient boosting does use pseudo-residuals (negative gradients). For log-loss, these are $y_i - \\hat{p}_i$, which depend on the current probability predictions, not raw labels.","D":"Both classification and regression gradient boosting use decision trees (typically). The difference is the loss function and link function, not the base learner architecture."}},{"section":"machine-learning","topicSlug":"ensemble-methods","topic":"Ensemble Methods","id":"ml-13012","difficulty":"hard","orderIndex":12,"question":"Two ensemble strategies are compared on a medical imaging classification task: (1) Training 10 diverse models (RF, SVM, CNN, LR, KNN, etc.) and stacking; (2) Training 10 variants of the same CNN with different random seeds and averaging predictions. Strategy 2 achieves higher accuracy. Why, and what does this reveal about ensemble design?","options":{"A":"Strategy 1 is always better — diversity of architecture always outperforms diversity of initialization","B":"Architecture diversity does not guarantee prediction diversity — logistic regression and a simple CNN on the same task may make very similar predictions; deep CNNs trained with different seeds on the same data capture different feature representations due to random initialization, dropout, and data augmentation stochasticity, creating genuine prediction diversity; the task-specific best architecture dominates, and variance reduction within the best architecture outperforms mixing weak architectures with the best","C":"Strategy 2 is better because CNNs always outperform other models on image data","D":"The result proves that stacking is an inferior ensemble method compared to averaging"},"correct":"B","explanation":{"correct":"- Effective ensembling requires: (1) high individual model quality, (2) diverse errors (low prediction correlation). Strategy 1 mixes strong (CNN) with weak (LR, KNN on images) models. The meta-learner in stacking will learn to ignore weak models, effectively reducing to a single CNN.\n- Strategy 2: all models have high baseline accuracy (same CNN architecture optimized for the task). Different seeds create genuinely different learned representations — different random feature detectors are learned, reducing correlation.\n- Research insight: in competitive ML (Kaggle, benchmarks), ensembles of the same top architecture with diverse hyperparameters/seeds often outperform heterogeneous ensembles with weak models.","A":"Architecture diversity is useful when all architectures are approximately equally strong. Mixing a strong architecture with significantly weaker ones adds noise to the ensemble without equivalent signal.","B":"","C":"\"CNNs always outperform on images\" is broadly but not universally true (ViT, recent transformer models also perform well). More importantly, the comparison here is about ensemble strategy, not architecture selection.","D":"The result doesn't prove stacking is inferior in general. It shows that for this task, averaging 10 strong models outperforms stacking 5 strong + 5 weak models. Stacking with diverse, equally-strong base models can outperform averaging."}},{"section":"machine-learning","topicSlug":"model-evaluation-and-metrics","topic":"Model Evaluation And Metrics","id":"ml-14001","difficulty":"easy","orderIndex":1,"question":"A binary classifier achieves 99% accuracy on a dataset where 99% of samples belong to class 0. A junior data scientist says \"99% accuracy means our model is excellent.\" What is wrong with this evaluation?","options":{"A":"99% accuracy is always an excellent result regardless of class distribution","B":"A trivial model that predicts \"class 0\" for every sample also achieves 99% accuracy — accuracy conflates majority-class performance with minority-class performance; on imbalanced datasets, accuracy doesn't measure whether the model learned anything about the minority class (class 1)","C":"99% accuracy requires 100% precision and 100% recall to be meaningful","D":"The accuracy is too high — a 99% accurate model is always overfitting"},"correct":"B","explanation":{"correct":"- Null accuracy (baseline): predict the majority class for all samples. With 99% class 0: null accuracy = 99%. The model achieves no improvement over this trivial baseline.\n- For class 1 detection: Recall = TP/(TP+FN). If the model predicts \"0\" for everything: TP = 0, FN = all class 1 samples. Recall = 0 — the model completely fails to detect the minority class.\n- Appropriate metrics for imbalanced data: precision-recall AUC, F1 on the minority class, Cohen's kappa, or Matthews correlation coefficient (MCC).","A":"99% accuracy can be meaningless on imbalanced data. The appropriate interpretation depends critically on class distribution.","B":"","C":"99% accuracy doesn't require perfect precision or recall. But on 99/1 imbalanced data, achieving 99% accuracy tells you nothing about minority class performance.","D":"High accuracy is not a sign of overfitting per se. Overfitting manifests as high training accuracy with lower test accuracy. Class imbalance is a separate issue."}},{"section":"machine-learning","topicSlug":"model-evaluation-and-metrics","topic":"Model Evaluation And Metrics","id":"ml-14002","difficulty":"easy","orderIndex":2,"question":"For a disease screening test, the following confusion matrix applies: TP=90, FP=200, FN=10, TN=9,700. Calculate precision, recall, and F1 score, and determine which metric is most critical for this screening application.","options":{"A":"Precision is most critical — flagging 200 healthy people for follow-up is worse than missing 10 sick people","B":"Recall (90/(90+10) = 90%) is most critical — missing 10 sick people has high clinical cost (late diagnosis, disease progression); precision (90/(90+200) = 31%) is low because it's a screening test where false positives are expected and managed through confirmatory testing; F1 = 2×0.9×0.31/(0.9+0.31) ≈ 0.46 combines both","C":"Accuracy ((90+9700)/10000 = 98%) is most critical — it captures the overall test performance","D":"F1 score should always be optimized for medical tests — it perfectly balances the clinical trade-offs"},"correct":"B","explanation":{"correct":"- Context matters: disease screening vs diagnosis. Screening: cast a wide net (high recall), accept false positives (low precision). Confirmatory tests (more expensive, invasive) eliminate false positives.\n- Missing a sick person (FN) in screening means they receive no follow-up, leading to late-stage diagnosis with much higher treatment cost and mortality.\n- False positive (FP) sends a healthy person for confirmatory testing — inconvenient and costly but not catastrophic.\n- Recall = sensitivity in medical terminology. The WHO's target sensitivity for TB screening is >90%. Low precision is acceptable for initial screening when confirmatory testing exists.","A":"For many diseases (cancer, HIV), missing a case (FN) is far more costly than an unnecessary follow-up test (FP). The asymmetric cost justifies prioritizing recall.","B":"","C":"Accuracy of 98% is dominated by the 9,700 true negatives. It tells you almost nothing about disease detection performance. Never use accuracy for medical screening evaluation.","D":"F1 assumes equal cost for FP and FN ($C_{FP} = C_{FN}$). Medical contexts typically have highly asymmetric costs. F1 is not the right metric when cost asymmetry exists."}},{"section":"machine-learning","topicSlug":"model-evaluation-and-metrics","topic":"Model Evaluation And Metrics","id":"ml-14003","difficulty":"easy","orderIndex":3,"question":"A classifier's ROC curve has an AUC of 0.85. A colleague says \"our model correctly classifies 85% of samples.\" Is this interpretation correct?","options":{"A":"Yes — AUC directly measures the percentage of correct classifications","B":"No — AUC-ROC measures the probability that the model ranks a randomly chosen positive sample higher than a randomly chosen negative sample; AUC=0.85 means: given a random positive and a random negative, the model assigns a higher score to the positive 85% of the time; this is a ranking quality metric, not a classification accuracy metric","C":"AUC-ROC and accuracy are equivalent — both measure the proportion of correct predictions","D":"AUC = 0.85 means the model achieves 85% recall at 85% precision"},"correct":"B","explanation":{"correct":"- Formal definition of AUC-ROC: $P(\\hat{p}(x^+) > \\hat{p}(x^-))$ for randomly drawn positive $x^+$ and negative $x^-$. This is a threshold-independent measure of discriminative ability.\n- AUC = 0.5: model cannot distinguish positive from negative (random ranking). AUC = 1.0: perfect ranking (all positives scored above all negatives). AUC = 0.85: very good discrimination.\n- For a 99% negative dataset with AUC=0.85: accuracy could be 99% (by predicting all negative), but AUC=0.85 correctly shows the model has learned to rank positives higher. Accuracy and AUC are measuring very different things.","A":"Accuracy is TP+TN / total. AUC is a ranking probability. They are completely different quantities and coincidentally equal only in specific cases.","B":"","C":"A model with 50% accuracy on balanced data can have AUC > 0.5. A model with 99% accuracy on 99:1 imbalanced data can have AUC close to 0.5 if it simply predicts all negatives. They are not equivalent.","D":"AUC has no direct relationship to specific precision-recall values at a fixed threshold. AUC is a summary over all possible thresholds."}},{"section":"machine-learning","topicSlug":"model-evaluation-and-metrics","topic":"Model Evaluation And Metrics","id":"ml-14004","difficulty":"easy","orderIndex":4,"question":"A spam classifier has AUC-ROC = 0.92 but AUC-PR (Precision-Recall) = 0.41. The dataset has 1% spam. A data scientist says \"AUC-ROC of 0.92 means the model is good.\" A colleague disagrees. Which is more informative for this application, and why?","options":{"A":"AUC-ROC is always the correct metric — AUC-PR is rarely used in practice","B":"For highly imbalanced datasets, AUC-PR is more informative — AUC-ROC includes the True Negative Rate (specificity), and with 99% negatives, the model classifies negatives easily; the ROC curve's large TN region inflates AUC-ROC; AUC-PR focuses on performance on the rare positive class (spam), where 0.41 indicates poor precision-recall trade-off for spam detection","C":"Both metrics are equivalent and should give the same value for any classifier","D":"The discrepancy between 0.92 and 0.41 indicates a computational error"},"correct":"B","explanation":{"correct":"- ROC curve plots TPR vs FPR. With 99% negatives: even a weak model keeps FPR low (many TN), producing a good-looking ROC curve despite poor minority class performance.\n- PR curve plots precision vs recall. It focuses entirely on the positive (minority) class — no TN in either metric. AUC-PR = 0.41 on 1% spam means the model struggles to achieve good precision-recall balance for spam.\n- Saito & Rehmsmeier (2015): AUC-PR is more informative than AUC-ROC for imbalanced datasets. AUC-PR's random baseline is equal to the positive class rate (1% here), while AUC-ROC's random baseline is always 0.5.","A":"AUC-PR is widely used in imbalanced classification, information retrieval (average precision), and recommendation systems. It is not rarely used.","B":"","C":"AUC-ROC and AUC-PR are different quantities measuring different aspects of model performance. They have different baselines and different interpretations. They are not equivalent.","D":"The discrepancy between 0.92 and 0.41 is expected and common for imbalanced datasets. It is not a computational error — it reveals the model's different performance characteristics."}},{"section":"machine-learning","topicSlug":"model-evaluation-and-metrics","topic":"Model Evaluation And Metrics","id":"ml-14005","difficulty":"easy","orderIndex":5,"question":"K-fold cross-validation (K=5) is compared to a single 80/20 train-test split for model evaluation. A data scientist argues K-fold is always better. When might a single split be more appropriate?","options":{"A":"A single split is never better — K-fold is universally superior","B":"K-fold requires fitting the model K times — for very large datasets or computationally expensive models (deep learning, large gradient boosting), K-fold is impractical; a single split may be sufficient when the dataset is large enough that a 20% test set (which may be 100,000+ samples) provides stable estimates with low variance; K-fold primarily helps when data is limited and variance in evaluation is high","C":"K-fold should never be used for neural networks because it causes overfitting","D":"A single split is better when the data has temporal structure, because K-fold would still use random splits"},"correct":"B","explanation":{"correct":"- K-fold benefit: reduces evaluation variance by averaging over K different train-test splits. With limited data (n<1,000), a single 80/20 split may give high-variance estimates depending on which 20% was in the test set.\n- K-fold cost: K× the computational cost. For a CNN trained for 12 hours, 5-fold = 60 hours. For large datasets where test variance is already low, the extra cost is not justified.\n- Also valid: D is partially correct — for time-series data, K-fold random splits cause leakage (future data in training). Time-series cross-validation (expanding window or rolling window) is needed.","A":"When the dataset is large enough or computation is expensive, K-fold provides minimal benefit at significant cost. Single splits are commonly used in deep learning evaluations.","B":"","C":"K-fold doesn't cause overfitting in neural networks. It's computationally expensive, which is why practitioners often use a single validation set. The concern with K-fold and neural networks is purely computational.","D":"This is an important limitation — but the question asks about the single split being \"more appropriate.\" A single train-test split with proper temporal ordering is the preferred approach for time series, making D a valid secondary answer, but the primary reason is computational cost (B)."}},{"section":"machine-learning","topicSlug":"model-evaluation-and-metrics","topic":"Model Evaluation And Metrics","id":"ml-14006","difficulty":"medium","orderIndex":6,"question":"A multi-class classifier (10 classes) is evaluated with macro F1 and weighted F1. Macro F1 = 0.62, weighted F1 = 0.84. Class distribution: 8 classes with 100 samples each, 2 classes with 5,000 samples each. The gap between the metrics reveals what about the model?","options":{"A":"Weighted F1 is always higher than macro F1 — the gap has no interpretation","B":"Macro F1 averages F1 per class equally — it is dominated by the 8 small classes where the model may perform poorly; weighted F1 weights each class by its sample count — the 2 large classes (5,000 samples each) dominate; the gap (0.84 vs 0.62) reveals that the model performs well on the large common classes but poorly on the small rare classes","C":"The gap means the model has high precision but low recall overall","D":"A weighted F1 of 0.84 means the model is production-ready — the macro F1 can be ignored"},"correct":"B","explanation":{"correct":"- Macro F1: $\\frac{1}{K}\\sum_{k=1}^K F1_k$. Each of 10 classes contributes equally. If the 8 small classes have F1 ≈ 0.3 (poor due to limited data) and 2 large classes have F1 ≈ 0.95: Macro F1 ≈ (8×0.3 + 2×0.95)/10 = 0.43. This aligns with 0.62 in the scenario.\n- Weighted F1: $\\sum_{k=1}^K \\frac{n_k}{N} F1_k$. With 2 classes contributing 5000/10800 ≈ 46% each: weighted F1 ≈ 0.46×0.95 + 0.46×0.95 + small class contribution ≈ dominated by large classes.\n- Decision: if rare classes are important (e.g., rare disease detection, minority customer types), macro F1 is the relevant metric.","A":"The relationship between weighted and macro F1 depends on the class performance distribution. Weighted F1 is not always higher — if the model performs better on rare classes, macro F1 > weighted F1.","B":"","C":"Precision-recall decomposition is not directly revealed by the gap between macro and weighted F1. The gap reveals the performance differential between rare and common classes.","D":"Ignoring macro F1 means ignoring the model's performance on 8 of 10 classes. If those classes are business-relevant (fraud subtypes, disease variants), 0.62 macro F1 may be unacceptable."}},{"section":"machine-learning","topicSlug":"model-evaluation-and-metrics","topic":"Model Evaluation And Metrics","id":"ml-14007","difficulty":"medium","orderIndex":7,"question":"A regression model is evaluated with RMSE = 100 on a dataset where house prices range from $50,000 to $5,000,000. A colleague evaluates the same model on a new dataset (prices $200,000-$400,000) and gets RMSE = 80. They claim \"the model improved by 20% on the new dataset.\" What is wrong with this comparison?","options":{"A":"RMSE values are always comparable across datasets","B":"RMSE is scale-dependent — an RMSE of 80 on $200K-$400K prices (a 200K range) represents 80/200K = 0.04% of the value range, much worse relative performance than RMSE of 100 on a $4.95M range (100/4,950,000 = 0.002%); use RMSE/mean or MAPE (Mean Absolute Percentage Error) for scale-normalized comparison","C":"RMSE cannot be used for regression problems with skewed distributions","D":"RMSE of 80 < RMSE of 100 always means better model performance regardless of scale"},"correct":"B","explanation":{"correct":"- Absolute RMSE is uninterpretable without context of the target variable's scale and variance. RMSE = 100 on a $5M range is ~0.002% error; RMSE = 80 on a $200K range is 0.04% error — the latter is 20× worse proportionally.\n- Normalized RMSE (NRMSE) = RMSE / (max - min) or RMSE / mean. MAPE = mean(|y - ŷ|/|y|) × 100% gives percentage errors that are directly comparable across datasets.\n- Caveat: MAPE is unstable when true values are near 0 and gives asymmetric penalties. Symmetric MAPE (sMAPE) or MASE (Mean Absolute Scaled Error) are more robust.","A":"RMSE values are only comparable across datasets with the same target variable scale and similar variance. Comparing raw RMSE across different-scale datasets is a common mistake.","B":"","C":"RMSE can be used for any regression problem. Skewed distributions may make MSE/RMSE insensitive to outliers, but they don't invalidate the metric.","D":"Lower absolute RMSE does not mean better model unless the datasets are comparable in scale. This is precisely the scale dependency problem."}},{"section":"machine-learning","topicSlug":"model-evaluation-and-metrics","topic":"Model Evaluation And Metrics","id":"ml-14008","difficulty":"medium","orderIndex":8,"question":"A recommendation system uses Mean Average Precision (MAP) for evaluation. The system recommends 10 items; for a user, relevant items are at positions 1, 4, 7. Calculate the Average Precision for this user.","options":{"A":"Average Precision = (1.0 + 0.5 + 0.43) / 10 = 0.193","B":"Average Precision = (1.0 + 0.5 + 0.43) / 3 ≈ 0.643 — average Precision@k values only at the positions of relevant items, divided by the number of relevant items","C":"Average Precision = 3/10 = 0.3 — fraction of relevant items in top 10","D":"Average Precision = P@10 = 3/10 = 0.3"},"correct":"B","explanation":{"correct":"- Average Precision (AP): $AP = \\frac{1}{R} \\sum_{k=1}^{n} P@k \\times \\text{rel}(k)$ where $R$ = total relevant items, $\\text{rel}(k) = 1$ if item at position $k$ is relevant.\n- Only sum precision values at positions of relevant items: $P@1 = 1.0$, $P@4 = 0.5$, $P@7 = 0.429$.\n- $AP = (1.0 + 0.5 + 0.429) / 3 = 1.929 / 3 \\approx 0.643$.\n- MAP (Mean AP) averages AP across all users/queries. AP rewards systems that rank relevant items higher — rank 1 contributes more than rank 7.","A":"Dividing by 10 (list length) is incorrect. AP divides by the number of relevant items (3), not the recommendation list length.","B":"","C":"3/10 = recall at 10 (fraction of relevant items retrieved in top 10). This is recall@10, not average precision. AP accounts for ranking position, not just total recall.","D":"P@10 = 3/10 = 0.3 is precision at the final cutoff. AP is a weighted average of precision at each relevant item's position, not precision at the end of the list."}},{"section":"machine-learning","topicSlug":"model-evaluation-and-metrics","topic":"Model Evaluation And Metrics","id":"ml-14009","difficulty":"hard","orderIndex":9,"question":"Two models are compared on a test set of 1,000 samples. Model A: accuracy 87.5%. Model B: accuracy 86.2%. A data scientist reports \"Model A is better.\" A statistician asks for significance testing. What test is appropriate, and what is the minimum information needed?","options":{"A":"A t-test on accuracy values across multiple test folds","B":"McNemar's test — it requires the 2×2 contingency table of cases where both models agree or disagree: both correct (n_11), A correct/B wrong (n_10), A wrong/B correct (n_01), both wrong (n_00); McNemar's tests only the disagreement cells (n_10 vs n_01) because the agreement cells don't contribute information about which model is better","C":"A chi-squared test on the confusion matrices of both models","D":"No significance test is needed — a 1.3-point accuracy difference on 1,000 samples is always statistically significant"},"correct":"B","explanation":{"correct":"- McNemar's test: given paired binary outcomes (correct/incorrect for each sample), the test statistic is $\\chi^2 = (n_{10} - n_{01})^2 / (n_{10} + n_{01})$. Under $H_0$ (both models have equal error rate): $n_{10} = n_{01}$.\n- Why not t-test: binary outcomes (correct/incorrect) don't meet normality assumptions for a standard t-test. McNemar's is the non-parametric alternative for paired binary outcomes.\n- Effect size: if n_10=50 (A right, B wrong) and n_01=37 (B right, A wrong): $\\chi^2 = (50-37)^2/(50+37) = 169/87 \\approx 1.94$. For df=1, $p \\approx 0.16$ — not significant. A 1.3-point difference may not be significant.","A":"A t-test on K-fold accuracy values (across folds) is a common but problematic approach due to non-independence of K-fold test sets. McNemar's test on paired sample-level predictions is more principled.","B":"","C":"Chi-squared on confusion matrices tests whether performance on individual classes differs, not whether one model is globally better than the other. It's the wrong test for overall comparison.","D":"Statistical significance depends on effect size and sample size together. A 1.3-point difference on 1,000 samples can be statistically significant (p<0.05) or not, depending on the overlap in what the models correctly classify."},"reference":"- Dietterich, \"Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms\": https://www.mitpressjournals.org/doi/10.1162/089976698300017197"},{"section":"machine-learning","topicSlug":"model-evaluation-and-metrics","topic":"Model Evaluation And Metrics","id":"ml-14010","difficulty":"hard","orderIndex":10,"question":"A calibration plot (reliability diagram) for a classifier shows: for predictions in the bin [0.7, 0.8], the actual positive rate is 0.45. For predictions in [0.3, 0.4], the actual positive rate is 0.55. What do these observations indicate, and how would you fix the calibration?","options":{"A":"The model is well-calibrated — slight deviations from the diagonal are expected","B":"The model is severely miscalibrated with inversion: samples predicted as highly positive (70-80% probability) have lower actual positive rate (45%) than samples predicted as moderately negative (30-40% probability, actual 55%); this suggests the model's sigmoid/softmax output is not a reliable probability estimate; fix: apply isotonic regression or Platt scaling to map raw scores to calibrated probabilities","C":"The model has low recall — calibration only measures precision","D":"The observations are impossible — model output probabilities and actual rates must maintain the same ordering"},"correct":"B","explanation":{"correct":"- Calibration: a model is calibrated if $P(y=1 | \\hat{p}(x) = p) = p$ for all $p$. Perfect calibration = reliability diagram on the diagonal.\n- The described model shows inverted calibration: high model scores correlate with lower actual positive rates. This is extreme miscalibration — the model's scores are negatively correlated with actual outcomes in some regions.\n- This can happen when a model is trained with inconsistent labels, when features that accidentally correlate negatively with labels are dominant, or when a model's decision boundary has flipped (e.g., incorrect label encoding).\n- Fixes: Platt scaling (logistic regression on model scores), isotonic regression (non-parametric monotone mapping). But inverted calibration is a severe model failure requiring investigation of the training pipeline.","A":"The described pattern is not a \"slight deviation.\" A 45% actual rate at 70-80% predicted probability and 55% actual rate at 30-40% predicted probability represents severe inversion, not noise.","B":"","C":"Calibration measures reliability of probability estimates, not just precision. Recall is about the classifier's sensitivity at a threshold; calibration is about whether predicted probabilities match actual frequencies.","D":"Model outputs and actual rates can have any relationship — especially for miscalibrated models. The model's raw output scores are transformed to probabilities through softmax/sigmoid and may not have a monotone relationship with ground truth."}},{"section":"machine-learning","topicSlug":"model-evaluation-and-metrics","topic":"Model Evaluation And Metrics","id":"ml-14011","difficulty":"hard","orderIndex":11,"question":"A model is evaluated with Brier Score. Model A: Brier = 0.18. Model B: Brier = 0.22. A data scientist knows Model B achieves higher AUC-ROC. How can a model have better AUC but worse Brier Score, and what does each measure?","options":{"A":"AUC and Brier Score cannot give contradictory results — one must be computed incorrectly","B":"AUC-ROC measures ranking quality (can the model order positives above negatives?); Brier Score measures probabilistic calibration quality ($\\frac{1}{n}\\sum (p_i - y_i)^2$, where $p_i$ is predicted probability); a model can be an excellent ranker (high AUC) but produce poorly calibrated probabilities (high Brier); Model B ranks correctly but may output overconfident or underconfident probabilities; Model A may be a weaker ranker but outputs well-calibrated, reliable probabilities","C":"Model B has higher AUC, so it must have lower Brier Score — the scenario is inconsistent","D":"Brier Score and AUC measure exactly the same thing using different formulas"},"correct":"B","explanation":{"correct":"- AUC = P(rank correct): considers only relative ordering of predicted scores. Multiplying all probabilities by 2 (or any monotone transformation) leaves AUC unchanged — rankings are preserved.\n- Brier Score = mean squared error between predicted probability and outcome: $BS = \\frac{1}{n}\\sum_{i=1}^n (\\hat{p}_i - y_i)^2$. Lower is better. Brier measures absolute probability accuracy.\n- Example: Model B predicts P=0.99 for all positives and P=0.01 for all negatives. AUC = 1.0 (perfect ranking). If actual positive rate is 0.6, the overconfident probabilities incur a penalty: Brier ≈ 0.6×(0.99-1)² + 0.4×(0.01-0)² ≈ small. Actually in this case Brier is low too. A cleaner example: if Model B outputs P=0.9 for positives and P=0.8 for negatives (good ranking, miscalibrated), AUC is high but Brier is penalized.","A":"AUC and Brier measure different properties. They can and do give contradictory rankings of models when ranking quality and probability calibration are different. This is well-documented.","B":"","C":"Higher AUC does not imply lower Brier Score. They measure fundamentally different aspects of model performance.","D":"AUC measures ranking discriminability; Brier measures probabilistic accuracy. They are different quantities with different mathematical formulations."}},{"section":"machine-learning","topicSlug":"model-evaluation-and-metrics","topic":"Model Evaluation And Metrics","id":"ml-14012","difficulty":"hard","orderIndex":12,"question":"A researcher uses test set performance to select between 100 hyperparameter configurations. The best configuration achieves 92% accuracy on the test set. They report this as the model's expected production performance. A statistician warns about \"test set contamination.\" What is the concern and what is the principled fix?","options":{"A":"100 hyperparameter configurations is too many — 10 is the maximum for unbiased evaluation","B":"By selecting the best configuration out of 100 based on test performance, the reported accuracy is optimistically biased — even if all configurations are random, the best of 100 will score high by chance (multiple comparisons problem); the test set effectively becomes a validation set used for selection; production performance will be lower; the principled fix is nested cross-validation or a held-out final test set that is never used during hyperparameter selection","C":"The concern is only valid if the hyperparameters were tuned on the training set — using the test set for selection is always valid","D":"Test set contamination only occurs when feature selection is performed — hyperparameter tuning does not contaminate the test set"},"correct":"B","explanation":{"correct":"- Multiple comparisons inflation: the expected maximum of 100 independent tests at noise level follows the extreme value distribution. Even with random performance (expected 50% for a coin flip classifier), the max of 100 samples can appear much higher by chance.\n- For accuracy at 92%: if 100 random configurations achieve 88-92% by variance, selecting the best inflates the reported estimate. The true expected performance of this configuration on new data is lower.\n- Principled fix: (1) Use 3 splits: training (model fitting), validation (hyperparameter selection), test (final unbiased evaluation). (2) Nested cross-validation: outer loop for test evaluation, inner loop for hyperparameter selection. The outer test fold is never used in hyperparameter selection.","A":"There is no maximum number of configurations for a valid search, as long as a separate test set is never used during selection. The issue is test set use for selection, not the number of configurations.","B":"","C":"Using the test set for any selection (including hyperparameter selection) contaminates it. The test set should only be used once, after all model development decisions are finalized.","D":"Any use of the test set for model selection — feature selection, hyperparameter tuning, architecture search — contaminates it. The contamination is about using test labels to make modeling decisions."}},{"section":"machine-learning","topicSlug":"bias-variance-tradeoff","topic":"Bias Variance Tradeoff","id":"ml-15001","difficulty":"easy","orderIndex":1,"question":"A linear regression model achieves training MSE = 150 and test MSE = 155. A polynomial regression (degree 10) achieves training MSE = 5 and test MSE = 800. What do these results indicate about each model?","options":{"A":"The polynomial model is better because it achieves lower training error","B":"Linear regression shows high bias (training MSE=150, suggesting underfitting) but low variance (test≈train); polynomial model shows low bias (training MSE=5, near-perfect fit) but high variance (test MSE=800, severe overfitting); the polynomial model memorized the training data and cannot generalize","C":"Both models are equivalent because neither achieves zero training error","D":"The test-train gap in polynomial regression means the test set is too small"},"correct":"B","explanation":{"correct":"- Bias-variance decomposition of generalization error: $E[\\text{MSE}] = \\text{Bias}^2 + \\text{Variance} + \\text{Irreducible Error}$.\n- High bias (underfitting): model is too simple to capture the true pattern. Both training and test error are high, with small gap.\n- High variance (overfitting): model fits training noise. Training error is very low, but test error is high (large gap: 800 - 5 = 795).\n- The polynomial model's degree-10 flexibility fits the training data perfectly (including noise) but cannot generalize.","A":"Lower training error does not mean better model. Training error measures how well the model fits historical data, not how well it will generalize to new data. Minimizing training error is not the goal of machine learning.","B":"","C":"Training MSE of 5 vs 150 represents fundamentally different fitting capacity. They are not equivalent.","D":"The large gap is due to model complexity (variance), not test set size. A larger test set would show the same high test MSE — the problem is the model, not the evaluation."}},{"section":"machine-learning","topicSlug":"bias-variance-tradeoff","topic":"Bias Variance Tradeoff","id":"ml-15002","difficulty":"easy","orderIndex":2,"question":"A model's bias is defined as the systematic error: $\\text{Bias}[\\hat{f}(x)] = E[\\hat{f}(x)] - f(x)$. A student asks: \"if I train the same model 100 times on 100 different samples from the same population, what does variance measure?\" What is the correct answer?","options":{"A":"Variance measures how often the model's predictions are correct on test data","B":"Variance measures how much the model's predictions change across different training sets — $\\text{Var}[\\hat{f}(x)] = E[(\\hat{f}(x) - E[\\hat{f}(x)])^2]$; a high-variance model produces very different predictions depending on which training samples it happened to see; a low-variance model produces similar predictions regardless of the specific training set","C":"Variance is the average training error across 100 runs","D":"Variance measures the number of parameters in the model — more parameters means higher variance"},"correct":"B","explanation":{"correct":"- Thought experiment: train a decision tree (high variance) vs logistic regression (lower variance) on 100 different samples of size 100 from the same population. Decision trees will look very different from each run (different splits, different predictions). Logistic regression will produce similar coefficients across runs.\n- High variance → sensitive to specific training data → overfitting to noise. Low variance → stable predictions → less responsive to specific training samples.\n- This definition of variance is over the sampling distribution of training sets — not over the test set.","A":"Whether predictions are correct is accuracy, not variance. Variance is about consistency across different training sets, not correctness.","B":"","C":"Training error measures how well the model fits training data. Variance is about stability of predictions across different training samples.","D":"More parameters can enable higher variance, but variance is not the parameter count. A highly regularized 1,000-parameter model may have lower variance than an unregularized 10-parameter model."}},{"section":"machine-learning","topicSlug":"bias-variance-tradeoff","topic":"Bias Variance Tradeoff","id":"ml-15003","difficulty":"easy","orderIndex":3,"question":"Learning curves are plotted for a model: training error stays high as more data is added; validation error decreases and approaches (but stays above) the training error. What type of error does this pattern indicate?","options":{"A":"High variance — the model is overfitting","B":"High bias (underfitting) — training error is high from the start and doesn't improve significantly with more data; the validation error converges toward training error (they meet at a high value); adding more data will not fix this; the model is too simple to capture the true function; the fix is to increase model complexity","C":"The model is well-optimized — the learning curve indicates good generalization","D":"High variance — training and validation error converging means the model is memorizing training data"},"correct":"B","explanation":{"correct":"- High bias learning curve pattern: training error is already high with few samples; adding more data doesn't dramatically reduce it (the model can't capture the true pattern regardless of data volume); validation error rapidly decreases and converges toward the (high) training error.\n- Intuition: a linear model trying to fit a cubic relationship has a fixed irreducible error floor set by the misspecification. More data refines the linear fit but doesn't help it capture the cubic term.\n- Fix for high bias: increase model complexity (more features, higher polynomial degree, deeper network), reduce regularization, add feature interactions.","A":"High variance shows a large gap between training error (low) and validation error (high). With more data, the gap narrows. The described pattern has training error staying high — this is underfitting, not overfitting.","B":"","C":"High training error that doesn't decrease is diagnostic of underfitting. A well-optimized model would have low training error and validation error approaching it.","D":"Memorizing training data (high variance) produces low training error. The described pattern has high training error — the model is not memorizing; it's underfitting."}},{"section":"machine-learning","topicSlug":"bias-variance-tradeoff","topic":"Bias Variance Tradeoff","id":"ml-15004","difficulty":"easy","orderIndex":4,"question":"A data scientist increases the regularization strength (lambda) in a Ridge regression model from 0.01 to 100. Training error increases significantly, while test error first decreases then increases. What does this behavior demonstrate?","options":{"A":"Higher regularization always improves test performance — the final test error increase is a bug","B":"The bias-variance tradeoff: at lambda=0.01 (low regularization), the model has low bias but high variance; as lambda increases, bias increases (model coefficients are shrunk, reducing model flexibility) but variance decreases (predictions become more stable); there is an optimal lambda where the total error (bias² + variance) is minimized; beyond this point, the added bias from over-regularization exceeds the variance reduction","C":"The test error increase at high lambda means regularization should never be applied","D":"The increase in training error at high lambda indicates the model is overfitting to the regularization penalty"},"correct":"B","explanation":{"correct":"- As lambda → ∞: all Ridge coefficients → 0. The model predicts the training mean for every input — high bias (predicts nothing), very low variance.\n- The U-shaped test error curve as a function of lambda is the empirical manifestation of the bias-variance tradeoff. The minimum of this curve is the optimal lambda.\n- Cross-validation for lambda selection: evaluate test-like performance for many lambda values and select the one with minimum cross-validation error. sklearn's `RidgeCV` does this automatically.","A":"Regularization can harm performance if set too high. The optimal regularization is dataset-specific and should be tuned via cross-validation.","B":"","C":"Regularization is valuable when the model overfits (high variance). The optimal regularization reduces total error. The problem is only at extreme lambda values.","D":"Training error increasing with lambda is expected and correct behavior — regularization constrains the model and prevents it from fully fitting training data. This is not overfitting; it's the intended effect."}},{"section":"machine-learning","topicSlug":"bias-variance-tradeoff","topic":"Bias Variance Tradeoff","id":"ml-15005","difficulty":"medium","orderIndex":5,"question":"An ML practitioner says: \"I always use ensembling because it reduces both bias and variance simultaneously.\" A theorist disagrees. Which ensemble technique primarily reduces variance, and which primarily reduces bias?","options":{"A":"All ensemble methods reduce both bias and variance equally","B":"Bagging (Random Forest) primarily reduces variance — it averages many high-variance low-bias models; boosting (AdaBoost, gradient boosting) primarily reduces bias — it sequentially adds models that correct the residuals/errors of previous models, fitting progressively more complex functions; bagging does not reduce bias because it averages models of the same class with same expected prediction","C":"Boosting reduces variance and bagging reduces bias — the reverse of common understanding","D":"Stacking reduces both bias and variance while bagging and boosting each reduce only one"},"correct":"B","explanation":{"correct":"- Bagging variance reduction: $\\text{Var}(\\bar{X}) = \\rho \\sigma^2 + (1-\\rho)\\sigma^2/B$. Averaging $B$ models reduces variance toward the correlated floor $\\rho \\sigma^2$. Bias of the average = bias of individual trees (unchanged). Bagging works best when base models are high variance (deep decision trees).\n- Boosting bias reduction: each iteration fits residuals $r_i = y_i - F_{m-1}(x_i)$. The composite model's bias decreases as more iterations capture complex patterns. The combined model can represent functions that no single weak learner can.\n- Boosting does also reduce variance through regularization (learning rate, depth), but the primary theoretical mechanism is bias reduction.","A":"Bagging and boosting have different primary mechanisms — claiming equal reduction in both ignores the mathematical structure of each method.","B":"","C":"This is reversed. Bagging = variance reduction (averaging); boosting = bias reduction (sequential error correction). This is a common confusion in interviews.","D":"Stacking is a meta-learning approach that can reduce both, but it's not categorically different in this respect from boosting. The key distinction is bagging vs boosting, not stacking."}},{"section":"machine-learning","topicSlug":"bias-variance-tradeoff","topic":"Bias Variance Tradeoff","id":"ml-15006","difficulty":"medium","orderIndex":6,"question":"Modern deep neural networks have millions of parameters and can interpolate training data (achieve ~0% training loss). Classical statistical theory predicts these models should severely overfit. Yet they generalize well. This \"double descent\" phenomenon challenges classical theory. What is the classical bias-variance tradeoff prediction, and why does deep learning deviate?","options":{"A":"Deep learning doesn't overfit because it uses batch normalization, which prevents overfitting","B":"Classical theory: test error follows a U-shaped curve as model complexity increases — low complexity (high bias), optimal, then high complexity (high variance/overfitting); deep learning observes a \"double descent\" — beyond the interpolation threshold, test error decreases again with more model capacity; overparameterized models have an implicit regularization effect from SGD that finds flat minima generalizing well, challenging the classical overfitting prediction","C":"Deep neural networks don't overfit because they use dropout, which limits effective model capacity","D":"The classical bias-variance tradeoff only applies to linear models — it never predicted overfitting for neural networks"},"correct":"B","explanation":{"correct":"- Classical U-curve: at the interpolation threshold (when model exactly fits training data), test error is expected to peak. Beyond this, classical theory predicts continued high variance.\n- Double descent: Belkin et al. (2019) showed test error can decrease again in the overparameterized regime. Why? SGD with early stopping implicitly finds solutions with low norm (analogous to L2 regularization), preferring flat, well-generalizing minima.\n- Modern understanding: classical bias-variance analysis assumes a specific model class trained to convergence. Deep learning's implicit regularization from SGD, random initialization, and optimization trajectory changes the effective model.","A":"Batch normalization helps training stability and can reduce overfitting somewhat, but it's not the fundamental explanation for generalization in overparameterized networks. The double descent phenomenon occurs even without BatchNorm.","B":"","C":"Dropout is one regularization technique. Double descent occurs even in networks trained without dropout. The phenomenon is fundamental, not dependent on specific regularization techniques.","D":"The classical bias-variance tradeoff is a statistical principle that applies to all models. It did predict overfitting for overparameterized models — deep learning's empirical behavior contradicts this prediction, which is exactly what makes double descent theoretically interesting."},"reference":"- Belkin et al., \"Reconciling modern ML and the bias-variance tradeoff\": https://arxiv.org/abs/1812.11118"},{"section":"machine-learning","topicSlug":"bias-variance-tradeoff","topic":"Bias Variance Tradeoff","id":"ml-15007","difficulty":"medium","orderIndex":7,"question":"A neural network achieves 95% training accuracy and 94% test accuracy. A colleague says \"the bias-variance decomposition shows low bias and low variance.\" Without seeing the learning curves or multiple training runs, what cannot be concluded from these two numbers alone?","options":{"A":"The two numbers are sufficient to fully characterize the bias-variance tradeoff","B":"Without knowing the Bayes optimal error (irreducible error), you cannot determine the absolute bias — if the best possible accuracy on this task is 99%, then a 5% training error indicates high bias; if the best possible is 95%, then 5% training error is at the optimum; variance is estimated from multiple training runs, not from a single train/test comparison; a 1% gap is consistent with low variance, but could also reflect that the test set is easy","C":"95% training accuracy always means low bias and 1% gap always means low variance","D":"The 1% gap between train and test is definitively low variance — no additional information is needed"},"correct":"B","explanation":{"correct":"- Irreducible error (Bayes error): the minimum achievable error given the data's inherent noise and label ambiguity. For noisy labels (humans disagree on classification), Bayes error > 0.\n- If Bayes error is 94%: training accuracy 95% means near-zero bias. If Bayes error is 60%, 95% accuracy already means very high bias.\n- Variance estimation: requires observing how much train/test performance varies across multiple random training runs or data subsets. A single run gives one sample of the distribution.","A":"Two numbers (train accuracy, test accuracy) give partial information. Full bias-variance characterization requires knowledge of Bayes error and multiple training runs.","B":"","C":"\"Always\" is incorrect. The interpretation of 95% training accuracy depends on the task difficulty (Bayes error). A 1% gap is consistent with low variance but doesn't definitively establish it.","D":"A 1% train-test gap is consistent with low variance, but \"definitively\" is too strong. The specific 20% test split might happen to be easier than the training set, creating an artificially small gap."}},{"section":"machine-learning","topicSlug":"bias-variance-tradeoff","topic":"Bias Variance Tradeoff","id":"ml-15008","difficulty":"hard","orderIndex":8,"question":"The bias-variance decomposition for 0-1 loss (classification) behaves differently than for squared loss (regression). For squared loss, $E[(y - \\hat{f}(x))^2] = \\text{Bias}^2 + \\text{Variance} + \\sigma^2$. For 0-1 loss, the bias and variance terms interact multiplicatively. What is the key implication of this difference?","options":{"A":"The bias-variance tradeoff does not apply to classification — only to regression","B":"For 0-1 loss, variance can actually reduce error when bias is high — a high-variance model may \"accidentally\" predict the correct class more often than a biased low-variance model in certain regions; bias and variance interact non-additively, so reducing variance doesn't always improve 0-1 loss; the decomposition is more complex and model selection for classification should use the actual 0-1 loss or a proper surrogate (log-loss, hinge loss) rather than the squared-loss decomposition","C":"For classification, bias and variance are exactly equal in magnitude — maximizing one minimizes the other","D":"The interaction only matters for multi-class problems, not binary classification"},"correct":"B","explanation":{"correct":"- Domingos (2000): for 0-1 loss, bias and variance interact in a complex way: $\\text{Error} = \\text{Noise} + \\text{Bias} \\times \\text{Variance}^{1/2}$ (simplified). In some cases, high variance can help — if a high-variance model has 50% chance of predicting the wrong class, it may also have 50% chance of predicting the right class in biased regions.\n- Practical implication: reducing variance doesn't always help classification. Ensembling (which reduces variance) sometimes improves classification more in low-bias regions and has complex behavior in high-bias regions.\n- Practitioners should use log-loss or hinge loss for optimization, not squared loss, to get well-behaved loss landscapes for classification tasks.","A":"The bias-variance tradeoff applies to all supervised learning. For classification, the decomposition just has a more complex, non-additive form.","B":"","C":"Bias and variance are not equal in magnitude for classification. They have a non-trivial relationship that depends on the decision boundary and the true distribution.","D":"The interaction is a fundamental property of 0-1 loss regardless of the number of classes. It applies equally to binary and multi-class classification."}},{"section":"machine-learning","topicSlug":"bias-variance-tradeoff","topic":"Bias Variance Tradeoff","id":"ml-15009","difficulty":"hard","orderIndex":9,"question":"A data scientist claims: \"adding more training data reduces both bias and variance.\" A researcher disagrees on one point. Which part of the claim is incorrect?","options":{"A":"More data reduces variance but not bias — bias is a function of model misspecification, not data quantity","B":"More data reduces both bias and variance equally","C":"More data reduces bias but increases variance by providing more opportunities for the model to fit noise","D":"More data has no effect on either bias or variance for neural networks"},"correct":"A","explanation":{"correct":"- Variance decreases with more data: $\\text{Var}(\\hat{f}) \\approx \\sigma^2 \\times \\text{model complexity} / n$. As $n \\to \\infty$, variance → 0 (for any fixed model class). More samples → more stable parameter estimates.\n- Bias does NOT decrease with more data: bias is the error due to model misspecification — the gap between the best model in the model class and the true function. A linear model fit to a million samples of a nonlinear function still has the same bias as a linear model fit to 100 samples (the mean prediction converges to the best linear approximation, which is still far from the true nonlinear function).\n- Exception: if model complexity is allowed to grow with data (e.g., using a kernel with adaptive bandwidth, or a neural network with more capacity), both bias and variance may change.","A":"","B":"More data does not reduce bias for a fixed model class. The \"fixed model class\" qualifier is critical.","C":"More data never increases variance for any reasonable model — this is incorrect. Variance decreases monotonically with more data for any fixed model class.","D":"Neural networks with fixed architecture have variance that decreases with more data, just like other models. More data helps neural network generalization."}},{"section":"machine-learning","topicSlug":"bias-variance-tradeoff","topic":"Bias Variance Tradeoff","id":"ml-15010","difficulty":"hard","orderIndex":10,"question":"Dropout in neural networks is typically described as a regularization technique that reduces overfitting. Using the bias-variance framework, explain precisely how dropout reduces variance and whether it has any bias cost.","options":{"A":"Dropout reduces both bias and variance to zero at high dropout rates","B":"Dropout randomly deactivates neurons during training, forcing the network to learn redundant representations; this is equivalent to training an ensemble of exponentially many different sub-networks; the ensemble's predictions are averaged (approximated at test time by weight scaling), reducing variance; the bias cost: a high dropout rate may prevent individual neurons from specializing, reducing the model's effective capacity and introducing bias — small dropout rates (0.1-0.3) are usually variance-reducing without significant bias increase","C":"Dropout has no effect on bias — it only reduces variance by zeroing out weights","D":"Dropout reduces bias by preventing neurons from relying on spurious correlations, with no variance effect"},"correct":"B","explanation":{"correct":"- Dropout as ensemble: with dropout rate $p$, each forward pass uses a different sub-network. Training produces a distribution over sub-networks. Inference averages predictions over this distribution (via weight scaling approximation), analogous to bagging.\n- Averaging reduces variance: the averaged prediction $E[\\hat{f}_\\theta(x)]$ across sub-networks has lower variance than any single sub-network prediction.\n- Bias cost: at high dropout rates (e.g., 0.7), many neurons are dropped per batch. Each sub-network has very few active neurons — may underfit complex patterns, increasing bias. This is why tuning dropout rate is important.\n- Common rates: 0.5 for fully connected layers in the original dropout paper; 0.1-0.3 for convolutional layers.","A":"Dropout at high rates (approaching 1.0) would prevent any learning — catastrophically high bias. Zero bias is not achievable with dropout.","B":"","C":"Dropout does affect bias at high dropout rates by limiting effective model capacity. The bias cost is often small at typical dropout rates but is not zero.","D":"The primary mechanism of dropout is variance reduction (ensemble averaging), not bias reduction. Bias reduction from removing spurious correlations is a secondary effect, not the primary mechanism."}},{"section":"machine-learning","topicSlug":"bias-variance-tradeoff","topic":"Bias Variance Tradeoff","id":"ml-15011","difficulty":"hard","orderIndex":11,"question":"A practitioner tunes a gradient boosting model. As the number of trees increases from 10 to 10,000 (with learning rate 0.01, no early stopping), training error decreases to near 0 while test error first decreases then increases. This pattern exactly mirrors the classical bias-variance tradeoff curve. What is the \"complexity\" axis in this context, and how do learning rate and tree depth interact with the tradeoff?","options":{"A":"The number of trees is the complexity axis; learning rate and depth have no effect on the tradeoff","B":"The number of trees (iterations) is the complexity axis for gradient boosting — more trees = lower bias (more complex function fitted to residuals), higher variance (more susceptible to noise); learning rate scales the contribution of each tree: small learning rate requires more trees to achieve the same bias reduction, making the tradeoff curve flatter; tree depth controls the individual tree's complexity — deeper trees reduce bias faster per iteration but also increase variance per tree; optimal performance requires jointly tuning trees, learning rate, and depth","C":"In gradient boosting, there is no bias-variance tradeoff — only overfitting and underfitting","D":"More trees in gradient boosting always reduces both bias and variance simultaneously"},"correct":"B","explanation":{"correct":"- Gradient boosting complexity: each additional tree adds a residual-fitting component. More trees → model can approximate more complex functions (lower bias); each tree fits residuals that may include noise → model is more sensitive to training noise (higher variance).\n- Learning rate $\\eta$: shrinks each tree's contribution. Small $\\eta$ → smooth interpolation requires more trees to reach the same function complexity → optimal tree count shifts right. The bias-variance curve is \"stretched\" horizontally.\n- Tree depth: shallow trees (depth 1 = stumps) are high-bias, low-variance weak learners. Deep trees reduce bias faster per iteration but add variance. LightGBM default depth = 8; XGBoost recommends 3-6.\n- Early stopping: halts training when validation error starts rising, directly finding the optimal point on the bias-variance curve.","A":"Learning rate and depth fundamentally affect where the optimal point on the bias-variance curve lies. They are not independent of the tradeoff.","B":"","C":"Gradient boosting exhibits a clear bias-variance tradeoff. The test error U-shape described in the question is exactly this tradeoff.","D":"More trees in gradient boosting always reduces bias but eventually increases variance (unlike Random Forest, where more trees only reduce variance). This is a key difference between boosting and bagging ensembles."}},{"section":"machine-learning","topicSlug":"regularization","topic":"Regularization","id":"ml-16001","difficulty":"easy","orderIndex":1,"question":"L1 (Lasso) and L2 (Ridge) regularization both add a penalty to the loss function. Lasso has the property of producing sparse solutions (many weights exactly zero). L2 does not. Why does L1 produce sparsity while L2 does not?","options":{"A":"L1 uses a smaller penalty coefficient than L2, which causes weights to become exactly zero","B":"The L1 penalty ($\\lambda|w|$) has a non-smooth gradient (subdifferential) at $w=0$ — the penalty function has a \"corner\" at zero; when the gradient of the data loss is smaller than $\\lambda$, the optimal solution is exactly $w=0$; L2 penalty ($\\lambda w^2$) has a smooth gradient that approaches 0 as $w \\to 0$ — L2 never pushes weights to exactly zero, only close to zero","C":"L1 regularization is stronger than L2, so it forces more weights to zero through larger penalties","D":"L2 regularization cannot shrink weights at all — it only reduces the learning rate"},"correct":"B","explanation":{"correct":"- Geometric intuition: L1 constraint region is a diamond (in 2D), L2 is a sphere. The optimal solution (where the loss ellipse touches the constraint boundary) tends to land on the corners of the diamond (sparse points) for L1. The sphere has no corners, so solutions rarely land exactly on an axis.\n- Subgradient at zero: L1 derivative is $\\lambda \\times \\text{sign}(w)$, undefined at $w=0$ — the subdifferential is $[-\\lambda, \\lambda]$. If the gradient of the data loss at $w=0$ is within $[-\\lambda, \\lambda]$, setting $w=0$ is optimal.\n- L2 derivative: $2\\lambda w$ → 0 as $w \\to 0$. The gradient always points toward (but never reaches) zero — it only asymptotically approaches zero.","A":"The strength of regularization (lambda value) is comparable between L1 and L2. The sparsity is a geometric property of the L1 norm, not a result of using smaller lambda.","B":"","C":"L1 and L2 with the same lambda have different magnitudes — neither is inherently \"stronger.\" The sparsity property is about the geometry of the penalty function, not just its magnitude.","D":"L2 shrinks weights toward (but not to) zero. This is its primary mechanism — it reduces weight magnitude, preventing any single feature from dominating. It doesn't reduce learning rate."}},{"section":"machine-learning","topicSlug":"regularization","topic":"Regularization","id":"ml-16002","difficulty":"easy","orderIndex":2,"question":"ElasticNet regularization combines L1 and L2 penalties: $L = \\text{MSE} + \\lambda_1 ||w||_1 + \\lambda_2 ||w||_2^2$. A practitioner chooses ElasticNet over Lasso for a dataset with 50 features where 30 are correlated in groups of 5. Why is Lasso alone insufficient here?","options":{"A":"Lasso is always worse than ElasticNet — ElasticNet is the superior method","B":"Lasso tends to arbitrarily select one feature from a group of correlated features and set the others to exactly zero — within a correlated group, it doesn't consistently select the most relevant feature; ElasticNet's L2 component groups correlated features together (similar to Ridge), while the L1 component still produces sparsity — correlated features get similar non-zero coefficients rather than one arbitrarily selected","C":"ElasticNet is chosen because it requires fewer hyperparameters than Lasso","D":"Lasso cannot handle datasets with more features than samples; ElasticNet can"},"correct":"B","explanation":{"correct":"- Lasso and correlated features: the Lasso solution is not unique when features are highly correlated. It may select any one feature from a correlated group — the selection depends on numerical noise and specific optimization path. This is called \"inconsistent variable selection.\"\n- ElasticNet: L2 component adds a grouping effect (correlated features are selected/deselected together). L1 maintains overall sparsity. The combination is more stable and interpretable for correlated feature groups.\n- Practical example: in genomics (correlated gene expressions within pathways), ElasticNet selects representative genes from each pathway rather than arbitrary single genes.","A":"Lasso is sufficient and often preferred when features are independent and a sparse model is the goal. ElasticNet's advantage is specifically for correlated feature scenarios.","B":"","C":"ElasticNet has TWO hyperparameters ($\\lambda_1, \\lambda_2$) vs Lasso's ONE ($\\lambda$). ElasticNet requires more hyperparameter tuning, not less.","D":"Lasso can handle p >> n scenarios (in fact, it's one of the primary tools for high-dimensional sparse regression). The issue is correlated feature instability, not p >> n."}},{"section":"machine-learning","topicSlug":"regularization","topic":"Regularization","id":"ml-16003","difficulty":"easy","orderIndex":3,"question":"A logistic regression model is trained on a dataset with 1,000 features and 500 training samples. Without regularization, the model achieves 100% training accuracy but 61% test accuracy. With L2 regularization (C=0.01 in sklearn, meaning strong regularization), training accuracy drops to 75%, test accuracy improves to 84%. What caused the improvement?","options":{"A":"L2 regularization improved the model by removing irrelevant features","B":"Without regularization, logistic regression can perfectly separate the 500 training samples in 1,000-dimensional space (many separating hyperplanes exist) — the learned coefficients are huge and unstable, perfectly fitting noise; L2 regularization constrains coefficient magnitudes ($||w||_2^2 \\leq \\lambda$), preventing overfitting to noise; the lower training accuracy reflects the regularization constraint, but the model generalizes better by not memorizing noise","C":"The improvement occurred because L2 regularization increased the number of training samples","D":"High training accuracy with low test accuracy indicates the test set is harder than training, not overfitting"},"correct":"B","explanation":{"correct":"- With p=1,000 > n=500: infinitely many hyperplanes separate the training data. Without regularization, the optimization finds a hyperplane that perfectly classifies training data but relies on noise correlations.\n- L2 regularization is equivalent to constraining the weight vector to lie within a ball of radius $\\sqrt{1/\\lambda}$. This prevents large weights that overfit to noise.\n- Sklearn's parameter C = $1/\\lambda$ (inverse of regularization strength). C=0.01 means strong regularization ($\\lambda = 100$), which heavily constrains coefficient magnitudes.","A":"L2 regularization shrinks all coefficients but keeps all features (no sparsity). Feature removal is L1 regularization. L2 improves generalization by coefficient shrinkage, not feature elimination.","B":"","C":"Regularization doesn't change the number of training samples. It changes how the model is fitted to the existing samples.","D":"With 1,000 features and 500 samples, perfect training accuracy is a strong sign of overfitting. Test accuracy of 61% (near random for 3+ classes, or just above random for binary) confirms the model memorized training noise."}},{"section":"machine-learning","topicSlug":"regularization","topic":"Regularization","id":"ml-16004","difficulty":"easy","orderIndex":4,"question":"Early stopping is used in neural network training as an implicit regularization technique. A model's validation loss starts increasing at epoch 50 while training loss continues decreasing. The model is stopped at epoch 50. Why does early stopping reduce overfitting?","options":{"A":"Early stopping prevents gradient descent from converging to the global minimum, which would overfit","B":"Early stopping prevents the model from fitting training noise in later epochs — in early training, gradient descent first captures broad patterns (high gradient signal); in later epochs, the model increasingly fits residual noise (small gradient updates in high-frequency noise directions); stopping before this phase prevents memorizing noise; it is equivalent to keeping the model in a lower effective complexity region","C":"Early stopping reduces overfitting by reducing the learning rate automatically","D":"Early stopping is equivalent to L1 regularization because it also produces sparse models"},"correct":"B","explanation":{"correct":"- Gradient dynamics: early in training, gradients are large and the model captures dominant patterns. As training progresses, the optimization explores finer structure that may reflect training-set-specific noise.\n- Formal equivalence (for linear models): Bishop (1995) showed early stopping in gradient descent is equivalent to L2 regularization, where the effective regularization strength is inversely proportional to the number of iterations.\n- Practical implementation: monitor validation loss; save model checkpoints; restore best checkpoint when validation loss stops improving. Patience parameter: how many epochs to wait before stopping.","A":"Early stopping does prevent reaching the global minimum of training loss. But the global minimum of training loss is not the goal — the global minimum of expected generalization loss is. These are different, especially with overparameterized models.","B":"","C":"Early stopping doesn't change the learning rate schedule. It stops training at a fixed learning rate. Learning rate scheduling is a separate technique.","D":"Early stopping has no sparsity property. It is most closely equivalent to L2 regularization (shrinking weights from their fully trained values). L1's sparsity comes from the subdifferential property of the L1 norm."}},{"section":"machine-learning","topicSlug":"regularization","topic":"Regularization","id":"ml-16005","difficulty":"medium","orderIndex":5,"question":"A Ridge regression model is trained with lambda=10. The model coefficients are: $w = [0.8, 0.6, 0.3, 0.1]$ for features [A, B, C, D]. Feature D has been judged irrelevant by domain experts. A data scientist says \"since $w_D = 0.1 \\approx 0$, Ridge has effectively removed Feature D.\" Why is this claim problematic?","options":{"A":"Ridge has actually set $w_D$ to exactly 0, confirming the claim","B":"Ridge shrinks but doesn't zero out coefficients — $w_D = 0.1$ is still non-zero; at prediction time, Feature D still contributes to every prediction; moreover, Ridge with lambda=10 has shrunk ALL coefficients toward zero, not just irrelevant ones; the \"small\" coefficient may reflect both the feature's low relevance AND the regularization penalty compressing the true coefficient; the proper approach for feature removal is L1 (Lasso) or explicit feature selection","C":"Ridge regularization is specifically designed to identify irrelevant features — the smallest coefficient is always the least relevant","D":"A coefficient of 0.1 is practically zero — Ridge has removed Feature D for all practical purposes"},"correct":"B","explanation":{"correct":"- Ridge coefficient: $\\hat{w}^{Ridge} = \\hat{w}^{OLS} / (1 + \\lambda)$ (in orthogonal feature case). With lambda=10, every coefficient is shrunk by a factor of 11. Feature D's true OLS coefficient might be 1.1 (significant!) but Ridge shrinks it to 0.1.\n- Coefficient magnitude under Ridge reflects BOTH feature relevance AND regularization penalty. Comparing coefficients across features is valid only if features are standardized AND lambda is accounted for.\n- For feature selection: use L1 (Lasso) which explicitly zeros coefficients, or model-agnostic methods (permutation importance, SHAP) that measure the actual predictive contribution.","A":"Ridge does not produce exactly zero coefficients by design. This is mathematically guaranteed by the smooth L2 penalty.","B":"","C":"The smallest Ridge coefficient is not necessarily the least relevant feature. A relevant feature with high collinearity with other features may have a small Ridge coefficient, while an irrelevant but independent feature may have a moderate coefficient.","D":"\"Practically zero\" is a judgment call, but in a prediction context, 0.1 contributes to every prediction. More importantly, without knowing the unregularized coefficient, you cannot distinguish \"small because irrelevant\" from \"small because regularized.\""}},{"section":"machine-learning","topicSlug":"regularization","topic":"Regularization","id":"ml-16006","difficulty":"medium","orderIndex":6,"question":"Dropout with rate 0.5 is applied to a layer with 100 neurons during training. At test time, the dropout is turned off and weights are multiplied by 0.5 (inverted dropout). Why is this weight scaling necessary?","options":{"A":"Weight scaling prevents numerical overflow in large networks","B":"During training with 50% dropout, each neuron is active on average 50% of the time — its expected contribution to the next layer is halved; at test time, all neurons are active; without scaling, the expected input to the next layer doubles compared to training; multiplying weights by 0.5 at test time (or equivalently, multiplying activations by 0.5 at test time) ensures the same expected signal magnitude at test time as during training","C":"Weight scaling at test time doubles the model's capacity to compensate for lost neurons during training","D":"Weight scaling is only needed for convolutional layers — fully connected layers don't require it"},"correct":"B","explanation":{"correct":"- Without scaling: a neuron with weight $w$ connecting to the next layer contributes $w \\times a$ (activation value). During training with 50% dropout: expected contribution = $0.5 \\times w \\times a$. At test time (no dropout): contribution = $w \\times a$ — twice the expected training contribution.\n- This mismatch between training and test distributions would cause the model to produce larger activations at test time, effectively changing the model's behavior. The network was trained to expect 0.5× contributions.\n- Fix: either (1) multiply weights by 0.5 at test time (standard), or (2) during training, multiply active weights by 2 to maintain expected activation magnitude (\"inverted dropout\" — the standard implementation in frameworks like PyTorch and TensorFlow).","A":"Weight scaling prevents train-test distribution mismatch, not numerical overflow. Overflow would be handled by gradient clipping or proper weight initialization.","B":"","C":"Scaling by 0.5 halves weights — it doesn't double capacity. The scaling maintains the same expected activation magnitude, not a doubled capacity.","D":"Dropout and weight scaling apply to any layer type. The activation-magnitude mismatch issue is present for any layer where neurons are randomly dropped."}},{"section":"machine-learning","topicSlug":"regularization","topic":"Regularization","id":"ml-16007","difficulty":"medium","orderIndex":7,"question":"A team uses L1 regularization on a linear model with 500 features. After tuning lambda, 480 features are zeroed out, leaving 20 non-zero features. They claim \"the L1 model selected the 20 most important features.\" A statistician cautions this claim. Why?","options":{"A":"L1 always selects the correct features — the statistician is wrong","B":"L1 feature selection is not stable — small changes in training data or lambda can change which 20 features are selected; when multiple features have similar predictive power, L1 arbitrarily picks one (as with correlated features); the selected set may also change with different regularization paths; for reliable feature selection, use stability selection (run L1 many times with subsampling and select features that consistently appear) or confirm with permutation importance","C":"L1 can only zero features, not identify important ones — it should be replaced with L2","D":"L1 is inconsistent for variable selection when there are more than 100 features"},"correct":"B","explanation":{"correct":"- L1 inconsistency in correlated groups: if features 1, 2, 3 are correlated and all predictive, L1 may select feature 1 in one run and feature 2 in another (depending on numerical noise, bootstrap sample, random initialization of optimization).\n- Stability selection (Meinshausen & Bühlmann 2010): run Lasso on 100 bootstrap subsamples, count how often each feature is selected. Features selected in >80% of runs are stable selections.\n- Near-equal lambda sensitivity: at the exact regularization level, two features may be equally competitive. Small perturbations determine which is selected.\n- Practical implication: report \"these 20 features were selected on this dataset with this lambda\" rather than \"these are the 20 most important features.\"","A":"L1 feature selection is not provably correct in the presence of correlated features. Stability analysis is needed to validate selections.","B":"","C":"L1 does both select features (via sparsity) and identify predictive ones. The caveat is stability, not the mechanism.","D":"There is no established feature count threshold for L1 consistency. The issue is correlation structure, not the absolute number of features."},"reference":"- Stability Selection: https://rss.onlinelibrary.wiley.com/doi/10.1111/j.1467-9868.2010.00740.x"},{"section":"machine-learning","topicSlug":"regularization","topic":"Regularization","id":"ml-16008","difficulty":"hard","orderIndex":8,"question":"Two Ridge regression models are trained: Model A with $\\lambda = 10$, Model B with $\\lambda = 10,000$. Model B's coefficients are all very close to zero but not exactly zero. A student asks: \"what is the closed-form solution for Ridge regression, and how does lambda control coefficient magnitude?\" Provide the answer.","codeSnippet":"# Ridge regression adds L2 penalty:\n# min ||y - Xw||² + λ||w||²\n# Closed-form: w = (X^T X + λI)^{-1} X^T y","options":{"A":"Ridge has no closed-form solution — it must be solved iteratively","B":"The closed form is $\\hat{w} = (X^TX + \\lambda I)^{-1}X^Ty$; as $\\lambda \\to 0$: reduces to OLS; as $\\lambda \\to \\infty$: $(X^TX + \\lambda I)^{-1} \\approx \\lambda^{-1} I \\to 0$, so $\\hat{w} \\to 0$; the $\\lambda I$ term adds a positive constant to the diagonal, making the matrix invertible even when $X^TX$ is singular (high collinearity); larger $\\lambda$ shrinks all coefficients proportionally toward zero","C":"The closed form is $\\hat{w} = (X^TX)^{-1}X^Ty - \\lambda I$ — Ridge subtracts lambda from OLS coefficients","D":"The closed form requires inverting an $n \\times n$ matrix, making Ridge computationally infeasible for large datasets"},"correct":"B","explanation":{"correct":"- Deriving the closed form: take the derivative of $||y - Xw||^2 + \\lambda||w||^2$ with respect to $w$ and set to zero: $-2X^T(y - Xw) + 2\\lambda w = 0 \\to (X^TX + \\lambda I)w = X^Ty \\to w = (X^TX + \\lambda I)^{-1}X^Ty$.\n- Stabilizing ill-conditioned systems: $X^TX$ may have near-zero eigenvalues (collinear features), making OLS unstable. Adding $\\lambda I$ shifts all eigenvalues by $\\lambda$: $(\\sigma_i^2 + \\lambda)^{-1}$ replaces $\\sigma_i^{-2}$. For small $\\sigma_i$, Ridge prevents coefficient explosion.\n- Computation: $X^TX$ is $p \\times p$ — for large $p$, computing $(X^TX + \\lambda I)^{-1}$ is $O(p^3)$. For large $n$, small $p$: efficient. For large $p$: use conjugate gradient or Cholesky decomposition.","A":"Ridge regression has an explicit closed-form solution, unlike L1 (which requires iterative coordinate descent or sub-gradient methods due to non-differentiability).","B":"","C":"This formula is incorrect. Ridge doesn't subtract lambda from OLS estimates. The correct formula changes the matrix to be inverted.","D":"Ridge inverts a $p \\times p$ matrix, not $n \\times n$. For typical problems where $p < n$, this is computationally tractable."}},{"section":"machine-learning","topicSlug":"regularization","topic":"Regularization","id":"ml-16009","difficulty":"hard","orderIndex":9,"question":"A deep learning model uses both dropout (rate=0.3) and L2 weight decay ($\\lambda = 0.001$). A researcher says these two regularization techniques are redundant for neural networks. Are they equivalent, and why or why not?","options":{"A":"Dropout and L2 weight decay are mathematically equivalent for all architectures","B":"They are not equivalent: L2 weight decay penalizes large weights by adding $\\lambda ||w||^2$ to the loss, shrinking all weights uniformly toward zero during gradient updates; dropout randomly deactivates neurons, forcing distributed representations and ensemble-like behavior; they address different sources of overfitting — L2 reduces weight magnitude (preventing reliance on large activations), while dropout prevents co-adaptation of neurons; in practice they are complementary and can improve performance together","C":"Dropout makes L2 redundant because both shrink weights toward zero","D":"L2 weight decay and dropout cancel each other out — applying both produces a worse model than either alone"},"correct":"B","explanation":{"correct":"- L2 weight decay mechanism: adds $\\lambda w$ to gradient update. Effect: all weights are shrunk by a constant fraction each update, preventing large weights.\n- Dropout mechanism: randomly zeros activations. Effect: each neuron cannot rely on specific co-activations → learns more robust, distributed features.\n- Different failure modes addressed: L2 prevents overparameterized models from learning degenerate large-magnitude solutions; dropout prevents neurons from co-adapting (a type of feature interaction overfitting L2 does not address).\n- Note: for adaptive optimizers (Adam, RMSprop), \"weight decay\" and \"L2 regularization\" are NOT equivalent — the interaction with the adaptive learning rate makes them different (decoupled weight decay, AdamW fixes this distinction).","A":"Mathematical equivalence only holds in specific cases (linear models, SGD without momentum). For nonlinear networks with adaptive optimizers, they are not equivalent.","B":"","C":"Dropout does not shrink weights toward zero — it randomly zeros activations during training. Weights can grow large; the stochasticity prevents specific feature detector pairs from always co-occurring.","D":"Combined regularization generally outperforms either alone by addressing multiple sources of overfitting. There is no cancellation — they operate on different mechanisms."}},{"section":"machine-learning","topicSlug":"regularization","topic":"Regularization","id":"ml-16010","difficulty":"hard","orderIndex":10,"question":"Batch normalization is described as a regularization technique in addition to being an acceleration technique. Explain the mechanism by which batch normalization provides implicit regularization, and why it can sometimes replace dropout.","options":{"A":"Batch normalization regularizes by adding Gaussian noise to every layer's output","B":"Batch normalization normalizes each feature by the batch statistics (mean, variance) which vary across mini-batches; this introduces stochastic noise into the training process — a sample's normalization depends on the other samples in its batch; this noise acts as a form of regularization, preventing the network from overfitting to individual sample patterns; when batch sizes are small, this noise is larger, providing more regularization; at test time, running averages replace batch statistics (removing the noise), creating a train-test discrepancy that improves generalization","C":"Batch normalization regularizes only by reducing internal covariate shift — it has no noise effect","D":"Batch normalization is identical to dropout with rate=0.1 — they can always be interchanged"},"correct":"B","explanation":{"correct":"- Stochastic element: during training, $\\mu_B = \\frac{1}{m}\\sum x_i$ and $\\sigma_B^2 = \\frac{1}{m}\\sum(x_i - \\mu_B)^2$ depend on the randomly sampled mini-batch. For sample $x_j$: $\\hat{x}_j = (x_j - \\mu_B)/\\sigma_B$ — the normalized value depends on which other samples are in the batch (stochastic).\n- This is why changing batch size affects generalization: small batches = noisy $\\mu_B, \\sigma_B$ = more regularization but less stable training.\n- BN as dropout replacement: Ioffe & Szegedy (original BN paper) observed that BN reduced the need for dropout. Modern architectures often use BN without dropout for convolutional layers.","A":"BatchNorm doesn't add Gaussian noise explicitly. The stochasticity comes from mini-batch sampling. Adding explicit Gaussian noise is a separate technique (data augmentation via noise injection).","B":"","C":"The internal covariate shift reduction (normalizing layer inputs) is the primary motivation. The regularization effect from batch statistics stochasticity is a secondary benefit. Both mechanisms are real.","D":"BatchNorm and dropout are not identical. Dropout creates stochasticity by zeroing activations; BatchNorm creates stochasticity through batch statistics. They have different properties and are often used together in many architectures."}},{"section":"machine-learning","topicSlug":"feature-selection-and-engineering","topic":"Feature Selection And Engineering","id":"ml-17001","difficulty":"easy","orderIndex":1,"question":"A filter method (chi-squared test) is used to select the top 20 features from 100. A wrapper method (recursive feature elimination with cross-validation) is also run on the same dataset. The wrapper method achieves 4% higher test accuracy but takes 50× longer. A team lead asks: \"which should we use in production?\" What are the correct trade-offs?","options":{"A":"Filter methods are always better because they are faster","B":"Filter methods score features independently of the model — they are fast (O(p) evaluations) but ignore feature interactions; wrapper methods evaluate feature subsets using the actual downstream model, capturing interactions — they are slower (O(p²) to O(2^p) evaluations) but more accurate; for production pipelines with computational budget, filter methods are preferred for fast iteration; when accuracy is critical and features have interactions, wrapper methods are justified","C":"Wrapper methods are always better because they use the actual model","D":"The 4% accuracy difference proves filter methods are unusable for any serious ML task"},"correct":"B","explanation":{"correct":"- Filter method (chi-squared, mutual information, variance threshold): evaluates each feature independently using a statistical test. Cannot detect interactions: feature A alone is useless, but A+B together are predictive (XOR problem).\n- Wrapper method (RFE, forward/backward selection): fits the model on feature subsets. Captures all interactions the model can use. Computational cost: RFE fits the model p times (backward elimination); forward selection fits p×(p/2) times.\n- Practical decision matrix: small dataset + high accuracy requirement → wrapper; large dataset + many features + computational budget → filter as first pass, wrapper on top-K features.","A":"Filter methods miss feature interactions and model-specific synergies. The 4% accuracy gap in the example shows wrappers can provide meaningful improvement.","B":"","C":"Wrapper methods are computationally expensive and can overfit the feature selection process if cross-validation is not properly implemented. They are not always better.","D":"4% may or may not be practically significant depending on the task. Filter methods are widely used in production systems (e.g., mutual information for feature selection in recommendation systems)."}},{"section":"machine-learning","topicSlug":"feature-selection-and-engineering","topic":"Feature Selection And Engineering","id":"ml-17002","difficulty":"easy","orderIndex":2,"question":"A dataset has a categorical feature \"City\" with 500 unique values. One-hot encoding would create 500 binary features. A data scientist suggests using label encoding (assigning integers 1-500) instead. What is the key problem with label encoding for a nominal categorical feature?","options":{"A":"Label encoding creates too many features — it should never be used","B":"Label encoding imposes an artificial ordinal relationship — it implies City 1 < City 2 < City 500, which is meaningless for nominal categories; a linear model or distance-based algorithm will interpret the numerical values as having relative magnitude; this creates false structure that misleads the model","C":"Label encoding is the best method for high-cardinality categoricals — the data scientist is correct","D":"Label encoding and one-hot encoding are equivalent for tree-based models and linear models"},"correct":"B","explanation":{"correct":"- Nominal: no inherent order (London, Paris, Tokyo are not ranked). Ordinal: has inherent order (low, medium, high).\n- Label encoding (LabelEncoder): assigns integer 1-500. Linear regression would learn a coefficient for \"city\" and predict: Tokyo (100) is 100× NYC (1)? This is meaningless arithmetic.\n- One-hot encoding: creates binary indicator per city. The model learns an independent coefficient per city — no ordering imposed. But 500 features is expensive.\n- Better alternatives for high-cardinality: target encoding (replace city with mean target value for that city), frequency encoding, entity embeddings (in deep learning).","A":"Label encoding is valid for ordinal features (education level: high school < bachelor < master). The problem is only with nominal categoricals.","B":"","C":"Label encoding for high-cardinality nominal features is a well-known mistake. It's commonly done by beginners who confuse nominal and ordinal encoding requirements.","D":"Tree-based models (decision trees, RF, gradient boosting) can effectively use label-encoded features because they split on individual thresholds — the ordinal assumption doesn't affect them as much. But linear models and distance-based methods are significantly harmed by label encoding of nominal features."}},{"section":"machine-learning","topicSlug":"feature-selection-and-engineering","topic":"Feature Selection And Engineering","id":"ml-17003","difficulty":"easy","orderIndex":3,"question":"A dataset has a feature \"Age\" with 5% missing values. Three imputation strategies are proposed: (1) Mean imputation, (2) Median imputation, (3) Forward-fill (use the previous row's value). A data scientist says all three are equivalent. When does each strategy fail?","options":{"A":"All three strategies are equivalent because they all produce valid numerical values","B":"Mean imputation fails when age is skewed (outliers pull the mean away from the typical value); median imputation is robust to skewness but fails for time-series data where temporal patterns matter; forward-fill fails for non-time-series tabular data where row order is arbitrary and the \"previous\" row has no relationship to the current row; correct strategy depends on data type, distribution, and whether ordering is meaningful","C":"Forward-fill is always best because it uses observed data","D":"Missing values should always be dropped — imputation always introduces bias"},"correct":"B","explanation":{"correct":"- Mean imputation: replaces missing with $\\bar{x}$. For a skewed distribution (house prices, income), mean is pulled by outliers — imputing with mean artificially concentrates data at a skewed mean.\n- Median imputation: replaces with median (50th percentile). Robust to outliers. But for time-series, a patient's age at time T should be near their age at time T-1 — median of all ages is unrelated to temporal continuity.\n- Forward-fill: uses the last observed value (LOCF — last observation carried forward). For time-series (stock prices, sensor readings), this is reasonable. For a shuffled tabular dataset (rows are independent customers), the \"previous\" row is random — forward-fill introduces noise.\n- Best practice: model-based imputation (MICE, KNN imputation) captures correlations with other features.","A":"The strategies produce different imputed values and have different statistical properties. The choice has a measurable impact on downstream model performance.","B":"","C":"Forward-fill only uses \"observed data\" meaningfully when row order is temporally or logically meaningful. For random-order tabular data, forward-fill is essentially injecting noise.","D":"Dropping rows with missing values (complete case analysis) discards information and can introduce bias if data is not Missing Completely At Random (MCAR). Imputation is often better, but choice of method matters."}},{"section":"machine-learning","topicSlug":"feature-selection-and-engineering","topic":"Feature Selection And Engineering","id":"ml-17004","difficulty":"easy","orderIndex":4,"question":"Mutual information (MI) between a feature X and target Y is computed as: $MI(X; Y) = \\sum_{x,y} p(x,y) \\log \\frac{p(x,y)}{p(x)p(y)}$. A feature has MI = 0. What does this mean, and is it always useless for prediction?","options":{"A":"MI = 0 means the feature is perfectly correlated with the target","B":"MI = 0 means X and Y are statistically independent — knowing X provides no information about Y; for a single feature in isolation, MI = 0 means the feature alone is useless; however, feature interactions exist — X might be useless alone but highly predictive when combined with another feature Z (interaction effect); filter methods miss such interactions","C":"MI = 0 means the feature has constant value — it is a constant feature","D":"MI = 0 is impossible for real-world data — it always has some noise that produces non-zero MI"},"correct":"B","explanation":{"correct":"- MI = 0: $p(x,y) = p(x)p(y)$ for all $x, y$ — full independence. Knowing $X$ changes our estimate of $Y$ not at all.\n- Interaction effect: $Y = XOR(X_1, X_2)$. $MI(X_1; Y) = 0$ (individually useless). $MI(X_1, X_2; Y) > 0$ (jointly informative). Filter methods that evaluate features individually would eliminate both $X_1$ and $X_2$ — losing all predictive power.\n- This is a fundamental limitation of univariate feature selection (chi-squared, MI, ANOVA) — they cannot detect interaction effects. Wrapper methods and embedded methods can detect interactions because they evaluate feature combinations.","A":"Perfect correlation would give MI > 0 (in fact, for a deterministic relationship: MI = entropy of Y). MI = 0 is the opposite of correlation.","B":"","C":"A constant feature also has MI = 0, but MI = 0 does not require the feature to be constant. An independent non-constant feature also has MI = 0.","D":"For finite samples, estimated MI is always slightly non-zero due to sampling noise. But the population MI can be exactly 0 for truly independent X and Y. The claim is about the population quantity."}},{"section":"machine-learning","topicSlug":"feature-selection-and-engineering","topic":"Feature Selection And Engineering","id":"ml-17005","difficulty":"easy","orderIndex":5,"question":"A dataset contains a \"TransactionDate\" column in format YYYY-MM-DD (e.g., \"2023-07-15\"). A junior data scientist label-encodes this as an integer (20230715). A senior engineer suggests better feature engineering. What are the recommended derived features?","options":{"A":"Leave the raw date string — modern ML models can parse dates automatically","B":"Extract meaningful temporal features: year, month, day-of-week, day-of-month, week-of-year, is_weekend, days_since_last_transaction, time_since_cohort_start; raw date integers (20230715) don't encode periodicity — the model cannot learn that July (month 7) recurs annually; the derived features capture seasonality, recency, and cyclical patterns that the raw integer misses","C":"Convert date to Unix timestamp (seconds since 1970-01-01) — this is the most informative representation","D":"Dates should always be removed from features — they cause temporal data leakage"},"correct":"B","explanation":{"correct":"- Raw integer (20230715): the model sees this as a continuous value. It cannot learn that December 31 and January 1 are consecutive unless it sees the transition explicitly. Seasonal patterns (holiday shopping every December) are not captured.\n- Extracted features: month captures annual seasonality; day-of-week captures weekly patterns; is_weekend captures activity patterns; days_since_X captures recency effects.\n- Cyclical encoding: for features like month (1-12) and day-of-week (0-6), use sine/cosine encoding to preserve cyclicality: $\\sin(2\\pi \\times \\text{month}/12)$, $\\cos(2\\pi \\times \\text{month}/12)$. This ensures the model knows December is adjacent to January.","A":"Most ML models (decision trees, linear models, gradient boosting) cannot parse raw date strings. Even if a model ingests the raw integer, the temporal structure (periodicity, recency) is not encoded in the integer value.","B":"","C":"Unix timestamp preserves temporal ordering but doesn't encode periodicity. A model trained on Unix timestamps cannot learn that similar timestamps occur one year apart — it sees them as values 31M seconds apart.","D":"Dates are valuable features when properly engineered. \"Always remove\" is incorrect — temporal features are often among the most predictive in time-sensitive applications (fraud, demand forecasting)."}},{"section":"machine-learning","topicSlug":"feature-selection-and-engineering","topic":"Feature Selection And Engineering","id":"ml-17006","difficulty":"medium","orderIndex":6,"question":"Target encoding replaces a categorical feature with the mean of the target variable for that category. A team applies it directly on training data and achieves high training performance. On test data, performance drops significantly for rare categories. What is the issue and the fix?","options":{"A":"Target encoding should only be used for high-cardinality features — the team applied it to a low-cardinality feature","B":"Target encoding leaks the target into the feature directly; if applied on the full training set (including the sample being encoded), the model learns trivial mappings; for rare categories (few samples), the mean is estimated from very few samples — high variance estimates that overfit to noise; fix: use leave-one-out target encoding or cross-validated encoding (compute mean from out-of-fold samples) and apply smoothing (blend category mean with global mean weighted by sample count)","C":"Target encoding fails because it converts categorical to continuous — use ordinal encoding instead","D":"Target encoding is not appropriate for any ML model — use one-hot encoding always"},"correct":"B","explanation":{"correct":"- Leakage: if a training sample's own target is included in computing its category encoding, the model learns $y_i = \\hat{y}_i$ — trivial perfect training performance.\n- Rare category variance: a category with 3 samples has mean = (y1+y2+y3)/3. High variance — could be 0, 0.5, or 1.0 depending on those 3 specific samples. At test time, the estimate is unreliable.\n- Smoothing formula: $\\text{encoding}(c) = \\frac{n_c \\times \\bar{y}_c + m \\times \\bar{y}_\\text{global}}{n_c + m}$ where $n_c$ = samples in category $c$, $m$ = smoothing parameter. Rare categories are pulled toward the global mean; frequent categories use their own mean.\n- Cross-validated encoding: for each fold, encode training samples using only out-of-fold statistics — prevents leakage.","A":"Target encoding is especially valuable for high-cardinality features (where one-hot would create thousands of dimensions). The problem is not about cardinality but about proper encoding implementation.","B":"","C":"The continuous output of target encoding is a feature, not a problem. The issue is leakage and rare-category variance, not the data type conversion.","D":"Target encoding is widely used and effective for high-cardinality categoricals (e.g., zip code, user ID). One-hot encoding for 10,000 zip codes creates a 10,000-dimensional sparse feature — computationally expensive and prone to overfitting."}},{"section":"machine-learning","topicSlug":"feature-selection-and-engineering","topic":"Feature Selection And Engineering","id":"ml-17007","difficulty":"medium","orderIndex":7,"question":"Permutation importance is computed for a trained random forest: after randomly shuffling feature X, test accuracy drops by 15%. After shuffling feature Z, accuracy drops by 0.2%. A data scientist removes Z from the model and retrains. The new model's accuracy drops by 3%. Explain this result.","options":{"A":"The result proves permutation importance is unreliable — it should never be used","B":"Permutation importance measures how much the model uses each feature, not each feature's intrinsic predictive value; Z may have had low permutation importance because X (correlated with Z) was substituted by the model when Z was shuffled; when Z is removed and X cannot compensate (X may have a different interaction), the 3% drop reveals Z's marginal contribution that permutation importance masked due to collinearity","C":"The 3% drop proves the first permutation importance computation was computed incorrectly","D":"Permutation importance always correctly identifies redundant features — Z should be removed based on the 0.2% drop"},"correct":"B","explanation":{"correct":"- Collinearity masking: if X and Z are correlated ($r = 0.9$), when Z is shuffled, the model can still predict using X (which maintains the shared signal with the target). So shuffling Z appears harmless (0.2% drop). But when Z is physically removed, X may not fully capture Z's unique contribution — 3% drop.\n- Permutation importance measures \"feature usage by the current model\" not \"intrinsic feature importance.\" For correlated features, importance is split arbitrarily between them.\n- Conditional permutation importance (Strobl et al.): shuffle Z while conditioning on the values of correlated features — better estimates the true marginal contribution.","A":"Permutation importance is a valid and widely used method (as model-agnostic importance). The issue is correct interpretation in the presence of correlated features. The tool is not unreliable — it requires nuanced interpretation.","B":"","C":"The permutation importance was computed correctly. It correctly measured how much the model uses Z in the presence of X. The removal experiment revealed a different (but also valid) quantity: Z's marginal contribution when X cannot compensate.","D":"Low permutation importance for a correlated feature should not automatically trigger removal. The collinearity must be investigated before removing features based solely on permutation importance."}},{"section":"machine-learning","topicSlug":"feature-selection-and-engineering","topic":"Feature Selection And Engineering","id":"ml-17008","difficulty":"medium","orderIndex":8,"question":"A dataset contains an \"Income\" feature with a heavy right tail: median = $50K, mean = $80K, max = $5M. Standard scaling (z-score) is applied. After scaling, the top 1% of values (high earners) have z-scores of 40-200. A linear model trained on this data has large coefficients for income-related predictions. What is the problem and the fix?","options":{"A":"The data is correct — high z-scores for outliers are expected and do not affect the model","B":"Standard scaling preserves the original distribution's skewness and outlier effects — extreme values (z=40-200) dominate the model's coefficient for income; a log transformation ($\\log(\\text{income}+1)$) would first compress the right tail, then standard scaling would create a more symmetric distribution with z-scores in a reasonable range; outliers would no longer dominate linear model training","C":"The fix is to remove all records with income > $500K — outliers should always be dropped","D":"Standard scaling is the correct preprocessing — no additional transformation is needed for skewed features"},"correct":"B","explanation":{"correct":"- Heavy-tailed features: standard scaling makes $z = (x - \\mu)/\\sigma$. If $\\sigma$ is large (due to the long tail), most values cluster around $z = 0$ to $z = 2$. The top 1% gets enormous z-scores, dominating any linear model's loss function.\n- Log transformation: compresses the right tail. $\\log(\\$5M) \\approx 15.4$, $\\log(\\$50K) \\approx 10.8$, $\\log(\\$30K) \\approx 10.3$. The range is compressed from $50K-5M$ (100:1 ratio) to $10.3-15.4$ (1.5:1 ratio in log space). After log-transform, standard scaling gives reasonable z-scores.\n- Box-Cox or Yeo-Johnson transform: more general parametric transformation that can handle both positive and zero/negative values.","A":"High z-scores (40-200) are not innocuous. In gradient descent, the gradient magnitude is proportional to feature values × error. Features with huge z-scores produce huge gradients, destabilizing training.","B":"","C":"Dropping income > $500K removes legitimate data points. High earners may be a meaningful segment (tax policy analysis, luxury goods). Removal introduces selection bias.","D":"Standard scaling is appropriate for approximately normal distributions. For heavy-tailed distributions, it leaves the skewness intact — additional transformation (log, sqrt) is needed first."}},{"section":"machine-learning","topicSlug":"feature-selection-and-engineering","topic":"Feature Selection And Engineering","id":"ml-17009","difficulty":"hard","orderIndex":9,"question":"SHAP (SHapley Additive exPlanations) values are used for feature importance in a gradient boosted model. SHAP values for feature X are: most values near 0, but for 10 specific samples, SHAP values are +15 (strongly pushing toward positive class). Standard permutation importance gives feature X an importance of 0.02 (very low). Explain the discrepancy.","options":{"A":"SHAP and permutation importance are computing the same quantity — one must be wrong","B":"Permutation importance averages the effect of shuffling X across ALL samples — if X has high impact on 10 samples but near-zero impact on 990 samples, the average effect is diluted to ~1%; SHAP provides per-sample contributions, revealing that X is critical for the 10 specific samples even though globally unimportant; both metrics are correct — they answer different questions; for rare high-impact cases (fraud detection, medical alerts), SHAP reveals features that matter for specific predictions","C":"SHAP values of +15 indicate a computation error — SHAP values must be between -1 and +1","D":"Permutation importance is always more accurate than SHAP for gradient boosted models"},"correct":"B","explanation":{"correct":"- Permutation importance: average accuracy drop after shuffling X across all test samples. If X is only important for 1% of samples, the average drop is ~0.01 × (importance on those samples) — diluted.\n- SHAP: computes the marginal contribution of feature X to each individual prediction $\\hat{f}(x_i)$ using Shapley values from cooperative game theory. Each sample has its own SHAP vector.\n- The 10 samples with SHAP = +15 represent cases where X is the critical driver. For fraud detection: these might be the actual fraud cases where a specific pattern in X is the key indicator.\n- Application: in production, SHAP explains individual decisions (why was this transaction flagged?), even for features that appear globally unimportant.","A":"SHAP and permutation importance answer different questions. SHAP explains individual predictions; permutation importance measures global model reliance. Discrepancies are expected and meaningful.","B":"","C":"SHAP values have no fixed range — they represent the contribution in the units of the model output. A SHAP value of +15 for log-odds or a regression target is valid.","D":"Neither is universally more accurate. Permutation importance captures global model reliance; SHAP provides per-instance explanations and handles correlated features better through conditional expectations. They are complementary."},"reference":"- Lundberg & Lee, \"A Unified Approach to Interpreting Model Predictions\" (SHAP): https://arxiv.org/abs/1705.07874"},{"section":"machine-learning","topicSlug":"feature-selection-and-engineering","topic":"Feature Selection And Engineering","id":"ml-17010","difficulty":"hard","orderIndex":10,"question":"A team creates a new feature by dividing Revenue by Cost (a ratio feature). In training data, Cost is always > 0. In production, some records have Cost = 0 (free services). The model crashes in production. Additionally, the ratio feature inflates importance in tree models. What are the two problems and their fixes?","options":{"A":"The problems are data type mismatch and overfitting — use float64 and add more training data","B":"Problem 1 (Division by zero): Cost = 0 in production causes infinity or NaN — fix: clip denominator ($\\max(\\text{Cost}, \\epsilon)$) or add a small constant (Revenue/(Cost+1)); Problem 2 (Ratio inflation in trees): if Revenue and Cost are separately available, the ratio extracts information already in the original features and creates a derived feature with different scale/distribution; tree models may overfit to extreme ratio values (very high Revenue/Cost = outlier); fix: use robust ratios ($\\log(\\text{Revenue}) - \\log(\\text{Cost})$), clip ratios, or ensure original features are also included","C":"Remove the ratio feature entirely — ratio features always cause problems in tree models","D":"The crash is caused by integer overflow — use int64 instead of float32"},"correct":"B","explanation":{"correct":"- Division by zero defense: $\\epsilon$ clipping: Revenue/max(Cost, 0.001). This prevents infinity while preserving the ratio's meaning for nearly-free services. Adding 1 (Laplace-like smoothing): Revenue/(Cost+1) — shifts the ratio but prevents zero denominator.\n- Log-ratio: $\\log(\\text{Revenue}) - \\log(\\text{Cost}) = \\log(\\text{Revenue/Cost})$ is the log-ratio, which compresses extreme values and has better statistical properties (more normal distribution in many business scenarios).\n- Tree model interaction: ratio features are nonlinear combinations of existing features. Decision trees can recreate ratios by sequential splits on Revenue and Cost. Providing the ratio can help by expressing a known meaningful relationship, but it can also introduce high-leverage points.","A":"Data type mismatches and overfitting are separate issues. The described problems are divide-by-zero (runtime error) and extreme ratio values (modeling issue).","B":"","C":"Ratio features are extremely common and useful (financial ratios, click-through rates, conversion rates). Removing them entirely would discard meaningful domain knowledge.","D":"The crash is due to division by zero (NaN/infinity propagation), not integer overflow. Float32 and float64 can both represent infinity and NaN."}},{"section":"machine-learning","topicSlug":"feature-selection-and-engineering","topic":"Feature Selection And Engineering","id":"ml-17011","difficulty":"hard","orderIndex":11,"question":"Recursive Feature Elimination with Cross-Validation (RFECV) eliminates features one at a time using model coefficients/importances. On a dataset with 200 features, it selects 45 features. The final model achieves 88% accuracy. A researcher warns this result is overly optimistic. Why, and what is the correct evaluation protocol?","options":{"A":"RFECV is always unbiased — 88% is the correct expected performance","B":"RFECV wraps cross-validation around the feature elimination process — if the entire RFECV (including feature selection) is run on training+test data together, the test set influenced which features were selected; if RFECV was run only on training data but evaluated on the same test set multiple times (testing different selected subsets), the test set was implicitly used for selection; correct protocol: outer cross-validation evaluates the entire pipeline (including RFECV feature selection) on held-out data that was never used during selection","C":"45 features is too many — reducing to 20 features would make the result accurate","D":"The 200-to-45 reduction causes underfitting, which makes the accuracy estimate overly optimistic"},"correct":"B","explanation":{"correct":"- Double dipping: if you run RFECV on train+test, the test set influences which features are \"good\" → test performance is inflated.\n- If RFECV runs only on training data but you then evaluate multiple feature subsets on the test set: each evaluation on the test set is a comparison, and the best-performing feature set is selected → test set contamination.\n- Nested cross-validation: outer loop k-fold for unbiased performance estimate; inner loop k-fold for RFECV. For each outer fold, the RFECV sees only the inner training data. The outer test fold is truly held out.\n- sklearn Pipeline + cross_val_score: when RFECV is inside a Pipeline and cross_val_score is applied, the outer CV loop correctly isolates the test fold from feature selection.","A":"RFECV is only unbiased if the entire feature selection process is nested within cross-validation and the test fold is never accessed during selection.","B":"","C":"The number of selected features doesn't determine whether the evaluation is biased. The bias comes from the evaluation protocol, not the feature count.","D":"Selecting fewer features reduces model complexity — this can either reduce overfitting (if too many features were noise) or reduce underfitting (if removing noise helps). Feature count doesn't directly determine whether the evaluation is optimistic."}},{"section":"machine-learning","topicSlug":"feature-selection-and-engineering","topic":"Feature Selection And Engineering","id":"ml-17012","difficulty":"hard","orderIndex":12,"question":"A practitioner creates \"interaction features\" by multiplying pairs of the top 10 features, creating 45 additional features (10 choose 2). The combined model achieves much better training performance but validation performance is mixed. A colleague says \"interaction features always improve models.\" What is the nuanced truth?","options":{"A":"Interaction features always improve linear models because they add expressiveness","B":"Interaction features explicitly encode pairwise relationships that can improve linear models (which otherwise cannot capture interactions) and can help tree models learn interactions with fewer splits; however, adding 45 features to 10 doubles feature space — the model now has 55 features from 10 original; with limited data, many interaction terms will be noise; the curse of dimensionality: overfitting increases; valid interactions should be grounded in domain knowledge (A × B is meaningful) rather than generated combinatorially; use cross-validation to confirm actual validation improvement","C":"Interaction features only work for tree-based models — they are harmful for linear models","D":"Multiplying features always causes multicollinearity that makes models untrainable"},"correct":"B","explanation":{"correct":"- Linear model limitation: $y = w_1 x_1 + w_2 x_2$ cannot capture $y = x_1 \\times x_2$. Adding the feature $x_1 x_2$ lets the linear model capture this interaction.\n- Tree models: decision trees can learn $x_1 \\times x_2$ interactions through consecutive splits. But explicit interaction features can reduce the tree depth needed, potentially improving learning efficiency with limited data.\n- Overfitting risk: 45 additional features from 10 originals, most of which are likely noise. With n=200 samples: 55 features + noise interactions → overfitting. Use L1 regularization or select only domain-motivated interactions.\n- Domain-motivated examples: Income × Education (reasonable synergy), Age × Risk_Factor (valid interaction for insurance), Random_Feature1 × Random_Feature2 (likely noise).","A":"\"Always improve linear models\" is false. Uninformative interaction features add noise and may not improve validation performance even if they improve training performance.","B":"","C":"Interaction features are most useful for linear models (which cannot learn interactions from raw features). Tree models can learn them from the raw features, though explicit features can help.","D":"Products of features do increase collinearity (especially $x_1$ and $x_1 \\times x_2$ are correlated). But \"untrainable\" is an exaggeration — regularization (L2 or L1) handles the multicollinearity."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-001","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T01 · ML Fundamentals] A recruiter asks: \"what is the difference between a model parameter and a hyperparameter?\" Which answer is correct?","options":{"A":"Parameters are set before training; hyperparameters are learned during training","B":"Parameters (weights, biases) are learned by the optimization algorithm during training; hyperparameters (learning rate, tree depth, K in KNN) are set by the practitioner before training and control the training process itself","C":"There is no difference — both are tuned during training","D":"Hyperparameters are only relevant for neural networks, not classical ML models"},"correct":"B","explanation":{"correct":"- Parameters: values the model learns to minimize loss — e.g., linear regression coefficients $w$, neural network weights $\\theta$. They are updated by gradient descent or closed-form solutions.\n- Hyperparameters: design choices that control the training process — learning rate, regularization strength, number of trees, max depth. They are not learned from data; they are set by the practitioner (often via cross-validation).\n- Key test: if the optimization algorithm updates it → parameter. If you set it before training → hyperparameter.","A":"Reversed. Parameters are learned during training; hyperparameters are set before training.","B":"","C":"Only parameters are learned by the optimization algorithm. Hyperparameters require separate tuning (grid search, random search, Bayesian optimization).","D":"Hyperparameters exist for all ML models: K in KNN, max_depth in decision trees, C in SVM, number of clusters in K-means."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-002","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T11 · Clustering] A K-means model is trained on customer data. After convergence, a new customer's cluster is determined by finding the centroid with the smallest Euclidean distance to the new customer's feature vector. No retraining occurs. What is this prediction step called, and what assumption does it make?","options":{"A":"Online learning — the model updates centroids with each new customer","B":"Cluster assignment (inference) — the frozen centroids from training are used to assign the new point; this assumes the production data distribution is similar to training distribution; if distribution has shifted, the cluster labels may be meaningless","C":"Re-clustering — each new point triggers a full K-means restart","D":"Interpolation — the model averages predictions from the nearest two centroids"},"correct":"B","explanation":{"correct":"- After K-means training, centroids are frozen. Inference = find argmin of distance to each centroid.\n- No retraining occurs — the model assumes the production distribution resembles training. If customer behavior changes, new customers may fall in between centroids, getting poor or irrelevant assignments.\n- This single-centroid assignment is the standard production serving pattern for K-means.","A":"Online learning updates model parameters with each new sample. Standard K-means inference does not update centroids.","B":"","C":"Re-clustering restarts K-means from scratch. Standard serving uses frozen centroids.","D":"Interpolation is not how K-means assigns clusters. The closest centroid wins outright (hard assignment)."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-003","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T15 · Bias-Variance] A model achieves 70% accuracy on training data and 69% accuracy on test data. The Bayes optimal accuracy for this task is estimated at 95%. What is the primary problem?","options":{"A":"Overfitting — the 1% train-test gap proves the model has too much variance","B":"High bias (underfitting) — both training and test errors are high (30% and 31%), with a small gap; the model is too simple to capture the true pattern; the large gap between model performance (~70%) and Bayes optimal (~95%) is the key signal","C":"The model is well-optimized — 69% test accuracy is excellent","D":"The test set is too small — more test data would reveal better performance"},"correct":"B","explanation":{"correct":"- Avoidable bias = training error − Bayes error = 30% − 5% = 25%. This is the gap that can be fixed by improving the model.\n- Variance = test error − training error = 1%. Variance is almost zero — the model is highly stable but just not powerful enough.\n- The fix is to increase model complexity (deeper network, more features, more trees), not to add regularization or more data.","A":"A 1% train-test gap indicates very low variance. Overfitting shows a large gap.","B":"","C":"If the best achievable is 95%, 69% leaves 26 points of avoidable error on the table. That's not well-optimized.","D":"Test set size doesn't change the model's actual accuracy on the task. The problem is the model architecture, not the evaluation."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-004","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T04 · Decision Trees] A decision tree is grown to full depth (no max_depth constraint) on training data. What is the training error, and why is this a problem?","options":{"A":"Training error is 50% — full-depth trees cannot learn well without pruning","B":"Training error is near 0% (or exactly 0% if no duplicate feature vectors with different labels exist) — each leaf contains one or a few training samples; the tree has memorized training data; this causes high variance and poor generalization","C":"Full-depth trees have the same training error as pruned trees","D":"Training error is undefined for full-depth trees because they overfit by definition"},"correct":"B","explanation":{"correct":"- A decision tree with no depth constraint will keep splitting until each leaf is pure (one class). For a dataset without conflicting labels, training error reaches exactly 0%.\n- The model memorizes every training sample — extremely high variance. A small change in training data produces a completely different tree.\n- The solution is to use max_depth, min_samples_leaf, min_samples_split, or cost-complexity pruning (ccp_alpha in sklearn) to prevent memorization.","A":"Full-depth trees achieve near-0% training error, not 50%. This is the definition of the overfitting problem.","B":"","C":"Pruning increases training error (removes some memorized splits) but reduces test error by generalizing better.","D":"Training error is well-defined. It's simply the fraction of training samples the tree misclassifies — 0% for full-depth trees on clean data."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-005","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T16 · Regularization] A linear regression model with no regularization has 200 features and 150 training samples. The matrix $X^TX$ is not invertible. What does this mean for the OLS closed form?","options":{"A":"OLS can still be computed using any matrix inversion routine","B":"With more features than samples (p=200 > n=150), $X$ does not have full column rank; $X^TX$ (200×200) is singular (rank ≤ 150); the OLS solution $(X^TX)^{-1}X^Ty$ is undefined; Ridge regularization fixes this by adding $\\lambda I$: $(X^TX + \\lambda I)$ is always invertible for $\\lambda > 0$","C":"The closed form still works — matrix inversion handles singular matrices automatically","D":"With p > n, the correct approach is to use a neural network instead"},"correct":"B","explanation":{"correct":"- Rank of $X^TX$ ≤ min(n, p) = 150. Since $X^TX$ is 200×200 with rank ≤ 150, it has at least 50 zero eigenvalues → not invertible.\n- Geometrically: infinitely many hyperplanes fit the 150 training points in 200-dimensional space — no unique OLS solution.\n- Ridge: $(X^TX + \\lambda I)$ shifts all eigenvalues by $\\lambda > 0$ → all eigenvalues > 0 → invertible. This is one of Ridge's key practical benefits beyond regularization.","A":"Standard matrix inversion routines will fail or return numerically unstable results for singular matrices. Pseudoinverse (Moore-Penrose) can be used but gives the minimum-norm solution, not the maximum-margin solution.","B":"","C":"Standard inversion of a singular matrix produces infinity/NaN or numerical garbage. Python's `np.linalg.solve` will raise a `LinAlgError`.","D":"Neural networks also face ill-conditioning with p > n. The answer is regularization (Ridge, L1, dropout), not switching algorithms."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-006","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T07 · SVM] An SVM with a linear kernel is trained on a 2D dataset that is not linearly separable. Training fails to find a clean margin. A colleague suggests \"use a higher C value to fix this.\" Is this correct?","options":{"A":"Yes — high C forces the SVM to find a wider margin, separating the classes","B":"No — high C shrinks the allowed margin (penalizes misclassifications more); for non-linearly separable data, increasing C tries harder to separate training points but cannot achieve linear separability; the correct fix is to use a nonlinear kernel (RBF, polynomial) that maps to a higher-dimensional space where classes are separable","C":"C has no effect on linear SVM — it only matters for kernel SVMs","D":"For non-linearly separable data, SVM always fails regardless of C or kernel"},"correct":"B","explanation":{"correct":"- C in soft-margin SVM: high C = high penalty for misclassified points → smaller margin, fewer training errors; low C = allows more misclassifications → larger margin, more regularization.\n- For non-linearly separable data in 2D, no linear hyperplane can perfectly separate classes. Increasing C just causes the SVM to try harder to separate with a linear boundary — it may overfit to noise without achieving true separation.\n- RBF kernel implicitly maps to infinite-dimensional space where linear separation often becomes possible (Cover's theorem).","A":"High C shrinks (not widens) the margin. High C = hard margin, low C = soft margin.","B":"","C":"C applies to all SVM variants including linear kernel. It controls the misclassification penalty in all cases.","D":"With a nonlinear kernel (RBF), SVM can achieve separation for most non-linearly separable datasets in practice."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-007","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T09 · Naive Bayes] A text classifier uses Multinomial Naive Bayes. The training vocabulary has 5,000 words. A new test document contains the word \"blockchain\" which was not in the training corpus. Without Laplace smoothing, what happens to the model's prediction for this document?","options":{"A":"The word is ignored — NB skips out-of-vocabulary words automatically","B":"$$P(\\text{blockchain}|\\text{any class}) = 0$; the product $\\prod P(w_i|\\text{class})$ includes a zero term → posterior = 0 for every class; the model cannot classify the document (division by zero / zero probability for all classes)","C":"The model assigns probability 0.5 to the word by default","D":"The model raises an exception because it cannot handle new vocabulary"},"correct":"B","explanation":{"correct":"- MLE probability: $P(w|c) = \\text{count}(w,c) / N_c$. \"Blockchain\" has count 0 → $P(\\text{blockchain}|c) = 0$ for all classes.\n- Product of likelihoods: $\\prod_i P(w_i|c)$ contains one zero factor → product = 0 for every class.\n- All posteriors are 0 → undefined argmax → model cannot predict.\n- Laplace smoothing ($\\alpha = 1$) prevents this: every word gets count + 1 in the numerator.","A":"Standard MNB does not skip unknown words. Without explicit handling, the zero probability problem occurs.","B":"","C":"0.5 is not a default in any standard NB implementation. That would require hard-coding an arbitrary fallback probability.","D":"NB doesn't raise exceptions by default — it silently computes 0 probability. The failure is silent, not an error."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-008","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T12 · Anomaly Detection] An anomaly detection pipeline produces 500 alerts per day. Upon review, investigators find only 10 are real anomalies. What is the precision of the detector, and why does this metric matter operationally?","options":{"A":"Precision = 10/500 = 2%; each alert has only a 2% chance of being a real anomaly; investigators waste 98% of their effort on false alarms; high false alarm rate causes \"alert fatigue\" — investigators start ignoring alerts","B":"Recall = 10/500 = 2%; the model is missing 98% of anomalies","C":"Accuracy = 10/500 = 2%; the model is 2% accurate","D":"Precision cannot be computed without knowing the total number of real anomalies in the day"},"correct":"A","explanation":{"correct":"- Precision = TP / (TP + FP) = 10 / 500 = 2%. Of all flagged alerts, only 2% are genuine.\n- Operational impact: investigators must examine all 500 alerts. 490 are wasted effort. If each investigation takes 30 minutes: 245 hours/day wasted on false positives.\n- Alert fatigue: high false positive rates cause investigators to skip investigations or set high bars, causing them to miss real anomalies.\n- Fix: raise the anomaly score threshold (reduces FP but may increase FN), or use better features/model.","A":"","B":"Recall = TP / (TP + FN). Without knowing the total number of real anomalies in the day, we cannot compute recall from this information alone.","C":"Accuracy requires knowing TN (true negatives — non-anomalous events correctly not flagged), which is much larger. Accuracy is not 2%.","D":"Precision only requires TP and FP from flagged alerts. Recall requires knowing total actual positives, but precision does not."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-009","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T05 · Random Forest] What does \"out-of-bag error\" mean in Random Forest, and why is it useful?","options":{"A":"Out-of-bag error measures the error on training samples that were included in the bootstrap sample","B":"Each bootstrap sample excludes ~36.8% of training points — these excluded points are \"out-of-bag\"; each tree is evaluated on its OOB samples; averaging these evaluations gives an unbiased estimate of generalization error without needing a separate validation set","C":"Out-of-bag error is the difference between training and test error","D":"Out-of-bag error only applies when Random Forest uses 500+ trees"},"correct":"B","explanation":{"correct":"- Bootstrap sampling: draws n samples with replacement. Each sample has probability $(1-1/n)^n \\approx e^{-1} \\approx 36.8\\%$ of never being selected.\n- For each tree, predict using the ~36.8% of samples that tree never saw during training. Average these OOB predictions to get the OOB error.\n- This provides a \"free\" cross-validation estimate — no separate validation set needed. It is comparable to (but not identical to) leave-one-out cross-validation.","A":"OOB points are those EXCLUDED from the bootstrap sample, not included. Training is done on the included points; OOB evaluation uses the excluded ones.","B":"","C":"OOB error is a standalone estimate. It doesn't require a separate test set and is not a gap between two other quantities.","D":"OOB error applies to any Random Forest regardless of tree count. More trees give more stable OOB estimates (each sample is OOB for more trees), but the mechanism works with any number."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-010","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T06 · Gradient Boosting] In gradient boosting with log-loss, what does each new tree in the sequence fit to?","options":{"A":"Each new tree fits the raw training labels (0 or 1) starting fresh","B":"Each new tree fits the pseudo-residuals — the negative gradient of the log-loss evaluated at the current model's predictions; for log-loss: $r_i = y_i - \\hat{p}_i$ (actual label minus current predicted probability); the tree corrects what the current ensemble gets wrong","C":"Each new tree fits the square of the previous tree's predictions","D":"Each new tree is identical to the previous tree but with doubled learning rate"},"correct":"B","explanation":{"correct":"- Gradient boosting framework: $F_m(x) = F_{m-1}(x) + \\eta \\cdot h_m(x)$ where $h_m$ is a tree fitted to the negative gradient.\n- For log-loss: $-\\partial L / \\partial F = y_i - \\hat{p}_i$. Where current predictions are too low for positives (underestimating $\\hat{p}$ for class 1), residuals are positive → new tree pushes predictions up.\n- Sequential correction: the ensemble improves iteratively, each tree fixing the mistakes of the cumulative model so far.","A":"If each tree fit raw labels from scratch, there would be no \"boosting\" — just many independent shallow trees. The sequential residual fitting is what defines gradient boosting.","B":"","C":"Fitting squared predictions has no theoretical justification and would not converge to a useful model.","D":"Each tree is independently fitted on current residuals. Trees are different from each other (they fit different residual patterns)."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-011","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T17 · Feature Engineering] A dataset has house prices as the target. The \"YearBuilt\" feature ranges from 1900 to 2023. A data scientist creates a new feature `age = 2026 - YearBuilt`. What type of feature engineering is this, and what advantage does it provide over the raw year?","options":{"A":"This is feature scaling — it normalizes the year to a standard range","B":"This is feature transformation (domain-informed engineering) — \"age\" directly encodes how old the house is, which has a more natural relationship with price (older = more maintenance, lower value in many markets); raw year (1900-2023) encodes calendar time, which may be hard for a model to interpret relative to the prediction date; age is a more semantically meaningful and model-friendly representation","C":"This transformation is harmful — it removes useful temporal information","D":"Subtracting from a constant is only valid for linear models"},"correct":"B","explanation":{"correct":"- Domain knowledge: house age has a clearer causal relationship with price depreciation, maintenance cost, and desirability than the calendar year of construction.\n- Model interpretability: an age of 5 (newly built) vs age of 120 (very old) is intuitively meaningful. The year \"1950\" is not interpretable without knowing \"what year is now?\"\n- Generalization: if the model is used in future years, \"age\" automatically updates meaning (a house built in 1990 is 36 years old in 2026 but would be 37 in 2027); raw year \"1990\" is static.","A":"Feature scaling changes the range/distribution. Subtracting year from a constant is a simple linear transformation that changes the reference point, not the scale.","B":"","C":"Age preserves the temporal information — it's a monotone transformation of year. No information is lost; the representation is just more meaningful.","D":"Linear transformations (including affine shifts like `2026 - YearBuilt`) are valid for any model type. Tree models handle this identically to the raw year."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-012","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T02 · Linear Regression] A linear regression model is evaluated with $R^2 = 0.85$. A new feature is added, and $R^2$ increases to 0.86. Should the new feature be kept?","options":{"A":"Yes — any increase in R² means the feature is useful","B":"Not necessarily — R² always increases (or stays the same) when any feature is added, even a random noise feature; the increase from 0.85 to 0.86 may reflect overfitting to the new feature, not genuine signal; use adjusted R² or compare models using a held-out test set or cross-validation","C":"No — R² above 0.85 indicates overfitting, so the feature should be removed","D":"Adding a feature to a linear model always causes overfitting regardless of R² change"},"correct":"B","explanation":{"correct":"- Property of R²: adding any feature (even pure random noise) can only increase or maintain R² on training data — it can never decrease. The optimization just sets the noise feature's coefficient to near-zero.\n- Adjusted R²: $\\bar{R}^2 = 1 - (1-R^2)(n-1)/(n-p-1)$. Penalizes for number of parameters $p$. If adjusted R² decreases after adding the feature, the feature is not worth its added complexity.\n- Better: evaluate on a held-out test set. If test R² decreases, the feature is introducing overfitting.","A":"R² inflation is a well-known problem. Adding random noise to a linear model always increases training R². This is why adjusted R² or test performance should be used.","B":"","C":"R² above 0.85 has no connection to overfitting. A model can have R²=0.99 on test data without overfitting.","D":"Adding features to linear models can be beneficial. The model uses regularization or feature selection to handle non-useful features."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-013","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T10 · PCA] PCA is applied to standardize a dataset before K-means clustering. A colleague says \"you must apply PCA before K-means because K-means requires uncorrelated features.\" Is this claim correct?","options":{"A":"Correct — K-means requires uncorrelated input features by design","B":"Incorrect — K-means has no mathematical requirement for uncorrelated features; PCA before K-means can be beneficial for reducing noise and dimensionality, and for making Euclidean distance more meaningful; but it is not required; the stated justification is wrong","C":"Correct — K-means uses PCA internally to find clusters","D":"PCA should never be applied before K-means as it destroys cluster structure"},"correct":"B","explanation":{"correct":"- K-means uses Euclidean distance: $||x_i - \\mu_k||^2$. This works fine with correlated features — correlated features are just redundant, not harmful to the algorithm's convergence.\n- Valid reasons to use PCA before K-means: (1) reduce noise dimensions that dilute the distance signal; (2) reduce computation for high-dimensional data; (3) visualize clusters in 2D.\n- Invalid reason: \"K-means requires uncorrelated features.\" This is a common myth. Correlated features lead to suboptimal cluster shapes (circular vs elliptical), which is a limitation, not a requirement violation.","A":"K-means has no correlation requirement. It minimizes WCSS using Euclidean distance, which is defined for any feature space.","B":"","C":"K-means does not internally use PCA. It uses centroid distance calculations.","D":"PCA can actually help K-means by removing noisy dimensions. The concern would be if PCA discards dimensions that separate the clusters — this is possible but not universal."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-014","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T03 · Logistic Regression] The decision boundary of a logistic regression model in 2D is the set of points where $\\hat{p}(x) = 0.5$. What is the geometric shape of this boundary?","options":{"A":"A circle — logistic regression produces circular decision boundaries","B":"A straight line (linear) — the decision boundary is where $w_1x_1 + w_2x_2 + b = 0$; this is a linear equation in the feature space; the sigmoid outputs 0.5 exactly when its input is 0 (the linear boundary)","C":"A sigmoid curve — the decision boundary follows the shape of the sigmoid function","D":"The boundary can be any shape — it depends on the training data distribution"},"correct":"B","explanation":{"correct":"- $\\hat{p} = \\sigma(w^Tx + b)$. When $\\hat{p} = 0.5$: $\\sigma(z) = 0.5 \\Rightarrow z = 0 \\Rightarrow w^Tx + b = 0$.\n- This equation $w^Tx + b = 0$ defines a hyperplane (line in 2D, plane in 3D). Logistic regression is a linear classifier.\n- To create nonlinear boundaries: add polynomial features ($x_1^2, x_1 x_2$, etc.) before logistic regression, or use kernel logistic regression.","A":"Circular boundaries require $x_1^2 + x_2^2 = c$ — a nonlinear equation. Standard logistic regression cannot produce circles without feature engineering.","B":"","C":"The sigmoid function maps real values to (0,1). The decision boundary is where this output equals 0.5 — a 2D line, not the sigmoid curve itself.","D":"While the training data influences the learned weights $w$, the geometric form of the boundary is always linear (hyperplane) regardless of data distribution."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-015","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T14 · Model Evaluation] A binary classifier is evaluated on a balanced dataset (50% positive, 50% negative). Accuracy = 75%. Precision = 70%, Recall = 85%. What does the F1 score equal approximately?","options":{"A":"F1 = (70 + 85) / 2 = 77.5% (arithmetic mean)","B":"F1 = 2 × (0.70 × 0.85) / (0.70 + 0.85) = 2 × 0.595 / 1.55 ≈ 76.8% (harmonic mean of precision and recall)","C":"F1 = 75% (equals accuracy for balanced datasets)","D":"F1 cannot be computed without knowing TP, FP, FN, TN individually"},"correct":"B","explanation":{"correct":"- F1 formula: $F1 = 2 \\times \\frac{P \\times R}{P + R} = \\frac{2PR}{P + R}$.\n- Computation: $2 \\times (0.70 \\times 0.85) / (0.70 + 0.85) = 1.19 / 1.55 \\approx 0.768 = 76.8\\%$.\n- F1 is the harmonic mean — it is always ≤ arithmetic mean. It penalizes imbalance between precision and recall more than the arithmetic mean would.","A":"The arithmetic mean (77.5%) is higher than F1 (76.8%). F1 uses the harmonic mean, which is more conservative when precision and recall differ.","B":"","C":"F1 ≠ accuracy in general, even for balanced datasets. They would coincide only in specific cases where both precision and recall equal accuracy.","D":"F1 can be computed from aggregate precision and recall values directly. Individual TP/FP counts are not required if precision and recall are already known."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-016","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T08 · KNN] K=1 is chosen for a KNN classifier. The training accuracy is 100%. A data scientist says \"perfect training accuracy means perfect test accuracy.\" What is wrong?","options":{"A":"K=1 always produces 100% accuracy on both training and test — there is no problem","B":"K=1 memorizes the training data — each training point is its own nearest neighbor, so training accuracy is trivially 100%; this is the maximum-variance, zero-training-error extreme of KNN; test accuracy will be lower because the model overfits to noise in training labels; increasing K smooths the decision boundary, reducing variance at the cost of some bias","C":"K=1 is always the best choice because it minimizes training error","D":"The training accuracy of 100% is impossible for K=1 — there is a calculation error"},"correct":"B","explanation":{"correct":"- K=1: for any training point $x_i$, its nearest neighbor is itself (distance=0). Predicted class = $y_i$ = actual class. Training accuracy = 100% by construction.\n- Test points: the nearest training neighbor might belong to a noisy or wrong class. With K=1, there's no smoothing — the prediction is exactly as noisy as the nearest training label.\n- Optimal K: typically selected via cross-validation. K=√n is a common heuristic. Larger K → smoother, more robust boundaries but potentially over-smoothing.","A":"Test accuracy for K=1 is typically lower than training accuracy. Perfect training accuracy does not carry over to test data.","B":"","C":"Minimizing training error is not the goal. The goal is to minimize generalization error on unseen data. K=1 overfits.","D":"100% training accuracy for K=1 is guaranteed (each point is its own nearest neighbor). This is not a calculation error — it's the fundamental behavior of K=1."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-017","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T13 · Ensemble Methods] A data scientist creates an ensemble of 5 identical models (same architecture, same hyperparameters, same training data, same random seed). Does this ensemble outperform a single model?","options":{"A":"Yes — any ensemble of 5 models is always better than 1 model","B":"No — identical models produce identical predictions; the majority vote of 5 identical classifiers equals any single classifier's output; there is zero diversity; for an ensemble to improve over a single model, models must disagree on some examples (uncorrelated errors)","C":"The ensemble is better because it averages out random initialization differences","D":"The ensemble is better because 5 models have 5× the capacity of 1 model"},"correct":"B","explanation":{"correct":"- Ensemble benefit: $\\text{Var}(\\bar{X}) = \\frac{1}{B}[\\rho \\sigma^2 + (1-\\rho)\\sigma^2]$ where $\\rho$ = inter-model correlation. If $\\rho = 1$ (identical predictions): $\\text{Var}(\\bar{X}) = \\sigma^2$ = single model variance. No improvement.\n- Diversity is required. The same seed, same data, same architecture → $\\rho = 1$ → no variance reduction.\n- If different random seeds are used (Question B says \"same random seed\"), even that tiny source of diversity is eliminated.","A":"More models only help when they are diverse (make different mistakes). Identical models provide zero benefit.","B":"","C":"With the same random seed, there are no random initialization differences.","D":"Ensemble prediction is an average/vote, not a capacity increase. The meta-prediction uses 1 combined output, not 5× capacity."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-018","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T01 · ML Fundamentals] A model is trained on 2020-2022 customer data and deployed in 2026. Performance has degraded significantly. What is the most likely cause?","options":{"A":"The model's code has bugs that develop over time","B":"Concept drift — the statistical relationship between features and the target has changed; customer behavior, market conditions, or product mix from 2026 differs significantly from 2020-2022; the model's learned patterns no longer apply to current data","C":"The hardware has degraded, producing random prediction errors","D":"The test set from 2026 is smaller than the training set, causing evaluation noise"},"correct":"B","explanation":{"correct":"- Concept drift: $P(y|x)$ changes over time. A model trained on 2020 patterns may not reflect 2026 customer behavior (new demographics, changed preferences, economic shifts).\n- Data drift (covariate shift): $P(x)$ changes — new types of customers appear. The model sees feature combinations outside its training distribution.\n- Solutions: periodic retraining, monitoring input/output distributions, champion-challenger testing, online learning.","A":"Software models don't develop bugs from running over time. Bugs are static unless code is changed.","B":"","C":"Hardware failures produce hard errors, not gradual degradation. Gradual degradation points to distributional causes.","D":"Evaluation noise from small test sets would cause noisy metrics, not systematic degradation. Concept drift causes systematic directional performance decrease."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-019","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T06 · Gradient Boosting] XGBoost is described as an improved gradient boosting framework. What is one key algorithmic difference between XGBoost and the original gradient boosting (GBDT)?","options":{"A":"XGBoost uses random forests internally instead of gradient boosting","B":"XGBoost incorporates second-order gradient information (Hessian) in the split-finding criterion; standard GBDT uses only first-order gradients (pseudo-residuals); the second-order Taylor expansion gives XGBoost more accurate leaf weight estimates and split gain calculations, often improving convergence","C":"XGBoost uses decision stumps (depth-1 trees) exclusively, while GBDT uses arbitrary-depth trees","D":"XGBoost eliminates the need for a learning rate hyperparameter"},"correct":"B","explanation":{"correct":"- Standard GBDT: fit each tree to the negative gradient (first-order approximation of the loss).\n- XGBoost: uses second-order Taylor expansion of the loss: $L \\approx \\sum_i [g_i f_t(x_i) + \\frac{1}{2}h_i f_t^2(x_i)]$ where $g_i = \\partial L/\\partial \\hat{y}$ and $h_i = \\partial^2 L / \\partial \\hat{y}^2$.\n- Leaf weights in XGBoost: $w^* = -G_j / (H_j + \\lambda)$ where $G, H$ are sums of first and second gradients. More accurate than first-order-only methods.","A":"XGBoost is a boosting framework. It uses gradient boosting with trees as weak learners.","B":"","C":"XGBoost supports arbitrary tree depth (max_depth hyperparameter). Depth-1 stumps (linear booster) are one option, not the default.","D":"XGBoost still uses a learning rate (eta parameter, default 0.3). It is a critical hyperparameter."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-020","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T04 · Decision Trees] Which splitting criterion produces a pure node of 100% class A? What is its Gini impurity value?","options":{"A":"Gini impurity = 1.0 — a pure node has maximum impurity","B":"Gini impurity = 0 — a pure node has $p_A = 1, p_B = 0$; $G = 1 - (1^2 + 0^2) = 0$; the ideal endpoint for a decision tree split","C":"Gini impurity = 0.5 — maximum impurity for a binary classification","D":"Gini impurity is undefined for a pure node because log(0) is undefined"},"correct":"B","explanation":{"correct":"- Gini impurity: $G = 1 - \\sum_k p_k^2$. For a pure node with 100% class A: $p_A = 1, p_B = 0$. $G = 1 - (1^2 + 0^2) = 1 - 1 = 0$.\n- Gini = 0 means perfectly pure — no further splitting benefit.\n- Maximum Gini for binary classification: $G = 0.5$ when $p_A = p_B = 0.5$ (completely mixed). $G = 1 - (0.5^2 + 0.5^2) = 1 - 0.5 = 0.5$.","A":"Pure node = Gini 0, not 1. Gini = 1 would mean a node impossible in binary classification (would need all probability outside any class).","B":"","C":"0.5 is the Gini for a maximally mixed (50/50) binary node — the worst case, not the pure case.","D":"Gini impurity does not involve logarithms (that's entropy/information gain). Gini = $1 - \\sum p_k^2$, which is well-defined for $p = 1$."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-021","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T16 · Regularization] A logistic regression model is trained with L1 regularization. After training, 40 of 100 feature weights are exactly 0. What does this mean for the model?","options":{"A":"40 features were removed from training data before the model ran","B":"L1 regularization drove 40 feature weights to exactly zero during optimization — these features contribute nothing to predictions; this is automatic feature selection; the remaining 60 features form a sparse model that is more interpretable and computationally efficient at inference time","C":"The model failed to converge — zero weights indicate training errors","D":"Zero weights mean the model is ignoring the regularization term for those features"},"correct":"B","explanation":{"correct":"- L1 sparsity mechanism: the subgradient at $w=0$ for the L1 penalty spans $[-\\lambda, \\lambda]$. When the data gradient is smaller than $\\lambda$ in magnitude, $w=0$ is optimal → exact zero.\n- This automatic feature selection is a key advantage of L1 over L2. The sparse model uses only 60 features at inference time — faster prediction and simpler interpretation.\n- Sparse solutions are valuable in high-dimensional settings where most features are noise.","A":"Feature data is present in training. L1 zeroes out the learned coefficient, not the feature column. The feature is present but given zero importance by the model.","B":"","C":"Zero weights from L1 are not a convergence failure — they are the optimal solution. The training converged to a point where L1 sparsity kicked in for those features.","D":"Zero weights for L1 regularized features are the regularization effect working as intended — the penalty was large enough to drive those weights to zero."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-022","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T10 · PCA] Before applying PCA, a data scientist must standardize the features (zero mean, unit variance). Why is this step essential?","options":{"A":"PCA requires non-negative values — standardization ensures all values are positive","B":"PCA finds directions of maximum variance — features with larger absolute scale (e.g., income in dollars vs age in years) dominate the variance; standardization ensures each feature contributes equally to the covariance matrix before PCA finds its principal directions","C":"Standardization is optional — PCA is scale-invariant","D":"Standardization is only needed when features have different units; same-unit features don't need it"},"correct":"B","explanation":{"correct":"- Covariance matrix: $C = \\frac{1}{n}X^TX$. Income variance ($\\sim 10^9$) dominates age variance ($\\sim 200$). PC1 will align almost entirely with income — PCA becomes income-only dimensionality.\n- After standardization: each feature has variance 1. The covariance matrix treats all features equally, and PCA finds the true directions of maximum multivariate variance.\n- Exception: if you deliberately want to give high-variance features more weight (e.g., they are more important by domain knowledge), you could skip standardization — but this should be intentional.","A":"PCA has no non-negativity requirement. Standardized features have negative values (below-mean observations are negative). PCA handles negative values fine.","B":"","C":"PCA is NOT scale-invariant. This is a critical point. Applying PCA to unstandardized data gives results dominated by high-variance features.","D":"Even features with the same units can have very different variances. Variance depends on the value range, not the unit. Standardization is recommended regardless of unit homogeneity."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-023","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T05 · Random Forest] Feature importance in Random Forest is computed as mean decrease in Gini impurity. A feature is listed with importance = 0.0. What might cause this?","options":{"A":"The feature is perfectly correlated with the target and thus contributes everything","B":"The feature was never used in any split across all trees — possible if the feature has very low predictive power, or if a correlated feature was always selected first (Random Forest randomly samples features at each split, so a weak feature may never win the competition); importance = 0.0 means the feature added no impurity reduction across all trees","C":"Feature importance = 0.0 is a computation error — all features must contribute something","D":"The feature was removed from the dataset before training"},"correct":"B","explanation":{"correct":"- Feature subsampling at each split: $m$ features are randomly selected per split. A weak feature may rarely be selected. If selected but does not produce a better split than a threshold (min impurity decrease), it won't be used.\n- For a truly irrelevant feature: it may appear in some splits by random chance but provides no impurity reduction → importance sums to near 0.\n- Also common: if a highly predictive feature is in the same random subset as a weaker correlated feature, the strong feature always wins — the weaker feature gets 0 importance despite some predictive value.","A":"High correlation with the target would give very high importance, not 0.","B":"","C":"A feature that never improves any split genuinely has importance 0. This is a valid outcome, not a bug.","D":"If the feature were removed from training data, it wouldn't appear in the importance rankings at all (no entry, not 0 entry)."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-024","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T14 · Model Evaluation] A model's ROC curve passes exactly through the point (0.1, 0.9). What does this operating point mean in terms of classification performance?","options":{"A":"The model correctly classifies 90% of all samples at a 10% error rate","B":"At this threshold: TPR = 0.9 (recall = 90%, catches 90% of positives) and FPR = 0.1 (10% of negatives are incorrectly flagged); this is an excellent operating point — high sensitivity with relatively low false positive rate","C":"The model has 90% precision and 10% recall at this threshold","D":"The point (0.1, 0.9) means AUC = 0.1 × 0.9 = 0.09"},"correct":"B","explanation":{"correct":"- ROC coordinates: x-axis = FPR = FP/(FP+TN), y-axis = TPR = TP/(TP+FN).\n- At (FPR=0.1, TPR=0.9): the model correctly identifies 90% of actual positives while misclassifying only 10% of negatives as positive.\n- This is a favorable operating point — it lies in the upper-left region of the ROC space, far above the diagonal (random classifier).","A":"TPR and FPR are not \"overall accuracy\" metrics. They measure performance on positives and negatives separately. Overall accuracy requires knowing the class proportions.","B":"","C":"Precision (PPV = TP/(TP+FP)) is not directly readable from the ROC curve coordinates. ROC plots TPR vs FPR, not precision vs recall.","D":"AUC is the area under the entire ROC curve — not the product of a single point's coordinates."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-025","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T09 · Naive Bayes] A Naive Bayes classifier is compared to logistic regression on a small dataset (n=50). Naive Bayes outperforms logistic regression. On a large dataset (n=50,000), logistic regression outperforms Naive Bayes. What principle explains this?","options":{"A":"Logistic regression is always slower, so it needs more data to converge","B":"Naive Bayes reaches its asymptotic error faster ($O(\\log p)$ samples) because its generative assumptions constrain the solution space — helpful with little data; logistic regression needs $O(p)$ samples but achieves lower asymptotic error when its assumptions hold; on large datasets, logistic regression's greater flexibility pays off","C":"Naive Bayes is more accurate on all dataset sizes — the large dataset result indicates an error","D":"The crossover is caused by the dataset size affecting the independence assumption"},"correct":"B","explanation":{"correct":"- Ng & Jordan (2001): generative models (NB) converge faster but to a higher asymptotic error when the generative assumptions are violated (as they almost always are for text).\n- Discriminative models (LR) converge slower (require more data to estimate the decision boundary) but achieve better asymptotic performance because they don't assume a specific data distribution.\n- Practical implication: for very small datasets, NB may be competitive or superior. For large datasets with enough samples to estimate LR parameters well, LR typically wins.","A":"Speed of training is not the explanation. The crossover is about statistical efficiency, not computational efficiency.","B":"","C":"The large-dataset dominance of logistic regression is the expected result from theory and empirical studies. NB's advantage is only at small sample sizes.","D":"The independence assumption is violated regardless of dataset size. It doesn't change with more data — only the estimation of the conditional probabilities improves."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-026","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T03 · Logistic Regression] The C parameter in sklearn's LogisticRegression controls regularization. C=0.001 vs C=1000. Which configuration is more likely to overfit?","options":{"A":"C=0.001 — smaller values mean stronger regularization, more overfitting","B":"C=1000 — C is the inverse of regularization strength; large C = weak regularization ($\\lambda = 1/C \\approx 0$); the model is nearly unregularized and can overfit to training noise, especially in high-dimensional feature spaces","C":"Both are equally likely to overfit","D":"C=0.001 — regularization causes overfitting by constraining the model too much"},"correct":"B","explanation":{"correct":"- sklearn convention: $C = 1/\\lambda$. High C → small $\\lambda$ → weak regularization → model can fit training data very closely → overfitting risk.\n- C=0.001: $\\lambda = 1000$ → strong regularization → heavy coefficient shrinkage → underfitting risk.\n- C=1000: $\\lambda = 0.001$ → almost unregularized → may overfit on small/noisy datasets.\n- The \"correct\" C is always data-specific and should be selected via cross-validation.","A":"C=0.001 has strong regularization (λ=1000). Strong regularization prevents overfitting; it may cause underfitting instead.","B":"","C":"They have opposite regularization strengths — they are not equally likely to overfit. The direction of effect is clear.","D":"Regularization doesn't cause overfitting. Regularization prevents overfitting. Too much regularization causes underfitting (model too constrained)."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-027","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T07 · SVM] Why does SVM use only the support vectors to define the maximum margin hyperplane, and how many support vectors are there typically?","options":{"A":"SVM uses all training points equally — all points define the hyperplane","B":"The optimization's KKT conditions show that only points on or within the margin have non-zero dual variables (Lagrange multipliers); non-support-vector points are outside the margin and contribute $\\alpha_i = 0$ to the decision function; the number of support vectors is typically small (a few dozen to a few hundred) and depends on the margin width and data complexity","C":"SVM randomly selects a subset of training points as support vectors","D":"Support vectors are always exactly equal to the number of features plus one"},"correct":"B","explanation":{"correct":"- SVM dual problem: $\\max_\\alpha \\sum \\alpha_i - \\frac{1}{2}\\sum_{i,j}\\alpha_i \\alpha_j y_i y_j K(x_i, x_j)$ subject to $\\sum \\alpha_i y_i = 0$, $0 \\leq \\alpha_i \\leq C$.\n- KKT complementarity: $\\alpha_i (1 - y_i(w^Tx_i + b)) = 0$. Points outside the margin: $y_i(w^Tx_i + b) > 1 \\Rightarrow \\alpha_i = 0$. Only margin points have $\\alpha_i > 0$.\n- The decision function: $f(x) = \\sum_{i \\in SV} \\alpha_i y_i K(x_i, x)$. Sums only over support vectors — non-SVs vanish.","A":"Non-support-vector points have $\\alpha_i = 0$ and do not contribute to the decision boundary. They can be removed from training data without changing the learned model.","B":"","C":"Support vectors are determined by the optimization, not random selection. They are the points most relevant to the decision boundary (on or inside the margin).","D":"The number of support vectors depends on the data complexity. For linearly separable data with a wide margin: very few SVs. For noisy or complex data: many SVs. No formula links SVs to feature count + 1."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-028","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T17 · Feature Selection] Chi-squared test for feature selection is applied to categorical features. A feature with a very small p-value (0.0001) is considered highly informative. What does the chi-squared test actually measure?","options":{"A":"It measures the correlation between two continuous features","B":"It tests the null hypothesis that the feature and the target are statistically independent; a very small p-value (reject H₀) means the feature's distribution differs significantly across target classes — evidence of association; a large p-value (fail to reject) suggests the feature provides no information about the target","C":"It measures the predictive accuracy if the feature is used alone","D":"Chi-squared test is only valid for regression targets, not classification"},"correct":"B","explanation":{"correct":"- Chi-squared test: compares observed vs expected frequencies in a contingency table. Expected = what you'd see if feature and target were independent.\n- Small p-value: the observed distribution of the feature across target classes is too different to be explained by chance → feature is associated with the target.\n- Limitation: tests marginal association only. Does not capture interactions. Can produce small p-values for features that are associated with the target but not useful conditional on other features.","A":"Chi-squared for feature selection tests categorical vs categorical (feature vs target) association. Correlation (Pearson) is for continuous features.","B":"","C":"Chi-squared measures statistical association, not predictive accuracy. A highly associated feature could have a small p-value but add little practical predictive power.","D":"Chi-squared is specifically designed for categorical features and categorical targets (classification). For regression targets, use ANOVA F-test or mutual information."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-029","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T11 · Clustering] Silhouette score = -0.15 is computed for a specific point. What does this negative value indicate?","options":{"A":"The computation has a bug — Silhouette cannot be negative","B":"The point's average distance to its own cluster ($a$) is greater than its average distance to the nearest other cluster ($b$) — the point is closer to a different cluster than its assigned one; $s = (b - a)/\\max(a,b) < 0$ when $a > b$; this indicates the point is likely misclassified into the wrong cluster","C":"The point is an outlier with no natural cluster affiliation","D":"The clustering used K that is too small"},"correct":"B","explanation":{"correct":"- Silhouette score: $s(i) = (b_i - a_i) / \\max(a_i, b_i)$.\n- $a_i$ = mean distance to points in own cluster (cohesion). $b_i$ = mean distance to nearest other cluster (separation).\n- Negative: $b_i < a_i$ → point is closer to a different cluster → wrong assignment. Score = -1 is the worst (point is deep in the wrong cluster). Score = 0 means on the boundary. Score = 1 means perfectly in its cluster.","A":"Silhouette scores range from -1 to +1. Negative values are mathematically valid and indicate poor cluster assignment.","B":"","C":"Negative Silhouette specifically indicates the point belongs better to another cluster. A true outlier (equidistant from all clusters) would have Silhouette near 0.","D":"The global K choice affects overall average Silhouette, but an individual point's negative score indicates that specific point is in the wrong cluster regardless of K."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-030","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T12 · Anomaly Detection] An anomaly detector reports 100% recall: it catches every single anomaly. A manager says \"perfect recall means our detection is excellent.\" What important information is missing from this evaluation?","options":{"A":"Recall of 100% is the best possible result — no additional information is needed","B":"100% recall may be trivially achieved by flagging everything as anomalous; in that case, precision = (actual anomaly rate) ≈ 1% → 99% of flags are false positives; high recall without precision is operationally useless; the precision-recall tradeoff must be reported together","C":"Recall is the wrong metric for anomaly detection — only precision matters","D":"100% recall is mathematically impossible for any detector"},"correct":"B","explanation":{"correct":"- Trivial high-recall model: flag every single event as anomalous. TP = all anomalies (100% recall). FP = all normal events. If anomaly rate is 1%: precision = 1% → 99% of alerts are false positives.\n- Operational impact: investigators must examine every event. This eliminates the value of the detector entirely.\n- Balanced evaluation: always report precision AND recall together, or use F1 (equal-cost) or PR-AUC (full tradeoff curve) for anomaly detection.","A":"100% recall says nothing about how many false positives are generated. Without precision, the evaluation is incomplete.","B":"","C":"Both precision and recall matter for anomaly detection. Precision determines investigator workload; recall determines how many anomalies are caught. Ignoring either produces a misleading assessment.","D":"100% recall is achievable — flag everything as anomalous. It is trivially achievable and therefore should not be cited as evidence of detector quality without precision."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-031","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T02 · Linear Regression] Homoscedasticity is one of the Gauss-Markov assumptions for OLS to be BLUE (Best Linear Unbiased Estimator). What does homoscedasticity mean?","options":{"A":"Residuals are normally distributed across all fitted values","B":"The variance of the residuals is constant across all fitted values — $\\text{Var}(\\epsilon_i) = \\sigma^2$ for all $i$; when violated (heteroscedasticity), OLS is still unbiased but no longer has minimum variance; some predictions are noisier than others","C":"Features are uncorrelated with each other (no multicollinearity)","D":"The relationship between features and the target is linear"},"correct":"B","explanation":{"correct":"- Homoscedasticity: same variance of errors across all observation levels. On a residual vs fitted plot: residuals should form a horizontal band, not a funnel shape.\n- When violated (heteroscedasticity): OLS is still unbiased but not efficient. Standard errors of coefficients are wrong → confidence intervals and p-values are invalid.\n- Fix: transform the target (log, sqrt), use weighted least squares (WLS), or use heteroscedasticity-robust standard errors (White's correction).","A":"Normality of residuals is a separate assumption (required for hypothesis tests and confidence intervals). Homoscedasticity is specifically about constant variance, not the distribution shape.","B":"","C":"Feature independence (no multicollinearity) is a separate Gauss-Markov condition related to the invertibility of $X^TX$.","D":"Linearity ($E[y] = X\\beta$) is a separate assumption — the relationship between $X$ and $y$ must be linear. These are four distinct Gauss-Markov conditions."}},{"section":"machine-learning","difficulty":"easy","id":"ml-pract-easy-032","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T15 · Bias-Variance] Which of the following model changes will definitely reduce variance without changing bias?","options":{"A":"Increasing the polynomial degree from 3 to 5","B":"Adding more training data (larger dataset for the same model class)","C":"Removing regularization (setting λ from 0.1 to 0)","D":"Adding more features to the model"},"correct":"B","explanation":{"correct":"- More training data → reduces variance: $\\text{Var}(\\hat{f}) \\propto \\sigma^2 / n$. As $n$ increases, variance decreases toward 0. Bias is unchanged (the model class and its average prediction remain the same — only the estimation becomes more stable).\n- A: Higher polynomial degree increases model complexity → increases variance (and decreases bias).\n- C: Removing regularization allows larger weights → increases variance.\n- D: Adding more features can increase variance (more parameters to estimate) and may reduce bias.","A":"Increased complexity → lower bias, higher variance. Not a variance-reduction step.","B":"","C":"Removing regularization reduces the shrinkage constraint → weights can grow larger → higher variance, lower bias.","D":"Adding features adds parameters. Unless regularization compensates, more features → higher variance."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-001","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T01 · ML Fundamentals] The No Free Lunch (NFL) theorem states that all learning algorithms perform equally well when averaged over all possible data-generating distributions. A colleague concludes: \"NFL means deep learning is not fundamentally better than a decision tree.\" Is this reasoning correct, and what does NFL actually imply for practitioners?","options":{"A":"Correct — NFL proves no algorithm is universally superior, so deep learning is just a fad","B":"Technically correct but practically misleading — NFL applies to uniform averaging over ALL possible distributions including pathological ones (random label assignments, pure noise); in practice, real-world problems come from a small subset of distributions that have structure (spatial locality, temporal patterns, compositional hierarchy); deep learning's inductive biases (CNNs for spatial structure, transformers for sequence) are precisely designed for these structured distributions; the NFL result is trivially satisfied but uninformative for practical algorithm selection","C":"NFL proves that algorithm selection never matters — always use the simplest model","D":"NFL is a theoretical curiosity that has been proven wrong by deep learning's empirical success"},"correct":"B","explanation":{"correct":"- NFL theorem (Wolpert, 1996): $\\sum_{f} E[L(A_1, f)] = \\sum_f E[L(A_2, f)]$ for any two algorithms $A_1, A_2$. Summed over all possible target functions $f$, all algorithms are equal.\n- The catch: the uniform distribution over $f$ is unrealistic. Nature's problems have structure. Inductive biases (smoothness priors, local connectivity, compositionality) are exploited by specific architectures.\n- Practitioner implication: NFL implies you cannot choose an algorithm without prior assumptions about the problem's structure. Deep learning wins when its inductive biases match the problem structure. It doesn't win on all problems.","A":"Misapplies NFL to practical settings. NFL says nothing about performance on any specific distribution — only about the average over all distributions. Deep learning's dominance on structured data is compatible with NFL.","B":"","C":"NFL does not support \"always use the simplest model.\" It says you cannot choose without assumptions — which implies domain knowledge should guide selection.","D":"Deep learning's success is consistent with NFL. NFL is not disproven — it's a theorem. The success occurs because real problems have structure not covered by NFL's uniform distribution."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-002","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T02 · Linear Regression] The Frisch-Waugh-Lovell (FWL) theorem states that the OLS coefficient for $X_1$ in the regression $y \\sim X_1 + X_2$ equals the OLS coefficient from the regression of $M_{X_2}y$ on $M_{X_2}X_1$, where $M_{X_2} = I - X_2(X_2^TX_2)^{-1}X_2^T$ is the annihilator matrix. What does the FWL theorem mean for interpreting coefficients in multiple regression?","options":{"A":"FWL means all regression coefficients are independent of each other","B":"FWL means the coefficient $\\hat{\\beta}_1$ measures the effect of $X_1$ on $y$ after both have been \"partialled out\" of $X_2$ — it is the effect of the component of $X_1$ that is orthogonal to $X_2$; this formalizes the \"ceteris paribus\" (all else equal) interpretation; adding or removing $X_2$ changes $\\hat{\\beta}_1$ precisely when $X_1$ is correlated with $X_2$ (omitted variable bias); FWL explains why multicollinearity inflates standard errors: $M_{X_2}X_1$ has low variance when $X_1$ and $X_2$ are correlated","C":"FWL proves that $\\hat{\\beta}_1$ is identical whether or not $X_2$ is included in the model","D":"FWL is only valid for orthogonal feature matrices ($X_1^TX_2 = 0$)"},"correct":"B","explanation":{"correct":"- Geometric interpretation: $M_{X_2}$ projects out the $X_2$ subspace. $M_{X_2}X_1$ is the residual of $X_1$ after regressing on $X_2$ — the variation in $X_1$ unexplained by $X_2$.\n- Omitted variable bias: if $X_2$ is omitted, $\\hat{\\beta}_1$ absorbs both the effect of $X_1$ AND the effect of $X_2$ on $y$ mediated through $X_1$'s correlation with $X_2$.\n- Multicollinearity: high correlation between $X_1$ and $X_2$ → $M_{X_2}X_1 \\approx 0$ (very small) → $\\text{Var}(\\hat{\\beta}_1) = \\sigma^2 / ||M_{X_2}X_1||^2 \\to \\infty$ → inflated standard errors.","A":"FWL proves the opposite — coefficients are dependent on what other variables are in the model. The \"ceteris paribus\" effect is conditional on other regressors.","B":"","C":"$$\\hat{\\beta}_1$ changes when $X_2$ is included if $X_1$ and $X_2$ are correlated. FWL explains precisely how $\\hat{\\beta}_1$ changes.","D":"FWL applies generally, not just for orthogonal features. The formula $M_{X_2}$ computes the orthogonal complement regardless of feature correlation."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-003","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T03 · Logistic Regression] Maximum likelihood estimation for logistic regression minimizes the cross-entropy loss. Show why cross-entropy loss and negative log-likelihood of the Bernoulli distribution are equivalent, and what this implies about the model's implicit distributional assumption.","options":{"A":"They are unrelated — cross-entropy and log-likelihood are different optimization objectives","B":"For binary classification, MLE assumes $y_i \\sim \\text{Bernoulli}(\\hat{p}_i)$ where $\\hat{p}_i = \\sigma(w^Tx_i)$; the log-likelihood: $\\ell = \\sum_i [y_i \\log(\\hat{p}_i) + (1-y_i)\\log(1-\\hat{p}_i)]$; maximizing this is exactly minimizing the cross-entropy loss; this implies logistic regression is the max-entropy model for binary outcomes with linear sufficient statistics — it makes no assumptions beyond the feature-outcome relationship; misclassifying with L2 loss would be wrong because it assumes Gaussian noise, which is inappropriate for binary labels","C":"They are equivalent but only for balanced classes","D":"Cross-entropy and log-likelihood are equivalent, which proves that logistic regression is unbiased for any classification problem"},"correct":"B","explanation":{"correct":"- Bernoulli MLE: $L(w) = \\prod_i \\hat{p}_i^{y_i}(1-\\hat{p}_i)^{1-y_i}$. Taking log: $\\ell = \\sum_i [y_i \\log \\hat{p}_i + (1-y_i)\\log(1-\\hat{p}_i)]$. Negating to get a loss: $-\\ell = -\\sum_i [y_i \\log \\hat{p}_i + (1-y_i)\\log(1-\\hat{p}_i)]$ = cross-entropy.\n- Max-entropy interpretation: among all distributions consistent with linear constraints $E[x_j y]$, logistic regression gives the maximum entropy distribution over $y|x$ — it makes minimal additional assumptions.\n- Practical implication: using MSE for binary classification assumes a Gaussian error model, which is wrong for $y \\in \\{0,1\\}$. This explains why MSE-trained models have saturated gradient problems in binary classification.","A":"They are mathematically identical — one is maximized and the other minimized, but the same $w^*$ satisfies both.","B":"","C":"The equivalence holds regardless of class balance. Class balance affects the optimal threshold, not the loss function's validity.","D":"MLE consistency (unbiasedness asymptotically) applies when the model is correctly specified. If the true relationship is nonlinear in $x$, logistic regression is biased regardless of the loss function derivation."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-004","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T04 · Decision Trees] CART tree training is NP-hard in general, but greedy top-down induction is used in practice. Explain why the greedy approach does not find the globally optimal tree, and describe one scenario where the greedy approach is demonstrably suboptimal.","options":{"A":"Greedy CART always finds the global optimum for sufficiently large datasets","B":"Greedy CART makes locally optimal splits at each node without backtracking — the split that maximizes impurity reduction at depth 1 is chosen without considering what splits become available at depth 2+; classic scenario: feature A has moderate Gini reduction but enables a perfect depth-2 split; feature B has higher immediate Gini reduction but leads to a dead end; greedy selects B (better immediate gain) and misses the globally better A-then-split tree; this is the XOR problem — neither $X_1$ nor $X_2$ alone separates XOR labels but their combination does; a greedy impurity-based split finds neither feature useful alone and cannot construct the optimal tree","C":"Greedy CART is globally optimal because it evaluates all possible trees","D":"Greedy approaches are globally optimal for tree structures due to the principle of optimality"},"correct":"B","explanation":{"correct":"- Optimality condition for greedy: Bellman's principle of optimality applies when subproblems are independent. Tree splits are NOT independent — the data subset reaching a child node depends on the parent split.\n- XOR example: $y = X_1 \\oplus X_2$ (XOR). $I(Y; X_1) = I(Y; X_2) = 0$ (each feature alone is statistically independent of the label). Greedy impurity at depth 1 = 0 for both features. No split improves impurity. The correct depth-2 tree uses both features, but greedy cannot discover this.\n- Alternative: look-ahead strategies, beam search, or random forest's randomization implicitly explore non-greedy splits.","A":"Greedy is demonstrably suboptimal for the XOR problem and many other interaction-based problems.","B":"","C":"Greedy evaluates one level at a time. It does not evaluate all possible trees. The number of possible binary trees grows exponentially — it is NP-hard to find the global optimum.","D":"Bellman's optimality applies to problems where optimal substructure holds (each subproblem's solution contributes independently). Tree splits violate this because data routing to subtrees depends on ancestor splits."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-005","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T05 · Random Forest] Impurity-based feature importance in Random Forest is known to be biased toward high-cardinality features. Explain the mechanism and describe the permutation importance method as an unbiased alternative.","options":{"A":"Impurity importance is unbiased — high-cardinality features are correctly ranked higher because they provide more information","B":"Impurity importance bias: a feature with many unique values (continuous or high-cardinality categorical) has more candidate thresholds; more thresholds → higher probability of finding a split that reduces impurity by chance → inflated importance; especially visible when comparing a continuous feature (1000 thresholds) vs binary feature (1 threshold); permutation importance: for each feature, randomly shuffle its values in the validation set and measure accuracy drop; a large drop indicates the feature was important; permutation operates on held-out data with the trained model, not during tree-building, so it avoids the threshold-count bias","C":"Impurity importance is biased only for categorical features with fewer than 10 categories","D":"Permutation importance is biased in the opposite direction — it underestimates all feature importances"},"correct":"B","explanation":{"correct":"- Impurity bias mechanism: at each split, CART searches all thresholds of all candidate features. A feature with 100 thresholds has 100 chances to find a good split by chance. A binary feature has 1 chance. The expected maximum impurity reduction scales with the number of thresholds.\n- Mathematical consequence: $E[\\max_{t \\in T_j} \\Delta G_j]$ increases with $|T_j|$ (number of thresholds). Features with more thresholds are systematically favored even when truly uninformative.\n- Permutation importance: evaluates the model on validation data. Shuffle feature $j$, observe $\\Delta \\text{acc}$. This measures actual predictive contribution, not split-time convenience. Unaffected by cardinality.","A":"The bias has been empirically documented and mathematically explained. High-cardinality random features score higher than low-cardinality informative features in impurity importance.","B":"","C":"The bias affects all features proportionally to their number of candidate splits. Continuous features (many real-valued thresholds) are most affected, but any feature with more thresholds is biased upward.","D":"Permutation importance can overestimate importance for correlated features (when one correlated feature is shuffled, the model uses the other correlated feature to compensate, underestimating importance). But it is not systematically biased toward underestimation across all features."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-006","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T06 · Gradient Boosting] Newton boosting (second-order gradient boosting, as in XGBoost) uses both gradients and Hessians. Explain why using the Hessian improves convergence compared to first-order gradient boosting, using the analogy of Newton's method vs gradient descent.","options":{"A":"The Hessian is only useful for smooth loss functions — gradient boosting doesn't need it","B":"Newton's method vs gradient descent: gradient descent takes step proportional to $-g$ (gradient); Newton's method takes step $-H^{-1}g$ (Hessian-adjusted); near optima, the Hessian captures curvature — in flat directions (large step safe), in steep directions (large step overshoots); applying this to boosting: each tree's leaf weights in XGBoost are $w^*_j = -G_j/(H_j + \\lambda)$ where $G = \\sum g_i$ and $H = \\sum h_i$; regions with high curvature (high $h_i$) get smaller leaf weight corrections; this prevents overshooting and allows larger effective learning rates while maintaining stability; convergence requires fewer trees","C":"Hessian use is a computational trick that reduces memory, not a convergence improvement","D":"Newton boosting and gradient boosting converge to different optima — Hessian use changes the solution"},"correct":"B","explanation":{"correct":"- Gradient: $g_i = \\partial L / \\partial \\hat{y}_i$. Hessian: $h_i = \\partial^2 L / \\partial \\hat{y}_i^2$.\n- For log-loss: $g_i = \\hat{p}_i - y_i$, $h_i = \\hat{p}_i(1-\\hat{p}_i)$. Near $\\hat{p} = 0.5$ (high uncertainty): $h = 0.25$ (moderate weight). Near $\\hat{p} = 0$ or $1$ (high confidence): $h \\approx 0$ (tiny weight). This prevents large corrections for high-confidence predictions.\n- Convergence: XGBoost typically needs fewer trees than GBDT (less boosting rounds) to achieve the same loss, especially for well-separated data points.","A":"The Hessian provides additional curvature information. For smooth, well-behaved losses (log-loss, MSE), the Hessian is well-defined and beneficial. MSE Hessian is constant (2), so Newton ≈ gradient for MSE.","B":"","C":"Hessian computation adds memory and computation overhead. The motivation is faster convergence, not memory reduction.","D":"Both approaches converge to the same optimal tree ensemble for the same loss. The Hessian adjustment speeds up the path to the optimum, not the destination."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-007","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T07 · SVM] Mercer's theorem states that a function $K(x, z)$ is a valid kernel if and only if it is a symmetric positive semi-definite function. Why is this condition necessary for the kernel trick to work, and give an example of a function that looks like a kernel but is not one.","options":{"A":"Mercer's condition is just a mathematical formality — any similarity function can be used as a kernel","B":"The kernel trick requires that $K(x,z) = \\langle \\phi(x), \\phi(z) \\rangle$ for some feature map $\\phi$; a valid inner product in any Hilbert space must produce a positive semi-definite Gram matrix $K_{ij} = K(x_i, x_j)$; if $K$ is not PSD, no valid $\\phi$ exists — you'd be optimizing in a non-Euclidean space where the SVM dual problem may be non-convex (indefinite quadratic program) with no guaranteed global minimum; example: $K(x,z) = -||x-z||$ (negative distance) can produce indefinite Gram matrices — not a valid kernel","C":"Mercer's condition guarantees that the kernel produces linearly separable data in the feature space","D":"Mercer's condition only applies to polynomial kernels — RBF kernels don't need it"},"correct":"B","explanation":{"correct":"- SVM dual: $\\max_\\alpha \\sum \\alpha_i - \\frac{1}{2}\\alpha^T K \\alpha$ where $K_{ij} = K(x_i, x_j)$. This is a quadratic program. For it to be convex (and have a global maximum), $K$ must be positive semi-definite.\n- Non-PSD kernel: $-\\frac{1}{2}\\alpha^T K \\alpha$ may be non-convex → saddle points, no guarantee of finding a global optimum → SVM training fails or produces meaningless solutions.\n- Common valid kernels: linear, polynomial (with $c \\geq 0$, $d$ integer), RBF (PSD by Bochner's theorem), Laplacian, sigmoid (PSD only for specific parameter ranges).","A":"Using a non-PSD function as a kernel breaks the convex optimization guarantee. The SVM dual solver (SMO) may not converge or converges to a saddle point.","B":"","C":"Mercer's condition guarantees a valid inner product space exists, not that data is linearly separable in that space. Separability depends on data distribution and the specific kernel choice.","D":"Mercer's condition applies to ALL kernels. RBF satisfies it (proven via Bochner's theorem on positive definite functions). The condition is universal."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-008","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T08 · KNN] Approximate nearest neighbor (ANN) methods like HNSW (Hierarchical Navigable Small World graphs) trade exactness for speed. Explain the HNSW indexing structure and why it achieves $O(\\log n)$ search complexity.","options":{"A":"HNSW is just a better-sorted array — it achieves log(n) via binary search","B":"HNSW builds a multi-layer proximity graph; the top layer is a sparse long-range graph (few nodes, long edges, fast approximate navigation); lower layers add increasing density with shorter edges; at query time: start at the top layer, greedily navigate to the nearest node, descend to the next layer, repeat; this hierarchical navigation visits $O(\\log n)$ nodes per layer instead of all $n$ points; the long-range edges at top layers allow skipping large portions of the space; recall (approximate accuracy) is tunable via `ef` (exploration factor) parameter: higher `ef` = more neighbors explored = higher recall, more computation","C":"HNSW achieves O(log n) by pre-sorting points along a Hilbert curve","D":"HNSW is only efficient for Euclidean distance — cosine or Hamming distances require exact search"},"correct":"B","explanation":{"correct":"- Graph structure analogy: \"six degrees of separation\" — a small-world graph. From any node, you can reach any other node in $O(\\log n)$ hops via long-range connections (shortcuts).\n- Layer hierarchy: inspired by skip lists. Layer 0 has all $n$ nodes. Layer 1 has a random subset, layer 2 a further subset, etc. Long edges at high layers enable fast long-distance jumps.\n- Query: greedy best-first search from the entry point (typically the centroid or a random high-layer node). At each layer, move to the nearest neighbor among connected nodes.","A":"HNSW is a graph structure, not a sorted array. Binary search requires a linear sorted order, which doesn't generalize to high-dimensional spaces.","B":"","C":"Hilbert curve (space-filling curve) indexing gives $O(\\log n)$ for 1D projections but struggles in high dimensions due to the curse of dimensionality. HNSW doesn't use Hilbert curves.","D":"HNSW works with any distance metric for which a greedy graph traversal converges to the approximate nearest neighbor. Cosine, Euclidean, and inner product are all supported in faiss and hnswlib."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-009","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T09 · Naive Bayes] Domingos & Pazzani (1997) showed that Naive Bayes can be an optimal classifier even when the independence assumption is violated, under certain conditions. What are those conditions, and why does NB sometimes achieve near-optimal accuracy despite being a biased probability estimator?","options":{"A":"NB achieves optimality only when features are truly independent","B":"Key condition: NB needs only to rank classes correctly at the decision boundary, not to estimate probabilities accurately; even with strongly dependent features, NB's classification rule $\\hat{y} = \\arg\\max_c P(c)\\prod_i P(x_i|c)$ may produce the correct argmax even when the absolute probability values are wrong; the dependencies can be \"benign\" (they don't change the argmax ordering); practically: NB calibration is poor but decision accuracy is often competitive; optimal condition: the feature dependencies are symmetric across classes (affect all classes equally) — they distort all class scores proportionally, preserving the argmax","C":"Domingos & Pazzani proved NB is optimal whenever it's competitive with Bayes optimal accuracy","D":"NB is optimal when using additive Laplace smoothing"},"correct":"B","explanation":{"correct":"- Classification accuracy vs probability estimation: NB needs $\\arg\\max_c$ to be correct, not $P(c|x)$ to be calibrated. These are weaker conditions.\n- Benign dependencies: if $x_1$ and $x_2$ are correlated given each class, but the correlation pattern is similar across classes, the log-ratio $\\log[P(c_1|x)/P(c_2|x)]$ may still have the right sign even though each individual $P(c|x)$ is wrong.\n- Empirical evidence: NB is competitive with logistic regression on text classification despite clear word co-occurrence dependencies (words like \"not\" and \"bad\" co-occur in sentiment analysis).","A":"This would make the result trivial. The insight is that NB can work DESPITE violated independence — this is what makes it practically useful.","B":"","C":"The question is about when NB is optimal, not circular. The conditions are about the benign symmetry of dependencies across classes.","D":"Laplace smoothing prevents zero probabilities but does not affect the independence assumption or the classification decision structure."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-010","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T10 · PCA] The Johnson-Lindenstrauss (JL) lemma states that any $n$ points in high-dimensional space can be embedded into $O(\\log n / \\epsilon^2)$ dimensions while preserving pairwise distances within a factor of $(1 \\pm \\epsilon)$. How does this compare to PCA, and when would you use JL random projections instead?","options":{"A":"JL and PCA are identical — both reduce to log(n) dimensions","B":"JL: dimensionality depends only on $n$ (number of points), not on original dimension $d$; the projection is a random matrix (Gaussian, ±1 entries); no training required; computational complexity $O(nd)$ per projection; PCA: dimensionality is data-driven (eigenvectors of covariance); captures maximum-variance directions; requires $O(n d^2)$ computation; JL preferred when: $n$ is small (log(n) << PCA's $k$), data has no dominant low-rank structure, fast sketching is needed for streaming/one-pass; PCA preferred when: data has strong low-rank structure, interpretable directions needed, intrinsic dimensionality < log(n)","C":"JL projections always outperform PCA for dimensionality reduction","D":"JL lemma is only applicable to nearest-neighbor search, not general dimensionality reduction"},"correct":"B","explanation":{"correct":"- JL target dimension: $k = O(\\log n / \\epsilon^2)$. For $n=1000$ points with $\\epsilon=0.1$: $k \\approx 700/0.01 = 700$. For $n = 10^6$: $k \\approx 1400$. Independent of original $d$.\n- Comparison: if PCA's effective rank is small (10-50 dimensions capture 99% variance), PCA produces a much lower-dimensional embedding than JL. JL's log(n) guarantee can be large.\n- JL's power: the projection is data-agnostic. You don't need to see all the data to compute the projection matrix — useful for streaming, privacy-preserving learning (RAPPOR), and randomized linear algebra.","A":"JL and PCA have very different properties. JL is random and data-independent; PCA is deterministic and data-adaptive. Their target dimensions can differ by orders of magnitude.","B":"","C":"Neither consistently outperforms the other. PCA is better when data has low intrinsic rank; JL is better for fast, data-agnostic sketching.","D":"JL is used in compressed sensing, sketch-and-solve linear regression, privacy-preserving ML, and fast matrix multiplication. It is not limited to nearest-neighbor search."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-011","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T11 · Clustering] Spectral clustering converts the clustering problem into a graph partitioning problem using eigendecomposition of the graph Laplacian. Why can spectral clustering find clusters that K-means cannot, and what is the computational bottleneck?","options":{"A":"Spectral clustering is just K-means applied to normalized features","B":"Spectral clustering can find non-convex, arbitrarily-shaped clusters because it uses graph connectivity rather than Euclidean distance to centroids; K-means partitions Voronoi cells (convex regions); two interlocking rings would be merged by K-means but separated by spectral clustering because the ring points are not connected in the proximity graph; bottleneck: eigendecomposition of the $n \\times n$ graph Laplacian costs $O(n^3)$ — prohibitive for large datasets; approximate methods (Nyström approximation, landmark-based) scale to $O(n \\cdot k^2)$","C":"Spectral clustering finds only linear cluster boundaries, like K-means","D":"The computational bottleneck is the k-means step at the end, not the eigendecomposition"},"correct":"B","explanation":{"correct":"- Graph Laplacian: $L = D - W$ where $W_{ij}$ is the edge weight (similarity between $x_i, x_j$) and $D_{ii} = \\sum_j W_{ij}$. The eigenvectors of $L$ encode the graph's cluster structure.\n- Non-convex clusters: K-means computes $||x - \\mu_k||^2$ — points near centroid 1 are in cluster 1 regardless of topology. Spectral clustering's affinity is local (RBF kernel with small bandwidth) → connectivity follows the manifold, not Euclidean ball.\n- Two-moons, concentric rings: standard benchmark where spectral succeeds and K-means fails.","A":"Spectral clustering uses K-means as a final step (on the eigenvectors), but the core mechanism is graph Laplacian eigendecomposition. The clustering is fundamentally different.","B":"","C":"The final K-means step on eigenvectors is linear in the eigenvector space, but the eigenvectors themselves encode non-linear structure. The combined effect finds non-convex clusters in original space.","D":"K-means is applied to the low-dimensional eigenvector embedding (k × k matrix). This is fast. The bottleneck is the $n \\times n$ eigendecomposition, which grows cubically."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-012","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T12 · Anomaly Detection] Conformal prediction provides distribution-free anomaly scores with statistical guarantees. Explain how conformal prediction differs from threshold-based anomaly detection, and what guarantee it provides.","options":{"A":"Conformal prediction is a synonym for threshold-based anomaly detection","B":"Threshold-based: pick score threshold $\\tau$; if anomaly score > $\\tau$, flag as anomaly; no statistical guarantee on false positive rate for new distributions; threshold selection is ad hoc; conformal prediction: use a calibration set of normal examples; compute non-conformity score $s_i$ for each calibration example; for a test point $x$, compute $p\\text{-value} = |\\{i : s_i \\geq s(x)\\}| / n_{\\text{cal}}$; flag $x$ as anomaly if $p < \\alpha$; guarantee: false positive rate $\\leq \\alpha$ regardless of the data distribution (assuming exchangeability); this is a distribution-free coverage guarantee","C":"Conformal prediction requires knowing the true data distribution, making it impractical","D":"Conformal prediction only works for classification, not anomaly detection"},"correct":"B","explanation":{"correct":"- Exchangeability assumption: the calibration set and test points are exchangeable (weaker than i.i.d.). Under this assumption, the p-value is uniformly distributed for normal points → FPR is controlled.\n- Practical advantage: no need to choose a threshold by intuition. Choose $\\alpha = 0.05$ → at most 5% of normal points are falsely flagged, regardless of the underlying distribution.\n- Non-conformity scores: can be based on any anomaly scoring function (isolation forest score, reconstruction error, local outlier factor). Conformal provides the wrapper.","A":"Threshold-based methods have no statistical guarantee. Conformal prediction provides a rigorous FPR bound. They are fundamentally different.","B":"","C":"Conformal prediction is explicitly distribution-free. It requires only exchangeability, which is weaker than distributional assumptions.","D":"Conformal prediction is a general framework applicable to any prediction problem: classification, regression, and anomaly detection. It was originally developed for classification but the framework is general."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-013","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T13 · Ensemble Methods] Negative correlation learning (NCL) explicitly promotes diversity in neural network ensembles by adding a penalty term to each network's loss function that discourages agreement with the ensemble's current prediction. What is the NCL penalty term, and why might forcing diversity hurt individual model quality?","options":{"A":"NCL is equivalent to standard ensembling — it adds no explicit diversity penalty","B":"NCL penalty: $\\Omega_i = -\\lambda \\sum_t (F_i(x^t) - \\bar{F}(x^t)) \\sum_{j \\neq i}(F_j(x^t) - \\bar{F}(x^t))$; this penalizes $F_i$ for moving in the same direction as $\\bar{F}$; individual quality cost: forcing each network to differ from the mean prediction may push individual networks toward suboptimal solutions — the ensemble mean may be accurate but individual models are constrained to be \"complementary\" (each covers the other's weaknesses); extreme NCL ($\\lambda$ too large) causes anti-correlated predictions that individually perform poorly; the ensemble still averages well but individual validation accuracy is artificially suppressed","C":"NCL only improves performance with $\\lambda > 1$","D":"NCL is only applicable to regression problems"},"correct":"B","explanation":{"correct":"- NCL derivation: Liu & Yao (1999). The penalty equals the negative correlation between $F_i$'s deviation from the mean and other models' deviations. Negative sign: maximize the negative correlation (i.e., make $F_i$ deviate opposite to others when the mean is wrong).\n- Bias-variance tradeoff at ensemble level: individual model error = bias² + variance + covariance. NCL explicitly reduces covariance at the cost of potentially increasing individual variance/bias.\n- Practical use: $\\lambda$ controls the exploration-exploitation tradeoff. Small $\\lambda$ ≈ independent training; large $\\lambda$ = forced diversity; too large → individual collapse.","A":"NCL adds an explicit, mathematical diversity penalty to each model's training objective. It is fundamentally different from independent model training.","B":"","C":"$$\\lambda$ is a continuous hyperparameter. Benefits appear at moderate values; harm at extreme values. No threshold at 1.","D":"NCL was originally applied to regression but works for classification with appropriate loss and output formulation. It is general to neural ensemble learning."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-014","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T14 · Model Evaluation] The Brier score can be decomposed into three components: reliability (calibration), resolution, and uncertainty. Explain what each component measures and how a model with perfect resolution but poor reliability should be fixed.","options":{"A":"Brier score cannot be decomposed — it is a single atomic metric","B":"Uncertainty: $\\bar{o}(1-\\bar{o})$ — the inherent difficulty of the task; irreducible; depends only on the base rate $\\bar{o}$; Resolution: measures how much predicted probabilities vary across different groups of events — how well the model distinguishes between events that occur and those that don't (forecasting value); Reliability: calibration error — mean squared difference between predicted probabilities and observed frequencies; a model with perfect resolution (correctly orders outcomes by predicted probability) but poor reliability (probabilities are miscalibrated, e.g., always outputs 0.6 when truth is 0.9) should be fixed via calibration (Platt scaling, isotonic regression) without retraining the base model","C":"Resolution measures model accuracy; reliability measures speed; uncertainty measures data quality","D":"Only reliability matters for Brier score optimization; resolution and uncertainty are academic"},"correct":"B","explanation":{"correct":"- DeGroot-Fienberg decomposition: $\\text{BS} = \\text{REL} - \\text{RES} + \\text{UNC}$. Better model → lower BS → lower REL, higher RES (resolution improves by subtracting more).\n- Reliability: $\\sum_{k=1}^{K} n_k(\\bar{f}_k - \\bar{o}_k)^2$. Binned calibration error: do events with predicted probability 0.7 actually occur 70% of the time?\n- Resolution: $\\sum_{k=1}^{K} n_k(\\bar{o}_k - \\bar{o})^2$. How much do outcomes differ from the base rate across forecast bins?\n- Fix: Platt scaling or isotonic regression post-hoc calibration preserves ranking (resolution) while adjusting probability values (improving reliability).","A":"The decomposition is a well-established result (DeGroot & Fienberg, 1983; Murphy, 1973). It is standard in meteorological forecasting and ML model evaluation.","B":"","C":"These definitions are incorrect. Resolution is discriminative power, reliability is calibration quality, uncertainty is base rate difficulty.","D":"All three components contribute to Brier score. Resolution is particularly important in decision-theoretic applications where predictions are used for threshold-based decisions."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-015","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T15 · Bias-Variance] The \"double descent\" phenomenon shows that test error can decrease, then increase, then decrease again as model complexity increases. Where does the second descent occur, and what breaks the classical bias-variance tradeoff picture?","options":{"A":"Double descent is a theoretical curiosity that doesn't occur in practice","B":"Classical U-shaped test error: underfitting → optimal → overfitting. Double descent adds a second descent at the \"interpolation threshold\" (model complexity equals n, the training set size); at this threshold, the model barely interpolates training data — test error spikes; beyond this threshold (heavily overparameterized regime), implicit regularization from gradient descent or minimum-norm solutions picks the \"smoothest\" interpolating function; the classical analysis assumes fixed noise — in the overparameterized regime, the minimum-norm interpolant has low variance despite exactly fitting training data","C":"Double descent occurs only for neural networks trained without any regularization","D":"Double descent means models should always be maximally overparameterized"},"correct":"B","explanation":{"correct":"- Interpolation threshold: at $p = n$ (parameters = samples), the model is at the boundary. Any interpolating solution exists but the unique one found by gradient descent may be maximally noisy.\n- Overparameterized regime ($p >> n$): many interpolating solutions exist. Gradient descent with small learning rate or random initialization finds the minimum-norm solution, which has a smoothness bias. This is implicit regularization.\n- Discovered empirically: Belkin et al. (2019) showed double descent for kernels, random forests, and neural networks. It challenges the single-valley bias-variance picture.","A":"Double descent has been empirically demonstrated for linear regression (random features), random forests, and neural networks. Multiple reproducible papers have confirmed it.","B":"","C":"Double descent occurs for linear models with random features, kernel methods, and trees, not just neural networks. Regularization smooths but doesn't eliminate the phenomenon.","D":"\"Always maximize overparameterization\" ignores computational cost and the risk that heavy overparameterization can still overfit without sufficient implicit regularization (e.g., with noisy labels)."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-016","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T16 · Regularization] Group Lasso extends L1 regularization to penalize groups of features jointly. The penalty is $\\sum_{g=1}^G ||w_g||_2$ (sum of L2 norms of weight groups). How does this differ from standard Lasso, and when should Group Lasso be preferred?","options":{"A":"Group Lasso is identical to standard Lasso applied to grouped features","B":"Standard Lasso: individual sparsity — each $w_j$ can be zero independently; $\\sum_j |w_j|$; Group Lasso: group sparsity — all weights in a group are zeroed together or all kept; penalty $\\sum_g ||w_g||_2$ is L1 between groups (sparse) and L2 within groups (non-sparse within selected groups); use when features form natural groups and group-level decisions are desired: one-hot encoded categorical features (select/drop the entire category), gene pathways (either the whole pathway is relevant or not), time series lag groups (include lags 1-5 together)","C":"Group Lasso penalizes each feature group with L1, making all features within groups individually sparse","D":"Group Lasso is only applicable to neural networks, not linear models"},"correct":"B","explanation":{"correct":"- Penalty structure: $\\Omega(w) = \\sum_{g=1}^G \\sqrt{|g|} \\cdot ||w_g||_2$ (weighted version). The $\\sqrt{|g|}$ factor normalizes for group size.\n- Geometry: standard Lasso's $||w||_1$ ball has corners at coordinate axes → sparsity. Group Lasso's group norm ball has ridges along group subspaces → group-level corners → one entire group goes to zero while others remain non-zero.\n- Application: one-hot encoding of \"city\" (1000 features) — Group Lasso either selects city as a feature (all 1000 non-zero) or removes it entirely. Standard Lasso might zero some city dummies but not others, producing an incoherent partial selection.","A":"Standard Lasso allows individual sparsity (any single feature can be zero). Group Lasso enforces that all features in a group are zeroed together. The sparsity structures are fundamentally different.","B":"","C":"L2 norm within groups means features within a selected group are NOT individually zeroed. The L2 norm shrinks the group uniformly but doesn't create within-group sparsity (that's Sparse Group Lasso, which combines both).","D":"Group Lasso was originally developed for linear models (Yuan & Lin, 2006). It is applicable to any model with structured weight groups."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-017","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T17 · Feature Selection] The Markov Blanket of a target variable $Y$ in a Bayesian network is the minimal set of features that renders $Y$ conditionally independent of all other variables. Explain why the Markov Blanket is the theoretically optimal feature set for predicting $Y$, and why finding it is computationally challenging.","options":{"A":"The Markov Blanket is equivalent to the set of features most correlated with Y","B":"Optimality: conditioning on the Markov Blanket $\\text{MB}(Y)$ makes $Y$ independent of all other variables — no information about $Y$ can be gained from features outside $\\text{MB}(Y)$ given $\\text{MB}(Y)$; formally: $Y \\perp X \\setminus \\text{MB}(Y) | \\text{MB}(Y)$; this means: adding any feature outside $\\text{MB}(Y)$ to the model cannot improve predictive accuracy; it is the smallest sufficient feature set; computational challenge: finding the MB requires testing conditional independence for all feature subsets — exponential search space; IAMB (Incremental Association Markov Blanket) and MMMB algorithms use forward-backward heuristics to find MB in $O(p^2)$ tests approximately","C":"Markov Blanket is a concept from graph theory with no connection to feature selection for prediction","D":"The Markov Blanket is the complete set of all features that are directly connected to Y in any network"},"correct":"B","explanation":{"correct":"- MB composition: parents of $Y$ (direct causes) + children of $Y$ (direct effects) + other parents of $Y$'s children (co-parents/spouses). All three sets carry information about $Y$ that isn't captured by other MB members.\n- Independence property: $P(Y | X) = P(Y | \\text{MB}(Y))$. This is a consequence of the d-separation criterion in Bayesian networks.\n- Hardness: the Bayesian network structure is unknown. Testing all conditional independencies requires exponentially many statistical tests (or exponential search over structures). IAMB uses a greedy forward phase (add features that significantly reduce entropy given current MB) and a backward phase (remove features that become redundant).","A":"Correlation (marginal association) does not define the MB. A feature can be correlated with $Y$ but excluded from MB (if it's conditionally independent given MB). A feature may be uncorrelated with $Y$ marginally but part of MB (through indirect paths or interactions).","B":"","C":"Markov Blankets directly define the optimal feature set for prediction under the Bayesian network model. They are used in MRMR (Minimum Redundancy Maximum Relevance) algorithms and causal feature selection.","D":"MB includes parents, children, AND co-parents. Features indirectly connected to $Y$ are not in the MB if they are d-separated by the MB set."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-018","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T01 · ML Fundamentals] VC dimension measures the complexity of a hypothesis class. A half-space classifier in $\\mathbb{R}^d$ has VC dimension $d+1$. What does this mean for sample complexity, and how does it relate to PAC learning?","options":{"A":"VC dimension measures how many features the model has","B":"VC dimension = $d+1$ for half-spaces: the classifier can shatter $d+1$ points in general position (label them with any binary assignment); PAC learning: to achieve error $\\leq \\epsilon$ with confidence $\\geq 1-\\delta$, the sample complexity is $O\\left(\\frac{d + \\log(1/\\delta)}{\\epsilon}\\right)$; higher VC dimension → more samples needed to generalize; implication: logistic regression in 100D needs $O(100/\\epsilon)$ samples; in 10,000D (NLP features), it needs $O(10,000/\\epsilon)$ → the curse of dimensionality appears in generalization, not just computation","C":"VC dimension equals the test set size needed for reliable evaluation","D":"A higher VC dimension means the model always generalizes better"},"correct":"B","explanation":{"correct":"- Shattering: a set of $m$ points is shattered by a hypothesis class if for every binary labeling of the $m$ points, there exists a hypothesis that correctly classifies all of them. VC dimension = max $m$ that can be shattered.\n- Fundamental theorem of PAC learning: a hypothesis class is PAC learnable iff its VC dimension is finite. The required sample size grows linearly with VC dimension.\n- Connection to practice: this is why high-dimensional linear models need regularization — without it, the effectively infinite-VC-dimension model (with enough features) can overfit on any finite dataset.","A":"VC dimension measures shatter capacity — the complexity of the function class, not the number of features directly. Though for half-spaces, VC dim = $d+1$ (related to dimension).","B":"","C":"VC dimension has nothing to do with test set size directly. Test set size relates to statistical confidence intervals for estimating error from empirical accuracy.","D":"Higher VC dimension means MORE capacity to fit training data — potentially more overfitting, requiring more training samples to generalize. High VC dimension ≠ better generalization."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-019","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T06 · Gradient Boosting] DART (Dropout Additive Regression Trees) applies dropout to gradient boosting. Explain the mechanism and why it was motivated by the over-specialization problem in standard gradient boosting.","options":{"A":"DART is dropout applied to the leaf weights within each individual tree","B":"Over-specialization in GBDT: the first trees in the sequence learn the most important patterns and dominate predictions; later trees learn residuals from increasingly low-signal data; these later trees are small corrections that can destabilize the ensemble; DART mechanism: at each boosting round, randomly drop a subset of previous trees from the ensemble, fit the new tree to the residuals of the remaining ensemble, then re-scale and add back both the dropped and new trees; this forces each new tree to be useful even when some previous trees are absent — prevents any single tree from dominating","C":"DART applies standard neural network dropout to every feature at each split","D":"DART eliminates the need for a learning rate by using dropout rate instead"},"correct":"B","explanation":{"correct":"- Motivation: in standard GBDT with many rounds, early trees are large and predictive; late trees are tiny corrections. The model is dominated by early trees. This is \"over-specialization.\"\n- DART effect: by randomly excluding trees, each new tree must contribute meaningfully even when some trees are missing — it can't rely on specific tree combinations. This produces a more uniform ensemble where all trees contribute.\n- Scaling: when dropped trees are added back, the predictions must be re-normalized to avoid magnitude inflation. This is the DART \"scaling\" step.\n- Implementation: available in XGBoost (`booster='dart'`) and LightGBM.","A":"DART drops entire trees, not leaf weights. The mechanism is conceptually analogous to neural dropout (removing units) but applied at the tree level.","B":"","C":"DART drops complete trees from the ensemble. It doesn't operate at the feature level within trees.","D":"DART can coexist with a learning rate. The learning rate still scales each tree's contribution; DART's dropout rate controls what fraction of trees are dropped during training."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-020","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T09 · Naive Bayes] Semi-Naive Bayes models relax the full independence assumption by allowing some feature dependencies to be modeled. Describe one semi-NB approach (e.g., TAN or NB with feature grouping) and analyze when it is better than full NB or full logistic regression.","options":{"A":"Semi-Naive Bayes is just Naive Bayes with more training data","B":"TAN (Tree Augmented Naive Bayes): extends NB by learning a maximum spanning tree over features (one additional parent per feature beyond the class variable); each feature $X_i$ can depend on one other feature $X_j$ in addition to the class $C$; tree structure is learned using mutual information $I(X_i; X_j | C)$ as edge weights; TAN is better than NB when: a few specific feature pairs have strong conditional dependencies (TAN captures them exactly); better than logistic regression when: small data where LR has insufficient samples to estimate all pairwise interactions; TAN sits on the bias-variance tradeoff between NB and LR","C":"Semi-NB models are always dominated by either full NB or LR — they have no use case","D":"TAN models the full joint distribution over features, making it equivalent to a Bayesian network classifier"},"correct":"B","explanation":{"correct":"- TAN algorithm (Friedman et al., 1997): (1) compute $I(X_i; X_j | C)$ for all feature pairs; (2) build a complete graph with these mutual information weights; (3) find the maximum spanning tree; (4) root the tree and direct edges from root to leaves; (5) train NB on the augmented structure.\n- Bias-variance: TAN has fewer parameters than LR but more than NB. With small datasets, TAN generalizes better than LR (fewer parameters to estimate). With moderate data, LR's flexibility pays off.\n- Practical: TAN achieves competitive performance with LR while maintaining the generative framework's advantages (missing data, easy feature addition).","A":"More training data doesn't change NB's assumption. TAN structurally relaxes the independence assumption by allowing one parent dependency per feature.","B":"","C":"TAN fills a practical niche between NB's extreme independence assumption and LR's full discriminative flexibility. It outperforms NB on datasets with correlated features and small data.","D":"TAN uses a tree structure (one parent per node), not a full Bayesian network (arbitrary parents). A full Bayesian network could model all dependencies but would require exponentially more parameters."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-021","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T14 · Model Evaluation] Cohen's Kappa is often used for evaluating multi-class classification with class imbalance. Explain how Kappa accounts for chance agreement that accuracy ignores, and describe a scenario where high accuracy and low Kappa both occur simultaneously.","options":{"A":"Cohen's Kappa is equivalent to accuracy — it measures the same thing with a different formula","B":"Kappa: $\\kappa = (p_o - p_e)/(1 - p_e)$; $p_o$ = observed accuracy; $p_e$ = expected accuracy by chance (based on class marginals); a classifier that always predicts the majority class has $p_o = p_e$ → $\\kappa = 0$; scenario: 95% class A, 5% class B; classifier always predicts A; accuracy = 95%; $p_e = 0.95^2 + 0.05^2 = 0.905$; $\\kappa = (0.95 - 0.905)/(1 - 0.905) = 0.045/0.095 \\approx 0.47$ — moderate by chance correction; extreme case: predict A always on 99% imbalanced data: accuracy=99%, $\\kappa = 0$ — no skill above chance","C":"Cohen's Kappa is only valid for binary classification","D":"High accuracy always means high Kappa — the two metrics agree on model ranking"},"correct":"B","explanation":{"correct":"- Expected agreement $p_e$: the probability that a random classifier (matching the marginal distributions) would agree with the true labels by chance. For two-class imbalanced data: $p_e = P(\\hat{y}=A)P(y=A) + P(\\hat{y}=B)P(y=B)$.\n- $\\kappa = 0$: no skill above random baseline. $\\kappa < 0$: worse than chance. $\\kappa > 0.8$: near-perfect agreement.\n- Concrete extreme: predict all A, 99% class A. Accuracy = 99%. $p_e = 0.99^2 + 0.01^2 = 0.9802$. $\\kappa = (0.99 - 0.9802)/(1 - 0.9802) = 0.0098/0.0198 \\approx 0.5$. Moderate kappa despite \"excellent\" accuracy.","A":"The key difference is $p_e$ — the chance correction. Accuracy ignores $p_e$; Kappa explicitly subtracts it. For balanced datasets, they are closely related. For imbalanced data, they diverge significantly.","B":"","C":"Cohen's Kappa generalizes to multi-class classification naturally. $p_e = \\sum_k P(\\hat{y}=k) \\times P(y=k)$, summed over all $k$ classes.","D":"High accuracy does NOT imply high Kappa for imbalanced data. A classifier with accuracy 99% can have Kappa near 0 if class imbalance is extreme."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-022","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T15 · Bias-Variance] Rashomon sets are defined as the set of all models within a certain loss tolerance of the best model. Why does a large Rashomon set have both practical benefits and risks for deploying ML models in regulated industries?","options":{"A":"Rashomon sets are only relevant for research — practitioners always select the single best model","B":"Large Rashomon set means many models achieve near-identical predictive accuracy; benefit: enables selection of the simplest, most interpretable model from the set (Occam's razor via Rashomon); in healthcare/credit, regulators require explainability — choosing a sparse linear model from the Rashomon set instead of a black-box model satisfies regulation without sacrificing accuracy; risk: the many near-equivalent models may have very different decision rationales — \"predictively equivalent\" models can disagree on 20-40% of individual predictions, creating arbitrariness in who receives loans/treatment; disparate impact: different Rashomon models may apply different implicit criteria, leading to inconsistent and potentially discriminatory decisions","C":"Large Rashomon sets always indicate overfitting","D":"All models in a Rashomon set are functionally identical and make identical predictions"},"correct":"B","explanation":{"correct":"- Rashomon set definition: $\\{f : L(f) \\leq L^* + \\epsilon\\}$ where $L^*$ is the minimum loss. All $f$ in this set are \"equally good\" within tolerance $\\epsilon$.\n- Interpretability benefit: Semenova et al. (2022) showed that many real-world datasets have large Rashomon sets that include simple decision lists achieving near-optimal performance. Regulators can be satisfied.\n- Predictive multiplicity risk: multiple models with identical aggregate accuracy make different predictions for individual cases. An individual can be denied a loan by one near-optimal model but approved by another with equal aggregate performance. This unpredictability is ethically problematic.","A":"Rashomon sets are increasingly important for ML fairness, interpretability, and regulatory compliance. The concept shapes how regulated industries should approach model selection.","B":"","C":"Rashomon sets relate to model complexity and the landscape of the loss function. A large Rashomon set often indicates an identifiability problem or that the data supports many equally good explanations — not necessarily overfitting.","D":"Models in the Rashomon set have equal aggregate loss but can have very different individual predictions. This is the \"predictive multiplicity\" phenomenon documented in real datasets."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-023","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T12 · Anomaly Detection] An adversarial attacker learns the decision boundary of an anomaly detection model (e.g., Isolation Forest) and crafts anomalous inputs that avoid detection (adversarial examples for anomaly detection). Describe the attack mechanism and a defense strategy.","options":{"A":"Adversarial attacks on anomaly detection are impossible — anomaly detectors are robust by design","B":"Attack mechanism: if the attacker can query the model (black-box) or access model parameters (white-box), they can iteratively adjust an anomalous input to minimize its anomaly score below the detection threshold; for Isolation Forest: the attacker generates samples that occupy dense regions of the training space (low depth in trees) while still representing malicious behavior; defense strategies: (1) ensemble diverse detectors (Isolation Forest + LOF + OCSVM) — an adversary that evades one may not evade all; (2) randomized thresholds and model refresh; (3) incorporate distributional shift detection alongside anomaly scoring; (4) adversarial training on anomaly detectors","C":"Adversarial anomaly examples can only exist in image or text domains","D":"Adding more training data always defends against adversarial anomaly attacks"},"correct":"B","explanation":{"correct":"- Adversarial anomaly: an attack that is semantically anomalous (fraud, intrusion) but statistically \"normal\" (similar to training distribution). The attacker minimizes $\\text{AnomalyScore}(x)$ while maximizing attack effectiveness.\n- Black-box attack: query the model with trial inputs, use gradient-free optimization (evolutionary algorithms, Bayesian optimization) to find low-score anomalous inputs.\n- White-box: if using gradient-based models (autoencoders), use backpropagation to minimize reconstruction error while maintaining attack payload.\n- Ensemble defense: forcing the attacker to evade multiple simultaneously is an exponentially harder optimization problem.","A":"Anomaly detectors are not inherently robust. Any model with a computable score function is potentially vulnerable to adversarial optimization.","B":"","C":"Adversarial attacks have been demonstrated on network intrusion detection, fraud detection (tabular), and industrial control systems. Domain is irrelevant — the attack exploits the score function, not the data modality.","D":"Adding more normal training data makes the \"normal\" region denser, not more protected. The adversary still targets the dense normal region. Additional training data doesn't inherently defend against adversarial anomaly construction."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-024","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T16 · Regularization] Spectral norm regularization (SN-GAN) constrains the spectral norm (largest singular value) of weight matrices in neural networks. Contrast this with L2 regularization and explain why spectral norm is important for training Generative Adversarial Networks.","options":{"A":"Spectral norm regularization is equivalent to L2 weight decay","B":"L2 (weight decay): penalizes $||W||_F^2$ (Frobenius norm) — penalizes sum of squared weights; spectral norm: constrains $\\sigma_1(W) = ||W||_2$ (largest singular value); why different: $||W||_F = \\sqrt{\\sum \\sigma_i^2}$ vs $||W||_2 = \\sigma_1$; GAN discriminator Lipschitz constraint: WGAN and SN-GAN require the discriminator to be 1-Lipschitz; Lipschitz constant is bounded by the product of spectral norms across layers; spectral normalization enforces $\\sigma_1(W) \\leq 1$ per layer → bounding the discriminator's Lipschitz constant → stable GAN training (prevents mode collapse and gradient explosion)","C":"Spectral norm regularization is only applicable to convolutional layers","D":"L2 regularization always achieves a smaller Lipschitz constant than spectral norm"},"correct":"B","explanation":{"correct":"- Lipschitz continuity: $||f(x) - f(y)|| \\leq L||x-y||$ for all $x, y$. For a linear layer: the Lipschitz constant = $\\sigma_1(W)$. For stacked layers: $L_{\\text{network}} \\leq \\prod_l \\sigma_1(W_l)$.\n- WGAN training: the critic must be K-Lipschitz (typically K=1). Gradient clipping (WGAN) is unstable. Spectral normalization (SN-GAN, Miyato et al. 2018) constrains $\\sigma_1(W) = 1$ using power iteration per gradient step — efficient and stable.\n- L2 vs SN: L2 penalizes all singular values equally. SN penalizes only the largest, leaving other singular values free — the network retains expressiveness while bounding its Lipschitz constant.","A":"Frobenius norm (L2) and spectral norm are different matrix norms with different geometric properties. Their minimizers are different, and their effects on the network's function class differ.","B":"","C":"Spectral normalization applies to any weight matrix: fully connected, convolutional, attention layers. Miyato et al. demonstrated it across all layer types.","D":"L2 can produce a smaller Frobenius norm but doesn't bound the spectral norm. A matrix with L2-regularized weights can still have a large spectral norm if one singular direction dominates."}},{"section":"machine-learning","difficulty":"hard","id":"ml-pract-hard-025","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T10 · PCA] Kernel PCA extends PCA to nonlinear dimensionality reduction using the kernel trick. Explain how kernel PCA avoids explicitly computing the feature map $\\phi(x)$, and what makes choosing the right kernel a difficult problem.","options":{"A":"Kernel PCA computes PCA in the original feature space, then applies a nonlinear transformation","B":"Kernel PCA performs PCA in the RKHS (Reproducing Kernel Hilbert Space) implicitly; the PCA eigenvectors in $\\mathcal{H}$ are expressed as $\\alpha_j = \\sum_i a_{ij} \\phi(x_i)$ (they lie in the span of training points via Representer theorem); the kernel gram matrix $K_{ij} = k(x_i, x_j)$ encodes all required inner products; the projection of a test point: $\\langle \\phi(x), \\alpha_j \\rangle = \\sum_i a_{ij} k(x, x_i)$; only kernel evaluations are needed — $\\phi(x)$ is never computed; kernel choice difficulty: different kernels assume different notions of similarity; RBF bandwidth $\\sigma$ controls locality; wrong $\\sigma$ → either too local (each point isolated) or too global (no nonlinear structure discovered); cross-validation for kernel hyperparameters requires re-computing $K$ each time — $O(n^2)$ per evaluation","C":"Kernel PCA uses the same eigenvectors as standard PCA — it just applies them in a higher-dimensional space","D":"Kernel PCA requires computing $\\phi(x)$ explicitly but in parallel for efficiency"},"correct":"B","explanation":{"correct":"- RKHS: functions in a Reproducing Kernel Hilbert Space have the property that evaluation is bounded by the kernel: $|f(x)| \\leq ||f||_{\\mathcal{H}} \\sqrt{k(x,x)}$.\n- Representer theorem: the solution to any regularized learning problem in RKHS lies in the span of kernel evaluations at training points → eigenvectors can be represented without explicit $\\phi$.\n- Centering in feature space: $K_{ij}^c = K_{ij} - \\frac{1}{n}\\sum_k K_{ik} - \\frac{1}{n}\\sum_k K_{kj} + \\frac{1}{n^2}\\sum_{k,l} K_{kl}$ (double centering). This centers the feature-space Gram matrix without computing $\\phi$.","A":"The order is reversed. Standard PCA in original space followed by nonlinear transformation is not Kernel PCA. Kernel PCA works entirely in $\\mathcal{H}$ via the kernel trick.","B":"","C":"Kernel PCA produces different eigenvectors from standard PCA. The kernel-space covariance matrix is different from the original-space covariance matrix.","D":"Kernel PCA explicitly AVOIDS computing $\\phi(x)$ — that is the entire motivation for using the kernel trick. Computing $\\phi(x)$ would require infinite dimensions for RBF kernel."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-001","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T01 · ML Fundamentals] A company builds a churn prediction model. Features include \"days since last login\" and \"support tickets this month.\" The target is churn in the next 30 days. Six months after deployment, the model's performance drops. The data science team notices that \"support tickets this month\" now has a much higher average than during training. What type of shift has occurred, and how should it be handled?","options":{"A":"Label shift — the proportion of churned customers has changed","B":"Covariate shift (data drift) — the input feature distribution $P(X)$ has changed while $P(Y|X)$ may still hold; the model receives feature values outside its training distribution; monitor input distributions with statistical tests (KS test, PSI), retrain on recent data, and use importance weighting to adjust for the shift","C":"This is expected behavior — support tickets always increase over time","D":"The model has a coding bug that inflates the feature value"},"correct":"B","explanation":{"correct":"- Covariate shift: $P(X_{\\text{train}}) \\neq P(X_{\\text{test/production}})$. The model was trained on lower ticket volumes; it now receives feature values it rarely saw during training.\n- Even if the relationship $P(\\text{churn}|\\text{tickets})$ is unchanged, the model's learned decision boundary may be poorly calibrated for the new range.\n- Monitoring: track feature distribution statistics (mean, std, percentiles) over time. Population Stability Index (PSI) > 0.2 typically triggers retraining.","A":"Label shift is $P(Y)$ changing — the churn rate changing. The scenario describes feature distribution change, not class proportion change.","B":"","C":"A business context explanation (products change, product issues increase tickets) doesn't negate the need to handle the distributional shift. The model still needs updating.","D":"Gradual changes over months are distributional, not a sudden bug. Bugs produce abrupt errors, not gradual drift."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-002","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T02 · Linear Regression] A linear regression model predicts house prices. The residual plot shows a fan shape: small residuals for low-priced houses, large residuals for high-priced houses. What Gauss-Markov assumption is violated, and what is the practical consequence?","options":{"A":"Multicollinearity — features are correlated with each other","B":"Heteroscedasticity — residual variance increases with fitted value (fan shape); OLS is still unbiased but loses the minimum-variance property; confidence intervals and p-values computed from standard OLS are wrong (standard errors are biased); the model makes systematically more uncertain predictions for expensive houses but treats all predictions equally","C":"Non-linearity — the relationship between features and price is nonlinear","D":"The violation is acceptable — fan shapes in residuals are normal for price data"},"correct":"B","explanation":{"correct":"- Homoscedasticity requires $\\text{Var}(\\epsilon_i) = \\sigma^2$ for all $i$. A fan shape shows $\\text{Var}(\\epsilon_i) \\propto \\hat{y}_i$ — variance grows with fitted value.\n- Consequences: (1) coefficient estimates are still unbiased but not BLUE; (2) standard errors are wrong → invalid hypothesis tests; (3) prediction intervals are too narrow for expensive houses, too wide for cheap ones.\n- Fix: $\\log(\\text{price})$ as target often stabilizes variance in price models; weighted least squares (WLS) with $w_i = 1/\\hat{y}_i^2$.","A":"Multicollinearity affects coefficient interpretation and stability but doesn't cause a fan shape in residuals. Multicollinearity is visible in VIF scores and coefficient instability.","B":"","C":"Non-linearity shows a curved pattern in residual vs fitted plots (systematic positive/negative residuals). A fan shape specifically indicates variance increase with fitted value.","D":"Fan shapes indicate a real assumption violation with real consequences. Log-transform of price target is standard practice precisely because of this."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-003","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T03 · Logistic Regression] A logistic regression model is trained on a perfectly linearly separable dataset (all class-1 points are clearly above the decision line, all class-0 points below). Training fails to converge. What is the mathematical reason?","options":{"A":"The optimizer has a bug — any optimizer should converge on linearly separable data","B":"For perfectly separable data, the log-loss is minimized by pushing decision boundary to infinity — weight magnitudes grow without bound; $\\hat{p}_i \\to 1$ for positives requires $w^Tx_i \\to +\\infty$; the log-loss never reaches 0 but approaches 0 asymptotically; gradient descent keeps taking steps forever; regularization (L2 or L1) prevents this by penalizing large weights","C":"Linear separability guarantees fast convergence — the opposite of what was described","D":"The model is trying to fit a nonlinear boundary on linearly separable data"},"correct":"B","explanation":{"correct":"- Log-loss: $L = -\\sum[y\\log(\\hat{p}) + (1-y)\\log(1-\\hat{p})]$. To minimize: push $\\hat{p} \\to 1$ for $y=1$ and $\\hat{p} \\to 0$ for $y=0$.\n- Perfect separation: $\\hat{p}_i = \\sigma(w^Tx_i) \\to 1$ requires $||w|| \\to \\infty$. The loss approaches 0 but never reaches it. Gradient is always non-zero → no convergence.\n- With L2 regularization: the loss has an additional $\\lambda||w||^2$ term that grows with $||w||$. The combined objective has a finite minimum where the weight magnitude is bounded.","A":"This is a known mathematical property of logistic regression with perfectly separable data, not a bug. Any gradient-based optimizer will diverge.","B":"","C":"Linear separability causes divergence (unbounded weights), not fast convergence. For non-separable data, weights are bounded by the constraint that some points must be misclassified.","D":"The model is fitting a linear boundary. The issue is the unbounded weight problem when the linear boundary perfectly separates both classes."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-004","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T04 · Decision Trees] Two splits are evaluated for a binary classification problem. Split A: Gini parent = 0.5, children weighted Gini = 0.35. Split B: Gini parent = 0.5, children weighted Gini = 0.10. Which split is chosen by the CART algorithm, and why?","options":{"A":"Split A — smaller Gini impurity in children means better purity","B":"Split B — CART maximizes information gain (Gini impurity reduction); Split B reduces Gini by 0.40 (from 0.5 to 0.10) vs Split A's 0.15 reduction; larger reduction = purer children = more informative split","C":"Both splits are equivalent — any split with Gini < 0.5 is acceptable","D":"CART would choose Split A because 0.35 < 0.50 satisfies the purity threshold"},"correct":"B","explanation":{"correct":"- CART criterion: choose the split that maximizes impurity reduction = parent Gini − weighted average children Gini.\n- Split A: $\\Delta G = 0.5 - 0.35 = 0.15$.\n- Split B: $\\Delta G = 0.5 - 0.10 = 0.40$.\n- Split B is chosen: children are much purer (Gini 0.10 vs 0.35), meaning the split separates classes much more effectively.","A":"\"Smaller Gini means better\" is directionally correct but the comparison is wrong. Split B has Gini 0.10 < Split A's 0.35. Split B should be preferred, not A.","B":"","C":"The magnitude of the reduction matters. CART chooses the largest Gini reduction, not just any split below the parent's Gini.","D":"CART doesn't use a fixed threshold — it picks the split with the highest impurity reduction among all candidate splits."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-005","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T05 · Random Forest] A Random Forest model achieves 89% accuracy with `max_features='sqrt'`. A data scientist increases it to `max_features='all'` (uses all features at each split). Training speed decreases and accuracy drops slightly. Explain both effects.","options":{"A":"Using all features always increases accuracy — the accuracy drop indicates a bug","B":"Speed: with all features at each split, finding the best split requires evaluating all $p$ features instead of $\\sqrt{p}$ — computation per split increases $O(p/\\sqrt{p}) = O(\\sqrt{p})$ times; accuracy: using all features makes trees more correlated (they all tend to split on the same dominant features), increasing inter-tree correlation $\\rho$, which limits variance reduction: $\\text{Var}(\\text{RF}) = \\rho \\sigma^2 + \\frac{1-\\rho}{B}\\sigma^2$; more $\\rho$ → higher ensemble variance","C":"Using more features always improves accuracy — the drop must be due to different random seeds","D":"Speed improves with more features because the algorithm can skip weaker features"},"correct":"B","explanation":{"correct":"- Feature subsampling in RF: each split evaluates $m$ random features. $m = \\sqrt{p}$ is optimal by empirical and theoretical analysis. It forces diverse splits across trees.\n- With $m = p$: every tree sees all features. All trees will tend to split on the same top-1 feature at the root → correlated trees → $\\rho$ increases → $\\text{Var}(\\text{RF})$ increases.\n- Speed: $\\sqrt{p}$ features to evaluate per split vs $p$ features. For $p=100$: $\\sqrt{p}=10$ vs 100 — 10× more work per split.","A":"Increasing feature subsampling to all features reduces diversity — a fundamental RF design choice. The accuracy drop is expected and documented.","B":"","C":"The accuracy drop is systematic, not random seed dependent. It is reproducible and directly caused by increased inter-tree correlation.","D":"Evaluating more features per split requires more computation, not less. Each feature requires computing the impurity reduction for all possible split thresholds."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-006","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T06 · Gradient Boosting] CatBoost is a gradient boosting library designed specifically to handle categorical features. Standard gradient boosting requires one-hot encoding. What is the key risk of naively one-hot encoding a high-cardinality categorical feature (e.g., city with 1,000 categories)?","options":{"A":"One-hot encoding is always safe — there is no risk with high-cardinality features","B":"With 1,000 categories, one-hot encoding creates 1,000 binary features; for tree-based models, this creates 1,000 possible binary split candidates per node; combined with gradient boosting's tendency to overfit, the model may learn spurious patterns specific to rare city values; rare cities with 1-2 training examples can dominate leaf predictions; CatBoost uses ordered target statistics to avoid this","C":"One-hot encoding fails mathematically for more than 100 categories","D":"High-cardinality one-hot causes one-hot encoded features to have zero variance"},"correct":"B","explanation":{"correct":"- High-cardinality + gradient boosting: a city that appears 2 times in training can have a deterministic mean target — the tree fits to this noise rather than the true signal. The gradient boosting model will pick up city-specific artifacts.\n- CatBoost's ordered boosting: uses only past rows' target statistics to compute the categorical encoding of the current row — prevents target leakage from the category's own label.\n- Also relevant: the Zipf distribution of category frequencies means most categories are rare (long tail). Encoding all rare categories individually adds high-variance features.","A":"One-hot encoding of high-cardinality categoricals is a well-known pitfall in gradient boosting. Rare categories create high-variance leaf predictions.","B":"","C":"There's no mathematical limit. The concern is statistical (overfitting), not mathematical failure.","D":"Each one-hot column is a binary feature with low but non-zero variance (variance = p(1-p) where p = fraction of that category). Zero variance only happens for constant features."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-007","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T07 · SVM] An RBF kernel SVM is trained. The gamma parameter is very high (γ = 100). What effect does this have on the decision boundary, and what problem does it cause?","options":{"A":"High gamma creates a very wide Gaussian kernel, making the boundary smooth","B":"High gamma means the Gaussian kernel $K(x,z) = \\exp(-\\gamma||x-z||^2)$ falls off very rapidly — each support vector influences only a tiny region around itself; the decision boundary becomes highly irregular, tracing tightly around each training point; this causes high variance (severe overfitting) — the model memorizes training data but fails on test data","C":"High gamma increases the margin width, reducing overfitting","D":"Gamma only affects computation speed, not the decision boundary shape"},"correct":"B","explanation":{"correct":"- RBF kernel: $K(x,z) = \\exp(-\\gamma||x-z||^2)$. For high $\\gamma$: only very close points have $K > 0$. Effective neighborhood shrinks to a point.\n- High $\\gamma$ effect: each training point \"owns\" a tiny bubble of influence. The decision boundary wrinkles around each support vector → complex, non-smooth boundary → overfitting.\n- Low $\\gamma$: each point influences a wide region → smooth boundary → high bias, may underfit.\n- Optimal $\\gamma$: selected via cross-validation. sklearn default: $\\gamma = 1/(\\text{n\\_features} \\times \\text{Var}(X))$.","A":"Wide Gaussian (large effective radius) corresponds to LOW gamma. High gamma = narrow Gaussian = small effective radius.","B":"","C":"High gamma increases overfitting. Margin width is controlled by C (the soft-margin penalty), not gamma. The two parameters have independent effects.","D":"Gamma fundamentally changes the kernel function and thus the decision boundary shape. Speed is a secondary effect."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-008","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T08 · KNN] A KNN regression model (K=5) is used to predict house prices. A test house has 5 nearest neighbors with prices: [$200K, $210K, $205K, $800K, $215K]. The prediction is $326K (mean). A data scientist says \"this prediction is wrong — the $800K neighbor is an outlier.\" What is the issue and the fix?","options":{"A":"K=5 is too small — use K=50 for more robust predictions","B":"KNN regression with Euclidean distance treats all K neighbors equally; the $800K outlier is 4× the other neighbors' values, pulling the mean prediction far from the cluster; fix: use distance-weighted KNN ($w_i = 1/d_i^2$, closer neighbors weighted more), use median instead of mean for robust KNN regression, or investigate whether the $800K neighbor is truly a relevant comparison","C":"KNN should use classification, not regression — regression produces unstable predictions","D":"The prediction of $326K is correct — KNN mean is always the right approach"},"correct":"B","explanation":{"correct":"- Equal-weight mean: $P = (200+210+205+800+215)/5 = 1630/5 = 326K$. The outlier inflates the prediction by $126K ($326K vs $207K without the outlier).\n- Distance-weighted KNN: if the $800K house is farther away (less similar), it receives less weight. The prediction reflects the most similar houses more strongly.\n- Median KNN: $\\text{median}(200, 205, 210, 215, 800) = 210K$. Robust to the outlier.\n- Root cause investigation: why is the $800K house a nearest neighbor? Perhaps a feature mismatch — investigating nearest-neighbor relevance may be more valuable than fixing the aggregation method.","A":"Increasing K adds more distant neighbors, potentially bringing in more outliers. It doesn't fix the outlier sensitivity problem.","B":"","C":"KNN regression is a valid and widely used technique. The issue is outlier sensitivity in the averaging step, not the use of regression mode.","D":"Equal-weight mean KNN is sensitive to outliers. The $326K prediction is clearly wrong for a house similar to $200-215K neighbors."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-009","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T09 · Naive Bayes] A Naive Bayes classifier is trained on customer review sentiment (positive/negative). The word \"not\" has $P(\\text{\"not\"}|\\text{positive}) = 0.12$ and $P(\\text{\"not\"}|\\text{negative}) = 0.18$. A review contains the phrase \"not bad.\" NB classifies it as negative. Is this surprising, and why?","options":{"A":"The classification is correct — \"not\" is more common in negative reviews so negative is expected","B":"The conditional independence assumption causes NB to treat \"not\" and \"bad\" as independent features; \"not bad\" means \"good\" (negation), but NB multiplies $P(\\text{\"not\"}|\\text{class}) \\times P(\\text{\"bad\"}|\\text{class})$ separately, destroying the negation relationship; the combination means something different from either word alone — NB misses this linguistic compositionality","C":"NB correctly handles negation because the product of two probabilities captures the combined effect","D":"The word \"not\" should be removed as a stop word to fix this issue"},"correct":"B","explanation":{"correct":"- Negation problem: NB's independence assumption treats each word as contributing independently. \"Not\" + \"bad\" → NB multiplies their individual class likelihoods. But \"not bad\" = \"good\" — the negation reverses the sentiment of \"bad.\"\n- NB's score for \"not bad\": $P(\\text{pos}) \\times P(\\text{\"not\"}|\\text{pos}) \\times P(\\text{\"bad\"}|\\text{pos})$ vs the negative class counterpart. \"Bad\" is heavily associated with negative reviews → NB leans negative even when the phrase means positive.\n- Fix: bigram features (\"not_bad\" as a single feature) or negation detection preprocessing that marks words following \"not\" with a negation tag.","A":"\"Not\" being slightly more common in negative reviews does not explain the full miscategorization. The key problem is that \"not bad\" is a positive phrase that NB's independence assumption breaks apart.","B":"","C":"Multiplying independent probabilities does NOT capture the combined meaning. $P(\\text{\"not bad\"}) \\neq P(\\text{\"not\"}) \\times P(\\text{\"bad\"})$ when words interact.","D":"Removing \"not\" would make things worse — the model would only see \"bad\" and classify as negative even more strongly. Stop word removal is counterproductive for sentiment analysis."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-010","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T10 · PCA] A company stores customer transaction records as sparse matrices (99% zeros, ~50 non-zero values per row out of 10,000 features). An engineer applies standard PCA. Memory usage explodes (100GB for the covariance matrix). What causes this, and what is the solution?","options":{"A":"PCA is working correctly — 100GB is expected for large datasets","B":"Standard PCA computes the full $p \\times p$ covariance matrix: $X^TX$ where $p = 10,000$; the matrix is $10,000 \\times 10,000 = 10^8$ float64 values = 800MB just for the matrix (manageable), but the issue is that standard PCA first densifies the sparse matrix (to compute $X^TX$); fix: use TruncatedSVD (sklearn) which works directly on sparse matrices using iterative methods without explicitly computing the dense covariance matrix","C":"PCA cannot be applied to sparse matrices under any circumstances","D":"The solution is to convert the matrix to float32 instead of float64"},"correct":"B","explanation":{"correct":"- Dense covariance computation: $X^TX$ requires materializing the full dense matrix product. If $X$ is stored as sparse but PCA densifies it: $n \\times p$ dense matrix where $n = 10^6, p = 10^4$ → $10^{10}$ float64 values → 80TB. That's the real memory problem.\n- TruncatedSVD (randomized SVD): computes only the top $k$ singular vectors without computing the full covariance. Works on sparse matrices in scipy.sparse format. sklearn's `TruncatedSVD` is the sparse-compatible equivalent of PCA.\n- Same mathematical result: TruncatedSVD on centered data = PCA. On uncentered data = LSA (Latent Semantic Analysis), often sufficient for NLP.","A":"100GB+ for a PCA covariance computation is not expected or acceptable. Standard PCA on sparse data requires engineering solutions.","B":"","C":"PCA can be applied to sparse data — using TruncatedSVD or online PCA methods. The constraint is on the naive dense implementation.","D":"float32 halves memory but doesn't solve the fundamental problem of materializing a dense matrix from sparse data. 50% of 80TB is still 40TB."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-011","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T11 · Clustering] A team is deciding between K-means and Gaussian Mixture Models (GMM) for customer segmentation. The team wants to handle customers who genuinely \"belong\" partially to two segments (e.g., a customer who shops like both a student and a professional). Which algorithm is better suited, and why?","options":{"A":"K-means is better because it assigns each customer to exactly one segment, ensuring clean boundaries","B":"GMM is better — it provides soft cluster membership: $P(\\text{segment}_k | \\text{customer}_i)$; a customer could be 60% segment A and 40% segment B; marketing campaigns can be weighted by membership probability; K-means hard assignment would arbitrarily force the customer into one segment, losing the ambiguity information","C":"Neither algorithm handles partial membership — a custom algorithm is needed","D":"GMM and K-means produce identical results when using spherical Gaussian components"},"correct":"B","explanation":{"correct":"- GMM soft assignment: the E-step in EM computes $P(\\text{segment}_k | x_i)$ — a full probability vector over K segments per customer. Downstream actions can be personalized proportionally.\n- K-means hard assignment: forces a binary decision. Customers on the boundary get one label arbitrarily. This loses valuable information about borderline cases.\n- Business value: a customer who is 50/50 between student and professional might respond to different messaging in different contexts. Soft membership captures this.","A":"Hard boundaries are not desirable when customers genuinely exhibit mixed behaviors. K-means hard assignment discards the ambiguity information.","B":"","C":"GMM explicitly provides soft membership through its probabilistic formulation. This is one of its primary differentiators from K-means.","D":"When GMM uses spherical equal-variance Gaussians, its MAP estimate (argmax of posterior) equals K-means hard assignments. But GMM still provides soft probabilities. They are equivalent only for the final hard assignment, not the probabilistic output."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-012","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T12 · Anomaly Detection] A manufacturing quality control system uses reconstruction error from an autoencoder to flag defective parts. A new type of defect is introduced (a scratch). The autoencoder was not trained on scratches. What will the autoencoder's reconstruction error be for scratched parts, and why?","options":{"A":"Low reconstruction error — autoencoders generalize perfectly to new types of defects","B":"High reconstruction error — the autoencoder was trained only on normal parts and learned to reconstruct normal surfaces well; a scratch pattern was never seen during training; the decoder cannot reproduce the scratch accurately from the learned latent representation → high reconstruction error → correctly flagged as anomaly","C":"Zero reconstruction error — the autoencoder ignores features it wasn't trained on","D":"The reconstruction error is unpredictable — sometimes high, sometimes low"},"correct":"B","explanation":{"correct":"- Autoencoder anomaly detection principle: train only on normal data → the encoder-decoder learns the manifold of normal appearances; out-of-distribution inputs (defects) are not on this manifold → high reconstruction error.\n- Scratch = new texture pattern not on the learned normal manifold → the decoder produces a smoothed, scratch-free reconstruction → $||x - \\hat{x}||^2$ is high.\n- This is why autoencoder anomaly detection is appealing for manufacturing: it generalizes to unseen defect types as long as \"normal\" was well-represented in training.","A":"Autoencoders do not generalize to new patterns perfectly. In fact, the design intent is the opposite — they should NOT reconstruct anomalies well.","B":"","C":"Autoencoders don't \"ignore\" features. The full reconstruction is attempted, and the scratch region will be reconstructed incorrectly (smoothed out), contributing to the high error.","D":"While individual results vary, the general expectation for a well-trained autoencoder on a novel anomaly type is high reconstruction error. \"Unpredictable\" understates the directional expectation."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-013","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T13 · Ensemble Methods] A stacking ensemble achieves 91% accuracy. The best single base model achieves 90%. A colleague says \"1% gain isn't worth the 5× inference cost.\" When is this trade-off justified, and what metric should guide the decision?","options":{"A":"1% accuracy improvement never justifies added cost","B":"The justification depends on business impact: in fraud detection, 1% more frauds caught could prevent millions in losses (high $C_{FN}$); in a low-stakes recommendation system, 1% may not justify the overhead; the decision should use expected business value: $\\Delta \\text{value} = \\Delta \\text{accuracy} \\times n_{\\text{predictions}} \\times \\text{value\\_per\\_correct\\_prediction}$; if this exceeds the cost of 5× inference, the ensemble is justified","C":"Stacking should always be used because accuracy improvements are always valuable","D":"The correct threshold for justifying an ensemble is always >5% accuracy improvement"},"correct":"B","explanation":{"correct":"- Cost-benefit analysis: $n = 10,000$ fraud predictions/day. 1% = 100 more caught frauds. At $\\$1000$ per fraud: $\\$100K/\\text{day}$ in additional value. 5× inference cost = trivial AWS cost increase. Justified.\n- For a web recommendation: 1% more relevant recommendations → marginal click-through improvement → revenue depends on traffic and monetization. May or may not be justified.\n- The \"5% threshold\" is arbitrary — it has no theoretical basis. Business impact should drive the decision.","A":"A 1% improvement can be extremely valuable in high-stakes, high-volume applications. \"Never justified\" is too absolute.","B":"","C":"Always using complex models ignores operational costs (latency, infrastructure, maintainability). Simple models that meet requirements are often preferable.","D":"No universal threshold exists. A 0.1% improvement in patient mortality prediction is highly significant; a 10% improvement in emoji suggestion accuracy may not justify complexity."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-014","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T14 · Model Evaluation] Nested cross-validation is described as the \"gold standard\" for unbiased model evaluation combined with hyperparameter tuning. What are the two loops, and what does each loop do?","options":{"A":"Outer loop trains the model; inner loop evaluates it","B":"Outer loop (k-fold): splits data into train+validation vs test — the test fold is never touched during model development; inner loop (k-fold on train+validation): performs hyperparameter selection via cross-validation; for each outer fold, the inner CV selects the best hyperparameters; the outer test fold evaluates final performance — this fold was never used in hyperparameter selection","C":"Outer loop does feature selection; inner loop does model training","D":"Nested CV is equivalent to double the cross-validation folds — it is a computational trick, not a methodological improvement"},"correct":"B","explanation":{"correct":"- Outer loop (5-fold): 5 test folds, each truly held out. Gives 5 independent performance estimates → average = unbiased performance estimate.\n- Inner loop (5-fold): for each outer training set, select the best hyperparameters using inner CV. A different set of hyperparameters may be selected for each outer fold.\n- Key insight: the outer test fold was NEVER used in the inner loop's hyperparameter selection → unbiased performance estimate.\n- Limitation: computationally expensive (5×5=25 model fits minimum). sklearn's `cross_val_score(pipeline, ...)` with `GridSearchCV` inside handles this correctly.","A":"The description is reversed. The outer loop is the evaluation loop; the inner loop is the hyperparameter selection loop.","B":"","C":"Feature selection can be included in the pipeline but it is not the purpose of the outer/inner loop structure.","D":"Nested CV is a principled methodology to prevent test set contamination from hyperparameter tuning. It's not just a computational trick — it's the correct evaluation procedure."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-015","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T15 · Bias-Variance] A neural network trained with heavy L2 regularization achieves 78% training accuracy and 77% test accuracy (1% gap). Without regularization, the same network achieves 99% training and 78% test accuracy. Which configuration has better bias, better variance, and which is better overall?","options":{"A":"No regularization is better because it achieves higher training accuracy","B":"Heavy regularization: high bias (78% train, 22% training error vs optimal), low variance (1% gap); No regularization: low bias (99% train, 1% training error), high variance (21% gap from train to test); overall test accuracy is similar (77% vs 78%); regularization is not strictly better here — it achieved similar test performance but with high bias instead of high variance; the no-regularization model might be improved with more data rather than more regularization","C":"Heavy regularization is always better — any reduction in overfitting is beneficial","D":"The models are identical in all meaningful ways"},"correct":"B","explanation":{"correct":"- Test accuracy is nearly identical (77% vs 78%). The regularized model achieved similar generalization through bias (high training error, low gap) rather than allowing the model to fit and then generalize through variance.\n- Practical interpretation: if there were more training data available, the unregularized model could potentially reach both lower training error AND lower test error (variance drops with more data). The regularized model is limited by its high bias.\n- Neither is definitively \"better\" without context. The right amount of regularization balances bias and variance at the current data size.","A":"99% training accuracy with 21% train-test gap shows severe overfitting. High training accuracy alone doesn't make a model better.","B":"","C":"\"Any reduction in overfitting is beneficial\" ignores the bias cost. In this example, heavy regularization introduced so much bias that test performance barely improved.","D":"The models have very different training behaviors and the trade-off between bias and variance differs. They are not identical."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-016","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T16 · Regularization] A team compares Lasso (L1) and Ridge (L2) regression on a dataset with 15 features where they believe 5 are truly predictive and 10 are noise. Which algorithm should they use for best prediction + interpretability, and why?","options":{"A":"Ridge always outperforms Lasso for prediction — use Ridge","B":"Lasso is preferable when the true model is sparse (few relevant features); it will zero out the 10 noise features and select the 5 predictive ones; this gives a sparse, interpretable model with potentially better generalization; Ridge shrinks all 15 features but keeps all non-zero — the noise features contribute (small but non-zero) noise to predictions; for sparse true models, Lasso's feature selection improves both interpretability and prediction","C":"Both algorithms produce identical results on any dataset","D":"Use ElasticNet because it handles all scenarios equally well"},"correct":"B","explanation":{"correct":"- Sparse true model (5/15 predictors): this is exactly the scenario where Lasso shines. Lasso can achieve exact zero for noise features, recovering the true sparse model under certain conditions (irrepresentable condition).\n- Ridge: keeps all 15 features with small coefficients. The 10 noise features add small but real noise to predictions. Prediction variance is slightly higher than Lasso's 5-feature model.\n- When Ridge is better: when the true model has many small but all-non-zero effects (dense signal). In gene expression, many genes have small effects → Ridge may outperform Lasso.","A":"Ridge's advantage is for dense signal problems. For sparse signal (few true predictors), Lasso typically outperforms Ridge.","B":"","C":"Lasso and Ridge produce different solutions. For sparse true models, they differ in test performance and interpretability.","D":"ElasticNet is useful for correlated feature groups, not necessarily better for the described scenario. Using ElasticNet always would be over-engineering without clear benefit."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-017","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T17 · Feature Engineering] A dataset has a feature \"last_purchase_days_ago\" with values ranging from 0 to 1,825 (5 years). The distribution is heavily right-skewed (most customers purchased recently). A decision tree model is used. Does the skewness require transformation before tree-based modeling?","options":{"A":"Yes — all skewed features must be log-transformed before any ML model","B":"No — decision trees split on arbitrary thresholds and are invariant to monotone transformations; $\\log(\\text{last\\_purchase\\_days\\_ago})$ and the raw feature produce the same split structure (the thresholds change but the split boundaries are equivalent); tree-based models (RF, gradient boosting) do not require feature scaling or distribution transformation; transformation helps linear models and distance-based models, not tree models","C":"Yes — skewness causes trees to produce biased splits","D":"The feature should be binned (discretized) before using in decision trees"},"correct":"B","explanation":{"correct":"- Decision tree splits: find the threshold $t$ that maximizes impurity reduction. The splitting criterion doesn't care about the feature distribution — it evaluates all possible thresholds.\n- Monotone transformation invariance: if $f$ is a monotone function, the optimal split threshold changes from $t$ to $f^{-1}(t)$, but the split's ability to separate classes is identical.\n- Who needs transformation: linear models (skewed features create leverage points), neural networks (large gradients), KNN (distance distortion), K-means (centroid computation affected by outliers).","A":"This is incorrect. Tree-based models are monotone transformation invariant. Log-transforming before a tree model adds zero value.","B":"","C":"Skewness does not bias tree splits. Trees consider all possible thresholds and select the best one regardless of distribution shape.","D":"Pre-binning a feature before a decision tree is redundant — the tree discretizes features implicitly through its splitting process."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-018","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T13 · Ensemble Methods] Why does AdaBoost use exponential loss weighting while gradient boosting can use any differentiable loss function?","options":{"A":"AdaBoost and gradient boosting use the same loss function — exponential is the default for both","B":"AdaBoost was derived specifically from the exponential loss function — the multiplicative weight update $w_i \\leftarrow w_i \\exp(-\\alpha_t y_i h_t(x_i))$ is the natural consequence of minimizing exponential loss via forward stagewise additive modeling; gradient boosting generalizes this framework: any differentiable loss has a gradient, and any tree can be fitted to negative gradients; AdaBoost is a special case of gradient boosting with exponential loss","C":"The loss function choice has no effect on the model — only the base learner matters","D":"AdaBoost cannot be described as gradient boosting — they are entirely unrelated algorithms"},"correct":"B","explanation":{"correct":"- AdaBoost derivation: Friedman et al. showed AdaBoost is forward stagewise additive modeling minimizing exponential loss $L(y, f) = e^{-yf(x)}$.\n- Exponential loss is more sensitive to misclassified points (exponentially growing penalty) → makes AdaBoost more sensitive to outliers/noise.\n- Log-loss (log-boosting) or Huber loss (robust boosting) are less sensitive. Gradient boosting generalization: any loss with a computable gradient can be used → XGBoost supports log-loss, Huber, custom losses.","A":"Gradient boosting supports arbitrary differentiable losses (MSE for regression, log-loss for classification, Huber for robust regression). Exponential is only AdaBoost's loss.","B":"","C":"The loss function determines what the ensemble is optimizing — it has fundamental effects on which examples are emphasized, how robust the model is to noise, and the final model form.","D":"This is a well-established theoretical result. AdaBoost is a special case of gradient boosting with exponential loss and specific tree fitting procedure."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-019","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T08 · KNN] A KNN classifier with K=10 uses Euclidean distance. Features are: age (years, range 20-80), account_balance (dollars, range $0-$500,000). After scaling, the classifier performs much better. Why does KNN need feature scaling when, say, Random Forest does not?","options":{"A":"KNN needs scaling to prevent memory overflow with large feature values","B":"KNN's prediction is based entirely on the distance $\\sum_j (x_{ij} - x_{kj})^2$ — without scaling, account_balance (range 500K) contributes $(500,000)^2 = 2.5 \\times 10^{11}$ to the sum while age (range 60) contributes $(60)^2 = 3,600$; account_balance dominates completely; after standardization, both features contribute proportionally; Random Forest splits on individual feature thresholds — it doesn't combine features via distance, so scale doesn't affect which threshold is best","C":"KNN only needs scaling when using cosine distance, not Euclidean distance","D":"Both KNN and Random Forest need scaling — the question's premise is incorrect"},"correct":"B","explanation":{"correct":"- Scale sensitivity: any algorithm using Euclidean (or Minkowski) distance is sensitive to feature scale. KNN, K-means, SVM (RBF), linear/logistic regression with gradient descent, PCA — all benefit from scaling.\n- Tree-based invariance: a split at balance = $50,000 is equally valid whether the feature is in dollars or thousands of dollars. The split threshold changes but the tree structure (and splits' quality) is identical.\n- Standardization: $z = (x - \\mu) / \\sigma$. All features have mean 0, std 1 after scaling → equal contribution to distance calculations.","A":"Memory overflow is not a scaling concern. Large feature values don't cause memory issues.","B":"","C":"Euclidean distance is MORE sensitive to scale than cosine distance (which is inherently scale-invariant by design). Euclidean definitely needs scaling.","D":"Random Forest (and gradient boosting, decision trees) are scale-invariant. Scaling features before training a tree model adds zero value."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-020","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T01 · ML Fundamentals] A model is trained on features A, B, C to predict target Y. Feature B is computed as \"average of Y for the same customer in the past 30 days.\" The model achieves 99% training accuracy but fails in production on new customers (who have no history). What is the core issue?","options":{"A":"The model overfits because it uses too many features","B":"Feature B is computed from the target variable (past Y values); for new customers, B=0 or missing by definition; but more critically, during training B already encodes information about the customer's relationship with Y — the model implicitly uses Y to predict Y (target leakage); in production, B for new customers is meaningless or missing, breaking the model","C":"The model should not use historical data — only real-time features","D":"New customer failure is expected and acceptable — the model is designed for existing customers"},"correct":"B","explanation":{"correct":"- Feature leakage: \"average Y in past 30 days\" is a past value of the target. For existing customers, this feature carries strong predictive signal (past behavior predicts future behavior). But this is essentially using $Y$ to predict $Y$.\n- Production failure: new customers have no history → feature is 0 or NaN → the model's primary feature is gone → predictions become unreliable or require a cold-start fallback.\n- Proper feature engineering: B = \"average purchases in the past 30 days (the feature/action)\" vs \"average of the target (the outcome).\" Using the action is legitimate; using the outcome is leakage.","A":"Three features is not \"too many.\" The issue is specifically feature B's construction from the target.","B":"","C":"Historical features are legitimate and often the most predictive. The problem is not history in general — it's using the target itself (past Y) as a feature.","D":"If the model is deployed to new customers but fails for them, this is a production failure. The scope of deployment should match the model's designed population."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-021","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T07 · SVM] Two students debate: \"SVMs work best for high-dimensional data\" vs \"Neural networks always outperform SVMs in high dimensions.\" Which claim is supported by ML theory and practice?","options":{"A":"Neural networks always outperform SVMs regardless of dataset size and dimensionality","B":"SVMs have theoretical advantages in high-dimensional settings: the kernel trick maps to higher-dimensional spaces where the margin can be large; with limited data in high dimensions (p >> n), SVMs' maximum-margin objective provides implicit regularization; neural networks outperform SVMs when data is abundant, input is structured (images, text), and sufficient computational resources are available; neither claim is universally true","C":"SVMs always outperform neural networks in text classification because text is high-dimensional","D":"The comparison is irrelevant — dimensionality has no effect on relative model performance"},"correct":"B","explanation":{"correct":"- SVM advantages: (1) kernel methods are competitive or superior on small/medium datasets (n < 10,000); (2) well-understood theoretical generalization bounds (VC theory); (3) work well for text, biology (SVMs dominated NLP before deep learning).\n- Neural network advantages: (1) automatic feature learning scales to massive datasets; (2) CNNs/transformers achieve SOTA on structured inputs; (3) can learn hierarchical features that kernel methods struggle with.\n- Modern practice: deep learning has overtaken SVMs on most benchmarks with sufficient data. But SVMs remain competitive for small datasets.","A":"SVMs outperform neural networks on small datasets and some high-dimensional tasks. \"Always\" is definitively false.","B":"","C":"Deep learning (BERT, transformers) has surpassed SVMs on most NLP benchmarks since 2018. SVMs are competitive on small NLP datasets but not universally superior.","D":"Dimensionality matters significantly. In high dimensions with small n, kernel methods are often competitive with neural networks. In high dimensions with large n, neural networks dominate."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-022","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T02 · Linear Regression] A data scientist adds a polynomial feature $x^2$ to a linear regression for a housing price model. The training RMSE drops by 20%, test RMSE drops by 12%. A colleague says \"the $x^2$ term is cheating because it creates a nonlinear model.\" Is the colleague correct?","options":{"A":"Correct — adding polynomial features violates the linearity assumption of linear regression","B":"Incorrect — \"linear\" in linear regression refers to linearity in the parameters (weights), not in the features; $y = w_0 + w_1 x + w_2 x^2$ is a polynomial curve in $x$ but is linear in the parameters $w_0, w_1, w_2$; this is valid linear regression with engineered features; the linearity assumption in Gauss-Markov refers to $E[y|x]$ being linear in the parameters, not in $x$","C":"Correct — polynomial features require using polynomial regression, a completely different algorithm","D":"Incorrect — the $x^2$ feature technically makes the model nonlinear in both features and parameters"},"correct":"B","explanation":{"correct":"- Linear in parameters: $y = \\beta_0 + \\beta_1 x_1 + \\beta_2 x_1^2$. The model is linear in $\\beta$. We can write $z = x_1^2$ and the model becomes $y = \\beta_0 + \\beta_1 x_1 + \\beta_2 z$ — standard linear regression on features $[x_1, z]$.\n- The OLS closed form, Gauss-Markov theorem, and all linear regression theory apply unchanged. The model fits a polynomial curve in feature space but a hyperplane in parameter space.\n- Feature engineering: adding $x^2, \\log(x), x_1 \\times x_2$ etc. all create new features for linear regression. This is standard practice.","A":"The linearity assumption refers to linearity in parameters, not features. Adding $x^2$ is a feature transformation, not a violation of the model's assumptions.","B":"","C":"\"Polynomial regression\" IS linear regression with polynomial features — the same algorithm, the same closed form solution. There's no separate \"polynomial regression\" algorithm.","D":"$$y = w_0 + w_1 x + w_2 x^2$ is nonlinear in $x$ but linear in parameters $w$. OLS optimizes over $w$, not $x$, so the model is linear from the optimizer's perspective."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-023","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T10 · PCA] A data scientist reduces 50 features to 10 using PCA, then trains a Random Forest on the 10 PCs. The Random Forest interprets feature importances of the PCs. Is \"PC1 has the highest importance\" a useful finding?","options":{"A":"Yes — PC1 having high importance means the features with highest variance are most important","B":"Not directly — PC1 represents the direction of maximum variance in the original feature space; high importance means the model uses variance direction 1, but this direction is a linear combination of all 50 original features; to determine which original features matter, you must examine the PC1 loading vector (which original features contribute to PC1) and then validate whether the Random Forest is using PC1 due to its original-feature composition","C":"Yes — Random Forest importance on PCs directly measures the original feature importance","D":"PC importance is meaningless because Random Forests should not be used after PCA"},"correct":"B","explanation":{"correct":"- PC1 = $v_1^T x$ where $v_1$ is the first eigenvector. If $v_1 = [0.8, 0.6, 0.1, ..., 0.05]$, high PC1 importance means the model uses a combination weighted toward features 1 and 2.\n- To interpret in original feature space: $w_{\\text{original}} = V \\cdot w_{\\text{RF importance}}$ (approximate). This propagates PC importance back to original features.\n- Limitation: RF importance on PCs is valid for prediction, but interpretation in original feature space requires the back-transformation step.","A":"Maximum variance ≠ maximum predictive importance (as discussed in the topic file). PC1 might capture a direction orthogonal to the target.","B":"","C":"Random Forest feature importance measures the model's reliance on each PC. To map back to original features requires the PCA loading matrix.","D":"Using Random Forest after PCA is valid. The combination is used to reduce noise/dimensionality before tree-based learning. The interpretation requires care, not avoidance."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-024","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T11 · Clustering] A DBSCAN model labels 30% of data points as noise (-1). A junior analyst removes all noise points from the dataset for downstream analysis. What is the risk?","options":{"A":"No risk — noise points are definitionally not useful","B":"DBSCAN noise points may be the most analytically interesting observations — rare events, fraud cases, medical anomalies; removing 30% of data based on DBSCAN's density criterion (which depends on eps/min_samples hyperparameters) may discard real signals; additionally, 30% noise typically indicates eps is too small or min_samples too large — first tune the hyperparameters before removing points","C":"Removing noise points always improves downstream model performance","D":"Noise points should be assigned to the nearest cluster, not removed"},"correct":"B","explanation":{"correct":"- Business context matters: if the dataset is fraud transactions, DBSCAN noise points (isolated transactions not forming dense groups) may be the actual rare fraud events — exactly what you want to keep.\n- Hyperparameter sensitivity: 30% noise is unusually high, suggesting DBSCAN is poorly tuned. The k-distance elbow plot should be used to select eps before interpreting noise labels.\n- Downstream impact: removing 30% of data reduces the downstream model's training set significantly and may introduce systematic bias if noise points share common characteristics.","A":"\"Not useful\" conflates density-based anomaly with lack of value. Density anomalies are often the most interesting observations in anomaly detection, fraud, and scientific discovery.","B":"","C":"This claim is only valid if the noise points are truly uninformative artifacts (e.g., data entry errors). Without domain validation, removing data is risky.","D":"Forcing noise points into clusters is exactly what DBSCAN is designed to avoid. Boundary/noise points that don't meet density criteria should not be force-assigned — that would change the algorithm's semantics."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-025","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T15 · Bias-Variance] A team evaluates model performance across 5-fold cross-validation. The per-fold accuracies are: [0.82, 0.83, 0.91, 0.82, 0.81]. They report mean = 0.838. A statistician flags the fold-3 outlier. What might cause one fold to perform much better than the others, and what should the team investigate?","options":{"A":"Fold-3 has more samples than other folds — cross-validation always allocates unequal folds","B":"Fold-3 may have different class distribution (stratification issue), easier examples from one data stratum, or temporal ordering artifacts (if data has temporal structure and fold-3 happened to contain mostly \"easy\" time periods); the team should check fold-3's class distribution, sample characteristics, and whether the fold represents a distinct subpopulation","C":"High fold-3 accuracy is good — the model is doing its best on that fold","D":"Variance across folds is always random — outlier folds should be discarded as noise"},"correct":"B","explanation":{"correct":"- Expected CV fold variance: small differences (1-2%) are normal due to different test samples. A 9% jump (91% vs 82%) is unusual and warrants investigation.\n- Common causes: (1) non-stratified split → fold-3 has easier class distribution; (2) temporal leakage → fold-3 represents an earlier period with simpler patterns; (3) the fold captures a specific subpopulation the model handles well (niche feature combination).\n- Action: inspect fold-3's data distribution, run stratified K-fold if not already used, and report variance alongside mean.","A":"sklearn's KFold allocates ± 1 sample per fold (essentially equal). Sample count differences are not the cause.","B":"","C":"One fold being much easier may indicate a data quality or sampling issue. High performance on one fold doesn't validate the model globally.","D":"The outlier fold should be investigated, not discarded. Discarding would require justification. The variance itself is signal about data or evaluation methodology."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-026","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T16 · Regularization] A data scientist applies both L1 and L2 regularization to the same logistic regression model. What is this combined approach called, and how does the mixing ratio $\\rho$ in sklearn's `l1_ratio` parameter affect the behavior?","options":{"A":"This is called Elastic Net; l1_ratio=1 is pure L1 (Lasso behavior: sparsity); l1_ratio=0 is pure L2 (Ridge behavior: grouped shrinkage); values between 0 and 1 blend both, producing sparse models where correlated features are grouped rather than arbitrarily selecting one","B":"This is called Ridge Plus; it always produces exactly the same result as pure L2","C":"Combining L1 and L2 cancels both penalties out, producing an unregularized model","D":"L1 and L2 cannot be combined in the same model — their gradients are incompatible"},"correct":"A","explanation":{"correct":"- ElasticNet penalty: $\\lambda[\\rho ||w||_1 + \\frac{1-\\rho}{2}||w||_2^2]$. $\\rho = 1$: pure Lasso (sparse). $\\rho = 0$: pure Ridge (dense shrinkage).\n- Grouping effect: at $\\rho \\in (0, 1)$, correlated features tend to have similar (non-zero or zero) coefficients rather than Lasso's arbitrary selection of one.\n- Use cases: genomics (gene groups), NLP (word groups), any domain where feature groups exist and partial sparsity is desired.","A":"","B":"ElasticNet is not \"Ridge Plus.\" It has distinct behavior at intermediate l1_ratio values that neither pure Lasso nor pure Ridge achieves.","C":"Adding both penalties creates a combined penalty that is stronger, not zero. The gradient of the combined penalty is non-zero.","D":"L1 and L2 penalties are both subdifferentiable functions of $w$. They are combined additively. There is no gradient incompatibility."}},{"section":"machine-learning","difficulty":"medium","id":"ml-pract-medium-027","topicSlug":"practice","topic":"Practice","orderIndex":0,"question":"[T17 · Feature Selection] A wrapper method uses logistic regression as the base model for feature selection. The selected features are used to train an XGBoost model. A reviewer says \"the selected features are suboptimal because they were selected for logistic regression, not XGBoost.\" Is this concern valid?","options":{"A":"No — wrapper methods are model-agnostic and always select the best features for any model","B":"Valid concern — wrapper methods find features that work well for the specific base model used in selection; logistic regression finds linearly predictive features; XGBoost can leverage nonlinear interactions and feature combinations; a feature useless for logistic regression may be important for XGBoost (e.g., a feature that is only predictive in interaction with another); use XGBoost itself (or a simpler tree proxy) as the base model in the wrapper, or use model-agnostic filter methods","C":"Wrapper methods always select the same features regardless of the base model","D":"XGBoost should not be used with pre-selected features — it performs its own feature selection internally"},"correct":"B","explanation":{"correct":"- Model-specific feature selection: logistic regression evaluates features based on their linear contribution to log-odds. XGBoost evaluates features based on their contribution to split quality in nonlinear trees.\n- Example: feature X is useless alone for LR (no linear signal), but X × Z is highly predictive. Wrapper with LR drops X. XGBoost would have discovered X useful via interaction splits.\n- Best practice: use the final model (or a fast proxy) as the base model for wrapper-based feature selection to ensure feature relevance aligns with the final model's inductive bias.","A":"Wrapper methods are model-specific by design — they evaluate feature subsets using the base model's loss. Different base models can produce different feature selections.","B":"","C":"Different base models (LR vs XGBoost) evaluate features differently and can produce different selected subsets for the same dataset.","D":"XGBoost does perform internal feature selection (features with zero importance). But external pre-selection can still improve performance by removing noise features that confuse the XGBoost training."}}],"allTopics":[{"slug":"ml-fundamentals","label":"ML Fundamentals","section":"machine-learning","description":"Master ML Fundamentals interviewer-level concepts.","orderIndex":1,"mcqCount":15},{"slug":"linear-regression","label":"Linear Regression","section":"machine-learning","description":"Master Linear Regression interviewer-level concepts.","orderIndex":2,"mcqCount":15},{"slug":"logistic-regression","label":"Logistic Regression","section":"machine-learning","description":"Master Logistic Regression interviewer-level concepts.","orderIndex":3,"mcqCount":15},{"slug":"decision-trees","label":"Decision Trees","section":"machine-learning","description":"Master Decision Trees interviewer-level concepts.","orderIndex":4,"mcqCount":15},{"slug":"random-forest","label":"Random Forest","section":"machine-learning","description":"Master Random Forest interviewer-level concepts.","orderIndex":5,"mcqCount":15},{"slug":"gradient-boosting","label":"Gradient Boosting","section":"machine-learning","description":"Master Gradient Boosting interviewer-level concepts.","orderIndex":6,"mcqCount":15},{"slug":"support-vector-machines","label":"Support Vector Machines","section":"machine-learning","description":"Master Support Vector Machines interviewer-level concepts.","orderIndex":7,"mcqCount":15},{"slug":"k-nearest-neighbors","label":"K Nearest Neighbors","section":"machine-learning","description":"Master K Nearest Neighbors interviewer-level concepts.","orderIndex":8,"mcqCount":15},{"slug":"naive-bayes","label":"Naive Bayes","section":"machine-learning","description":"Master Naive Bayes interviewer-level concepts.","orderIndex":9,"mcqCount":14},{"slug":"pca-dimensionality-reduction","label":"Pca Dimensionality Reduction","section":"machine-learning","description":"Master Pca Dimensionality Reduction interviewer-level concepts.","orderIndex":10,"mcqCount":13},{"slug":"clustering","label":"Clustering","section":"machine-learning","description":"Master Clustering interviewer-level concepts.","orderIndex":11,"mcqCount":13},{"slug":"anomaly-detection","label":"Anomaly Detection","section":"machine-learning","description":"Master Anomaly Detection interviewer-level concepts.","orderIndex":12,"mcqCount":12},{"slug":"ensemble-methods","label":"Ensemble Methods","section":"machine-learning","description":"Master Ensemble Methods interviewer-level concepts.","orderIndex":13,"mcqCount":12},{"slug":"model-evaluation-and-metrics","label":"Model Evaluation And Metrics","section":"machine-learning","description":"Master Model Evaluation And Metrics interviewer-level concepts.","orderIndex":14,"mcqCount":12},{"slug":"bias-variance-tradeoff","label":"Bias Variance Tradeoff","section":"machine-learning","description":"Master Bias Variance Tradeoff interviewer-level concepts.","orderIndex":15,"mcqCount":11},{"slug":"regularization","label":"Regularization","section":"machine-learning","description":"Master Regularization interviewer-level concepts.","orderIndex":16,"mcqCount":10},{"slug":"feature-selection-and-engineering","label":"Feature Selection And Engineering","section":"machine-learning","description":"Master Feature Selection And Engineering interviewer-level concepts.","orderIndex":17,"mcqCount":12}],"tests":[{"id":"01-ml-fundamentals","name":"ML Foundations & Regression Models","level":"mixed","duration":18,"order":1,"description":"Topics 01–03: ML pipeline, learning paradigms, OLS assumptions, regularization, sigmoid decision boundary, and data leakage traps. Tests whether you understand the why behind foundational models — not just their names.","questionIds":["ml-01001","ml-02007","ml-03001","ml-01007","ml-02001","ml-03007","ml-01002","ml-02008","ml-03013","ml-01008","ml-02002","ml-01013"]},{"id":"04-decision-trees","name":"Tree Models & Boosting","level":"mixed","duration":18,"order":2,"description":"Topics 04–06: Gini vs entropy, pruning vs overfitting, bagging mechanics, OOB error, residual fitting, learning rate, and the real differences between XGBoost and LightGBM. High trap density — essential for any ML interview.","questionIds":["ml-04001","ml-05007","ml-06001","ml-04007","ml-05001","ml-06007","ml-04002","ml-05013","ml-06008","ml-04008","ml-05002","ml-04013","ml-06013"]},{"id":"07-support-vector-machines","name":"SVM, KNN & Probabilistic Classifiers","level":"mixed","duration":18,"order":3,"description":"Topics 07–09: Margin maximization, kernel trick, curse of dimensionality, k-selection traps, Laplace smoothing, and when the independence assumption breaks — or surprisingly doesn't. Tests distance-intuition and probabilistic reasoning.","questionIds":["ml-07001","ml-09006","ml-08001","ml-07007","ml-09001","ml-08007","ml-07002","ml-09007","ml-07013","ml-08002","ml-07008","ml-08013"]},{"id":"10-pca-dimensionality-reduction","name":"Dimensionality Reduction & Clustering","level":"mixed","duration":15,"order":4,"description":"Topics 10–11: Eigenvectors, variance explained, PCA leakage traps, when t-SNE lies, K-means convergence guarantees, DBSCAN parameter sensitivity, and silhouette score pitfalls. Essential for any role involving unsupervised pipelines.","questionIds":["ml-10001","ml-11006","ml-10006","ml-11001","ml-10007","ml-11002","ml-10002","ml-11007","ml-10012","ml-11012"]},{"id":"12-anomaly-detection","name":"Anomaly Detection & Ensemble Methods","level":"mixed","duration":15,"order":5,"description":"Topics 12–13: Isolation Forest mechanics, LOF vs OCSVM tradeoffs, autoencoder anomaly scoring, bagging vs boosting vs stacking distinctions, and when ensembles actively hurt. Tricky evaluation scenarios included.","questionIds":["ml-12001","ml-13005","ml-12005","ml-13001","ml-12006","ml-13002","ml-12002","ml-13006","ml-12011","ml-13011"]},{"id":"14-model-evaluation-and-metrics","name":"Evaluation, Bias-Variance & Regularization","level":"mixed","duration":18,"order":6,"description":"Topics 14–16: AUC-ROC vs PR curve, F1 tradeoffs, train vs test gap diagnosis, learning curves, L1 sparsity mechanics, Ridge vs Lasso selection, and dropout as regularization. Tests the reasoning behind metric and regularizer choice.","questionIds":["ml-14001","ml-15005","ml-16001","ml-14006","ml-15001","ml-16005","ml-14002","ml-15006","ml-14007","ml-15002","ml-14011","ml-16009"]},{"id":"17-feature-selection-and-engineering","name":"Feature Selection & Engineering","level":"mixed","duration":15,"order":7,"description":"Topic 17: Filter vs wrapper vs embedded methods, mutual information limits, SHAP-based selection, categorical encoding traps, missing value imputation bias, and feature leakage detection. The topic most commonly underestimated in ML interviews.","questionIds":["ml-17001","ml-17005","ml-17002","ml-17007","ml-17003","ml-17006","ml-17004","ml-17008","ml-17010","ml-17012"]},{"id":"mock-easy-01","name":"Easy Mock Interview — Round 1","level":"easy","duration":12,"order":8,"description":"Simulates an entry-level ML screening round. Broad coverage of 10 topics — fundamentals, supervised models, and evaluation. Every question has a trap designed to catch surface-level memorization. Pass this to prove you understand ML, not just its terminology.","questionIds":["ml-01001","ml-02002","ml-03001","ml-04002","ml-05001","ml-06002","ml-07001","ml-08001","ml-14001","ml-15001"]},{"id":"mock-easy-02","name":"Easy Mock Interview — Round 2","level":"easy","duration":12,"order":9,"description":"Simulates a second easy ML screening — different question set, different topics. Covers unsupervised models, probabilistic classifiers, and regularization basics. Ideal as a second attempt or a complementary drill to Round 1.","questionIds":["ml-01003","ml-09001","ml-10001","ml-11001","ml-12001","ml-13001","ml-16001","ml-17001","ml-03003","ml-05003"]},{"id":"mock-medium-01","name":"Medium Mock Interview — Round 1","level":"medium","duration":18,"order":10,"description":"Simulates a mid-level applied ML interview. Each question tests multi-step reasoning — design choices, debugging production models, and tradeoff analysis. Applied across 12 topics including boosting internals, kernel choices, and metric selection under class imbalance.","questionIds":["ml-01008","ml-02007","ml-03007","ml-04008","ml-05007","ml-06008","ml-07008","ml-08007","ml-10007","ml-12006","ml-14007","ml-16005"]},{"id":"mock-medium-02","name":"Medium Mock Interview — Round 2","level":"medium","duration":18,"order":11,"description":"Second applied ML interview simulation — entirely different question set from Round 1. Covers NB vs LR tradeoffs, boosting depth choices, clustering evaluation, ensemble diversity, and feature selection under collinearity. Tests whether you reason from first principles.","questionIds":["ml-01009","ml-02009","ml-03009","ml-04009","ml-05009","ml-06009","ml-07009","ml-09007","ml-11007","ml-13006","ml-15006","ml-17007"]},{"id":"mock-hard-01","name":"Hard Mock Interview — Round 1","level":"hard","duration":25,"order":12,"description":"Simulates a senior ML engineer interview round. 15 hard questions across all major algorithm groups. Expect edge cases, non-intuitive behavior, and scenario traps that break if you're reasoning from memorized patterns rather than model internals. FAANG-calibre difficulty.","questionIds":["ml-01013","ml-02013","ml-03013","ml-04013","ml-05013","ml-06013","ml-07013","ml-08013","ml-09011","ml-10011","ml-11011","ml-12010","ml-13010","ml-14010","ml-15009"]},{"id":"mock-hard-02","name":"Hard Mock Interview — Round 2","level":"hard","duration":25,"order":13,"description":"Second hard mock interview — completely fresh question set. Tests architectural reasoning, optimization tradeoffs, debugging thought process, and production failure modes. Covers regularization effects, kernel behavior, anomaly evaluation under imbalance, and ensemble collapse conditions.","questionIds":["ml-01014","ml-02014","ml-03014","ml-04014","ml-05014","ml-06014","ml-07014","ml-08014","ml-09012","ml-10012","ml-11012","ml-12011","ml-13011","ml-14011","ml-16009"]},{"id":"elite-01","name":"Advanced Elite Test — Set 1","level":"elite","duration":35,"order":14,"description":"18 questions spanning all 17 ML topics at maximum depth. Designed for senior ML engineers, AI architects, and staff-level screening. Requires multi-step reasoning, internals knowledge, and production intuition. Questions test system tradeoffs, optimization failure modes, and edge cases that expose gaps in mental models — not surface-level recall.","questionIds":["ml-01015","ml-02015","ml-03015","ml-04015","ml-05015","ml-06015","ml-07015","ml-08015","ml-09014","ml-10013","ml-11013","ml-12012","ml-13012","ml-14012","ml-15011","ml-16010","ml-17012","ml-17011"]},{"id":"elite-02","name":"Advanced Elite Test — Set 2","level":"elite","duration":35,"order":15,"description":"Second elite assessment — entirely fresh question set across all 17 topics. Focuses on a different axis of depth: where Set 1 tests internals, Set 2 stresses architectural decisions, cross-topic interactions, and failure reasoning. Targeted at AI architects, ML platform engineers, and senior candidates who want to benchmark themselves against the hardest possible questions.","questionIds":["ml-01014","ml-02014","ml-03014","ml-04014","ml-05014","ml-06014","ml-07014","ml-08014","ml-09013","ml-10012","ml-11012","ml-12011","ml-13011","ml-14011","ml-15010","ml-16009","ml-17010","ml-17009"]}],"initialMode":"practice","initialTopic":"medium"}]