d:["$","$L16",null,{"section":{"slug":"deep-learning","label":"Deep Learning","shortLabel":"Deep Learning","description":"CNNs, RNNs, optimizers, and backpropagation traps.","seoTitle":"Deep Learning Interview Questions & MCQs","seoDescription":"Practice Deep Learning questions on CNNs, RNNs, optimizers, and backpropagation traps.","keywords":["Deep learning interview questions","Deep learning MCQs"],"icon":"D","iconColor":"bg-rose-600","status":"active","phase":3,"priority":0.9},"learnMcqs":[{"section":"deep-learning","topicSlug":"introduction-to-neural-networks","topic":"Introduction To Neural Networks","id":"dl-01001","difficulty":"easy","orderIndex":1,"question":"A textbook describes a biological neuron by mapping its parts to a perceptron: dendrites receive signals, the cell body sums them, and the axon fires if the sum exceeds a threshold. A student uses this analogy to argue that increasing the number of dendrites (input connections) on a neuron will always improve classification accuracy. What is wrong with this reasoning?","options":{"A":"More inputs increase computation time, which degrades accuracy in practice","B":"The number of inputs is fixed by the dataset — adding connections adds noise, not signal","C":"The biological analogy breaks down at this point: more weighted inputs expand the input space but do not change the fundamental linear decision boundary a single perceptron can represent","D":"Biological neurons operate in continuous time, so discrete perceptrons cannot model more dendrites"},"correct":"C","explanation":{"correct":"- A single perceptron computes a weighted sum of inputs and applies a threshold. The decision boundary it can represent is always a hyperplane — adding more input features expands the dimensionality but the separator remains linear.\n- The biological analogy is useful for intuition but does not imply that more connections enable non-linear separation. The constraint is architectural (single layer, linear activation), not a data quantity issue.\n- In production, adding irrelevant features to a linear classifier typically hurts generalization (curse of dimensionality) without resolving non-linearly separable problems.","A":"Computation time is a systems concern, not a model capacity concern. The question is about classification accuracy as a function of model power, not wall-clock time.","B":"The number of inputs is determined by the feature space, but \"noise vs signal\" is a data quality argument, not a model capacity argument. The real issue is the linear decision boundary, not input noise.","C":"","D":"Discrete vs continuous time is an irrelevant distinction here. Standard perceptrons are not time-based and the analogy breakdown is about representational capacity, not temporal dynamics."},"reference":"- Rosenblatt, F., \"The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain\" (1958): https://psycnet.apa.org/record/1959-09865-001\n- Nielsen, M., \"Neural Networks and Deep Learning\", Chapter 1: http://neuralnetworksanddeeplearning.com/chap1.html"},{"section":"deep-learning","topicSlug":"introduction-to-neural-networks","topic":"Introduction To Neural Networks","id":"dl-01002","difficulty":"easy","orderIndex":2,"question":"A junior engineer implements a perceptron for binary classification and reports 100% training accuracy on an AND gate dataset. She then tries the same perceptron on an OR gate dataset and gets 100% again. Encouraged, she applies it directly to an XOR dataset and gets exactly 50% accuracy — no better than random. Why does this specific jump fail?","options":{"A":"The perceptron learning rule diverges for XOR because the learning rate is not tuned correctly for that dataset","B":"XOR is not linearly separable — no single straight line (hyperplane) can divide XOR's positive and negative examples in input space, which is the only type of boundary a perceptron can represent","C":"XOR requires binary inputs, but the perceptron interprets inputs as continuous, causing precision errors","D":"The training dataset for XOR has only 4 samples, which is insufficient for the perceptron to converge"},"correct":"B","explanation":{"correct":"- AND and OR are linearly separable: you can draw a line in 2D that perfectly separates their 0-outputs from 1-outputs. XOR cannot be separated by any hyperplane in the original input space.\n- The perceptron convergence theorem guarantees convergence only for linearly separable problems. On XOR, the algorithm oscillates indefinitely — it is not a learning rate or sample count problem.\n- This is historically significant: Minsky and Papert's 1969 analysis of XOR's non-separability contributed to the first \"AI winter\" by demonstrating fundamental limitations of single-layer networks.","A":"Tuning the learning rate cannot fix a geometric impossibility. The perceptron updates weights to minimize misclassifications, but no weight configuration produces zero errors for XOR on a single layer.","B":"","C":"XOR operates on {0,1} inputs, which are valid continuous values. Precision is not the issue — the problem is representational capacity of the linear model.","D":"The perceptron convergence theorem applies regardless of dataset size as long as the data is linearly separable. With only 4 points, XOR can be exhaustively enumerated and the non-separability is provable analytically, not statistically."},"reference":"- Minsky, M. & Papert, S., \"Perceptrons\" (1969): https://mitpress.mit.edu/9780262630221/perceptrons/\n- Visualizing XOR non-separability: https://playground.tensorflow.org/"},{"section":"deep-learning","topicSlug":"introduction-to-neural-networks","topic":"Introduction To Neural Networks","id":"dl-01003","difficulty":"easy","orderIndex":3,"question":"You are reviewing a colleague's perceptron implementation. The update rule modifies weights only when the prediction is wrong. Your colleague argues this is a bug — \"we should update weights on every sample to ensure the model keeps learning.\" Who is correct and why?","options":{"A":"The colleague is correct; skipping updates on correct samples wastes gradient information","B":"The original implementation is correct; updating only on misclassifications is the defining rule of the Perceptron algorithm, and updating on correct predictions would push the decision boundary away from correct examples","C":"Both approaches converge to the same solution; it is purely a performance optimization choice","D":"Neither is correct; perceptrons require batch updates across all samples simultaneously, not online per-sample updates"},"correct":"B","explanation":{"correct":"- The Perceptron learning rule (Rosenblatt, 1958) updates weights as: w ← w + η·(y - ŷ)·x. When y = ŷ (correct prediction), the update is zero by definition — not a special case but the mathematical result.\n- Updating weights on correctly classified samples would introduce unnecessary perturbations, potentially moving the decision boundary away from a valid separating hyperplane.\n- The Perceptron algorithm is guaranteed to converge (find a separating hyperplane in finite steps) for linearly separable data under the standard update-on-mistake rule. This guarantee does not hold for arbitrary update schedules.","A":"There is no gradient in a standard perceptron — it is not a gradient descent method. The concept of \"wasting gradient information\" doesn't apply; the update rule is a correction signal, not a gradient.","B":"","C":"The two approaches do not converge to the same solution. Updating on correct samples introduces drift and can cause oscillation even on linearly separable data.","D":"The Perceptron algorithm is inherently online (processes one sample at a time). Batch perceptrons exist but are not the standard formulation, and the question is about the classical single-sample update rule."},"reference":"- Novikoff, A.B.J., \"On convergence proofs for perceptrons\" (1963): classic convergence proof\n- https://cs229.stanford.edu/notes2022fall/cs229-notes6.pdf"},{"section":"deep-learning","topicSlug":"introduction-to-neural-networks","topic":"Introduction To Neural Networks","id":"dl-01004","difficulty":"medium","orderIndex":4,"question":"A team trains a perceptron to classify whether a loan should be approved (1) or rejected (0) based on two features: credit score and income. After training, they plot the decision boundary and find a straight line correctly separating all training samples. They then test on a held-out set and get 95% accuracy. A new feature — \"number of late payments\" — is added. Retraining yields 80% accuracy. The team concludes the new feature \"confused\" the perceptron. What is the most likely true cause?","options":{"A":"Adding a feature increases the input dimension, which always reduces accuracy in linear classifiers","B":"The original two features happened to be linearly separable; adding the third feature may have introduced cases where the combined three-dimensional feature space is no longer linearly separable, or where the new feature correlates with noise in the training set","C":"The perceptron cannot handle three or more features simultaneously — it is limited to two-dimensional inputs","D":"The learning rate must be reduced when adding features, otherwise the perceptron overshoots the optimal boundary"},"correct":"B","explanation":{"correct":"- Linear separability is a property of the data in a specific feature space, not a guaranteed property. Adding a feature changes the geometry of the space — previously separable data may no longer be separable in the augmented space, especially if the new feature interacts non-linearly with the class boundary.\n- \"Number of late payments\" likely has a non-linear relationship with approval (e.g., 0 late payments = good, but 1-3 may be borderline). This creates decision regions in 3D that cannot be cleanly separated by a plane.\n- In practice, before adding features to linear models, teams should check whether the augmented dataset remains approximately linearly separable using tools like SVM with a linear kernel.","A":"Higher dimensionality does not always reduce linear accuracy. If the new feature is linearly predictive, it can improve accuracy. The dimensionality itself is not the problem.","B":"","C":"A perceptron generalizes to any number of dimensions — it computes w·x + b for a weight vector of arbitrary length. There is no dimensionality cap.","D":"Learning rate affects convergence speed and stability, not the fundamental geometric feasibility of linear separation. If the data is linearly separable in 3D, any positive learning rate will eventually converge."},"reference":"- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html"},{"section":"deep-learning","topicSlug":"introduction-to-neural-networks","topic":"Introduction To Neural Networks","id":"dl-01005","difficulty":"medium","orderIndex":5,"question":"The Perceptron Convergence Theorem states the algorithm will find a separating hyperplane in a finite number of updates if the data is linearly separable. A researcher applies a perceptron to a dataset and observes the algorithm running for 10,000 epochs without converging. She concludes the data must not be linearly separable. A colleague disagrees. Who is more likely correct, and why?","options":{"A":"The researcher is correct; non-convergence after sufficient epochs is the standard test for non-linear separability","B":"The colleague is more likely correct; the theorem guarantees finite steps proportional to the margin, and \"sufficient epochs\" depends on how small the margin is — tight margins can require millions of updates even for separable data","C":"Both are wrong; the perceptron always converges in at most n² steps where n is the number of samples","D":"The colleague is correct only if the learning rate is set to exactly 1.0; otherwise the theorem does not apply"},"correct":"B","explanation":{"correct":"- The convergence theorem bounds the number of updates by R²/γ², where R is the maximum norm of the input vectors and γ is the geometric margin (distance from the closest point to the separating hyperplane). If γ is very small (nearly non-separable data), R²/γ² can be enormous.\n- A dataset with a tiny margin (e.g., two classes separated by 0.001 in feature space) is technically linearly separable but may require millions of updates to converge — far exceeding what 10,000 epochs covers.\n- In practice, engineers use SVMs with a linear kernel to detect near-margin separability, rather than relying on perceptron convergence as a test.","A":"Non-convergence is not a definitive test for non-separability because the required iterations grow inversely with the margin squared. A practical epoch limit is not a mathematical proof.","B":"","C":"There is no n² step bound in the standard convergence theorem. The bound depends on R and γ, not solely on the number of samples.","D":"The convergence theorem holds for any positive learning rate η, not just η=1.0. The learning rate affects the scale of weight updates but not the convergence guarantee."},"reference":"- Novikoff convergence proof bound: R²/γ²\n- Shalev-Shwartz & Ben-David, \"Understanding Machine Learning\", Chapter 9"},{"section":"deep-learning","topicSlug":"introduction-to-neural-networks","topic":"Introduction To Neural Networks","id":"dl-01006","difficulty":"medium","orderIndex":6,"question":"Given the XOR truth table: (0,0)→0, (0,1)→1, (1,0)→1, (1,1)→0 — a senior engineer claims you can solve XOR with a single perceptron by using a non-linear feature transformation: mapping (x₁, x₂) → (x₁, x₂, x₁·x₂). A junior engineer says this \"cheats\" and doesn't count as a perceptron solution. Who is right, and what does this reveal about neural networks?","options":{"A":"The senior engineer is wrong; XOR cannot be solved by any perceptron regardless of input transformation","B":"The junior engineer is right that it \"cheats\" — only raw features are valid inputs to a perceptron; adding derived features violates the perceptron definition","C":"The senior engineer is correct in principle: applying a feature map to create a higher-dimensional linearly separable representation is valid and is exactly what a hidden layer in a neural network computes automatically","D":"Both are partially right; the transformation works but requires the perceptron to have three inputs, which is only valid for 3-class problems"},"correct":"C","explanation":{"correct":"- In the transformed space (x₁, x₂, x₁x₂), XOR becomes linearly separable. For example, the plane w = [1, 1, -2] with bias -0.5 correctly classifies all four points. This is a valid perceptron on 3 features.\n- This insight is the core motivation for neural networks: a hidden layer computes a learned non-linear feature transformation (the \"representation\"), and the output layer performs linear classification in the transformed space.\n- The \"kernel trick\" in SVMs and the \"representation learning\" in deep networks are both formalizations of the same principle: learn or design a feature map that makes the problem linearly separable.","A":"XOR is absolutely solvable with the right feature map. The Minsky-Papert result says it is not solvable with raw inputs on a single-layer perceptron — not that it is fundamentally unsolvable.","B":"The perceptron model accepts any feature vector as input. There is no rule restricting inputs to \"raw\" features. Feature engineering is standard practice — the distinction is whether the transformation is manual or learned.","C":"","D":"The three inputs correspond to three features, not three classes. The number of inputs in a perceptron is independent of the number of output classes."},"reference":"- http://neuralnetworksanddeeplearning.com/chap4.html (visual proof of universal approximation)\n- Kernel trick and XOR: https://cs229.stanford.edu/"},{"section":"deep-learning","topicSlug":"introduction-to-neural-networks","topic":"Introduction To Neural Networks","id":"dl-01007","difficulty":"medium","orderIndex":7,"question":"A data scientist trains a neural network with one hidden layer (2 hidden units, ReLU) to solve XOR and achieves 100% accuracy. She then removes the hidden layer (making it a single perceptron) and retrains — the perceptron never converges. She concludes that \"more neurons\" is what solved XOR. A reviewer pushes back. What is the reviewer's most accurate correction?","options":{"A":"The reviewer is wrong; more computational units is exactly what solves XOR","B":"The reviewer would argue it is not the number of neurons but the non-linear hidden layer that transforms the input space into a representation where XOR becomes linearly separable — depth and non-linearity together enable this, not just adding neurons","C":"The reviewer would point out that a perceptron with 4 or more neurons in a single layer can also solve XOR","D":"The reviewer would argue the difference is the ReLU activation — a perceptron with ReLU could solve XOR without a hidden layer"},"correct":"B","explanation":{"correct":"- Adding neurons to a single-layer network (without a hidden layer) only produces more linear classifiers whose ensemble is still a linear function. You cannot combine linear functions to get a non-linear one without non-linearity between them.\n- The hidden layer with ReLU creates a piecewise-linear transformation of the input space. The two hidden units effectively create new features that separate XOR's pattern, and the output layer is a linear classifier on those features.\n- The key insight: it is the combination of non-linearity (activation functions) and depth (hidden layers) that grants representational power — not sheer neuron count in a flat architecture.","A":"\"More neurons\" in a single layer without non-linearity between them collapses to a single linear function (by the superposition property of linear transforms). This cannot solve XOR.","B":"","C":"A single-layer network with any number of neurons remains a linear classifier. Adding more neurons to a flat architecture is equivalent to increasing the width of one linear transformation, which stays linear.","D":"ReLU applied at the output layer of a single perceptron changes it from a linear to a piecewise-linear function, but the function is still a single hinge — it cannot separate XOR's four quadrant pattern. You need at least two ReLU units with different boundaries."},"reference":"- https://playground.tensorflow.org/ (interactive XOR solution with hidden layers)"},{"section":"deep-learning","topicSlug":"introduction-to-neural-networks","topic":"Introduction To Neural Networks","id":"dl-01008","difficulty":"hard","orderIndex":8,"question":"Minsky and Papert's 1969 analysis of perceptrons showed that a single-layer network cannot compute the XOR function with locally connected (limited-order) predicates. This result contributed to defunding of neural network research for nearly a decade. A historian argues: \"The AI winter was a rational response — Minsky proved neural networks were fundamentally flawed.\" A modern ML researcher disputes this characterization. What is the most technically precise basis for the researcher's disagreement?","options":{"A":"Minsky's proof was mathematically incorrect and has since been disproven","B":"Minsky and Papert explicitly noted that multi-layer networks could overcome these limitations, and the generalization of their result to all neural networks was an overinterpretation that the field accepted uncritically","C":"XOR is not an important real-world problem, so the limitation was overstated from the beginning","D":"Backpropagation was already known in 1969 and could have solved XOR immediately, making the AI winter purely political"},"correct":"B","explanation":{"correct":"- Minsky and Papert's book explicitly discussed multi-layer perceptrons in the final chapter and noted that their analysis did not extend to networks with hidden layers. The \"AI winter\" resulted from the research community overgeneralizing a proof about single-layer networks.\n- The community's mistake was assuming that because hidden-layer networks lacked training algorithms (backpropagation wasn't practically known/applied until Rumelhart et al., 1986), they were not worth pursuing — conflating \"hard to train\" with \"fundamentally limited.\"\n- This is a historically important lesson about how limitations of a specific model can be misread as limitations of an entire research paradigm.","A":"Minsky and Papert's proofs are mathematically correct for their stated scope (finite-order perceptrons, single layer). The issue was scope of interpretation, not mathematical error.","B":"","C":"XOR is a canonical non-linear classification problem. Its unsolvability by a single perceptron directly implies that any non-linearly separable problem — which is the vast majority of real-world problems — cannot be solved by a flat network.","D":"Backpropagation was not practically known in 1969. Werbos derived it in his 1974 thesis, and the key popularization was Rumelhart, Hinton & Williams in 1986. The AI winter was partly due to the genuine absence of a practical training method for multi-layer networks."},"reference":"- Minsky & Papert, \"Perceptrons\" (1969)\n- Rumelhart, Hinton & Williams, \"Learning representations by back-propagating errors\" (1986): https://www.nature.com/articles/323533a0"},{"section":"deep-learning","topicSlug":"introduction-to-neural-networks","topic":"Introduction To Neural Networks","id":"dl-01009","difficulty":"hard","orderIndex":9,"question":"You are building a neural network from scratch to solve XOR. With two hidden units and sigmoid activations, the network trains successfully. You then replace the sigmoid with a linear activation (f(x) = x) in the hidden layer, keeping everything else identical, and retrain from scratch. The network now fails to solve XOR. Your manager asks why changing \"just the activation\" breaks it. What is the exact mathematical reason?","options":{"A":"Linear activations cause gradient explosion during backpropagation, preventing convergence","B":"A network with linear activations in hidden layers is mathematically equivalent to a single-layer linear network regardless of depth — the composition of linear functions is itself a linear function, eliminating all non-linear representational power","C":"Linear activations saturate at large values, causing the hidden layer to output constants for XOR's inputs","D":"Linear activations require a different learning rate than sigmoid activations; the existing hyperparameters are incompatible"},"correct":"B","explanation":{"correct":"- If hidden layer j computes h = W₂(W₁x + b₁) + b₂, this simplifies to (W₂W₁)x + (W₂b₁ + b₂) = Wx + b — a single affine transformation. No depth of linear layers adds representational power beyond a single layer.\n- This is the mathematical proof that depth alone does not grant expressiveness — non-linear activation functions are the critical ingredient that makes composition of layers more powerful than any single layer.\n- In practice, this means a 100-layer fully linear network is equivalent to logistic regression (with a linear output). Non-linearity (sigmoid, ReLU, tanh) is not an implementation detail — it is the source of all representational power in neural networks.","A":"Linear activations do not cause gradient explosion by themselves. In fact, the gradient of a linear activation is a constant (1.0), which is numerically very stable. The issue is representational, not optimization-related.","B":"","C":"Linear activations do not saturate — their output is unbounded. Saturation is a property of sigmoid and tanh, where outputs asymptote to 0 or 1 (or -1/1), causing vanishing gradients.","D":"Learning rate is a hyperparameter of the optimizer. While different activations may benefit from different learning rates, the failure to solve XOR is fundamental — no learning rate will allow a linear network to represent XOR."},"reference":"- Goodfellow et al., \"Deep Learning\", Chapter 6.3 (Hidden Units and Depth): https://www.deeplearningbook.org/"},{"section":"deep-learning","topicSlug":"introduction-to-neural-networks","topic":"Introduction To Neural Networks","id":"dl-01010","difficulty":"hard","orderIndex":10,"question":"A researcher plots the loss landscape of a perceptron trained on a linearly separable dataset and observes that the loss surface has many local minima. She uses this as evidence that the perceptron learning rule is unreliable. A senior ML engineer disagrees. What is the most precise technical reason the senior engineer is correct?","options":{"A":"Modern perceptrons use Adam optimizer which avoids local minima completely","B":"The perceptron uses a step function (threshold activation), making its loss non-differentiable, but the update rule is a direct correction rule, not gradient descent — there are no local minima in the relevant sense because the algorithm is not minimizing a smooth loss function","C":"The loss landscape of a linearly separable problem has exactly one global minimum by definition, so local minima cannot exist","D":"The perceptron averages updates across all misclassified samples, which statistically eliminates local minima"},"correct":"B","explanation":{"correct":"- The classical Perceptron algorithm does not perform gradient descent. It applies a correction w ← w + η·(y - ŷ)·x directly when a sample is misclassified. There is no differentiable loss being minimized.\n- The concept of \"local minima\" in an optimization sense applies to gradient-based methods minimizing a smooth scalar loss. For the Perceptron, convergence is guaranteed by the geometric structure of the problem (Novikoff's theorem), not by a loss landscape argument.\n- The confusion arises because researchers familiar with modern deep learning (where gradient descent on smooth losses is universal) incorrectly apply loss landscape intuitions to algorithms that don't operate on smooth losses.","A":"The standard perceptron does not use Adam or any adaptive optimizer. And Adam does not \"avoid local minima completely\" — it converges to local minima more efficiently than SGD but does not escape them in general.","B":"","C":"For a linearly separable problem, there are infinitely many valid separating hyperplanes (any hyperplane in the margin region works), so the \"solution\" is not unique. The loss landscape argument is moot for the Perceptron's update rule.","D":"The classical perceptron is an online algorithm — it updates on one sample at a time, not as a batch average. Even mini-batch averaging doesn't eliminate local minima in gradient descent."},"reference":"- Novikoff, A.B.J., \"On convergence proofs for perceptrons\" (1963)\n- https://cs229.stanford.edu/notes2022fall/cs229-notes6.pdf"},{"section":"deep-learning","topicSlug":"introduction-to-neural-networks","topic":"Introduction To Neural Networks","id":"dl-01011","difficulty":"medium","orderIndex":11,"question":"A neural network with 3 input features, 1 hidden layer (4 units), and 1 output unit is described as having \"two layers.\" A student insists it has \"three layers\" because she counts the input, hidden, and output layers. In a job interview, which answer is expected and what is the correct convention?","options":{"A":"The student is correct — always count all layers including input; this is the IEEE standard","B":"Both conventions are used, but in interviews and research papers, layer count typically refers to the number of layers with learnable parameters (weight matrices). The input layer has no parameters, so the network is called a \"2-layer network\" or \"1-hidden-layer network\"","C":"The network has 4 layers because each hidden unit counts as a separate layer","D":"The correct count is always the total number of weight matrices plus the number of bias vectors"},"correct":"B","explanation":{"correct":"- In the deep learning community (and in most interview contexts), \"N-layer network\" refers to N layers with learnable parameters. An input layer simply passes data and has no weights, so it is not counted.\n- A \"2-layer network\" has 1 hidden layer and 1 output layer. A \"3-layer network\" has 2 hidden layers. This is the convention used in Goodfellow et al.'s \"Deep Learning\" textbook and most research papers.\n- Ambiguity in layer counting is a common source of confusion. Being precise (\"a network with one hidden layer\" vs \"a 2-layer network\") is better practice in technical communication.","A":"There is no IEEE standard that mandates counting the input layer. The convention varies by context, but the dominant research/interview convention excludes the input layer from the count.","B":"","C":"Counting individual neurons as layers is incorrect. A \"layer\" is a set of neurons that process inputs in parallel and share the same position in the network topology — not individual units.","D":"The number of weight matrices equals the number of layers with parameters, and bias vectors are counted alongside their layer. This count is equivalent to option B's convention but is not expressed in standard terminology."},"reference":"- Goodfellow et al., \"Deep Learning\", Chapter 6.1 (Example: Learning XOR)"},{"section":"deep-learning","topicSlug":"introduction-to-neural-networks","topic":"Introduction To Neural Networks","id":"dl-01012","difficulty":"medium","orderIndex":12,"question":"You have a dataset with two features (x₁, x₂) and a binary label. You visualize the data and see that the positive class forms a ring around the negative class (concentric circles). You train a perceptron on this data for 1000 epochs. What will you observe and why?","options":{"A":"The perceptron will converge to roughly 50% accuracy because the classes are balanced and it cannot separate them","B":"The perceptron will oscillate without converging because the data is not linearly separable — no straight line can enclose a ring around another class","C":"The perceptron will converge to approximately 75% accuracy because it can correctly classify 3 of the 4 quadrants","D":"The perceptron will converge slowly but eventually find a separating line once the learning rate decays sufficiently"},"correct":"B","explanation":{"correct":"- Concentric circles (the \"rings\" dataset) is a canonical example of a non-linearly separable problem. The positive class (ring) surrounds the negative class (center), which cannot be divided by any hyperplane in 2D.\n- By the Perceptron Convergence Theorem, the algorithm converges only for linearly separable data. For non-separable data, the update rule oscillates — it corrects misclassifications on one side only to re-misclassify others on the next pass.\n- This is why kernel methods (RBF kernel maps to infinite-dimensional feature space where circles become separable) and neural networks (learn a non-linear boundary) were developed.","A":"50% accuracy is possible but not guaranteed — a diagonal line through the center could achieve well above 50% by capturing one side of the ring. The defining behavior is non-convergence and oscillation, not a specific accuracy.","B":"","C":"The perceptron cannot be analyzed as correctly classifying \"quadrants\" — its boundary is a single hyperplane, not a quadrant decomposition. 75% is not a meaningful prediction for this geometry.","D":"Learning rate decay affects convergence speed for separable data but does not affect the fundamental impossibility of linear separation. The rings dataset remains non-linearly separable regardless of learning rate schedule."},"reference":"- https://playground.tensorflow.org/ (rings dataset visualization)\n- Scikit-learn make_circles dataset: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_circles.html"},{"section":"deep-learning","topicSlug":"introduction-to-neural-networks","topic":"Introduction To Neural Networks","id":"dl-01013","difficulty":"hard","orderIndex":13,"question":"A team tries to solve a 4-class classification problem using a single perceptron with a step function output. They encode the 4 classes as binary pairs: (0,0), (0,1), (1,0), (1,1), and train one perceptron per bit position (two perceptrons total). Each perceptron achieves 90% accuracy on its binary subtask. The team concludes the combined system achieves 90% accuracy on the 4-class problem. What is the flaw in this reasoning?","options":{"A":"Two perceptrons cannot share inputs — they must use different feature subsets","B":"The independence assumption is incorrect: the two perceptrons make errors on different samples, so the combined 4-class accuracy is lower than 90% — it is approximately 0.9 × 0.9 = 81% (if errors are independent) or worse if errors are correlated","C":"Step function outputs cannot be combined; the team should use sigmoid activations to enable probability combination","D":"This architecture is equivalent to one perceptron with 8 outputs, which would achieve 81% accuracy due to class interference"},"correct":"B","explanation":{"correct":"- If each binary classifier makes errors on 10% of samples independently, a sample is correctly classified in 4-class space only if both binary classifiers are correct simultaneously. P(both correct) = 0.9 × 0.9 = 0.81 under independence.\n- In practice, errors are often correlated (both classifiers fail on the same hard examples near decision boundaries), which makes combined accuracy even lower than 81%.\n- This is a common mistake in multi-label and multi-class decomposition strategies: individual component accuracies compound multiplicatively, not additively.","A":"Perceptrons can absolutely share the same input feature vector. There is no architectural reason they must use different features. In fact, sharing features is standard in multi-output networks.","B":"","C":"Step functions can be combined via logical operations or majority vote. The issue is not the activation type but the compounding of errors. Sigmoid would not fix the 90% × 90% = 81% problem.","D":"A single perceptron with 8 outputs is a multi-output linear model. Its accuracy depends on the problem geometry. The error compounding calculation is specific to the two-independent-classifier setup, not to the number of outputs."},"reference":"- Multi-label classification error analysis: https://scikit-learn.org/stable/modules/multiclass.html"},{"section":"deep-learning","topicSlug":"introduction-to-neural-networks","topic":"Introduction To Neural Networks","id":"dl-01014","difficulty":"hard","orderIndex":14,"question":"Consider a neural network with 2 inputs, 2 hidden units (sigmoid), and 1 output (sigmoid). You manually set the weights to make the first hidden unit compute approximately AND(x₁, x₂) and the second compute approximately OR(x₁, x₂). The output unit is set to compute NOT(AND) AND OR, which is equivalent to XOR. A student argues this \"proves\" XOR is solvable but doesn't generalize because the weights were hand-crafted. What does this demonstration actually prove about neural networks?","options":{"A":"Nothing useful — hand-crafted weights don't count as learning","B":"It proves that a shallow neural network with non-linear activations has sufficient representational capacity to express XOR — the weights exist. Learning algorithms (backpropagation) are responsible for finding those weights automatically","C":"It proves that sigmoid activations are necessary for XOR — ReLU or tanh would fail in this configuration","D":"It proves XOR requires exactly 2 hidden units — fewer units cannot express the function"},"correct":"B","explanation":{"correct":"- Existence of a weight configuration that solves XOR proves the model has the representational capacity. Gradient-based learning is an algorithm for finding such weights — it is a search problem, not a capacity problem.\n- This separation between \"expressiveness\" (what can the model represent?) and \"learnability\" (can the optimizer find it?) is fundamental. The Universal Approximation Theorem proves existence of weights for any continuous function; backpropagation is the practical search algorithm.\n- Hand-crafted demonstrations are valid proofs of capacity. The reason we need learning algorithms is that for high-dimensional problems with millions of parameters, manual weight design is infeasible.","A":"Hand-crafted weights are a proof by construction. In mathematics, existence proofs by construction are the strongest form of existence proof. This absolutely \"counts.\"","B":"","C":"Sigmoid is used here for convenience (it approximates AND and OR with the right weights), but ReLU networks can also represent XOR and any other function that networks in general can represent. The activation choice affects the specific weight values, not the representational capacity.","D":"You can solve XOR with 2 hidden units (as demonstrated), but this does not prove it is the minimum. A single hidden unit with a quadratic transformation can also solve XOR. Minimum complexity is a separate research question."},"reference":"- Cybenko, G., \"Approximation by superpositions of a sigmoidal function\" (1989): the original Universal Approximation Theorem\n- http://neuralnetworksanddeeplearning.com/chap4.html"},{"section":"deep-learning","topicSlug":"introduction-to-neural-networks","topic":"Introduction To Neural Networks","id":"dl-01015","difficulty":"medium","orderIndex":15,"question":"You are onboarding a new team member who asks: \"If neural networks are just compositions of matrix multiplications and activation functions, why are they so powerful? Linear algebra is simple.\" What is the most technically complete answer that bridges the theory to practice?","options":{"A":"Neural networks are powerful because matrix multiplication is GPU-accelerated, enabling much larger models than older methods","B":"The power comes from the interaction of three properties: non-linear activations enabling universal approximation, depth allowing hierarchical feature composition, and the availability of gradient descent to search the exponentially large weight space efficiently","C":"Neural networks are powerful primarily because of the large amounts of data they are trained on — the architecture itself is not special","D":"The activation functions convert the linear operations into non-linear ones, which is equivalent to performing kernel regression in infinite-dimensional space for all practical purposes"},"correct":"B","explanation":{"correct":"- Non-linear activations alone (without depth) give universal approximation in theory but require exponentially many hidden units. Depth allows hierarchical composition (edges → shapes → objects in vision), which is exponentially more efficient for structured data.\n- Gradient descent with backpropagation navigates a loss surface with billions of parameters — a search problem that would be intractable with brute force but is made feasible by automatic differentiation and modern hardware.\n- The combination of all three — expressiveness, efficiency, and trainability — is what makes deep networks uniquely powerful. Each factor alone is insufficient.","A":"GPU acceleration is an implementation advantage, not a theoretical source of power. Neural networks were theoretically powerful before GPUs; GPUs made them practically scalable.","B":"","C":"Data is essential for generalization but does not explain why a neural network can learn better representations than a linear model given the same data. The architecture determines what can be represented.","D":"The neural tangent kernel (NTK) framework shows that infinitely wide networks are equivalent to kernel methods, but this is a limiting theoretical result. In practice, finite-width deep networks do not behave as kernel machines and often outperform them by learning adaptive representations."},"reference":"- Goodfellow et al., \"Deep Learning\", Chapters 6–8: https://www.deeplearningbook.org/\n- LeCun, Bengio & Hinton, \"Deep learning\" (Nature 2015): https://www.nature.com/articles/nature14539"},{"section":"deep-learning","topicSlug":"neurons-and-perceptrons","topic":"Neurons And Perceptrons","id":"dl-02001","difficulty":"easy","orderIndex":1,"question":"A neural network layer computes z = Wx + b, where W is a 64×128 weight matrix, x is a 128-dimensional input, and b is a 64-dimensional bias. A new engineer adds an extra bias vector of shape (64,) after the activation and trains the model. He is surprised to find no improvement. What is the most likely reason?","options":{"A":"Bias terms must always be initialized to zero; adding a second bias with random initialization causes training instability","B":"Two additive bias terms on the same layer collapse into a single effective bias — the network cannot distinguish between the two, so no extra representational capacity is gained","C":"The second bias vector is outside the activation function, so it bypasses the non-linearity and breaks the gradient flow","D":"A 64-dimensional bias is too large; standard practice limits bias size to match the input dimension"},"correct":"B","explanation":{"correct":"- The layer computes: output = f(Wx + b₁) + b₂. Since b₁ and b₂ are both learned, the optimizer can achieve the same result by absorbing any value of b₂ into b₁ (before the activation), adjusted for the activation's effect. The second bias adds a parameter but not representational power.\n- More precisely, if the activation is linear, b₁ + b₂ collapses into one bias. With non-linear activation, b₂ shifts the output but this shift is already achievable by adjusting b₁ and W together.\n- Adding redundant parameters increases memory and computation with no model capacity gain. This is a common mistake when engineers try to \"boost\" a layer without understanding what parameters do.","A":"Bias initialization to zero is standard (to break symmetry concerns apply to weights, not biases), but the second bias won't cause instability — it simply provides no benefit.","B":"","C":"The gradient flows correctly through addition. Placing a bias after an activation is valid mathematically and does backpropagate gradients — it just doesn't help.","D":"There is no standard that requires bias size to match input dimension. Bias size matches output dimension (64), which is already correct in this setup."},"reference":"- Goodfellow et al., \"Deep Learning\", Chapter 6.2 (Gradient-Based Learning)"},{"section":"deep-learning","topicSlug":"neurons-and-perceptrons","topic":"Neurons And Perceptrons","id":"dl-02002","difficulty":"easy","orderIndex":2,"question":"In a multilayer perceptron, every unit in layer k is connected to every unit in layer k+1. A team decides to remove all connections between the first and third layer (no skip connections) and reports the network is \"equivalent\" to the original. A second team adds direct connections from input to output layer and says this is strictly \"more powerful.\" Which team is correct?","options":{"A":"First team is correct — removing non-adjacent connections doesn't change anything since gradients don't flow through skipped layers anyway","B":"Second team is correct — adding skip connections from input to output layer adds a new direct linear pathway, meaning the network can represent functions that the non-skip version cannot, specifically residual linear transformations of the input","C":"Both teams are correct — both architectures compute identical functions with different parameterizations","D":"Neither claim is correct — removing any connection changes the output and adding connections changes the architecture class entirely"},"correct":"B","explanation":{"correct":"- In a standard MLP, each layer's output is a transformed version of the previous layer only. Adding a direct input-to-output connection creates a pathway that computes output = f(deep_path(x)) + W_skip·x, allowing the network to represent functions that are \"a deep transformation plus a direct linear term.\"\n- This is the architectural insight behind ResNets: skip connections allow the network to easily learn identity functions (if the residual branch is zero, the skip dominates), which addresses vanishing gradients and enables very deep networks.\n- The first team is wrong because \"removing connections between non-adjacent layers\" is vacuously true in a standard MLP (those connections don't exist to begin with) — the claim is about removing existing adjacent connections, which would reduce capacity.","A":"In a standard MLP, there are no first-to-third-layer connections to remove. If they meant removing first-to-second connections, that would reduce representational capacity dramatically by disconnecting parts of the network.","B":"","C":"Skip connections create new computational pathways — the architectures are not equivalent in terms of representable functions, even with different parameterizations.","D":"The second team's claim is correct by the argument in the explanation. Adding skip connections does add representational power (a new linear pathway)."},"reference":"- He et al., \"Deep Residual Learning for Image Recognition\" (ResNet): https://arxiv.org/abs/1512.03385"},{"section":"deep-learning","topicSlug":"neurons-and-perceptrons","topic":"Neurons And Perceptrons","id":"dl-02003","difficulty":"easy","orderIndex":3,"question":"A neural network's hidden layer has 100 units, all initialized to the same weight vector w₀ and same bias b₀. The network trains for 100 epochs but all hidden units remain identical throughout training. Why does this happen even though the loss is non-zero and gradients are flowing?","options":{"A":"Identical initialization causes NaN gradients because the loss surface has a flat region at symmetric points","B":"Since all units receive the same input and compute the same output, backpropagation produces identical gradients for every unit — they receive the same update and remain permanently symmetric throughout training","C":"This is expected behavior; the network converges to a unique solution where all units specialize identically","D":"The optimizer averages gradients across units, cancelling out individual updates and preventing specialization"},"correct":"B","explanation":{"correct":"- This is the \"symmetry breaking\" problem. If all weights in a layer are identical, every unit computes the same pre-activation value z = w·x + b. Their outputs are identical, so the loss gradient with respect to each unit's weights is identical. Each unit receives the same gradient update, keeping them identical forever.\n- The result is a layer of 100 units that behaves identically to a single unit — massive parameter waste with no representational gain.\n- This is why weights are initialized randomly (Xavier/He initialization): to break symmetry so different units can specialize to different features during training.","A":"Identical initialization does not cause NaN gradients. The gradients are well-defined and finite — they are just identical across units, causing symmetric updates, not numerical failure.","B":"","C":"There is nothing \"correct\" about identical units. The network converges but learns a degenerate solution with far less capacity than intended. A 100-unit layer that behaves like a 1-unit layer wastes 99x parameters.","D":"Backpropagation computes individual per-weight gradients, not averaged gradients. The identity of gradients is a consequence of identical forward-pass outputs, not optimizer averaging."},"reference":"- Goodfellow et al., \"Deep Learning\", Chapter 8.4 (Practical Considerations for Training Deep Models — symmetry breaking)"},{"section":"deep-learning","topicSlug":"neurons-and-perceptrons","topic":"Neurons And Perceptrons","id":"dl-02004","difficulty":"medium","orderIndex":4,"question":"A single hidden unit in a neural network computes: output = sigmoid(w₁·x₁ + w₂·x₂ + b). You are told the unit has learned w₁ = 3.0, w₂ = -3.0, b = 0. Without running any code, predict: for input (1, 0), what does this unit detect, and what happens to its output as you scale the input (100, 0)?","options":{"A":"The unit outputs 0.95 for (1,0) and approaches 1.0 for (100, 0), meaning it is a feature detector that saturates — strong evidence of feature x₁ being present causes the sigmoid to \"clamp\" at 1","B":"The unit outputs sigmoid(3) ≈ 0.95 for (1,0) and sigmoid(300) ≈ 1.0 for (100,0). The unit detects \"x₁ > x₂\" (since w₁ = −w₂) but saturates — large inputs collapse the gradient to near zero, which is the vanishing gradient problem at the activation level","C":"The unit outputs 0.5 for both inputs because the bias is 0, which forces the sigmoid to its center value","D":"Scaling the input has no effect because the sigmoid output is bounded between 0 and 1 regardless of input magnitude"},"correct":"B","explanation":{"correct":"- For (1,0): z = 3·1 + (−3)·0 + 0 = 3, sigmoid(3) ≈ 0.9526. For (100,0): z = 300, sigmoid(300) ≈ 1.0 (to machine precision).\n- The weight pattern w₁ = 3, w₂ = −3 means the unit activates when x₁ >> x₂ (it computes a difference detector). The bias of 0 centers the threshold at x₁ = x₂.\n- The critical insight: when z is large (300), sigmoid'(z) = sigmoid(z)(1−sigmoid(z)) ≈ 1·0 = 0. The gradient is effectively zero, so this unit contributes nothing to weight updates for large-magnitude inputs — the vanishing gradient problem.","A":"Partially correct (saturation is real), but misses the crucial production implication: vanishing gradients mean this unit stops learning once inputs are large. This is the core reason ReLU replaced sigmoid for hidden layers.","B":"","C":"The bias is 0, but the output for (1,0) is sigmoid(3) ≈ 0.95, not 0.5. Sigmoid outputs 0.5 only when z = 0. For (1,0), z = 3, not 0.","D":"Scaling the input does affect the output — it changes z which changes the sigmoid output. The output is bounded between 0 and 1, but the specific value changes with input magnitude."},"reference":"- Hochreiter, \"The vanishing gradient problem during learning recurrent neural nets\" (1998)\n- https://cs231n.github.io/neural-networks-1/"},{"section":"deep-learning","topicSlug":"neurons-and-perceptrons","topic":"Neurons And Perceptrons","id":"dl-02005","difficulty":"medium","orderIndex":5,"question":"You have a trained MLP classifier. During inference on a new input, you notice that 80% of the hidden units in the first layer output values very close to 0. A teammate says this is a sign the model is \"broken\" and suggests retraining with a larger network. Is the teammate correct?","options":{"A":"Yes — 80% zero activations means 80% of the network's capacity is wasted, and a larger network would use more capacity","B":"No — sparse activation is often a sign of a well-trained network. If the model uses ReLU, dead units on specific inputs means those features are irrelevant to the input; this is feature selectivity, not a bug","C":"Yes — all hidden units should have roughly equal activation magnitudes for the network to be efficient","D":"No — 80% zero activations means the model has overfit and is memorizing training data by deactivating most units"},"correct":"B","explanation":{"correct":"- Sparse activations in ReLU networks are a feature, not a bug. A unit outputting 0 for a given input means that input doesn't trigger the feature that unit represents. Different inputs activate different subsets of units — this is the network's learned feature selectivity.\n- This is analogous to sparse coding in neuroscience (Olshausen & Field, 1996), where most neurons are silent for any given stimulus. Sparse representations are more interpretable, energy-efficient, and often generalize better.\n- If 80% of units are always 0 regardless of any input (dead ReLU), that is a different problem. But 80% zeros for specific inputs is expected and desirable.","A":"Capacity is not measured by activation counts. A unit that is 0 for one input may be active for other inputs and contribute meaningfully to those predictions. \"Capacity\" in neural networks is about expressiveness over the distribution of inputs, not per-sample activation density.","B":"","C":"Uniform activation magnitudes would imply every unit is equally relevant to every input — this contradicts the idea of feature specialization. Uniform activations are more characteristic of poorly trained or random networks.","D":"Overfitting manifests as poor generalization (large train/test gap), not as sparse activations. A model can be sparse and well-generalized, or dense and overfit."},"reference":"- Olshausen & Field, \"Sparse coding with an overcomplete basis set\" (1997)\n- https://cs231n.github.io/neural-networks-1/#actfun"},{"section":"deep-learning","topicSlug":"neurons-and-perceptrons","topic":"Neurons And Perceptrons","id":"dl-02006","difficulty":"medium","orderIndex":6,"question":"A network has 3 hidden layers with widths [512, 256, 128]. You double the width of the first layer to 1024. Your colleague claims this \"doubles the network's capacity.\" A senior researcher disagrees. What is the most accurate statement about what actually changes?","options":{"A":"The colleague is correct — capacity scales linearly with the number of parameters in the first layer","B":"Doubling the first layer width quadruples the parameters in the first weight matrix (input → layer 1) and doubles those in the second matrix (layer 1 → layer 2), but \"capacity\" in the meaningful sense (ability to separate complex decision boundaries) grows sub-linearly and depends on the interaction with depth and non-linearities","C":"Doubling width has no effect because the bottleneck at 128 units in the final hidden layer limits total capacity","D":"Doubling the first layer width doubles the network's VC dimension exactly"},"correct":"B","explanation":{"correct":"- If the input has dimension d and first layer has n₁ units, the first weight matrix is n₁×d, so doubling n₁ doubles this matrix's parameter count. The second weight matrix n₂×n₁ also doubles. Total extra parameters: O(d·n₁ + n₁·n₂).\n- However, \"capacity\" in the sense of the VC dimension or Rademacher complexity depends non-linearly on width, depth, and their interaction. Empirically, wider networks tend to improve performance but with diminishing returns.\n- The bottleneck argument (option C) has some validity — the narrowest layer constrains information flow — but capacity is not purely determined by the narrowest layer.","A":"Capacity does not scale linearly with parameter count. Two networks with the same parameter count but different architectures can have very different effective capacities. VC dimension for neural networks scales roughly as O(W log W) where W is weight count, not O(W).","B":"","C":"The bottleneck layer does constrain the network (it's why autoencoders use narrow bottlenecks), but making earlier layers wider still increases the representational richness of intermediate representations, which can improve performance even with the same bottleneck size.","D":"VC dimension for neural networks does not scale exactly with width in a simple linear fashion. Exact VC dimension computations for MLPs are complex and depend on the activation function, depth, and connectivity."},"reference":"- Bartlett et al., \"Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks\" (2019): https://arxiv.org/abs/1703.02930"},{"section":"deep-learning","topicSlug":"neurons-and-perceptrons","topic":"Neurons And Perceptrons","id":"dl-02007","difficulty":"medium","orderIndex":7,"question":"You build a multi-layer perceptron for regression (predicting house prices). The output layer has a single unit with no activation function (linear output). A colleague says you should add a ReLU activation to the output unit \"because house prices can't be negative.\" Should you follow this advice?","codeSnippet":"# Current output layer\noutput = nn.Linear(64, 1) # no activation\n\n# Proposed change\noutput = nn.Sequential(nn.Linear(64, 1), nn.ReLU())","options":{"A":"Yes — ReLU on the output ensures non-negative predictions and is always a good practice for price prediction","B":"No — adding ReLU to the output layer constrains predictions to non-negative values but also kills gradients for any training sample where the pre-activation value is negative, preventing the model from learning from those examples","C":"Yes — ReLU is differentiable everywhere except 0, so it has no meaningful impact on training while ensuring valid predictions","D":"No — the output should use softmax instead of ReLU for regression tasks"},"correct":"B","explanation":{"correct":"- If the model predicts a negative pre-activation value for some training samples, ReLU clips the output to 0, making the loss gradient with respect to those samples zero (ReLU gradient is 0 for negative inputs). The model literally cannot learn from those examples.\n- Early in training, many pre-activation values will be negative (random initialization spreads around 0). Adding output ReLU causes a significant fraction of training samples to have zero gradient — effectively \"dead\" output units for those inputs.\n- Better alternatives: (1) use no activation and let L2/Huber loss implicitly penalize negative predictions relative to ground truth, (2) use Softplus (smooth approximation to ReLU) which has non-zero gradients everywhere, or (3) apply log transformation to house prices and predict in log space.","A":"Domain constraint is a valid motivation, but the implementation using ReLU is harmful. The domain constraint must be balanced against trainability. Dead gradients on output units prevent learning.","B":"","C":"ReLU is not differentiable at 0 (undefined, or defined as 0 by convention). More importantly, it is 0 everywhere for x < 0, which means zero gradient — a very meaningful impact on training.","D":"Softmax is for classification (multi-class probability distributions summing to 1), not regression. It is completely wrong for a single continuous output."},"reference":"- https://cs231n.github.io/neural-networks-2/#reg (output activation choices)"},{"section":"deep-learning","topicSlug":"neurons-and-perceptrons","topic":"Neurons And Perceptrons","id":"dl-02008","difficulty":"hard","orderIndex":8,"question":"A network has two fully connected layers: Layer 1 computes h = ReLU(W₁x + b₁) and Layer 2 computes output = W₂h + b₂. You freeze Layer 1 (stop its gradients) and train only Layer 2. Your manager claims: \"Freezing Layer 1 is equivalent to reducing the problem to linear regression on fixed features.\" Is this claim correct?","options":{"A":"Yes — if Layer 1 is frozen, the output is a linear function of the fixed hidden representation h, which is the definition of linear regression","B":"Partially correct — the output is linear in h (the frozen layer's output), but h = ReLU(W₁x + b₁) is a non-linear function of x. The problem is linear in h but non-linear in the original input x — it is equivalent to kernel regression with a fixed non-linear feature map","C":"No — freezing Layer 1 still allows non-linear interactions because the optimizer can adjust the bias b₂ to create thresholding effects","D":"Yes, but only if the batch size is 1; for larger batches, the matrix operations become non-linear"},"correct":"B","explanation":{"correct":"- W₂h + b₂ is indeed linear in h (Layer 2 is a linear function of its inputs). If h is fixed (frozen Layer 1), training Layer 2 is exactly linear regression where h is the feature vector.\n- However, h = ReLU(W₁x + b₁) is a non-linear function of the original input x. So the end-to-end function output = W₂·ReLU(W₁x + b₁) + b₂ is non-linear in x.\n- This is the foundation of transfer learning and feature extraction: freeze a pre-trained backbone (non-linear feature extractor), train only the linear head. You get the expressive features of the deep network while the training problem is simplified to convex linear regression.","A":"\"Linear regression on fixed features\" is partially correct but misses the crucial point that the features themselves are non-linear transformations of the input. Pure linear regression operates on the raw input; this operates on a non-linear embedding.","B":"","C":"Bias b₂ is a single vector — adjusting it shifts the output uniformly but does not create element-wise thresholding. A linear layer with learnable bias is still a linear (affine) function of its input h.","D":"Batch size has no effect on the functional form of a neural network layer. The same linear transformation applies to each sample in the batch independently. Non-linearity does not emerge from batching."},"reference":"- Transfer learning and linear probe evaluation: https://arxiv.org/abs/2002.05709"},{"section":"deep-learning","topicSlug":"neurons-and-perceptrons","topic":"Neurons And Perceptrons","id":"dl-02009","difficulty":"hard","orderIndex":9,"question":"Consider a fully connected layer with 1000 input units and 1000 output units. The weight matrix W is 1000×1000. A researcher proposes replacing W with a low-rank factorization: W ≈ AB where A is 1000×r and B is r×1000, with r = 10. The forward pass becomes: output = (AB)x = A(Bx). What is the exact parameter reduction, and what capability does the network lose?","options":{"A":"Parameters drop from 10⁶ to 20,000 (98% reduction); the network loses the ability to express high-frequency input patterns","B":"Parameters drop from 10⁶ to 1000·r + r·1000 = 2·1000·10 = 20,000 (98% reduction); the network loses the ability to represent any linear transformation whose rank exceeds r=10 — specifically, any output that requires more than 10 independent directions in input space","C":"Parameters drop from 10⁶ to 10,000; the network loses skip connections between non-adjacent layers","D":"Parameters drop from 10⁶ to 20,000; the network loses non-linearity because the product of two matrices is always linear"},"correct":"B","explanation":{"correct":"- Original: 1000×1000 = 1,000,000 parameters. Factored: 1000×10 + 10×1000 = 10,000 + 10,000 = 20,000 parameters. Reduction: 98%.\n- The product AB has rank at most r=10. This means the transformation can only map inputs to a 10-dimensional subspace of the output space. Any output pattern requiring more than 10 independent \"basis directions\" cannot be represented.\n- Low-rank factorization is used extensively in model compression (LoRA, low-rank adapters for LLMs) because most weight matrices in trained networks are approximately low-rank — the effective rank is much smaller than the matrix dimension.","A":"\"High-frequency input patterns\" is not a well-defined loss for a linear transformation. The constraint is rank (number of independent directions), not frequency. Frequency is a concept for convolutional/signal processing contexts.","B":"","C":"1000·10 + 10·1000 = 20,000, not 10,000. The calculation in C is off by 2x. Skip connections are an architectural choice unrelated to rank factorization.","D":"The product of two matrices AB is indeed a matrix (linear transformation), but the full weight matrix W is also linear. Low-rank factorization does not reduce linearity — the transformation was already linear. The concern is rank, not linearity."},"reference":"- LoRA: Low-Rank Adaptation of Large Language Models: https://arxiv.org/abs/2106.09685\n- Hu et al., LoRA paper explains exactly this parameter reduction mechanism"},{"section":"deep-learning","topicSlug":"neurons-and-perceptrons","topic":"Neurons And Perceptrons","id":"dl-02010","difficulty":"hard","orderIndex":10,"question":"You train two networks on the same dataset: Network A has 3 layers of width 100 (300 total units), and Network B has 1 layer of width 300 (300 total units). Both use ReLU and identical training procedures. Network A significantly outperforms Network B. An interviewer asks you to explain exactly why depth helps here beyond just \"more layers = more power.\"","options":{"A":"Network A has more parameters because it has more weight matrices, which directly causes better performance","B":"Depth allows hierarchical composition of simple functions: each layer can detect increasingly abstract features by composing the outputs of previous layers. A 3-layer network can represent functions of functions of features, while a 1-layer network requires representing the full pattern directly — for structured data, this hierarchy is exponentially more efficient","C":"Network A benefits from more gradient steps per layer during backpropagation, which improves optimization","D":"Deeper networks have higher variance, which in the bias-variance tradeoff means better fit to complex training distributions"},"correct":"B","explanation":{"correct":"- The exponential efficiency of depth (Bengio & LeCun, 2007; Telgarsky, 2016) is mathematically established: certain functions that require exponentially many neurons to represent in a shallow network can be represented with polynomially many neurons in a deep network.\n- Concretely for vision: Layer 1 detects edges, Layer 2 composes edges into shapes, Layer 3 composes shapes into objects. A single-layer network must represent object detection directly from pixels — requiring far more neurons to carve out the same decision boundaries.\n- The key phrase is \"for structured data with compositional structure.\" If the data has no hierarchical structure, depth may not help significantly.","A":"Network A does not necessarily have more total parameters than B. Width 100 with 3 layers: W₁ is input×100, W₂ is 100×100, W₃ is 100×output. Network B: W₁ is input×300, W₂ is 300×output. For large inputs, B may have more parameters in W₁. Parameter count alone doesn't explain the performance gap.","B":"","C":"Backpropagation does not give each layer more gradient steps — all layers are updated in a single backward pass. \"More gradient steps per layer\" is a misunderstanding of how backprop works.","D":"Higher variance from depth does not automatically improve fit. Deeper networks are both higher variance and higher capacity, but uncontrolled variance leads to overfitting, not better performance. The advantage of depth is efficiency of representation, not variance."},"reference":"- Bengio & LeCun, \"Scaling algorithms towards AI\" (2007)\n- Telgarsky, \"Benefits of depth in neural networks\" (2016): https://arxiv.org/abs/1602.04485"},{"section":"deep-learning","topicSlug":"neurons-and-perceptrons","topic":"Neurons And Perceptrons","id":"dl-02011","difficulty":"medium","orderIndex":11,"question":"A network's weight matrix W for one layer has been trained and you visualize its rows (each row represents the weights going INTO one output unit). You see that many rows are nearly identical (high cosine similarity between rows). What does this imply about the network?","options":{"A":"The layer is well-trained — identical weights mean the units have converged to a stable solution","B":"The layer likely has redundant units — multiple neurons are detecting the same feature in the input, which wastes capacity. This can happen due to poor initialization, insufficient regularization, or the network being overparameterized for the task","C":"This is a sign of overfitting — identical weights in a layer mean the model has memorized training data","D":"Identical rows are expected because weight sharing is required for neural networks to generalize"},"correct":"B","explanation":{"correct":"- Each row of W represents the \"feature detector\" of one output neuron. If many rows are nearly identical, many neurons are detecting the same pattern, providing no additional information.\n- This indicates either: (1) the layer has more units than needed for the task (overparameterization), (2) symmetry breaking failed despite random initialization (rare but possible with very small weights), or (3) regularization is insufficient to push units toward diverse representations.\n- In practice, this is detected via the \"effective rank\" of W. A low effective rank (most singular values near zero) means the layer is not using its full representational capacity.","A":"Convergence to a stable solution should produce diverse weights (different feature detectors). Identical rows are a degenerate convergence, not a good one. A well-trained layer typically shows varied, diverse rows.","B":"","C":"Overfitting manifests as poor generalization (large train/test gap), not identical weights. Memorization of training data would typically produce highly varied weights keyed to specific training examples, not identical rows.","D":"Weight sharing is a specific architectural choice (e.g., convolutional layers share weights spatially). In a fully connected layer, weight sharing is not expected or required. Identical rows are not \"sharing\" — they're redundancy."},"reference":"- Frankle & Carlin, \"The Lottery Ticket Hypothesis\" (2019): https://arxiv.org/abs/1803.03635"},{"section":"deep-learning","topicSlug":"neurons-and-perceptrons","topic":"Neurons And Perceptrons","id":"dl-02012","difficulty":"easy","orderIndex":12,"question":"A perceptron computes: output = 1 if (w₁x₁ + w₂x₂ + b) ≥ 0, else 0. You set w₁ = 1, w₂ = 1, b = -1.5. Evaluate the outputs for all inputs in {0,1}². What logical gate does this perceptron implement?","options":{"A":"OR gate — outputs 1 whenever at least one input is 1","B":"AND gate — outputs 1 only when both inputs are 1, because the threshold -1.5 requires both x₁ and x₂ to be active simultaneously","C":"NAND gate — outputs 0 only when both inputs are 1","D":"XOR gate — outputs 1 when inputs differ"},"correct":"B","explanation":{"correct":"- (0,0): 0+0-1.5 = -1.5 < 0 → output 0. (0,1): 0+1-1.5 = -0.5 < 0 → output 0. (1,0): 1+0-1.5 = -0.5 < 0 → output 0. (1,1): 1+1-1.5 = 0.5 ≥ 0 → output 1.\n- Only (1,1) → 1, which is exactly the AND function. The bias -1.5 requires the sum w₁x₁ + w₂x₂ ≥ 1.5, which is only satisfied when both inputs are 1 (sum = 2).\n- This demonstrates that logical gates are representable as perceptrons and that the bias term controls the threshold — b = -0.5 would give OR, b = -1.5 gives AND. The same weights, different bias = different gate.","A":"OR gate requires the sum ≥ 1, which needs b = -0.5 (not -1.5). With b = -0.5: (0,1) → 0.5 ≥ 0 → 1 ✓, (1,0) → 0.5 ≥ 0 → 1 ✓, (1,1) → 1.5 ≥ 0 → 1 ✓.","B":"","C":"NAND outputs 0 only for (1,1) and 1 otherwise — the inverse of AND. This requires different weights or a negated threshold structure.","D":"XOR outputs 1 for (0,1) and (1,0) only — which is not linearly separable and cannot be represented by any single perceptron with fixed weights and bias."},"reference":"- http://neuralnetworksanddeeplearning.com/chap1.html#perceptrons"},{"section":"deep-learning","topicSlug":"neurons-and-perceptrons","topic":"Neurons And Perceptrons","id":"dl-02013","difficulty":"medium","orderIndex":13,"question":"You are training an MLP on a tabular dataset with 50 features. You add a hidden layer with 1000 units and observe strong training accuracy but poor validation accuracy. You then reduce the hidden layer to 10 units and observe poor training accuracy and poor validation accuracy. What does this tell you about network depth/width intuition for tabular data?","options":{"A":"Tabular data always requires very deep networks; the problem is insufficient depth, not width","B":"1000 units overfit (high variance), 10 units underfit (high bias) — the optimal width for this problem is somewhere between 10 and 1000, and the right size depends on the complexity of the underlying data pattern relative to the feature space","C":"The poor validation with 1000 units proves the training data is corrupted; no amount of tuning will help","D":"Tabular data is incompatible with fully connected layers; convolutional layers should be used instead"},"correct":"B","explanation":{"correct":"- Classic bias-variance tradeoff: too many parameters relative to data complexity leads to memorization (overfitting = high variance); too few parameters leads to inability to capture patterns (underfitting = high bias).\n- For tabular data with 50 features, the right width depends on: how many meaningful non-linear interactions exist, how many training samples are available, and what regularization is applied.\n- In practice, tabular data often performs well with relatively modest network sizes (128-512 units per layer) combined with dropout and weight decay. Blindly increasing width doesn't help without regularization.","A":"Deeper networks don't automatically solve overfitting from wide layers. Adding more layers to an already overparameterized network typically increases overfitting further. Depth and width both affect capacity — depth is not a cure for width-induced overfitting.","B":"","C":"Overfitting (good train, bad validation) is a normal consequence of having more model capacity than data complexity warrants. It does not imply data corruption — which would manifest as poor training accuracy or high noise.","D":"Fully connected layers are absolutely valid for tabular data. CNNs are designed for grid-structured data (images, sequences). Tabular data lacks spatial locality, making CNNs inappropriate."},"reference":"- Shwartz-Ziv & Armon, \"Tabular data: deep learning is not all you need\" (2022): https://arxiv.org/abs/2106.03253"},{"section":"deep-learning","topicSlug":"neurons-and-perceptrons","topic":"Neurons And Perceptrons","id":"dl-02014","difficulty":"hard","orderIndex":14,"question":"A network's weight matrix W has been trained on task A. You want to transfer it to task B by fine-tuning only the last layer. After fine-tuning, you compute the gradient magnitude of the last layer's weights vs. the frozen earlier layers (which have zero gradient by design). An engineer proposes measuring the \"network depth utilization\" as the ratio of active (non-frozen) parameters to total parameters, and says networks with low utilization are \"underusing their depth.\" What is wrong with this metric?","options":{"A":"Nothing — depth utilization is a valid and widely used metric in transfer learning research","B":"The metric conflates parameter count with representational contribution. A frozen layer with rich, general features contributes heavily to the output even with zero gradient — measuring utilization by gradient flow ignores that frozen layers still perform computation and determine what features are available to the trainable head","C":"The metric is valid but should count neurons, not parameters, to normalize for layer width differences","D":"Gradient magnitude in the last layer should be normalized by the number of samples in the dataset, not compared to frozen layers"},"correct":"B","explanation":{"correct":"- Transfer learning's entire value proposition is that frozen layers provide learned features — even though their parameters don't update, their forward-pass computation is the core of what makes transfer learning work. The frozen ResNet-50 backbone extracts rich visual features; only the final linear head is trained.\n- \"Depth utilization\" as gradient-fraction creates a perverse incentive: it would rate a randomly initialized network with no frozen layers as 100% utilized, and a perfectly pretrained network with a fine-tuned head as poorly utilized.\n- Meaningful transfer learning metrics include: (a) linear probe accuracy (how good are frozen features?), (b) fine-tuning efficiency (how few samples are needed?), and (c) feature alignment between source and target domain.","A":"\"Depth utilization\" as defined (gradient-active vs total parameters) is not a standard metric in transfer learning research. The concept sounds reasonable but is fundamentally flawed as argued.","B":"","C":"Counting neurons vs parameters doesn't fix the fundamental problem: frozen neurons still compute and contribute to the output. The issue is the meaning of \"utilization,\" not the normalization.","D":"Normalizing by dataset size is relevant for gradient scaling analysis, but the core issue here is the conceptual flaw in equating gradient flow with contribution."},"reference":"- Kumar et al., \"Fine-Tuning can Distort Pretrained Features and Underperform from Scratch\" (2022): https://arxiv.org/abs/2202.10054"},{"section":"deep-learning","topicSlug":"neurons-and-perceptrons","topic":"Neurons And Perceptrons","id":"dl-02015","difficulty":"hard","orderIndex":15,"question":"You're debugging a wide MLP (2000 hidden units per layer, 5 layers) that shows a puzzling behavior: training loss decreases but at roughly 1/10 the rate of a narrower network (200 units, same depth) on the same task. Both networks use the same learning rate and batch size. Without profiling, what is the most likely cause and how should it be investigated?","options":{"A":"The wide network has 100x more parameters so requires 100x more epochs to converge at the same learning rate — this is expected and not a bug","B":"The effective learning rate per parameter is too small for the wide network's loss landscape; with more parameters, the gradient signal is \"diluted\" — but the real likely cause is the gradient magnitude scaling issue: wider layers produce larger activations which can cause gradients to scale differently, requiring learning rate tuning proportional to width","C":"The wide network is computing unnecessarily — 2000 units exceed the intrinsic dimensionality of the task, so most units deactivate and gradients vanish","D":"The 5-layer depth causes vanishing gradients in both networks equally; the width difference is irrelevant"},"correct":"B","explanation":{"correct":"- In wide networks, the variance of pre-activations scales with fan-in (number of input connections). Without proper initialization (e.g., He initialization scales weights by √(2/fan-in) for ReLU), activations can explode, causing gradient instability and slow convergence.\n- Additionally, for SGD-based optimizers, the optimal learning rate for a layer scales as 1/√(fan-out) in some parameterizations. A learning rate optimal for width-200 layers is likely too small for width-2000 layers.\n- Investigation: (1) plot activation norms per layer to detect scaling issues, (2) check gradient norms per layer to find vanishing/exploding gradients, (3) try μP (maximal update parameterization) which enables learning rate transfer across widths.","A":"The number of epochs to converge doesn't scale linearly with parameter count. With the same batch size and learning rate, a wider network makes similar gradient steps in wall-clock time (if hardware can handle it). \"Needs 100x more epochs\" is empirically false for well-initialized networks.","B":"","C":"Unit deactivation (dead ReLU) would cause near-zero gradients only for those units — other units would still train normally. 2000 units doesn't inherently cause mass deactivation unless initialization or learning rate is wrong.","D":"Vanishing gradients from depth would affect both networks similarly if they have the same depth. The width difference is the relevant factor for the described behavior."},"reference":"- Yang & Hu, \"Feature Learning in Infinite-Width Neural Networks\" (μP): https://arxiv.org/abs/2011.14522\n- He et al., \"Delving Deep into Rectifiers\" (He initialization): https://arxiv.org/abs/1502.01852"},{"section":"deep-learning","topicSlug":"activation-functions","topic":"Activation Functions","id":"dl-03001","difficulty":"easy","orderIndex":1,"question":"A sigmoid activation outputs values in (0,1). You use it in a hidden layer of a deep network with 10 layers. During training you observe that gradients in the first 3 layers are approximately 10⁻⁶ while gradients in the last 3 layers are approximately 0.1. What causes this disparity and what is the standard fix?","options":{"A":"The first layers receive less data during backpropagation because batches are processed sequentially; fix by increasing batch size","B":"Sigmoid's derivative σ'(z) = σ(z)(1−σ(z)) has a maximum of 0.25 at z=0 and approaches 0 for large |z|. In a 10-layer network, multiplying 10 such terms produces gradients on the order of 0.25¹⁰ ≈ 10⁻⁶ — the vanishing gradient problem. Standard fix: replace sigmoid in hidden layers with ReLU, whose derivative is 1 for positive inputs","C":"The first layers are closer to the random initialization and haven't received enough gradient signal; fix by training longer","D":"Deep networks always have small gradients in early layers; this is expected and does not affect training"},"correct":"B","explanation":{"correct":"- The chain rule multiplies Jacobians across layers. Each sigmoid layer contributes a factor of at most 0.25. After 10 layers: 0.25^10 ≈ 9.5×10⁻⁷, matching the observed 10⁻⁶ magnitude.\n- ReLU's derivative is exactly 1 for positive inputs, meaning gradients pass through ReLU layers without attenuation (for the active units). This is why ReLU effectively solved the vanishing gradient problem for deep feedforward networks.\n- The vanishing gradient problem is one of the primary historical reasons deep networks were difficult to train before 2010 (before ReLU and BatchNorm were standardized).","A":"Backpropagation processes the entire batch uniformly. Batch size affects gradient noise/stability, not the systematic decay of gradient magnitude across layers.","B":"","C":"Training longer doesn't fix vanishing gradients. The small gradients mean early-layer weights update negligibly per step — more steps on near-zero gradients still converge extremely slowly or not at all.","D":"Small gradients in early layers are NOT expected or acceptable — they are the symptom of the vanishing gradient problem. Networks with vanishing gradients effectively don't train their early layers, wasting depth."},"reference":"- Glorot & Bengio, \"Understanding the difficulty of training deep feedforward networks\" (2010): https://proceedings.mlr.press/v9/glorot10a.html\n- https://cs231n.github.io/neural-networks-1/#actfun"},{"section":"deep-learning","topicSlug":"activation-functions","topic":"Activation Functions","id":"dl-03002","difficulty":"easy","orderIndex":2,"question":"You replace all ReLU activations in a trained model with tanh activations and retrain from scratch. Training is significantly slower and final accuracy is lower. What is the most likely technical cause for both effects?","options":{"A":"tanh outputs are in (-1, 1) instead of (0, ∞) for ReLU, making gradients negative which confuses the optimizer","B":"tanh saturates for |z| > 2 (derivative → 0) causing vanishing gradients in deeper layers, while ReLU has a derivative of 1 for all positive inputs, enabling stable gradient flow in deep networks","C":"tanh requires complex number arithmetic which is slower on GPU hardware than the max(0, x) operation of ReLU","D":"tanh activations produce zero-centered outputs which cause weight update interference between neurons in the same layer"},"correct":"B","explanation":{"correct":"- tanh'(z) = 1 - tanh²(z), which approaches 0 as |z| → ∞. For large pre-activation values (common after a few training steps), tanh saturates and gradients vanish.\n- ReLU's derivative is exactly 1 for z > 0, meaning gradients pass through without scaling down. In deep networks (10+ layers), this difference is dramatic: tanh compounds to near-zero gradients, ReLU maintains stable gradient magnitude.\n- Additionally, ReLU is computationally cheaper (max(0,x) vs exponentials in tanh), which partially explains the speed difference.","A":"Negative gradients don't \"confuse\" optimizers. Gradient descent operates on the sign and magnitude of gradients — negative gradients are completely valid and expected for parameters that need to decrease.","B":"","C":"tanh uses exponentials (e^z), not complex number arithmetic. Modern hardware handles this efficiently. The performance difference between tanh and ReLU is real but due to computational complexity (exp vs max), not complex numbers.","D":"Zero-centered outputs are actually a desirable property of tanh (sigmoid's outputs are not zero-centered, which is a disadvantage). Zero-centered activations reduce update \"zig-zagging\" effects. This is not the cause of slower training."},"reference":"- LeCun et al., \"Efficient BackProp\" (1998): recommends tanh over sigmoid but ReLU superseded both\n- Nair & Hinton, \"Rectified Linear Units Improve Restricted Boltzmann Machines\" (2010)"},{"section":"deep-learning","topicSlug":"activation-functions","topic":"Activation Functions","id":"dl-03003","difficulty":"easy","orderIndex":3,"question":"A team initializes all weights in a ReLU network to small positive values near zero. After one epoch, they notice that 60% of neurons permanently output 0 and never recover, even after 100 more epochs. What is this phenomenon and what caused it here?","options":{"A":"Dead ReLU problem — caused by large negative pre-activations causing ReLU to output 0 with zero gradient. Here it was triggered by poor weight initialization producing many negative pre-activations from the start","B":"Gradient explosion — small initial weights cause gradients to grow exponentially backward through the network","C":"Overfitting — the neurons are deactivating to memorize specific training samples","D":"Mode collapse — the ReLU neurons collapse to a single output mode which outputs 0 for all inputs"},"correct":"A","explanation":{"correct":"- ReLU(z) = max(0, z) and its gradient is 0 when z < 0. A \"dead\" neuron is one where z < 0 for all inputs in the dataset — it outputs 0 always and receives gradient 0 always, so its weights never update.\n- Near-zero initialization with many features can produce z = Wx + b ≈ 0 initially, but a few bad samples or unlucky updates can push z < 0. Once dead, that neuron stays dead.\n- Fix: He initialization (scales weights by √(2/fan-in)), Leaky ReLU (gradient = α < 1 for negative inputs instead of 0), or PReLU (learnable negative slope). ELU also has negative outputs, preventing dead neurons.","A":"","B":"Small initial weights produce small activations, which produce small gradients — the opposite of explosion. Gradient explosion occurs with large weights, not small ones.","C":"Neuron deactivation is not memorization. Memorization would require neurons to be selectively active for specific training patterns, not permanently off.","D":"Mode collapse is a GAN training problem where the generator produces limited variety. It is not applicable to individual neuron behavior in a supervised learning MLP."},"reference":"- He et al., \"Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet\" (2015): https://arxiv.org/abs/1502.01852\n- Maas et al., \"Rectifier Nonlinearities Improve Neural Network Acoustic Models\" (Leaky ReLU)"},{"section":"deep-learning","topicSlug":"activation-functions","topic":"Activation Functions","id":"dl-03004","difficulty":"medium","orderIndex":4,"question":"You train a model with ReLU activations and achieve good performance. A colleague switches the activations to Leaky ReLU (α=0.01) for the hidden layers, claiming it is \"strictly better.\" After retraining, the model performs identically. Your colleague insists there must be a bug. What is the most accurate explanation?","options":{"A":"Leaky ReLU is always strictly better than ReLU; the identical performance confirms a bug in the implementation","B":"Leaky ReLU's advantage (non-zero gradient for negative inputs) only matters when neurons are actually dying (stuck at z<0). If the original ReLU network had few or no dead neurons, Leaky ReLU provides no benefit — both activations are identical for z>0 and the negative-slope advantage never activates","C":"Leaky ReLU and ReLU are mathematically identical because the leaky term (0.01x) is too small to affect training","D":"The dataset is too small for Leaky ReLU's advantages to manifest; it requires 100,000+ samples to show improvement"},"correct":"B","explanation":{"correct":"- Leaky ReLU with α=0.01 computes: max(0.01z, z). For z > 0, this is identical to ReLU. The difference only appears for z < 0, where ReLU gives 0 (zero gradient) and Leaky ReLU gives 0.01z (non-zero gradient).\n- If the original network had no dead neurons (all activations mostly positive for training data), the two activations are functionally equivalent on that dataset, and identical performance is the correct expected result.\n- The lesson: architectural improvements that address specific failure modes (like dead neurons) only show benefits when that failure mode is actually occurring. ReLU networks on well-initialized problems often have <5% dead neurons, making Leaky ReLU's advantage marginal.","A":"\"Strictly better\" in theory doesn't mean \"strictly better on every problem.\" Leaky ReLU is strictly better at addressing dead neurons, but if no neurons are dying, the advantage is zero.","B":"","C":"0.01x for negative inputs is not \"too small to affect training\" — if neurons were dying, even a 0.01 gradient would be infinitely better than a 0 gradient. The magnitude matters only when the feature is relevant.","D":"The improvement from Leaky ReLU is not sample-size dependent. It depends on whether dead neurons are present. You can have dead neurons with 1 million samples and no dead neurons with 100 samples."},"reference":"- https://cs231n.github.io/neural-networks-1/#actfun (comparison of activations)"},{"section":"deep-learning","topicSlug":"activation-functions","topic":"Activation Functions","id":"dl-03005","difficulty":"medium","orderIndex":5,"question":"GELU (Gaussian Error Linear Unit) is defined as: GELU(x) = x · Φ(x), where Φ is the standard normal CDF. Unlike ReLU which makes a hard 0/1 decision at x=0, GELU is used in Transformers (BERT, GPT) instead of ReLU. What property of GELU makes it preferable for Transformer-based architectures specifically?","options":{"A":"GELU is faster to compute because it avoids the max() operation in ReLU","B":"GELU is smooth (infinitely differentiable) and stochastically gates inputs — it smoothly interpolates between \"pass input\" and \"gate to zero\" based on the input's magnitude relative to other inputs. This smooth gating is empirically better for the attention + MLP structure in Transformers","C":"GELU outputs values in (0,1), making it compatible with the softmax in the attention mechanism","D":"GELU was designed specifically for pre-LayerNorm Transformers and has no advantage over ReLU in post-LayerNorm architectures"},"correct":"B","explanation":{"correct":"- GELU(x) = x · Φ(x) can be interpreted as: multiply the input by its probability of being greater than a Gaussian sample. For large positive x: Φ(x)→1, so GELU(x)≈x. For large negative x: Φ(x)→0, so GELU(x)≈0. Near 0: smooth interpolation.\n- This smooth, stochastic gating behavior means GELU doesn't make hard cutoff decisions like ReLU. In deep Transformer architectures where activations are distributed roughly normally (due to LayerNorm before each sublayer), GELU's Gaussian-parameterized gating matches the activation distribution naturally.\n- Empirically, GELU consistently outperforms ReLU in BERT, GPT, and most modern Transformer variants — the theoretical explanation is still an active research area.","A":"GELU requires computing the error function (or an approximation), which is more expensive than max(0,x). It is computationally slower than ReLU.","B":"","C":"GELU(x) = x · Φ(x) can be negative (when x is negative but not large enough to make GELU exactly 0 — actually GELU is slightly negative for x around -0.17). It is not bounded to (0,1).","D":"GELU was introduced by Hendrycks & Gimpel (2016) as a general activation function. Its advantages have been demonstrated across various architectures and normalization schemes, not limited to pre-LayerNorm configurations."},"reference":"- Hendrycks & Gimpel, \"Gaussian Error Linear Units (GELUs)\" (2016): https://arxiv.org/abs/1606.08415\n- BERT paper uses GELU: https://arxiv.org/abs/1810.04805"},{"section":"deep-learning","topicSlug":"activation-functions","topic":"Activation Functions","id":"dl-03006","difficulty":"medium","orderIndex":6,"question":"You are training a binary classifier and must choose between sigmoid and ReLU for the output layer activation. A teammate says \"use ReLU everywhere for consistency.\" What is wrong with using ReLU on the output layer for binary classification?","options":{"A":"ReLU outputs can exceed 1.0, making them incompatible with binary cross-entropy loss which expects probabilities in [0,1]","B":"ReLU cannot distinguish between confidently correct and confidently incorrect predictions because it clips all negative values to 0","C":"ReLU is not differentiable at 0, which causes instability in the loss computation","D":"Both A and B — ReLU produces unbounded outputs and loses negative prediction information"},"correct":"A","explanation":{"correct":"- Binary cross-entropy (BCE) loss: L = -[y·log(p) + (1-y)·log(1-p)] requires p ∈ (0,1). If the model outputs p > 1 (possible with ReLU), log(1-p) = log(negative) → undefined/NaN, breaking the loss computation.\n- Sigmoid squashes any real-valued pre-activation to (0,1), making it the canonical output activation for binary classification. The log-odds interpretation is also natural: the pre-activation logit maps directly to probability via sigmoid.\n- In PyTorch, `nn.BCEWithLogitsLoss` combines sigmoid and BCE in one numerically stable operation, which is why many implementations use no output activation with `BCEWithLogitsLoss` rather than explicit sigmoid.","A":"","B":"ReLU does differentiate between high and low outputs for positive predictions. The issue is not discrimination ability for positive outputs but the incompatibility with the loss function's probability expectations.","C":"ReLU is not differentiable at exactly 0, but this is handled by convention (gradient = 0 at 0). In practice, the probability of exactly hitting 0 is negligible and this is not the primary problem with using ReLU on the output layer.","D":"While both A and B raise valid points, A is the fundamental reason: mathematical incompatibility with the loss function is a hard constraint, not a soft preference."},"reference":"- PyTorch BCEWithLogitsLoss: https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html"},{"section":"deep-learning","topicSlug":"activation-functions","topic":"Activation Functions","id":"dl-03007","difficulty":"medium","orderIndex":7,"question":"A network using ELU (Exponential Linear Unit) activations converges significantly faster than the same network with ReLU on a deep (20 layer) architecture. The engineer explains: \"ELU is faster because it uses exponentials which are faster than max().\" Is the engineer's explanation correct?","codeSnippet":"# ELU: f(x) = x if x > 0, else α(e^x - 1)\n# ReLU: f(x) = max(0, x)","options":{"A":"Yes — exponential functions have hardware acceleration in modern CPUs making ELU faster than ReLU","B":"No — ELU is computationally more expensive than ReLU (exp is slower than max). The faster convergence is due to ELU producing negative outputs for negative inputs, keeping the mean activation near zero. This prevents the \"bias shift\" problem that slows ReLU networks","C":"No — ELU is faster because its derivative is always non-zero, enabling larger learning rates","D":"Yes — ELU avoids the non-differentiability at z=0 that causes ReLU to require smaller learning rates"},"correct":"B","explanation":{"correct":"- The exponential function is one of the more expensive operations in floating-point arithmetic. ELU is computationally slower than ReLU per unit operation. The faster convergence is explained by a different mechanism.\n- ReLU outputs are always ≥ 0. In a layer with ReLU activations, the average output is positive, which means the next layer's weights receive inputs with non-zero mean. This \"bias shift\" (similar to the sigmoid non-zero-mean problem) causes gradient updates that are correlated across samples, slowing convergence.\n- ELU outputs can be negative (approaching -α for large negative inputs), keeping the mean activation near zero — similar to tanh's zero-centering benefit but without tanh's saturation problem.","A":"Exponential functions do not have special hardware acceleration that makes them faster than max(). Modern CPUs/GPUs implement max as a single instruction, while exp requires multiple floating-point operations or a table lookup.","B":"","C":"While ELU's derivative is non-zero everywhere (for α > 0), this doesn't enable larger learning rates per se. The learning rate is constrained by loss landscape curvature, not just gradient existence.","D":"ReLU's non-differentiability at z=0 is handled by convention and is not a practical constraint on learning rate. The subgradient is used and training proceeds normally."},"reference":"- Clevert et al., \"Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)\" (2015): https://arxiv.org/abs/1511.07289"},{"section":"deep-learning","topicSlug":"activation-functions","topic":"Activation Functions","id":"dl-03008","difficulty":"hard","orderIndex":8,"question":"SiLU (Sigmoid Linear Unit, also called Swish) is defined as SiLU(x) = x · sigmoid(x). A researcher claims SiLU is \"the same as GELU with a different distribution assumption.\" An engineer disagrees, saying they are fundamentally different. Who is correct and what is the exact difference?","options":{"A":"The researcher is correct — SiLU and GELU are numerically identical for all practical inputs","B":"The engineer is correct — SiLU uses sigmoid(x) as the gating function (deterministic, parameterized by logistic distribution) while GELU uses Φ(x) (CDF of standard normal). Both are \"self-gated\" (input gates itself) but with different distributional assumptions and different numerical values for the same input","C":"The researcher is correct — both are approximations to ReLU and converge to identical functions for large networks","D":"The engineer is correct — SiLU is not differentiable while GELU is smooth everywhere"},"correct":"B","explanation":{"correct":"- Both SiLU and GELU are self-gated activations of the form f(x) = x · gate(x). For GELU: gate(x) = Φ(x) (normal CDF). For SiLU: gate(x) = sigmoid(x) = 1/(1+e^(-x)) (logistic CDF).\n- The normal CDF and logistic CDF are different functions that happen to be similar in shape (both S-shaped, both in [0,1]). At x=0: Φ(0) = 0.5 = sigmoid(0) — they agree. At x=1: Φ(1) ≈ 0.841 vs sigmoid(1) ≈ 0.731 — they diverge.\n- In practice, SiLU is used in EfficientNet, MobileNetV3, and many modern CNNs. GELU is preferred in Transformers. Both outperform ReLU on many benchmarks, and the choice is often empirical.","A":"SiLU and GELU are numerically different. For x=1: SiLU(1) = 1·sigmoid(1) ≈ 0.731, GELU(1) = 1·Φ(1) ≈ 0.841. The difference is small but real, and compounds across layers.","B":"","C":"Convergence to identical functions as network width increases is a property of neural network training dynamics (NTK perspective), not of the activation functions themselves. The activations remain numerically distinct regardless of network size.","D":"Both SiLU and GELU are smooth (infinitely differentiable). SiLU(x) = x·σ(x) is differentiable everywhere since both x and sigmoid are differentiable."},"reference":"- Ramachandran et al., \"Swish: A Self-Gated Activation Function\" (2017): https://arxiv.org/abs/1710.05941\n- GELU: https://arxiv.org/abs/1606.08415"},{"section":"deep-learning","topicSlug":"activation-functions","topic":"Activation Functions","id":"dl-03009","difficulty":"hard","orderIndex":9,"question":"You train a 50-layer network with ReLU activations. After training, you measure the fraction of dead neurons (always outputting 0) per layer. You find: Layer 1: 2% dead, Layer 25: 35% dead, Layer 50: 68% dead. The dead neuron count increases with depth. What mechanism causes this pattern and what architectural intervention prevents it?","options":{"A":"Deeper layers receive smaller gradients due to vanishing gradient, so they update less and drift to negative weight values — dead neuron accumulation is a direct consequence of vanishing gradients in ReLU networks","B":"Dead neurons at layer k propagate to layer k+1: if a neuron in layer k is dead, it contributes 0 to all downstream neurons' pre-activations. As more upstream neurons die, more downstream neurons receive predominantly zero (or negative) pre-activations and die themselves — a cascade failure. Batch Normalization interrupts this cascade by re-centering activations before each ReLU","C":"Deeper layers have more parameters which increases the probability of any single parameter reaching a dead state statistically","D":"The learning rate decays over training, causing deeper layers (which update later in backpropagation) to have effectively lower learning rates and die from under-updating"},"correct":"B","explanation":{"correct":"- The cascade mechanism: if 35% of layer 25 neurons output 0 always, then neurons in layer 26 receive inputs that are 35% zeros. This biases their pre-activation sum toward lower values, increasing the probability they also become dead.\n- This cascade compounds exponentially: even a small dead fraction in early layers multiplies into large dead fractions in later layers.\n- BatchNorm (or LayerNorm) normalizes pre-activations to have zero mean and unit variance before the activation function. This ensures activations enter ReLU with a balanced distribution, interrupting the dead-neuron cascade. This is one of BatchNorm's key practical benefits.","A":"Vanishing gradients in ReLU networks are primarily a problem with multiplicative weight matrices, not the activation function itself (ReLU gradient is 1 for positive inputs). ReLU actually alleviates vanishing gradients compared to sigmoid. Dead neurons accumulate via the cascade mechanism, not gradient vanishing.","B":"","C":"Dead neuron probability is not purely statistical. Individual neuron death depends on the distribution of its inputs and the values of its specific weights — it's a deterministic function of the network state, not a random statistical outcome.","D":"Learning rate scheduling affects all layers simultaneously in backpropagation. Deeper layers receive gradients from earlier layers, so their effective learning rate is not independently lower due to scheduling."},"reference":"- Ioffe & Szegedy, \"Batch Normalization: Accelerating Deep Network Training\" (2015): https://arxiv.org/abs/1502.03167"},{"section":"deep-learning","topicSlug":"activation-functions","topic":"Activation Functions","id":"dl-03010","difficulty":"hard","orderIndex":10,"question":"You are evaluating a new activation function f(x) = max(x, αx) where α = -0.5. An intern claims: \"This function has a negative slope for x < 0, so it will cause gradients to flip sign during backpropagation, making training unstable.\" Is the intern correct?","options":{"A":"Yes — negative slopes during backpropagation cause gradient sign flips which prevent convergence","B":"No — the gradient of f(x) for x < 0 is α = -0.5, a constant negative slope. This means gradients are scaled by -0.5 for negative pre-activations, not flipped unpredictably. However, this activation (Leaky ReLU with negative α) would cause unconventional behavior: negative-input neurons amplify and invert their gradient signal, which could destabilize training","C":"No — gradient sign flips are normal in SGD and occur every time the optimizer passes through a loss minimum; the intern is confusing gradient descent mechanics with activation gradients","D":"Yes — but only for the first training step; after initialization, all pre-activations become positive due to ReLU's rectification behavior"},"correct":"B","explanation":{"correct":"- For x < 0, f(x) = αx = -0.5x, so f'(x) = -0.5. The chain rule multiplies this into the gradient of upstream layers. A factor of -0.5 scales and inverts the gradient signal for neurons with negative pre-activations.\n- Standard Leaky ReLU uses α ∈ (0, 1) (e.g., 0.01) to keep gradients positive but small. Using α = -0.5 is unusual and potentially harmful: negative gradients would cause weight updates to push in the opposite direction of the loss gradient for those units.\n- This is different from the gradient naturally being negative (which simply means \"decrease this weight\"). Here, the activation's negative slope would invert the semantic meaning of the loss gradient for certain neurons.","A":"The intern's concern about \"instability\" has some validity, but the mechanism described (\"flip sign\") is not quite right. The concern is about α being negative causing sign inversion through the activation, not gradient instability in the general SGD sense.","B":"","C":"The intern is not confusing gradient descent mechanics — the concern is specifically about the activation function's contribution to the chain rule product. This is a valid concern, just slightly imprecisely stated.","D":"ReLU rectification doesn't make all pre-activations positive. Many neurons will have negative pre-activations during training, especially early on. The \"all positives after first step\" claim is false."},"reference":"- Maas et al., \"Rectifier Nonlinearities Improve Neural Network Acoustic Models\" (Leaky ReLU, α should be in (0,1)): https://ai.stanford.edu/~amaas/papers/relu_hybrid_icml2013_final.pdf"},{"section":"deep-learning","topicSlug":"activation-functions","topic":"Activation Functions","id":"dl-03011","difficulty":"medium","orderIndex":11,"question":"A network for multi-class classification (10 classes) uses softmax as the output activation. A colleague replaces softmax with sigmoid on each output independently, arguing \"sigmoid also produces values in (0,1) and is simpler.\" After training, the colleague's model produces outputs like [0.95, 0.87, 0.76, ...] that sum to 6.3. What critical property did the colleague's model lose?","options":{"A":"Differentiability — sigmoid outputs cannot be used with cross-entropy loss","B":"Mutual exclusivity normalization — softmax ensures outputs sum to 1.0 and represent a valid probability distribution over classes. Independent sigmoids produce values in (0,1) but without the normalization constraint, so outputs can sum to any value, making them unnormalized scores rather than class probabilities","C":"Sparsity — softmax produces sparse outputs (one dominant class) while sigmoid produces dense activations that confuse the model","D":"The model lost nothing significant — both activations produce equivalent outputs after applying argmax for the final class prediction"},"correct":"B","explanation":{"correct":"- Softmax: softmax(z)ᵢ = exp(zᵢ) / Σⱼ exp(zⱼ). The denominator normalizes outputs so they sum to exactly 1.0 and form a valid categorical probability distribution.\n- Independent sigmoid: σ(zᵢ) = 1/(1+exp(-zᵢ)) for each output independently. No normalization — outputs can each be close to 1, summing well above 1.\n- The key difference: softmax encodes \"which class is most likely, given that exactly one is correct.\" Sigmoid encodes \"is this class present?\" — appropriate for multi-label problems (multiple classes can be true simultaneously), not multi-class problems (exactly one class is true).","A":"Sigmoid outputs are in (0,1) and differentiable. They are perfectly compatible with cross-entropy loss. The issue is not differentiability.","B":"","C":"Softmax does produce a \"winner-take-all\" effect (the largest logit gets amplified), but the primary issue is probability normalization, not sparsity per se.","D":"Argmax gives the same answer regardless of softmax vs sigmoid if the relative ordering of logits is preserved (which it is, since both are monotone transformations). So for inference alone, argmax accuracy could be similar. However, the probability estimates are meaningless, calibration is lost, and training with cross-entropy on unnormalized probabilities produces incorrect gradients."},"reference":"- https://cs231n.github.io/linear-classify/#softmax (Softmax vs SVM losses)"},{"section":"deep-learning","topicSlug":"activation-functions","topic":"Activation Functions","id":"dl-03012","difficulty":"easy","orderIndex":12,"question":"You are comparing activation functions for a hidden layer. A senior engineer says: \"For modern deep learning on GPUs, ReLU is the default choice not just because it avoids vanishing gradients but for a second practical reason that matters at scale.\" What is the second practical reason?","options":{"A":"ReLU enables sparse activations — on average, ~50% of neurons output 0 in each forward pass. Sparse activations mean fewer multiplications in subsequent layers, which translates to real computational savings on specialized hardware","B":"ReLU outputs are bounded, preventing memory overflow in GPU operations","C":"ReLU is the only activation that is supported natively by CUDA kernels in PyTorch","D":"ReLU eliminates the need for bias terms, reducing memory usage in large networks"},"correct":"A","explanation":{"correct":"- For a typical activation distribution centered near zero after BatchNorm, roughly 50% of ReLU inputs are negative and produce exactly 0 output. Multiplying any value by 0 is trivially computed.\n- On GPUs, sparse activation can be exploited by structured pruning and sparse matrix libraries. More importantly, the 0-outputs skip computations in the next layer's matrix-vector product for those specific neurons.\n- This computational sparsity is one reason why ReLU-based sparse models can be inference-efficient, and why techniques like \"pruning\" and \"sparse networks\" work well with ReLU.","A":"","B":"ReLU outputs are NOT bounded above (max(0,x) grows without bound for large positive x). Output explosion is possible with ReLU, which is why weight initialization and batch normalization are important.","C":"PyTorch CUDA kernels support all standard activation functions including sigmoid, tanh, GELU, SiLU, etc. ReLU has no exclusive hardware support claim.","D":"Bias terms are determined by the network architecture, not the activation function. ReLU layers still use biases. Removing biases (bias=False) is an independent design choice unrelated to activation type."},"reference":"- LeCun et al., \"Efficient BackProp\": practical considerations for activations\n- https://pytorch.org/docs/stable/sparse.html"},{"section":"deep-learning","topicSlug":"activation-functions","topic":"Activation Functions","id":"dl-03013","difficulty":"medium","orderIndex":13,"question":"A vision model uses ReLU activations and achieves strong performance. You switch to PReLU (Parametric ReLU), which replaces the fixed slope of 0 for negative inputs with a learnable parameter αᵢ per channel. After training, you find that all αᵢ converged to ~0.01 (close to Leaky ReLU's standard setting). What does this convergence pattern tell you about the data?","options":{"A":"The model overfit during training; α should be regularized to exactly 0 (standard ReLU) to prevent overfitting","B":"The data's optimal activation behavior for negative pre-activations is approximately the Leaky ReLU regime (small positive slope), not full ReLU (zero slope) or full linear (slope=1). The network discovered this autonomously — the data prefers a small leak rather than hard zeroing","C":"The αᵢ convergence to 0.01 indicates dead neurons — the learnable parameter tried to revive them with a small slope","D":"PReLU always converges to α≈0.01 regardless of data due to L2 regularization on α pulling values toward zero"},"correct":"B","explanation":{"correct":"- PReLU is a superset of both ReLU (α=0) and Leaky ReLU (fixed α). If it converges to α≈0.01, the network found that a small negative slope is better than no slope (ReLU) for this data.\n- This is an interpretable result: some information from negative pre-activations is useful for the task. A slope of 0.01 allows a weak gradient signal from neurons that would otherwise be dead, improving gradient flow slightly without allowing negative activations to dominate.\n- The uniform convergence across channels (all αᵢ ≈ 0.01) suggests this preference is consistent across features, not layer/channel-specific.","A":"PReLU's learnable α is not a sign of overfitting. The parameters α are additional degrees of freedom, but they are learned in a way that improves training stability. Regularizing α to exactly 0 would manually force ReLU behavior, discarding the learned preference.","B":"","C":"Dead neurons have α effect only if those neurons are currently inactive. α≈0.01 means the network chose a small positive slope as the optimal behavior for negative inputs — it is not evidence of dead neurons attempting revival.","D":"L2 regularization would pull α toward 0, not 0.01. If all αᵢ converge to 0.01 with L2 regularization, the data gradient is pulling α up to 0.01 and the regularization is pulling it down — they balance at 0.01. This means the data genuinely prefers 0.01 over 0."},"reference":"- He et al., \"Delving Deep into Rectifiers\" (PReLU section): https://arxiv.org/abs/1502.01852"},{"section":"deep-learning","topicSlug":"activation-functions","topic":"Activation Functions","id":"dl-03014","difficulty":"hard","orderIndex":14,"question":"A team is building a Mixture of Experts (MoE) language model. The router network (which decides which expert handles each token) uses softmax to output probabilities over 64 experts. The team observes \"expert collapse\": after 5000 training steps, 90% of tokens are routed to 2 of the 64 experts. The remaining 62 experts receive no gradients and become useless. What is the mechanistic cause related to softmax, and what fix is applied in production MoE systems?","options":{"A":"Softmax's normalization causes the winning experts to have gradients 32x larger than losing experts, amplifying early random advantages into permanent collapse","B":"Softmax with temperature=1 creates a positive feedback loop: experts that win early get more training examples, their performance improves, softmax amplifies their logit advantage further on subsequent tokens — collapse is a stable attractor of the softmax + gradient descent system. Production fix: add an auxiliary load-balancing loss that penalizes unequal expert utilization","C":"Expert collapse is caused by the router network overfitting to the training data; fix by adding dropout to the router","D":"Softmax is the wrong activation for routing; replace with ReLU to allow multiple experts per token"},"correct":"B","explanation":{"correct":"- The collapse mechanism: Expert A gets slightly higher initial logit → softmax amplifies this to high probability → Expert A gets more gradient updates → Expert A improves more → its logit grows higher → softmax amplifies further → collapse.\n- This is a positive feedback loop inherent to the softmax + gradient descent interaction. Early random advantages are exponentially amplified by softmax's normalization.\n- Production fix (Switch Transformer, GShard, Mixtral): auxiliary load-balancing loss L_aux = α · Σᵢ fᵢ · Pᵢ, where fᵢ is the fraction of tokens dispatched to expert i and Pᵢ is the router's mean probability for expert i. This directly penalizes unequal utilization.","A":"The gradient magnitude difference between winning and losing experts follows from the softmax probability values, not a fixed 32x factor. More importantly, the gradient difference alone doesn't cause collapse — the feedback loop between gradient updates and future routing decisions is the actual mechanism.","B":"","C":"Dropout on the router would add noise to routing decisions but would not address the fundamental positive feedback loop. Production systems use load-balancing loss for this purpose.","D":"Replacing softmax with ReLU would allow multiple experts per token (multi-select routing), which is a different design choice. Some systems use top-k routing with ReLU normalization, but this changes the problem structure rather than fixing softmax expert collapse."},"reference":"- Fedus et al., \"Switch Transformers\" (2021): https://arxiv.org/abs/2101.03961\n- Lepikhin et al., \"GShard\" (2020): https://arxiv.org/abs/2006.16668"},{"section":"deep-learning","topicSlug":"activation-functions","topic":"Activation Functions","id":"dl-03015","difficulty":"hard","orderIndex":15,"question":"A research paper claims: \"For networks wider than 1000 units per layer, the choice of activation function (ReLU, GELU, tanh) becomes irrelevant because the network falls into the infinite-width (Neural Tangent Kernel) regime where all activations are equivalent.\" A practitioner dismisses this as \"theoretical nonsense.\" Who is right and why?","options":{"A":"The paper is correct — NTK theory proves that all activations become equivalent at infinite width","B":"The practitioner is right to be skeptical: NTK theory applies in a specific mathematical limit (infinite width, specific initialization, lazy training regime). At width=1000, networks are far from this limit and still in the feature-learning regime where activation choice affects learned representations, convergence speed, and final performance","C":"Both are correct — for classification tasks with width>1000, activations are equivalent; for generation tasks they differ","D":"The paper is correct for training speed but the practitioner is correct for final accuracy — activations affect how fast networks train but not what they converge to"},"correct":"B","explanation":{"correct":"- NTK theory (Jacot et al., 2018) describes networks in the \"lazy training\" regime where parameters stay close to initialization. This requires infinite width AND specific scaling. At finite width (even 10,000 units), networks learn features and deviate from the NTK prediction.\n- Practical width=1000 networks are solidly in the feature-learning (non-NTK) regime. The choice between ReLU and GELU significantly affects: (a) dead neuron fraction, (b) gradient flow quality, (c) representation geometry.\n- The paper's claim oversimplifies by conflating \"mathematically wider than 1000 makes NTK-like\" with the actual infinite-width limit. NTK effects start to appear at much larger widths than 1000, and even then are approximate.","A":"NTK theory proves equivalence only at truly infinite width with specific parameterization (NTK parameterization) and small learning rates. \"Infinite\" is not a practical width threshold and \"1000\" is not anywhere near the regime where NTK approximations become accurate.","B":"","C":"NTK theory does not distinguish by task type (classification vs generation). The regime is determined by network width, learning rate, initialization scale, and training dynamics — not the loss function.","D":"Activation choice affects both training speed and final accuracy. They are not decoupled. Networks with dying neurons (bad activation choice) converge to worse solutions, not just slower convergence to the same solution."},"reference":"- Jacot et al., \"Neural Tangent Kernel: Convergence and Generalization in Neural Networks\" (2018): https://arxiv.org/abs/1806.07572\n- Yang & Hu, \"Feature Learning in Infinite-Width Neural Networks\" (feature learning regime): https://arxiv.org/abs/2011.14522"},{"section":"deep-learning","topicSlug":"forward-propagation","topic":"Forward Propagation","id":"dl-04001","difficulty":"easy","orderIndex":1,"question":"A network has input shape (batch_size=32, features=128), first layer weight matrix W₁ of shape (128, 64), and bias b₁ of shape (64,). An engineer writes the forward pass as: h = x @ W₁ + b₁. What is the shape of h and how does the bias broadcasting work?","options":{"A":"h has shape (32, 128) because the bias expands to match the input dimension","B":"h has shape (32, 64) — x @ W₁ produces (32, 64), and b₁ of shape (64,) is broadcast to (32, 64) by repeating along the batch dimension, adding the same bias to each sample","C":"This code is invalid — bias must have shape (32, 64) to match the batch dimension explicitly","D":"h has shape (64, 32) because matrix multiplication transposes the batch dimension"},"correct":"B","explanation":{"correct":"- Matrix multiply: (32, 128) @ (128, 64) = (32, 64). Each of the 32 samples gets its own 64-dimensional output vector.\n- Broadcasting: b₁ has shape (64,). NumPy/PyTorch broadcasts this to (32, 64) by repeating along the batch dimension — the same bias vector b₁ is added to every sample's activation. This is the correct behavior because the bias is a property of the layer, not the sample.\n- Broadcasting rules: shapes are aligned from the right. (32, 64) and (64,) → (64,) is broadcast to (1, 64) → then to (32, 64). This implicit behavior is a common source of shape bugs when the bias has unexpected dimensions.","A":"(32, 128) would be the shape if we multiplied x by W₁ transposed as (128, 128) — but the weight matrix here maps 128→64, so the output is 64-dimensional.","B":"","C":"PyTorch and NumPy handle broadcasting automatically. The bias does not need to be explicitly shaped (32, 64). Requiring explicit batch-dimension expansion would be cumbersome and is not how neural network libraries work.","D":"Matrix multiplication preserves the batch dimension as the leading dimension. (32, 128) @ (128, 64) = (32, 64), not (64, 32)."},"reference":"- PyTorch broadcasting semantics: https://pytorch.org/docs/stable/notes/broadcasting.html\n- https://cs231n.github.io/neural-networks-2/#datapre"},{"section":"deep-learning","topicSlug":"forward-propagation","topic":"Forward Propagation","id":"dl-04002","difficulty":"easy","orderIndex":2,"question":"You are implementing a 3-layer MLP forward pass and tracking tensor shapes. Input is (batch=16, 784). Layers: [784→256, 256→128, 128→10]. At each step, you apply ReLU after layers 1 and 2, and no activation after layer 3. Which shape sequence is correct?","options":{"A":"(16,784) → (16,256) → (256,) → (16,128) → (128,) → (16,10)","B":"(16,784) → (16,256) → (16,256) → (16,128) → (16,128) → (16,10)","C":"(16,784) → (256,16) → (256,16) → (128,16) → (128,16) → (10,16)","D":"(16,784) → (16,256) → (16,128) → (16,10) skipping ReLU shapes since activation doesn't change shape"},"correct":"B","explanation":{"correct":"- Layer 1: (16,784) @ (784,256) + b = (16,256). ReLU applied element-wise: output is (16,256) — same shape, different values.\n- Layer 2: (16,256) @ (256,128) + b = (16,128). ReLU: still (16,128).\n- Layer 3: (16,128) @ (128,10) + b = (16,10). No activation.\n- Activation functions (ReLU, sigmoid, tanh) are element-wise operations — they preserve tensor shape. Shape tracking must include these steps to verify the code is correct, even though shape doesn't change.","A":"The (256,) and (128,) shapes are incorrect — they represent 1D bias vectors, not the layer outputs. After the matrix multiply, the output is 2D (batch × features).","B":"","C":"Standard PyTorch linear layers use (batch, features) convention, not (features, batch). The batch dimension is always leading.","D":"Option D is actually numerically correct (skipping same-shape ReLU steps), but omitting activation steps in shape tracking is bad practice — a common source of bugs when activation functions are accidentally applied to wrong tensors."},"reference":"- PyTorch nn.Linear documentation: https://pytorch.org/docs/stable/generated/torch.nn.Linear.html"},{"section":"deep-learning","topicSlug":"forward-propagation","topic":"Forward Propagation","id":"dl-04003","difficulty":"medium","orderIndex":3,"question":"You are processing a batch of 64 images, each of shape (3, 224, 224), through a convolutional layer followed by a fully connected layer. Before the FC layer, the output of the conv stack has shape (64, 512, 7, 7). A junior engineer writes `x = x.reshape(64, -1)` before the FC layer. What is the resulting shape and what bug risk does this introduce compared to `x.view(64, -1)` or `nn.Flatten()`?","codeSnippet":"conv_out = torch.randn(64, 512, 7, 7)\nx = conv_out.reshape(64, -1)\nfc = nn.Linear(512*7*7, 1000)\nlogits = fc(x)","options":{"A":"Shape is (64, 25088). reshape is equivalent to view for contiguous tensors; the bug risk is that reshape may silently copy non-contiguous tensors, potentially masking incorrect tensor layout assumptions downstream","B":"Shape is (64, 512, 49) because reshape preserves the channel dimension","C":"Shape is incorrect because -1 cannot infer dimensions for 4D → 2D flattening","D":"Shape is (64, 25088) but this will cause a runtime error because FC layers require 3D inputs"},"correct":"A","explanation":{"correct":"- 512 × 7 × 7 = 25,088. reshape(64, -1) infers -1 = 25,088. Output shape: (64, 25,088). ✓\n- `reshape` vs `view`: Both produce the same shape. For contiguous tensors (which standard conv outputs are), they are identical. For non-contiguous tensors (e.g., after transpose or permute), `view` raises an error while `reshape` silently copies. This means `reshape` can mask bugs where a tensor has unexpected memory layout.\n- Best practice: use `nn.Flatten()` which handles both contiguous and non-contiguous tensors correctly and documents intent clearly in the model definition.","A":"","B":"`reshape(64, -1)` collapses ALL remaining dimensions into one. (64, 512, 7, 7) → (64, 512*7*7) = (64, 25088), not (64, 512, 49).","C":"Python/PyTorch's `-1` in reshape correctly infers the size needed to keep total elements constant. 64×512×7×7 = 64×25088, so -1 = 25088. This is standard behavior.","D":"PyTorch `nn.Linear` expects inputs of shape (batch, features) — 2D inputs — which (64, 25088) provides. FC layers do not require 3D inputs; that is recurrent layers."},"reference":"- PyTorch reshape vs view: https://pytorch.org/docs/stable/tensor_view.html\n- nn.Flatten: https://pytorch.org/docs/stable/generated/torch.nn.Flatten.html"},{"section":"deep-learning","topicSlug":"forward-propagation","topic":"Forward Propagation","id":"dl-04004","difficulty":"medium","orderIndex":4,"question":"During forward propagation through a 5-layer network, you process a batch of 256 samples. A profiler shows that 80% of the forward pass time is spent in matrix multiplication. Your team proposes three optimizations: (A) reduce batch size to 32, (B) use float16 instead of float32, (C) add a skip connection from layer 1 to layer 5. Which optimization(s) will reduce forward pass time and why?","options":{"A":"Only A — smaller batch size means less data to process","B":"Only B — float16 operations are 2× faster than float32 on modern GPUs and tensor cores","C":"A and B — batch size and precision both affect throughput; skip connections add computation","D":"B and potentially A — float16 halves memory bandwidth and enables tensor core operations (4-8× faster matrix multiply); reducing batch size helps if GPU memory is the bottleneck but hurts throughput efficiency if the GPU is underutilized"},"correct":"D","explanation":{"correct":"- Float16 (half precision): Modern NVIDIA GPUs have dedicated tensor cores that perform FP16 matrix multiplication 4-8× faster than FP32. Memory bandwidth is also halved (each value is 2 bytes vs 4 bytes), reducing data movement bottleneck.\n- Batch size: Reducing from 256 to 32 doesn't help if the GPU is already compute-bound (fully utilizing all cores). It can hurt throughput by reducing parallelism. It only helps if GPU memory is the bottleneck preventing larger batches.\n- Skip connections (C) add matrix additions (cheap) and potentially extra weight matrices — they slightly increase FLOPs but can improve gradient flow, leading to better final models. They don't reduce forward pass time.","A":"Reducing batch size from 256 to 32 reduces the amount of work, but GPU throughput is maximized with large batches. For a 256-sample batch, the GPU is likely well-utilized. Going to 32 may leave GPU cores idle, reducing actual throughput efficiency (samples/second).","B":"Float16 is correct as stated, but A is not a simple win — it depends on whether the GPU is memory-bound or compute-bound, and whether the current batch size fills GPU compute capacity.","C":"Skip connections add an element-wise addition (negligible cost) but if they include extra weight matrices (as in ResNets), they add matrix multiplications. Net effect: slightly more computation, not less.","D":""},"reference":"- NVIDIA tensor cores and FP16: https://developer.nvidia.com/tensor-cores\n- Mixed precision training: https://pytorch.org/docs/stable/amp.html"},{"section":"deep-learning","topicSlug":"forward-propagation","topic":"Forward Propagation","id":"dl-04005","difficulty":"medium","orderIndex":5,"question":"You implement a forward pass manually for debugging:","codeSnippet":"def forward(x, W1, b1, W2, b2):\n z1 = x @ W1.T + b1\n a1 = relu(z1)\n z2 = a1 @ W2.T + b2\n return z2","options":{"A":"The bias should be added before the matrix multiply, not after","B":"PyTorch's `nn.Linear` transposes the weight matrix internally (computes xW^T + b), so the weight matrices W1 and W2 should be stored as (out_features, in_features) — this code is actually consistent with PyTorch's convention","C":"ReLU should be applied before the matrix multiply in layer 2, not after layer 1","D":"The code uses `.T` which transposes the entire tensor including batch dimensions for batched inputs, causing incorrect computation"},"correct":"B","explanation":{"correct":"- PyTorch's `nn.Linear(in, out)` stores weight as shape (out, in) and computes output = input @ weight.T + bias. The transpose operation aligns dimensions: (batch, in) @ (in, out) = (batch, out).\n- This code does exactly that: `x @ W1.T + b1` where W1 is (out, in) transposes to (in, out) and multiplies. The implementation is consistent with PyTorch's convention.\n- The subtle point: many textbooks write the weight as (in, out) and compute xW + b without transpose. PyTorch chose the transposed convention (out, in) for storage efficiency. This inconsistency between textbook notation and implementation is a frequent source of confusion.","A":"The bias is correctly added after the matrix multiply: z = xW^T + b. This is the standard affine transformation. Adding bias before the multiply would produce W^T(x + b), which is mathematically different.","B":"","C":"The activation is applied to z1 (layer 1's pre-activation) to produce a1 (layer 1's output). Layer 2 then processes a1. The order is correct: z → activation → next z is the standard forward pass structure.","D":"`.T` in PyTorch (and NumPy) for 2D tensors transposes the two dimensions correctly. For a 2D weight matrix (out, in), `.T` gives (in, out). For higher-dimensional tensors, `.T` reverses all dimensions, but weight matrices are 2D."},"reference":"- PyTorch nn.Linear weight shape: https://pytorch.org/docs/stable/generated/torch.nn.Linear.html"},{"section":"deep-learning","topicSlug":"forward-propagation","topic":"Forward Propagation","id":"dl-04006","difficulty":"medium","orderIndex":6,"question":"A model processes batches of text sequences with shape (batch=32, seq_len=512, d_model=768). During a forward pass through a feed-forward sublayer (two linear layers), you notice peak GPU memory is 3× the model parameter memory. A teammate says \"just reduce batch size.\" What is the actual cause of the memory spike and what is the correct fix?","options":{"A":"The model parameters are duplicated three times during the forward pass for numerical stability","B":"Intermediate activations (all layer outputs needed for backpropagation) are stored during the forward pass. For (32, 512, 768) inputs processed through multiple layers, these activation tensors collectively occupy 2-3× model memory. Correct fix: gradient checkpointing trades memory for compute by recomputing activations during the backward pass","C":"The optimizer states (Adam maintains 2 extra copies per parameter) cause the 3× memory during forward pass","D":"Float32 arithmetic requires 4 bytes per number, and the GPU allocates memory in 3× chunks for alignment"},"correct":"B","explanation":{"correct":"- During the forward pass, PyTorch stores all intermediate activations needed for backpropagation (chain rule requires knowing the forward values to compute gradients). For a deep network on large sequences, these stored activations can easily exceed model parameter memory.\n- For (32, 512, 768): each activation tensor is 32×512×768×4 bytes ≈ 48 MB. With 12 Transformer layers each having multiple sublayers, stored activations sum to hundreds of MB or more.\n- Gradient checkpointing (torch.utils.checkpoint): during forward pass, discard intermediate activations. During backward pass, recompute them on-the-fly. Trades ~33% extra compute for ~50-70% memory reduction.","A":"Model parameters are not duplicated during the forward pass. They are stored once and referenced. Parameter duplication happens with distributed training (data parallelism) or during optimizer steps, not forward passes.","B":"","C":"Adam optimizer states (first and second moment estimates) are allocated during the optimizer step, not during the forward pass. They are persistent between training steps but are not created fresh during forward propagation.","D":"Memory alignment is real but results in small, fixed padding, not 3× expansion. Memory alignment does not cause 3× usage."},"reference":"- PyTorch gradient checkpointing: https://pytorch.org/docs/stable/checkpoint.html\n- Chen et al., \"Training Deep Nets with Sublinear Memory Cost\" (gradient checkpointing): https://arxiv.org/abs/1604.06174"},{"section":"deep-learning","topicSlug":"forward-propagation","topic":"Forward Propagation","id":"dl-04007","difficulty":"hard","orderIndex":7,"question":"You run the exact same network forward pass twice with the same input tensor and observe different outputs. The network is in `model.eval()` mode. What is the most likely cause, and what does fixing it require?","codeSnippet":"model.eval()\nx = torch.randn(8, 64)\nout1 = model(x)\nout2 = model(x)\nassert torch.allclose(out1, out2) # This assertion FAILS","options":{"A":"`model.eval()` does not disable all randomness — Dropout layers are disabled by eval(), but if the model contains MC Dropout (Dropout intentionally left active in eval mode), or if any layer explicitly generates random noise (e.g., noise injection for robustness), the outputs will differ","B":"Float32 arithmetic is non-deterministic on GPUs; different CUDA kernel execution orders produce different results on every run","C":"The model has a bug in weight initialization that re-randomizes weights on every forward call","D":"PyTorch eval() mode only affects BatchNorm statistics; Dropout is always active regardless of eval/train mode"},"correct":"A","explanation":{"correct":"- `model.eval()` sets the mode flag that disables standard Dropout and switches BatchNorm from batch statistics to running statistics. However, it does not disable ALL randomness.\n- MC Dropout (Monte Carlo Dropout) intentionally overrides the eval flag to keep Dropout active for uncertainty estimation. If the model uses this pattern, eval mode does not make it deterministic.\n- Other sources of non-determinism in eval mode: stochastic depth layers, noise injection, random augmentation in the forward path, or CUDA non-determinism with certain operations.\n- Fix: explicitly set `torch.manual_seed()` before each call, use `torch.use_deterministic_algorithms(True)`, or identify and disable the specific source of randomness.","A":"","B":"While CUDA non-determinism is real (some operations like atomicAdd have non-deterministic ordering), it produces differences in the ~1e-7 range, well within `torch.allclose`'s default tolerance (atol=1e-8, rtol=1e-5). Failing `allclose` with identical inputs suggests larger differences.","C":"Weight re-initialization in the forward pass would be a catastrophic bug that would be immediately obvious in training (loss would never decrease). This is not a realistic scenario in a model that has been trained.","D":"eval() definitely disables Dropout (by setting `self.training = False`, which Dropout checks). The confusion is that some custom Dropout implementations explicitly ignore the training flag."},"reference":"- PyTorch MC Dropout for uncertainty: https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial.html\n- torch.use_deterministic_algorithms: https://pytorch.org/docs/stable/generated/torch.use_deterministic_algorithms.html"},{"section":"deep-learning","topicSlug":"forward-propagation","topic":"Forward Propagation","id":"dl-04008","difficulty":"hard","orderIndex":8,"question":"A language model processes tokens with an embedding layer that maps token IDs to 512-dimensional vectors. Input is a batch of shape (32, 128) — 32 sentences, each with 128 tokens. The embedding table has shape (50000, 512). An engineer writes the forward pass as a matrix multiply: `embeddings = one_hot(tokens) @ embedding_table`. A senior engineer says this is \"functionally correct but catastrophically inefficient.\" Why?","options":{"A":"One-hot encoding creates a (32, 128, 50000) tensor — 32×128×50000×4 bytes ≈ 800 MB just for the one-hot matrix. The actual operation needed is a simple lookup (indexing), not a matrix multiply","B":"Matrix multiplication requires contiguous memory and one-hot tensors are sparse, causing CUDA memory allocation failures","C":"The embedding table must be transposed before the multiply, so the engineer's code produces wrong output shapes","D":"One-hot + matrix multiply is only inefficient for vocabularies larger than 100,000; for 50,000 tokens it is acceptable"},"correct":"A","explanation":{"correct":"- One-hot encoding a (32, 128) index tensor with vocabulary size 50,000 creates a (32, 128, 50,000) float tensor. Memory: 32×128×50,000×4 = 819 MB just for the one-hot representation.\n- The one-hot matrix is 99.998% zeros (only 1 out of 50,000 entries is 1 per token). Multiplying by the embedding table computes 50,000 products and sums only to select 1 row — extreme waste.\n- The correct operation: `embedding_table[tokens]` (fancy indexing). PyTorch's `nn.Embedding` implements this as an O(1) lookup per token — just reading the row at the given index. No multiplication needed.","A":"","B":"Sparse tensors are supported in PyTorch, but the one-hot matrix here would be created as a dense tensor (no automatic sparsification). Memory allocation failure is possible but the primary issue is the inefficiency, not a hard failure.","C":"(32, 128, 50000) @ (50000, 512) = (32, 128, 512) — the shapes are actually correct (batched matrix multiply). The code is functionally correct, just catastrophically slow/memory-intensive.","D":"There is no meaningful threshold at 100,000. Even at vocabulary=50,000, the one-hot approach is 4+ orders of magnitude more compute than a lookup. The inefficiency scales linearly with vocabulary size — it is always unacceptable."},"reference":"- PyTorch nn.Embedding: https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html"},{"section":"deep-learning","topicSlug":"forward-propagation","topic":"Forward Propagation","id":"dl-04009","difficulty":"hard","orderIndex":9,"question":"During a forward pass, you trace the values flowing through a network and observe that after 8 layers with ReLU activations, the activation norms double with each layer: layer 1 norm ≈ 1.0, layer 2 ≈ 2.0, ..., layer 8 ≈ 128.0. The loss is NaN after the first batch. What initialization problem caused this and what is the fix?","options":{"A":"The weights were initialized too small, causing ReLU to output zero for all inputs","B":"The weights were initialized with variance too large (e.g., random normal with std=1.0 instead of He initialization). Each layer multiplies the activation norm by approximately √(fan_in) × std. With std=1.0 and fan_in=256 neurons, each layer amplifies by ~16, causing exponential activation growth and eventual overflow to NaN","C":"The bias terms were initialized to positive values, causing additive growth across layers","D":"ReLU should not be used with more than 4 layers; beyond that, activation normalization is required"},"correct":"B","explanation":{"correct":"- If weights are sampled from N(0, 1) for a layer with fan_in=256, the pre-activation z = Σᵢ wᵢxᵢ has variance = fan_in × Var(w) × Var(x) = 256 × 1 × 1 = 256, so std(z) ≈ 16. After ReLU (which halves variance), each layer amplifies activation norm by roughly √(256/2) ≈ 8-16×.\n- He initialization: std = √(2/fan_in) ensures each ReLU layer preserves activation variance: fan_in × (2/fan_in) × Var(x) = 2 × Var(x) → after ReLU (halving): Var = 1 × Var(x). Norm stays constant across layers.\n- Exponential norm growth → floating-point overflow → NaN loss on first backward pass.","A":"Small weight initialization causes vanishing activations (norms shrink to near zero), not exponential growth. The observed doubling-per-layer pattern is a signature of over-large weight variance.","B":"","C":"Bias initialization affects the offset of each layer's output but not the multiplicative growth. Biases are typically initialized to zero and contribute additively (linear, not exponential growth).","D":"ReLU can be used with 100+ layers when properly initialized (ResNets use ReLU at 50-152 layers). The issue is initialization, not a fundamental ReLU depth limit."},"reference":"- He et al., \"Delving Deep into Rectifiers\" (He initialization derivation): https://arxiv.org/abs/1502.01852\n- https://cs231n.github.io/neural-networks-2/#init"},{"section":"deep-learning","topicSlug":"forward-propagation","topic":"Forward Propagation","id":"dl-04010","difficulty":"medium","orderIndex":10,"question":"A model is trained on GPU and achieves 98% training accuracy. During inference on CPU, the same model produces different outputs for the same inputs — outputs that are slightly different (differences of ~1e-4). The model does not use Dropout. What is the most likely cause?","options":{"A":"CPU and GPU use different random seeds, causing stochastic differences in forward propagation","B":"Float32 arithmetic is not associative — the order of floating-point operations differs between GPU (parallel, fused operations) and CPU (sequential, different operation ordering), producing slightly different results due to floating-point rounding. These differences are expected and not a bug","C":"The model weights were saved in float16 and loaded in float32 on CPU, causing precision loss during weight conversion","D":"CPU inference automatically applies quantization, reducing precision to int8"},"correct":"B","explanation":{"correct":"- Floating-point arithmetic is not mathematically associative: (a + b) + c ≠ a + (b + c) in IEEE 754 float32 due to rounding at each step. GPUs perform matrix multiplications with parallel reduction (different summation order than CPU sequential operations), producing numerically different but equally \"correct\" results.\n- Differences of ~1e-4 in float32 are typical for this phenomenon. The results are both valid floating-point approximations to the same mathematical computation, just with different rounding error accumulation paths.\n- This is a known and expected behavior documented in CUDA documentation. For deterministic CPU/GPU matching, use `torch.use_deterministic_algorithms(True)` and specific CUDA determinism settings.","A":"Forward propagation in eval mode (no Dropout) is deterministic given the same inputs and weights. Random seeds affect random number generation, which is not used in a standard forward pass.","B":"","C":"If weights were saved in float32 (which is the default for `torch.save`), no conversion happens on load. Float16-to-float32 conversion would produce systematic differences, not random ~1e-4 variations.","D":"PyTorch CPU inference does not automatically quantize models. Quantization (int8) is an explicit operation requiring `torch.quantization` API calls. Default CPU inference uses float32."},"reference":"- CUDA determinism: https://pytorch.org/docs/stable/notes/randomness.html\n- IEEE 754 floating-point arithmetic: https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html"},{"section":"deep-learning","topicSlug":"forward-propagation","topic":"Forward Propagation","id":"dl-04011","difficulty":"easy","orderIndex":11,"question":"A network processes batches of images. During the forward pass, the last convolutional layer output has shape (batch=8, channels=256, height=14, width=14). Before the fully connected layer, the tensor must be flattened. A new engineer uses `x.squeeze()` instead of `x.flatten(1)`. What is the problem?","options":{"A":"`squeeze()` and `flatten(1)` are equivalent for 4D tensors — no problem","B":"`squeeze()` removes dimensions of size 1 — if the batch size is 1, it would remove the batch dimension, producing shape (256, 14, 14) instead of (1, 25088), breaking the FC layer which expects 2D input","C":"`squeeze()` transposes the channel and spatial dimensions, producing incorrect feature ordering","D":"`squeeze()` only works on 2D tensors; it will raise an error on a 4D input"},"correct":"B","explanation":{"correct":"- `torch.squeeze()` removes all dimensions of size 1. For batch_size=8: shape (8, 256, 14, 14) — no size-1 dimensions, so squeeze does nothing (accidentally correct for this batch).\n- For batch_size=1: shape (1, 256, 14, 14) — squeeze removes the batch dimension → (256, 14, 14). The FC layer (nn.Linear) expects 2D (batch, features) → shape error.\n- This bug appears only during inference when single samples are processed (batch_size=1), not during training (batch_size > 1). It is a classic \"training works, inference breaks\" bug.","A":"They produce the same result only when no dimension has size 1. For batch_size=1, they produce different results.","B":"","C":"squeeze() only removes size-1 dimensions — it does not reorder or transpose remaining dimensions.","D":"squeeze() works on tensors of any dimension. It removes any dimension(s) that have size 1, regardless of the total number of dimensions."},"reference":"- torch.squeeze documentation: https://pytorch.org/docs/stable/generated/torch.squeeze.html\n- Common PyTorch pitfalls: https://pytorch.org/docs/stable/notes/faq.html"},{"section":"deep-learning","topicSlug":"forward-propagation","topic":"Forward Propagation","id":"dl-04012","difficulty":"medium","orderIndex":12,"question":"You implement batch processing in numpy for a network with weight W (shape: 100×50) and bias b (shape: 100,). You receive a single sample x (shape: (50,)) and a batch X (shape: (32, 50)). Your colleague's code uses `np.dot(W, x)` for single samples and `np.dot(X, W.T)` for batches. Why are these two different formulations necessary?","options":{"A":"They are not necessary — one formulation works for both cases in numpy","B":"numpy's dot product behaves differently for 1D and 2D inputs: `np.dot(W, x)` computes Wx (matrix-vector, output shape (100,)), while `np.dot(X, W.T)` computes XW^T (matrix-matrix, output shape (32, 100)). Both are correct, but using `np.dot(W, x)` on a batch would compute a different operation","C":"The batch formulation transposes W because batched gradient computation requires transposed weight access","D":"Both compute identical operations — the difference is only in memory layout which numpy handles automatically"},"correct":"B","explanation":{"correct":"- `np.dot(W, x)`: W is (100,50), x is (50,) → matrix-vector product → output (100,). This is the standard Wx formulation.\n- `np.dot(X, W.T)`: X is (32,50), W.T is (50,100) → matrix-matrix product → output (32, 100). Each row of X is one sample, computing all 32 outputs simultaneously.\n- Modern deep learning libraries (PyTorch, TensorFlow) abstract this by always working in batch mode with leading batch dimension. The explicit W vs W.T difference is why nn.Linear stores weights transposed and always processes batches.","A":"You cannot use `np.dot(W, X.T)` to replace both. While `np.dot(W, X.T)` gives shape (100, 32) which can be transposed to (32, 100), it requires an extra transpose and is less readable. More importantly, the question asks why two different formulations exist in the colleague's code.","B":"","C":"The transpose in the batch formulation is not related to gradient computation — it's a geometric necessity for the matrix dimensions to align. (32,50) @ (50,100) requires W.T, not W.","D":"The two operations are not identical. `np.dot(W, x)` produces (100,), not (32, 100). numpy does not \"handle\" this automatically."},"reference":"- numpy dot product semantics: https://numpy.org/doc/stable/reference/generated/numpy.dot.html"},{"section":"deep-learning","topicSlug":"forward-propagation","topic":"Forward Propagation","id":"dl-04013","difficulty":"hard","orderIndex":13,"question":"You are implementing a Transformer's feed-forward sublayer:","codeSnippet":"class FFN(nn.Module):\n def __init__(self, d_model=512, d_ff=2048):\n super().__init__()\n self.w1 = nn.Linear(d_model, d_ff)\n self.w2 = nn.Linear(d_ff, d_model)\n \n def forward(self, x): # x: (batch, seq, d_model)\n return self.w2(F.relu(self.w1(x)))","options":{"A":"The ReLU should be GELU for Transformers","B":"A dropout layer should be applied between the two linear layers in training, as specified in the original paper","C":"The output should be divided by √d_ff to normalize the output scale","D":"The linear layers should use weight tying (sharing weights between w1 and w2.T)"},"correct":"B","explanation":{"correct":"- The original \"Attention is All You Need\" (Vaswani et al., 2017) FFN includes dropout after the first activation: FFN(x) = max(0, xW₁ + b₁)W₂ + b₂, with \"Dropout applied to the output of each sub-layer, before it is added to the sub-layer input and normalized.\"\n- Modern implementations often include this dropout explicitly inside the FFN: `F.dropout(F.relu(self.w1(x)), p=0.1, training=self.training)`.\n- The missing component is subtle but important for regularization. Many efficient implementations omit it for inference, but it should be present in the training code.","A":"The original paper uses ReLU. GELU is used in BERT, GPT, and later Transformers, but the question specifically asks about missing components vs the original \"Attention is All You Need\" formulation. GELU is a later improvement, not a missing component.","B":"","C":"The Transformer does not divide FFN output by √d_ff. The scaling factor 1/√d_k appears in the attention score computation, not in the FFN. Dividing by √d_ff would unnecessarily shrink outputs.","D":"Weight tying is used between the token embedding layer and the output projection (input embedding ↔ pre-softmax weight). It is not applied between FFN's two linear layers — they have different dimensions (d_model×d_ff and d_ff×d_model) and would need to be transposed, which is a different pattern."},"reference":"- Vaswani et al., \"Attention Is All You Need\" (2017): https://arxiv.org/abs/1706.03762 (Section 3.3, Position-wise Feed-Forward Networks)"},{"section":"deep-learning","topicSlug":"forward-propagation","topic":"Forward Propagation","id":"dl-04014","difficulty":"hard","orderIndex":14,"question":"You are comparing inference throughput of two models: Model A processes samples one-at-a-time (batch_size=1), Model B processes 256 samples simultaneously (batch_size=256). Both models have identical architectures. On a modern GPU, Model B has 180× higher throughput (samples/second) despite using 256× more samples per forward pass. What explains the throughput improvement, and what limits further improvement beyond batch_size=256?","options":{"A":"Model B benefits from GPU parallelism — a single forward pass on 256 samples is nearly as fast as 1 sample because all 256 samples are processed simultaneously on different GPU cores. The limit is GPU memory: once the batch no longer fits in VRAM, throughput drops","B":"Model B benefits from better caching — 256 samples cause the weight matrix to be cached in L2 cache, reducing memory access time per sample","C":"Model A has 256× more Python overhead because each sample requires a separate Python function call","D":"Model B enables kernel fusion, which is only active above batch_size=128"},"correct":"A","explanation":{"correct":"- Modern GPUs (A100: 6912 CUDA cores, 400 GB/s memory bandwidth) are designed for massively parallel computation. For batch_size=1, most GPU cores are idle during a matrix multiply because the single-sample computation doesn't generate enough parallelism to saturate the hardware.\n- For batch_size=256, the matrix multiply (256, features) × (features, out) saturates GPU cores. The wall-clock time for 256 samples is nearly the same as for 1 sample because all samples are processed in parallel.\n- The limit: when batch_size × activations_per_sample exceeds GPU VRAM, Out-of-Memory errors occur. Also, beyond GPU saturation point, each additional sample actually does take proportionally longer (diminishing returns). The optimal batch size maximizes GPU utilization without exceeding memory.","A":"","B":"Weight matrix caching in L2 is a real effect but explains only a 2-5× speedup in bandwidth-bound operations, not 180×. The dominant effect is GPU parallelism.","C":"Modern deep learning frameworks (PyTorch with CUDA) don't make a Python call per sample during batched forward passes. The overhead is at the batch level, not per sample. Python GIL overhead is minimal in GPU-accelerated inference.","D":"Kernel fusion (combining multiple operations into one CUDA kernel) happens based on graph structure and operator implementation, not batch size thresholds."},"reference":"- NVIDIA GPU architecture and parallelism: https://developer.nvidia.com/blog/cuda-pro-tip-understand-fat-binaries/\n- PyTorch performance tuning guide: https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html"},{"section":"deep-learning","topicSlug":"forward-propagation","topic":"Forward Propagation","id":"dl-04015","difficulty":"hard","orderIndex":15,"question":"You are profiling a model's forward pass and find that a layer with shape (batch=64, seq=512, d=768) takes 3× longer than expected based on FLOP count. The layer is a fully connected layer (nn.Linear). The FLOP count predicts 10ms but it takes 30ms. What is the most likely performance bottleneck and how would you diagnose it?","options":{"A":"The layer has too many parameters — reduce d from 768 to 256 to bring runtime in line with FLOP prediction","B":"The layer is memory-bandwidth bound rather than compute-bound: the weight matrix size (768×768 = 2.4 MB in float32) plus input activations must be read from VRAM on each forward pass. If the arithmetic intensity (FLOPs / bytes of memory access) is below the GPU's roofline, actual throughput is limited by memory bandwidth, not compute","C":"The Python garbage collector is pausing for 20ms during the layer computation to free old tensors","D":"Batch size 64 is too small for this layer's dimensions — the GPU cannot parallelize below batch_size=256"},"correct":"B","explanation":{"correct":"- The roofline model: a GPU has peak FLOP/s and peak memory bandwidth. For a given operation, arithmetic intensity = FLOPs / bytes accessed. If arithmetic intensity < (peak FLOP/s / peak bandwidth), the operation is memory-bound — memory access is the bottleneck.\n- For nn.Linear on (64, 512, 768): FLOP count = 2 × 64×512×768×768 ≈ 38 GFLOPs. Memory: weight (768×768×4) = 2.4 MB + activations (64×512×768×4) = 150 MB. If reading 150 MB at 2 TB/s takes 75μs while the FLOPs at 20 TFLOP/s takes 2ms, the operation is compute-bound. But if actual access patterns cause repeated weight re-reads (e.g., non-contiguous memory), effective bandwidth drops.\n- Diagnosis: use `nvprof` or `torch.profiler` to check compute vs memory utilization. If GPU compute utilization is low but memory bandwidth is near 100%, the layer is memory-bound.","A":"Reducing d changes the FLOP count proportionally. If the operation is memory-bound, reducing FLOPs won't help proportionally — you'd just have fewer FLOPs sitting idle while memory bandwidth remains the bottleneck.","B":"","C":"Python GC does not pause CUDA operations. PyTorch CUDA operations are asynchronous — CUDA streams continue independent of Python GC. PyTorch's CUDA memory manager handles tensor deallocation independently.","D":"While low batch sizes reduce parallelism, nn.Linear with batch=64 and seq=512 means 64×512=32,768 parallel samples being processed — this is typically sufficient to saturate most GPU layers. The \"minimum batch size\" framing oversimplifies GPU utilization."},"reference":"- PyTorch profiler: https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html\n- Roofline model: https://developer.nvidia.com/blog/roofline-and-deepspeed/"},{"section":"deep-learning","topicSlug":"loss-and-cost-functions","topic":"Loss And Cost Functions","id":"dl-05001","difficulty":"easy","orderIndex":1,"question":"A regression model predicts house prices. During training, the MSE loss is 1,000,000 (in squared dollars). A colleague says \"that's a huge loss, the model is failing.\" A senior engineer disagrees. Who is correct?","options":{"A":"The colleague is correct — MSE above 1000 always indicates a failing model","B":"The senior engineer is correct — MSE is in squared units of the target variable. If prices are in dollars and predictions are off by ~$1000 on average, MSE ≈ 1,000,000 (1000²). The absolute MSE value is meaningless without context of the target scale; RMSE ($1000 error) or MAPE is more interpretable","C":"The colleague is correct — MSE should always be normalized to [0,1] before training","D":"Both are wrong — MSE is computed in log-space for price prediction and the units are not squared dollars"},"correct":"B","explanation":{"correct":"- MSE = (1/n)Σ(y - ŷ)². If y is in dollars and the average error is $1000, MSE = 1000² = 1,000,000. An MSE of 1,000,000 dollars² corresponds to RMSE = $1,000 — which may be excellent for a $500,000 house (~0.2% error).\n- The key insight: MSE's magnitude is meaningless in isolation. It depends entirely on the scale of the target variable. A model predicting temperatures in Kelvin (range ~250-350) vs dollars (range ~$50,000-$5M) will have MSE values differing by 6 orders of magnitude despite equal prediction quality.\n- RMSE is preferred for interpretability (same units as target), and R² is preferred for scale-independent model quality assessment.","A":"No threshold on MSE indicates a failing model without knowing the target scale. MSE < 0.001 could be catastrophic if targets are in the range [0, 0.0001].","B":"","C":"MSE normalization to [0,1] is not standard practice and would require knowing the maximum possible squared error in advance, which is often undefined for regression.","D":"Predicting in log-space is a common technique for skewed targets like prices, but it is not a default behavior. Most regression models train on raw values unless explicitly transformed."},"reference":"- https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics"},{"section":"deep-learning","topicSlug":"loss-and-cost-functions","topic":"Loss And Cost Functions","id":"dl-05002","difficulty":"easy","orderIndex":2,"question":"A binary classifier outputs probabilities and is trained with Binary Cross-Entropy (BCE) loss. For a positive sample (y=1), the model outputs p=0.01. For a negative sample (y=0), the model outputs p=0.99. Calculate the BCE loss for each case and explain why the loss function behaves this way.","options":{"A":"BCE(y=1, p=0.01) = -log(0.99) ≈ 0.01; BCE(y=0, p=0.99) = -log(0.01) ≈ 4.6 — losses are equal because the errors are symmetric","B":"BCE(y=1, p=0.01) = -log(0.01) ≈ 4.6; BCE(y=0, p=0.99) = -log(1-0.99) = -log(0.01) ≈ 4.6 — both cases are maximally wrong and incur the same loss. BCE applies heavy penalties for confident wrong predictions via the log function","C":"BCE(y=1, p=0.01) = 0.01; BCE(y=0, p=0.99) = 0.01 — BCE is linear in prediction error","D":"BCE(y=1, p=0.01) = -(0.01) = -0.01; the negative sign causes gradient ascent when predictions are wrong"},"correct":"B","explanation":{"correct":"- BCE: L = -[y·log(p) + (1-y)·log(1-p)]. For y=1, p=0.01: L = -log(0.01) = log(100) ≈ 4.605. For y=0, p=0.99: L = -log(1-0.99) = -log(0.01) ≈ 4.605.\n- The log function maps p→0 to L→∞ and p→1 to L→0. A highly confident wrong prediction (p→0 when y=1) incurs an enormous loss — this is the \"penalty for overconfident errors\" property.\n- This property is why cross-entropy trains classifiers to be well-calibrated: the model is penalized not just for being wrong, but for being confidently wrong. An overconfident wrong prediction receives a much larger gradient than an uncertain wrong prediction.","A":"The formula is reversed. BCE(y=1, p) = -log(p), not -log(1-p). For y=1, p=0.01: -log(0.01) = 4.6, not -log(0.99) = 0.01.","B":"","C":"BCE is logarithmic, not linear. The log function ensures that predictions near 0 or 1 receive extreme penalties when wrong. Linearity would not penalize overconfidence adequately.","D":"The negative sign in BCE makes the loss positive (log of a probability in (0,1) is negative; negating it makes the loss positive). It does not cause gradient ascent — the loss is positive and minimized by gradient descent."},"reference":"- https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html\n- Bishop, \"Pattern Recognition and Machine Learning\", Chapter 4.3.4"},{"section":"deep-learning","topicSlug":"loss-and-cost-functions","topic":"Loss And Cost Functions","id":"dl-05003","difficulty":"medium","orderIndex":3,"question":"You train a neural network for a 10-class problem using Cross-Entropy loss. The model achieves 95% training accuracy but you notice the average training loss stopped decreasing after epoch 50 (stuck at 0.15) even though gradients are non-zero. Your colleague suspects overfitting. What is the more likely cause?","options":{"A":"Cross-entropy loss has a lower bound of 0; once accuracy plateaus, loss cannot decrease further","B":"Cross-entropy loss can continue decreasing even when accuracy is 95% — the model could push probabilities for correct classes closer to 1.0, reducing loss. Loss stopping while accuracy is stable indicates the model has likely reached the optimizer's minimum for the current configuration (learning rate too high, trapped in a local minimum, or the model capacity is insufficient to perfectly calibrate all samples)","C":"Overfitting always causes training loss to increase, not plateau — the colleague is wrong","D":"Cross-entropy loss plateaus when label smoothing is not applied; adding label smoothing would allow further decrease"},"correct":"B","explanation":{"correct":"- Cross-entropy loss = 0 only when the model outputs probability 1.0 for the correct class on every sample. With 95% accuracy and loss=0.15, the model is still uncertain on many correctly classified samples (it predicts p=0.6 for the correct class, contributing -log(0.6)≈0.51 to loss).\n- Continued loss decrease would require sharper probabilities — the model assigning higher confidence to correct predictions, even if they're already classified correctly.\n- Loss plateau with non-zero gradients suggests: (a) oscillating near a sharp minimum (high learning rate), (b) the model has insufficient capacity to fit remaining hard examples, or (c) the optimizer is stuck. The distinction between \"loss plateau\" and \"overfitting\" is that overfitting shows increasing validation loss, not just plateau.","A":"Cross-entropy's lower bound of 0 is achievable only with perfect, confident predictions. 95% accuracy with 0.15 loss is far from this bound — the loss absolutely can decrease further if the model improves calibration.","B":"","C":"Overfitting causes training loss to keep decreasing (model memorizes) while validation loss increases. A training loss plateau is more likely a sign of optimization difficulty or insufficient capacity, not overfitting.","D":"Label smoothing (replacing hard 0/1 targets with soft targets like 0.9/0.1) would actually raise the loss floor slightly, not allow further decrease. It is used to prevent overconfidence, not to enable lower loss."},"reference":"- https://cs231n.github.io/neural-networks-3/#loss (monitoring training)"},{"section":"deep-learning","topicSlug":"loss-and-cost-functions","topic":"Loss And Cost Functions","id":"dl-05004","difficulty":"medium","orderIndex":4,"question":"A fraud detection model is trained on a dataset where 0.1% of samples are fraud. You use Cross-Entropy loss with default settings and achieve 99.9% accuracy. A business stakeholder says the model is excellent. A data scientist says the model is useless. Who is right and what loss function change would help?","options":{"A":"The stakeholder is right — 99.9% accuracy means only 1 in 1000 predictions is wrong","B":"The data scientist is right — 99.9% accuracy is achieved by predicting \"no fraud\" for every sample (the majority class baseline). Cross-entropy on imbalanced data allows the model to ignore the minority class. Fix: use Focal Loss, class-weighted CE, or resampling","C":"The data scientist is right, but the fix is to lower the classification threshold, not change the loss function","D":"Both are partially right — accuracy is valid but should be supplemented with F1 score; no loss function change is needed"},"correct":"B","explanation":{"correct":"- With 0.1% fraud: a model that always predicts \"not fraud\" achieves 99.9% accuracy. Standard CE loss minimizes average log-probability over all samples. 99.9% of samples are negative, so the model is incentivized to perfectly classify negatives and can ignore the minority class entirely.\n- Focal Loss (Lin et al., 2017): FL = -αₜ(1-pₜ)ᵞ log(pₜ). The (1-pₜ)ᵞ factor down-weights easy examples (correctly classified majority class with high confidence) and up-weights hard examples (minority class). This forces the model to focus on difficult/rare examples.\n- Alternatives: class-weighted CE (multiply minority class loss by a large factor), SMOTE/oversampling, or training with AUPRC as the optimization target.","A":"The \"excellent\" claim fails under scrutiny: if all 1000 incorrect predictions are fraud cases that went undetected, the model catches 0% of actual fraud. This is the worst possible fraud detector.","B":"","C":"Lowering the classification threshold changes the decision boundary but does not improve the model's learned probability estimates. If the model assigns 0.001 probability to fraud for all samples, no threshold adjustment can make it useful.","D":"F1 score helps evaluate model quality but doesn't fix the loss function problem. If the model outputs 99.9% class 0 probability for all samples, no threshold or evaluation metric change fixes the underlying training failure."},"reference":"- Lin et al., \"Focal Loss for Dense Object Detection\" (RetinaNet): https://arxiv.org/abs/1708.02002\n- https://scikit-learn.org/stable/auto_examples/classification/plot_imbalanced_dataset.html"},{"section":"deep-learning","topicSlug":"loss-and-cost-functions","topic":"Loss And Cost Functions","id":"dl-05005","difficulty":"medium","orderIndex":5,"question":"You are training a regression model and compare MSE vs Huber loss. During validation, you notice MSE is 50,000 and Huber loss (δ=1.0) is 10.5 for the same predictions. A junior data scientist says \"the Huber loss model is 4700× better.\" What is wrong with this comparison?","options":{"A":"Huber loss and MSE have different units, so they cannot be compared numerically","B":"The two loss values are not comparable because they measure different things — MSE penalizes every error quadratically while Huber is quadratic for |error| < δ and linear beyond. The numerical values have no meaningful ratio relationship. What matters is which model has lower validation RMSE (or another task-relevant metric), not which has lower absolute loss value","C":"MSE should be divided by sample count; the engineer forgot to average the loss","D":"Huber loss is always smaller than MSE by definition, so comparing them proves nothing"},"correct":"B","explanation":{"correct":"- MSE and Huber loss compute fundamentally different things. MSE = mean of squared errors. Huber with δ=1 = mean of (0.5·e² for |e|<1; |e|-0.5 for |e|≥1). A single outlier with error=100 contributes 10,000 to MSE but only 99.5 to Huber.\n- The 4700× difference doesn't mean the Huber model is 4700× more accurate — it means the two loss scales are incomparable. The MSE model might actually generalize better despite higher Huber loss.\n- To compare models, use the same metric for both: RMSE, MAE, or R² — any metric that doesn't change between models.","A":"Both MSE and Huber loss are in units of the target variable (if δ is in target units). MSE is in squared units while Huber's linear tail is in original units — so they do have different units, but option B's explanation is more complete and precise.","B":"","C":"Both losses should be averaged over samples. This doesn't explain why the values are orders of magnitude apart — that difference comes from the different functional forms, not averaging.","D":"Huber loss is not always smaller than MSE by definition. For small errors (|e| < δ), Huber = 0.5e² ≤ e² = MSE contribution (smaller). For large errors, Huber = linear (smaller than quadratic). So Huber ≤ MSE for the same errors, but this isn't the point of the question."},"reference":"- Huber, P.J., \"Robust Estimation of a Location Parameter\" (1964)\n- https://pytorch.org/docs/stable/generated/torch.nn.HuberLoss.html"},{"section":"deep-learning","topicSlug":"loss-and-cost-functions","topic":"Loss And Cost Functions","id":"dl-05006","difficulty":"medium","orderIndex":6,"question":"A multi-label image classification model (an image can have multiple tags simultaneously) uses Cross-Entropy loss with softmax. After training, the model can only ever predict one tag per image even though images clearly have multiple. What is the root cause?","options":{"A":"The model needs more capacity to predict multiple outputs; increase the hidden layer size","B":"Softmax + Cross-Entropy forces the model to treat the problem as single-label (exactly one class is correct). Softmax normalizes probabilities to sum to 1, which correctly represents \"pick one\" but incorrectly models multi-label problems. Fix: use sigmoid per output with Binary Cross-Entropy for each label independently","C":"The learning rate is too high, causing the model to overfit to the most frequent label","D":"Multi-label classification requires a special loss function that sums over all correct labels; Cross-Entropy sums instead of averaging"},"correct":"B","explanation":{"correct":"- Softmax output: probabilities sum to 1.0. This probabilistic simplex constraint is exactly right for \"exactly one class is true\" (single-label). For multi-label problems, multiple classes can simultaneously be \"on\" — a photo can be both \"dog\" and \"outdoor.\"\n- When trained with softmax + CE, the model learns a probability distribution over classes — it learns to \"spend\" its probability budget on the most likely single class. Other classes get near-zero probability even if they are also correct.\n- Fix: use sigmoid independently per output (each output in (0,1) independently) + Binary Cross-Entropy for each label. This way, each label has its own independent probability, allowing any combination of labels.","A":"Capacity is not the issue. Even a very large model trained with softmax+CE will produce single-label predictions because the loss function fundamentally trains it to do so.","B":"","C":"Learning rate affects convergence speed but not the structural single-label vs multi-label behavior. This behavior appears at any learning rate.","D":"Cross-Entropy can be adapted for multi-label problems (summing BCE across labels), but the issue described is specifically about softmax forcing single-label outputs, not about CE's summation behavior."},"reference":"- https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html\n- https://cs231n.github.io/linear-classify/#softmax"},{"section":"deep-learning","topicSlug":"loss-and-cost-functions","topic":"Loss And Cost Functions","id":"dl-05007","difficulty":"medium","orderIndex":7,"question":"You train two models on the same regression task. Model A uses MSE loss and achieves RMSE=100 on the test set. Model B uses MAE loss and achieves MAE=80 on the test set. A product manager asks \"which model is better?\" What is the correct answer and what additional information would you need?","options":{"A":"Model B is better because 80 < 100","B":"You cannot directly compare these models without knowing their errors under the same metric. MSE-trained models minimize squared errors (penalizing outliers heavily), while MAE-trained models minimize absolute errors (treating outliers and small errors equally). RMSE=100 and MAE=80 are not on the same scale — compute both metrics for both models on the test set","C":"Model A is better because RMSE is the industry standard metric for regression","D":"The comparison is valid because RMSE and MAE have the same units; 80 < 100 means Model B is better"},"correct":"B","explanation":{"correct":"- RMSE and MAE both have the same units as the target, but RMSE ≥ MAE always (by Cauchy-Schwarz inequality). A typical relationship is RMSE ≈ 1.0-1.5× MAE for mildly skewed error distributions, and RMSE >> MAE when there are outliers.\n- Model A might have MAE=70 (better than Model B on absolute error) and high RMSE=100 due to a few large outliers. Model B might have no outliers at all. Comparing RMSE of one model to MAE of another is meaningless.\n- To compare: compute RMSE and MAE for both models on the same test set. Choose based on the business metric that matters: if outliers are costly (e.g., safety-critical predictions), prefer MSE-trained model with lower RMSE; if all errors are equal cost, prefer MAE-trained model.","A":"80 < 100 compares numbers but ignores that they measure different things. This is like comparing a weight in kilograms to a distance in miles and concluding \"5 miles > 3 kg.\"","B":"","C":"No single metric is \"the industry standard\" — it depends on the application. MSE/RMSE are common but MAE is preferred in many domains (economics, finance) where outliers are not penalized differently.","D":"Same units does not mean same scale. RMSE is always ≥ MAE for the same set of predictions. Comparing RMSE of one model to MAE of another conflates two different distributions of the same predictions."},"reference":"- Chai & Draxler, \"Root mean square error (RMSE) or mean absolute error (MAE)?\" (2014): https://gmd.copernicus.org/articles/7/1247/2014/"},{"section":"deep-learning","topicSlug":"loss-and-cost-functions","topic":"Loss And Cost Functions","id":"dl-05008","difficulty":"hard","orderIndex":8,"question":"A language model is trained with Cross-Entropy loss on next-token prediction. Validation perplexity is 45. A second model trained on the same data achieves perplexity 30. A researcher claims \"Model 2 is 33% better.\" What is the correct interpretation of the perplexity difference and why is \"33% better\" misleading?","options":{"A":"The researcher is correct — (45-30)/45 = 33% improvement in perplexity","B":"Perplexity is exponential in cross-entropy loss: PPL = exp(H) where H is cross-entropy. The difference PPL₁ - PPL₂ = 15 in perplexity corresponds to a difference of ln(45) - ln(30) ≈ 0.405 nats in cross-entropy — a meaningful but not \"33%\" improvement. Perplexity differences are not linearly comparable; log-likelihood or bits-per-character are better for arithmetic comparisons","C":"The 33% improvement claim is correct but only applies to vocabulary size > 30,000","D":"Perplexity differences don't indicate model quality; only BLEU score matters for language models"},"correct":"B","explanation":{"correct":"- Perplexity = exp(cross-entropy). Going from PPL=45 to PPL=30 means reducing cross-entropy from ln(45)≈3.807 to ln(30)≈3.401, a reduction of 0.406 nats (or ≈0.586 bits). The actual CE improvement is ~10.7%, not 33%.\n- Perplexity is on an exponential scale. \"33% lower perplexity\" sounds like a large improvement, but on the underlying information-theoretic scale, it may be modest. Conversely, going from PPL=5 to PPL=4 (20% reduction) represents the same CE improvement as PPL=100 to PPL=80 (also 20%), but the lower-perplexity improvement is much harder to achieve.\n- Best practice: report cross-entropy in nats or bits-per-token for arithmetic comparisons; perplexity for intuitive interpretation (perplexity ≈ average branching factor at each prediction step).","A":"Arithmetic on perplexity values is misleading because perplexity is on an exponential scale. A 33% reduction in perplexity does not correspond to 33% better predictions in any information-theoretic sense.","B":"","C":"Perplexity interpretation doesn't change based on vocabulary size. The vocabulary size affects the range of reasonable perplexity values (max PPL = vocab_size for a uniform distribution), but arithmetic comparisons remain equally valid/invalid regardless of vocabulary size.","D":"BLEU score is a task-specific metric for translation and text generation. Perplexity is a valid and widely used metric for language model quality — it directly measures how well the model's probability distribution matches the test data."},"reference":"- Brown et al., \"Language Models are Few-Shot Learners\" (GPT-3): https://arxiv.org/abs/2005.14165 (perplexity reporting)\n- Jurafsky & Martin, \"Speech and Language Processing\", Chapter 3 (perplexity definition)"},{"section":"deep-learning","topicSlug":"loss-and-cost-functions","topic":"Loss And Cost Functions","id":"dl-05009","difficulty":"hard","orderIndex":9,"question":"A team trains an object detection model with Focal Loss (γ=2, α=0.25). They observe that after 10,000 steps, the model detects common objects (cars, people) well but almost completely ignores rare objects (fire hydrants, parking meters). They increase γ from 2 to 5, expecting Focal Loss to further down-weight the easy majority class. What is likely to happen and why?","options":{"A":"Higher γ will further focus on hard examples, solving the rare object problem","B":"Increasing γ too much will cause the model to focus on the hardest samples in the training set — which may include mislabeled examples, extremely occluded instances, and noise — rather than the rare objects. The model may degrade on common objects without improving on rare ones","C":"Higher γ has no effect above γ=2; Focal Loss saturates at γ=2","D":"The fix is to reduce γ to 0, which is standard cross-entropy and treats all examples equally"},"correct":"B","explanation":{"correct":"- Focal Loss: FL = -αₜ(1-pₜ)ᵞ log(pₜ). At γ=5, the weight factor (1-pₜ)⁵ becomes extremely small for easy examples (pₜ=0.9: weight = 0.1⁵ = 0.00001) and dominates for hard examples (pₜ=0.1: weight = 0.9⁵ = 0.59).\n- The problem: \"hard examples\" include rare objects but also mislabeled data, extremely occluded objects, and ambiguous cases. At γ=5, these noisy hard examples get enormous weight relative to clean easy examples, potentially causing the model to overfit to noise.\n- In practice, the original Focal Loss paper (Lin et al.) found γ=2 works well across different detection tasks. The rare object problem is better solved by class-balanced sampling, augmentation, or the α parameter, not extreme γ values.","A":"This is the naive expectation but ignores the noise amplification problem. The hardest examples are not necessarily the most informative ones — they may be genuinely ambiguous or mislabeled.","B":"","C":"Focal Loss does not saturate at γ=2. The function (1-pₜ)ᵞ continues to change with γ. The choice of γ=2 as default is empirical, not a mathematical saturation point.","D":"γ=0 gives standard CE (no example weighting). This would make the rare object problem worse, not better, by treating all examples equally in a class-imbalanced dataset."},"reference":"- Lin et al., \"Focal Loss for Dense Object Detection\" (2017): https://arxiv.org/abs/1708.02002 (Section 4: ablation study on γ)"},{"section":"deep-learning","topicSlug":"loss-and-cost-functions","topic":"Loss And Cost Functions","id":"dl-05010","difficulty":"hard","orderIndex":10,"question":"You are training a generative model and need to measure how close the model's output distribution P is to the true data distribution Q. A teammate proposes using KL divergence D_KL(P||Q). A researcher argues D_KL(Q||P) is more appropriate for this use case. What is the concrete behavioral difference between these two directions?","options":{"A":"The two directions are mathematically identical; KL divergence is symmetric","B":"D_KL(P||Q) = Σ P(x)·log(P(x)/Q(x)) — the model (P) must assign probability to every region where P is non-zero. If P spreads mass over regions where Q is zero (true data has no examples), the KL is infinite. This pushes P to cover all of Q's support but may spread to regions Q doesn't cover (mode-covering). D_KL(Q||P) penalizes P when Q is high but P is low — pushing P to match Q's modes but allowing P to miss parts of Q (mode-seeking)","C":"D_KL(Q||P) requires the model distribution Q to be differentiable, while D_KL(P||Q) works with any distribution","D":"The difference is only relevant for discrete distributions; for continuous generative models, both directions produce identical training dynamics"},"correct":"B","explanation":{"correct":"- Forward KL (D_KL(P||Q), \"inclusive\"): when P>0, requires Q>0. The model must \"cover\" all regions of the true distribution. This is used in Maximum Likelihood Estimation and leads to mode-covering behavior — the model spreads mass broadly to not miss any mode of the data.\n- Reverse KL (D_KL(Q||P), \"exclusive\"): when Q>0, requires P>0. Penalizes the model for having low probability where data has high probability. Leads to mode-seeking behavior — the model picks one or a few modes and concentrates mass there.\n- This is the core distinction between VAEs (minimize forward KL, mode-covering, blurry) and GANs/flow models (implicitly minimize reverse KL or other metrics, mode-seeking, sharp but can miss modes).","A":"KL divergence is NOT symmetric. D_KL(P||Q) ≠ D_KL(Q||P) in general. This is a fundamental property of KL divergence. The symmetric version is Jensen-Shannon divergence: JS(P,Q) = 0.5·D_KL(P||M) + 0.5·D_KL(Q||M) where M = 0.5(P+Q).","B":"","C":"Both formulations require the distributions to be smooth enough for gradient computation. The differentiability requirement is the same for both directions — it's determined by the parameterization of the model, not the direction of KL.","D":"The mode-seeking vs mode-covering distinction is equally relevant for continuous distributions. It manifests as blurry vs sharp image generation in VAEs vs GANs."},"reference":"- Goodfellow et al., \"Deep Learning\", Chapter 3.13 (KL divergence)\n- Wainwright & Jordan, \"Graphical Models, Exponential Families, and Variational Inference\" (2008)"},{"section":"deep-learning","topicSlug":"loss-and-cost-functions","topic":"Loss And Cost Functions","id":"dl-05011","difficulty":"medium","orderIndex":11,"question":"A model predicts whether a loan application should be approved. The business requires: \"false negatives (approved loans that default) cost 10× more than false positives (rejected loans that would have been fine).\" You train with standard BCE loss but the model performs suboptimally on this cost metric. What change to the loss function captures this asymmetric cost?","options":{"A":"Use MSE loss which automatically weights false negatives more heavily than false positives","B":"Use class-weighted BCE loss where the weight for the positive class (defaulters) is set to 10: loss = -[10·y·log(p) + (1-y)·log(1-p)]. This multiplies the gradient for positive-class errors (false negatives) by 10×, training the model to prioritize avoiding them","C":"Apply a threshold of 0.1 instead of 0.5 during inference; no loss function change is needed","D":"Use a custom loss that penalizes predictions where p > 0.5 for negative class samples by 10×"},"correct":"B","explanation":{"correct":"- Weighted BCE: when a positive sample (defaulter, y=1) is misclassified, the loss contribution is 10× larger than for a negative sample (safe borrower, y=0). This directly reflects the business cost asymmetry in the optimization objective.\n- The class weight effectively says: \"the model should work 10× harder to correctly classify defaulters.\" The gradient for false-negative errors (missed defaulters) is 10× larger than for false-positive errors, pushing the decision boundary toward lower false-negative rates.\n- This is distinct from threshold adjustment (option C) because it changes the learned probability distribution, not just the post-hoc decision rule.","A":"MSE for binary classification does not inherently weight false negatives differently. MSE treats all prediction errors symmetrically regardless of label value.","B":"","C":"Threshold adjustment changes which predictions are labeled \"approve\" vs \"deny\" after training. However, the model's learned probabilities are still optimized for equal-cost errors. A well-calibrated model trained with cost-weighted loss will produce better probability estimates for the asymmetric cost structure.","D":"Penalizing high-probability negative predictions during training doesn't model the cost asymmetry correctly. The asymmetry is about error severity (missing a defaulter vs rejecting a good borrower), not about prediction confidence for negative samples."},"reference":"- scikit-learn class_weight parameter: https://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane_unbalanced.html"},{"section":"deep-learning","topicSlug":"loss-and-cost-functions","topic":"Loss And Cost Functions","id":"dl-05012","difficulty":"easy","orderIndex":12,"question":"Cross-entropy loss is defined as CE = -Σᵢ yᵢ·log(pᵢ). For a 5-class problem with true label class 3 and predicted probabilities [0.1, 0.1, 0.6, 0.1, 0.1], what is the cross-entropy loss, and what would it be if the model had output [0.01, 0.01, 0.96, 0.01, 0.01]?","options":{"A":"CE₁ = 0.6, CE₂ = 0.96 — CE equals the predicted probability of the correct class","B":"CE₁ = -log(0.6) ≈ 0.511, CE₂ = -log(0.96) ≈ 0.041 — CE is the negative log of the correct class probability; higher confidence on the correct class means lower loss","C":"CE₁ = -Σ log(pᵢ) over all classes ≈ -5·log(0.2) ≈ 8.05 for both, since the sum is over uniform distribution","D":"CE₁ = 1 - 0.6 = 0.4, CE₂ = 1 - 0.96 = 0.04 — CE equals 1 minus the correct class probability"},"correct":"B","explanation":{"correct":"- For one-hot labels, CE = -Σᵢ yᵢ·log(pᵢ) = -1·log(p_correct) (all other yᵢ = 0). CE reduces to just the negative log probability of the correct class.\n- CE₁ = -log(0.6) ≈ 0.511. CE₂ = -log(0.96) ≈ 0.041. The model in case 2 is much more confident and correct, incurring 12× lower loss.\n- This is why maximizing log-likelihood and minimizing cross-entropy are equivalent for classification: you're directly maximizing the log probability assigned to the correct class.","A":"CE = probability would make the loss a linear function of confidence. CE = -log(p) creates an asymmetric penalty: going from p=0.5 to p=1.0 reduces loss by log(2)≈0.69, while going from p=0.01 to p=0.5 reduces loss by log(50)≈3.9. The log ensures large penalties for very wrong confident predictions.","B":"","C":"CE uses the true label distribution (one-hot), not a uniform distribution. The sum over all classes collapses to one term because only the correct class has yᵢ = 1.","D":"CE = 1-p would be a linear loss function. The logarithm in CE provides the desirable property of infinite loss for p=0 (completely wrong confident prediction) and zero loss for p=1 (perfectly confident correct prediction)."},"reference":"- https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html"},{"section":"deep-learning","topicSlug":"loss-and-cost-functions","topic":"Loss And Cost Functions","id":"dl-05013","difficulty":"hard","orderIndex":13,"question":"A team is training a knowledge distillation model where a student network is trained to match a teacher network's output probability distribution. They use Cross-Entropy between the student's softmax outputs and the teacher's softmax outputs (soft targets). The teacher uses temperature T=1 for soft targets. The student trains poorly — it collapses to predicting the same distribution as using hard one-hot labels. What is likely missing?","options":{"A":"Knowledge distillation requires a different optimizer than standard training","B":"The teacher's probabilities at T=1 are near one-hot (e.g., [0.99, 0.003, 0.003, ...]) — the soft targets barely differ from hard labels. Temperature scaling (T=3-5) should be applied to the teacher's logits before softmax to produce softer, more informative distributions that reveal the teacher's \"dark knowledge\" about class relationships","C":"The student network must have the same architecture as the teacher for knowledge distillation to work","D":"Cross-Entropy is inappropriate for distillation; KL divergence must be used instead"},"correct":"B","explanation":{"correct":"- A trained teacher network typically produces very confident predictions: softmax([10, 0.1, 0.1, ...]) ≈ [0.9999, 0.00005, 0.00005, ...]. At T=1, soft targets are nearly identical to hard one-hot labels, providing no additional information.\n- Temperature scaling: teacher_probs = softmax(logits/T). At T=4: softmax([2.5, 0.025, 0.025, ...]) ≈ [0.88, 0.03, 0.03, ...] — much softer. The student now learns that the teacher slightly prefers class 2 and 3 over class 4, even though class 1 is most likely. This \"dark knowledge\" encodes learned similarity between classes.\n- Hinton et al. (2015) used temperature T=3-20 in their original distillation work. The typical loss is a combination: L = α·CE(student, hard_labels) + (1-α)·KL(student_soft, teacher_soft).","A":"Knowledge distillation uses standard optimizers (Adam, SGD). No special optimizer is required.","B":"","C":"Knowledge distillation is specifically designed to work with different architectures (small student, large teacher). Same architecture is not a requirement and defeats the purpose of compression.","D":"KL divergence and CE on soft targets differ by only a constant when the targets are fixed (KL = CE - H(targets)). For distillation purposes, they are functionally equivalent. The issue is not the loss function but the temperature of the teacher's softmax."},"reference":"- Hinton et al., \"Distilling the Knowledge in a Neural Network\" (2015): https://arxiv.org/abs/1503.02531"},{"section":"deep-learning","topicSlug":"loss-and-cost-functions","topic":"Loss And Cost Functions","id":"dl-05014","difficulty":"hard","orderIndex":14,"question":"You are training a regression model to predict protein structure coordinates (x, y, z in Angstroms). A senior researcher insists on using Huber loss with δ=1.0 instead of MSE. A junior researcher argues: \"In protein structure, there are no outliers — all measurements are precise crystallography data. Huber loss is unnecessary.\" Who is right?","options":{"A":"The junior researcher is right — Huber loss is only needed for datasets with measurement noise and outliers","B":"The senior researcher may be right for a different reason: even with precise measurements, some residues (protein sub-units) are structurally flexible and genuinely have multiple valid conformations. Predictions for these residues will always have high error regardless of model quality. MSE would penalize these inherently uncertain residues 10-1000× more than other residues, distorting learning. Huber loss reduces their influence","C":"Huber loss is never appropriate for coordinate regression; use MSE always","D":"The junior researcher is right for crystallography data but wrong for cryo-EM data"},"correct":"B","explanation":{"correct":"- \"Outliers\" in loss function context means \"data points with large residuals\" — not necessarily measurement errors. Flexible protein loops and disordered regions produce large prediction errors by nature (the true structure exists in an ensemble of conformations).\n- With MSE, these structurally ambiguous residues produce squared errors of 100-10,000 Å² compared to 1-4 Å² for well-structured regions. The model spends disproportionate gradient effort on hard-to-predict flexible regions at the expense of learning well-structured regions.\n- Huber loss with appropriate δ caps the influence of flexible residues, allowing the model to learn structured regions without being dominated by inherently ambiguous ones. AlphaFold2 uses multiple loss components including specialized handling for disordered regions.","A":"This conflates \"outlier as measurement error\" with \"outlier as prediction difficulty.\" The definition of \"outlier\" for loss function purposes is a sample with disproportionately large residual, regardless of cause.","B":"","C":"Huber loss is widely used in coordinate regression tasks including 3D object detection (bounding box regression uses smooth L1, which is Huber), robot control, and molecular modeling.","D":"The same argument applies to cryo-EM data and crystallography. Both have flexible/disordered regions. The structural biology challenge (multiple conformations) is independent of the measurement technique."},"reference":"- Jumper et al., \"Highly accurate protein structure prediction with AlphaFold\" (2021): https://www.nature.com/articles/s41586-021-03819-2\n- Object detection uses smooth L1 (Huber): https://arxiv.org/abs/1504.08083"},{"section":"deep-learning","topicSlug":"loss-and-cost-functions","topic":"Loss And Cost Functions","id":"dl-05015","difficulty":"hard","orderIndex":15,"question":"You train a variational autoencoder (VAE) and observe that the reconstruction loss decreases steadily but the KL divergence term collapses to near zero from the first epoch. The generated samples have high quality but show no diversity — all sampled images are nearly identical. What is this phenomenon and what causes it?","options":{"A":"The model is overfitting to the training data; reduce the number of parameters","B":"This is \"posterior collapse\" — the encoder ignores the input and maps all inputs to the prior N(0,I). The decoder learns to generate without using the latent code (from the prior alone). Mathematically: minimizing KL(q(z|x) || p(z)) pushes q toward the prior; if the decoder is powerful enough to reconstruct without z, the model collapses to doing exactly that. Fix: β-VAE (reduce KL weight) or KL annealing (gradually increase KL weight during training)","C":"The KL term collapsing to zero means the model has perfectly learned the posterior; this is the ideal training outcome","D":"Posterior collapse is caused by a learning rate that is too high; reduce the learning rate to prevent the KL from collapsing early"},"correct":"B","explanation":{"correct":"- VAE objective: maximize ELBO = E[log p(x|z)] - KL(q(z|x)||p(z)). The reconstruction term incentivizes using z; the KL term incentivizes q to be close to the prior (where z is uninformative).\n- If the decoder is powerful (e.g., an autoregressive decoder that can model all of p(x) without conditioning on z), it will generate correctly even when z is sampled from the prior with no x-specific information. The encoder then has no incentive to encode x into z, so it collapses to outputting the prior.\n- Fix options: (1) β-VAE: multiply KL by β < 1 to reduce its weight; (2) KL annealing: start with KL weight=0, slowly increase to 1 over training; (3) less powerful decoder (e.g., use a simple decoder that needs z to reconstruct); (4) free bits: guarantee minimum KL per dimension.","A":"Overfitting would cause high reconstruction accuracy on training data and poor on validation — the model would use latent codes to memorize training samples. Posterior collapse shows identical outputs regardless of input, which is the opposite: the latent code is unused.","B":"","C":"KL(q||p) = 0 means q exactly equals the prior for all inputs. This means the encoder has learned to output N(0,I) regardless of input x — z contains no information about x. The ideal training outcome is q being close to (but not equal to) the prior while also encoding x-specific information.","D":"Learning rate affects the rate of convergence but not which equilibrium the model converges to. With a powerful decoder, the posterior collapse equilibrium is a stable local minimum. No learning rate setting prevents convergence to it once the model discovers this shortcut."},"reference":"- Lucas et al., \"Don't Blame the ELBO! A Linear VAE Perspective on Posterior Collapse\" (2019): https://arxiv.org/abs/1911.02469\n- Higgins et al., \"β-VAE\" (2017): https://openreview.net/forum?id=Sy2fchgIW"},{"section":"deep-learning","topicSlug":"backpropagation","topic":"Backpropagation","id":"dl-06001","difficulty":"easy","orderIndex":1,"question":"A neural network has 3 layers. You compute the forward pass successfully but during backpropagation, the gradient for the first layer is exactly zero for all weights. The loss is non-zero and the last layer's gradient is correct. What is the most likely cause?","options":{"A":"The first layer's weights are initialized to zero, causing zero gradients","B":"One of the intermediate activation functions (e.g., ReLU) has zero gradient for all inputs in that batch, effectively cutting off gradient flow. The chain rule multiplies gradients across layers — a zero at any layer zeroes all gradients to earlier layers","C":"The learning rate is too small, causing gradients to round to zero in float32","D":"Backpropagation only updates the last two layers by default in PyTorch; the first layer requires a separate optimizer call"},"correct":"B","explanation":{"correct":"- Chain rule in backpropagation: ∂L/∂W₁ = ∂L/∂a₂ · ∂a₂/∂z₂ · ∂z₂/∂a₁ · ∂a₁/∂z₁ · ∂z₁/∂W₁. If any term is zero (e.g., ∂a₁/∂z₁ = 0 because all neurons in layer 1 are dead ReLU), the entire product is zero.\n- This is the gradient \"cut\" — a zero in the chain rule propagates leftward and zeroes all earlier layers' gradients. The last layer is unaffected because its gradients don't depend on the earlier zero term.\n- Dead ReLU (all neurons with z<0) is the most common cause. Other causes: sigmoid saturated to 0 or 1 for all inputs, or a custom activation with zero derivative.","A":"Zero weight initialization causes symmetric gradients (all neurons compute the same thing) but not zero gradients — the gradients are identical across neurons but non-zero. The symmetry problem prevents specialization but doesn't zero the gradients.","B":"","C":"Float32 has ~7 decimal digits of precision. A gradient would need to be smaller than ~1e-38 (near underflow) to appear as zero. Learning rate affects the weight update magnitude, not the gradient magnitude itself.","D":"PyTorch backpropagation computes gradients for all parameters with requires_grad=True, including all layers. There is no \"default\" that stops at layer 2."},"reference":"- https://cs231n.github.io/optimization-2/ (computational graphs and chain rule)"},{"section":"deep-learning","topicSlug":"backpropagation","topic":"Backpropagation","id":"dl-06002","difficulty":"easy","orderIndex":2,"question":"You are explaining backpropagation to a junior engineer. She asks: \"Why do we need to store intermediate activations during the forward pass? Can't we just recompute them during the backward pass?\" What is the correct technical response?","options":{"A":"We cannot recompute activations because PyTorch deletes the computation graph after the forward pass","B":"Recomputing activations is possible (gradient checkpointing does exactly this) but it trades memory for compute — storing activations avoids recomputing, but requires O(depth) memory. The choice depends on whether the bottleneck is memory or compute","C":"Intermediate activations must be stored because backpropagation requires them as inputs to the chain rule gradient computation (∂L/∂W depends on the activation value at that layer). Without storage, you'd have to redo the entire forward pass for every layer during backward","D":"Activations are stored in GPU VRAM automatically and cannot be freed until the next batch"},"correct":"C","explanation":{"correct":"- The gradient of a weight matrix W in layer k: ∂L/∂Wₖ = δₖ · aₖ₋₁ᵀ, where δₖ is the error signal from the next layer and aₖ₋₁ is the activation from the previous layer. Both are required.\n- Without stored activations, computing ∂L/∂W requires knowing the activation value at that layer — which can only be obtained by re-running the forward pass up to that point.\n- B is also technically correct (gradient checkpointing recomputes activations), but C is the fundamental reason activations are stored by default: correctness and efficiency. Gradient checkpointing is an optional memory optimization.","A":"PyTorch does keep the computation graph until `.backward()` is called. After calling `.backward()`, the graph is freed (unless `retain_graph=True`). The graph is not deleted immediately after the forward pass.","B":"","C":"","D":"Activations stored in PyTorch's computation graph can be freed at any time by calling `.detach()` or by not retaining the graph. They are not permanently locked in VRAM. Gradient checkpointing explicitly frees them during the forward pass."},"reference":"- Goodfellow et al., \"Deep Learning\", Chapter 6.5 (Back-Propagation and Other Differentiation Algorithms)\n- PyTorch gradient checkpointing: https://pytorch.org/docs/stable/checkpoint.html"},{"section":"deep-learning","topicSlug":"backpropagation","topic":"Backpropagation","id":"dl-06003","difficulty":"medium","orderIndex":3,"question":"You implement backpropagation manually for a 2-layer network and compare gradients to PyTorch's autograd. For the second layer, your gradients match exactly. For the first layer, yours are consistently 10× larger than PyTorch's. You didn't make an arithmetic error. What is the most likely source of the discrepancy?","options":{"A":"PyTorch normalizes gradients by the number of layers during backpropagation","B":"PyTorch's default reduction in loss functions is 'mean' — if your manual implementation used 'sum' instead of averaging over the batch, your gradients would be batch_size× larger. If batch_size=10, your gradients would be 10× PyTorch's","C":"PyTorch clips gradients to prevent explosion; your manual implementation lacks this clipping","D":"The first layer has more parameters than the second, causing larger gradients in your manual implementation"},"correct":"B","explanation":{"correct":"- PyTorch's `nn.CrossEntropyLoss`, `nn.MSELoss`, etc. default to `reduction='mean'` — dividing the total loss by the batch size. If your manual implementation sums losses over the batch without dividing by batch size, all gradients are batch_size× larger.\n- For the second layer, the gradient magnitude matches because you may have implemented it correctly. The discrepancy in the first layer (10×) suggests a batch-size factor is applied somewhere between your second and first layer computation — likely in how the delta (error signal) is computed.\n- This is one of the most common bugs when implementing backprop manually: confusing `sum` and `mean` reduction, leading to incorrect learning rates for the actual gradient scale.","A":"PyTorch does not normalize gradients by number of layers. Gradients are computed via chain rule and may vary in magnitude by layer, but there is no normalization step.","B":"","C":"PyTorch does NOT clip gradients by default. Gradient clipping (`torch.nn.utils.clip_grad_norm_`) must be called explicitly. It would reduce gradient magnitude, not increase it by 10×.","D":"The number of parameters in a layer doesn't affect gradient magnitude for individual weights. Gradient magnitude depends on the error signal and activation values, not the weight count."},"reference":"- PyTorch loss reduction parameter: https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html"},{"section":"deep-learning","topicSlug":"backpropagation","topic":"Backpropagation","id":"dl-06004","difficulty":"medium","orderIndex":4,"question":"You train a neural network and after 1000 steps, the training loss is still at its initial value. You inspect gradients and find they are all NaN. What sequence of events most likely caused gradient NaN, and what debugging steps would you take first?","options":{"A":"NaN gradients are caused by setting the learning rate to exactly 0.001; use 0.0001 instead","B":"NaN gradients typically trace back to a NaN loss, which traces back to NaN activations, which traces back to either: (1) a NaN in the input data, (2) log(0) or 0/0 in the loss function (e.g., log(p) when p=0 from ReLU output fed to softmax with all-zero logits), or (3) exploding activations that overflow to infinity then produce 0/0. Debug: check inputs for NaN, add `torch.autograd.set_detect_anomaly(True)`, inspect intermediate activation norms","C":"NaN gradients always indicate a memory overflow; reduce batch size","D":"NaN gradients indicate the model has converged to a saddle point where gradients are undefined"},"correct":"B","explanation":{"correct":"- NaN propagates: NaN input → NaN activations → NaN loss → NaN gradients. The source is almost always upstream of where you observe NaN.\n- Common specific causes: (1) `torch.log(tensor)` where tensor contains 0 (log(0)=-inf, and inf-inf=NaN); (2) 0/0 from division with a near-zero denominator (e.g., LayerNorm with zero-variance inputs); (3) overflow from too-large activations (activation → inf, then inf × 0 = NaN in gradient).\n- `torch.autograd.set_detect_anomaly(True)` adds hooks that identify the exact operation that first produced NaN, printing a stack trace. This is the recommended first debugging step.","A":"Learning rate value does not cause NaN gradients unless combined with an exploding gradient scenario where the loss landscape has extreme curvature. The learning rate itself is just a scalar multiplier.","B":"","C":"Memory overflow produces an out-of-memory (OOM) error, not NaN. NaN results from mathematical operations like 0/0, ∞-∞, or log(0). Memory issues and NaN are distinct failure modes.","D":"Saddle points have non-zero gradients in most dimensions. True saddle points (zero gradient in all directions) would produce zero, not NaN. A \"flat\" region would give zero gradients; NaN requires an illegal mathematical operation."},"reference":"- PyTorch anomaly detection: https://pytorch.org/docs/stable/autograd.html#anomaly-detection"},{"section":"deep-learning","topicSlug":"backpropagation","topic":"Backpropagation","id":"dl-06005","difficulty":"medium","orderIndex":5,"question":"A team implements a custom layer that uses a non-differentiable operation (argmax) in the forward pass. During backpropagation, PyTorch raises an error because argmax has no gradient. They ask: \"Can we still train with backpropagation?\" What are the two main approaches?","options":{"A":"No — non-differentiable operations fundamentally prevent gradient-based training","B":"Yes — two approaches: (1) Straight-Through Estimator (STE): pass the gradient through the argmax as if it were an identity function (∂argmax/∂input ≈ 1), accepting the approximation. (2) Gumbel-Softmax: replace argmax with a differentiable soft approximation (temperature-controlled softmax) during training, use hard argmax during inference","C":"Yes — replace argmax with a sigmoid function which is a differentiable proxy","D":"Yes — use numerical differentiation (finite differences) to estimate gradients for the argmax layer"},"correct":"B","explanation":{"correct":"- STE (Hinton, 2012; Bengio et al., 2013): during backward pass, treat the non-differentiable operation as identity (∂output/∂input = 1). This is biologically inspired (works empirically despite being mathematically incorrect) and is the basis of training quantized neural networks (QNNs).\n- Gumbel-Softmax (Jang et al., 2017; Maddison et al., 2017): use softmax(log(π) + Gumbel_noise)/τ during training (differentiable), approach hard one-hot as τ→0 for inference. Used in VQ-VAE, discrete VAEs.\n- Both are in active production use: STE in training binary/ternary networks, Gumbel-Softmax in discrete latent variable models.","A":"While argmax is non-differentiable, there are well-established workarounds used in production. The field has extensive work on training through discrete operations.","B":"","C":"Sigmoid produces values in (0,1) for a single output, not a one-hot selection across options. Sigmoid is appropriate for binary gates but not for multi-way selection like argmax.","D":"Numerical differentiation (finite differences) is extremely expensive for neural networks with millions of parameters (requires N forward passes for N parameters). It is used for gradient checking, not for training."},"reference":"- Bengio et al., \"Estimating or Propagating Gradients Through Stochastic Neurons\" (STE, 2013): https://arxiv.org/abs/1308.3432\n- Jang et al., \"Categorical Reparameterization with Gumbel-Softmax\" (2017): https://arxiv.org/abs/1611.01144"},{"section":"deep-learning","topicSlug":"backpropagation","topic":"Backpropagation","id":"dl-06006","difficulty":"medium","orderIndex":6,"question":"You compute gradients for a 10-layer network and plot gradient norm per layer. Layer 10 (closest to loss): norm = 1.0. Layer 5: norm = 0.01. Layer 1: norm = 0.0001. After adding BatchNorm after every 2 layers, the gradient norms become approximately equal across all layers. Why does BatchNorm have this effect on gradient magnitudes?","options":{"A":"BatchNorm clips gradients to be equal across layers","B":"BatchNorm normalizes pre-activations to zero mean and unit variance at each layer. This prevents the forward pass activations from shrinking/growing exponentially, which in turn prevents the backward pass gradients from shrinking/growing exponentially via the chain rule","C":"BatchNorm adds trainable skip connections that provide direct gradient paths to early layers","D":"BatchNorm reduces the learning rate for deep layers, compensating for otherwise smaller gradients"},"correct":"B","explanation":{"correct":"- The root cause of gradient decay is that each layer multiplies gradients by the Jacobian ∂aₖ/∂aₖ₋₁. If activations have small magnitude (common without normalization), the Jacobian entries are small, and gradients decay across layers.\n- BatchNorm normalizes activations to N(0,1) after each layer. This prevents the covariate shift (activation distribution shift) that causes Jacobian magnitudes to vary wildly, keeping gradient flow more stable.\n- More precisely: BatchNorm's γ and β parameters, combined with the normalization, effectively scale the gradient flow to be approximately 1.0 per layer, preventing the multiplicative decay.","A":"BatchNorm does not clip gradients. Gradient clipping is a separate technique (clip_grad_norm). BatchNorm's effect on gradients is through the normalization of forward activations, not through explicit gradient manipulation.","B":"","C":"BatchNorm does not add skip connections. Skip connections (ResNets) are an architectural choice. BatchNorm is an in-place normalization operation that does not create new paths in the computational graph between non-adjacent layers.","D":"BatchNorm does not modify the optimizer's learning rate. The learning rate is a hyperparameter of the optimizer. BatchNorm's effect on gradient uniformity comes from activation normalization, not learning rate scheduling."},"reference":"- Ioffe & Szegedy, \"Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift\" (2015): https://arxiv.org/abs/1502.03167\n- Santurkar et al., \"How Does Batch Normalization Help Optimization?\" (2018): https://arxiv.org/abs/1805.11604"},{"section":"deep-learning","topicSlug":"backpropagation","topic":"Backpropagation","id":"dl-06007","difficulty":"hard","orderIndex":7,"question":"You are training a recurrent network (vanilla RNN) on sequences of length 500. You observe that gradient norms for the hidden state at step t=490 are approximately 0.1, at t=400 are 10⁻⁴, and at t=1 are 10⁻²⁴. The loss is on the final step's output. What is happening and why do LSTM gates specifically address this?","options":{"A":"The gradient decay is caused by using BPTT (Backpropagation Through Time) which has fewer time steps for early states","B":"Each backpropagation step through the RNN multiplies the gradient by the recurrent weight matrix Wₕ. If the spectral radius ρ(Wₕ) < 1, gradients decay exponentially as ρ(Wₕ)^T where T is the number of steps. At T=490: 0.9^490 ≈ 10⁻²³. LSTM gates create additive (not multiplicative) paths for gradient flow through the cell state: dC/dt = f_t · C_{t-1} + i_t · g_t, where the forget gate f_t controls how much gradient flows backward — cells can maintain near-unit gradient flow for arbitrarily long sequences","C":"The vanishing gradient is caused by the tanh activation in the RNN output layer; replace with ReLU to fix","D":"Gradient norms below 0.1 indicate correct behavior — early time steps should have smaller gradients because they contribute less to the final loss"},"correct":"B","explanation":{"correct":"- Vanilla RNN gradient: ∂h_t/∂h_{t-k} = ∏ᵢ ∂h_{t-i+1}/∂h_{t-i} = ∏ᵢ Wₕᵀ · diag(tanh'(z_{t-i})). If ρ(Wₕ) < 1, this product decays geometrically. At 500 time steps, even ρ=0.99 gives 0.99^500 ≈ 0.0066.\n- LSTM cell state update: C_t = f_t ⊙ C_{t-1} + i_t ⊙ tanh(Wₓx_t + Wₕh_{t-1}). Gradient through C_t: ∂C_t/∂C_{t-1} = f_t (element-wise multiplication, not matrix multiplication). The forget gate can be near 1.0, providing a near-unit gradient path.\n- This \"constant error carousel\" (Hochreiter & Schmidhuber, 1997) is the key LSTM innovation: replace multiplicative recurrent connections with additive cell state updates gated by learned gates.","A":"BPTT applies backpropagation through all time steps equally. Early steps t=1 receive gradients that have been backpropagated through all 499 steps between t=1 and t=500 — they don't get \"fewer steps,\" they get the compounded decay of all steps.","B":"","C":"The tanh in the hidden state transition (not just the output layer) contributes to gradient decay via tanh'(z) ≤ 1. However, the dominant effect is the multiplicative recurrence through Wₕ. Replacing output tanh with ReLU partially helps but does not solve the fundamental multiplicative gradient path.","D":"The gradient at t=1 representing the contribution of the very first input to the final loss should be non-negligible if the sequence has long-range dependencies. A gradient of 10⁻²⁴ means the network literally cannot learn any relationship between t=1 and the output."},"reference":"- Hochreiter & Schmidhuber, \"Long Short-Term Memory\" (1997): https://www.mitpressjournals.org/doi/10.1162/neco.1997.9.8.1735\n- Hochreiter, \"The vanishing gradient problem during learning recurrent neural nets\" (1998)"},{"section":"deep-learning","topicSlug":"backpropagation","topic":"Backpropagation","id":"dl-06008","difficulty":"hard","orderIndex":8,"question":"A team implements gradient checking (comparing analytical gradients from backprop to numerical gradients via finite differences) and finds a relative error of 10⁻³ for one specific weight. The threshold for \"passing\" gradient check is typically 10⁻⁵ to 10⁻⁷. The team is debugging a custom loss function. What are the most likely causes of this specific elevated error?","options":{"A":"A relative error of 10⁻³ is within float32 numerical precision and should be ignored","B":"Possible causes: (1) a non-smooth operation in the loss function (e.g., absolute value at the kink, max at equality) where numerical and analytical gradients differ at the non-differentiable point; (2) a bug in the analytical gradient formula with the wrong coefficient; (3) using float32 (h ≈ 10⁻⁵ in finite differences, float32 precision ≈ 10⁻⁷, gives error floor ≈ 10⁻²) — switch to float64 for gradient checking","C":"The weight is in a batch normalization layer; gradient checking always fails for BN due to batch statistics","D":"A relative error of 10⁻³ specifically indicates a missing factor of 1000 in the gradient formula (off-by-1000 error)"},"correct":"B","explanation":{"correct":"- Gradient checking uses finite differences: (f(x+h) - f(x-h))/(2h). Float32 precision ≈ 10⁻⁷ limits accuracy. With h=10⁻⁵, float32 operations have error ~10⁻⁷/10⁻⁵ = 10⁻². So gradient checking in float32 with typical h has error floor of ~10⁻², not 10⁻⁶.\n- Always run gradient checks in float64. In float64, the error floor drops to ~10⁻¹¹, allowing detection of errors as small as 10⁻⁷.\n- Non-smooth operations (L1 loss, ReLU at exactly 0, max at equality) legitimately produce different analytical vs numerical gradients at the kink — but only for weights where the function is evaluated exactly at the non-smooth point.","A":"10⁻³ is not within float32 precision for analytical gradients. The analytical gradient computed via backpropagation is exact (within floating-point errors of ~10⁻⁷ for float32). A 10⁻³ relative error suggests either a float precision issue in the numerical check or a real bug.","B":"","C":"BatchNorm gradient checking is tricky because the batch statistics create coupling between samples, but it is not impossible. The issue is that finite differences change one weight at a time, while BN statistics change with each perturbation. Special care is needed, but it doesn't universally fail.","D":"A relative error of 10⁻³ is 3 orders of magnitude off, which could indicate a factor-of-1000 error, but could also indicate many other issues (missing factor of 2, wrong sign, non-smooth point, precision issue). It doesn't specifically diagnose a 1000× error."},"reference":"- Gradient checking guide: https://cs231n.github.io/neural-networks-3/#gradcheck\n- Goodfellow et al., \"Deep Learning\", Chapter 6.5.6 (Gradient Checking)"},{"section":"deep-learning","topicSlug":"backpropagation","topic":"Backpropagation","id":"dl-06009","difficulty":"hard","orderIndex":9,"question":"You train a deep network and observe that the gradient norm in layer 1 is 10⁴ (exploding). You apply gradient clipping: `torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)`. The training stabilizes. A colleague argues that gradient clipping is \"cheating\" because it changes the true gradient direction. Is the colleague correct, and what does the clipped gradient actually compute?","options":{"A":"The colleague is correct — gradient clipping changes the gradient direction and introduces bias, making the training invalid","B":"The colleague is partially correct about direction change: when the global gradient norm exceeds max_norm, all gradients are scaled by (max_norm / global_norm). This scales the gradient uniformly, preserving the relative ratios between individual parameter gradients (direction is preserved). It biases the update magnitude but not the direction. For exploding gradients, this is an acceptable approximation because the true gradient would cause parameter overflow anyway","C":"Gradient clipping doesn't change the direction because it clips each gradient independently to [-max_norm, max_norm]","D":"The colleague is wrong — gradient clipping computes the exact true gradient but with reduced precision"},"correct":"B","explanation":{"correct":"- `clip_grad_norm_` computes the global gradient norm G = √(Σᵢ ||∇wᵢ||²). If G > max_norm, scales ALL gradients by max_norm/G. This is a uniform scaling that preserves the relative ratios between gradients (same direction, different magnitude).\n- Direction preservation: if ∇W = [100, -50, 25] and max_norm=1: scaled = [100, -50, 25] × (1/√(100²+50²+25²)) ≈ [0.87, -0.43, 0.22]. The direction (unit vector) is preserved.\n- Alternative: `clip_grad_value_` clips each gradient independently to [-max_value, max_value]. This does change the direction (gradient ratios change), which is generally worse.","A":"Gradient clipping is valid training practice used in LSTM training, Transformer training (where exploding gradients from attention are common), and GAN training. The \"bias\" introduced is intentional and necessary to prevent parameter overflow.","B":"","C":"This describes `clip_grad_value_` (per-element clipping), not `clip_grad_norm_` (global norm scaling). The two are different operations. `clip_grad_norm_` scales uniformly and preserves direction.","D":"Gradient clipping does not compute the \"exact true gradient.\" It explicitly modifies gradient magnitude. This is a deliberate approximation, not a precision issue."},"reference":"- Pascanu et al., \"On the difficulty of training recurrent neural networks\" (gradient clipping): https://arxiv.org/abs/1211.5063\n- PyTorch clip_grad_norm_: https://pytorch.org/docs/stable/generated/torch.nn.utils.clip_grad_norm_.html"},{"section":"deep-learning","topicSlug":"backpropagation","topic":"Backpropagation","id":"dl-06010","difficulty":"hard","orderIndex":10,"question":"You are implementing a custom neural network operation `f(x) = x² · sin(x)` and need to register its backward function in PyTorch. A junior engineer implements it as:","codeSnippet":"class CustomOp(torch.autograd.Function):\n @staticmethod\n def forward(ctx, x):\n ctx.save_for_backward(x)\n return x**2 * torch.sin(x)\n \n @staticmethod\n def backward(ctx, grad_output):\n x, = ctx.saved_tensors\n grad_x = 2*x * torch.sin(x) # Missing term\n return grad_output * grad_x","options":{"A":"`ctx.save_for_backward(x)` is incorrect; use `ctx.x = x` instead","B":"The backward function is missing the x² · cos(x) term. The correct gradient is: f'(x) = 2x·sin(x) + x²·cos(x) (product rule: d/dx[x²·sin(x)] = 2x·sin(x) + x²·cos(x))","C":"`grad_output` should not be multiplied with the computed gradient — it is already the final gradient","D":"The forward function must return a new tensor created with `torch.empty_like(x)`; modifying x in-place is invalid"},"correct":"B","explanation":{"correct":"- f(x) = x² · sin(x). Applying the product rule: f'(x) = d(x²)/dx · sin(x) + x² · d(sin(x))/dx = 2x·sin(x) + x²·cos(x).\n- The bug: the implementation only computes 2x·sin(x), omitting the x²·cos(x) term. This is a partial product rule application.\n- The chain rule in PyTorch: if L = loss and y = f(x), then ∂L/∂x = ∂L/∂y · ∂y/∂x = grad_output · f'(x). The `return grad_output * grad_x` structure is correct, but grad_x is computed incorrectly.","A":"Both `ctx.save_for_backward(x)` and `ctx.x = x` can store tensors. However, `save_for_backward` is the correct API for autograd functions — it ensures proper memory management and version tracking. `ctx.x = x` can cause issues with in-place operations and is not recommended.","B":"","C":"`grad_output` IS the upstream gradient ∂L/∂y. The chain rule requires multiplying it by the local gradient ∂y/∂x = f'(x). The structure `return grad_output * grad_x` is correct — grad_output must be multiplied by the local gradient.","D":"The forward function creates a new tensor via `x**2 * torch.sin(x)` — this is not in-place modification of x. In-place operations would use `x **= 2` or `x.sin_()`."},"reference":"- PyTorch custom autograd functions: https://pytorch.org/docs/stable/notes/extending.html"},{"section":"deep-learning","topicSlug":"backpropagation","topic":"Backpropagation","id":"dl-06011","difficulty":"medium","orderIndex":11,"question":"You train a Transformer model and notice that gradients for the first few layers are consistently larger than gradients for the last few layers — the opposite of the vanishing gradient problem. What architectural feature of standard Transformers causes this gradient pattern and is it problematic?","options":{"A":"The attention mechanism amplifies gradients in early layers due to the softmax operation","B":"Pre-LN Transformers (LayerNorm before attention/FFN sublayers) can have this reversed gradient pattern. The skip connections in early layers have not yet contributed their normalization effect, while later layers' gradients pass through more LayerNorm normalizations which reduce gradient magnitude. Post-LN (original) Transformers can show the opposite (vanishing early gradients). Reversed gradients are not inherently problematic — they indicate gradients are flowing backward strongly through skip connections","C":"This is the exploding gradient problem; apply gradient clipping immediately","D":"Transformers with more than 12 layers always show reversed gradient patterns; this is expected and desirable"},"correct":"B","explanation":{"correct":"- Pre-LN (used in GPT-2, GPT-3, most modern LLMs): LN is applied before each sublayer. Skip connections carry gradients directly, and the LN in earlier layers has less accumulated normalization effect. This can produce larger gradient norms in early layers.\n- Post-LN (original \"Attention is All You Need\"): LN is applied after each sublayer + skip connection. Early layers have gradients that must pass through more LN layers to reach the input, potentially reducing them.\n- Neither pattern is \"problematic\" by itself — both architectures have been used successfully. The key metric is whether gradients flow effectively through all layers (non-zero, finite), not whether they decrease or increase with depth.","A":"Softmax in attention does not specifically amplify gradients in early layers. The softmax gradient is bounded by the softmax probabilities (max gradient = 0.25 per element for two-class case). Softmax does not cause systematic early-layer amplification.","B":"","C":"Gradients being larger in early layers is not by itself exploding gradients. Exploding gradients means norms are exponentially large (10³-10⁶), not just \"larger than later layers.\" Healthy training may have 2-5× variation in gradient norms across layers.","D":"Reversed gradient patterns depend on architecture (Pre-LN vs Post-LN) and initialization, not layer count alone. It is not universally expected or required for all Transformers."},"reference":"- Xiong et al., \"On Layer Normalization in the Transformer Architecture\" (Pre-LN analysis): https://arxiv.org/abs/2002.04745"},{"section":"deep-learning","topicSlug":"backpropagation","topic":"Backpropagation","id":"dl-06012","difficulty":"easy","orderIndex":12,"question":"Consider the function f(x) = max(x, 0) (ReLU). At x=0, the derivative is technically undefined (left derivative = 0, right derivative = 1). PyTorch uses a subgradient of 0 at x=0. In practice, why doesn't this cause training problems?","options":{"A":"PyTorch avoids x=0 by adding a small epsilon to all inputs before applying ReLU","B":"The probability that any floating-point number equals exactly 0 after arbitrary network computations is essentially zero. In practice, the gradient at x=0 is never evaluated — all actual pre-activations are either clearly positive or clearly negative","C":"PyTorch uses a differentiable approximation to ReLU (softplus) internally even when you call nn.ReLU()","D":"Subgradient of 0 at x=0 is actually the mathematically optimal choice and makes training faster"},"correct":"B","explanation":{"correct":"- Floating-point numbers have finite precision. The probability of computing exactly 0.0000000000000000 (64 zeros) from arbitrary neural network weights and inputs approaches zero in practice.\n- Even if a pre-activation were exactly 0 due to symmetry or specific initialization, the next gradient update would immediately move it away from exactly 0. The zero subgradient convention is essentially never triggered.\n- This is why the theoretical non-differentiability of ReLU at 0 is a non-issue in practice — thousands of papers and production systems use ReLU without ever encountering problems from the x=0 gradient.","A":"PyTorch does not add epsilon to ReLU inputs. This would change the function to max(x+ε, 0) which has different behavior. No such modification is applied.","B":"","C":"PyTorch's nn.ReLU() computes max(0,x) exactly, not Softplus. Softplus(x) = log(1+eˣ) is a separate function available as nn.Softplus(). Users who want smooth approximations must explicitly request Softplus.","D":"The subgradient of 0 at x=0 is a convention (PyTorch's choice); a subgradient of 1 would also be valid mathematically. It is not \"optimal\" — the choice at a measure-zero point has no practical impact on training."},"reference":"- https://cs231n.github.io/neural-networks-1/#actfun (ReLU practical notes)"},{"section":"deep-learning","topicSlug":"backpropagation","topic":"Backpropagation","id":"dl-06013","difficulty":"medium","orderIndex":13,"question":"A researcher claims: \"Backpropagation is a special case of the chain rule applied to a computational graph, so it doesn't actually 'know' about layers — it works the same way on any computation.\" A student challenges: \"But residual networks (ResNets) work differently because the skip connections create multiple gradient paths.\" Who is correct?","options":{"A":"The student is correct — skip connections require a modified version of backpropagation","B":"The researcher is correct — backpropagation is the chain rule applied to any computational graph, including graphs with skip connections. ResNets do create multiple gradient paths (the skip path and the residual path), but standard autograd handles this by summing gradients at the junction node (when a node has multiple downstream consumers, gradients from each consumer are summed)","C":"Both are correct — standard backprop handles simple sequential graphs; modified algorithms are needed for graphs with cycles or skip connections","D":"The student is correct for the gradient of the skip connection weights but wrong for the residual branch weights"},"correct":"B","explanation":{"correct":"- The chain rule on a computation graph: for any node with multiple downstream consumers, the total gradient is the sum of gradients from each consumer path. For ResNet's addition node: ∂L/∂x = ∂L/∂(x + F(x)) · (1 + ∂F(x)/∂x). The \"1\" is the gradient from the skip path, \"∂F(x)/∂x\" is from the residual path.\n- PyTorch's autograd builds a dynamic computational graph and applies reverse-mode automatic differentiation — which is exactly backpropagation generalized to arbitrary DAGs. No special handling is needed for skip connections.\n- This is why Torch.autograd, TensorFlow, and JAX handle any directed acyclic graph (DAG) of differentiable operations, including ResNets, DenseNets, and arbitrary architectures.","A":"Skip connections do not require a \"modified version\" of backpropagation. They create a richer computational graph, but the same chain rule and gradient accumulation rules apply.","B":"","C":"Standard backprop handles DAGs (any graph without cycles). Recurrent networks (cycles in time) require BPTT (unrolling the cycle into a DAG). True cycles are unrolled or handled with specific algorithms, but skip connections are not cycles — they are just multiple paths in a DAG.","D":"In a ResNet, the residual branch F(x) has weights and the skip connection is a direct identity (no weights). The gradient for F(x)'s weights follows the same chain rule as any other layer. No special treatment is needed for the skip-connected layer's gradients."},"reference":"- He et al., \"Deep Residual Learning for Image Recognition\" (2016): https://arxiv.org/abs/1512.03385\n- Baydin et al., \"Automatic Differentiation in Machine Learning: a Survey\": https://arxiv.org/abs/1502.05767"},{"section":"deep-learning","topicSlug":"backpropagation","topic":"Backpropagation","id":"dl-06014","difficulty":"hard","orderIndex":14,"question":"You are training a model and want to debug gradients. You write:","codeSnippet":"loss = criterion(outputs, labels)\nloss.backward()\nfor name, param in model.named_parameters():\n print(f\"{name}: grad={param.grad}\")","options":{"A":"Parameters with `grad=None` are in layers with learning rate = 0; set uniform learning rate","B":"Multiple causes: (1) the parameter is not in the computational graph leading to the loss (e.g., a layer that is defined but never called in forward()), (2) the parameter has `requires_grad=False`, (3) the computation was done inside `torch.no_grad()` context which suppresses gradient tracking. Fix: verify the layer is called in forward(), check requires_grad, ensure backward is called outside no_grad","C":"`grad=None` always means the gradient is zero; rename `None` to 0 for visualization","D":"PyTorch only computes gradients for the last layer by default; add `.retain_grad()` to earlier layers"},"correct":"B","explanation":{"correct":"- `param.grad` is None (not zero) when the parameter was never \"touched\" by any operation in the computational graph during the current forward pass. PyTorch only allocates gradient tensors for parameters that participated in the computation.\n- Common case 1: a layer defined in `__init__` but never called in `forward()`. The parameter exists but has no graph path to the loss.\n- Common case 2: `requires_grad=False` (e.g., frozen parameters). These are deliberately excluded from gradient computation.\n- Common case 3: `with torch.no_grad(): output = model(x)` — operations inside no_grad don't track gradients. Calling .backward() afterward won't have graph information.","A":"Learning rate is applied during the optimizer step (after gradient computation). It does not affect whether gradients are computed or whether param.grad is None.","B":"","C":"`grad=None` is distinctly different from `grad=torch.zeros(...)`. None means the gradient was never computed (parameter not in graph). Zero means the gradient was computed and happened to be zero (e.g., for a dead ReLU neuron). Conflating them would hide important debugging information.","D":"PyTorch computes gradients for ALL parameters with requires_grad=True that appear in the computational graph, not just the last layer. The issue with earlier layers having None grad is specifically about whether they appeared in the graph, not about depth."},"reference":"- PyTorch autograd mechanics: https://pytorch.org/docs/stable/notes/autograd.html"},{"section":"deep-learning","topicSlug":"backpropagation","topic":"Backpropagation","id":"dl-06015","difficulty":"hard","orderIndex":15,"question":"A production ML system accumulates gradients over 8 steps before each optimizer update (gradient accumulation to simulate a larger batch). The code is:","codeSnippet":"for i, (x, y) in enumerate(dataloader):\n outputs = model(x)\n loss = criterion(outputs, y) / 8\n loss.backward()\n if (i + 1) % 8 == 0:\n optimizer.step()\n optimizer.zero_grad()","options":{"A":"Nothing changes — gradient accumulation is scale-invariant","B":"Without `/ 8`, each of the 8 mini-batch losses is computed with full scale. After accumulation, the total gradient is 8× larger than it would be for a single large batch of 8× mini-batch size. The optimizer step effectively uses a learning rate 8× larger than intended. This often causes training instability. The `/ 8` correctly scales down each mini-batch loss so that the accumulated gradient matches what you'd get from a single large batch","C":"Removing `/ 8` improves convergence because larger gradients give stronger learning signal","D":"The `/ 8` is only needed when using Adam optimizer; for SGD, gradient accumulation works correctly without scaling"},"correct":"B","explanation":{"correct":"- Gradient accumulation goal: simulate processing a batch of 8×mini_batch_size. A true large batch computes loss = mean(losses over 8×N samples). This is equivalent to mean(mean(losses over N samples) for each of 8 mini-batches) — the outer mean contributes the /8 factor.\n- Without /8: accumulated gradient = Σᵢ ∇Lᵢ (sum of 8 mini-batch gradients). With a true large batch: gradient = (1/8)·Σᵢ ∇Lᵢ. The difference is 8×.\n- An 8× larger gradient update is equivalent to multiplying the learning rate by 8. For learning rates tuned for a specific batch size, this change often causes gradient explosion or training instability.","A":"Gradient scale matters for the effective learning rate. Optimizer step sizes are directly proportional to gradient magnitude. Accumulating 8× larger gradients is equivalent to using 8× learning rate.","B":"","C":"\"Stronger learning signal\" from 8× gradients is equivalent to 8× learning rate — which is outside the stable range for most problems. Stronger gradients are not inherently better; they must be appropriately scaled.","D":"Adam has gradient normalization via the second moment (adaptive learning rate), which makes it somewhat more robust to gradient scale changes compared to SGD. However, Adam still uses the first moment (mean gradient), which is 8× larger without /8. The effective learning rate for Adam is also roughly 8× larger, which can destabilize training."},"reference":"- Goyal et al., \"Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour\" (learning rate scaling rule): https://arxiv.org/abs/1706.02677"},{"section":"deep-learning","topicSlug":"optimizers","topic":"Optimizers","id":"dl-07001","difficulty":"easy","orderIndex":1,"question":"You train a neural network using SGD with learning rate 0.1 and observe very noisy loss curves — the loss zigzags up and down across consecutive batches. Switching to SGD with Momentum (β=0.9) smooths the curve. What mathematical operation does momentum perform that causes this smoothing?","options":{"A":"Momentum averages the learning rate across the last 10 batches","B":"Momentum maintains a velocity vector v that is an exponential moving average of past gradients: v_t = β·v_{t-1} + (1-β)·g_t, then updates w = w - α·v_t. High-frequency gradient noise (opposite-sign gradients on consecutive steps) is averaged out, while consistent gradient directions accumulate velocity","C":"Momentum clips individual gradient values to reduce extreme updates that cause loss spikes","D":"Momentum applies the gradient only when it agrees with the gradient from the previous step, skipping updates otherwise"},"correct":"B","explanation":{"correct":"- The exponential moving average acts as a low-pass filter: noisy high-frequency oscillations (gradients that alternate sign) get averaged to near zero in v_t. Consistent gradient directions (signal) accumulate in v_t and produce larger effective steps.\n- With β=0.9: the effective gradient is a weighted sum of the last ~1/(1-β) = 10 gradients. Random noise that doesn't correlate across batches averages out; true gradient direction (consistent across batches) is amplified.\n- This is why momentum helps in ravine-shaped loss surfaces: in the narrow direction (high curvature), gradients oscillate and are averaged out. In the long flat direction (low curvature), gradients consistently point the same way and accumulate velocity.","A":"Momentum does not average the learning rate. The learning rate remains fixed. Momentum maintains a gradient velocity, not a learning rate average.","B":"","C":"Gradient clipping is a separate technique (`clip_grad_norm_`) that caps gradient magnitude. Momentum does not clip — it exponentially averages gradients, which can increase effective magnitude in consistent directions.","D":"Momentum always updates — it never conditionally skips updates. The velocity is updated and applied regardless of gradient sign agreement with the previous step."},"reference":"- Polyak, B.T., \"Some methods of speeding up the convergence of iteration methods\" (1964)\n- https://cs231n.github.io/neural-networks-3/#sgd (SGD + momentum)"},{"section":"deep-learning","topicSlug":"optimizers","topic":"Optimizers","id":"dl-07002","difficulty":"easy","orderIndex":2,"question":"A team trains a model with Adam optimizer using default hyperparameters (lr=0.001, β₁=0.9, β₂=0.999). At the first training step, the first moment m₁ ≈ 0.1·g and the second moment v₁ ≈ 0.001·g². The Adam update uses m₁/(√v₁ + ε) instead of g. What bias correction does Adam apply and why?","options":{"A":"Adam scales the learning rate by (1-β₁)/(1-β₂) to normalize between the two moment estimates","B":"Adam applies bias correction: m̂_t = m_t/(1-β₁ᵗ) and v̂_t = v_t/(1-β₂ᵗ). At t=1: m̂₁ = m₁/(1-0.9) = 10·m₁ and v̂₁ = v₁/(1-0.999) = 1000·v₁. This corrects for the fact that m₁ and v₁ are initialized to 0 and are biased toward 0 at early steps","C":"Bias correction is only applied when the learning rate exceeds 0.01; for standard lr=0.001, no correction is needed","D":"Adam's bias correction divides by t (step count) to implement learning rate decay automatically"},"correct":"B","explanation":{"correct":"- Without bias correction: m₁ = (1-0.9)·g = 0.1·g (initialized at 0, so m₁ is 10× smaller than the true first moment estimate). This would cause Adam to take very small steps at the beginning of training.\n- With correction: m̂₁ = 0.1·g / (1-0.9) = g (corrects back to the true gradient value). At early steps, (1-β₁ᵗ) → 0, giving large correction. As t→∞, (1-β₁ᵗ) → 1, and correction vanishes.\n- This is the key innovation in the original Adam paper: bias correction ensures the effective learning rate is stable from the very first step, enabling Adam to work reliably without warm-up for most problems.","A":"Adam does not compute a ratio of (1-β₁)/(1-β₂). The correction is applied independently to m and v before computing the ratio m̂/√v̂.","B":"","C":"Bias correction is applied at every step, regardless of learning rate. The correction term (1-βᵗ) depends only on the step count t, not on the learning rate value.","D":"Adam's bias correction does not implement learning rate decay. (1-β₁ᵗ) increases from 0 toward 1 as t grows — it is a bias correction factor that increases over time (decreasing correction), not a decay factor that decreases the step size."},"reference":"- Kingma & Ba, \"Adam: A Method for Stochastic Optimization\" (2014): https://arxiv.org/abs/1412.6980"},{"section":"deep-learning","topicSlug":"optimizers","topic":"Optimizers","id":"dl-07003","difficulty":"medium","orderIndex":3,"question":"You fine-tune a large language model (1B parameters) and observe that training loss decreases but validation loss is slightly higher than expected. Switching from Adam to AdamW fixes this gap. What is the difference between Adam and AdamW, and why does it matter for large model fine-tuning?","options":{"A":"AdamW uses a larger learning rate internally, which improves generalization","B":"Adam implements L2 regularization by adding λ·w to the gradient before the Adam update: g' = g + λ·w. AdamW implements weight decay directly by subtracting λ·w from weights after the Adam update: w' = w - α·(m̂/√v̂) - α·λ·w. In Adam, the adaptive scaling of Adam also scales down the regularization (g' is divided by √v̂), weakening it for infrequently updated parameters. AdamW applies weight decay at full strength regardless of gradient history","C":"AdamW uses a different β₁ default (0.99 vs 0.9), which prevents overfitting in large models","D":"Adam is numerically unstable for models above 100M parameters; AdamW adds numerical stabilization"},"correct":"B","explanation":{"correct":"- Adam with L2: the regularization term λ·w is treated as part of the gradient and is scaled by 1/√(v̂). For parameters with large gradient history (high v̂), the effective regularization is weak (divided by large √v̂). This is decoupled weight decay's key insight.\n- AdamW: weight decay is applied as a separate multiplicative factor on the weights: w_new = (1-α·λ)·w - α·m̂/√v̂. The weight decay term is not affected by the adaptive scaling — every parameter gets the same proportional weight decay.\n- For LLM fine-tuning: many parameters have very consistent large gradients (high v̂), making Adam's L2 regularization near-zero for those parameters. AdamW ensures all parameters are properly regularized, explaining the improved validation performance.","A":"AdamW does not use a larger internal learning rate. The learning rate hyperparameter is the same. The improvement comes from correct decoupling of weight decay from gradient scaling.","B":"","C":"AdamW uses the same default β₁=0.9 as Adam. The difference is in the weight decay implementation, not the momentum hyperparameters.","D":"Both Adam and AdamW are numerically stable for large models. AdamW's improvement is about regularization correctness, not numerical stability."},"reference":"- Loshchilov & Hutter, \"Decoupled Weight Decay Regularization\" (AdamW, 2017): https://arxiv.org/abs/1711.05101\n- Ilya Loshchilov's blog post explaining the Adam L2 vs AdamW distinction"},{"section":"deep-learning","topicSlug":"optimizers","topic":"Optimizers","id":"dl-07004","difficulty":"medium","orderIndex":4,"question":"A team trains a model with a cosine learning rate schedule starting at lr=0.1 and decaying to 0 over 100 epochs. After 50 epochs (lr≈0), they want to continue training for 50 more epochs. They reset the cosine schedule to restart from lr=0.1. Training improves significantly compared to continuing at lr≈0. What is this technique and why does it work?","options":{"A":"This is learning rate warm-up, which is standard practice for all deep learning training","B":"This is Stochastic Gradient Descent with Warm Restarts (SGDR / Cosine Annealing with Restarts). Restarting the schedule from a high learning rate allows the optimizer to \"escape\" local minima or sharp minima that the model has converged to. The sharp minimum found at low learning rate may generalize poorly; restarting explores broader loss landscape regions that may contain wider (better-generalizing) minima","C":"This is cyclical learning rate training, which only works with SGD, not with Adam","D":"The improvement is not related to the schedule restart but to the fact that they trained for 100 total epochs instead of 50"},"correct":"B","explanation":{"correct":"- SGDR (Loshchilov & Hutter, 2017): restart cosine schedule periodically. High learning rates explore broadly; low learning rates fine-tune. Restart re-explores from a high learning rate, potentially escaping into wider loss basins.\n- Sharp minima hypothesis: flat/wide minima generalize better than sharp minima (Hochreiter & Schmidhuber, 1997; Keskar et al., 2017). High learning rates tend to find flatter minima because they cannot settle into narrow, sharp valleys. Restarting after reaching a low LR escapes the current (potentially sharp) basin.\n- Snapshot ensembling (Huang et al., 2017) saves model checkpoints at each LR minimum to build ensembles from a single training run.","A":"Learning rate warm-up is the practice of starting from a very small LR and increasing to the target LR over the first few steps/epochs. SGDR is the opposite concept — restarting from a high LR after convergence.","B":"","C":"Cosine annealing with restarts works with any optimizer including Adam. Loshchilov uses SGD in the original paper, but the schedule is optimizer-independent.","D":"The comparison is explicitly with \"continuing at lr≈0\" for 50 more epochs — the same total 100 epochs. The improvement is specifically from the LR restart, not from additional training time."},"reference":"- Loshchilov & Hutter, \"SGDR: Stochastic Gradient Descent with Warm Restarts\" (2017): https://arxiv.org/abs/1608.03983\n- Huang et al., \"Snapshot Ensembles: Train 1, get M for free\" (2017): https://arxiv.org/abs/1704.00109"},{"section":"deep-learning","topicSlug":"optimizers","topic":"Optimizers","id":"dl-07005","difficulty":"medium","orderIndex":5,"question":"RMSProp updates weights using: w = w - α·g/√(E[g²] + ε). A colleague says: \"RMSProp is identical to Adam without the first moment (momentum).\" After testing, you find RMSProp and Adam-without-momentum converge differently. Who is right and what is the technical difference?","options":{"A":"The colleague is correct — RMSProp and Adam-without-momentum are mathematically identical","B":"The colleague is approximately right but technically wrong: RMSProp uses an unbiased running average (E[g²] = ρ·E_{t-1}[g²] + (1-ρ)·g²) with no bias correction. Adam applies bias correction to the second moment: v̂_t = v_t/(1-β₂ᵗ). At early steps, RMSProp's E[g²] is biased toward 0 (small), making step sizes larger than Adam's corrected steps. They converge identically only after many steps when bias correction becomes negligible","C":"RMSProp uses the absolute gradient |g| while Adam uses g² for the second moment","D":"The difference is that Adam clips the second moment to prevent explosion, while RMSProp does not"},"correct":"B","explanation":{"correct":"- RMSProp: E[g²]_t = ρ·E[g²]_{t-1} + (1-ρ)·g_t². Initialized to 0. At t=1: E[g²]_1 = (1-ρ)·g₁², which is biased toward 0 by factor (1-ρ).\n- Adam second moment: v_t = β₂·v_{t-1} + (1-β₂)·g_t², then corrected: v̂_t = v_t/(1-β₂ᵗ). At t=1: v₁ = (1-β₂)·g₁², v̂₁ = g₁². The bias correction restores the true estimate of g².\n- Consequence: RMSProp at early steps has small E[g²], giving large step sizes. Adam's bias correction gives stable step sizes from step 1. They converge to the same update rule as t→∞ when both biases vanish.","A":"They are not mathematically identical. The bias correction in Adam for the second moment creates different behavior at early steps, which can meaningfully affect training trajectory.","B":"","C":"Both RMSProp and Adam use g² (squared gradient) for the second moment estimate. Absolute value |g| is not used in either.","D":"Neither RMSProp nor standard Adam explicitly clips the second moment. The epsilon (ε=10⁻⁸) prevents division by zero but is not a \"clip\" on the second moment."},"reference":"- Tieleman & Hinton, \"Lecture 6.5 — RMSProp\" (2012): Coursera slides\n- Kingma & Ba, \"Adam\" (2014): https://arxiv.org/abs/1412.6980"},{"section":"deep-learning","topicSlug":"optimizers","topic":"Optimizers","id":"dl-07006","difficulty":"medium","orderIndex":6,"question":"You train a model with Adam and observe: the training loss decreases normally for the first 1000 steps, then suddenly \"resets\" — loss jumps back up and the model appears to \"forget\" what it learned. This happens at exactly step 1000 when the second moment estimate v_t becomes a reliable long-term average (no longer dominated by bias correction). What optimizer configuration is likely causing this?","options":{"A":"The learning rate is too high at step 1000","B":"This is the \"Adam learning rate collapse\" phenomenon: at early steps, bias correction makes v̂_t large (small denominator effects are corrected away), giving large effective step sizes. As v_t accumulates history (bias correction effect diminishes), the effective step size can change dramatically. If combined with a warmup schedule that ends exactly at step 1000, the LR change may be sharp enough to cause apparent \"forgetting\"","C":"Adam's β₂ accumulates variance, causing the effective learning rate to increase at step 1000 and destabilize training","D":"The optimizer should be switched to SGD after 1000 steps as Adam is only effective in early training"},"correct":"B","explanation":{"correct":"- Effective step size in Adam = α·m̂_t/√v̂_t. As training progresses: m̂_t converges to a smooth gradient estimate, and v̂_t converges to the long-run average squared gradient. The ratio m̂_t/√v̂_t can stabilize differently than at early steps.\n- A common issue: if using a warmup schedule that ends at step 1000 with a sharp LR transition, combined with the natural stabilization of Adam's moments, the effective step sizes can change abruptly.\n- The specific description of \"resets at exactly step 1000\" suggests a scheduled event (warmup end, LR change) rather than a natural Adam phenomenon. Diagnosis: plot effective learning rate = α/√v̂_t per parameter over time to see the actual step size trajectory.","A":"\"Learning rate too high\" would cause instability from the beginning, not specifically at step 1000. High LR manifests as oscillating or diverging loss from early steps.","B":"","C":"v_t accumulates squared gradient information, causing the effective learning rate to decrease (denominator grows), not increase. The effective learning rate in Adam generally decreases over time as the second moment accumulates.","D":"Adam is not limited to \"early training\" — it is used effectively for full training runs in most modern deep learning. Switching to SGD mid-training without careful LR scheduling would be highly disruptive."},"reference":"- Ma & Yarats, \"Quasi-hyperbolic momentum and Adam for deep learning\" (2019): https://arxiv.org/abs/1810.06801"},{"section":"deep-learning","topicSlug":"optimizers","topic":"Optimizers","id":"dl-07007","difficulty":"hard","orderIndex":7,"question":"A researcher trains the same ResNet-50 model with: (A) Adam (lr=1e-3), (B) SGD+Momentum (lr=0.1, momentum=0.9), and (C) AdamW (lr=1e-3, weight_decay=0.01). She observes final test accuracy: SGD > AdamW > Adam. She concludes \"SGD is always better than Adam for image classification.\" What is the correct nuanced interpretation?","options":{"A":"The researcher's conclusion is correct — SGD is always better for computer vision","B":"The result reflects well-known empirical findings: for image classification with large datasets and well-tuned LR schedules, SGD+Momentum often outperforms Adam in final accuracy, likely because SGD finds wider minima with better generalization. However, the comparison is confounded by LR choice — Adam's optimal LR is typically 10-100× smaller than SGD's. The conclusion \"SGD is always better\" is too strong; Adam typically outperforms SGD on NLP tasks, irregular loss surfaces, and small datasets","C":"The result proves Adam has higher variance, which always hurts generalization","D":"AdamW should be identical to Adam with weight decay; the difference in their results indicates a bug in the implementation"},"correct":"B","explanation":{"correct":"- The SGD > Adam for image classification finding has been replicated widely (Wilson et al., 2017, \"The Marginal Value of Momentum for Small Learning Rate SGD\"). The dominant hypothesis: Adam's adaptive learning rates allow it to escape broad regions quickly, but it tends to converge to sharper minima that generalize worse.\n- Critically: Adam's default lr=1e-3 and SGD's optimal lr=0.1 are not equivalent; the effective step sizes are very different. A fairer comparison would tune LR for each optimizer independently.\n- Domain dependency: Transformers, NLP, and irregular optimization landscapes generally favor Adam/AdamW because of its robustness to sparse gradients and irregular geometry. For vision with well-tuned training recipes, SGD+momentum with cosine schedule is competitive or better.","A":"\"Always better for computer vision\" is falsified by Transformer-based vision models (ViT, DeiT) which use Adam/AdamW and achieve strong results. The finding is empirically narrower than \"always.\"","B":"","C":"Higher variance in optimization doesn't directly translate to worse generalization. Adam's adaptive learning rates produce different optimization trajectories — the generalization difference is about loss landscape geometry (sharp vs flat minima), not variance per se.","D":"AdamW corrects Adam's L2 regularization to be proper weight decay. For large, regularly trained models, AdamW provides meaningful regularization benefits over Adam, so different results are expected and correct."},"reference":"- Wilson et al., \"The Marginal Value of Momentum for Small Learning Rate SGD\" (2017): https://arxiv.org/abs/1705.08292\n- He et al., \"Bag of Tricks for Image Classification\" (SGD training recipe for ResNets): https://arxiv.org/abs/1812.01187"},{"section":"deep-learning","topicSlug":"optimizers","topic":"Optimizers","id":"dl-07008","difficulty":"hard","orderIndex":8,"question":"The Lion optimizer update rule is: m_t = β₁·m_{t-1} + (1-β₁)·g_t, then w_t = w_{t-1} - α·sign(m_t) - α·λ·w_{t-1}. Compared to Adam and SGD, what is Lion's distinctive property and what type of models does it improve?","options":{"A":"Lion uses the sign function instead of the gradient magnitude, making all parameter updates the same size (±α per step). This is memory-efficient (only one moment to track vs Adam's two) and appears to work well for large-scale vision and language models where signal direction is more important than magnitude","B":"Lion's sign function causes updates to clip gradients to 1.0, making it equivalent to gradient clipping with max_norm=1","C":"Lion is identical to Adam but with β₂ removed; the sign function replaces the adaptive learning rate scaling","D":"Lion's sign update prevents convergence to local minima because +α or -α steps can always escape any flat region"},"correct":"A","explanation":{"correct":"- sign(m_t) ∈ {-1, 0, +1}. Every parameter update has magnitude exactly α, regardless of gradient magnitude. This is a unified step size across all parameters — very different from Adam's per-parameter adaptive scaling.\n- Memory: Lion tracks only one moment (m_t, equivalent to momentum), vs Adam's two moments. For large models with billions of parameters, this halves optimizer state memory.\n- Empirical results (Chen et al., 2023): Lion outperforms or matches AdamW on ViT, JFT, Imagen, and language modeling benchmarks with 2-10× better memory efficiency.","A":"","B":"Gradient clipping limits gradient norm before the update step. Lion's sign function acts on the accumulated moment, not the raw gradient. The operations are applied at different points in the update pipeline.","C":"Adam without β₂ gives unscaled first-moment updates: w = w - α·m_t. Lion applies sign to the moment: w = w - α·sign(m_t). The sign function makes the update direction-only, not magnitude-preserving.","D":"sign updates can escape flat regions (gradient near 0 but not exactly 0 still gives ±α step), but they can also oscillate around minima (the step size is fixed, so the optimizer can't \"slow down\" near a minimum like Adam does with accumulated second moment)."},"reference":"- Chen et al., \"Symbolic Discovery of Optimization Algorithms\" (Lion optimizer, 2023): https://arxiv.org/abs/2302.06675"},{"section":"deep-learning","topicSlug":"optimizers","topic":"Optimizers","id":"dl-07009","difficulty":"hard","orderIndex":9,"question":"A team trains a language model with the following learning rate schedule: linear warmup for 2000 steps from 0 to lr_max, then cosine decay to lr_max/10. They observe that training is unstable in the first 100 steps (loss spikes and oscillates) even with warmup. What is the most likely missing component?","options":{"A":"The warmup duration (2000 steps) is too long; reduce to 100 steps","B":"The initial learning rate is not exactly 0; even starting at lr=1e-7 can cause instability when the model is randomly initialized and gradient magnitudes are large. Additionally, the first batches may have extreme loss values (random predictions on first batch) — the real issue is often that gradient norms spike in the first few steps before warmup stabilizes them. Fix: add gradient clipping in addition to LR warmup","C":"Cosine decay should start immediately, not after warmup; warmup itself causes instability","D":"The model needs BatchNorm; without it, warmup has no stabilizing effect"},"correct":"B","explanation":{"correct":"- At initialization, model weights are random. The first batch loss is typically high (random classifier), and gradients can be large. Even with LR warmup starting from a very small value, large gradient magnitudes multiplied by even a small LR can produce significant weight updates.\n- Gradient clipping (typically max_norm=1.0) is complementary to LR warmup: warmup controls the learning rate trajectory, clipping controls the per-step update magnitude. Together they provide robust training stability.\n- In practice, Transformers routinely use both: \"We use Adam with warmup and gradient clipping of 1.0\" appears in GPT, BERT, and most modern LLM training papers.","A":"Longer warmup is generally more stable, not less. 2000 steps for a language model is a common choice. Reducing to 100 steps would make warmup shorter and potentially less effective.","B":"","C":"Warmup is specifically designed to stabilize early training. Starting cosine decay immediately (without warmup) would begin from a large learning rate at a point where the model is most sensitive (random initialization).","D":"BatchNorm is not needed in Transformer LM training (which uses LayerNorm). BatchNorm's presence or absence doesn't determine whether LR warmup is effective."},"reference":"- Vaswani et al., \"Attention Is All You Need\" (2017): uses warmup + Adam: https://arxiv.org/abs/1706.03762\n- Ma et al., \"Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization\": warmup and clipping analysis"},{"section":"deep-learning","topicSlug":"optimizers","topic":"Optimizers","id":"dl-07010","difficulty":"hard","orderIndex":10,"question":"You run a hyperparameter sweep over learning rates for Adam: [1e-5, 1e-4, 1e-3, 1e-2, 1e-1]. You find 1e-3 works best. Your colleague uses the same model but with 4× batch size and gets best results at 1e-3 still. A third engineer with 16× batch size also finds 1e-3 is best. Your team lead says this \"proves Adam doesn't need learning rate scaling with batch size.\" Is this correct?","options":{"A":"Yes — Adam's adaptive learning rate makes it batch-size invariant by normalizing gradient scale","B":"Partially correct but requires nuance: Adam's adaptive scaling partially compensates for batch size effects, but the optimal learning rate still changes with batch size in theory. The empirical finding that 1e-3 works across batch sizes may reflect the fact that optimal LR for Adam is robust within certain ranges, or that the sweep resolution (1-order-of-magnitude steps) is too coarse to detect the shift. For linear scaling rule (multiply LR by k when batch size multiplies by k), Adam weakens but doesn't eliminate the relationship","C":"Yes — the adaptive learning rates in Adam make it exactly batch-size invariant, unlike SGD where the linear scaling rule applies","D":"No — Adam's optimal learning rate scales exactly as 1/√(batch_size); the team should use 1e-3/√16 = 2.5e-4 for 16× batch size"},"correct":"B","explanation":{"correct":"- With batch size k×, gradient estimates have k× smaller variance (more samples per estimate). For SGD, optimal LR scales as k (linear scaling rule) to compensate. For Adam, the second moment √v̂ also adapts to gradient scale, providing some automatic compensation.\n- However, the effective learning rate in Adam = α/√v̂ doesn't perfectly compensate for batch size changes because the noise structure of gradients changes with batch size in complex ways.\n- The empirical finding that 1e-3 works across 4× and 16× batch sizes is plausible for Adam (robustness) but doesn't prove invariance. The coarse 10× resolution of the sweep means optimal LR could shift by 2-3× within the same \"best\" bin.","A":"Adam is not exactly batch-size invariant. The adaptive scaling partially compensates but doesn't remove the dependency. This is an active research area (e.g., learning rate scaling experiments in GPT training).","B":"","C":"There is no proof of exact invariance. The adaptive scaling is an approximation that reduces sensitivity to batch size, not a perfect invariance guarantee. Large batch training papers (Goyal et al., 2017) show even Adam needs LR adjustment for very large batches.","D":"The 1/√(batch_size) scaling rule is not established for Adam. There is no consensus exact scaling rule for Adam — the partial compensation makes it harder to derive than SGD's linear rule."},"reference":"- Goyal et al., \"Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour\" (2017): https://arxiv.org/abs/1706.02677\n- Smith et al., \"Don't Decay the Learning Rate, Increase the Batch Size\" (2018): https://arxiv.org/abs/1711.00489"},{"section":"deep-learning","topicSlug":"optimizers","topic":"Optimizers","id":"dl-07011","difficulty":"medium","orderIndex":11,"question":"You train a model for 100 epochs with a step learning rate schedule: lr=0.1 for epochs 1-30, lr=0.01 for epochs 31-60, lr=0.001 for epochs 61-100. At epoch 31 and 61, training loss spikes sharply before recovering. What causes these spikes and how does cosine annealing prevent them?","options":{"A":"The optimizer's momentum resets at epoch boundaries, causing gradient direction changes","B":"At each step drop, the optimizer's accumulated momentum (velocity) was calibrated for a 10× larger learning rate. When LR drops by 10×, the momentum-scaled update is still 10× too large for the first few steps until the momentum \"forgets\" the old gradients. The spike is from the momentum × new_LR combination being inconsistent. Cosine annealing decays LR smoothly — the optimizer's effective step size changes gradually, so momentum and LR stay in sync","C":"Loss spikes at epoch boundaries are caused by data shuffling, not the learning rate schedule","D":"The spikes indicate gradient explosion; gradient clipping should be added at epoch boundaries"},"correct":"B","explanation":{"correct":"- SGD+Momentum velocity: v_t = β·v_{t-1} + g_t. The velocity has accumulated history from lr=0.1 steps. When LR drops to 0.01, the weight update is α·v_t, but v_t is still large from the high-LR phase. The first few updates at lr=0.01 apply old (lr=0.1-calibrated) momentum, effectively giving larger updates than intended.\n- It takes ~1/(1-β) = 10 steps for momentum to \"forget\" old gradients. During this warmdown period, updates are inconsistent with the new LR.\n- Cosine annealing: LR changes smoothly. The velocity at any point is consistent with the recent LR history (no discontinuity). No spike because there's no sudden LR scale mismatch.","A":"Momentum does NOT reset at epoch boundaries in standard implementations. The velocity vector is persistent across epochs. The problem is that it persists with values calibrated for the old LR.","B":"","C":"Data shuffling changes which batch is seen but not the gradient scale. Shuffling might add noise but not systematic spikes at exact epoch boundaries. The spikes correlate precisely with LR changes.","D":"Gradient explosion produces exponentially growing loss that doesn't recover. The described spikes recover quickly (within ~10 steps), which is the signature of momentum-LR mismatch, not true gradient explosion."},"reference":"- https://cs231n.github.io/neural-networks-3/#anneal (learning rate annealing)\n- Loshchilov & Hutter, \"SGDR: Stochastic Gradient Descent with Warm Restarts\" (2017)"},{"section":"deep-learning","topicSlug":"optimizers","topic":"Optimizers","id":"dl-07012","difficulty":"medium","orderIndex":12,"question":"A model is trained with Adam and achieves good validation performance. At inference, you discover the model's predictions are poorly calibrated — it outputs very high confidences (>0.99) for many predictions that are often wrong. A colleague suggests \"this is an Adam problem; use SGD to fix calibration.\" Is this diagnosis accurate?","options":{"A":"Yes — Adam optimizers systematically produce poorly calibrated models","B":"No — poor calibration is primarily a consequence of training with cross-entropy loss (which doesn't penalize overconfidence) and insufficient regularization, not the optimizer choice. Both Adam and SGD can produce poorly calibrated models. Fix: temperature scaling, label smoothing, or explicit calibration techniques (Platt scaling)","C":"Yes — Adam's adaptive learning rate causes it to over-optimize for certain confident predictions","D":"No, but switching to SGD does improve calibration because SGD finds flatter minima which are better calibrated by definition"},"correct":"B","explanation":{"correct":"- Calibration measures whether predicted probabilities match empirical frequencies. Poor calibration (overconfidence) is a well-documented property of modern neural networks trained with cross-entropy loss (Guo et al., 2017, \"On Calibration of Modern Neural Networks\").\n- The root cause: cross-entropy loss is minimized when the model assigns probability 1 to correct classes. Without explicit regularization, the model is pushed toward maximum confidence on training data, which overfits confidence (not just labels).\n- Fix: (1) label smoothing (soft targets) — prevents the model from targeting p=1.0; (2) temperature scaling (post-hoc) — scales logits by learned T to calibrate probabilities; (3) Dropout at inference (MC Dropout) for uncertainty estimation.","A":"Adam doesn't systematically cause poor calibration. Models trained with SGD also exhibit overconfidence. The phenomenon is loss-function-driven, not optimizer-driven.","B":"","C":"Adam's adaptive learning rates affect which minima are found, not whether the model is overconfident. Overconfidence relates to the loss landscape near confident predictions, not to optimizer adaptivity.","D":"SGD finding \"flatter minima\" is a hypothesis about generalization, not calibration. Flat minima may generalize better (lower test error) but don't directly improve calibration of confidence scores."},"reference":"- Guo et al., \"On Calibration of Modern Neural Networks\" (2017): https://arxiv.org/abs/1706.04599\n- Label smoothing paper: Szegedy et al., \"Rethinking the Inception Architecture\" (2016)"},{"section":"deep-learning","topicSlug":"optimizers","topic":"Optimizers","id":"dl-07013","difficulty":"easy","orderIndex":13,"question":"A neural network training run diverges (loss goes to infinity) after 500 steps when using SGD with lr=0.01. You reduce the learning rate to 0.001 and training converges. What is the mathematical mechanism by which a high learning rate causes divergence?","options":{"A":"High learning rate causes integer overflow in PyTorch's weight tensors","B":"High learning rate causes weight updates w = w - α·g to overshoot the loss minimum. In a quadratic loss bowl, if α > 2/L (where L is the Lipschitz constant of the gradient), each step overshoots to the opposite side of the minimum with increasing distance. The overshoots grow geometrically until weights reach infinity","C":"High learning rate causes gradients to become NaN due to numerical instability in the exponential function","D":"High learning rate causes the model to memorize training data too quickly, and memorization increases the loss on each subsequent batch"},"correct":"B","explanation":{"correct":"- For a 1D quadratic loss f(w) = 0.5·c·w², gradient g = c·w. Update: w' = w - α·c·w = (1-α·c)·w. If |1-α·c| > 1 (i.e., α > 2/c), |w'| > |w| — each step moves further from w=0 (the minimum).\n- Geometric divergence: |w_t| = |1-α·c|ᵗ · |w_0|. With α=0.01 and c=100 (steep loss): |1-0.01·100| = 0, convergence in one step. With α=0.1: |1-10| = 9, |w_t| grows as 9ᵗ → infinity.\n- This explains why the stability condition α < 2/L is fundamental. For curvature L, exceeding 2/L causes divergence regardless of the direction chosen.","A":"PyTorch uses float32/float64, which are IEEE 754 floating-point numbers. They don't overflow to integers — they overflow to `inf` (infinity), which is a valid float64 value. The divergence is mathematical, not an overflow in the integer sense.","B":"","C":"Gradients becoming NaN due to exponential functions happens in specific architectures (e.g., exp in softmax with large logits), not as a direct consequence of high learning rate. High LR causes the loss to diverge first, which then may produce NaN in subsequent operations.","D":"Memorization refers to fitting specific training examples. High learning rate causes the optimization trajectory to diverge (weights → infinity) due to overshooting, not because the model is memorizing faster."},"reference":"- Goodfellow et al., \"Deep Learning\", Chapter 8.2 (Challenges in Neural Network Optimization)\n- https://cs231n.github.io/neural-networks-3/#baby"},{"section":"deep-learning","topicSlug":"optimizers","topic":"Optimizers","id":"dl-07014","difficulty":"hard","orderIndex":14,"question":"A company switches from training transformers with Adam to training with 8-bit Adam (bitsandbytes library), claiming \"8-bit quantization of optimizer states with no quality loss.\" A skeptical ML engineer says \"quantizing optimizer states must affect training.\" Who is right?","options":{"A":"The company is right — 8-bit optimizer states are mathematically identical to 32-bit","B":"The engineer is partially right: 8-bit quantization of optimizer states (first and second moments in Adam) does introduce quantization noise. However, Dettmers et al. (2022) showed that block-wise quantization with dynamic scaling largely preserves training quality, because small quantization errors in optimizer states have a regularization-like effect. The quality difference is empirically negligible for most tasks while saving 75% optimizer memory","C":"The company is right because Adam optimizer states are already low-precision floating-point numbers and don't require full 32-bit precision","D":"The engineer is right — 8-bit Adam produces models that underfit by approximately 2% on all tasks"},"correct":"B","explanation":{"correct":"- Standard Adam stores first moment m (float32) and second moment v (float32). For a 7B parameter model, this is 2 × 7B × 4 bytes = 56 GB — often more than the model itself.\n- 8-bit quantization: each value represented in 8-bit integers with block-wise dynamic scaling (find the max value in each 2048-element block, scale values to [0,255]). The quantization error is bounded and approximately uniform, acting as small gradient noise.\n- Dettmers et al. (2022) demonstrated on GPT-2, OPT, and BLOOM fine-tuning that 8-bit Adam achieves near-identical perplexity to 32-bit Adam with 75% memory savings. Some tasks show negligible degradation.","A":"8-bit and 32-bit representations are not mathematically identical. 8-bit has ~2.8 bits of effective mantissa precision vs float32's 23 bits. Quantization noise is real — the question is whether it matters practically.","B":"","C":"Adam stores optimizer states in float32 for precision in accumulating gradients over time. The second moment accumulates squares of gradients, which can span many orders of magnitude. Storing in float16 (not float32) already causes issues — 8-bit requires the block-wise trick to work.","D":"\"Exactly 2% underfitting on all tasks\" is too specific and not supported empirically. Quality degradation from 8-bit Adam is task-dependent and often negligible, not a fixed universal penalty."},"reference":"- Dettmers et al., \"8-bit Optimizers via Block-wise Quantization\" (2022): https://arxiv.org/abs/2110.02861"},{"section":"deep-learning","topicSlug":"optimizers","topic":"Optimizers","id":"dl-07015","difficulty":"hard","orderIndex":15,"question":"You train a GAN (generator G and discriminator D) with Adam. After 10,000 steps, the discriminator loss collapses to near 0 and the generator produces random noise. You try switching D to SGD with high lr, G stays Adam. Training stabilizes. What optimizer-specific property of SGD helps here, and what does this reveal about Adam's behavior in adversarial training?","options":{"A":"SGD trains D slower, giving G more time to improve before D becomes perfect","B":"Adam's adaptive learning rates per parameter make D's update steps become very small for parameters with large gradient history — in adversarial training, the discriminator easily classifies real vs fake in the early steps, accumulating large gradient history, causing Adam to reduce effective LR for D to near zero. D becomes unable to update fast enough to keep up with G's improvements. SGD with fixed LR keeps D's update rate stable regardless of gradient history","C":"SGD produces more gradient noise than Adam, which prevents D from memorizing the entire training set","D":"Adam causes GAN mode collapse by making G and D converge to the same local minimum"},"correct":"B","explanation":{"correct":"- Adam's second moment v_t accumulates squared gradients. For D's weights that consistently produce large gradients (clear real/fake discrimination), v_t grows large, and the effective learning rate α/√v_t shrinks toward zero over time.\n- Once D's effective LR collapses, D can no longer meaningfully compete with G. G receives weak or uninformative gradients from a D that barely updates, causing G to produce garbage outputs (no learning signal from a non-updating D).\n- SGD with fixed LR maintains D's ability to update regardless of gradient history. This \"natural learning rate\" preserves the adversarial tension needed for GAN training.","A":"\"G having more time\" would require training them at different rates explicitly. The difference is the effective per-step learning rate magnitude, not the number of steps.","B":"","C":"While SGD does have more gradient noise than Adam (due to lack of adaptive scaling), the mechanism is specifically about effective learning rate collapse for the discriminator, not about noise preventing memorization.","D":"Mode collapse in GANs is a generator problem (G maps many inputs to the same output, covering only a few modes of the data distribution). It is not caused by Adam making G and D converge to the same minimum — they have different objectives by design."},"reference":"- Goodfellow et al., \"Generative Adversarial Networks\" (2014): https://arxiv.org/abs/1406.2661\n- Lucic et al., \"Are GANs Created Equal? A Large-Scale Study\" (2018): https://arxiv.org/abs/1711.10337"},{"section":"deep-learning","topicSlug":"ann-architectures","topic":"Ann Architectures","id":"dl-08001","difficulty":"easy","orderIndex":1,"question":"A team compares two networks on a tabular dataset with 100 features and 50,000 training samples: Network A has 1 hidden layer with 1024 units; Network B has 4 hidden layers with 256 units each. Both have approximately the same total parameter count. Network B consistently outperforms Network A. A junior engineer concludes \"more layers is always better.\" What is the accurate explanation and when would this conclusion fail?","options":{"A":"More layers are always better; the junior engineer is correct","B":"Network B has more depth, which allows it to learn hierarchical feature compositions. However, \"more layers is always better\" fails when: (1) the data has no hierarchical structure (e.g., random tabular data with no compositional features), (2) depth introduces optimization difficulties (vanishing gradients, dead neurons) that are worse than the capacity gain, or (3) the dataset is too small to learn useful hierarchical representations","C":"Network B is better because it has more regularization from the additional bias terms in deeper layers","D":"Depth always helps because wider networks are less efficient at using their parameters"},"correct":"B","explanation":{"correct":"- For data with compositional structure (images: edges→shapes→objects; text: chars→morphemes→words), depth allows each layer to build on previous layer abstractions. This exponential efficiency of depth means Network B can represent more complex functions with the same parameter count.\n- Failure cases: (1) Random forest features or tabular data with engineered features often lack the compositional structure that makes depth useful. Many studies show shallow networks work equally well on tabular data. (2) Very deep networks (20+ layers without ResNet-style skip connections) can be harder to train than shallow ones due to vanishing gradients.\n- The width vs depth trade-off is problem-specific, not universally resolved in favor of depth.","A":"\"Always better\" claims are rarely correct in ML. The Universal Approximation Theorem shows a single hidden layer can represent any function — depth is about efficiency and learnability, not strict necessity.","B":"","C":"Additional bias terms in deeper networks are minimal (a few hundred extra scalar parameters). This is negligible and does not explain the performance difference.","D":"Wider networks can be very efficient — width allows learning many different features simultaneously. There's no universal efficiency advantage for depth over width."},"reference":"- Bengio & LeCun, \"Scaling algorithms towards AI\" (2007)\n- Goodfellow et al., \"Deep Learning\", Chapter 6.4 (Architecture Design)"},{"section":"deep-learning","topicSlug":"ann-architectures","topic":"Ann Architectures","id":"dl-08002","difficulty":"easy","orderIndex":2,"question":"The Universal Approximation Theorem (UAT) states that a neural network with one hidden layer and enough units can approximate any continuous function to arbitrary precision. A product manager uses this to argue: \"We should always use single-layer networks since UAT proves they can do anything.\" What is wrong with this argument?","options":{"A":"UAT only applies to regression problems, not classification","B":"UAT guarantees existence of a single-layer network that works, not that we can efficiently find it. The required width may be exponential in the input dimension. More practically, it doesn't address: (1) how to find the right weights (optimization), (2) how many samples are needed to learn it (generalization), or (3) the practical cost of the exponentially large required width","C":"UAT has been proven false; modern neural networks require depth to function","D":"UAT applies only to networks with sigmoid activations; ReLU networks require depth to be universal approximators"},"correct":"B","explanation":{"correct":"- Cybenko's original UAT (1989) proved existence of weights for a wide enough single-layer network. But \"wide enough\" can be exponential in input dimension for certain functions.\n- Hornik (1991) generalized to any squashing function, and Barron (1993) proved single-layer networks can approximate any function with finite first-moment of the Fourier transform using O(1/ε²) neurons — but this bound is loose in practice.\n- The practical problems: (1) optimization of a single very wide layer may be harder than a deep narrow network; (2) generalization requires enough samples relative to parameter count; (3) the exponentially wide single layer may need far more FLOPs and memory than a deep equivalent.","A":"UAT applies to both regression and classification (approximating any continuous function includes decision boundaries). It is not restricted to regression.","B":"","C":"UAT has not been disproven — it remains valid. What has been shown (Telgarsky, 2016) is that for certain functions, depth allows exponentially more efficient representations. This doesn't falsify UAT.","D":"ReLU networks are also universal approximators for any single-hidden-layer network. Several papers (Hornik, 1991; LeSarge, 1996) have established UAT for various activation functions including ReLU."},"reference":"- Cybenko, \"Approximation by superpositions of a sigmoidal function\" (1989)\n- Hornik et al., \"Multilayer feedforward networks are universal approximators\" (1989)"},{"section":"deep-learning","topicSlug":"ann-architectures","topic":"Ann Architectures","id":"dl-08003","difficulty":"medium","orderIndex":3,"question":"You process a batch of 32 samples through a fully connected network. At the second hidden layer, you notice that halving the batch size from 32 to 16 reduces training time by 45% (not 50%). What does this tell you about the computation bottleneck?","options":{"A":"Halving the batch size should always halve training time; the 45% means there's a 5% overhead bug","B":"45% time reduction means approximately 10% of the time is batch-size-independent overhead (data loading, optimizer step, memory allocation). For batch-independent operations: t_fixed ≈ 10% of original time. The matrix multiply scales with batch size, but fixed overheads don't. This is consistent with 32-sample batch: 90% compute + 10% overhead; 16-sample batch: 45% compute + 10% overhead = 55% of original","C":"The GPU is only 90% utilized — the remaining 10% idle time explains the discrepancy","D":"The activation functions are not GPU-accelerated for batch sizes below 20"},"correct":"B","explanation":{"correct":"- Total time = compute_time(batch) + fixed_overhead. Compute scales linearly with batch size (more samples → more FLOPs). Fixed overhead (Python execution, data transfer, optimizer step) is constant per batch.\n- If t_total(32) = 1.0, t_total(16) = 0.55 (45% reduction). Let fixed = c, compute = (1-c). Then: (1-c)·(16/32) + c = 0.55 → 0.5·(1-c) + c = 0.55 → 0.5 + 0.5c = 0.55 → c = 0.1 (10% overhead).\n- This is a common profiling exercise: compute vs overhead decomposition helps identify whether reducing batch size will proportionally reduce training time.","A":"Halving batch size rarely halves training time exactly due to fixed overheads. The 5% discrepancy is not a \"bug\" — it's a natural consequence of batch-independent operations.","B":"","C":"GPU utilization measures parallel compute efficiency, not overhead fractions. Low utilization (idle GPU cores) would manifest as the GPU taking longer for the compute portion, not as fixed overhead.","D":"GPU acceleration for activation functions is batch-size-independent (element-wise operations scale with total element count). There's no 20-sample threshold."},"reference":"- PyTorch profiler for compute vs overhead analysis: https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html"},{"section":"deep-learning","topicSlug":"ann-architectures","topic":"Ann Architectures","id":"dl-08004","difficulty":"medium","orderIndex":4,"question":"A team experiments with two architectures for a 1000-class image classifier: (A) 5 fully connected layers of width 4096 — total parameters ≈ 10B; (B) ResNet-50 with 4 convolutional stages — total parameters ≈ 25M. ResNet-50 achieves far better performance. What architectural property of ResNet-50 explains why 25M parameters outperforms 10B in this task?","options":{"A":"ResNet-50 has skip connections that make it more powerful per parameter due to additive gradient paths","B":"Convolutional layers exploit spatial structure through weight sharing and local connectivity: a single 3×3 kernel with 9×C² parameters detects the same pattern everywhere in an image. Fully connected layers treat all pixel connections independently, requiring exponentially more parameters to cover the same spatial patterns. Weight sharing makes CNNs extremely parameter-efficient for image data where the same features (edges, textures) appear throughout","C":"ResNet-50 uses BatchNorm which makes it more efficient by reducing the number of parameters needed","D":"Fully connected networks cannot process images above 224×224 resolution due to memory constraints"},"correct":"B","explanation":{"correct":"- A 3×3 conv kernel with C_in=256 and C_out=256 has 9×256×256 ≈ 590K parameters and can detect features at any spatial location by sliding the kernel. An equivalent fully-connected layer connecting 256×7×7 = 12,544 positions to 256 output features needs 12,544×256 ≈ 3.2M parameters — for one layer.\n- More fundamentally: in natural images, the same features (corners, curves, textures) appear at all spatial locations. Learning these features once (shared weights) and applying everywhere is both more efficient and provides implicit translation invariance.\n- Fully connected networks must learn separate detectors for \"edge in top-left\" vs \"edge in center\" vs \"edge in bottom-right\" — the same feature learned 196 times (for a 14×14 feature map). This is the parameter inefficiency.","A":"Skip connections are a secondary benefit. ResNet-50's primary advantage over a 10B-parameter FC network is convolutional weight sharing, not skip connections. A simple ConvNet without skip connections would still vastly outperform the FC network.","B":"","C":"BatchNorm has relatively few parameters (2×C per layer for scale and bias). Its benefit is training stability, not parameter efficiency. BatchNorm actually adds parameters, not removes them.","D":"Fully connected networks can process any image size (just flatten the input). The constraint is practical (memory and parameter count), not architectural. Many FC-only architectures have processed high-resolution images."},"reference":"- LeCun et al., \"Gradient-Based Learning Applied to Document Recognition\" (1998): original ConvNet efficiency argument\n- https://cs231n.github.io/convolutional-networks/"},{"section":"deep-learning","topicSlug":"ann-architectures","topic":"Ann Architectures","id":"dl-08005","difficulty":"medium","orderIndex":5,"question":"You increase the batch size from 256 to 2048 (8×) while keeping all other hyperparameters constant. Training converges to a model with 1.5% higher test error than the 256-batch model. A colleague says \"increase learning rate by 8× to recover performance.\" Does this fix work?","options":{"A":"Yes — the linear scaling rule says LR should scale linearly with batch size, which recovers the same effective learning rate per sample","B":"Partially — the linear scaling rule (Goyal et al., 2017) works for moderate batch size increases in well-tested settings, but it requires also using LR warmup (gradually increasing from small LR to 8× LR over first 5 epochs). Directly jumping to 8× LR without warmup causes training instability. Additionally, for very large batches (>8192), diminishing returns appear and the linear rule underfits generalization","C":"No — learning rate should decrease when batch size increases because each step sees more data","D":"Yes but only with SGD; Adam automatically adjusts for batch size changes"},"correct":"B","explanation":{"correct":"- The linear scaling rule (Goyal et al.): for batch size k×B with LR k×η, the model trains to the same accuracy as batch size B with LR η, provided k is not too large. The intuition: larger batches compute lower-variance gradient estimates; the larger LR compensates.\n- Warmup is critical: at initialization, gradients can be large and noisy. Jumping immediately to 8× LR produces large erratic steps. Warmup starts at lr=η and linearly increases to 8×η over 5 epochs.\n- Generalization gap for large batches: large-batch training finds \"sharp minima\" with worse generalization (Keskar et al., 2016). Linear LR scaling compensates for optimization speed but not for the sharp-minima phenomenon. This explains the residual 1.5% gap.","A":"The linear scaling rule is correct in direction (increase LR with batch size) but incomplete — it omits the critical warmup requirement. \"Just multiply LR by 8\" without warmup often fails.","B":"","C":"The intuition \"more data per step → lower LR\" is wrong. More data per step means less noisy gradient estimates, not weaker signal. The LR should increase to match the higher quality gradient estimate.","D":"Adam does partially adapt to batch size changes via second moment normalization, but it does not fully automatically compensate. Large batch Adam training still benefits from LR scaling."},"reference":"- Goyal et al., \"Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour\" (2017): https://arxiv.org/abs/1706.02677\n- Keskar et al., \"On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima\" (2016): https://arxiv.org/abs/1609.04836"},{"section":"deep-learning","topicSlug":"ann-architectures","topic":"Ann Architectures","id":"dl-08006","difficulty":"medium","orderIndex":6,"question":"You design a fully connected network for time series prediction (predict value at t+1 from last 100 timesteps). You use a flat architecture: flatten 100 values into a vector, then 3 FC layers. A senior engineer says \"this architecture ignores the sequential structure of the data.\" She proposes using 1D convolution instead. What specifically does the flat FC architecture fail to exploit?","options":{"A":"FC networks cannot process more than 50 input features","B":"The FC architecture treats all 100 timesteps as independent features with no assumption about temporal locality or ordering. A value at t=1 and t=100 are connected with the same weight matrix as adjacent timesteps. 1D CNNs apply local kernels that capture patterns at specific temporal scales (e.g., a 5-step window can detect short-term trends) and are translation-equivariant — the same pattern at t=10 and t=90 uses identical weights. The flat FC must learn these local temporal patterns separately at each position","C":"FC networks cannot backpropagate gradients through more than 3 layers when input size exceeds 100","D":"FC networks require input normalization before processing time series, unlike CNNs"},"correct":"B","explanation":{"correct":"- A fully connected layer from (100,) to (H,): the weight at position W[j,5] (connecting timestep 5 to hidden unit j) is completely independent of W[j,6] (connecting timestep 6). There's no inductive bias for temporal locality.\n- 1D CNN: a kernel of size 5 learns a pattern over 5 consecutive timesteps and slides across all positions. The same kernel detects the same pattern regardless of when it occurs (translation equivariance). Parameter count: 5×channels, applied at every position.\n- The FC layer must learn these temporal patterns without the locality inductive bias — requiring more data and parameters to learn what CNNs represent by construction.","A":"FC networks have no hard limit on input features. A 100-dimensional input is small by modern standards. FC networks handle inputs of millions of dimensions (though inefficiently for structured data).","B":"","C":"Gradient flow through FC networks is determined by the number of layers and activation functions, not input size. 3 FC layers with 100 inputs backpropagate gradients without any 100-feature limit.","D":"Normalization helps both FC and CNN networks. It is not specific to FC networks or required differently based on architecture."},"reference":"- https://cs231n.github.io/convolutional-networks/ (parameter sharing and local connectivity)"},{"section":"deep-learning","topicSlug":"ann-architectures","topic":"Ann Architectures","id":"dl-08007","difficulty":"hard","orderIndex":7,"question":"You compare two networks of identical depth (10 layers) and width (512 per layer) but different connectivity: Network A is fully dense (each layer fully connected to next); Network B is DenseNet-style (each layer connected to all previous layers). For the same input, Network B requires far more computation in later layers. Why, and what is the growth in parameter count in later layers?","options":{"A":"DenseNet's later layers are slower because they process larger input tensors; each layer k receives concatenation of all k previous layer outputs, so input size grows as k×512","B":"Network B's layer k receives a concatenation of all previous outputs: input to layer k = [h₁, h₂, ..., h_{k-1}, x], with dimension k×512 + input_dim. The weight matrix for layer k is k×512 × 512 (output). As k grows linearly, the parameter count for layer k grows linearly with k — total parameters across all N layers grows as O(N²) vs O(N) for fully connected sequential networks","C":"DenseNet computes additional forward passes for each skip connection, causing multiplicative slowdowns","D":"DenseNet's parameter count doesn't change — it only adds additive operations, not multiplicative ones"},"correct":"B","explanation":{"correct":"- DenseNet concatenation: layer k's input = [x, h₁, ..., h_{k-1}] has dimension d₀ + (k-1)×d_layer. The weight matrix mapping this to d_layer outputs has (d₀ + (k-1)×d_layer) × d_layer parameters.\n- Total parameters ≈ Σₖ (k × d_layer²) = d_layer² × N(N+1)/2 = O(N²). For sequential fully connected: each layer has d_layer² parameters → total O(N×d_layer²) = O(N).\n- DenseNet mitigates this with bottleneck layers (1×1 conv in CNN version) and growth rate limiting (each layer adds only g features, not full d_layer). For FC layers without these mitigations, quadratic parameter growth is a genuine concern.","A":"Partially correct description but incomplete. The input size growth is correct (k×512), which is the root cause. But option B completes the analysis with the quadratic parameter count consequence.","B":"","C":"Skip connections don't cause extra forward passes. DenseNet computes each layer once; the skip connections just route already-computed activations to later layers via concatenation.","D":"Concatenating larger inputs requires larger weight matrices (more parameters, more multiplications). This is a multiplicative operation: (k×512) × (512) weight matrix grows with k."},"reference":"- Huang et al., \"Densely Connected Convolutional Networks\" (DenseNet): https://arxiv.org/abs/1608.06993"},{"section":"deep-learning","topicSlug":"ann-architectures","topic":"Ann Architectures","id":"dl-08008","difficulty":"hard","orderIndex":8,"question":"You benchmark two model configurations: Config A (batch_size=32, model_width=512) and Config B (batch_size=256, model_width=512). Both are trained for the same number of gradient steps. Config B reaches lower training loss but worse validation loss. An engineer proposes Config C (batch_size=32, model_width=2048). What should she expect and why?","options":{"A":"Config C will perform similarly to Config A because width and batch size trade off identically","B":"Config C will likely have better validation performance than Config B (larger batch = sharp minima, worse generalization) and may outperform Config A on validation due to increased model capacity with small batch size (which finds flatter, better-generalizing minima). The combination of wider model (more capacity) with small batch (finds flatter minima) is a common recipe for maximizing generalization","C":"Config C will overfit immediately due to the increased model capacity","D":"Wider models always converge slower, so Config C will not reach competitive loss in the same number of steps"},"correct":"B","explanation":{"correct":"- Large batch (Config B): computes low-variance gradient estimates but tends to converge to sharp minima (narrow loss valleys) that generalize poorly (Keskar et al., 2016). Sharp minima have worse generalization because small input distribution shifts push out of the narrow good region.\n- Wider model (Config C) + small batch (flat minima): width increases representational capacity, while small batch's noisy gradient estimates act as implicit regularization by preventing convergence into sharp narrow minima. This combination often achieves the best of both.\n- The intuition: width alone would increase overfitting risk, but small batch + noise acts as regularization that finds wider minima of the same loss landscape.","A":"Width and batch size affect different aspects of training (capacity vs gradient noise / minima sharpness). They don't trade off in a simple equivalent way.","B":"","C":"Overfitting is a function of parameter-to-data ratio AND training dynamics. With small batch size providing implicit regularization and with dropout/weight decay, a wider model doesn't necessarily overfit more than a narrow one.","D":"Wider models can converge at similar rates as narrow ones with appropriate initialization and learning rate scaling. Width affects per-step FLOP cost but not necessarily the number of gradient steps to convergence."},"reference":"- Keskar et al., \"On Large-Batch Training for Deep Learning\" (2016): https://arxiv.org/abs/1609.04836\n- Hoffer et al., \"Train longer, generalize better\" (2017): https://arxiv.org/abs/1705.08741"},{"section":"deep-learning","topicSlug":"ann-architectures","topic":"Ann Architectures","id":"dl-08009","difficulty":"hard","orderIndex":9,"question":"A network is designed with 20 fully connected layers, each with width 256 and ReLU activations. Without any normalization or skip connections, the model has 10% test accuracy (same as random for a 10-class problem). Adding only BatchNorm (no skip connections) raises it to 85%. Adding only skip connections (ResNet-style) raises it to 82%. What does this tell us about the role of each component?","options":{"A":"BatchNorm is strictly more important than skip connections for deep networks","B":"Both mechanisms independently solve different aspects of deep network training difficulty: BatchNorm addresses the internal covariate shift / gradient flow problem (keeps activations normalized, stabilizes optimization), while skip connections address the gradient vanishing through additive paths. Their similar effectiveness here suggests both are addressing the same root cause (gradient flow through 20 layers), just through different mechanisms","C":"BatchNorm only helps at test time; it has no effect on training","D":"Skip connections don't help without BatchNorm; the 82% result is due to noise in the experiment"},"correct":"B","explanation":{"correct":"- 10% accuracy (random) → both mechanisms are needed for a 20-layer network to train at all. The depth creates severe gradient flow problems.\n- BatchNorm's contribution: normalizes activations to zero mean, unit variance after each layer. This prevents the exponential activation growth/decay that causes gradients to explode/vanish. Also shown to smooth the loss landscape (Santurkar et al., 2018).\n- Skip connections' contribution: provide direct additive gradient paths. ∂L/∂h₀ includes a term from the skip path that doesn't vanish even when the residual branch gradient does.\n- 85% vs 82%: in this specific experiment, BatchNorm is slightly more effective, but this is architecture- and data-specific. Deeper networks (50+ layers) often see skip connections become the dominant factor.","A":"\"Strictly more important\" is too strong. In different architectures and datasets, skip connections are the more critical component (e.g., very deep networks where normalization alone cannot solve gradient flow). Both are important, and the relative importance is context-dependent.","B":"","C":"BatchNorm has separate behavior during training (uses batch statistics) and inference (uses running statistics). Its training behavior (normalizing activations, stabilizing gradient flow) is its primary contribution.","D":"The skip connection result (82%) is meaningful, not noise. ResNet with 20 layers is a well-established architecture that trains effectively. The difference from BatchNorm (85% vs 82%) is within normal architecture comparison variance."},"reference":"- He et al., \"Deep Residual Learning for Image Recognition\": https://arxiv.org/abs/1512.03385\n- Santurkar et al., \"How Does Batch Normalization Help Optimization?\": https://arxiv.org/abs/1805.11604"},{"section":"deep-learning","topicSlug":"ann-architectures","topic":"Ann Architectures","id":"dl-08010","difficulty":"medium","orderIndex":10,"question":"A product manager requests a neural network that can handle variable-length inputs (sentences with 10 to 500 tokens) without padding. You design: (A) a fully connected network that requires fixed-length inputs, (B) a recurrent network, (C) a CNN with global average pooling. Which approach(es) natively handle variable-length inputs and what is the trade-off?","options":{"A":"Only recurrent networks can handle variable-length inputs","B":"Both RNNs and CNNs with global pooling handle variable-length inputs, but through different mechanisms: RNNs process tokens sequentially and can stop at any length; CNNs with global average pooling apply convolutional kernels to any-length sequence and pool all positions to a fixed-size vector. RNNs capture sequence order and long-range dependencies better in principle; CNNs are more parallelizable (can process all positions simultaneously) but use local kernels","C":"Only transformers handle variable-length inputs; RNNs and CNNs require fixed length","D":"Variable-length inputs require padding to the maximum length; no architecture natively avoids this"},"correct":"B","explanation":{"correct":"- RNNs: process one token at a time, maintaining hidden state h_t = f(h_{t-1}, x_t). After T tokens, h_T is the summary. T can be any length — the same weights process 10 or 500 tokens.\n- CNNs + global average pooling: apply 1D kernels (shape: kernel_size × channels) that slide over any sequence length, then average all output positions into a fixed-size vector. The CNN itself requires no sequence length knowledge.\n- Transformers also handle variable-length inputs natively (attention is computed over all positions, quadratic in sequence length). They're not listed in this scenario.","A":"CNNs with global pooling also handle variable length natively, so \"only RNNs\" is incorrect.","B":"","C":"Transformers are not the only option. Both RNNs and CNNs have been used extensively for variable-length inputs (text classification, audio processing) before Transformers became dominant.","D":"While padding is a common implementation technique (for batch efficiency on GPUs), it is not architecturally necessary. True variable-length processing is natively supported by RNNs and CNNs with pooling."},"reference":"- https://cs231n.github.io/rnn/ (RNNs and variable-length sequences)"},{"section":"deep-learning","topicSlug":"ann-architectures","topic":"Ann Architectures","id":"dl-08011","difficulty":"hard","orderIndex":11,"question":"You train a 10-layer fully connected network and plot the singular value distribution of each layer's weight matrix after training. The first layer has a power-law singular value distribution (a few large values, many small). The last layer has a near-uniform singular value distribution. What does this difference tell you about learned representations?","options":{"A":"The first layer is undertrained (should have uniform singular values); reduce the learning rate for layer 1","B":"The first layer's low-rank structure (dominated by a few singular values) means it has effectively learned to project inputs into a low-dimensional subspace — only a few \"directions\" in the input space are relevant. The last layer's near-uniform distribution indicates it is using all its dimensions roughly equally. This pattern (low effective rank in early layers, higher rank in later layers) often indicates the network is learning to extract the most informative dimensions first","C":"The singular value distribution is random and has no interpretation","D":"Uniform singular values in the last layer indicate overfitting; the model should have low-rank structure throughout"},"correct":"B","explanation":{"correct":"- A weight matrix W with a power-law singular value distribution has low effective rank — most information passes through a few dominant directions. The matrix is approximately W ≈ UΣV^T where only a few singular values σ₁ >> σ₂ >> ... >> σₖ contribute significantly.\n- This pattern in early layers reflects that raw input features (e.g., pixels) are highly correlated. The network learns to project into the few meaningful dimensions that contain task-relevant information.\n- Martin & Mahoney (2020) studied this extensively, finding that well-trained networks exhibit implicit low-rank structure (HeavyTailed Self-Regularization) that correlates with generalization. Models with more power-law structure tend to generalize better.","A":"Low-rank structure in trained networks is a sign of learning, not undertraining. Undertrained networks often have near-uniform singular values (close to initialization). Power-law structure emerges during training as the network finds the useful dimensions.","B":"","C":"Singular value distributions in trained networks are highly non-random and deeply informative about network behavior. This is an active research area connecting to random matrix theory and generalization.","D":"Uniform singular values in the last layer suggest full-rank utilization — the output layer needs to use all incoming dimensions to distinguish all classes. This is normal and expected for classification tasks."},"reference":"- Martin & Mahoney, \"Implicit Self-Regularization in Deep Neural Networks\" (2019): https://arxiv.org/abs/1810.01075"},{"section":"deep-learning","topicSlug":"ann-architectures","topic":"Ann Architectures","id":"dl-08012","difficulty":"medium","orderIndex":12,"question":"A deep learning framework initializes all weights randomly except for one specific type of layer where weights are initialized to 0: biases. However, for a layer with no activation function (linear output layer), the bias initialization to 0 is stated to be especially important. Why?","options":{"A":"Zero bias prevents the linear output from producing NaN on the first forward pass","B":"For a linear output layer predicting real values, bias=0 means the initial prediction is the weighted sum of inputs with no offset. If the target mean is near 0 (after normalization), this is a reasonable starting point. More importantly: if all biases were initialized identically to any non-zero constant, different output units would all start with the same non-zero offset, and the network would need extra training steps to differentiate predictions across classes or output dimensions","C":"Zero bias initialization is required by the Adam optimizer's bias correction algorithm","D":"Non-zero bias initialization for the output layer causes the cross-entropy loss to be undefined at the first step"},"correct":"B","explanation":{"correct":"- For regression with normalized targets (zero mean), bias=0 in the output layer means initial predictions are zero (or close to zero for small random weights) — a reasonable starting point near the target distribution mean.\n- For classification: if output biases were all initialized to the same constant c, then all class logits would include the same c, and softmax would output uniform probabilities (same result as bias=0 after softmax). The constant cancels in softmax.\n- For regression with bias ≠ 0: the model starts predicting non-zero values for all samples, increasing initial loss unnecessarily. Zero initialization minimizes the initial loss and allows faster convergence to meaningful predictions.","A":"Non-zero biases don't cause NaN. The forward pass produces finite values (weighted sum + bias) regardless of bias initialization.","B":"","C":"Adam's bias correction is for the first and second moments of the gradient (initialized to zero), not for model weight biases. These are different \"biases\" — model parameter biases vs optimization moment biases.","D":"Cross-entropy requires positive probability inputs. Non-zero biases in the output layer would produce non-zero logits → non-uniform softmax probabilities → valid cross-entropy (finite loss). The loss is not undefined with non-zero bias."},"reference":"- https://cs231n.github.io/neural-networks-2/#init"},{"section":"deep-learning","topicSlug":"ann-architectures","topic":"Ann Architectures","id":"dl-08013","difficulty":"hard","orderIndex":13,"question":"You run an ablation study: starting from a baseline 4-layer MLP, you double the width of each layer. Training loss improves by 5%. You then double the depth to 8 layers (keeping the original width). Training loss improves by 8%. You then combine both (8 layers, doubled width). Training loss improves by only 6% relative to the 4-layer/doubled-width baseline, not the expected 13% (5%+8%) additive gain. What phenomenon explains the sub-additive gains?","options":{"A":"The combined model is too large and crashes the GPU, causing training errors that reduce the measured gain","B":"The gains from width and depth are not additive because they address overlapping bottlenecks. Once width is doubled, the limiting factor for the 4-layer model shifts. Adding depth after sufficient width addresses a different bottleneck (depth of representation), but part of the depth gain was already captured by the wider 4-layer model's increased capacity. The effective bottlenecks interact non-linearly","C":"The optimizer cannot handle both increases simultaneously; use separate optimizers for width and depth changes","D":"Sub-additive gains are caused by L2 regularization which penalizes the larger combined model more heavily"},"correct":"B","explanation":{"correct":"- Width and depth improvements often address partially overlapping representational bottlenecks. Doubling width lets each layer represent more features simultaneously. Adding depth lets the network build more hierarchical abstractions.\n- When a model has bottlenecks in both dimensions, fixing one partially alleviates the other (a wider layer can approximate some depth effects by learning more complex functions per layer). So the \"remaining gain\" from adding depth after already having doubled width is less than adding depth alone.\n- This is related to the EfficientNet scaling insight (Tan & Le, 2019): optimal performance comes from compound scaling (width, depth, resolution together with balanced ratios), not independent scaling.","A":"GPU crashes would produce training failures or NaN losses, not a consistent 6% improvement. Sub-additive gains on a working model are a property of the learning dynamics, not hardware failure.","B":"","C":"The optimizer works the same for any architecture size. Using \"separate optimizers\" for different architectural components is not a standard technique and wouldn't affect the sub-additivity.","D":"L2 regularization penalizes larger parameter norms, but this would reduce performance to below the baseline, not just reduce gains relative to additive expectation. The regularization would need to be tuned for each model size separately."},"reference":"- Tan & Le, \"EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks\" (2019): https://arxiv.org/abs/1905.11946"},{"section":"deep-learning","topicSlug":"ann-architectures","topic":"Ann Architectures","id":"dl-08014","difficulty":"easy","orderIndex":14,"question":"A neural network for binary classification has a final layer: `nn.Linear(256, 2)` with 2 output units + softmax. A colleague says \"this is wasteful — use a single output unit with sigmoid.\" Who is right?","options":{"A":"The colleague is wrong — 2 output units are always required for binary classification","B":"Both approaches are correct and produce equivalent results, but the single-sigmoid approach is more common and efficient: one output unit with sigmoid outputs P(class=1) directly. The 2-unit softmax approach outputs [P(class=0), P(class=1)] where P(class=0) = 1 - P(class=1) — the second output is redundant. The 2-unit approach uses 2× more output parameters for the same information","C":"The colleague is wrong — using sigmoid for classification causes training instability compared to softmax","D":"The 2-unit approach allows the model to predict \"neither class\" — a third implicit option that sigmoid cannot represent"},"correct":"B","explanation":{"correct":"- For binary classification, P(class=0) + P(class=1) = 1 (exhaustive, exclusive). So knowing P(class=1) = p immediately gives P(class=0) = 1-p. The second output unit is perfectly redundant.\n- Single sigmoid: output = σ(z). Loss: BCE = -[y·log(σ(z)) + (1-y)·log(1-σ(z))]. Uses 256×1 + 1 = 257 parameters for the final layer.\n- Two-unit softmax: outputs [σ₀, σ₁] = softmax([z₀, z₁]). Uses 256×2 + 2 = 514 parameters. σ₁ = exp(z₁)/(exp(z₀)+exp(z₁)) — equivalent to sigmoid(z₁-z₀). The two logits only matter through their difference, making one of them redundant.","A":"Two output units are NOT always required. Many production binary classifiers use a single sigmoid output. Libraries like scikit-learn's neural network default to single-output sigmoid for binary classification.","B":"","C":"Both sigmoid (BCE) and softmax (CE) are stable for binary classification. Sigmoid is actually preferred by most practitioners for binary problems due to simplicity and efficiency.","D":"The 2-unit softmax does not represent a \"third neither class.\" Softmax always produces a valid probability distribution over exactly the specified classes — probabilities sum to 1.0 for the two classes, leaving no probability mass for other options."},"reference":"- PyTorch BCEWithLogitsLoss (single output) vs CrossEntropyLoss (multiple outputs): https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html"},{"section":"deep-learning","topicSlug":"ann-architectures","topic":"Ann Architectures","id":"dl-08015","difficulty":"hard","orderIndex":15,"question":"You have a network where the output layer's logits (pre-softmax) have very large magnitude: outputs like [100, -50, 30, ...]. The training loss is low but the model is overconfident — softmax([100, -50, 30]) ≈ [1.0, 0.0, 0.0]. A team member argues this is not a problem because the argmax prediction is correct. Why is extreme logit magnitude a production concern?","options":{"A":"Large logits cause integer overflow in the argmax computation","B":"Extreme logit magnitudes produce near-zero gradients for all but the dominant class (softmax probability ≈ 1 for one class → softmax Jacobian ≈ 0), preventing the model from continuing to learn from correctly classified examples. In production: (1) miscalibrated confidence scores are unreliable for downstream systems that use probabilities (e.g., rejection thresholds), (2) distribution shift at inference can push new inputs toward different logit patterns that the model can't recover from, (3) temperature scaling calibration fails when logits are at extreme values","C":"Large logits slow down inference because softmax requires more floating-point operations for large values","D":"The problem only exists during training; inference is unaffected by logit magnitude"},"correct":"B","explanation":{"correct":"- Near-saturation at softmax: when p_correct ≈ 1.0, ∂L/∂logit ≈ (p_predicted - y_true) ≈ 0. The gradient signal vanishes for correctly classified examples. The model essentially stops learning from these samples.\n- Calibration: a downstream classifier or safety system using probability thresholds (e.g., \"only act if confidence > 80%\") can't distinguish between p=0.95 (confident) and p=0.9999999 (extreme). Both are \"high confidence\" but one is healthy and one is pathological.\n- Temperature scaling: the standard post-training calibration technique (divide logits by T, find T on validation set) can't fix extremely large logits well because the logit magnitude spans many orders of magnitude.","A":"Argmax is a comparison operation, not arithmetic on the logit values. Large logit values don't cause overflow in argmax — the operation just selects the index of the maximum value.","B":"","C":"Softmax uses exp(xᵢ - max(x)) for numerical stability, adding only a subtraction step. The computational cost of softmax scales with the number of classes, not logit magnitude.","D":"Inference is directly affected by logit magnitude through probability calibration. Inference confidence scores are the logits passed through softmax — their values are the direct output of the model and affect downstream decisions."},"reference":"- Guo et al., \"On Calibration of Modern Neural Networks\" (2017): https://arxiv.org/abs/1706.04599\n- Label smoothing as prevention: Müller et al., \"When Does Label Smoothing Help?\" (2019): https://arxiv.org/abs/1906.02629"},{"section":"deep-learning","topicSlug":"regularization-and-normalization","topic":"Regularization And Normalization","id":"dl-09001","difficulty":"easy","orderIndex":1,"question":"A model achieves 99% training accuracy but only 72% validation accuracy. Adding Dropout (p=0.5) to hidden layers reduces training accuracy to 91% but improves validation accuracy to 85%. A junior engineer is alarmed: \"Dropout hurt our training accuracy!\" Is this a problem?","options":{"A":"Yes — a good model should always have high training accuracy; the Dropout rate is too high","B":"No — the intentional training accuracy reduction is Dropout working as designed. Dropout randomly deactivates 50% of neurons per forward pass, forcing the network to learn redundant representations and preventing co-adaptation. The gap between 99% train and 72% valid was overfitting; the narrowed gap (91% train, 85% valid) is correct behavior","C":"Yes — Dropout should improve both training and validation accuracy simultaneously","D":"No, but the Dropout rate should be increased to 0.9 to close the remaining 6% gap further"},"correct":"B","explanation":{"correct":"- Overfitting (99% train, 72% valid) means the model memorized training patterns. The 27-point gap is the overfitting signal.\n- Dropout during training: each forward pass drops 50% of neurons randomly. The model cannot rely on specific neuron combinations → forces distributed, robust representations. This reduces effective model capacity and acts as ensemble training (each dropout mask creates a different \"sub-network\").\n- The 91% train / 85% valid result: narrower 6-point gap (down from 27) with better absolute validation performance. This is the correct trade-off. During inference, Dropout is disabled (model.eval()), and outputs are scaled by (1-p) to account for the full network.","A":"Training accuracy is not the target metric — generalization (validation accuracy) is. A model that achieves 99% training and 72% valid is failing. 91% train and 85% valid is a success.","B":"","C":"Dropout explicitly impairs training by randomly disabling neurons. It is designed to hurt training performance in exchange for better generalization. Both effects are expected.","D":"Increasing Dropout to 0.9 would disable 90% of neurons per pass, severely under-utilizing the network and likely causing underfitting (both train and valid accuracy would drop). Dropout rates above 0.5 are rarely used in practice."},"reference":"- Srivastava et al., \"Dropout: A Simple Way to Prevent Neural Networks from Overfitting\" (2014): https://jmlr.org/papers/v15/srivastava14a.html"},{"section":"deep-learning","topicSlug":"regularization-and-normalization","topic":"Regularization And Normalization","id":"dl-09002","difficulty":"easy","orderIndex":2,"question":"You train two models with L1 and L2 regularization respectively (same λ, same architecture, same data). After training, Model A (L2) has many small weights near 0.01; Model B (L1) has many weights exactly 0 and a few large weights. Which model has sparse weights and why?","options":{"A":"Model A (L2) — L2 regularization pushes weights toward exactly zero","B":"Model B (L1) — L1 penalty (λ·|w|) has a constant gradient (±λ) regardless of weight magnitude. This constant \"pull\" toward zero is strong enough to push small weights all the way to exactly 0. L2 penalty gradient (2λ·w) diminishes as w approaches 0, so small weights are only pulled weakly and never reach exactly zero","C":"Both regularizations produce identical sparsity patterns; the difference is only in total loss value","D":"Model A (L2) — L2 regularization produces sparsity through the squared penalty amplifying small weights"},"correct":"B","explanation":{"correct":"- L1 subdifferential at w=0: the subgradient is any value in [-λ, λ]. For w > 0: gradient = λ (constant pull toward 0). For w = 0: gradient can be 0 (if the data gradient is within [-λ, λ]), making 0 a stable equilibrium. This is why L1 produces exact zeros.\n- L2 gradient: 2λ·w. As w→0, gradient→0. The force pulling w toward 0 weakens as w gets smaller. L2 pushes weights toward small values but never has enough force to reach exactly 0 (gradient = 0 only at w=0, which requires the weight to already be 0).\n- Practical consequence: L1 produces sparse models (useful for feature selection); L2 produces small but non-sparse models (useful for general regularization). L1 + L2 = Elastic Net, combining both properties.","A":"L2 does not push weights to exactly zero. This is a fundamental property difference between L1 and L2. Confusing them is a very common misconception.","B":"","C":"The sparsity patterns are very different: L1 creates exact sparsity (many zeros), L2 does not. This has major implications for model interpretability and computational efficiency.","D":"L2's squared penalty amplifies the gradient for large weights (strong push for large weights) but diminishes for small weights — the opposite of what would produce sparsity."},"reference":"- Tibshirani, \"Regression Shrinkage and Selection via the Lasso\" (1996): L1 regularization / Lasso\n- https://scikit-learn.org/stable/modules/linear_model.html#lasso"},{"section":"deep-learning","topicSlug":"regularization-and-normalization","topic":"Regularization And Normalization","id":"dl-09003","difficulty":"medium","orderIndex":3,"question":"BatchNorm is applied to a layer's pre-activations. During training, BN uses batch statistics (mean and variance of the current batch). During inference, it uses running statistics (exponential moving average from training). A deployed model that was trained with batch_size=256 is called with batch_size=1 at inference. Your colleague says \"the inference statistics will be wrong since there's only 1 sample.\" Who is correct?","options":{"A":"The colleague is correct — BatchNorm requires batch_size > 1 during inference","B":"The colleague is wrong — during inference, BatchNorm uses stored running statistics (mean and variance accumulated during training), not the current batch's statistics. Batch_size=1 at inference is completely valid because the normalization is applied using the training-time population statistics, not the current sample's statistics","C":"The colleague is correct, but the fix is to use instance normalization at inference time","D":"Both are correct — running statistics are used but become unreliable for batch_size=1"},"correct":"B","explanation":{"correct":"- BatchNorm training mode: normalize using current batch's mean/var. Also updates running_mean and running_var via exponential moving average.\n- BatchNorm eval mode (`model.eval()`): normalize using stored running_mean and running_var. The current batch's statistics are completely ignored.\n- A single sample at inference: y_normalized = (x - running_mean) / √(running_var + ε). This is a deterministic transformation using population statistics. The output doesn't depend on whether other samples are in the batch.","A":"Batch_size > 1 is only required during training (for meaningful batch statistics). At inference in eval mode, any batch size works — even batch_size=1.","B":"","C":"Switching to instance normalization at inference is unnecessary and would change the model's behavior (instance norm uses per-sample statistics, not the trained population statistics). This would require retraining.","D":"Running statistics are computed over the entire training dataset and are reliable regardless of inference batch size. A single inference sample doesn't affect the statistics used for normalization."},"reference":"- PyTorch BatchNorm documentation: https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm1d.html"},{"section":"deep-learning","topicSlug":"regularization-and-normalization","topic":"Regularization And Normalization","id":"dl-09004","difficulty":"medium","orderIndex":4,"question":"A Transformer-based language model uses LayerNorm instead of BatchNorm. During training with a batch of sequences, LayerNorm normalizes across the feature dimension for each sample independently. A team switches to BatchNorm to improve training stability. After the switch, training becomes more unstable and generation quality drops. What is the fundamental incompatibility?","options":{"A":"BatchNorm is slower than LayerNorm for Transformer architectures","B":"BatchNorm normalizes across the batch dimension (mean/var computed over all batch samples at each position). In NLP, different samples in a batch have different sequence lengths and semantics — mixing statistics across samples can corrupt the representation. More critically, at inference time with single samples or variable-length sequences, BatchNorm's running statistics (computed over batch-aggregated features) don't match the per-sample feature distributions. LayerNorm normalizes within each sample independently, making it batch-size-agnostic","C":"BatchNorm requires 2D inputs; Transformer hidden states are 3D (batch, seq, d_model)","D":"LayerNorm includes learnable parameters that BatchNorm lacks, causing the model to lose expressivity"},"correct":"B","explanation":{"correct":"- LayerNorm: for a token at position (batch_idx, seq_pos): normalize across the d_model dimension using that single token's mean and variance. Each token is normalized by its own statistics.\n- BatchNorm at a given layer: for each feature dimension d, compute mean/var over all samples×positions in the batch. This mixes statistics from semantically unrelated tokens (e.g., \"cat\" in one sentence and \"quantum\" in another). These have different semantic content but are normalized together.\n- The critical inference problem: at inference with batch_size=1 and different sequence lengths, running statistics accumulated from diverse training batches may not reflect the statistics of any individual input distribution.","A":"Computational speed is not the primary concern. LayerNorm and BatchNorm have similar complexity. The fundamental issue is correctness of statistics for NLP inputs.","B":"","C":"PyTorch BatchNorm has variants for 1D, 2D, and 3D inputs. BatchNorm1d handles (batch, features) or (batch, features, seq_len). 3D inputs are technically handleable, not the root issue.","D":"Both BatchNorm and LayerNorm have learnable scale (γ) and shift (β) parameters. They have comparable expressivity through these parameters."},"reference":"- Ba et al., \"Layer Normalization\" (2016): https://arxiv.org/abs/1607.06450\n- Vaswani et al., \"Attention Is All You Need\" (uses LayerNorm): https://arxiv.org/abs/1706.03762"},{"section":"deep-learning","topicSlug":"regularization-and-normalization","topic":"Regularization And Normalization","id":"dl-09005","difficulty":"medium","orderIndex":5,"question":"A team trains a model with Dropout (p=0.5) and at inference, runs the model in training mode (forgot to call `model.eval()`). The model's predictions are noisy and inconsistent for the same input. The team also notices the model seems to perform slightly differently than expected. What is the technical issue and what are the two consequences?","options":{"A":"Training mode has no effect on Dropout during inference","B":"Two consequences: (1) Stochasticity — Dropout randomly drops 50% of neurons on each forward pass. The same input produces different outputs on different calls. (2) Scale mismatch — during training, Dropout scales outputs by 1/(1-p) = 2× (inverted Dropout) to keep expected values consistent. If the model uses standard (non-inverted) Dropout, inference in train mode produces halved expected outputs compared to eval mode. Modern frameworks use inverted Dropout during training, so issue 1 (noise) is the main practical problem","C":"Training mode causes BatchNorm to use batch statistics, affecting only BatchNorm layers","D":"Dropout in training mode is identical to not using Dropout at all"},"correct":"B","explanation":{"correct":"- PyTorch implements \"inverted dropout\": during training, active neurons are scaled by 1/(1-p), so the expected output magnitude is the same as without Dropout. At eval mode, Dropout is disabled and no scaling is applied — this is consistent because training already scaled.\n- The stochasticity issue: running a model in training mode at inference means each call to model(x) produces a different output due to random neuron masking. For deterministic predictions (same x → same output), this is a critical bug.\n- This is a common production bug: forgetting `model.eval()` before inference. It can cause serious problems in A/B tests (non-deterministic results), deployment (different outputs from same input), and monitoring (unexplained prediction variance).","A":"Training mode does affect Dropout during inference — it keeps Dropout active, introducing stochasticity. This is the core bug.","B":"","C":"BatchNorm behavior in training vs eval mode is a separate concern (different statistics), not related to Dropout's effect. The question specifically asks about Dropout consequences.","D":"Dropout in training mode is NOT equivalent to no Dropout — it randomly deactivates neurons, reducing effective model capacity and adding noise to each forward pass."},"reference":"- PyTorch nn.Dropout documentation: https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html\n- model.eval() vs model.train(): https://pytorch.org/docs/stable/generated/torch.nn.Module.eval.html"},{"section":"deep-learning","topicSlug":"regularization-and-normalization","topic":"Regularization And Normalization","id":"dl-09006","difficulty":"medium","orderIndex":6,"question":"GroupNorm is used instead of BatchNorm for object detection with small batch sizes (e.g., 2 samples per GPU). You explain to a junior engineer: \"GroupNorm doesn't have the batch-size dependency problem.\" She asks: \"If GroupNorm normalizes across groups of channels, what determines the quality of GroupNorm statistics as batch size decreases?\"","options":{"A":"GroupNorm quality degrades with smaller batch sizes just like BatchNorm","B":"GroupNorm normalizes within each sample and group (mean/var computed over G channels within one sample). Its statistics depend on the number of channels per group (C/num_groups), not the batch size. With batch_size=1, GroupNorm produces meaningful statistics as long as the group size is large enough (typically G=32, num_groups=32 for C=512 gives 16 channels per group). BatchNorm with batch_size=1 is meaningless (variance=0)","C":"GroupNorm requires the number of groups to equal the batch size","D":"GroupNorm and BatchNorm are identical for batch_size=32; GroupNorm is only different for batch_size=1"},"correct":"B","explanation":{"correct":"- GroupNorm: for input (batch, C, H, W), divide C channels into G groups. For each (sample, group): compute mean and var over (C/G × H × W) elements. Statistics are computed within a single sample — batch size doesn't affect them.\n- BatchNorm: for each channel, compute mean/var over (batch × H × W) elements. With batch_size=1: mean and var computed over 1×H×W elements per channel (legitimate for images but the running statistics update is noisy). For batch_size=1 in NLP (1 sample, 1 position): mean and var computed over 1 element — meaningless (var=0).\n- GroupNorm is the standard for detection/segmentation where batch sizes must be small (high-resolution inputs fill GPU memory). ResNeXt, Mask R-CNN, and most modern detection models use GroupNorm.","A":"This is the key distinction between GroupNorm and BatchNorm. GroupNorm's statistics are independent of batch size — this is its defining advantage.","B":"","C":"GroupNorm groups the channel dimension, not the batch dimension. The number of groups is a fixed hyperparameter (typically 32) that doesn't change with batch size.","D":"GroupNorm and BatchNorm are mathematically different for all batch sizes. For large batches, BatchNorm estimates population statistics better (more samples), while GroupNorm always uses the same within-sample statistics."},"reference":"- Wu & He, \"Group Normalization\" (2018): https://arxiv.org/abs/1803.08494"},{"section":"deep-learning","topicSlug":"regularization-and-normalization","topic":"Regularization And Normalization","id":"dl-09007","difficulty":"hard","orderIndex":7,"question":"RMSNorm (used in LLaMA, Mistral, and most modern LLMs) computes: y = x / RMS(x) * γ, where RMS(x) = √(mean(x²)). A researcher says: \"RMSNorm is strictly worse than LayerNorm because it doesn't center the activations.\" Is this correct?","options":{"A":"Yes — zero-centering is essential for normalization to be effective","B":"No — RMSNorm drops the mean-centering step but retains the scale normalization. In practice, Transformer hidden states often have near-zero mean already (due to attention patterns and residual connections). The mean-centering step in LayerNorm adds computational overhead (computing mean, subtracting) without meaningful benefit. RMSNorm achieves similar training stability with fewer operations (~20% faster per layer), which is significant at scale","C":"Yes — RMSNorm causes gradient vanishing because the mean is not removed","D":"No — RMSNorm is mathematically equivalent to LayerNorm for all inputs with non-zero mean"},"correct":"B","explanation":{"correct":"- LayerNorm: x̂ = (x - μ) / σ · γ + β, where μ = mean(x), σ = std(x). Two operations: mean subtraction and variance scaling.\n- RMSNorm: x̂ = x / RMS(x) · γ (no mean subtraction, no β bias term). Only one operation: scale by inverse root mean square.\n- The empirical finding (Zhang & Sennrich, 2019): for neural machine translation and language modeling, LLaMA, Mistral, Gemma, and other modern LLMs show that RMSNorm achieves comparable or better performance than LayerNorm at reduced compute cost. The centering step seems less important than the scale normalization.","A":"Zero-centering is beneficial in some settings (e.g., CNNs where feature distributions are asymmetric), but not universally essential. In Transformer residual streams, activations naturally tend toward zero mean due to the residual connection structure.","B":"","C":"Gradient flow through RMSNorm is similar to LayerNorm. The gradient of RMSNorm is well-defined and doesn't cause vanishing gradients. In practice, RMSNorm networks (LLaMA-7B etc.) train stably without additional interventions.","D":"RMSNorm and LayerNorm are not equivalent. LayerNorm subtracts the mean before scaling; RMSNorm does not. For any input with non-zero mean, these produce different outputs."},"reference":"- Zhang & Sennrich, \"Root Mean Square Layer Normalization\" (2019): https://arxiv.org/abs/1910.07467\n- Touvron et al., \"LLaMA 2\" uses RMSNorm: https://arxiv.org/abs/2307.09288"},{"section":"deep-learning","topicSlug":"regularization-and-normalization","topic":"Regularization And Normalization","id":"dl-09008","difficulty":"hard","orderIndex":8,"question":"You train a model with L2 regularization (weight decay λ=0.01) and observe that some layers have consistently large weights (||W||² >> 1) even after training. You increase λ to 0.1. Now all weights are small (||W||² ≈ 0.01) but validation performance drops significantly. What is the diagnostic and fix?","options":{"A":"The model is too large; reduce the number of layers","B":"The large weights in specific layers likely encode critical task-relevant features — the model needs those large weights to represent its learned transformation. Increasing λ uniformly across all layers penalizes these important weights as heavily as regularization targets (weights that should be small). Fix: layer-wise λ — apply stronger regularization to early layers (often have redundant features) and weaker to final classification layers, or use gradient-based methods to identify which weights should be large","C":"L2 regularization should never be applied uniformly; replace with L1 which is more selective","D":"The validation drop indicates the model was already at optimal capacity; any regularization hurts"},"correct":"B","explanation":{"correct":"- Not all weights in a network play equal roles. Weights in final classification layers often need to be large to sharply separate class probabilities. Weights in intermediate feature extraction layers may be legitimately large for important features (e.g., edge detectors in CNNs have large weights in the dominant orientation directions).\n- Uniform λ penalizes all weights equally, ignoring their semantic importance. This is the key limitation of global weight decay.\n- In practice: LLM fine-tuning often applies weight decay only to specific parameter groups (not biases, not normalization parameters). PyTorch's AdamW allows per-parameter-group λ: `optimizer = AdamW([{'params': early_layers, 'weight_decay': 0.1}, {'params': final_layer, 'weight_decay': 0.001}], lr=1e-3)`.","A":"The issue is not model size but regularization strength calibration. A model with many layers can still need specific large weights for specific tasks.","B":"","C":"L1 regularization produces sparsity (zero weights), not just small weights. For the described problem (some layers needing large weights), L1 would zero those critical weights, making the problem worse.","D":"If the model performs well without λ=0.1, it is not \"at optimal capacity.\" The issue is over-regularization, not perfect capacity utilization."},"reference":"- PyTorch AdamW parameter groups: https://pytorch.org/docs/stable/optim.html#per-parameter-options"},{"section":"deep-learning","topicSlug":"regularization-and-normalization","topic":"Regularization And Normalization","id":"dl-09009","difficulty":"hard","orderIndex":9,"question":"BatchNorm has been claimed to work because it \"reduces internal covariate shift\" (Ioffe & Szegedy, 2015). A subsequent paper (Santurkar et al., 2018, \"How Does Batch Normalization Help Optimization?\") showed this explanation is incorrect. What does the Santurkar et al. paper argue is the actual reason BatchNorm helps, and what experiment did they use to disprove the covariate shift hypothesis?","options":{"A":"Santurkar showed BatchNorm reduces overfitting through regularization effects, not covariate shift","B":"Santurkar showed that even when \"noisy BatchNorm\" (adding noise to BN statistics to increase covariate shift) was applied, training was still faster and more stable than without BN. The actual mechanism is that BatchNorm smooths the loss landscape — it makes the loss function more Lipschitz (smaller changes in loss per unit change in weight) and makes the gradient more predictive (gradients are more consistent across steps). This smooth landscape allows larger learning rates and more stable optimization","C":"Santurkar showed covariate shift reduction is the real mechanism by providing mathematical proof","D":"Santurkar showed that LayerNorm is strictly better than BatchNorm for all architectures"},"correct":"B","explanation":{"correct":"- The covariate shift hypothesis: each layer's input distribution changes as previous layer weights update. BN stabilizes these distributions. The hypothesis predicts BN's benefit should correlate with reduced covariate shift.\n- Santurkar's experiment: they added random noise to BN's statistics (increasing covariate shift) and found training was still faster than no-BN. If covariate shift were the mechanism, more covariate shift should hurt training. It didn't.\n- The actual finding: BatchNorm makes the loss landscape smoother (both loss function and its gradient). Smoother loss: better-conditioned Hessian, more predictable gradients, less \"choppy\" optimization trajectory. This allows larger learning rates and explains why BN lets you train \"from anywhere in the loss landscape.\"","A":"While BN does have regularization effects (stochastic batch statistics add noise), this is a secondary finding. The primary mechanism Santurkar identifies is loss landscape smoothing.","B":"","C":"Santurkar explicitly disproves the covariate shift hypothesis — they do not confirm it. The paper is titled \"How Does Batch Normalization Help Optimization?\" and its contribution is providing an alternative explanation.","D":"LayerNorm vs BatchNorm comparison is a separate topic. Santurkar's paper doesn't claim LayerNorm superiority — it analyzes why BatchNorm helps, regardless of how it compares to alternatives."},"reference":"- Santurkar et al., \"How Does Batch Normalization Help Optimization?\" (2018): https://arxiv.org/abs/1805.11604"},{"section":"deep-learning","topicSlug":"regularization-and-normalization","topic":"Regularization And Normalization","id":"dl-09010","difficulty":"hard","orderIndex":10,"question":"A language model uses Pre-LN (LayerNorm before attention, before FFN) instead of Post-LN (LayerNorm after attention+residual). The model trains without warmup and doesn't explode. The same architecture with Post-LN requires careful warmup and learning rate tuning to avoid training instability. What property of Pre-LN creates this training robustness?","options":{"A":"Pre-LN is more computationally stable due to fewer matrix multiplications","B":"In Post-LN, the residual connection is added and then normalized: LN(x + sublayer(x)). The gradient of x flows through the LN normalization, which can scale gradients in unpredictable ways depending on the variance of the activations. In Pre-LN, the gradient of x flows directly back through the residual path (x + sublayer(LN(x))): ∂L/∂x = ∂L/∂output · (I + ∂sublayer/∂LN_output · ∂LN/∂x). The \"I\" term is a direct gradient path that is never rescaled by LN. This guarantees that the gradient magnitude doesn't collapse regardless of how LN affects the sublayer path","C":"Pre-LN uses smaller weight matrices, requiring less precise initialization","D":"Pre-LN removes the need for residual connections, simplifying gradient flow"},"correct":"B","explanation":{"correct":"- Post-LN gradient: ∂L/∂x_in flows through LN(x_in + F(x_in)). The LN normalization scales the combined (signal + residual) output. Early in training with small weights, x_in dominates and LN approximately normalizes x_in, scaling gradients by 1/||x_in||. This can destabilize training.\n- Pre-LN gradient: x_out = x_in + F(LN(x_in)). ∂L/∂x_in = ∂L/∂x_out · (1 + ∂F/∂x_in). The \"1\" term is the direct residual gradient path that is always present, always magnitude ∂L/∂x_out, never scaled by LN. This gives stable gradient flow even with random initialization.\n- This explains why all modern LLMs (GPT-3, LLaMA, PaLM) use Pre-LN: it enables training without warmup or special initialization.","A":"Pre-LN and Post-LN have the same number of matrix multiplications (the attention and FFN sublayers are identical). The difference is only in where LN is placed.","B":"","C":"Pre-LN doesn't change weight matrix sizes. Both use the same d_model × d_model weight matrices. The stability comes from gradient path properties, not matrix size.","D":"Pre-LN keeps residual connections. The LN is applied before the sublayer, and the original x is added after: x_out = x + sublayer(LN(x)). Residual connections are essential in both Pre-LN and Post-LN Transformers."},"reference":"- Xiong et al., \"On Layer Normalization in the Transformer Architecture\" (2020): https://arxiv.org/abs/2002.04745\n- GPT-3 technical report (uses Pre-LN): https://arxiv.org/abs/2005.14165"},{"section":"deep-learning","topicSlug":"regularization-and-normalization","topic":"Regularization And Normalization","id":"dl-09011","difficulty":"medium","orderIndex":11,"question":"A team applies both Dropout (p=0.3) and L2 regularization (λ=0.01) simultaneously. A colleague claims \"using both is redundant — they both prevent overfitting through the same mechanism.\" Is this correct?","options":{"A":"Yes — both Dropout and L2 are forms of noise injection and have identical effects","B":"No — they prevent overfitting through different mechanisms and can be complementary: Dropout prevents co-adaptation (neurons relying on each other by randomly disabling them during training); L2 prevents large weights by adding a penalty on weight magnitude. Dropout acts on activations (stochastically) while L2 acts on weights (deterministically). A network can have both, and tuning them jointly often outperforms either alone","C":"Yes — applying both reduces effective learning rate, which is the true mechanism of both regularizers","D":"No, but L2 should be removed when using Dropout as they create conflicting gradient signals"},"correct":"B","explanation":{"correct":"- Dropout mechanism: random deactivation of neurons during training prevents any individual neuron from becoming a \"master neuron\" that the network relies on. Forces distributed, robust representations. It's a structural regularizer.\n- L2 mechanism: directly penalizes large weight magnitudes via the loss term. Forces weights toward small values, preventing any single connection from dominating. It's a magnitude regularizer.\n- These address different failure modes: Dropout prevents co-adaptation (structural), L2 prevents weight explosion (magnitude). A model can overfit through either channel. Combined use: typical in modern architectures (e.g., BERT uses both Dropout within the Transformer blocks and weight decay in AdamW).","A":"The mechanisms are fundamentally different. Dropout adds structured activation noise; L2 adds a deterministic gradient penalty. Their mathematical formulations and effects on the loss landscape are distinct.","B":"","C":"Neither Dropout nor L2 reduces the learning rate directly. Dropout reduces the effective number of active neurons per step (computational sparsity), and L2 adds a gradient term (λ·w) that adds to the weight gradient. Neither multiplies the learning rate.","D":"L2 and Dropout gradients don't conflict. In the backward pass, L2 adds λ·w to the weight's gradient, and Dropout masks certain activation gradients. They operate on different quantities (weights vs activations) and sum independently."},"reference":"- Srivastava et al., \"Dropout\" (2014): discusses interaction with other regularization"},{"section":"deep-learning","topicSlug":"regularization-and-normalization","topic":"Regularization And Normalization","id":"dl-09012","difficulty":"easy","orderIndex":12,"question":"Instance Normalization (IN) normalizes each sample-channel pair independently across spatial dimensions, while Group Normalization (GN) normalizes across groups of channels within a sample. For which task is Instance Normalization specifically preferred and why?","options":{"A":"Instance Normalization is preferred for text classification tasks","B":"Instance Normalization is preferred for style transfer and image generation tasks. IN normalizes per-sample per-channel, removing per-channel mean and variance (the \"style\" information). Adaptive Instance Normalization (AdaIN) replaces IN statistics with those of a style image, transferring style while preserving content. This makes IN the right choice when you want to manipulate or remove per-channel statistics (style = global channel statistics in artistic style transfer)","C":"Instance Normalization is preferred for batch_size=1 inference in all tasks","D":"Instance Normalization is preferred for any task involving long sequences"},"correct":"B","explanation":{"correct":"- Per-channel mean and variance capture the \"style\" of an image (color distribution, texture frequency). IN normalizes these away, producing a style-invariant representation.\n- AdaIN (Huang & Belongie, 2017): IN(content) with scale/shift from style image statistics. The content image's spatial structure (content) is preserved, but the channel statistics (style) are replaced by the style image's statistics. This is the mechanism behind fast neural style transfer and StyleGAN.\n- Group Norm or Batch Norm is preferred for classification tasks where absolute channel statistics carry semantic information (edge strength, color distribution helps identify objects).","A":"Text classification doesn't have spatial dimensions. Instance Normalization is primarily designed for 2D feature maps (convolutional layers on images). Layer Normalization is the standard for text.","B":"","C":"While IN works at batch_size=1 (like GroupNorm), batch_size=1 compatibility is not its primary advantage. For general inference at batch_size=1, GroupNorm is preferred; IN is specifically motivated by its style normalization property.","D":"Long sequences don't define IN's use case. Temporal/sequential data uses recurrent models with Layer Normalization, not Instance Normalization."},"reference":"- Huang & Belongie, \"Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization\" (2017): https://arxiv.org/abs/1703.06868\n- Ulyanov et al., \"Instance Normalization: The Missing Ingredient for Fast Stylization\" (2017): https://arxiv.org/abs/1607.08022"},{"section":"deep-learning","topicSlug":"regularization-and-normalization","topic":"Regularization And Normalization","id":"dl-09013","difficulty":"medium","orderIndex":13,"question":"A ResNet model uses BatchNorm between each convolutional layer. When you fine-tune only the last two layers (freezing all earlier layers), training is unstable — loss oscillates and doesn't converge. A colleague says \"freeze the BatchNorm layers too.\" Why would this help?","options":{"A":"Frozen BatchNorm prevents GPU memory overflow during fine-tuning","B":"Frozen earlier layers' BatchNorm layers are still in training mode, updating running_mean and running_var based on the fine-tuning dataset's statistics. If the fine-tuning data has different distribution than pre-training data, the running statistics diverge from what the frozen earlier-layer weights expect, corrupting the intermediate representations. Freezing BN layers (calling bn.eval()) prevents statistics from updating, keeping the earlier layers' transformation consistent","C":"BatchNorm layers must always be frozen when any other layer is frozen — they are linked","D":"Unfrozen BatchNorm increases the effective learning rate for all layers, causing instability"},"correct":"B","explanation":{"correct":"- The problem: Layer 3 (frozen weights) was trained with pre-training statistics. Its frozen weights W₃ were optimized assuming BN₃ would normalize using pre-training statistics (mean=μ₁, var=σ₁²). Fine-tuning updates BN₃'s running stats to (μ₂, σ₂²). Now W₃ is computing outputs based on incorrect normalization — the weights were calibrated for μ₁ but BN₃ is now normalizing with μ₂.\n- Fix: when freezing layers, also freeze their associated BatchNorm layers by calling `.eval()` on them individually: `for m in model.modules(): if isinstance(m, nn.BatchNorm2d): m.eval()`.\n- This is standard practice in transfer learning with ResNets: freeze BN stats in all frozen layers to preserve the pre-trained feature space.","A":"BatchNorm in training mode does not increase memory usage significantly. Running statistics are stored as buffers, not gradients. Memory overflow from fine-tuning is caused by large batch sizes or too many trainable parameters.","B":"","C":"BatchNorm and weight layers are not inherently linked. You can freeze weights without freezing BN, but as explained, this causes the inconsistency described. The connection is functional, not architectural.","D":"BatchNorm does not increase effective learning rate. The BN statistics update (exponential moving average) is not an optimizer step and doesn't affect weight learning rates."},"reference":"- PyTorch BatchNorm fine-tuning best practices: https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html"},{"section":"deep-learning","topicSlug":"regularization-and-normalization","topic":"Regularization And Normalization","id":"dl-09014","difficulty":"hard","orderIndex":14,"question":"You train two identical architectures: one with data augmentation (random crop, flip, color jitter) and one with Dropout. Both achieve similar validation accuracy. A reviewer says \"they are equivalent regularization techniques.\" What is the fundamental difference in what they regularize?","options":{"A":"Data augmentation is always strictly better than Dropout","B":"They regularize fundamentally different aspects: data augmentation regularizes the input space — it teaches the model that certain transformations (flips, crops) should not change the prediction, encoding specific invariances (translation, scale, color). Dropout regularizes the model's weight space — it prevents any single neuron from becoming essential, forcing the network to use distributed redundant representations. A model trained with augmentation may still overfit to specific neuron patterns; a model trained with Dropout may still fail on augmented inputs it never saw","C":"They are equivalent because both reduce effective dataset size by introducing uncertainty","D":"Dropout acts on the loss function while data augmentation acts on gradients"},"correct":"B","explanation":{"correct":"- Data augmentation: the model is trained on x, flip(x), crop(x), jitter(x) — all with the same label. The model learns that these transformations are semantically invariant. This directly encodes domain knowledge about what should be invariant, and is specific to the transformation type.\n- Dropout: randomly disables neurons. The model learns that individual neurons are not reliable and develops redundant features. This is architecture-level regularization, independent of input transformations.\n- In practice: combining both is optimal. Augmentation provides input-space invariances; Dropout prevents neural co-adaptation. Neither can substitute for the other: a high-resolution face recognition model needs both precise spatial features (not well-handled by Dropout alone) and robustness to pose/lighting (requires augmentation).","A":"Neither is universally \"strictly better.\" They address different problems. For very large datasets, augmentation often provides more benefit. For small datasets, Dropout is critical. For medical imaging with limited augmentation options, Dropout is more important.","B":"","C":"Data augmentation increases effective dataset size (by creating more training examples from each real example). Dropout reduces effective model capacity per training step. They work in opposite directions from the sample-count perspective.","D":"Both data augmentation and Dropout ultimately modify gradients (all training techniques do). Dropout multiplies activation gradients by the dropout mask; augmentation changes which input produces which gradient. The distinction is in what aspects of the learning problem they affect."},"reference":"- Shorten & Khoshgoftaar, \"A survey on Image Data Augmentation for Deep Learning\" (2019): https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0197-0"},{"section":"deep-learning","topicSlug":"regularization-and-normalization","topic":"Regularization And Normalization","id":"dl-09015","difficulty":"hard","orderIndex":15,"question":"A team wants to use Dropout in a Transformer model but finds that applying standard Dropout to attention weights (the attention matrix before softmax) causes severe training instability. They switch to \"attention dropout\" (applied after softmax to the attention probabilities). Why is post-softmax application more stable?","options":{"A":"Post-softmax dropout requires fewer random number generations, reducing GPU overhead","B":"Pre-softmax dropout changes the relative magnitudes of attention logits, potentially creating asymmetric softmax distributions where some tokens receive abnormally high attention (if their competitors were dropped, the remaining logits compete differently). Post-softmax dropout zeros out entire attention connections (a token pair) while the remaining connections maintain their proper probability mass after renormalization in the next layer. This preserves the probabilistic interpretation of attention weights while still providing regularization","C":"Pre-softmax dropout causes gradient explosion in the Q·K^T matrix multiplication","D":"Attention dropout must always be applied before softmax; the team made a mistake by switching"},"correct":"B","explanation":{"correct":"- Pre-softmax dropout: zeros out some logits before softmax(logits/√d_k). If logit for token pair (i,j) was large (high attention) but gets dropped, the softmax redistribution shifts attention to other tokens. This can create artificially high attention on tokens that happened not to be dropped — random attention concentration, not semantically meaningful attention.\n- Post-softmax dropout: directly zeros out attention connections after they've been computed as meaningful probabilities. Each remaining connection retains its correct relative weight. The zeroed-out connections are just \"masked\" - the model learns to function with any subset of attention connections active.\n- Post-softmax dropout is the standard in all Transformer implementations (BERT, GPT, T5). The Vaswani et al. paper specifies dropout applied \"to the output of each sub-layer\" and to \"the attention weights.\"","A":"Random number generation count is the same for both (one random mask of the same size). GPU overhead is not the distinction.","B":"","C":"Gradient explosion in Q·K^T is caused by large logit values or poor initialization, not by dropout. Dropout modifies the mask, not the magnitude of the matrix product.","D":"The team correctly switched to post-softmax dropout, which is the standard convention. Pre-softmax dropout is technically implementable but non-standard and less stable as explained."},"reference":"- Vaswani et al., \"Attention Is All You Need\" (2017): Section 5.4 (Regularization with dropout)\n- BERT: https://arxiv.org/abs/1810.04805 (attention_probs_dropout_prob applied after softmax)"},{"section":"deep-learning","topicSlug":"weight-initialization","topic":"Weight Initialization","id":"dl-10001","difficulty":"easy","orderIndex":1,"question":"A new engineer initializes all weights in a 5-layer fully connected network to 0. After 100 training epochs, the model achieves exactly random chance (10% for a 10-class problem). What went wrong?","options":{"A":"Zero initialization causes NaN during forward propagation","B":"Zero initialization causes the symmetry problem: all neurons in each layer compute the same output and receive the same gradient. All neurons in a layer update identically on every step. The model effectively has only 1 neuron per layer regardless of the layer width. With all weights zero, every hidden layer outputs 0, and the gradient for every neuron in a layer is identical, so they all update to the same non-zero value — they remain symmetric forever","C":"Zero initialization prevents the optimizer from computing gradients","D":"Zero initialization only fails for layers with more than 100 neurons"},"correct":"B","explanation":{"correct":"- Forward pass with W=0: all pre-activations are 0, all activations are 0 (or 0.5 for sigmoid). Every neuron in a layer computes the same output.\n- Backward pass: since all neurons in a layer produce the same output, the gradient flowing into each neuron from the next layer is the same. All neurons receive identical updates. After one step: all neurons update to the same new value — still symmetric.\n- This symmetry is never broken by gradient descent. The model has N neurons in a layer but effective rank 1 — all neurons always compute the same function. The model's expressive power is no better than a single neuron per layer.","A":"Zero initialization produces 0 pre-activations (finite values). Forward propagation doesn't produce NaN. The output can pass through softmax cleanly — NaN requires 0/0 or ±∞ operations.","B":"","C":"Gradients can be computed with zero weights. The gradient of the loss with respect to weights is non-zero as long as the input to the layer is non-zero. The issue is not gradient computation failure but gradient symmetry.","D":"The symmetry problem affects all zero-initialized layers regardless of width. Even a layer with 2 neurons initialized to zero will exhibit this behavior."},"reference":"- Goodfellow et al., \"Deep Learning\", Section 8.4 (Weight Initialization)\n- http://cs231n.github.io/neural-networks-2/#init"},{"section":"deep-learning","topicSlug":"weight-initialization","topic":"Weight Initialization","id":"dl-10002","difficulty":"easy","orderIndex":2,"question":"Xavier (Glorot) initialization draws weights from a distribution with variance = 2/(fan_in + fan_out). Kaiming (He) initialization uses variance = 2/fan_in. When should you use each, and what assumption makes them different?","options":{"A":"Xavier is for convolutional layers; Kaiming is for fully connected layers","B":"Xavier is designed for symmetric activations (sigmoid, tanh) where the goal is to keep the variance of activations equal to the variance of inputs across layers. Kaiming is designed for ReLU-like activations which set half the inputs to 0 — halving the expected variance. Kaiming accounts for ReLU's effective variance reduction by using a 2× larger variance (2/fan_in instead of 1/fan_in). Use Xavier with sigmoid/tanh; use Kaiming with ReLU/Leaky ReLU","C":"Both are identical for modern architectures; use either interchangeably","D":"Xavier is for randomly initialized networks; Kaiming is for pretrained networks being fine-tuned"},"correct":"B","explanation":{"correct":"- Xavier derivation: for a layer with symmetric activation g(x) ≈ x near 0 (sigmoid, tanh): Var[output] = fan_in × Var[W] × Var[input]. Setting this equal to Var[input]: Var[W] = 1/fan_in. The symmetric 2/(fan_in+fan_out) formula balances forward and backward variance.\n- Kaiming derivation (He et al., 2015): ReLU(x) = max(0,x) zeros out half the activations. The expected squared output of ReLU(x) for zero-mean symmetric input x is E[ReLU(x)²] = 0.5 × E[x²] = 0.5 × Var[x]. To maintain variance through a ReLU layer: Var[W] = 2/fan_in (factor of 2 compensates for the 50% zeroing).\n- Wrong initialization in deep networks: using Xavier with ReLU → each ReLU layer halves variance → exponential variance decay over depth → vanishing gradients.","A":"Both can be used for convolutional and fully connected layers. The distinction is about activation function, not layer type. PyTorch's `nn.init.xavier_uniform_` and `nn.init.kaiming_uniform_` work for both layer types.","B":"","C":"They produce different variances and make different assumptions. Using Xavier with ReLU in a 20-layer network will likely cause vanishing activations (each layer loses 50% variance). The choice matters significantly.","D":"Both are for randomly initialized networks. Pre-trained networks are fine-tuned from existing weights, not reinitialized."},"reference":"- Glorot & Bengio, \"Understanding the difficulty of training deep feedforward neural networks\" (2010): https://proceedings.mlr.press/v9/glorot10a\n- He et al., \"Delving Deep into Rectifiers\" (2015): https://arxiv.org/abs/1502.01852"},{"section":"deep-learning","topicSlug":"weight-initialization","topic":"Weight Initialization","id":"dl-10003","difficulty":"medium","orderIndex":3,"question":"You train a 10-layer network with ReLU activations and Kaiming initialization. Training is stable. A colleague changes the initialization to use std=0.001 (much smaller than Kaiming). What will happen to training and why?","options":{"A":"Nothing changes — the optimizer will adjust regardless of initialization","B":"With std=0.001, activations will rapidly approach 0 through the network. Layer 1: Var[a₁] ≈ 0.001²×fan_in×Var[x] << 1. By layer 10: activations are effectively 0 for all inputs. Gradients flowing backward through these near-zero activations will also be near-zero. The model will appear to train (loss is computed) but weights will barely update — the network is in an effective \"dead zone\" from the start","C":"Smaller initialization is always better as it prevents gradient explosion","D":"Kaiming initialization with std=0.001 is equivalent to L2 regularization"},"correct":"B","explanation":{"correct":"- Kaiming std for ReLU: std = √(2/fan_in). For fan_in=512: std ≈ 0.063. Using std=0.001 is ~63× smaller.\n- Forward propagation: each layer computes h_{l} = ReLU(W_{l} h_{l-1}). With very small W, the pre-activations are tiny. After 10 layers of multiplying tiny values, activations are effectively zero (numerical underflow or below learning threshold).\n- Backward propagation: gradients flow through ∂h_l/∂W_l = h_{l-1}. If h_{l-1} is near 0, the weight gradient ≈ 0. This creates a self-reinforcing failure: small weights → small activations → small gradients → weights never update.","A":"The optimizer applies updates proportional to gradients. If gradients are ~0 due to poor initialization, the optimizer can't correct the problem — it can't \"see\" what direction to move. A good optimizer cannot overcome initialization so bad that no gradient signal exists.","B":"","C":"Smaller initialization prevents gradient explosion, but taken to extremes, it causes vanishing gradients/activations. There is an optimal scale (Kaiming, Xavier) that balances both issues.","D":"Kaiming initialization and L2 regularization are completely unrelated. L2 regularization is a gradient penalty added during training. Initialization is the starting point of weights."},"reference":"- He et al., \"Delving Deep into Rectifiers\" (2015): Section 2.2 (variance analysis)"},{"section":"deep-learning","topicSlug":"weight-initialization","topic":"Weight Initialization","id":"dl-10004","difficulty":"medium","orderIndex":4,"question":"You initialize a 100-layer network with weights drawn from N(0, 1/fan_in) (no ReLU correction) and observe that the gradient norm at layer 1 is 10⁻²⁰ while the gradient norm at layer 100 is ~1. What specific problem is this and what mathematical property causes it?","options":{"A":"Exploding gradients — the gradient grows from layer 1 to layer 100","B":"Vanishing gradients — the gradient shrinks exponentially from layer 100 to layer 1. The backward pass multiplies gradients by the weight matrix W^T at each layer. With weights initialized from N(0, 1/fan_in), the spectral norm of W is approximately 1, but repeated multiplication of ~100 such matrices has spectral norm ≈ σ^100 where σ is the average singular value. For σ slightly < 1, this decays exponentially (0.99^100 ≈ 0.37; 0.95^100 ≈ 0.006; 0.9^100 ≈ 2×10⁻⁵)","C":"Numerical overflow from the 10⁻²⁰ value indicating floating-point underflow","D":"The gradient norm difference is expected and correct — deep networks always have this pattern"},"correct":"B","explanation":{"correct":"- Gradient at layer l: ∂L/∂W_l = (∏_{k=l+1}^{L} W_k^T) × ∂L/∂h_L. This product of weight matrices is the Jacobian of layer L with respect to layer l.\n- If each weight matrix has spectral norm σ < 1: the product of 100 matrices has spectral norm ≤ σ^100. For σ = 0.9: 0.9^100 = 2×10⁻⁵. This is the vanishing gradient phenomenon.\n- The specific initialization N(0, 1/fan_in) doesn't account for ReLU. With tanh (and Xavier): the variance is calibrated to keep ||h||² ≈ constant. Without ReLU correction (Kaiming), ReLU halves the variance per layer, causing activations (and thus gradients) to decay exponentially.","A":"The gradient grows in magnitude as you go from early layers (near input) to later layers (near loss), not the other way. Gradient at layer 100 is larger → gradients shrink going backward. This is the definition of vanishing gradients.","B":"","C":"10⁻²⁰ in float32 is below the subnormal range (~10⁻⁴⁵) but above absolute zero. In float32, values below ~10⁻³⁸ become subnormal (with reduced precision). 10⁻²⁰ is representable but indicates numerical instability.","D":"The gradient norm difference indicates a serious training problem — layers close to the input will not update meaningfully. This is not \"expected and correct.\""},"reference":"- Bengio et al., \"Learning Long-Term Dependencies with Gradient Descent is Difficult\" (1994)\n- Glorot & Bengio, \"Understanding the difficulty of training deep feedforward neural networks\" (2010)"},{"section":"deep-learning","topicSlug":"weight-initialization","topic":"Weight Initialization","id":"dl-10005","difficulty":"medium","orderIndex":5,"question":"A team trains a model in FP16 (half precision). During the first training step, the loss is NaN. They switch to BF16 and NaN disappears. Both have 16-bit precision; why does BF16 fix the NaN?","options":{"A":"BF16 has higher total precision than FP16; it uses more total bits","B":"FP16 has range [~6×10⁻⁵, ~65504]. BF16 has range [~10⁻³⁸, ~3×10³⁸] — same dynamic range as FP32. At initialization and in early training steps, weight gradients or intermediate activations can exceed FP16's max value (65504), causing overflow to Inf/NaN. BF16's much larger dynamic range (same exponent bits as FP32) prevents this overflow while accepting lower mantissa precision (7 bits vs 10 for FP16)","C":"BF16 automatically applies gradient clipping that FP16 doesn't have","D":"FP16 doesn't support negative numbers; BF16 does, which is required for gradients"},"correct":"B","explanation":{"correct":"- FP16 format: 1 sign bit, 5 exponent bits, 10 mantissa bits. Max value: 65504. Min positive normal: ~6.1×10⁻⁵.\n- BF16 format: 1 sign bit, 8 exponent bits (same as FP32!), 7 mantissa bits. Max value: ~3.4×10³⁸. Min positive: ~1.2×10⁻³⁸.\n- At initialization, with Kaiming init and ReLU, a single forward pass can produce activation magnitudes beyond 65504 in wide networks or with large fan_in. The gradient norms in the first step can also be very large. BF16's FP32-equivalent exponent range prevents these values from overflowing.\n- Trade-off: BF16 has only 7 mantissa bits (vs 10 for FP16), meaning lower fractional precision. But for training stability, dynamic range matters more than mantissa bits.","A":"Both FP16 and BF16 use exactly 16 bits total. BF16 doesn't have \"higher total precision\" — it trades mantissa precision for dynamic range.","B":"","C":"BF16 has no built-in gradient clipping. Gradient clipping is a separate technique applied explicitly in the training loop. BF16's advantage is its number representation range.","D":"Both FP16 and BF16 support negative numbers (via the sign bit). IEEE 754 floating point formats always support negative numbers. Gradients are regularly negative in both formats."},"reference":"- NVIDIA BF16 explanation: https://developer.nvidia.com/blog/accelerating-ai-training-with-tf32-tensor-cores/\n- PyTorch Automatic Mixed Precision: https://pytorch.org/docs/stable/amp.html"},{"section":"deep-learning","topicSlug":"weight-initialization","topic":"Weight Initialization","id":"dl-10006","difficulty":"hard","orderIndex":6,"question":"You initialize a Transformer's token embedding matrix with N(0, 1) and the output projection matrix (embedding → logits) with Kaiming initialization. The first forward pass produces NaN logits. After investigating, you find the pre-softmax logits have values of order 10⁵ — 10⁶. What is the root cause?","options":{"A":"The softmax function is numerically unstable for all large values","B":"The token embedding N(0,1) has std=1 for vectors of dimension d_model. The expected L2 norm of an embedding vector is √d_model. For d_model=768: ||e|| ≈ √768 ≈ 27.7. Multiplied through multiple Transformer layers and the output projection (Kaiming std ≈ √(2/d_model) ≈ 0.05 for d_model=768), the final logit magnitude is ||W_out|| × ||h|| ≈ (0.05)^{1/2} × ... However, the embedding vectors with norm 27.7 entering the Transformer immediately amplify activations by 27.7× — any subsequent Kaiming-initialized layer that assumes unit-norm inputs produces amplified outputs. The fix is to initialize embeddings with std = 1/√d_model","C":"NaN only occurs when both embedding and output projection use the same initialization","D":"The output projection should use zero initialization for the first step"},"correct":"B","explanation":{"correct":"- Standard initialization for embedding matrices in Transformers: Vaswani et al. (2017) explicitly multiply embeddings by √d_model to scale them; GPT-2 uses N(0, 0.02). The key: embedding vectors shouldn't have O(1) components — they should have ||e|| ≈ O(1), not O(√d_model).\n- With N(0,1) embeddings and d_model=768: embedding norm ≈ 27.7. After LayerNorm (which normalizes to unit variance across features), this gets corrected at the first LN layer. But between the embedding and the first LN, the attention computation Q·K^T computes dot products of vectors with norm 27.7, producing values of 27.7²/√d_model ≈ 26.7×27.7 ≈ 740, which already overflow with subsequent operations.\n- Correct practice: `nn.Embedding(vocab_size, d_model); nn.init.normal_(embedding.weight, std=1/math.sqrt(d_model))` or use built-in scaling.","A":"Softmax is numerically stable when implemented as softmax(x - max(x)). Large logits don't produce NaN with numerically stable implementations. The NaN comes from the logits before softmax.","B":"","C":"NaN occurs due to magnitude mismatch between embedding scale (O(√d_model)) and the network's expected input scale (~O(1)). It's not about using the same initialization type.","D":"Zero output projection would produce all-zero logits (then uniform softmax, not NaN). Zero projection prevents learning but doesn't produce NaN."},"reference":"- Vaswani et al., \"Attention Is All You Need\" (2017): Section 3.4 (Embeddings and Softmax)\n- GPT-2 initialization: Radford et al., \"Language Models are Unsupervised Multitask Learners\" (2019)"},{"section":"deep-learning","topicSlug":"weight-initialization","topic":"Weight Initialization","id":"dl-10007","difficulty":"hard","orderIndex":7,"question":"You train a 50-layer ResNet. After 1000 steps, the loss suddenly spikes from 0.5 to 12.0, then slowly recovers over the next 2000 steps. This loss spike pattern repeats twice more during training. What is causing this and what initialization/training fix prevents it?","options":{"A":"The loss spikes are caused by bad batches in the training data","B":"Loss spikes in deep networks are typically caused by catastrophic gradient updates when the gradient norm temporarily exceeds the optimizer's ability to compensate. Common causes: (1) occasional large-norm batches amplified by large LR; (2) interactions between adaptive optimizer momentum and learning rate schedule (e.g., cosine restarts where LR suddenly increases); (3) in models with residual connections, if residual branch outputs are not properly scaled at initialization (zero-init of the last residual layer), early training can produce unstable activations. Fix: (1) gradient clipping; (2) μP (maximal update parameterization) initialization; (3) zero-init the last conv/linear in each residual block","C":"Loss spikes indicate NaN weights that are automatically recovered by the framework","D":"Loss spikes cannot occur in ResNets due to skip connections"},"correct":"B","explanation":{"correct":"- Zero-init of residual branch: in ResNets, initializing the last layer of each residual block to zero means the residual block outputs 0 at initialization. The network starts as a pure linear chain (only skip connections). This prevents early training instability.\n- Gradient clipping: max_norm gradient clipping prevents any single step from making very large weight updates, reducing spike likelihood.\n- μP (Maximal Update Parameterization, Yang et al., 2022): parameterizes weights so gradient updates stay O(1) regardless of width/depth, completely eliminating loss spikes. This is used in Cerebras and some large model trainings.\n- Cosine restarts (SGDR): if LR restarts at a high value, this can cause temporary loss spikes. Warm restarts should start at a smaller LR than the previous cycle's peak.","A":"Occasional bad batches do cause temporary loss increases, but recoverable spikes of magnitude 0.5 → 12.0 are disproportionate to batch-level noise. Batch-level variations typically cause loss fluctuations of 0.01-0.1, not 10× amplification.","B":"","C":"NaN weights are permanent (NaN + anything = NaN). A model with NaN weights cannot recover — subsequent steps would all produce NaN. The described pattern (spike then recovery) indicates large but finite gradients, not NaN.","D":"Skip connections reduce vanishing gradients but don't prevent loss spikes. A ResNet can still have loss spikes from large gradient updates if not mitigated."},"reference":"- Yang et al., \"Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer (μP)\" (2022): https://arxiv.org/abs/2203.03466\n- Zhang et al., \"Fixup Initialization: Residual Learning Without Normalization\" (2019): https://arxiv.org/abs/1901.09321"},{"section":"deep-learning","topicSlug":"weight-initialization","topic":"Weight Initialization","id":"dl-10008","difficulty":"hard","orderIndex":8,"question":"Mixed precision training (FP16 compute, FP32 master weights) uses a technique called \"loss scaling.\" A model trained without loss scaling in FP16 shows correct forward pass but zero gradients for all parameters. Why?","options":{"A":"Loss scaling is only needed for the forward pass, not the backward pass","B":"In FP16, the minimum representable positive value is ~6×10⁻⁵. During backpropagation, gradient values are often much smaller than this (especially in early layers of deep networks or for small gradients in residual branches). These gradients underflow to 0 in FP16. Loss scaling multiplies the loss by a large constant (e.g., 2¹⁵=32768) before the backward pass, scaling all gradients by 32768. This pushes tiny gradient values into FP16's representable range. After the backward pass, gradients are divided by 32768 before applying to the FP32 master weights","C":"Loss scaling is needed because FP16 cannot represent negative numbers","D":"Zero gradients in FP16 training are caused by the Adam optimizer, not number format limitations"},"correct":"B","explanation":{"correct":"- FP16 range gap: the smallest positive normal value in FP16 is ~6.1×10⁻⁵. In FP32, it's ~1.2×10⁻³⁸. Gradients in deep networks at later training stages (or in early layers) can easily be 10⁻¹⁰ or smaller — representable in FP32 but underflowing to 0 in FP16.\n- Loss scaling mechanics: if loss_scaled = loss × S (S=32768), then gradient_scaled = gradient × S. Values that were 10⁻¹⁰ become 3.3×10⁻⁶ — still small but within FP16's subnormal range. After backpropagation, gradients are unscaled: gradient = gradient_scaled / S. The master FP32 weights are updated with the correctly-scaled gradients.\n- Automatic loss scaling (ALS): PyTorch's `GradScaler` dynamically adjusts S — increases S if gradients are finite, decreases S if overflow (Inf/NaN) is detected.","A":"Loss scaling is specifically designed for the backward pass (gradient computation). The forward pass in FP16 is less susceptible to underflow because activation values are typically larger than gradients.","B":"","C":"FP16 does represent negative numbers (via the sign bit). Both FP16 and FP32 support negative numbers in IEEE 754 format.","D":"Adam has bias correction for its moment estimates, but this does not address FP16 underflow. The zero-gradient problem is due to number format limitations, not optimizer choice."},"reference":"- Micikevicius et al., \"Mixed Precision Training\" (2018): https://arxiv.org/abs/1710.03740\n- PyTorch GradScaler: https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.GradScaler"},{"section":"deep-learning","topicSlug":"weight-initialization","topic":"Weight Initialization","id":"dl-10009","difficulty":"hard","orderIndex":9,"question":"You fine-tune a pretrained ResNet-50 by replacing the final classification layer with a new one for 200 classes (original was 1000 classes). You randomly initialize the new layer with Kaiming. After 10 epochs, the frozen layers' BatchNorm running statistics have drifted significantly. What is the source of this drift and how does it indicate a training bug?","options":{"A":"BatchNorm running statistics are always updated by fine-tuning; this is expected behavior","B":"If earlier ResNet layers are frozen (weights don't update), their BatchNorm layers should also be frozen (set to eval mode). Running statistics are updated by BatchNorm layers in train mode, regardless of whether the weights are frozen. If you call model.train() without explicitly setting earlier BN layers to eval mode, BN updates its running mean/var using the fine-tuning data statistics — which may differ significantly from ImageNet pretraining statistics. The frozen weights no longer produce the activations they were calibrated for, corrupting the pretrained feature representations","C":"Running statistics drift is caused by the randomly initialized final layer producing incorrect upstream gradients","D":"BatchNorm running statistics only drift when learning rate is too high"},"correct":"B","explanation":{"correct":"- Frozen weights + training-mode BN: a frozen conv layer with W produces activations for the new dataset. The new dataset may have different image statistics (domain shift). BN's running mean/var are updated to match these new activation statistics — diverging from ImageNet statistics.\n- The resulting issue: the frozen earlier-layer weights were optimized with the assumption that BN would normalize using ImageNet statistics. With updated statistics, each BN layer's effective transformation changes: γ × (x - new_mean) / new_std + β ≠ γ × (x - original_mean) / original_std + β. The carefully learned features of the frozen layers are now distorted.\n- Correct fine-tuning: call `model.eval()` first, then set only the layers you want to fine-tune to `train()`. This freezes both weights AND BN statistics for frozen layers.","A":"Running statistics should only be updated in layers where you want the BN to adapt. For frozen layers, updating BN stats corrupts the pretrained representations. This is not expected behavior — it's a training bug.","B":"","C":"Gradients only flow to unfrozen parameters. The frozen earlier layers don't receive gradient updates. Running statistics are updated by the BN forward pass, not by gradients from the final layer.","D":"Running statistics are updated by the exponential moving average in BN's forward pass: running_mean = (1-momentum) × running_mean + momentum × batch_mean. This update happens regardless of learning rate."},"reference":"- PyTorch transfer learning tutorial: https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html"},{"section":"deep-learning","topicSlug":"weight-initialization","topic":"Weight Initialization","id":"dl-10010","difficulty":"medium","orderIndex":10,"question":"A senior engineer proposes \"orthogonal initialization\" for a deep RNN: initializing the recurrent weight matrix as a random orthogonal matrix (Q where Q^T Q = I). What problem does this specifically solve compared to random Gaussian initialization for RNNs?","options":{"A":"Orthogonal matrices are faster to compute the matrix product for","B":"For an RNN, the hidden state update h_t = tanh(W_h h_{t-1} + W_x x_t) involves repeated multiplication by W_h. With random Gaussian initialization, the spectral norm of W_h can be > 1 or < 1 — causing exponential growth or decay in h over many steps. An orthogonal matrix has all singular values exactly 1 (it's an isometry), so ||W_h h||₂ = ||h||₂. The hidden state magnitude is preserved exactly across time steps, directly addressing the vanishing/exploding gradient problem in RNNs","C":"Orthogonal initialization prevents the symmetry-breaking problem that occurs in RNNs","D":"Orthogonal matrices ensure the RNN weights remain sparse during training"},"correct":"B","explanation":{"correct":"- RNN gradient through time: ∂h_t/∂h_0 = ∏_{k=1}^{t} W_h^T · diag(tanh'(a_k)). For long sequences (t=100+), this product determines gradient magnitude.\n- Orthogonal W_h: all singular values = 1 → spectral norm = 1 → the matrix multiplications don't amplify or attenuate vectors. Combined with tanh' (which ≤ 1), the gradient can only decrease (not explode), and decreases more slowly.\n- Gaussian W_h: singular values centered around √(1/fan_in) ≈ √(1/H). If even one singular value is slightly > 1 or < 1, repeated multiplication amplifies this. Over 100 time steps: σ^100 can be 0 or ∞ for σ ≠ 1.","A":"Matrix multiplication speed depends on matrix dimensions, not whether the matrix is orthogonal. The computation time for W_h × h is identical whether W_h is orthogonal or Gaussian.","B":"","C":"Symmetry breaking is prevented by random (non-zero) initialization of any kind. A random orthogonal matrix breaks symmetry just as a random Gaussian matrix does. This is not the specific benefit of orthogonal initialization.","D":"Orthogonal matrices are fully dense (all entries non-zero). They have the opposite of sparsity. Sparsity is a property of L1 regularization, not orthogonal initialization."},"reference":"- Saxe et al., \"Exact solutions to the nonlinear dynamics of learning in deep linear networks\" (2013): https://arxiv.org/abs/1312.6120\n- Wisdom et al., \"Full-Capacity Unitary Recurrent Neural Networks\" (2016)"},{"section":"deep-learning","topicSlug":"weight-initialization","topic":"Weight Initialization","id":"dl-10011","difficulty":"medium","orderIndex":11,"question":"You compare two models at initialization (step 0): Model A uses Kaiming init, Model B uses N(0, 0.01). You compute the loss for both on the same batch. Model A has loss ≈ ln(10) ≈ 2.3 (for 10-class cross-entropy). Model B has loss ≈ 2.3 as well. A junior engineer says \"they're identical at initialization.\" What is wrong with this assessment?","options":{"A":"The losses are identical because both initializations satisfy the uniform prior condition","B":"The initial loss ≈ 2.3 = ln(10) means both models output near-uniform class probabilities (expected for a random classifier on 10 classes). However, this doesn't mean the models are equivalent. Model A (Kaiming) has appropriately scaled weights that will produce meaningful gradient magnitudes through all layers. Model B (small std=0.01) has tiny weights that will cause near-zero activations in deep layers, vanishing gradients, and effectively zero weight updates from the first step. The loss metric only captures the model's output; it doesn't reflect the health of the gradient flow through the network","C":"Model B will immediately diverge because small weights cause numerical underflow","D":"Both models have identical gradient norms at initialization"},"correct":"B","explanation":{"correct":"- Why both have similar initial loss: With tiny weights (Model B), near-zero pre-activations → all activations ≈ constant → output logits are approximately equal → softmax ≈ uniform → loss = -ln(1/10) = ln(10) ≈ 2.3. With Kaiming (Model A), activations have proper scale but are random → output logits are random but centered near zero → softmax ≈ uniform → loss ≈ ln(10).\n- The critical difference: Model A's random logits have the right gradient magnitude (Kaiming ensures gradients flow through all layers). Model B's near-zero activations → near-zero gradients → weight updates ≈ 0 from step 1.\n- Diagnostics beyond loss: always check gradient norms per layer at initialization. A healthy initialization has similar gradient norms across layers. Model B would show gradient norms decaying exponentially toward input layers.","A":"The \"uniform prior condition\" is satisfied when the model outputs uniform probabilities. Both models achieve this, but the loss metric doesn't capture the gradient flow health.","B":"","C":"std=0.01 is finite and will produce finite activations (not numerical underflow for a few layers). The problem is training dynamics (vanishing gradients), not immediate NaN/overflow.","D":"Model A and Model B have very different gradient norms. Model B's gradient norms decay to near-zero through the network. Model A has well-scaled gradients throughout. This difference is what makes the initializations non-equivalent."},"reference":"- Glorot & Bengio, \"Understanding the difficulty of training deep feedforward neural networks\" (2010)"},{"section":"deep-learning","topicSlug":"weight-initialization","topic":"Weight Initialization","id":"dl-10012","difficulty":"hard","orderIndex":12,"question":"GPT-2 uses a modified initialization where output projection layers in attention blocks are scaled by 1/√N, where N is the number of residual layers. Explain why this specific scaling is used and what problem it solves.","options":{"A":"Scaling by 1/√N reduces the memory footprint of large models","B":"In a Transformer with N residual layers, the total output is a sum of N residual branch contributions: h_final ≈ x + Σ_{i=1}^{N} f_i(x). If each f_i contributes variance O(1), the sum has variance O(N) — growing with depth. By scaling each residual branch's output projection by 1/√N, each contribution has variance O(1/N), and the sum of N such terms has variance O(N × 1/N) = O(1). This keeps total activation variance bounded regardless of model depth, preventing the \"depth × variance amplification\" issue in very deep Transformers","C":"1/√N scaling is equivalent to reducing learning rate by 1/√N for deeper models","D":"The 1/√N scaling prevents loss spikes only during the first 100 training steps"},"correct":"B","explanation":{"correct":"- Variance analysis: if the Transformer is modeled as x_{L} = x_0 + Σ f_i(x), where each f_i is a residual block with output variance σ_f², then Var[x_L] = Var[x_0] + N·σ_f². For large N (e.g., GPT-3 has 96 layers), this grows linearly with N.\n- GPT-2 fix: initialize the output projection of each block (the final linear layer in attention and FFN) with N(0, 0.02²/N) (equivalently, scale by 1/√N). Each block's expected output variance is σ_f²/N, so the total is N × (σ_f²/N) = σ_f² — independent of depth.\n- This follows from Kaiming's variance analysis applied to residual networks: deeper networks need smaller per-layer variance to maintain the same total activation scale.","A":"Scaling initialization values doesn't affect model memory footprint (weights are stored the same way regardless of initialization scale). Memory is determined by the count of weights and their dtype.","B":"","C":"Initialization scaling and learning rate are related through training dynamics but are not equivalent. The 1/√N factor is applied at initialization and doesn't change the learning rate. The learning rate would need to be adjusted separately based on the resulting gradient scales.","D":"The initialization affects training stability throughout training (the initial variance determines how gradients flow from step 1 to convergence). It's not a \"first 100 steps\" effect — good initialization prevents persistent variance growth problems."},"reference":"- Radford et al., \"Language Models are Unsupervised Multitask Learners\" (GPT-2) (2019)\n- Brown et al., \"Language Models are Few-Shot Learners\" (GPT-3): Appendix B (initialization)"},{"section":"deep-learning","topicSlug":"weight-initialization","topic":"Weight Initialization","id":"dl-10013","difficulty":"easy","orderIndex":13,"question":"Biases in neural networks are typically initialized to 0, but there is one important exception: biases in RNN/LSTM forget gates are often initialized to a large positive value (e.g., 1 or 2). Why?","options":{"A":"Large forget gate biases prevent gradient explosion in LSTMs","B":"The forget gate bias initialized to a large positive value causes the forget gate to be near 1 at the start of training (sigmoid(large positive) ≈ 1). This means the LSTM initially \"forgets nothing\" — it passes the full cell state forward. This gives the LSTM access to long-term memory from the start, preventing the vanishing gradient problem in early training. If initialized to 0, forget gates start at 0.5 (sigmoid(0)), causing the model to lose 50% of the cell state at each step, effectively limiting memory length during early training","C":"Large forget gate biases are used to match the scale of other LSTM gates","D":"Forget gate biases should always be 0; using large values is an outdated practice"},"correct":"B","explanation":{"correct":"- Forget gate: f_t = σ(W_f h_{t-1} + U_f x_t + b_f). With b_f=1: f_t ≈ σ(1) ≈ 0.73 at initialization with small weights. With b_f=5: f_t ≈ 0.99.\n- Cell state update: C_t = f_t ⊙ C_{t-1} + i_t ⊙ g_t. If f_t ≈ 0.5 (b_f=0): the cell state is halved at each step. Over 10 steps: C_0 × 0.5^10 ≈ C_0 × 0.001. The LSTM loses the initial cell state within 10 steps.\n- Jozefowicz et al. (2015) showed that initializing forget gate bias to 1 significantly improves LSTM performance on long sequence tasks. The idea: start with a strong prior that the previous state is relevant, let the network learn to forget when appropriate.","A":"Forget gate biases affect the flow of information through time, not the magnitude of gradients directly. Gradient explosion is addressed through gradient clipping or orthogonal initialization, not forget gate bias.","B":"","C":"LSTM gate biases are not matched to each other for scale reasons. Input, output, and forget gates serve different functions, and their biases are set for functional reasons (e.g., forget gate for memory retention), not scale consistency.","D":"This is an active, recommended practice. Keras and PyTorch LSTM default initializations set forget gate bias to 1. The 2015 paper by Jozefowicz et al. and subsequent work have confirmed this benefit."},"reference":"- Jozefowicz et al., \"An Empirical Evaluation of Recurrent Network Architectures\" (2015): http://proceedings.mlr.press/v37/jozefowicz15.html"},{"section":"deep-learning","topicSlug":"weight-initialization","topic":"Weight Initialization","id":"dl-10014","difficulty":"medium","orderIndex":14,"question":"You train a model with mixed precision (FP16/FP32). The model converges to val_loss=0.42 in FP32 but only achieves val_loss=0.55 in FP16 with loss scaling. A colleague says \"FP16 is strictly worse; go back to FP32.\" A second colleague says \"increase loss scale factor to fix the remaining gap.\" Who is right?","options":{"A":"The first colleague is right; FP16 fundamentally cannot match FP32 accuracy","B":"Neither is immediately correct. The val_loss gap (0.42 vs 0.55) likely indicates the model is sensitive to weight update precision. Try: (1) BF16 instead of FP16 (maintains FP32 dynamic range); (2) keep certain sensitive operations (softmax, LayerNorm) in FP32 while using FP16 elsewhere (\"mixed\" in mixed precision); (3) increase loss scale factor. Only if all FP16/BF16 variants fail should you revert to pure FP32. Many production models match FP32 accuracy with proper mixed precision implementation","C":"The second colleague is right; increasing the loss scale factor always closes the accuracy gap","D":"The gap is random noise; train for more epochs in FP16 to close it"},"correct":"B","explanation":{"correct":"- The 0.13 val_loss gap suggests weight updates in FP16 are losing precision in a way that affects final model quality. FP16's limited mantissa (10 bits) means weight updates smaller than weight × 2⁻¹⁰ are lost — gradients that would cause the weights to change in the 11th or more significant bit are completely ignored.\n- BF16 addresses dynamic range but still has fewer mantissa bits than FP32. The specific solution depends on which operation is losing precision.\n- Master FP32 weights: standard mixed precision keeps FP32 master weights and applies FP16 gradients after casting. If this is already implemented and the gap persists, check that the optimizer state (Adam moments) is also in FP32.","A":"FP16 and BF16 regularly match FP32 accuracy in production. Large models (GPT-3, LLaMA, etc.) are trained entirely in BF16/FP16 with mixed precision and achieve competitive accuracy. The gap indicates a fixable configuration issue.","B":"","C":"Increasing loss scale factor beyond the point where gradients don't underflow provides no additional benefit. If gradients are already in the representable range (not underflowing), a larger scale causes overflow (NaN gradients) rather than improved precision.","D":"The gap is systematic (0.55 vs 0.42 is a 23% relative gap), not random noise. Random noise would cause variation around a central value, not a consistent directional gap."},"reference":"- Micikevicius et al., \"Mixed Precision Training\" (2018): https://arxiv.org/abs/1710.03740"},{"section":"deep-learning","topicSlug":"weight-initialization","topic":"Weight Initialization","id":"dl-10015","difficulty":"hard","orderIndex":15,"question":"You train a very wide network (width=4096) with standard Kaiming initialization and SGD. You observe that weight norms grow steadily during training: ||W|| at epoch 100 is 5× larger than at initialization. You suspect this will cause instability. A colleague says \"use weight normalization to fix this.\" Another says \"use L2 regularization (weight decay).\" Both are proposed solutions — what is the fundamental difference in how they address the problem?","options":{"A":"Weight normalization and L2 regularization are mathematically identical for SGD","B":"Weight decay directly penalizes ||W||² via the loss gradient (adds -λW to weight update), counteracting the growth by pulling weights toward smaller values — it's an additive correction to the gradient. Weight normalization reparameterizes W = g × v/||v|| (separating magnitude g from direction v), preventing ||W|| from growing unconstrained because direction and magnitude are updated independently. WN doesn't prevent ||g|| from growing but makes the direction v unit-norm. For the instability problem (growing magnitude), weight decay is more direct; WN addresses a different problem (optimizing over direction and magnitude separately for more stable optimization in deep networks)","C":"Both are identical; they both normalize weights to unit norm","D":"Weight normalization prevents weight growth by periodically resetting weights; L2 adds a penalty that makes training slower"},"correct":"B","explanation":{"correct":"- L2 (weight decay) gradient update: Δw = -η(∂L/∂w + λw). The λw term directly reduces weight magnitude at every step. For growing weights, the pull toward zero counteracts gradient-induced growth. At equilibrium: ||w|| is bounded by the balance between gradient-induced growth and weight decay.\n- Weight normalization (Salimans & Kingma, 2016): W = g × v/||v||. Update is now over g (scalar magnitude) and v (direction vector). ||v|| = 1 by construction (unit norm). The scale g can still grow, but the direction v never grows unbounded. WN provides scale-invariant gradient directions for v, improving training stability and conditioning — but doesn't cap ||g||.\n- For the specific problem (||W|| growing 5×), weight decay is the appropriate tool. WN is more relevant for fixing the optimization landscape (making it easier to optimize well-conditioned directions) than for capping weight magnitude growth.","A":"They are fundamentally different operations. Weight decay adds -λW to the gradient. Weight normalization reparameterizes the weight matrices. For SGD with weight decay, the explicit connection exists (L2 = weight decay), but WN is a reparameterization, not a gradient modification.","B":"","C":"Weight decay does not normalize weights to unit norm — it penalizes large norms without enforcing a specific norm value. Weight normalization ensures ||v|| = 1 for the direction component but allows g to be any value.","D":"Weight normalization doesn't reset weights periodically. It's a mathematical reparameterization applied throughout training. Periodic weight resets would be a completely different technique (not a standard one)."},"reference":"- Salimans & Kingma, \"Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks\" (2016): https://arxiv.org/abs/1602.07868"},{"section":"deep-learning","topicSlug":"cnn-architectures","topic":"Cnn Architectures","id":"dl-11001","difficulty":"easy","orderIndex":1,"question":"A 2D convolutional layer has: kernel_size=3×3, in_channels=64, out_channels=128, stride=1, padding=1. The input is (batch=8, C=64, H=32, W=32). What is the output shape and total number of learnable parameters?","options":{"A":"Output: (8, 128, 32, 32), Parameters: 73,856","B":"Output: (8, 128, 32, 32), Parameters: 73,728","C":"Output: (8, 128, 30, 30), Parameters: 73,728","D":"Output: (8, 64, 32, 32), Parameters: 73,856"},"correct":"A","explanation":{"correct":"- Output spatial size: H_out = (H_in + 2P - K) / S + 1 = (32 + 2×1 - 3) / 1 + 1 = 32. Same for W. So output = (8, 128, 32, 32). ✓\n- Parameters (weights): kernel_size² × in_channels × out_channels = 3×3×64×128 = 73,728 weight parameters.\n- Parameters (biases): one bias per output channel = 128.\n- Total parameters: 73,728 + 128 = 73,856.\n- The common mistake: forgetting to count bias parameters (or assuming \"no bias by default\"). In PyTorch's `nn.Conv2d`, bias=True by default.","A":"","B":"Correctly computes weight parameters (73,728) but forgets to add bias parameters (128). Total should be 73,856.","C":"Incorrectly computes output size without accounting for padding. With padding=1 and kernel=3: (32 + 2 - 3)/1 + 1 = 32, not 30. No padding would give 30×30.","D":"Confuses output channels — the convolutional layer produces out_channels=128 output feature maps, not the input's 64 channels."},"reference":"- PyTorch Conv2d: https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html"},{"section":"deep-learning","topicSlug":"cnn-architectures","topic":"Cnn Architectures","id":"dl-11002","difficulty":"easy","orderIndex":2,"question":"What is the receptive field of the final feature map after stacking three 3×3 convolutional layers with stride=1 and no pooling?","options":{"A":"3×3 (each layer sees only its kernel size)","B":"7×7 — each 3×3 layer adds (kernel_size - 1) = 2 pixels to each side of the receptive field: layer 1 = 3×3, layer 2 = 5×5, layer 3 = 7×7. Two stacked 3×3 layers have the same receptive field as one 5×5 layer; three 3×3 layers equal one 7×7 layer","C":"9×9 — three layers multiply the receptive field: 3×3×3=9","D":"27×27 — the receptive field grows cubically with the number of layers"},"correct":"B","explanation":{"correct":"- Receptive field calculation: each 3×3 conv layer looks at a 3×3 region of the previous layer's output. Layer 1: RF = 3. Layer 2: each unit in layer 2 sees 3×3 of layer 1's output, each of which saw 3×3 of the input. The RF grows by 2 per layer: RF = 2×L + 1 for L layers with kernel=3.\n- Layer 1: RF = 2×1+1 = 3. Layer 2: RF = 2×2+1 = 5. Layer 3: RF = 2×3+1 = 7.\n- This is the VGG insight (Simonyan & Zisserman, 2014): two 3×3 layers have the same receptive field as one 5×5 layer but with fewer parameters (2×9C² vs 25C² for C channels) and more non-linearities.","A":"If each layer only saw its own kernel, there'd be no benefit to stacking layers. The key property of CNNs is that deeper layers have larger receptive fields.","B":"","C":"Receptive field grows additively, not multiplicatively. Each 3×3 layer adds 2 pixels to the RF on each side. Three layers: RF = 3 + 2 + 2 = 7, not 3×3×3.","D":"Receptive field grows linearly with depth (for constant kernel size and stride=1). For stride>1, it grows exponentially — but stride=1 here gives linear growth."},"reference":"- Simonyan & Zisserman, \"Very Deep Convolutional Networks for Large-Scale Image Recognition\" (VGGNet): https://arxiv.org/abs/1409.1556"},{"section":"deep-learning","topicSlug":"cnn-architectures","topic":"Cnn Architectures","id":"dl-11003","difficulty":"medium","orderIndex":3,"question":"AlexNet uses Local Response Normalization (LRN) after its ReLU activations. Modern architectures (VGG, ResNet) dropped LRN and replaced it with BatchNorm. What did LRN attempt to do, and why was BatchNorm a better solution?","options":{"A":"LRN and BatchNorm are identical operations; the name change was cosmetic","B":"LRN normalizes activations within a local neighborhood of channels (across nearby channels at the same spatial location), creating competition between channels and providing local contrast normalization — motivated by lateral inhibition in neuroscience. BatchNorm normalizes across the spatial and batch dimensions for each channel, stabilizing the training dynamics and loss landscape. BatchNorm was superior because: (1) it normalizes the learned representation more globally; (2) provides actual training stability benefits; (3) allows higher learning rates. LRN's benefit was mostly theoretical; empirical results showed it barely helped once BN was available","C":"LRN was replaced because it caused vanishing gradients, while BatchNorm prevents them","D":"LRN is used for classification; BatchNorm is used for detection tasks only"},"correct":"B","explanation":{"correct":"- LRN (AlexNet, Krizhevsky 2012): for neuron at channel c, position (x,y): normalize by a sum of squared activations across nearby channels [c-n/2, c+n/2]. This creates inter-channel competition (\"the most active neuron suppresses others\"), analogous to lateral inhibition in the visual cortex.\n- BatchNorm (Ioffe & Szegedy, 2015): normalizes each channel's activations across the batch and spatial dimensions. Stabilizes learning by preventing covariate shift (or smoothing loss landscape, per Santurkar 2018).\n- Why LRN fell out of use: LRN adds minor regularization but doesn't address the core optimization problem. When BN was introduced, it provided quantifiably better training stability, faster convergence, and higher final accuracy. LRN's neuroscience motivation didn't translate to reliable empirical gains.","A":"LRN and BatchNorm are mathematically very different. LRN: normalization by local channel neighborhood at the same position. BatchNorm: normalization by batch statistics across all spatial positions. They have different normalization axes.","B":"","C":"LRN doesn't specifically cause vanishing gradients (it normalizes activations, not weights). BatchNorm's primary benefit is training stability, not specifically gradient magnitude control (that's the Kaiming initialization + skip connections job).","D":"BatchNorm is used in both classification and detection networks (ResNet for ImageNet, Mask R-CNN for detection). The claim that BN is only for detection is incorrect."},"reference":"- Krizhevsky et al., \"ImageNet Classification with Deep Convolutional Neural Networks\" (AlexNet, 2012)\n- Ioffe & Szegedy, \"Batch Normalization\" (2015): https://arxiv.org/abs/1502.03167"},{"section":"deep-learning","topicSlug":"cnn-architectures","topic":"Cnn Architectures","id":"dl-11004","difficulty":"medium","orderIndex":4,"question":"ResNet introduces skip connections: F(x) + x. The paper argues that learning the residual mapping F(x) = H(x) - x is easier than learning H(x) directly, especially when the identity is a good approximation. What specific evidence from the original ResNet paper supports this, and what would happen without skip connections?","options":{"A":"Without skip connections, training a 56-layer network is impossible due to GPU memory limits","B":"The paper showed that deeper plain networks (without skip connections) have higher training error than shallower ones (counterintuitive — more parameters, worse training). This \"degradation problem\" is not due to overfitting (training error itself is worse). With skip connections, a 110-layer ResNet trains successfully and reaches lower training and test error than a 20-layer plain network. The residual formulation ensures that at minimum, the network can learn identity mappings (F=0), which cannot happen as gracefully in a plain deep network","C":"The evidence is theoretical only; ResNets were proposed before experiments were run","D":"Without skip connections, ResNets achieve 1% lower accuracy because the skip connection provides additional inputs to each layer"},"correct":"B","explanation":{"correct":"- The degradation problem: in the ResNet paper (He et al., 2015), plain 56-layer networks had 6.02% training error vs 4-layer plain networks at 4.18%. More layers → higher training error. This rules out overfitting as the cause — overfitting increases val error but should decrease training error.\n- Identity shortcut argument: if a 56-layer plain net should be at least as good as a 20-layer net (the remaining 36 layers could learn identity), why doesn't it? The answer: learning exact identity mappings is hard for stacked non-linear layers. With residual connections: F(x) = H(x) - x. If the optimal transformation is identity, F(x) = 0, which is easy to achieve (push weights to 0 → F=0).\n- With skip connections: 110-layer ResNet achieves 6.43% test error vs 13.63% for the plain 110-layer network on CIFAR-10.","A":"Deep plain networks can be trained on modern GPUs — the limitation is optimization difficulty, not memory. A 56-layer plain VGG-style network can be instantiated in GPU memory; it simply doesn't train well.","B":"","C":"The ResNet paper is primarily an experimental paper. The experiments on CIFAR-10 and ImageNet are the core evidence. Theory is secondary to the empirical demonstration.","D":"Skip connections don't \"provide additional inputs\" — they add the identity of the previous layer to the output, not a separate additional feature. The benefit is optimization, not additional input information per se."},"reference":"- He et al., \"Deep Residual Learning for Image Recognition\" (2015): https://arxiv.org/abs/1512.03385"},{"section":"deep-learning","topicSlug":"cnn-architectures","topic":"Cnn Architectures","id":"dl-11005","difficulty":"medium","orderIndex":5,"question":"EfficientNet uses compound scaling: simultaneously scaling depth (d), width (w), and resolution (r) with a fixed ratio, rather than scaling each independently. You have a baseline model and want to multiply compute by 8×. How does compound scaling allocate this vs single-dimension scaling?","options":{"A":"Compound scaling allocates all 8× compute to depth (more layers)","B":"EfficientNet compound scaling: given a compute budget of φ times the baseline (FLOPS ∝ d × w² × r²), the formula uses d = α^φ, w = β^φ, r = γ^φ where α × β² × γ² ≈ 2 (so doubling φ doubles compute). For 8×: φ=3 (since 2^3=8). Typical coefficients: α=1.2, β=1.1, γ=1.15. Compound scaling uses all three dimensions in balanced ratios, while single-axis scaling (e.g., 8× depth only) is less efficient because image resolution and channel width don't keep up with depth","C":"Compound scaling is identical to width scaling; depth and resolution scaling add no benefit","D":"The 8× compute budget should be split equally: 2.67× per dimension"},"correct":"B","explanation":{"correct":"- Single-axis limitation: scaling only depth creates very deep but narrow networks. Deep narrow networks may have large receptive fields but limited per-layer feature richness. Scaling only width creates wide but shallow networks that can't learn hierarchical features.\n- Balanced scaling intuition: if input resolution increases (more pixels), you need wider layers to process the extra spatial information, and deeper layers to capture higher-level patterns in the larger resolution input. These three dimensions are interdependent.\n- Empirical finding (EfficientNet paper, Tan & Le 2019): given the same FLOPS, compound-scaled models consistently outperform single-axis scaled models at every compute point on ImageNet.","A":"Allocating all compute to depth ignores the interdependency between dimensions. The compound scaling finding is specifically that balanced scaling outperforms single-axis scaling.","B":"","C":"Width scaling is one component of compound scaling. Depth and resolution scaling interact with width — increasing resolution without increasing width leaves the additional spatial information under-processed.","D":"Equal splitting (2.67× per dimension) is one possible approach, but EfficientNet finds that the optimal split is not equal. The balanced ratios (α, β, γ) are found via neural architecture search on a small proxy."},"reference":"- Tan & Le, \"EfficientNet: Rethinking Model Scaling for CNNs\" (2019): https://arxiv.org/abs/1905.11946"},{"section":"deep-learning","topicSlug":"cnn-architectures","topic":"Cnn Architectures","id":"dl-11006","difficulty":"medium","orderIndex":6,"question":"A 1×1 convolutional layer (also called a \"network-in-network\" or pointwise convolution) is applied to a feature map with shape (batch, 256, 28, 28) to produce (batch, 64, 28, 28). What does this operation accomplish, and why is it used in bottleneck ResNet blocks?","options":{"A":"1×1 convolution is a no-op for spatial dimensions; it only changes the batch dimension","B":"1×1 convolution applies a linear projection across channels at each spatial location independently: for each (h, w) position, it computes a 64-dimensional linear combination of the 256 input channels. This is dimensionality reduction in the channel dimension. In ResNet bottleneck blocks: 256→64 (1×1), 64→64 (3×3), 64→256 (1×1). The expensive 3×3 conv operates on the compressed 64-channel representation, reducing compute by (64/256)² ≈ 16× compared to directly applying 3×3 on 256 channels","C":"1×1 convolution is used to increase spatial resolution from 28×28 to 256×256","D":"1×1 convolution applies 3D spatial filtering across height, width, and channels simultaneously"},"correct":"B","explanation":{"correct":"- 1×1 conv math: output[n, c_out, h, w] = Σ_{c_in} W[c_out, c_in] × input[n, c_in, h, w]. This is a matrix-vector product at each spatial position: the 256-dimensional channel vector at position (h,w) is projected to 64 dimensions.\n- Bottleneck compute savings: 3×3 conv with C channels: 9C² FLOPs per position. With bottleneck (C→C/4→C): 1×1 (C×C/4) + 3×3 (C/4×C/4) + 1×1 (C/4×C) = C²/4 + C²/16×9 + C²/4 = ~C²/1.78. For C=256: 36,864 FLOPs vs 589,824 FLOPs for direct 3×3. 16× FLOP reduction.\n- 1×1 convs also allow channel mixing without spatial computation — they can re-weight which input channels are relevant for each output channel.","A":"1×1 convolution is not a no-op — it changes channel dimensions (from 256 to 64 in this case) and can learn non-trivial channel mixing. It has the same spatial resolution in and out.","B":"","C":"1×1 convolution doesn't change spatial dimensions. The \"1×1\" refers to the spatial extent of the kernel — one pixel × one pixel. Spatial dimensions are preserved.","D":"1×1 convolution is applied independently at each (h, w) spatial position. It does not combine information across spatial locations — it only combines across channels at each position."},"reference":"- He et al., \"Deep Residual Learning for Image Recognition\" (2015): Figure 5 (bottleneck block)\n- Lin et al., \"Network In Network\" (2013): https://arxiv.org/abs/1312.4400"},{"section":"deep-learning","topicSlug":"cnn-architectures","topic":"Cnn Architectures","id":"dl-11007","difficulty":"hard","orderIndex":7,"question":"Depthwise separable convolution (used in MobileNet) separates a standard K×K convolution into: (1) depthwise: K×K applied per channel independently, (2) pointwise: 1×1 across channels. For a layer with C_in=128, C_out=256, K=3: calculate the parameter and FLOP reduction vs standard convolution.","options":{"A":"Parameters: 2× fewer; FLOPs: 4× fewer","B":"Standard conv parameters: K²×C_in×C_out = 9×128×256 = 294,912. Depthwise-separable: depthwise K²×C_in = 9×128 = 1,152 + pointwise C_in×C_out = 128×256 = 32,768, total = 33,920. Parameter reduction: 294,912/33,920 ≈ 8.7×. FLOP reduction is similar: standard FLOPs ∝ K²×C_in×C_out; DSC FLOPs ∝ K²×C_in + C_in×C_out, giving reduction ≈ 1/(1/C_out + 1/K²) ≈ 8-9× for these values","C":"Depthwise separable convolutions are lossless — they compute exactly the same function as standard convolution with fewer parameters","D":"Parameter reduction is 10×; FLOP reduction is 2× due to the extra 1×1 layer"},"correct":"B","explanation":{"correct":"- Standard conv: single kernel of shape (C_out, C_in, K, K). Parameters: C_out × C_in × K² = 256 × 128 × 9 = 294,912.\n- Depthwise conv: C_in kernels of shape (1, 1, K, K), one per input channel. Parameters: C_in × K² = 128 × 9 = 1,152.\n- Pointwise conv: 1×1 conv mixing channels. Parameters: C_in × C_out = 128 × 256 = 32,768.\n- DSC total: 1,152 + 32,768 = 33,920. Ratio: 294,912 / 33,920 = 8.7×.\n- The FLOP reduction formula: standard FLOPs = K²·C_in·C_out; DSC FLOPs = K²·C_in + C_in·C_out. Ratio = 1/(1/C_out + 1/K²) = 1/(1/256 + 1/9) ≈ 1/(0.0039 + 0.111) ≈ 8.6×.","A":"The actual reduction is ~8-9×, not 2× or 4×. The key insight is that DSC doesn't compute all channel combinations simultaneously — depthwise processes each channel separately, then pointwise mixes.","B":"","C":"Depthwise separable convolution is NOT equivalent to standard convolution. Standard conv can express any mapping between C_in channels to C_out channels; DSC constrains the function family. DSC is a structured approximation with reduced representational capacity.","D":"The parameter and FLOP reductions are both approximately the same factor (~8-9×), not asymmetrically 10× and 2×. The extra 1×1 layer adds parameters (the pointwise conv is the major component of DSC's parameter count)."},"reference":"- Howard et al., \"MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications\" (2017): https://arxiv.org/abs/1704.04861"},{"section":"deep-learning","topicSlug":"cnn-architectures","topic":"Cnn Architectures","id":"dl-11008","difficulty":"hard","orderIndex":8,"question":"ConvNeXt (2022) modernizes a standard ResNet by incorporating design principles from Vision Transformers (ViTs). One key change is using depthwise convolution with very large kernels (7×7) in the inverted bottleneck design. What insight from ViT motivated this change, and how does it compare to stacking smaller kernels?","options":{"A":"Larger kernels are used because they are computationally cheaper than small kernels","B":"Vision Transformers use self-attention, which has a global receptive field — every output position attends to every input position. The 7×7 depthwise conv in ConvNeXt approximates this by having a larger local receptive field (49 spatial positions vs 9 for 3×3). A single 7×7 depthwise conv uses 49×C_in parameters (one kernel per channel); equivalent receptive field from stacking three 3×3 convs would use 3×9×C_in²/reduction_factor parameters. The depthwise design makes large kernels computationally feasible since channels aren't mixed at the spatial step","C":"Large kernels are only used in ConvNeXt for the first layer (similar to ViT's patch embedding)","D":"7×7 kernels are used because they match the 7×7 output resolution at the final ResNet stage"},"correct":"B","explanation":{"correct":"- ViT's inductive bias: multi-head self-attention computes all pairwise token interactions. This creates a fully-connected spatial mixing at every layer. The implicit \"large receptive field from the first layer\" is a key difference from CNNs' local receptive fields.\n- ConvNeXt motivation: if large receptive fields help ViTs, can we give CNNs larger receptive fields without the quadratic cost of attention? Depthwise 7×7 convs achieve this: 49 spatial positions processed, but only C_in parameters (vs K²×C_in×C_out for standard 7×7).\n- ConvNeXt also uses other ViT-inspired changes: patch-based downsampling, fewer normalization layers, GELU activation, inverted bottleneck (wide FFN in ViT → wide channel dimension in ConvNeXt).","A":"Larger kernels are generally more expensive, not cheaper. A 7×7 standard conv uses ~5.4× more FLOPs than a 3×3 conv. ConvNeXt uses depthwise 7×7 (cheap) to get large receptive fields without the full cost.","B":"","C":"The 7×7 depthwise conv is used in every stage of ConvNeXt, not just the first layer. This is a key architectural change applied throughout the network.","D":"The 7×7 kernel size is motivated by receptive field and ViT comparison, not by matching output resolution. The 7×7 output resolution at the final ResNet stage is a feature map size, not directly related to why 7×7 kernels were chosen."},"reference":"- Liu et al., \"A ConvNet for the 2020s (ConvNeXt)\" (2022): https://arxiv.org/abs/2201.03545"},{"section":"deep-learning","topicSlug":"cnn-architectures","topic":"Cnn Architectures","id":"dl-11009","difficulty":"hard","orderIndex":9,"question":"A ResNet-50 is deployed for medical image classification. During inference, an image of size 448×448 is used (model was trained on 224×224). The accuracy drops significantly compared to 224×224 images. A colleague says \"just resize to 224×224.\" Another says \"the model can handle any size natively because Conv layers have no size constraint.\" Who is right and why?","options":{"A":"The second colleague is right — ResNets with global average pooling handle any input size","B":"Both are partially right, but they miss a critical issue: convolutional layers accept any size (their weights are size-independent). However, ResNet uses a global average pooling (GAP) layer at the end, which is size-invariant. But performance degrades because: (1) the model's convolutional filters have effective receptive fields calibrated for 224×224 — at 448×448, the same filters cover proportionally smaller parts of the image, disrupting learned feature hierarchies; (2) BatchNorm running statistics were computed for 224×224 spatial distributions; (3) the model hasn't seen 448×448 spatial patterns during training. For best accuracy at 448×448, fine-tune on 448×448 or use test-time augmentation at the training resolution","C":"The first colleague is right — resize to 224×224 is the only valid approach; ResNets cannot process other sizes at all","D":"Both are wrong — a new model must be trained from scratch for 448×448 images"},"correct":"B","explanation":{"correct":"- Why conv layers handle any size: a 3×3 conv sliding over 448×448 produces 446×446 (or 448×448 with padding). The same filters work at any spatial scale — they're position-independent. GAP then takes the mean over all spatial positions, producing a C-dimensional vector regardless of input size.\n- Why performance degrades despite technical compatibility: effective receptive field issue. A ResNet-50 feature at the last conv layer has a receptive field of ~196×196 pixels on a 224×224 input (covering ~77% of the image). On a 448×448 input, the same receptive field covers ~19% of the image. The model sees only local patches, not the full object structure it was trained to recognize.\n- The fix options: (1) fine-tune at 448×448 (adjusts BN stats, teaches model to use larger receptive fields); (2) use FixRes (Touvron et al.) which trains at low res and tests at high res with a simple fix-up.","A":"While technically true that GAP + conv layers allow any input size, performance degrades significantly due to resolution mismatch. \"Can handle any size natively\" implies no accuracy penalty, which is incorrect.","B":"","C":"ResNets can technically process any input size — this statement is factually wrong. The convolutional architecture has no hard size constraint.","D":"Fine-tuning on 448×448 is sufficient — no need to train from scratch. Transfer learning from 224×224 pretrained weights is effective for resolution adaptation."},"reference":"- Touvron et al., \"Fixing the train-test resolution discrepancy\" (FixRes) (2019): https://arxiv.org/abs/1906.06423"},{"section":"deep-learning","topicSlug":"cnn-architectures","topic":"Cnn Architectures","id":"dl-11010","difficulty":"medium","orderIndex":10,"question":"You replace all MaxPooling layers in a CNN with stride-2 convolutional layers (same kernel size, stride=2 instead of stride=1, no pooling). What are the trade-offs?","options":{"A":"Stride-2 conv is strictly better — it has all the benefits of pooling plus it's learnable","B":"Stride-2 conv is learnable (parameters can be optimized for the task) and preserves more information (learned aggregation vs fixed max operation). MaxPooling is parameter-free (no weights to learn), provides perfect translation invariance within the pooling window, and applies a non-linear operation (max). Trade-off: stride-2 conv adds parameters and may be harder to optimize; it doesn't have the built-in non-linear selection property of max. Modern architectures (ResNet, etc.) largely use stride-2 conv; MaxPooling is used in older architectures (VGG) and where translation invariance is explicitly desired","C":"Stride-2 conv and MaxPooling are mathematically identical when kernel_size matches","D":"MaxPooling should always be preferred; stride-2 conv causes spatial aliasing"},"correct":"B","explanation":{"correct":"- MaxPool: takes the maximum value in each pooling window. This is shift-invariant within the window (if the maximum value shifts by 1 pixel, the pool output is the same). No learnable parameters. Non-linear (max is non-differentiable at ties, but piecewise linear).\n- Stride-2 conv: a learned linear combination of the input at each position, then strided. More expressive (can approximate any linear function, including max), but requires training data to learn the appropriate weights. Can overfit the aggregation function.\n- Empirical result: strided convolutions work as well or better in practice (ResNet-50 uses a stride-2 conv at the beginning instead of max pooling). The learned downsampling often outperforms fixed max downsampling for high-level vision tasks.","A":"Stride-2 conv is not \"strictly better.\" For tasks where translation invariance is explicitly desired (e.g., detecting presence of a small texture anywhere in the image), max pooling's built-in invariance is valuable. Learnability doesn't always help if the dataset is small.","B":"","C":"They are not mathematically identical. MaxPool computes the maximum; stride-2 conv computes a weighted sum. These are fundamentally different operations (non-linear max vs linear weighted sum).","D":"MaxPooling also has spatial aliasing (skipping every other pixel) — both approaches have aliasing issues. The claim that stride-2 conv \"causes aliasing\" while MaxPool doesn't is incorrect (both downsample)."},"reference":"- He et al., \"Deep Residual Learning for Image Recognition\" (2015): Section 3.3 (uses stride instead of pooling)\n- Springenberg et al., \"Striving for Simplicity: The All Convolutional Net\" (2015): https://arxiv.org/abs/1412.6806"},{"section":"deep-learning","topicSlug":"cnn-architectures","topic":"Cnn Architectures","id":"dl-11011","difficulty":"medium","orderIndex":11,"question":"In semantic segmentation, the output must have the same spatial resolution as the input (each pixel gets a class label). Encoder-decoder architectures (U-Net, SegNet) use skip connections from encoder to decoder. What specific information do these skip connections carry, and why is it critical for pixel-level predictions?","options":{"A":"Skip connections carry class labels from earlier predictions in the encoder","B":"Skip connections carry high-resolution spatial detail from early encoder layers directly to the corresponding decoder layers. The encoder progressively downsamples and loses fine spatial information (exact object boundaries, thin structures). The decoder upsamples from the bottleneck but can only recover coarse locations without the spatial details. Skip connections provide the exact high-resolution feature maps to the decoder, allowing it to produce sharp boundaries — the encoder contributes \"where exactly\" (spatial precision) while the bottleneck contributes \"what\" (semantic context)","C":"Skip connections in U-Net are only used to prevent gradient vanishing during training","D":"Skip connections carry the original pixel values (raw input) to all decoder layers"},"correct":"B","explanation":{"correct":"- Encoder path: spatial resolution decreases (224→112→56→28→14→7) while channels increase. At 7×7, the network has high semantic understanding but no precise spatial location.\n- Decoder without skip connections: upsamples from 7×7 back to 224×224 using only the bottleneck features. Can reconstruct coarse object locations but produces blurry, imprecise boundaries.\n- U-Net skip connections: at each decoder resolution level, concatenate (or add) the encoder's feature map of the same resolution. The 56×56 decoder layer gets the encoder's 56×56 features — these contain the precise boundaries and textures that were lost during subsequent downsampling.\n- Critical for thin structures: in medical imaging (e.g., blood vessels, cell borders), thin 1-2 pixel structures are completely lost in deep encoders. Skip connections restore this detail.","A":"Skip connections carry feature maps (intermediate learned representations), not class labels. Classification happens at the final decoder output layer.","B":"","C":"Skip connections do help gradient flow (paths to early layers), but this is a secondary benefit. The primary motivation is spatial detail transfer for precise pixel-level predictions.","D":"Skip connections carry layer-specific feature maps from the encoder (processed representations), not the original pixel values. Only the very first skip connection (if from the input layer) would carry near-raw pixels."},"reference":"- Ronneberger et al., \"U-Net: Convolutional Networks for Biomedical Image Segmentation\" (2015): https://arxiv.org/abs/1505.04597"},{"section":"deep-learning","topicSlug":"cnn-architectures","topic":"Cnn Architectures","id":"dl-11012","difficulty":"hard","orderIndex":12,"question":"You compare two feature extraction methods for a ResNet-50: (A) Global Average Pooling (GAP) to produce a 2048-d vector, (B) spatial feature pyramid pooling (SPP) to produce a 7168-d vector (concatenating 2048 from 1×1 + 2048 from 2×2 + 3072 from 3×3 spatial pools). For a nearest-neighbor image retrieval task, SPP achieves significantly higher mAP. Why?","options":{"A":"SPP achieves higher mAP only because it has a larger feature vector (7168 vs 2048)","B":"GAP spatially averages all activations into a single vector — spatial location information is completely discarded. SPP captures features at multiple spatial scales: the 1×1 pool captures global statistics; 2×2 captures quadrant-level features (top-left, top-right, etc.); 3×3 captures 9-region features. For retrieval, the spatial distribution of features (where in the image a feature occurs) is crucial for discriminating images. SPP preserves \"what is in which region\" rather than just \"what is in the image.\" The larger vector is a consequence, not the cause of the improvement","C":"SPP is better because it applies non-linear pooling (max) instead of average pooling","D":"GAP and SPP are equivalent for retrieval tasks; the mAP difference is due to random seeds"},"correct":"B","explanation":{"correct":"- GAP limitation for retrieval: two images with the same objects in different spatial arrangements produce similar GAP vectors. A horse in the upper-left and a horse in the lower-right collapse to the same 2048-d average.\n- SPP spatial preservation: with 2×2 pooling, the top-left quadrant's features are in a different part of the SPP vector than the bottom-right quadrant's features. Two images with the same objects in different positions produce different SPP vectors.\n- This spatial discriminativeness is critical for retrieval — the goal is to find images that are similar in both content and arrangement. For classification (where spatial invariance is desirable), GAP is better.","A":"The larger vector size (7168 vs 2048) does increase capacity but is not the primary reason for mAP improvement. The improvement is specifically due to the spatial information preservation. You can verify this by comparing SPP to a random 7168-d feature — the random large vector wouldn't improve mAP.","B":"","C":"SPP uses max pooling (not average) in some formulations, but the key benefit is multi-scale spatial sampling, not the max vs average distinction. GAP with multiple scales would also outperform single-scale GAP.","D":"The mAP difference between spatial and non-spatial features on retrieval benchmarks is consistent and large (often 10-20% mAP). This is a well-established finding in image retrieval literature."},"reference":"- He et al., \"Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition\" (SPPNet, 2014): https://arxiv.org/abs/1406.4729"},{"section":"deep-learning","topicSlug":"cnn-architectures","topic":"Cnn Architectures","id":"dl-11013","difficulty":"easy","orderIndex":13,"question":"In the LeNet→AlexNet progression, AlexNet introduced several innovations beyond depth. What was the role of ReLU activation compared to the sigmoid/tanh activations used in LeNet?","options":{"A":"ReLU was introduced to prevent the dying neuron problem specific to LeNet","B":"ReLU provides non-saturating gradients for positive activations: ∂ReLU/∂x = 1 for x>0, vs ∂sigmoid/∂x = σ(x)(1-σ(x)) which is at most 0.25 and approaches 0 for large |x|. Tanh is similarly bounded. With ReLU, deep networks can be trained faster because gradients don't vanish through many layers. AlexNet's paper showed ReLU networks trained 6× faster than tanh networks on the same architecture. The trade-off: dying ReLU (if pre-activation stays negative, gradient = 0 permanently)","C":"ReLU was used for biological plausibility; the dying ReLU problem was actually desired to simulate neuron death","D":"ReLU was used because sigmoid requires expensive exp() computations that were impractical on 2012 hardware"},"correct":"B","explanation":{"correct":"- Sigmoid saturation: for large positive or negative x, σ(x)→0 or σ(x)→1. Gradient ≈ 0 for saturated neurons. In a deep network, multiple layers of near-zero gradients compound into vanishing gradient.\n- ReLU (Rectified Linear Unit): max(0,x). Gradient is exactly 1 for x>0 (no saturation), exactly 0 for x<0 (dying ReLU). The non-saturating positive gradient allows training of much deeper networks.\n- 6× speedup: stated in the AlexNet paper. On CIFAR-10, a 4-layer ReLU network reached 25% training error 6× faster than a tanh network.","A":"The dying ReLU problem (neurons stuck at x<0 permanently outputting 0) is a known issue with ReLU, not a problem in LeNet. LeNet used sigmoid/tanh which have different issues (saturation).","B":"","C":"Biological plausibility motivation is not the primary reason cited in the AlexNet paper. The paper explicitly states faster training due to non-saturation as the motivation.","D":"exp() is fast on modern hardware and GPUs. The AlexNet paper doesn't mention hardware computational cost as the reason for ReLU. The primary reason is training speed due to non-saturating gradients."},"reference":"- Krizhevsky et al., \"ImageNet Classification with Deep Convolutional Neural Networks\" (AlexNet, 2012): Section 3.1 (ReLU Nonlinearity)"},{"section":"deep-learning","topicSlug":"cnn-architectures","topic":"Cnn Architectures","id":"dl-11014","difficulty":"hard","orderIndex":14,"question":"A team uses feature maps from a ResNet-50 backbone at multiple scales for object detection (FPN: Feature Pyramid Network). They extract features from ResNet's C3, C4, C5 stages and add top-down pathways. An engineer wants to add the C2 stage (high-resolution, early features). What is the trade-off of adding C2 features?","options":{"A":"Adding C2 features increases accuracy with no cost because more features are always better","B":"C2 features have very high spatial resolution (e.g., 56×56 for 224×224 input) and low semantic content (early layers detect edges, textures, not objects). Adding C2 to FPN: (1) increases memory proportionally to the spatial resolution increase — 56×56 features require 4× more memory than 28×28 (C3) features; (2) the top-down pathway now needs to propagate semantic context to a much larger feature map; (3) C2's features are semantically weak — the network needs to combine high-res but low-semantics with the FPN's top-down signal. For most detection tasks, C2 marginally helps small-object detection but significantly increases compute/memory","C":"C2 features have too low resolution to be useful for object detection","D":"Adding C2 requires retraining the entire backbone from scratch due to gradient flow changes"},"correct":"B","explanation":{"correct":"- FPN multi-scale design: C5 (7×7, 2048ch) → P5 (semantic, low-res). C4 (14×14) → P4. C3 (28×28) → P3. Each P_i is used for detecting objects at a specific scale range.\n- C2 (56×56) benefit: enables detection of very small objects (objects that are only a few pixels in the C3 feature map become more resolved in C2).\n- C2 cost: 56×56 = 3,136 positions vs C3's 28×28 = 784. 4× more positions for all convolutions in the detection head. For batch_size=2 with C2, the FPN head memory can increase by ~20-40% depending on the head design. For the marginal benefit on small objects (which may be rare in the dataset), this cost is often not justified.","A":"More features are not always better when cost is considered. C2's high memory cost is a real trade-off. Additionally, C2's low semantic content means the FPN must do more work to make these features useful.","B":"","C":"C2 has the highest resolution (56×56 for 224×224 input) — it's the highest resolution stage in the backbone. The claim of \"too low resolution\" is factually incorrect.","D":"Adding C2 features to FPN doesn't require retraining from scratch. The backbone weights are unchanged; only the FPN lateral connections and head need training. This is standard fine-tuning."},"reference":"- Lin et al., \"Feature Pyramid Networks for Object Detection\" (FPN, 2017): https://arxiv.org/abs/1612.03144"},{"section":"deep-learning","topicSlug":"cnn-architectures","topic":"Cnn Architectures","id":"dl-11015","difficulty":"hard","orderIndex":15,"question":"VGG-16 and ResNet-50 achieve similar ImageNet top-5 accuracy (~92%). VGG-16 has 138M parameters; ResNet-50 has 25M. You deploy both in production serving 10,000 requests/second. Which bottleneck does VGG-16 hit first, and why doesn't ResNet-50 have the same problem?","options":{"A":"VGG-16 hits a compute bottleneck; ResNet-50 avoids it because skip connections reduce FLOPs","B":"VGG-16's 138M parameters require 528MB in FP32 (or 264MB in FP16). At 10K requests/sec, VGG-16 primarily hits a memory bandwidth bottleneck: each inference must load the full 138M parameters from GPU memory. ResNet-50 at 25M parameters requires only 100MB — it fits easily in GPU L2 cache/shared memory for concurrent requests. The bottleneck shifts from parameter loading (memory-bound) for VGG to compute-bound for ResNet. In production, memory-bound layers are typically 5-10× slower than compute-bound at the same FLOP count","C":"VGG-16 hits a latency bottleneck because it has more layers than ResNet-50","D":"Both models hit identical bottlenecks; parameter count doesn't affect throughput"},"correct":"B","explanation":{"correct":"- Memory bandwidth bottleneck: GPU memory bandwidth (A100: ~2 TB/s) limits how fast weights can be loaded for inference. For VGG-16: loading 528MB takes 528MB / 2TB/s = 0.26ms just for weight loading. For ResNet-50: 100MB / 2TB/s = 0.05ms — 5× faster just from weight transfers.\n- Roofline model: operations are either compute-bound (limited by FLOP/s) or memory-bound (limited by memory bandwidth). For VGG-16's large fully-connected layers (4096→4096: 33.5M parameters), the ratio of FLOPs to bytes loaded is low → memory-bound.\n- Cache effects: ResNet-50's smaller parameter set allows much of the model to be resident in GPU cache. VGG-16's parameters don't fit, requiring main GPU memory access on every inference.","A":"Skip connections don't reduce FLOPs — they add FLOPs (the addition operation). ResNet-50 actually has slightly more FLOPs per image than some VGG variants. The bottleneck difference is parameter count / memory bandwidth, not FLOPs.","B":"","C":"VGG-16 has 16 layers; ResNet-50 has 50 layers. ResNet-50 has more layers, not fewer. Layer count doesn't directly map to latency without considering the operation type and size per layer.","D":"Parameter count directly affects memory usage and bandwidth requirements. A 5× parameter reduction results in a 5× reduction in weight transfer time, which is a primary bottleneck in large-batch inference."},"reference":"- Williams et al., \"Roofline: An insightful visual performance model for multicore architectures\" (2009)\n- MLPerf inference benchmarks: https://mlcommons.org/en/inference-edge-20/"},{"section":"deep-learning","topicSlug":"rnn-lstm-gru","topic":"Rnn Lstm Gru","id":"dl-12001","difficulty":"easy","orderIndex":1,"question":"A vanilla RNN processes a sequence of 100 words and must produce a single classification output. You observe that the gradient norm at step 1 is 10⁻¹⁵ while the gradient norm at step 100 is ~1.0. What is this problem, and what mathematical property causes it?","options":{"A":"Exploding gradients — the gradient grows from step 1 to step 100","B":"Vanishing gradients — the gradient at step 1 is effectively zero. In vanilla RNNs, the gradient at step t flows backward through the recurrence ∂L/∂h_1 = ∂L/∂h_T × ∏_{t=2}^{T} ∂h_t/∂h_{t-1}. Each term ∂h_t/∂h_{t-1} = diag(tanh'(a_t)) × W_h. With T=100 steps, this product is a matrix raised to the 99th power. If the spectral radius of W_h × diag(tanh') is < 1 (which it usually is since tanh' ≤ 1), repeated multiplication drives the gradient to zero exponentially","C":"The gradient norm difference is expected behavior; only the gradient at the last step matters","D":"Vanishing gradients only occur in forward propagation, not backward propagation"},"correct":"B","explanation":{"correct":"- Gradient chain rule for RNNs: ∂L/∂h_1 = Π_{t=2}^{100} W_h^T × diag(tanh'(a_t)) × ∂L/∂h_100.\n- tanh'(x) = 1 - tanh²(x) ≤ 1, and equals 1 only at x=0. For any non-zero pre-activation, tanh' < 1. The product of 99 such terms × W_h: if the largest singular value < 1, the product → 0 exponentially.\n- Practical consequence: the model cannot learn long-range dependencies. The prediction at step 100 depends almost entirely on steps ~90-100; information from steps 1-70 is effectively lost.","A":"The gradient grows from step 1 to step 100, meaning the gradient AT STEP 1 is smaller. This is vanishing gradients (gradients vanish going backward to early timesteps), not exploding. Exploding gradients would make early gradients LARGER than late gradients.","B":"","C":"For sequence classification, all input positions contribute information. The gradient at step 1 being ~0 means the model doesn't update its weights based on early-sequence content — a critical failure for long sequences.","D":"Vanishing gradients are specifically a backward pass (backpropagation through time) phenomenon. The forward pass computes activations, which may also diminish but doesn't directly cause training failure. The training failure comes from zero gradients in backprop."},"reference":"- Bengio et al., \"Learning Long-Term Dependencies with Gradient Descent is Difficult\" (1994)\n- Pascanu et al., \"On the difficulty of training recurrent neural networks\" (2013): https://arxiv.org/abs/1211.5063"},{"section":"deep-learning","topicSlug":"rnn-lstm-gru","topic":"Rnn Lstm Gru","id":"dl-12002","difficulty":"easy","orderIndex":2,"question":"An LSTM has three gates: forget, input, and output. A textbook describes the forget gate as \"the most important gate.\" For a sentiment analysis task on long movie reviews, explain intuitively what information each gate controls and why the forget gate matters specifically.","options":{"A":"The forget gate removes information from the cell state; the input gate adds information; the output gate determines the hidden state — forget is most important because it prevents information accumulation","B":"Forget gate (f_t = σ(W_f[h_{t-1}, x_t] + b_f)): controls what to erase from cell state. For sentiment: \"However, despite the boring intro...\" — after \"However\", the forget gate should reduce the weight on the positive sentiment accumulated so far. Input gate (i_t): controls what new information to write. Output gate (o_t): controls what part of cell state to expose as hidden state h_t for the next step or output. The forget gate is critical for long documents because without selective forgetting, the cell state accumulates everything equally, losing signal in noise","C":"The output gate is most important; it's the only gate visible to the next layer","D":"All gates are equally important; the \"forget gate is most important\" claim is a myth"},"correct":"B","explanation":{"correct":"- Cell state without forget gate: C_t = C_{t-1} + i_t ⊙ g_t. C_t grows monotonically — the cell state is a running sum of everything. For a 1000-word review, the sentiment signal from the last 50 words is buried under noise from the first 950.\n- Forget gate enables selective memory: at word t, f_t ≈ 0 for irrelevant or contradictory content (forget old), f_t ≈ 1 for consistent content (maintain). This allows the LSTM to maintain relevant long-range context while discarding irrelevant information.\n- Jozefowicz et al. (2015) ablation: removing the forget gate (setting f_t = 1 always, i.e., never forget) hurts performance significantly. Setting f_t = 0 (always forget) eliminates long-term memory.","A":"This accurately describes the gates but doesn't give the sentiment-specific intuition for why forget matters. The claim \"prevents information accumulation\" is vague — the key is selective forgetting of contradicted or irrelevant information.","B":"","C":"The output gate is important but is specifically about \"what to expose\" from the cell state at each step, not about managing long-term memory. For long-document tasks, selective forgetting is the primary challenge.","D":"The forget gate is empirically the most critical gate in many ablation studies. Jozefowicz et al. (2015) showed that LSTM variants that remove the forget gate consistently perform worse on language modeling tasks."},"reference":"- Hochreiter & Schmidhuber, \"Long Short-Term Memory\" (1997): original LSTM paper\n- Jozefowicz et al., \"An Empirical Evaluation of Recurrent Network Architectures\" (2015)"},{"section":"deep-learning","topicSlug":"rnn-lstm-gru","topic":"Rnn Lstm Gru","id":"dl-12003","difficulty":"medium","orderIndex":3,"question":"A GRU replaces the three LSTM gates with two: a reset gate and an update gate. A team argues \"GRU is strictly better than LSTM for all tasks because it has fewer parameters.\" What is the accurate analysis?","options":{"A":"The team is correct — GRU is always better due to fewer parameters","B":"GRU has fewer parameters (2 gates vs 3 in LSTM, no separate cell state) making it faster and less prone to overfitting on small datasets. LSTM's separate cell state provides more expressive memory (can independently control what's stored vs what's outputted). Empirically, performance is dataset and task dependent: GRU often matches LSTM on language modeling with less data; LSTM tends to win on tasks requiring complex, structured long-term dependencies (e.g., music generation with long-term structure). The \"fewer parameters = better\" logic ignores that LSTM's extra capacity addresses specific memory management problems","C":"LSTM is always better; GRU was an unsuccessful experiment that never achieved practical adoption","D":"GRU and LSTM are mathematically identical; the name difference is vendor-specific"},"correct":"B","explanation":{"correct":"- GRU parameters for hidden_size=H, input_size=D: 3×(D+H)×H (two gates + candidate hidden: reset, update, new_h). LSTM parameters: 4×(D+H)×H (three gates + cell input: i, f, o, g). GRU ≈ 25% fewer parameters.\n- GRU update gate: z_t = σ(W_z [h_{t-1}, x_t]). h_t = (1-z_t)⊙h_{t-1} + z_t⊙h̃_t. This single gate does both \"forget\" and \"what to update\" — it can't independently control forgetting and updating. LSTM can forget old information while writing specific new information independently.\n- Chung et al. (2014) and Greff et al. (2017) empirical comparisons: on many NLP tasks, GRU ≈ LSTM with 25% fewer parameters. On tasks requiring more precise memory management, LSTM has an edge.","A":"\"Always better\" is empirically false. LSTM outperforms GRU on some tasks (music generation, certain machine translation benchmarks with long dependencies). The relationship is task-dependent.","B":"","C":"GRU is widely adopted. PyTorch and TensorFlow both support GRU as a core layer. GRU is used in production systems (speech processing, time series). It is not an \"unsuccessful experiment.\"","D":"GRU and LSTM have fundamentally different architectures. GRU has no separate cell state; LSTM does. They compute different mathematical functions. The equations are substantially different."},"reference":"- Chung et al., \"Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling\" (2014): https://arxiv.org/abs/1412.3555"},{"section":"deep-learning","topicSlug":"rnn-lstm-gru","topic":"Rnn Lstm Gru","id":"dl-12004","difficulty":"medium","orderIndex":4,"question":"You implement Backpropagation Through Time (BPTT) for a sequence of length 500. Training is slow and uses 40GB of GPU memory. A colleague suggests \"truncated BPTT with segment_length=20.\" What is the trade-off and what does it sacrifice?","options":{"A":"Truncated BPTT reduces memory and speed but maintains identical gradient computation","B":"Truncated BPTT divides the sequence into non-overlapping segments of length 20, computing gradients only within each segment. Memory reduction: from O(T) to O(segment_length). Speed improvement: similar. Trade-off: gradients cannot propagate back beyond 20 steps. The model can only learn dependencies within 20-step windows. The hidden state carries information from before the segment boundary (the initial hidden state of each segment comes from the end of the previous segment), but the weights are not updated to improve this long-term state — the model learns to use short-term context efficiently but cannot optimize the 21-to-500 step dependencies","C":"Truncated BPTT only reduces speed, not memory; all activations must still be stored","D":"Truncated BPTT with segment_length=20 is equivalent to full BPTT for sequence_length=500 when the RNN converges"},"correct":"B","explanation":{"correct":"- Full BPTT memory: must store all T=500 hidden states and pre-activations for the backward pass. Memory = O(T × H²) for a network with hidden size H.\n- Truncated BPTT: process T/20 = 25 segments. Each segment stores only 20 activations. Memory: O(segment_length × H²) = O(20 × H²). 25× memory reduction.\n- The sacrifice: weights are not updated to optimize cross-segment dependencies. The model learns to \"use\" hidden states from before the segment boundary but not to generate them optimally. Dependencies longer than 20 steps are underfit.\n- The hidden state IS passed between segments (avoiding complete information loss), but the gradient is detached at the boundary (`h = h.detach()` in PyTorch), preventing gradients from flowing through.","A":"Truncated BPTT computes different gradients from full BPTT — it explicitly zeroes out gradients beyond the truncation boundary. They are not identical.","B":"","C":"Memory reduction is the primary motivation for truncated BPTT. Only segment_length activations need to be stored at once, not the full T=500. This is a significant memory saving.","D":"No convergence property makes them equivalent. Even after thousands of training steps, a model trained with truncated BPTT cannot learn dependencies that span more than the truncation length, regardless of convergence."},"reference":"- Sutton, \"Time-derivative models of Pavlovian reinforcement\" (1990): original TBPTT\n- Mikolov et al., \"Recurrent Neural Network Based Language Model\" (2010): uses TBPTT"},{"section":"deep-learning","topicSlug":"rnn-lstm-gru","topic":"Rnn Lstm Gru","id":"dl-12005","difficulty":"medium","orderIndex":5,"question":"You train a bidirectional LSTM on named entity recognition (NER). The forward LSTM reads left-to-right, the backward LSTM reads right-to-left. A colleague questions: \"doesn't the backward LSTM see the 'future' relative to the current token?\" When is bidirectional processing valid and when is it invalid?","options":{"A":"Bidirectional LSTMs always see the future; they are invalid for all sequential tasks","B":"Bidirectional models are valid for offline tasks where the full sequence is available at inference: NER, POS tagging, text classification, machine translation encoding (the encoder sees the full source sentence). They are invalid for online/autoregressive tasks where future tokens are unavailable at generation time: language modeling (next-word prediction), streaming ASR, real-time translation. For NER, the word \"Apple\" in \"I work at Apple Inc.\" benefits from seeing \"Inc.\" ahead to classify \"Apple\" as an organization — the full sentence is available, so right-to-left context is legitimate","C":"Bidirectional models are invalid because they cause data leakage — future context is unavailable in production","D":"Bidirectional LSTMs are only valid for classification; for sequence labeling, only forward LSTMs work"},"correct":"B","explanation":{"correct":"- NER is an offline task: the full sentence is a training example. During both training and inference, the complete sequence is provided. Using the full sentence to label each token is not \"data leakage\" — it's using available context.\n- Language modeling is an autoregressive task: generating the next word must only use previous words. During training, the model sees the full sequence, but using future tokens to predict the current token would be cheating (the model would trivially predict the next token because it can see it).\n- BERT uses bidirectional attention (Transformer, not LSTM) for representation learning but cannot directly generate text. GPT uses unidirectional attention (causal masking) for generation.","A":"\"Invalid for all sequential tasks\" is too strong. The validity depends on whether the task requires online processing. Offline tasks (batch processing of complete sequences) fully allow bidirectional models.","B":"","C":"\"Data leakage\" in ML means using information during training that wouldn't be available at deployment time. If full sequences are available at both training and inference (as in NER), there's no leakage.","D":"Bidirectional LSTMs are used for sequence labeling (NER, POS tagging) — this is one of their primary applications. The claim that they only work for classification is incorrect."},"reference":"- Devlin et al., \"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding\" (2018)\n- Collobert et al., \"Natural Language Processing (Almost) from Scratch\" (2011): bidirectional models for NER"},{"section":"deep-learning","topicSlug":"rnn-lstm-gru","topic":"Rnn Lstm Gru","id":"dl-12006","difficulty":"hard","orderIndex":6,"question":"You debug an LSTM language model and find that the forget gate activations (f_t) are saturated near 1.0 for 95% of timesteps, and the input gate (i_t) is near 0.0 for the same timesteps. The model achieves decent perplexity but generalizes poorly. What is this pattern indicating?","options":{"A":"The LSTM is working correctly — high forget gate means good long-term memory","B":"This pattern indicates the LSTM is in \"copy mode\" — nearly always passing through the previous cell state unchanged (f_t ≈ 1, no forgetting) while ignoring new input (i_t ≈ 0, no writing). The model has essentially learned to copy h_{t-1} → h_t for most timesteps, only occasionally updating based on the actual input. This is a degenerate solution: the model achieves decent perplexity by predicting the current word mostly based on accumulated context, but isn't learning to use specific input signals. Poor generalization occurs because the model relies on generic context accumulation rather than learning specific input-output patterns","C":"This is expected for long documents; LSTMs must maintain context across many steps","D":"The issue is the forget gate bias being too large; decrease it to 0 to fix generalization"},"correct":"B","explanation":{"correct":"- Normal LSTM behavior: forget gate should vary by context. At sentence boundaries: f_t ≈ 0 (reset cell state). After topic changes: partial forgetting. For consistent content: f_t ≈ 1 (maintain). 95% saturation near 1.0 is pathologically high.\n- Copy mode failure mode: C_t ≈ C_{t-1} + ε (tiny updates from near-zero input gate). The LSTM is essentially a leaky integrator, not a selective memory system.\n- Diagnosis approach: check if the model predicts words differently when given completely different inputs (different sequence prefixes). A copy-mode LSTM produces nearly identical predictions for many different inputs — it's not truly using the current input.","A":"High forget gate (f_t ≈ 1) alone might be fine for long-range memory. The problem is the simultaneous near-zero input gate — the model never writes new information. Together, these indicate a degenerate solution.","B":"","C":"Maintaining context is legitimate, but context maintenance should be selective (forget irrelevant, maintain relevant). 95% always-maintain is not selective — it's a failure to learn which information to retain.","D":"Decreasing forget gate bias to 0 would make forget gates start at 0.5 (sigmoid(0)), causing the model to forget ~50% of cell state at each step by default. This overcorrects and would likely destroy long-term memory. The issue needs diagnosing more carefully (learning rate, architecture, data)."},"reference":"- Karpathy et al., \"Visualizing and Understanding Recurrent Networks\" (2015): https://arxiv.org/abs/1506.02078"},{"section":"deep-learning","topicSlug":"rnn-lstm-gru","topic":"Rnn Lstm Gru","id":"dl-12007","difficulty":"hard","orderIndex":7,"question":"You compare a 2-layer stacked LSTM (layer 1 output feeds into layer 2) to a single-layer LSTM with doubled hidden size. Both have approximately equal parameter counts. For a machine translation task, which performs better and why?","options":{"A":"Doubled hidden size always wins because wider models learn more diverse features","B":"The 2-layer stacked LSTM typically outperforms the wider single-layer LSTM for translation because depth enables compositional representations: layer 1 can learn syntactic patterns (sentence structure, phrase boundaries) while layer 2 can build semantic representations on top of these structural patterns. Depth creates a hierarchy of abstractions that a single wide layer cannot represent as efficiently. This mirrors the depth advantage in feedforward networks and is why nearly all state-of-the-art RNN-based models (pre-Transformer) used stacked RNNs (2-4 layers)","C":"Single-layer with doubled hidden size wins because stacking creates vanishing gradients","D":"Performance is identical; depth and width are equivalent for LSTMs"},"correct":"B","explanation":{"correct":"- Depth enables hierarchical processing: in Seq2Seq translation models (Sutskever et al. 2014, Wu et al. 2016 Google NMT), stacked LSTMs (4 layers) significantly outperformed single-layer LSTMs. The gain from adding the 2nd layer was larger than from adding more width.\n- Two-layer stacking: h1_t = LSTM1(x_t, h1_{t-1}); h2_t = LSTM2(h1_t, h2_{t-1}). Layer 2 processes sequences of layer 1 outputs — a \"sequence of features\" rather than \"sequence of words.\"\n- Width saturation: increasing hidden size gives diminishing returns; going from H=512 to H=1024 helps less than adding a second 512-unit LSTM layer. Additional neurons in a wide single-layer model become redundant (correlated) as width increases.","A":"Wider single-layer models encounter the redundancy problem — extra neurons learn similar functions. Depth creates qualitatively different processing levels, not just more parallel features.","B":"","C":"Stacking does increase gradient path length, but LSTM's cell state provides stable gradient flow across time. Stacking 2-4 LSTM layers doesn't cause significant vanishing gradients for typical sequence lengths (up to a few hundred tokens).","D":"Depth and width are not equivalent — this is a fundamental finding across all deep learning. For sequential tasks with hierarchical structure (language, time series), depth creates qualitatively better representations."},"reference":"- Sutskever et al., \"Sequence to Sequence Learning with Neural Networks\" (2014): 4-layer stacked LSTM\n- Wu et al., \"Google's Neural Machine Translation System\" (2016): 8-layer stacked LSTM"},{"section":"deep-learning","topicSlug":"rnn-lstm-gru","topic":"Rnn Lstm Gru","id":"dl-12008","difficulty":"medium","orderIndex":8,"question":"An LSTM is used for stock price prediction. The training set covers 2010-2020; the test set is 2020-2023. The model achieves very low training loss but terrible test loss. A naive colleague proposes \"add more LSTM layers to increase capacity.\" What is the likely root cause and why would more capacity not help?","options":{"A":"More capacity would definitely help; the model is simply underfitting","B":"The model is overfitting to the 2010-2020 distribution. The 2020-2023 period includes COVID-19 market disruptions, remote work shifts, and new market dynamics that don't appear in training. More LSTM capacity would increase memorization of the 2010-2020 specific patterns (overfitting worse), not generalization. The actual fixes: (1) regularization (Dropout, L2); (2) expanding training data to include diverse market regimes; (3) using a simpler model with better inductive biases for non-stationary time series","C":"The issue is that LSTMs cannot handle financial data; use a CNN instead","D":"The issue is sequence length mismatch between training and test sets"},"correct":"B","explanation":{"correct":"- Distribution shift: financial markets exhibit non-stationarity — statistical properties change over time (market regimes, volatility regimes). A model trained on 2010-2020 sees bull market cycles, tech dominance, quantitative easing. 2020-2023 brings unprecedented events.\n- More capacity makes it worse: a higher-capacity model will fit the 2010-2020 data more precisely (lower training loss) but become more brittle to distribution shift. The model has more parameters to encode the specific patterns of the training period, and fewer \"slack\" parameters for out-of-distribution generalization.\n- This is a classic distribution shift / concept drift problem, not an underfitting problem. The diagnostic: training loss is already very low (not underfitting). Test loss is high (generalization failure). More capacity reduces training loss further but increases test loss.","A":"Low training loss and high test loss is the definition of overfitting/distribution shift, not underfitting. The solution to overfitting is regularization, not more capacity.","B":"","C":"CNNs can process time series (1D CNNs), but the problem is distribution shift, not architecture. Switching to a CNN wouldn't fix the generalization across market regimes. The architecture is not the fundamental issue.","D":"Sequence length mismatch would cause technical errors (shape mismatches) or be trivially fixed by truncation/padding. The described problem (training works, test fails) is characteristic of distribution shift or overfitting, not sequence length issues."},"reference":"- Hurst, \"Overfitting and Distribution Shift in Time Series Forecasting\" (general concept)\n- https://arxiv.org/abs/2004.12667 (temporal covariate shift in time series)"},{"section":"deep-learning","topicSlug":"rnn-lstm-gru","topic":"Rnn Lstm Gru","id":"dl-12009","difficulty":"hard","orderIndex":9,"question":"You implement a character-level LSTM language model. At inference, you generate text by sampling from the softmax output. You notice that with temperature=1.0, the text is grammatical but repetitive. With temperature=0.01 (near-greedy), output is highly repetitive (loops). With temperature=2.0, output is incoherent. Explain the relationship between temperature and the LSTM's hidden state, and what causes the repetitive loop at near-zero temperature.","options":{"A":"Temperature only affects the output tokens; the LSTM hidden state is temperature-independent","B":"Temperature scales logits before softmax: p_i = softmax(logits/T). High T → uniform distribution (exploration). Low T → peaked distribution (exploitation). The repetition loop at low temperature occurs because: the LSTM's hidden state h_t is conditioned on the previous token x_t. When the model repeatedly samples the same token (greedy/near-greedy often produces a token that the model has been trained to follow with the same token), the hidden state converges to a fixed point — a state where the most probable next token feeds back to produce the same state. The LSTM is trapped in a hidden state cycle","C":"Repetition is caused by the forget gate becoming saturated at low temperature","D":"Temperature changes are applied to the hidden state, not the output distribution"},"correct":"B","explanation":{"correct":"- Fixed point analysis: at temperature→0, the model always selects argmax(logits). If the sequence \"the the the\" has high probability under the model (because \"the\" appears frequently and the LSTM produces high probability for \"the\" after \"the\"), the model is trapped in this cycle.\n- Hidden state convergence: h_t = LSTM(h_{t-1}, x_t). If x_t is always \"the\", then h_t → h* (a fixed vector) because the recurrence with constant input converges. At h*, the model always predicts \"the\", reinforcing the loop.\n- Temperature 2.0 (incoherence): uniform distribution → samples rare or semantically inappropriate words → the LSTM hidden state transitions to an atypical state → subsequent predictions are also atypical.","A":"Temperature is applied to the output logits, but the sampled token IS fed back into the LSTM as input. Therefore, temperature indirectly affects the hidden state trajectory by determining which token is sampled and fed back.","B":"","C":"Forget gate saturation at low temperature is not the mechanism. The forget gate is controlled by the current input and previous hidden state, not by the sampling temperature. The loop is a fixed-point attractor in the hidden-state-token space.","D":"Temperature is applied to the logits (pre-softmax activations) of the output projection layer. It does not modify the hidden state directly."},"reference":"- Karpathy, \"The Unreasonable Effectiveness of Recurrent Neural Networks\" (2015): http://karpathy.github.io/2015/05/21/rnn-effectiveness/"},{"section":"deep-learning","topicSlug":"rnn-lstm-gru","topic":"Rnn Lstm Gru","id":"dl-12010","difficulty":"medium","orderIndex":10,"question":"Modern NLP uses Transformers almost exclusively. A senior engineer claims \"there are still tasks where RNNs beat Transformers.\" What are these tasks, and what properties of RNNs provide the advantage?","options":{"A":"RNNs never beat Transformers; the senior engineer is wrong","B":"RNNs retain advantages in: (1) Online/streaming inference — RNNs process one token at a time with O(1) memory; Transformers require O(T) KV cache that grows with sequence length. For real-time processing with unbounded sequences, RNNs are superior. (2) Very long sequences — Transformer attention is O(T²) compute; at T=100,000 tokens, quadratic scaling is prohibitive. LSTMs process such sequences in O(T). (3) Hardware-constrained edge deployment — RNN hidden state is a fixed-size vector; complete model inference requires only the state vector and current input, not the entire history","C":"RNNs beat Transformers only on synthetic tasks designed to favor sequential processing","D":"RNNs beat Transformers when using ReLU activations instead of tanh"},"correct":"B","explanation":{"correct":"- Streaming inference: an LSTM with H=512 processes token t with only h_{t-1} (512 floats) + x_t → h_t. Memory is constant regardless of how many tokens have been processed. A Transformer needs to store all previous token key-value pairs in the KV cache — O(2 × num_layers × num_heads × head_dim × T) memory, growing linearly with sequence length T.\n- O(T²) vs O(T) computation: for documents with T=100K tokens, O(T²) = 10¹⁰ operations per attention layer. LSTMs: O(T) × O(H²) operations — linear in T.\n- Note: RWKV, Mamba, and SSMs (State Space Models) are recent architectures that combine Transformer-level performance with RNN-style O(T) inference — they're replacing RNNs for these use cases.","A":"The senior engineer is correct in specific scenarios. Streaming and long-sequence tasks are legitimate cases where RNNs (and their successors like Mamba) have practical advantages. Saying \"never\" is incorrect.","B":"","C":"The advantages are practical, not synthetic. Production streaming ASR (speech recognition) systems and edge devices with memory constraints are real-world use cases.","D":"Activation function choice doesn't determine when RNNs beat Transformers. The advantages are architectural (O(1) memory, O(T) compute), not activation-dependent."},"reference":"- Gu & Dao, \"Mamba: Linear-Time Sequence Modeling with Selective State Spaces\" (2023): https://arxiv.org/abs/2312.00752\n- Peng et al., \"RWKV: Reinventing RNNs for the Transformer Era\" (2023): https://arxiv.org/abs/2305.13048"},{"section":"deep-learning","topicSlug":"rnn-lstm-gru","topic":"Rnn Lstm Gru","id":"dl-12011","difficulty":"hard","orderIndex":11,"question":"You implement an encoder-decoder LSTM for sequence-to-sequence machine translation. The encoder reads the source sentence and compresses it into a single fixed-size vector (the final hidden state). This is then used as the decoder's initial hidden state. For short sentences (5-10 tokens), performance is good. For long sentences (50+ tokens), BLEU scores drop significantly. What is the fundamental architectural limitation causing this?","options":{"A":"The encoder uses too few LSTM layers; add more layers to increase capacity","B":"The fixed-size bottleneck problem: all information from the source sentence must be compressed into a single hidden state vector of size H (e.g., 512 or 1024). For short sentences, a 512-d vector can capture the full meaning. For 50+ token sentences, the compression ratio is too high — the encoder must discard information to fit everything in 512 dimensions. This is the \"information bottleneck\" and motivated the invention of attention mechanisms (Bahdanau et al., 2015): instead of compressing to a single vector, attention allows the decoder to selectively access any encoder hidden state at each decoder step","C":"Long sentences cause BPTT to use too many steps, causing gradient explosion in the encoder","D":"The performance drop is due to the decoder, not the encoder — longer targets require more decoder steps"},"correct":"B","explanation":{"correct":"- The bottleneck: regardless of sentence length, the encoder must summarize everything into h_T of fixed size H. For \"The cat sat on the mat.\" (6 tokens): easy compression. For a 50-token sentence with complex structure, dependencies, and multiple clauses: the 512-d vector must encode all of this simultaneously.\n- Empirical evidence: Cho et al. (2014) showed performance degrades sharply for sentences > 30 tokens in seq2seq models. Bahdanau et al. (2015) proposed attention, allowing the decoder to create a different context vector c_t = Σ α_{ti} h_i for each decoder step — directly addressing the bottleneck.\n- The longer the source sentence, the more information is discarded in the fixed-size encoding, leading to poor BLEU scores for long sentences.","A":"More encoder LSTM layers increase the capacity to process each step, but the bottleneck is the fixed-size final hidden state, not the layers' processing capacity. Adding layers doesn't change the H-dimensional bottleneck.","B":"","C":"BPTT through 50 encoder steps doesn't typically cause gradient explosion in LSTMs (LSTMs have cell state highway for gradients). If gradient clipping is applied, this is even less of an issue. The bottleneck is information capacity, not gradient flow.","D":"The decoder performance drop is a consequence of poor encoder representation. If the encoder discards information, the decoder has nothing to work with. The root cause is the encoder-side compression bottleneck."},"reference":"- Cho et al., \"Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation\" (2014): https://arxiv.org/abs/1406.1078\n- Bahdanau et al., \"Neural Machine Translation by Jointly Learning to Align and Translate\" (attention mechanism, 2015): https://arxiv.org/abs/1409.0473"},{"section":"deep-learning","topicSlug":"rnn-lstm-gru","topic":"Rnn Lstm Gru","id":"dl-12012","difficulty":"hard","orderIndex":12,"question":"You profile an LSTM-based model on a GPU. Despite the LSTM having H=1024 and batch_size=256, GPU utilization is only 15%. A profiler shows the bottleneck is sequential matrix multiplications (h_t depends on h_{t-1}, so each step must wait for the previous). What specific GPU efficiency problem does this represent and what architectural modifications address it?","options":{"A":"The problem is batch_size being too small; increase to 4096 to improve GPU utilization","B":"The fundamental problem is sequential data dependency: h_t = LSTM(h_{t-1}, x_t) means step t cannot start until step t-1 completes. GPUs achieve high utilization through massive parallelism. With T=100 timesteps, only 1 of the T potential parallel computations is active at each step. Across the batch (256 samples), the same input token position is processed in parallel — but this is a 256×(small matrix) operation, underutilizing the GPU. Fixes: (1) process batch efficiently across the 256 dimension; (2) use quasi-recurrent neural networks (QRNNs) that parallelize most computation across time; (3) use convolutions (parallelizable) for the input transformation and only use recurrence for the gating; (4) switch to Transformers which are fully parallelizable across time","C":"The problem is that 1024 hidden size is too small for GPU optimization; increase to 8192","D":"The problem is missing cuDNN LSTM optimizations; just add torch.backends.cudnn.enabled = True"},"correct":"B","explanation":{"correct":"- Sequential dependency: in an RNN/LSTM, T timesteps must be processed sequentially. A GPU with 10,000 CUDA cores can only use batch_size=256 cores effectively per step (one per batch element), leaving 9,744 cores idle.\n- Compare with Transformers: all T timesteps' attention can be computed in parallel using batched matrix multiplication Q×K^T. A single layer processes all T positions simultaneously, achieving high GPU utilization.\n- QRNN (Bradbury et al., 2016): parallelizes the convolution over time (input transformation is a temporal convolution, GPU-parallel) while keeping the minimal recurrence in the pooling step (small, fast). Achieves ~16× speedup over LSTM on GPUs.","A":"Batch size increase would help GPU utilization marginally (more samples processed in parallel per step). But the fundamental bottleneck is temporal sequential dependency (T sequential steps), not batch parallelism. Even batch_size=4096 still processes T=100 steps sequentially.","B":"","C":"Hidden size 1024 creates (1024×1024) = 1M parameter matrices per gate. These are large enough for efficient GEMM operations. The issue is temporal dependency, not matrix size.","D":"cuDNN LSTM optimizations (CuDNN's custom LSTM kernel) can provide 2-3× speedup by fusing operations, but they don't overcome the fundamental sequential dependency bottleneck. 15% → 45% is possible, but 15% → 80%+ requires removing the sequential dependency."},"reference":"- Bradbury et al., \"Quasi-Recurrent Neural Networks\" (2016): https://arxiv.org/abs/1611.01576\n- PyTorch cuDNN LSTM optimization: https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html"},{"section":"deep-learning","topicSlug":"rnn-lstm-gru","topic":"Rnn Lstm Gru","id":"dl-12013","difficulty":"medium","orderIndex":13,"question":"You train an LSTM for time series anomaly detection. The normal patterns show gradual trends, but anomalies are sudden spikes. After training, the model achieves 60% anomaly detection rate. A colleague suggests \"add teacher forcing during training.\" Would this help, and what is teacher forcing?","options":{"A":"Teacher forcing would definitely help; use it for all sequence prediction tasks","B":"Teacher forcing: during training, at each step t, instead of feeding the model's own prediction ŷ_{t-1} back as input, feed the ground truth y_{t-1}. This allows faster training convergence (model sees correct context, avoids compounding errors). However, for anomaly detection specifically, teacher forcing causes a training-inference mismatch: at inference, the model must use its own predictions (or actual observations) as input. If the model was trained with perfect previous values, it may not generalize well to inference conditions where its previous predictions may be slightly off","C":"Teacher forcing is only used for language models; it doesn't apply to time series","D":"Teacher forcing should be avoided entirely; it causes catastrophic forgetting in LSTMs"},"correct":"B","explanation":{"correct":"- Teacher forcing mechanism: at training step t, standard training feeds ŷ_{t-1} = model_output_{t-1} → model compounding prediction errors if any early prediction is wrong. Teacher forcing feeds y_{t-1} = ground truth → model always sees correct previous values during training.\n- Benefits: faster convergence, more stable gradients (no error accumulation). Used extensively in seq2seq models.\n- Problem (exposure bias): the model is never exposed to its own prediction errors during training. At inference, small errors compound: ŷ_t is slightly off → ŷ_{t+1} is more off → ŷ_{t+2} is even more off. For anomaly detection, the model must handle both normal and anomalous inputs — training only on ground truth means the model never learns to recover from prediction errors.","A":"\"Always use teacher forcing\" ignores the training-inference gap problem. For tasks with long autoregressive generation, teacher forcing can hurt inference performance. Scheduled sampling (gradually replacing teacher forcing with model predictions) is often a better approach.","B":"","C":"Teacher forcing is used in any task where the model uses its own previous predictions as input: time series prediction, seq2seq models, language models, and anomaly detection with recurrent models. It's not limited to language models.","D":"Teacher forcing doesn't cause catastrophic forgetting (which is a continual learning problem where new training overwrites old knowledge). They are completely different concepts."},"reference":"- Williams & Zipser, \"A Learning Algorithm for Continually Running Fully Recurrent Neural Networks\" (1989): original teacher forcing\n- Bengio et al., \"Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks\" (2015): https://arxiv.org/abs/1506.03099"},{"section":"deep-learning","topicSlug":"rnn-lstm-gru","topic":"Rnn Lstm Gru","id":"dl-12014","difficulty":"easy","orderIndex":14,"question":"You add `dropout=0.3` to a PyTorch `nn.LSTM` layer with `num_layers=3`. A junior engineer says \"dropout is applied after every hidden-to-hidden transition.\" Is this correct?","options":{"A":"Yes — dropout is applied after every h_{t-1} → h_t transition","B":"No — PyTorch's nn.LSTM dropout is applied only between LSTM layers (inter-layer dropout), not within a single layer's hidden-to-hidden transitions. For a 3-layer LSTM: dropout is applied between layer 1→2 and layer 2→3 outputs. The temporal recurrence (h_{t-1} → h_t) within each layer does NOT have dropout applied. This is a known limitation: Variational Dropout (Gal & Ghahramani, 2016) applies the same dropout mask across all timesteps for both input and recurrent connections, but PyTorch's standard LSTM doesn't implement this","C":"Dropout is applied only to the final LSTM layer's output","D":"Dropout in nn.LSTM is applied to every weight matrix independently"},"correct":"B","explanation":{"correct":"- PyTorch nn.LSTM with dropout=p and num_layers=k: PyTorch documentation explicitly states \"If non-zero, introduces a Dropout layer on the outputs of each LSTM layer except the last layer.\"\n- This means: for 3 layers, dropout is applied to the output of layer 1 (before feeding to layer 2) and the output of layer 2 (before feeding to layer 3). The output of layer 3 (the final layer) has no dropout.\n- The recurrent transition within each layer (h_{t-1} → h_t) has no dropout in the standard PyTorch implementation.\n- Variational RNN Dropout: uses the same dropout mask at every timestep (Gal & Ghahramani, 2016). Standard PyTorch uses independent random masks at each step (when applied), and only between layers.","A":"Dropout is not applied at every hidden-to-hidden transition. This is the Variational Dropout approach, not PyTorch's default. Standard nn.LSTM only applies dropout between layers.","B":"","C":"The final layer's output has NO dropout (per PyTorch documentation: \"except the last layer\"). Dropout is applied between intermediate layers.","D":"PyTorch's nn.LSTM dropout doesn't apply to weight matrices directly (that would be weight dropout / DropConnect). It applies to the activation outputs between layers."},"reference":"- PyTorch nn.LSTM documentation: https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html\n- Gal & Ghahramani, \"A Theoretically Grounded Application of Dropout in Recurrent Neural Networks\" (2016): https://arxiv.org/abs/1512.05287"},{"section":"deep-learning","topicSlug":"rnn-lstm-gru","topic":"Rnn Lstm Gru","id":"dl-12015","difficulty":"hard","orderIndex":15,"question":"An LSTM-based seq2seq model with attention is trained for machine translation. At test time, you use beam search with beam_size=5. A researcher claims \"increasing beam_size always improves BLEU.\" You increase to beam_size=50 and observe BLEU decreases. What is happening?","options":{"A":"Larger beam sizes always improve BLEU; the decrease is a software bug","B":"The \"beam search curse\" (or beam search optimization inconsistency): with larger beams, beam search finds translations with higher model log-probability but lower BLEU scores. The model is imperfect — it assigns high probability to sequences that are structurally fluent but semantically incorrect or contain \"safe\" but generic phrases. Larger beams explore more of the model's probability space, finding sequences that are very \"safe\" (high probability under the model) but not actually good translations. The model's log-prob is a proxy for quality, and this proxy breaks down for extreme beams","C":"Larger beams cause GPU memory overflow that corrupts output tokens","D":"BLEU decreases because beam search with large beams produces longer sequences that are penalized by BLEU's brevity penalty"},"correct":"B","explanation":{"correct":"- Beam search optimization: maximize Σ log p(y_t | y_{> 1 (large variance), the maximum element dominates and softmax ≈ one-hot. Gradient of softmax: p_i(1-p_i) → 0 as p_i → 1. Near-zero gradients → training stalls.\n- Vaswani et al. include this exact analysis in \"Attention Is All You Need\" (2017), Section 3.2.1.","A":"\"Overflow\" in softmax is a separate issue (handled by softmax(x - max(x)) numerics). The √d_k scaling is about gradient flow, not overflow. Without scaling, values are large but finite for reasonable d_k.","B":"","C":"Dividing by √d_k (a scalar) has negligible computational cost — it's a single multiplication per element. It doesn't meaningfully reduce softmax computation.","D":"Softmax always sums to 1 regardless of input magnitude. The sum-to-1 property is a mathematical property of softmax, not a consequence of scaling."},"reference":"- Vaswani et al., \"Attention Is All You Need\" (2017): https://arxiv.org/abs/1706.03762 (Section 3.2.1)"},{"section":"deep-learning","topicSlug":"attention-and-transformers-dl","topic":"Attention And Transformers Dl","id":"dl-13002","difficulty":"easy","orderIndex":2,"question":"Multi-head attention uses h separate attention heads, each with reduced dimension d_k = d_model/h. After computing h attention outputs, they are concatenated and projected. A team increases h from 8 to 32 (keeping d_model=512 constant). What changes in computation?","options":{"A":"Increasing h increases total FLOPs proportionally (4× more attention operations)","B":"Total FLOPs remain approximately constant. Each head has d_k = 512/h. The QKV projections: W_Q ∈ ℝ^{d_model × d_k}. With more heads, each head's projection is smaller: total Q projection FLOPs = d_model × d_k × h = d_model × d_model (constant). The attention computation per head: O(T² × d_k) × h = O(T² × d_model) (constant). The output changes: with more heads, each head attends to lower-dimensional subspaces — each head specializes in a narrower feature space","C":"Increasing h increases memory usage by 4× because more attention matrices are stored","D":"Increasing h from 8 to 32 increases d_k from 64 to 256"},"correct":"B","explanation":{"correct":"- Multi-head attention total computation:\n- QKV projections: 3 × T × d_model × d_k × h = 3 × T × d_model² (constant in h)\n- Attention: T² × d_k × h = T² × d_model (constant in h)\n- Output projection: T × d_model × d_model (constant in h)\n- Changing h only redistributes the capacity into more subspaces, not adds total capacity.\n- Trade-off: more heads → lower-dimensional per-head representations → each head is more constrained → may miss complex patterns within a subspace, but can specialize into different relationship types. Too many heads (small d_k) → each head too narrow to capture useful features.","A":"FLOPs don't scale with h because d_k decreases proportionally. 32 heads × d_k=16 = same total dimension as 8 heads × d_k=64.","B":"","C":"The attention matrices have shape T×T per head. Total attention map memory = h × T² × 1 = fixed number of T² elements regardless of head size (per element). With h=32 each head computes T×T, total memory h × T² × d_k = T² × d_model (roughly constant).","D":"d_k = d_model/h. More heads → smaller d_k. h=32, d_model=512 → d_k = 16 (not 256). This is the opposite direction."},"reference":"- Vaswani et al., \"Attention Is All You Need\" (2017): Section 3.2.2"},{"section":"deep-learning","topicSlug":"attention-and-transformers-dl","topic":"Attention And Transformers Dl","id":"dl-13003","difficulty":"medium","orderIndex":3,"question":"Transformers use positional encoding (PE) added to token embeddings. The original Transformer uses sinusoidal PE: PE(pos, 2i) = sin(pos/10000^(2i/d_model)), PE(pos, 2i+1) = cos(...). A team replaces this with learned positional embeddings (a lookup table). For a model trained on sequences up to length 512, they test on sequences of length 1024. What happens?","options":{"A":"Learned PE generalizes perfectly to length 1024 because the model learns position-invariant patterns","B":"Learned PE typically fails to generalize beyond training length. Position embeddings for positions 513-1024 were never seen during training — the embedding table has no entries for these positions (or wraps around/truncates). Even if technically extended, the model has no learned signal for these positions. Sinusoidal PE can extrapolate because the mathematical function is defined for any position. However, in practice, even sinusoidal PE performance degrades for very out-of-distribution positions due to attention patterns being calibrated for shorter sequences","C":"Both sinusoidal and learned PE fail completely for sequences longer than training length","D":"Learned PE is strictly better — it can represent any positional pattern including those beyond 512"},"correct":"B","explanation":{"correct":"- Learned PE lookup table: embedding_table ∈ ℝ^{max_seq_len × d_model}. For position 513: `embedding_table[513]` simply doesn't exist (IndexError or truncation). Even if you extend with zeros or random values, the model has no trained understanding of these positions.\n- Sinusoidal extrapolation: PE(pos, i) = sin(pos/10000^(2i/d_model)) is mathematically defined for any integer pos. Position 1024 produces a valid vector. However, the attention mechanism's effective range is still calibrated for < 512 positions.\n- RoPE (Rotary Position Embedding) and ALiBi are modern solutions specifically designed for length extrapolation, both used in production LLMs (LLaMA, Falcon, etc.).","A":"\"Position-invariant patterns\" would mean the model doesn't use positional information at all. Learned PE is specifically designed to encode position — it's not position-invariant, and it doesn't generalize to unseen positions.","B":"","C":"Sinusoidal PE does extrapolate mathematically (the formula is valid for any position). The claim that it \"fails completely\" is too strong. It may degrade, but it produces valid vectors for any position.","D":"Learned PE is strictly bounded by the training length. Beyond that, it has no learned representation. It cannot represent patterns for positions it never encountered."},"reference":"- Su et al., \"RoFormer: Enhanced Transformer with Rotary Position Embedding\" (RoPE, 2021): https://arxiv.org/abs/2104.09864\n- Press et al., \"Train Short, Test Long: ALiBi\" (2021): https://arxiv.org/abs/2108.12409"},{"section":"deep-learning","topicSlug":"attention-and-transformers-dl","topic":"Attention And Transformers Dl","id":"dl-13004","difficulty":"medium","orderIndex":4,"question":"The Transformer feed-forward network (FFN) consists of two linear layers: FFN(x) = max(0, xW_1 + b_1)W_2 + b_2, with inner dimension 4×d_model. For d_model=512, this is: 512 → 2048 → 512. The FFN has 4× more parameters than the attention sublayer (d_model×4×d_model vs d_model×3×d_model for QKV). What is the role of the FFN, and why is the 4× expansion useful?","options":{"A":"The FFN is redundant; Transformers would work equally well without it","B":"The FFN applies position-wise transformations — the same function independently to each position. While attention performs \"routing\" (mixing information across positions, deciding which positions are relevant to each other), the FFN performs \"computation\" (applying a non-linear transformation to the token's representation at that position). The 4× expansion (dimension bottleneck) allows the model to represent complex functions in the high-dimensional intermediate space: the first layer expands to 2048 dimensions (more features to compute from), the second layer selects and compresses. This is analogous to how wider hidden layers in MLPs can represent more complex functions","C":"The FFN acts as a key-value memory, storing factual knowledge about the world","D":"The 4× expansion is a legacy design choice that modern Transformers have eliminated"},"correct":"B","explanation":{"correct":"- Attention vs FFN role: attention computes token interactions (which positions attend to which). FFN applies a nonlinear transformation per token independently.\n- The \"write-then-read\" intuition: attention gathers relevant context into a representation; FFN then processes this representation. The 4× expansion gives the FFN more \"working memory\" — 2048 intermediate dimensions for computing complex functions of the 512-d input.\n- Research on FFN as memory (Geva et al., 2020): actually supports the memory interpretation — FFN layers seem to store factual associations. But the primary designed role is position-wise nonlinear computation.","A":"Removing FFN layers from Transformers significantly reduces performance. Ablation studies show that FFN layers contribute substantially to Transformer quality. The architecture without FFN is much weaker.","B":"","C":"The \"key-value memory\" interpretation (Geva et al., 2020) is an emerging research finding, not the designed purpose. The primary role is position-wise nonlinear computation. Presenting the memory interpretation as \"the role\" oversimplifies.","D":"The 4× expansion is used in virtually all modern Transformers including LLaMA, GPT-4, and Gemini. Some architectures use different expansion factors (e.g., 8/3× for SwiGLU), but the expansion concept is universal."},"reference":"- Vaswani et al., \"Attention Is All You Need\" (2017): Section 3.3\n- Geva et al., \"Transformer Feed-Forward Layers Are Key-Value Memories\" (2021): https://arxiv.org/abs/2012.14913"},{"section":"deep-learning","topicSlug":"attention-and-transformers-dl","topic":"Attention And Transformers Dl","id":"dl-13005","difficulty":"medium","orderIndex":5,"question":"You notice that in a trained Transformer, the attention patterns in layer 1 are very different from layer 12 (in a 12-layer model). Layer 1 shows local attention (mostly attending to nearby tokens). Layer 12 shows sparse, global attention (a few tokens attending to many distant ones). Why do attention patterns evolve across layers?","options":{"A":"Layer 1 has fewer parameters, forcing it to use local attention","B":"Layer 1 processes raw token embeddings + positional encodings. The embeddings encode surface-level information (word identity, local syntax). Attending locally at layer 1 is optimal for capturing local syntactic structure. By layer 12, representations have been refined through many layers of attention+FFN — they encode abstract semantic information. Global, sparse attention in later layers reflects high-level semantic associations across the full sequence (e.g., a pronoun attending to its antecedent many tokens away, a verb attending to its distant subject). Each layer's attention patterns emerge from what information is useful at that representation level","C":"The attention patterns are random; variation across layers is not meaningful","D":"Layer 12 uses larger attention weights by design; PyTorch initializes later layers with higher weights"},"correct":"B","explanation":{"correct":"- Visualization studies (Clark et al., 2019 \"What Does BERT Look At?\"): different attention heads in different layers capture different linguistic phenomena. Early layers: local attention, syntactic relations (adjacent token dependencies). Late layers: long-range semantic dependencies, coreference.\n- Information accumulation: after 12 layers of attention, each token's representation encodes context from the entire sequence. The later layers have access to highly processed representations that encode global semantic structure, enabling long-range attention to be informative.\n- This hierarchical processing — local syntax → global semantics — mirrors what happens in the brain's language processing and in CNN layer hierarchies (local features → global patterns).","A":"All Transformer layers have the same number of parameters (same d_model, same number of heads). There's no \"fewer parameters in earlier layers\" — they're architecturally identical.","B":"","C":"Attention patterns are highly structured and reproducible across runs and models. Multiple papers show consistent patterns (local attention in early layers, global in later layers) across different Transformer models trained on different tasks.","D":"PyTorch initializes all Transformer layers identically (same initialization scheme for all layers). The patterns emerge during training, not from initialization."},"reference":"- Clark et al., \"What Does BERT Look At? An Analysis of BERT's Attention\" (2019): https://arxiv.org/abs/1906.04341\n- Tenney et al., \"BERT Rediscovers the Classical NLP Pipeline\" (2019): https://arxiv.org/abs/1905.05950"},{"section":"deep-learning","topicSlug":"attention-and-transformers-dl","topic":"Attention And Transformers Dl","id":"dl-13006","difficulty":"hard","orderIndex":6,"question":"Self-attention has O(T²) compute and memory complexity for sequence length T. For T=16,384 (16K tokens), this becomes prohibitive. Name three distinct approaches to reduce this complexity with different complexity-expressiveness trade-offs.","options":{"A":"The only solution is to reduce d_model; attention complexity cannot be reduced","B":"(1) Sparse attention (Longformer, BigBird): compute attention only for local windows + select global tokens. O(T×w) where w is window size. Trades global attention for local efficiency. (2) Linear attention (Performer, Linformer): approximate softmax attention with kernel methods: Φ(Q)Φ(K)^T ≈ QK^T, allowing the association Φ(K)^T V to be precomputed. O(T×d). Full expressiveness loss due to approximation. (3) Flash Attention: same O(T²) complexity but minimizes HBM memory reads/writes using tiling. Full attention (exact), memory efficient, but still O(T²) compute","C":"The O(T²) complexity is a fundamental theorem; it cannot be reduced without losing all expressive power","D":"The solution is to chunk the sequence into non-overlapping segments and apply attention within each chunk (no cross-chunk attention)"},"correct":"B","explanation":{"correct":"- Sparse attention (Longformer): each token attends to w local neighbors + k global tokens. Total attention computations: O(T×(w+k)) instead of T². Trade-off: misses some cross-document attention patterns.\n- Linear attention (Performer): using random feature approximation of softmax kernel: exp(q·k) ≈ φ(q)^T φ(k). Rewrite attention as Q(K^T V) (O(T×d²)) instead of (QK^T)V (O(T²×d)). Trade-off: approximation error in attention distribution.\n- FlashAttention (Dao et al., 2022): exact attention with optimized memory access pattern. Uses tiling to compute attention block by block, never materializing the full T×T attention matrix. No expressiveness loss, but still O(T²) FLOPs — the win is memory and wall-clock time.","A":"Many published approaches reduce attention complexity below O(T²). Listing \"the only solution is reducing d_model\" ignores 5+ years of efficiency research.","B":"","C":"Linear attention (Performer, Linformer) demonstrates O(T×d) complexity with practical applications. The theorem claim is false.","D":"Non-overlapping chunks (basic segmentation) is a crude solution that loses all cross-chunk dependencies. This is worse than sparse attention approaches that at least have overlapping windows or global tokens."},"reference":"- Tay et al., \"Efficient Transformers: A Survey\" (2020): https://arxiv.org/abs/2009.06732\n- Dao et al., \"FlashAttention\" (2022): https://arxiv.org/abs/2205.14135"},{"section":"deep-learning","topicSlug":"attention-and-transformers-dl","topic":"Attention And Transformers Dl","id":"dl-13007","difficulty":"hard","orderIndex":7,"question":"A team computes attention: Q ∈ ℝ^{T×d_k}, K ∈ ℝ^{S×d_k}, V ∈ ℝ^{S×d_v}. In self-attention, T=S. In cross-attention (decoder attending to encoder), T≠S. For a translation model with source length S=20 and target length T=15, describe the shapes of the attention matrix and what each entry (i,j) represents.","options":{"A":"The attention matrix is always square (T×T) regardless of S and T","B":"The attention matrix QK^T ∈ ℝ^{T×S} = ℝ^{15×20}. Entry (i,j) represents the attention score between target position i and source position j — i.e., how much target word i should attend to (be influenced by) source word j when generating its representation. After softmax: each row sums to 1, representing a probability distribution over source positions for each target position. V then aggregates: A×V ∈ ℝ^{15×d_v} produces a context vector for each target position as a weighted sum of source value vectors","C":"The attention matrix shape is (S×T) = (20×15) because source attends to target","D":"Cross-attention requires T=S; the team must pad the source to length 15"},"correct":"B","explanation":{"correct":"- Cross-attention mechanics: Q from decoder (shape: T×d_k), K and V from encoder (shape: S×d_k and S×d_v). QK^T: (T×d_k) × (d_k×S) = T×S.\n- Entry (i,j) of the raw attention matrix: the similarity between query vector q_i (representation of target position i) and key vector k_j (representation of source position j). Softmax over j: how much target token i attends to source token j.\n- This is the mechanism Bahdanau et al. (2015) introduced: the decoder decides where to look in the source sentence for each target word — explicitly learned alignment.","A":"The attention matrix is T×S (not necessarily square). Self-attention has T=S, but cross-attention generally doesn't. The Q×K^T operation requires Q to have columns = K rows (d_k), not the same number of rows.","B":"","C":"Q comes from the decoder (T=target length), K comes from encoder (S=source length). Target attends to source, so attention is T×S (rows=target queries, cols=source keys), not S×T.","D":"Cross-attention explicitly allows different sequence lengths. This is its primary advantage over the original fixed-size encoder vector. Padding to equal lengths is unnecessary and wastes computation."},"reference":"- Bahdanau et al., \"Neural Machine Translation by Jointly Learning to Align and Translate\" (2015): https://arxiv.org/abs/1409.0473"},{"section":"deep-learning","topicSlug":"attention-and-transformers-dl","topic":"Attention And Transformers Dl","id":"dl-13008","difficulty":"medium","orderIndex":8,"question":"Layer normalization in the original Transformer is applied as Post-LN: output = LN(x + Sublayer(x)). Modern LLMs use Pre-LN: output = x + Sublayer(LN(x)). You are designing a new Transformer for a task requiring training with a very high learning rate (1e-3). Which normalization placement should you use and why?","options":{"A":"Post-LN with a very high learning rate is safe; LN handles any learning rate","B":"Pre-LN is required for stable training with high learning rates. In Pre-LN, the gradient of x flows back through the residual path: ∂L/∂x includes a direct term (from the skip connection) that is not scaled by LN. This direct path ensures gradient magnitude doesn't collapse regardless of LN's behavior. With Post-LN at high LR, the optimization is very sensitive to initialization — high LR with Post-LN often causes training divergence. Pre-LN allows training without warmup at high LR, which is critical for rapid training","C":"Post-LN is required; Pre-LN with high learning rate causes exploding activations","D":"The choice doesn't matter; use whichever is implemented in the framework"},"correct":"B","explanation":{"correct":"- Post-LN gradient: ∂L/∂x_in flows through LN(x_in + F(x_in)). The LN normalization gates the gradient magnitude based on the total activation variance. Early in training with high LR and Post-LN, the combined (signal + residual) has high variance, causing LN to scale gradients in unpredictable ways → divergence.\n- Pre-LN gradient: ∂L/∂x_out = ∂L/∂x_in (from skip) + ∂sublayer_gradient. The \"1\" term from the skip connection is always present, providing a well-scaled gradient path.\n- All major modern LLMs (GPT-2, GPT-3, LLaMA, PaLM, Gemini) use Pre-LN precisely for this reason: stable high-LR training without warmup.","A":"LN does NOT make any LR safe. LN normalizes activations, which stabilizes optimization, but Post-LN with very high LR (1e-3 for Transformers is typically high) still diverges regularly. Requiring warmup is exactly the problem Post-LN has.","B":"","C":"Pre-LN with high LR is more stable, not less. The direct residual path in Pre-LN bounds gradient magnitudes. \"Exploding activations\" with Pre-LN is not a documented phenomenon.","D":"The choice of normalization placement is one of the most important architectural decisions for training stability. Major papers (Xiong et al., 2020) show quantifiably better stability with Pre-LN."},"reference":"- Xiong et al., \"On Layer Normalization in the Transformer Architecture\" (2020): https://arxiv.org/abs/2002.04745"},{"section":"deep-learning","topicSlug":"attention-and-transformers-dl","topic":"Attention And Transformers Dl","id":"dl-13009","difficulty":"hard","orderIndex":9,"question":"KV cache (key-value cache) is used during Transformer inference for autoregressive generation. Without KV cache, generating T tokens requires O(T²) total compute. With KV cache, it requires O(T). However, a production system serving many concurrent requests with T=4096 tokens runs out of GPU memory. Explain the memory cost of KV cache and why it is a major production concern.","options":{"A":"KV cache memory is negligible compared to model weights; the OOM is from model parameters","B":"KV cache per request = 2 × num_layers × num_heads × head_dim × T × sizeof(dtype). For LLaMA-7B: 2 × 32 layers × 32 heads × 128 head_dim × 4096 tokens × 2 bytes (FP16) = 2×32×32×128×4096×2 = ~1GB per request. With 100 concurrent requests: 100GB just for KV cache. LLaMA-7B weights are only 14GB in FP16. The KV cache grows linearly with sequence length and request concurrency, often exceeding model weight memory at production scale. This is why techniques like PagedAttention (vLLM), quantized KV cache, and context compression are active research areas","C":"KV cache memory is per-token and is released after each token is generated, so it doesn't accumulate","D":"KV cache only stores the final layer's key-value pairs; earlier layers are recomputed each step"},"correct":"B","explanation":{"correct":"- Exact calculation for LLaMA-7B (32 layers, 32 heads, 128 head_dim, FP16):\n- Per token per layer: 2 (K and V) × 32 heads × 128 head_dim × 2 bytes = 16,384 bytes = 16 KB\n- Per token total: 16 KB × 32 layers = 512 KB per token\n- For T=4096 tokens: 4096 × 512 KB ≈ 2 GB per request\n- At 50 concurrent requests: 50 × 2 GB = 100 GB KV cache vs 14 GB model weights.\n- vLLM's PagedAttention: inspired by OS virtual memory, stores KV cache in non-contiguous memory pages, allowing efficient memory sharing and preventing fragmentation.","A":"As shown, KV cache can be 2-10GB per long request — comparable to or larger than the model weights. At production concurrency, it's the primary memory bottleneck.","B":"","C":"KV cache accumulates throughout a single request (all previously generated tokens' K, V must be stored to avoid recomputation). It's released after the request completes, but during generation, it grows with each new token.","D":"KV cache stores all layers' K and V vectors. That's the point — to avoid recomputing them. Storing only the final layer's KV and recomputing earlier layers would eliminate most of the savings."},"reference":"- Kwon et al., \"Efficient Memory Management for Large Language Model Serving with PagedAttention\" (vLLM, 2023): https://arxiv.org/abs/2309.06180"},{"section":"deep-learning","topicSlug":"attention-and-transformers-dl","topic":"Attention And Transformers Dl","id":"dl-13010","difficulty":"hard","orderIndex":10,"question":"Rotary Position Embedding (RoPE) encodes position by rotating query and key vectors in 2D subspaces: the dot product q_m · k_n (positions m and n) depends only on q_m · k_n computed as a function of (m-n). Why is this \"relative\" property critical for generalization, and how does it differ from absolute sinusoidal PE?","options":{"A":"RoPE is only useful for models with more than 1000 layers; it doesn't apply to standard Transformers","B":"RoPE's attention score depends on (m-n) — the relative distance between positions, not their absolute positions. Absolute PE: the model learns that position 100 should attend differently to position 150 than position 100 attends to position 1. With training on sequences up to length 512, position 600 has never been seen — absolute PE embedding at position 600 is undefined. RoPE: the model learns that \"looking back 50 positions\" means something, regardless of whether \"position 550 looking at 500\" or \"position 50 looking at 0\" — relative distance is the semantically meaningful quantity","C":"RoPE is a form of data augmentation, not a positional encoding","D":"Absolute sinusoidal PE also encodes relative position; RoPE is a minor implementation detail"},"correct":"B","explanation":{"correct":"- RoPE property: for rotation matrices R_m (position m) and R_n (position n): (R_m q)^T (R_n k) = q^T R_{m-n} k. The dot product computes a function of the relative offset m-n. This is exactly what \"relative position encoding\" means.\n- Generalization beyond training length: the model learns functions of relative distances (1, 2, 5, 50, etc.). At inference with longer sequences, the same relative distances are used — the model can generalize to position 5000 looking back 50 positions, because it's the same relative distance as position 100 looking back 50.\n- LLaMA, Mistral, Falcon all use RoPE for this generalization property.","A":"RoPE is used in standard Transformer architectures (LLaMA-7B to 70B, Mistral-7B, etc.) regardless of layer count. It's a positional encoding, not a layer-specific technique.","B":"","C":"RoPE is a positional encoding scheme, not data augmentation. It encodes token positions mathematically, not by modifying training samples.","D":"Absolute sinusoidal PE does NOT encode relative position in attention dot products. sin(pos × f) + position embedding produces different vectors for different absolute positions. The dot product of absolute PEs does have some relative position information (cos(m-n) appears), but it's confounded with absolute position information, not purely relative."},"reference":"- Su et al., \"RoFormer: Enhanced Transformer with Rotary Position Embedding\" (2021): https://arxiv.org/abs/2104.09864"},{"section":"deep-learning","topicSlug":"attention-and-transformers-dl","topic":"Attention And Transformers Dl","id":"dl-13011","difficulty":"hard","orderIndex":11,"question":"Grouped Query Attention (GQA) and Multi-Query Attention (MQA) reduce the number of K and V heads compared to Q heads. Standard MHA: 32 Q, 32 K, 32 V heads. MQA: 32 Q, 1 K, 1 V head. GQA: 32 Q, 8 K, 8 V heads (groups of 4 Q heads share K/V). What is the primary motivation, and what is the accuracy-efficiency trade-off?","options":{"A":"GQA/MQA reduce FLOPs per attention computation by 32× or 4×","B":"The primary motivation is KV cache memory reduction. MQA: reduces K/V heads from 32 to 1 → reduces KV cache by 32×. GQA with 8 groups: reduces KV cache by 4×. The FLOPs for attention computation change by a similar factor, but the dominant bottleneck at inference is memory bandwidth (loading KV cache from HBM), not compute. Accuracy trade-off: MQA can reduce quality (single K/V shared across all 32 Q heads limits expressive diversity). GQA balances this: 8 K/V heads for 32 Q provides more diversity than MQA while still achieving ~4× KV cache reduction. LLaMA-2-70B uses GQA with 8 K/V groups","C":"GQA/MQA only help during training; inference memory is identical to standard MHA","D":"GQA/MQA eliminate the need for KV caching entirely"},"correct":"B","explanation":{"correct":"- Standard MHA KV cache per token: 2 × num_heads × head_dim × num_layers = 2 × 32 × 128 × 32 = 262,144 values per token.\n- MQA KV cache: 2 × 1 × 128 × 32 = 8,192 values — 32× less.\n- GQA (8 K/V groups) cache: 2 × 8 × 128 × 32 = 65,536 values — 4× less.\n- The memory bandwidth bottleneck: at inference, for each generated token, all KV cache must be loaded from GPU HBM. MQA/GQA directly reduce this bandwidth requirement, improving inference throughput.\n- Accuracy: Ainslie et al. (2023) GQA paper shows GQA matches MHA accuracy while MQA has a small but consistent accuracy loss.","A":"FLOPs for attention computation: attention(Q, K, V) scales with num_kv_heads. MQA reduces these FLOPs by 32×, but this is not the primary motivation. Memory bandwidth is the bottleneck in autoregressive inference, not FLOPs.","B":"","C":"KV cache is an inference concept (caching K/V for previously generated tokens). GQA/MQA directly reduce its size, which is an inference benefit. Training is less affected since batched training can cache KV for all positions simultaneously.","D":"KV caching is still necessary with GQA/MQA — the K and V vectors (now fewer) still need to be cached to avoid recomputation. The cache is smaller, not eliminated."},"reference":"- Ainslie et al., \"GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints\" (2023): https://arxiv.org/abs/2305.13245"},{"section":"deep-learning","topicSlug":"attention-and-transformers-dl","topic":"Attention And Transformers Dl","id":"dl-13012","difficulty":"medium","orderIndex":12,"question":"You implement causal (masked) self-attention for autoregressive language modeling. The mask ensures token i only attends to positions j ≤ i. During training with batch_size=32 and sequence_length=512, you materialize a 512×512 causal mask of -∞ (upper triangle) and add it to the attention logits before softmax. A colleague says this is wasteful. What is she referring to and what is a more efficient implementation?","options":{"A":"The mask should be a boolean matrix, not -∞; softmax of -∞ causes NaN","B":"Materializing a 512×512 matrix of -∞ per batch requires storing 512² = 262,144 values per head (or batch element × head × 512² = 32×8×262K ≈ 67M values). This wastes memory and requires a separate addition operation. Efficient alternative: (1) create the mask once as a bool matrix and multiply in the attention kernel; (2) FlashAttention-style tiled attention builds the causal mask implicitly without materializing the full T×T matrix; (3) use `torch.nn.functional.scaled_dot_product_attention` with `is_causal=True` which applies the mask within the fused CUDA kernel without creating a full mask tensor","C":"The causal mask should be applied after softmax, not before","D":"Causal masking is not needed for decoder models; the sequential nature of generation handles causality"},"correct":"B","explanation":{"correct":"- Memory waste: a 512×512 float32 mask = 1MB per layer per item in the batch. With batch=32, 8 heads: 32×8×1MB = 256MB just for masks across all heads per layer.\n- The mask is a static, triangular pattern — the same for every batch element and every head. Creating it once as a register buffer (not recomputed, not batched) saves memory.\n- PyTorch 2.0+: `F.scaled_dot_product_attention(q, k, v, is_causal=True)` fuses the mask application into a single CUDA kernel that computes attention block by block (FlashAttention style), never materializing the full T×T matrix.","A":"softmax(-∞) = 0 (not NaN). exp(-∞) = 0, which in the softmax denominator contributes 0, effectively masking the position. This is exactly the intended behavior.","B":"","C":"The causal mask must be applied before softmax to prevent probability mass from being assigned to masked (future) positions. Applying after softmax would require different mask values and wouldn't correctly zero out future positions.","D":"Causal masking is essential for training decoder models. Without the mask, during training (where the full sequence is available), each token can see all future tokens. The sequential nature applies only at inference time — during training, all positions are processed in parallel and the mask enforces causality."},"reference":"- PyTorch scaled_dot_product_attention: https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html"},{"section":"deep-learning","topicSlug":"attention-and-transformers-dl","topic":"Attention And Transformers Dl","id":"dl-13013","difficulty":"easy","orderIndex":13,"question":"In the Transformer architecture, residual connections are used in both attention and FFN sublayers: output = LN(x + Sublayer(x)). A 6-layer Transformer is trained without any residual connections. What will happen during training and why?","options":{"A":"Training will succeed but converge more slowly","B":"Without residual connections, gradients must flow through the full depth of nonlinear transformations. For a 6-layer Transformer with tanh or ReLU activations in FFN layers, the gradient at the input layer is ∂L/∂x_1 = Π_{l=1}^{6} ∂x_{l+1}/∂x_l. Without skip paths, each term can be < 1 in spectral norm, causing vanishing gradients. The model will fail to train meaningfully — the first few layers will receive near-zero gradient updates while only the final layers train effectively","C":"Training will fail due to a shape mismatch error from the missing addition operation","D":"Training works fine for 6 layers; residual connections are only needed for 50+ layer networks"},"correct":"B","explanation":{"correct":"- Residual connection gradient: with x + F(x), the gradient ∂L/∂x includes a direct term \"1\" from the identity path. This prevents gradient from being forced through the nonlinear F(x) path, guaranteeing some gradient flow regardless of F(x)'s Jacobian.\n- Without residuals in a 6-layer Transformer: gradient must pass through 6 attention+FFN nonlinear compositions. Each composition can attenuate gradients. While 6 layers is not as extreme as 50+, the multi-head attention + FFN layers with LayerNorm are non-trivial nonlinear operations. Gradient vanishing within 6 layers is a real risk, especially early in training.\n- Empirical evidence: the ResNet degradation problem showed even 20-layer plain networks fail. Transformers without residuals show similar degradation.","A":"\"More slowly\" understates the problem. For a 6-layer Transformer without residuals, training typically stalls — the model barely converges rather than just converging slowly.","B":"","C":"Removing the addition operation from `LN(x + Sublayer(x))` doesn't cause a shape mismatch. Both x and Sublayer(x) have the same shape (T × d_model). The operation just becomes `LN(Sublayer(x))`. It's a valid operation, just suboptimal.","D":"The need for residual connections is not exclusively for very deep networks. The original Transformer paper (6 layers) uses residual connections and shows they're important for stable training. The degradation problem was demonstrated for 20+ layers, but residuals help at any depth."},"reference":"- He et al., \"Deep Residual Learning for Image Recognition\" (2015): residual connections for training stability\n- Vaswani et al., \"Attention Is All You Need\" (2017): Section 3 (residual connections in each sublayer)"},{"section":"deep-learning","topicSlug":"attention-and-transformers-dl","topic":"Attention And Transformers Dl","id":"dl-13014","difficulty":"hard","orderIndex":14,"question":"You analyze attention patterns in a trained Transformer and find that certain heads consistently produce near-uniform attention distributions (all positions receive equal weight ≈ 1/T). Another set of heads consistently attend to the [CLS] token or current position. What do these degenerate heads indicate?","options":{"A":"These heads are working correctly — uniform attention is an information aggregation strategy","B":"These are \"no-op heads\" or \"over-smoothing heads.\" Uniform attention head: computes the average of all value vectors — effectively a mean pooling operation. While this can be useful for global aggregation, many uniform heads indicate the model has more heads than it needs and some heads have settled into trivial solutions. Attending to [CLS] or self-attention: the head is not using the key-query interaction to make meaningful choices. Diagnoses: (1) too many heads for the task — some are redundant; (2) the head's Q/K projections learned trivial mappings; (3) informative heads have been \"stolen\" by LayerNorm's normalization. Pruning uniform/trivial heads typically maintains performance with faster inference","C":"Uniform attention heads cannot be pruned; they are mathematically required for the Transformer to function","D":"These heads are a training error; reinitialize and retrain them"},"correct":"B","explanation":{"correct":"- Uniform attention: if attention logits are all 0 (Q·K^T = 0), softmax outputs 1/T for all positions. The output is (1/T)×ΣV_i = mean of all value vectors. This is a valid function (global average pooling) but wastes head capacity.\n- Head pruning research: Michel et al. (2019) \"Are Sixteen Heads Really Better than One?\" showed that in BERT, most attention heads can be pruned without performance degradation. Many heads are redundant. The few informative heads (capturing specific linguistic relations) carry most of the useful computation.\n- The \"over-smoothing\" connection: uniform attention repeatedly applied over multiple layers can cause representations to converge toward the mean, losing local distinctions (related to the over-smoothing problem in GNNs).","A":"While uniform attention does perform mean pooling (a valid operation), having many uniform heads suggests wasted capacity. The model could achieve the same aggregation with fewer heads, freeing capacity for more useful operations.","B":"","C":"Uniform heads can be pruned. Michel et al. demonstrate this empirically — pruning up to 80-90% of heads causes minimal accuracy loss. The model redistributes the computation.","D":"Retraining specific heads without changing the architecture would produce the same degenerate solutions. The fundamental issue is model over-capacity for the task, not a training error."},"reference":"- Michel et al., \"Are Sixteen Heads Really Better than One?\" (2019): https://arxiv.org/abs/1905.10650"},{"section":"deep-learning","topicSlug":"attention-and-transformers-dl","topic":"Attention And Transformers Dl","id":"dl-13015","difficulty":"hard","orderIndex":15,"question":"You implement attention with head_dim=64, num_heads=8, and compute Q, K, V via three separate linear projections. A colleague proposes fusing Q, K, V into a single projection: `QKV = x @ W_QKV` where W_QKV ∈ ℝ^{d_model × 3d_model}. What are the computational and practical trade-offs of this fusion?","options":{"A":"Fused QKV projection is mathematically inequivalent to separate projections","B":"Fused QKV projection is mathematically equivalent: W_QKV = [W_Q | W_K | W_V] concatenated along the output dimension. Practical advantages: (1) single larger GEMM (General Matrix Multiply) instead of three smaller ones — GPU is more efficient for larger matrices; (2) single data load of x from memory (read x once instead of three times); (3) enables memory fusion in frameworks (one kernel launch). Trade-offs: W_QKV must fit in GPU registers/cache — for very large d_model, this may not be possible; less flexibility in applying different regularization to Q vs K vs V. Modern Transformer implementations (FlashAttention, cuDNN) all fuse QKV for efficiency","C":"Fused QKV uses 3× more memory because it stores the full 3d_model output","D":"Fused QKV is less accurate because the shared computation introduces correlation between Q, K, and V"},"correct":"B","explanation":{"correct":"- Mathematical equivalence: computing [x @ W_Q, x @ W_K, x @ W_V] is identical to x @ [W_Q | W_K | W_V] (concatenating along output dimension). The linear operations are independent (no weight sharing).\n- GPU efficiency: three separate GEMMs of size (T, d_model) × (d_model, d_k) vs one GEMM of size (T, d_model) × (d_model, 3×d_k). Larger GEMMs achieve better hardware utilization (higher arithmetic intensity, better use of tensor cores). For d_model=512, one (512, 1536) GEMM is more efficient than three (512, 512) GEMMs.\n- Memory bandwidth: x has shape (T, d_model) = T × 512 values. Reading x once for one GEMM vs three times for three GEMMs = 3× less memory bandwidth for x.","A":"Fused QKV is mathematically equivalent to separate projections. The projection matrices W_Q, W_K, W_V are simply concatenated into W_QKV. The outputs are identical.","B":"","C":"The memory for the output is the same: 3 × T × d_k regardless of whether computed separately or fused. The weight matrices: W_QKV (d_model × 3d_model) = same total parameters as W_Q + W_K + W_V (each d_model × d_k, sum = d_model × 3d_k = d_model × 3d_model).","D":"Q, K, V weights are independent in the fused projection (separate columns of W_QKV). There's no weight sharing or correlation introduced between Q, K, and V computation. The results are identical."},"reference":"- Megatron-LM: https://github.com/NVIDIA/Megatron-LM (fused QKV for efficiency)\n- FlashAttention implementation details: https://github.com/Dao-AILab/flash-attention"},{"section":"deep-learning","topicSlug":"self-supervised-and-contrastive-learning","topic":"Self Supervised And Contrastive Learning","id":"dl-14001","difficulty":"easy","orderIndex":1,"question":"Self-supervised learning (SSL) trains a model on a pretext task without human-labeled data. A team designs a pretext task: predict whether two image patches from the same image are adjacent (positive) or not adjacent (negative). What is this type of pretext task, and what limitation does it have?","options":{"A":"This is a contrastive learning task; it's optimal for all downstream tasks","B":"This is a context prediction pretext task (spatial relationship prediction), a form of SSL that forces the model to understand relative spatial structure. The limitation is \"pretext task bias\": the learned representations are optimized specifically for spatial relationship prediction. If the downstream task is semantic classification (is this a cat?), the features optimized for \"which patches are adjacent\" may not align well with \"which image contains a cat.\" The model learns low-level spatial structure but may miss semantic content needed for classification","C":"This is a supervised task because it generates labels (adjacent/not-adjacent)","D":"This pretext task is equivalent to training on ImageNet labels"},"correct":"B","explanation":{"correct":"- Pretext task bias: SSL models learn only as much as needed to solve the pretext task. A spatial adjacency predictor learns to encode spatial layout and local texture continuity — useful for object detection, not necessarily for fine-grained recognition.\n- This is why designing good pretext tasks is critical and why modern SSL methods (SimCLR, DINO) moved away from hand-designed pretext tasks toward invariance-based approaches (learn representations that are invariant to augmentation).\n- The labels (adjacent/not-adjacent) are automatically derived from the image itself without human annotation — this is the \"self-supervised\" in SSL. The learning is supervised in mechanism but self-supervised in label generation.","A":"\"Optimal for all downstream tasks\" is too strong. Pretext task design significantly affects which downstream tasks benefit. Spatial SSL helps detection/segmentation more than classification.","B":"","C":"SSL specifically means labels are automatically generated from the data itself (no human annotation). Adjacent/not-adjacent labels are derived programmatically from the image — this is self-supervised. \"Supervised\" would require human annotators labeling each patch pair.","D":"ImageNet labels (1000 semantic categories with human annotation) encode semantic content. Adjacency labels encode spatial relationships — these are completely different."},"reference":"- Doersch et al., \"Unsupervised Visual Representation Learning by Context Prediction\" (2015): https://arxiv.org/abs/1505.05192"},{"section":"deep-learning","topicSlug":"self-supervised-and-contrastive-learning","topic":"Self Supervised And Contrastive Learning","id":"dl-14002","difficulty":"easy","orderIndex":2,"question":"SimCLR's contrastive loss (NT-Xent) is defined for a positive pair (i, j) as: L_{i,j} = -log[exp(sim(z_i, z_j)/τ) / Σ_{k≠i} exp(sim(z_i, z_k)/τ)]. The denominator includes all 2N-2 other samples (one positive, one negative for each item). What is the role of temperature τ in this loss, and what happens at τ→0 and τ→∞?","options":{"A":"Temperature τ is a scaling constant with no effect on training; it only normalizes the loss value","B":"Temperature τ controls the sharpness of the similarity distribution. At τ→0: the loss becomes near-zero for clearly separated pairs and near-infinity for any misclassified pair — essentially a hard margin loss that only trains on \"confused\" negatives. At τ→∞: the softmax denominator becomes uniform, and the gradient vanishes — the loss becomes insensitive to relative similarities. Optimal τ (typically 0.07-0.5 in practice) provides informative gradients: \"difficult negatives\" (similar but different class) contribute large gradients; \"easy negatives\" (dissimilar) contribute small gradients","C":"Temperature τ controls the batch size; larger τ requires smaller batches","D":"Temperature τ should always be set to 1.0; other values cause training instability"},"correct":"B","explanation":{"correct":"- Gradient analysis: ∂L/∂sim(z_i, z_j) = -(1/τ)(1 - softmax(sim(z_i,z_j)/τ)). For high similarity (positive pair far from negatives), softmax ≈ 1 → gradient ≈ 0 (already learned). For low similarity (positive pair confused with negatives), softmax small → gradient large (need to push closer).\n- τ→0: gradient is large only when the positive pair similarity is less than any negative — creates a very hard, sparse learning signal. Risk of gradient explosion for bad initializations.\n- τ→∞: all gradients → 0 (all similarities equally weighted). No meaningful learning.\n- SimCLR paper uses τ=0.07; MoCo v2 uses τ=0.2. The choice significantly affects quality.","A":"Temperature strongly affects training. Chen et al. (SimCLR) performed ablation showing τ significantly affects linear evaluation accuracy. Lower τ in a reasonable range generally improves feature quality.","B":"","C":"Temperature and batch size are independent hyperparameters. Temperature affects the sharpness of the similarity distribution; batch size determines how many negatives are available.","D":"τ=1.0 is one valid choice but not optimal. The SimCLR paper shows τ=0.1 outperforms τ=1.0 on CIFAR-10 linear evaluation by a significant margin."},"reference":"- Chen et al., \"A Simple Framework for Contrastive Learning of Visual Representations (SimCLR)\" (2020): https://arxiv.org/abs/2002.05709"},{"section":"deep-learning","topicSlug":"self-supervised-and-contrastive-learning","topic":"Self Supervised And Contrastive Learning","id":"dl-14003","difficulty":"medium","orderIndex":3,"question":"SimCLR requires very large batch sizes (e.g., 4096-8192) for good performance. MoCo (Momentum Contrast) achieves comparable performance with batch_size=256 by using a queue of negative keys. What fundamental problem does MoCo's queue solve, and what is the role of the momentum encoder?","options":{"A":"MoCo's queue solves GPU memory limitations by storing keys in CPU memory","B":"SimCLR's negatives come only from the current batch — with batch=256, only 254 negatives per sample. Contrastive learning benefits from many diverse negatives (the denominator's discriminative power increases with more negatives). MoCo's queue maintains a rolling buffer of K=65,536 encoded keys from recent batches, giving each sample 65,536 negatives without increasing batch size. The momentum encoder solves consistency: if the key encoder is updated with large gradient steps per batch, keys in the queue (encoded by different encoder versions) are inconsistent. Momentum update (ξ=0.999): θ_k ← m×θ_k + (1-m)×θ_q ensures slow, consistent evolution of the key encoder, making all queue entries approximately encoded by the same encoder version","C":"MoCo's queue stores the input images; momentum updates the queue with new images each step","D":"Momentum encoder is used to prevent gradient explosion from training on stale negatives"},"correct":"B","explanation":{"correct":"- SimCLR batch size requirement: with 2N samples and 2N-2 negatives, more negatives → better coverage of the negative space → harder contrastive problem → better features. SimCLR needs large batches because all negatives must be from the current forward pass.\n- MoCo queue: stores encoded keys from recent batches. With K=65,536 queue entries and batch=256, each query is contrasted against 65,536 negatives encoded over the past 65,536/256 ≈ 256 batches.\n- Momentum encoder necessity: if the key encoder (which encoded queue entries) has changed significantly, old entries are inconsistent with current representations. Momentum (very slow) encoder change: entries encoded 256 batches ago are still approximately compatible with today's encoder.","A":"MoCo's queue is in GPU memory (as a tensor). The point is not CPU vs GPU storage but having more negatives than fit in a single batch. MoCo v3 removes the queue entirely in favor of large batch contrastive learning.","B":"","C":"MoCo's queue stores encoded key vectors (d-dimensional feature vectors), not raw images. Encoding images takes compute — storing pre-computed encodings is the efficiency win.","D":"Gradient explosion is not the primary concern. Momentum encoding is specifically about maintaining consistency of the key representations in the queue — all queue entries should be from \"similar\" encoder versions."},"reference":"- He et al., \"Momentum Contrast for Unsupervised Visual Representation Learning (MoCo)\" (2020): https://arxiv.org/abs/1911.05722"},{"section":"deep-learning","topicSlug":"self-supervised-and-contrastive-learning","topic":"Self Supervised And Contrastive Learning","id":"dl-14004","difficulty":"medium","orderIndex":4,"question":"BYOL (Bootstrap Your Own Latent) achieves state-of-the-art self-supervised performance without negative pairs. Critics initially predicted this would fail due to \"representational collapse\" (the model could trivially minimize loss by mapping all inputs to the same constant vector). BYOL avoids collapse using two components — what are they?","options":{"A":"BYOL uses very large batch sizes to prevent collapse","B":"BYOL uses: (1) an online-target asymmetry: the online network has an extra prediction head (MLP) that the target network doesn't have. The two networks are architecturally different, preventing the trivial constant solution (target can't be reached by a constant representation because the prediction head must transform to match target). (2) Stop-gradient on the target: target network is a momentum-updated copy of the online network (no gradient flows to the target). The target is a \"moving average\" oracle. Together, these create an asymmetric optimization that prevents collapse: the online network always chases a moving target that represents a slightly different (momentum-averaged) feature space","C":"BYOL avoids collapse by adding BatchNorm which implicitly creates negative interactions between samples in a batch","D":"BYOL uses random cropping augmentation only; no other architectural tricks are needed"},"correct":"B","explanation":{"correct":"- Representational collapse risk: if both networks mapped every input to the same constant z, the cosine similarity = 1, loss = 0. Perfect training loss with useless representations.\n- BYOL's architectural trick: online network q_θ(z), target network z̄ (no prediction head). Loss: ||q_θ(z) - sg(z̄)||². The prediction head q_θ must transform z to match z̄. A constant z wouldn't have a good prediction — q_θ would need to output z̄ which comes from slightly different representations.\n- Grill et al. (2020) BYOL paper; Richemond et al. (2020) \"BYOL works even without batch statistics\" (analyzes BatchNorm's role); Tian et al. (2021) show momentum + prediction head together are sufficient for collapse prevention.","A":"Batch size is not the primary mechanism. BYOL was shown to work with relatively small batch sizes (512-1024) compared to SimCLR's 4096-8192 requirement.","B":"","C":"BatchNorm does implicitly prevent collapse (all-constant output → constant batch statistics → BatchNorm makes this suboptimal). This was debated in the BYOL community. However, BYOL's primary designed mechanism is the asymmetric architecture (prediction head + momentum encoder), not BatchNorm.","D":"Augmentation is necessary but not sufficient to prevent collapse. Without the prediction head and momentum encoder, simple augmentation-based SSL collapses to constant representations."},"reference":"- Grill et al., \"Bootstrap Your Own Latent (BYOL)\" (2020): https://arxiv.org/abs/2006.07733"},{"section":"deep-learning","topicSlug":"self-supervised-and-contrastive-learning","topic":"Self Supervised And Contrastive Learning","id":"dl-14005","difficulty":"medium","orderIndex":5,"question":"MAE (Masked Autoencoders) masks 75% of image patches and trains a ViT to reconstruct the masked patches from the remaining 25%. This is more aggressive than BERT's 15% masking. Why does a high masking ratio work better for images than for text, and why doesn't it cause the model to only memorize local statistics?","options":{"A":"Images are lower-dimensional than text, requiring more masking for the same difficulty","B":"Images have much higher spatial redundancy than text. Adjacent image patches are highly correlated (smooth regions, textures). With only 15% masking, the model can interpolate from immediately surrounding patches without understanding global structure. 75% masking removes enough context that reconstruction requires global understanding of object structure — \"filling in a 75% occluded cat requires knowing what a cat looks like, not just what neighboring pixels look like.\" Text has lower redundancy: \"The [MASK] ate the [MASK]\" with 15% masking is already challenging. The reconstruction target (pixel values) is also low-level but the learned representations capture semantics because the model can only succeed by understanding global scene structure","C":"75% masking is used to reduce computation (fewer visible patches = less attention cost)","D":"Higher masking ratios cause overfitting; MAE prevents this with gradient clipping"},"correct":"B","explanation":{"correct":"- Spatial redundancy in images: a patch's pixel values can be predicted from adjacent patches via linear interpolation without any semantic understanding. BERT with 15% masking works because text lacks this spatial redundancy — each word carries unique semantic content.\n- MAE's high mask ratio design: He et al. (2021) ablated masking ratios 10%-90% and found 75% optimal. At low ratios, the task is too easy (local interpolation suffices). At very high ratios (90%+), too little context remains and even the decoder can't reconstruct.\n- Computation benefit: the ViT encoder only processes the 25% visible patches. This actually makes MAE faster than processing all patches — a beneficial side effect, not the motivation.","A":"Dimensionality is not the relevant factor. The key property is spatial correlation/redundancy, not dimensionality. A text sequence with 128 tokens is \"lower-dimensional\" than an image in some sense but requires lower masking ratios.","B":"","C":"While MAE's encoder does process fewer patches (25%), the primary motivation is task difficulty calibration, not compute. He et al. explicitly motivated the high masking ratio by the need to force global understanding.","D":"Gradient clipping and overfitting are not related to masking ratio choice. MAE's masking ratio ablation shows a smooth performance curve peaking at 75%, not an overfitting curve."},"reference":"- He et al., \"Masked Autoencoders Are Scalable Vision Learners (MAE)\" (2021): https://arxiv.org/abs/2111.06377"},{"section":"deep-learning","topicSlug":"self-supervised-and-contrastive-learning","topic":"Self Supervised And Contrastive Learning","id":"dl-14006","difficulty":"hard","orderIndex":6,"question":"You apply SimCLR to a medical imaging dataset of 10,000 chest X-rays (unlabeled) and then fine-tune the pretrained model on 100 labeled X-rays for pneumonia detection. Your colleague applies the same pipeline to 1M natural images (ImageNet-scale) before fine-tuning on the same 100 labeled X-rays. Surprisingly, the ImageNet-pretrained model performs similarly to the domain-specific SSL model. What explains this, and when would domain-specific SSL clearly win?","options":{"A":"ImageNet always outperforms domain-specific SSL; more data is always better","B":"At 1M vs 10K samples, ImageNet's data volume advantage may compensate for domain mismatch. SimCLR on 10K images may not learn sufficiently diverse representations — contrastive learning benefits greatly from scale (diversity of negatives, augmentation variety). However, domain-specific SSL clearly wins when: (1) the domain has no visual overlap with natural images (e.g., satellite imagery, pathology slides at 40× magnification, time series data); (2) when you scale domain-specific data to 100K+ unlabeled examples; (3) when downstream task requires highly domain-specific features (cell nucleus morphology vs ImageNet textures)","C":"ImageNet always loses; domain-specific SSL is always better due to lower distribution shift","D":"Domain-specific SSL is illegal to compare to ImageNet; they must be evaluated on different benchmarks"},"correct":"B","explanation":{"correct":"- Scale-quality trade-off: SimCLR at 10K images (chest X-rays) sees limited diversity. The momentum queue has limited unique negatives; augmentations may not create sufficiently different views. ImageNet at 1M images provides diverse negatives and augmentation variety that produces better feature generalization.\n- When domain wins: Raghu et al. (2019) showed that for medical imaging with sufficient domain data (100K+), domain-specific pretraining outperforms ImageNet pretraining significantly. The larger the domain shift, the larger this benefit.\n- Chest X-rays specifically: X-rays have different statistics (grayscale, high-frequency structures, anatomical regularities) from natural images. Domain SSL can learn X-ray-specific features (lung density patterns) that ImageNet SSL misses.","A":"Domain-specific SSL can outperform ImageNet when domain-specific data is abundant. \"Always better\" claims in transfer learning are consistently proven wrong across different scales and domains.","B":"","C":"\"Always loses\" is also wrong. At 10K domain images vs 1M natural images, the scale advantage can outweigh domain specificity. Both A and C are too absolute.","D":"Comparing the two approaches on the same downstream task (100 labeled X-rays) is a standard and valid experimental setup. It's a legitimate research question, not a methodological error."},"reference":"- Raghu et al., \"Transfusion: Understanding Transfer Learning for Medical Imaging\" (2019): https://arxiv.org/abs/1902.07208\n- Zhang et al., \"Contrastive Learning of Medical Visual Representations from Paired Images and Text\" (2020)"},{"section":"deep-learning","topicSlug":"self-supervised-and-contrastive-learning","topic":"Self Supervised And Contrastive Learning","id":"dl-14007","difficulty":"hard","orderIndex":7,"question":"DINO (Self-Distillation with No Labels) uses Vision Transformers (ViTs) and a self-distillation approach where the student network is trained to match the teacher's output distribution. DINO's teacher is trained with centering (subtracting a running mean of teacher outputs) and sharpening (low temperature for teacher softmax). Without centering, what collapse would occur, and why doesn't sharpening alone prevent it?","options":{"A":"Without centering, the model would converge to random representations","B":"Without centering, the teacher's outputs would collapse to a single dominant dimension: all outputs have one probability near 1 and others near 0 (uniform collapse to one prototype). This happens because softmax with sharpening + no centering amplifies any small imbalance — if one output dimension is slightly larger due to initialization, sharpening makes it dominant, and the student learns to predict this \"always one\" output. Sharpening alone doesn't prevent this because it actively amplifies the imbalance: sharper distribution → stronger push toward the dominant dimension → stronger collapse signal. Centering subtracts the running mean of teacher outputs, preventing any single dimension from dominating","C":"Without centering, loss would become NaN in the first training step","D":"Centering is only needed for ViT architectures; CNN-based DINO doesn't need it"},"correct":"B","explanation":{"correct":"- Collapse analysis in DINO: the teacher softmax output p_t = softmax(g_t(x)/τ_t). If τ_t is small (sharpening) and one output dimension h consistently has higher logit: exp(h/τ_t) >> exp(other/τ_t) → p_t ≈ [0,...,1,...,0].\n- All images output the same one-hot → trivial student loss (student always predicts the same thing) → model outputs useless representations.\n- Centering: g_t ← g_t - center, where center = momentum EMA of teacher output. If teacher collapses to output consistently high values at index k, center[k] becomes large, subtracting it and bringing the distribution back toward uniform.\n- Sharpening prevents collapse to uniform (opposite direction): sharpening makes the teacher output more peaked, which is good for learning distinctive features — but only works against uniform collapse, not uniform-to-single-mode collapse.","A":"\"Random representations\" are not the collapse type. The collapse is to non-random but useless representations — all inputs producing the same output.","B":"","C":"Loss doesn't become NaN immediately. The collapse is gradual — the teacher progressively concentrates on one mode over many training steps.","D":"The collapse mechanism (amplification of dominant dimensions through sharpening) applies to any architecture using softmax-sharpened outputs. It's not ViT-specific."},"reference":"- Caron et al., \"Emerging Properties in Self-Supervised Vision Transformers (DINO)\" (2021): https://arxiv.org/abs/2104.14294"},{"section":"deep-learning","topicSlug":"self-supervised-and-contrastive-learning","topic":"Self Supervised And Contrastive Learning","id":"dl-14008","difficulty":"medium","orderIndex":8,"question":"A researcher claims: \"Self-supervised models learn better features than supervised models because they use more data.\" You're asked to evaluate this claim. What is the nuanced truth?","options":{"A":"The claim is correct; SSL always outperforms supervised learning","B":"The claim is partially correct in specific regimes: SSL + large unlabeled data can outperform supervised learning on limited labeled data (few-shot and semi-supervised settings). But supervised models trained on fully labeled large datasets (e.g., full ImageNet with 1.28M labels) still outperform SSL models of the same architecture on classification tasks in most benchmarks, because supervised labels directly optimize for the target metric. SSL's advantage is (1) representation quality with few labels downstream, (2) versatility (same SSL features work for many tasks), and (3) scaling — unlabeled data is far more abundant","C":"Supervised learning always outperforms SSL for all tasks and all data sizes","D":"SSL and supervised models learn identical features; the choice is purely a function of data availability"},"correct":"B","explanation":{"correct":"- Ericsson et al. (2021) \"How Well Do Self-Supervised Models Transfer?\": comprehensive comparison showing that SSL features (SimCLR, BYOL, MoCo v2) transfer better across diverse downstream tasks than supervised features, but supervised ImageNet accuracy is still higher.\n- The nuance: SSL features are more general (better on semantic segmentation, object detection, texture recognition) while supervised features are more specialized (better on classification tasks similar to the supervised training task).\n- Scaling law: He et al. (MAE) and Chen et al. (SimCLR v2) show that with very large unlabeled datasets (100M+ images) and fine-tuning on 1% of ImageNet labels, SSL can match or exceed full supervised training. The crossover point depends on scale.","A":"Full supervised ImageNet training still outperforms SSL for ImageNet classification specifically. \"Always outperforms\" is empirically false.","B":"","C":"SSL explicitly outperforms supervised in: few-shot learning (1% ImageNet labels), cross-domain transfer (ImageNet SSL → medical imaging), and multi-task settings. \"Always loses\" is also false.","D":"SSL and supervised features are measurably different. SSL features have more distributed, texture-sensitive representations; supervised features are more compressed and task-specific. Studies probing representations show clear differences."},"reference":"- Ericsson et al., \"How Well Do Self-Supervised Models Transfer?\" (2021): https://arxiv.org/abs/2011.13377"},{"section":"deep-learning","topicSlug":"self-supervised-and-contrastive-learning","topic":"Self Supervised And Contrastive Learning","id":"dl-14009","difficulty":"hard","orderIndex":9,"question":"VICReg (Variance-Invariance-Covariance Regularization) is an SSL method that avoids collapse through explicit regularization terms rather than negative pairs or asymmetric architectures. The three terms are: Invariance (MSE between two views), Variance (per-dimension std ≥ γ), Covariance (off-diagonal covariance terms → 0). What specific collapse does each term prevent?","options":{"A":"All three terms prevent the same collapse type; they are redundant","B":"Invariance term: pushes different views of the same image toward the same representation (learn view-invariant features). Without it, the model could learn different representations for different augmentations. Variance term: prevents dimensional collapse — where the network maps all inputs to the same point (constant representation, std=0 per dimension). Enforcing std ≥ γ per dimension ensures each dimension encodes diverse information. Covariance term: prevents informational collapse — where multiple dimensions encode the same feature. Zero off-diagonal covariance forces each representation dimension to be independent, maximizing the information encoded across dimensions (similar to ICA objective)","C":"VICReg's variance term prevents gradient explosion, not collapse","D":"The covariance term is only used for regularization during training; it's removed at inference"},"correct":"B","explanation":{"correct":"- Dimensional collapse: if all samples map to the same vector z*, variance per dimension = 0. The variance term penalizes small std per dimension, directly preventing this.\n- Feature redundancy (informational collapse): if dimension 1 and dimension 2 always have the same value (covariance = 1), the representation only has effective dimensionality 1 despite being 2-dimensional. Off-diagonal covariance = 0 forces dimensions to encode different aspects of the input.\n- These three forms of collapse are distinct: point collapse (all samples → same point, caught by variance), dimensional correlation (dimensions encode same info, caught by covariance), augmentation sensitivity (caught by invariance).","A":"Each term addresses a distinct failure mode. Variance prevents point collapse; covariance prevents correlated features; invariance prevents augmentation-sensitive representations. Removing any one allows its corresponding collapse.","B":"","C":"Gradient explosion is not related to variance regularization. VICReg's variance term is specifically an explicit constraint on the output distribution (std ≥ γ), not a gradient magnitude control.","D":"The covariance term is a training regularization that shapes the learned representation. Once trained, the model's representations naturally have low covariance (the weights were optimized to achieve this). At inference, no explicit regularization is applied."},"reference":"- Bardes et al., \"VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning\" (2022): https://arxiv.org/abs/2105.04906"},{"section":"deep-learning","topicSlug":"self-supervised-and-contrastive-learning","topic":"Self Supervised And Contrastive Learning","id":"dl-14010","difficulty":"medium","orderIndex":10,"question":"You train SimCLR on a dataset of 100,000 satellite images. The augmentation pipeline includes: random crop, random horizontal flip, color jitter, and Gaussian blur. After fine-tuning on 500 labeled images, accuracy is only 60%. A colleague who trains on medical X-rays with the same pipeline achieves 85% after similar fine-tuning. What is the root cause of the satellite image underperformance?","options":{"A":"Satellite images are too large for SimCLR; reduce resolution to fix","B":"The augmentation pipeline was designed for natural images (ImageNet), where color jitter and horizontal flip are label-preserving (a blue sky remains sky when hue-shifted). For satellite images: (1) color information is semantically critical — red vs green vs water vs building have specific spectral signatures; color jitter corrupts the most informative feature; (2) horizontal flip may be label-preserving, but random crop of aerial images might crop away the entire object of interest (a single building might be 5% of the image); (3) Gaussian blur destroys the fine-grained structural features (road patterns, building edges) that distinguish satellite image classes. The augmentation must be designed for the domain's semantic invariances","C":"SimCLR requires batch_size > 10,000 for satellite images specifically","D":"Satellite images have 4 channels (RGBI); SimCLR only supports 3-channel inputs"},"correct":"B","explanation":{"correct":"- Augmentation design principle: contrastive learning assumes augmentations create \"views\" that share the same semantic content but differ in appearance. An augmentation is valid if it's label-preserving. For natural images: color changes don't change \"cat vs dog.\" For satellite images: color IS the semantic content.\n- Domain-specific SSL for satellite imagery: researchers use augmentations like season-change simulation (same location in summer vs winter → different spectral signatures but same land use), multi-temporal views, or multi-spectral band dropout.\n- The medical X-ray success: grayscale X-rays are largely invariant to color jitter (already grayscale or near-grayscale). Horizontal flip is medically controversial (left-right lung anatomy matters) but less catastrophic than color destruction.","A":"Resolution is not the root issue. SimCLR works at various resolutions. The problem is augmentation-semantic alignment, not image size.","B":"","C":"SimCLR has no satellite-image-specific batch size requirement. The batch size argument doesn't explain why satellite images specifically underperform compared to medical images.","D":"SimCLR's projection networks work with any input channels. Many satellite image SSL papers use 4-channel (RGBI) or even 13-channel (Sentinel-2) inputs with appropriate projection layers."},"reference":"- Manas et al., \"Seasonal Contrast: Unsupervised Pre-Training from Uncurated Remote Sensing Data\" (2021): https://arxiv.org/abs/2103.16607"},{"section":"deep-learning","topicSlug":"self-supervised-and-contrastive-learning","topic":"Self Supervised And Contrastive Learning","id":"dl-14011","difficulty":"hard","orderIndex":11,"question":"A text encoder trained with contrastive learning on (image, text) pairs (CLIP-style) shows an unexpected behavior: when asked to classify images, performance drops significantly when category names are changed to synonyms or when rare words are used. What is the root cause?","options":{"A":"CLIP models cannot process synonyms; they use character-level encoding","B":"CLIP's text encoder is trained on image-text pairs where certain descriptions are more common: \"a photo of a dog\" appears more frequently than \"a photo of a canine.\" The model's text representations encode the specific word distributions in the training data. Less common words/phrasings may not be well-represented in the learned text embedding space — they may cluster far from the corresponding image embeddings. The zero-shot classification performance depends critically on the prompt template and word choice matching the training distribution","C":"CLIP models do not support zero-shot classification; only supervised fine-tuning works","D":"The issue is the text tokenizer; rare words are split into subword tokens which confuse the model"},"correct":"B","explanation":{"correct":"- CLIP's training distribution: web-scraped (image, alt-text) pairs. \"Dog,\" \"puppy,\" \"golden retriever\" appear frequently with appropriate images. \"Canis lupus familiaris\" or \"canine quadruped\" appear rarely and often without matched images.\n- Prompt engineering (Radford et al., 2021): using \"a photo of {class}\" outperforms \"{class}\" alone. Averaging embeddings of multiple prompts further improves performance. This sensitivity to prompt engineering reveals the model's sensitivity to text distribution.\n- Rare word underperformance is a known limitation: in scientific domains where rare technical terms are used, CLIP's zero-shot performance degrades significantly compared to fine-tuned models.","A":"CLIP uses subword tokenization (BPE), not character-level. It can process synonyms and rare words tokenically. The issue is learned representation quality, not tokenization capability.","B":"","C":"CLIP's primary use case is zero-shot classification — comparing image embeddings to text embeddings of category names. This is the standard evaluation in the original CLIP paper.","D":"Subword tokenization can process any word. Rare words being split into subword tokens does affect representation quality (fewer training examples for those subword combinations), but this is a secondary effect. The primary issue is training distribution coverage."},"reference":"- Radford et al., \"Learning Transferable Visual Models From Natural Language Supervision (CLIP)\" (2021): https://arxiv.org/abs/2103.00020"},{"section":"deep-learning","topicSlug":"self-supervised-and-contrastive-learning","topic":"Self Supervised And Contrastive Learning","id":"dl-14012","difficulty":"easy","orderIndex":12,"question":"Contrastive learning requires defining \"positive pairs\" and \"negative pairs.\" SimCLR uses two augmented views of the same image as positives and other images in the batch as negatives. What is the \"false negative\" problem in contrastive learning?","options":{"A":"False negatives are augmentations that look too similar to the original image","B":"False negatives occur when two different images that belong to the same class (or represent the same concept) are treated as negatives in the contrastive loss. Example: two different photos of the same dog breed are pulled apart as negatives. The contrastive loss will actively push their representations apart even though they should be similar for semantic understanding. This can harm representation quality, particularly when the dataset has many images of the same class. Solutions: class-aware contrastive loss (supervised contrastive learning), momentum queue deduplication, or using very diverse datasets where same-class pairs are rare","C":"False negatives are augmentation pairs where the augmentation removes all useful information","D":"False negatives only occur in text-based contrastive learning, not image-based"},"correct":"B","explanation":{"correct":"- Standard contrastive learning negative sampling: all images in the batch except the current image's augmentations are negatives. With K=256 batch size and a 10-class dataset, roughly 25 other images in the batch have the same class as the current image — these are false negatives.\n- Impact: the loss pushes same-class images apart in feature space, conflicting with the goal of learning semantically meaningful representations. This is why contrastive learning sometimes learns features that separate instances but not classes.\n- Supervised contrastive learning (Khosla et al., 2020): uses label information to identify true negatives (different class) and true positives (same class), avoiding this problem.","A":"Augmentations that look similar to the original are actually good positives — they test whether the model can find invariant features. This is not the false negative problem.","B":"","C":"Augmentations that remove useful information would be ineffective views (the model can't learn from them), but this is an augmentation quality problem, not the false negative problem.","D":"False negatives occur in any contrastive learning setting where negatives are not verified to be semantically different. Image-based contrastive learning has this problem extensively."},"reference":"- Khosla et al., \"Supervised Contrastive Learning\" (2020): https://arxiv.org/abs/2004.11362"},{"section":"deep-learning","topicSlug":"self-supervised-and-contrastive-learning","topic":"Self Supervised And Contrastive Learning","id":"dl-14013","difficulty":"medium","orderIndex":13,"question":"MAE (Masked Autoencoder) uses an asymmetric encoder-decoder: the encoder only processes visible (unmasked) patches; the decoder reconstructs both visible and masked patches. Why is this asymmetry important, and what would happen with a symmetric design?","options":{"A":"The asymmetry is a software optimization; symmetric MAE would work equally well","B":"The encoder only processes visible patches (25%), making it computationally efficient. The decoder is lightweight and only used during pre-training. The asymmetry is critical because: (1) Efficiency: the encoder processes 25% of patches vs 100% — 4× FLOP reduction for the expensive ViT encoder; (2) Feature quality: the encoder never sees masked tokens during pre-training — it learns to extract features from limited visible context. At fine-tuning, all patches are visible, so the encoder is now applied to the full image — a setting it can handle but wasn't constrained to during pretraining, which may actually improve generalization. A symmetric design (encoder sees all, including masked) would be cheaper but produces worse representations","C":"Asymmetry is needed because the decoder must be larger than the encoder to reconstruct","D":"A symmetric design would produce identical representations; the asymmetry only affects training speed"},"correct":"B","explanation":{"correct":"- He et al.'s key insight: the masked tokens (75%) should only be used by the decoder (a shallow MLP), not the encoder. The encoder focuses on learning from limited visible patches — creating a harder, more useful pretext task.\n- Encoder efficiency: a ViT-Large with 196 patches, processing only 25% = 49 patches — the expensive self-attention (T²) scales as 49² instead of 196², a 16× reduction.\n- Decoder design: a small Transformer (narrow and shallow) is sufficient for reconstruction given the encoder's rich representations. The decoder is discarded after pre-training.\n- The asymmetry ensures the encoder learns self-sufficient representations (not relying on the decoder to interpret masked regions).","A":"The paper explicitly ablates symmetric vs asymmetric designs. Symmetric MAE (encoder sees all tokens) performs worse on linear evaluation and fine-tuning. The asymmetry is both computationally and qualitatively important.","B":"","C":"MAE's decoder is deliberately SMALLER than the encoder (lighter). A large decoder would add computation without improving encoder representations.","D":"He et al. (2021) show that asymmetric MAE achieves higher accuracy than symmetric designs. The representations are demonstrably different and better."},"reference":"- He et al., \"Masked Autoencoders Are Scalable Vision Learners (MAE)\" (2021): Figure 9 (ablation on decoder depth/width)"},{"section":"deep-learning","topicSlug":"self-supervised-and-contrastive-learning","topic":"Self Supervised And Contrastive Learning","id":"dl-14014","difficulty":"hard","orderIndex":14,"question":"You compare two SSL-pretrained models: Model A (SimCLR, 100 epochs, 1M images) and Model B (MAE, 800 epochs, 1M images). For image classification fine-tuning with 100% labels, Model B achieves higher accuracy. For few-shot (1% labels), Model A achieves higher accuracy. Explain why this reversal occurs.","options":{"A":"The reversal is caused by Model B training for 800 epochs vs Model A's 100 epochs","B":"SimCLR's invariance-learning (contrastive) creates representations optimized for semantic consistency across augmentations — ideal for few-shot learning because these representations are already semantically structured and semantically similar images cluster together. MAE's reconstruction objective learns dense, detailed visual features by solving the pixel-level reconstruction task — these features capture more fine-grained visual information, benefiting from many labels to learn appropriate classifier mappings. With 100% labels, MAE's richer features can be fully utilized; with 1% labels, SimCLR's already-semantically-aligned features require less fine-tuning signal to produce good classifiers","C":"The reversal indicates a bug in the evaluation protocol; SSL models should have consistent ordering","D":"The reversal is solely due to ViT vs ResNet architecture (MAE uses ViT; SimCLR uses ResNet)"},"correct":"B","explanation":{"correct":"- Contrastive SSL (SimCLR): the objective explicitly creates compact, semantically clustered representations. The projection head discards high-frequency information. The result: well-organized semantic feature space where k-NN or linear classification works with few examples.\n- Masked Autoencoder (MAE): the objective is pixel-level reconstruction, which preserves fine-grained texture and structural information (needed to reconstruct pixels). These rich features benefit from full fine-tuning but don't naturally cluster semantically.\n- This is a well-documented phenomenon: contrastive methods excel at linear evaluation (probing semantic structure) and few-shot; masked autoencoders excel at full fine-tuning (where dense features can be specialized).","A":"Training duration (100 vs 800 epochs) does contribute, but this doesn't explain the reversal. With equal epochs, MAE still outperforms SimCLR on full fine-tuning and SimCLR still outperforms MAE on few-shot.","B":"","C":"The reversal is a real, documented phenomenon — not a bug. Multiple papers confirm that contrastive and generative SSL methods have different strengths across evaluation protocols.","D":"Architecture (ViT vs ResNet) does affect performance, but MAE can be applied to ResNets and SimCLR can use ViT. The fundamental difference is contrastive (semantic alignment) vs generative (density estimation) objectives."},"reference":"- He et al., \"Masked Autoencoders Are Scalable Vision Learners\" (2021): Table 3 comparison with contrastive methods\n- Park et al., \"What Do Self-Supervised Vision Transformers Learn?\" (2022)"},{"section":"deep-learning","topicSlug":"self-supervised-and-contrastive-learning","topic":"Self Supervised And Contrastive Learning","id":"dl-14015","difficulty":"hard","orderIndex":15,"question":"A team uses a two-stage training: (1) SSL pre-training on 10M unlabeled images; (2) supervised fine-tuning on 10K labeled images. They observe that increasing SSL pre-training time from 100 to 1000 epochs improves fine-tuning accuracy from 78% to 82%. However, going from 1000 to 5000 epochs only improves to 82.3%. What phenomenon explains this diminishing returns pattern and what limits further SSL improvement?","options":{"A":"SSL pre-training converges after 1000 epochs; more epochs actively hurt performance","B":"SSL representations plateau when the pretext task is \"solved\" — the model has learned all information available from the unlabeled data that is accessible through the SSL objective. Beyond this point, additional epochs may: (1) overfit to dataset-specific statistics rather than general features; (2) cause the representations to become more task-specific to the SSL objective (contrastive invariances) rather than more general; (3) reduce diversity in representations (augmentation choices become overly familiar). The SSL information bottleneck: the unlabeled data has finite information relevant to downstream tasks, and the SSL objective captures a fraction of it — additional epochs don't unlock new information","C":"The improvement plateau is caused by learning rate decay; increase learning rate at epoch 1000","D":"More epochs require more GPU memory, causing the model to automatically reduce capacity"},"correct":"B","explanation":{"correct":"- Information saturation: SSL learns from data-derived signals. After sufficient training, the model extracts all available information that the SSL objective can expose. Contrastive learning learns invariance to augmentation — more epochs refine this invariance but don't add new information types.\n- Over-specialization risk: with very long training, the model may memorize dataset-specific patterns (which augmentation crops most frequently appear together for each image) rather than learning general features.\n- The logarithmic scaling law: progress in SSL roughly follows a log(epochs) curve. First doublings of epochs yield large gains; later doublings yield smaller gains. This is a general pattern in SSL.","A":"SSL pre-training with more epochs rarely \"actively hurts\" in normal ranges. The pattern here is diminishing returns (82% → 82.3%), not degradation. The claim of \"actively hurts\" would require fine-tuning accuracy to decrease with more SSL.","B":"","C":"Learning rate decay affects convergence speed but not the ultimate information saturation limit. Increasing LR at epoch 1000 might help convergence speed but wouldn't break the information saturation ceiling.","D":"GPU memory is fixed by hardware, not by training duration. More epochs don't reduce model capacity — the model architecture stays constant throughout pre-training."},"reference":"- Chen et al., \"A Simple Framework for Contrastive Learning (SimCLR)\" (2020): Figure 9 (training epochs vs accuracy)\n- Assran et al., \"Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture\" (I-JEPA, 2023)"},{"section":"deep-learning","topicSlug":"graph-neural-networks","topic":"Graph Neural Networks","id":"dl-15001","difficulty":"easy","orderIndex":1,"question":"A graph G = (V, E) has 5 nodes and adjacency matrix A. A GCN (Graph Convolutional Network) updates node features as H^{(l+1)} = σ(D^{-1/2} Ã D^{-1/2} H^{(l)} W^{(l)}), where Ã = A + I (self-loops added). What is the role of the D^{-1/2} Ã D^{-1/2} normalization?","options":{"A":"The normalization prevents gradient explosion by capping all values to [-1, 1]","B":"D^{-1/2} Ã D^{-1/2} is symmetric normalization: for node i, it averages incoming neighbor messages weighted by both the sender's degree and receiver's degree. Without normalization, high-degree nodes (many neighbors) would have very large aggregated features (summing many neighbors). With normalization: each neighbor j contributes 1/√(d_i × d_j) to node i's update — nodes with many connections contribute proportionally less, preventing high-degree nodes from dominating the representation","C":"The normalization is used to make the matrix invertible for the backward pass","D":"D^{-1/2} ensures the adjacency matrix has eigenvalues exactly in [-1, 1], which prevents vanishing gradients"},"correct":"B","explanation":{"correct":"- Unnormalized aggregation: H' = Ã H W. Row i: Σ_j Ã_{ij} H_j W = Σ_{j∈N(i)∪{i}} h_j W. For a hub node with 100 neighbors: sum of 100 vectors — the scale is 100× that of a leaf node with 1 neighbor.\n- Degree normalization: D^{-1/2} Ã D^{-1/2} entry (i,j) = 1/√(d_i × d_j). This normalizes: for node i, neighbor j's contribution = h_j / √(d_i × d_j). High-degree nodes (high d_i) receive smaller contributions per neighbor; high-degree neighbors (high d_j) contribute less.\n- The result: features are normalized to similar scales regardless of local graph structure. This allows the same weights W to work across different graph structures.","A":"The normalization ensures consistency of scale across nodes — it doesn't cap values to [-1, 1]. Feature values can be any real number after the normalization.","B":"","C":"The normalization is for feature scale stability, not matrix invertibility. The adjacency matrix Ã can be inverted separately; the symmetric normalization is a design choice for message aggregation.","D":"Eigenvalue control is a consequence (spectral GCN motivates this normalization through eigenvalues of the graph Laplacian), but the practical interpretation is degree-based aggregation normalization. The eigenvalue interpretation is the spectral theory motivation."},"reference":"- Kipf & Welling, \"Semi-Supervised Classification with Graph Convolutional Networks\" (2016): https://arxiv.org/abs/1609.02907"},{"section":"deep-learning","topicSlug":"graph-neural-networks","topic":"Graph Neural Networks","id":"dl-15002","difficulty":"easy","orderIndex":2,"question":"The message passing framework for GNNs involves three steps: (1) message computation, (2) aggregation, and (3) update. Why must the aggregation function be permutation-invariant, and what are examples of valid and invalid aggregation functions?","options":{"A":"Permutation invariance is only required for graph classification tasks, not node classification","B":"Aggregation must be permutation-invariant because neighbor order in a graph is undefined. Node i's neighbors are an unordered set {j₁, j₂, ..., jₖ} — there's no canonical ordering. If the aggregation depended on order (e.g., a concatenation of [m_{j₁}, m_{j₂}, ..., m_{jₖ}]), the same graph with neighbors listed in different order would produce different node representations. Valid aggregations: mean (Σmⱼ/k), sum (Σmⱼ), max (elementwise max), min. Invalid: concatenation (requires fixed order), LSTM over neighbors (order-dependent unless using sorted order, which is arbitrary)","C":"Permutation invariance is only needed because of GPU memory constraints","D":"Concatenation is a valid aggregation; the order of neighbors is fixed by node ID"},"correct":"B","explanation":{"correct":"- Graph property: edges encode connections, not orderings. The neighborhood N(i) = {j : (i,j) ∈ E} is a set, not a sequence.\n- Permutation equivariance vs invariance: aggregation must be permutation-invariant (same output for any permutation of neighbors). The overall GNN is permutation-equivariant (permuting input node features permutes output node features consistently).\n- Mean vs max vs sum: each captures different graph properties. Sum is used in Graph Isomorphism Network (GIN) because it can distinguish different numbers of identical neighbors (mean cannot). Max captures the most extreme feature in the neighborhood. Mean provides a \"representative neighbor.\"","A":"Permutation invariance is required for all GNN tasks. Even for graph classification, the intermediate node representations must be permutation-invariant. For node classification, the ordering of neighbors affects the node's representation regardless of final task.","B":"","C":"GPU memory doesn't determine permutation invariance. The requirement comes from the mathematical structure of graphs (unordered sets of neighbors), not hardware limitations.","D":"Using node ID to fix neighbor order is an arbitrary, external ordering not encoded in the graph structure. The same graph with relabeled nodes (same structure, different IDs) should produce the same representations — node ID-based ordering violates this."},"reference":"- Xu et al., \"How Powerful are Graph Neural Networks?\" (GIN) (2019): https://arxiv.org/abs/1810.00826"},{"section":"deep-learning","topicSlug":"graph-neural-networks","topic":"Graph Neural Networks","id":"dl-15003","difficulty":"medium","orderIndex":3,"question":"A GCN is applied to a social network for node classification. After training, you observe that node embeddings for nodes 3 hops apart have become very similar — even nodes from different communities. This is the \"over-smoothing\" problem. What causes it, and why does adding more GCN layers make it worse?","options":{"A":"Over-smoothing is caused by the dropout applied between GCN layers","B":"Each GCN layer averages a node's features with its neighbors'. After k layers, a node's representation is influenced by its k-hop neighborhood. As k increases, the k-hop neighborhood grows exponentially (in non-sparse graphs) and eventually covers most of the graph. The averaging makes all node representations converge to a weighted average of all nodes' initial features — proportional to the node's (generalized) degree, which is the same for all nodes with the same degree. More layers → larger neighborhoods → more averaging → more similar representations. Mathematically: repeated application of the normalized Laplacian's diffusion converges to the trivial limit","C":"Over-smoothing is caused by the softmax normalization becoming saturated after multiple layers","D":"Over-smoothing only occurs when graph diameter < number of layers; for small graphs, it doesn't happen"},"correct":"B","explanation":{"correct":"- Information diffusion: GCN propagation is D^{-1/2} Ã D^{-1/2} H W. Ignoring W (consider a linear GCN): H^{(k)} ∝ (D^{-1/2} Ã D^{-1/2})^k H^{(0)}. As k→∞, this matrix converges to a rank-1 matrix (the outer product of the stationary distribution) — all rows become identical. All node representations converge to the same vector.\n- Practical consequence: for node classification where nodes in different communities should have different representations, over-smoothed GCNs cannot distinguish them. This limits most GCNs to 2-3 layers.\n- Mitigation: residual connections (JK-Net: jumping knowledge), normalization (PairNorm), or limiting the receptive field.","A":"Dropout doesn't cause over-smoothing. Dropout randomly disables neurons, which can actually prevent over-smoothing by creating diverse stochastic sub-representations.","B":"","C":"Softmax is typically not a component of GCN message aggregation layers (only at the output classification). The aggregation uses mean/sum, not softmax.","D":"Over-smoothing is particularly problematic when the number of layers exceeds the graph diameter. For a graph with diameter 3 (all pairs within 3 hops), a 6-layer GCN would \"mix\" information beyond the diameter, causing over-smoothing even in small graphs."},"reference":"- Li et al., \"Deeper Insights into Graph Convolutional Networks for Semi-Supervised Classification\" (2018): https://arxiv.org/abs/1801.07606\n- Xu et al., \"Representation Learning on Graphs with Jumping Knowledge Networks\" (2018): https://arxiv.org/abs/1806.03536"},{"section":"deep-learning","topicSlug":"graph-neural-networks","topic":"Graph Neural Networks","id":"dl-15004","difficulty":"medium","orderIndex":4,"question":"Graph Attention Networks (GAT) compute attention coefficients for each edge (i,j): α_{ij} = softmax_j(LeakyReLU(a^T [W h_i || W h_j])). Compare this to GCN's fixed degree normalization. What does GAT's learned attention provide that GCN cannot, and what is its computational cost?","options":{"A":"GAT and GCN produce identical results; attention only changes training speed","B":"GAT allows node i to assign different weights to different neighbors based on their feature content. GCN's normalization (1/√d_i × d_j) depends only on degree (graph structure), not feature content. GAT: \"neighbor j is relevant to node i if their features are related\" — learned from data. This allows task-specific neighbor weighting: for sentiment classification in a social graph, nearby users with similar political views (feature-based) might be more influential than structurally close but semantically distant neighbors. Cost: O(|E| × d) attention coefficient computation for each head vs O(|E|) for GCN — proportional overhead, typically 4-8× more expensive","C":"GAT can only be used for graph classification; GCN is required for node classification","D":"GAT's attention reduces memory usage because it ignores low-weight neighbors"},"correct":"B","explanation":{"correct":"- GCN's limitation: the normalization 1/√(d_i × d_j) is determined solely by node degrees — a structural property. All neighbors contribute equally (after degree adjustment) regardless of their features' relevance.\n- GAT attention: a^T [W h_i || W h_j] computes a scalar for each (i,j) pair based on both nodes' transformed features. Softmax over j normalizes to produce edge weights. The attention is feature-dependent and learned for the specific task.\n- Practical advantage: for citation networks where not all papers cite equally relevant works, GAT can focus on the most semantically related neighbors. Ablations in the original GAT paper show significant improvement over GCN on Cora and Citeseer.","A":"GAT and GCN produce different outputs because GAT uses feature-based attention vs GCN's degree-based normalization. Multiple papers show GAT outperforms GCN on several benchmarks.","B":"","C":"Both GAT and GCN support node classification and graph classification. GAT was originally applied to node classification in the paper. The claim is factually incorrect.","D":"GAT doesn't \"ignore\" low-weight neighbors — it assigns them small but non-zero attention weights. All neighbors are included in the aggregation; only their weights change. Memory usage is O(|E| × d), similar to GCN."},"reference":"- Veličković et al., \"Graph Attention Networks\" (2018): https://arxiv.org/abs/1710.10903"},{"section":"deep-learning","topicSlug":"graph-neural-networks","topic":"Graph Neural Networks","id":"dl-15005","difficulty":"medium","orderIndex":5,"question":"GraphSAGE (Hamilton et al., 2017) uses neighborhood sampling: instead of using all neighbors, it samples a fixed k neighbors per node per layer. For a node with 1000 neighbors, why is this critical for scaling, and what is the trade-off?","options":{"A":"Sampling is needed because GNNs cannot process more than 100 neighbors","B":"Full neighborhood aggregation creates exponential graph expansion: a 3-layer GNN with full aggregation for a node with avg degree 20 needs 1 + 20 + 20² + 20³ = 8,421 nodes per computation tree. For 1,000-neighbor nodes: computation becomes intractable. GraphSAGE samples k₁=25 neighbors for layer 1, k₂=10 for layer 2: fixed k₁×k₂=250 nodes per sample. Memory is O(k₁×k₂×...×batch_size). Trade-off: with sampling, some neighbors are ignored in each forward pass. This introduces variance in the gradient — different samples in different batches produce different gradients. But it enables mini-batch training on graphs with millions of nodes","C":"Sampling reduces memory for storing neighbor feature vectors, but graph structure is still fully used","D":"GraphSAGE sampling is only used at inference; training still uses full neighborhoods"},"correct":"B","explanation":{"correct":"- Neighborhood explosion: in deep GNNs, the computation tree grows exponentially. For full aggregation: 2-layer GNN on a dense graph needs O(d^L) nodes per sample. For a social network with average degree 200 and L=3: 8M nodes per training example — batching becomes impossible.\n- Mini-batch training with GraphSAGE: fix the computation tree size per sample. For each training node, sample exactly k₁ neighbors (layer 1), and for each of those, sample k₂ neighbors (layer 2). Total computation: batch × k₁ × k₂ = fixed budget regardless of graph size.\n- Variance reduction: \"neighbor sampling\" adds noise to the gradient but allows unbiased estimation (sampled mean is an unbiased estimate of full mean). PinSage (Pinterest's GraphSAGE deployment) scaled to 3B nodes using this approach.","A":"GNNs have no hard constraint on neighbor count — they can process any number. The issue is computational scalability (exponential growth), not a hard architectural limit.","B":"","C":"GraphSAGE sampling reduces computation and memory by limiting which neighbors are processed. The graph structure IS modified in the sense that unsampled edges are ignored in each pass.","D":"GraphSAGE sampling is used during both training and inference. At inference, the same sampling (or full aggregation if feasible) is used to generate node embeddings."},"reference":"- Hamilton et al., \"Inductive Representation Learning on Large Graphs (GraphSAGE)\" (2017): https://arxiv.org/abs/1706.02216\n- Ying et al., \"Graph Convolutional Neural Networks for Web-Scale Recommender Systems (PinSage)\" (2018): https://arxiv.org/abs/1806.01973"},{"section":"deep-learning","topicSlug":"graph-neural-networks","topic":"Graph Neural Networks","id":"dl-15006","difficulty":"hard","orderIndex":6,"question":"You apply a GCN to a graph with 3 nodes: A-B-C (path graph). After 2 layers, node A's representation is influenced by nodes A, B, C. If you apply a 4th layer, node A's representation at layer 4 is still influenced by A, B, C (same set — the graph has diameter 2). What does this tell you about GNN depth beyond graph diameter?","options":{"A":"More layers always improve GNN performance by refining representations","B":"Beyond graph diameter, additional GNN layers don't expand the receptive field (every node already receives all other nodes' information at layer k = diameter). Extra layers: (1) simply re-aggregate already-aggregated information — adding non-linearity and transformation without new structural information; (2) increase the risk of over-smoothing (representations converge toward similar values); (3) add computational cost without structural benefit. The effective depth for capturing structural information is bounded by the graph diameter. Going beyond: useful only if the non-linear transformations W, σ add task-relevant function composition beyond structural aggregation","C":"Layers 3 and 4 provide gradient shortcuts that improve training stability","D":"After reaching graph diameter, the model automatically switches to fully connected processing"},"correct":"B","explanation":{"correct":"- Receptive field ceiling: for a graph with diameter D, all nodes are within D hops of each other. A GNN with L ≥ D layers has full-graph receptive field from step D — adding more layers doesn't add new neighbors.\n- The question is then: does more function composition (more W, σ layers) help? Sometimes yes — deeper function approximation can learn more complex mappings. But the risk of over-smoothing increases.\n- Empirical finding: most GNN papers use 2-3 layers. Deeper GNNs (without special design) often underperform due to over-smoothing. Techniques like residual connections in GCNII (Chen et al., 2020) enable deeper GNNs.","A":"More layers don't always improve GNN performance. For most node classification benchmarks, 2-3 layer GCNs outperform deeper variants due to over-smoothing. The \"always improve\" claim is empirically false.","B":"","C":"Layer 3+ in a GNN don't add gradient shortcuts (those would require residual connections). Without residuals, deeper layers add gradient path length, increasing vanishing gradient risk.","D":"GNNs don't \"switch to fully connected processing.\" The architecture is fixed regardless of graph diameter. After reaching the diameter depth, the same message passing continues (aggregating from the full receptive field)."},"reference":"- Chen et al., \"Simple and Deep Graph Convolutional Networks (GCNII)\" (2020): https://arxiv.org/abs/2007.02133"},{"section":"deep-learning","topicSlug":"graph-neural-networks","topic":"Graph Neural Networks","id":"dl-15007","difficulty":"hard","orderIndex":7,"question":"The Weisfeiler-Lehman (WL) graph isomorphism test is the theoretical upper bound on GNN expressive power. The GIN (Graph Isomorphism Network) is designed to match this bound. What specific GNN design choices make GIN as powerful as the WL test, and what graphs can neither WL nor GIN distinguish?","options":{"A":"GIN uses attention to achieve WL-level expressiveness","B":"GIN's key design choices: (1) SUM aggregation instead of MEAN or MAX — sum can distinguish {1,2} from {1,1,2} (sum=3 vs 4); mean gives 1.5 vs 1.33 (different), but max gives 2 vs 2 (same). Only sum uniquely maps multisets to a value. (2) MLP instead of linear layer — the MLP can approximate any injective function on the multiset histogram. Together: h_v^{(k)} = MLP^{(k)}((1+ε) × h_v^{(k-1)} + Σ_{u∈N(v)} h_u^{(k-1)}). Graphs WL cannot distinguish: regular graphs where all nodes have same degree and k-hop neighborhoods. Any two r-regular graphs on n nodes cannot be distinguished by WL or GIN, requiring higher-order GNNs","C":"GIN uses global pooling after each layer to capture graph-level features for WL-equivalent power","D":"WL test and GIN have identical computational complexity; any GNN matches WL power"},"correct":"B","explanation":{"correct":"- WL test: iteratively assigns colors (hashes) to nodes based on their neighborhood multisets. Two graphs are non-isomorphic if their color histograms differ. GIN's SUM + MLP replicates this: the injective MLP maps multisets to unique representations.\n- MEAN and MAX fail WL-level: mean({1,1}) = mean({1}) = 1 (can't distinguish); max({1,1}) = max({1}) = 1. Sum: sum({1,1}) = 2 ≠ sum({1}) = 1. Sum is the only simple aggregation that distinguishes multiset cardinality.\n- WL limitation: two non-isomorphic regular graphs with identical k-hop structure are indistinguishable. 3D GNNs (using node coordinates) or higher-order WL tests can distinguish these but have higher computational cost.","A":"GAT uses attention (feature-based weights). Attention doesn't address the aggregation function's expressive power problem (distinguishing multisets). GAT with mean aggregation is not WL-equivalent.","B":"","C":"Global pooling is used for graph classification, not node classification. GIN's WL-equivalent power comes from the node-level aggregation, not global pooling. Global pooling is applied after GIN layers for graph-level tasks.","D":"\"Any GNN matches WL power\" is false — this is the central contribution of Xu et al. (2019). Most GNNs (using MEAN or MAX aggregation) are strictly less powerful than WL. GIN with SUM + MLP is the specific design that achieves WL-level power."},"reference":"- Xu et al., \"How Powerful are Graph Neural Networks?\" (2019): https://arxiv.org/abs/1810.00826"},{"section":"deep-learning","topicSlug":"graph-neural-networks","topic":"Graph Neural Networks","id":"dl-15008","difficulty":"medium","orderIndex":8,"question":"You apply a GNN for drug-target interaction prediction. This is a link prediction task on a bipartite graph (drugs on one side, proteins on the other). The GNN is used to predict whether an edge (drug, protein) exists. Compare the appropriate GNN formulation vs node classification GNN — what changes?","options":{"A":"Link prediction uses the same GNN as node classification; no changes are needed","B":"Link prediction uses node embeddings as a substrate, then computes edge scores. The GNN learns node representations h_drug and h_protein. Link prediction score: σ(h_drug^T h_protein) or MLP(concat(h_drug, h_protein)). Key differences: (1) Loss is applied to (node_pair, label) tuples instead of (node, label); (2) Negative sampling is critical — real drug-protein pairs are positive; random drug-protein pairs are negative (many non-edges exist); (3) For inductive link prediction (predict edges for unseen drugs/proteins), GraphSAGE-style encoders are needed instead of transductive GCNs that require all nodes at training time","C":"Link prediction requires separate GNNs for drug nodes and protein nodes that are combined by attention","D":"Bipartite graphs cannot be processed by GNNs; use MLP with node features only"},"correct":"B","explanation":{"correct":"- GNN for node classification: loss = CE(h_v, label_v) for each node v. The GNN is trained to produce a good node representation for the classification task.\n- GNN for link prediction: the GNN generates node embeddings, then a \"decoder\" (dot product, MLP, or element-wise product) scores each potential edge. Loss: binary CE(score(i,j), 1) for real edges; binary CE(score(i',j'), 0) for sampled negative pairs.\n- Bipartite graph GNN: handle the two node types separately. Drug nodes aggregate from protein neighbors; protein nodes aggregate from drug neighbors. Alternating 2-layer propagation is common.","A":"Link prediction and node classification have different loss functions and output heads. While the GNN encoder is similar, the task setup (what the GNN optimizes for) is fundamentally different.","B":"","C":"Separate GNNs with attention is one valid approach (multi-view learning), but it's not required. A single unified GNN that updates both drug and protein representations simultaneously is standard and simpler.","D":"GNNs are explicitly designed for graph-structured data and have been applied to bipartite graphs extensively (drug-protein interaction, user-movie recommendation). GraphSAGE, GCN, and GAT all support bipartite graphs."},"reference":"- Hamilton et al., \"Embedding Methods for Link Prediction\" (2020 survey)\n- Lim et al., \"Drug-Target Interaction Prediction using GNNs\" (various 2020-2022 papers)"},{"section":"deep-learning","topicSlug":"graph-neural-networks","topic":"Graph Neural Networks","id":"dl-15009","difficulty":"hard","orderIndex":9,"question":"You train a GNN for graph classification on molecular property prediction. Two molecules have identical atom types and bond types in different arrangements. A standard GCN with mean aggregation labels them as the same class. Why, and how does GIN fix this?","options":{"A":"GCN labels them the same because it ignores bond types","B":"Mean aggregation loses count information: if molecule A has two carbon atoms in the benzene ring and molecule B has one carbon atom, mean({C,C}) = mean({C}) — both give the same average. Standard GCN with mean aggregation cannot distinguish graphs where neighborhoods have different multiplicity of the same atom type. GIN uses SUM: sum({C,C,N}) ≠ sum({C,N}) — captures that the first structure has two carbons where the second has one. The MLP then maps these different sums to different representations. For molecular property prediction where the exact count of specific atoms in a neighborhood matters (e.g., degree of saturation), sum aggregation is critical","C":"GCN labels them the same because molecular graphs are always isomorphic","D":"The fix is to use edge features (bond types) instead of changing aggregation"},"correct":"B","explanation":{"correct":"- Multiset problem: {C, C, N} and {C, N} have the same mean (if C=1, N=0: mean({1,1,0}) = 0.67, mean({1,0}) = 0.5 — actually different). But consider: {1, 2} and {1.5, 1.5} both have mean 1.5. Sum({1,2}) = 3 ≠ Sum({1.5, 1.5}) = 3 in this example. The key is that with discrete atom features, specific patterns like {C, C, N} vs {C, N, N} have different sums only with integer encodings.\n- The deeper issue: if we map atom types as integers and sum them, two neighborhoods with different carbon counts produce different sums. Mean doesn't distinguish {2 carbons, 1 nitrogen} from {1 carbon, 1.5-equivalent nitrogen}.\n- GIN's design ensures the representation function is injective on multisets — same multiset gives same representation, different multisets give different representations.","A":"GCN can incorporate bond types as edge features (a valid extension). But the fundamental aggregation problem (mean vs sum for multisets) is separate from edge feature usage. Using edge features with mean aggregation still has the multiset distinguishability problem.","B":"","C":"If two molecules have the same atom/bond types but different arrangements, they are non-isomorphic (different molecular graphs). The GCN's failure is due to the aggregation function, not graph isomorphism.","D":"Edge features help represent bond information but don't address the multiset cardinality problem in aggregation. The aggregation fix (sum) is still needed even with edge features."},"reference":"- Xu et al., \"How Powerful are Graph Neural Networks?\" (2019): Section 3 (GIN design)"},{"section":"deep-learning","topicSlug":"graph-neural-networks","topic":"Graph Neural Networks","id":"dl-15010","difficulty":"medium","orderIndex":10,"question":"A recommendation system uses a bipartite user-item graph. You train a GNN (LightGCN) that propagates user preferences to item nodes and item characteristics to user nodes. During training, you mask 10% of positive user-item edges as validation set. At inference, for a new user with only 2 interactions, the GNN produces poor recommendations. What is the root cause?","options":{"A":"LightGCN requires at least 100 interactions per user; filter out low-interaction users","B":"The cold start problem: a new user with 2 interactions has a k-hop neighborhood containing only 2 items and their shared users. The GNN aggregates: user's embedding ← average of 2 item embeddings. This provides very limited information for learning user preferences. The 2 items may not represent the user's diverse interests. The GNN is designed for users with enough interaction history to form a meaningful local graph structure. Fixes: (1) hybrid approach combining GNN with content-based features; (2) meta-learning (MAML-style) for few-interaction users; (3) separate cold-start module that uses side information (demographics, item content)","C":"The issue is the masking during training — use all edges for training to fix cold start","D":"Cold start only occurs for new items, not new users; the model should work for any user"},"correct":"B","explanation":{"correct":"- LightGCN propagation: e_u^{(k)} = Σ_{i∈N(u)} e_i^{(k-1)} / |N(u)|. For a user with 2 interactions: e_u^{(1)} = (e_{item1} + e_{item2}) / 2. This single vector must represent all preferences.\n- Compare to power users with 200 interactions: e_u^{(1)} is an average of 200 diverse items, capturing broad preferences. Layer 2 brings in items interacted with by users similar to our user.\n- The neighborhood structure for 2-interaction users is too sparse for meaningful aggregation. The GNN has limited information to learn preferences from.","A":"There's no hard 100-interaction threshold in LightGCN. The issue is gradual degradation with fewer interactions, not a cliff at a specific count. Filtering users is an extreme solution that eliminates cold-start users entirely.","B":"","C":"Including validation edges in training would cause data leakage (testing on edges the model was trained on). This doesn't fix cold start — a new user at inference time still has only 2 interactions regardless of training strategy.","D":"Cold start affects both new users (few interactions) and new items (few ratings). New items have the same problem: a new item with 2 ratings has a sparse neighborhood and is poorly represented."},"reference":"- He et al., \"LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation\" (2020): https://arxiv.org/abs/2002.02126"},{"section":"deep-learning","topicSlug":"graph-neural-networks","topic":"Graph Neural Networks","id":"dl-15011","difficulty":"hard","orderIndex":11,"question":"Heterogeneous graphs have multiple node types and edge types (e.g., user→reviews→product, user→friends→user). A standard GNN treats all edges equally. What is the specific problem this causes for message passing, and how does HAN (Heterogeneous Attention Network) or RGCN address it?","options":{"A":"Standard GNNs cannot process heterogeneous graphs at all due to different feature dimensions","B":"Different edge types have different semantic meaning: a \"user-reviews-product\" edge carries different information than a \"user-friends-user\" edge. Aggregating both equally conflates semantically different information. Two nodes may be structurally close through \"friends\" edges but semantically unrelated; aggregating friends as if they were product reviews would corrupt the representation. RGCN: separate weight matrix W_r per relation type r: h_v = Σ_r Σ_{u∈N_r(v)} W_r h_u / c_{v,r}. Each relation has its own transformation. HAN: uses meta-path-based attention, where meta-paths (user→product→user) create homogeneous subgraphs aggregated with learned attention weights per meta-path type","C":"Heterogeneous graphs require GNNs to be retrained for each edge type independently","D":"Heterogeneous graphs can be made homogeneous by concatenating edge type as a node feature; no architectural change needed"},"correct":"B","explanation":{"correct":"- Semantic mismatch: h_v^{friend-path} encodes social similarity; h_v^{review-path} encodes product preference. Averaging these with the same weight W would produce a representation that mixes two completely different types of relationships.\n- RGCN (Schlichtkrull et al., 2018): each relation type r has its own weight matrix W_r ∈ ℝ^{d×d}. This lets the model learn how to process friend messages differently from review messages. For graphs with many relation types, basis decomposition reduces parameters: W_r = Σ_b a_{rb} V_b.\n- HAN (Wang et al., 2019): meta-path-based approaches aggregate along specific semantic paths (user-buys-product-buys-user: other users who bought the same products), then use attention to weight different meta-paths.","A":"Standard GNNs can process heterogeneous graphs with unified feature spaces — the issue is semantic conflation, not inability. With a feature projection, nodes of different types can be mapped to a common space.","B":"","C":"Training separate GNNs per edge type would produce disconnected representations that can't interact. RGCN integrates all relations in a unified model with relation-specific parameters.","D":"Edge type as a node feature is one approach (edge-conditioned convolutions), but it doesn't address the aggregation problem. A node aggregating from 100 friends and 100 product reviews would still mix them equally unless the aggregation is modified."},"reference":"- Schlichtkrull et al., \"Modeling Relational Data with Graph Convolutional Networks (RGCN)\" (2018): https://arxiv.org/abs/1703.06103"},{"section":"deep-learning","topicSlug":"graph-neural-networks","topic":"Graph Neural Networks","id":"dl-15012","difficulty":"medium","orderIndex":12,"question":"You train a GCN for fraud detection on a financial transaction graph. Fraudulent transactions (1% of edges) are rare. After training, the model achieves 99% accuracy but detects only 5% of actual fraud cases. What is happening and what specifically should be changed about the GNN training?","options":{"A":"GCNs cannot detect fraud; use a CNN instead","B":"The 99% accuracy with 5% fraud recall indicates severe class imbalance exploitation: the model predicts \"not fraud\" for every node (or almost every node), achieving 99% accuracy because 99% of transactions are legitimate. The GNN is optimizing the wrong objective (accuracy on imbalanced data). Fixes: (1) use weighted cross-entropy loss or focal loss that amplifies loss for rare positive (fraud) class; (2) oversample fraud examples in mini-batches; (3) use metrics that account for imbalance (F1, AUROC, Precision-Recall AUC); (4) graph-specific: ensure fraud nodes are well-represented in each mini-batch's computation graph by oversampling their neighbors","C":"The model needs more GNN layers to capture long-range fraud patterns","D":"The 99% accuracy is correct and the 5% fraud recall is acceptable given the class ratio"},"correct":"B","explanation":{"correct":"- Accuracy paradox: with 1% fraud, a model predicting \"not fraud\" for everything achieves 99% accuracy. This is not useful. The model has essentially learned to predict the majority class.\n- Class imbalance in graphs: standard mini-batch GNN training samples nodes uniformly, so fraud nodes (1%) appear rarely. The model sees 100× more non-fraud examples and optimizes to predict non-fraud.\n- Focal loss: L = -(1-p_t)^γ × log(p_t). The (1-p_t)^γ factor down-weights easy examples (correctly classified non-fraud with high confidence) and focuses training on hard examples (fraud cases). Used in FICO and other financial ML systems.","A":"GCNs are used in production fraud detection (e.g., at Alibaba: GBDT-GNN, at PayPal). The issue is training configuration, not architecture.","B":"","C":"More layers might help capture fraud ring patterns (connected fraud nodes), but the primary issue is class imbalance. Fixing the imbalance problem would yield immediate improvement; more layers might provide incremental gains.","D":"5% fraud recall (missing 95% of actual fraud) is a critical failure in fraud detection. Real fraud detection systems target >80% recall with acceptable precision. The 99% accuracy metric is meaningless here."},"reference":"- Lin et al., \"Focal Loss for Dense Object Detection\" (2017): https://arxiv.org/abs/1708.02002\n- Wen et al., \"Towards Consumer Loan Fraud Detection: Graph Neural Networks with Role-Based Features\" (various GNN fraud papers)"},{"section":"deep-learning","topicSlug":"graph-neural-networks","topic":"Graph Neural Networks","id":"dl-15013","difficulty":"easy","orderIndex":13,"question":"Node classification, link prediction, and graph classification are three main GNN tasks. For a drug discovery application (predict which molecules have desired properties), which task is appropriate and what does the model output?","options":{"A":"Node classification — predict property for each atom in the molecule","B":"Graph classification — the entire molecule is the input graph (atoms as nodes, bonds as edges). The model produces a single embedding for the whole graph via a graph-level readout (global mean/sum/max pooling over all node embeddings) and predicts the molecular property (e.g., toxicity, solubility) from this embedding. Each molecule is one graph; the label is the molecular property","C":"Link prediction — predict whether two atoms would form a new bond","D":"Node classification is required because molecular properties are atom-level phenomena"},"correct":"B","explanation":{"correct":"- Graph classification setup: input = graph G = (V, E) with atom features on nodes and bond features on edges. GNN produces node embeddings h_v after k layers. Graph-level readout: h_G = READOUT({h_v : v ∈ V}). The READOUT (sum, mean, or attention-based) aggregates all node embeddings into a fixed-size graph vector. Final prediction: ŷ = MLP(h_G).\n- Task suitability: molecular properties (toxicity, solubility, bioactivity) are global properties of the whole molecule, not properties of individual atoms or atom pairs. Graph classification produces a single prediction for the whole graph.\n- Link prediction would be appropriate for: predicting new chemical bonds (bond formation prediction), protein-protein interaction prediction.","A":"Atom-level classification would predict properties for each atom (e.g., NMR shift of each carbon). Molecular toxicity/solubility is a whole-molecule property, not an atom-level property.","B":"","C":"Link prediction predicts whether an edge (bond) exists or will form. Molecular property prediction doesn't require predicting new bonds — the molecular structure is given; the task is to predict the whole-molecule property.","D":"While some molecular properties have atom-level explanations (reactivity centers), the prediction task for drug discovery is typically molecule-level. Atom-level classification would predict per-atom properties, not the molecule-level drug property."},"reference":"- Gilmer et al., \"Neural Message Passing for Quantum Chemistry (MPNN)\" (2017): https://arxiv.org/abs/1704.01212"},{"section":"deep-learning","topicSlug":"graph-neural-networks","topic":"Graph Neural Networks","id":"dl-15014","difficulty":"hard","orderIndex":14,"question":"You compare two GNN training paradigms: transductive (entire graph is visible during training, test nodes are unknown but present in the graph) and inductive (test graphs are completely unseen during training). For a fraud detection system on a bank's transaction graph, which paradigm applies, and what architectural constraint does it impose?","options":{"A":"Transductive learning always applies for graphs; inductive is only for images","B":"Fraud detection on evolving transaction graphs requires inductive learning: new customers, new merchants, and new transactions appear daily — completely unseen nodes must be classified at inference. Inductive GNNs (GraphSAGE, GAT with node features) learn a generalizable aggregation function that can be applied to any graph structure. Transductive GNNs (vanilla GCN) learn node embeddings directly — these embeddings are node-specific and cannot be applied to new nodes not present during training. The architectural constraint: inductive GNNs cannot use node IDs as features (ID-based embeddings don't generalize) and must learn from structural and feature-based aggregation","C":"Fraud detection uses transductive learning because the test graph is a subset of the training graph","D":"Inductive learning is only possible with graph classification, not node classification"},"correct":"B","explanation":{"correct":"- Transductive GCN: learns representations for fixed nodes V at training time. For a new node v not in V: no representation exists without retraining.\n- Inductive GNN (GraphSAGE): learns aggregation function f_SAGE(h_v, {h_u : u∈N(v)}) that can generate representations for any node given its features and neighborhood. New customer → sample 25 neighbors from their transaction history → apply f_SAGE → get embedding.\n- ID feature prohibition: if node IDs are used as input features (e.g., one-hot encoding of node index), new nodes have IDs that were never seen during training. The model can't generate representations for them.","A":"Inductive learning is critical for many graph applications. Production recommender systems (PinSage), fraud detection, and drug discovery (predicting on new molecules) all require inductive capability.","B":"","C":"New customers and merchants are not a \"subset of the training graph\" — they are new nodes. A growing transaction graph continuously adds new nodes, requiring inductive inference.","D":"Inductive learning is the standard for node classification in production systems. Graph classification is inherently inductive (each new graph is a test \"node\"). Inductive node classification is just as natural."},"reference":"- Hamilton et al., \"Inductive Representation Learning on Large Graphs (GraphSAGE)\" (2017): Section 4 (Inductive vs Transductive)"},{"section":"deep-learning","topicSlug":"graph-neural-networks","topic":"Graph Neural Networks","id":"dl-15015","difficulty":"hard","orderIndex":15,"question":"A knowledge graph embedding task uses a GNN to predict missing triples (head, relation, tail). You compare TransE (geometric embedding) with an RGCN + decoder. Your RGCN achieves lower MRR (Mean Reciprocal Rank) than TransE on a benchmark. A colleague says \"GNNs are always better than geometric methods for KG completion.\" What fundamental limitation of GNNs explains RGCN's underperformance?","options":{"A":"GNNs require more training data; TransE works with smaller datasets","B":"For knowledge graph completion, GNNs face the \"entity symmetry\" problem: GCN aggregates 1-hop neighbors. If two entities have the same set of relational neighbors (e.g., two cities \"located-in\" the same country and \"has-airport\"), their RGCN representations become identical after aggregation — the GNN cannot distinguish them. TransE models individual entity-relation translations: entity A positioned at e_A such that e_A + r_relation ≈ e_B for each fact (A, r, B). Each entity has its own embedding vector, capturing its unique role across all relations. RGCN conflates entities that share the same relational neighborhood structure, missing fine-grained entity-specific information","C":"TransE is always better than GNNs for all graph tasks; GNNs are overhyped","D":"RGCN needs more layers to match TransE; add 10 layers to fix the MRR"},"correct":"B","explanation":{"correct":"- Entity symmetry in RGCN: consider two cities, Paris and London, both \"located-in\" Europe and \"has-airport\" → True. Their 1-hop neighborhoods are structurally identical. RGCN produces the same embedding. But they're different entities with different properties.\n- TransE's entity-specific embeddings: each entity e ∈ ℝ^d is learned independently. Paris and London have different vectors, even if their local relational structure overlaps.\n- This is a manifestation of the WL test limitation: the WL test (and GNNs by extension) cannot distinguish nodes with identical neighborhood structures. For dense KGs where many entities have similar relational patterns, this is a critical limitation.","A":"Training data size is a factor but not the fundamental explanation. RGCN can underperform on large KGs where entities have similar structures. TransE scales well with data.","B":"","C":"GNNs outperform TransE on some KG tasks and datasets, particularly when multi-hop reasoning or structural context is important. \"Always better\" or \"always worse\" claims are both incorrect.","D":"Adding more RGCN layers doesn't solve the entity symmetry problem — it would cause over-smoothing, making distinct entities even more similar. The root cause is the aggregation-based representation, not depth."},"reference":"- Bordes et al., \"Translating Embeddings for Modeling Multi-relational Data (TransE)\" (2013): https://papers.nips.cc/paper/5071-translating-embeddings-for-modeling-multi-relational-data\n- Zhang & Chen, \"Link Prediction Based on Graph Neural Networks\" (2018): discusses GNN limitations for KG"},{"section":"deep-learning","topicSlug":"transfer-learning","topic":"Transfer Learning","id":"dl-16001","difficulty":"easy","orderIndex":1,"question":"A team fine-tunes a ResNet-50 pretrained on ImageNet for classifying satellite images. They use feature extraction (freeze all layers, train only the final classifier). After 20 epochs, validation accuracy plateaus at 62%. The same team fine-tunes all layers and achieves 84%. What does this reveal about the feature extraction vs fine-tuning decision?","options":{"A":"Feature extraction should always be used; the 62% result is a bug in the training pipeline","B":"Feature extraction assumes ImageNet features transfer well to the target domain. Satellite images (top-down view, different color distribution, no common object classes with ImageNet) differ significantly from ImageNet (natural photography). The deep convolutional layers of ResNet-50, optimized for natural images, produce features that are poorly aligned with satellite image structure. Fine-tuning all layers allows these task-specific features to be learned. Feature extraction is appropriate when: (1) target domain is similar to source; (2) target dataset is small (fine-tuning would overfit); fine-tuning is appropriate when: (1) sufficient target data exists; (2) domains differ significantly","C":"Feature extraction is better for large datasets; fine-tuning is for small datasets only","D":"The difference is due to the learning rate; using a lower LR in feature extraction would match full fine-tuning"},"correct":"B","explanation":{"correct":"- Domain distance: ImageNet contains natural photos (animals, objects, scenes). Satellite images have: top-down perspective, different scale (meters per pixel), different color statistics, objects like fields/roads instead of dogs/cars. Early CNN layers learn Gabor-like filters for natural image edges — these generalize. Later layers encode high-level semantic concepts (dog faces, car shapes) — these don't transfer to satellite imagery.\n- The rule of thumb: freeze early layers (generalizable low-level features), fine-tune later layers (task-specific high-level features). For very different domains, fine-tune most or all layers.\n- Yosinski et al. (2014) showed empirically that transferability decays with layer depth for cross-domain transfer.","A":"Feature extraction is not universally applicable. The 62% plateau indicates the frozen features are insufficiently informative for satellite images. The \"bug\" framing is incorrect — it's a domain mismatch issue.","B":"","C":"The relationship is opposite: feature extraction is appropriate for small target datasets (to avoid overfitting), fine-tuning is preferred when target data is abundant enough to update weights without overfitting.","D":"Learning rate affects training stability, not the representation quality. A frozen layer cannot adapt regardless of learning rate. The gap is architectural (frozen vs trainable), not optimization-related."},"reference":"- Yosinski et al., \"How transferable are features in deep neural networks?\" (2014): https://arxiv.org/abs/1411.1792"},{"section":"deep-learning","topicSlug":"transfer-learning","topic":"Transfer Learning","id":"dl-16002","difficulty":"easy","orderIndex":2,"question":"You fine-tune a BERT model on a 500-example medical QA dataset. After training, training loss → 0.01 but validation loss → 2.8 (heavily overfit). A colleague suggests catastrophic forgetting. Another suggests overfitting. What is the actual issue, and how would you distinguish the two?","options":{"A":"This is definitely catastrophic forgetting; you need to freeze BERT's layers","B":"Overfitting and catastrophic forgetting are distinct: Overfitting: model memorizes training examples; validation loss is high because the model doesn't generalize to unseen examples. BERT has 110M parameters; 500 examples provides ≈ 220 examples/parameter — massively underparameterized relative to data. Catastrophic forgetting: model's general language knowledge is overwritten by the new task, losing pretrained language model capabilities. To distinguish: (1) test on a general NLP benchmark (e.g., GLUE task) — if performance collapses, catastrophic forgetting; (2) examine val loss curve — if it rises immediately and steeply, the model is failing to generalize (overfitting), not forgetting prior knowledge. In this case with 500 examples, overfitting is the primary explanation","C":"The issue is that BERT requires at least 10,000 examples; use a smaller model","D":"Catastrophic forgetting only occurs in continual learning settings, not standard fine-tuning"},"correct":"B","explanation":{"correct":"- Overfitting diagnosis: train accuracy ≈ 100%, val accuracy ≈ low. With 500 examples and 110M parameters, BERT can memorize all training examples perfectly without learning generalizable patterns.\n- Catastrophic forgetting diagnosis: reduced performance on tasks BERT was originally trained for. You'd check by evaluating on original pretraining tasks (masked LM, next sentence prediction).\n- Both can coexist, but the dominant problem with 500 examples is almost certainly overfitting. Fix: regularization (weight decay, dropout), data augmentation, reduce learning rate, reduce training epochs, use LoRA (train <1% of parameters).","A":"Freezing BERT's layers would cause feature extraction mode — appropriate only if the medical QA task is well-represented by general language features. Freezing prevents the model from learning medical terminology. The recommended approach is parameter-efficient fine-tuning (LoRA) or heavy regularization.","B":"","C":"BERT models are fine-tuned successfully on much smaller datasets (< 100 examples with proper techniques). The issue is not a minimum dataset size requirement.","D":"Catastrophic forgetting is a broader phenomenon. During fine-tuning, if the learning rate is too high and too many epochs are run, the model's weights shift far from the pretrained initialization, losing general capabilities."},"reference":"- Howard & Ruder, \"Universal Language Model Fine-Tuning for Text Classification (ULMFiT)\" (2018): https://arxiv.org/abs/1801.06146\n- Hu et al., \"LoRA: Low-Rank Adaptation of Large Language Models\" (2022): https://arxiv.org/abs/2106.09685"},{"section":"deep-learning","topicSlug":"transfer-learning","topic":"Transfer Learning","id":"dl-16003","difficulty":"medium","orderIndex":3,"question":"ULMFiT introduced discriminative fine-tuning, where different learning rates are used for different layers (e.g., LR for layer 1 = η/2.6^3, LR for layer 2 = η/2.6^2, LR for layer 3 = η). What is the rationale, and what would happen if you used a uniform high LR across all layers?","options":{"A":"Discriminative LR is used to reduce computational cost; it has no effect on quality","B":"Discriminative LR recognizes that early (low) layers learn general, transferable features (syntax, basic semantics) and late (high) layers learn task-specific features. Fine-tuning requires: late layers to adapt significantly to the new task; early layers to adapt slowly (preserve valuable general features). Uniform high LR: all layers update aggressively. Result: early layers' general representations are overwritten by task-specific signals from the small fine-tuning dataset. The model loses its general language understanding (catastrophic forgetting). Discriminative LR → lower LR for early layers (slow drift from pretrained values) + higher LR for later layers (fast task adaptation)","C":"Uniform LR is fine because later layers dominate the gradient signal anyway","D":"Discriminative LR is needed only for RNNs; Transformers require uniform LR"},"correct":"B","explanation":{"correct":"- Layer-wise learning rate intuition: features become increasingly task-specific with depth. Overwriting general features (low layers) with task-specific fine-tuning signals corrupts valuable pretrained representations.\n- Mathematically: with high LR, the weight update Δw = -η × ∇L can be large relative to the pretrained values w_pretrained. For early layers, this destroys generalizable features. For late layers, this is desirable — the pretrained late-layer features (general NLU) should be replaced with task-specific representations.\n- ULMFiT ablation: discriminative LR consistently outperforms uniform LR in classification experiments.","A":"ULMFiT's paper shows discriminative LR improves test accuracy across multiple NLP benchmarks. The quality difference is significant — it's not a compute optimization.","B":"","C":"Later layers do dominate gradient signal at the output, but early layers also receive gradient through backpropagation. Without discriminative LR, early layer gradients can still cause significant weight updates.","D":"Discriminative LR is applicable to both RNNs (ULMFiT's original application) and Transformers. Papers on fine-tuning BERT and GPT models show that using lower LR for early Transformer layers improves performance."},"reference":"- Howard & Ruder, \"Universal Language Model Fine-Tuning for Text Classification\" (2018): https://arxiv.org/abs/1801.06146"},{"section":"deep-learning","topicSlug":"transfer-learning","topic":"Transfer Learning","id":"dl-16004","difficulty":"medium","orderIndex":4,"question":"You pretrain a ResNet on natural photos and then fine-tune on medical X-ray images. Despite fine-tuning, the model underperforms a smaller CNN trained from scratch on 50,000 X-ray images. This is negative transfer. What conditions cause negative transfer, and how would you detect it before investing in full fine-tuning?","options":{"A":"Negative transfer means the pretrained model is corrupted; you need to reinitialize","B":"Negative transfer: performance with transfer learning < performance without transfer learning (training from scratch). Causes for X-ray scenario: (1) domain gap — X-rays are greyscale (often 3-channel replicated), inverted brightness (denser tissue = whiter), different spatial scale. ImageNet's RGB color statistics are meaningless; (2) task mismatch — ImageNet classification (1000 diverse categories) vs binary/multi-label pathology detection. The model wastes capacity encoding ImageNet priors. Detect before full fine-tuning: (1) compare linear probe (frozen feature) accuracy on 10% of target data vs random init with same architecture and data; (2) measure feature similarity (CKA: Centered Kernel Alignment) between pretrained and optimal target features; if CKA is low, transfer will be poor","C":"Negative transfer only occurs when pretraining dataset is smaller than target dataset","D":"Increasing pretraining epochs prevents negative transfer"},"correct":"B","explanation":{"correct":"- Negative transfer evidence: Raghu et al. (2019) \"Transfusion\" paper found that ImageNet pretraining provided minimal benefit for radiology tasks compared to training with proper medical image architectures/data.\n- Domain gap measurement: CKA similarity between ImageNet-pretrained features and features of a model trained from scratch on X-rays. Low similarity → the two domains require fundamentally different feature representations.\n- Early detection: linear probing on 10% of data takes minutes vs full fine-tuning which takes hours/days. If linear probe with pretrained features doesn't outperform random init features, full fine-tuning is unlikely to help.","A":"Negative transfer doesn't corrupt the pretrained model — the original pretrained weights are unchanged. The issue is that fine-tuning adapts the model away from its pretrained state without reaching a good target-domain solution.","B":"","C":"Negative transfer is primarily about domain/task mismatch, not dataset size comparison. Large pretraining datasets can still transfer negatively to very different domains.","D":"More pretraining epochs on ImageNet would deepen ImageNet-specific features, potentially worsening transfer to X-rays. Domain-specific pretraining (on unlabeled X-rays) would help, not more ImageNet epochs."},"reference":"- Raghu et al., \"Transfusion: Understanding Transfer Learning for Medical Imaging\" (2019): https://arxiv.org/abs/1902.07208"},{"section":"deep-learning","topicSlug":"transfer-learning","topic":"Transfer Learning","id":"dl-16005","difficulty":"medium","orderIndex":5,"question":"Prototypical Networks (ProtoNets) and MAML (Model-Agnostic Meta-Learning) are two approaches for few-shot learning. Explain the key algorithmic difference and which is more appropriate for few-shot image classification with a new class that has only 5 labeled examples.","options":{"A":"MAML is always better because it optimizes during meta-training and meta-testing","B":"ProtoNets: compute class prototypes = mean embedding of support set examples; classify query by nearest prototype. Non-parametric; no gradient steps at test time. MAML: learn an initialization θ such that a few gradient steps on a new task produces a good model. Meta-test: take 5-10 gradient steps from θ for the new class. For few-shot image classification with 5 labeled examples: (1) ProtoNets are simpler, faster, more robust; the 5 support examples define a reliable prototype in embedding space; (2) MAML requires gradient steps at test time (computationally more expensive) and higher-order gradients during training; (3) Empirically, ProtoNets often match MAML performance on standard benchmarks (miniImageNet, tieredImageNet) while being significantly simpler","C":"Neither applies — few-shot learning requires at least 100 examples","D":"MAML is for NLP only; ProtoNets are for image classification only"},"correct":"B","explanation":{"correct":"- ProtoNets at test time: given 5 labeled examples of \"snow leopard\" (never seen during training): embed each through the encoder φ; compute prototype c = (1/5) Σᵢ φ(xᵢ); classify new query x̂ by argmin_c ||φ(x̂) - c||². No gradient descent required.\n- MAML at test time: θ_leopard = θ_init - α ∇_θ L(θ; 5 examples). This requires a forward+backward pass for 5-10 optimization steps. The goal: the gradient steps should quickly adapt the global θ to the new class.\n- Practical consideration: ProtoNets are simpler to implement and debug. MAML involves second-order gradients (computing gradients of gradients) or first-order approximation (FOMAML), which is more complex.","A":"MAML is not always better. For simple few-shot image classification, ProtoNets consistently perform comparably or better with lower computational cost. MAML has advantages in tasks requiring rapid adaptation through gradient steps (e.g., reinforcement learning, regression tasks).","B":"","C":"Few-shot learning is specifically designed for 1-10 labeled examples per class (N-shot, K-way). Both ProtoNets and MAML are designed for this regime and have been validated on 1-shot and 5-shot benchmarks.","D":"Both ProtoNets and MAML are general-purpose few-shot learning algorithms applicable to image classification, NLP, and reinforcement learning. There's no domain restriction."},"reference":"- Snell et al., \"Prototypical Networks for Few-shot Learning\" (2017): https://arxiv.org/abs/1703.05175\n- Finn et al., \"Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (MAML)\" (2017): https://arxiv.org/abs/1703.03400"},{"section":"deep-learning","topicSlug":"transfer-learning","topic":"Transfer Learning","id":"dl-16006","difficulty":"medium","orderIndex":6,"question":"LoRA (Low-Rank Adaptation) fine-tunes a pretrained language model by adding trainable rank-r matrices (A ∈ ℝ^{d×r}, B ∈ ℝ^{r×d}) to the attention weight matrices W ∈ ℝ^{d×d}, where r << d. The adapted weight is W' = W + BA. Why does this approach prevent catastrophic forgetting, and what is the memory saving for d=4096, r=8?","options":{"A":"LoRA prevents forgetting by freezing the model and training separate heads for each task","B":"LoRA's anti-forgetting mechanism: the original W is frozen (never updated). The adaptation is encoded entirely in BA, initialized as B=zeros, A=random (so BA=0 initially — no initial perturbation). Only A and B are updated. The pretrained knowledge in W is preserved by construction. Memory saving: full fine-tuning W requires d² trainable parameters: 4096² = 16.8M params. LoRA: A has d×r = 4096×8 = 32,768 params; B has r×d = 8×4096 = 32,768 params; total = 65,536 params ≈ 65K. Saving: 16.8M / 65K ≈ 256× fewer trainable parameters per weight matrix","C":"LoRA prevents forgetting by using a replay buffer of pretraining examples during fine-tuning","D":"LoRA works by pruning 90% of weights before fine-tuning, reducing the chance of overwriting important weights"},"correct":"B","explanation":{"correct":"- Preservation by freezing: W_pretrained remains exactly unchanged throughout fine-tuning. Any forgetting in traditional fine-tuning comes from directly updating W. LoRA sidesteps this entirely.\n- BA initialization: B=0 means BA=0 initially. W' = W + 0 = W. As training proceeds, BA learns the task-specific delta. This is a clean starting point with no disruption.\n- Parameter count: for GPT-3 (d=12288), full fine-tuning: 12288² = 150M per attention matrix. With 96 layers × 4 matrices = 57.6B parameters just for attention. LoRA r=4: 96 × 4 × 2 × 12288 × 4 = 37.7M total LoRA parameters — a 1500× reduction.","A":"LoRA does freeze the model body, but it doesn't train \"separate heads per task.\" The LoRA adapter (BA) modifies the same weight matrices used for all tasks. Task separation via separate heads is a different approach (multi-task heads, adapter layers).","B":"","C":"Replay buffers (experience replay) are a technique from continual learning (e.g., Elastic Weight Consolidation, GEM). LoRA doesn't use replay buffers. It prevents forgetting through architectural design (frozen base weights).","D":"LoRA doesn't prune weights. All d² parameters of W are retained but frozen. Pruning would remove parameters; LoRA adds parameters (the BA matrices) while keeping W intact."},"reference":"- Hu et al., \"LoRA: Low-Rank Adaptation of Large Language Models\" (2022): https://arxiv.org/abs/2106.09685"},{"section":"deep-learning","topicSlug":"transfer-learning","topic":"Transfer Learning","id":"dl-16007","difficulty":"hard","orderIndex":7,"question":"You fine-tune GPT-2 (117M params) on a 1000-example customer service chatbot dataset. During fine-tuning with a learning rate of 5e-4, the model quickly learns to generate helpful responses but loses its general text coherence (produces grammatically broken sentences outside the chatbot domain). What is happening mechanically, and what are two independent fixes?","options":{"A":"GPT-2 is too large for 1000 examples; use a 2-layer LSTM instead","B":"The high LR (5e-4) aggressively updates all 117M parameters. With 1000 examples, the model rapidly overfits to chatbot patterns: specific phrasing, vocabulary, and response structures. The large updates overwrite the pretrained language modeling capabilities (grammar, coherence). Mechanically: the weight updates Δw = -η × ∇L are large (η=5e-4 is high for GPT-2 fine-tuning; typical: 1e-5 to 5e-5). Fix 1: reduce LR to 1e-5 — smaller updates preserve pretrained representations while allowing gradual task adaptation. Fix 2: use LoRA (r=8) — freeze all 117M params, add 0.5M trainable params. Only the low-rank adapters update; GPT-2's language model weights are frozen, preserving coherence","C":"The issue is gradient clipping; enable gradient clipping to fix coherence","D":"Fine-tuning GPT-2 on 1000 examples is the correct approach; the coherence issue resolves after more training"},"correct":"B","explanation":{"correct":"- LR impact: for pretrained LLMs, fine-tuning LR is typically 1-2 orders of magnitude lower than pretraining LR. GPT-2 was pretrained with LR ~6.25e-4 with a large batch and warm-up. Fine-tuning at 5e-4 without a small batch or LR schedule applies updates at ≈ pretraining magnitude, treating the model as if training from scratch.\n- Forgetting speed: 1000 examples × multiple epochs = thousands of gradient steps. Each step at 5e-4 drifts the weights significantly from pretrained initialization.\n- LoRA fix: with B=0 initialized adapters, only the low-rank matrices capture task knowledge. The base model's language knowledge is architecturally protected.","A":"Model size alone doesn't determine fine-tuning success. With proper regularization (lower LR, LoRA, early stopping), GPT-2 can be effectively fine-tuned on small datasets. Switching to an LSTM would lose GPT-2's language knowledge.","B":"","C":"Gradient clipping (||∇|| ≤ max_norm) prevents large gradient magnitudes but doesn't prevent many small updates from cumulatively overwriting pretrained weights. Gradient clipping is necessary for training stability but doesn't address catastrophic forgetting.","D":"More training with a high LR would worsen catastrophic forgetting, not resolve it. As training continues, the model's weights move further from the pretrained initialization."},"reference":"- Mosbach et al., \"On the Stability of Fine-Tuning BERT: Misconceptions, Explanations, and Strong Baselines\" (2021): https://arxiv.org/abs/2006.04884"},{"section":"deep-learning","topicSlug":"transfer-learning","topic":"Transfer Learning","id":"dl-16008","difficulty":"hard","orderIndex":8,"question":"Elastic Weight Consolidation (EWC) is a continual learning technique that adds a regularization term L_EWC = (λ/2) Σ_i F_i(θ_i - θ*_i)², where F_i is the Fisher information for parameter i and θ*_i is the pretrained value. For fine-tuning on Task B after training on Task A, what does F_i represent and why does it address catastrophic forgetting better than simple L2 regularization (L = (λ/2) Σ_i (θ_i - θ*_i)²)?","options":{"A":"F_i is the gradient magnitude; EWC and L2 are equivalent when gradients are uniform","B":"F_i is the Fisher information — the expected squared gradient of the log-likelihood with respect to θ_i on Task A's data: F_i = E[(∂ log p(y|x,θ) / ∂θ_i)²]. This estimates how sensitive Task A's loss is to parameter θ_i. High F_i → parameter θ_i is important for Task A; changing it will hurt Task A performance. EWC advantage over L2: L2 penalizes all weight changes equally — it treats a parameter critical to Task A (large Fisher) the same as a parameter irrelevant to Task A (small Fisher). EWC concentrates regularization on important parameters. This allows parameters irrelevant to Task A to freely update for Task B (flexibility), while protecting critical Task A parameters (forgetting prevention)","C":"EWC uses the Hessian diagonal; the Fisher information is only an approximation of the Hessian","D":"L2 regularization prevents catastrophic forgetting equally well as EWC; F_i is just used for computational efficiency"},"correct":"B","explanation":{"correct":"- Fisher information intuition: if changing θ_i by a small amount ε significantly changes the log-likelihood for Task A's data, θ_i is important for Task A. F_i captures this: F_i = E[(∂ log p(y|x,θ*) / ∂θ_i)²]. Under the Laplace approximation, the posterior over θ_i near θ*_i is Gaussian with precision F_i.\n- L2 vs EWC: consider θ_j, a parameter used exclusively for Task A (high F_j), and θ_k, irrelevant to Task A (F_k ≈ 0). L2 penalizes both equally. EWC: heavy penalty on θ_j (preserve Task A), no penalty on θ_k (free to learn Task B).\n- Practical effect: EWC enables selective forgetting — only irrelevant parameters can change, while the important ones are protected.","A":"F_i is not the gradient magnitude; it's the expected squared gradient of the log-likelihood. Gradient magnitude during Task B training measures sensitivity to Task B, not Task A. F_i is computed once from Task A data.","B":"","C":"F_i (Fisher diagonal) is indeed related to the Hessian diagonal. Under mild conditions (near the MLE), F_i ≈ -E[∂² log p / ∂θ_i²] = Hessian diagonal. This relationship is used as the EWC motivation. The statement \"only an approximation\" is technically true but doesn't make the claim incorrect — the Fisher diagonal is the standard EWC formulation.","D":"L2 regularization does not perform as well as EWC for continual learning. The Fisher-weighted regularization is the key innovation in EWC. Papers show EWC significantly outperforms L2 regularization in sequential task learning benchmarks."},"reference":"- Kirkpatrick et al., \"Overcoming catastrophic forgetting in neural networks (EWC)\" (2017): https://arxiv.org/abs/1612.00796"},{"section":"deep-learning","topicSlug":"transfer-learning","topic":"Transfer Learning","id":"dl-16009","difficulty":"medium","orderIndex":9,"question":"A 3-layer CNN pretrained on ImageNet is fine-tuned for a new task with only 200 labeled examples. You test two strategies: (A) freeze layer 1-2, fine-tune layer 3 + new classifier head; (B) fine-tune all layers with a very low LR (1e-6). Which strategy is less likely to overfit, and why?","options":{"A":"Strategy B always outperforms A for small datasets because it updates more parameters","B":"Strategy A (freeze early layers) is less likely to overfit: 200 examples can only reliably train a small number of parameters. Frozen layers provide fixed feature extraction — the trainable parameter count is reduced to layer 3 + head. Fewer parameters relative to data → less overfitting. Strategy B: all 3 layers update, but with LR=1e-6. Very small updates mean the layers drift very slowly — overfitting is prevented through update magnitude limitation rather than frozen architecture. Trade-off: Strategy A is more robust to overfitting but may underperform if early layers provide suboptimal features. Strategy B allows richer adaptation but risks eventual overfitting. Recommendation: use Strategy A with early stopping, or LoRA","C":"Neither strategy can work with 200 examples; you must use data augmentation first","D":"Strategy B cannot learn anything at LR=1e-6; the gradients vanish before reaching layer 1"},"correct":"B","explanation":{"correct":"- Parameter count comparison: if layer 1-2 have 500K params, layer 3 has 200K, head has 1K: Strategy A trains 201K params; Strategy B trains 701K params. With 200 examples, 200 examples / 201K params = 1 example per 1000 parameters (still sparse, but 3.5× better than Strategy B).\n- LR=1e-6 effect: the weight update per step = 1e-6 × gradient. For typical gradient magnitude ~1, updates ~1e-6 per step. After 200 examples × 10 epochs = 2000 steps, total drift ≈ 2000 × 1e-6 = 0.002 — very small weight changes.\n- Both strategies have merit; the best approach depends on how similar the source and target domains are.","A":"Updating more parameters with 200 examples is a recipe for overfitting, not improvement. The fundamental challenge is generalization with limited data. More trainable parameters require more data.","B":"","C":"Data augmentation is a complementary technique (not a prerequisite). With proper augmentation, both strategies can work. But augmentation doesn't resolve the architectural question about which strategy overfits less.","D":"LR=1e-6 doesn't cause vanishing gradients. Gradients flow normally through backpropagation; the LR only scales the update step. The model can learn at LR=1e-6 — just slowly."},"reference":"- Kornblith et al., \"Do Better ImageNet Models Transfer Better?\" (2019): https://arxiv.org/abs/1805.08974"},{"section":"deep-learning","topicSlug":"transfer-learning","topic":"Transfer Learning","id":"dl-16010","difficulty":"hard","orderIndex":10,"question":"Adapter layers are small bottleneck modules inserted between Transformer layers: h → LayerNorm → down-project (d→m, m< Volume factor when: (1) target domain has unique structural properties absent in general domain (X-rays vs photos); (2) task requires fine-grained domain-specific distinctions; (3) target dataset is small (domain-specific features reduce the needed fine-tuning to align representations). Counter-case: for tasks where both domains apply (e.g., skin lesion detection — photos share some properties), ImageNet with 50× data may win","C":"The result is due to RadImageNet being harder to overfit; size doesn't affect generalization","D":"Neural scaling laws guarantee more data = better; the experiment must have a flaw"},"correct":"B","explanation":{"correct":"- Feature alignment vs volume: consider linear probing: frozen pretrained features → logistic regression on target task. If domain-specific features score 0.85 and ImageNet features score 0.62, fine-tuning can improve both but starts from a better initialization with domain-specific pretraining.\n- CKA analysis: Raghu et al. and Nguyen et al. used CKA to measure feature similarity between pretrained models and task-optimal models. Domain-specific pretraining produces features with higher CKA similarity to the target task, requiring less adaptation.\n- Practical guidance: for specialized domains (medical imaging, satellite imagery, molecular biology), domain-specific pretraining often outperforms general pretraining even with less data.","A":"Domain-specific pretraining is not always better. For target tasks well-covered by general pretraining (e.g., detecting office objects, classifying natural animals), ImageNet pretraining with vastly more data and diversity wins. \"Always better\" is an overstatement.","B":"","C":"Overfitting difficulty doesn't explain transfer quality. The explanation is feature alignment: what the model learns to represent during pretraining.","D":"Neural scaling laws apply within a domain and training paradigm. They don't claim that data from a different distribution always improves performance. Cross-domain transfer violates the i.i.d. assumption underlying scaling laws."},"reference":"- Mei et al., \"RadImageNet: An Open Radiologic Deep Learning Research Dataset for Effective Transfer Learning\" (2022): https://arxiv.org/abs/2201.09600"},{"section":"deep-learning","topicSlug":"transfer-learning","topic":"Transfer Learning","id":"dl-16012","difficulty":"medium","orderIndex":12,"question":"Domain adaptation addresses the scenario where training distribution p_source ≠ test distribution p_target. In unsupervised domain adaptation (UDA), you have labeled source data and unlabeled target data. How does Domain-Adversarial Neural Network (DANN) use a gradient reversal layer (GRL) to learn domain-invariant features?","options":{"A":"GRL flips gradients to make the classifier worse, improving domain adaptation by adversarial training on the label space","B":"DANN has three components: feature extractor G_f, label classifier G_y (on source labels), domain classifier G_d (predicts source vs target). GRL sits between G_f and G_d. Forward pass: normal. Backward pass through GRL: gradients are multiplied by -λ (reversed). Effect: G_d tries to distinguish source from target; reversed gradient tells G_f to produce features that maximally confuse G_d. Result: G_f learns features where source and target are indistinguishable (domain-invariant). Simultaneously, G_y trains G_f to keep label-discriminative information. The learned features are both label-predictive AND domain-invariant — features transfer to unlabeled target domain with high accuracy","C":"GRL prevents the feature extractor from training; only G_y and G_d update during backprop","D":"Domain adaptation only works when source and target have the same number of classes"},"correct":"B","explanation":{"correct":"- Minimax objective: min_{G_f, G_y} max_{G_d} [L_y(G_y(G_f(x_source)), y) - λ L_d(G_d(G_f(x)), d)]. G_f minimizes label loss, G_d maximizes domain classification loss (minimizes domain classifier accuracy). GRL implements this minimax through the reversal trick.\n- Why domain invariance helps: if G_f produces features where source and target look the same, a classifier trained on source features can be applied to target features without explicit target labels.\n- Limitation: domain invariance is necessary but not sufficient. If the conditional distribution p(y|features) differs between domains (label shift), DANN may fail.","A":"GRL doesn't flip gradients for the label classifier (G_y). G_y receives normal gradients from the label classification loss. Only G_d receives reversed gradients, which flow back to G_f. The adversarial training targets domain confusion, not label confusion.","B":"","C":"GRL only affects gradient flow (multiplies by -λ). The feature extractor G_f receives gradients from both paths: positive gradients from G_y's label loss (learn discriminative features) and negative gradients from G_d's domain loss (learn domain-invariant features). G_f updates from both.","D":"DANN works regardless of class count differences. The domain classifier is binary (source vs target) regardless of the number of task classes. Many practical domain adaptation scenarios have different class distributions between source and target."},"reference":"- Ganin & Lempitsky, \"Unsupervised Domain Adaptation by Backpropagation (DANN)\" (2015): https://arxiv.org/abs/1409.7495"},{"section":"deep-learning","topicSlug":"transfer-learning","topic":"Transfer Learning","id":"dl-16013","difficulty":"hard","orderIndex":13,"question":"CLIP (Contrastive Language-Image Pre-Training) enables zero-shot transfer: a model trained on image-text pairs can classify images into unseen classes without fine-tuning, by comparing image embeddings to text embeddings of class descriptions. If CLIP is used for zero-shot classification of pathological tissue types, and zero-shot CLIP achieves 51% top-1 accuracy, while a ResNet fine-tuned with 100 labeled examples achieves 79%, which approach should be chosen and what is the core limitation of CLIP's zero-shot transfer here?","options":{"A":"Always use CLIP zero-shot; fine-tuning introduces catastrophic forgetting","B":"Choose the fine-tuned ResNet (79% > 51%). Core limitation of CLIP zero-shot for pathology: CLIP was trained on internet image-text pairs — pathological tissue images (H&E staining, microscopy) are rare or absent in web data. The text descriptions (\"carcinoma\", \"adenocarcinoma\", \"dysplasia\") are specialized medical terms that CLIP's text encoder associates with radiology reports or textbooks, not with actual H&E-stained tissue images. Domain gap: CLIP's image encoder was not trained on microscopy images; its visual representations don't align with the specific features pathologists use. 100 labeled examples is sufficient to train a ResNet head (or fine-tune the last few layers) to learn domain-specific visual distinctions","C":"CLIP zero-shot is always preferable because it doesn't require any labeled data","D":"Fine-tuned ResNet is better only because it has more parameters; use CLIP with more parameters to match"},"correct":"B","explanation":{"correct":"- 51% vs 79%: a 28% accuracy gap is decisive. The cost of labeling 100 examples (hours of pathologist time) is justified by the performance gain in a clinical setting.\n- CLIP's zero-shot strength: it excels on natural image categories well-represented in web data (ImageNet-like classes). For domain-specific visual concepts (pathology, radiology, satellite imagery), zero-shot performance degrades significantly.\n- LLaVA-Med, PathCLIP: domain-specific CLIP-like models pretrained on medical image-text pairs achieve much higher zero-shot performance, highlighting that the limitation is domain gap, not the framework itself.","A":"CLIP zero-shot doesn't suffer catastrophic forgetting (nothing is fine-tuned). The concern is the opposite: CLIP's zero-shot representations don't transfer well to highly specialized domains.","B":"","C":"\"No labeled data required\" is an advantage of zero-shot learning, but it's not sufficient justification when zero-shot performance is 28% below fine-tuned performance in a high-stakes domain (clinical pathology). The cost of 100 labels is worth 28% accuracy gain.","D":"The gap is not due to parameter count. ResNet-50 (25M params) with 100-example fine-tuning outperforms CLIP ViT-B/32 (150M image encoder params) on pathology tasks. The issue is representation alignment, not model capacity."},"reference":"- Radford et al., \"Learning Transferable Visual Models from Natural Language Supervision (CLIP)\" (2021): https://arxiv.org/abs/2103.00020\n- Zhang et al., \"BiomedCLIP\" (2023): domain-specific medical CLIP"},{"section":"deep-learning","topicSlug":"transfer-learning","topic":"Transfer Learning","id":"dl-16014","difficulty":"hard","orderIndex":14,"question":"You fine-tune a Vision Transformer (ViT-L/16, 307M params) on a 2,000-example art classification dataset using full fine-tuning. Training set accuracy = 97%, validation accuracy = 58%, suggesting severe overfitting. You compare five interventions: (A) weight decay 0.01, (B) frozen patch embedding + 50% of transformer blocks, (C) LoRA r=16, (D) 10× data augmentation (flips, rotations, color jitter), (E) dropout 0.3 in all attention layers. Rank these from most to least effective.","options":{"A":"A > B > C > D > E","B":"Most effective → least effective: D > B > C > A > E. (D) Data augmentation effectively multiplies the 2,000 examples, directly addressing the data scarcity problem — 10× augmentation ≈ 20,000 examples, dramatically reducing overfitting. (B) Freezing 50% of blocks reduces trainable parameters from 307M to ~150M while preserving pretrained features. (C) LoRA r=16: trainable parameters ≈ 2×16×d_model × n_layers. For ViT-L (d=1024), one LoRA pair: 2 × 1024 × 16 = 32,768 params. All layers: ~10M params — massive reduction. (A) Weight decay penalizes large weights but doesn't reduce parameter count. (E) Dropout in attention is the least effective alone — it doesn't directly address the parameter:data ratio problem","C":"C > D > B > E > A","D":"All interventions are equally effective; use any combination"},"correct":"B","explanation":{"correct":"- Severity of overfitting: 97% train vs 58% val is extreme. With 307M params and 2K examples, the ratio is ~150K examples/param needed for reliable fitting — we have 100× less data.\n- Data augmentation: art classification benefits especially from geometric (rotation, flips) and color augmentation. 10× augmentation with a 2K dataset gives 20K diverse examples without any architectural change.\n- LoRA vs weight decay: LoRA reduces effective parameter count to ~3% of full model. Weight decay constrains weight magnitudes but keeps all 307M parameters active and potentially able to overfit.\n- Dropout in attention: attention dropout disrupts the attention patterns learned during pretraining and can hurt representation quality. It's a blunt instrument for this specific overfitting problem.","A":"Ranking A (weight decay) above B (frozen blocks) and C (LoRA) is incorrect. Weight decay is the weakest regularizer here — it doesn't fundamentally address the parameter:data imbalance. Frozen blocks and LoRA directly reduce the effective number of trainable parameters.","B":"","C":"While LoRA is highly effective, ranking it above data augmentation is debatable. Data augmentation directly increases the effective dataset size, addressing the root cause. LoRA reduces the model's capacity to overfit but doesn't increase information content.","D":"The interventions have different expected magnitudes of effect based on the overfitting mechanism. Data augmentation addresses data scarcity; parameter reduction (B, C) addresses model complexity; A and E are weaker regularizers."},"reference":"- Touvron et al., \"Training data-efficient image transformers (DeiT)\" (2021): augmentation strategies for ViT\n- He et al., \"Masked Autoencoders Are Scalable Vision Learners\" (2022): fine-tuning ViT at various dataset scales"},{"section":"deep-learning","topicSlug":"transfer-learning","topic":"Transfer Learning","id":"dl-16015","difficulty":"hard","orderIndex":15,"question":"A continual learning system trains on Tasks A, B, C, D sequentially. After training on D, it shows perfect performance on D, 90% on C, 65% on B, and 12% on A. A senior engineer argues that this is expected behavior and the performance difference is statistically insignificant. What does the performance pattern actually reveal, and what is the fundamental trade-off in continual learning that makes this an unsolved problem?","options":{"A":"The pattern shows normal learning — earlier tasks are naturally harder","B":"The pattern shows classic catastrophic forgetting (also called \"catastrophic interference\"): Task A's performance (12%) is near chance — the model has forgotten nearly everything from Task A. Each new task updates weights optimizing the current task, progressively overwriting earlier task representations. The fundamental trade-off (plasticity-stability dilemma): Stability: preserve performance on previous tasks (resist changing old weights). Plasticity: learn new tasks effectively (freely update weights). These are mutually contradictory: learning Task D requires changing weights, which disrupts Task A-C representations. No current approach fully resolves this. EWC (Fisher-weighted regularization) partially mitigates it. Replay methods (store samples from old tasks) reduce forgetting at memory cost. Progressive Neural Networks (column per task) have full stability but grow linearly in parameters. None achieve human-level continual learning","C":"The pattern is caused by task D being more recent; add a time-based decay to fix it","D":"Use higher learning rates on early tasks to make them \"stickier\" in the network"},"correct":"B","explanation":{"correct":"- 12% on a classification task: if Task A is 10-class, random chance = 10%. The model has essentially random performance on Task A — complete forgetting.\n- Stability-plasticity dilemma: biological neural systems solve this through complementary learning systems (hippocampus = fast learning/plasticity; neocortex = slow consolidation/stability). Current neural networks have a single weight space serving all tasks — no natural separation.\n- State of the field (2024): EWC, PackNet, GEM, A-GEM, and other continual learning methods improve over naive sequential training but still show non-trivial forgetting. Few-shot learning and meta-learning partially address this for related tasks.","A":"The performance pattern is not \"expected normal behavior.\" Task A being at 12% (near chance) is not \"naturally harder\" — it was presumably mastered before Task B began. The degradation across tasks is caused by training Task B, C, D, not by task difficulty.","B":"","C":"Time-based decay doesn't address the fundamental problem. Decaying old weights would make forgetting worse, not better. The goal is to preserve old task performance, which requires resisting weight changes on important old parameters.","D":"Using higher LR on earlier tasks would cause them to be learned initially with \"larger\" representations, but subsequent task training would still overwrite those weights. The problem is sequential overwriting during later task training, not initial learning rate."},"reference":"- Kirkpatrick et al., \"Overcoming catastrophic forgetting in neural networks (EWC)\" (2017): https://arxiv.org/abs/1612.00796\n- Parisi et al., \"Continual lifelong learning with neural networks: A review\" (2019): https://arxiv.org/abs/1802.07569"}],"practiceMcqs":[{"section":"deep-learning","difficulty":"easy","id":"dl-e001","topicSlug":"introduction-to-neural-networks","orderIndex":1,"topic":"Introduction To Neural Networks","question":"A perceptron computes f = 1 if w₁x₁ + w₂x₂ + b ≥ 0. A student wants to implement the AND gate (output 1 only when x₁=1 AND x₂=1). They try w₁=1, w₂=1, b=-1. For input (1,0), they compute z = 1+0-1 = 0, which fires (≥ 0). Is this correct AND behavior, and what bias would fix it?","options":{"A":"Yes — AND should fire at z ≥ 0, so b=-1 is correct","B":"No — input (1,0) should output 0 for AND. z=0 fires, so the decision boundary is wrong. Setting b=-1.5 fixes it: (1,1): z=0.5>0 ✓; (1,0): z=-0.5<0 ✓; (0,1): z=-0.5<0 ✓; (0,0): z=-1.5<0 ✓","C":"No — AND cannot be implemented by a perceptron regardless of weight values","D":"Yes — AND requires that the sum of inputs equals 2, so z=0 is the correct threshold for (1,0)"},"correct":"B","explanation":{"correct":"- Perceptron boundary: fires when z ≥ 0. With b=-1, inputs (1,0) and (0,1) give z=0, which fires — but AND should output 0 for these cases.\n- Fix: shift the boundary between z=1 (both inputs=1) and z=1 (one input=1). Since both give the same z with w=[1,1], we want the threshold strictly between those two sums (1 vs 2). b=-1.5 places the boundary at z=0 when w·x=1.5, which falls between 1 and 2.\n- This shows the bias term's role: it translates the decision boundary without changing the hyperplane orientation.","A":"z=0 is a boundary condition that fires (≥ 0). For AND, (1,0)→0 so it must not fire. The threshold is set too loosely.","B":"","C":"AND is linearly separable (you can draw a line separating the 3 \"false\" points from 1 \"true\" point in 2D). A perceptron can implement it — unlike XOR.","D":"The sum of inputs for (1,0) is 1, not 2. With w=[1,1] and b=-1, z=0 for (1,0), which triggers activation. That's the bug."},"reference":"- Rosenblatt, \"The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain\" (1958)"},{"section":"deep-learning","difficulty":"easy","id":"dl-e002","topicSlug":"introduction-to-neural-networks","orderIndex":2,"topic":"Introduction To Neural Networks","question":"You train a 2-layer neural network (1 hidden layer) with ReLU on a 2D classification dataset. On the training set, the model achieves 98% accuracy. A colleague says \"this means the model has learned the true underlying function.\" Another says \"it may have memorized the training data.\" What single experiment most directly distinguishes the two?","options":{"A":"Train for more epochs — if accuracy stays at 98%, the model has generalized","B":"Evaluate on a held-out test set — if test accuracy is also ~98%, the model generalizes; if test accuracy drops significantly (e.g., 60%), it has memorized training data without learning the underlying pattern","C":"Plot the loss curve — a decreasing training loss confirms generalization","D":"Increase model capacity — if a larger model also achieves 98%, then the pattern is real"},"correct":"B","explanation":{"correct":"- Generalization test: the training accuracy tells you nothing about generalization. Any sufficiently large network can memorize any finite dataset (Zhang et al. 2017 showed networks can memorize randomly labeled CIFAR-10).\n- The held-out test set (unseen data from the same distribution) is the gold standard for whether the model has learned the underlying function or just the training set.\n- A train/test accuracy gap (high train, low test) is the definition of overfitting (memorization). Equal train and test accuracy suggests the learned function generalizes.","A":"Training longer with the same data cannot reveal whether the model generalizes — it only confirms that the model can still fit the training set.","B":"","C":"A decreasing training loss is expected for any model with enough capacity regardless of generalization. It measures fit, not generalization.","D":"Testing a larger model on the same training data doesn't test generalization. Both small and large models can achieve 98% training accuracy while generalizing differently."},"reference":"- Zhang et al., \"Understanding Deep Learning Requires Rethinking Generalization\" (2017): https://arxiv.org/abs/1611.03530"},{"section":"deep-learning","difficulty":"easy","id":"dl-e003","topicSlug":"neurons-and-perceptrons","orderIndex":3,"topic":"Neurons And Perceptrons","question":"In a network `z = Wx + b`, you set all biases to zero at initialization. Unlike weights, all biases can remain zero and the network still works. True or False — and what specific problem occurs if you also initialize all weights to the same constant (e.g., 0.001) in addition to zero biases?","options":{"A":"True — zero biases are fine; zero weights are also fine because the gradient will differentiate them during training","B":"False — zero biases already cause a symmetry problem because all neurons in a layer produce identical pre-activations","C":"True — zero biases are fine. But identical weights (e.g., all 0.001) cause the symmetry problem: every neuron in a layer receives identical gradients, so all weights update identically at every step — the layer effectively has just one unique neuron regardless of width, and the network never develops diverse representations","D":"False — biases must always be initialized to 1 to prevent the vanishing gradient problem"},"correct":"C","explanation":{"correct":"- Zero bias: fine for most architectures. Biases are independent per neuron — setting them to 0 initializes the threshold at 0, which is a reasonable starting point. Gradient updates for biases are `∂L/∂b = δ`, which differ per neuron once weights differ.\n- Symmetric weight problem: if all weights in a layer are identical, then every neuron computes the same z = w·x + b. The same activation, same gradient → same weight update → remain identical forever. This is the symmetry problem.\n- The layer collapses to a single effective neuron regardless of its width.","A":"Identical weights cause the symmetry problem regardless of the learning process — gradients are identical for identical neurons, so training cannot break the symmetry.","B":"Zero biases alone do NOT cause a symmetry problem. Bias gradients ∂L/∂b = δ only equal zero when the activation gradient δ is zero. With different weights per neuron, δ differs, so biases diverge.","C":"","D":"Bias initialization to 1 is used in some specific cases (e.g., LSTM forget gate bias initialized to 1), but it's not a general requirement and is not related to vanishing gradients."},"reference":"- Goodfellow et al., \"Deep Learning\" (2016), Chapter 8: Optimization for Training Deep Models"},{"section":"deep-learning","difficulty":"easy","id":"dl-e004","topicSlug":"neurons-and-perceptrons","orderIndex":4,"topic":"Neurons And Perceptrons","question":"A fully connected layer maps 512 input features to 256 output features. Including biases, how many trainable parameters does this layer have? And if you add a second identical FC layer (256 → 256) after it, what is the total parameter count for both layers combined?","options":{"A":"Layer 1: 512 × 256 = 131,072 params. Layer 2: 256 × 256 = 65,536 params. Total: 196,608","B":"Layer 1: 512 × 256 + 256 = 131,328 params. Layer 2: 256 × 256 + 256 = 65,792 params. Total: 197,120","C":"Layer 1: (512 + 1) × 256 = 131,328 params. Layer 2: (256 + 1) × 256 = 65,792 params. Total: 197,120 (same as B, different calculation)","D":"Layer 1: 512 × 256 = 131,072. Layer 2: 256 × 256 = 65,536. Bias is a single global parameter, so total = 196,608 + 1 = 196,609"},"correct":"B","explanation":{"correct":"- Each FC layer: W of shape (d_out, d_in) + bias vector b of shape (d_out,).\n- Layer 1: W₁ has 512 × 256 = 131,072 weights + 256 biases = 131,328 params.\n- Layer 2: W₂ has 256 × 256 = 65,536 weights + 256 biases = 65,792 params.\n- Total: 131,328 + 65,792 = 197,120 params.\n- The bias is one scalar per output neuron (not a single global value), so each layer adds d_out bias terms.","A":"Forgets to include the bias vectors. This is a common mistake when counting parameters. Both layers have one bias per output unit.","B":"","C":"","D":"The bias is NOT a single global parameter. It is a vector of size d_out — one learned offset per output neuron. Treating it as a single value is incorrect."},"reference":"- PyTorch docs: `torch.nn.Linear` — lists `weight` (out_features × in_features) and `bias` (out_features)"},{"section":"deep-learning","difficulty":"easy","id":"dl-e005","topicSlug":"activation-functions","orderIndex":5,"topic":"Activation Functions","question":"You build a binary sentiment classifier (positive/negative). A junior engineer uses sigmoid activation on the output neuron and cross-entropy loss. A second engineer uses a linear output neuron (no activation) and MSE loss. Which is preferred and why?","options":{"A":"MSE + linear is preferred because the loss is smooth and easy to differentiate","B":"Sigmoid + cross-entropy is preferred: sigmoid maps the output to (0,1) producing a valid probability; cross-entropy penalizes confident wrong predictions logarithmically (large gradient when p=0.01 for y=1). MSE + linear can produce outputs outside [0,1] and has near-zero gradient when the output is very wrong (far from 0/1), slowing learning","C":"Both are equivalent; the choice of activation and loss doesn't affect the final result","D":"Linear + MSE is preferred for binary classification because MSE has a unique global minimum"},"correct":"B","explanation":{"correct":"- Sigmoid output: p ∈ (0,1), interpretable as a probability. BCE loss: -[y log p + (1-y) log(1-p)]. When the model is confidently wrong (p=0.001 for y=1), loss = -log(0.001) ≈ 7, providing a large gradient signal.\n- MSE gradient for classification: ∂MSE/∂p = 2(p-y). When p=10 (linear output, very wrong), gradient = 18, which can be large. But when the activation is saturated (e.g., sigmoid ≈ 0.99999), MSE gradient becomes tiny because (0.99999 - 1)² ≈ 0. This is why cross-entropy is preferred with sigmoid.\n- MSE is designed for regression where the target can be any real number, not a binary label.","A":"MSE is smooth, but \"smooth\" doesn't mean \"appropriate.\" The gradient behavior for classification tasks makes MSE suboptimal — it doesn't naturally encode probability semantics.","B":"","C":"The choice significantly affects convergence speed and gradient behavior. Sigmoid + BCE trains faster and converges to better solutions for binary classification.","D":"MSE has a unique minimum for regression problems, but for classification with a linear output, the model may learn to push outputs far outside [0,1], which is numerically unstable and harder to interpret."},"reference":"- Goodfellow et al., \"Deep Learning\" (2016), Chapter 6.2: Output Units"},{"section":"deep-learning","difficulty":"easy","id":"dl-e006","topicSlug":"activation-functions","orderIndex":6,"topic":"Activation Functions","question":"A network uses ReLU activations throughout. During training, you observe that 40% of neurons in layer 3 always output exactly 0 regardless of the training example. What is this phenomenon called, and what is the most direct single change to address it?","options":{"A":"Gradient clipping — apply norm-based clipping to prevent neurons from dying","B":"This is the \"dying ReLU\" problem. When a neuron's pre-activation z is consistently negative (due to large negative bias or large negative weight updates), ReLU outputs 0, and the gradient ∂ReLU/∂z = 0. No gradient flows, so weights don't update, locking the neuron in the dead state. The most direct fix: switch to Leaky ReLU (f(z) = max(αz, z) with α=0.01), which allows a small gradient even for z < 0","C":"Batch normalization — apply BN before ReLU to center the pre-activations around zero","D":"This is vanishing gradient; fix by reducing the number of layers"},"correct":"B","explanation":{"correct":"- Dead ReLU mechanism: `ReLU'(z) = 1 if z > 0, else 0`. If a neuron's z is always ≤ 0, the gradient is always 0 — weight update = 0 forever. The neuron is \"dead.\"\n- Common causes: large learning rate causing large weight updates that push z negative; poor initialization (all-negative biases for that layer).\n- Leaky ReLU fix: `f(z) = z if z > 0 else αz`, `f'(z) = 1 if z > 0 else α`. Even when z < 0, a non-zero gradient (α=0.01) flows, allowing the neuron to recover.\n- Alternative: better initialization (Kaiming) reduces the chance of neurons dying from the start.","A":"Gradient clipping prevents exploding gradients (very large gradients). Dead ReLU neurons have zero gradient, not large gradient — clipping does nothing for neurons that are already dead.","B":"","C":"BN before ReLU helps prevent many neurons from dying by keeping z centered near 0. However, the most direct architectural fix specifically targeting dying ReLU is Leaky ReLU, not BN.","D":"Dead ReLU is not the vanishing gradient problem. Vanishing gradient is about gradients becoming exponentially small as they flow backward through many layers. Dead ReLU is about specific neurons with permanently zero gradient."},"reference":"- Maas et al., \"Rectifier Nonlinearities Improve Neural Network Acoustic Models\" (2013) — Leaky ReLU"},{"section":"deep-learning","difficulty":"easy","id":"dl-e007","topicSlug":"activation-functions","orderIndex":7,"topic":"Activation Functions","question":"BERT uses GELU activation, while original ResNet uses ReLU. A student asks: \"Can I replace BERT's GELU with ReLU without hurting performance?\" You know GELU = x·Φ(x) where Φ is the Gaussian CDF. What property of GELU makes it preferred in Transformer architectures?","options":{"A":"GELU is computationally cheaper than ReLU, which speeds up large Transformer training","B":"GELU provides a smooth, probabilistic gating: it multiplies x by the probability that x is positive under a Gaussian, giving a smooth approximation of ReLU. This smooth gradient near zero allows more information to pass through during backpropagation compared to the hard zero gradient of ReLU for z < 0. Empirically, GELU outperforms ReLU in Transformer architectures (BERT, GPT) though the difference in CNNs is smaller. Replacing BERT's GELU with ReLU would likely cause a small but measurable performance drop.","C":"GELU prevents the dying neuron problem by outputting negative values for negative inputs","D":"GELU and ReLU are interchangeable; the architectural context doesn't matter for activation choice"},"correct":"B","explanation":{"correct":"- GELU(x) = x·Φ(x) ≈ 0.5x(1 + tanh(√(2/π)(x + 0.044715x³))). This is smooth everywhere (unlike ReLU's kink at 0).\n- Stochastic interpretation: GELU is the expected value of a Bernoulli-gated linear unit where the gate probability = Φ(x). Small positive inputs are probabilistically suppressed.\n- For Transformers: the smooth gradient behavior near zero matters because attention weights produce many near-zero pre-activations that benefit from smooth gradient flow.\n- GELU ≈ ReLU for large |x|, but the smooth transition region matters for training dynamics.","A":"GELU is actually slightly more expensive than ReLU (involves CDF computation or a polynomial approximation). The preference is not computational.","B":"","C":"GELU does produce negative outputs for some negative inputs (where x·Φ(x) is a small negative number near 0). However, this is not the main reason it's preferred. The smoothness is the key property.","D":"Activation choice interacts with architecture. GELU and SiLU work better in Transformers; ReLU works well in CNNs. The choice is empirically architecture-specific."},"reference":"- Hendrycks & Gimpel, \"Gaussian Error Linear Units (GELUs)\" (2016): https://arxiv.org/abs/1606.08415"},{"section":"deep-learning","difficulty":"easy","id":"dl-e008","topicSlug":"forward-propagation","orderIndex":8,"topic":"Forward Propagation","question":"A linear layer in PyTorch is defined as `nn.Linear(128, 64)`. You feed a batch of 32 samples: `x.shape = (32, 128)`. PyTorch computes `output = x @ W.T + b`. What is the shape of W, W.T, b, and the final output?","options":{"A":"W: (128, 64), W.T: (64, 128), b: (64,), output: (32, 64)","B":"W: (64, 128), W.T: (128, 64), b: (64,), output: (32, 64)","C":"W: (64, 128), W.T: (128, 64), b: (32,), output: (32, 64)","D":"W: (128, 64), W.T: (64, 128), b: (128,), output: (32, 128)"},"correct":"B","explanation":{"correct":"- PyTorch's `nn.Linear(in_features, out_features)` stores W with shape (out_features, in_features) = (64, 128). This is because the computation is `output = x @ W.T + b`.\n- x @ W.T: (32, 128) @ (128, 64) = (32, 64). ✓\n- b has shape (64,) — one bias per output feature, broadcast across the batch.\n- Note: W.shape = (64, 128) in PyTorch, NOT (128, 64) as you might expect from the math z = Wx + b (which uses column vectors). PyTorch uses row vectors and transposes W.","A":"Swaps W and W.T shapes. PyTorch stores W as (out, in) = (64, 128), so W.T = (128, 64). If W were (128, 64), then x @ W.T would be (32, 128) @ (64, 128) — incompatible shapes.","B":"","C":"Bias b has shape (out_features,) = (64,), not (batch_size,) = (32,). The bias is broadcast across the batch, not assigned per sample.","D":"Output shape (32, 128) would require the layer to output 128 features, not 64. The defined layer maps 128→64, so the output must be (32, 64)."},"reference":"- PyTorch docs: `torch.nn.Linear` — https://pytorch.org/docs/stable/generated/torch.nn.Linear.html"},{"section":"deep-learning","difficulty":"easy","id":"dl-e009","topicSlug":"forward-propagation","orderIndex":9,"topic":"Forward Propagation","question":"You call `model.eval()` and then run inference. You forgot to also wrap the inference in `with torch.no_grad()`. What is the practical consequence, and what is the purpose of each call?","options":{"A":"Without `torch.no_grad()`, the model computes wrong outputs because gradients affect the forward pass","B":"`model.eval()` changes model behavior (turns off Dropout random masking, uses BatchNorm running stats instead of batch stats). `torch.no_grad()` disables gradient computation, reducing memory and speeding up inference. Forgetting `torch.no_grad()`: model behavior is correct (eval() handles that), but PyTorch still builds a computational graph and stores activations for potential backward pass — wasting memory and slowing inference, but not producing wrong results","C":"`torch.no_grad()` is identical to `model.eval()`; calling one is sufficient","D":"Forgetting `torch.no_grad()` causes an error at the end of the inference loop"},"correct":"B","explanation":{"correct":"- `model.eval()`: changes module behavior. Dropout becomes identity (no masking). BatchNorm uses stored running_mean/running_var instead of batch statistics. This ensures deterministic, reproducible inference.\n- `torch.no_grad()`: tells autograd not to track operations in the context. No computation graph is built, and intermediate activations for backprop are not stored. This saves memory (no activation storage) and computation (no gradient bookkeeping).\n- Combined: correct behavior (eval) + memory efficiency (no_grad). Missing no_grad: memory usage stays high (all activations cached), but outputs are numerically identical.","A":"Gradients don't affect forward pass computation values. The gradient tensor is separate from the value tensor. Forward pass output = same regardless of whether autograd is enabled.","B":"","C":"They have completely different functions. eval() affects layer behavior (Dropout, BN). no_grad() affects memory management. Neither subsumes the other.","D":"PyTorch doesn't raise an error for running inference without no_grad(). It simply uses more memory than necessary."},"reference":"- PyTorch docs: `torch.no_grad()` and `Module.eval()`"},{"section":"deep-learning","difficulty":"easy","id":"dl-e010","topicSlug":"loss-and-cost-functions","orderIndex":10,"topic":"Loss And Cost Functions","question":"You train a regression model to predict house prices (in dollars, range 50K–2M). You use MSE loss. A colleague suggests switching to MAE loss. Under what condition would MSE be worse than MAE, and what is the key difference in their gradient behavior?","options":{"A":"MSE is always worse than MAE for regression; always use MAE","B":"MSE is worse than MAE when the dataset has outliers (e.g., a few houses worth $20M). MSE squares the error: a $5M error contributes 25× more loss than a $1M error (5² vs 1²). This causes the model to shift its predictions toward outliers. MAE treats all errors proportionally (a $5M error is 5× a $1M error). Gradient difference: MSE gradient = 2(ŷ - y), which grows with error magnitude. MAE gradient = ±1 (constant regardless of error magnitude). Near zero error, MAE's subgradient creates optimization difficulties (no smooth minimum).","C":"MAE is worse than MSE because its gradient is always ±1, making it slower to converge","D":"MSE and MAE produce identical optimal models; the difference is only in numerical stability"},"correct":"B","explanation":{"correct":"- Outlier sensitivity: MSE minimizer is the conditional mean E[y|x], which is pulled toward outliers. MAE minimizer is the conditional median, which is robust to outliers.\n- Gradient comparison: MSE ∂L/∂ŷ = 2(ŷ-y) — proportional to error. Large errors → large gradient updates (fast learning, but outliers dominate). MAE ∂L/∂ŷ = sign(ŷ-y) = ±1 — constant regardless of error magnitude.\n- MAE downside: the constant gradient doesn't provide fine-grained signal near the optimum, potentially causing oscillation around the true minimum. Huber loss combines both behaviors.","A":"MSE is preferred when the data has no significant outliers (most values near the mean) and you want a smooth, easily optimizable loss. \"Always use MAE\" is an oversimplification.","B":"","C":"MAE's constant gradient is a weakness for optimization (no smooth minimum) but doesn't make MAE worse than MSE across all scenarios. The choice depends on the data distribution and outlier sensitivity requirements.","D":"MSE and MAE produce different optimal solutions: mean-minimizing vs median-minimizing. These are different statistics and will differ when the distribution is skewed."},"reference":"- Huber, \"Robust Estimation of a Location Parameter\" (1964) — motivation for Huber loss"},{"section":"deep-learning","difficulty":"easy","id":"dl-e011","topicSlug":"loss-and-cost-functions","orderIndex":11,"topic":"Loss And Cost Functions","question":"A model predicts class probabilities [0.7, 0.2, 0.1] for a 3-class problem. The true label is class 0 (one-hot: [1, 0, 0]). Calculate the cross-entropy loss and explain what happens to the loss if the model becomes more confident in the correct class (predicts [0.99, 0.005, 0.005]).","options":{"A":"CE = -(0.7·log0.7 + 0.2·log0.2 + 0.1·log0.1) ≈ 0.80. More confidence → higher loss because all terms are included","B":"CE = -log(0.7) ≈ 0.357. With prediction [0.99, 0.005, 0.005]: CE = -log(0.99) ≈ 0.010. More confidence in the correct class → lower loss. Cross-entropy only cares about the probability assigned to the true class","C":"CE = -(log0.7 + log0.2 + log0.1) ≈ 3.0. The loss increases when probabilities are more concentrated","D":"CE = -(0.7 + 0.2 + 0.1) = -1.0. Loss is constant since probabilities always sum to 1"},"correct":"B","explanation":{"correct":"- Cross-entropy formula: CE = -Σ y_k log(p_k). With one-hot y=[1,0,0]: only the k=0 term survives: CE = -1·log(p_0) = -log(0.7) ≈ 0.357.\n- More confident in true class: p_0 = 0.99 → CE = -log(0.99) ≈ 0.010. The loss decreases as confidence in the correct class increases.\n- This is why cross-entropy works so well with softmax: the model is penalized based solely on how much probability it assigns to the correct class.","A":"The formula used in A is the entropy of the predicted distribution, not cross-entropy against the true labels. With one-hot labels, terms for wrong classes multiply by 0 and drop out.","B":"","C":"Sums of logs rather than weighted by labels. The one-hot label selector means only the true class log probability contributes.","D":"Cross-entropy is not constant. As probabilities change (while summing to 1), the loss changes. The true class probability determines the loss."},"reference":"- Goodfellow et al., \"Deep Learning\" (2016), Chapter 6.2.1: Cross-Entropy Loss"},{"section":"deep-learning","difficulty":"easy","id":"dl-e012","topicSlug":"loss-and-cost-functions","orderIndex":12,"topic":"Loss And Cost Functions","question":"You are training a medical image classifier. Of 10,000 training examples, only 100 have the disease (class 1). You use standard cross-entropy loss and get 99% accuracy, but the model never predicts class 1. What is wrong, and what is the simplest fix to the loss function?","options":{"A":"The model needs more training epochs; 99% accuracy with no disease predictions means under-training","B":"The model has learned to predict class 0 (healthy) for everything — achieving 99% accuracy by ignoring the rare class. This is the class imbalance problem. Standard cross-entropy treats all examples equally, so the 9,900 healthy examples dominate training. Fix: use weighted cross-entropy with class weight = N_total / N_class: weight_0 = 10000/9900 ≈ 1.01, weight_1 = 10000/100 = 100. Now each disease example contributes 100× more to the loss, forcing the model to learn class 1.","C":"Switch from cross-entropy to MSE loss — MSE handles imbalanced classes better","D":"The issue is the learning rate; lower it to allow the model to detect the rare class"},"correct":"B","explanation":{"correct":"- Imbalance effect: the loss landscape is dominated by the majority class. Predicting class 0 for everything minimizes the overall loss (9900/10000 examples are correct). Gradients from the 100 disease examples are overwhelmed by gradients from the 9900 healthy examples.\n- Weighted CE: multiply each example's loss by its class weight. `L = -Σ w_k · y_k · log(p_k)`. Alternatively, use focal loss which down-weights easy (well-classified) examples.\n- Evaluation: for imbalanced problems, use F1 score, AUROC, or precision-recall AUC instead of accuracy.","A":"Training longer without addressing the imbalance would just make the model more confident in always predicting class 0. 99% accuracy is already converged to this degenerate solution.","B":"","C":"MSE for classification has its own problems and doesn't inherently address class imbalance better than CE. Focal loss is the state-of-the-art fix.","D":"Learning rate doesn't address the fundamental signal imbalance. Even with a perfect learning rate, 9900 examples of class 0 will overpower 100 examples of class 1 unless the loss is reweighted."},"reference":"- Lin et al., \"Focal Loss for Dense Object Detection (RetinaNet)\" (2017): https://arxiv.org/abs/1708.02002"},{"section":"deep-learning","difficulty":"easy","id":"dl-e013","topicSlug":"backpropagation","orderIndex":13,"topic":"Backpropagation","question":"In PyTorch, you compute `loss.backward()` twice in a row without calling `optimizer.zero_grad()` between iterations. What happens to the gradients, and why is `optimizer.zero_grad()` called at the start of each training iteration?","options":{"A":"The second `backward()` call resets and recomputes gradients from scratch — previous gradients are overwritten","B":"PyTorch accumulates gradients by default: each `backward()` call adds to the existing `.grad` attribute. After two `backward()` calls, `param.grad` = sum of both gradient computations. Without `zero_grad()`, gradients from previous batch add to the current batch, effectively doubling (or more) the effective gradient, causing incorrect updates. `zero_grad()` sets all `.grad` to zero before computing the new gradient — ensuring each update uses only the current batch's gradient","C":"PyTorch raises a RuntimeError on the second `backward()` call","D":"The second `backward()` call divides the gradient by 2 to compensate for calling twice"},"correct":"B","explanation":{"correct":"- Gradient accumulation: PyTorch's design choice. `param.grad += new_gradient` at each `.backward()`. This is actually useful for gradient accumulation over micro-batches (to simulate large batch training with limited memory).\n- Bug from forgetting `zero_grad()`: iteration 1 gradient g₁ accumulates; iteration 2 computes g₂ but param.grad = g₁ + g₂. The optimizer step uses this sum, making the effective learning rate larger than intended and the update direction incorrect.\n- Standard pattern: `optimizer.zero_grad()` → `loss = model(x)` → `loss.backward()` → `optimizer.step()`.","A":"This is incorrect. PyTorch accumulates, not overwrites. See the official docs: \"Gradients are accumulated.\" Overwriting would require explicit `param.grad = None` or `zero_grad()`.","B":"","C":"PyTorch allows multiple `backward()` calls if the graph is retained (retain_graph=True) or if the graph is rebuilt each forward pass. No error is raised — but incorrect results occur.","D":"PyTorch has no such averaging behavior for gradients from multiple backward passes."},"reference":"- PyTorch docs: `Optimizer.zero_grad()` — https://pytorch.org/docs/stable/optim.html"},{"section":"deep-learning","difficulty":"easy","id":"dl-e014","topicSlug":"backpropagation","orderIndex":14,"topic":"Backpropagation","question":"You write a custom loss function that includes `math.log(pred)` (Python's math module, not torch). During training, the loss computes correctly but `loss.backward()` raises an error. What is the problem?","options":{"A":"`math.log` is not differentiable; you must use a polynomial approximation","B":"`math.log` converts the tensor to a plain Python float, breaking the PyTorch autograd computation graph. Autograd tracks operations on tensors; once a tensor is converted to a Python float (which `math.log` does), the graph connection is severed. `loss.backward()` cannot compute gradients through operations not recorded in the graph. Fix: replace `math.log(pred)` with `torch.log(pred)`, which records the log operation in the autograd graph.","C":"`math.log` returns a negative value for probabilities, causing a NaN in backprop","D":"The error occurs because `math.log` requires integer inputs; use `float()` to convert first"},"correct":"B","explanation":{"correct":"- Autograd graph: PyTorch builds a directed acyclic graph of tensor operations during the forward pass. Each tensor operation returns a new tensor with a `.grad_fn` pointing to the operation. `torch.log(t)` records a `LogBackward` node.\n- `math.log(t)`: first extracts the scalar value from `t` (calling `.item()` implicitly), then applies Python's math.log. The result is a plain Python float with no `.grad_fn`. The autograd graph is severed.\n- `loss.backward()` traverses the graph from the loss tensor. If the graph ends prematurely (float stops graph), it cannot compute gradients for earlier tensors.","A":"`math.log` is mathematically differentiable (d/dx log x = 1/x). The problem is not mathematical differentiability but PyTorch's ability to *track* the operation in its automatic differentiation graph.","B":"","C":"For predictions p ∈ (0,1), log(p) is negative — this is expected in cross-entropy loss. Negative log probability is not a NaN. The error is about graph tracking, not sign.","D":"`math.log` works on floats. The problem isn't input type — it's that extracting a float breaks the computation graph."},"reference":"- PyTorch docs: Autograd Mechanics — https://pytorch.org/docs/stable/notes/autograd.html"},{"section":"deep-learning","difficulty":"easy","id":"dl-e015","topicSlug":"optimizers","orderIndex":15,"topic":"Optimizers","question":"You train a model with SGD and observe that the loss oscillates: 0.8 → 1.2 → 0.7 → 1.3 → 0.6 → 1.4. The loss is decreasing on average but oscillating wildly. What is the most likely cause and what single hyperparameter change would smooth training?","options":{"A":"The model has too many parameters; reduce model size to fix oscillation","B":"The learning rate is too high. Large LR causes updates to overshoot the minimum — the loss decreases then bounces past the optimal point in the loss landscape. The optimizer jumps back and forth across the minimum. Fix: reduce the learning rate (e.g., by 10×). Oscillating loss with downward trend is a classic too-high LR signature. Adding momentum can also help by smoothing the update direction.","C":"The batch size is too small; increase to full-batch gradient descent","D":"The oscillation is caused by the ReLU activation; switch to sigmoid"},"correct":"B","explanation":{"correct":"- Overshoot mechanism: at the minimum, the gradient is zero. When LR is high, the step size is large — the optimizer lands on the other side of the minimum where the gradient points back. This creates oscillation.\n- Decreasing average trend: despite oscillation, the optimizer is slowly finding lower regions. This is why reducing LR stabilizes training — smaller steps approach the minimum without overshooting.\n- LR diagnosis: smooth loss = LR OK; oscillating loss = LR too high; no decrease = LR too low or gradient issue.","A":"Model size doesn't cause loss oscillation patterns. Large models may overfit, but overfitting shows as train loss decreasing while val loss increases — not the oscillation pattern described.","B":"","C":"Small batch size does introduce gradient noise, but the pattern for noisy gradients is more random (not the regular back-and-forth oscillation). Full batch gradient descent would remove noise but also remove regularization benefits.","D":"Activation function doesn't cause the per-iteration oscillation pattern. ReLU provides sparse, stable activations. Switching activations wouldn't fix an optimizer overshoot problem."},"reference":"- Goodfellow et al., \"Deep Learning\" (2016), Chapter 8.3: Basic Algorithms"},{"section":"deep-learning","difficulty":"easy","id":"dl-e016","topicSlug":"optimizers","orderIndex":16,"topic":"Optimizers","question":"Adam optimizer uses bias correction for its moment estimates: m̂_t = m_t / (1 - β₁ᵗ) and v̂_t = v_t / (1 - β₂ᵗ). Why is this correction needed specifically in the first few steps, and what happens to the correction factor as t → ∞?","options":{"A":"Bias correction scales up the learning rate to compensate for small gradients at initialization","B":"At t=1, m₁ = (1-β₁)g₁ (initialized from zero). Without correction, m₁ underestimates the true gradient because it is scaled by (1-β₁) ≈ 0.1 (β₁=0.9). Bias correction: m̂₁ = m₁/(1-β₁¹) = m₁/0.1 = 10·m₁, restoring the true scale. As t→∞, β₁ᵗ → 0, so (1-β₁ᵗ) → 1, and m̂_t → m_t (correction factor becomes 1). The correction matters mainly in early steps and disappears asymptotically.","C":"Bias correction is only needed when gradients are very small (< 1e-8); otherwise it has no effect","D":"The correction normalizes the update to be between -1 and +1 at all times"},"correct":"B","explanation":{"correct":"- Exponential moving average startup: m_t = β₁ m_{t-1} + (1-β₁) g_t. Starting from m_0=0:\n- t=1: m₁ = (1-β₁)g₁ ≈ 0.1g₁ (if β₁=0.9). Underestimates by factor 10.\n- t=2: m₂ = β₁(1-β₁)g₁ + (1-β₁)g₂. Still underestimates.\n- t=100: m₁₀₀ ≈ m̂₁₀₀ (correction ≈ 1 since β₁¹⁰⁰ ≈ 0.000027).\n- The correction is critical early in training when the EMA hasn't had time to accumulate enough history to represent the true running average.","A":"Bias correction doesn't scale the learning rate — it corrects the moment estimates. The effective step size is m̂_t / (√v̂_t + ε) × η, where η is the fixed LR.","B":"","C":"Bias correction applies to all gradient magnitudes uniformly. It's always active — the division by (1-β₁ᵗ) happens regardless of gradient magnitude.","D":"Bias-corrected Adam updates can take values larger or smaller than ±1 depending on the gradient and the second moment estimate. There's no clipping to [-1,1]."},"reference":"- Kingma & Ba, \"Adam: A Method for Stochastic Optimization\" (2015): https://arxiv.org/abs/1412.6980"},{"section":"deep-learning","difficulty":"easy","id":"dl-e017","topicSlug":"optimizers","orderIndex":17,"topic":"Optimizers","question":"You train the same CNN on CIFAR-10 using (A) SGD with momentum=0.9, lr=0.1 and (B) Adam with lr=0.001. After 100 epochs, SGD achieves 93% test accuracy and Adam achieves 91%. Why might SGD outperform Adam on image classification despite Adam's adaptive learning rates?","options":{"A":"Adam has a bug for image classification tasks; use SGD by default","B":"SGD with momentum tends to find flatter minima than Adam on vision tasks. Flat minima generalize better (small perturbations in weights → small change in loss) than sharp minima. Adam's per-parameter adaptive step sizes can cause it to converge to sharper minima faster. Additionally, Adam's effective LR decay (as v_t accumulates) means late-stage training may have very small updates, preventing escape from sharp local minima. Many image classification benchmarks (ImageNet, CIFAR) show SGD + momentum + LR schedule outperforms Adam in final accuracy, though Adam converges faster early on.","C":"SGD uses the entire dataset while Adam uses mini-batches; larger data = better accuracy","D":"Adam requires 10× more memory than SGD, causing memory errors that reduce accuracy"},"correct":"B","explanation":{"correct":"- Sharp vs flat minima: Adam's adaptive updates can exploit gradient information more efficiently in each step, but they may converge to sharper loss basins. Flat minima are associated with better generalization (Hochreiter & Schmidhuber 1997, Keskar et al. 2017).\n- The SGD+momentum advantage in vision: SGD converges more slowly but often to flatter, better-generalizing solutions. It's also more sensitive to learning rate schedule, which is why LR schedules (cosine, step decay) are critical for SGD.\n- Practical tip: Adam is often better for NLP (Transformers), SGD is often better for vision (CNNs). This is a well-documented empirical finding.","A":"Adam has no \"bug\" for image classification. It's a valid optimizer. The difference is in the optimization landscape, not a software defect.","B":"","C":"Both SGD and Adam use mini-batches in standard deep learning practice. The comparison is between the adaptive vs non-adaptive update rules, not batch usage.","D":"Adam stores first and second moment vectors (2× parameter count overhead vs SGD's momentum buffer). This is a 2× increase, not 10×, and it doesn't cause accuracy differences — it's a memory concern at scale."},"reference":"- Keskar et al., \"On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima\" (2017): https://arxiv.org/abs/1609.04836"},{"section":"deep-learning","difficulty":"easy","id":"dl-e018","topicSlug":"ann-architectures","orderIndex":18,"topic":"Ann Architectures","question":"The Universal Approximation Theorem (UAT) states that a single hidden layer neural network with enough neurons can approximate any continuous function on a compact domain. A student concludes \"therefore, deep networks are unnecessary — we just need enough neurons in one hidden layer.\" What is wrong with this conclusion?","options":{"A":"The UAT is incorrect; neural networks cannot approximate arbitrary functions","B":"The UAT proves existence, not practicality. While a single hidden layer CAN approximate any function, the number of neurons required may be exponential in the input dimension. Deep networks can approximate the same function with exponentially fewer neurons by composing simpler sub-functions hierarchically. The UAT also says nothing about learnability via gradient descent — a theoretically sufficient shallow network may be practically untrainable","C":"The UAT applies only to linear activation functions; ReLU networks cannot approximate arbitrary functions","D":"The UAT is correct; a shallow network with enough neurons is always better than a deep network for any task"},"correct":"B","explanation":{"correct":"- Existence vs construction: the theorem guarantees a set of weights exists, but doesn't guarantee gradient descent will find them, or that a practical number of neurons is sufficient.\n- Depth efficiency: Montufar et al. (2014) showed that deep ReLU networks can represent exponentially more linear regions than shallow networks with the same parameter count. A function requiring N neurons shallowly may need only O(log N) per layer with depth.\n- Practical implication: depth is not just about theoretical expressive power — it's about learning efficiency. Hierarchical features (edges → textures → parts → objects in CNNs) are naturally learned by depth.","A":"The UAT is well-proven and widely accepted. The issue is its practical implications, not its correctness.","B":"","C":"The UAT was originally proved for sigmoid (Hornik et al. 1989) and later for ReLU and many other activation functions. ReLU networks are universal approximators.","D":"This is the student's incorrect conclusion. Deep networks usually outperform wide shallow networks on complex tasks with equal or fewer parameters."},"reference":"- Cybenko, \"Approximation by Superpositions of a Sigmoidal Function\" (1989)\n- Montufar et al., \"On the Number of Linear Regions of Deep Neural Networks\" (2014): https://arxiv.org/abs/1402.1869"},{"section":"deep-learning","difficulty":"easy","id":"dl-e019","topicSlug":"ann-architectures","orderIndex":19,"topic":"Ann Architectures","question":"A model achieves 95% training accuracy and 60% validation accuracy. A colleague says \"train longer to close the gap.\" Another says \"make the model smaller.\" What is the correct diagnosis, and what are two complementary fixes?","options":{"A":"The model is underfitting; add more layers to increase capacity","B":"This is overfitting: the model has memorized training examples without generalizing. High train accuracy with much lower val accuracy = train/validation gap = overfitting. Fix 1: regularization — add L2 weight decay or Dropout to penalize memorization. Fix 2: reduce model capacity (fewer layers/neurons) so the model is forced to learn patterns present in validation data too. Alternatively: get more training data, or use data augmentation.","C":"The model is correct — 60% validation accuracy is expected with 95% training accuracy; this gap is normal","D":"This is underfitting; the model needs more training data to improve validation accuracy"},"correct":"B","explanation":{"correct":"- Overfitting signature: train accuracy >> val accuracy. The model has learned training-specific patterns (noise, memorized examples) that don't generalize.\n- Why \"train longer\" doesn't help: more training epochs on the same data will increase training accuracy further (potentially to 99%) while validation accuracy may decrease further (the model memorizes more).\n- Why \"smaller model\" helps: a model with fewer parameters is forced to learn the most statistically reliable patterns, which tend to generalize. This is the bias-variance tradeoff — smaller models have higher bias but lower variance.","A":"95% training accuracy indicates the model is learning well from training data (not underfitting). Underfitting = low training accuracy. The problem is the 35% train/val gap.","B":"","C":"A 35% gap between train and val accuracy is a significant overfitting indicator, not normal. Typical acceptable gaps depend on the task, but 35% is large for most real-world tasks.","D":"Underfitting = model too simple to fit training data (low train accuracy). Here, train accuracy is high. More data helps with overfitting but doesn't fix the cause (model capacity)."},"reference":"- Goodfellow et al., \"Deep Learning\" (2016), Chapter 5.2: Capacity, Overfitting, and Underfitting"},{"section":"deep-learning","difficulty":"easy","id":"dl-e020","topicSlug":"regularization-and-normalization","orderIndex":20,"topic":"Regularization And Normalization","question":"At inference time, a model with Dropout layers (p=0.5) produces different outputs for the same input when run twice. The model is in training mode. What simple fix is needed, and what would happen to output values if you correctly switch to eval mode but forget to scale activations?","options":{"A":"This is expected; models always produce different outputs for the same input","B":"Fix: call `model.eval()` before inference. This disables the random dropout mask — all neurons are active during evaluation. Scale concern: PyTorch uses inverted dropout (applies scale 1/(1-p) during training), so at eval, no scaling is needed — outputs are already correctly scaled. If you manually implemented non-inverted dropout (scale at eval), forgetting the scale factor would make eval activations 2× larger than training activations (for p=0.5), causing miscalibrated predictions.","C":"Call `torch.manual_seed(42)` before each inference run to make dropout deterministic","D":"Remove dropout layers from the model architecture after training; they are only needed during training"},"correct":"B","explanation":{"correct":"- Inverted dropout (PyTorch default): during training, active neurons are scaled up by 1/(1-p). During eval, all neurons are active with no scaling — the expected activation value is the same as in training.\n- Without inverted dropout (scale at eval): during training, activations are unscaled. At eval, all neurons are active → activations are (1/(1-p))× larger than training. The model's calibration (e.g., softmax temperatures) is off.\n- `model.eval()` is the correct fix. It sets `training=False` for all modules, which disables the Bernoulli mask in `nn.Dropout`.","A":"Stochastic outputs are a bug for production inference (except in Monte Carlo Dropout for uncertainty estimation). For standard inference, deterministic outputs are required.","B":"","C":"Setting a random seed makes the dropout mask deterministic but not necessarily correct — it would produce a specific masked output, not the intended full-network output. The model would still not use all neurons.","D":"Removing dropout layers is bad practice — it would change the model architecture and may break saved weights/configurations. Use eval() instead."},"reference":"- Srivastava et al., \"Dropout: A Simple Way to Prevent Neural Networks from Overfitting\" (2014): https://www.jmlr.org/papers/v15/srivastava14a.html"},{"section":"deep-learning","difficulty":"easy","id":"dl-e021","topicSlug":"regularization-and-normalization","orderIndex":21,"topic":"Regularization And Normalization","question":"BatchNorm stores `running_mean` and `running_var` during training (updated via momentum=0.1). At inference, these running stats are used instead of batch statistics. If you accidentally continue training after loading a model checkpoint (with frozen BN layers), but the running stats were computed on a different dataset, what goes wrong?","options":{"A":"Nothing — running stats are always recomputed at inference so old stats don't matter","B":"Frozen BN layers (in eval mode) use their stored running_mean/running_var at inference. If these stats were computed on a different domain (e.g., pretraining on ImageNet, fine-tuning on X-rays), the normalization divides by the wrong distribution statistics. X-ray pixel intensities differ from ImageNet pixel statistics — using ImageNet mean/var to normalize X-ray inputs would produce badly normalized features, degrading model performance. Fix: either (1) unfreeze BN and let it update running stats during fine-tuning, or (2) recalculate running stats by doing a forward pass through the training data with model in train mode but optimizer disabled.","C":"Running stats are automatically adapted when you load a checkpoint, so domain mismatch is impossible","D":"BN running stats only affect the bias term; the scale is learned and adapts during fine-tuning"},"correct":"B","explanation":{"correct":"- Running stats purpose: at inference, batch statistics are unavailable (inference may process single samples). Running stats provide an approximation of the training set's feature distribution.\n- Domain mismatch: if ImageNet running_mean[0] = 0.485 and X-ray running_mean[0] = 0.1 (different brightness distribution), BN will incorrectly normalize X-ray features using ImageNet statistics.\n- Fine-tuning BN: when fine-tuning for a new domain, BN layers should generally be unfrozen to allow running stats to adapt. Small datasets sometimes freeze BN to avoid overfitting.","A":"Running stats are NOT recomputed at inference. In eval mode, BN uses the stored running_mean and running_var, which were accumulated during training.","B":"","C":"Loading a checkpoint restores the stored running stats from the checkpoint — it doesn't recompute or adapt them.","D":"BN has two learnable parameters: γ (scale) and β (shift). It also has two non-learnable running buffers: running_mean and running_var. Both the normalization statistics AND the learned scale/shift are domain-dependent."},"reference":"- Ioffe & Szegedy, \"Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift\" (2015): https://arxiv.org/abs/1502.03167"},{"section":"deep-learning","difficulty":"easy","id":"dl-e022","topicSlug":"regularization-and-normalization","orderIndex":22,"topic":"Regularization And Normalization","question":"You apply LayerNorm to a Transformer's input sequence. The input is shape (batch=2, seq_len=10, d_model=512). LayerNorm normalizes over the last dimension (d_model). What exactly is being normalized, and why is this preferred over BatchNorm for NLP tasks?","options":{"A":"LayerNorm averages across the batch dimension; BatchNorm averages across the d_model dimension","B":"LayerNorm normalizes each (batch, position) pair independently over its 512 features. For token (b=0, t=3), it computes mean and std over the 512 feature values of that one token, then normalizes. This is per-token, per-sample normalization. BatchNorm would normalize over the batch×sequence dimensions for each feature — requiring consistent batch statistics across all positions and samples. For NLP, different positions have very different distributions (punctuation vs content words), and batch stats are computed over variable-length sequences making BN unstable. LayerNorm is independent of batch size and sequence length, making it ideal for variable-length NLP tasks.","C":"LayerNorm and BatchNorm produce identical outputs for NLP tasks; choose either","D":"LayerNorm normalizes across the sequence dimension; each feature is normalized across all 10 positions"},"correct":"B","explanation":{"correct":"- LayerNorm(x)_i = (x_i - μ) / σ × γ_i + β_i, where μ and σ are computed over the feature dimension for one token at a time.\n- BatchNorm for NLP problems: (1) batch stats are unstable for small batches or variable-length sequences; (2) at inference with batch size=1, batch statistics are undefined — must use running stats; (3) different positions in a sequence have different semantic roles — pooling their statistics corrupts the representation.\n- LayerNorm advantages: works with any batch size (including 1), independent of sequence length, position-agnostic — each token normalized by its own feature statistics.","A":"This swaps the normalization axes. LayerNorm normalizes over features (d_model); BatchNorm normalizes over the batch and spatial/sequence dimensions for each feature.","B":"","C":"They produce different outputs because they normalize over different dimensions. The statistics (mean, variance) are computed from different sets of values.","D":"Normalizing across sequence positions (10 tokens) is a different variant — it would conflate information from different positions. Standard LayerNorm as used in Transformers normalizes over the feature (d_model) dimension."},"reference":"- Ba et al., \"Layer Normalization\" (2016): https://arxiv.org/abs/1607.06450"},{"section":"deep-learning","difficulty":"easy","id":"dl-e023","topicSlug":"weight-initialization","orderIndex":23,"topic":"Weight Initialization","question":"Xavier (Glorot) initialization sets `Var(w) = 2 / (fan_in + fan_out)`. Kaiming (He) initialization sets `Var(w) = 2 / fan_in`. What is the key architectural assumption that differentiates when each is appropriate?","options":{"A":"Xavier is for recurrent networks; Kaiming is for convolutional networks","B":"Xavier assumes a symmetric activation function (like tanh or sigmoid) where the positive and negative parts of the activation have equal variance contribution. Kaiming accounts for ReLU's asymmetry: ReLU zeros out half the neurons, effectively halving the variance. Setting Var(w) = 2/fan_in compensates for this halving. Using Xavier with ReLU: variance shrinks by half per layer → vanishing activations in deep networks. Using Kaiming with tanh: variance is slightly too large → mild exploding activations in very deep networks.","C":"The only difference is that Xavier uses fan_in + fan_out while Kaiming uses only fan_in; both work with any activation","D":"Xavier is for the first layer; Kaiming is for all subsequent layers"},"correct":"B","explanation":{"correct":"- Variance analysis for ReLU: if z ~ N(0, σ²), then ReLU(z) has variance ≈ σ²/2 (positive half of a Gaussian has half the variance of the full Gaussian). To maintain variance through each layer: we need Var(w) × fan_in × Var(a) = Var(z). Since Var(ReLU(z)) = Var(z)/2, we need an extra factor of 2: Var(w) = 2/fan_in.\n- For tanh/sigmoid: these have derivatives near 1 at initialization (if inputs are small), so variance is approximately preserved without the factor of 2. Xavier's 2/(fan_in + fan_out) is a compromise to maintain variance in both forward and backward passes.","A":"The choice is activation-function-based, not architecture-based. CNNs can use either Xavier or Kaiming depending on which activation they use. Kaiming is preferred when using ReLU regardless of whether it's a CNN or MLP.","B":"","C":"The denominator difference (fan_in vs fan_in + fan_out) matters. Using Kaiming with tanh in a deep network would give slightly too-high variance (2/fan_in vs the optimal 2/(fan_in+fan_out)), which can cause instability in very deep networks.","D":"There is no layer-position-based rule. Xavier and Kaiming apply uniformly to all layers based on the activation function used in that layer."},"reference":"- He et al., \"Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification (Kaiming init)\" (2015): https://arxiv.org/abs/1502.01852"},{"section":"deep-learning","difficulty":"easy","id":"dl-e024","topicSlug":"weight-initialization","orderIndex":24,"topic":"Weight Initialization","question":"You initialize all weights in a 10-layer MLP to zero. After one forward pass, you call backward(). What values do all the weight gradients have, and why?","options":{"A":"Gradients are random because the loss function introduces randomness","B":"All weight gradients are identical within each layer (all the same value), and the symmetry means the model can never learn diverse features. Specifically: with W=0, all pre-activations z=0, so activations a = σ(0) = same constant for all neurons. The same output flows to every downstream neuron. Gradients via chain rule: ∂L/∂W_ij = δ_j × a_i — since all a_i in a layer are equal and all δ_j in a layer are equal, all weight gradients in a layer are equal. After the update, all weights in a layer remain equal: the model has one effective neuron per layer.","C":"Gradients are all zero because a zero forward pass produces zero loss","D":"Gradients are non-zero and different for each weight because the loss function differentiates each weight independently"},"correct":"B","explanation":{"correct":"- Forward pass with W=0: z = 0·x + b = b. If biases are also 0: z=0, a=σ(0) = constant (e.g., 0.5 for sigmoid, 0 for ReLU) for ALL neurons.\n- All neurons in a layer identical → all outputs identical → loss gradient distributes identically to all neurons in a layer.\n- Gradient formula: ∂L/∂w_{ij} = ∂L/∂z_j × x_i = δ_j × a_{i-1}. Since a_{i-1} is the same for all j (identical neurons in previous layer), and δ_j is the same for all j (identical neurons in current layer), all w_{ij} share the same gradient for given i.\n- Result: all neurons stay identical forever — the layer is \"collapsed.\"","A":"The loss function is deterministic given the model output. With identical forward pass (all zeros), the loss and its gradient are deterministic. No randomness is introduced.","B":"","C":"The loss is not necessarily zero. For cross-entropy, L = -log(p_correct). If all outputs are equal and there are K classes, each gets probability 1/K. CE = log(K) ≠ 0 (for K > 1). So loss ≠ 0, but gradients are equal across neurons.","D":"Weight gradients within a layer are equal (not different) due to symmetry. They differentiate only if the inputs or activations differ across neurons."},"reference":"- Goodfellow et al., \"Deep Learning\" (2016), Chapter 8.4: Parameter Initialization Strategies"},{"section":"deep-learning","difficulty":"easy","id":"dl-e025","topicSlug":"cnn-architectures","orderIndex":25,"topic":"Cnn Architectures","question":"A convolutional layer has 32 filters of size 3×3, applied to a 3-channel (RGB) input. How many weights does this layer have (excluding biases)? How does this compare to a fully connected layer taking the same 64×64×3 input and producing 32 outputs?","options":{"A":"Conv: 32 × 3 × 3 = 288 weights. FC: 64 × 64 × 3 × 32 = 393,216 weights. Conv has 1365× fewer weights","B":"Conv: 32 filters × (3×3 kernel × 3 input channels) = 32 × 27 = 864 weights. FC: 64 × 64 × 3 × 32 = 393,216 weights. Conv has 455× fewer weights — this is the parameter efficiency of weight sharing","C":"Conv: 32 × 3 × 3 × 3 × 64 × 64 = 56,623,104 weights. Conv and FC have the same order of magnitude","D":"Conv: 32 × 3 × 3 = 288 weights. FC: 64 × 64 × 32 = 131,072 weights. Conv is smaller because it ignores one input channel"},"correct":"B","explanation":{"correct":"- Conv parameter count: each filter is of shape (C_in × K_H × K_W). 32 filters, each of shape (3 × 3 × 3) = 27 values. Total = 32 × 27 = 864 weights.\n- FC parameter count: input has 64 × 64 × 3 = 12,288 values. FC to 32 outputs: 12,288 × 32 = 393,216 weights.\n- Ratio: 393,216 / 864 = 455×. This is the key advantage of CNNs: weight sharing (the same filter is applied at every spatial location) makes them dramatically more parameter-efficient for spatial data.","A":"Forgets to multiply by C_in=3 (the filter must cover all 3 input channels). One filter is 3×3×3=27, not 3×3=9.","B":"","C":"This incorrectly multiplies the spatial output dimensions into the weight count. Convolutional weights are independent of the spatial dimensions they're applied to — that's the point of weight sharing. The 64×64 output positions all use the SAME 864 weights.","D":"Convolution does NOT ignore input channels. Each filter convolves all input channels simultaneously. The \"×3\" for RGB channels is part of the filter shape."},"reference":"- LeCun et al., \"Gradient-Based Learning Applied to Document Recognition\" (1998) — weight sharing motivation"},{"section":"deep-learning","difficulty":"easy","id":"dl-e026","topicSlug":"cnn-architectures","orderIndex":26,"topic":"Cnn Architectures","question":"In a CNN, two design choices for downsampling are: (A) max pooling (non-parametric, takes the max in each 2×2 window) and (B) strided convolution with stride=2 (a parametric, learned operation). When would you prefer max pooling over strided convolution, and what does max pooling preserve that average pooling does not?","options":{"A":"Max pooling is always preferred because it is parameter-free","B":"Max pooling is preferred when you want to preserve the strongest activation (presence of a feature) regardless of exact location — providing translation invariance within the pooling window. Max pooling preserves the most activated feature in a local region. Average pooling preserves the average activation, which is sensitive to the presence of many weakly activated neurons rather than one strongly activated one. Strided conv is preferred when you want the network to learn how to downsample based on task-specific patterns. Modern architectures (ResNet, EfficientNet) increasingly prefer strided convolution because it allows the network to decide what is important to keep.","C":"Average pooling is identical to max pooling; the distinction doesn't matter in practice","D":"Max pooling requires the same number of parameters as a strided convolution; the choice is aesthetic"},"correct":"B","explanation":{"correct":"- Max pooling: `output = max(x_1, x_2, x_3, x_4)` for a 2×2 window. If one pixel strongly detects an edge, the max preserves that detection regardless of whether neighboring pixels also detect it. This provides feature presence detection.\n- Average pooling: `output = (x_1 + x_2 + x_3 + x_4) / 4`. One strong detection is diluted by three weak ones. Good for global average pooling at the end of a network (replacing FC layers).\n- Strided conv advantage: the filter learns what to preserve during downsampling, potentially learning task-relevant downsampling. This is why modern architectures use it.","A":"Parameter-free is not always an advantage. Strided convolution can outperform max pooling when the task benefits from learned downsampling.","B":"","C":"Max and average pooling produce different outputs. For a window [0, 0, 0, 100]: max=100, avg=25. For detection tasks, max pooling (100) correctly signals feature presence; average pooling (25) weakens the signal.","D":"Max pooling has zero parameters (just takes the max). Strided convolution has kernel_size²×C_in×C_out parameters. They are not equal."},"reference":"- Springenberg et al., \"Striving for Simplicity: The All Convolutional Net\" (2014): https://arxiv.org/abs/1412.6806"},{"section":"deep-learning","difficulty":"easy","id":"dl-e027","topicSlug":"cnn-architectures","orderIndex":27,"topic":"Cnn Architectures","question":"You design a CNN for a 256×256 input. After 4 layers of stride-1 convolution with 3×3 kernels and no padding, what is the spatial size of the output? If you add padding=1 to each layer, what changes?","options":{"A":"Without padding: 256 - 4×3 = 244×244. With padding: 256×256 (unchanged)","B":"Without padding: each 3×3 conv (stride 1, no padding) reduces each spatial dimension by 2 (=(K-1)). After 4 layers: 256 - 4×2 = 248×248. With padding=1: output size = input size (same padding), so output remains 256×256 after all 4 layers","C":"Without padding: 256 / 4 = 64×64. Convolution halves the spatial dimension each layer","D":"Without padding: 256 - 4×(3-1)/2 = 252×252. With padding, the size doubles"},"correct":"B","explanation":{"correct":"- Output size formula (stride 1): `out = floor((in + 2×P - K) / S) + 1`. With P=0, K=3, S=1: `out = in - 2`.\n- After 4 layers: 256 - 4×2 = 248. Each layer removes 2 pixels (one from each border).\n- With padding=1: `out = (in + 2×1 - 3) / 1 + 1 = in`. Each layer maintains spatial dimensions. This is \"same\" padding — output equals input size.\n- Why same padding matters: without padding, deep CNNs lose spatial resolution quickly. Same padding allows very deep networks (e.g., VGG's 16 layers) to maintain spatial dimensions until explicit downsampling.","A":"The reduction per layer is K-1 = 2 (not K = 3). A 3×3 filter without padding removes 1 pixel from each edge per layer, so 2 pixels total per spatial dimension.","B":"","C":"Stride-1 convolution does NOT halve the spatial dimension. Halving requires stride=2 (strided conv) or pooling with pool_size=2, stride=2.","D":"`(K-1)/2 = 1` for K=3. The formula \"in - 4×(K-1)/2 = 252\" equals 252 for K=3, which is wrong. After 4 layers, the reduction is 4 × 2 = 8, giving 256 - 8 = 248, not 252. And padding never doubles the output size."},"reference":"- Goodfellow et al., \"Deep Learning\" (2016), Chapter 9.5: Basic Convolution Function"},{"section":"deep-learning","difficulty":"easy","id":"dl-e028","topicSlug":"rnn-lstm-gru","orderIndex":28,"topic":"Rnn Lstm Gru","question":"An RNN processes the sentence \"The cat that chased the dog barked.\" The network needs to associate \"cat\" (position 1) with \"barked\" (position 8) for subject-verb agreement. Why does a vanilla RNN struggle with this, and what architectural component in LSTM was specifically designed to address it?","options":{"A":"RNNs cannot process sentences longer than 5 words; LSTM extends this limit to 500 words","B":"In a vanilla RNN, the hidden state h_t = tanh(W_h h_{t-1} + W_x x_t). Information from position 1 (\"cat\") must survive through 7 tanh operations. Each tanh compresses values to (-1,1), and since |tanh'| ≤ 1, the gradient of the loss at position 8 with respect to weights at position 1 involves a product of 7 partial derivatives, typically < 1, causing vanishing gradients. The LSTM cell state c_t = f_t ⊙ c_{t-1} + i_t ⊙ g_t provides a direct highway for information: when the forget gate f_t ≈ 1, information passes unchanged — no repeated squashing. This is why LSTM can maintain long-range dependencies.","C":"The issue is that RNNs use tanh, while LSTM uses ReLU — ReLU prevents vanishing gradients","D":"LSTM simply adds more parameters, which allows it to store more sequence positions in memory"},"correct":"B","explanation":{"correct":"- Vanilla RNN gradient: ∂L/∂h_1 = ∂L/∂h_8 × Π_{t=1}^{7} ∂h_{t+1}/∂h_t. Each factor = W_h diag(1 - h_t²) (Jacobian of tanh). If the spectral radius of W_h × diag(1-h_t²) < 1, gradients vanish over 7 steps.\n- LSTM cell state highway: c_t = f_t ⊙ c_{t-1} + new_info. When f_t=1 and new_info≈0: c_t = c_{t-1} exactly. The gradient ∂c_t/∂c_{t-1} = f_t ∈ (0,1) — only one pointwise multiplication, not a full matrix Jacobian × tanh derivative. This avoids the compound shrinkage of vanilla RNN gradients.","A":"There's no hard 5-word or 500-word limit. The ability to capture long-range dependencies is gradual — vanilla RNN struggles proportionally with distance, LSTM extends practical range but not to infinite.","B":"","C":"LSTM doesn't use ReLU in its core gates — it uses sigmoid (gates) and tanh (cell/output activations). The long-range capacity comes from the additive cell state update, not from ReLU.","D":"More parameters don't inherently address vanishing gradients. The specific architectural innovation is the gating mechanism and additive cell state update, not parameter count."},"reference":"- Hochreiter & Schmidhuber, \"Long Short-Term Memory\" (1997): https://www.bioinf.jku.at/publications/older/2604.pdf"},{"section":"deep-learning","difficulty":"easy","id":"dl-e029","topicSlug":"rnn-lstm-gru","orderIndex":29,"topic":"Rnn Lstm Gru","question":"In seq2seq (encoder-decoder) models without attention, the encoder reads an input sequence and produces a single fixed-size context vector c. The decoder uses c to generate the output sequence. What fundamental limitation does this create for long input sequences?","options":{"A":"The decoder cannot run without attention; seq2seq without attention is theoretically impossible","B":"The encoder must compress all information from the input (regardless of length) into a single fixed-size vector c ∈ ℝ^d. For a 100-word sentence, all semantic content must fit in d dimensions. Information is lost when the sequence contains more unique content than the vector can represent. In practice: for short sequences, the bottleneck is manageable; for long sequences (100+ words), early input information is lost by the time the encoder finishes. The decoder then generates outputs without access to early input positions. Attention (Bahdanau, 2015) fixes this by allowing the decoder to attend to all encoder hidden states, bypassing the bottleneck.","C":"The bottleneck is only a problem when the vocabulary size exceeds the context vector dimension","D":"The fixed context vector is only used for the first decoder step; subsequent steps use the decoder's own hidden state"},"correct":"B","explanation":{"correct":"- Information bottleneck: the context vector c = h_T (last encoder hidden state) must summarize the entire input. Empirically, RNNs tend to remember recent tokens better than early ones.\n- Translation quality degradation: Bahdanau et al. (2015) showed that BLEU scores for fixed-context seq2seq drop sharply for input sentences with more than 30 words, while attention-based models maintain quality for much longer sequences.\n- Attention solution: instead of using a single c, attention computes a weighted sum of all encoder hidden states h_1, ..., h_T at each decoder step. The decoder can \"look back\" at any position.","A":"Seq2seq without attention was the standard before 2015 and successfully trained on many tasks (machine translation, summarization). The issue is quality degradation for long sequences, not impossibility.","B":"","C":"Vocabulary size and context vector dimension are different quantities. The bottleneck is about fitting sequence semantic content into d dimensions, not about the number of possible words.","D":"In standard fixed-context seq2seq: c is used to initialize the decoder hidden state h_0^dec = tanh(W_s · c). After that, each decoder step uses its own h_t^dec. However, c is NOT directly provided at each subsequent step in the basic version — the issue is still the information bottleneck at initialization."},"reference":"- Bahdanau et al., \"Neural Machine Translation by Jointly Learning to Align and Translate\" (2015): https://arxiv.org/abs/1409.0473"},{"section":"deep-learning","difficulty":"easy","id":"dl-e030","topicSlug":"rnn-lstm-gru","orderIndex":30,"topic":"Rnn Lstm Gru","question":"Bidirectional LSTMs run the sequence forward (left to right) and backward (right to left), concatenating hidden states. A student wants to use a bidirectional LSTM for real-time speech recognition (must transcribe as audio arrives). Why is this inappropriate?","options":{"A":"Bidirectional LSTMs require 2× the memory, making them too slow for real-time use","B":"A bidirectional LSTM requires the complete sequence before producing representations. The backward LSTM starts from the last token and moves to the first — it cannot compute backward states until the full input is available. For real-time speech, tokens arrive one at a time and the system must produce output before the sequence ends. A causal (forward-only) LSTM can process each token as it arrives. Bidirectional models are appropriate for offline tasks (post-processing, text classification) where the full sequence is available at inference time.","C":"Bidirectional LSTMs cannot be trained on audio sequences; they only work for text","D":"The backward LSTM causes the model to output words in reverse order, which is wrong for speech recognition"},"correct":"B","explanation":{"correct":"- Causality requirement: real-time systems require causal models — output at time t can only depend on inputs up to time t. Bidirectional models violate causality because the backward hidden state h_t^{bwd} depends on inputs x_{t+1}, ..., x_T.\n- Practical impact: for streaming speech recognition (e.g., voice assistants, live captions), you need a result within ~100ms of each spoken word. Waiting for the full sentence is unacceptable.\n- When bidirectional is appropriate: document classification (full text available), NLP understanding tasks (BERT processes the full sequence), offline audio analysis.","A":"Memory usage is a constraint but not the primary reason for avoiding bidirectional LSTM in real-time settings. Modern hardware can handle 2× memory for typical sequence lengths. The fundamental issue is the causality violation.","B":"","C":"Bidirectional LSTMs work on any sequence type (audio, text, biological sequences). The limitation is about inference-time causality, not input modality.","D":"The backward LSTM produces hidden states in reverse order internally, but the concatenated output still corresponds to each time step in forward order. The output is not reversed."},"reference":"- Schuster & Paliwal, \"Bidirectional Recurrent Neural Networks\" (1997) — original BiRNN paper"},{"section":"deep-learning","difficulty":"easy","id":"dl-e031","topicSlug":"attention-and-transformers-dl","orderIndex":31,"topic":"Attention And Transformers Dl","question":"In the Transformer attention mechanism, Q (Query), K (Key), and V (Value) matrices are derived from the same input sequence via linear projections. Describe what Q, K, and V represent conceptually using an analogy, and what the attention formula `softmax(QK^T / √d_k) V` computes.","options":{"A":"Q=weights, K=biases, V=activations — standard neural network components","B":"Information retrieval analogy: Q represents \"what I am looking for\" (the current token's query for context), K represents \"what I advertise\" (every token's summary of its content), V represents \"what I return\" (the actual content to contribute if selected). The formula: QK^T computes the compatibility between each query and every key (dot product = similarity). / √d_k scales to prevent extreme softmax values. softmax(·) converts similarities to attention weights (sum to 1). × V produces a weighted combination of all values — the output is a blend of all input values weighted by relevance to the query.","C":"Q is the input, K is a learned weight matrix, V is the output layer","D":"Q, K, V all contain the same information; the distinction is only to allow the model to use three separate weight matrices"},"correct":"B","explanation":{"correct":"- Database analogy: think of a soft key-value store. Q is the search query. K are the database keys. If query Q matches key K_j, return value V_j (weighted by match strength).\n- For a sentence \"The cat sat\": when computing the output for \"sat,\" the Q for \"sat\" is compared against K for \"The,\" \"cat,\" and \"sat.\" If \"cat\" has the highest Q·K similarity (because the subject relates to the verb), V_{cat} gets the highest weight.\n- Self-attention: Q, K, V all come from the same sequence — each token queries all other tokens (including itself).","A":"Q, K, V are not standard NN components. They are specific projections designed to implement soft retrieval.","B":"","C":"Q comes from the current representation being transformed, not directly the input. K and V both come from the context (same sequence in self-attention). Calling K a \"weight matrix\" confuses the projection matrix W_K with the projected key tensor K = X W_K.","D":"If Q, K, V had identical information, you could collapse them. The three separate projections allow the model to learn different aspects: what to look for (Q), what to expose (K), what to return (V). Empirically, W_Q, W_K, W_V learn very different transformations."},"reference":"- Vaswani et al., \"Attention Is All You Need\" (2017): https://arxiv.org/abs/1706.03762"},{"section":"deep-learning","difficulty":"easy","id":"dl-e032","topicSlug":"attention-and-transformers-dl","orderIndex":32,"topic":"Attention And Transformers Dl","question":"Transformers use positional encoding to inject sequence order information. Without positional encoding, what happens when you feed the sentence \"cat chases dog\" vs \"dog chases cat\" to a Transformer, and why?","options":{"A":"The Transformer correctly captures order through its causal mask, making positional encoding redundant","B":"Without positional encoding, the Transformer's self-attention produces the same output for both sentences. Self-attention computes similarity between every pair of tokens regardless of position. For \"cat chases dog\" and \"dog chases cat\": the same three token embeddings participate, just in different positions — but without position information, the model cannot distinguish position 1 from position 3. The attention scores depend only on token content (Q·K similarity), not on where the token is in the sequence. Positional encoding adds position-specific vectors to each token embedding, making \"cat at position 1\" different from \"cat at position 3.\"","C":"Order is captured by the feedforward layers; positional encoding is optional","D":"The Transformer uses order through its convolution layers; positional encoding is for CNNs only"},"correct":"B","explanation":{"correct":"- Permutation equivariance: self-attention is permutation equivariant — permuting the input permutes the output in the same way. Without positional encoding, the model treats the input as an unordered set of tokens.\n- Sentence reversal test: \"cat chases dog\" → tokens {cat, chases, dog} at positions {1,2,3}. \"dog chases cat\" → same token set {cat, chases, dog} at different positions. Without PE: QK^T produces the same attention matrix (modulo permutation). With PE: the embeddings differ (cat@pos1 ≠ cat@pos3), so attention patterns and outputs differ.\n- This also explains why transformers need positional encoding for autoregressive generation — without it, the model can't know which token is \"next.\"","A":"Causal masks prevent attending to future positions but don't encode position information. A causal mask with identical token embeddings still can't distinguish \"cat at position 1\" from \"cat at position 5.\"","B":"","C":"Feedforward layers are position-wise (applied independently to each position's representation). They don't mix information across positions and can't capture order.","D":"Standard Transformers have no convolution layers. Positional encoding is not a CNN concept — it's specific to attention-based models that lack inherent sequence order."},"reference":"- Vaswani et al., \"Attention Is All You Need\" (2017): Section 3.5 — Positional Encoding"},{"section":"deep-learning","difficulty":"easy","id":"dl-e033","topicSlug":"attention-and-transformers-dl","orderIndex":33,"topic":"Attention And Transformers Dl","question":"Self-attention has O(T²) complexity in sequence length T. For a sequence of T=512 tokens with d_model=768, estimate the number of attention score computations and explain why this becomes a bottleneck for documents with T=50,000 tokens.","options":{"A":"Attention score computations = T × d_model = 512 × 768 = 393,216. Length doesn't affect bottleneck","B":"Each attention score is a dot product between a query and a key (both of dimension d_k). For T=512: T² = 262,144 dot products, each costing d_k multiplications. The full QK^T matrix has shape (T×T). For T=50,000: 50,000² = 2.5 billion attention score pairs. Storing QK^T requires 2.5B × 4 bytes = 10 GB just for one attention head. Compute: 2.5B dot products of dimension d_k each. Both memory and compute grow quadratically with T, making standard attention impractical for long documents.","C":"Attention is O(T×d_k) not O(T²); only memory grows quadratically","D":"O(T²) complexity applies only to masked (causal) attention; bidirectional attention is O(T)"},"correct":"B","explanation":{"correct":"- QK^T matrix: for queries Q ∈ ℝ^{T×d_k} and keys K ∈ ℝ^{T×d_k}, the product QK^T ∈ ℝ^{T×T} requires T² dot products of length d_k each. This is O(T²d_k) computation.\n- Memory bottleneck: the T×T attention matrix must be stored for the softmax and the subsequent multiplication with V. For T=50K and float32: 50,000² × 4 bytes = 10 GB per head per batch sample.\n- Efficient attention: FlashAttention (Dao et al., 2022) computes attention in tiles to avoid materializing the full T×T matrix; Sparse attention limits each token to attending only k nearby or globally selected tokens, reducing to O(T√T) or O(T).","A":"T × d_model is not the number of attention score computations. The QK^T matrix has T×T entries, each computed from a d_k-dimensional dot product.","B":"","C":"Both memory and compute are O(T²). The QK^T matrix has T² entries (memory), and computing it requires T² dot products (compute).","D":"Both causal (masked) and bidirectional attention compute the full T×T QK^T matrix (then mask after). The complexity is O(T²) for both. Masking reduces the effective attention range but not the matrix computation."},"reference":"- Dao et al., \"FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness\" (2022): https://arxiv.org/abs/2205.14135"},{"section":"deep-learning","difficulty":"easy","id":"dl-e034","topicSlug":"self-supervised-and-contrastive-learning","orderIndex":34,"topic":"Self Supervised And Contrastive Learning","question":"In SimCLR, two views of the same image are created by applying random augmentations (crop, color jitter, Gaussian blur). These form a positive pair. All other images in the batch form negative pairs. Why must the two views of the same image be semantically similar but visually different?","options":{"A":"Visual difference is required for GPU efficiency — identical views would be processed in parallel","B":"The goal is to learn representations invariant to augmentation. If views are too similar (e.g., only brightness changed slightly), the model can match them via low-level pixel statistics rather than semantic content — it learns \"same image = same brightness\" rather than \"same image = same object.\" If views share semantic meaning but differ visually (e.g., different crops, different colors), the model must encode the underlying semantic content to correctly pull them together, learning useful high-level representations. This is the augmentation design principle: be invariant to what you apply augmentation for, but not to what you don't apply augmentation for.","C":"Augmentations are only for data expansion; the model learns from the original images","D":"Views should be as different as possible to maximize contrastive learning difficulty"},"correct":"B","explanation":{"correct":"- Invariance design: the representation learned by SimCLR is invariant to the applied augmentations. If you apply strong color jitter, the model learns color-invariant features. This is intentional: object identity is color-invariant in many tasks.\n- Too-easy augmentations: if views are nearly identical (e.g., small brightness shift), the model learns trivial similarity (low-level pixel matching). The representation doesn't capture anything interesting.\n- Too-hard augmentations: if views are too different (semantically different), the model can't find a useful invariance and may collapse.\n- The key insight (Chen et al., 2020): careful augmentation design determines what invariances are learned and thus how well representations transfer to downstream tasks.","A":"GPU efficiency is determined by batch size and architecture, not augmentation type. Identical views would be processed the same — no GPU efficiency difference.","B":"","C":"The training signals in SimCLR come from the contrastive loss on augmented views. The \"original image\" is not used — only the two augmented views of each image.","D":"If views are too different (semantically different crops of different objects), the model would be trying to push together views that represent different things. This breaks the \"positive pair means same semantic content\" assumption."},"reference":"- Chen et al., \"A Simple Framework for Contrastive Learning of Visual Representations (SimCLR)\" (2020): https://arxiv.org/abs/2002.05709"},{"section":"deep-learning","difficulty":"easy","id":"dl-e035","topicSlug":"self-supervised-and-contrastive-learning","orderIndex":35,"topic":"Self Supervised And Contrastive Learning","question":"BERT uses Masked Language Modeling (MLM) as a pretraining objective: 15% of tokens are masked, and the model predicts the masked tokens. Why is this considered self-supervised learning rather than supervised learning?","options":{"A":"MLM is supervised learning because each masked token has a correct label","B":"MLM is self-supervised because the labels are derived automatically from the data itself, requiring no human annotation. The \"labels\" for the masked tokens are the original tokens — extracted from the text. No human labels the dataset; the training signal is generated from the raw text corpus by the masking procedure itself. Self-supervised learning = automatic label generation from structure in unlabeled data. Contrast with supervised learning: human-annotated labels (e.g., sentiment = positive/negative). MLM generates millions of training examples from raw text without any human effort.","C":"MLM is unsupervised learning because the model processes unlabeled data","D":"MLM is semi-supervised because it requires some labeled examples for fine-tuning"},"correct":"B","explanation":{"correct":"- Self-supervised definition: a learning paradigm where supervisory signal is generated from the raw data itself. The data structure creates labels (e.g., mask a word and try to predict it, predict next word in a sequence, predict image rotation).\n- Key distinction from unsupervised: unsupervised learning (clustering, PCA) finds structure without explicit prediction targets. Self-supervised learning has explicit prediction targets but generates them automatically.\n- Key distinction from supervised: supervised learning requires human-labeled data (a person decides the label). In MLM, the algorithm decides the label (original token) by masking — no human involvement.\n- This allows BERT to pretrain on the entire internet (hundreds of billions of tokens) without human annotation.","A":"The distinction is HOW the labels are created, not whether labels exist. MLM has labels (original tokens) but they are automatically generated. Supervised learning requires human-provided labels. Both have labels; only the source differs.","B":"","C":"Unsupervised learning makes no predictions — it finds patterns, clusters, or latent variables without a prediction target. MLM has a clear prediction target (the masked token). Self-supervised is a better term.","D":"Fine-tuning uses a small amount of supervised data, but pretraining via MLM is purely self-supervised. The overall BERT workflow is \"self-supervised pretraining + supervised fine-tuning\" — only the pretraining phase is self-supervised."},"reference":"- Devlin et al., \"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding\" (2019): https://arxiv.org/abs/1810.04805"},{"section":"deep-learning","difficulty":"easy","id":"dl-e036","topicSlug":"graph-neural-networks","orderIndex":36,"topic":"Graph Neural Networks","question":"A social network has users as nodes and friendships as edges. Node features are [age, location_id, num_posts]. In a GNN, what is the difference between \"node features\" and \"node embeddings,\" and what does one round of message passing produce?","options":{"A":"Node features and node embeddings are the same thing; message passing doesn't change them","B":"Node features are the raw input attributes (age, location, num_posts) — provided before training. Node embeddings are learned vector representations produced by the GNN — they encode both the node's own features AND information from its neighborhood. One round of message passing: each node aggregates (e.g., averages) the features/embeddings of its neighbors and combines them with its own features via a learned transformation: h_v^{(1)} = σ(W · concat(h_v^{(0)}, mean({h_u^{(0)} : u ∈ N(v)})). After one round, the embedding of user A encodes A's features plus the average features of A's direct friends.","C":"Node features are used for training, node embeddings are used for inference only","D":"Message passing only updates edge features; node features remain constant throughout the GNN"},"correct":"B","explanation":{"correct":"- Feature vs embedding distinction: features are fixed inputs (not learned); embeddings are learned representations output by the GNN. The GNN transforms features into embeddings layer by layer.\n- After 1 layer: each embedding captures 1-hop neighborhood. After 2 layers: 2-hop neighborhood. After k layers: k-hop neighborhood.\n- Why this matters: two users with identical raw features but different social circles will have different embeddings after message passing, because the neighborhood structure affects the aggregation.","A":"Message passing fundamentally changes the node representations. After one round, each node's representation incorporates its neighbors' features — this is different from the initial features.","B":"","C":"Both features and embeddings are available during training and inference. The distinction is about what is provided vs computed, not when they are used.","D":"GNNs update node representations (embeddings) via message passing. Edge features can also be used to weight messages (as in GAT), but node representations are the primary output of each layer."},"reference":"- Hamilton et al., \"Graph Representation Learning\" (2020 book): Chapter 5 — Graph Neural Networks"},{"section":"deep-learning","difficulty":"easy","id":"dl-e037","topicSlug":"graph-neural-networks","orderIndex":37,"topic":"Graph Neural Networks","question":"After training a GNN on a citation network, you want to classify a paper that was not in the training graph (a completely new paper added to the network). An inductive GNN (like GraphSAGE) can do this, but a transductive GNN (like vanilla GCN) cannot. What is the key difference?","options":{"A":"GraphSAGE uses more layers than GCN, allowing it to handle new nodes","B":"Transductive GCN: the weight matrix W is applied to all node features simultaneously in A·H·W form. The adjacency matrix A is fixed at training time — new nodes are not in A, so the trained model cannot be applied to them. Inductive GraphSAGE: the model learns a neighborhood aggregation function (a set of weights that aggregate from arbitrary neighbor sets). For a new node, you gather its neighbors, apply the learned aggregation function, and produce an embedding without needing to re-optimize. The key: GraphSAGE learns HOW to aggregate, while vanilla GCN learns node-specific representations tied to the training graph.","C":"GraphSAGE uses attention to handle unseen nodes, while GCN does not","D":"Transductive GCN cannot handle new nodes because it uses GPU memory, not CPU memory"},"correct":"B","explanation":{"correct":"- Transductive learning: optimization over all nodes in a fixed graph. The learned representations are specific to those nodes. Adding a new node to the graph would require reoptimizing the entire model.\n- Inductive learning: learn a function (the message-passing aggregator) that generalizes to unseen nodes. GraphSAGE's aggregator function can be applied to any node's neighborhood — seen or unseen.\n- Practical importance: in real-world graphs (social networks, knowledge graphs), new nodes are added continuously. Transductive models require expensive re-training for each new node; inductive models produce embeddings on-the-fly.","A":"The number of layers doesn't determine transductive vs inductive behavior. A 2-layer GraphSAGE and a 2-layer GCN differ architecturally, not in depth.","B":"","C":"GAT (Graph Attention Network) is attention-based and can also be inductive. The inductive/transductive distinction is about whether the model generalizes to new nodes via a learned aggregation function, not about whether attention is used.","D":"GPU/CPU memory has nothing to do with transductive vs inductive learning. This is an architectural and mathematical concept."},"reference":"- Hamilton et al., \"Inductive Representation Learning on Large Graphs (GraphSAGE)\" (2017): https://arxiv.org/abs/1706.02216"},{"section":"deep-learning","difficulty":"easy","id":"dl-e038","topicSlug":"transfer-learning","orderIndex":38,"topic":"Transfer Learning","question":"You fine-tune a pretrained ResNet-50 on a small (2,000 example) dataset of X-ray images. You freeze layers 1-4 (early) and fine-tune only layer 5 (late) + the classifier head. A colleague suggests freezing all layers instead (pure feature extraction). Which approach is better, and under what condition would you switch the recommendation?","options":{"A":"Feature extraction (all frozen) is always better for small datasets because fine-tuning causes overfitting","B":"Partial fine-tuning (freeze early, update late layers + head) is generally better when the source domain (ImageNet) and target domain (X-rays) differ significantly. Early layers (edges, textures) generalize well and should be frozen. Late layers (high-level semantics) are domain-specific and benefit from updating. If the dataset were extremely small (< 100 examples) or if X-ray images were very similar to ImageNet, full feature extraction might be preferred — then fine-tuning late layers would overfit.","C":"Always fine-tune all layers for any domain gap, regardless of dataset size","D":"The number of frozen layers doesn't matter — use the same strategy regardless of dataset size"},"correct":"B","explanation":{"correct":"- Layer transferability: Yosinski et al. (2014) showed that early layers (low-level feature detectors: edges, textures) transfer across very different domains, while late layers are increasingly task-specific.\n- Freezing strategy heuristic:\n- Small dataset + similar domain → freeze most layers (feature extraction)\n- Small dataset + different domain → freeze early, fine-tune late\n- Large dataset + similar domain → fine-tune all (with discriminative LR)\n- Large dataset + different domain → fine-tune all with low LR\n- With 2,000 X-ray examples: enough to update late layers without overfitting, but not enough to update all 25M ResNet-50 parameters.","A":"With significant domain gap (ImageNet vs X-rays), frozen late layers will produce domain-specific ImageNet features that are suboptimal for X-ray analysis. Some adaptation is beneficial.","B":"","C":"Fine-tuning all layers with 2,000 examples and 25M parameters would cause severe overfitting. The dataset has roughly 80 examples per parameter — extremely low coverage.","D":"The dataset size fundamentally determines how many parameters can be reliably updated. This is not an aesthetic choice."},"reference":"- Yosinski et al., \"How transferable are features in deep neural networks?\" (2014): https://arxiv.org/abs/1411.1792"},{"section":"deep-learning","difficulty":"easy","id":"dl-e039","topicSlug":"transfer-learning","orderIndex":39,"topic":"Transfer Learning","question":"LoRA (Low-Rank Adaptation) adds matrices A ∈ ℝ^{d×r} and B ∈ ℝ^{r×d} to a frozen pretrained weight W ∈ ℝ^{d×d}. The adapted weight is W' = W + BA. For d=768 (BERT-base), r=8, calculate the % parameter reduction vs full fine-tuning of W. Why is B initialized to zero?","options":{"A":"LoRA reduces parameters by 50% because rank-8 is half of rank-16","B":"Full fine-tuning W: d² = 768² = 589,824 parameters. LoRA: A has d×r = 768×8 = 6,144 params; B has r×d = 8×768 = 6,144 params; total LoRA = 12,288. Reduction: (589,824 - 12,288) / 589,824 ≈ 97.9% fewer params. B=0 initialization: BA = 0·A = 0, so W' = W + 0 = W initially. The model starts inference-equivalent to the pretrained model with no disruption. As training progresses, BA accumulates the task-specific delta. Starting from BA≠0 would perturb the pretrained representations from the first step.","C":"LoRA reduces parameters by 8× because rank r=8 is 1/8 of full rank (768/8=96 not 8...)","D":"B is initialized to zero because LoRA cannot train B; only A is trainable"},"correct":"B","explanation":{"correct":"- Parameter count: LoRA adds two small matrices rather than updating all d² parameters of W. For d=768, r=8: 2×768×8 = 12,288 << 589,824.\n- 97.9% reduction: this means only ~2.1% as many parameters need to be trained vs full fine-tuning, while the frozen W retains all pretrained knowledge.\n- B=0 initialization: ensures BA=0 at the start → W' = W → model produces the same outputs as the pretrained model before any training steps. This is a clean initialization with no disruption.\n- A is initialized randomly (e.g., Gaussian) so that as B starts training, BA develops non-trivial values.","A":"The reduction is based on parameter count, not rank ratio. Full rank-768 matrix has 768² = 589K params. LoRA rank-8 has 2×768×8 = 12K. The ratio is 589K/12K ≈ 48×, not 2×.","B":"","C":"Rank r=8 means the LoRA matrices have rank at most 8 (r << d). The reduction factor is d/r = 768/8 = 96×? Actually: d²/(2×d×r) = d/(2r) = 768/16 = 48×. The exact ratio is ~48×, not 8×.","D":"Both A and B are trainable parameters in LoRA. B is initialized to zero for the clean start; A is initialized randomly. During training, both are updated."},"reference":"- Hu et al., \"LoRA: Low-Rank Adaptation of Large Language Models\" (2022): https://arxiv.org/abs/2106.09685"},{"section":"deep-learning","difficulty":"easy","id":"dl-e040","topicSlug":"transfer-learning","orderIndex":40,"topic":"Transfer Learning","question":"After fine-tuning a GPT-2 model (117M parameters) on customer support data, you evaluate on the original GPT-2 benchmarks (HellaSwag, WinoGrande) and find performance dropped significantly. What is this phenomenon, and what is the simplest architectural fix that prevents it while still allowing task-specific adaptation?","options":{"A":"Performance drop is expected; fine-tuned models cannot maintain general capabilities","B":"This is catastrophic forgetting: fine-tuning on customer support data overwrites the pretrained weights, degrading general language capabilities. GPT-2's weights encoded broad language knowledge; high-LR updates for the narrow customer support distribution push the weights toward this specific domain, overwriting patterns for general text. Simplest fix: LoRA. By freezing GPT-2's 117M parameters and only training small low-rank adapter matrices (≈0.5M params for r=8), the pretrained weights remain intact — general benchmarks are unaffected. Task-specific knowledge is learned entirely in the adapters. Alternatively: very low LR (1e-5) with early stopping reduces (but doesn't eliminate) forgetting.","C":"The performance drop is caused by data preprocessing, not weight updates","D":"Catastrophic forgetting only occurs in continual learning; fine-tuning is immune to it"},"correct":"B","explanation":{"correct":"- Forgetting mechanism: each gradient step updates all 117M weights toward the customer support loss minimum. Weights that were optimized for general language understanding are shifted. After enough steps with high LR, general capabilities degrade.\n- LoRA solution: W_pretrained is frozen (never modified). W' = W_pretrained + BA. Only BA (≈0.5M params for GPT-2) is updated. General benchmarks use W_pretrained paths directly → performance unchanged. Task-specific knowledge is stored in BA.\n- This is the core appeal of parameter-efficient fine-tuning: adapt to new tasks without touching the base model's knowledge.","A":"This is the motivation for parameter-efficient fine-tuning research — the problem is real but solvable. LoRA, adapters, and prompt tuning are all designed to allow task adaptation without forgetting.","B":"","C":"Data preprocessing artifacts would affect training performance. General benchmark degradation is specifically caused by weight modification (gradient updates), not data issues.","D":"Catastrophic forgetting is not exclusive to continual learning settings. Any time a pretrained model is fine-tuned on a distribution different from pretraining, there is risk of forgetting, proportional to the LR, number of steps, and distribution shift magnitude."},"reference":"- Hu et al., \"LoRA: Low-Rank Adaptation of Large Language Models\" (2022): https://arxiv.org/abs/2106.09685\n- McCloskey & Cohen, \"Catastrophic Interference in Connectionist Networks\" (1989)"},{"section":"deep-learning","difficulty":"hard","id":"dl-h001","topicSlug":"introduction-to-neural-networks","orderIndex":1,"topic":"Introduction To Neural Networks","question":"You train a shallow neural network on the XOR problem with 2 inputs, 2 hidden units (ReLU), and 1 output (sigmoid). After 10,000 SGD steps, training loss plateaus at 0.25 instead of converging near 0. You verify the network has sufficient capacity. What are the two most likely causes, and what would each fix look like?","options":{"A":"The XOR problem is not solvable by any neural network; the plateau is expected","B":"Cause 1 — Symmetry from identical weight initialization: if both hidden neurons start with identical weights, they remain identical throughout training (symmetry problem), effectively giving the network only 1 unique hidden unit. One unique hidden unit cannot solve XOR (it creates one linear boundary, insufficient to separate XOR's 4 points). Fix: use random weight initialization (e.g., Xavier). Cause 2 — Learning rate issues: too high → oscillates around the solution; too low → extremely slow convergence, appearing plateaued. The XOR loss landscape has narrow valleys — SGD with the wrong LR gets stuck. Fix: use an adaptive optimizer (Adam) or perform LR search. Verify fix: after proper init + optimizer, XOR should converge to near-zero loss in <1000 steps.","C":"Plateau at 0.25 means the model has converged to the globally optimal solution","D":"2 hidden units are insufficient; XOR requires at least 4 hidden units to solve"},"correct":"B","explanation":{"correct":"- Symmetry-broken capacity: with 2 identical neurons, the effective hidden layer has rank 1 — one linear boundary, equivalent to logistic regression. XOR is not linearly separable; logistic regression cannot solve it.\n- XOR convergence test: the correct solution has weights like W₁=[[1,-1],[-1,1]], b₁=[0,0], W₂=[1,1], b₂=-0.5 (or equivalent). Loss should approach 0.\n- LR diagnosis: if a full loss curve shows oscillations around 0.25, the LR is too high. If it decreases extremely slowly (loss: 0.693 → 0.500 → 0.400 → 0.320 ... over 10K steps), LR is too low or the network is stuck in a flat region.\n- 2 ReLU hidden units with proper initialization CAN solve XOR — it needs two intersecting half-planes.","A":"XOR IS solvable by any neural network with ≥2 hidden units and non-linear activations. The XOR problem's unsolvability applies only to single-layer perceptrons (linear classifiers).","B":"","C":"Cross-entropy loss of 0.25 with binary labels means the model's probability outputs are around 0.78 for correct class. This is not optimal for XOR (which has clear binary boundaries).","D":"2 hidden units (ReLU) are sufficient for XOR. A single hidden unit is not enough; 2 is the minimum. More units make optimization easier but are not required."},"reference":"- Minsky & Papert, \"Perceptrons\" (1969) — XOR non-linearity requirement\n- Goodfellow et al., \"Deep Learning\" (2016), Chapter 6.1"},{"section":"deep-learning","difficulty":"hard","id":"dl-h002","topicSlug":"backpropagation","orderIndex":2,"topic":"Backpropagation","question":"You implement a neural network in a framework that uses reverse-mode automatic differentiation. The forward pass computes: `z = relu(W @ x); loss = cross_entropy(softmax(V @ z), y)`. During the backward pass, you notice that gradient norms for W are 1000× larger than for V. The network has L=20 layers (not shown). What is the most likely cause, and how does this differ from the vanishing gradient problem?","options":{"A":"This is the vanishing gradient problem — W is in an early layer so its gradients vanish","B":"This is the exploding gradient problem for W. In deep networks, the gradient of the loss with respect to W (an early layer) involves the Jacobian product: ∂L/∂W = (∂L/∂z_L) × Π_{l=k}^{L} (∂z_{l+1}/∂z_l) × ∂z_k/∂W. If the spectral norm of each Jacobian ∂z_{l+1}/∂z_l > 1, the product grows exponentially. With 20 layers where each Jacobian has spectral norm 1.4: 1.4^20 ≈ 836 — consistent with 1000× amplification. Vanishing gradients (spectral norm < 1) cause early-layer gradients to shrink to near zero, preventing learning. Exploding gradients cause extreme updates that corrupt weights. The asymmetry (W >> V) is because W is in an earlier layer with more Jacobian multiplications than V (the last layer). Fix: gradient clipping, weight normalization, or skip connections.","C":"The 1000× difference is expected and indicates W is learning faster than V — a feature, not a bug","D":"The gradient difference is caused by the ReLU activation at W's layer; switch to sigmoid to equalize gradients"},"correct":"B","explanation":{"correct":"- Jacobian chain growth: ∂z_{l+1}/∂z_l = W_{l+1} × diag(ReLU'(z_l)). ReLU': 0 or 1. The matrix W_{l+1} × diag(mask) has spectral norm ≈ spectral_norm(W_{l+1}) × sparsity_factor. If weights are initialized slightly large (norm > 1), gradients explode.\n- Asymmetry: V is the last layer (1 Jacobian multiplication for V's gradient). W is 19 layers earlier (19 Jacobian multiplications). The amplification hits earlier layers harder.\n- Clipping: `torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)` scales all gradients proportionally when their total norm exceeds max_norm, preventing corrupt updates.","A":"Vanishing gradients cause W's gradient to be 1000× SMALLER than V (not larger). Gradients shrink as they propagate backward through many layers. The described scenario (W >> V) is the OPPOSITE of vanishing gradients.","B":"","C":"1000× gradient difference is a training instability indicator. The optimizer step for W would be 1000× larger than for V, causing W to be wildly overupdated while V barely changes. This is not \"faster learning\" — it's gradient explosion.","D":"Switching to sigmoid from ReLU would worsen the situation. Sigmoid derivatives are at most 0.25, causing vanishing gradients. ReLU with derivative {0,1} preserves gradient magnitude better than sigmoid."},"reference":"- Pascanu et al., \"On the difficulty of training recurrent neural networks\" (2013): https://arxiv.org/abs/1211.5063"},{"section":"deep-learning","difficulty":"hard","id":"dl-h003","topicSlug":"optimizers","orderIndex":3,"topic":"Optimizers","question":"You train a large Transformer (1B parameters) on a TPU cluster using AdamW with β₁=0.9, β₂=0.95, ε=1e-8, lr=3e-4, weight_decay=0.1. At step 50,000, the loss spikes from 2.1 to 8.3 and never recovers. Loss scaling is used (scale=65536). You have saved checkpoints. What is the most likely cause, and what is the systematic debugging procedure?","options":{"A":"The loss spike is caused by a corrupt data batch; skip that batch and continue","B":"Most likely cause: gradient overflow in FP16 causing NaN/Inf propagation. At step 50K, the loss scale (65536) multiplied by an unusually large gradient may have caused FP16 overflow (>65504) → gradients become Inf/NaN → weight update is Inf/NaN → weights corrupted → loss never recovers. Systematic debugging: Step 1: check gradient norm history — was there a spike in grad norm at step 50K? Step 2: check if loss scale was halved at step 50K (PyTorch AMP's GradScaler does this automatically after overflow, but if the weights are already corrupted, recovery is impossible). Step 3: reload the last good checkpoint (step 49,000) and inspect: monitor loss scale values, gradient norms per layer, and weight norm spikes. Step 4: if grad overflow confirmed, reduce the initial loss scale or use BF16 (larger dynamic range, no loss scaling needed).","C":"The spike is caused by learning rate being too high; reduce lr and continue from the spike","D":"The loss spike means the model has escaped a local minimum and is exploring a better region"},"correct":"B","explanation":{"correct":"- FP16 overflow signature: gradient norm jumps to Inf or NaN at a specific step. The optimizer update corrupts all affected weights in one step. No recovery is possible from the corrupted weights — the training trajectory diverges permanently.\n- AMP GradScaler behavior: it automatically detects Inf/NaN gradients and skips the optimizer step (preserving weights). But if NaN propagates INTO the model weights before the scaler detects it, corruption occurs.\n- BF16 advantage: BF16 has the same exponent range as FP32 (8-bit exponent) vs FP16 (5-bit exponent). Max BF16 = ~3.4×10^38 vs FP16 = 65504. BF16 virtually eliminates overflow — why modern TPU training defaults to BF16.\n- Checkpoint reload is the only recovery option once weights are corrupted.","A":"A single corrupt data batch causes a temporary loss spike that recovers over a few steps as the bad update is averaged out. A permanent loss spike (never recovers) is characteristic of weight corruption, not a bad batch.","B":"","C":"Continuing from the spike point with reduced LR is ineffective — if the weights are already corrupted (NaN values), the loss function output is undefined. Must restore from checkpoint.","D":"Loss spikes during language model training at step 50K are a well-known failure mode (see Chinchilla, PaLM training reports), not beneficial exploration. They require careful monitoring and are architectural/numerical issues, not optimization features."},"reference":"- Chowdhery et al., \"PaLM: Scaling Language Modeling with Pathways\" (2022): https://arxiv.org/abs/2204.02311 — training instability analysis"},{"section":"deep-learning","difficulty":"hard","id":"dl-h004","topicSlug":"activation-functions","orderIndex":4,"topic":"Activation Functions","question":"A production vision model uses SiLU (Swish) activations: f(x) = x·σ(x). During ONNX export for edge deployment, you discover the target hardware lacks a native sigmoid instruction. A colleague suggests approximating SiLU with a piecewise linear function. What properties of SiLU must the approximation preserve for the exported model to match training performance within 1% accuracy, and what is the risk if only the value (not the derivative) is matched?","options":{"A":"Only the value needs to match; derivatives are irrelevant after training","B":"For inference (not further training), only the forward-pass value needs to be matched at inference time. However, accuracy within 1% requires: (1) Value accuracy at the activation's operating range: SiLU(x) ≈ x for x >> 0; ≈ 0 for x << 0; the non-trivial region is roughly x ∈ [-3, 3]. The piecewise approximation must closely match SiLU in this range. (2) Smoothness near x=0: SiLU has a smooth minimum near x ≈ -1.28 (minimum value ≈ -0.28). A piecewise linear approximation with segments captures this only if a breakpoint is placed near the minimum. (3) Risk of derivative mismatch: if the model was trained with SiLU and deployed with an approximation that has different curvature, the activations see a shifted distribution. For deep networks, this distribution shift compounds across layers. Even if per-activation error is 0.5%, compounding over 50 layers can cause output shift >> 1%. Test: compare layer-wise activation distribution between original and approximated model on calibration data.","C":"Replace SiLU with ReLU entirely; the accuracy difference will be less than 1%","D":"The approximation is irrelevant; only the final softmax temperature matters for accuracy"},"correct":"B","explanation":{"correct":"- Inference vs training derivative needs: during inference, no backpropagation occurs. The derivative of the activation function is not needed. Only the forward-pass output values matter.\n- Compounding error: in a deep network, each activation's approximation error creates a small distribution shift in the next layer's inputs. With 50 layers, a per-layer relative error of ε compounds: (1+ε)^50 ≈ e^{50ε}. For ε=0.01: e^{0.5} ≈ 1.65 — 65% distribution shift. The final layer may see dramatically different input statistics than expected.\n- Calibration fix: post-training quantization tools (TensorRT, ONNX Runtime) use calibration data to adjust activation ranges and minimize this compounding error.","A":"While technically true for a single activation in isolation, the compounding distribution shift in deep networks means value-only approximation can still cause significant accuracy loss if the approximation has systematic bias in the operating range.","B":"","C":"ReLU and SiLU have fundamentally different behaviors: SiLU is non-monotonic (has a minimum at x≈-1.28) while ReLU is monotonic. The network's weights were trained assuming SiLU's non-monotonic behavior. Replacing with ReLU changes the learned function significantly (>1% accuracy loss for well-trained models).","D":"Softmax temperature affects calibration (confidence scores) but not classification accuracy (argmax of logits). Activation distribution shifts do affect the logit values themselves."},"reference":"- Ramachandran et al., \"Searching for Activation Functions (Swish/SiLU)\" (2017): https://arxiv.org/abs/1710.05941"},{"section":"deep-learning","difficulty":"hard","id":"dl-h005","topicSlug":"weight-initialization","orderIndex":5,"topic":"Weight Initialization","question":"You train a 50-layer pre-LN Transformer from scratch. At initialization (before any training), you measure the output logit variance across the vocabulary for each input sample — it is 800× larger than expected from a standard Kaiming init. You trace the cause to the residual stream. What initialization strategy does GPT-2 use to address this, and why does depth amplify variance in the residual stream?","options":{"A":"GPT-2 uses zero initialization for all layers; larger variance is expected at depth 50","B":"Residual variance amplification: in a Pre-LN Transformer with skip connections, the residual stream at depth l is: x_l = x_0 + Σ_{i=1}^{l} F_i(LayerNorm(x_{i-1})). Each F_i adds variance: Var(x_l) ≈ Var(x_0) + l × Var(F_i). At layer 50 with l=50 blocks: variance grows as O(l) = 50× if each sub-layer contributes equal variance. For 2 sub-layers per block (attention + FFN) and depth 50: 100 sub-layers → 100× variance amplification. GPT-2 fix: scale residual projections by 1/√(2N), where N is the number of residual layers. The residual output matrix (c_proj in attention, the second linear in FFN) is initialized with std = 0.02 / √(2N). This pre-scales each sub-layer's contribution so that Var(Σ F_i) = constant regardless of depth.","C":"The variance is caused by LayerNorm; removing LayerNorm solves the issue","D":"This is normal behavior; large initial logit variance doesn't affect training"},"correct":"B","explanation":{"correct":"- Mathematical derivation: if each F_i(·) has output variance σ² (with standard init), and they are approximately independent, then Var(x_l) = Var(x_0) + l × σ². For l=50 blocks (100 sub-layers): Var(x_50) = Var(x_0) + 100σ² >> Var(x_0) if σ² is non-trivial.\n- GPT-2 scaling: in OpenAI's GPT-2 implementation, the c_proj weight in attention and the second linear in MLP are initialized with std=0.02/√(2N), where N=number of layers. This scales each sub-layer's contribution by 1/(2N), making the total variance Var(Σ F_i) ≈ 2N × (σ/2N)² = σ²/2N × 2N = σ² — independent of depth.\n- Why it matters: large initial logit variance means the initial softmax is highly peaked (initial predictions are very confident in random directions). This causes large initial gradients and unstable early training.","A":"Zero initialization for all layers would cause the symmetry problem. GPT-2 uses random initialization with a depth-scaled standard deviation for specific layers.","B":"","C":"LayerNorm normalizes the inputs to each sub-layer, preventing the sub-layer's INPUTS from having extreme variance. But the OUTPUTS of the sub-layers (the residual additions) still accumulate variance in the residual stream. Removing LayerNorm would worsen training, not fix the variance problem.","D":"Large initial logit variance causes: (1) overconfident initial predictions → large initial CE loss → large initial gradients → potential gradient explosion; (2) the model may converge to overconfident solutions. Proper initialization is critical for stable deep Transformer training."},"reference":"- GPT-2 paper: Radford et al., \"Language Models are Unsupervised Multitask Learners\" (2019) — initialization section\n- Wang & Komatsuzaki, \"GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model\" (2021) — explains the 1/√(2N) scaling"},{"section":"deep-learning","difficulty":"hard","id":"dl-h006","topicSlug":"regularization-and-normalization","orderIndex":6,"topic":"Regularization And Normalization","question":"You train a ViT-Large with BatchNorm instead of LayerNorm. With batch size B=512, training is stable and achieves 83% top-1 on ImageNet. You then try to deploy with batch size B=1 (single image inference) and find accuracy drops to 61%. Explain the exact mechanism causing the 22-point drop, and describe two ways to fix this without retraining.","options":{"A":"The accuracy drop is caused by missing gradients at batch size 1","B":"$17","C":"Batch size 1 is insufficient for backpropagation, causing the accuracy drop","D":"The drop is caused by dropout at inference; call model.eval() to fix it"},"correct":"B","explanation":{"correct":"- BN at inference: BN uses stored running_mean and running_var, NOT the current batch's statistics. For B=1, the single sample's statistics are irrelevant — BN still uses the population-level running stats from training.\n- The real cause: during training with B=512, running stats are computed from diverse batches that closely approximate the true data distribution. The stored stats accurately capture the feature distribution. At B=1, the ISSUE is not that BN computes differently — it always uses running stats at inference. The issue is if the training set statistics poorly represent the test data (distributional mismatch, or if accumulation was unstable).\n- Why ViT specifically: ViT uses patch embeddings. BatchNorm across patches (not across the batch) can be problematic. LayerNorm, which normalizes per token, is better suited for variable per-image statistics.","A":"Backpropagation is not used at inference — there are no gradients at inference regardless of batch size.","B":"","C":"Batch size 1 is a valid inference setting for all neural networks. No backpropagation occurs at inference.","D":"model.eval() disables Dropout and switches BN to use running stats (not batch stats). If the model is already in eval mode, calling it again changes nothing. The issue is the running stats themselves, not Dropout."},"reference":"- Ioffe, \"Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models\" (2017): https://arxiv.org/abs/1702.03275"},{"section":"deep-learning","difficulty":"hard","id":"dl-h007","topicSlug":"cnn-architectures","orderIndex":7,"topic":"Cnn Architectures","question":"A ResNet-50 model is used for transfer learning on a 5-class satellite imagery task (512×512 input). Standard ResNet-50 expects 224×224 input. You resize all images to 224×224 and fine-tune, achieving 87% accuracy. A colleague fine-tunes with 512×512 input (keeping all ResNet conv layers, only retraining the fully connected head at native resolution). Their model achieves 91%. What specific architectural property of ResNet enables the 512×512 model to work without retraining conv layers, and what would break if BatchNorm layers were frozen?","options":{"A":"ResNet works at any resolution because its parameters encode pixel values","B":"Fully convolutional property: ResNet's convolutional layers (conv1, conv2-5) are translation-equivariant and resolution-independent — the same filters slide across any spatial size. A filter trained on 224×224 features detects the same edge/texture/object-part patterns at 512×512. What changes: (1) the spatial output map is larger (e.g., before the final avgpool: 7×7 for 224 input → 16×16 for 512 input); (2) global average pooling (GAP) aggregates over the larger spatial map — producing the same 2048-d embedding regardless of spatial size. BatchNorm frozen issue: BN's running_mean and running_var were computed during training on 224×224 inputs. At 512×512, the features at each spatial position have different statistics — the network is computing activations over a larger receptive field at each position. If BN is frozen, the normalization uses 224-statistics for 512-activations → potentially biased normalization at every layer. Fix: unfreeze BN to recompute running stats during fine-tuning.","C":"ResNet processes 512×512 by splitting the image into four 256×256 tiles","D":"The 512×512 model works because the fully connected head automatically adapts to any spatial input"},"correct":"B","explanation":{"correct":"- Resolution independence of convolution: conv(x, W) is defined for any input size. The same weight W slides across all positions. This is the foundational property that makes CNNs usable for different input sizes.\n- GAP effect: Global Average Pooling averages over all spatial positions: GAP(x) = mean_spatial(x). For 512×512 input, the spatial map before GAP is larger (more positions to average), but the output is still 2048-d. The network correctly handles this.\n- Accuracy improvement (87% → 91%): higher resolution provides more detailed spatial information about satellite features (road edges, building outlines) that are lost at 224×224. The CNN extracts finer features at 512×512.","A":"Parameters encode filter patterns (e.g., edge detection), not pixel values. The same filter works for any resolution because it detects local patterns regardless of the overall image size.","B":"","C":"ResNet processes the full image in a single forward pass. There is no internal tiling mechanism.","D":"The fully connected head is fixed size (2048 → num_classes). It's the GAP layer that makes the output resolution-independent, not the FC head. Without GAP, an FC head would require a fixed spatial input."},"reference":"- Long et al., \"Fully Convolutional Networks for Semantic Segmentation\" (2015): https://arxiv.org/abs/1411.4038 — resolution independence"},{"section":"deep-learning","difficulty":"hard","id":"dl-h008","topicSlug":"rnn-lstm-gru","orderIndex":8,"topic":"Rnn Lstm Gru","question":"You train a 2-layer stacked LSTM (hidden size H=512) for language modeling. During training, you notice that the forget gate activations f_t average 0.97 across all time steps. A researcher says \"this is a problem — the forget gate should be closer to 0.5 to allow selective forgetting.\" You disagree. Who is correct, and what does f_t ≈ 0.97 indicate about what the LSTM has learned?","options":{"A":"The researcher is correct; forget gate near 1 means the LSTM is not learning to forget","B":"You are correct. f_t ≈ 0.97 indicates the LSTM has learned to maintain long-range dependencies — the cell state c_t ≈ 0.97 × c_{t-1} + new_info. At each step, 97% of previous cell state is retained. For language modeling, much information (subject of a sentence, topical context) must persist for many steps. A forget gate near 0.5 would cause the cell state to decay to half its value every step — effective memory of only ~14 steps (0.5^14 ≈ 10^{-4}). Language requires context windows much longer than 14 steps. The LSTM has learned: \"retain most information continuously; selectively add new information.\" This is correct behavior for long-range language modeling. A forget gate near 0.5 would be appropriate for tasks requiring rapid context switching.","C":"Both gates' values are irrelevant; only the final hidden state h_T matters","D":"f_t ≈ 0.97 causes exploding gradients through the cell state; this will destabilize training"},"correct":"B","explanation":{"correct":"- Forget gate learning: the forget gate is initialized at 1.0 in many implementations (bias initialized to 1) specifically because long-range memory is more useful initially. During training, it may stay near 1 for language tasks.\n- Effective memory horizon: with f_t = 0.97, the effective memory (exponential decay) has time constant τ = -1/log(0.97) ≈ 33 steps. Information from 33 steps ago is attenuated to e^{-1} ≈ 37% of its original value — a useful working memory for language.\n- Task-dependent: for speech modeling with short phoneme dependencies, f_t might converge lower. For language, high forget gate is expected and correct.","A":"A forget gate near 1 means the model RETAINS information (not forgets). The gate name is counterintuitive — \"forget gate ≈ 1\" means \"don't forget.\" For long-range language dependencies, this is desirable.","B":"","C":"The hidden state h_t = o_t ⊙ tanh(c_t) is computed from the cell state c_t at every step. The cell state is what provides long-term memory. The forget gate's value directly determines the long-range gradient flow through c_t.","D":"LSTM gradient through the cell state: ∂c_t/∂c_{t-1} = f_t. For f_t=0.97, the gradient is multiplied by 0.97 at each step. This is much better than vanilla RNN (which has products of full Jacobians). Cell state gradient at step T for step 1: 0.97^T → small but non-zero. This is the key innovation — the additive cell update preserves gradients far better than the multiplicative RNN recurrence."},"reference":"- Hochreiter & Schmidhuber, \"Long Short-Term Memory\" (1997)\n- Gers et al., \"Learning to Forget: Continual Prediction with LSTM\" (1999) — forget gate initialization"},{"section":"deep-learning","difficulty":"hard","id":"dl-h009","topicSlug":"attention-and-transformers-dl","orderIndex":9,"topic":"Attention And Transformers Dl","question":"You implement multi-head attention from scratch. After careful testing, you discover that for long sequences (T=2048), the softmax of QK^T/√d_k produces attention weights extremely close to one-hot (one position gets ≈1.0, all others ≈0). The model fails to aggregate information across positions. Explain the mathematical cause, and why increasing d_k from 64 to 256 worsens the problem rather than alleviating it.","options":{"A":"Increasing d_k from 64 to 256 improves attention; the one-hot behavior is a data problem","B":"The problem is softmax saturation from large dot-product magnitudes. For Q, K ~ N(0, 1) (standard initialization), Q_i · K_j ~ N(0, d_k) — variance grows linearly with d_k. For d_k=64: typical dot products have std = √64 = 8. softmax receives inputs with std=8; the maximum logit is ~3×8=24; softmax(24 vs 0) ≈ e^{24}/e^{24} ≈ near-one for the max, near-zero for all others. The 1/√d_k scaling in QK^T/√d_k divides by √64=8: effective std becomes 1 — softmax inputs are manageable. If you increase d_k to 256 WITHOUT adjusting initialization: Q_i · K_j ~ N(0, 256); std = 16; even with 1/√256=1/16 scaling, after dividing by 16, std = 1 still. So the formula is correct IF the scaling is applied. The bug is likely a missing 1/√d_k scaling factor when d_k changed. If scaling is always applied, d_k=256 should work the same as d_k=64.","C":"One-hot attention is desirable; it means the model has learned to focus precisely","D":"The problem is the temperature; increase softmax temperature to flatten the distribution"},"correct":"B","explanation":{"correct":"- Dot product variance: Q ∈ ℝ^{d_k}, K ∈ ℝ^{d_k} with iid N(0,1): dot product Q·K = Σᵢ QᵢKᵢ, which is a sum of d_k products of N(0,1) variables. Variance = d_k (sum of d_k terms each with variance 1). Std = √d_k.\n- Scaling: 1/√d_k brings the variance to 1. Softmax of inputs with std=1 is well-behaved.\n- If 1/√d_k is applied correctly, changing d_k shouldn't affect softmax saturation. The described worsening (d_k=256 → worse) suggests the 1/√d_k scaling is NOT being updated when d_k changes (a common implementation bug when hardcoding the scaling factor).","A":"The one-hot behavior is a mathematical consequence of high-variance dot products, not a data problem. It's deterministically solvable by the 1/√d_k scaling.","B":"","C":"Useful attention aggregates information from multiple positions (soft attention). For tasks like translation (\"the professor who...\" needs to link \"professor\" to the subject of a relative clause), one-hot attention means only one position's information is used, losing context. One-hot is occasionally correct (e.g., copying) but should be learned, not forced.","D":"\"Softmax temperature\" = 1/T where T is a scalar. Softmax temperature = 1/√d_k is already the temperature. \"Increasing temperature\" means using 1/(√d_k × T) with T>1, which makes attention more uniform — this is the correct direction (flatter attention = less one-hot). But the right fix is using the 1/√d_k scaling correctly, not an additional temperature."},"reference":"- Vaswani et al., \"Attention Is All You Need\" (2017): Section 3.2.1 — Scaled Dot-Product Attention, explanation of 1/√d_k"},{"section":"deep-learning","difficulty":"hard","id":"dl-h010","topicSlug":"forward-propagation","orderIndex":10,"topic":"Forward Propagation","question":"You profile a Transformer's forward pass on an A100 GPU and find that a linear layer (d_model=4096 → d_ff=16384, batch=1, seq=512) achieves only 12% of theoretical FLOPs utilization (MFU). A colleague achieves 58% MFU on the same hardware with a different batch size. Explain why compute efficiency is so low at batch=1, and what specific memory hierarchy property causes the bottleneck.","options":{"A":"The bottleneck is the softmax operation; replace it with linear attention","B":"The bottleneck is memory bandwidth, not compute. The linear layer multiplies X (512×4096) by W (4096×16384). FLOPs = 2 × 512 × 4096 × 16384 ≈ 68B. Time to load W from HBM: W has 4096×16384 × 2 bytes (FP16) = 128 MB. A100 HBM bandwidth = 2 TB/s. Load time = 128 MB / 2 TB/s = 64 μs. A100 peak compute = 312 TFLOPS (BF16). Compute time = 68 GFLOPs / 312 TFLOPs/s = 0.22 μs. Ratio: memory-bound by 64/0.22 = 290×. The operation is severely memory-bandwidth limited, not compute-limited. The arithmetic intensity = FLOPs / bytes = 68G / 128M = 531 FLOP/byte. A100 ridge point (compute/bandwidth) = 312T / 2T = 156 FLOP/byte. Since 531 > 156, in theory this should be compute-bound. The catch: at batch=1, seq=512, the output X (small) is reused, but W must be loaded once per forward pass. With larger batch, the FLOPs increase while the weight loading stays constant → higher arithmetic intensity → compute-bound → higher MFU.","C":"The bottleneck is Python interpreter overhead; use TorchScript to fix it","D":"12% MFU is normal for batch size 1; no optimization is possible"},"correct":"B","explanation":{"correct":"- Roofline model: operations below the \"ridge point\" (arithmetic intensity < compute/bandwidth ratio) are memory-bound. Above the ridge point: compute-bound.\n- Batch size effect: with batch=32, seq=512: FLOPs = 32 × 512 × 4096 × 16384 × 2 ≈ 2.2T. Weight loading: still ~128 MB (reused for all 32 samples). Arithmetic intensity = 2.2T / 128M = 17,188 FLOP/byte >> 156 ridge point. Now strongly compute-bound.\n- This is why LLM inference is memory-bound (small batches) and training is compute-bound (large batches). Systems like continuous batching (vLLM) increase the effective batch size to improve GPU utilization.","A":"Softmax is not the bottleneck for a standard linear layer. The profiling shows the linear layer itself at 12% MFU — softmax is a different operation.","B":"","C":"TorchScript reduces Python overhead (relevant for small operations where Python is the bottleneck). For 68 GFLOPs operations running on a GPU, Python overhead is negligible (<1 μs for kernel launch vs 64 μs for memory bandwidth).","D":"12% MFU for a production model is too low — significant optimization is possible. Batching multiple requests (as vLLM does), quantization (4-bit weights halve memory bandwidth), and weight streaming optimizations can improve this substantially."},"reference":"- Karpathy, \"The GPU Computational Bottleneck and Why Batch Size Matters\" — llm.c discussions\n- Williams et al., \"Roofline: An Insightful Visual Performance Model for Multicore Architectures\" (2009)"},{"section":"deep-learning","difficulty":"hard","id":"dl-h011","topicSlug":"loss-and-cost-functions","orderIndex":11,"topic":"Loss And Cost Functions","question":"You train an object detection model with Focal Loss: FL(p_t) = -(1 - p_t)^γ log(p_t), with γ=2. During training on a dataset with 1000 background examples for every 1 foreground example, the loss is dominated by background. A colleague sets γ=5 to down-weight easy backgrounds more aggressively. At γ=5, training loss decreases faster initially but final mAP is 3 points lower than γ=2. What is the mathematical mechanism causing the degradation?","options":{"A":"Higher γ always improves Focal Loss; the mAP drop is caused by insufficient training epochs","B":"With γ=5, easy backgrounds (p_background ≈ 0.999, p_t ≈ 0.999): weight = (1-0.999)^5 = (0.001)^5 = 10^{-15}. These examples contribute almost zero gradient signal. Hard foreground (p_t = 0.5): weight = (0.5)^5 = 0.031. For γ=2: easy background weight = (0.001)^2 = 10^{-6}; hard foreground weight = (0.5)^2 = 0.25. The ratio hard/easy changes from 0.25/10^{-6} = 250,000 at γ=2 to 0.031/10^{-15} = 3.1×10^{13} at γ=5. At γ=5, the loss is computed from an extremely small effective sample — only the hardest examples contribute meaningful gradients. This creates high-variance gradient estimates (few examples dominate) and overfits to the specific hard examples in each mini-batch. With γ=2, a broader set of semi-hard examples provides more stable, generalizing gradients.","C":"γ=5 is equivalent to hard example mining; the mAP drop is expected and acceptable","D":"The issue is the learning rate; reduce LR when using γ=5"},"correct":"B","explanation":{"correct":"- Gradient variance analysis: at γ=5, only examples with p_t ∈ [0.3, 0.7] receive substantial gradients. For 1000:1 imbalance, only a tiny fraction of the mini-batch contributes usable signal.\n- Effective batch size reduction: with γ=5 and 99.9% of examples being background with high confidence, the effective learning signal comes from << 0.1% of examples. The mini-batch gradient estimate has extremely high variance.\n- Optimal γ: Lin et al. (2017) showed γ=2 is optimal for COCO detection. γ ∈ [0.5, 5] were tested; γ=2 provided the best mAP. Higher γ reduces loss too aggressively on easy examples, hurting gradient quality.","A":"Faster initial loss decrease does not imply better final mAP. A model can rapidly minimize the extremely-down-weighted easy backgrounds while poorly learning to distinguish hard cases due to noisy gradient estimates.","B":"","C":"Hard example mining (OHEM) selects a fixed number of hard examples per batch, providing stable sample counts. γ=5 provides no such stability — the effective sample count varies per batch based on what the model finds easy/hard at each step.","D":"Lower LR with γ=5 would slow down already-noisy gradient updates, not address the fundamental gradient variance problem from sparse effective samples."},"reference":"- Lin et al., \"Focal Loss for Dense Object Detection (RetinaNet)\" (2017): https://arxiv.org/abs/1708.02002 — Table 1: γ comparison"},{"section":"deep-learning","difficulty":"hard","id":"dl-h012","topicSlug":"ann-architectures","orderIndex":12,"topic":"Ann Architectures","question":"A team trains a wide MLP (1 hidden layer, 65536 neurons) vs a deep MLP (8 hidden layers, 256 neurons each). Both have approximately equal parameter counts. On MNIST, both achieve ~99% accuracy. On a hierarchical image composition task (parts → objects → scenes), the deep model significantly outperforms the wide model. Explain from the circuit complexity perspective why depth provides an exponential advantage for hierarchical functions, and what the \"number of linear regions\" argument says.","options":{"A":"Deep models outperform wide models because deep models have more parameters","B":"Circuit complexity argument: a hierarchical function f(x) = h₃(h₂(h₁(x))) where each hᵢ extracts features from the previous level cannot be computed efficiently by a shallow circuit without exponential width. A deep ReLU network with d layers can compute functions that require exponential (in d) width for any shallow network. Formally: functions composable as depth-k circuits require O(2^k) neurons in a 1-hidden-layer network but only O(k × poly(n)) in a depth-k network. Number of linear regions: a ReLU network with L layers and N total neurons can produce O((N/L)^(L-1) × N) linear regions in the input space. Equivalently, deep networks produce exponentially more linear regions (decision boundaries) than shallow networks of the same parameter count. For parts→objects→scenes: each layer learns a higher-level composition. The wide model must represent ALL compositions in a single layer — exponentially harder than sequential composition.","C":"The advantage is numerical, not structural; deeper models have better gradient flow","D":"Width and depth are equivalent for any function; the task difference is due to training, not architecture"},"correct":"B","explanation":{"correct":"- Montufar et al. (2014) formal result: a deep ReLU network with L hidden layers, each width n, creates at least (n/⌊n/2⌋)^{(L-1)} × (1/2 × Σᵢ binomial(n-1, i)) linear regions. The key factor is exponential in L. A 1-hidden-layer network of the same parameters creates only polynomial regions.\n- Compositional bias: deep networks naturally implement hierarchical computations (layer 1: edges, layer 2: textures, layer 3: parts, layer 4: objects). Wide shallow networks must encode all hierarchical relations in a single transformation.\n- MNIST exception: MNIST digits have minimal hierarchical structure (simple strokes), so wide and deep models perform similarly. Hierarchical tasks (scenes, language syntax) benefit from depth.","A":"Both networks have approximately equal parameter counts, ruling out parameter count as the explanation.","B":"","C":"Gradient flow is a training concern. The exponential advantage is a representational (architectural) property — even with perfect optimization, the shallow model needs exponential width.","D":"Barron's theorem and circuit complexity theory formally prove that certain functions cannot be efficiently represented shallowly. The advantage is not just empirical."},"reference":"- Montufar et al., \"On the Number of Linear Regions of Deep Neural Networks\" (2014): https://arxiv.org/abs/1402.1869"},{"section":"deep-learning","difficulty":"hard","id":"dl-h013","topicSlug":"self-supervised-and-contrastive-learning","orderIndex":13,"topic":"Self Supervised And Contrastive Learning","question":"In SimCLR, the InfoNCE loss is: L = -log(exp(sim(z_i, z_j)/τ) / Σ_{k≠i} exp(sim(z_i, z_k)/τ)). You run two experiments: (A) batch size N=256, τ=0.07 and (B) batch size N=2048, τ=0.07. Experiment B achieves significantly higher downstream accuracy. Beyond more negative samples, what specific learning dynamics does the larger batch size change, and what does the temperature τ control about the difficulty of negatives?","options":{"A":"Larger batch size only adds more negative samples; the learning dynamics are the same","B":"$18","C":"Larger batch causes worse results because false negatives increase","D":"Temperature τ=0.07 has no effect on learning; only τ=0 and τ=∞ are meaningfully different"},"correct":"B","explanation":{"correct":"- False negative concern: with N=2048 from ImageNet, ~2000 negatives are from different classes. But with 1000 classes, ≈2047/1000 ≈ 2 negatives are from the same class as the anchor (false negatives). Studies show this reduces accuracy slightly, but the benefit of hard negatives dominates.\n- Temperature interpretation: L = Σ_pos[sim/τ] - log(Σ_all exp(sim/τ)). Low τ focuses learning on high-similarity pairs. If the hardest negative has sim=0.6 and the positive has sim=0.8: with τ=0.07: exp((0.6-0.8)/0.07) = exp(-2.86) ≈ 0.057. With τ=1: exp(0.6-0.8) = exp(-0.2) ≈ 0.82. Low τ makes the hard negative much less competing — but since we want to push it away, the gradient signal is actually largest when the negative is close to the positive (high sim negative → high loss → large gradient).\n- Chen et al. (2020) ablation: batch size 256 → 76.5% top-1; batch size 4096 → 82.9% top-1 on ImageNet with linear evaluation.","A":"The learning dynamics DO change beyond count: the distribution of difficulty of negatives changes, gradient variance changes, and the interaction with temperature changes.","B":"","C":"False negatives (same-class treated as negative) are a real concern but empirically don't outweigh the benefits of more hard negatives. Studies that explicitly handle false negatives (e.g., Debiased Contrastive Learning) improve results but start from the strong N=2048 baseline.","D":"Temperature fundamentally reshapes the loss landscape. τ→0 approaches hard-max (only the hardest negative matters). τ→∞ makes all negatives equally weighted (uniform, no hard negative focus). τ=0.07 is a strong hard-negative-focusing temperature."},"reference":"- Chen et al., \"A Simple Framework for Contrastive Learning of Visual Representations (SimCLR)\" (2020): https://arxiv.org/abs/2002.05709 — Appendix B: batch size and temperature ablations"},{"section":"deep-learning","difficulty":"hard","id":"dl-h014","topicSlug":"graph-neural-networks","orderIndex":14,"topic":"Graph Neural Networks","question":"Two molecules: (A) benzene (cyclic, all carbons connected in a ring) and (B) cyclohexane (same ring structure but saturated, no double bonds). Their molecular graphs have identical topology. A 3-layer standard GCN predicts identical properties for both. A more expressive GNN correctly distinguishes them. What fundamental limitation of standard message-passing GNNs causes this failure, and what is the Weisfeiler-Leman (WL) test connection?","options":{"A":"GCNs cannot process cyclic graphs; use only tree-structured GNNs for molecules","B":"Standard message-passing GNNs (MPNNs) are bounded in expressiveness by the 1-Weisfeiler-Leman (1-WL) graph isomorphism test. The 1-WL test: iteratively color each node by a hash of its color + sorted neighbor colors. Two graphs are distinguished if their final color histograms differ. If the 1-WL test fails to distinguish two graphs, no standard MPNN can distinguish them either (Xu et al. 2019, GIN paper). For benzene vs cyclohexane: both have the same ring topology and all nodes have the same degree (2 bonds each). The ONLY structural difference is in edge features (double bonds in benzene vs single bonds in cyclohexane). Standard GCNs with only node features and a uniform adjacency matrix cannot incorporate edge type — they see the same graph. Fix: (1) Add edge features to message passing: m_{ij} = φ(h_i, h_j, e_{ij}) where e_{ij} is the edge type (bond order). (2) Use higher-order GNNs (k-WL, PPGN) that track subgraph structures.","C":"The issue is insufficient depth; add more GCN layers to distinguish the molecules","D":"Standard GCN distinguishes benzene from cyclohexane correctly; the premise is wrong"},"correct":"B","explanation":{"correct":"- 1-WL bound: Xu et al. (2019) proved that if two graphs are indistinguishable by the 1-WL test (their iterative coloring produces the same histogram), then any sum-aggregation MPNN assigns them the same representation.\n- Benzene vs cyclohexane topologically: C₆H₆ (benzene) and C₆H₁₂ (cyclohexane) have different molecular formulas — but if the GNN only sees carbon nodes and undirected bonds, both appear as a 6-cycle with same-degree nodes. The aromatic ring information lives in bond type (not captured by standard adjacency).\n- Fix with edge features: DirectedMP or DMPNN (message passing on directed edges) with bond-type features correctly distinguishes benzene (aromatic/double) from cyclohexane (single bonds).","A":"GCNs can process cyclic graphs. The adjacency matrix correctly represents cycles. The issue is expressiveness (which substructures are captured), not the presence of cycles.","B":"","C":"Adding more layers doesn't solve the expressiveness bound. With identical node features and the same adjacency, all layers produce the same aggregated representations for both molecules, regardless of depth.","D":"A standard GCN with only carbon atom identity as node features and binary adjacency CANNOT distinguish benzene from cyclohexane — they have the same topology and node types. This is a known limitation motivating edge-featured and higher-order GNNs."},"reference":"- Xu et al., \"How Powerful are Graph Neural Networks (GIN)\" (2019): https://arxiv.org/abs/1810.00826 — WL expressiveness theorem"},{"section":"deep-learning","difficulty":"hard","id":"dl-h015","topicSlug":"transfer-learning","orderIndex":15,"topic":"Transfer Learning","question":"You fine-tune LLaMA-3-8B on a legal contract analysis task using LoRA (r=16, α=32, target_modules=[q_proj, v_proj]). After fine-tuning, the model excels at legal tasks but its general reasoning performance (MMLU) drops from 68% to 54%. Identify two distinct mechanisms causing MMLU degradation and propose a fine-tuning strategy that limits MMLU degradation to < 2% while maintaining legal task performance.","options":{"A":"MMLU degradation is unavoidable with LoRA; accept the 14-point drop","B":"$19","C":"The degradation is caused by insufficient training data; add more legal examples","D":"Fine-tune k_proj instead of q_proj and v_proj; key projection doesn't affect reasoning"},"correct":"B","explanation":{"correct":"- Attention pattern modification: q_proj and v_proj directly control what the model attends to (q_proj) and what information is extracted from attended positions (v_proj). Legal text has very different attention patterns than MMLU reasoning (e.g., attending back to defined terms in contracts vs attending to relevant context in multiple-choice).\n- EWC (Elastic Weight Consolidation): Fisher information matrix F estimates how important each parameter is for the original task. EWC loss = L_legal + λ Σᵢ Fᵢ × (θᵢ - θ*ᵢ)². This constrains LoRA parameters critical for MMLU reasoning.\n- Practical solution: modern PEFT libraries (HuggingFace PEFT) allow task vectors — train separate LoRA adapters for each task, then interpolate. Serve the legal adapter for legal tasks, keep base model for MMLU.","A":"MMLU drops larger than 5% indicate significant forgetting. LoRA is specifically designed to minimize forgetting — a 14% drop suggests misconfigured LoRA (too high rank/scale, wrong target modules). It IS avoidable.","B":"","C":"More legal data would increase legal performance but would also increase the magnitude of LoRA updates, potentially increasing MMLU interference. The problem is the DIRECTION of adaptation, not the quantity of training.","D":"k_proj (key projection) directly participates in the query-key dot product that computes attention scores — it's just as important to reasoning as q_proj. There's no principled reason to expect k_proj modification to be less disruptive."},"reference":"- Hu et al., \"LoRA: Low-Rank Adaptation of Large Language Models\" (2022): https://arxiv.org/abs/2106.09685\n- Kirkpatrick et al., \"Overcoming catastrophic forgetting in neural networks (EWC)\" (2017): https://arxiv.org/abs/1612.00796"},{"section":"deep-learning","difficulty":"hard","id":"dl-h016","topicSlug":"neurons-and-perceptrons","orderIndex":16,"topic":"Neurons And Perceptrons","question":"A network uses the formula h = σ(W₂ σ(W₁ x + b₁) + b₂). You want to prove this 2-layer MLP can represent any function f: ℝ → ℝ on [0,1]. A student cites the Universal Approximation Theorem (UAT) and says it requires infinite neurons. You argue it requires only O(1/ε²) neurons for a function in a specific smoothness class. What is the smoothness condition that makes efficient approximation possible, and what does Barron's theorem say?","options":{"A":"Any function in L² requires O(1/ε) neurons; smoothness doesn't affect neuron count","B":"Barron's theorem (1993): A function f: ℝⁿ → ℝ is in Barron's class if its frequency-domain L¹ norm is finite: C_f = ∫ ||ω|| |f̂(ω)| dω < ∞ (where f̂ is the Fourier transform). For such functions: a 1-hidden-layer network with m neurons achieves L² approximation error ε = O(C_f / √m). Equivalently: to achieve error ε, you need m = O(C_f² / ε²) neurons — polynomial in 1/ε, not exponential. For comparison: without Barron's condition (non-smooth functions or high-frequency content), achieving ε error with a fixed-degree polynomial approximation may require exponentially many terms. Smoothness condition: finite C_f means f's Fourier representation has decaying high-frequency content — the function doesn't oscillate wildly (bounded variation in Fourier domain).","C":"UAT requires infinite neurons in general; no bounded-neuron guarantee exists","D":"Barron's theorem applies only to sigmoid activations; ReLU networks have no such guarantee"},"correct":"B","explanation":{"correct":"- C_f interpretation: functions with large C_f (high Fourier L¹ norm) have lots of high-frequency content (sharp corners, rapid oscillations). These require more neurons to approximate. Smooth functions (small C_f) are approximable with fewer neurons.\n- Dimension-free result: Barron's theorem is notable because the m = O(C_f²/ε²) bound does NOT depend on the input dimension n. This is \"the blessing of Barron's class\" — deep learning avoids the curse of dimensionality for this function class.\n- Practical connection: why do neural networks work well in practice? Natural language, images, and audio signals have decaying Fourier spectra (they're \"smooth enough\" to be in Barron's class). Pure adversarial examples often exploit high-frequency perturbations — they leave Barron's class.","A":"The claim \"any L² function requires O(1/ε)\" is false. L² includes highly non-smooth functions (e.g., random noise) that require exponentially many neurons. The O(1/ε²) bound is specific to Barron's class.","B":"","C":"The original UAT (Cybenko, Hornik) only proves existence (infinite neurons are sufficient). Barron's theorem provides the constructive bound. UAT doesn't say \"infinite neurons are necessary\" — Barron's provides the polynomial guarantee for smooth functions.","D":"Barron's original theorem used sigmoid activations as a constructive proof. The result has been extended to ReLU networks by subsequent work (e.g., Barron & Klusowski 2018). The class of expressible functions is similar for both."},"reference":"- Barron, \"Universal Approximation Bounds for Superpositions of a Sigmoidal Function\" (1993): IEEE Transactions on Information Theory\n- Bach, \"Breaking the Curse of Dimensionality with Convex Neural Networks\" (2017): https://arxiv.org/abs/1412.8690"},{"section":"deep-learning","difficulty":"hard","id":"dl-h017","topicSlug":"backpropagation","orderIndex":17,"topic":"Backpropagation","question":"You debug a custom attention implementation. The forward pass is correct but loss.backward() gives incorrect gradients for W_Q. You verify using the finite difference method: ∂L/∂W_Q ≈ (L(W_Q + εe_ij) - L(W_Q - εe_ij)) / (2ε). The finite difference gradient is 0.153 but autograd gives 0.089. The discrepancy is consistent across multiple inputs. What are the three most likely implementation errors in the custom backward pass for W_Q?","options":{"A":"The finite difference check has numerical errors; trust autograd only","B":"Three likely errors in the custom backward pass for W_Q: (1) Missing factor from scaling: attention = softmax(QK^T / √d_k) V. The backward pass for W_Q must propagate through both the 1/√d_k scaling AND through the softmax Jacobian. If the implementation applies the softmax gradient but forgets to multiply by 1/√d_k, the gradient is scaled by √d_k too large (or too small if divided instead of multiplied). (2) Softmax Jacobian error: ∂softmax(z)/∂z = diag(s) - s·sᵀ where s=softmax(z). A common error: computing only the diagonal (treating softmax as element-wise) and ignoring the s·sᵀ outer product. This is the most common softmax backward error. (3) Incorrect gradient accumulation through multi-head attention: if W_Q is shared across heads (or gradients from multiple heads are summed), forgetting to sum gradients from all heads or dividing instead of summing causes systematic underestimation (0.089 ≈ 0.153 × num_heads / (num_heads × 2)?).","C":"The finite difference check is correct; autograd has a bug in PyTorch","D":"W_Q gradient of 0.089 is correct; the finite difference approximation is too coarse"},"correct":"B","explanation":{"correct":"- Gradient check validation: finite difference is the gold standard for custom backward passes. If FD and autograd disagree consistently (not numerically), the autograd implementation has a bug. The reverse is never the case for standard differentiable operations.\n- Softmax Jacobian: ∂L/∂z_i = ∂L/∂s × ∂s/∂z_i = Σ_j (∂L/∂s_j)(s_j(δ_{ij} - s_i)) = s_i(∂L/∂s_i - Σ_j s_j ∂L/∂s_j). This requires the full outer product, not just the diagonal. Omitting s·sᵀ systematically underestimates the gradient.\n- Scaling factor: the 1/√d_k factor must be correctly propagated. ∂(QK^T/√d_k)/∂Q = K/√d_k. Missing this factor would cause the gradient to be √d_k × too large, not too small. So the missing 1/√d_k would explain if autograd < FD by a factor of √d_k.","A":"Finite difference check on a correctly computed loss function is numerically accurate for ε values like 1e-5 (for float64). A consistent discrepancy (0.153 vs 0.089) is too large to be numerical error.","B":"","C":"PyTorch's autograd is extremely well-tested and correct for standard operations. The custom backward pass is user-implemented; that's where the bug is.","D":"ε=1e-5 for finite differences typically gives 8-digit accuracy for smooth functions in float64. A discrepancy of 0.064 (42% error) is far beyond numerical error and indicates a logic bug."},"reference":"- CS231n, \"Computing Gradients: Numerical Gradient Checking\" — gradient check methodology\n- Vaswani et al., \"Attention Is All You Need\" (2017): Section 3.2 — attention derivation"},{"section":"deep-learning","difficulty":"hard","id":"dl-h018","topicSlug":"regularization-and-normalization","orderIndex":18,"topic":"Regularization And Normalization","question":"RMSNorm (used in LLaMA) vs LayerNorm (used in BERT): RMSNorm(x) = x / RMS(x) × γ where RMS(x) = √(1/n Σxᵢ²). LayerNorm(x) = (x - μ) / σ × γ + β. RMSNorm removes the mean-centering step and has no β (shift) parameter. A researcher claims \"RMSNorm is strictly worse because it loses the centering invariance.\" Construct a counterargument using the re-centering invariance property of transformers.","options":{"A":"The researcher is correct; LayerNorm's centering is necessary for all architectures","B":"Counterargument: in transformer architectures with residual connections, the network is shift-invariant to the bias β. The output of a transformer block is x' = x + F(Norm(x)). If LayerNorm's learned β introduces a constant shift to every token's representation, this shift propagates through the residual: x'_l = x_0 + Σᵢ F_i(LN(xᵢ)). Since F contains linear layers (W × (·) + b), any constant shift from β can be absorbed into the bias b of the subsequent linear layer. Therefore β is redundant and provides no additional expressiveness — the subsequent linear bias can represent the same function. For μ centering: in a residual network, the mean component of x can also be absorbed into downstream biases. RMSNorm is computationally simpler (no mean subtraction), numerically more stable (denominator is always ≥ 0, no cancellation errors), and achieves equivalent expressiveness. LLaMA and Mistral empirically match or exceed BERT's performance with RMSNorm.","C":"RMSNorm and LayerNorm are identical; β and μ have no effect on outputs","D":"RMSNorm is better because it has fewer parameters and always outperforms LayerNorm"},"correct":"B","explanation":{"correct":"- Redundancy of β in residual networks: for any constant vector c = β (from LayerNorm), the following linear layer W×(·) + b produces W×c + b — the β contribution is equivalent to a constant additive term to b. Since b is already a learned parameter, β adds no new expressive capacity; it's absorbed.\n- Mean subtraction redundancy: x - μ removes the mean component. But the subsequent linear W×(·) + b already has a bias b that can shift the mean. Again, centering is not strictly necessary when biases exist in downstream layers.\n- Computational benefit: RMSNorm avoids computing the mean (one pass through the data), only computes the root-mean-square (also one pass). Marginally faster, and more numerically stable (no chance of cancellation errors from (x-μ) when x ≈ μ).","A":"The \"centering invariance\" argument assumes the network lacks other components that can compensate. In transformer blocks with residual connections and bias terms, the centering is indeed redundant.","B":"","C":"LayerNorm with β=0 and μ subtraction still differs from RMSNorm in behavior: LayerNorm normalizes to zero mean, RMSNorm normalizes by RMS only (mean can be non-zero after RMSNorm). They produce different outputs. But the LEARNED MODEL can achieve equivalent final representations.","D":"RMSNorm empirically matches or exceeds LayerNorm — it's not universally better for all architectures. For non-residual networks, the redundancy argument breaks down and mean-centering may matter."},"reference":"- Zhang & Sennrich, \"Root Mean Square Layer Normalization\" (2019): https://arxiv.org/abs/1910.07467"},{"section":"deep-learning","difficulty":"hard","id":"dl-h019","topicSlug":"cnn-architectures","orderIndex":19,"topic":"Cnn Architectures","question":"You design a mobile CNN using depthwise separable convolutions. The model achieves 78% top-1 on ImageNet but runs at 15ms on a mobile CPU (target: < 10ms). A profiler shows the 1×1 pointwise convolutions consume 82% of the latency, even though they have fewer FLOPs than standard convolutions. Explain the counterintuitive result and what architectural modification addresses this.","options":{"A":"Reduce the number of channels; fewer channels always reduce latency proportionally","B":"Counterintuitive cause: FLOPs ≠ latency. The 1×1 pointwise conv computes C_in × C_out multiplications per pixel. For MobileNet with C=512 channels: 512×512 = 262,144 multiplications per pixel. While this is fewer FLOPs than the 3×3 depthwise (9×512 = 4,608 FLOPs), the 1×1 conv accesses W (512×512 weights = 1MB) that must be loaded from cache for each position. The 3×3 depthwise filter is 9×512 = 4.5KB — fits in L1 cache. The 1×1 pointwise conv has low arithmetic intensity (few FLOPs per byte loaded) — it's memory-bandwidth bound, not compute-bound. Fix 1: Bottleneck design (MobileNetV2 inverted residual): expand channels for the 3×3 depthwise (compute-efficient at expanded dim), then project down with 1×1 (smaller C_out). Fix 2: Channel shuffling (ShuffleNet): replace full 1×1 with grouped 1×1 + channel shuffle to maintain cross-group mixing at reduced compute.","C":"The latency issue is caused by Python overhead in the forward pass; use TorchScript","D":"1×1 convolutions are always faster than 3×3 convolutions; the profiler is incorrect"},"correct":"B","explanation":{"correct":"- Arithmetic intensity of 1×1: for each output pixel: FLOPs = 2 × C_in × C_out. Memory: C_in × C_out × 4 bytes (weights, float32). Arithmetic intensity = 2 FLOPs / 4 bytes = 0.5 FLOP/byte. The CPU roofline ridge point is typically ~20 FLOP/byte for modern CPUs. At 0.5 FLOP/byte, the operation is 40× more memory-bound than the ridge.\n- 3×3 depthwise: FLOPs = 9 × C (independent channels). Memory: 9 × C × 4 bytes. Arithmetic intensity = 2 × 9 × C / (9 × C × 4) = 2/4 = 0.5 FLOP/byte — also memory-bound! But the weight tensor is 40× smaller → fits in L1 cache → effective bandwidth is much higher for depthwise.\n- MobileNetV2 fix: by using an inverted bottleneck (expand → depthwise → project), the depthwise operates at high channel count (high FLOP but cached weights), and the projection reduces channels (small weight tensor, fits in cache).","A":"Reducing channels proportionally reduces both FLOPs and latency — but may reduce accuracy. The optimization question is efficiency (latency-per-FLOP), not just latency.","B":"","C":"TorchScript reduces Python overhead (~microseconds). At 15ms total, Python overhead is < 1%. The bottleneck is memory bandwidth (10+ms in 1×1 convs).","D":"Smaller kernel size (1×1 vs 3×3) doesn't guarantee faster execution when the operation is memory-bandwidth bound and the weight tensor doesn't fit in cache."},"reference":"- Sandler et al., \"MobileNetV2: Inverted Residuals and Linear Bottlenecks\" (2018): https://arxiv.org/abs/1801.04381"},{"section":"deep-learning","difficulty":"hard","id":"dl-h020","topicSlug":"rnn-lstm-gru","orderIndex":20,"topic":"Rnn Lstm Gru","question":"You train a seq2seq model with attention for machine translation (English→German). The model achieves 28 BLEU on the test set. During error analysis, you find the model performs well on short sentences (< 15 tokens, 31 BLEU) but poorly on long sentences (> 40 tokens, 18 BLEU). The attention mechanism uses additive (Bahdanau) attention. What specific attention pathology causes long-sentence degradation, and what architectural change addresses it without switching to Transformers?","options":{"A":"Long sentences require more parameters; add more LSTM layers to fix the length degradation","B":"$1a","C":"Long-sentence degradation is inevitable for all sequence models; accept lower BLEU","D":"The fix is to increase the attention dimensionality from 256 to 1024; more expressive alignment fixes the length problem"},"correct":"B","explanation":{"correct":"- Attention score diffusion: for uniform attention: each weight = 1/T. With T=50: max possible attention weight = 1 (one-hot). Mean = 1/50 = 0.02. The gradient signal for updating the encoder position that should receive attention is proportional to the attention weight. Diffused attention → small gradient signal to the \"right\" position → slower/weaker convergence.\n- Coverage mechanism: maintains a coverage vector c_t = Σ_{t' 65504 → overflow in FP16 without the max-subtraction trick. This is why attention logits can cause overflow for long sequences.\n- FlashAttention also avoids materializing the T×T matrix in HBM memory (stores only tiles in SRAM), solving both the numerical stability AND the memory bandwidth problem simultaneously.","A":"FP16 can achieve < 0.1% relative error vs FP32 for well-conditioned operations. The 8% error is a specific numerical pathology from the softmax, not an intrinsic FP16 limitation.","B":"","C":"Masking (adding -∞ to masked positions before softmax) doesn't cause the error for unmasked positions. The overflow/underflow issue is from the softmax computation itself over long sequences.","D":"FP16 precision (mantissa) affects the accuracy of each operation. The precision issue here is RANGE (exponent overflow), not mantissa bits. Longer sequences cause larger accumulated sums → range overflow."},"reference":"- Dao et al., \"FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness\" (2022): https://arxiv.org/abs/2205.14135 — Algorithm 1: online softmax"},{"section":"deep-learning","difficulty":"hard","id":"dl-h030","topicSlug":"transfer-learning","orderIndex":30,"topic":"Transfer Learning","question":"You fine-tune CLIP (dual encoder: image encoder + text encoder) for a specialized medical image-text retrieval task. After fine-tuning with learning rate 1e-4 on 10,000 image-text pairs, the medical retrieval performance improves from 31% R@1 to 78% R@1. However, the model loses its zero-shot classification ability on general ImageNet-1k (from 76% → 23%). A colleague suggests task arithmetic: merge the fine-tuned model with the original CLIP using weight interpolation. What is the theoretical basis for task arithmetic, and predict the accuracy tradeoff curve at interpolation coefficient α ∈ {0, 0.25, 0.5, 0.75, 1.0}?","options":{"A":"Weight interpolation always degrades both tasks; do not interpolate","B":"$22","C":"Task arithmetic requires retraining both models jointly; interpolation is not valid","D":"The optimal α is always 0.5; no other coefficient can improve on this"},"correct":"B","explanation":{"correct":"- Linear mode connectivity: two models fine-tuned from the same pre-trained checkpoint often lie in the same loss basin. Linear interpolation between them stays within low-loss regions for BOTH tasks (loss barriers are small), enabling smooth tradeoff curves.\n- Task vector composition: τ_medical = θ_medical_ft - θ_CLIP. Adding ατ_medical to θ_CLIP scales the medical adaptation — α=0.5 adds half the medical specialization while retaining most of the original structure.\n- Practical use: WiSE-FT (Wortsman et al.) showed this interpolation consistently improves distribution shift robustness. Applied to CLIP, it achieves better ImageNet+OOD tradeoffs than either model alone.","A":"Empirically, task arithmetic (weight interpolation from the same pre-trained init) consistently produces points on the Pareto frontier between the two tasks — better than either endpoint for at least one task at no cost to the other.","B":"","C":"Ilharco et al. show that task vectors can be computed without joint retraining. The linear operation (θ_pretrained + ατ) is all that's needed.","D":"The optimal α depends on the relative importance of each task. For medical deployment prioritizing retrieval, α=0.75 may be better. For general-purpose with medical enhancement, α=0.25 may be better. There is no universal optimal α."},"reference":"- Ilharco et al., \"Editing Models with Task Arithmetic\" (2023): https://arxiv.org/abs/2212.04089\n- Wortsman et al., \"Model Soups: Averaging Weights of Multiple Fine-tuned Models Improves Accuracy and Robustness\" (2022): https://arxiv.org/abs/2203.05482"},{"section":"deep-learning","difficulty":"hard","id":"dl-h031","topicSlug":"activation-functions","orderIndex":31,"topic":"Activation Functions","question":"You deploy a quantized model (INT8 weights, INT8 activations) that uses GELU activation. The quantization calibration was done with 100 ImageNet batches. Post-quantization accuracy drops from 82.3% (FP32) to 74.1% (INT8) — an 8.2% drop, much larger than typical (<1% for well-quantized models). You suspect the GELU activation is the culprit. What specific quantization challenges does GELU pose compared to ReLU, and what quantization-aware technique mitigates this?","options":{"A":"GELU and ReLU have identical quantization behavior; the accuracy drop is from weight quantization","B":"$23","C":"Switch to symmetric INT8; this fixes the GELU quantization issue","D":"Use INT16 for GELU activations and INT8 elsewhere; mixed precision solves the problem"},"correct":"B","explanation":{"correct":"- Asymmetric vs symmetric quantization: symmetric INT8 maps [-R, R] to [-127, 127]. For GELU output range [-0.17, max_val]: max_val ≈ 10 in a well-trained model. Symmetric range ±10 wastes the range [-10, -0.17] (negative GELU territory), allocating 90% of negative range to a region with no activations.\n- QAT mechanism: during forward pass, insert fake quantize: x_q = round(x / scale) × scale. Backward: STE passes gradients through as if x_q = x. The model learns to keep activations in quantization-friendly ranges.\n- For GELU specifically: QAT teaches the model to avoid the problematic smooth transition region (around x=0) or to produce activations with distributions that INT8 can represent well — sometimes by changing the scaling of inputs to GELU.","A":"Activation quantization is a major source of accuracy loss, especially for non-ReLU activations. GELU's non-monotonic range and smooth curvature make it more challenging to quantize than ReLU.","B":"","C":"Symmetric INT8 WORSENS the GELU problem by forcing a symmetric range that wastes quantization resolution on negative GELU values rarely encountered in practice.","D":"Using INT16 for activations would mostly solve the precision issue (16 bits provides 256× more resolution than INT8), but INT16 operations are 4× slower than INT8 on most inference hardware. This defeats the purpose of INT8 quantization."},"reference":"- Jacob et al., \"Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference\" (2018): https://arxiv.org/abs/1712.05877"},{"section":"deep-learning","difficulty":"hard","id":"dl-h032","topicSlug":"backpropagation","orderIndex":32,"topic":"Backpropagation","question":"You compute the gradient of a matrix multiplication Y = X @ W with respect to W. The correct gradient is ∂L/∂W = X^T @ ∂L/∂Y. You implement this in a custom CUDA extension. After testing, you find the gradients are correct when X has shape (B, D_in) but incorrect when X has shape (B, T, D_in) (batched sequence). You didn't modify the kernel. What is the mathematical cause, and what is the correct gradient formula for the 3D case?","options":{"A":"The gradient formula is the same for 2D and 3D; the error must be elsewhere","B":"Mathematical cause: for 2D case, Y = X @ W where X: (B, D_in), W: (D_in, D_out), Y: (B, D_out). The gradient: ∂L/∂W = X^T @ ∂L/∂Y. Shape check: X^T: (D_in, B), ∂L/∂Y: (B, D_out) → result: (D_in, D_out) ✓. For 3D case, Y = X @ W where X: (B, T, D_in), W: (D_in, D_out), Y: (B, T, D_out). Broadcasting: the matrix multiplication is applied independently for each (b, t) pair. The gradient: ∂L/∂W = Σ_{b,t} x_{b,t}^T ⊗ ∂L/∂y_{b,t}. In tensor notation: ∂L/∂W = einsum('bti, bto -> io', X, ∂L/∂Y) OR: reshape X to (B×T, D_in), ∂L/∂Y to (B×T, D_out), then (B×T, D_in)^T @ (B×T, D_out) = (D_in, D_out). The custom kernel using X^T @ ∂L/∂Y in 3D gets: X^T: (B, D_in, T), ∂L/∂Y: (B, T, D_out) → batched matmul gives (B, D_in, D_out) — NOT summed over the batch. The kernel either ignores the batch reduction (takes only slice [0]) or sums incorrectly.","C":"3D tensors require the transpose of ∂L/∂Y rather than X^T; swap the operands","D":"The 3D gradient requires dividing by T (the sequence length) for normalization"},"correct":"B","explanation":{"correct":"- The 2D formula X^T @ dY works because the batch dimension contracts naturally in the 2D matrix multiply. For 3D, naively applying the same formula without summing over (B, T) gives per-batch-per-step gradients, not the accumulated gradient across all positions.\n- Correct implementation: `torch.einsum('bti,bto->io', X, dY)` correctly sums over both B and T dimensions. Alternatively: `X.reshape(-1, D_in).T @ dY.reshape(-1, D_out)`.\n- This is a common bug in custom backward implementations: forgetting that gradients w.r.t. shared parameters (W is shared across all B×T applications) must be summed, not averaged, over all the instances that used W.","A":"The gradient formula IS different for 3D — the reduction dimensions change. The 2D formula doesn't account for the T sequence dimension.","B":"","C":"Swapping operands (dY^T @ X instead of X^T @ dY) doesn't produce the right gradient. The correct formula sums outer products x_{b,t} ⊗ dy_{b,t} over all (b,t) pairs.","D":"The gradient is NOT divided by T. Dividing by T would give the average gradient, not the total gradient. The total gradient (sum over all positions) is correct for parameter updates — the learning rate effectively averages by using the loss's average over positions."},"reference":"- Goodfellow et al., \"Deep Learning\" (2016), Chapter 6.5: Back-Propagation in Feedforward Networks — matrix gradient derivation"},{"section":"deep-learning","difficulty":"hard","id":"dl-h033","topicSlug":"graph-neural-networks","orderIndex":33,"topic":"Graph Neural Networks","question":"You train a link prediction model on a social network graph with 1M nodes and 50M edges. Your GNN uses 3-layer GraphSAGE with mean aggregation. At inference, predicting whether edge (u,v) exists requires the embeddings of both u and v. You benchmark on a held-out edge set and find AUC=0.91. A reviewer says \"your result is inflated by data leakage from graph structure.\" How could structural leakage occur in link prediction evaluation, and what is the correct negative sampling strategy?","options":{"A":"AUC=0.91 is correct; graph structure cannot cause data leakage","B":"$24","C":"Remove 50% of training edges; this prevents all forms of data leakage","D":"GraphSAGE cannot be used for link prediction; use only non-structural methods"},"correct":"B","explanation":{"correct":"- Inductive vs transductive leakage: GraphSAGE is inductive — it computes embeddings from neighbor aggregation. For an edge (u,v) in the test set, if u's neighborhood aggregation includes v (because the edge u-v exists in the graph used for aggregation), the embedding h_u directly encodes information about v's features. This makes the link score f(h_u, h_v) essentially \"see\" the test edge.\n- Correct evaluation protocol: (1) Split edges into train/val/test. (2) Construct aggregation graph using ONLY training edges. (3) Compute all node embeddings using only training graph structure. (4) Score edges (u,v) using these training-graph embeddings.\n- OGB (Open Graph Benchmark) link prediction datasets follow this protocol explicitly, and results on datasets using it are directly comparable.","A":"Structural leakage is a well-documented problem in graph link prediction evaluation. Several published results have been found to be inflated due to this issue.","B":"","C":"Removing training edges reduces the model's ability to learn structural patterns. The fix is not to remove edges but to ensure the evaluation uses only training edges for neighborhood computation.","D":"GraphSAGE is widely used for link prediction. The OGB leaderboard features many GraphSAGE-based methods. The issue is evaluation protocol, not the model architecture."},"reference":"- Poole et al., \"GraphSAGE and link prediction leakage\" — evaluation best practices\n- Hu et al., \"Open Graph Benchmark\" (2020): https://arxiv.org/abs/2005.00687 — OGB link prediction evaluation protocol"},{"section":"deep-learning","difficulty":"hard","id":"dl-h034","topicSlug":"introduction-to-neural-networks","orderIndex":34,"topic":"Introduction To Neural Networks","question":"A mechanistic interpretability researcher finds that a 2-layer MLP (trained on a toy task) implements a specific Boolean circuit in its weights. They claim \"we can read off exactly what computation the network performs.\" A production team uses a 128-layer Transformer. The researcher claims the same circuit-reading approach scales. What are the two fundamental obstacles to mechanistic interpretability at scale, and what does the concept of \"superposition\" specifically predict about neural network representations that makes interpretation harder?","options":{"A":"Mechanistic interpretability works equally well at any scale; it's purely a compute problem","B":"$25","C":"The only obstacle is compute; given enough time, all circuits can be found","D":"Superposition means different neurons encode the same feature redundantly; this simplifies interpretation"},"correct":"B","explanation":{"correct":"- Superposition evidence: Toy models of superposition (Elhage et al. 2022) show explicitly how a 2-layer MLP trained with D' > D features uses superposition. Features are represented as vectors that are nearly (but not perfectly) orthogonal, allowing D neurons to represent D' > D features.\n- Polysemanticity: in superposition, individual neurons respond to multiple unrelated features (e.g., a neuron in GPT-2 responds to \"code tokens,\" \"mathematical notation,\" and \"European names\"). This is documented empirically by Anthropic's interpretability team.\n- Scale challenge: BERT-large has 24 layers × 1024 dimensions. Potentially millions of superimposed features across all layers. Even if each feature could be identified (itself hard), understanding the CIRCUIT connecting features across layers is a separate exponentially hard problem.","A":"Scale introduces qualitatively new challenges beyond compute. Even with unlimited compute, the superposition problem means that individual neurons do not cleanly encode interpretable concepts.","B":"","C":"Superposition is not just a compute problem. The mathematical structure (near-orthogonal feature vectors across neurons) means there is no direct mapping from neurons to features, regardless of compute budget.","D":"Superposition is the OPPOSITE of redundancy. In superposition, each neuron encodes DIFFERENT parts of MULTIPLE features simultaneously. Redundancy would mean multiple neurons encode the same feature — that's a different phenomenon."},"reference":"- Elhage et al., \"Toy Models of Superposition\" (2022): https://transformer-circuits.pub/2022/toy_model/index.html\n- Elhage et al., \"A Mathematical Framework for Transformer Circuits\" (2021): https://transformer-circuits.pub/2021/framework/index.html"},{"section":"deep-learning","difficulty":"hard","id":"dl-h035","topicSlug":"regularization-and-normalization","orderIndex":35,"topic":"Regularization And Normalization","question":"You train a 24-layer Transformer language model and find that without any normalization, the model's loss spikes unpredictably during training, and with Post-LayerNorm, it requires careful LR warmup. You switch to Pre-LayerNorm (Pre-LN). A reviewer asks: \"Pre-LN is known to cause representation collapse in very deep networks — the residual stream's contribution from early layers becomes negligible.\" Explain the mathematical mechanism behind this collapse and what technique (DeepNorm or ResiDual) addresses it.","options":{"A":"Pre-LN never causes representation collapse; the reviewer is incorrect","B":"$26","C":"Collapse is prevented by making the residual connection trainable (learnable weight)","D":"The collapse is a training issue; more epochs solve the deep Pre-LN collapse"},"correct":"B","explanation":{"correct":"- Quantitative collapse: with L Pre-LN layers, x_L = x_0 + Σ_{l=0}^{L-1} F_l(LN(x_l)). The accumulated sum ||x_L|| grows as O(L × ||F_l output||). At layer L, ||F_l(LN(x_l))|| / ||x_L|| ≈ ||F_l(LN(x_l))|| / (L × ||F_l||) = 1/L → 0 as L → ∞. Each layer's contribution shrinks inversely with depth — the network behaves as if only the first few layers matter.\n- DeepNorm: α and β are analytically derived (α = (2N)^{1/4} for self-attention, β = (8N)^{-1/4}). These keep E[||x_l||] constant across all L layers and E[||∂L/∂x_l||] constant — solving both forward collapse and gradient vanishing simultaneously without warmup.\n- Used in: GLM-130B, DeepNet (Microsoft), and other very deep Transformers (>100 layers) use DeepNorm or similar techniques to enable stable training without warmup.","A":"Pre-LN collapse is mathematically derived and empirically documented. Wang et al. (2022) explicitly showed depth ≥ 1000 layers with Post-LN is unstable but DeepNorm enables stable training at depth 1000.","B":"","C":"Learnable residual weights (scalar α per layer) are a valid idea, but require initialization tuning. DeepNorm's contribution is providing the analytical formula for α and β — removing the need for empirical search.","D":"More training epochs don't fix structural issues in the forward pass. If early layers' contributions collapse, the optimization landscape itself is degraded — more steps on the same landscape don't recover collapsed representations."},"reference":"- Wang et al., \"DeepNet: Scaling Transformers to 1,000 Layers\" (2022): https://arxiv.org/abs/2203.00555"},{"section":"deep-learning","difficulty":"hard","id":"dl-h036","topicSlug":"self-supervised-and-contrastive-learning","orderIndex":36,"topic":"Self Supervised And Contrastive Learning","question":"You pretrain a ViT-B/16 using MAE (Masked Autoencoder) with 75% masking ratio. The decoder reconstructs raw pixel values. After pretraining, you fine-tune for image classification (linear probe: frozen encoder + linear layer). Linear probe accuracy = 68%. A DINO-pretrained ViT-B/16 achieves 78% linear probe accuracy with the same setup. Despite MAE achieving better full fine-tune performance (83% vs DINO's 81%), why does MAE's linear probe significantly underperform DINO, and what property of DINO's loss function creates more linearly separable representations?","options":{"A":"MAE is strictly worse than DINO; the results above contradict this","B":"$27","C":"The linear probe difference is purely due to architecture; DINO uses a CLS token","D":"MAE should be pretrained for 1600 epochs to match DINO's linear probe performance"},"correct":"B","explanation":{"correct":"- Linear probe measures representation quality WITHOUT task-specific adaptation. It specifically tests if class information is encoded in a linearly accessible way in the frozen representation.\n- MAE's learned features: pixel reconstruction requires preserving spatial, textural, and structural details. These features are distributed across many representation dimensions, not necessarily aligned with semantic classes.\n- DINO's features: each image produces a distribution over 65536 \"semantic concepts\" (the DINO prototypes). Similar-content images produce similar prototype distributions → the representation directly encodes semantic similarity → linear classifiers easily extract class information.\n- This explains why DINO representations produce impressive unsupervised segmentation (background/foreground structure visible in attention maps) while MAE representations show better local texture features.","A":"The stated results reflect actual published comparisons. MAE achieves better FULL fine-tune accuracy than DINO while having worse linear probe accuracy — both facts are empirically correct. The two methods learn representations with complementary properties.","B":"","C":"Both ViT-B/16 variants use a CLS token (ViT architecture includes it). The linear probe uses the CLS token's representation for both. The difference is the pretraining loss, not the architecture.","D":"MAE was pretrained for 1600 epochs in the original paper. Training longer with the same pixel reconstruction loss produces better reconstruction but doesn't fundamentally change the semantic linearity of the representation."},"reference":"- He et al., \"Masked Autoencoders Are Scalable Vision Learners (MAE)\" (2021): https://arxiv.org/abs/2111.06377 — Table 1: linear probe comparison\n- Caron et al., \"Emerging Properties in Self-Supervised Vision Transformers (DINO)\" (2021): https://arxiv.org/abs/2104.14294"},{"section":"deep-learning","difficulty":"hard","id":"dl-h037","topicSlug":"loss-and-cost-functions","orderIndex":37,"topic":"Loss And Cost Functions","question":"You train a variational autoencoder (VAE) with ELBO loss: L = E_q[log p(x|z)] - KL(q(z|x) || p(z)). During training, the KL term collapses to 0 (KL divergence becomes near-zero) while reconstruction loss remains high. This is \"posterior collapse.\" Explain the exact optimization mechanism causing this and the two most effective fixes used in production VAEs.","options":{"A":"KL collapse means the model has converged; zero KL is optimal","B":"$28","C":"Posterior collapse is caused by the decoder being too small; increase decoder capacity","D":"Use MSE reconstruction loss instead of log-likelihood; this prevents KL collapse"},"correct":"B","explanation":{"correct":"- Optimization landscape: the ELBO has two competing terms. If the decoder is powerful (e.g., Transformer decoder), it can minimize reconstruction loss without using z by leveraging context (x_1,...,x_{t-1} → x_t in autoregressive VAE). The KL term then drives q(z|x) → p(z) for free (no cost). The overall ELBO improves even though z becomes uninformative.\n- KL annealing mechanism: during early training (β=0), the model trains as a standard autoencoder — z must encode information. As β increases, the KL penalty is introduced gradually. The encoder already has useful encodings, so it doesn't collapse.\n- Free bits: Kingma et al. (2016) showed that requiring min-KL = λ bits per dimension (e.g., λ=0.25 bits) prevents collapse by ensuring each dimension encodes at least λ bits. This is a hard constraint on the minimum information the encoder must encode.","A":"KL=0 means the posterior = prior for ALL inputs — the encoder provides NO information about the input. The latent z is then a pure noise sample that carries no semantic information. This is a degenerate solution where the VAE's purpose (encoding meaningful latent structure) has failed.","B":"","C":"Larger decoder capacity WORSENS posterior collapse — a more powerful decoder is better at reconstructing without using z. The fix is to force the encoder to be used, not to reduce decoder capacity.","D":"MSE reconstruction loss doesn't prevent collapse — the decoder can still learn to ignore z with MSE loss, achieving low reconstruction error without using the latent code. The problem is the optimization dynamics, not the specific reconstruction loss form."},"reference":"- Bowman et al., \"Generating Sentences from a Continuous Space (KL annealing)\" (2016): https://arxiv.org/abs/1511.06349\n- Kingma et al., \"Improving Variational Inference with Inverse Autoregressive Flow\" (2016) — Free bits"},{"section":"deep-learning","difficulty":"hard","id":"dl-h038","topicSlug":"ann-architectures","orderIndex":38,"topic":"Ann Architectures","question":"You train a Neural ODE model where the hidden state dynamics are modeled by a differential equation: dh/dt = f_θ(h(t), t), solved with a numerical ODE solver (Runge-Kutta). The loss is computed at the final time T. Compared to a discrete ResNet with the same parameter count, what are the memory implications of backpropagating through a Neural ODE, and why do two different gradient computation methods (backprop through solver vs adjoint method) give different memory vs compute trade-offs?","options":{"A":"Neural ODE requires identical memory to ResNet backpropagation","B":"$29","C":"Adjoint method is always better than standard backprop in all aspects","D":"Neural ODE gradient computation is identical to RNN backpropagation"},"correct":"B","explanation":{"correct":"- Adjoint method derivation: ∂L/∂θ = -∫_T^0 a(t)^T × ∂f_θ/∂θ(h(t),t) dt. The adjoint a(t) satisfies: da/dt = -a(t)^T × ∂f_θ/∂h(h(t),t). Both h(t) and a(t) are computed by backward ODE integration from T to 0.\n- Memory trade-off: standard backprop through N steps stores N activation checkpoints. The adjoint stores only the current (h, a) pair — O(D) total. For N=500 steps and D=1024: 500× memory reduction.\n- Numerical accuracy: the adjoint ODE integrate backward from T to 0, reconstructing h(t) as it goes. Numerical integration errors in the reconstructed h(t) cause small discrepancies in the adjoint gradient vs true gradient. For high accuracy, use tight tolerances (rtol=1e-7, atol=1e-8).","A":"Neural ODE with standard backprop through the solver stores all solver steps — O(N×D) vs ResNet's O(layers×D). For adaptive solvers (N can vary), Neural ODE memory is often larger.","B":"","C":"Adjoint method uses 2× compute (two full ODE integrations). Standard backprop uses 1 forward + 1 backward pass through stored states — memory-expensive but compute-equivalent to the forward pass. For memory-constrained scenarios, adjoint is better; for compute-constrained scenarios, direct backprop may be better.","D":"RNN backpropagation (BPTT) processes discrete steps and stores hidden states at each step. Neural ODE uses continuous-time ODE solving with adaptive step sizes — different algorithms, different memory/compute profiles."},"reference":"- Chen et al., \"Neural Ordinary Differential Equations\" (2018): https://arxiv.org/abs/1806.07366 — Section 2: Reverse-mode automatic differentiation of ODE solutions"},{"section":"deep-learning","difficulty":"hard","id":"dl-h039","topicSlug":"cnn-architectures","orderIndex":39,"topic":"Cnn Architectures","question":"You train EfficientNet-B7 on a 200-class fine-grained classification task (bird species). The training images are 600×600. After training, you deploy on a mobile device and must reduce latency from 450ms to < 50ms. A team proposes knowledge distillation from EfficientNet-B7 (teacher) to MobileNetV3-Small (student). During distillation training, the student achieves only 71% accuracy vs the teacher's 89%. Identify two specific reasons why large-to-small distillation gaps occur for fine-grained tasks, and propose a distillation strategy that narrows the gap.","options":{"A":"Knowledge distillation always achieves teacher accuracy; a gap means implementation error","B":"$2a","C":"Larger distillation temperature always fixes capacity gaps; use T=20 for fine-grained tasks","D":"The gap is caused by the optimizer; switch the student to Adam for distillation"},"correct":"B","explanation":{"correct":"- Capacity gap in KD: Mirzadeh et al. (2020) showed that large teacher-student capacity gaps HURT distillation — a model that is too powerful a teacher actually degrades student performance vs a medium-complexity teacher. Intermediate teachers \"bridge\" the gap.\n- Feature distillation (FitNets, Romero et al. 2014): intermediate layer features are richer than final logits. The teacher's intermediate representations encode hierarchical features (wing texture at layer 3, beak shape at layer 5) that are more directly learnable by the student than the compressed logit distribution.\n- Progressive distillation efficiency: each step is a smaller capacity gap, making distillation more effective at each stage. The full chain can achieve 82-84% for fine-grained classification with MobileNetV3 vs 71% with direct distillation.","A":"Knowledge distillation consistently shows accuracy gaps for large teacher-to-small student transfers, especially for fine-grained tasks. The gap narrows with better distillation strategies but rarely disappears entirely.","B":"","C":"High temperature (T=20) softens the logit distribution — useful when the teacher has sharp one-hot-like predictions (all mass on one class). For fine-grained 200-class tasks, the teacher already produces soft distributions. Very high T further smooths the distributions, potentially losing the fine-grained similarity structure that makes KD useful.","D":"The optimizer choice affects convergence speed but not the fundamental capacity limitation. Both Adam and SGD would produce similar final accuracy for a capacity-constrained student."},"reference":"- Mirzadeh et al., \"Improved Knowledge Distillation via Teacher Assistant\" (2020): https://arxiv.org/abs/1902.03393\n- Romero et al., \"FitNets: Hints for Thin Deep Nets\" (2015): https://arxiv.org/abs/1412.6550"},{"section":"deep-learning","difficulty":"hard","id":"dl-h040","topicSlug":"attention-and-transformers-dl","orderIndex":40,"topic":"Attention And Transformers Dl","question":"You analyze the gradient flow in a 24-layer Pre-LN Transformer during training. You find that layer 1's attention weights consistently receive gradients 12× smaller than layer 24's attention weights, despite using Pre-LN (which is supposed to improve gradient flow). You also notice that all layers use weight tying (the same W_Q, W_K, W_V matrices shared across all layers). How does weight tying interact with Pre-LN to cause this gradient imbalance, and what is the correct fix?","options":{"A":"Gradient imbalance is impossible with Pre-LN; the observation must be a measurement error","B":"$2b","C":"The gradient imbalance is beneficial; early layers should receive smaller updates","D":"The fix is to increase dropout at early layers; this rebalances gradients"},"correct":"B","explanation":{"correct":"$2c","A":"Pre-LN improves gradient flow compared to Post-LN, but does NOT perfectly equalize gradients across all layers. The combination of Pre-LN + weight tying specifically creates the described imbalance. The observation is physically plausible and measurable.","B":"","C":"The imbalance with weight tying is harmful: layer 24 dominates the shared W_Q updates, causing the weight matrix to be specialized for deep-layer query patterns. Early layers' query patterns (which process more local, syntactic information in language models) are underweighted. Final model quality suffers.","D":"Dropout affects which neurons are active during training but doesn't balance gradient magnitudes across layers for shared weights. Dropout is applied to activations, not to gradients directly."},"reference":"- Press et al., \"Using the Output Embedding to Improve Language Models\" (2017): https://arxiv.org/abs/1608.05859 — weight tying\n- Howard & Ruder, \"Universal Language Model Fine-Tuning (ULMFiT) — Layer-wise LR decay\" (2018): https://arxiv.org/abs/1801.06146"},{"section":"deep-learning","difficulty":"medium","id":"dl-m001","topicSlug":"introduction-to-neural-networks","orderIndex":1,"topic":"Introduction To Neural Networks","question":"You have a 3-class classification problem (classes A, B, C) with 100 training examples each. You train a 3-layer MLP and find training accuracy = 99% but the model consistently misclassifies class C examples as class B. Validation accuracy for classes A and B is 92%, but class C validation accuracy is 41%. What is happening, and what is the most targeted intervention?","options":{"A":"The model has overfit classes A and B; add more dropout","B":"The model has learned a decision boundary that conflates C with B — suggesting C and B are similar in feature space for the model's learned representation. The issue is not general overfitting (A/B perform well on validation) but class-specific confusion. Targeted intervention: (1) examine the confusion matrix to confirm B↔C confusion; (2) inspect C and B examples to understand feature overlap; (3) add C-vs-B discriminative examples to training (data augmentation or collection); (4) add a class-specific loss term that penalizes C→B errors more heavily. Simply adding dropout would reduce the overall accuracy without specifically addressing the B/C boundary.","C":"Increase the learning rate to force the model to better separate class C","D":"The 41% validation accuracy for class C is acceptable since it's above random (33%)"},"correct":"B","explanation":{"correct":"- Targeted diagnosis: the asymmetric error (A/B fine, C bad) points to a representation problem specific to C vs B, not global overfitting. The model has learned to separate A from {B,C} but not to separate B from C.\n- Common causes: C and B may share low-level features; the training labels may be noisy for C; the model's learned features may not capture the C-vs-B distinguishing information.\n- Targeted fix: focus on the specific failure mode. Confusion matrix analysis + example inspection + targeted data collection is more efficient than global regularization changes.","A":"Dropout reduces overall capacity uniformly. It would degrade A and B performance too, without specifically addressing the C-vs-B boundary.","B":"","C":"Higher LR can help escape local minima but can also destabilize what's already working (A and B). It's a blunt instrument for a targeted problem.","D":"41% validation accuracy for class C in a 3-class problem is only 8 percentage points above random (33%). This level of performance is not \"acceptable\" for most real applications and indicates a genuine classification failure."},"reference":"- Goodfellow et al., \"Deep Learning\" (2016), Chapter 11: Practical Methodology — Confusion matrix analysis"},{"section":"deep-learning","difficulty":"medium","id":"dl-m002","topicSlug":"introduction-to-neural-networks","orderIndex":2,"topic":"Introduction To Neural Networks","question":"A student claims: \"A network with more layers than needed will automatically learn to use only the useful layers and ignore the rest — extra layers do no harm.\" Is this claim correct for a deep ReLU network without skip connections?","options":{"A":"Correct — gradient descent automatically prunes unnecessary layers to identity transforms","B":"Partially incorrect. Extra ReLU layers CAN learn the identity function (by setting weights to I and biases to 0), which is theoretically harmless. However, in practice: (1) extra layers make optimization harder — the loss landscape becomes more non-convex with more composition of non-linear functions; (2) extra layers add parameters that can overfit; (3) without skip connections (like ResNet), deep networks suffer from degradation — adding layers can actually decrease training accuracy because the optimization landscape makes it hard to learn identity mappings. Skip connections in ResNet explicitly allow extra layers to learn near-zero residuals, fixing this problem.","C":"Correct — a ReLU layer with W=I and b=0 is exactly identity; gradient descent trivially finds this","D":"Incorrect — extra layers always harm performance because they introduce vanishing gradients"},"correct":"B","explanation":{"correct":"- He et al. (2016) demonstrated the degradation problem: 56-layer plain networks perform worse than 20-layer networks on CIFAR-10 — even on training accuracy. This cannot be explained by overfitting. Extra layers don't automatically learn identity.\n- Why identity is hard to learn: the optimization must jointly adjust all layers. Extra layers create saddle points and local minima that make the gradient landscape harder to navigate.\n- ResNet fix: H(x) = F(x) + x. The network only needs to learn F(x) = 0 (zero residual) to implement identity. Learning zero is much easier than learning the identity mapping directly.","A":"Gradient descent does not automatically prune unnecessary layers. The degradation problem is well-documented empirically.","B":"","C":"While W=I, b=0 is a valid identity for ReLU layers (for non-negative activations), gradient descent doesn't reliably find this solution in practice, especially when the layer needs to pass both positive and negative activations.","D":"Vanishing gradients are one problem but not the only one. Degradation occurs even when gradients flow well (e.g., with BN). The landscape problem is more fundamental."},"reference":"- He et al., \"Deep Residual Learning for Image Recognition\" (2016): https://arxiv.org/abs/1512.03385"},{"section":"deep-learning","difficulty":"medium","id":"dl-m003","topicSlug":"neurons-and-perceptrons","orderIndex":3,"topic":"Neurons And Perceptrons","question":"A 2-layer MLP (input→hidden→output) with linear activations (no non-linearity) is trained on a multi-class problem. A professor says \"this model is equivalent to a single linear layer.\" Prove or disprove with a matrix algebra argument.","options":{"A":"False — two linear layers have more parameters than one, so they must be more expressive","B":"True. With linear activations: output = W₂(W₁x + b₁) + b₂ = W₂W₁x + W₂b₁ + b₂ = Wx' + b' where W' = W₂W₁ and b' = W₂b₁ + b₂. The product of two matrices is still a matrix. The 2-layer linear network has the same output as a single linear layer with W' = W₂W₁. The intermediate hidden layer adds no expressive power — only reparametrizes the same space of linear functions. This is why non-linear activations are essential: they break this collapsibility.","C":"False — the bias terms b₁ and b₂ prevent collapse; two biases are more expressive than one","D":"True only if W₁ and W₂ are square matrices; non-square matrices prevent the collapse"},"correct":"B","explanation":{"correct":"- Matrix multiplication closure: the product of two matrices (W₂ ∈ ℝ^{K×H}, W₁ ∈ ℝ^{H×D}) gives W' ∈ ℝ^{K×D}. This is just a K×D linear transformation — exactly what a single linear layer computes.\n- Bias collapse: W₂b₁ + b₂ is a constant vector — equivalent to a single bias b' = W₂b₁ + b₂.\n- Implication: stacking linear layers without non-linearity is wasteful. Any depth of linear layers is equivalent to a depth-1 linear model. This is the fundamental reason why activation functions are not optional.","A":"More parameters does NOT imply more expressiveness when the parameters collapse. W₂W₁ is a rank-min(rank_W₂, rank_W₁) matrix, which can be factored many ways. The function space (all linear functions) is the same.","B":"","C":"b' = W₂b₁ + b₂ is a single bias vector in ℝ^K, exactly what a single linear layer with bias uses. Two biases collapse into one. No extra expressiveness.","D":"The collapse applies regardless of whether matrices are square. For any W₂ ∈ ℝ^{m×n} and W₁ ∈ ℝ^{n×p}, W₂W₁ ∈ ℝ^{m×p} — a linear transformation, regardless of shape."},"reference":"- Goodfellow et al., \"Deep Learning\" (2016), Chapter 6.1: Why Deep Architectures — linear collapse argument"},{"section":"deep-learning","difficulty":"medium","id":"dl-m004","topicSlug":"activation-functions","orderIndex":4,"topic":"Activation Functions","question":"A network uses sigmoid activations in all hidden layers. You observe that gradients at layer 1 are 1000× smaller than gradients at the output layer after 10 layers. The network fails to learn useful features. You switch to ReLU. After the switch, training is faster but several neurons still show zero gradients throughout training. What are the two separate problems, and why doesn't one fix solve both?","options":{"A":"Both problems are caused by the learning rate; adjust LR to fix both","B":"Problem 1 (sigmoid): vanishing gradients. Sigmoid derivative σ'(z) = σ(z)(1-σ(z)) ≤ 0.25. Through 10 layers, the gradient magnitude is ≤ 0.25^{10} ≈ 10^{-6} — effectively zero at early layers. ReLU derivative = 1 (for z > 0), avoiding this cascade shrinkage. Problem 2 (ReLU): dead neurons. Neurons with z ≤ 0 have gradient = 0 regardless of loss — they cannot receive updates. These are independently caused: vanishing gradients (too-small values from chain rule multiplication) vs dead neurons (structural zeros from ReLU definition). No single fix addresses both: ReLU fixes vanishing gradients but introduces dead neurons. Leaky ReLU addresses dead neurons but may still have slight gradient shrinkage for z < 0.","C":"Both are the same problem; add BatchNorm before the activation to solve both at once","D":"The zero gradients in ReLU are expected and harmless; only sigmoid's problem is real"},"correct":"B","explanation":{"correct":"- Vanishing gradient mechanism: ∂L/∂w_1 = ∂L/∂a_10 × Π_{l=1}^{9} ∂a_{l+1}/∂a_l × ∂a_1/∂z_1. Each sigmoid factor ≤ 0.25. Product of 9 factors ≤ 0.25^9 ≈ 4×10^{-6}.\n- Dead ReLU mechanism: a neuron stuck with z < 0 has ReLU'(z) = 0 at that neuron. Chain rule at that neuron = 0, so all upstream weights see zero gradient from that path.\n- These are different problems: one is about gradient shrinkage through multiplication; the other is about structural zeros that don't depend on the loss value.","A":"LR affects update magnitude but not gradient direction or whether gradients exist. A high LR doesn't restore vanished gradients; a low LR doesn't resurrect dead ReLU neurons.","B":"","C":"BatchNorm before activation helps center pre-activations, reducing sigmoid saturation and reducing dead ReLU probability. But it doesn't fully solve either: with long sequences or deep networks, sigmoid gradients still vanish; and some neurons can still die even with BN.","D":"Zero gradients for dead ReLU neurons ARE harmful — those neurons contribute nothing to the model's capacity and waste parameters. 40% dead neurons (as in the easy.md example) is a significant problem."},"reference":"- Glorot et al., \"Deep Sparse Rectifier Neural Networks\" (2011): https://proceedings.mlr.press/v15/glorot11a.html"},{"section":"deep-learning","difficulty":"medium","id":"dl-m005","topicSlug":"activation-functions","orderIndex":5,"topic":"Activation Functions","question":"PReLU (Parametric ReLU) has a learnable slope for the negative region: f(x) = x if x > 0 else αx, where α is a learned parameter (initialized to 0.25). ELU uses f(x) = x if x > 0 else α(eˣ - 1). A practitioner asks: \"Should I use PReLU or ELU for a 50-layer ResNet on ImageNet?\" What is the key practical consideration for PReLU, and under what condition is ELU preferred?","options":{"A":"Always use ELU; PReLU is deprecated","B":"PReLU consideration: it adds one learnable parameter per channel (or per neuron). For a 50-layer ResNet with thousands of channels, this is a small but non-zero increase in parameters and storage. More importantly, if the dataset is small, PReLU's extra parameters can overfit — the slopes may tune to the training distribution. ELU preference: ELU produces negative outputs for x < 0 (approaches -α asymptotically), making activations zero-mean in expectation. This reduces bias shift in deeper networks — each layer's output is closer to zero mean, avoiding the systematic positive shift that ReLU causes (non-zero mean activations → bias correction needed in subsequent layers). ELU is preferred when zero-mean activations are beneficial (dense networks without BN).","C":"They produce identical results; the choice doesn't matter","D":"PReLU is always better than ELU because learned parameters outperform fixed parameters"},"correct":"B","explanation":{"correct":"- PReLU parameter overhead: if applied per-channel (most common), adds one scalar per channel. For ResNet-50 with 64+64+128+256+512 channels across stages = ~1024 total: ~1024 extra parameters. Negligible for large datasets; may overfit for small datasets.\n- ELU zero-mean advantage: for x sampled symmetrically around 0, E[ELU(x)] ≈ 0 when α=1 (since the negative tail compensates positive values). ReLU: E[ReLU(x)] = E[x⁺] > 0. This systematic positive bias accumulates across layers, requiring BN to correct.\n- ResNet with BN: since ResNet uses BN, the zero-mean benefit of ELU is less critical. For ResNets, empirical results with ReLU are strong; PReLU (He et al. 2015) showed marginal improvements on ImageNet.","A":"ELU is not universally better. PReLU can outperform ELU on some tasks (He et al. showed PReLU surpassed ELU on ImageNet). Neither is categorically deprecated.","B":"","C":"PReLU and ELU are mathematically different functions with different gradient profiles. For negative inputs, PReLU: constant slope α; ELU: exponential approach to -α. They produce different outputs and different gradients.","D":"Learned parameters don't always outperform fixed parameters. PReLU's α may converge to values similar to ELU's fixed curve, providing no advantage, while adding optimization complexity."},"reference":"- He et al., \"Delving Deep into Rectifiers\" (2015): Section 3 — PReLU vs other activations"},{"section":"deep-learning","difficulty":"medium","id":"dl-m006","topicSlug":"forward-propagation","orderIndex":6,"topic":"Forward Propagation","question":"You implement a forward pass for a 3-layer MLP in NumPy. Layer 1: (512→256, ReLU), Layer 2: (256→128, ReLU), Layer 3: (128→10, softmax). For training, you save intermediate activations a₁, a₂ for backpropagation. A memory-constrained deployment system says \"don't store activations — recompute them during backward.\" What is the memory vs compute trade-off?","options":{"A":"Recomputation is never done in practice because it doubles training time","B":"Standard backprop: store a₁, a₂ during forward pass (memory = O(batch × hidden)), use them directly in backward pass (no recompute). Activation checkpointing (gradient checkpointing): don't store a₁, a₂. During backward pass, rerun the forward pass from a checkpoint to recompute the needed activation. Trade-off: memory reduced by (approximately) the number of non-checkpointed layers (e.g., 4× for 4 layers), but compute increases by ≈1.33× (one extra forward pass per backward pass). For memory-constrained systems, this is a key technique — it enables training larger models or larger batches that wouldn't fit in GPU memory otherwise.","C":"Recomputing activations is impossible because the random number generator state is different for each forward pass","D":"Memory and compute are the same thing; saving activations always saves both"},"correct":"B","explanation":{"correct":"- Memory of activations: for a batch of B examples in a layer with H hidden units: B × H floats. For deep models (BERT: 24 layers, H=768, B=32): 24 × 32 × 768 × 4 bytes ≈ 2.4 MB per layer × 24 layers ≈ 57 MB for activations alone (plus more for attention). This is significant for very deep models.\n- Gradient checkpointing trade-off: implemented in PyTorch via `torch.utils.checkpoint`. A classic paper showed O(√N) memory is achievable for N layers with O(N) compute overhead using optimal checkpointing strategy.\n- When activations are deterministic (no dropout): recomputation is exact. With dropout: must use the same random seed, which requires saving the RNG state — a small memory cost.","A":"Gradient checkpointing is used in production. BERT, GPT, and other large models use it extensively. The 33% compute overhead is acceptable when memory is the bottleneck.","B":"","C":"Deterministic activations (ReLU, linear) can be recomputed exactly. For stochastic operations (Dropout), PyTorch's gradient checkpointing saves the RNG state at the checkpoint, then restores it for recomputation.","D":"Memory and compute are independent resources. Saving activations (writing to GPU RAM) costs memory but NOT extra compute. Recomputation costs extra compute but saves memory. They trade against each other."},"reference":"- Chen et al., \"Training Deep Nets with Sublinear Memory Cost\" (2016): https://arxiv.org/abs/1604.06174"},{"section":"deep-learning","difficulty":"medium","id":"dl-m007","topicSlug":"forward-propagation","orderIndex":7,"topic":"Forward Propagation","question":"You vectorize a forward pass for a batch of 32 samples. The weight matrix W is (d_out, d_in) = (512, 256) and the input batch is X = (32, 256). You write `output = X @ W + b`. A colleague writes `output = (W @ X.T).T + b`. Both produce the same output. Which is faster in practice and why?","options":{"A":"The first version (X @ W) is always faster because it uses fewer memory bytes","B":"The second version (W @ X.T).T is faster in practice for small d_in relative to batch size. More accurately: for large batches, both are equivalent in FLOPs. The performance depends on memory layout (row-major storage). X ∈ ℝ^{32×256} is stored row-major: each row is contiguous. X @ W = (32×256) @ (256×512): X is read row-by-row (cache-friendly); W is read column-by-column (potentially cache-unfriendly). Modern BLAS libraries optimize both orderings. In practice, PyTorch/NumPy BLAS calls perform identically since they internally choose the optimal layout. The important insight: row-major memory layout affects cache performance, but optimized BLAS handles this automatically.","C":"Neither is faster — matrix multiplication is always O(n³) regardless of order","D":"The first is faster because transposing X.T requires copying data while X @ W uses the original memory"},"correct":"D","explanation":{"correct":"- Transpose memory: `.T` in NumPy/PyTorch is a view (no data copy), just changes the stride. `X.T` doesn't copy data. However, the resulting non-contiguous memory layout can make the subsequent matmul cache-unfriendly.\n- BLAS optimization: modern libraries (cuBLAS, MKL) detect memory layout and choose optimal algorithms. `X @ W` on contiguous row-major data is typically cache-friendly.\n- Practical recommendation: `X @ W + b` (first form) is the standard, idiomatic, and often faster choice because X is contiguous row-major and W is accessed in the natural BLAS order.","A":"The statement \"fewer memory bytes\" is incorrect — both forms compute the same (32×512) output and use the same input data.","B":"The `.T` operation on X creates a non-contiguous view, which can hurt performance. The first form is typically preferred. The answer is partially correct in that BLAS handles both, but incorrectly suggests the second form is faster.","C":"Matrix multiplication time complexity is O(n³) for n×n matrices in theory, but practical performance depends heavily on hardware, memory layout, and BLAS implementation.","D":""},"reference":"- PyTorch docs: `torch.matmul` performance notes"},{"section":"deep-learning","difficulty":"medium","id":"dl-m008","topicSlug":"loss-and-cost-functions","orderIndex":8,"topic":"Loss And Cost Functions","question":"You train a neural network for 3-class classification with cross-entropy loss. The model's training loss has plateaued at 1.09 for 20 epochs — it hasn't decreased at all. What does this specific loss value tell you about what the model is actually doing?","options":{"A":"Loss of 1.09 means the model has 90% accuracy","B":"CE loss of 1.09 ≈ log(3) ≈ 1.099 is the cross-entropy of a uniform distribution over 3 classes: -log(1/3) = log(3) ≈ 1.099. The model is predicting approximately equal probability (33.3%) for all classes — essentially making no learning progress beyond chance. This is a strong signal that the model is stuck: likely at a degenerate local minimum, or the gradients are zero, or there is a training bug. The model hasn't learned any discriminative features.","C":"Loss of 1.09 is normal for early training; it will decrease automatically with more epochs","D":"Loss of 1.09 means the model has converged and further training is unnecessary"},"correct":"B","explanation":{"correct":"- log(3) = 1.0986. Cross-entropy for uniform prediction on K classes = log(K). For K=3: log(3) ≈ 1.099.\n- Diagnostic power: if loss = log(K) after 20 epochs, the model hasn't improved from random initialization. Common causes: (1) learning rate too high (loss oscillates around log(K)); (2) learning rate too low (gradient steps too small); (3) data/label mismatch (labels are random or misaligned); (4) all gradients zero (dead ReLU everywhere, wrong activation, zero initialization); (5) optimizer configuration bug.\n- This is a powerful diagnostic rule: expected loss at random initialization ≈ log(K). After a few batches, loss should decrease.","A":"Loss and accuracy are not directly interchangeable without knowing the specific predictions. Log(3) corresponds to 33% accuracy (random), not 90%.","B":"","C":"After 20 epochs of no decrease, the model is stuck. \"More epochs\" won't fix the underlying issue. The training pipeline needs debugging.","D":"A model predicting random output has not converged to a useful solution. A converged (good) model for 3-class would have loss << 1.099."},"reference":"- Goodfellow et al., \"Deep Learning\" (2016), Chapter 11.3: Diagnosing Causes of Poor Generalization"},{"section":"deep-learning","difficulty":"medium","id":"dl-m009","topicSlug":"loss-and-cost-functions","orderIndex":9,"topic":"Loss And Cost Functions","question":"You train a language model with cross-entropy loss and achieve perplexity = 50 on the validation set. A colleague achieves perplexity = 10 on the same dataset with their model. Explain what perplexity measures, and what does the difference between 50 and 10 mean concretely for text generation quality?","options":{"A":"Perplexity = number of parameters divided by vocabulary size; lower always means overfitting","B":"Perplexity = exp(cross-entropy) = exp(-1/N Σ log P(w_t | context)). It measures how surprised the model is by the validation text on average. Perplexity of 50 means the model is as uncertain as choosing uniformly among 50 options at each step. Perplexity of 10 means only 10 effective choices per step. Concrete generation difference: your model with PPL=50 assigns much lower probability to the true next word — generating text that frequently picks unlikely words, producing less coherent text. The colleague's model with PPL=10 predicts the next word with much higher confidence, generating more fluent, predictable text. PPL=10 is roughly 5 bits per word (log₂10 ≈ 3.32 bits) vs PPL=50 is ~5.6 bits per word.","C":"Perplexity measures vocabulary coverage; PPL=50 means only 50/vocab_size words are used","D":"Perplexity of 50 is better than 10 because more options = more creative generation"},"correct":"B","explanation":{"correct":"- Perplexity = branching factor: if perplexity = K, the model is as uncertain as a uniform K-class choice at each time step. This is an interpretable metric for language model quality.\n- PPL 50 vs 10: a 5× difference is large. State-of-the-art GPT-4 achieves PPL < 5 on many benchmarks. A PPL=50 model would generate text where every ~50th word seems random/unexpected.\n- Relationship to bits: bits-per-character (BPC) = log₂(PPL) / tokens_per_char. Lower BPC = better compression = better language model.","A":"Perplexity has nothing to do with parameter count or vocabulary size. It's a function of the model's probability assignments to the validation text. Lower perplexity = better model.","B":"","C":"Perplexity doesn't measure vocabulary coverage. A model using all 50,000 vocabulary words but assigning uniform probability would have PPL = 50,000. A focused model with low perplexity could use a small or large vocabulary.","D":"Lower perplexity = better predictive ability = more coherent generation in practice. Creativity and perplexity are not directly related. Lower PPL means the model's distribution better matches natural language, which typically produces more coherent text, not less creative text."},"reference":"- Jurafsky & Martin, \"Speech and Language Processing\" (2023), Chapter 3: N-gram Language Models — Perplexity"},{"section":"deep-learning","difficulty":"medium","id":"dl-m010","topicSlug":"backpropagation","orderIndex":10,"topic":"Backpropagation","question":"You implement a custom layer with a forward pass that clips gradients internally: in the backward pass, if the gradient magnitude > 1, return gradient = 1. A colleague says \"this is equivalent to gradient clipping done at the optimizer level.\" Is this true? What is the subtle difference?","options":{"A":"True — gradient clipping inside the backward pass and at the optimizer level are mathematically identical","B":"False. There are two fundamentally different gradient clipping operations: (1) Per-layer clipping inside backward: each layer clips its own gradient independently. The upstream gradient propagated to earlier layers is altered — earlier layers receive already-clipped gradients, potentially distorting the learning signal for those layers. (2) Optimizer-level global norm clipping: compute the global gradient norm across ALL parameters, then scale down ALL gradients proportionally if the total norm exceeds max_norm. Global clipping preserves the relative magnitude ratios between different parameter gradients — it's a scale-only operation. Per-layer clipping changes the relative proportions of gradients between layers, potentially distorting the optimization direction.","C":"True — both clip gradients to the same range, producing identical optimization trajectories","D":"Optimizer-level clipping is always wrong; per-layer clipping should be used exclusively"},"correct":"B","explanation":{"correct":"- Global norm clipping: `g_clipped = g × min(1, max_norm / ||g_global||)`. All gradients are scaled by the same factor. Direction is preserved; only magnitude changes.\n- Per-layer clipping: each layer's gradient is independently clipped. If layer 1 gradient = [0.5, 2.0], after per-layer clipping (threshold 1): [0.5, 1.0]. But the true gradient might require [0.5, 2.0] to point correctly in optimization space. Clipping only large components distorts the direction.\n- The deeper issue: per-layer clipping affects gradient flow to earlier layers. If a late layer clips its gradient, the upstream layers see a modified gradient, potentially learning incorrectly.","A":"","B":"","C":"","D":"Global norm clipping is the standard recommended approach in practice (Pascanu et al. 2013). It's specifically designed to handle exploding gradients in RNNs."},"reference":"- Pascanu et al., \"On the difficulty of training recurrent neural networks\" (2013): https://arxiv.org/abs/1211.5063"},{"section":"deep-learning","difficulty":"medium","id":"dl-m011","topicSlug":"backpropagation","orderIndex":11,"topic":"Backpropagation","question":"You use gradient checkpointing on a Transformer with 12 layers, checkpointing every 4th layer. During the backward pass, how many total forward passes are computed (including the original), and what is the peak memory compared to storing all activations?","options":{"A":"1 forward pass total; memory reduced by 12×","B":"With checkpoints at layers 0, 4, 8, 12: the backward pass recomputes layer activations between checkpoints. For each 4-layer segment, one forward recompute is needed. Three segments (0→4, 4→8, 8→12) → 3 extra forward passes in addition to the original 1 → total = 4 forward passes (1 original + 3 recomputes). Peak memory: at any time, you store: the activation at the last checkpoint (1 layer) + the recomputed activations for the current 4-layer segment (4 layers). Without checkpointing: all 12 layer activations. Memory ratio: checkpointed ≈ (1 + 4) / 12 ≈ 42% of full — about 2.4× reduction. Compute increase: 3 extra forward segments / 1 original = 33% extra computation.","C":"12 forward passes; all activations must be recomputed one by one","D":"2 forward passes total; checkpointing halves both memory and compute simultaneously"},"correct":"B","explanation":{"correct":"- Gradient checkpointing strategy: divide the network into segments. Store only activations at segment boundaries. During backward, when a segment's gradient is needed, recompute that segment from its checkpoint.\n- For 12 layers with checkpoints at every 4th: 3 segments × (1 recompute per segment) = 3 extra forward passes. Total = 1 (original) + 3 (recomputes) = 4 forward passes.\n- Memory analysis: peak memory = max over all backward steps of: checkpoint activations + current segment activations. Optimal checkpointing: every √L layers, giving O(√L) memory with O(1) compute overhead.","A":"Incorrect memory reduction formula. With checkpoints at every 4th layer (3 checkpoints stored) vs 12 activations: memory is roughly 3/12 + peak_recompute_segment ≈ 5/12, not 1/12.","B":"","C":"Not all activations are recomputed individually. Checkpointing recomputes segment-by-segment, not layer-by-layer. 12 forward passes would be 12 individual recomputes.","D":"The total forward passes is 4 (not 2) for the described configuration. Checkpointing doesn't halve both simultaneously — the trade-off is approximately -2.4× memory for +1.33× compute."},"reference":"- Chen et al., \"Training Deep Nets with Sublinear Memory Cost\" (2016): https://arxiv.org/abs/1604.06174"},{"section":"deep-learning","difficulty":"medium","id":"dl-m012","topicSlug":"optimizers","orderIndex":12,"topic":"Optimizers","question":"You train ResNet-50 with AdamW (weight_decay=0.01, lr=1e-3). A colleague argues: \"AdamW is wrong — weight decay and L2 regularization are the same thing.\" You know they're not equivalent for Adam. Explain concisely why AdamW differs from Adam + L2, and what practical implication this has for regularization strength.","options":{"A":"AdamW and Adam + L2 are identical for any optimizer; the colleague is correct","B":"Adam + L2: L2 adds λ/2 × ||w||² to the loss. The gradient of L2 is λw. Adam normalizes gradients by √v_t: effective update = η × (g + λw) / (√v_t + ε). The regularization term λw is divided by √v_t along with the gradient. For parameters with large gradient history (large v_t), the L2 penalty is scaled DOWN — meaning high-gradient parameters receive weaker regularization. AdamW decouples: w ← w - η/√v_t × g - η × λ × w (weight decay applied directly, not through gradient normalization). The weight decay λw is not divided by √v_t, so ALL parameters receive consistent regularization regardless of their gradient history. Practical: AdamW provides uniform regularization strength; Adam + L2 has inconsistent regularization that's weaker for frequently-updated parameters.","C":"AdamW sets weight decay to zero for the first 1000 steps; this is the only difference","D":"AdamW only applies weight decay to convolutional layers; Adam + L2 applies to all layers"},"correct":"B","explanation":{"correct":"- The key insight: Adam's adaptive scaling normalizes gradients by the empirical second moment estimate. When you add L2 to the loss, the regularization gradient λw goes through this same normalization. Parameters with large v_t (frequently updated, large gradient variance) receive weaker effective regularization.\n- AdamW (Loshchilov & Hutter, 2019) shows this causes suboptimal regularization. They propose decoupled weight decay that bypasses Adam's normalization.\n- Empirically: AdamW significantly outperforms Adam + L2 for Transformer pretraining (GPT, BERT fine-tuning) and is now the default optimizer for large language models.","A":"This is exactly the claim shown to be incorrect by Loshchilov & Hutter (2019). For SGD, L2 and weight decay are equivalent. For Adam, they are not.","B":"","C":"AdamW doesn't have a warm-up-only weight decay rule. It applies weight decay at every step.","D":"AdamW applies weight decay to all parameters by default. You can configure it to skip certain parameters (e.g., biases, LayerNorm) using `params_groups`, but this is optional."},"reference":"- Loshchilov & Hutter, \"Decoupled Weight Decay Regularization (AdamW)\" (2019): https://arxiv.org/abs/1711.05101"},{"section":"deep-learning","difficulty":"medium","id":"dl-m013","topicSlug":"optimizers","orderIndex":13,"topic":"Optimizers","question":"You train a Transformer language model with a linear warmup for 4000 steps, then cosine annealing to lr=0 over 100,000 steps. Why is the warmup phase critical for Transformers, and what would happen if you started with the full learning rate from step 1?","options":{"A":"Warmup is purely for computational efficiency — it doesn't affect final model quality","B":"Warmup is critical because: (1) At step 1, Adam's moment estimates (m, v) are initialized at zero, so bias-corrected estimates are noisy (high variance) until enough gradient history accumulates. With a high LR immediately, these noisy estimates cause large, poorly-directed updates. (2) For Transformers specifically, early large updates to LayerNorm, attention weights, and embedding matrices can send representations to extreme regions of parameter space that are hard to escape. Starting from lr=0 and gradually increasing allows moment estimates to stabilize before making large updates. Without warmup: the loss may spike early, the model may not recover from suboptimal initialization, and training instability can cause NaN losses in the first 100 steps.","C":"Warmup is only needed for SGD; Adam handles cold-start automatically through bias correction","D":"Warmup should be 50% of total training steps; longer warmup always improves quality"},"correct":"B","explanation":{"correct":"- Adam bias correction partially helps (it corrects the scale), but doesn't eliminate directional instability from noisy early-step estimates. v̂_t = v_t / (1-β₂ᵗ) at t=1 is based on just 1 gradient sample — high variance.\n- Transformer sensitivity: embeddings, attention weights, and LayerNorm parameters are particularly sensitive to early large updates. The attention mechanism can collapse (all weights to one token) under large early steps.\n- Empirical evidence: Liu et al. (2019) showed that removing warmup significantly hurts Transformer training. The warmup length is a sensitive hyperparameter (typically 1-10% of total steps).","A":"Warmup affects both training stability and final quality. Models trained without warmup often achieve lower final quality due to poor early-step optimization.","B":"","C":"Adam's bias correction corrects the scale but not the direction of early noisy gradients. The directional instability from high-variance early estimates is not corrected by bias correction.","D":"Warmup of 50% of steps would mean spending half the training budget at sub-optimal learning rates, significantly slowing convergence. Typical warmup is 1-10% of total steps."},"reference":"- Liu et al., \"On the Variance of the Adaptive Learning Rate and Beyond (RAdam)\" (2019): https://arxiv.org/abs/1908.03265"},{"section":"deep-learning","difficulty":"medium","id":"dl-m014","topicSlug":"ann-architectures","orderIndex":14,"topic":"Ann Architectures","question":"ResNet introduces skip connections: `output = F(x, {W_i}) + x`. The paper claims this solves the \"degradation problem\" — where deeper networks have higher training error than shallower ones. How does the skip connection make learning the identity easier, and why is this different from just using a better optimizer?","options":{"A":"Skip connections make the network shallower by skipping layers during training","B":"The degradation hypothesis: deep plain networks struggle to learn the identity mapping for unnecessary layers because the optimization landscape makes it hard. With skip connection: `output = F(x) + x`. The residual network only needs to learn F(x) = 0 (zero function) to implement identity (output = 0 + x = x). Learning zero is significantly easier than learning the identity mapping W=I, b=0 through gradient descent — zero is a natural default (gradient pushes toward small weights). This is an architectural prior (bias toward identity), not an optimizer improvement. Better optimizers cannot solve degradation because the problem is the loss landscape's local minima structure, not gradient noise or step size.","C":"Skip connections allow gradients to bypass non-differentiable operations, fixing the vanishing gradient problem","D":"Skip connections double the learning rate for early layers, compensating for vanishing gradients"},"correct":"B","explanation":{"correct":"- Optimization landscape: for a plain network to implement identity, it must find w ≈ I (the identity matrix) in a high-dimensional weight space. The loss surface around this solution may have poor local minima or saddle points.\n- Residual formulation: F(x) = output - x. The residual network transforms the problem to \"what is the difference from identity?\" This is a re-parameterization that makes the identity easy (F=0) and non-trivial transformations explicit (F = learned residual).\n- He et al. showed that with skip connections, adding layers to a model that has already converged produces equal (not worse) performance — the new layers learn near-zero residuals automatically.","A":"Skip connections don't physically skip layers during training — all layers receive gradients and update. The skip provides an alternative gradient path.","B":"","C":"While skip connections do help gradient flow (gradients can flow through the skip path directly), the paper's primary claim is about the degradation problem, not vanishing gradients. Both benefits exist, but they're different.","D":"Skip connections don't change the effective learning rate. They change the optimization landscape and gradient flow paths."},"reference":"- He et al., \"Deep Residual Learning for Image Recognition\" (2016): https://arxiv.org/abs/1512.03385 — Section 3"},{"section":"deep-learning","difficulty":"medium","id":"dl-m015","topicSlug":"ann-architectures","orderIndex":15,"topic":"Ann Architectures","question":"You design a model for classifying time series of variable length (sequences from 10 to 500 time steps). You consider (A) 1D CNN with global average pooling, (B) LSTM, (C) Transformer with positional encoding. Compare how each handles variable-length inputs and what the computational bottleneck is for the 500-step case.","options":{"A":"Only LSTM handles variable-length inputs; CNN and Transformer require fixed-length","B":"All three can handle variable-length inputs, but differently. 1D CNN + GAP: convolutions apply to any length; global average pooling aggregates to a fixed-size vector regardless of input length. Bottleneck: O(L × C²) per layer where L is sequence length. LSTM: processes one step at a time; hidden state propagates across all L steps. Bottleneck: O(L × H²) sequential operations — cannot parallelize across steps; 500 steps = 500 sequential LSTM cells. Transformer: applies self-attention across all L steps. Bottleneck: O(L²) attention matrix for L=500 → 250,000 attention scores per head. All handle variable length; CNN and Transformer allow GPU parallelization (faster wall-clock), LSTM is sequential (slower for long sequences).","C":"Transformer requires fixed input length due to positional encoding","D":"1D CNN with GAP cannot process sequences longer than its receptive field"},"correct":"B","explanation":{"correct":"- 1D CNN: `conv1d(x)` works on any sequence length. GAP: `mean(output, dim=time)` — always produces fixed-size output. Computational cost scales linearly with L.\n- LSTM: inherently variable-length by design. Hidden state is a vector regardless of current step. But: sequential dependency (h_t depends on h_{t-1}) prevents parallelization. For L=500, this means 500 sequential matmuls.\n- Transformer: positional encoding can be absolute (sin/cos, works for any L up to prespecified max), learned (must extend for longer sequences), or relative (RoPE, ALiBi — fully length-agnostic). Not \"fixed-length only.\" The O(L²) attention bottleneck is more relevant for long sequences.","A":"All three architectures handle variable-length sequences. CNN via padding + GAP, LSTM natively, Transformer via masking and flexible positional encoding.","B":"","C":"Positional encoding doesn't enforce fixed length. Sinusoidal PE can be computed for any position index. Learned PE requires extending the embedding table for positions beyond the training length (which is a limitation, but not a fundamental fixed-length constraint).","D":"GAP aggregates across all time steps regardless of length. The receptive field determines what local context each output position sees, not the maximum processable length."},"reference":"- Bai et al., \"An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling (TCN)\" (2018): https://arxiv.org/abs/1803.01271"},{"section":"deep-learning","difficulty":"medium","id":"dl-m016","topicSlug":"regularization-and-normalization","orderIndex":16,"topic":"Regularization And Normalization","question":"You apply L1 regularization (λ||w||₁) to a network with 1,000 features. After training, 950 features have weight exactly 0. A colleague applies L2 regularization (λ||w||₂²) with the same λ and finds all 1,000 features have small but non-zero weights. Explain mathematically why L1 produces exact zeros but L2 does not.","options":{"A":"L1 produces zeros because it clips gradient values; L2 multiplies gradient values","B":"L1 subgradient: ∂(λ|w|)/∂w = λ × sign(w). This adds a constant force pushing each weight toward zero regardless of its current value. When |w| is small enough that the data gradient and the L1 gradient cancel, the weight lands at exactly zero (subdifferential at w=0 includes any value in [-λ, λ]). The L1 penalty creates a flat region at w=0 — once a weight reaches zero, the L1 subgradient has no force to move it away. L2 gradient: ∂(λw²/2)/∂w = λw. The L2 force toward zero is proportional to w. As w → 0, the L2 gradient → 0 — it loses the ability to push the weight all the way to zero. This is why L1 produces exact sparsity and L2 produces small-but-nonzero weights.","C":"L1 zeros weights because it has a larger magnitude regularization term","D":"Both L1 and L2 produce exact zeros; the difference is in how quickly they reach zero"},"correct":"B","explanation":{"correct":"- Geometric interpretation: L1 penalty creates diamond-shaped constraint regions. The optimal solution often sits at a corner (where some coordinates = 0) because the loss contours first touch the L1 ball at a corner for sparsity-inducing problems.\n- L1 at zero: the subdifferential ∂|w|/∂w at w=0 = [-1, 1]. If the data gradient g ∈ (-λ, λ), the total subgradient (g + λ×sign(w)) contains 0 when w=0. The weight is stuck at zero.\n- L2 at small w: gradient = g + λw ≠ 0 unless g = -λw exactly. For small w, g ≈ g_data ≠ 0 in general, so the weight keeps moving — it approaches zero asymptotically but never reaches it.","A":"\"Clipping\" is not the mechanism. L1 adds a constant subgradient (not a clip). L2 multiplies by a shrinkage factor, which does produce the \"weight decay\" effect of exponential decay, but not to exact zero.","B":"","C":"A larger λ would also produce more zeros with L2 (by pushing weights closer to zero), but the key difference is the gradient behavior near zero, not the magnitude of λ.","D":"L2 never produces exact zeros in continuous optimization — it asymptotically approaches zero. L1's subgradient mechanism creates a true absorbing zero state."},"reference":"- Tibshirani, \"Regression Shrinkage and Selection via the Lasso\" (1996) — original LASSO paper"},{"section":"deep-learning","difficulty":"medium","id":"dl-m017","topicSlug":"regularization-and-normalization","orderIndex":17,"topic":"Regularization And Normalization","question":"You apply GroupNorm with G=8 groups to a CNN layer with C=64 channels, processing a batch of B=4 images with spatial dimensions H×W. Explain which dimensions are normalized, compute how many channels per group, and compare to LayerNorm and BatchNorm for this specific configuration.","options":{"A":"GroupNorm normalizes across groups; each group of 8 channels is treated as one unit","B":"GroupNorm: C/G = 64/8 = 8 channels per group. For each (batch sample, group): normalize over the 8 channels and all H×W spatial positions. Normalized dimensions: (C/G, H, W) per sample per group. This means the statistics are computed from H×W×8 values per normalization unit. Compare: BatchNorm — normalizes over (B, H, W) for each channel: statistics from 4×H×W values per channel; sensitive to batch size. LayerNorm (for CNN) — would normalize over all (C, H, W) for each sample: statistics from 64×H×W values per sample. GroupNorm is between: independent of batch size (like LN), but normalizes over smaller feature groups (like BN's per-channel), making it suitable for small-batch settings where BN's batch statistics are noisy.","C":"GroupNorm with G=8 is identical to LayerNorm with 8 features","D":"GroupNorm requires the batch to be divisible by G=8; B=4 would cause an error"},"correct":"B","explanation":{"correct":"- GroupNorm normalization axes: for input shape (B, C, H, W), reshape to (B, G, C/G, H, W). Normalize over dims (C/G, H, W) for each (B, G) pair.\n- Statistics for our config: B=4, G=8, C/G=8, H×W=spatial. Statistics computed from 8×H×W values per (sample, group). Independent of batch size B=4.\n- Comparison summary:\n- BN: normalizes over (B, H, W) for each channel C → batch-size dependent\n- LN: normalizes over (C, H, W) for each batch sample → large normalization window (64×H×W)\n- GN: normalizes over (C/G, H, W) for each (batch, group) → intermediate window (8×H×W), batch-size independent","A":"GroupNorm doesn't normalize \"across groups\" — it normalizes WITHIN each group. The 8 channels within a group plus their spatial positions are the normalization domain.","B":"","C":"LayerNorm for CNN normalizes all C=64 channels per sample; GroupNorm with G=8 normalizes only 8 channels per group. Different normalization sizes → different behavior.","D":"GroupNorm's G must divide C (channels), not B (batch size). B=4 is fine; G=8 must divide C=64 ✓ (64/8=8)."},"reference":"- Wu & He, \"Group Normalization\" (2018): https://arxiv.org/abs/1803.08494"},{"section":"deep-learning","difficulty":"medium","id":"dl-m018","topicSlug":"weight-initialization","orderIndex":18,"topic":"Weight Initialization","question":"You train a very deep ResNet (1000 layers) and find that every 100 epochs or so, the loss spikes to 10× its previous value before recovering. This is correlated with gradient norm explosions. Your batch size is 256 and you use FP16. What is the most likely cause, and what is the standard fix used in FP16 training?","options":{"A":"The loss spikes are caused by data outliers; apply input clipping","B":"FP16 underflow causes gradient vanishing for small values, but the described spikes (10× loss increase) suggest gradient explosion or loss function overflow. More likely: FP16 has a dynamic range of ~65,504 max. During training, loss gradients can overflow FP16 range → gradients become Inf or NaN → weight update corrupts the model → loss spikes. Standard fix: loss scaling. Multiply the loss by a large scalar S (e.g., 65536) before backward: scaled_loss = S × loss. Backprop computes S × true_gradients. Before optimizer step, divide by S: gradient = scaled_gradient / S. This keeps gradients in FP16's representable range during backprop without changing the actual update.","C":"Use a smaller batch size to prevent gradient overflow in FP16","D":"Loss spikes in FP16 are unavoidable; switch to FP32 for deep networks"},"correct":"B","explanation":{"correct":"- FP16 overflow mechanism: smallest normalized FP16 = 6.1e-5, max = 65,504. Gradients in deep networks often have very small values (especially early layers via chain rule). In FP16, values < 6.1e-5 underflow to 0 (vanishing). But occasional gradient explosions push values > 65,504 → overflow to ±Inf → NaN propagation → weight corruption → loss spike.\n- Loss scaling solution: by multiplying the loss by a large scale factor, the gradients are proportionally larger, avoiding underflow. After scaling, divide to get the true gradient. Automatic loss scaling (ALS) in Apex/PyTorch AMP doubles the scale factor after N successful steps and halves it after overflow — finding the optimal scale adaptively.","A":"Data outliers would cause training instability at specific batches consistently correlated with outlier examples, not periodic every-100-epoch spikes. FP16 overflow causes periodic spikes based on accumulated gradient behavior.","B":"","C":"Batch size doesn't directly affect FP16 overflow probability. The gradient values per sample are the relevant quantity.","D":"Mixed-precision training (FP16 with loss scaling) is now standard for training all modern large models. FP16 with proper loss scaling is stable; BF16 is an alternative that further reduces overflow risk."},"reference":"- Micikevicius et al., \"Mixed Precision Training\" (2018): https://arxiv.org/abs/1710.03740"},{"section":"deep-learning","difficulty":"medium","id":"dl-m019","topicSlug":"weight-initialization","orderIndex":19,"topic":"Weight Initialization","question":"You debug a freshly initialized Transformer and observe that the model's initial loss for a 32,000-vocabulary language modeling task is 22 (much higher than expected). What should the initial cross-entropy loss be approximately, and what initialization problem causes an initial loss of 22?","options":{"A":"Initial loss of 22 is correct; log(32000) is approximately 10 anyway","B":"Expected initial CE loss for uniform predictions over V=32,000 classes: -log(1/V) = log(32000) ≈ log(32000) = ln(32000) / ln(e) = 10.37 (natural log). Initial loss of 22 ≈ 2× the expected value, suggesting the model is initially very biased toward a few tokens — making some classes extremely improbable and others over-probable. This is often caused by: (1) incorrect output bias initialization (non-zero biases that create peaked initial distributions); (2) weight initialization too large causing the pre-softmax logits to be far from uniform (large logits → peaked softmax → high CE for wrong predictions); (3) embedding initialization too large. Fix: initialize output layer weights to near zero (logits ≈ 0 → softmax ≈ uniform → CE ≈ log(V)).","C":"Initial loss of 22 is impossible because cross-entropy is bounded by log(V)","D":"Initial loss of 22 is correct for transformers; it decreases after the first 100 steps"},"correct":"B","explanation":{"correct":"- Expected initial loss: for a uniform distribution over V classes: CE = -log(1/V) = log(V). For V=32,000: log(32,000) ≈ 10.37. This is the \"entropy of maximum uncertainty\" — the loss when the model is completely uninformed.\n- Loss > log(V): this can happen! If some logits are very large or biased, the softmax becomes peaked (not uniform). A peaked distribution on wrong classes has higher CE than uniform. Example: if logit[correct_token] = -10, log(softmax) = very negative → CE = very large.\n- Diagnostic: initial loss ≈ log(V) confirms neutral initialization; initial loss >> log(V) indicates biased initialization.","A":"log(32000) ≈ 10.37 (not 22). The expected initial loss should be around 10.37, making 22 approximately 2× too high — a clear initialization problem.","B":"","C":"CE loss is NOT bounded by log(V) from above. If the model assigns near-zero probability to the correct class, CE = -log(~0) → very large. CE is bounded below by 0 (perfect prediction) and above by infinity.","D":"All well-implemented language models should start near log(V). A loss of 22 at initialization is a bug, not expected behavior."},"reference":"- Goodfellow et al., \"Deep Learning\" (2016), Chapter 6.2: Output Units and Loss Functions"},{"section":"deep-learning","difficulty":"medium","id":"dl-m020","topicSlug":"cnn-architectures","orderIndex":20,"topic":"Cnn Architectures","question":"EfficientNet uses compound scaling: depth d = α^φ, width w = β^φ, resolution r = γ^φ where α·β²·γ² ≈ 2. The paper fixes φ=1 (EfficientNet-B1). If you double computational budget (φ=2), how do the three dimensions scale, and why is compound scaling preferred over scaling only width or only depth?","options":{"A":"Compound scaling just adds more layers; width and resolution are fixed","B":"With φ=2 and (α=1.2, β=1.1, γ=1.15) (EfficientNet's found constants): depth = 1.2², width = 1.1², resolution = 1.15². At φ=2: depth factor = 1.44, width factor = 1.21, resolution factor = 1.32. Compound scaling preferred because: (1) depth alone hits diminishing returns (vanishing gradients for very deep networks); (2) width alone loses long-range dependencies (wide but shallow networks miss hierarchical features); (3) resolution alone increases compute quadratically without depth to process the extra spatial information. Balancing all three maintains the effective receptive field growth, the model's capacity to learn hierarchical features, and the resolution of input detail.","C":"Compound scaling is identical to neural architecture search; the scaling rule finds the best architecture","D":"Compound scaling always requires α=β=γ; using different values causes performance degradation"},"correct":"B","explanation":{"correct":"- FLOPs budget: convolution FLOPs ∝ d × w² × r². Scaling all three: (α × β² × γ²)^φ ≈ 2^φ. The constraint α × β² × γ² ≈ 2 ensures each doubling of φ doubles FLOPs.\n- Why all three dimensions matter: deep networks (more layers) increase the receptive field and hierarchy; wide networks (more channels) increase feature diversity; higher resolution captures finer details. Each addresses a different aspect of image recognition capability.\n- Empirical results: EfficientNet-B7 achieved SOTA on ImageNet with 8.4× fewer parameters than GPipe and similar accuracy, showing the compound scaling efficiency.","A":"EfficientNet explicitly scales all three dimensions simultaneously. The \"compound\" in compound scaling refers to the joint scaling of depth, width, and resolution.","B":"","C":"NAS is used to find the base architecture (EfficientNet-B0) and the optimal α, β, γ coefficients. Compound scaling is the subsequent scaling rule applied to this base, not the NAS itself.","D":"EfficientNet specifically uses different α, β, γ values (not equal) found by NAS. The different values reflect that FLOPs scale differently with each dimension (FLOPs ∝ w², r² but linearly with d)."},"reference":"- Tan & Le, \"EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks\" (2019): https://arxiv.org/abs/1905.11946"},{"section":"deep-learning","difficulty":"medium","id":"dl-m021","topicSlug":"cnn-architectures","orderIndex":21,"topic":"Cnn Architectures","question":"A depthwise separable convolution (DSC) replaces a standard 3×3 conv with: (1) depthwise conv (3×3, groups=C_in) then (2) 1×1 conv. For C_in=C_out=128, compute the FLOPs ratio of DSC vs standard conv, and explain what the depthwise step computes vs what the pointwise (1×1) step computes.","options":{"A":"DSC is 2× cheaper than standard conv; each step saves 50% of FLOPs","B":"Standard conv FLOPs (per output pixel): K² × C_in × C_out = 9 × 128 × 128 = 147,456. DSC: depthwise FLOPs = K² × C_in × 1 = 9 × 128 = 1,152 (each channel filtered independently). Pointwise FLOPs = 1 × C_in × C_out = 128 × 128 = 16,384. Total DSC = 17,536 FLOPs. Ratio: 17,536 / 147,456 ≈ 1/8.4. DSC is ~8× cheaper. Depthwise step: applies a separate spatial filter to each channel — captures spatial patterns (edges, textures) independently per channel. Pointwise step: mixes channels with 1×1 conv — creates new feature combinations from the spatially filtered channels.","C":"DSC is exactly K² times cheaper than standard conv regardless of channel count","D":"DSC with groups=C_in is equivalent to dilated convolution; the FLOPs are the same"},"correct":"B","explanation":{"correct":"- FLOPs analysis: standard 3×3 conv: K² × C_in × C_out per output pixel. For K=3, C_in=C_out=128: 9 × 128 × 128 = 147,456.\n- DSC breakdown: depthwise (K=3, groups=C_in=128, C_out=128 same channels): 9 × 128 × 1 = 1,152. Pointwise (1×1 conv, C_in=128, C_out=128): 1 × 128 × 128 = 16,384. Total: 17,536.\n- Reduction ratio = 1/(K² × C_in)/(K² × C_in + C_in × C_out) × ... simplified: DSC/standard ≈ 1/C_out + 1/K² = 1/128 + 1/9 ≈ 0.119 ≈ 8.4× reduction.\n- The separability insight: spatial filtering (depthwise) and channel mixing (pointwise) are decoupled, capturing most of the expressive power of standard conv at a fraction of the cost.","A":"Not exactly 2×. The actual reduction for C=128, K=3 is ~8.4×. The reduction factor depends on both K and C: larger channels → more savings; smaller K → less savings.","B":"","C":"The reduction is 1/(K² + C) × (K² × C) ≈ not just K². For K=3 and C=128: reduction is ~8.4×, not K²=9×. The C_in×C_out term in pointwise also contributes to cost.","D":"Grouped convolutions (groups=C_in is depthwise) and dilated convolutions are different operations. Dilated convs expand the receptive field using a dilation rate; depthwise convs apply one filter per channel. Different FLOPs, different purposes."},"reference":"- Howard et al., \"MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications\" (2017): https://arxiv.org/abs/1704.04861"},{"section":"deep-learning","difficulty":"medium","id":"dl-m022","topicSlug":"rnn-lstm-gru","orderIndex":22,"topic":"Rnn Lstm Gru","question":"Truncated BPTT processes a 1000-step sequence by splitting it into 50-step chunks. The hidden state h is carried over between chunks (not reset). A student says: \"This is equivalent to full BPTT because the hidden state preserves all history.\" What is wrong with this statement?","options":{"A":"The student is correct — carryover of h is mathematically equivalent to full BPTT","B":"Carrying over h preserves information in the forward pass but NOT in the backward pass. Full BPTT: the gradient ∂L/∂h_0 flows back through all 1000 steps. Truncated BPTT: within each 50-step chunk, the gradient flows back 50 steps but is stopped at the chunk boundary — it doesn't flow back to earlier chunks. The information from step 1 may influence h at step 1000 (via the hidden state chain), but the gradient at step 50 does not flow back to step 1 via the chunk boundary stop. Consequence: the model's weights receive gradient signal from at most 50 timesteps back, not 1000. Long-range weight updates that would require gradient flow across chunk boundaries are lost.","C":"Truncated BPTT always produces worse results than full BPTT; never use it","D":"Full BPTT and Truncated BPTT produce identical gradients because LSTM gates prevent gradient flow beyond 50 steps anyway"},"correct":"B","explanation":{"correct":"- Forward vs backward information: h carries information forward (it's passed between chunks). But gradient signal does NOT flow backward across chunk boundaries — the chunk at steps 51-100 stops gradients at step 51.\n- Weight update difference: weights affecting steps 1-50 can be updated by gradients from steps 51-100 only if the gradient flows back through the chunk. With truncated BPTT, this cross-chunk gradient is zero.\n- Practical effect: truncated BPTT learns short-range dependencies well (within the chunk length) and struggles with very long-range dependencies that require the gradient to flow across many chunks.","A":"Carryover of h is the forward pass information flow. Backward pass gradient flow is determined by whether backpropagation is truncated, not by h carryover. They're independent.","B":"","C":"Truncated BPTT is widely used in practice because full BPTT over 1000 steps requires storing 1000 activation snapshots and is memory-prohibitive. Truncated BPTT is a practical compromise.","D":"LSTM gates do help with long-range gradient flow within the RNN's architecture, but they don't prevent truncation at chunk boundaries. The truncation happens in the computational graph, not in the LSTM gate mechanism."},"reference":"- Williams & Zipser, \"A Learning Algorithm for Continually Running Fully Recurrent Neural Networks\" (1989)"},{"section":"deep-learning","difficulty":"medium","id":"dl-m023","topicSlug":"rnn-lstm-gru","orderIndex":23,"topic":"Rnn Lstm Gru","question":"You train a GRU for time-series anomaly detection on 30-day sliding windows. The GRU has hidden size 128, 2 layers, bidirectional. At inference, you need to detect anomalies in real-time (each new reading arrives at 1 Hz). Why can't you deploy the trained bidirectional GRU as-is, and what are two valid deployment options?","options":{"A":"Bidirectional GRU cannot run on single time steps; it only works with full sequences","B":"The bidirectional GRU's backward layer runs from t=T to t=1 — it requires the complete 30-day window before producing hidden states. For real-time inference (anomaly decision needed within 1 second of the new reading), waiting 30 days for the window to complete is unacceptable. Option 1: Re-train as a unidirectional GRU — processes each new reading causally, makes a prediction immediately. Lower quality but causal. Option 2: Accept latency — deploy the BiGRU but delay the anomaly alert by the window length (30 days). The BiGRU processes each completed window, outputting anomaly scores for past time points. Appropriate for post-hoc analysis, not real-time.","C":"Switch to LSTM; bidirectional LSTMs support real-time inference but GRUs do not","D":"Enable streaming mode in the GRU by setting bidirectional=False at inference only"},"correct":"B","explanation":{"correct":"- Causality constraint: the backward pass of a BiGRU computes h_t^{bwd} using inputs x_t, x_{t+1}, ..., x_T. For time t, this requires ALL future inputs. In real-time at 1 Hz, future data doesn't exist.\n- Option 1 trade-off: unidirectional GRU has lower context quality (no future context) but enables real-time deployment. Common in production streaming systems.\n- Option 2 trade-off: BiGRU with delayed output provides better context but requires accepting the window-length delay. Useful for batch processing, fraud detection (can tolerate hours-long delay), maintenance prediction.\n- Note: switching bidirectional=False at inference only without re-training would use weights trained with bidirectional inputs for a unidirectional pass — the weights are incompatible.","A":"Bidirectional GRU processes full sequences during training, but it can process any sequence length — including single time steps fed as complete sequences. The issue is causal availability of data, not architectural length constraints.","B":"","C":"Bidirectional LSTM has the same causality constraint as bidirectional GRU. The underlying issue is the backward direction, not the gating mechanism (LSTM vs GRU).","D":"Setting bidirectional=False at inference only would use the forward direction's weights (correct) but the weights for the forward GRU were trained jointly with the backward GRU, potentially learning features that assume future context is available in the layer interactions."},"reference":"- Schuster & Paliwal, \"Bidirectional Recurrent Neural Networks\" (1997)"},{"section":"deep-learning","difficulty":"medium","id":"dl-m024","topicSlug":"attention-and-transformers-dl","orderIndex":24,"topic":"Attention And Transformers Dl","question":"A language model generates text autoregressively. You observe that at each new token, the model must recompute attention over all previous tokens. A colleague says: \"Using a KV cache reduces this from O(T²) to O(T) per new token.\" Is this correct, and what memory does the KV cache consume for GPT-2 (12 layers, 12 heads, d_k=64, T=2048 tokens, float32)?","options":{"A":"Correct — the KV cache stores keys and values so only O(1) compute is needed per new token","B":"Correct per-step statement with clarification: with KV cache, generating the t-th token requires attention computation: Q_t (current token) × K_{1:t} and V_{1:t}. Compute = 1 query × t keys = O(t) per step, not O(t²) full recompute. The O(T) improvement vs O(T²) for the full generation is correct. Memory: KV cache stores K and V for all past tokens, all heads, all layers. Per layer: 2 (K+V) × n_heads × d_k × T × 4 bytes = 2 × 12 × 64 × 2048 × 4 = 12,582,912 bytes ≈ 12 MB per layer. 12 layers: 12 × 12 MB ≈ 144 MB. This scales linearly with sequence length T — memory grows as longer contexts are generated.","C":"The KV cache reduces both memory and compute; there is no memory overhead","D":"KV cache is only useful for batch inference; single-sequence generation cannot benefit"},"correct":"B","explanation":{"correct":"- Without KV cache: at step t, compute Q_{1:t}, K_{1:t}, V_{1:t} from the full prefix. Total attention: (t × t) for each head. For full T-token generation: Σ_{t=1}^T t² ≈ O(T³) total.\n- With KV cache: K_{1:t-1} and V_{1:t-1} are already computed and cached. Only compute K_t, V_t for the new token. Attention: Q_t × K_{1:t} = 1×t dot product. Per step: O(t); total: Σ_{t=1}^T t = O(T²). Correct that it reduces to O(T) per step.\n- Memory calculation: 2 (K,V) × n_heads(12) × d_k(64) × T(2048) × layers(12) × 4 bytes = 2 × 12 × 64 × 2048 × 12 × 4 = 150,994,944 bytes ≈ 144 MB. This is the memory cost for generating a full 2048-token sequence.","A":"KV cache reduces compute per step from O(T) (scanning all K,V) to O(T) — same! The benefit is avoiding recomputing K,V for past tokens (which saves compute per step). It still requires O(t) attention at step t (matching all past keys). The memory overhead is significant.","B":"","C":"KV cache adds memory (O(T × n_layers × n_heads × d_k) bytes) in exchange for compute savings. It does NOT reduce memory — it adds memory.","D":"KV cache is beneficial for single-sequence inference (reduces sequential compute). For batch inference, it applies per sequence in the batch."},"reference":"- Pope et al., \"Efficiently Scaling Transformer Inference\" (2023): https://arxiv.org/abs/2211.05100"},{"section":"deep-learning","difficulty":"medium","id":"dl-m025","topicSlug":"attention-and-transformers-dl","orderIndex":25,"topic":"Attention And Transformers Dl","question":"Grouped Query Attention (GQA) reduces the number of K/V heads to G groups while keeping H query heads. For Llama-2 (H=32 query heads, G=8 K/V groups, d_k=128, T=4096, float32), calculate the KV cache memory saving vs standard multi-head attention (G=H=32), and what is the trade-off?","options":{"A":"GQA reduces KV cache by G/H = 8/32 = 25% (saves 75%)","B":"Standard MHA KV cache: 2 × H × d_k × T × layers × bytes = 2 × 32 × 128 × 4096 × 32 × 4 = 4,294,967,296 bytes ≈ 4.3 GB. GQA KV cache: 2 × G × d_k × T × layers × bytes = 2 × 8 × 128 × 4096 × 32 × 4 = 1,073,741,824 bytes ≈ 1.07 GB. Memory saving: 4.3 / 1.07 = 4× reduction (GQA uses G/H = 8/32 = 1/4 the KV cache). Trade-off: each query group shares K/V tensors — the 4 queries within a group use the same key/value representations. This reduces expressive power compared to MHA (each query has its own K/V). Empirically, GQA with 8 groups closely matches full MHA quality while reducing KV cache 4×.","C":"GQA only affects training speed; inference KV cache is unchanged","D":"GQA requires retraining from scratch; you cannot convert a standard MHA model to GQA"},"correct":"B","explanation":{"correct":"- Memory calculation: KV cache ∝ n_kv_heads × d_k × T. Standard MHA: 32 heads. GQA-8: 8 K/V heads. Ratio: 8/32 = 1/4. KV cache is 4× smaller.\n- Quality trade-off: Ainslie et al. (2023) found GQA-8 (1/4 the K/V heads) to perform nearly as well as MHA-32 on downstream tasks, with negligible quality degradation. The key insight: different query heads don't need entirely independent K/V representations to learn diverse attention patterns.\n- Uptraining: Llama-2 and other models were pretrained with GQA. Ainslie et al. show that existing MHA checkpoints can be \"uptrained\" to GQA by grouping K/V heads and fine-tuning — a data-efficient conversion.","A":"GQA/H = 8/32 = 0.25 is the fraction of KV heads retained. KV cache = 0.25 × MHA cache (saves 75%, same conclusion — but stated as \"saves 75%\" is the complement). The question asks about saving vs standard, so 4× reduction = saves 75% of the original. Option A states this correctly in terms of fraction, but the answer B provides the full calculation.","B":"","C":"GQA's primary benefit IS the KV cache reduction at inference. It also reduces the per-step attention computation (8 K/V projections vs 32).","D":"Ainslie et al. (2023) demonstrated \"uptraining\" existing MHA models to GQA format, requiring only 5% of original pretraining compute. Full retraining is not necessary."},"reference":"- Ainslie et al., \"GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints\" (2023): https://arxiv.org/abs/2305.13245"},{"section":"deep-learning","difficulty":"medium","id":"dl-m026","topicSlug":"self-supervised-and-contrastive-learning","orderIndex":26,"topic":"Self Supervised And Contrastive Learning","question":"BYOL (Bootstrap Your Own Latent) trains without negative pairs. It uses an online network (with gradients) and a target network (EMA of online, no gradients). A stop-gradient is applied to the target network. If you remove the stop-gradient AND use the same network for both branches (removing the EMA), BYOL degenerates. Why doesn't standard BYOL collapse to a trivial solution?","options":{"A":"BYOL uses large negative learning rates on the target network to prevent collapse","B":"BYOL prevents collapse through the combination of two mechanisms: (1) Asymmetric architecture — the online network has an extra prediction MLP head that the target network lacks. This asymmetry means the two networks represent the space differently; the online network must \"predict\" the target's representation, which requires non-trivial structure. (2) EMA of the target network — the target is a slow-moving (β≈0.996) average of the online network. It never directly optimizes the loss, preventing the gradient signal from finding the trivial solution (constant output). If both mechanisms are removed (same network + gradients through target), the gradient descent finds the constant-output solution immediately (collapsing all representations to zero variance satisfies the prediction loss).","C":"BYOL prevents collapse through momentum SGD that naturally avoids zero-gradient solutions","D":"Stop-gradient alone is sufficient; the predictor MLP is not necessary for preventing collapse"},"correct":"B","explanation":{"correct":"- Collapse scenario: if both branches are the same network with gradients, the loss L = ||z_online - z_target||² is minimized by z_online = z_target = constant (e.g., all zeros). The gradient would drive both toward the same constant, collapsing the representation.\n- Grill et al. (2020) ablation: removing the predictor MLP causes collapse; removing the EMA (replacing with stop-gradient only) also causes collapse; both together prevent collapse.\n- Theoretical analysis: Tian et al. (2021) \"Understanding Self-Supervised Learning Dynamics without Contrastive Pairs\" showed that the asymmetry between predictor-equipped online and predictor-free target creates implicit regularization that prevents collapse.","A":"BYOL doesn't use negative learning rates. It uses positive LR with Adam. There's no \"negative learning rate\" mechanism.","B":"","C":"Momentum SGD doesn't inherently avoid trivial solutions. Any optimizer applied to the collapsed objective would find the constant solution equally well.","D":"Stop-gradient alone doesn't prevent collapse in BYOL. The original paper's ablation (Table 2) shows that EMA target without the predictor collapses. The predictor is critical."},"reference":"- Grill et al., \"Bootstrap Your Own Latent (BYOL)\" (2020): https://arxiv.org/abs/2006.07733"},{"section":"deep-learning","difficulty":"medium","id":"dl-m027","topicSlug":"self-supervised-and-contrastive-learning","orderIndex":27,"topic":"Self Supervised And Contrastive Learning","question":"Masked Autoencoders (MAE) mask 75% of image patches and reconstruct pixel values for masked patches only. Why does this high masking ratio work better than lower ratios (e.g., 25%)? What is the information-theoretic intuition?","options":{"A":"75% masking is used purely for computational efficiency; lower ratios produce better features","B":"With 25% masking (75% visible), the model can reconstruct masked patches by interpolating from nearby visible patches — a texture interpolation task requiring only local pattern completion, no semantic understanding. With 75% masking (25% visible), visible patches are sparse and far apart. The model cannot rely on local neighborhood interpolation; it must understand the global semantic structure of the image (what object is present, where it is, what the overall scene looks like) to plausibly reconstruct the heavily masked regions. This forces the model to learn rich semantic representations rather than simple texture statistics. He et al. (2021) showed that with 75% masking, downstream finetuning performance is highest — lower masking ratios produce worse representations.","C":"75% masking is needed to ensure no single patch has reconstruction loss of zero","D":"Higher masking means fewer tokens processed in the encoder, making training faster without quality loss"},"correct":"B","explanation":{"correct":"- Information removal: at 75% mask, the remaining 25% of patches provide incomplete local context. To reconstruct, the model must use long-range context (what's the object class? where are the edges? what does the full scene look like?).\n- Inductive bias: low masking ratio → easy reconstruction via local extrapolation → trivial pretext task → weak representations. High masking ratio → hard reconstruction requiring global understanding → forces semantic feature learning.\n- He et al. ablation: optimal masking ratio peaks around 75% for ImageNet ViT pretraining. Both lower (40%) and higher (90%) ratios produce weaker downstream representations.","A":"He et al. specifically studied masking ratios and found 75% produces the best downstream representations, not just faster computation. It's both efficient AND produces better features.","B":"","C":"Any masking ratio > 0 will have some patches with non-zero reconstruction loss (because the model isn't perfect). The masking ratio choice is about task difficulty and representation quality, not ensuring non-zero loss.","D":"Speed is a benefit of MAE (the encoder only processes 25% of patches), but it's a secondary benefit. The primary motivation is the representation quality."},"reference":"- He et al., \"Masked Autoencoders Are Scalable Vision Learners (MAE)\" (2021): https://arxiv.org/abs/2111.06377"},{"section":"deep-learning","difficulty":"medium","id":"dl-m028","topicSlug":"graph-neural-networks","orderIndex":28,"topic":"Graph Neural Networks","question":"You apply a 5-layer GNN to a molecular graph with 20 atoms (nodes). The average atom has 3 bonds (degree 3). At layer 5, what is the theoretical maximum k-hop neighborhood size, and in practice, what problem limits useful depth to 2-3 layers for most molecular GNNs?","options":{"A":"Layer 5 can only see 5 atoms regardless of graph structure","B":"Theoretical maximum: in a regular degree-3 graph, the k-hop neighborhood grows as 3^k — at layer 5: up to 3^5 = 243 nodes. But the molecule only has 20 atoms, so the 5-hop neighborhood covers the entire molecule (diameter < 5 for many drug-like molecules). In practice, layer depth > diameter brings no additional structural information (the receptive field has saturated). The over-smoothing problem: repeated averaging across the entire graph makes all atom representations converge toward the same vector. For layer 2-3, atom representations capture useful local chemical environments (2-3 bond neighborhood is the relevant molecular descriptor). At layer 5, every atom representation is influenced by all other atoms, washing out the local chemical environment distinctions needed for property prediction.","C":"5 layers are always better for molecular graphs because more neighbor information is always useful","D":"The theoretical maximum is 5 × degree = 15 atoms (one new shell per layer)"},"correct":"B","explanation":{"correct":"- k-hop neighborhood: in an expander-like graph with degree 3, the number of atoms within k hops grows approximately as 3^k. For k=5: up to 243, but the graph caps at 20.\n- Saturation at diameter: for a 20-atom molecule with diameter (shortest longest path) of ~6, a 6-layer GNN's receptive field covers the whole molecule. Extra layers add no new structural information.\n- Chemical relevance: functional groups (methyl, carbonyl, aromatic ring) are 1-2 bond radius. ECFP (circular fingerprints) with radius 2-3 are highly effective for property prediction. This empirically validates that 2-3 GNN layers capture the relevant chemical neighborhoods.","A":"The receptive field is not simply the layer count. Each layer aggregates from ALL neighbors, which exponentially expands the influence radius (not 1 new atom per layer).","B":"","C":"More layers beyond the graph diameter cause over-smoothing, degrading performance. Most molecular GNN papers show 2-5 layers is optimal, with diminishing returns beyond.","D":"The maximum is not 5 × degree because each hop can branch to NEW atoms. At hop 1: 3 new atoms. At hop 2: each of those 3 atoms branches to ~3 more = 9 new atoms. It's multiplicative, not additive (excluding revisited atoms)."},"reference":"- Yang et al., \"Analyzing Learned Molecular Representations for Property Prediction (Chemprop)\" (2019): https://arxiv.org/abs/1904.01561"},{"section":"deep-learning","difficulty":"medium","id":"dl-m029","topicSlug":"graph-neural-networks","orderIndex":29,"topic":"Graph Neural Networks","question":"In a knowledge graph (entities as nodes, relations as directed edges), you apply RGCN (Relational GCN) for entity classification. The knowledge graph has 500 relation types. The base RGCN would have 500 separate weight matrices. Why is this a problem, and how does basis decomposition address it?","options":{"A":"500 weight matrices is too many for GPU memory; use only the top 10 relations","B":"Problem: 500 separate W_r matrices (each d×d) = 500×d² parameters. For d=256: 500 × 65,536 = 32.8M parameters just for relation weights. This massively overfits when the labeled training set is small (typical for knowledge graphs). Many relation types have few examples — their W_r is poorly estimated. Basis decomposition: W_r = Σ_{b=1}^B a_{rb} V_b, where V_1,...,V_B are B shared basis matrices and a_{rb} are relation-specific scalars. Parameters: B basis matrices (B × d²) + 500 × B scalars. For B=30: 30 × 65,536 + 500 × 30 = 1,981,080 + 15,000 = 1.996M. Reduction: from 32.8M to 2M — 16× fewer parameters, while allowing each relation to have a unique linear combination of shared basis transformations.","C":"RGCN's weight matrices must be square; 500 relation types prevent this","D":"Basis decomposition averages all 500 weight matrices into one, losing relation-specific information"},"correct":"B","explanation":{"correct":"- Overfitting risk: in knowledge graphs, some relations (e.g., \"is_a\") have millions of triples, while others (e.g., \"was_born_in_village_of\") may have 10 triples. A full W_r for rare relations is trained on very few examples — high overfitting risk.\n- Basis decomposition insight: all relations share a common basis of transformations. Rare relations use sparse combinations of well-estimated basis matrices. This is a form of multi-task learning where all relations benefit from shared structure.\n- Alternative: block-diagonal decomposition (W_r is block-diagonal) reduces compute at the cost of less interaction across feature groups.","A":"Discarding rare relations loses important structural information in the knowledge graph. The standard solution is parameter sharing (basis decomposition), not relation pruning.","B":"","C":"Weight matrices don't need to be square. RGCN uses W_r ∈ ℝ^{d_out × d_in} regardless of the number of relation types. The problem is parameter count and overfitting, not matrix shape.","D":"Basis decomposition doesn't average matrices — it decomposes each into a unique linear combination of shared basis vectors. Different a_{rb} coefficients mean each W_r is different, preserving relation-specific transformations."},"reference":"- Schlichtkrull et al., \"Modeling Relational Data with Graph Convolutional Networks (RGCN)\" (2018): https://arxiv.org/abs/1703.06103"},{"section":"deep-learning","difficulty":"medium","id":"dl-m030","topicSlug":"transfer-learning","orderIndex":30,"topic":"Transfer Learning","question":"You want to fine-tune T5-large (770M parameters) on a medical summarization task with 5,000 labeled examples. You consider: (A) Full fine-tuning, (B) Prefix tuning (add K trainable tokens to input, freeze T5), (C) LoRA rank-16. Compare expected generalization, parameter efficiency, and inference overhead for each.","options":{"A":"Full fine-tuning always wins on quality; efficiency is not a concern for 770M parameter models","B":"Full fine-tuning: updates all 770M params. With 5000 examples, this is 154,000 params/example — severe overfitting risk. High quality if regularized well, but requires significant memory (optimizer states = 2× 770M × 4 bytes = 6GB for Adam) and risk of catastrophic forgetting. Prefix tuning: adds K (e.g., 100) trainable token embeddings to every attention layer. ~0.1% of params. Low overfitting risk. No inference overhead on the model itself (prefix is fixed at inference). LoRA rank-16: updates W_q, W_k, W_v, W_o with rank-16 adapters. Params ≈ 4 × 2 × d × r × n_layers ≈ 4 × 2 × 1024 × 16 × 24 ≈ 3.1M (0.4%). Inference: merged into W at inference (W' = W + BA), zero overhead. Both LoRA and prefix tuning prevent catastrophic forgetting; LoRA typically outperforms prefix tuning at the same parameter budget.","C":"Only full fine-tuning works for medical tasks; domain-specific tasks require updating all parameters","D":"Prefix tuning adds K extra tokens to each inference request, making it slower than LoRA at inference"},"correct":"B","explanation":{"correct":"- 5000 examples guide: parameter-efficient methods (LoRA, prefix, adapters) significantly outperform full fine-tuning when data is limited, avoiding overfitting.\n- LoRA inference: the adapter matrices A and B can be merged with the original weights before deployment: W_merged = W + BA. Inference uses W_merged — identical cost to the original model.\n- Prefix tuning inference: K extra tokens (typically 100) are prepended to every input sequence, extending the effective sequence length by K. This is a modest overhead (K/T fraction of attention compute).\n- Quality comparison: LoRA > prefix tuning > prompt tuning for most NLP tasks at similar parameter budgets (Hu et al. 2022).","A":"Full fine-tuning with 5000 examples and 770M parameters is a strong overfitting setup. L2 regularization and early stopping can help, but parameter-efficient methods typically generalize better in this regime.","B":"","C":"Domain-specific medical knowledge is primarily needed in the encoder's attention patterns and the decoder's summarization logic. LoRA can inject this domain adaptation efficiently into the attention weight matrices without full fine-tuning.","D":"LoRA merged weights have zero inference overhead. Prefix tuning adds K extra tokens to sequences, which is a real but small overhead (K=100 for T=512: ~20% longer sequence). LoRA is more efficient at inference than prefix tuning."},"reference":"- Hu et al., \"LoRA: Low-Rank Adaptation of Large Language Models\" (2022): https://arxiv.org/abs/2106.09685\n- Li & Liang, \"Prefix-Tuning: Optimizing Continuous Prompts for Generation\" (2021): https://arxiv.org/abs/2101.00190"},{"section":"deep-learning","difficulty":"medium","id":"dl-m031","topicSlug":"introduction-to-neural-networks","orderIndex":31,"topic":"Introduction To Neural Networks","question":"A student trains a deep neural network and achieves 85% test accuracy. They then add an extra hidden layer and retrain from scratch — test accuracy drops to 80%. They conclude \"deeper networks are worse.\" What is wrong with this experimental design, and what factors could explain the accuracy drop?","options":{"A":"The student is correct — shallower networks always generalize better","B":"The experimental design has confounds: (1) Retraining from scratch means both models may not have converged — the deeper model may need more training epochs or a different LR schedule. (2) Initialization: a new random init for the deeper model may have landed in a worse basin. (3) No hyperparameter search: the deeper model may need different LR, weight decay, or batch size. Potential causes of accuracy drop: (a) degradation problem — adding a layer to a plain network without skip connections can genuinely hurt; (b) underfitting — deeper model needs more epochs; (c) optimization landscape — deeper = harder to optimize without appropriate architecture (BN, skip connections). Fair comparison requires: same hyperparameter tuning budget for both models.","C":"Test accuracy always decreases when adding layers; the student should stop at 2 layers","D":"The drop from 85% to 80% is within noise; the models are statistically equivalent"},"correct":"B","explanation":{"correct":"- Confounded comparison: adding a layer changes depth, parameter count, and optimization difficulty simultaneously. Without controlling for hyperparameters and training budget, the comparison is not fair.\n- The degradation problem (He et al. 2016): plain networks (no skip connections) degrade with depth on training accuracy. Adding a layer to a plain network can genuinely hurt even training accuracy.\n- Proper experiment: use the same hyperparameter grid search, same total compute budget, same validation set for model selection, then compare final test accuracy.","A":"Many tasks benefit from deeper networks (image recognition, NLP, speech). Shallower is not always better — it depends on task complexity and regularization.","B":"","C":"Optimal depth is task-dependent. ResNet-152 (152 layers) outperforms ResNet-50 on ImageNet. Stopping at 2 layers is not a general principle.","D":"85% → 80% is a 5-percentage-point drop, which is typically larger than noise for standard benchmarks with sufficient test data. It would need statistical testing to confirm, but it's a meaningful difference worth investigating."},"reference":"- He et al., \"Deep Residual Learning for Image Recognition\" (2016): https://arxiv.org/abs/1512.03385 — Sec. 1: Degradation problem"},{"section":"deep-learning","difficulty":"medium","id":"dl-m032","topicSlug":"neurons-and-perceptrons","orderIndex":32,"topic":"Neurons And Perceptrons","question":"A 3-layer MLP processes a 10-class problem. The output layer uses softmax. During inference, you observe the model frequently outputs probabilities like [0.34, 0.33, 0.33, 0.00, ..., 0.00] — evenly split among 3 classes with near-zero for the rest. What does this suggest about the model's learned representations, and how would you investigate further?","options":{"A":"This is correct behavior; a 3-class problem produces 3 non-zero probabilities","B":"This output distribution suggests the model is uncertain (softmax probabilities are distributed across 3 classes) but has learned to eliminate 7 of the 10 classes as implausible for these inputs. Likely causes: (1) The input may fall in a region that looks like 3 similar classes (e.g., dog, cat, wolf — all look animal-like). (2) The pre-softmax logits for the 7 near-zero classes are strongly negative (discriminative features clearly rule them out). Investigation: (a) examine which inputs trigger this pattern — are they genuinely ambiguous? (b) inspect the top-3 logit values and the spacing between them; (c) evaluate calibration (does 34% predicted probability correspond to 34% actual accuracy? if so, the model is well-calibrated and the uncertainty is real). This is NOT a problem — it may indicate realistic uncertainty on hard examples.","C":"The model has collapsed — all 10 softmax outputs should be ~10% initially","D":"The model needs more training to increase confidence; uncertainty means underfitting"},"correct":"B","explanation":{"correct":"- Softmax behavior: the model has learned that 7 classes have strongly negative logits for this input. The remaining 3 classes have similar logit values (near-equal → near-equal softmax). This is a valid, potentially correct behavior for genuinely ambiguous inputs.\n- Calibration check: if the true accuracy for examples where the model assigns 34% to the top class is actually ~34%, the model is well-calibrated. This would validate that the uncertainty is meaningful.\n- When to worry: if ALL test examples show this pattern (model never makes confident predictions), that's a problem. If it occurs for a subset of genuinely ambiguous examples, it's appropriate behavior.","A":"The problem has 10 classes, not 3. A model outputting non-zero for only 3 classes is NOT \"correct\" in the sense that it could be a sign of learned discrimination (good) or an architectural issue. The context matters.","B":"","C":"After training, softmax outputs should reflect the learned discriminative features, not the uniform prior. Post-training uniform output would indicate the model learned nothing.","D":"High confidence (low entropy) is not always desirable — it can indicate overconfidence/poor calibration. Calibrated uncertainty (34% means truly 34% likely) is often better than overconfident wrong predictions."},"reference":"- Guo et al., \"On Calibration of Modern Neural Networks\" (2017): https://arxiv.org/abs/1706.04599"},{"section":"deep-learning","difficulty":"medium","id":"dl-m033","topicSlug":"regularization-and-normalization","orderIndex":33,"topic":"Regularization And Normalization","question":"You combine L2 regularization (weight decay λ=0.01) with Dropout (p=0.5) in the same model. A colleague argues they are redundant: \"Both prevent overfitting, so one is enough.\" Are they redundant? Describe the distinct mechanism of each and a scenario where both are needed simultaneously.","options":{"A":"They are redundant; use only one regularizer to avoid double-penalizing the model","B":"They are not redundant — they prevent overfitting through distinct mechanisms: L2 regularization: penalizes large weight magnitudes. Keeps all weights small; prevents any single weight from dominating the prediction. The model learns spread-out feature attribution. Dropout: prevents co-adaptation. Randomly removes 50% of neurons per forward pass, forcing each neuron to be independently useful without relying on specific combinations of other neurons. The model learns redundant, robust representations. Scenario needing both: a large model on a small dataset. L2 prevents extreme weights (memorizing via weight magnitudes). Dropout prevents co-adapted feature groups (memorizing via specific neuron combinations). These are different failure modes that can co-occur.","C":"L2 makes Dropout unnecessary because L2 already prevents large weight products","D":"Dropout makes L2 unnecessary because dropping neurons is equivalent to zeroing weights"},"correct":"B","explanation":{"correct":"- Different targets: L2 acts on weight magnitudes (the scale of individual weights). Dropout acts on weight patterns (which combinations of neurons fire together). A network can have small weights (L2-compliant) but strong co-adaptations between small-weight neurons.\n- Complementary: a neuron that fires only when neurons j₁ and j₂ fire simultaneously (co-adaptation) can have small weights (satisfying L2) but still memorize by exploiting the co-activation pattern. Dropout breaks this by randomly removing j₁ or j₂.\n- Empirical: adding both L2 and Dropout to large models like BERT fine-tuning is standard practice and outperforms either alone on small datasets.","A":"They are complementary regularizers. Using both together is a valid and often beneficial strategy, not \"double-penalizing.\"","B":"","C":"L2 limits individual weight sizes. Co-adaptation can persist with small weights — two neurons with w₁=0.5 and w₂=0.5 whose joint activation pattern is informative satisfy L2 but are co-adapted.","D":"Dropout randomly sets activations to 0 during training, which effectively creates stochastic weight masking. But Dropout doesn't permanently reduce weight magnitudes — weights can grow large between dropout events. L2 continuously constrains magnitudes."},"reference":"- Srivastava et al., \"Dropout: A Simple Way to Prevent Neural Networks from Overfitting\" (2014): Sec. 4 — Dropout with other regularizers"},{"section":"deep-learning","difficulty":"medium","id":"dl-m034","topicSlug":"loss-and-cost-functions","orderIndex":34,"topic":"Loss And Cost Functions","question":"You train an autoencoder for anomaly detection. Normal examples are used for training. At inference, you use reconstruction loss (MSE between input and reconstruction) as the anomaly score. A data scientist asks: \"Should the threshold for anomaly detection be fixed or adaptive?\" What considerations govern the threshold choice?","options":{"A":"Always use a fixed threshold of MSE = 0.5; this is the standard anomaly cutoff","B":"The threshold should be adaptive based on: (1) Distribution of reconstruction loss on normal data — the threshold should be set at a high percentile (e.g., 95th or 99th percentile) of normal reconstruction losses, not a fixed constant. Reconstruction loss varies with data complexity; fixing at 0.5 ignores dataset-specific characteristics. (2) Desired precision-recall tradeoff — higher threshold = more conservative (fewer false positives but more missed anomalies = lower recall). This depends on application: medical devices may prefer high recall (catch all anomalies, accept false alarms); industrial quality control may prefer high precision (only flag true defects). (3) Concept drift — normal reconstruction loss distribution may shift over time; a static threshold becomes miscalibrated. Use a rolling window of recent normal examples to update the threshold.","C":"Reconstruction loss cannot distinguish anomalies from normal examples; use a classifier instead","D":"The threshold should maximize training reconstruction loss, not minimize it"},"correct":"B","explanation":{"correct":"- Reconstruction-based anomaly score: trained on only normal examples, the autoencoder learns to efficiently encode normal patterns. Anomalies (unseen patterns) are poorly reconstructed → high reconstruction loss.\n- Threshold setting: the 95th/99th percentile of validation normal reconstruction losses is a natural choice — it bounds the false positive rate at 5%/1%. This is data-driven, not fixed.\n- Operational considerations: in production, the false positive rate tolerance depends on the cost of investigating flagged anomalies vs the cost of missing true anomalies.","A":"MSE = 0.5 is arbitrary and dataset-dependent. For high-resolution images, normal MSE may be 50. For 10-dimensional tabular data, it may be 0.001. No fixed universal threshold makes sense.","B":"","C":"Reconstruction loss CAN distinguish anomalies if the autoencoder is well-trained and anomalies differ structurally from training data. Many industrial systems successfully use reconstruction-based anomaly detection.","D":"High training reconstruction loss means the autoencoder is not learning. Lower training loss means better reconstruction quality, which makes the anomaly threshold more reliable."},"reference":"- An & Cho, \"Variational Autoencoder-Based Anomaly Detection Using Reconstruction Probability\" (2015)"},{"section":"deep-learning","difficulty":"medium","id":"dl-m035","topicSlug":"backpropagation","orderIndex":35,"topic":"Backpropagation","question":"You build a neural network with a custom discrete operation: argmax (returns the index of the maximum value). During training, `argmax` is applied to the output of a layer, and its result is used in the next computation. The gradient of `argmax` is 0 almost everywhere (the argmax is a step function). How do practitioners handle this, and what is the trade-off of each approach?","options":{"A":"Argmax has a well-defined gradient; just use standard backpropagation","B":"Two main approaches: (1) Straight-Through Estimator (STE): in the backward pass, pretend argmax = identity — pass gradients as if no argmax was applied. Forward: use the hard argmax. Backward: gradient flows through as if the layer output equals the selected value continuously. Fast, simple, but the forward and backward passes are inconsistent (gradient is a biased estimator of the true gradient). (2) Gumbel-Softmax: replace argmax with softmax(logits/τ) with a low temperature τ → 0. At τ≈0, softmax approximates argmax (one-hot-like distribution). Backward: smooth function, exact gradient via backprop. Trade-off: small τ makes gradients very high-variance (denominator near zero in softmax); large τ deviates from true argmax (soft approximation introduces bias). Both approaches are used; STE is simpler; Gumbel-Softmax provides better gradient estimates for some applications.","C":"Remove argmax from the architecture; neural networks cannot use discrete operations","D":"Replace argmax with max (returns the value, not the index); max has a valid gradient"},"correct":"B","explanation":{"correct":"- Why argmax gradient = 0 almost everywhere: argmax(x) = arg(max x_i). The output (an integer index) changes only when the ordering of x changes — at the specific hyperplane boundaries where x_i = x_j. Almost everywhere (when all x_i are distinct), the gradient is 0 (no change in output for small changes in input).\n- STE applications: quantization-aware training (rounding), discrete VAE latent codes, reinforcement learning discrete action spaces.\n- Gumbel-Softmax applications: neural machine translation with discrete latent variables, generative models with discrete structures.","A":"Argmax does NOT have a well-defined gradient that standard backprop can use. Its gradient is 0 almost everywhere, not a useful learning signal.","B":"","C":"Discrete operations ARE used in neural networks — VQ-VAE, discrete autoencoders, neural Turing machines all have discrete operations. The key is how to handle gradients.","D":"`max(x)` returns the maximum value, not the index. Its gradient is well-defined: 1 for the maximum element, 0 for others. But this doesn't solve the problem of needing the selected index for subsequent operations."},"reference":"- Bengio et al., \"Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation\" (2013) — STE\n- Jang et al., \"Categorical Reparameterization with Gumbel-Softmax\" (2017): https://arxiv.org/abs/1611.01144"},{"section":"deep-learning","difficulty":"medium","id":"dl-m036","topicSlug":"cnn-architectures","orderIndex":36,"topic":"Cnn Architectures","question":"VGG-16 uses only 3×3 convolutions throughout. The designers argue \"two stacked 3×3 conv layers have the same receptive field as one 5×5 layer but with fewer parameters.\" Verify this claim numerically (FLOPs and parameters) for C input and output channels.","options":{"A":"Two 3×3 layers are identical in parameters and FLOPs to one 5×5 layer","B":"Receptive field: one 5×5 layer sees a 5×5 window. Two stacked 3×3 layers: layer 1 sees 3×3; for each layer-2 output, it aggregates from a 3×3 window of layer-1 outputs, each of which saw a 3×3 input region. Effective receptive field: 3 + 3 - 1 = 5. Same receptive field ✓. Parameters: one 5×5 conv: 5² × C × C = 25C². Two 3×3 convs: 2 × 3² × C × C = 18C². Savings: 25C² - 18C² = 7C² (28% fewer parameters). FLOPs per output pixel: 5×5: 25×C² multiply-adds. 3×3×2: 18×C² multiply-adds. Two 3×3 are cheaper. Trade-off: two 3×3 layers have two activation functions (added non-linearity) vs one for 5×5 — actually MORE expressive despite fewer parameters.","C":"Three 3×3 layers have the same receptive field as one 5×5; two 3×3 gives a 4×4 receptive field","D":"Two 3×3 layers have 3² × 2 = 18 parameters while one 5×5 has 5² = 25; parameters scale without channel dimension"},"correct":"B","explanation":{"correct":"- Receptive field formula: stacking two 3×3 layers (stride 1): effective RF = 1 + L×(K-1) = 1 + 2×(3-1) = 5. Matches one 5×5 layer.\n- Parameter comparison: including the channel dimension (critical for CNNs): one 5×5: K² × C_in × C_out = 25C². Two 3×3: 2 × K² × C² = 18C² (assuming C_in = C_out = C for both).\n- Extra non-linearity: two conv layers have two activation functions, giving the network more representational power than one conv+activation. This is the \"bonus\" of VGG's design: same RF, fewer parameters, more non-linearity.","A":"","B":"","C":"Two stacked 3×3 layers give a 5×5 effective RF (not 4×4). The formula is 1 + L(K-1) = 1 + 2(2) = 5.","D":"The parameter count must include the channel dimensions (C_in × C_out). Ignoring channels gives 9 vs 25 (single-channel), but in CNNs with C channels: 18C² vs 25C². The claim is about the multi-channel case."},"reference":"- Simonyan & Zisserman, \"Very Deep Convolutional Networks for Large-Scale Image Recognition (VGG)\" (2015): https://arxiv.org/abs/1409.1556"},{"section":"deep-learning","difficulty":"medium","id":"dl-m037","topicSlug":"self-supervised-and-contrastive-learning","orderIndex":37,"topic":"Self Supervised And Contrastive Learning","question":"You use SimCLR to learn representations from a medical ultrasound dataset. After pretraining, you train a linear probe on 1% labeled data and get 62% accuracy. A baseline supervised model trained on the same 1% labeled data gets 55%. Your SSL model improves on the supervised baseline by 7 points. A colleague says \"but with 100% labels, supervised training gets 89%. SSL is still worse.\" Is SSL successful in this scenario, and when does it matter most?","options":{"A":"SSL failed because it didn't reach supervised performance with full labels","B":"SSL is successful in this scenario. The relevant comparison is SSL pretraining + 1% labels (62%) vs supervised with 1% labels (55%). SSL provided a 7-point improvement using the same label budget. The 89% with 100% labels is a different experimental condition — it uses 100× more labeled data. SSL matters most when: (1) labeled data is scarce but unlabeled data is abundant (standard in medical imaging — labeling requires radiologist time); (2) the representation quality from SSL transfers well to the downstream task. The comparison to 100% supervised is only relevant if you're choosing between SSL+few labels vs spending resources to collect 100% labels. If the 100% labels cost $1M to collect, SSL+1% might be the only feasible approach.","C":"SSL only succeeds when it matches the full-supervision baseline; otherwise it's not useful","D":"SSL performance of 62% is poor regardless of comparison; retrain with 100% labeled data"},"correct":"B","explanation":{"correct":"- Right comparison: SSL is designed to help in the low-label regime. The appropriate comparison is: (same label budget) supervised vs SSL+supervised. Here, 55% → 62% is a real improvement.\n- Practical relevance: in medical AI, collecting labels requires clinical annotation. 1% of a 10,000-image dataset = 100 labeled images. Getting 100 labels labeled by radiologists is feasible; getting 10,000 labels may not be.\n- SSL's value proposition: unlabeled data (9,900 images) + 100 labels > just 100 labeled images. The improvement from SSL shows the unlabeled data added real value.","A":"SSL is not evaluated relative to its full-supervision ceiling. It's evaluated relative to its label-constrained baseline (supervised with same labels). The comparison to 89% is unfair since it uses 100× more labeled data.","B":"","C":"SSL is valuable whenever it improves performance at the target label budget. Matching full-supervision is a bonus, not a requirement for success.","D":"62% with 1% labels vs 55% with 1% labels is a meaningful improvement. Whether to collect more labels or use SSL depends on the cost of labeling vs the cost of deployment errors."},"reference":"- Chen et al., \"A Simple Framework for Contrastive Learning of Visual Representations (SimCLR)\" (2020): https://arxiv.org/abs/2002.05709 — linear evaluation protocol"},{"section":"deep-learning","difficulty":"medium","id":"dl-m038","topicSlug":"transfer-learning","orderIndex":38,"topic":"Transfer Learning","question":"You fine-tune a pretrained model (task A → task B) and achieve good task B accuracy. A colleague says: \"The pretrained model is just providing weight initialization; fine-tuning from a good random init would be just as good.\" Design an experiment to test this claim and predict the outcome for two scenarios: (1) task A and B share domain, (2) task A and B are from different domains.","options":{"A":"The colleague is correct in both scenarios; initialization quality doesn't matter for fine-tuning","B":"Experiment: train 3 models — (1) pretrained-A init → fine-tune on B; (2) random init → train on B; (3) Kaiming/Xavier random init → train on B with same hyperparameters. Compare B test accuracy and training speed. Scenario 1 (same domain, e.g., ImageNet→indoor scenes): pretrained initialization wins significantly. The features (edge detectors, texture filters) transfer directly. Random init requires relearning these from scratch with less data. Scenario 2 (different domain, e.g., natural images→X-rays): the margin narrows or may reverse. Pretrained features for natural images may not help for X-rays. If labeled B data is large enough, random init may catch up. If B data is small, pretraining still helps (generic low-level features generalize).","C":"Random init with longer training always matches pretrained init regardless of domain","D":"The comparison only makes sense when task B has no labeled data (zero-shot)"},"correct":"B","explanation":{"correct":"- Knowledge transfer experiment: this is exactly the setup of Yosinski et al. (2014) \"How transferable are features in deep neural networks?\" — they systematically tested transferred vs random init across varying layer depth and domain similarity.\n- Same domain: pretrained init typically achieves higher final accuracy AND trains faster. The features are relevant and provide a strong starting point.\n- Different domain: early layers (edge detectors) still generalize across domains. Only very domain-specific late layers may need significant adaptation. Even for different domains, pretrained init often converges faster and to comparable or better solutions.\n- Data size effect: with large labeled B data, random init can match pretrained init given enough training. With small labeled B data, pretrained init is critical.","A":"Well-designed transfer learning experiments consistently show pretrained init outperforms random init, especially in low-data regimes. The colleague's claim is empirically false for same-domain scenarios.","B":"","C":"Given unlimited compute and large enough datasets, random init can learn equivalent features, but this is practically unrealistic for most real-world settings.","D":"Transfer learning is most commonly evaluated in the semi-supervised few-shot setting (limited labeled data), not zero-shot. The value of pretrained init is precisely for limited-data scenarios, not no-data scenarios."},"reference":"- Yosinski et al., \"How transferable are features in deep neural networks?\" (2014): https://arxiv.org/abs/1411.1792"},{"section":"deep-learning","difficulty":"medium","id":"dl-m039","topicSlug":"rnn-lstm-gru","orderIndex":39,"topic":"Rnn Lstm Gru","question":"You train a stacked 3-layer LSTM (hidden size 256) for sentiment analysis on movie reviews. The model achieves 90% accuracy. You're asked to increase capacity while keeping inference time under 50ms. A colleague suggests increasing to 6 layers (keep same hidden size). Another suggests keeping 3 layers but doubling hidden size to 512. What are the trade-offs in terms of parameters, sequential depth, and practical performance?","options":{"A":"6 layers is always better; deeper models always outperform wider ones for NLP","B":"6 layers (depth): parameters ≈ 2× (one extra LSTM per direction, 3 more layers). Sequential depth = 6 LSTM computations per step — inference time increases proportionally (each LSTM layer adds sequential compute). Risk: over-smoothing of the hidden state; deeper LSTMs don't always outperform shallower ones (diminishing returns). 3 layers, hidden=512 (width): parameters ≈ 4× (hidden² scales quadratically: 256² → 512²). Same sequential depth = same inference time structure — width is parallelizable within each step (LSTM operations are matrix multiplications, which parallelize well). Risk: overfitting with 4× more parameters. Practical recommendation: width (512 hidden) often outperforms depth for LSTMs in NLP; inference time depends more on the number of sequential LSTM layers than hidden size (due to GPU parallelism of matrix ops).","C":"Wider and deeper models have identical inference time; only parameter count matters","D":"Only width matters for LSTMs; depth (stacked layers) provides no benefit"},"correct":"B","explanation":{"correct":"- Inference time analysis: each LSTM layer processes one step at a time (sequential dependency h_t ← h_{t-1}). 6 layers = 6 sequential LSTM computations per time step. 3 layers = 3. Depth directly increases sequential latency.\n- Width parallelism: within each LSTM step, the matrix multiplications W_h × h_{t-1} and W_x × x_t are large matmuls that fully utilize GPU parallelism. Wider hidden size → larger matmul, but not more sequential operations.\n- Practical guideline: LSTMs benefit from moderate depth (2-4 layers) but wider hidden sizes are often more impactful for sentiment analysis where the most relevant patterns are captured in a few layers.","A":"For NLP tasks with LSTMs, depth beyond 3-4 layers shows diminishing returns. Wider models often outperform deeper ones for semantic understanding tasks.","B":"","C":"Depth adds sequential compute (more LSTM steps per time step). Width adds parallelizable compute (larger matmuls). GPU accelerators parallelize matmuls efficiently, so wider models often run at similar or slightly higher speed than deeper models with the same parameter count.","D":"Depth does provide benefit: 2-3 stacked LSTM layers capture different abstraction levels (character/word patterns, phrase patterns, sentence patterns). Pure width without depth misses hierarchical feature composition."},"reference":"- Graves et al., \"Speech Recognition with Deep Recurrent Neural Networks\" (2013) — stacked RNN motivation"},{"section":"deep-learning","difficulty":"medium","id":"dl-m040","topicSlug":"attention-and-transformers-dl","orderIndex":40,"topic":"Attention And Transformers Dl","question":"Pre-LayerNorm (Pre-LN) applies LayerNorm before the attention and FFN sub-layers, while Post-LN applies it after (and before the residual addition). Post-LN was the original Transformer design. Why has Pre-LN become the standard for deep Transformers, and what does Post-LN training require that Pre-LN does not?","options":{"A":"Pre-LN is standard only because it requires fewer parameters","B":"Post-LN issue: the residual path accumulates variance. At layer l, the output variance grows approximately with l (each layer adds variance via the residual). For deep Transformers (24+ layers), the gradient through Post-LN varies enormously across layers — early layers have much larger gradient magnitudes than later layers (the gradient passes through multiple un-normalized residuals). This makes Post-LN require careful learning rate warmup (small LR initially, gradually increasing) and is prone to instability without it. Pre-LN: each sub-layer's input is normalized before the attention/FFN computation. The residual path carries the un-normalized signal, while the attention and FFN see normalized inputs. Gradients flow more uniformly across layers. Pre-LN allows training at higher LR without warmup and is more robust to hyperparameter choices.","C":"Pre-LN produces better final quality; Post-LN always converges to a worse solution","D":"The difference is only in inference; training is identical for both"},"correct":"B","explanation":{"correct":"- Post-LN gradient analysis: ∂L/∂x_l (gradient at layer l) = ∂L/∂x_L × Π_{i=l}^{L} (I + ∂F_i/∂x_i). The product of (I + Jacobian) terms grows with L. Without careful initialization, gradients explode in the deepest layers.\n- Pre-LN gradient analysis: ∂L/∂x_l = ∂L/∂x_{l+1} × (I + ∂LayerNorm_l/∂x_l × ...). The I term ensures gradients flow through the skip connection, and LayerNorm normalizes the scale at each step. The gradient magnitude is more stable across layers.\n- Warmup requirement: Xiong et al. (2020) showed mathematically why Post-LN requires warmup while Pre-LN can start with larger LR. In practice, GPT-2, LLaMA, and most modern LLMs use Pre-LN (RMSNorm variant).","A":"Pre-LN and Post-LN have the same number of parameters — LayerNorm parameters are the same in both cases, just placed differently.","B":"","C":"Post-LN can achieve competitive final quality WITH proper warmup and training procedures. The issue is training stability and sensitivity to hyperparameters, not final quality ceiling.","D":"The normalization placement fundamentally changes gradient flow during training. Both inference and training are affected by the normalization position, but the stability difference is most pronounced during training."},"reference":"- Xiong et al., \"On Layer Normalization in the Transformer Architecture (Pre-LN analysis)\" (2020): https://arxiv.org/abs/2002.04745"}],"allMcqs":[{"section":"deep-learning","topicSlug":"introduction-to-neural-networks","topic":"Introduction To Neural Networks","id":"dl-01001","difficulty":"easy","orderIndex":1,"question":"A textbook describes a biological neuron by mapping its parts to a perceptron: dendrites receive signals, the cell body sums them, and the axon fires if the sum exceeds a threshold. A student uses this analogy to argue that increasing the number of dendrites (input connections) on a neuron will always improve classification accuracy. What is wrong with this reasoning?","options":{"A":"More inputs increase computation time, which degrades accuracy in practice","B":"The number of inputs is fixed by the dataset — adding connections adds noise, not signal","C":"The biological analogy breaks down at this point: more weighted inputs expand the input space but do not change the fundamental linear decision boundary a single perceptron can represent","D":"Biological neurons operate in continuous time, so discrete perceptrons cannot model more dendrites"},"correct":"C","explanation":{"correct":"- A single perceptron computes a weighted sum of inputs and applies a threshold. The decision boundary it can represent is always a hyperplane — adding more input features expands the dimensionality but the separator remains linear.\n- The biological analogy is useful for intuition but does not imply that more connections enable non-linear separation. The constraint is architectural (single layer, linear activation), not a data quantity issue.\n- In production, adding irrelevant features to a linear classifier typically hurts generalization (curse of dimensionality) without resolving non-linearly separable problems.","A":"Computation time is a systems concern, not a model capacity concern. The question is about classification accuracy as a function of model power, not wall-clock time.","B":"The number of inputs is determined by the feature space, but \"noise vs signal\" is a data quality argument, not a model capacity argument. The real issue is the linear decision boundary, not input noise.","C":"","D":"Discrete vs continuous time is an irrelevant distinction here. Standard perceptrons are not time-based and the analogy breakdown is about representational capacity, not temporal dynamics."},"reference":"- Rosenblatt, F., \"The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain\" (1958): https://psycnet.apa.org/record/1959-09865-001\n- Nielsen, M., \"Neural Networks and Deep Learning\", Chapter 1: http://neuralnetworksanddeeplearning.com/chap1.html"},{"section":"deep-learning","topicSlug":"introduction-to-neural-networks","topic":"Introduction To Neural Networks","id":"dl-01002","difficulty":"easy","orderIndex":2,"question":"A junior engineer implements a perceptron for binary classification and reports 100% training accuracy on an AND gate dataset. She then tries the same perceptron on an OR gate dataset and gets 100% again. Encouraged, she applies it directly to an XOR dataset and gets exactly 50% accuracy — no better than random. Why does this specific jump fail?","options":{"A":"The perceptron learning rule diverges for XOR because the learning rate is not tuned correctly for that dataset","B":"XOR is not linearly separable — no single straight line (hyperplane) can divide XOR's positive and negative examples in input space, which is the only type of boundary a perceptron can represent","C":"XOR requires binary inputs, but the perceptron interprets inputs as continuous, causing precision errors","D":"The training dataset for XOR has only 4 samples, which is insufficient for the perceptron to converge"},"correct":"B","explanation":{"correct":"- AND and OR are linearly separable: you can draw a line in 2D that perfectly separates their 0-outputs from 1-outputs. XOR cannot be separated by any hyperplane in the original input space.\n- The perceptron convergence theorem guarantees convergence only for linearly separable problems. On XOR, the algorithm oscillates indefinitely — it is not a learning rate or sample count problem.\n- This is historically significant: Minsky and Papert's 1969 analysis of XOR's non-separability contributed to the first \"AI winter\" by demonstrating fundamental limitations of single-layer networks.","A":"Tuning the learning rate cannot fix a geometric impossibility. The perceptron updates weights to minimize misclassifications, but no weight configuration produces zero errors for XOR on a single layer.","B":"","C":"XOR operates on {0,1} inputs, which are valid continuous values. Precision is not the issue — the problem is representational capacity of the linear model.","D":"The perceptron convergence theorem applies regardless of dataset size as long as the data is linearly separable. With only 4 points, XOR can be exhaustively enumerated and the non-separability is provable analytically, not statistically."},"reference":"- Minsky, M. & Papert, S., \"Perceptrons\" (1969): https://mitpress.mit.edu/9780262630221/perceptrons/\n- Visualizing XOR non-separability: https://playground.tensorflow.org/"},{"section":"deep-learning","topicSlug":"introduction-to-neural-networks","topic":"Introduction To Neural Networks","id":"dl-01003","difficulty":"easy","orderIndex":3,"question":"You are reviewing a colleague's perceptron implementation. The update rule modifies weights only when the prediction is wrong. Your colleague argues this is a bug — \"we should update weights on every sample to ensure the model keeps learning.\" Who is correct and why?","options":{"A":"The colleague is correct; skipping updates on correct samples wastes gradient information","B":"The original implementation is correct; updating only on misclassifications is the defining rule of the Perceptron algorithm, and updating on correct predictions would push the decision boundary away from correct examples","C":"Both approaches converge to the same solution; it is purely a performance optimization choice","D":"Neither is correct; perceptrons require batch updates across all samples simultaneously, not online per-sample updates"},"correct":"B","explanation":{"correct":"- The Perceptron learning rule (Rosenblatt, 1958) updates weights as: w ← w + η·(y - ŷ)·x. When y = ŷ (correct prediction), the update is zero by definition — not a special case but the mathematical result.\n- Updating weights on correctly classified samples would introduce unnecessary perturbations, potentially moving the decision boundary away from a valid separating hyperplane.\n- The Perceptron algorithm is guaranteed to converge (find a separating hyperplane in finite steps) for linearly separable data under the standard update-on-mistake rule. This guarantee does not hold for arbitrary update schedules.","A":"There is no gradient in a standard perceptron — it is not a gradient descent method. The concept of \"wasting gradient information\" doesn't apply; the update rule is a correction signal, not a gradient.","B":"","C":"The two approaches do not converge to the same solution. Updating on correct samples introduces drift and can cause oscillation even on linearly separable data.","D":"The Perceptron algorithm is inherently online (processes one sample at a time). Batch perceptrons exist but are not the standard formulation, and the question is about the classical single-sample update rule."},"reference":"- Novikoff, A.B.J., \"On convergence proofs for perceptrons\" (1963): classic convergence proof\n- https://cs229.stanford.edu/notes2022fall/cs229-notes6.pdf"},{"section":"deep-learning","topicSlug":"introduction-to-neural-networks","topic":"Introduction To Neural Networks","id":"dl-01004","difficulty":"medium","orderIndex":4,"question":"A team trains a perceptron to classify whether a loan should be approved (1) or rejected (0) based on two features: credit score and income. After training, they plot the decision boundary and find a straight line correctly separating all training samples. They then test on a held-out set and get 95% accuracy. A new feature — \"number of late payments\" — is added. Retraining yields 80% accuracy. The team concludes the new feature \"confused\" the perceptron. What is the most likely true cause?","options":{"A":"Adding a feature increases the input dimension, which always reduces accuracy in linear classifiers","B":"The original two features happened to be linearly separable; adding the third feature may have introduced cases where the combined three-dimensional feature space is no longer linearly separable, or where the new feature correlates with noise in the training set","C":"The perceptron cannot handle three or more features simultaneously — it is limited to two-dimensional inputs","D":"The learning rate must be reduced when adding features, otherwise the perceptron overshoots the optimal boundary"},"correct":"B","explanation":{"correct":"- Linear separability is a property of the data in a specific feature space, not a guaranteed property. Adding a feature changes the geometry of the space — previously separable data may no longer be separable in the augmented space, especially if the new feature interacts non-linearly with the class boundary.\n- \"Number of late payments\" likely has a non-linear relationship with approval (e.g., 0 late payments = good, but 1-3 may be borderline). This creates decision regions in 3D that cannot be cleanly separated by a plane.\n- In practice, before adding features to linear models, teams should check whether the augmented dataset remains approximately linearly separable using tools like SVM with a linear kernel.","A":"Higher dimensionality does not always reduce linear accuracy. If the new feature is linearly predictive, it can improve accuracy. The dimensionality itself is not the problem.","B":"","C":"A perceptron generalizes to any number of dimensions — it computes w·x + b for a weight vector of arbitrary length. There is no dimensionality cap.","D":"Learning rate affects convergence speed and stability, not the fundamental geometric feasibility of linear separation. If the data is linearly separable in 3D, any positive learning rate will eventually converge."},"reference":"- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html"},{"section":"deep-learning","topicSlug":"introduction-to-neural-networks","topic":"Introduction To Neural Networks","id":"dl-01005","difficulty":"medium","orderIndex":5,"question":"The Perceptron Convergence Theorem states the algorithm will find a separating hyperplane in a finite number of updates if the data is linearly separable. A researcher applies a perceptron to a dataset and observes the algorithm running for 10,000 epochs without converging. She concludes the data must not be linearly separable. A colleague disagrees. Who is more likely correct, and why?","options":{"A":"The researcher is correct; non-convergence after sufficient epochs is the standard test for non-linear separability","B":"The colleague is more likely correct; the theorem guarantees finite steps proportional to the margin, and \"sufficient epochs\" depends on how small the margin is — tight margins can require millions of updates even for separable data","C":"Both are wrong; the perceptron always converges in at most n² steps where n is the number of samples","D":"The colleague is correct only if the learning rate is set to exactly 1.0; otherwise the theorem does not apply"},"correct":"B","explanation":{"correct":"- The convergence theorem bounds the number of updates by R²/γ², where R is the maximum norm of the input vectors and γ is the geometric margin (distance from the closest point to the separating hyperplane). If γ is very small (nearly non-separable data), R²/γ² can be enormous.\n- A dataset with a tiny margin (e.g., two classes separated by 0.001 in feature space) is technically linearly separable but may require millions of updates to converge — far exceeding what 10,000 epochs covers.\n- In practice, engineers use SVMs with a linear kernel to detect near-margin separability, rather than relying on perceptron convergence as a test.","A":"Non-convergence is not a definitive test for non-separability because the required iterations grow inversely with the margin squared. A practical epoch limit is not a mathematical proof.","B":"","C":"There is no n² step bound in the standard convergence theorem. The bound depends on R and γ, not solely on the number of samples.","D":"The convergence theorem holds for any positive learning rate η, not just η=1.0. The learning rate affects the scale of weight updates but not the convergence guarantee."},"reference":"- Novikoff convergence proof bound: R²/γ²\n- Shalev-Shwartz & Ben-David, \"Understanding Machine Learning\", Chapter 9"},{"section":"deep-learning","topicSlug":"introduction-to-neural-networks","topic":"Introduction To Neural Networks","id":"dl-01006","difficulty":"medium","orderIndex":6,"question":"Given the XOR truth table: (0,0)→0, (0,1)→1, (1,0)→1, (1,1)→0 — a senior engineer claims you can solve XOR with a single perceptron by using a non-linear feature transformation: mapping (x₁, x₂) → (x₁, x₂, x₁·x₂). A junior engineer says this \"cheats\" and doesn't count as a perceptron solution. Who is right, and what does this reveal about neural networks?","options":{"A":"The senior engineer is wrong; XOR cannot be solved by any perceptron regardless of input transformation","B":"The junior engineer is right that it \"cheats\" — only raw features are valid inputs to a perceptron; adding derived features violates the perceptron definition","C":"The senior engineer is correct in principle: applying a feature map to create a higher-dimensional linearly separable representation is valid and is exactly what a hidden layer in a neural network computes automatically","D":"Both are partially right; the transformation works but requires the perceptron to have three inputs, which is only valid for 3-class problems"},"correct":"C","explanation":{"correct":"- In the transformed space (x₁, x₂, x₁x₂), XOR becomes linearly separable. For example, the plane w = [1, 1, -2] with bias -0.5 correctly classifies all four points. This is a valid perceptron on 3 features.\n- This insight is the core motivation for neural networks: a hidden layer computes a learned non-linear feature transformation (the \"representation\"), and the output layer performs linear classification in the transformed space.\n- The \"kernel trick\" in SVMs and the \"representation learning\" in deep networks are both formalizations of the same principle: learn or design a feature map that makes the problem linearly separable.","A":"XOR is absolutely solvable with the right feature map. The Minsky-Papert result says it is not solvable with raw inputs on a single-layer perceptron — not that it is fundamentally unsolvable.","B":"The perceptron model accepts any feature vector as input. There is no rule restricting inputs to \"raw\" features. Feature engineering is standard practice — the distinction is whether the transformation is manual or learned.","C":"","D":"The three inputs correspond to three features, not three classes. The number of inputs in a perceptron is independent of the number of output classes."},"reference":"- http://neuralnetworksanddeeplearning.com/chap4.html (visual proof of universal approximation)\n- Kernel trick and XOR: https://cs229.stanford.edu/"},{"section":"deep-learning","topicSlug":"introduction-to-neural-networks","topic":"Introduction To Neural Networks","id":"dl-01007","difficulty":"medium","orderIndex":7,"question":"A data scientist trains a neural network with one hidden layer (2 hidden units, ReLU) to solve XOR and achieves 100% accuracy. She then removes the hidden layer (making it a single perceptron) and retrains — the perceptron never converges. She concludes that \"more neurons\" is what solved XOR. A reviewer pushes back. What is the reviewer's most accurate correction?","options":{"A":"The reviewer is wrong; more computational units is exactly what solves XOR","B":"The reviewer would argue it is not the number of neurons but the non-linear hidden layer that transforms the input space into a representation where XOR becomes linearly separable — depth and non-linearity together enable this, not just adding neurons","C":"The reviewer would point out that a perceptron with 4 or more neurons in a single layer can also solve XOR","D":"The reviewer would argue the difference is the ReLU activation — a perceptron with ReLU could solve XOR without a hidden layer"},"correct":"B","explanation":{"correct":"- Adding neurons to a single-layer network (without a hidden layer) only produces more linear classifiers whose ensemble is still a linear function. You cannot combine linear functions to get a non-linear one without non-linearity between them.\n- The hidden layer with ReLU creates a piecewise-linear transformation of the input space. The two hidden units effectively create new features that separate XOR's pattern, and the output layer is a linear classifier on those features.\n- The key insight: it is the combination of non-linearity (activation functions) and depth (hidden layers) that grants representational power — not sheer neuron count in a flat architecture.","A":"\"More neurons\" in a single layer without non-linearity between them collapses to a single linear function (by the superposition property of linear transforms). This cannot solve XOR.","B":"","C":"A single-layer network with any number of neurons remains a linear classifier. Adding more neurons to a flat architecture is equivalent to increasing the width of one linear transformation, which stays linear.","D":"ReLU applied at the output layer of a single perceptron changes it from a linear to a piecewise-linear function, but the function is still a single hinge — it cannot separate XOR's four quadrant pattern. You need at least two ReLU units with different boundaries."},"reference":"- https://playground.tensorflow.org/ (interactive XOR solution with hidden layers)"},{"section":"deep-learning","topicSlug":"introduction-to-neural-networks","topic":"Introduction To Neural Networks","id":"dl-01008","difficulty":"hard","orderIndex":8,"question":"Minsky and Papert's 1969 analysis of perceptrons showed that a single-layer network cannot compute the XOR function with locally connected (limited-order) predicates. This result contributed to defunding of neural network research for nearly a decade. A historian argues: \"The AI winter was a rational response — Minsky proved neural networks were fundamentally flawed.\" A modern ML researcher disputes this characterization. What is the most technically precise basis for the researcher's disagreement?","options":{"A":"Minsky's proof was mathematically incorrect and has since been disproven","B":"Minsky and Papert explicitly noted that multi-layer networks could overcome these limitations, and the generalization of their result to all neural networks was an overinterpretation that the field accepted uncritically","C":"XOR is not an important real-world problem, so the limitation was overstated from the beginning","D":"Backpropagation was already known in 1969 and could have solved XOR immediately, making the AI winter purely political"},"correct":"B","explanation":{"correct":"- Minsky and Papert's book explicitly discussed multi-layer perceptrons in the final chapter and noted that their analysis did not extend to networks with hidden layers. The \"AI winter\" resulted from the research community overgeneralizing a proof about single-layer networks.\n- The community's mistake was assuming that because hidden-layer networks lacked training algorithms (backpropagation wasn't practically known/applied until Rumelhart et al., 1986), they were not worth pursuing — conflating \"hard to train\" with \"fundamentally limited.\"\n- This is a historically important lesson about how limitations of a specific model can be misread as limitations of an entire research paradigm.","A":"Minsky and Papert's proofs are mathematically correct for their stated scope (finite-order perceptrons, single layer). The issue was scope of interpretation, not mathematical error.","B":"","C":"XOR is a canonical non-linear classification problem. Its unsolvability by a single perceptron directly implies that any non-linearly separable problem — which is the vast majority of real-world problems — cannot be solved by a flat network.","D":"Backpropagation was not practically known in 1969. Werbos derived it in his 1974 thesis, and the key popularization was Rumelhart, Hinton & Williams in 1986. The AI winter was partly due to the genuine absence of a practical training method for multi-layer networks."},"reference":"- Minsky & Papert, \"Perceptrons\" (1969)\n- Rumelhart, Hinton & Williams, \"Learning representations by back-propagating errors\" (1986): https://www.nature.com/articles/323533a0"},{"section":"deep-learning","topicSlug":"introduction-to-neural-networks","topic":"Introduction To Neural Networks","id":"dl-01009","difficulty":"hard","orderIndex":9,"question":"You are building a neural network from scratch to solve XOR. With two hidden units and sigmoid activations, the network trains successfully. You then replace the sigmoid with a linear activation (f(x) = x) in the hidden layer, keeping everything else identical, and retrain from scratch. The network now fails to solve XOR. Your manager asks why changing \"just the activation\" breaks it. What is the exact mathematical reason?","options":{"A":"Linear activations cause gradient explosion during backpropagation, preventing convergence","B":"A network with linear activations in hidden layers is mathematically equivalent to a single-layer linear network regardless of depth — the composition of linear functions is itself a linear function, eliminating all non-linear representational power","C":"Linear activations saturate at large values, causing the hidden layer to output constants for XOR's inputs","D":"Linear activations require a different learning rate than sigmoid activations; the existing hyperparameters are incompatible"},"correct":"B","explanation":{"correct":"- If hidden layer j computes h = W₂(W₁x + b₁) + b₂, this simplifies to (W₂W₁)x + (W₂b₁ + b₂) = Wx + b — a single affine transformation. No depth of linear layers adds representational power beyond a single layer.\n- This is the mathematical proof that depth alone does not grant expressiveness — non-linear activation functions are the critical ingredient that makes composition of layers more powerful than any single layer.\n- In practice, this means a 100-layer fully linear network is equivalent to logistic regression (with a linear output). Non-linearity (sigmoid, ReLU, tanh) is not an implementation detail — it is the source of all representational power in neural networks.","A":"Linear activations do not cause gradient explosion by themselves. In fact, the gradient of a linear activation is a constant (1.0), which is numerically very stable. The issue is representational, not optimization-related.","B":"","C":"Linear activations do not saturate — their output is unbounded. Saturation is a property of sigmoid and tanh, where outputs asymptote to 0 or 1 (or -1/1), causing vanishing gradients.","D":"Learning rate is a hyperparameter of the optimizer. While different activations may benefit from different learning rates, the failure to solve XOR is fundamental — no learning rate will allow a linear network to represent XOR."},"reference":"- Goodfellow et al., \"Deep Learning\", Chapter 6.3 (Hidden Units and Depth): https://www.deeplearningbook.org/"},{"section":"deep-learning","topicSlug":"introduction-to-neural-networks","topic":"Introduction To Neural Networks","id":"dl-01010","difficulty":"hard","orderIndex":10,"question":"A researcher plots the loss landscape of a perceptron trained on a linearly separable dataset and observes that the loss surface has many local minima. She uses this as evidence that the perceptron learning rule is unreliable. A senior ML engineer disagrees. What is the most precise technical reason the senior engineer is correct?","options":{"A":"Modern perceptrons use Adam optimizer which avoids local minima completely","B":"The perceptron uses a step function (threshold activation), making its loss non-differentiable, but the update rule is a direct correction rule, not gradient descent — there are no local minima in the relevant sense because the algorithm is not minimizing a smooth loss function","C":"The loss landscape of a linearly separable problem has exactly one global minimum by definition, so local minima cannot exist","D":"The perceptron averages updates across all misclassified samples, which statistically eliminates local minima"},"correct":"B","explanation":{"correct":"- The classical Perceptron algorithm does not perform gradient descent. It applies a correction w ← w + η·(y - ŷ)·x directly when a sample is misclassified. There is no differentiable loss being minimized.\n- The concept of \"local minima\" in an optimization sense applies to gradient-based methods minimizing a smooth scalar loss. For the Perceptron, convergence is guaranteed by the geometric structure of the problem (Novikoff's theorem), not by a loss landscape argument.\n- The confusion arises because researchers familiar with modern deep learning (where gradient descent on smooth losses is universal) incorrectly apply loss landscape intuitions to algorithms that don't operate on smooth losses.","A":"The standard perceptron does not use Adam or any adaptive optimizer. And Adam does not \"avoid local minima completely\" — it converges to local minima more efficiently than SGD but does not escape them in general.","B":"","C":"For a linearly separable problem, there are infinitely many valid separating hyperplanes (any hyperplane in the margin region works), so the \"solution\" is not unique. The loss landscape argument is moot for the Perceptron's update rule.","D":"The classical perceptron is an online algorithm — it updates on one sample at a time, not as a batch average. Even mini-batch averaging doesn't eliminate local minima in gradient descent."},"reference":"- Novikoff, A.B.J., \"On convergence proofs for perceptrons\" (1963)\n- https://cs229.stanford.edu/notes2022fall/cs229-notes6.pdf"},{"section":"deep-learning","topicSlug":"introduction-to-neural-networks","topic":"Introduction To Neural Networks","id":"dl-01011","difficulty":"medium","orderIndex":11,"question":"A neural network with 3 input features, 1 hidden layer (4 units), and 1 output unit is described as having \"two layers.\" A student insists it has \"three layers\" because she counts the input, hidden, and output layers. In a job interview, which answer is expected and what is the correct convention?","options":{"A":"The student is correct — always count all layers including input; this is the IEEE standard","B":"Both conventions are used, but in interviews and research papers, layer count typically refers to the number of layers with learnable parameters (weight matrices). The input layer has no parameters, so the network is called a \"2-layer network\" or \"1-hidden-layer network\"","C":"The network has 4 layers because each hidden unit counts as a separate layer","D":"The correct count is always the total number of weight matrices plus the number of bias vectors"},"correct":"B","explanation":{"correct":"- In the deep learning community (and in most interview contexts), \"N-layer network\" refers to N layers with learnable parameters. An input layer simply passes data and has no weights, so it is not counted.\n- A \"2-layer network\" has 1 hidden layer and 1 output layer. A \"3-layer network\" has 2 hidden layers. This is the convention used in Goodfellow et al.'s \"Deep Learning\" textbook and most research papers.\n- Ambiguity in layer counting is a common source of confusion. Being precise (\"a network with one hidden layer\" vs \"a 2-layer network\") is better practice in technical communication.","A":"There is no IEEE standard that mandates counting the input layer. The convention varies by context, but the dominant research/interview convention excludes the input layer from the count.","B":"","C":"Counting individual neurons as layers is incorrect. A \"layer\" is a set of neurons that process inputs in parallel and share the same position in the network topology — not individual units.","D":"The number of weight matrices equals the number of layers with parameters, and bias vectors are counted alongside their layer. This count is equivalent to option B's convention but is not expressed in standard terminology."},"reference":"- Goodfellow et al., \"Deep Learning\", Chapter 6.1 (Example: Learning XOR)"},{"section":"deep-learning","topicSlug":"introduction-to-neural-networks","topic":"Introduction To Neural Networks","id":"dl-01012","difficulty":"medium","orderIndex":12,"question":"You have a dataset with two features (x₁, x₂) and a binary label. You visualize the data and see that the positive class forms a ring around the negative class (concentric circles). You train a perceptron on this data for 1000 epochs. What will you observe and why?","options":{"A":"The perceptron will converge to roughly 50% accuracy because the classes are balanced and it cannot separate them","B":"The perceptron will oscillate without converging because the data is not linearly separable — no straight line can enclose a ring around another class","C":"The perceptron will converge to approximately 75% accuracy because it can correctly classify 3 of the 4 quadrants","D":"The perceptron will converge slowly but eventually find a separating line once the learning rate decays sufficiently"},"correct":"B","explanation":{"correct":"- Concentric circles (the \"rings\" dataset) is a canonical example of a non-linearly separable problem. The positive class (ring) surrounds the negative class (center), which cannot be divided by any hyperplane in 2D.\n- By the Perceptron Convergence Theorem, the algorithm converges only for linearly separable data. For non-separable data, the update rule oscillates — it corrects misclassifications on one side only to re-misclassify others on the next pass.\n- This is why kernel methods (RBF kernel maps to infinite-dimensional feature space where circles become separable) and neural networks (learn a non-linear boundary) were developed.","A":"50% accuracy is possible but not guaranteed — a diagonal line through the center could achieve well above 50% by capturing one side of the ring. The defining behavior is non-convergence and oscillation, not a specific accuracy.","B":"","C":"The perceptron cannot be analyzed as correctly classifying \"quadrants\" — its boundary is a single hyperplane, not a quadrant decomposition. 75% is not a meaningful prediction for this geometry.","D":"Learning rate decay affects convergence speed for separable data but does not affect the fundamental impossibility of linear separation. The rings dataset remains non-linearly separable regardless of learning rate schedule."},"reference":"- https://playground.tensorflow.org/ (rings dataset visualization)\n- Scikit-learn make_circles dataset: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_circles.html"},{"section":"deep-learning","topicSlug":"introduction-to-neural-networks","topic":"Introduction To Neural Networks","id":"dl-01013","difficulty":"hard","orderIndex":13,"question":"A team tries to solve a 4-class classification problem using a single perceptron with a step function output. They encode the 4 classes as binary pairs: (0,0), (0,1), (1,0), (1,1), and train one perceptron per bit position (two perceptrons total). Each perceptron achieves 90% accuracy on its binary subtask. The team concludes the combined system achieves 90% accuracy on the 4-class problem. What is the flaw in this reasoning?","options":{"A":"Two perceptrons cannot share inputs — they must use different feature subsets","B":"The independence assumption is incorrect: the two perceptrons make errors on different samples, so the combined 4-class accuracy is lower than 90% — it is approximately 0.9 × 0.9 = 81% (if errors are independent) or worse if errors are correlated","C":"Step function outputs cannot be combined; the team should use sigmoid activations to enable probability combination","D":"This architecture is equivalent to one perceptron with 8 outputs, which would achieve 81% accuracy due to class interference"},"correct":"B","explanation":{"correct":"- If each binary classifier makes errors on 10% of samples independently, a sample is correctly classified in 4-class space only if both binary classifiers are correct simultaneously. P(both correct) = 0.9 × 0.9 = 0.81 under independence.\n- In practice, errors are often correlated (both classifiers fail on the same hard examples near decision boundaries), which makes combined accuracy even lower than 81%.\n- This is a common mistake in multi-label and multi-class decomposition strategies: individual component accuracies compound multiplicatively, not additively.","A":"Perceptrons can absolutely share the same input feature vector. There is no architectural reason they must use different features. In fact, sharing features is standard in multi-output networks.","B":"","C":"Step functions can be combined via logical operations or majority vote. The issue is not the activation type but the compounding of errors. Sigmoid would not fix the 90% × 90% = 81% problem.","D":"A single perceptron with 8 outputs is a multi-output linear model. Its accuracy depends on the problem geometry. The error compounding calculation is specific to the two-independent-classifier setup, not to the number of outputs."},"reference":"- Multi-label classification error analysis: https://scikit-learn.org/stable/modules/multiclass.html"},{"section":"deep-learning","topicSlug":"introduction-to-neural-networks","topic":"Introduction To Neural Networks","id":"dl-01014","difficulty":"hard","orderIndex":14,"question":"Consider a neural network with 2 inputs, 2 hidden units (sigmoid), and 1 output (sigmoid). You manually set the weights to make the first hidden unit compute approximately AND(x₁, x₂) and the second compute approximately OR(x₁, x₂). The output unit is set to compute NOT(AND) AND OR, which is equivalent to XOR. A student argues this \"proves\" XOR is solvable but doesn't generalize because the weights were hand-crafted. What does this demonstration actually prove about neural networks?","options":{"A":"Nothing useful — hand-crafted weights don't count as learning","B":"It proves that a shallow neural network with non-linear activations has sufficient representational capacity to express XOR — the weights exist. Learning algorithms (backpropagation) are responsible for finding those weights automatically","C":"It proves that sigmoid activations are necessary for XOR — ReLU or tanh would fail in this configuration","D":"It proves XOR requires exactly 2 hidden units — fewer units cannot express the function"},"correct":"B","explanation":{"correct":"- Existence of a weight configuration that solves XOR proves the model has the representational capacity. Gradient-based learning is an algorithm for finding such weights — it is a search problem, not a capacity problem.\n- This separation between \"expressiveness\" (what can the model represent?) and \"learnability\" (can the optimizer find it?) is fundamental. The Universal Approximation Theorem proves existence of weights for any continuous function; backpropagation is the practical search algorithm.\n- Hand-crafted demonstrations are valid proofs of capacity. The reason we need learning algorithms is that for high-dimensional problems with millions of parameters, manual weight design is infeasible.","A":"Hand-crafted weights are a proof by construction. In mathematics, existence proofs by construction are the strongest form of existence proof. This absolutely \"counts.\"","B":"","C":"Sigmoid is used here for convenience (it approximates AND and OR with the right weights), but ReLU networks can also represent XOR and any other function that networks in general can represent. The activation choice affects the specific weight values, not the representational capacity.","D":"You can solve XOR with 2 hidden units (as demonstrated), but this does not prove it is the minimum. A single hidden unit with a quadratic transformation can also solve XOR. Minimum complexity is a separate research question."},"reference":"- Cybenko, G., \"Approximation by superpositions of a sigmoidal function\" (1989): the original Universal Approximation Theorem\n- http://neuralnetworksanddeeplearning.com/chap4.html"},{"section":"deep-learning","topicSlug":"introduction-to-neural-networks","topic":"Introduction To Neural Networks","id":"dl-01015","difficulty":"medium","orderIndex":15,"question":"You are onboarding a new team member who asks: \"If neural networks are just compositions of matrix multiplications and activation functions, why are they so powerful? Linear algebra is simple.\" What is the most technically complete answer that bridges the theory to practice?","options":{"A":"Neural networks are powerful because matrix multiplication is GPU-accelerated, enabling much larger models than older methods","B":"The power comes from the interaction of three properties: non-linear activations enabling universal approximation, depth allowing hierarchical feature composition, and the availability of gradient descent to search the exponentially large weight space efficiently","C":"Neural networks are powerful primarily because of the large amounts of data they are trained on — the architecture itself is not special","D":"The activation functions convert the linear operations into non-linear ones, which is equivalent to performing kernel regression in infinite-dimensional space for all practical purposes"},"correct":"B","explanation":{"correct":"- Non-linear activations alone (without depth) give universal approximation in theory but require exponentially many hidden units. Depth allows hierarchical composition (edges → shapes → objects in vision), which is exponentially more efficient for structured data.\n- Gradient descent with backpropagation navigates a loss surface with billions of parameters — a search problem that would be intractable with brute force but is made feasible by automatic differentiation and modern hardware.\n- The combination of all three — expressiveness, efficiency, and trainability — is what makes deep networks uniquely powerful. Each factor alone is insufficient.","A":"GPU acceleration is an implementation advantage, not a theoretical source of power. Neural networks were theoretically powerful before GPUs; GPUs made them practically scalable.","B":"","C":"Data is essential for generalization but does not explain why a neural network can learn better representations than a linear model given the same data. The architecture determines what can be represented.","D":"The neural tangent kernel (NTK) framework shows that infinitely wide networks are equivalent to kernel methods, but this is a limiting theoretical result. In practice, finite-width deep networks do not behave as kernel machines and often outperform them by learning adaptive representations."},"reference":"- Goodfellow et al., \"Deep Learning\", Chapters 6–8: https://www.deeplearningbook.org/\n- LeCun, Bengio & Hinton, \"Deep learning\" (Nature 2015): https://www.nature.com/articles/nature14539"},{"section":"deep-learning","topicSlug":"neurons-and-perceptrons","topic":"Neurons And Perceptrons","id":"dl-02001","difficulty":"easy","orderIndex":1,"question":"A neural network layer computes z = Wx + b, where W is a 64×128 weight matrix, x is a 128-dimensional input, and b is a 64-dimensional bias. A new engineer adds an extra bias vector of shape (64,) after the activation and trains the model. He is surprised to find no improvement. What is the most likely reason?","options":{"A":"Bias terms must always be initialized to zero; adding a second bias with random initialization causes training instability","B":"Two additive bias terms on the same layer collapse into a single effective bias — the network cannot distinguish between the two, so no extra representational capacity is gained","C":"The second bias vector is outside the activation function, so it bypasses the non-linearity and breaks the gradient flow","D":"A 64-dimensional bias is too large; standard practice limits bias size to match the input dimension"},"correct":"B","explanation":{"correct":"- The layer computes: output = f(Wx + b₁) + b₂. Since b₁ and b₂ are both learned, the optimizer can achieve the same result by absorbing any value of b₂ into b₁ (before the activation), adjusted for the activation's effect. The second bias adds a parameter but not representational power.\n- More precisely, if the activation is linear, b₁ + b₂ collapses into one bias. With non-linear activation, b₂ shifts the output but this shift is already achievable by adjusting b₁ and W together.\n- Adding redundant parameters increases memory and computation with no model capacity gain. This is a common mistake when engineers try to \"boost\" a layer without understanding what parameters do.","A":"Bias initialization to zero is standard (to break symmetry concerns apply to weights, not biases), but the second bias won't cause instability — it simply provides no benefit.","B":"","C":"The gradient flows correctly through addition. Placing a bias after an activation is valid mathematically and does backpropagate gradients — it just doesn't help.","D":"There is no standard that requires bias size to match input dimension. Bias size matches output dimension (64), which is already correct in this setup."},"reference":"- Goodfellow et al., \"Deep Learning\", Chapter 6.2 (Gradient-Based Learning)"},{"section":"deep-learning","topicSlug":"neurons-and-perceptrons","topic":"Neurons And Perceptrons","id":"dl-02002","difficulty":"easy","orderIndex":2,"question":"In a multilayer perceptron, every unit in layer k is connected to every unit in layer k+1. A team decides to remove all connections between the first and third layer (no skip connections) and reports the network is \"equivalent\" to the original. A second team adds direct connections from input to output layer and says this is strictly \"more powerful.\" Which team is correct?","options":{"A":"First team is correct — removing non-adjacent connections doesn't change anything since gradients don't flow through skipped layers anyway","B":"Second team is correct — adding skip connections from input to output layer adds a new direct linear pathway, meaning the network can represent functions that the non-skip version cannot, specifically residual linear transformations of the input","C":"Both teams are correct — both architectures compute identical functions with different parameterizations","D":"Neither claim is correct — removing any connection changes the output and adding connections changes the architecture class entirely"},"correct":"B","explanation":{"correct":"- In a standard MLP, each layer's output is a transformed version of the previous layer only. Adding a direct input-to-output connection creates a pathway that computes output = f(deep_path(x)) + W_skip·x, allowing the network to represent functions that are \"a deep transformation plus a direct linear term.\"\n- This is the architectural insight behind ResNets: skip connections allow the network to easily learn identity functions (if the residual branch is zero, the skip dominates), which addresses vanishing gradients and enables very deep networks.\n- The first team is wrong because \"removing connections between non-adjacent layers\" is vacuously true in a standard MLP (those connections don't exist to begin with) — the claim is about removing existing adjacent connections, which would reduce capacity.","A":"In a standard MLP, there are no first-to-third-layer connections to remove. If they meant removing first-to-second connections, that would reduce representational capacity dramatically by disconnecting parts of the network.","B":"","C":"Skip connections create new computational pathways — the architectures are not equivalent in terms of representable functions, even with different parameterizations.","D":"The second team's claim is correct by the argument in the explanation. Adding skip connections does add representational power (a new linear pathway)."},"reference":"- He et al., \"Deep Residual Learning for Image Recognition\" (ResNet): https://arxiv.org/abs/1512.03385"},{"section":"deep-learning","topicSlug":"neurons-and-perceptrons","topic":"Neurons And Perceptrons","id":"dl-02003","difficulty":"easy","orderIndex":3,"question":"A neural network's hidden layer has 100 units, all initialized to the same weight vector w₀ and same bias b₀. The network trains for 100 epochs but all hidden units remain identical throughout training. Why does this happen even though the loss is non-zero and gradients are flowing?","options":{"A":"Identical initialization causes NaN gradients because the loss surface has a flat region at symmetric points","B":"Since all units receive the same input and compute the same output, backpropagation produces identical gradients for every unit — they receive the same update and remain permanently symmetric throughout training","C":"This is expected behavior; the network converges to a unique solution where all units specialize identically","D":"The optimizer averages gradients across units, cancelling out individual updates and preventing specialization"},"correct":"B","explanation":{"correct":"- This is the \"symmetry breaking\" problem. If all weights in a layer are identical, every unit computes the same pre-activation value z = w·x + b. Their outputs are identical, so the loss gradient with respect to each unit's weights is identical. Each unit receives the same gradient update, keeping them identical forever.\n- The result is a layer of 100 units that behaves identically to a single unit — massive parameter waste with no representational gain.\n- This is why weights are initialized randomly (Xavier/He initialization): to break symmetry so different units can specialize to different features during training.","A":"Identical initialization does not cause NaN gradients. The gradients are well-defined and finite — they are just identical across units, causing symmetric updates, not numerical failure.","B":"","C":"There is nothing \"correct\" about identical units. The network converges but learns a degenerate solution with far less capacity than intended. A 100-unit layer that behaves like a 1-unit layer wastes 99x parameters.","D":"Backpropagation computes individual per-weight gradients, not averaged gradients. The identity of gradients is a consequence of identical forward-pass outputs, not optimizer averaging."},"reference":"- Goodfellow et al., \"Deep Learning\", Chapter 8.4 (Practical Considerations for Training Deep Models — symmetry breaking)"},{"section":"deep-learning","topicSlug":"neurons-and-perceptrons","topic":"Neurons And Perceptrons","id":"dl-02004","difficulty":"medium","orderIndex":4,"question":"A single hidden unit in a neural network computes: output = sigmoid(w₁·x₁ + w₂·x₂ + b). You are told the unit has learned w₁ = 3.0, w₂ = -3.0, b = 0. Without running any code, predict: for input (1, 0), what does this unit detect, and what happens to its output as you scale the input (100, 0)?","options":{"A":"The unit outputs 0.95 for (1,0) and approaches 1.0 for (100, 0), meaning it is a feature detector that saturates — strong evidence of feature x₁ being present causes the sigmoid to \"clamp\" at 1","B":"The unit outputs sigmoid(3) ≈ 0.95 for (1,0) and sigmoid(300) ≈ 1.0 for (100,0). The unit detects \"x₁ > x₂\" (since w₁ = −w₂) but saturates — large inputs collapse the gradient to near zero, which is the vanishing gradient problem at the activation level","C":"The unit outputs 0.5 for both inputs because the bias is 0, which forces the sigmoid to its center value","D":"Scaling the input has no effect because the sigmoid output is bounded between 0 and 1 regardless of input magnitude"},"correct":"B","explanation":{"correct":"- For (1,0): z = 3·1 + (−3)·0 + 0 = 3, sigmoid(3) ≈ 0.9526. For (100,0): z = 300, sigmoid(300) ≈ 1.0 (to machine precision).\n- The weight pattern w₁ = 3, w₂ = −3 means the unit activates when x₁ >> x₂ (it computes a difference detector). The bias of 0 centers the threshold at x₁ = x₂.\n- The critical insight: when z is large (300), sigmoid'(z) = sigmoid(z)(1−sigmoid(z)) ≈ 1·0 = 0. The gradient is effectively zero, so this unit contributes nothing to weight updates for large-magnitude inputs — the vanishing gradient problem.","A":"Partially correct (saturation is real), but misses the crucial production implication: vanishing gradients mean this unit stops learning once inputs are large. This is the core reason ReLU replaced sigmoid for hidden layers.","B":"","C":"The bias is 0, but the output for (1,0) is sigmoid(3) ≈ 0.95, not 0.5. Sigmoid outputs 0.5 only when z = 0. For (1,0), z = 3, not 0.","D":"Scaling the input does affect the output — it changes z which changes the sigmoid output. The output is bounded between 0 and 1, but the specific value changes with input magnitude."},"reference":"- Hochreiter, \"The vanishing gradient problem during learning recurrent neural nets\" (1998)\n- https://cs231n.github.io/neural-networks-1/"},{"section":"deep-learning","topicSlug":"neurons-and-perceptrons","topic":"Neurons And Perceptrons","id":"dl-02005","difficulty":"medium","orderIndex":5,"question":"You have a trained MLP classifier. During inference on a new input, you notice that 80% of the hidden units in the first layer output values very close to 0. A teammate says this is a sign the model is \"broken\" and suggests retraining with a larger network. Is the teammate correct?","options":{"A":"Yes — 80% zero activations means 80% of the network's capacity is wasted, and a larger network would use more capacity","B":"No — sparse activation is often a sign of a well-trained network. If the model uses ReLU, dead units on specific inputs means those features are irrelevant to the input; this is feature selectivity, not a bug","C":"Yes — all hidden units should have roughly equal activation magnitudes for the network to be efficient","D":"No — 80% zero activations means the model has overfit and is memorizing training data by deactivating most units"},"correct":"B","explanation":{"correct":"- Sparse activations in ReLU networks are a feature, not a bug. A unit outputting 0 for a given input means that input doesn't trigger the feature that unit represents. Different inputs activate different subsets of units — this is the network's learned feature selectivity.\n- This is analogous to sparse coding in neuroscience (Olshausen & Field, 1996), where most neurons are silent for any given stimulus. Sparse representations are more interpretable, energy-efficient, and often generalize better.\n- If 80% of units are always 0 regardless of any input (dead ReLU), that is a different problem. But 80% zeros for specific inputs is expected and desirable.","A":"Capacity is not measured by activation counts. A unit that is 0 for one input may be active for other inputs and contribute meaningfully to those predictions. \"Capacity\" in neural networks is about expressiveness over the distribution of inputs, not per-sample activation density.","B":"","C":"Uniform activation magnitudes would imply every unit is equally relevant to every input — this contradicts the idea of feature specialization. Uniform activations are more characteristic of poorly trained or random networks.","D":"Overfitting manifests as poor generalization (large train/test gap), not as sparse activations. A model can be sparse and well-generalized, or dense and overfit."},"reference":"- Olshausen & Field, \"Sparse coding with an overcomplete basis set\" (1997)\n- https://cs231n.github.io/neural-networks-1/#actfun"},{"section":"deep-learning","topicSlug":"neurons-and-perceptrons","topic":"Neurons And Perceptrons","id":"dl-02006","difficulty":"medium","orderIndex":6,"question":"A network has 3 hidden layers with widths [512, 256, 128]. You double the width of the first layer to 1024. Your colleague claims this \"doubles the network's capacity.\" A senior researcher disagrees. What is the most accurate statement about what actually changes?","options":{"A":"The colleague is correct — capacity scales linearly with the number of parameters in the first layer","B":"Doubling the first layer width quadruples the parameters in the first weight matrix (input → layer 1) and doubles those in the second matrix (layer 1 → layer 2), but \"capacity\" in the meaningful sense (ability to separate complex decision boundaries) grows sub-linearly and depends on the interaction with depth and non-linearities","C":"Doubling width has no effect because the bottleneck at 128 units in the final hidden layer limits total capacity","D":"Doubling the first layer width doubles the network's VC dimension exactly"},"correct":"B","explanation":{"correct":"- If the input has dimension d and first layer has n₁ units, the first weight matrix is n₁×d, so doubling n₁ doubles this matrix's parameter count. The second weight matrix n₂×n₁ also doubles. Total extra parameters: O(d·n₁ + n₁·n₂).\n- However, \"capacity\" in the sense of the VC dimension or Rademacher complexity depends non-linearly on width, depth, and their interaction. Empirically, wider networks tend to improve performance but with diminishing returns.\n- The bottleneck argument (option C) has some validity — the narrowest layer constrains information flow — but capacity is not purely determined by the narrowest layer.","A":"Capacity does not scale linearly with parameter count. Two networks with the same parameter count but different architectures can have very different effective capacities. VC dimension for neural networks scales roughly as O(W log W) where W is weight count, not O(W).","B":"","C":"The bottleneck layer does constrain the network (it's why autoencoders use narrow bottlenecks), but making earlier layers wider still increases the representational richness of intermediate representations, which can improve performance even with the same bottleneck size.","D":"VC dimension for neural networks does not scale exactly with width in a simple linear fashion. Exact VC dimension computations for MLPs are complex and depend on the activation function, depth, and connectivity."},"reference":"- Bartlett et al., \"Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks\" (2019): https://arxiv.org/abs/1703.02930"},{"section":"deep-learning","topicSlug":"neurons-and-perceptrons","topic":"Neurons And Perceptrons","id":"dl-02007","difficulty":"medium","orderIndex":7,"question":"You build a multi-layer perceptron for regression (predicting house prices). The output layer has a single unit with no activation function (linear output). A colleague says you should add a ReLU activation to the output unit \"because house prices can't be negative.\" Should you follow this advice?","codeSnippet":"# Current output layer\noutput = nn.Linear(64, 1) # no activation\n\n# Proposed change\noutput = nn.Sequential(nn.Linear(64, 1), nn.ReLU())","options":{"A":"Yes — ReLU on the output ensures non-negative predictions and is always a good practice for price prediction","B":"No — adding ReLU to the output layer constrains predictions to non-negative values but also kills gradients for any training sample where the pre-activation value is negative, preventing the model from learning from those examples","C":"Yes — ReLU is differentiable everywhere except 0, so it has no meaningful impact on training while ensuring valid predictions","D":"No — the output should use softmax instead of ReLU for regression tasks"},"correct":"B","explanation":{"correct":"- If the model predicts a negative pre-activation value for some training samples, ReLU clips the output to 0, making the loss gradient with respect to those samples zero (ReLU gradient is 0 for negative inputs). The model literally cannot learn from those examples.\n- Early in training, many pre-activation values will be negative (random initialization spreads around 0). Adding output ReLU causes a significant fraction of training samples to have zero gradient — effectively \"dead\" output units for those inputs.\n- Better alternatives: (1) use no activation and let L2/Huber loss implicitly penalize negative predictions relative to ground truth, (2) use Softplus (smooth approximation to ReLU) which has non-zero gradients everywhere, or (3) apply log transformation to house prices and predict in log space.","A":"Domain constraint is a valid motivation, but the implementation using ReLU is harmful. The domain constraint must be balanced against trainability. Dead gradients on output units prevent learning.","B":"","C":"ReLU is not differentiable at 0 (undefined, or defined as 0 by convention). More importantly, it is 0 everywhere for x < 0, which means zero gradient — a very meaningful impact on training.","D":"Softmax is for classification (multi-class probability distributions summing to 1), not regression. It is completely wrong for a single continuous output."},"reference":"- https://cs231n.github.io/neural-networks-2/#reg (output activation choices)"},{"section":"deep-learning","topicSlug":"neurons-and-perceptrons","topic":"Neurons And Perceptrons","id":"dl-02008","difficulty":"hard","orderIndex":8,"question":"A network has two fully connected layers: Layer 1 computes h = ReLU(W₁x + b₁) and Layer 2 computes output = W₂h + b₂. You freeze Layer 1 (stop its gradients) and train only Layer 2. Your manager claims: \"Freezing Layer 1 is equivalent to reducing the problem to linear regression on fixed features.\" Is this claim correct?","options":{"A":"Yes — if Layer 1 is frozen, the output is a linear function of the fixed hidden representation h, which is the definition of linear regression","B":"Partially correct — the output is linear in h (the frozen layer's output), but h = ReLU(W₁x + b₁) is a non-linear function of x. The problem is linear in h but non-linear in the original input x — it is equivalent to kernel regression with a fixed non-linear feature map","C":"No — freezing Layer 1 still allows non-linear interactions because the optimizer can adjust the bias b₂ to create thresholding effects","D":"Yes, but only if the batch size is 1; for larger batches, the matrix operations become non-linear"},"correct":"B","explanation":{"correct":"- W₂h + b₂ is indeed linear in h (Layer 2 is a linear function of its inputs). If h is fixed (frozen Layer 1), training Layer 2 is exactly linear regression where h is the feature vector.\n- However, h = ReLU(W₁x + b₁) is a non-linear function of the original input x. So the end-to-end function output = W₂·ReLU(W₁x + b₁) + b₂ is non-linear in x.\n- This is the foundation of transfer learning and feature extraction: freeze a pre-trained backbone (non-linear feature extractor), train only the linear head. You get the expressive features of the deep network while the training problem is simplified to convex linear regression.","A":"\"Linear regression on fixed features\" is partially correct but misses the crucial point that the features themselves are non-linear transformations of the input. Pure linear regression operates on the raw input; this operates on a non-linear embedding.","B":"","C":"Bias b₂ is a single vector — adjusting it shifts the output uniformly but does not create element-wise thresholding. A linear layer with learnable bias is still a linear (affine) function of its input h.","D":"Batch size has no effect on the functional form of a neural network layer. The same linear transformation applies to each sample in the batch independently. Non-linearity does not emerge from batching."},"reference":"- Transfer learning and linear probe evaluation: https://arxiv.org/abs/2002.05709"},{"section":"deep-learning","topicSlug":"neurons-and-perceptrons","topic":"Neurons And Perceptrons","id":"dl-02009","difficulty":"hard","orderIndex":9,"question":"Consider a fully connected layer with 1000 input units and 1000 output units. The weight matrix W is 1000×1000. A researcher proposes replacing W with a low-rank factorization: W ≈ AB where A is 1000×r and B is r×1000, with r = 10. The forward pass becomes: output = (AB)x = A(Bx). What is the exact parameter reduction, and what capability does the network lose?","options":{"A":"Parameters drop from 10⁶ to 20,000 (98% reduction); the network loses the ability to express high-frequency input patterns","B":"Parameters drop from 10⁶ to 1000·r + r·1000 = 2·1000·10 = 20,000 (98% reduction); the network loses the ability to represent any linear transformation whose rank exceeds r=10 — specifically, any output that requires more than 10 independent directions in input space","C":"Parameters drop from 10⁶ to 10,000; the network loses skip connections between non-adjacent layers","D":"Parameters drop from 10⁶ to 20,000; the network loses non-linearity because the product of two matrices is always linear"},"correct":"B","explanation":{"correct":"- Original: 1000×1000 = 1,000,000 parameters. Factored: 1000×10 + 10×1000 = 10,000 + 10,000 = 20,000 parameters. Reduction: 98%.\n- The product AB has rank at most r=10. This means the transformation can only map inputs to a 10-dimensional subspace of the output space. Any output pattern requiring more than 10 independent \"basis directions\" cannot be represented.\n- Low-rank factorization is used extensively in model compression (LoRA, low-rank adapters for LLMs) because most weight matrices in trained networks are approximately low-rank — the effective rank is much smaller than the matrix dimension.","A":"\"High-frequency input patterns\" is not a well-defined loss for a linear transformation. The constraint is rank (number of independent directions), not frequency. Frequency is a concept for convolutional/signal processing contexts.","B":"","C":"1000·10 + 10·1000 = 20,000, not 10,000. The calculation in C is off by 2x. Skip connections are an architectural choice unrelated to rank factorization.","D":"The product of two matrices AB is indeed a matrix (linear transformation), but the full weight matrix W is also linear. Low-rank factorization does not reduce linearity — the transformation was already linear. The concern is rank, not linearity."},"reference":"- LoRA: Low-Rank Adaptation of Large Language Models: https://arxiv.org/abs/2106.09685\n- Hu et al., LoRA paper explains exactly this parameter reduction mechanism"},{"section":"deep-learning","topicSlug":"neurons-and-perceptrons","topic":"Neurons And Perceptrons","id":"dl-02010","difficulty":"hard","orderIndex":10,"question":"You train two networks on the same dataset: Network A has 3 layers of width 100 (300 total units), and Network B has 1 layer of width 300 (300 total units). Both use ReLU and identical training procedures. Network A significantly outperforms Network B. An interviewer asks you to explain exactly why depth helps here beyond just \"more layers = more power.\"","options":{"A":"Network A has more parameters because it has more weight matrices, which directly causes better performance","B":"Depth allows hierarchical composition of simple functions: each layer can detect increasingly abstract features by composing the outputs of previous layers. A 3-layer network can represent functions of functions of features, while a 1-layer network requires representing the full pattern directly — for structured data, this hierarchy is exponentially more efficient","C":"Network A benefits from more gradient steps per layer during backpropagation, which improves optimization","D":"Deeper networks have higher variance, which in the bias-variance tradeoff means better fit to complex training distributions"},"correct":"B","explanation":{"correct":"- The exponential efficiency of depth (Bengio & LeCun, 2007; Telgarsky, 2016) is mathematically established: certain functions that require exponentially many neurons to represent in a shallow network can be represented with polynomially many neurons in a deep network.\n- Concretely for vision: Layer 1 detects edges, Layer 2 composes edges into shapes, Layer 3 composes shapes into objects. A single-layer network must represent object detection directly from pixels — requiring far more neurons to carve out the same decision boundaries.\n- The key phrase is \"for structured data with compositional structure.\" If the data has no hierarchical structure, depth may not help significantly.","A":"Network A does not necessarily have more total parameters than B. Width 100 with 3 layers: W₁ is input×100, W₂ is 100×100, W₃ is 100×output. Network B: W₁ is input×300, W₂ is 300×output. For large inputs, B may have more parameters in W₁. Parameter count alone doesn't explain the performance gap.","B":"","C":"Backpropagation does not give each layer more gradient steps — all layers are updated in a single backward pass. \"More gradient steps per layer\" is a misunderstanding of how backprop works.","D":"Higher variance from depth does not automatically improve fit. Deeper networks are both higher variance and higher capacity, but uncontrolled variance leads to overfitting, not better performance. The advantage of depth is efficiency of representation, not variance."},"reference":"- Bengio & LeCun, \"Scaling algorithms towards AI\" (2007)\n- Telgarsky, \"Benefits of depth in neural networks\" (2016): https://arxiv.org/abs/1602.04485"},{"section":"deep-learning","topicSlug":"neurons-and-perceptrons","topic":"Neurons And Perceptrons","id":"dl-02011","difficulty":"medium","orderIndex":11,"question":"A network's weight matrix W for one layer has been trained and you visualize its rows (each row represents the weights going INTO one output unit). You see that many rows are nearly identical (high cosine similarity between rows). What does this imply about the network?","options":{"A":"The layer is well-trained — identical weights mean the units have converged to a stable solution","B":"The layer likely has redundant units — multiple neurons are detecting the same feature in the input, which wastes capacity. This can happen due to poor initialization, insufficient regularization, or the network being overparameterized for the task","C":"This is a sign of overfitting — identical weights in a layer mean the model has memorized training data","D":"Identical rows are expected because weight sharing is required for neural networks to generalize"},"correct":"B","explanation":{"correct":"- Each row of W represents the \"feature detector\" of one output neuron. If many rows are nearly identical, many neurons are detecting the same pattern, providing no additional information.\n- This indicates either: (1) the layer has more units than needed for the task (overparameterization), (2) symmetry breaking failed despite random initialization (rare but possible with very small weights), or (3) regularization is insufficient to push units toward diverse representations.\n- In practice, this is detected via the \"effective rank\" of W. A low effective rank (most singular values near zero) means the layer is not using its full representational capacity.","A":"Convergence to a stable solution should produce diverse weights (different feature detectors). Identical rows are a degenerate convergence, not a good one. A well-trained layer typically shows varied, diverse rows.","B":"","C":"Overfitting manifests as poor generalization (large train/test gap), not identical weights. Memorization of training data would typically produce highly varied weights keyed to specific training examples, not identical rows.","D":"Weight sharing is a specific architectural choice (e.g., convolutional layers share weights spatially). In a fully connected layer, weight sharing is not expected or required. Identical rows are not \"sharing\" — they're redundancy."},"reference":"- Frankle & Carlin, \"The Lottery Ticket Hypothesis\" (2019): https://arxiv.org/abs/1803.03635"},{"section":"deep-learning","topicSlug":"neurons-and-perceptrons","topic":"Neurons And Perceptrons","id":"dl-02012","difficulty":"easy","orderIndex":12,"question":"A perceptron computes: output = 1 if (w₁x₁ + w₂x₂ + b) ≥ 0, else 0. You set w₁ = 1, w₂ = 1, b = -1.5. Evaluate the outputs for all inputs in {0,1}². What logical gate does this perceptron implement?","options":{"A":"OR gate — outputs 1 whenever at least one input is 1","B":"AND gate — outputs 1 only when both inputs are 1, because the threshold -1.5 requires both x₁ and x₂ to be active simultaneously","C":"NAND gate — outputs 0 only when both inputs are 1","D":"XOR gate — outputs 1 when inputs differ"},"correct":"B","explanation":{"correct":"- (0,0): 0+0-1.5 = -1.5 < 0 → output 0. (0,1): 0+1-1.5 = -0.5 < 0 → output 0. (1,0): 1+0-1.5 = -0.5 < 0 → output 0. (1,1): 1+1-1.5 = 0.5 ≥ 0 → output 1.\n- Only (1,1) → 1, which is exactly the AND function. The bias -1.5 requires the sum w₁x₁ + w₂x₂ ≥ 1.5, which is only satisfied when both inputs are 1 (sum = 2).\n- This demonstrates that logical gates are representable as perceptrons and that the bias term controls the threshold — b = -0.5 would give OR, b = -1.5 gives AND. The same weights, different bias = different gate.","A":"OR gate requires the sum ≥ 1, which needs b = -0.5 (not -1.5). With b = -0.5: (0,1) → 0.5 ≥ 0 → 1 ✓, (1,0) → 0.5 ≥ 0 → 1 ✓, (1,1) → 1.5 ≥ 0 → 1 ✓.","B":"","C":"NAND outputs 0 only for (1,1) and 1 otherwise — the inverse of AND. This requires different weights or a negated threshold structure.","D":"XOR outputs 1 for (0,1) and (1,0) only — which is not linearly separable and cannot be represented by any single perceptron with fixed weights and bias."},"reference":"- http://neuralnetworksanddeeplearning.com/chap1.html#perceptrons"},{"section":"deep-learning","topicSlug":"neurons-and-perceptrons","topic":"Neurons And Perceptrons","id":"dl-02013","difficulty":"medium","orderIndex":13,"question":"You are training an MLP on a tabular dataset with 50 features. You add a hidden layer with 1000 units and observe strong training accuracy but poor validation accuracy. You then reduce the hidden layer to 10 units and observe poor training accuracy and poor validation accuracy. What does this tell you about network depth/width intuition for tabular data?","options":{"A":"Tabular data always requires very deep networks; the problem is insufficient depth, not width","B":"1000 units overfit (high variance), 10 units underfit (high bias) — the optimal width for this problem is somewhere between 10 and 1000, and the right size depends on the complexity of the underlying data pattern relative to the feature space","C":"The poor validation with 1000 units proves the training data is corrupted; no amount of tuning will help","D":"Tabular data is incompatible with fully connected layers; convolutional layers should be used instead"},"correct":"B","explanation":{"correct":"- Classic bias-variance tradeoff: too many parameters relative to data complexity leads to memorization (overfitting = high variance); too few parameters leads to inability to capture patterns (underfitting = high bias).\n- For tabular data with 50 features, the right width depends on: how many meaningful non-linear interactions exist, how many training samples are available, and what regularization is applied.\n- In practice, tabular data often performs well with relatively modest network sizes (128-512 units per layer) combined with dropout and weight decay. Blindly increasing width doesn't help without regularization.","A":"Deeper networks don't automatically solve overfitting from wide layers. Adding more layers to an already overparameterized network typically increases overfitting further. Depth and width both affect capacity — depth is not a cure for width-induced overfitting.","B":"","C":"Overfitting (good train, bad validation) is a normal consequence of having more model capacity than data complexity warrants. It does not imply data corruption — which would manifest as poor training accuracy or high noise.","D":"Fully connected layers are absolutely valid for tabular data. CNNs are designed for grid-structured data (images, sequences). Tabular data lacks spatial locality, making CNNs inappropriate."},"reference":"- Shwartz-Ziv & Armon, \"Tabular data: deep learning is not all you need\" (2022): https://arxiv.org/abs/2106.03253"},{"section":"deep-learning","topicSlug":"neurons-and-perceptrons","topic":"Neurons And Perceptrons","id":"dl-02014","difficulty":"hard","orderIndex":14,"question":"A network's weight matrix W has been trained on task A. You want to transfer it to task B by fine-tuning only the last layer. After fine-tuning, you compute the gradient magnitude of the last layer's weights vs. the frozen earlier layers (which have zero gradient by design). An engineer proposes measuring the \"network depth utilization\" as the ratio of active (non-frozen) parameters to total parameters, and says networks with low utilization are \"underusing their depth.\" What is wrong with this metric?","options":{"A":"Nothing — depth utilization is a valid and widely used metric in transfer learning research","B":"The metric conflates parameter count with representational contribution. A frozen layer with rich, general features contributes heavily to the output even with zero gradient — measuring utilization by gradient flow ignores that frozen layers still perform computation and determine what features are available to the trainable head","C":"The metric is valid but should count neurons, not parameters, to normalize for layer width differences","D":"Gradient magnitude in the last layer should be normalized by the number of samples in the dataset, not compared to frozen layers"},"correct":"B","explanation":{"correct":"- Transfer learning's entire value proposition is that frozen layers provide learned features — even though their parameters don't update, their forward-pass computation is the core of what makes transfer learning work. The frozen ResNet-50 backbone extracts rich visual features; only the final linear head is trained.\n- \"Depth utilization\" as gradient-fraction creates a perverse incentive: it would rate a randomly initialized network with no frozen layers as 100% utilized, and a perfectly pretrained network with a fine-tuned head as poorly utilized.\n- Meaningful transfer learning metrics include: (a) linear probe accuracy (how good are frozen features?), (b) fine-tuning efficiency (how few samples are needed?), and (c) feature alignment between source and target domain.","A":"\"Depth utilization\" as defined (gradient-active vs total parameters) is not a standard metric in transfer learning research. The concept sounds reasonable but is fundamentally flawed as argued.","B":"","C":"Counting neurons vs parameters doesn't fix the fundamental problem: frozen neurons still compute and contribute to the output. The issue is the meaning of \"utilization,\" not the normalization.","D":"Normalizing by dataset size is relevant for gradient scaling analysis, but the core issue here is the conceptual flaw in equating gradient flow with contribution."},"reference":"- Kumar et al., \"Fine-Tuning can Distort Pretrained Features and Underperform from Scratch\" (2022): https://arxiv.org/abs/2202.10054"},{"section":"deep-learning","topicSlug":"neurons-and-perceptrons","topic":"Neurons And Perceptrons","id":"dl-02015","difficulty":"hard","orderIndex":15,"question":"You're debugging a wide MLP (2000 hidden units per layer, 5 layers) that shows a puzzling behavior: training loss decreases but at roughly 1/10 the rate of a narrower network (200 units, same depth) on the same task. Both networks use the same learning rate and batch size. Without profiling, what is the most likely cause and how should it be investigated?","options":{"A":"The wide network has 100x more parameters so requires 100x more epochs to converge at the same learning rate — this is expected and not a bug","B":"The effective learning rate per parameter is too small for the wide network's loss landscape; with more parameters, the gradient signal is \"diluted\" — but the real likely cause is the gradient magnitude scaling issue: wider layers produce larger activations which can cause gradients to scale differently, requiring learning rate tuning proportional to width","C":"The wide network is computing unnecessarily — 2000 units exceed the intrinsic dimensionality of the task, so most units deactivate and gradients vanish","D":"The 5-layer depth causes vanishing gradients in both networks equally; the width difference is irrelevant"},"correct":"B","explanation":{"correct":"- In wide networks, the variance of pre-activations scales with fan-in (number of input connections). Without proper initialization (e.g., He initialization scales weights by √(2/fan-in) for ReLU), activations can explode, causing gradient instability and slow convergence.\n- Additionally, for SGD-based optimizers, the optimal learning rate for a layer scales as 1/√(fan-out) in some parameterizations. A learning rate optimal for width-200 layers is likely too small for width-2000 layers.\n- Investigation: (1) plot activation norms per layer to detect scaling issues, (2) check gradient norms per layer to find vanishing/exploding gradients, (3) try μP (maximal update parameterization) which enables learning rate transfer across widths.","A":"The number of epochs to converge doesn't scale linearly with parameter count. With the same batch size and learning rate, a wider network makes similar gradient steps in wall-clock time (if hardware can handle it). \"Needs 100x more epochs\" is empirically false for well-initialized networks.","B":"","C":"Unit deactivation (dead ReLU) would cause near-zero gradients only for those units — other units would still train normally. 2000 units doesn't inherently cause mass deactivation unless initialization or learning rate is wrong.","D":"Vanishing gradients from depth would affect both networks similarly if they have the same depth. The width difference is the relevant factor for the described behavior."},"reference":"- Yang & Hu, \"Feature Learning in Infinite-Width Neural Networks\" (μP): https://arxiv.org/abs/2011.14522\n- He et al., \"Delving Deep into Rectifiers\" (He initialization): https://arxiv.org/abs/1502.01852"},{"section":"deep-learning","topicSlug":"activation-functions","topic":"Activation Functions","id":"dl-03001","difficulty":"easy","orderIndex":1,"question":"A sigmoid activation outputs values in (0,1). You use it in a hidden layer of a deep network with 10 layers. During training you observe that gradients in the first 3 layers are approximately 10⁻⁶ while gradients in the last 3 layers are approximately 0.1. What causes this disparity and what is the standard fix?","options":{"A":"The first layers receive less data during backpropagation because batches are processed sequentially; fix by increasing batch size","B":"Sigmoid's derivative σ'(z) = σ(z)(1−σ(z)) has a maximum of 0.25 at z=0 and approaches 0 for large |z|. In a 10-layer network, multiplying 10 such terms produces gradients on the order of 0.25¹⁰ ≈ 10⁻⁶ — the vanishing gradient problem. Standard fix: replace sigmoid in hidden layers with ReLU, whose derivative is 1 for positive inputs","C":"The first layers are closer to the random initialization and haven't received enough gradient signal; fix by training longer","D":"Deep networks always have small gradients in early layers; this is expected and does not affect training"},"correct":"B","explanation":{"correct":"- The chain rule multiplies Jacobians across layers. Each sigmoid layer contributes a factor of at most 0.25. After 10 layers: 0.25^10 ≈ 9.5×10⁻⁷, matching the observed 10⁻⁶ magnitude.\n- ReLU's derivative is exactly 1 for positive inputs, meaning gradients pass through ReLU layers without attenuation (for the active units). This is why ReLU effectively solved the vanishing gradient problem for deep feedforward networks.\n- The vanishing gradient problem is one of the primary historical reasons deep networks were difficult to train before 2010 (before ReLU and BatchNorm were standardized).","A":"Backpropagation processes the entire batch uniformly. Batch size affects gradient noise/stability, not the systematic decay of gradient magnitude across layers.","B":"","C":"Training longer doesn't fix vanishing gradients. The small gradients mean early-layer weights update negligibly per step — more steps on near-zero gradients still converge extremely slowly or not at all.","D":"Small gradients in early layers are NOT expected or acceptable — they are the symptom of the vanishing gradient problem. Networks with vanishing gradients effectively don't train their early layers, wasting depth."},"reference":"- Glorot & Bengio, \"Understanding the difficulty of training deep feedforward networks\" (2010): https://proceedings.mlr.press/v9/glorot10a.html\n- https://cs231n.github.io/neural-networks-1/#actfun"},{"section":"deep-learning","topicSlug":"activation-functions","topic":"Activation Functions","id":"dl-03002","difficulty":"easy","orderIndex":2,"question":"You replace all ReLU activations in a trained model with tanh activations and retrain from scratch. Training is significantly slower and final accuracy is lower. What is the most likely technical cause for both effects?","options":{"A":"tanh outputs are in (-1, 1) instead of (0, ∞) for ReLU, making gradients negative which confuses the optimizer","B":"tanh saturates for |z| > 2 (derivative → 0) causing vanishing gradients in deeper layers, while ReLU has a derivative of 1 for all positive inputs, enabling stable gradient flow in deep networks","C":"tanh requires complex number arithmetic which is slower on GPU hardware than the max(0, x) operation of ReLU","D":"tanh activations produce zero-centered outputs which cause weight update interference between neurons in the same layer"},"correct":"B","explanation":{"correct":"- tanh'(z) = 1 - tanh²(z), which approaches 0 as |z| → ∞. For large pre-activation values (common after a few training steps), tanh saturates and gradients vanish.\n- ReLU's derivative is exactly 1 for z > 0, meaning gradients pass through without scaling down. In deep networks (10+ layers), this difference is dramatic: tanh compounds to near-zero gradients, ReLU maintains stable gradient magnitude.\n- Additionally, ReLU is computationally cheaper (max(0,x) vs exponentials in tanh), which partially explains the speed difference.","A":"Negative gradients don't \"confuse\" optimizers. Gradient descent operates on the sign and magnitude of gradients — negative gradients are completely valid and expected for parameters that need to decrease.","B":"","C":"tanh uses exponentials (e^z), not complex number arithmetic. Modern hardware handles this efficiently. The performance difference between tanh and ReLU is real but due to computational complexity (exp vs max), not complex numbers.","D":"Zero-centered outputs are actually a desirable property of tanh (sigmoid's outputs are not zero-centered, which is a disadvantage). Zero-centered activations reduce update \"zig-zagging\" effects. This is not the cause of slower training."},"reference":"- LeCun et al., \"Efficient BackProp\" (1998): recommends tanh over sigmoid but ReLU superseded both\n- Nair & Hinton, \"Rectified Linear Units Improve Restricted Boltzmann Machines\" (2010)"},{"section":"deep-learning","topicSlug":"activation-functions","topic":"Activation Functions","id":"dl-03003","difficulty":"easy","orderIndex":3,"question":"A team initializes all weights in a ReLU network to small positive values near zero. After one epoch, they notice that 60% of neurons permanently output 0 and never recover, even after 100 more epochs. What is this phenomenon and what caused it here?","options":{"A":"Dead ReLU problem — caused by large negative pre-activations causing ReLU to output 0 with zero gradient. Here it was triggered by poor weight initialization producing many negative pre-activations from the start","B":"Gradient explosion — small initial weights cause gradients to grow exponentially backward through the network","C":"Overfitting — the neurons are deactivating to memorize specific training samples","D":"Mode collapse — the ReLU neurons collapse to a single output mode which outputs 0 for all inputs"},"correct":"A","explanation":{"correct":"- ReLU(z) = max(0, z) and its gradient is 0 when z < 0. A \"dead\" neuron is one where z < 0 for all inputs in the dataset — it outputs 0 always and receives gradient 0 always, so its weights never update.\n- Near-zero initialization with many features can produce z = Wx + b ≈ 0 initially, but a few bad samples or unlucky updates can push z < 0. Once dead, that neuron stays dead.\n- Fix: He initialization (scales weights by √(2/fan-in)), Leaky ReLU (gradient = α < 1 for negative inputs instead of 0), or PReLU (learnable negative slope). ELU also has negative outputs, preventing dead neurons.","A":"","B":"Small initial weights produce small activations, which produce small gradients — the opposite of explosion. Gradient explosion occurs with large weights, not small ones.","C":"Neuron deactivation is not memorization. Memorization would require neurons to be selectively active for specific training patterns, not permanently off.","D":"Mode collapse is a GAN training problem where the generator produces limited variety. It is not applicable to individual neuron behavior in a supervised learning MLP."},"reference":"- He et al., \"Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet\" (2015): https://arxiv.org/abs/1502.01852\n- Maas et al., \"Rectifier Nonlinearities Improve Neural Network Acoustic Models\" (Leaky ReLU)"},{"section":"deep-learning","topicSlug":"activation-functions","topic":"Activation Functions","id":"dl-03004","difficulty":"medium","orderIndex":4,"question":"You train a model with ReLU activations and achieve good performance. A colleague switches the activations to Leaky ReLU (α=0.01) for the hidden layers, claiming it is \"strictly better.\" After retraining, the model performs identically. Your colleague insists there must be a bug. What is the most accurate explanation?","options":{"A":"Leaky ReLU is always strictly better than ReLU; the identical performance confirms a bug in the implementation","B":"Leaky ReLU's advantage (non-zero gradient for negative inputs) only matters when neurons are actually dying (stuck at z<0). If the original ReLU network had few or no dead neurons, Leaky ReLU provides no benefit — both activations are identical for z>0 and the negative-slope advantage never activates","C":"Leaky ReLU and ReLU are mathematically identical because the leaky term (0.01x) is too small to affect training","D":"The dataset is too small for Leaky ReLU's advantages to manifest; it requires 100,000+ samples to show improvement"},"correct":"B","explanation":{"correct":"- Leaky ReLU with α=0.01 computes: max(0.01z, z). For z > 0, this is identical to ReLU. The difference only appears for z < 0, where ReLU gives 0 (zero gradient) and Leaky ReLU gives 0.01z (non-zero gradient).\n- If the original network had no dead neurons (all activations mostly positive for training data), the two activations are functionally equivalent on that dataset, and identical performance is the correct expected result.\n- The lesson: architectural improvements that address specific failure modes (like dead neurons) only show benefits when that failure mode is actually occurring. ReLU networks on well-initialized problems often have <5% dead neurons, making Leaky ReLU's advantage marginal.","A":"\"Strictly better\" in theory doesn't mean \"strictly better on every problem.\" Leaky ReLU is strictly better at addressing dead neurons, but if no neurons are dying, the advantage is zero.","B":"","C":"0.01x for negative inputs is not \"too small to affect training\" — if neurons were dying, even a 0.01 gradient would be infinitely better than a 0 gradient. The magnitude matters only when the feature is relevant.","D":"The improvement from Leaky ReLU is not sample-size dependent. It depends on whether dead neurons are present. You can have dead neurons with 1 million samples and no dead neurons with 100 samples."},"reference":"- https://cs231n.github.io/neural-networks-1/#actfun (comparison of activations)"},{"section":"deep-learning","topicSlug":"activation-functions","topic":"Activation Functions","id":"dl-03005","difficulty":"medium","orderIndex":5,"question":"GELU (Gaussian Error Linear Unit) is defined as: GELU(x) = x · Φ(x), where Φ is the standard normal CDF. Unlike ReLU which makes a hard 0/1 decision at x=0, GELU is used in Transformers (BERT, GPT) instead of ReLU. What property of GELU makes it preferable for Transformer-based architectures specifically?","options":{"A":"GELU is faster to compute because it avoids the max() operation in ReLU","B":"GELU is smooth (infinitely differentiable) and stochastically gates inputs — it smoothly interpolates between \"pass input\" and \"gate to zero\" based on the input's magnitude relative to other inputs. This smooth gating is empirically better for the attention + MLP structure in Transformers","C":"GELU outputs values in (0,1), making it compatible with the softmax in the attention mechanism","D":"GELU was designed specifically for pre-LayerNorm Transformers and has no advantage over ReLU in post-LayerNorm architectures"},"correct":"B","explanation":{"correct":"- GELU(x) = x · Φ(x) can be interpreted as: multiply the input by its probability of being greater than a Gaussian sample. For large positive x: Φ(x)→1, so GELU(x)≈x. For large negative x: Φ(x)→0, so GELU(x)≈0. Near 0: smooth interpolation.\n- This smooth, stochastic gating behavior means GELU doesn't make hard cutoff decisions like ReLU. In deep Transformer architectures where activations are distributed roughly normally (due to LayerNorm before each sublayer), GELU's Gaussian-parameterized gating matches the activation distribution naturally.\n- Empirically, GELU consistently outperforms ReLU in BERT, GPT, and most modern Transformer variants — the theoretical explanation is still an active research area.","A":"GELU requires computing the error function (or an approximation), which is more expensive than max(0,x). It is computationally slower than ReLU.","B":"","C":"GELU(x) = x · Φ(x) can be negative (when x is negative but not large enough to make GELU exactly 0 — actually GELU is slightly negative for x around -0.17). It is not bounded to (0,1).","D":"GELU was introduced by Hendrycks & Gimpel (2016) as a general activation function. Its advantages have been demonstrated across various architectures and normalization schemes, not limited to pre-LayerNorm configurations."},"reference":"- Hendrycks & Gimpel, \"Gaussian Error Linear Units (GELUs)\" (2016): https://arxiv.org/abs/1606.08415\n- BERT paper uses GELU: https://arxiv.org/abs/1810.04805"},{"section":"deep-learning","topicSlug":"activation-functions","topic":"Activation Functions","id":"dl-03006","difficulty":"medium","orderIndex":6,"question":"You are training a binary classifier and must choose between sigmoid and ReLU for the output layer activation. A teammate says \"use ReLU everywhere for consistency.\" What is wrong with using ReLU on the output layer for binary classification?","options":{"A":"ReLU outputs can exceed 1.0, making them incompatible with binary cross-entropy loss which expects probabilities in [0,1]","B":"ReLU cannot distinguish between confidently correct and confidently incorrect predictions because it clips all negative values to 0","C":"ReLU is not differentiable at 0, which causes instability in the loss computation","D":"Both A and B — ReLU produces unbounded outputs and loses negative prediction information"},"correct":"A","explanation":{"correct":"- Binary cross-entropy (BCE) loss: L = -[y·log(p) + (1-y)·log(1-p)] requires p ∈ (0,1). If the model outputs p > 1 (possible with ReLU), log(1-p) = log(negative) → undefined/NaN, breaking the loss computation.\n- Sigmoid squashes any real-valued pre-activation to (0,1), making it the canonical output activation for binary classification. The log-odds interpretation is also natural: the pre-activation logit maps directly to probability via sigmoid.\n- In PyTorch, `nn.BCEWithLogitsLoss` combines sigmoid and BCE in one numerically stable operation, which is why many implementations use no output activation with `BCEWithLogitsLoss` rather than explicit sigmoid.","A":"","B":"ReLU does differentiate between high and low outputs for positive predictions. The issue is not discrimination ability for positive outputs but the incompatibility with the loss function's probability expectations.","C":"ReLU is not differentiable at exactly 0, but this is handled by convention (gradient = 0 at 0). In practice, the probability of exactly hitting 0 is negligible and this is not the primary problem with using ReLU on the output layer.","D":"While both A and B raise valid points, A is the fundamental reason: mathematical incompatibility with the loss function is a hard constraint, not a soft preference."},"reference":"- PyTorch BCEWithLogitsLoss: https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html"},{"section":"deep-learning","topicSlug":"activation-functions","topic":"Activation Functions","id":"dl-03007","difficulty":"medium","orderIndex":7,"question":"A network using ELU (Exponential Linear Unit) activations converges significantly faster than the same network with ReLU on a deep (20 layer) architecture. The engineer explains: \"ELU is faster because it uses exponentials which are faster than max().\" Is the engineer's explanation correct?","codeSnippet":"# ELU: f(x) = x if x > 0, else α(e^x - 1)\n# ReLU: f(x) = max(0, x)","options":{"A":"Yes — exponential functions have hardware acceleration in modern CPUs making ELU faster than ReLU","B":"No — ELU is computationally more expensive than ReLU (exp is slower than max). The faster convergence is due to ELU producing negative outputs for negative inputs, keeping the mean activation near zero. This prevents the \"bias shift\" problem that slows ReLU networks","C":"No — ELU is faster because its derivative is always non-zero, enabling larger learning rates","D":"Yes — ELU avoids the non-differentiability at z=0 that causes ReLU to require smaller learning rates"},"correct":"B","explanation":{"correct":"- The exponential function is one of the more expensive operations in floating-point arithmetic. ELU is computationally slower than ReLU per unit operation. The faster convergence is explained by a different mechanism.\n- ReLU outputs are always ≥ 0. In a layer with ReLU activations, the average output is positive, which means the next layer's weights receive inputs with non-zero mean. This \"bias shift\" (similar to the sigmoid non-zero-mean problem) causes gradient updates that are correlated across samples, slowing convergence.\n- ELU outputs can be negative (approaching -α for large negative inputs), keeping the mean activation near zero — similar to tanh's zero-centering benefit but without tanh's saturation problem.","A":"Exponential functions do not have special hardware acceleration that makes them faster than max(). Modern CPUs/GPUs implement max as a single instruction, while exp requires multiple floating-point operations or a table lookup.","B":"","C":"While ELU's derivative is non-zero everywhere (for α > 0), this doesn't enable larger learning rates per se. The learning rate is constrained by loss landscape curvature, not just gradient existence.","D":"ReLU's non-differentiability at z=0 is handled by convention and is not a practical constraint on learning rate. The subgradient is used and training proceeds normally."},"reference":"- Clevert et al., \"Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)\" (2015): https://arxiv.org/abs/1511.07289"},{"section":"deep-learning","topicSlug":"activation-functions","topic":"Activation Functions","id":"dl-03008","difficulty":"hard","orderIndex":8,"question":"SiLU (Sigmoid Linear Unit, also called Swish) is defined as SiLU(x) = x · sigmoid(x). A researcher claims SiLU is \"the same as GELU with a different distribution assumption.\" An engineer disagrees, saying they are fundamentally different. Who is correct and what is the exact difference?","options":{"A":"The researcher is correct — SiLU and GELU are numerically identical for all practical inputs","B":"The engineer is correct — SiLU uses sigmoid(x) as the gating function (deterministic, parameterized by logistic distribution) while GELU uses Φ(x) (CDF of standard normal). Both are \"self-gated\" (input gates itself) but with different distributional assumptions and different numerical values for the same input","C":"The researcher is correct — both are approximations to ReLU and converge to identical functions for large networks","D":"The engineer is correct — SiLU is not differentiable while GELU is smooth everywhere"},"correct":"B","explanation":{"correct":"- Both SiLU and GELU are self-gated activations of the form f(x) = x · gate(x). For GELU: gate(x) = Φ(x) (normal CDF). For SiLU: gate(x) = sigmoid(x) = 1/(1+e^(-x)) (logistic CDF).\n- The normal CDF and logistic CDF are different functions that happen to be similar in shape (both S-shaped, both in [0,1]). At x=0: Φ(0) = 0.5 = sigmoid(0) — they agree. At x=1: Φ(1) ≈ 0.841 vs sigmoid(1) ≈ 0.731 — they diverge.\n- In practice, SiLU is used in EfficientNet, MobileNetV3, and many modern CNNs. GELU is preferred in Transformers. Both outperform ReLU on many benchmarks, and the choice is often empirical.","A":"SiLU and GELU are numerically different. For x=1: SiLU(1) = 1·sigmoid(1) ≈ 0.731, GELU(1) = 1·Φ(1) ≈ 0.841. The difference is small but real, and compounds across layers.","B":"","C":"Convergence to identical functions as network width increases is a property of neural network training dynamics (NTK perspective), not of the activation functions themselves. The activations remain numerically distinct regardless of network size.","D":"Both SiLU and GELU are smooth (infinitely differentiable). SiLU(x) = x·σ(x) is differentiable everywhere since both x and sigmoid are differentiable."},"reference":"- Ramachandran et al., \"Swish: A Self-Gated Activation Function\" (2017): https://arxiv.org/abs/1710.05941\n- GELU: https://arxiv.org/abs/1606.08415"},{"section":"deep-learning","topicSlug":"activation-functions","topic":"Activation Functions","id":"dl-03009","difficulty":"hard","orderIndex":9,"question":"You train a 50-layer network with ReLU activations. After training, you measure the fraction of dead neurons (always outputting 0) per layer. You find: Layer 1: 2% dead, Layer 25: 35% dead, Layer 50: 68% dead. The dead neuron count increases with depth. What mechanism causes this pattern and what architectural intervention prevents it?","options":{"A":"Deeper layers receive smaller gradients due to vanishing gradient, so they update less and drift to negative weight values — dead neuron accumulation is a direct consequence of vanishing gradients in ReLU networks","B":"Dead neurons at layer k propagate to layer k+1: if a neuron in layer k is dead, it contributes 0 to all downstream neurons' pre-activations. As more upstream neurons die, more downstream neurons receive predominantly zero (or negative) pre-activations and die themselves — a cascade failure. Batch Normalization interrupts this cascade by re-centering activations before each ReLU","C":"Deeper layers have more parameters which increases the probability of any single parameter reaching a dead state statistically","D":"The learning rate decays over training, causing deeper layers (which update later in backpropagation) to have effectively lower learning rates and die from under-updating"},"correct":"B","explanation":{"correct":"- The cascade mechanism: if 35% of layer 25 neurons output 0 always, then neurons in layer 26 receive inputs that are 35% zeros. This biases their pre-activation sum toward lower values, increasing the probability they also become dead.\n- This cascade compounds exponentially: even a small dead fraction in early layers multiplies into large dead fractions in later layers.\n- BatchNorm (or LayerNorm) normalizes pre-activations to have zero mean and unit variance before the activation function. This ensures activations enter ReLU with a balanced distribution, interrupting the dead-neuron cascade. This is one of BatchNorm's key practical benefits.","A":"Vanishing gradients in ReLU networks are primarily a problem with multiplicative weight matrices, not the activation function itself (ReLU gradient is 1 for positive inputs). ReLU actually alleviates vanishing gradients compared to sigmoid. Dead neurons accumulate via the cascade mechanism, not gradient vanishing.","B":"","C":"Dead neuron probability is not purely statistical. Individual neuron death depends on the distribution of its inputs and the values of its specific weights — it's a deterministic function of the network state, not a random statistical outcome.","D":"Learning rate scheduling affects all layers simultaneously in backpropagation. Deeper layers receive gradients from earlier layers, so their effective learning rate is not independently lower due to scheduling."},"reference":"- Ioffe & Szegedy, \"Batch Normalization: Accelerating Deep Network Training\" (2015): https://arxiv.org/abs/1502.03167"},{"section":"deep-learning","topicSlug":"activation-functions","topic":"Activation Functions","id":"dl-03010","difficulty":"hard","orderIndex":10,"question":"You are evaluating a new activation function f(x) = max(x, αx) where α = -0.5. An intern claims: \"This function has a negative slope for x < 0, so it will cause gradients to flip sign during backpropagation, making training unstable.\" Is the intern correct?","options":{"A":"Yes — negative slopes during backpropagation cause gradient sign flips which prevent convergence","B":"No — the gradient of f(x) for x < 0 is α = -0.5, a constant negative slope. This means gradients are scaled by -0.5 for negative pre-activations, not flipped unpredictably. However, this activation (Leaky ReLU with negative α) would cause unconventional behavior: negative-input neurons amplify and invert their gradient signal, which could destabilize training","C":"No — gradient sign flips are normal in SGD and occur every time the optimizer passes through a loss minimum; the intern is confusing gradient descent mechanics with activation gradients","D":"Yes — but only for the first training step; after initialization, all pre-activations become positive due to ReLU's rectification behavior"},"correct":"B","explanation":{"correct":"- For x < 0, f(x) = αx = -0.5x, so f'(x) = -0.5. The chain rule multiplies this into the gradient of upstream layers. A factor of -0.5 scales and inverts the gradient signal for neurons with negative pre-activations.\n- Standard Leaky ReLU uses α ∈ (0, 1) (e.g., 0.01) to keep gradients positive but small. Using α = -0.5 is unusual and potentially harmful: negative gradients would cause weight updates to push in the opposite direction of the loss gradient for those units.\n- This is different from the gradient naturally being negative (which simply means \"decrease this weight\"). Here, the activation's negative slope would invert the semantic meaning of the loss gradient for certain neurons.","A":"The intern's concern about \"instability\" has some validity, but the mechanism described (\"flip sign\") is not quite right. The concern is about α being negative causing sign inversion through the activation, not gradient instability in the general SGD sense.","B":"","C":"The intern is not confusing gradient descent mechanics — the concern is specifically about the activation function's contribution to the chain rule product. This is a valid concern, just slightly imprecisely stated.","D":"ReLU rectification doesn't make all pre-activations positive. Many neurons will have negative pre-activations during training, especially early on. The \"all positives after first step\" claim is false."},"reference":"- Maas et al., \"Rectifier Nonlinearities Improve Neural Network Acoustic Models\" (Leaky ReLU, α should be in (0,1)): https://ai.stanford.edu/~amaas/papers/relu_hybrid_icml2013_final.pdf"},{"section":"deep-learning","topicSlug":"activation-functions","topic":"Activation Functions","id":"dl-03011","difficulty":"medium","orderIndex":11,"question":"A network for multi-class classification (10 classes) uses softmax as the output activation. A colleague replaces softmax with sigmoid on each output independently, arguing \"sigmoid also produces values in (0,1) and is simpler.\" After training, the colleague's model produces outputs like [0.95, 0.87, 0.76, ...] that sum to 6.3. What critical property did the colleague's model lose?","options":{"A":"Differentiability — sigmoid outputs cannot be used with cross-entropy loss","B":"Mutual exclusivity normalization — softmax ensures outputs sum to 1.0 and represent a valid probability distribution over classes. Independent sigmoids produce values in (0,1) but without the normalization constraint, so outputs can sum to any value, making them unnormalized scores rather than class probabilities","C":"Sparsity — softmax produces sparse outputs (one dominant class) while sigmoid produces dense activations that confuse the model","D":"The model lost nothing significant — both activations produce equivalent outputs after applying argmax for the final class prediction"},"correct":"B","explanation":{"correct":"- Softmax: softmax(z)ᵢ = exp(zᵢ) / Σⱼ exp(zⱼ). The denominator normalizes outputs so they sum to exactly 1.0 and form a valid categorical probability distribution.\n- Independent sigmoid: σ(zᵢ) = 1/(1+exp(-zᵢ)) for each output independently. No normalization — outputs can each be close to 1, summing well above 1.\n- The key difference: softmax encodes \"which class is most likely, given that exactly one is correct.\" Sigmoid encodes \"is this class present?\" — appropriate for multi-label problems (multiple classes can be true simultaneously), not multi-class problems (exactly one class is true).","A":"Sigmoid outputs are in (0,1) and differentiable. They are perfectly compatible with cross-entropy loss. The issue is not differentiability.","B":"","C":"Softmax does produce a \"winner-take-all\" effect (the largest logit gets amplified), but the primary issue is probability normalization, not sparsity per se.","D":"Argmax gives the same answer regardless of softmax vs sigmoid if the relative ordering of logits is preserved (which it is, since both are monotone transformations). So for inference alone, argmax accuracy could be similar. However, the probability estimates are meaningless, calibration is lost, and training with cross-entropy on unnormalized probabilities produces incorrect gradients."},"reference":"- https://cs231n.github.io/linear-classify/#softmax (Softmax vs SVM losses)"},{"section":"deep-learning","topicSlug":"activation-functions","topic":"Activation Functions","id":"dl-03012","difficulty":"easy","orderIndex":12,"question":"You are comparing activation functions for a hidden layer. A senior engineer says: \"For modern deep learning on GPUs, ReLU is the default choice not just because it avoids vanishing gradients but for a second practical reason that matters at scale.\" What is the second practical reason?","options":{"A":"ReLU enables sparse activations — on average, ~50% of neurons output 0 in each forward pass. Sparse activations mean fewer multiplications in subsequent layers, which translates to real computational savings on specialized hardware","B":"ReLU outputs are bounded, preventing memory overflow in GPU operations","C":"ReLU is the only activation that is supported natively by CUDA kernels in PyTorch","D":"ReLU eliminates the need for bias terms, reducing memory usage in large networks"},"correct":"A","explanation":{"correct":"- For a typical activation distribution centered near zero after BatchNorm, roughly 50% of ReLU inputs are negative and produce exactly 0 output. Multiplying any value by 0 is trivially computed.\n- On GPUs, sparse activation can be exploited by structured pruning and sparse matrix libraries. More importantly, the 0-outputs skip computations in the next layer's matrix-vector product for those specific neurons.\n- This computational sparsity is one reason why ReLU-based sparse models can be inference-efficient, and why techniques like \"pruning\" and \"sparse networks\" work well with ReLU.","A":"","B":"ReLU outputs are NOT bounded above (max(0,x) grows without bound for large positive x). Output explosion is possible with ReLU, which is why weight initialization and batch normalization are important.","C":"PyTorch CUDA kernels support all standard activation functions including sigmoid, tanh, GELU, SiLU, etc. ReLU has no exclusive hardware support claim.","D":"Bias terms are determined by the network architecture, not the activation function. ReLU layers still use biases. Removing biases (bias=False) is an independent design choice unrelated to activation type."},"reference":"- LeCun et al., \"Efficient BackProp\": practical considerations for activations\n- https://pytorch.org/docs/stable/sparse.html"},{"section":"deep-learning","topicSlug":"activation-functions","topic":"Activation Functions","id":"dl-03013","difficulty":"medium","orderIndex":13,"question":"A vision model uses ReLU activations and achieves strong performance. You switch to PReLU (Parametric ReLU), which replaces the fixed slope of 0 for negative inputs with a learnable parameter αᵢ per channel. After training, you find that all αᵢ converged to ~0.01 (close to Leaky ReLU's standard setting). What does this convergence pattern tell you about the data?","options":{"A":"The model overfit during training; α should be regularized to exactly 0 (standard ReLU) to prevent overfitting","B":"The data's optimal activation behavior for negative pre-activations is approximately the Leaky ReLU regime (small positive slope), not full ReLU (zero slope) or full linear (slope=1). The network discovered this autonomously — the data prefers a small leak rather than hard zeroing","C":"The αᵢ convergence to 0.01 indicates dead neurons — the learnable parameter tried to revive them with a small slope","D":"PReLU always converges to α≈0.01 regardless of data due to L2 regularization on α pulling values toward zero"},"correct":"B","explanation":{"correct":"- PReLU is a superset of both ReLU (α=0) and Leaky ReLU (fixed α). If it converges to α≈0.01, the network found that a small negative slope is better than no slope (ReLU) for this data.\n- This is an interpretable result: some information from negative pre-activations is useful for the task. A slope of 0.01 allows a weak gradient signal from neurons that would otherwise be dead, improving gradient flow slightly without allowing negative activations to dominate.\n- The uniform convergence across channels (all αᵢ ≈ 0.01) suggests this preference is consistent across features, not layer/channel-specific.","A":"PReLU's learnable α is not a sign of overfitting. The parameters α are additional degrees of freedom, but they are learned in a way that improves training stability. Regularizing α to exactly 0 would manually force ReLU behavior, discarding the learned preference.","B":"","C":"Dead neurons have α effect only if those neurons are currently inactive. α≈0.01 means the network chose a small positive slope as the optimal behavior for negative inputs — it is not evidence of dead neurons attempting revival.","D":"L2 regularization would pull α toward 0, not 0.01. If all αᵢ converge to 0.01 with L2 regularization, the data gradient is pulling α up to 0.01 and the regularization is pulling it down — they balance at 0.01. This means the data genuinely prefers 0.01 over 0."},"reference":"- He et al., \"Delving Deep into Rectifiers\" (PReLU section): https://arxiv.org/abs/1502.01852"},{"section":"deep-learning","topicSlug":"activation-functions","topic":"Activation Functions","id":"dl-03014","difficulty":"hard","orderIndex":14,"question":"A team is building a Mixture of Experts (MoE) language model. The router network (which decides which expert handles each token) uses softmax to output probabilities over 64 experts. The team observes \"expert collapse\": after 5000 training steps, 90% of tokens are routed to 2 of the 64 experts. The remaining 62 experts receive no gradients and become useless. What is the mechanistic cause related to softmax, and what fix is applied in production MoE systems?","options":{"A":"Softmax's normalization causes the winning experts to have gradients 32x larger than losing experts, amplifying early random advantages into permanent collapse","B":"Softmax with temperature=1 creates a positive feedback loop: experts that win early get more training examples, their performance improves, softmax amplifies their logit advantage further on subsequent tokens — collapse is a stable attractor of the softmax + gradient descent system. Production fix: add an auxiliary load-balancing loss that penalizes unequal expert utilization","C":"Expert collapse is caused by the router network overfitting to the training data; fix by adding dropout to the router","D":"Softmax is the wrong activation for routing; replace with ReLU to allow multiple experts per token"},"correct":"B","explanation":{"correct":"- The collapse mechanism: Expert A gets slightly higher initial logit → softmax amplifies this to high probability → Expert A gets more gradient updates → Expert A improves more → its logit grows higher → softmax amplifies further → collapse.\n- This is a positive feedback loop inherent to the softmax + gradient descent interaction. Early random advantages are exponentially amplified by softmax's normalization.\n- Production fix (Switch Transformer, GShard, Mixtral): auxiliary load-balancing loss L_aux = α · Σᵢ fᵢ · Pᵢ, where fᵢ is the fraction of tokens dispatched to expert i and Pᵢ is the router's mean probability for expert i. This directly penalizes unequal utilization.","A":"The gradient magnitude difference between winning and losing experts follows from the softmax probability values, not a fixed 32x factor. More importantly, the gradient difference alone doesn't cause collapse — the feedback loop between gradient updates and future routing decisions is the actual mechanism.","B":"","C":"Dropout on the router would add noise to routing decisions but would not address the fundamental positive feedback loop. Production systems use load-balancing loss for this purpose.","D":"Replacing softmax with ReLU would allow multiple experts per token (multi-select routing), which is a different design choice. Some systems use top-k routing with ReLU normalization, but this changes the problem structure rather than fixing softmax expert collapse."},"reference":"- Fedus et al., \"Switch Transformers\" (2021): https://arxiv.org/abs/2101.03961\n- Lepikhin et al., \"GShard\" (2020): https://arxiv.org/abs/2006.16668"},{"section":"deep-learning","topicSlug":"activation-functions","topic":"Activation Functions","id":"dl-03015","difficulty":"hard","orderIndex":15,"question":"A research paper claims: \"For networks wider than 1000 units per layer, the choice of activation function (ReLU, GELU, tanh) becomes irrelevant because the network falls into the infinite-width (Neural Tangent Kernel) regime where all activations are equivalent.\" A practitioner dismisses this as \"theoretical nonsense.\" Who is right and why?","options":{"A":"The paper is correct — NTK theory proves that all activations become equivalent at infinite width","B":"The practitioner is right to be skeptical: NTK theory applies in a specific mathematical limit (infinite width, specific initialization, lazy training regime). At width=1000, networks are far from this limit and still in the feature-learning regime where activation choice affects learned representations, convergence speed, and final performance","C":"Both are correct — for classification tasks with width>1000, activations are equivalent; for generation tasks they differ","D":"The paper is correct for training speed but the practitioner is correct for final accuracy — activations affect how fast networks train but not what they converge to"},"correct":"B","explanation":{"correct":"- NTK theory (Jacot et al., 2018) describes networks in the \"lazy training\" regime where parameters stay close to initialization. This requires infinite width AND specific scaling. At finite width (even 10,000 units), networks learn features and deviate from the NTK prediction.\n- Practical width=1000 networks are solidly in the feature-learning (non-NTK) regime. The choice between ReLU and GELU significantly affects: (a) dead neuron fraction, (b) gradient flow quality, (c) representation geometry.\n- The paper's claim oversimplifies by conflating \"mathematically wider than 1000 makes NTK-like\" with the actual infinite-width limit. NTK effects start to appear at much larger widths than 1000, and even then are approximate.","A":"NTK theory proves equivalence only at truly infinite width with specific parameterization (NTK parameterization) and small learning rates. \"Infinite\" is not a practical width threshold and \"1000\" is not anywhere near the regime where NTK approximations become accurate.","B":"","C":"NTK theory does not distinguish by task type (classification vs generation). The regime is determined by network width, learning rate, initialization scale, and training dynamics — not the loss function.","D":"Activation choice affects both training speed and final accuracy. They are not decoupled. Networks with dying neurons (bad activation choice) converge to worse solutions, not just slower convergence to the same solution."},"reference":"- Jacot et al., \"Neural Tangent Kernel: Convergence and Generalization in Neural Networks\" (2018): https://arxiv.org/abs/1806.07572\n- Yang & Hu, \"Feature Learning in Infinite-Width Neural Networks\" (feature learning regime): https://arxiv.org/abs/2011.14522"},{"section":"deep-learning","topicSlug":"forward-propagation","topic":"Forward Propagation","id":"dl-04001","difficulty":"easy","orderIndex":1,"question":"A network has input shape (batch_size=32, features=128), first layer weight matrix W₁ of shape (128, 64), and bias b₁ of shape (64,). An engineer writes the forward pass as: h = x @ W₁ + b₁. What is the shape of h and how does the bias broadcasting work?","options":{"A":"h has shape (32, 128) because the bias expands to match the input dimension","B":"h has shape (32, 64) — x @ W₁ produces (32, 64), and b₁ of shape (64,) is broadcast to (32, 64) by repeating along the batch dimension, adding the same bias to each sample","C":"This code is invalid — bias must have shape (32, 64) to match the batch dimension explicitly","D":"h has shape (64, 32) because matrix multiplication transposes the batch dimension"},"correct":"B","explanation":{"correct":"- Matrix multiply: (32, 128) @ (128, 64) = (32, 64). Each of the 32 samples gets its own 64-dimensional output vector.\n- Broadcasting: b₁ has shape (64,). NumPy/PyTorch broadcasts this to (32, 64) by repeating along the batch dimension — the same bias vector b₁ is added to every sample's activation. This is the correct behavior because the bias is a property of the layer, not the sample.\n- Broadcasting rules: shapes are aligned from the right. (32, 64) and (64,) → (64,) is broadcast to (1, 64) → then to (32, 64). This implicit behavior is a common source of shape bugs when the bias has unexpected dimensions.","A":"(32, 128) would be the shape if we multiplied x by W₁ transposed as (128, 128) — but the weight matrix here maps 128→64, so the output is 64-dimensional.","B":"","C":"PyTorch and NumPy handle broadcasting automatically. The bias does not need to be explicitly shaped (32, 64). Requiring explicit batch-dimension expansion would be cumbersome and is not how neural network libraries work.","D":"Matrix multiplication preserves the batch dimension as the leading dimension. (32, 128) @ (128, 64) = (32, 64), not (64, 32)."},"reference":"- PyTorch broadcasting semantics: https://pytorch.org/docs/stable/notes/broadcasting.html\n- https://cs231n.github.io/neural-networks-2/#datapre"},{"section":"deep-learning","topicSlug":"forward-propagation","topic":"Forward Propagation","id":"dl-04002","difficulty":"easy","orderIndex":2,"question":"You are implementing a 3-layer MLP forward pass and tracking tensor shapes. Input is (batch=16, 784). Layers: [784→256, 256→128, 128→10]. At each step, you apply ReLU after layers 1 and 2, and no activation after layer 3. Which shape sequence is correct?","options":{"A":"(16,784) → (16,256) → (256,) → (16,128) → (128,) → (16,10)","B":"(16,784) → (16,256) → (16,256) → (16,128) → (16,128) → (16,10)","C":"(16,784) → (256,16) → (256,16) → (128,16) → (128,16) → (10,16)","D":"(16,784) → (16,256) → (16,128) → (16,10) skipping ReLU shapes since activation doesn't change shape"},"correct":"B","explanation":{"correct":"- Layer 1: (16,784) @ (784,256) + b = (16,256). ReLU applied element-wise: output is (16,256) — same shape, different values.\n- Layer 2: (16,256) @ (256,128) + b = (16,128). ReLU: still (16,128).\n- Layer 3: (16,128) @ (128,10) + b = (16,10). No activation.\n- Activation functions (ReLU, sigmoid, tanh) are element-wise operations — they preserve tensor shape. Shape tracking must include these steps to verify the code is correct, even though shape doesn't change.","A":"The (256,) and (128,) shapes are incorrect — they represent 1D bias vectors, not the layer outputs. After the matrix multiply, the output is 2D (batch × features).","B":"","C":"Standard PyTorch linear layers use (batch, features) convention, not (features, batch). The batch dimension is always leading.","D":"Option D is actually numerically correct (skipping same-shape ReLU steps), but omitting activation steps in shape tracking is bad practice — a common source of bugs when activation functions are accidentally applied to wrong tensors."},"reference":"- PyTorch nn.Linear documentation: https://pytorch.org/docs/stable/generated/torch.nn.Linear.html"},{"section":"deep-learning","topicSlug":"forward-propagation","topic":"Forward Propagation","id":"dl-04003","difficulty":"medium","orderIndex":3,"question":"You are processing a batch of 64 images, each of shape (3, 224, 224), through a convolutional layer followed by a fully connected layer. Before the FC layer, the output of the conv stack has shape (64, 512, 7, 7). A junior engineer writes `x = x.reshape(64, -1)` before the FC layer. What is the resulting shape and what bug risk does this introduce compared to `x.view(64, -1)` or `nn.Flatten()`?","codeSnippet":"conv_out = torch.randn(64, 512, 7, 7)\nx = conv_out.reshape(64, -1)\nfc = nn.Linear(512*7*7, 1000)\nlogits = fc(x)","options":{"A":"Shape is (64, 25088). reshape is equivalent to view for contiguous tensors; the bug risk is that reshape may silently copy non-contiguous tensors, potentially masking incorrect tensor layout assumptions downstream","B":"Shape is (64, 512, 49) because reshape preserves the channel dimension","C":"Shape is incorrect because -1 cannot infer dimensions for 4D → 2D flattening","D":"Shape is (64, 25088) but this will cause a runtime error because FC layers require 3D inputs"},"correct":"A","explanation":{"correct":"- 512 × 7 × 7 = 25,088. reshape(64, -1) infers -1 = 25,088. Output shape: (64, 25,088). ✓\n- `reshape` vs `view`: Both produce the same shape. For contiguous tensors (which standard conv outputs are), they are identical. For non-contiguous tensors (e.g., after transpose or permute), `view` raises an error while `reshape` silently copies. This means `reshape` can mask bugs where a tensor has unexpected memory layout.\n- Best practice: use `nn.Flatten()` which handles both contiguous and non-contiguous tensors correctly and documents intent clearly in the model definition.","A":"","B":"`reshape(64, -1)` collapses ALL remaining dimensions into one. (64, 512, 7, 7) → (64, 512*7*7) = (64, 25088), not (64, 512, 49).","C":"Python/PyTorch's `-1` in reshape correctly infers the size needed to keep total elements constant. 64×512×7×7 = 64×25088, so -1 = 25088. This is standard behavior.","D":"PyTorch `nn.Linear` expects inputs of shape (batch, features) — 2D inputs — which (64, 25088) provides. FC layers do not require 3D inputs; that is recurrent layers."},"reference":"- PyTorch reshape vs view: https://pytorch.org/docs/stable/tensor_view.html\n- nn.Flatten: https://pytorch.org/docs/stable/generated/torch.nn.Flatten.html"},{"section":"deep-learning","topicSlug":"forward-propagation","topic":"Forward Propagation","id":"dl-04004","difficulty":"medium","orderIndex":4,"question":"During forward propagation through a 5-layer network, you process a batch of 256 samples. A profiler shows that 80% of the forward pass time is spent in matrix multiplication. Your team proposes three optimizations: (A) reduce batch size to 32, (B) use float16 instead of float32, (C) add a skip connection from layer 1 to layer 5. Which optimization(s) will reduce forward pass time and why?","options":{"A":"Only A — smaller batch size means less data to process","B":"Only B — float16 operations are 2× faster than float32 on modern GPUs and tensor cores","C":"A and B — batch size and precision both affect throughput; skip connections add computation","D":"B and potentially A — float16 halves memory bandwidth and enables tensor core operations (4-8× faster matrix multiply); reducing batch size helps if GPU memory is the bottleneck but hurts throughput efficiency if the GPU is underutilized"},"correct":"D","explanation":{"correct":"- Float16 (half precision): Modern NVIDIA GPUs have dedicated tensor cores that perform FP16 matrix multiplication 4-8× faster than FP32. Memory bandwidth is also halved (each value is 2 bytes vs 4 bytes), reducing data movement bottleneck.\n- Batch size: Reducing from 256 to 32 doesn't help if the GPU is already compute-bound (fully utilizing all cores). It can hurt throughput by reducing parallelism. It only helps if GPU memory is the bottleneck preventing larger batches.\n- Skip connections (C) add matrix additions (cheap) and potentially extra weight matrices — they slightly increase FLOPs but can improve gradient flow, leading to better final models. They don't reduce forward pass time.","A":"Reducing batch size from 256 to 32 reduces the amount of work, but GPU throughput is maximized with large batches. For a 256-sample batch, the GPU is likely well-utilized. Going to 32 may leave GPU cores idle, reducing actual throughput efficiency (samples/second).","B":"Float16 is correct as stated, but A is not a simple win — it depends on whether the GPU is memory-bound or compute-bound, and whether the current batch size fills GPU compute capacity.","C":"Skip connections add an element-wise addition (negligible cost) but if they include extra weight matrices (as in ResNets), they add matrix multiplications. Net effect: slightly more computation, not less.","D":""},"reference":"- NVIDIA tensor cores and FP16: https://developer.nvidia.com/tensor-cores\n- Mixed precision training: https://pytorch.org/docs/stable/amp.html"},{"section":"deep-learning","topicSlug":"forward-propagation","topic":"Forward Propagation","id":"dl-04005","difficulty":"medium","orderIndex":5,"question":"You implement a forward pass manually for debugging:","codeSnippet":"def forward(x, W1, b1, W2, b2):\n z1 = x @ W1.T + b1\n a1 = relu(z1)\n z2 = a1 @ W2.T + b2\n return z2","options":{"A":"The bias should be added before the matrix multiply, not after","B":"PyTorch's `nn.Linear` transposes the weight matrix internally (computes xW^T + b), so the weight matrices W1 and W2 should be stored as (out_features, in_features) — this code is actually consistent with PyTorch's convention","C":"ReLU should be applied before the matrix multiply in layer 2, not after layer 1","D":"The code uses `.T` which transposes the entire tensor including batch dimensions for batched inputs, causing incorrect computation"},"correct":"B","explanation":{"correct":"- PyTorch's `nn.Linear(in, out)` stores weight as shape (out, in) and computes output = input @ weight.T + bias. The transpose operation aligns dimensions: (batch, in) @ (in, out) = (batch, out).\n- This code does exactly that: `x @ W1.T + b1` where W1 is (out, in) transposes to (in, out) and multiplies. The implementation is consistent with PyTorch's convention.\n- The subtle point: many textbooks write the weight as (in, out) and compute xW + b without transpose. PyTorch chose the transposed convention (out, in) for storage efficiency. This inconsistency between textbook notation and implementation is a frequent source of confusion.","A":"The bias is correctly added after the matrix multiply: z = xW^T + b. This is the standard affine transformation. Adding bias before the multiply would produce W^T(x + b), which is mathematically different.","B":"","C":"The activation is applied to z1 (layer 1's pre-activation) to produce a1 (layer 1's output). Layer 2 then processes a1. The order is correct: z → activation → next z is the standard forward pass structure.","D":"`.T` in PyTorch (and NumPy) for 2D tensors transposes the two dimensions correctly. For a 2D weight matrix (out, in), `.T` gives (in, out). For higher-dimensional tensors, `.T` reverses all dimensions, but weight matrices are 2D."},"reference":"- PyTorch nn.Linear weight shape: https://pytorch.org/docs/stable/generated/torch.nn.Linear.html"},{"section":"deep-learning","topicSlug":"forward-propagation","topic":"Forward Propagation","id":"dl-04006","difficulty":"medium","orderIndex":6,"question":"A model processes batches of text sequences with shape (batch=32, seq_len=512, d_model=768). During a forward pass through a feed-forward sublayer (two linear layers), you notice peak GPU memory is 3× the model parameter memory. A teammate says \"just reduce batch size.\" What is the actual cause of the memory spike and what is the correct fix?","options":{"A":"The model parameters are duplicated three times during the forward pass for numerical stability","B":"Intermediate activations (all layer outputs needed for backpropagation) are stored during the forward pass. For (32, 512, 768) inputs processed through multiple layers, these activation tensors collectively occupy 2-3× model memory. Correct fix: gradient checkpointing trades memory for compute by recomputing activations during the backward pass","C":"The optimizer states (Adam maintains 2 extra copies per parameter) cause the 3× memory during forward pass","D":"Float32 arithmetic requires 4 bytes per number, and the GPU allocates memory in 3× chunks for alignment"},"correct":"B","explanation":{"correct":"- During the forward pass, PyTorch stores all intermediate activations needed for backpropagation (chain rule requires knowing the forward values to compute gradients). For a deep network on large sequences, these stored activations can easily exceed model parameter memory.\n- For (32, 512, 768): each activation tensor is 32×512×768×4 bytes ≈ 48 MB. With 12 Transformer layers each having multiple sublayers, stored activations sum to hundreds of MB or more.\n- Gradient checkpointing (torch.utils.checkpoint): during forward pass, discard intermediate activations. During backward pass, recompute them on-the-fly. Trades ~33% extra compute for ~50-70% memory reduction.","A":"Model parameters are not duplicated during the forward pass. They are stored once and referenced. Parameter duplication happens with distributed training (data parallelism) or during optimizer steps, not forward passes.","B":"","C":"Adam optimizer states (first and second moment estimates) are allocated during the optimizer step, not during the forward pass. They are persistent between training steps but are not created fresh during forward propagation.","D":"Memory alignment is real but results in small, fixed padding, not 3× expansion. Memory alignment does not cause 3× usage."},"reference":"- PyTorch gradient checkpointing: https://pytorch.org/docs/stable/checkpoint.html\n- Chen et al., \"Training Deep Nets with Sublinear Memory Cost\" (gradient checkpointing): https://arxiv.org/abs/1604.06174"},{"section":"deep-learning","topicSlug":"forward-propagation","topic":"Forward Propagation","id":"dl-04007","difficulty":"hard","orderIndex":7,"question":"You run the exact same network forward pass twice with the same input tensor and observe different outputs. The network is in `model.eval()` mode. What is the most likely cause, and what does fixing it require?","codeSnippet":"model.eval()\nx = torch.randn(8, 64)\nout1 = model(x)\nout2 = model(x)\nassert torch.allclose(out1, out2) # This assertion FAILS","options":{"A":"`model.eval()` does not disable all randomness — Dropout layers are disabled by eval(), but if the model contains MC Dropout (Dropout intentionally left active in eval mode), or if any layer explicitly generates random noise (e.g., noise injection for robustness), the outputs will differ","B":"Float32 arithmetic is non-deterministic on GPUs; different CUDA kernel execution orders produce different results on every run","C":"The model has a bug in weight initialization that re-randomizes weights on every forward call","D":"PyTorch eval() mode only affects BatchNorm statistics; Dropout is always active regardless of eval/train mode"},"correct":"A","explanation":{"correct":"- `model.eval()` sets the mode flag that disables standard Dropout and switches BatchNorm from batch statistics to running statistics. However, it does not disable ALL randomness.\n- MC Dropout (Monte Carlo Dropout) intentionally overrides the eval flag to keep Dropout active for uncertainty estimation. If the model uses this pattern, eval mode does not make it deterministic.\n- Other sources of non-determinism in eval mode: stochastic depth layers, noise injection, random augmentation in the forward path, or CUDA non-determinism with certain operations.\n- Fix: explicitly set `torch.manual_seed()` before each call, use `torch.use_deterministic_algorithms(True)`, or identify and disable the specific source of randomness.","A":"","B":"While CUDA non-determinism is real (some operations like atomicAdd have non-deterministic ordering), it produces differences in the ~1e-7 range, well within `torch.allclose`'s default tolerance (atol=1e-8, rtol=1e-5). Failing `allclose` with identical inputs suggests larger differences.","C":"Weight re-initialization in the forward pass would be a catastrophic bug that would be immediately obvious in training (loss would never decrease). This is not a realistic scenario in a model that has been trained.","D":"eval() definitely disables Dropout (by setting `self.training = False`, which Dropout checks). The confusion is that some custom Dropout implementations explicitly ignore the training flag."},"reference":"- PyTorch MC Dropout for uncertainty: https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial.html\n- torch.use_deterministic_algorithms: https://pytorch.org/docs/stable/generated/torch.use_deterministic_algorithms.html"},{"section":"deep-learning","topicSlug":"forward-propagation","topic":"Forward Propagation","id":"dl-04008","difficulty":"hard","orderIndex":8,"question":"A language model processes tokens with an embedding layer that maps token IDs to 512-dimensional vectors. Input is a batch of shape (32, 128) — 32 sentences, each with 128 tokens. The embedding table has shape (50000, 512). An engineer writes the forward pass as a matrix multiply: `embeddings = one_hot(tokens) @ embedding_table`. A senior engineer says this is \"functionally correct but catastrophically inefficient.\" Why?","options":{"A":"One-hot encoding creates a (32, 128, 50000) tensor — 32×128×50000×4 bytes ≈ 800 MB just for the one-hot matrix. The actual operation needed is a simple lookup (indexing), not a matrix multiply","B":"Matrix multiplication requires contiguous memory and one-hot tensors are sparse, causing CUDA memory allocation failures","C":"The embedding table must be transposed before the multiply, so the engineer's code produces wrong output shapes","D":"One-hot + matrix multiply is only inefficient for vocabularies larger than 100,000; for 50,000 tokens it is acceptable"},"correct":"A","explanation":{"correct":"- One-hot encoding a (32, 128) index tensor with vocabulary size 50,000 creates a (32, 128, 50,000) float tensor. Memory: 32×128×50,000×4 = 819 MB just for the one-hot representation.\n- The one-hot matrix is 99.998% zeros (only 1 out of 50,000 entries is 1 per token). Multiplying by the embedding table computes 50,000 products and sums only to select 1 row — extreme waste.\n- The correct operation: `embedding_table[tokens]` (fancy indexing). PyTorch's `nn.Embedding` implements this as an O(1) lookup per token — just reading the row at the given index. No multiplication needed.","A":"","B":"Sparse tensors are supported in PyTorch, but the one-hot matrix here would be created as a dense tensor (no automatic sparsification). Memory allocation failure is possible but the primary issue is the inefficiency, not a hard failure.","C":"(32, 128, 50000) @ (50000, 512) = (32, 128, 512) — the shapes are actually correct (batched matrix multiply). The code is functionally correct, just catastrophically slow/memory-intensive.","D":"There is no meaningful threshold at 100,000. Even at vocabulary=50,000, the one-hot approach is 4+ orders of magnitude more compute than a lookup. The inefficiency scales linearly with vocabulary size — it is always unacceptable."},"reference":"- PyTorch nn.Embedding: https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html"},{"section":"deep-learning","topicSlug":"forward-propagation","topic":"Forward Propagation","id":"dl-04009","difficulty":"hard","orderIndex":9,"question":"During a forward pass, you trace the values flowing through a network and observe that after 8 layers with ReLU activations, the activation norms double with each layer: layer 1 norm ≈ 1.0, layer 2 ≈ 2.0, ..., layer 8 ≈ 128.0. The loss is NaN after the first batch. What initialization problem caused this and what is the fix?","options":{"A":"The weights were initialized too small, causing ReLU to output zero for all inputs","B":"The weights were initialized with variance too large (e.g., random normal with std=1.0 instead of He initialization). Each layer multiplies the activation norm by approximately √(fan_in) × std. With std=1.0 and fan_in=256 neurons, each layer amplifies by ~16, causing exponential activation growth and eventual overflow to NaN","C":"The bias terms were initialized to positive values, causing additive growth across layers","D":"ReLU should not be used with more than 4 layers; beyond that, activation normalization is required"},"correct":"B","explanation":{"correct":"- If weights are sampled from N(0, 1) for a layer with fan_in=256, the pre-activation z = Σᵢ wᵢxᵢ has variance = fan_in × Var(w) × Var(x) = 256 × 1 × 1 = 256, so std(z) ≈ 16. After ReLU (which halves variance), each layer amplifies activation norm by roughly √(256/2) ≈ 8-16×.\n- He initialization: std = √(2/fan_in) ensures each ReLU layer preserves activation variance: fan_in × (2/fan_in) × Var(x) = 2 × Var(x) → after ReLU (halving): Var = 1 × Var(x). Norm stays constant across layers.\n- Exponential norm growth → floating-point overflow → NaN loss on first backward pass.","A":"Small weight initialization causes vanishing activations (norms shrink to near zero), not exponential growth. The observed doubling-per-layer pattern is a signature of over-large weight variance.","B":"","C":"Bias initialization affects the offset of each layer's output but not the multiplicative growth. Biases are typically initialized to zero and contribute additively (linear, not exponential growth).","D":"ReLU can be used with 100+ layers when properly initialized (ResNets use ReLU at 50-152 layers). The issue is initialization, not a fundamental ReLU depth limit."},"reference":"- He et al., \"Delving Deep into Rectifiers\" (He initialization derivation): https://arxiv.org/abs/1502.01852\n- https://cs231n.github.io/neural-networks-2/#init"},{"section":"deep-learning","topicSlug":"forward-propagation","topic":"Forward Propagation","id":"dl-04010","difficulty":"medium","orderIndex":10,"question":"A model is trained on GPU and achieves 98% training accuracy. During inference on CPU, the same model produces different outputs for the same inputs — outputs that are slightly different (differences of ~1e-4). The model does not use Dropout. What is the most likely cause?","options":{"A":"CPU and GPU use different random seeds, causing stochastic differences in forward propagation","B":"Float32 arithmetic is not associative — the order of floating-point operations differs between GPU (parallel, fused operations) and CPU (sequential, different operation ordering), producing slightly different results due to floating-point rounding. These differences are expected and not a bug","C":"The model weights were saved in float16 and loaded in float32 on CPU, causing precision loss during weight conversion","D":"CPU inference automatically applies quantization, reducing precision to int8"},"correct":"B","explanation":{"correct":"- Floating-point arithmetic is not mathematically associative: (a + b) + c ≠ a + (b + c) in IEEE 754 float32 due to rounding at each step. GPUs perform matrix multiplications with parallel reduction (different summation order than CPU sequential operations), producing numerically different but equally \"correct\" results.\n- Differences of ~1e-4 in float32 are typical for this phenomenon. The results are both valid floating-point approximations to the same mathematical computation, just with different rounding error accumulation paths.\n- This is a known and expected behavior documented in CUDA documentation. For deterministic CPU/GPU matching, use `torch.use_deterministic_algorithms(True)` and specific CUDA determinism settings.","A":"Forward propagation in eval mode (no Dropout) is deterministic given the same inputs and weights. Random seeds affect random number generation, which is not used in a standard forward pass.","B":"","C":"If weights were saved in float32 (which is the default for `torch.save`), no conversion happens on load. Float16-to-float32 conversion would produce systematic differences, not random ~1e-4 variations.","D":"PyTorch CPU inference does not automatically quantize models. Quantization (int8) is an explicit operation requiring `torch.quantization` API calls. Default CPU inference uses float32."},"reference":"- CUDA determinism: https://pytorch.org/docs/stable/notes/randomness.html\n- IEEE 754 floating-point arithmetic: https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html"},{"section":"deep-learning","topicSlug":"forward-propagation","topic":"Forward Propagation","id":"dl-04011","difficulty":"easy","orderIndex":11,"question":"A network processes batches of images. During the forward pass, the last convolutional layer output has shape (batch=8, channels=256, height=14, width=14). Before the fully connected layer, the tensor must be flattened. A new engineer uses `x.squeeze()` instead of `x.flatten(1)`. What is the problem?","options":{"A":"`squeeze()` and `flatten(1)` are equivalent for 4D tensors — no problem","B":"`squeeze()` removes dimensions of size 1 — if the batch size is 1, it would remove the batch dimension, producing shape (256, 14, 14) instead of (1, 25088), breaking the FC layer which expects 2D input","C":"`squeeze()` transposes the channel and spatial dimensions, producing incorrect feature ordering","D":"`squeeze()` only works on 2D tensors; it will raise an error on a 4D input"},"correct":"B","explanation":{"correct":"- `torch.squeeze()` removes all dimensions of size 1. For batch_size=8: shape (8, 256, 14, 14) — no size-1 dimensions, so squeeze does nothing (accidentally correct for this batch).\n- For batch_size=1: shape (1, 256, 14, 14) — squeeze removes the batch dimension → (256, 14, 14). The FC layer (nn.Linear) expects 2D (batch, features) → shape error.\n- This bug appears only during inference when single samples are processed (batch_size=1), not during training (batch_size > 1). It is a classic \"training works, inference breaks\" bug.","A":"They produce the same result only when no dimension has size 1. For batch_size=1, they produce different results.","B":"","C":"squeeze() only removes size-1 dimensions — it does not reorder or transpose remaining dimensions.","D":"squeeze() works on tensors of any dimension. It removes any dimension(s) that have size 1, regardless of the total number of dimensions."},"reference":"- torch.squeeze documentation: https://pytorch.org/docs/stable/generated/torch.squeeze.html\n- Common PyTorch pitfalls: https://pytorch.org/docs/stable/notes/faq.html"},{"section":"deep-learning","topicSlug":"forward-propagation","topic":"Forward Propagation","id":"dl-04012","difficulty":"medium","orderIndex":12,"question":"You implement batch processing in numpy for a network with weight W (shape: 100×50) and bias b (shape: 100,). You receive a single sample x (shape: (50,)) and a batch X (shape: (32, 50)). Your colleague's code uses `np.dot(W, x)` for single samples and `np.dot(X, W.T)` for batches. Why are these two different formulations necessary?","options":{"A":"They are not necessary — one formulation works for both cases in numpy","B":"numpy's dot product behaves differently for 1D and 2D inputs: `np.dot(W, x)` computes Wx (matrix-vector, output shape (100,)), while `np.dot(X, W.T)` computes XW^T (matrix-matrix, output shape (32, 100)). Both are correct, but using `np.dot(W, x)` on a batch would compute a different operation","C":"The batch formulation transposes W because batched gradient computation requires transposed weight access","D":"Both compute identical operations — the difference is only in memory layout which numpy handles automatically"},"correct":"B","explanation":{"correct":"- `np.dot(W, x)`: W is (100,50), x is (50,) → matrix-vector product → output (100,). This is the standard Wx formulation.\n- `np.dot(X, W.T)`: X is (32,50), W.T is (50,100) → matrix-matrix product → output (32, 100). Each row of X is one sample, computing all 32 outputs simultaneously.\n- Modern deep learning libraries (PyTorch, TensorFlow) abstract this by always working in batch mode with leading batch dimension. The explicit W vs W.T difference is why nn.Linear stores weights transposed and always processes batches.","A":"You cannot use `np.dot(W, X.T)` to replace both. While `np.dot(W, X.T)` gives shape (100, 32) which can be transposed to (32, 100), it requires an extra transpose and is less readable. More importantly, the question asks why two different formulations exist in the colleague's code.","B":"","C":"The transpose in the batch formulation is not related to gradient computation — it's a geometric necessity for the matrix dimensions to align. (32,50) @ (50,100) requires W.T, not W.","D":"The two operations are not identical. `np.dot(W, x)` produces (100,), not (32, 100). numpy does not \"handle\" this automatically."},"reference":"- numpy dot product semantics: https://numpy.org/doc/stable/reference/generated/numpy.dot.html"},{"section":"deep-learning","topicSlug":"forward-propagation","topic":"Forward Propagation","id":"dl-04013","difficulty":"hard","orderIndex":13,"question":"You are implementing a Transformer's feed-forward sublayer:","codeSnippet":"class FFN(nn.Module):\n def __init__(self, d_model=512, d_ff=2048):\n super().__init__()\n self.w1 = nn.Linear(d_model, d_ff)\n self.w2 = nn.Linear(d_ff, d_model)\n \n def forward(self, x): # x: (batch, seq, d_model)\n return self.w2(F.relu(self.w1(x)))","options":{"A":"The ReLU should be GELU for Transformers","B":"A dropout layer should be applied between the two linear layers in training, as specified in the original paper","C":"The output should be divided by √d_ff to normalize the output scale","D":"The linear layers should use weight tying (sharing weights between w1 and w2.T)"},"correct":"B","explanation":{"correct":"- The original \"Attention is All You Need\" (Vaswani et al., 2017) FFN includes dropout after the first activation: FFN(x) = max(0, xW₁ + b₁)W₂ + b₂, with \"Dropout applied to the output of each sub-layer, before it is added to the sub-layer input and normalized.\"\n- Modern implementations often include this dropout explicitly inside the FFN: `F.dropout(F.relu(self.w1(x)), p=0.1, training=self.training)`.\n- The missing component is subtle but important for regularization. Many efficient implementations omit it for inference, but it should be present in the training code.","A":"The original paper uses ReLU. GELU is used in BERT, GPT, and later Transformers, but the question specifically asks about missing components vs the original \"Attention is All You Need\" formulation. GELU is a later improvement, not a missing component.","B":"","C":"The Transformer does not divide FFN output by √d_ff. The scaling factor 1/√d_k appears in the attention score computation, not in the FFN. Dividing by √d_ff would unnecessarily shrink outputs.","D":"Weight tying is used between the token embedding layer and the output projection (input embedding ↔ pre-softmax weight). It is not applied between FFN's two linear layers — they have different dimensions (d_model×d_ff and d_ff×d_model) and would need to be transposed, which is a different pattern."},"reference":"- Vaswani et al., \"Attention Is All You Need\" (2017): https://arxiv.org/abs/1706.03762 (Section 3.3, Position-wise Feed-Forward Networks)"},{"section":"deep-learning","topicSlug":"forward-propagation","topic":"Forward Propagation","id":"dl-04014","difficulty":"hard","orderIndex":14,"question":"You are comparing inference throughput of two models: Model A processes samples one-at-a-time (batch_size=1), Model B processes 256 samples simultaneously (batch_size=256). Both models have identical architectures. On a modern GPU, Model B has 180× higher throughput (samples/second) despite using 256× more samples per forward pass. What explains the throughput improvement, and what limits further improvement beyond batch_size=256?","options":{"A":"Model B benefits from GPU parallelism — a single forward pass on 256 samples is nearly as fast as 1 sample because all 256 samples are processed simultaneously on different GPU cores. The limit is GPU memory: once the batch no longer fits in VRAM, throughput drops","B":"Model B benefits from better caching — 256 samples cause the weight matrix to be cached in L2 cache, reducing memory access time per sample","C":"Model A has 256× more Python overhead because each sample requires a separate Python function call","D":"Model B enables kernel fusion, which is only active above batch_size=128"},"correct":"A","explanation":{"correct":"- Modern GPUs (A100: 6912 CUDA cores, 400 GB/s memory bandwidth) are designed for massively parallel computation. For batch_size=1, most GPU cores are idle during a matrix multiply because the single-sample computation doesn't generate enough parallelism to saturate the hardware.\n- For batch_size=256, the matrix multiply (256, features) × (features, out) saturates GPU cores. The wall-clock time for 256 samples is nearly the same as for 1 sample because all samples are processed in parallel.\n- The limit: when batch_size × activations_per_sample exceeds GPU VRAM, Out-of-Memory errors occur. Also, beyond GPU saturation point, each additional sample actually does take proportionally longer (diminishing returns). The optimal batch size maximizes GPU utilization without exceeding memory.","A":"","B":"Weight matrix caching in L2 is a real effect but explains only a 2-5× speedup in bandwidth-bound operations, not 180×. The dominant effect is GPU parallelism.","C":"Modern deep learning frameworks (PyTorch with CUDA) don't make a Python call per sample during batched forward passes. The overhead is at the batch level, not per sample. Python GIL overhead is minimal in GPU-accelerated inference.","D":"Kernel fusion (combining multiple operations into one CUDA kernel) happens based on graph structure and operator implementation, not batch size thresholds."},"reference":"- NVIDIA GPU architecture and parallelism: https://developer.nvidia.com/blog/cuda-pro-tip-understand-fat-binaries/\n- PyTorch performance tuning guide: https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html"},{"section":"deep-learning","topicSlug":"forward-propagation","topic":"Forward Propagation","id":"dl-04015","difficulty":"hard","orderIndex":15,"question":"You are profiling a model's forward pass and find that a layer with shape (batch=64, seq=512, d=768) takes 3× longer than expected based on FLOP count. The layer is a fully connected layer (nn.Linear). The FLOP count predicts 10ms but it takes 30ms. What is the most likely performance bottleneck and how would you diagnose it?","options":{"A":"The layer has too many parameters — reduce d from 768 to 256 to bring runtime in line with FLOP prediction","B":"The layer is memory-bandwidth bound rather than compute-bound: the weight matrix size (768×768 = 2.4 MB in float32) plus input activations must be read from VRAM on each forward pass. If the arithmetic intensity (FLOPs / bytes of memory access) is below the GPU's roofline, actual throughput is limited by memory bandwidth, not compute","C":"The Python garbage collector is pausing for 20ms during the layer computation to free old tensors","D":"Batch size 64 is too small for this layer's dimensions — the GPU cannot parallelize below batch_size=256"},"correct":"B","explanation":{"correct":"- The roofline model: a GPU has peak FLOP/s and peak memory bandwidth. For a given operation, arithmetic intensity = FLOPs / bytes accessed. If arithmetic intensity < (peak FLOP/s / peak bandwidth), the operation is memory-bound — memory access is the bottleneck.\n- For nn.Linear on (64, 512, 768): FLOP count = 2 × 64×512×768×768 ≈ 38 GFLOPs. Memory: weight (768×768×4) = 2.4 MB + activations (64×512×768×4) = 150 MB. If reading 150 MB at 2 TB/s takes 75μs while the FLOPs at 20 TFLOP/s takes 2ms, the operation is compute-bound. But if actual access patterns cause repeated weight re-reads (e.g., non-contiguous memory), effective bandwidth drops.\n- Diagnosis: use `nvprof` or `torch.profiler` to check compute vs memory utilization. If GPU compute utilization is low but memory bandwidth is near 100%, the layer is memory-bound.","A":"Reducing d changes the FLOP count proportionally. If the operation is memory-bound, reducing FLOPs won't help proportionally — you'd just have fewer FLOPs sitting idle while memory bandwidth remains the bottleneck.","B":"","C":"Python GC does not pause CUDA operations. PyTorch CUDA operations are asynchronous — CUDA streams continue independent of Python GC. PyTorch's CUDA memory manager handles tensor deallocation independently.","D":"While low batch sizes reduce parallelism, nn.Linear with batch=64 and seq=512 means 64×512=32,768 parallel samples being processed — this is typically sufficient to saturate most GPU layers. The \"minimum batch size\" framing oversimplifies GPU utilization."},"reference":"- PyTorch profiler: https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html\n- Roofline model: https://developer.nvidia.com/blog/roofline-and-deepspeed/"},{"section":"deep-learning","topicSlug":"loss-and-cost-functions","topic":"Loss And Cost Functions","id":"dl-05001","difficulty":"easy","orderIndex":1,"question":"A regression model predicts house prices. During training, the MSE loss is 1,000,000 (in squared dollars). A colleague says \"that's a huge loss, the model is failing.\" A senior engineer disagrees. Who is correct?","options":{"A":"The colleague is correct — MSE above 1000 always indicates a failing model","B":"The senior engineer is correct — MSE is in squared units of the target variable. If prices are in dollars and predictions are off by ~$1000 on average, MSE ≈ 1,000,000 (1000²). The absolute MSE value is meaningless without context of the target scale; RMSE ($1000 error) or MAPE is more interpretable","C":"The colleague is correct — MSE should always be normalized to [0,1] before training","D":"Both are wrong — MSE is computed in log-space for price prediction and the units are not squared dollars"},"correct":"B","explanation":{"correct":"- MSE = (1/n)Σ(y - ŷ)². If y is in dollars and the average error is $1000, MSE = 1000² = 1,000,000. An MSE of 1,000,000 dollars² corresponds to RMSE = $1,000 — which may be excellent for a $500,000 house (~0.2% error).\n- The key insight: MSE's magnitude is meaningless in isolation. It depends entirely on the scale of the target variable. A model predicting temperatures in Kelvin (range ~250-350) vs dollars (range ~$50,000-$5M) will have MSE values differing by 6 orders of magnitude despite equal prediction quality.\n- RMSE is preferred for interpretability (same units as target), and R² is preferred for scale-independent model quality assessment.","A":"No threshold on MSE indicates a failing model without knowing the target scale. MSE < 0.001 could be catastrophic if targets are in the range [0, 0.0001].","B":"","C":"MSE normalization to [0,1] is not standard practice and would require knowing the maximum possible squared error in advance, which is often undefined for regression.","D":"Predicting in log-space is a common technique for skewed targets like prices, but it is not a default behavior. Most regression models train on raw values unless explicitly transformed."},"reference":"- https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics"},{"section":"deep-learning","topicSlug":"loss-and-cost-functions","topic":"Loss And Cost Functions","id":"dl-05002","difficulty":"easy","orderIndex":2,"question":"A binary classifier outputs probabilities and is trained with Binary Cross-Entropy (BCE) loss. For a positive sample (y=1), the model outputs p=0.01. For a negative sample (y=0), the model outputs p=0.99. Calculate the BCE loss for each case and explain why the loss function behaves this way.","options":{"A":"BCE(y=1, p=0.01) = -log(0.99) ≈ 0.01; BCE(y=0, p=0.99) = -log(0.01) ≈ 4.6 — losses are equal because the errors are symmetric","B":"BCE(y=1, p=0.01) = -log(0.01) ≈ 4.6; BCE(y=0, p=0.99) = -log(1-0.99) = -log(0.01) ≈ 4.6 — both cases are maximally wrong and incur the same loss. BCE applies heavy penalties for confident wrong predictions via the log function","C":"BCE(y=1, p=0.01) = 0.01; BCE(y=0, p=0.99) = 0.01 — BCE is linear in prediction error","D":"BCE(y=1, p=0.01) = -(0.01) = -0.01; the negative sign causes gradient ascent when predictions are wrong"},"correct":"B","explanation":{"correct":"- BCE: L = -[y·log(p) + (1-y)·log(1-p)]. For y=1, p=0.01: L = -log(0.01) = log(100) ≈ 4.605. For y=0, p=0.99: L = -log(1-0.99) = -log(0.01) ≈ 4.605.\n- The log function maps p→0 to L→∞ and p→1 to L→0. A highly confident wrong prediction (p→0 when y=1) incurs an enormous loss — this is the \"penalty for overconfident errors\" property.\n- This property is why cross-entropy trains classifiers to be well-calibrated: the model is penalized not just for being wrong, but for being confidently wrong. An overconfident wrong prediction receives a much larger gradient than an uncertain wrong prediction.","A":"The formula is reversed. BCE(y=1, p) = -log(p), not -log(1-p). For y=1, p=0.01: -log(0.01) = 4.6, not -log(0.99) = 0.01.","B":"","C":"BCE is logarithmic, not linear. The log function ensures that predictions near 0 or 1 receive extreme penalties when wrong. Linearity would not penalize overconfidence adequately.","D":"The negative sign in BCE makes the loss positive (log of a probability in (0,1) is negative; negating it makes the loss positive). It does not cause gradient ascent — the loss is positive and minimized by gradient descent."},"reference":"- https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html\n- Bishop, \"Pattern Recognition and Machine Learning\", Chapter 4.3.4"},{"section":"deep-learning","topicSlug":"loss-and-cost-functions","topic":"Loss And Cost Functions","id":"dl-05003","difficulty":"medium","orderIndex":3,"question":"You train a neural network for a 10-class problem using Cross-Entropy loss. The model achieves 95% training accuracy but you notice the average training loss stopped decreasing after epoch 50 (stuck at 0.15) even though gradients are non-zero. Your colleague suspects overfitting. What is the more likely cause?","options":{"A":"Cross-entropy loss has a lower bound of 0; once accuracy plateaus, loss cannot decrease further","B":"Cross-entropy loss can continue decreasing even when accuracy is 95% — the model could push probabilities for correct classes closer to 1.0, reducing loss. Loss stopping while accuracy is stable indicates the model has likely reached the optimizer's minimum for the current configuration (learning rate too high, trapped in a local minimum, or the model capacity is insufficient to perfectly calibrate all samples)","C":"Overfitting always causes training loss to increase, not plateau — the colleague is wrong","D":"Cross-entropy loss plateaus when label smoothing is not applied; adding label smoothing would allow further decrease"},"correct":"B","explanation":{"correct":"- Cross-entropy loss = 0 only when the model outputs probability 1.0 for the correct class on every sample. With 95% accuracy and loss=0.15, the model is still uncertain on many correctly classified samples (it predicts p=0.6 for the correct class, contributing -log(0.6)≈0.51 to loss).\n- Continued loss decrease would require sharper probabilities — the model assigning higher confidence to correct predictions, even if they're already classified correctly.\n- Loss plateau with non-zero gradients suggests: (a) oscillating near a sharp minimum (high learning rate), (b) the model has insufficient capacity to fit remaining hard examples, or (c) the optimizer is stuck. The distinction between \"loss plateau\" and \"overfitting\" is that overfitting shows increasing validation loss, not just plateau.","A":"Cross-entropy's lower bound of 0 is achievable only with perfect, confident predictions. 95% accuracy with 0.15 loss is far from this bound — the loss absolutely can decrease further if the model improves calibration.","B":"","C":"Overfitting causes training loss to keep decreasing (model memorizes) while validation loss increases. A training loss plateau is more likely a sign of optimization difficulty or insufficient capacity, not overfitting.","D":"Label smoothing (replacing hard 0/1 targets with soft targets like 0.9/0.1) would actually raise the loss floor slightly, not allow further decrease. It is used to prevent overconfidence, not to enable lower loss."},"reference":"- https://cs231n.github.io/neural-networks-3/#loss (monitoring training)"},{"section":"deep-learning","topicSlug":"loss-and-cost-functions","topic":"Loss And Cost Functions","id":"dl-05004","difficulty":"medium","orderIndex":4,"question":"A fraud detection model is trained on a dataset where 0.1% of samples are fraud. You use Cross-Entropy loss with default settings and achieve 99.9% accuracy. A business stakeholder says the model is excellent. A data scientist says the model is useless. Who is right and what loss function change would help?","options":{"A":"The stakeholder is right — 99.9% accuracy means only 1 in 1000 predictions is wrong","B":"The data scientist is right — 99.9% accuracy is achieved by predicting \"no fraud\" for every sample (the majority class baseline). Cross-entropy on imbalanced data allows the model to ignore the minority class. Fix: use Focal Loss, class-weighted CE, or resampling","C":"The data scientist is right, but the fix is to lower the classification threshold, not change the loss function","D":"Both are partially right — accuracy is valid but should be supplemented with F1 score; no loss function change is needed"},"correct":"B","explanation":{"correct":"- With 0.1% fraud: a model that always predicts \"not fraud\" achieves 99.9% accuracy. Standard CE loss minimizes average log-probability over all samples. 99.9% of samples are negative, so the model is incentivized to perfectly classify negatives and can ignore the minority class entirely.\n- Focal Loss (Lin et al., 2017): FL = -αₜ(1-pₜ)ᵞ log(pₜ). The (1-pₜ)ᵞ factor down-weights easy examples (correctly classified majority class with high confidence) and up-weights hard examples (minority class). This forces the model to focus on difficult/rare examples.\n- Alternatives: class-weighted CE (multiply minority class loss by a large factor), SMOTE/oversampling, or training with AUPRC as the optimization target.","A":"The \"excellent\" claim fails under scrutiny: if all 1000 incorrect predictions are fraud cases that went undetected, the model catches 0% of actual fraud. This is the worst possible fraud detector.","B":"","C":"Lowering the classification threshold changes the decision boundary but does not improve the model's learned probability estimates. If the model assigns 0.001 probability to fraud for all samples, no threshold adjustment can make it useful.","D":"F1 score helps evaluate model quality but doesn't fix the loss function problem. If the model outputs 99.9% class 0 probability for all samples, no threshold or evaluation metric change fixes the underlying training failure."},"reference":"- Lin et al., \"Focal Loss for Dense Object Detection\" (RetinaNet): https://arxiv.org/abs/1708.02002\n- https://scikit-learn.org/stable/auto_examples/classification/plot_imbalanced_dataset.html"},{"section":"deep-learning","topicSlug":"loss-and-cost-functions","topic":"Loss And Cost Functions","id":"dl-05005","difficulty":"medium","orderIndex":5,"question":"You are training a regression model and compare MSE vs Huber loss. During validation, you notice MSE is 50,000 and Huber loss (δ=1.0) is 10.5 for the same predictions. A junior data scientist says \"the Huber loss model is 4700× better.\" What is wrong with this comparison?","options":{"A":"Huber loss and MSE have different units, so they cannot be compared numerically","B":"The two loss values are not comparable because they measure different things — MSE penalizes every error quadratically while Huber is quadratic for |error| < δ and linear beyond. The numerical values have no meaningful ratio relationship. What matters is which model has lower validation RMSE (or another task-relevant metric), not which has lower absolute loss value","C":"MSE should be divided by sample count; the engineer forgot to average the loss","D":"Huber loss is always smaller than MSE by definition, so comparing them proves nothing"},"correct":"B","explanation":{"correct":"- MSE and Huber loss compute fundamentally different things. MSE = mean of squared errors. Huber with δ=1 = mean of (0.5·e² for |e|<1; |e|-0.5 for |e|≥1). A single outlier with error=100 contributes 10,000 to MSE but only 99.5 to Huber.\n- The 4700× difference doesn't mean the Huber model is 4700× more accurate — it means the two loss scales are incomparable. The MSE model might actually generalize better despite higher Huber loss.\n- To compare models, use the same metric for both: RMSE, MAE, or R² — any metric that doesn't change between models.","A":"Both MSE and Huber loss are in units of the target variable (if δ is in target units). MSE is in squared units while Huber's linear tail is in original units — so they do have different units, but option B's explanation is more complete and precise.","B":"","C":"Both losses should be averaged over samples. This doesn't explain why the values are orders of magnitude apart — that difference comes from the different functional forms, not averaging.","D":"Huber loss is not always smaller than MSE by definition. For small errors (|e| < δ), Huber = 0.5e² ≤ e² = MSE contribution (smaller). For large errors, Huber = linear (smaller than quadratic). So Huber ≤ MSE for the same errors, but this isn't the point of the question."},"reference":"- Huber, P.J., \"Robust Estimation of a Location Parameter\" (1964)\n- https://pytorch.org/docs/stable/generated/torch.nn.HuberLoss.html"},{"section":"deep-learning","topicSlug":"loss-and-cost-functions","topic":"Loss And Cost Functions","id":"dl-05006","difficulty":"medium","orderIndex":6,"question":"A multi-label image classification model (an image can have multiple tags simultaneously) uses Cross-Entropy loss with softmax. After training, the model can only ever predict one tag per image even though images clearly have multiple. What is the root cause?","options":{"A":"The model needs more capacity to predict multiple outputs; increase the hidden layer size","B":"Softmax + Cross-Entropy forces the model to treat the problem as single-label (exactly one class is correct). Softmax normalizes probabilities to sum to 1, which correctly represents \"pick one\" but incorrectly models multi-label problems. Fix: use sigmoid per output with Binary Cross-Entropy for each label independently","C":"The learning rate is too high, causing the model to overfit to the most frequent label","D":"Multi-label classification requires a special loss function that sums over all correct labels; Cross-Entropy sums instead of averaging"},"correct":"B","explanation":{"correct":"- Softmax output: probabilities sum to 1.0. This probabilistic simplex constraint is exactly right for \"exactly one class is true\" (single-label). For multi-label problems, multiple classes can simultaneously be \"on\" — a photo can be both \"dog\" and \"outdoor.\"\n- When trained with softmax + CE, the model learns a probability distribution over classes — it learns to \"spend\" its probability budget on the most likely single class. Other classes get near-zero probability even if they are also correct.\n- Fix: use sigmoid independently per output (each output in (0,1) independently) + Binary Cross-Entropy for each label. This way, each label has its own independent probability, allowing any combination of labels.","A":"Capacity is not the issue. Even a very large model trained with softmax+CE will produce single-label predictions because the loss function fundamentally trains it to do so.","B":"","C":"Learning rate affects convergence speed but not the structural single-label vs multi-label behavior. This behavior appears at any learning rate.","D":"Cross-Entropy can be adapted for multi-label problems (summing BCE across labels), but the issue described is specifically about softmax forcing single-label outputs, not about CE's summation behavior."},"reference":"- https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html\n- https://cs231n.github.io/linear-classify/#softmax"},{"section":"deep-learning","topicSlug":"loss-and-cost-functions","topic":"Loss And Cost Functions","id":"dl-05007","difficulty":"medium","orderIndex":7,"question":"You train two models on the same regression task. Model A uses MSE loss and achieves RMSE=100 on the test set. Model B uses MAE loss and achieves MAE=80 on the test set. A product manager asks \"which model is better?\" What is the correct answer and what additional information would you need?","options":{"A":"Model B is better because 80 < 100","B":"You cannot directly compare these models without knowing their errors under the same metric. MSE-trained models minimize squared errors (penalizing outliers heavily), while MAE-trained models minimize absolute errors (treating outliers and small errors equally). RMSE=100 and MAE=80 are not on the same scale — compute both metrics for both models on the test set","C":"Model A is better because RMSE is the industry standard metric for regression","D":"The comparison is valid because RMSE and MAE have the same units; 80 < 100 means Model B is better"},"correct":"B","explanation":{"correct":"- RMSE and MAE both have the same units as the target, but RMSE ≥ MAE always (by Cauchy-Schwarz inequality). A typical relationship is RMSE ≈ 1.0-1.5× MAE for mildly skewed error distributions, and RMSE >> MAE when there are outliers.\n- Model A might have MAE=70 (better than Model B on absolute error) and high RMSE=100 due to a few large outliers. Model B might have no outliers at all. Comparing RMSE of one model to MAE of another is meaningless.\n- To compare: compute RMSE and MAE for both models on the same test set. Choose based on the business metric that matters: if outliers are costly (e.g., safety-critical predictions), prefer MSE-trained model with lower RMSE; if all errors are equal cost, prefer MAE-trained model.","A":"80 < 100 compares numbers but ignores that they measure different things. This is like comparing a weight in kilograms to a distance in miles and concluding \"5 miles > 3 kg.\"","B":"","C":"No single metric is \"the industry standard\" — it depends on the application. MSE/RMSE are common but MAE is preferred in many domains (economics, finance) where outliers are not penalized differently.","D":"Same units does not mean same scale. RMSE is always ≥ MAE for the same set of predictions. Comparing RMSE of one model to MAE of another conflates two different distributions of the same predictions."},"reference":"- Chai & Draxler, \"Root mean square error (RMSE) or mean absolute error (MAE)?\" (2014): https://gmd.copernicus.org/articles/7/1247/2014/"},{"section":"deep-learning","topicSlug":"loss-and-cost-functions","topic":"Loss And Cost Functions","id":"dl-05008","difficulty":"hard","orderIndex":8,"question":"A language model is trained with Cross-Entropy loss on next-token prediction. Validation perplexity is 45. A second model trained on the same data achieves perplexity 30. A researcher claims \"Model 2 is 33% better.\" What is the correct interpretation of the perplexity difference and why is \"33% better\" misleading?","options":{"A":"The researcher is correct — (45-30)/45 = 33% improvement in perplexity","B":"Perplexity is exponential in cross-entropy loss: PPL = exp(H) where H is cross-entropy. The difference PPL₁ - PPL₂ = 15 in perplexity corresponds to a difference of ln(45) - ln(30) ≈ 0.405 nats in cross-entropy — a meaningful but not \"33%\" improvement. Perplexity differences are not linearly comparable; log-likelihood or bits-per-character are better for arithmetic comparisons","C":"The 33% improvement claim is correct but only applies to vocabulary size > 30,000","D":"Perplexity differences don't indicate model quality; only BLEU score matters for language models"},"correct":"B","explanation":{"correct":"- Perplexity = exp(cross-entropy). Going from PPL=45 to PPL=30 means reducing cross-entropy from ln(45)≈3.807 to ln(30)≈3.401, a reduction of 0.406 nats (or ≈0.586 bits). The actual CE improvement is ~10.7%, not 33%.\n- Perplexity is on an exponential scale. \"33% lower perplexity\" sounds like a large improvement, but on the underlying information-theoretic scale, it may be modest. Conversely, going from PPL=5 to PPL=4 (20% reduction) represents the same CE improvement as PPL=100 to PPL=80 (also 20%), but the lower-perplexity improvement is much harder to achieve.\n- Best practice: report cross-entropy in nats or bits-per-token for arithmetic comparisons; perplexity for intuitive interpretation (perplexity ≈ average branching factor at each prediction step).","A":"Arithmetic on perplexity values is misleading because perplexity is on an exponential scale. A 33% reduction in perplexity does not correspond to 33% better predictions in any information-theoretic sense.","B":"","C":"Perplexity interpretation doesn't change based on vocabulary size. The vocabulary size affects the range of reasonable perplexity values (max PPL = vocab_size for a uniform distribution), but arithmetic comparisons remain equally valid/invalid regardless of vocabulary size.","D":"BLEU score is a task-specific metric for translation and text generation. Perplexity is a valid and widely used metric for language model quality — it directly measures how well the model's probability distribution matches the test data."},"reference":"- Brown et al., \"Language Models are Few-Shot Learners\" (GPT-3): https://arxiv.org/abs/2005.14165 (perplexity reporting)\n- Jurafsky & Martin, \"Speech and Language Processing\", Chapter 3 (perplexity definition)"},{"section":"deep-learning","topicSlug":"loss-and-cost-functions","topic":"Loss And Cost Functions","id":"dl-05009","difficulty":"hard","orderIndex":9,"question":"A team trains an object detection model with Focal Loss (γ=2, α=0.25). They observe that after 10,000 steps, the model detects common objects (cars, people) well but almost completely ignores rare objects (fire hydrants, parking meters). They increase γ from 2 to 5, expecting Focal Loss to further down-weight the easy majority class. What is likely to happen and why?","options":{"A":"Higher γ will further focus on hard examples, solving the rare object problem","B":"Increasing γ too much will cause the model to focus on the hardest samples in the training set — which may include mislabeled examples, extremely occluded instances, and noise — rather than the rare objects. The model may degrade on common objects without improving on rare ones","C":"Higher γ has no effect above γ=2; Focal Loss saturates at γ=2","D":"The fix is to reduce γ to 0, which is standard cross-entropy and treats all examples equally"},"correct":"B","explanation":{"correct":"- Focal Loss: FL = -αₜ(1-pₜ)ᵞ log(pₜ). At γ=5, the weight factor (1-pₜ)⁵ becomes extremely small for easy examples (pₜ=0.9: weight = 0.1⁵ = 0.00001) and dominates for hard examples (pₜ=0.1: weight = 0.9⁵ = 0.59).\n- The problem: \"hard examples\" include rare objects but also mislabeled data, extremely occluded objects, and ambiguous cases. At γ=5, these noisy hard examples get enormous weight relative to clean easy examples, potentially causing the model to overfit to noise.\n- In practice, the original Focal Loss paper (Lin et al.) found γ=2 works well across different detection tasks. The rare object problem is better solved by class-balanced sampling, augmentation, or the α parameter, not extreme γ values.","A":"This is the naive expectation but ignores the noise amplification problem. The hardest examples are not necessarily the most informative ones — they may be genuinely ambiguous or mislabeled.","B":"","C":"Focal Loss does not saturate at γ=2. The function (1-pₜ)ᵞ continues to change with γ. The choice of γ=2 as default is empirical, not a mathematical saturation point.","D":"γ=0 gives standard CE (no example weighting). This would make the rare object problem worse, not better, by treating all examples equally in a class-imbalanced dataset."},"reference":"- Lin et al., \"Focal Loss for Dense Object Detection\" (2017): https://arxiv.org/abs/1708.02002 (Section 4: ablation study on γ)"},{"section":"deep-learning","topicSlug":"loss-and-cost-functions","topic":"Loss And Cost Functions","id":"dl-05010","difficulty":"hard","orderIndex":10,"question":"You are training a generative model and need to measure how close the model's output distribution P is to the true data distribution Q. A teammate proposes using KL divergence D_KL(P||Q). A researcher argues D_KL(Q||P) is more appropriate for this use case. What is the concrete behavioral difference between these two directions?","options":{"A":"The two directions are mathematically identical; KL divergence is symmetric","B":"D_KL(P||Q) = Σ P(x)·log(P(x)/Q(x)) — the model (P) must assign probability to every region where P is non-zero. If P spreads mass over regions where Q is zero (true data has no examples), the KL is infinite. This pushes P to cover all of Q's support but may spread to regions Q doesn't cover (mode-covering). D_KL(Q||P) penalizes P when Q is high but P is low — pushing P to match Q's modes but allowing P to miss parts of Q (mode-seeking)","C":"D_KL(Q||P) requires the model distribution Q to be differentiable, while D_KL(P||Q) works with any distribution","D":"The difference is only relevant for discrete distributions; for continuous generative models, both directions produce identical training dynamics"},"correct":"B","explanation":{"correct":"- Forward KL (D_KL(P||Q), \"inclusive\"): when P>0, requires Q>0. The model must \"cover\" all regions of the true distribution. This is used in Maximum Likelihood Estimation and leads to mode-covering behavior — the model spreads mass broadly to not miss any mode of the data.\n- Reverse KL (D_KL(Q||P), \"exclusive\"): when Q>0, requires P>0. Penalizes the model for having low probability where data has high probability. Leads to mode-seeking behavior — the model picks one or a few modes and concentrates mass there.\n- This is the core distinction between VAEs (minimize forward KL, mode-covering, blurry) and GANs/flow models (implicitly minimize reverse KL or other metrics, mode-seeking, sharp but can miss modes).","A":"KL divergence is NOT symmetric. D_KL(P||Q) ≠ D_KL(Q||P) in general. This is a fundamental property of KL divergence. The symmetric version is Jensen-Shannon divergence: JS(P,Q) = 0.5·D_KL(P||M) + 0.5·D_KL(Q||M) where M = 0.5(P+Q).","B":"","C":"Both formulations require the distributions to be smooth enough for gradient computation. The differentiability requirement is the same for both directions — it's determined by the parameterization of the model, not the direction of KL.","D":"The mode-seeking vs mode-covering distinction is equally relevant for continuous distributions. It manifests as blurry vs sharp image generation in VAEs vs GANs."},"reference":"- Goodfellow et al., \"Deep Learning\", Chapter 3.13 (KL divergence)\n- Wainwright & Jordan, \"Graphical Models, Exponential Families, and Variational Inference\" (2008)"},{"section":"deep-learning","topicSlug":"loss-and-cost-functions","topic":"Loss And Cost Functions","id":"dl-05011","difficulty":"medium","orderIndex":11,"question":"A model predicts whether a loan application should be approved. The business requires: \"false negatives (approved loans that default) cost 10× more than false positives (rejected loans that would have been fine).\" You train with standard BCE loss but the model performs suboptimally on this cost metric. What change to the loss function captures this asymmetric cost?","options":{"A":"Use MSE loss which automatically weights false negatives more heavily than false positives","B":"Use class-weighted BCE loss where the weight for the positive class (defaulters) is set to 10: loss = -[10·y·log(p) + (1-y)·log(1-p)]. This multiplies the gradient for positive-class errors (false negatives) by 10×, training the model to prioritize avoiding them","C":"Apply a threshold of 0.1 instead of 0.5 during inference; no loss function change is needed","D":"Use a custom loss that penalizes predictions where p > 0.5 for negative class samples by 10×"},"correct":"B","explanation":{"correct":"- Weighted BCE: when a positive sample (defaulter, y=1) is misclassified, the loss contribution is 10× larger than for a negative sample (safe borrower, y=0). This directly reflects the business cost asymmetry in the optimization objective.\n- The class weight effectively says: \"the model should work 10× harder to correctly classify defaulters.\" The gradient for false-negative errors (missed defaulters) is 10× larger than for false-positive errors, pushing the decision boundary toward lower false-negative rates.\n- This is distinct from threshold adjustment (option C) because it changes the learned probability distribution, not just the post-hoc decision rule.","A":"MSE for binary classification does not inherently weight false negatives differently. MSE treats all prediction errors symmetrically regardless of label value.","B":"","C":"Threshold adjustment changes which predictions are labeled \"approve\" vs \"deny\" after training. However, the model's learned probabilities are still optimized for equal-cost errors. A well-calibrated model trained with cost-weighted loss will produce better probability estimates for the asymmetric cost structure.","D":"Penalizing high-probability negative predictions during training doesn't model the cost asymmetry correctly. The asymmetry is about error severity (missing a defaulter vs rejecting a good borrower), not about prediction confidence for negative samples."},"reference":"- scikit-learn class_weight parameter: https://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane_unbalanced.html"},{"section":"deep-learning","topicSlug":"loss-and-cost-functions","topic":"Loss And Cost Functions","id":"dl-05012","difficulty":"easy","orderIndex":12,"question":"Cross-entropy loss is defined as CE = -Σᵢ yᵢ·log(pᵢ). For a 5-class problem with true label class 3 and predicted probabilities [0.1, 0.1, 0.6, 0.1, 0.1], what is the cross-entropy loss, and what would it be if the model had output [0.01, 0.01, 0.96, 0.01, 0.01]?","options":{"A":"CE₁ = 0.6, CE₂ = 0.96 — CE equals the predicted probability of the correct class","B":"CE₁ = -log(0.6) ≈ 0.511, CE₂ = -log(0.96) ≈ 0.041 — CE is the negative log of the correct class probability; higher confidence on the correct class means lower loss","C":"CE₁ = -Σ log(pᵢ) over all classes ≈ -5·log(0.2) ≈ 8.05 for both, since the sum is over uniform distribution","D":"CE₁ = 1 - 0.6 = 0.4, CE₂ = 1 - 0.96 = 0.04 — CE equals 1 minus the correct class probability"},"correct":"B","explanation":{"correct":"- For one-hot labels, CE = -Σᵢ yᵢ·log(pᵢ) = -1·log(p_correct) (all other yᵢ = 0). CE reduces to just the negative log probability of the correct class.\n- CE₁ = -log(0.6) ≈ 0.511. CE₂ = -log(0.96) ≈ 0.041. The model in case 2 is much more confident and correct, incurring 12× lower loss.\n- This is why maximizing log-likelihood and minimizing cross-entropy are equivalent for classification: you're directly maximizing the log probability assigned to the correct class.","A":"CE = probability would make the loss a linear function of confidence. CE = -log(p) creates an asymmetric penalty: going from p=0.5 to p=1.0 reduces loss by log(2)≈0.69, while going from p=0.01 to p=0.5 reduces loss by log(50)≈3.9. The log ensures large penalties for very wrong confident predictions.","B":"","C":"CE uses the true label distribution (one-hot), not a uniform distribution. The sum over all classes collapses to one term because only the correct class has yᵢ = 1.","D":"CE = 1-p would be a linear loss function. The logarithm in CE provides the desirable property of infinite loss for p=0 (completely wrong confident prediction) and zero loss for p=1 (perfectly confident correct prediction)."},"reference":"- https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html"},{"section":"deep-learning","topicSlug":"loss-and-cost-functions","topic":"Loss And Cost Functions","id":"dl-05013","difficulty":"hard","orderIndex":13,"question":"A team is training a knowledge distillation model where a student network is trained to match a teacher network's output probability distribution. They use Cross-Entropy between the student's softmax outputs and the teacher's softmax outputs (soft targets). The teacher uses temperature T=1 for soft targets. The student trains poorly — it collapses to predicting the same distribution as using hard one-hot labels. What is likely missing?","options":{"A":"Knowledge distillation requires a different optimizer than standard training","B":"The teacher's probabilities at T=1 are near one-hot (e.g., [0.99, 0.003, 0.003, ...]) — the soft targets barely differ from hard labels. Temperature scaling (T=3-5) should be applied to the teacher's logits before softmax to produce softer, more informative distributions that reveal the teacher's \"dark knowledge\" about class relationships","C":"The student network must have the same architecture as the teacher for knowledge distillation to work","D":"Cross-Entropy is inappropriate for distillation; KL divergence must be used instead"},"correct":"B","explanation":{"correct":"- A trained teacher network typically produces very confident predictions: softmax([10, 0.1, 0.1, ...]) ≈ [0.9999, 0.00005, 0.00005, ...]. At T=1, soft targets are nearly identical to hard one-hot labels, providing no additional information.\n- Temperature scaling: teacher_probs = softmax(logits/T). At T=4: softmax([2.5, 0.025, 0.025, ...]) ≈ [0.88, 0.03, 0.03, ...] — much softer. The student now learns that the teacher slightly prefers class 2 and 3 over class 4, even though class 1 is most likely. This \"dark knowledge\" encodes learned similarity between classes.\n- Hinton et al. (2015) used temperature T=3-20 in their original distillation work. The typical loss is a combination: L = α·CE(student, hard_labels) + (1-α)·KL(student_soft, teacher_soft).","A":"Knowledge distillation uses standard optimizers (Adam, SGD). No special optimizer is required.","B":"","C":"Knowledge distillation is specifically designed to work with different architectures (small student, large teacher). Same architecture is not a requirement and defeats the purpose of compression.","D":"KL divergence and CE on soft targets differ by only a constant when the targets are fixed (KL = CE - H(targets)). For distillation purposes, they are functionally equivalent. The issue is not the loss function but the temperature of the teacher's softmax."},"reference":"- Hinton et al., \"Distilling the Knowledge in a Neural Network\" (2015): https://arxiv.org/abs/1503.02531"},{"section":"deep-learning","topicSlug":"loss-and-cost-functions","topic":"Loss And Cost Functions","id":"dl-05014","difficulty":"hard","orderIndex":14,"question":"You are training a regression model to predict protein structure coordinates (x, y, z in Angstroms). A senior researcher insists on using Huber loss with δ=1.0 instead of MSE. A junior researcher argues: \"In protein structure, there are no outliers — all measurements are precise crystallography data. Huber loss is unnecessary.\" Who is right?","options":{"A":"The junior researcher is right — Huber loss is only needed for datasets with measurement noise and outliers","B":"The senior researcher may be right for a different reason: even with precise measurements, some residues (protein sub-units) are structurally flexible and genuinely have multiple valid conformations. Predictions for these residues will always have high error regardless of model quality. MSE would penalize these inherently uncertain residues 10-1000× more than other residues, distorting learning. Huber loss reduces their influence","C":"Huber loss is never appropriate for coordinate regression; use MSE always","D":"The junior researcher is right for crystallography data but wrong for cryo-EM data"},"correct":"B","explanation":{"correct":"- \"Outliers\" in loss function context means \"data points with large residuals\" — not necessarily measurement errors. Flexible protein loops and disordered regions produce large prediction errors by nature (the true structure exists in an ensemble of conformations).\n- With MSE, these structurally ambiguous residues produce squared errors of 100-10,000 Å² compared to 1-4 Å² for well-structured regions. The model spends disproportionate gradient effort on hard-to-predict flexible regions at the expense of learning well-structured regions.\n- Huber loss with appropriate δ caps the influence of flexible residues, allowing the model to learn structured regions without being dominated by inherently ambiguous ones. AlphaFold2 uses multiple loss components including specialized handling for disordered regions.","A":"This conflates \"outlier as measurement error\" with \"outlier as prediction difficulty.\" The definition of \"outlier\" for loss function purposes is a sample with disproportionately large residual, regardless of cause.","B":"","C":"Huber loss is widely used in coordinate regression tasks including 3D object detection (bounding box regression uses smooth L1, which is Huber), robot control, and molecular modeling.","D":"The same argument applies to cryo-EM data and crystallography. Both have flexible/disordered regions. The structural biology challenge (multiple conformations) is independent of the measurement technique."},"reference":"- Jumper et al., \"Highly accurate protein structure prediction with AlphaFold\" (2021): https://www.nature.com/articles/s41586-021-03819-2\n- Object detection uses smooth L1 (Huber): https://arxiv.org/abs/1504.08083"},{"section":"deep-learning","topicSlug":"loss-and-cost-functions","topic":"Loss And Cost Functions","id":"dl-05015","difficulty":"hard","orderIndex":15,"question":"You train a variational autoencoder (VAE) and observe that the reconstruction loss decreases steadily but the KL divergence term collapses to near zero from the first epoch. The generated samples have high quality but show no diversity — all sampled images are nearly identical. What is this phenomenon and what causes it?","options":{"A":"The model is overfitting to the training data; reduce the number of parameters","B":"This is \"posterior collapse\" — the encoder ignores the input and maps all inputs to the prior N(0,I). The decoder learns to generate without using the latent code (from the prior alone). Mathematically: minimizing KL(q(z|x) || p(z)) pushes q toward the prior; if the decoder is powerful enough to reconstruct without z, the model collapses to doing exactly that. Fix: β-VAE (reduce KL weight) or KL annealing (gradually increase KL weight during training)","C":"The KL term collapsing to zero means the model has perfectly learned the posterior; this is the ideal training outcome","D":"Posterior collapse is caused by a learning rate that is too high; reduce the learning rate to prevent the KL from collapsing early"},"correct":"B","explanation":{"correct":"- VAE objective: maximize ELBO = E[log p(x|z)] - KL(q(z|x)||p(z)). The reconstruction term incentivizes using z; the KL term incentivizes q to be close to the prior (where z is uninformative).\n- If the decoder is powerful (e.g., an autoregressive decoder that can model all of p(x) without conditioning on z), it will generate correctly even when z is sampled from the prior with no x-specific information. The encoder then has no incentive to encode x into z, so it collapses to outputting the prior.\n- Fix options: (1) β-VAE: multiply KL by β < 1 to reduce its weight; (2) KL annealing: start with KL weight=0, slowly increase to 1 over training; (3) less powerful decoder (e.g., use a simple decoder that needs z to reconstruct); (4) free bits: guarantee minimum KL per dimension.","A":"Overfitting would cause high reconstruction accuracy on training data and poor on validation — the model would use latent codes to memorize training samples. Posterior collapse shows identical outputs regardless of input, which is the opposite: the latent code is unused.","B":"","C":"KL(q||p) = 0 means q exactly equals the prior for all inputs. This means the encoder has learned to output N(0,I) regardless of input x — z contains no information about x. The ideal training outcome is q being close to (but not equal to) the prior while also encoding x-specific information.","D":"Learning rate affects the rate of convergence but not which equilibrium the model converges to. With a powerful decoder, the posterior collapse equilibrium is a stable local minimum. No learning rate setting prevents convergence to it once the model discovers this shortcut."},"reference":"- Lucas et al., \"Don't Blame the ELBO! A Linear VAE Perspective on Posterior Collapse\" (2019): https://arxiv.org/abs/1911.02469\n- Higgins et al., \"β-VAE\" (2017): https://openreview.net/forum?id=Sy2fchgIW"},{"section":"deep-learning","topicSlug":"backpropagation","topic":"Backpropagation","id":"dl-06001","difficulty":"easy","orderIndex":1,"question":"A neural network has 3 layers. You compute the forward pass successfully but during backpropagation, the gradient for the first layer is exactly zero for all weights. The loss is non-zero and the last layer's gradient is correct. What is the most likely cause?","options":{"A":"The first layer's weights are initialized to zero, causing zero gradients","B":"One of the intermediate activation functions (e.g., ReLU) has zero gradient for all inputs in that batch, effectively cutting off gradient flow. The chain rule multiplies gradients across layers — a zero at any layer zeroes all gradients to earlier layers","C":"The learning rate is too small, causing gradients to round to zero in float32","D":"Backpropagation only updates the last two layers by default in PyTorch; the first layer requires a separate optimizer call"},"correct":"B","explanation":{"correct":"- Chain rule in backpropagation: ∂L/∂W₁ = ∂L/∂a₂ · ∂a₂/∂z₂ · ∂z₂/∂a₁ · ∂a₁/∂z₁ · ∂z₁/∂W₁. If any term is zero (e.g., ∂a₁/∂z₁ = 0 because all neurons in layer 1 are dead ReLU), the entire product is zero.\n- This is the gradient \"cut\" — a zero in the chain rule propagates leftward and zeroes all earlier layers' gradients. The last layer is unaffected because its gradients don't depend on the earlier zero term.\n- Dead ReLU (all neurons with z<0) is the most common cause. Other causes: sigmoid saturated to 0 or 1 for all inputs, or a custom activation with zero derivative.","A":"Zero weight initialization causes symmetric gradients (all neurons compute the same thing) but not zero gradients — the gradients are identical across neurons but non-zero. The symmetry problem prevents specialization but doesn't zero the gradients.","B":"","C":"Float32 has ~7 decimal digits of precision. A gradient would need to be smaller than ~1e-38 (near underflow) to appear as zero. Learning rate affects the weight update magnitude, not the gradient magnitude itself.","D":"PyTorch backpropagation computes gradients for all parameters with requires_grad=True, including all layers. There is no \"default\" that stops at layer 2."},"reference":"- https://cs231n.github.io/optimization-2/ (computational graphs and chain rule)"},{"section":"deep-learning","topicSlug":"backpropagation","topic":"Backpropagation","id":"dl-06002","difficulty":"easy","orderIndex":2,"question":"You are explaining backpropagation to a junior engineer. She asks: \"Why do we need to store intermediate activations during the forward pass? Can't we just recompute them during the backward pass?\" What is the correct technical response?","options":{"A":"We cannot recompute activations because PyTorch deletes the computation graph after the forward pass","B":"Recomputing activations is possible (gradient checkpointing does exactly this) but it trades memory for compute — storing activations avoids recomputing, but requires O(depth) memory. The choice depends on whether the bottleneck is memory or compute","C":"Intermediate activations must be stored because backpropagation requires them as inputs to the chain rule gradient computation (∂L/∂W depends on the activation value at that layer). Without storage, you'd have to redo the entire forward pass for every layer during backward","D":"Activations are stored in GPU VRAM automatically and cannot be freed until the next batch"},"correct":"C","explanation":{"correct":"- The gradient of a weight matrix W in layer k: ∂L/∂Wₖ = δₖ · aₖ₋₁ᵀ, where δₖ is the error signal from the next layer and aₖ₋₁ is the activation from the previous layer. Both are required.\n- Without stored activations, computing ∂L/∂W requires knowing the activation value at that layer — which can only be obtained by re-running the forward pass up to that point.\n- B is also technically correct (gradient checkpointing recomputes activations), but C is the fundamental reason activations are stored by default: correctness and efficiency. Gradient checkpointing is an optional memory optimization.","A":"PyTorch does keep the computation graph until `.backward()` is called. After calling `.backward()`, the graph is freed (unless `retain_graph=True`). The graph is not deleted immediately after the forward pass.","B":"","C":"","D":"Activations stored in PyTorch's computation graph can be freed at any time by calling `.detach()` or by not retaining the graph. They are not permanently locked in VRAM. Gradient checkpointing explicitly frees them during the forward pass."},"reference":"- Goodfellow et al., \"Deep Learning\", Chapter 6.5 (Back-Propagation and Other Differentiation Algorithms)\n- PyTorch gradient checkpointing: https://pytorch.org/docs/stable/checkpoint.html"},{"section":"deep-learning","topicSlug":"backpropagation","topic":"Backpropagation","id":"dl-06003","difficulty":"medium","orderIndex":3,"question":"You implement backpropagation manually for a 2-layer network and compare gradients to PyTorch's autograd. For the second layer, your gradients match exactly. For the first layer, yours are consistently 10× larger than PyTorch's. You didn't make an arithmetic error. What is the most likely source of the discrepancy?","options":{"A":"PyTorch normalizes gradients by the number of layers during backpropagation","B":"PyTorch's default reduction in loss functions is 'mean' — if your manual implementation used 'sum' instead of averaging over the batch, your gradients would be batch_size× larger. If batch_size=10, your gradients would be 10× PyTorch's","C":"PyTorch clips gradients to prevent explosion; your manual implementation lacks this clipping","D":"The first layer has more parameters than the second, causing larger gradients in your manual implementation"},"correct":"B","explanation":{"correct":"- PyTorch's `nn.CrossEntropyLoss`, `nn.MSELoss`, etc. default to `reduction='mean'` — dividing the total loss by the batch size. If your manual implementation sums losses over the batch without dividing by batch size, all gradients are batch_size× larger.\n- For the second layer, the gradient magnitude matches because you may have implemented it correctly. The discrepancy in the first layer (10×) suggests a batch-size factor is applied somewhere between your second and first layer computation — likely in how the delta (error signal) is computed.\n- This is one of the most common bugs when implementing backprop manually: confusing `sum` and `mean` reduction, leading to incorrect learning rates for the actual gradient scale.","A":"PyTorch does not normalize gradients by number of layers. Gradients are computed via chain rule and may vary in magnitude by layer, but there is no normalization step.","B":"","C":"PyTorch does NOT clip gradients by default. Gradient clipping (`torch.nn.utils.clip_grad_norm_`) must be called explicitly. It would reduce gradient magnitude, not increase it by 10×.","D":"The number of parameters in a layer doesn't affect gradient magnitude for individual weights. Gradient magnitude depends on the error signal and activation values, not the weight count."},"reference":"- PyTorch loss reduction parameter: https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html"},{"section":"deep-learning","topicSlug":"backpropagation","topic":"Backpropagation","id":"dl-06004","difficulty":"medium","orderIndex":4,"question":"You train a neural network and after 1000 steps, the training loss is still at its initial value. You inspect gradients and find they are all NaN. What sequence of events most likely caused gradient NaN, and what debugging steps would you take first?","options":{"A":"NaN gradients are caused by setting the learning rate to exactly 0.001; use 0.0001 instead","B":"NaN gradients typically trace back to a NaN loss, which traces back to NaN activations, which traces back to either: (1) a NaN in the input data, (2) log(0) or 0/0 in the loss function (e.g., log(p) when p=0 from ReLU output fed to softmax with all-zero logits), or (3) exploding activations that overflow to infinity then produce 0/0. Debug: check inputs for NaN, add `torch.autograd.set_detect_anomaly(True)`, inspect intermediate activation norms","C":"NaN gradients always indicate a memory overflow; reduce batch size","D":"NaN gradients indicate the model has converged to a saddle point where gradients are undefined"},"correct":"B","explanation":{"correct":"- NaN propagates: NaN input → NaN activations → NaN loss → NaN gradients. The source is almost always upstream of where you observe NaN.\n- Common specific causes: (1) `torch.log(tensor)` where tensor contains 0 (log(0)=-inf, and inf-inf=NaN); (2) 0/0 from division with a near-zero denominator (e.g., LayerNorm with zero-variance inputs); (3) overflow from too-large activations (activation → inf, then inf × 0 = NaN in gradient).\n- `torch.autograd.set_detect_anomaly(True)` adds hooks that identify the exact operation that first produced NaN, printing a stack trace. This is the recommended first debugging step.","A":"Learning rate value does not cause NaN gradients unless combined with an exploding gradient scenario where the loss landscape has extreme curvature. The learning rate itself is just a scalar multiplier.","B":"","C":"Memory overflow produces an out-of-memory (OOM) error, not NaN. NaN results from mathematical operations like 0/0, ∞-∞, or log(0). Memory issues and NaN are distinct failure modes.","D":"Saddle points have non-zero gradients in most dimensions. True saddle points (zero gradient in all directions) would produce zero, not NaN. A \"flat\" region would give zero gradients; NaN requires an illegal mathematical operation."},"reference":"- PyTorch anomaly detection: https://pytorch.org/docs/stable/autograd.html#anomaly-detection"},{"section":"deep-learning","topicSlug":"backpropagation","topic":"Backpropagation","id":"dl-06005","difficulty":"medium","orderIndex":5,"question":"A team implements a custom layer that uses a non-differentiable operation (argmax) in the forward pass. During backpropagation, PyTorch raises an error because argmax has no gradient. They ask: \"Can we still train with backpropagation?\" What are the two main approaches?","options":{"A":"No — non-differentiable operations fundamentally prevent gradient-based training","B":"Yes — two approaches: (1) Straight-Through Estimator (STE): pass the gradient through the argmax as if it were an identity function (∂argmax/∂input ≈ 1), accepting the approximation. (2) Gumbel-Softmax: replace argmax with a differentiable soft approximation (temperature-controlled softmax) during training, use hard argmax during inference","C":"Yes — replace argmax with a sigmoid function which is a differentiable proxy","D":"Yes — use numerical differentiation (finite differences) to estimate gradients for the argmax layer"},"correct":"B","explanation":{"correct":"- STE (Hinton, 2012; Bengio et al., 2013): during backward pass, treat the non-differentiable operation as identity (∂output/∂input = 1). This is biologically inspired (works empirically despite being mathematically incorrect) and is the basis of training quantized neural networks (QNNs).\n- Gumbel-Softmax (Jang et al., 2017; Maddison et al., 2017): use softmax(log(π) + Gumbel_noise)/τ during training (differentiable), approach hard one-hot as τ→0 for inference. Used in VQ-VAE, discrete VAEs.\n- Both are in active production use: STE in training binary/ternary networks, Gumbel-Softmax in discrete latent variable models.","A":"While argmax is non-differentiable, there are well-established workarounds used in production. The field has extensive work on training through discrete operations.","B":"","C":"Sigmoid produces values in (0,1) for a single output, not a one-hot selection across options. Sigmoid is appropriate for binary gates but not for multi-way selection like argmax.","D":"Numerical differentiation (finite differences) is extremely expensive for neural networks with millions of parameters (requires N forward passes for N parameters). It is used for gradient checking, not for training."},"reference":"- Bengio et al., \"Estimating or Propagating Gradients Through Stochastic Neurons\" (STE, 2013): https://arxiv.org/abs/1308.3432\n- Jang et al., \"Categorical Reparameterization with Gumbel-Softmax\" (2017): https://arxiv.org/abs/1611.01144"},{"section":"deep-learning","topicSlug":"backpropagation","topic":"Backpropagation","id":"dl-06006","difficulty":"medium","orderIndex":6,"question":"You compute gradients for a 10-layer network and plot gradient norm per layer. Layer 10 (closest to loss): norm = 1.0. Layer 5: norm = 0.01. Layer 1: norm = 0.0001. After adding BatchNorm after every 2 layers, the gradient norms become approximately equal across all layers. Why does BatchNorm have this effect on gradient magnitudes?","options":{"A":"BatchNorm clips gradients to be equal across layers","B":"BatchNorm normalizes pre-activations to zero mean and unit variance at each layer. This prevents the forward pass activations from shrinking/growing exponentially, which in turn prevents the backward pass gradients from shrinking/growing exponentially via the chain rule","C":"BatchNorm adds trainable skip connections that provide direct gradient paths to early layers","D":"BatchNorm reduces the learning rate for deep layers, compensating for otherwise smaller gradients"},"correct":"B","explanation":{"correct":"- The root cause of gradient decay is that each layer multiplies gradients by the Jacobian ∂aₖ/∂aₖ₋₁. If activations have small magnitude (common without normalization), the Jacobian entries are small, and gradients decay across layers.\n- BatchNorm normalizes activations to N(0,1) after each layer. This prevents the covariate shift (activation distribution shift) that causes Jacobian magnitudes to vary wildly, keeping gradient flow more stable.\n- More precisely: BatchNorm's γ and β parameters, combined with the normalization, effectively scale the gradient flow to be approximately 1.0 per layer, preventing the multiplicative decay.","A":"BatchNorm does not clip gradients. Gradient clipping is a separate technique (clip_grad_norm). BatchNorm's effect on gradients is through the normalization of forward activations, not through explicit gradient manipulation.","B":"","C":"BatchNorm does not add skip connections. Skip connections (ResNets) are an architectural choice. BatchNorm is an in-place normalization operation that does not create new paths in the computational graph between non-adjacent layers.","D":"BatchNorm does not modify the optimizer's learning rate. The learning rate is a hyperparameter of the optimizer. BatchNorm's effect on gradient uniformity comes from activation normalization, not learning rate scheduling."},"reference":"- Ioffe & Szegedy, \"Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift\" (2015): https://arxiv.org/abs/1502.03167\n- Santurkar et al., \"How Does Batch Normalization Help Optimization?\" (2018): https://arxiv.org/abs/1805.11604"},{"section":"deep-learning","topicSlug":"backpropagation","topic":"Backpropagation","id":"dl-06007","difficulty":"hard","orderIndex":7,"question":"You are training a recurrent network (vanilla RNN) on sequences of length 500. You observe that gradient norms for the hidden state at step t=490 are approximately 0.1, at t=400 are 10⁻⁴, and at t=1 are 10⁻²⁴. The loss is on the final step's output. What is happening and why do LSTM gates specifically address this?","options":{"A":"The gradient decay is caused by using BPTT (Backpropagation Through Time) which has fewer time steps for early states","B":"Each backpropagation step through the RNN multiplies the gradient by the recurrent weight matrix Wₕ. If the spectral radius ρ(Wₕ) < 1, gradients decay exponentially as ρ(Wₕ)^T where T is the number of steps. At T=490: 0.9^490 ≈ 10⁻²³. LSTM gates create additive (not multiplicative) paths for gradient flow through the cell state: dC/dt = f_t · C_{t-1} + i_t · g_t, where the forget gate f_t controls how much gradient flows backward — cells can maintain near-unit gradient flow for arbitrarily long sequences","C":"The vanishing gradient is caused by the tanh activation in the RNN output layer; replace with ReLU to fix","D":"Gradient norms below 0.1 indicate correct behavior — early time steps should have smaller gradients because they contribute less to the final loss"},"correct":"B","explanation":{"correct":"- Vanilla RNN gradient: ∂h_t/∂h_{t-k} = ∏ᵢ ∂h_{t-i+1}/∂h_{t-i} = ∏ᵢ Wₕᵀ · diag(tanh'(z_{t-i})). If ρ(Wₕ) < 1, this product decays geometrically. At 500 time steps, even ρ=0.99 gives 0.99^500 ≈ 0.0066.\n- LSTM cell state update: C_t = f_t ⊙ C_{t-1} + i_t ⊙ tanh(Wₓx_t + Wₕh_{t-1}). Gradient through C_t: ∂C_t/∂C_{t-1} = f_t (element-wise multiplication, not matrix multiplication). The forget gate can be near 1.0, providing a near-unit gradient path.\n- This \"constant error carousel\" (Hochreiter & Schmidhuber, 1997) is the key LSTM innovation: replace multiplicative recurrent connections with additive cell state updates gated by learned gates.","A":"BPTT applies backpropagation through all time steps equally. Early steps t=1 receive gradients that have been backpropagated through all 499 steps between t=1 and t=500 — they don't get \"fewer steps,\" they get the compounded decay of all steps.","B":"","C":"The tanh in the hidden state transition (not just the output layer) contributes to gradient decay via tanh'(z) ≤ 1. However, the dominant effect is the multiplicative recurrence through Wₕ. Replacing output tanh with ReLU partially helps but does not solve the fundamental multiplicative gradient path.","D":"The gradient at t=1 representing the contribution of the very first input to the final loss should be non-negligible if the sequence has long-range dependencies. A gradient of 10⁻²⁴ means the network literally cannot learn any relationship between t=1 and the output."},"reference":"- Hochreiter & Schmidhuber, \"Long Short-Term Memory\" (1997): https://www.mitpressjournals.org/doi/10.1162/neco.1997.9.8.1735\n- Hochreiter, \"The vanishing gradient problem during learning recurrent neural nets\" (1998)"},{"section":"deep-learning","topicSlug":"backpropagation","topic":"Backpropagation","id":"dl-06008","difficulty":"hard","orderIndex":8,"question":"A team implements gradient checking (comparing analytical gradients from backprop to numerical gradients via finite differences) and finds a relative error of 10⁻³ for one specific weight. The threshold for \"passing\" gradient check is typically 10⁻⁵ to 10⁻⁷. The team is debugging a custom loss function. What are the most likely causes of this specific elevated error?","options":{"A":"A relative error of 10⁻³ is within float32 numerical precision and should be ignored","B":"Possible causes: (1) a non-smooth operation in the loss function (e.g., absolute value at the kink, max at equality) where numerical and analytical gradients differ at the non-differentiable point; (2) a bug in the analytical gradient formula with the wrong coefficient; (3) using float32 (h ≈ 10⁻⁵ in finite differences, float32 precision ≈ 10⁻⁷, gives error floor ≈ 10⁻²) — switch to float64 for gradient checking","C":"The weight is in a batch normalization layer; gradient checking always fails for BN due to batch statistics","D":"A relative error of 10⁻³ specifically indicates a missing factor of 1000 in the gradient formula (off-by-1000 error)"},"correct":"B","explanation":{"correct":"- Gradient checking uses finite differences: (f(x+h) - f(x-h))/(2h). Float32 precision ≈ 10⁻⁷ limits accuracy. With h=10⁻⁵, float32 operations have error ~10⁻⁷/10⁻⁵ = 10⁻². So gradient checking in float32 with typical h has error floor of ~10⁻², not 10⁻⁶.\n- Always run gradient checks in float64. In float64, the error floor drops to ~10⁻¹¹, allowing detection of errors as small as 10⁻⁷.\n- Non-smooth operations (L1 loss, ReLU at exactly 0, max at equality) legitimately produce different analytical vs numerical gradients at the kink — but only for weights where the function is evaluated exactly at the non-smooth point.","A":"10⁻³ is not within float32 precision for analytical gradients. The analytical gradient computed via backpropagation is exact (within floating-point errors of ~10⁻⁷ for float32). A 10⁻³ relative error suggests either a float precision issue in the numerical check or a real bug.","B":"","C":"BatchNorm gradient checking is tricky because the batch statistics create coupling between samples, but it is not impossible. The issue is that finite differences change one weight at a time, while BN statistics change with each perturbation. Special care is needed, but it doesn't universally fail.","D":"A relative error of 10⁻³ is 3 orders of magnitude off, which could indicate a factor-of-1000 error, but could also indicate many other issues (missing factor of 2, wrong sign, non-smooth point, precision issue). It doesn't specifically diagnose a 1000× error."},"reference":"- Gradient checking guide: https://cs231n.github.io/neural-networks-3/#gradcheck\n- Goodfellow et al., \"Deep Learning\", Chapter 6.5.6 (Gradient Checking)"},{"section":"deep-learning","topicSlug":"backpropagation","topic":"Backpropagation","id":"dl-06009","difficulty":"hard","orderIndex":9,"question":"You train a deep network and observe that the gradient norm in layer 1 is 10⁴ (exploding). You apply gradient clipping: `torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)`. The training stabilizes. A colleague argues that gradient clipping is \"cheating\" because it changes the true gradient direction. Is the colleague correct, and what does the clipped gradient actually compute?","options":{"A":"The colleague is correct — gradient clipping changes the gradient direction and introduces bias, making the training invalid","B":"The colleague is partially correct about direction change: when the global gradient norm exceeds max_norm, all gradients are scaled by (max_norm / global_norm). This scales the gradient uniformly, preserving the relative ratios between individual parameter gradients (direction is preserved). It biases the update magnitude but not the direction. For exploding gradients, this is an acceptable approximation because the true gradient would cause parameter overflow anyway","C":"Gradient clipping doesn't change the direction because it clips each gradient independently to [-max_norm, max_norm]","D":"The colleague is wrong — gradient clipping computes the exact true gradient but with reduced precision"},"correct":"B","explanation":{"correct":"- `clip_grad_norm_` computes the global gradient norm G = √(Σᵢ ||∇wᵢ||²). If G > max_norm, scales ALL gradients by max_norm/G. This is a uniform scaling that preserves the relative ratios between gradients (same direction, different magnitude).\n- Direction preservation: if ∇W = [100, -50, 25] and max_norm=1: scaled = [100, -50, 25] × (1/√(100²+50²+25²)) ≈ [0.87, -0.43, 0.22]. The direction (unit vector) is preserved.\n- Alternative: `clip_grad_value_` clips each gradient independently to [-max_value, max_value]. This does change the direction (gradient ratios change), which is generally worse.","A":"Gradient clipping is valid training practice used in LSTM training, Transformer training (where exploding gradients from attention are common), and GAN training. The \"bias\" introduced is intentional and necessary to prevent parameter overflow.","B":"","C":"This describes `clip_grad_value_` (per-element clipping), not `clip_grad_norm_` (global norm scaling). The two are different operations. `clip_grad_norm_` scales uniformly and preserves direction.","D":"Gradient clipping does not compute the \"exact true gradient.\" It explicitly modifies gradient magnitude. This is a deliberate approximation, not a precision issue."},"reference":"- Pascanu et al., \"On the difficulty of training recurrent neural networks\" (gradient clipping): https://arxiv.org/abs/1211.5063\n- PyTorch clip_grad_norm_: https://pytorch.org/docs/stable/generated/torch.nn.utils.clip_grad_norm_.html"},{"section":"deep-learning","topicSlug":"backpropagation","topic":"Backpropagation","id":"dl-06010","difficulty":"hard","orderIndex":10,"question":"You are implementing a custom neural network operation `f(x) = x² · sin(x)` and need to register its backward function in PyTorch. A junior engineer implements it as:","codeSnippet":"class CustomOp(torch.autograd.Function):\n @staticmethod\n def forward(ctx, x):\n ctx.save_for_backward(x)\n return x**2 * torch.sin(x)\n \n @staticmethod\n def backward(ctx, grad_output):\n x, = ctx.saved_tensors\n grad_x = 2*x * torch.sin(x) # Missing term\n return grad_output * grad_x","options":{"A":"`ctx.save_for_backward(x)` is incorrect; use `ctx.x = x` instead","B":"The backward function is missing the x² · cos(x) term. The correct gradient is: f'(x) = 2x·sin(x) + x²·cos(x) (product rule: d/dx[x²·sin(x)] = 2x·sin(x) + x²·cos(x))","C":"`grad_output` should not be multiplied with the computed gradient — it is already the final gradient","D":"The forward function must return a new tensor created with `torch.empty_like(x)`; modifying x in-place is invalid"},"correct":"B","explanation":{"correct":"- f(x) = x² · sin(x). Applying the product rule: f'(x) = d(x²)/dx · sin(x) + x² · d(sin(x))/dx = 2x·sin(x) + x²·cos(x).\n- The bug: the implementation only computes 2x·sin(x), omitting the x²·cos(x) term. This is a partial product rule application.\n- The chain rule in PyTorch: if L = loss and y = f(x), then ∂L/∂x = ∂L/∂y · ∂y/∂x = grad_output · f'(x). The `return grad_output * grad_x` structure is correct, but grad_x is computed incorrectly.","A":"Both `ctx.save_for_backward(x)` and `ctx.x = x` can store tensors. However, `save_for_backward` is the correct API for autograd functions — it ensures proper memory management and version tracking. `ctx.x = x` can cause issues with in-place operations and is not recommended.","B":"","C":"`grad_output` IS the upstream gradient ∂L/∂y. The chain rule requires multiplying it by the local gradient ∂y/∂x = f'(x). The structure `return grad_output * grad_x` is correct — grad_output must be multiplied by the local gradient.","D":"The forward function creates a new tensor via `x**2 * torch.sin(x)` — this is not in-place modification of x. In-place operations would use `x **= 2` or `x.sin_()`."},"reference":"- PyTorch custom autograd functions: https://pytorch.org/docs/stable/notes/extending.html"},{"section":"deep-learning","topicSlug":"backpropagation","topic":"Backpropagation","id":"dl-06011","difficulty":"medium","orderIndex":11,"question":"You train a Transformer model and notice that gradients for the first few layers are consistently larger than gradients for the last few layers — the opposite of the vanishing gradient problem. What architectural feature of standard Transformers causes this gradient pattern and is it problematic?","options":{"A":"The attention mechanism amplifies gradients in early layers due to the softmax operation","B":"Pre-LN Transformers (LayerNorm before attention/FFN sublayers) can have this reversed gradient pattern. The skip connections in early layers have not yet contributed their normalization effect, while later layers' gradients pass through more LayerNorm normalizations which reduce gradient magnitude. Post-LN (original) Transformers can show the opposite (vanishing early gradients). Reversed gradients are not inherently problematic — they indicate gradients are flowing backward strongly through skip connections","C":"This is the exploding gradient problem; apply gradient clipping immediately","D":"Transformers with more than 12 layers always show reversed gradient patterns; this is expected and desirable"},"correct":"B","explanation":{"correct":"- Pre-LN (used in GPT-2, GPT-3, most modern LLMs): LN is applied before each sublayer. Skip connections carry gradients directly, and the LN in earlier layers has less accumulated normalization effect. This can produce larger gradient norms in early layers.\n- Post-LN (original \"Attention is All You Need\"): LN is applied after each sublayer + skip connection. Early layers have gradients that must pass through more LN layers to reach the input, potentially reducing them.\n- Neither pattern is \"problematic\" by itself — both architectures have been used successfully. The key metric is whether gradients flow effectively through all layers (non-zero, finite), not whether they decrease or increase with depth.","A":"Softmax in attention does not specifically amplify gradients in early layers. The softmax gradient is bounded by the softmax probabilities (max gradient = 0.25 per element for two-class case). Softmax does not cause systematic early-layer amplification.","B":"","C":"Gradients being larger in early layers is not by itself exploding gradients. Exploding gradients means norms are exponentially large (10³-10⁶), not just \"larger than later layers.\" Healthy training may have 2-5× variation in gradient norms across layers.","D":"Reversed gradient patterns depend on architecture (Pre-LN vs Post-LN) and initialization, not layer count alone. It is not universally expected or required for all Transformers."},"reference":"- Xiong et al., \"On Layer Normalization in the Transformer Architecture\" (Pre-LN analysis): https://arxiv.org/abs/2002.04745"},{"section":"deep-learning","topicSlug":"backpropagation","topic":"Backpropagation","id":"dl-06012","difficulty":"easy","orderIndex":12,"question":"Consider the function f(x) = max(x, 0) (ReLU). At x=0, the derivative is technically undefined (left derivative = 0, right derivative = 1). PyTorch uses a subgradient of 0 at x=0. In practice, why doesn't this cause training problems?","options":{"A":"PyTorch avoids x=0 by adding a small epsilon to all inputs before applying ReLU","B":"The probability that any floating-point number equals exactly 0 after arbitrary network computations is essentially zero. In practice, the gradient at x=0 is never evaluated — all actual pre-activations are either clearly positive or clearly negative","C":"PyTorch uses a differentiable approximation to ReLU (softplus) internally even when you call nn.ReLU()","D":"Subgradient of 0 at x=0 is actually the mathematically optimal choice and makes training faster"},"correct":"B","explanation":{"correct":"- Floating-point numbers have finite precision. The probability of computing exactly 0.0000000000000000 (64 zeros) from arbitrary neural network weights and inputs approaches zero in practice.\n- Even if a pre-activation were exactly 0 due to symmetry or specific initialization, the next gradient update would immediately move it away from exactly 0. The zero subgradient convention is essentially never triggered.\n- This is why the theoretical non-differentiability of ReLU at 0 is a non-issue in practice — thousands of papers and production systems use ReLU without ever encountering problems from the x=0 gradient.","A":"PyTorch does not add epsilon to ReLU inputs. This would change the function to max(x+ε, 0) which has different behavior. No such modification is applied.","B":"","C":"PyTorch's nn.ReLU() computes max(0,x) exactly, not Softplus. Softplus(x) = log(1+eˣ) is a separate function available as nn.Softplus(). Users who want smooth approximations must explicitly request Softplus.","D":"The subgradient of 0 at x=0 is a convention (PyTorch's choice); a subgradient of 1 would also be valid mathematically. It is not \"optimal\" — the choice at a measure-zero point has no practical impact on training."},"reference":"- https://cs231n.github.io/neural-networks-1/#actfun (ReLU practical notes)"},{"section":"deep-learning","topicSlug":"backpropagation","topic":"Backpropagation","id":"dl-06013","difficulty":"medium","orderIndex":13,"question":"A researcher claims: \"Backpropagation is a special case of the chain rule applied to a computational graph, so it doesn't actually 'know' about layers — it works the same way on any computation.\" A student challenges: \"But residual networks (ResNets) work differently because the skip connections create multiple gradient paths.\" Who is correct?","options":{"A":"The student is correct — skip connections require a modified version of backpropagation","B":"The researcher is correct — backpropagation is the chain rule applied to any computational graph, including graphs with skip connections. ResNets do create multiple gradient paths (the skip path and the residual path), but standard autograd handles this by summing gradients at the junction node (when a node has multiple downstream consumers, gradients from each consumer are summed)","C":"Both are correct — standard backprop handles simple sequential graphs; modified algorithms are needed for graphs with cycles or skip connections","D":"The student is correct for the gradient of the skip connection weights but wrong for the residual branch weights"},"correct":"B","explanation":{"correct":"- The chain rule on a computation graph: for any node with multiple downstream consumers, the total gradient is the sum of gradients from each consumer path. For ResNet's addition node: ∂L/∂x = ∂L/∂(x + F(x)) · (1 + ∂F(x)/∂x). The \"1\" is the gradient from the skip path, \"∂F(x)/∂x\" is from the residual path.\n- PyTorch's autograd builds a dynamic computational graph and applies reverse-mode automatic differentiation — which is exactly backpropagation generalized to arbitrary DAGs. No special handling is needed for skip connections.\n- This is why Torch.autograd, TensorFlow, and JAX handle any directed acyclic graph (DAG) of differentiable operations, including ResNets, DenseNets, and arbitrary architectures.","A":"Skip connections do not require a \"modified version\" of backpropagation. They create a richer computational graph, but the same chain rule and gradient accumulation rules apply.","B":"","C":"Standard backprop handles DAGs (any graph without cycles). Recurrent networks (cycles in time) require BPTT (unrolling the cycle into a DAG). True cycles are unrolled or handled with specific algorithms, but skip connections are not cycles — they are just multiple paths in a DAG.","D":"In a ResNet, the residual branch F(x) has weights and the skip connection is a direct identity (no weights). The gradient for F(x)'s weights follows the same chain rule as any other layer. No special treatment is needed for the skip-connected layer's gradients."},"reference":"- He et al., \"Deep Residual Learning for Image Recognition\" (2016): https://arxiv.org/abs/1512.03385\n- Baydin et al., \"Automatic Differentiation in Machine Learning: a Survey\": https://arxiv.org/abs/1502.05767"},{"section":"deep-learning","topicSlug":"backpropagation","topic":"Backpropagation","id":"dl-06014","difficulty":"hard","orderIndex":14,"question":"You are training a model and want to debug gradients. You write:","codeSnippet":"loss = criterion(outputs, labels)\nloss.backward()\nfor name, param in model.named_parameters():\n print(f\"{name}: grad={param.grad}\")","options":{"A":"Parameters with `grad=None` are in layers with learning rate = 0; set uniform learning rate","B":"Multiple causes: (1) the parameter is not in the computational graph leading to the loss (e.g., a layer that is defined but never called in forward()), (2) the parameter has `requires_grad=False`, (3) the computation was done inside `torch.no_grad()` context which suppresses gradient tracking. Fix: verify the layer is called in forward(), check requires_grad, ensure backward is called outside no_grad","C":"`grad=None` always means the gradient is zero; rename `None` to 0 for visualization","D":"PyTorch only computes gradients for the last layer by default; add `.retain_grad()` to earlier layers"},"correct":"B","explanation":{"correct":"- `param.grad` is None (not zero) when the parameter was never \"touched\" by any operation in the computational graph during the current forward pass. PyTorch only allocates gradient tensors for parameters that participated in the computation.\n- Common case 1: a layer defined in `__init__` but never called in `forward()`. The parameter exists but has no graph path to the loss.\n- Common case 2: `requires_grad=False` (e.g., frozen parameters). These are deliberately excluded from gradient computation.\n- Common case 3: `with torch.no_grad(): output = model(x)` — operations inside no_grad don't track gradients. Calling .backward() afterward won't have graph information.","A":"Learning rate is applied during the optimizer step (after gradient computation). It does not affect whether gradients are computed or whether param.grad is None.","B":"","C":"`grad=None` is distinctly different from `grad=torch.zeros(...)`. None means the gradient was never computed (parameter not in graph). Zero means the gradient was computed and happened to be zero (e.g., for a dead ReLU neuron). Conflating them would hide important debugging information.","D":"PyTorch computes gradients for ALL parameters with requires_grad=True that appear in the computational graph, not just the last layer. The issue with earlier layers having None grad is specifically about whether they appeared in the graph, not about depth."},"reference":"- PyTorch autograd mechanics: https://pytorch.org/docs/stable/notes/autograd.html"},{"section":"deep-learning","topicSlug":"backpropagation","topic":"Backpropagation","id":"dl-06015","difficulty":"hard","orderIndex":15,"question":"A production ML system accumulates gradients over 8 steps before each optimizer update (gradient accumulation to simulate a larger batch). The code is:","codeSnippet":"for i, (x, y) in enumerate(dataloader):\n outputs = model(x)\n loss = criterion(outputs, y) / 8\n loss.backward()\n if (i + 1) % 8 == 0:\n optimizer.step()\n optimizer.zero_grad()","options":{"A":"Nothing changes — gradient accumulation is scale-invariant","B":"Without `/ 8`, each of the 8 mini-batch losses is computed with full scale. After accumulation, the total gradient is 8× larger than it would be for a single large batch of 8× mini-batch size. The optimizer step effectively uses a learning rate 8× larger than intended. This often causes training instability. The `/ 8` correctly scales down each mini-batch loss so that the accumulated gradient matches what you'd get from a single large batch","C":"Removing `/ 8` improves convergence because larger gradients give stronger learning signal","D":"The `/ 8` is only needed when using Adam optimizer; for SGD, gradient accumulation works correctly without scaling"},"correct":"B","explanation":{"correct":"- Gradient accumulation goal: simulate processing a batch of 8×mini_batch_size. A true large batch computes loss = mean(losses over 8×N samples). This is equivalent to mean(mean(losses over N samples) for each of 8 mini-batches) — the outer mean contributes the /8 factor.\n- Without /8: accumulated gradient = Σᵢ ∇Lᵢ (sum of 8 mini-batch gradients). With a true large batch: gradient = (1/8)·Σᵢ ∇Lᵢ. The difference is 8×.\n- An 8× larger gradient update is equivalent to multiplying the learning rate by 8. For learning rates tuned for a specific batch size, this change often causes gradient explosion or training instability.","A":"Gradient scale matters for the effective learning rate. Optimizer step sizes are directly proportional to gradient magnitude. Accumulating 8× larger gradients is equivalent to using 8× learning rate.","B":"","C":"\"Stronger learning signal\" from 8× gradients is equivalent to 8× learning rate — which is outside the stable range for most problems. Stronger gradients are not inherently better; they must be appropriately scaled.","D":"Adam has gradient normalization via the second moment (adaptive learning rate), which makes it somewhat more robust to gradient scale changes compared to SGD. However, Adam still uses the first moment (mean gradient), which is 8× larger without /8. The effective learning rate for Adam is also roughly 8× larger, which can destabilize training."},"reference":"- Goyal et al., \"Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour\" (learning rate scaling rule): https://arxiv.org/abs/1706.02677"},{"section":"deep-learning","topicSlug":"optimizers","topic":"Optimizers","id":"dl-07001","difficulty":"easy","orderIndex":1,"question":"You train a neural network using SGD with learning rate 0.1 and observe very noisy loss curves — the loss zigzags up and down across consecutive batches. Switching to SGD with Momentum (β=0.9) smooths the curve. What mathematical operation does momentum perform that causes this smoothing?","options":{"A":"Momentum averages the learning rate across the last 10 batches","B":"Momentum maintains a velocity vector v that is an exponential moving average of past gradients: v_t = β·v_{t-1} + (1-β)·g_t, then updates w = w - α·v_t. High-frequency gradient noise (opposite-sign gradients on consecutive steps) is averaged out, while consistent gradient directions accumulate velocity","C":"Momentum clips individual gradient values to reduce extreme updates that cause loss spikes","D":"Momentum applies the gradient only when it agrees with the gradient from the previous step, skipping updates otherwise"},"correct":"B","explanation":{"correct":"- The exponential moving average acts as a low-pass filter: noisy high-frequency oscillations (gradients that alternate sign) get averaged to near zero in v_t. Consistent gradient directions (signal) accumulate in v_t and produce larger effective steps.\n- With β=0.9: the effective gradient is a weighted sum of the last ~1/(1-β) = 10 gradients. Random noise that doesn't correlate across batches averages out; true gradient direction (consistent across batches) is amplified.\n- This is why momentum helps in ravine-shaped loss surfaces: in the narrow direction (high curvature), gradients oscillate and are averaged out. In the long flat direction (low curvature), gradients consistently point the same way and accumulate velocity.","A":"Momentum does not average the learning rate. The learning rate remains fixed. Momentum maintains a gradient velocity, not a learning rate average.","B":"","C":"Gradient clipping is a separate technique (`clip_grad_norm_`) that caps gradient magnitude. Momentum does not clip — it exponentially averages gradients, which can increase effective magnitude in consistent directions.","D":"Momentum always updates — it never conditionally skips updates. The velocity is updated and applied regardless of gradient sign agreement with the previous step."},"reference":"- Polyak, B.T., \"Some methods of speeding up the convergence of iteration methods\" (1964)\n- https://cs231n.github.io/neural-networks-3/#sgd (SGD + momentum)"},{"section":"deep-learning","topicSlug":"optimizers","topic":"Optimizers","id":"dl-07002","difficulty":"easy","orderIndex":2,"question":"A team trains a model with Adam optimizer using default hyperparameters (lr=0.001, β₁=0.9, β₂=0.999). At the first training step, the first moment m₁ ≈ 0.1·g and the second moment v₁ ≈ 0.001·g². The Adam update uses m₁/(√v₁ + ε) instead of g. What bias correction does Adam apply and why?","options":{"A":"Adam scales the learning rate by (1-β₁)/(1-β₂) to normalize between the two moment estimates","B":"Adam applies bias correction: m̂_t = m_t/(1-β₁ᵗ) and v̂_t = v_t/(1-β₂ᵗ). At t=1: m̂₁ = m₁/(1-0.9) = 10·m₁ and v̂₁ = v₁/(1-0.999) = 1000·v₁. This corrects for the fact that m₁ and v₁ are initialized to 0 and are biased toward 0 at early steps","C":"Bias correction is only applied when the learning rate exceeds 0.01; for standard lr=0.001, no correction is needed","D":"Adam's bias correction divides by t (step count) to implement learning rate decay automatically"},"correct":"B","explanation":{"correct":"- Without bias correction: m₁ = (1-0.9)·g = 0.1·g (initialized at 0, so m₁ is 10× smaller than the true first moment estimate). This would cause Adam to take very small steps at the beginning of training.\n- With correction: m̂₁ = 0.1·g / (1-0.9) = g (corrects back to the true gradient value). At early steps, (1-β₁ᵗ) → 0, giving large correction. As t→∞, (1-β₁ᵗ) → 1, and correction vanishes.\n- This is the key innovation in the original Adam paper: bias correction ensures the effective learning rate is stable from the very first step, enabling Adam to work reliably without warm-up for most problems.","A":"Adam does not compute a ratio of (1-β₁)/(1-β₂). The correction is applied independently to m and v before computing the ratio m̂/√v̂.","B":"","C":"Bias correction is applied at every step, regardless of learning rate. The correction term (1-βᵗ) depends only on the step count t, not on the learning rate value.","D":"Adam's bias correction does not implement learning rate decay. (1-β₁ᵗ) increases from 0 toward 1 as t grows — it is a bias correction factor that increases over time (decreasing correction), not a decay factor that decreases the step size."},"reference":"- Kingma & Ba, \"Adam: A Method for Stochastic Optimization\" (2014): https://arxiv.org/abs/1412.6980"},{"section":"deep-learning","topicSlug":"optimizers","topic":"Optimizers","id":"dl-07003","difficulty":"medium","orderIndex":3,"question":"You fine-tune a large language model (1B parameters) and observe that training loss decreases but validation loss is slightly higher than expected. Switching from Adam to AdamW fixes this gap. What is the difference between Adam and AdamW, and why does it matter for large model fine-tuning?","options":{"A":"AdamW uses a larger learning rate internally, which improves generalization","B":"Adam implements L2 regularization by adding λ·w to the gradient before the Adam update: g' = g + λ·w. AdamW implements weight decay directly by subtracting λ·w from weights after the Adam update: w' = w - α·(m̂/√v̂) - α·λ·w. In Adam, the adaptive scaling of Adam also scales down the regularization (g' is divided by √v̂), weakening it for infrequently updated parameters. AdamW applies weight decay at full strength regardless of gradient history","C":"AdamW uses a different β₁ default (0.99 vs 0.9), which prevents overfitting in large models","D":"Adam is numerically unstable for models above 100M parameters; AdamW adds numerical stabilization"},"correct":"B","explanation":{"correct":"- Adam with L2: the regularization term λ·w is treated as part of the gradient and is scaled by 1/√(v̂). For parameters with large gradient history (high v̂), the effective regularization is weak (divided by large √v̂). This is decoupled weight decay's key insight.\n- AdamW: weight decay is applied as a separate multiplicative factor on the weights: w_new = (1-α·λ)·w - α·m̂/√v̂. The weight decay term is not affected by the adaptive scaling — every parameter gets the same proportional weight decay.\n- For LLM fine-tuning: many parameters have very consistent large gradients (high v̂), making Adam's L2 regularization near-zero for those parameters. AdamW ensures all parameters are properly regularized, explaining the improved validation performance.","A":"AdamW does not use a larger internal learning rate. The learning rate hyperparameter is the same. The improvement comes from correct decoupling of weight decay from gradient scaling.","B":"","C":"AdamW uses the same default β₁=0.9 as Adam. The difference is in the weight decay implementation, not the momentum hyperparameters.","D":"Both Adam and AdamW are numerically stable for large models. AdamW's improvement is about regularization correctness, not numerical stability."},"reference":"- Loshchilov & Hutter, \"Decoupled Weight Decay Regularization\" (AdamW, 2017): https://arxiv.org/abs/1711.05101\n- Ilya Loshchilov's blog post explaining the Adam L2 vs AdamW distinction"},{"section":"deep-learning","topicSlug":"optimizers","topic":"Optimizers","id":"dl-07004","difficulty":"medium","orderIndex":4,"question":"A team trains a model with a cosine learning rate schedule starting at lr=0.1 and decaying to 0 over 100 epochs. After 50 epochs (lr≈0), they want to continue training for 50 more epochs. They reset the cosine schedule to restart from lr=0.1. Training improves significantly compared to continuing at lr≈0. What is this technique and why does it work?","options":{"A":"This is learning rate warm-up, which is standard practice for all deep learning training","B":"This is Stochastic Gradient Descent with Warm Restarts (SGDR / Cosine Annealing with Restarts). Restarting the schedule from a high learning rate allows the optimizer to \"escape\" local minima or sharp minima that the model has converged to. The sharp minimum found at low learning rate may generalize poorly; restarting explores broader loss landscape regions that may contain wider (better-generalizing) minima","C":"This is cyclical learning rate training, which only works with SGD, not with Adam","D":"The improvement is not related to the schedule restart but to the fact that they trained for 100 total epochs instead of 50"},"correct":"B","explanation":{"correct":"- SGDR (Loshchilov & Hutter, 2017): restart cosine schedule periodically. High learning rates explore broadly; low learning rates fine-tune. Restart re-explores from a high learning rate, potentially escaping into wider loss basins.\n- Sharp minima hypothesis: flat/wide minima generalize better than sharp minima (Hochreiter & Schmidhuber, 1997; Keskar et al., 2017). High learning rates tend to find flatter minima because they cannot settle into narrow, sharp valleys. Restarting after reaching a low LR escapes the current (potentially sharp) basin.\n- Snapshot ensembling (Huang et al., 2017) saves model checkpoints at each LR minimum to build ensembles from a single training run.","A":"Learning rate warm-up is the practice of starting from a very small LR and increasing to the target LR over the first few steps/epochs. SGDR is the opposite concept — restarting from a high LR after convergence.","B":"","C":"Cosine annealing with restarts works with any optimizer including Adam. Loshchilov uses SGD in the original paper, but the schedule is optimizer-independent.","D":"The comparison is explicitly with \"continuing at lr≈0\" for 50 more epochs — the same total 100 epochs. The improvement is specifically from the LR restart, not from additional training time."},"reference":"- Loshchilov & Hutter, \"SGDR: Stochastic Gradient Descent with Warm Restarts\" (2017): https://arxiv.org/abs/1608.03983\n- Huang et al., \"Snapshot Ensembles: Train 1, get M for free\" (2017): https://arxiv.org/abs/1704.00109"},{"section":"deep-learning","topicSlug":"optimizers","topic":"Optimizers","id":"dl-07005","difficulty":"medium","orderIndex":5,"question":"RMSProp updates weights using: w = w - α·g/√(E[g²] + ε). A colleague says: \"RMSProp is identical to Adam without the first moment (momentum).\" After testing, you find RMSProp and Adam-without-momentum converge differently. Who is right and what is the technical difference?","options":{"A":"The colleague is correct — RMSProp and Adam-without-momentum are mathematically identical","B":"The colleague is approximately right but technically wrong: RMSProp uses an unbiased running average (E[g²] = ρ·E_{t-1}[g²] + (1-ρ)·g²) with no bias correction. Adam applies bias correction to the second moment: v̂_t = v_t/(1-β₂ᵗ). At early steps, RMSProp's E[g²] is biased toward 0 (small), making step sizes larger than Adam's corrected steps. They converge identically only after many steps when bias correction becomes negligible","C":"RMSProp uses the absolute gradient |g| while Adam uses g² for the second moment","D":"The difference is that Adam clips the second moment to prevent explosion, while RMSProp does not"},"correct":"B","explanation":{"correct":"- RMSProp: E[g²]_t = ρ·E[g²]_{t-1} + (1-ρ)·g_t². Initialized to 0. At t=1: E[g²]_1 = (1-ρ)·g₁², which is biased toward 0 by factor (1-ρ).\n- Adam second moment: v_t = β₂·v_{t-1} + (1-β₂)·g_t², then corrected: v̂_t = v_t/(1-β₂ᵗ). At t=1: v₁ = (1-β₂)·g₁², v̂₁ = g₁². The bias correction restores the true estimate of g².\n- Consequence: RMSProp at early steps has small E[g²], giving large step sizes. Adam's bias correction gives stable step sizes from step 1. They converge to the same update rule as t→∞ when both biases vanish.","A":"They are not mathematically identical. The bias correction in Adam for the second moment creates different behavior at early steps, which can meaningfully affect training trajectory.","B":"","C":"Both RMSProp and Adam use g² (squared gradient) for the second moment estimate. Absolute value |g| is not used in either.","D":"Neither RMSProp nor standard Adam explicitly clips the second moment. The epsilon (ε=10⁻⁸) prevents division by zero but is not a \"clip\" on the second moment."},"reference":"- Tieleman & Hinton, \"Lecture 6.5 — RMSProp\" (2012): Coursera slides\n- Kingma & Ba, \"Adam\" (2014): https://arxiv.org/abs/1412.6980"},{"section":"deep-learning","topicSlug":"optimizers","topic":"Optimizers","id":"dl-07006","difficulty":"medium","orderIndex":6,"question":"You train a model with Adam and observe: the training loss decreases normally for the first 1000 steps, then suddenly \"resets\" — loss jumps back up and the model appears to \"forget\" what it learned. This happens at exactly step 1000 when the second moment estimate v_t becomes a reliable long-term average (no longer dominated by bias correction). What optimizer configuration is likely causing this?","options":{"A":"The learning rate is too high at step 1000","B":"This is the \"Adam learning rate collapse\" phenomenon: at early steps, bias correction makes v̂_t large (small denominator effects are corrected away), giving large effective step sizes. As v_t accumulates history (bias correction effect diminishes), the effective step size can change dramatically. If combined with a warmup schedule that ends exactly at step 1000, the LR change may be sharp enough to cause apparent \"forgetting\"","C":"Adam's β₂ accumulates variance, causing the effective learning rate to increase at step 1000 and destabilize training","D":"The optimizer should be switched to SGD after 1000 steps as Adam is only effective in early training"},"correct":"B","explanation":{"correct":"- Effective step size in Adam = α·m̂_t/√v̂_t. As training progresses: m̂_t converges to a smooth gradient estimate, and v̂_t converges to the long-run average squared gradient. The ratio m̂_t/√v̂_t can stabilize differently than at early steps.\n- A common issue: if using a warmup schedule that ends at step 1000 with a sharp LR transition, combined with the natural stabilization of Adam's moments, the effective step sizes can change abruptly.\n- The specific description of \"resets at exactly step 1000\" suggests a scheduled event (warmup end, LR change) rather than a natural Adam phenomenon. Diagnosis: plot effective learning rate = α/√v̂_t per parameter over time to see the actual step size trajectory.","A":"\"Learning rate too high\" would cause instability from the beginning, not specifically at step 1000. High LR manifests as oscillating or diverging loss from early steps.","B":"","C":"v_t accumulates squared gradient information, causing the effective learning rate to decrease (denominator grows), not increase. The effective learning rate in Adam generally decreases over time as the second moment accumulates.","D":"Adam is not limited to \"early training\" — it is used effectively for full training runs in most modern deep learning. Switching to SGD mid-training without careful LR scheduling would be highly disruptive."},"reference":"- Ma & Yarats, \"Quasi-hyperbolic momentum and Adam for deep learning\" (2019): https://arxiv.org/abs/1810.06801"},{"section":"deep-learning","topicSlug":"optimizers","topic":"Optimizers","id":"dl-07007","difficulty":"hard","orderIndex":7,"question":"A researcher trains the same ResNet-50 model with: (A) Adam (lr=1e-3), (B) SGD+Momentum (lr=0.1, momentum=0.9), and (C) AdamW (lr=1e-3, weight_decay=0.01). She observes final test accuracy: SGD > AdamW > Adam. She concludes \"SGD is always better than Adam for image classification.\" What is the correct nuanced interpretation?","options":{"A":"The researcher's conclusion is correct — SGD is always better for computer vision","B":"The result reflects well-known empirical findings: for image classification with large datasets and well-tuned LR schedules, SGD+Momentum often outperforms Adam in final accuracy, likely because SGD finds wider minima with better generalization. However, the comparison is confounded by LR choice — Adam's optimal LR is typically 10-100× smaller than SGD's. The conclusion \"SGD is always better\" is too strong; Adam typically outperforms SGD on NLP tasks, irregular loss surfaces, and small datasets","C":"The result proves Adam has higher variance, which always hurts generalization","D":"AdamW should be identical to Adam with weight decay; the difference in their results indicates a bug in the implementation"},"correct":"B","explanation":{"correct":"- The SGD > Adam for image classification finding has been replicated widely (Wilson et al., 2017, \"The Marginal Value of Momentum for Small Learning Rate SGD\"). The dominant hypothesis: Adam's adaptive learning rates allow it to escape broad regions quickly, but it tends to converge to sharper minima that generalize worse.\n- Critically: Adam's default lr=1e-3 and SGD's optimal lr=0.1 are not equivalent; the effective step sizes are very different. A fairer comparison would tune LR for each optimizer independently.\n- Domain dependency: Transformers, NLP, and irregular optimization landscapes generally favor Adam/AdamW because of its robustness to sparse gradients and irregular geometry. For vision with well-tuned training recipes, SGD+momentum with cosine schedule is competitive or better.","A":"\"Always better for computer vision\" is falsified by Transformer-based vision models (ViT, DeiT) which use Adam/AdamW and achieve strong results. The finding is empirically narrower than \"always.\"","B":"","C":"Higher variance in optimization doesn't directly translate to worse generalization. Adam's adaptive learning rates produce different optimization trajectories — the generalization difference is about loss landscape geometry (sharp vs flat minima), not variance per se.","D":"AdamW corrects Adam's L2 regularization to be proper weight decay. For large, regularly trained models, AdamW provides meaningful regularization benefits over Adam, so different results are expected and correct."},"reference":"- Wilson et al., \"The Marginal Value of Momentum for Small Learning Rate SGD\" (2017): https://arxiv.org/abs/1705.08292\n- He et al., \"Bag of Tricks for Image Classification\" (SGD training recipe for ResNets): https://arxiv.org/abs/1812.01187"},{"section":"deep-learning","topicSlug":"optimizers","topic":"Optimizers","id":"dl-07008","difficulty":"hard","orderIndex":8,"question":"The Lion optimizer update rule is: m_t = β₁·m_{t-1} + (1-β₁)·g_t, then w_t = w_{t-1} - α·sign(m_t) - α·λ·w_{t-1}. Compared to Adam and SGD, what is Lion's distinctive property and what type of models does it improve?","options":{"A":"Lion uses the sign function instead of the gradient magnitude, making all parameter updates the same size (±α per step). This is memory-efficient (only one moment to track vs Adam's two) and appears to work well for large-scale vision and language models where signal direction is more important than magnitude","B":"Lion's sign function causes updates to clip gradients to 1.0, making it equivalent to gradient clipping with max_norm=1","C":"Lion is identical to Adam but with β₂ removed; the sign function replaces the adaptive learning rate scaling","D":"Lion's sign update prevents convergence to local minima because +α or -α steps can always escape any flat region"},"correct":"A","explanation":{"correct":"- sign(m_t) ∈ {-1, 0, +1}. Every parameter update has magnitude exactly α, regardless of gradient magnitude. This is a unified step size across all parameters — very different from Adam's per-parameter adaptive scaling.\n- Memory: Lion tracks only one moment (m_t, equivalent to momentum), vs Adam's two moments. For large models with billions of parameters, this halves optimizer state memory.\n- Empirical results (Chen et al., 2023): Lion outperforms or matches AdamW on ViT, JFT, Imagen, and language modeling benchmarks with 2-10× better memory efficiency.","A":"","B":"Gradient clipping limits gradient norm before the update step. Lion's sign function acts on the accumulated moment, not the raw gradient. The operations are applied at different points in the update pipeline.","C":"Adam without β₂ gives unscaled first-moment updates: w = w - α·m_t. Lion applies sign to the moment: w = w - α·sign(m_t). The sign function makes the update direction-only, not magnitude-preserving.","D":"sign updates can escape flat regions (gradient near 0 but not exactly 0 still gives ±α step), but they can also oscillate around minima (the step size is fixed, so the optimizer can't \"slow down\" near a minimum like Adam does with accumulated second moment)."},"reference":"- Chen et al., \"Symbolic Discovery of Optimization Algorithms\" (Lion optimizer, 2023): https://arxiv.org/abs/2302.06675"},{"section":"deep-learning","topicSlug":"optimizers","topic":"Optimizers","id":"dl-07009","difficulty":"hard","orderIndex":9,"question":"A team trains a language model with the following learning rate schedule: linear warmup for 2000 steps from 0 to lr_max, then cosine decay to lr_max/10. They observe that training is unstable in the first 100 steps (loss spikes and oscillates) even with warmup. What is the most likely missing component?","options":{"A":"The warmup duration (2000 steps) is too long; reduce to 100 steps","B":"The initial learning rate is not exactly 0; even starting at lr=1e-7 can cause instability when the model is randomly initialized and gradient magnitudes are large. Additionally, the first batches may have extreme loss values (random predictions on first batch) — the real issue is often that gradient norms spike in the first few steps before warmup stabilizes them. Fix: add gradient clipping in addition to LR warmup","C":"Cosine decay should start immediately, not after warmup; warmup itself causes instability","D":"The model needs BatchNorm; without it, warmup has no stabilizing effect"},"correct":"B","explanation":{"correct":"- At initialization, model weights are random. The first batch loss is typically high (random classifier), and gradients can be large. Even with LR warmup starting from a very small value, large gradient magnitudes multiplied by even a small LR can produce significant weight updates.\n- Gradient clipping (typically max_norm=1.0) is complementary to LR warmup: warmup controls the learning rate trajectory, clipping controls the per-step update magnitude. Together they provide robust training stability.\n- In practice, Transformers routinely use both: \"We use Adam with warmup and gradient clipping of 1.0\" appears in GPT, BERT, and most modern LLM training papers.","A":"Longer warmup is generally more stable, not less. 2000 steps for a language model is a common choice. Reducing to 100 steps would make warmup shorter and potentially less effective.","B":"","C":"Warmup is specifically designed to stabilize early training. Starting cosine decay immediately (without warmup) would begin from a large learning rate at a point where the model is most sensitive (random initialization).","D":"BatchNorm is not needed in Transformer LM training (which uses LayerNorm). BatchNorm's presence or absence doesn't determine whether LR warmup is effective."},"reference":"- Vaswani et al., \"Attention Is All You Need\" (2017): uses warmup + Adam: https://arxiv.org/abs/1706.03762\n- Ma et al., \"Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization\": warmup and clipping analysis"},{"section":"deep-learning","topicSlug":"optimizers","topic":"Optimizers","id":"dl-07010","difficulty":"hard","orderIndex":10,"question":"You run a hyperparameter sweep over learning rates for Adam: [1e-5, 1e-4, 1e-3, 1e-2, 1e-1]. You find 1e-3 works best. Your colleague uses the same model but with 4× batch size and gets best results at 1e-3 still. A third engineer with 16× batch size also finds 1e-3 is best. Your team lead says this \"proves Adam doesn't need learning rate scaling with batch size.\" Is this correct?","options":{"A":"Yes — Adam's adaptive learning rate makes it batch-size invariant by normalizing gradient scale","B":"Partially correct but requires nuance: Adam's adaptive scaling partially compensates for batch size effects, but the optimal learning rate still changes with batch size in theory. The empirical finding that 1e-3 works across batch sizes may reflect the fact that optimal LR for Adam is robust within certain ranges, or that the sweep resolution (1-order-of-magnitude steps) is too coarse to detect the shift. For linear scaling rule (multiply LR by k when batch size multiplies by k), Adam weakens but doesn't eliminate the relationship","C":"Yes — the adaptive learning rates in Adam make it exactly batch-size invariant, unlike SGD where the linear scaling rule applies","D":"No — Adam's optimal learning rate scales exactly as 1/√(batch_size); the team should use 1e-3/√16 = 2.5e-4 for 16× batch size"},"correct":"B","explanation":{"correct":"- With batch size k×, gradient estimates have k× smaller variance (more samples per estimate). For SGD, optimal LR scales as k (linear scaling rule) to compensate. For Adam, the second moment √v̂ also adapts to gradient scale, providing some automatic compensation.\n- However, the effective learning rate in Adam = α/√v̂ doesn't perfectly compensate for batch size changes because the noise structure of gradients changes with batch size in complex ways.\n- The empirical finding that 1e-3 works across 4× and 16× batch sizes is plausible for Adam (robustness) but doesn't prove invariance. The coarse 10× resolution of the sweep means optimal LR could shift by 2-3× within the same \"best\" bin.","A":"Adam is not exactly batch-size invariant. The adaptive scaling partially compensates but doesn't remove the dependency. This is an active research area (e.g., learning rate scaling experiments in GPT training).","B":"","C":"There is no proof of exact invariance. The adaptive scaling is an approximation that reduces sensitivity to batch size, not a perfect invariance guarantee. Large batch training papers (Goyal et al., 2017) show even Adam needs LR adjustment for very large batches.","D":"The 1/√(batch_size) scaling rule is not established for Adam. There is no consensus exact scaling rule for Adam — the partial compensation makes it harder to derive than SGD's linear rule."},"reference":"- Goyal et al., \"Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour\" (2017): https://arxiv.org/abs/1706.02677\n- Smith et al., \"Don't Decay the Learning Rate, Increase the Batch Size\" (2018): https://arxiv.org/abs/1711.00489"},{"section":"deep-learning","topicSlug":"optimizers","topic":"Optimizers","id":"dl-07011","difficulty":"medium","orderIndex":11,"question":"You train a model for 100 epochs with a step learning rate schedule: lr=0.1 for epochs 1-30, lr=0.01 for epochs 31-60, lr=0.001 for epochs 61-100. At epoch 31 and 61, training loss spikes sharply before recovering. What causes these spikes and how does cosine annealing prevent them?","options":{"A":"The optimizer's momentum resets at epoch boundaries, causing gradient direction changes","B":"At each step drop, the optimizer's accumulated momentum (velocity) was calibrated for a 10× larger learning rate. When LR drops by 10×, the momentum-scaled update is still 10× too large for the first few steps until the momentum \"forgets\" the old gradients. The spike is from the momentum × new_LR combination being inconsistent. Cosine annealing decays LR smoothly — the optimizer's effective step size changes gradually, so momentum and LR stay in sync","C":"Loss spikes at epoch boundaries are caused by data shuffling, not the learning rate schedule","D":"The spikes indicate gradient explosion; gradient clipping should be added at epoch boundaries"},"correct":"B","explanation":{"correct":"- SGD+Momentum velocity: v_t = β·v_{t-1} + g_t. The velocity has accumulated history from lr=0.1 steps. When LR drops to 0.01, the weight update is α·v_t, but v_t is still large from the high-LR phase. The first few updates at lr=0.01 apply old (lr=0.1-calibrated) momentum, effectively giving larger updates than intended.\n- It takes ~1/(1-β) = 10 steps for momentum to \"forget\" old gradients. During this warmdown period, updates are inconsistent with the new LR.\n- Cosine annealing: LR changes smoothly. The velocity at any point is consistent with the recent LR history (no discontinuity). No spike because there's no sudden LR scale mismatch.","A":"Momentum does NOT reset at epoch boundaries in standard implementations. The velocity vector is persistent across epochs. The problem is that it persists with values calibrated for the old LR.","B":"","C":"Data shuffling changes which batch is seen but not the gradient scale. Shuffling might add noise but not systematic spikes at exact epoch boundaries. The spikes correlate precisely with LR changes.","D":"Gradient explosion produces exponentially growing loss that doesn't recover. The described spikes recover quickly (within ~10 steps), which is the signature of momentum-LR mismatch, not true gradient explosion."},"reference":"- https://cs231n.github.io/neural-networks-3/#anneal (learning rate annealing)\n- Loshchilov & Hutter, \"SGDR: Stochastic Gradient Descent with Warm Restarts\" (2017)"},{"section":"deep-learning","topicSlug":"optimizers","topic":"Optimizers","id":"dl-07012","difficulty":"medium","orderIndex":12,"question":"A model is trained with Adam and achieves good validation performance. At inference, you discover the model's predictions are poorly calibrated — it outputs very high confidences (>0.99) for many predictions that are often wrong. A colleague suggests \"this is an Adam problem; use SGD to fix calibration.\" Is this diagnosis accurate?","options":{"A":"Yes — Adam optimizers systematically produce poorly calibrated models","B":"No — poor calibration is primarily a consequence of training with cross-entropy loss (which doesn't penalize overconfidence) and insufficient regularization, not the optimizer choice. Both Adam and SGD can produce poorly calibrated models. Fix: temperature scaling, label smoothing, or explicit calibration techniques (Platt scaling)","C":"Yes — Adam's adaptive learning rate causes it to over-optimize for certain confident predictions","D":"No, but switching to SGD does improve calibration because SGD finds flatter minima which are better calibrated by definition"},"correct":"B","explanation":{"correct":"- Calibration measures whether predicted probabilities match empirical frequencies. Poor calibration (overconfidence) is a well-documented property of modern neural networks trained with cross-entropy loss (Guo et al., 2017, \"On Calibration of Modern Neural Networks\").\n- The root cause: cross-entropy loss is minimized when the model assigns probability 1 to correct classes. Without explicit regularization, the model is pushed toward maximum confidence on training data, which overfits confidence (not just labels).\n- Fix: (1) label smoothing (soft targets) — prevents the model from targeting p=1.0; (2) temperature scaling (post-hoc) — scales logits by learned T to calibrate probabilities; (3) Dropout at inference (MC Dropout) for uncertainty estimation.","A":"Adam doesn't systematically cause poor calibration. Models trained with SGD also exhibit overconfidence. The phenomenon is loss-function-driven, not optimizer-driven.","B":"","C":"Adam's adaptive learning rates affect which minima are found, not whether the model is overconfident. Overconfidence relates to the loss landscape near confident predictions, not to optimizer adaptivity.","D":"SGD finding \"flatter minima\" is a hypothesis about generalization, not calibration. Flat minima may generalize better (lower test error) but don't directly improve calibration of confidence scores."},"reference":"- Guo et al., \"On Calibration of Modern Neural Networks\" (2017): https://arxiv.org/abs/1706.04599\n- Label smoothing paper: Szegedy et al., \"Rethinking the Inception Architecture\" (2016)"},{"section":"deep-learning","topicSlug":"optimizers","topic":"Optimizers","id":"dl-07013","difficulty":"easy","orderIndex":13,"question":"A neural network training run diverges (loss goes to infinity) after 500 steps when using SGD with lr=0.01. You reduce the learning rate to 0.001 and training converges. What is the mathematical mechanism by which a high learning rate causes divergence?","options":{"A":"High learning rate causes integer overflow in PyTorch's weight tensors","B":"High learning rate causes weight updates w = w - α·g to overshoot the loss minimum. In a quadratic loss bowl, if α > 2/L (where L is the Lipschitz constant of the gradient), each step overshoots to the opposite side of the minimum with increasing distance. The overshoots grow geometrically until weights reach infinity","C":"High learning rate causes gradients to become NaN due to numerical instability in the exponential function","D":"High learning rate causes the model to memorize training data too quickly, and memorization increases the loss on each subsequent batch"},"correct":"B","explanation":{"correct":"- For a 1D quadratic loss f(w) = 0.5·c·w², gradient g = c·w. Update: w' = w - α·c·w = (1-α·c)·w. If |1-α·c| > 1 (i.e., α > 2/c), |w'| > |w| — each step moves further from w=0 (the minimum).\n- Geometric divergence: |w_t| = |1-α·c|ᵗ · |w_0|. With α=0.01 and c=100 (steep loss): |1-0.01·100| = 0, convergence in one step. With α=0.1: |1-10| = 9, |w_t| grows as 9ᵗ → infinity.\n- This explains why the stability condition α < 2/L is fundamental. For curvature L, exceeding 2/L causes divergence regardless of the direction chosen.","A":"PyTorch uses float32/float64, which are IEEE 754 floating-point numbers. They don't overflow to integers — they overflow to `inf` (infinity), which is a valid float64 value. The divergence is mathematical, not an overflow in the integer sense.","B":"","C":"Gradients becoming NaN due to exponential functions happens in specific architectures (e.g., exp in softmax with large logits), not as a direct consequence of high learning rate. High LR causes the loss to diverge first, which then may produce NaN in subsequent operations.","D":"Memorization refers to fitting specific training examples. High learning rate causes the optimization trajectory to diverge (weights → infinity) due to overshooting, not because the model is memorizing faster."},"reference":"- Goodfellow et al., \"Deep Learning\", Chapter 8.2 (Challenges in Neural Network Optimization)\n- https://cs231n.github.io/neural-networks-3/#baby"},{"section":"deep-learning","topicSlug":"optimizers","topic":"Optimizers","id":"dl-07014","difficulty":"hard","orderIndex":14,"question":"A company switches from training transformers with Adam to training with 8-bit Adam (bitsandbytes library), claiming \"8-bit quantization of optimizer states with no quality loss.\" A skeptical ML engineer says \"quantizing optimizer states must affect training.\" Who is right?","options":{"A":"The company is right — 8-bit optimizer states are mathematically identical to 32-bit","B":"The engineer is partially right: 8-bit quantization of optimizer states (first and second moments in Adam) does introduce quantization noise. However, Dettmers et al. (2022) showed that block-wise quantization with dynamic scaling largely preserves training quality, because small quantization errors in optimizer states have a regularization-like effect. The quality difference is empirically negligible for most tasks while saving 75% optimizer memory","C":"The company is right because Adam optimizer states are already low-precision floating-point numbers and don't require full 32-bit precision","D":"The engineer is right — 8-bit Adam produces models that underfit by approximately 2% on all tasks"},"correct":"B","explanation":{"correct":"- Standard Adam stores first moment m (float32) and second moment v (float32). For a 7B parameter model, this is 2 × 7B × 4 bytes = 56 GB — often more than the model itself.\n- 8-bit quantization: each value represented in 8-bit integers with block-wise dynamic scaling (find the max value in each 2048-element block, scale values to [0,255]). The quantization error is bounded and approximately uniform, acting as small gradient noise.\n- Dettmers et al. (2022) demonstrated on GPT-2, OPT, and BLOOM fine-tuning that 8-bit Adam achieves near-identical perplexity to 32-bit Adam with 75% memory savings. Some tasks show negligible degradation.","A":"8-bit and 32-bit representations are not mathematically identical. 8-bit has ~2.8 bits of effective mantissa precision vs float32's 23 bits. Quantization noise is real — the question is whether it matters practically.","B":"","C":"Adam stores optimizer states in float32 for precision in accumulating gradients over time. The second moment accumulates squares of gradients, which can span many orders of magnitude. Storing in float16 (not float32) already causes issues — 8-bit requires the block-wise trick to work.","D":"\"Exactly 2% underfitting on all tasks\" is too specific and not supported empirically. Quality degradation from 8-bit Adam is task-dependent and often negligible, not a fixed universal penalty."},"reference":"- Dettmers et al., \"8-bit Optimizers via Block-wise Quantization\" (2022): https://arxiv.org/abs/2110.02861"},{"section":"deep-learning","topicSlug":"optimizers","topic":"Optimizers","id":"dl-07015","difficulty":"hard","orderIndex":15,"question":"You train a GAN (generator G and discriminator D) with Adam. After 10,000 steps, the discriminator loss collapses to near 0 and the generator produces random noise. You try switching D to SGD with high lr, G stays Adam. Training stabilizes. What optimizer-specific property of SGD helps here, and what does this reveal about Adam's behavior in adversarial training?","options":{"A":"SGD trains D slower, giving G more time to improve before D becomes perfect","B":"Adam's adaptive learning rates per parameter make D's update steps become very small for parameters with large gradient history — in adversarial training, the discriminator easily classifies real vs fake in the early steps, accumulating large gradient history, causing Adam to reduce effective LR for D to near zero. D becomes unable to update fast enough to keep up with G's improvements. SGD with fixed LR keeps D's update rate stable regardless of gradient history","C":"SGD produces more gradient noise than Adam, which prevents D from memorizing the entire training set","D":"Adam causes GAN mode collapse by making G and D converge to the same local minimum"},"correct":"B","explanation":{"correct":"- Adam's second moment v_t accumulates squared gradients. For D's weights that consistently produce large gradients (clear real/fake discrimination), v_t grows large, and the effective learning rate α/√v_t shrinks toward zero over time.\n- Once D's effective LR collapses, D can no longer meaningfully compete with G. G receives weak or uninformative gradients from a D that barely updates, causing G to produce garbage outputs (no learning signal from a non-updating D).\n- SGD with fixed LR maintains D's ability to update regardless of gradient history. This \"natural learning rate\" preserves the adversarial tension needed for GAN training.","A":"\"G having more time\" would require training them at different rates explicitly. The difference is the effective per-step learning rate magnitude, not the number of steps.","B":"","C":"While SGD does have more gradient noise than Adam (due to lack of adaptive scaling), the mechanism is specifically about effective learning rate collapse for the discriminator, not about noise preventing memorization.","D":"Mode collapse in GANs is a generator problem (G maps many inputs to the same output, covering only a few modes of the data distribution). It is not caused by Adam making G and D converge to the same minimum — they have different objectives by design."},"reference":"- Goodfellow et al., \"Generative Adversarial Networks\" (2014): https://arxiv.org/abs/1406.2661\n- Lucic et al., \"Are GANs Created Equal? A Large-Scale Study\" (2018): https://arxiv.org/abs/1711.10337"},{"section":"deep-learning","topicSlug":"ann-architectures","topic":"Ann Architectures","id":"dl-08001","difficulty":"easy","orderIndex":1,"question":"A team compares two networks on a tabular dataset with 100 features and 50,000 training samples: Network A has 1 hidden layer with 1024 units; Network B has 4 hidden layers with 256 units each. Both have approximately the same total parameter count. Network B consistently outperforms Network A. A junior engineer concludes \"more layers is always better.\" What is the accurate explanation and when would this conclusion fail?","options":{"A":"More layers are always better; the junior engineer is correct","B":"Network B has more depth, which allows it to learn hierarchical feature compositions. However, \"more layers is always better\" fails when: (1) the data has no hierarchical structure (e.g., random tabular data with no compositional features), (2) depth introduces optimization difficulties (vanishing gradients, dead neurons) that are worse than the capacity gain, or (3) the dataset is too small to learn useful hierarchical representations","C":"Network B is better because it has more regularization from the additional bias terms in deeper layers","D":"Depth always helps because wider networks are less efficient at using their parameters"},"correct":"B","explanation":{"correct":"- For data with compositional structure (images: edges→shapes→objects; text: chars→morphemes→words), depth allows each layer to build on previous layer abstractions. This exponential efficiency of depth means Network B can represent more complex functions with the same parameter count.\n- Failure cases: (1) Random forest features or tabular data with engineered features often lack the compositional structure that makes depth useful. Many studies show shallow networks work equally well on tabular data. (2) Very deep networks (20+ layers without ResNet-style skip connections) can be harder to train than shallow ones due to vanishing gradients.\n- The width vs depth trade-off is problem-specific, not universally resolved in favor of depth.","A":"\"Always better\" claims are rarely correct in ML. The Universal Approximation Theorem shows a single hidden layer can represent any function — depth is about efficiency and learnability, not strict necessity.","B":"","C":"Additional bias terms in deeper networks are minimal (a few hundred extra scalar parameters). This is negligible and does not explain the performance difference.","D":"Wider networks can be very efficient — width allows learning many different features simultaneously. There's no universal efficiency advantage for depth over width."},"reference":"- Bengio & LeCun, \"Scaling algorithms towards AI\" (2007)\n- Goodfellow et al., \"Deep Learning\", Chapter 6.4 (Architecture Design)"},{"section":"deep-learning","topicSlug":"ann-architectures","topic":"Ann Architectures","id":"dl-08002","difficulty":"easy","orderIndex":2,"question":"The Universal Approximation Theorem (UAT) states that a neural network with one hidden layer and enough units can approximate any continuous function to arbitrary precision. A product manager uses this to argue: \"We should always use single-layer networks since UAT proves they can do anything.\" What is wrong with this argument?","options":{"A":"UAT only applies to regression problems, not classification","B":"UAT guarantees existence of a single-layer network that works, not that we can efficiently find it. The required width may be exponential in the input dimension. More practically, it doesn't address: (1) how to find the right weights (optimization), (2) how many samples are needed to learn it (generalization), or (3) the practical cost of the exponentially large required width","C":"UAT has been proven false; modern neural networks require depth to function","D":"UAT applies only to networks with sigmoid activations; ReLU networks require depth to be universal approximators"},"correct":"B","explanation":{"correct":"- Cybenko's original UAT (1989) proved existence of weights for a wide enough single-layer network. But \"wide enough\" can be exponential in input dimension for certain functions.\n- Hornik (1991) generalized to any squashing function, and Barron (1993) proved single-layer networks can approximate any function with finite first-moment of the Fourier transform using O(1/ε²) neurons — but this bound is loose in practice.\n- The practical problems: (1) optimization of a single very wide layer may be harder than a deep narrow network; (2) generalization requires enough samples relative to parameter count; (3) the exponentially wide single layer may need far more FLOPs and memory than a deep equivalent.","A":"UAT applies to both regression and classification (approximating any continuous function includes decision boundaries). It is not restricted to regression.","B":"","C":"UAT has not been disproven — it remains valid. What has been shown (Telgarsky, 2016) is that for certain functions, depth allows exponentially more efficient representations. This doesn't falsify UAT.","D":"ReLU networks are also universal approximators for any single-hidden-layer network. Several papers (Hornik, 1991; LeSarge, 1996) have established UAT for various activation functions including ReLU."},"reference":"- Cybenko, \"Approximation by superpositions of a sigmoidal function\" (1989)\n- Hornik et al., \"Multilayer feedforward networks are universal approximators\" (1989)"},{"section":"deep-learning","topicSlug":"ann-architectures","topic":"Ann Architectures","id":"dl-08003","difficulty":"medium","orderIndex":3,"question":"You process a batch of 32 samples through a fully connected network. At the second hidden layer, you notice that halving the batch size from 32 to 16 reduces training time by 45% (not 50%). What does this tell you about the computation bottleneck?","options":{"A":"Halving the batch size should always halve training time; the 45% means there's a 5% overhead bug","B":"45% time reduction means approximately 10% of the time is batch-size-independent overhead (data loading, optimizer step, memory allocation). For batch-independent operations: t_fixed ≈ 10% of original time. The matrix multiply scales with batch size, but fixed overheads don't. This is consistent with 32-sample batch: 90% compute + 10% overhead; 16-sample batch: 45% compute + 10% overhead = 55% of original","C":"The GPU is only 90% utilized — the remaining 10% idle time explains the discrepancy","D":"The activation functions are not GPU-accelerated for batch sizes below 20"},"correct":"B","explanation":{"correct":"- Total time = compute_time(batch) + fixed_overhead. Compute scales linearly with batch size (more samples → more FLOPs). Fixed overhead (Python execution, data transfer, optimizer step) is constant per batch.\n- If t_total(32) = 1.0, t_total(16) = 0.55 (45% reduction). Let fixed = c, compute = (1-c). Then: (1-c)·(16/32) + c = 0.55 → 0.5·(1-c) + c = 0.55 → 0.5 + 0.5c = 0.55 → c = 0.1 (10% overhead).\n- This is a common profiling exercise: compute vs overhead decomposition helps identify whether reducing batch size will proportionally reduce training time.","A":"Halving batch size rarely halves training time exactly due to fixed overheads. The 5% discrepancy is not a \"bug\" — it's a natural consequence of batch-independent operations.","B":"","C":"GPU utilization measures parallel compute efficiency, not overhead fractions. Low utilization (idle GPU cores) would manifest as the GPU taking longer for the compute portion, not as fixed overhead.","D":"GPU acceleration for activation functions is batch-size-independent (element-wise operations scale with total element count). There's no 20-sample threshold."},"reference":"- PyTorch profiler for compute vs overhead analysis: https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html"},{"section":"deep-learning","topicSlug":"ann-architectures","topic":"Ann Architectures","id":"dl-08004","difficulty":"medium","orderIndex":4,"question":"A team experiments with two architectures for a 1000-class image classifier: (A) 5 fully connected layers of width 4096 — total parameters ≈ 10B; (B) ResNet-50 with 4 convolutional stages — total parameters ≈ 25M. ResNet-50 achieves far better performance. What architectural property of ResNet-50 explains why 25M parameters outperforms 10B in this task?","options":{"A":"ResNet-50 has skip connections that make it more powerful per parameter due to additive gradient paths","B":"Convolutional layers exploit spatial structure through weight sharing and local connectivity: a single 3×3 kernel with 9×C² parameters detects the same pattern everywhere in an image. Fully connected layers treat all pixel connections independently, requiring exponentially more parameters to cover the same spatial patterns. Weight sharing makes CNNs extremely parameter-efficient for image data where the same features (edges, textures) appear throughout","C":"ResNet-50 uses BatchNorm which makes it more efficient by reducing the number of parameters needed","D":"Fully connected networks cannot process images above 224×224 resolution due to memory constraints"},"correct":"B","explanation":{"correct":"- A 3×3 conv kernel with C_in=256 and C_out=256 has 9×256×256 ≈ 590K parameters and can detect features at any spatial location by sliding the kernel. An equivalent fully-connected layer connecting 256×7×7 = 12,544 positions to 256 output features needs 12,544×256 ≈ 3.2M parameters — for one layer.\n- More fundamentally: in natural images, the same features (corners, curves, textures) appear at all spatial locations. Learning these features once (shared weights) and applying everywhere is both more efficient and provides implicit translation invariance.\n- Fully connected networks must learn separate detectors for \"edge in top-left\" vs \"edge in center\" vs \"edge in bottom-right\" — the same feature learned 196 times (for a 14×14 feature map). This is the parameter inefficiency.","A":"Skip connections are a secondary benefit. ResNet-50's primary advantage over a 10B-parameter FC network is convolutional weight sharing, not skip connections. A simple ConvNet without skip connections would still vastly outperform the FC network.","B":"","C":"BatchNorm has relatively few parameters (2×C per layer for scale and bias). Its benefit is training stability, not parameter efficiency. BatchNorm actually adds parameters, not removes them.","D":"Fully connected networks can process any image size (just flatten the input). The constraint is practical (memory and parameter count), not architectural. Many FC-only architectures have processed high-resolution images."},"reference":"- LeCun et al., \"Gradient-Based Learning Applied to Document Recognition\" (1998): original ConvNet efficiency argument\n- https://cs231n.github.io/convolutional-networks/"},{"section":"deep-learning","topicSlug":"ann-architectures","topic":"Ann Architectures","id":"dl-08005","difficulty":"medium","orderIndex":5,"question":"You increase the batch size from 256 to 2048 (8×) while keeping all other hyperparameters constant. Training converges to a model with 1.5% higher test error than the 256-batch model. A colleague says \"increase learning rate by 8× to recover performance.\" Does this fix work?","options":{"A":"Yes — the linear scaling rule says LR should scale linearly with batch size, which recovers the same effective learning rate per sample","B":"Partially — the linear scaling rule (Goyal et al., 2017) works for moderate batch size increases in well-tested settings, but it requires also using LR warmup (gradually increasing from small LR to 8× LR over first 5 epochs). Directly jumping to 8× LR without warmup causes training instability. Additionally, for very large batches (>8192), diminishing returns appear and the linear rule underfits generalization","C":"No — learning rate should decrease when batch size increases because each step sees more data","D":"Yes but only with SGD; Adam automatically adjusts for batch size changes"},"correct":"B","explanation":{"correct":"- The linear scaling rule (Goyal et al.): for batch size k×B with LR k×η, the model trains to the same accuracy as batch size B with LR η, provided k is not too large. The intuition: larger batches compute lower-variance gradient estimates; the larger LR compensates.\n- Warmup is critical: at initialization, gradients can be large and noisy. Jumping immediately to 8× LR produces large erratic steps. Warmup starts at lr=η and linearly increases to 8×η over 5 epochs.\n- Generalization gap for large batches: large-batch training finds \"sharp minima\" with worse generalization (Keskar et al., 2016). Linear LR scaling compensates for optimization speed but not for the sharp-minima phenomenon. This explains the residual 1.5% gap.","A":"The linear scaling rule is correct in direction (increase LR with batch size) but incomplete — it omits the critical warmup requirement. \"Just multiply LR by 8\" without warmup often fails.","B":"","C":"The intuition \"more data per step → lower LR\" is wrong. More data per step means less noisy gradient estimates, not weaker signal. The LR should increase to match the higher quality gradient estimate.","D":"Adam does partially adapt to batch size changes via second moment normalization, but it does not fully automatically compensate. Large batch Adam training still benefits from LR scaling."},"reference":"- Goyal et al., \"Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour\" (2017): https://arxiv.org/abs/1706.02677\n- Keskar et al., \"On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima\" (2016): https://arxiv.org/abs/1609.04836"},{"section":"deep-learning","topicSlug":"ann-architectures","topic":"Ann Architectures","id":"dl-08006","difficulty":"medium","orderIndex":6,"question":"You design a fully connected network for time series prediction (predict value at t+1 from last 100 timesteps). You use a flat architecture: flatten 100 values into a vector, then 3 FC layers. A senior engineer says \"this architecture ignores the sequential structure of the data.\" She proposes using 1D convolution instead. What specifically does the flat FC architecture fail to exploit?","options":{"A":"FC networks cannot process more than 50 input features","B":"The FC architecture treats all 100 timesteps as independent features with no assumption about temporal locality or ordering. A value at t=1 and t=100 are connected with the same weight matrix as adjacent timesteps. 1D CNNs apply local kernels that capture patterns at specific temporal scales (e.g., a 5-step window can detect short-term trends) and are translation-equivariant — the same pattern at t=10 and t=90 uses identical weights. The flat FC must learn these local temporal patterns separately at each position","C":"FC networks cannot backpropagate gradients through more than 3 layers when input size exceeds 100","D":"FC networks require input normalization before processing time series, unlike CNNs"},"correct":"B","explanation":{"correct":"- A fully connected layer from (100,) to (H,): the weight at position W[j,5] (connecting timestep 5 to hidden unit j) is completely independent of W[j,6] (connecting timestep 6). There's no inductive bias for temporal locality.\n- 1D CNN: a kernel of size 5 learns a pattern over 5 consecutive timesteps and slides across all positions. The same kernel detects the same pattern regardless of when it occurs (translation equivariance). Parameter count: 5×channels, applied at every position.\n- The FC layer must learn these temporal patterns without the locality inductive bias — requiring more data and parameters to learn what CNNs represent by construction.","A":"FC networks have no hard limit on input features. A 100-dimensional input is small by modern standards. FC networks handle inputs of millions of dimensions (though inefficiently for structured data).","B":"","C":"Gradient flow through FC networks is determined by the number of layers and activation functions, not input size. 3 FC layers with 100 inputs backpropagate gradients without any 100-feature limit.","D":"Normalization helps both FC and CNN networks. It is not specific to FC networks or required differently based on architecture."},"reference":"- https://cs231n.github.io/convolutional-networks/ (parameter sharing and local connectivity)"},{"section":"deep-learning","topicSlug":"ann-architectures","topic":"Ann Architectures","id":"dl-08007","difficulty":"hard","orderIndex":7,"question":"You compare two networks of identical depth (10 layers) and width (512 per layer) but different connectivity: Network A is fully dense (each layer fully connected to next); Network B is DenseNet-style (each layer connected to all previous layers). For the same input, Network B requires far more computation in later layers. Why, and what is the growth in parameter count in later layers?","options":{"A":"DenseNet's later layers are slower because they process larger input tensors; each layer k receives concatenation of all k previous layer outputs, so input size grows as k×512","B":"Network B's layer k receives a concatenation of all previous outputs: input to layer k = [h₁, h₂, ..., h_{k-1}, x], with dimension k×512 + input_dim. The weight matrix for layer k is k×512 × 512 (output). As k grows linearly, the parameter count for layer k grows linearly with k — total parameters across all N layers grows as O(N²) vs O(N) for fully connected sequential networks","C":"DenseNet computes additional forward passes for each skip connection, causing multiplicative slowdowns","D":"DenseNet's parameter count doesn't change — it only adds additive operations, not multiplicative ones"},"correct":"B","explanation":{"correct":"- DenseNet concatenation: layer k's input = [x, h₁, ..., h_{k-1}] has dimension d₀ + (k-1)×d_layer. The weight matrix mapping this to d_layer outputs has (d₀ + (k-1)×d_layer) × d_layer parameters.\n- Total parameters ≈ Σₖ (k × d_layer²) = d_layer² × N(N+1)/2 = O(N²). For sequential fully connected: each layer has d_layer² parameters → total O(N×d_layer²) = O(N).\n- DenseNet mitigates this with bottleneck layers (1×1 conv in CNN version) and growth rate limiting (each layer adds only g features, not full d_layer). For FC layers without these mitigations, quadratic parameter growth is a genuine concern.","A":"Partially correct description but incomplete. The input size growth is correct (k×512), which is the root cause. But option B completes the analysis with the quadratic parameter count consequence.","B":"","C":"Skip connections don't cause extra forward passes. DenseNet computes each layer once; the skip connections just route already-computed activations to later layers via concatenation.","D":"Concatenating larger inputs requires larger weight matrices (more parameters, more multiplications). This is a multiplicative operation: (k×512) × (512) weight matrix grows with k."},"reference":"- Huang et al., \"Densely Connected Convolutional Networks\" (DenseNet): https://arxiv.org/abs/1608.06993"},{"section":"deep-learning","topicSlug":"ann-architectures","topic":"Ann Architectures","id":"dl-08008","difficulty":"hard","orderIndex":8,"question":"You benchmark two model configurations: Config A (batch_size=32, model_width=512) and Config B (batch_size=256, model_width=512). Both are trained for the same number of gradient steps. Config B reaches lower training loss but worse validation loss. An engineer proposes Config C (batch_size=32, model_width=2048). What should she expect and why?","options":{"A":"Config C will perform similarly to Config A because width and batch size trade off identically","B":"Config C will likely have better validation performance than Config B (larger batch = sharp minima, worse generalization) and may outperform Config A on validation due to increased model capacity with small batch size (which finds flatter, better-generalizing minima). The combination of wider model (more capacity) with small batch (finds flatter minima) is a common recipe for maximizing generalization","C":"Config C will overfit immediately due to the increased model capacity","D":"Wider models always converge slower, so Config C will not reach competitive loss in the same number of steps"},"correct":"B","explanation":{"correct":"- Large batch (Config B): computes low-variance gradient estimates but tends to converge to sharp minima (narrow loss valleys) that generalize poorly (Keskar et al., 2016). Sharp minima have worse generalization because small input distribution shifts push out of the narrow good region.\n- Wider model (Config C) + small batch (flat minima): width increases representational capacity, while small batch's noisy gradient estimates act as implicit regularization by preventing convergence into sharp narrow minima. This combination often achieves the best of both.\n- The intuition: width alone would increase overfitting risk, but small batch + noise acts as regularization that finds wider minima of the same loss landscape.","A":"Width and batch size affect different aspects of training (capacity vs gradient noise / minima sharpness). They don't trade off in a simple equivalent way.","B":"","C":"Overfitting is a function of parameter-to-data ratio AND training dynamics. With small batch size providing implicit regularization and with dropout/weight decay, a wider model doesn't necessarily overfit more than a narrow one.","D":"Wider models can converge at similar rates as narrow ones with appropriate initialization and learning rate scaling. Width affects per-step FLOP cost but not necessarily the number of gradient steps to convergence."},"reference":"- Keskar et al., \"On Large-Batch Training for Deep Learning\" (2016): https://arxiv.org/abs/1609.04836\n- Hoffer et al., \"Train longer, generalize better\" (2017): https://arxiv.org/abs/1705.08741"},{"section":"deep-learning","topicSlug":"ann-architectures","topic":"Ann Architectures","id":"dl-08009","difficulty":"hard","orderIndex":9,"question":"A network is designed with 20 fully connected layers, each with width 256 and ReLU activations. Without any normalization or skip connections, the model has 10% test accuracy (same as random for a 10-class problem). Adding only BatchNorm (no skip connections) raises it to 85%. Adding only skip connections (ResNet-style) raises it to 82%. What does this tell us about the role of each component?","options":{"A":"BatchNorm is strictly more important than skip connections for deep networks","B":"Both mechanisms independently solve different aspects of deep network training difficulty: BatchNorm addresses the internal covariate shift / gradient flow problem (keeps activations normalized, stabilizes optimization), while skip connections address the gradient vanishing through additive paths. Their similar effectiveness here suggests both are addressing the same root cause (gradient flow through 20 layers), just through different mechanisms","C":"BatchNorm only helps at test time; it has no effect on training","D":"Skip connections don't help without BatchNorm; the 82% result is due to noise in the experiment"},"correct":"B","explanation":{"correct":"- 10% accuracy (random) → both mechanisms are needed for a 20-layer network to train at all. The depth creates severe gradient flow problems.\n- BatchNorm's contribution: normalizes activations to zero mean, unit variance after each layer. This prevents the exponential activation growth/decay that causes gradients to explode/vanish. Also shown to smooth the loss landscape (Santurkar et al., 2018).\n- Skip connections' contribution: provide direct additive gradient paths. ∂L/∂h₀ includes a term from the skip path that doesn't vanish even when the residual branch gradient does.\n- 85% vs 82%: in this specific experiment, BatchNorm is slightly more effective, but this is architecture- and data-specific. Deeper networks (50+ layers) often see skip connections become the dominant factor.","A":"\"Strictly more important\" is too strong. In different architectures and datasets, skip connections are the more critical component (e.g., very deep networks where normalization alone cannot solve gradient flow). Both are important, and the relative importance is context-dependent.","B":"","C":"BatchNorm has separate behavior during training (uses batch statistics) and inference (uses running statistics). Its training behavior (normalizing activations, stabilizing gradient flow) is its primary contribution.","D":"The skip connection result (82%) is meaningful, not noise. ResNet with 20 layers is a well-established architecture that trains effectively. The difference from BatchNorm (85% vs 82%) is within normal architecture comparison variance."},"reference":"- He et al., \"Deep Residual Learning for Image Recognition\": https://arxiv.org/abs/1512.03385\n- Santurkar et al., \"How Does Batch Normalization Help Optimization?\": https://arxiv.org/abs/1805.11604"},{"section":"deep-learning","topicSlug":"ann-architectures","topic":"Ann Architectures","id":"dl-08010","difficulty":"medium","orderIndex":10,"question":"A product manager requests a neural network that can handle variable-length inputs (sentences with 10 to 500 tokens) without padding. You design: (A) a fully connected network that requires fixed-length inputs, (B) a recurrent network, (C) a CNN with global average pooling. Which approach(es) natively handle variable-length inputs and what is the trade-off?","options":{"A":"Only recurrent networks can handle variable-length inputs","B":"Both RNNs and CNNs with global pooling handle variable-length inputs, but through different mechanisms: RNNs process tokens sequentially and can stop at any length; CNNs with global average pooling apply convolutional kernels to any-length sequence and pool all positions to a fixed-size vector. RNNs capture sequence order and long-range dependencies better in principle; CNNs are more parallelizable (can process all positions simultaneously) but use local kernels","C":"Only transformers handle variable-length inputs; RNNs and CNNs require fixed length","D":"Variable-length inputs require padding to the maximum length; no architecture natively avoids this"},"correct":"B","explanation":{"correct":"- RNNs: process one token at a time, maintaining hidden state h_t = f(h_{t-1}, x_t). After T tokens, h_T is the summary. T can be any length — the same weights process 10 or 500 tokens.\n- CNNs + global average pooling: apply 1D kernels (shape: kernel_size × channels) that slide over any sequence length, then average all output positions into a fixed-size vector. The CNN itself requires no sequence length knowledge.\n- Transformers also handle variable-length inputs natively (attention is computed over all positions, quadratic in sequence length). They're not listed in this scenario.","A":"CNNs with global pooling also handle variable length natively, so \"only RNNs\" is incorrect.","B":"","C":"Transformers are not the only option. Both RNNs and CNNs have been used extensively for variable-length inputs (text classification, audio processing) before Transformers became dominant.","D":"While padding is a common implementation technique (for batch efficiency on GPUs), it is not architecturally necessary. True variable-length processing is natively supported by RNNs and CNNs with pooling."},"reference":"- https://cs231n.github.io/rnn/ (RNNs and variable-length sequences)"},{"section":"deep-learning","topicSlug":"ann-architectures","topic":"Ann Architectures","id":"dl-08011","difficulty":"hard","orderIndex":11,"question":"You train a 10-layer fully connected network and plot the singular value distribution of each layer's weight matrix after training. The first layer has a power-law singular value distribution (a few large values, many small). The last layer has a near-uniform singular value distribution. What does this difference tell you about learned representations?","options":{"A":"The first layer is undertrained (should have uniform singular values); reduce the learning rate for layer 1","B":"The first layer's low-rank structure (dominated by a few singular values) means it has effectively learned to project inputs into a low-dimensional subspace — only a few \"directions\" in the input space are relevant. The last layer's near-uniform distribution indicates it is using all its dimensions roughly equally. This pattern (low effective rank in early layers, higher rank in later layers) often indicates the network is learning to extract the most informative dimensions first","C":"The singular value distribution is random and has no interpretation","D":"Uniform singular values in the last layer indicate overfitting; the model should have low-rank structure throughout"},"correct":"B","explanation":{"correct":"- A weight matrix W with a power-law singular value distribution has low effective rank — most information passes through a few dominant directions. The matrix is approximately W ≈ UΣV^T where only a few singular values σ₁ >> σ₂ >> ... >> σₖ contribute significantly.\n- This pattern in early layers reflects that raw input features (e.g., pixels) are highly correlated. The network learns to project into the few meaningful dimensions that contain task-relevant information.\n- Martin & Mahoney (2020) studied this extensively, finding that well-trained networks exhibit implicit low-rank structure (HeavyTailed Self-Regularization) that correlates with generalization. Models with more power-law structure tend to generalize better.","A":"Low-rank structure in trained networks is a sign of learning, not undertraining. Undertrained networks often have near-uniform singular values (close to initialization). Power-law structure emerges during training as the network finds the useful dimensions.","B":"","C":"Singular value distributions in trained networks are highly non-random and deeply informative about network behavior. This is an active research area connecting to random matrix theory and generalization.","D":"Uniform singular values in the last layer suggest full-rank utilization — the output layer needs to use all incoming dimensions to distinguish all classes. This is normal and expected for classification tasks."},"reference":"- Martin & Mahoney, \"Implicit Self-Regularization in Deep Neural Networks\" (2019): https://arxiv.org/abs/1810.01075"},{"section":"deep-learning","topicSlug":"ann-architectures","topic":"Ann Architectures","id":"dl-08012","difficulty":"medium","orderIndex":12,"question":"A deep learning framework initializes all weights randomly except for one specific type of layer where weights are initialized to 0: biases. However, for a layer with no activation function (linear output layer), the bias initialization to 0 is stated to be especially important. Why?","options":{"A":"Zero bias prevents the linear output from producing NaN on the first forward pass","B":"For a linear output layer predicting real values, bias=0 means the initial prediction is the weighted sum of inputs with no offset. If the target mean is near 0 (after normalization), this is a reasonable starting point. More importantly: if all biases were initialized identically to any non-zero constant, different output units would all start with the same non-zero offset, and the network would need extra training steps to differentiate predictions across classes or output dimensions","C":"Zero bias initialization is required by the Adam optimizer's bias correction algorithm","D":"Non-zero bias initialization for the output layer causes the cross-entropy loss to be undefined at the first step"},"correct":"B","explanation":{"correct":"- For regression with normalized targets (zero mean), bias=0 in the output layer means initial predictions are zero (or close to zero for small random weights) — a reasonable starting point near the target distribution mean.\n- For classification: if output biases were all initialized to the same constant c, then all class logits would include the same c, and softmax would output uniform probabilities (same result as bias=0 after softmax). The constant cancels in softmax.\n- For regression with bias ≠ 0: the model starts predicting non-zero values for all samples, increasing initial loss unnecessarily. Zero initialization minimizes the initial loss and allows faster convergence to meaningful predictions.","A":"Non-zero biases don't cause NaN. The forward pass produces finite values (weighted sum + bias) regardless of bias initialization.","B":"","C":"Adam's bias correction is for the first and second moments of the gradient (initialized to zero), not for model weight biases. These are different \"biases\" — model parameter biases vs optimization moment biases.","D":"Cross-entropy requires positive probability inputs. Non-zero biases in the output layer would produce non-zero logits → non-uniform softmax probabilities → valid cross-entropy (finite loss). The loss is not undefined with non-zero bias."},"reference":"- https://cs231n.github.io/neural-networks-2/#init"},{"section":"deep-learning","topicSlug":"ann-architectures","topic":"Ann Architectures","id":"dl-08013","difficulty":"hard","orderIndex":13,"question":"You run an ablation study: starting from a baseline 4-layer MLP, you double the width of each layer. Training loss improves by 5%. You then double the depth to 8 layers (keeping the original width). Training loss improves by 8%. You then combine both (8 layers, doubled width). Training loss improves by only 6% relative to the 4-layer/doubled-width baseline, not the expected 13% (5%+8%) additive gain. What phenomenon explains the sub-additive gains?","options":{"A":"The combined model is too large and crashes the GPU, causing training errors that reduce the measured gain","B":"The gains from width and depth are not additive because they address overlapping bottlenecks. Once width is doubled, the limiting factor for the 4-layer model shifts. Adding depth after sufficient width addresses a different bottleneck (depth of representation), but part of the depth gain was already captured by the wider 4-layer model's increased capacity. The effective bottlenecks interact non-linearly","C":"The optimizer cannot handle both increases simultaneously; use separate optimizers for width and depth changes","D":"Sub-additive gains are caused by L2 regularization which penalizes the larger combined model more heavily"},"correct":"B","explanation":{"correct":"- Width and depth improvements often address partially overlapping representational bottlenecks. Doubling width lets each layer represent more features simultaneously. Adding depth lets the network build more hierarchical abstractions.\n- When a model has bottlenecks in both dimensions, fixing one partially alleviates the other (a wider layer can approximate some depth effects by learning more complex functions per layer). So the \"remaining gain\" from adding depth after already having doubled width is less than adding depth alone.\n- This is related to the EfficientNet scaling insight (Tan & Le, 2019): optimal performance comes from compound scaling (width, depth, resolution together with balanced ratios), not independent scaling.","A":"GPU crashes would produce training failures or NaN losses, not a consistent 6% improvement. Sub-additive gains on a working model are a property of the learning dynamics, not hardware failure.","B":"","C":"The optimizer works the same for any architecture size. Using \"separate optimizers\" for different architectural components is not a standard technique and wouldn't affect the sub-additivity.","D":"L2 regularization penalizes larger parameter norms, but this would reduce performance to below the baseline, not just reduce gains relative to additive expectation. The regularization would need to be tuned for each model size separately."},"reference":"- Tan & Le, \"EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks\" (2019): https://arxiv.org/abs/1905.11946"},{"section":"deep-learning","topicSlug":"ann-architectures","topic":"Ann Architectures","id":"dl-08014","difficulty":"easy","orderIndex":14,"question":"A neural network for binary classification has a final layer: `nn.Linear(256, 2)` with 2 output units + softmax. A colleague says \"this is wasteful — use a single output unit with sigmoid.\" Who is right?","options":{"A":"The colleague is wrong — 2 output units are always required for binary classification","B":"Both approaches are correct and produce equivalent results, but the single-sigmoid approach is more common and efficient: one output unit with sigmoid outputs P(class=1) directly. The 2-unit softmax approach outputs [P(class=0), P(class=1)] where P(class=0) = 1 - P(class=1) — the second output is redundant. The 2-unit approach uses 2× more output parameters for the same information","C":"The colleague is wrong — using sigmoid for classification causes training instability compared to softmax","D":"The 2-unit approach allows the model to predict \"neither class\" — a third implicit option that sigmoid cannot represent"},"correct":"B","explanation":{"correct":"- For binary classification, P(class=0) + P(class=1) = 1 (exhaustive, exclusive). So knowing P(class=1) = p immediately gives P(class=0) = 1-p. The second output unit is perfectly redundant.\n- Single sigmoid: output = σ(z). Loss: BCE = -[y·log(σ(z)) + (1-y)·log(1-σ(z))]. Uses 256×1 + 1 = 257 parameters for the final layer.\n- Two-unit softmax: outputs [σ₀, σ₁] = softmax([z₀, z₁]). Uses 256×2 + 2 = 514 parameters. σ₁ = exp(z₁)/(exp(z₀)+exp(z₁)) — equivalent to sigmoid(z₁-z₀). The two logits only matter through their difference, making one of them redundant.","A":"Two output units are NOT always required. Many production binary classifiers use a single sigmoid output. Libraries like scikit-learn's neural network default to single-output sigmoid for binary classification.","B":"","C":"Both sigmoid (BCE) and softmax (CE) are stable for binary classification. Sigmoid is actually preferred by most practitioners for binary problems due to simplicity and efficiency.","D":"The 2-unit softmax does not represent a \"third neither class.\" Softmax always produces a valid probability distribution over exactly the specified classes — probabilities sum to 1.0 for the two classes, leaving no probability mass for other options."},"reference":"- PyTorch BCEWithLogitsLoss (single output) vs CrossEntropyLoss (multiple outputs): https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html"},{"section":"deep-learning","topicSlug":"ann-architectures","topic":"Ann Architectures","id":"dl-08015","difficulty":"hard","orderIndex":15,"question":"You have a network where the output layer's logits (pre-softmax) have very large magnitude: outputs like [100, -50, 30, ...]. The training loss is low but the model is overconfident — softmax([100, -50, 30]) ≈ [1.0, 0.0, 0.0]. A team member argues this is not a problem because the argmax prediction is correct. Why is extreme logit magnitude a production concern?","options":{"A":"Large logits cause integer overflow in the argmax computation","B":"Extreme logit magnitudes produce near-zero gradients for all but the dominant class (softmax probability ≈ 1 for one class → softmax Jacobian ≈ 0), preventing the model from continuing to learn from correctly classified examples. In production: (1) miscalibrated confidence scores are unreliable for downstream systems that use probabilities (e.g., rejection thresholds), (2) distribution shift at inference can push new inputs toward different logit patterns that the model can't recover from, (3) temperature scaling calibration fails when logits are at extreme values","C":"Large logits slow down inference because softmax requires more floating-point operations for large values","D":"The problem only exists during training; inference is unaffected by logit magnitude"},"correct":"B","explanation":{"correct":"- Near-saturation at softmax: when p_correct ≈ 1.0, ∂L/∂logit ≈ (p_predicted - y_true) ≈ 0. The gradient signal vanishes for correctly classified examples. The model essentially stops learning from these samples.\n- Calibration: a downstream classifier or safety system using probability thresholds (e.g., \"only act if confidence > 80%\") can't distinguish between p=0.95 (confident) and p=0.9999999 (extreme). Both are \"high confidence\" but one is healthy and one is pathological.\n- Temperature scaling: the standard post-training calibration technique (divide logits by T, find T on validation set) can't fix extremely large logits well because the logit magnitude spans many orders of magnitude.","A":"Argmax is a comparison operation, not arithmetic on the logit values. Large logit values don't cause overflow in argmax — the operation just selects the index of the maximum value.","B":"","C":"Softmax uses exp(xᵢ - max(x)) for numerical stability, adding only a subtraction step. The computational cost of softmax scales with the number of classes, not logit magnitude.","D":"Inference is directly affected by logit magnitude through probability calibration. Inference confidence scores are the logits passed through softmax — their values are the direct output of the model and affect downstream decisions."},"reference":"- Guo et al., \"On Calibration of Modern Neural Networks\" (2017): https://arxiv.org/abs/1706.04599\n- Label smoothing as prevention: Müller et al., \"When Does Label Smoothing Help?\" (2019): https://arxiv.org/abs/1906.02629"},{"section":"deep-learning","topicSlug":"regularization-and-normalization","topic":"Regularization And Normalization","id":"dl-09001","difficulty":"easy","orderIndex":1,"question":"A model achieves 99% training accuracy but only 72% validation accuracy. Adding Dropout (p=0.5) to hidden layers reduces training accuracy to 91% but improves validation accuracy to 85%. A junior engineer is alarmed: \"Dropout hurt our training accuracy!\" Is this a problem?","options":{"A":"Yes — a good model should always have high training accuracy; the Dropout rate is too high","B":"No — the intentional training accuracy reduction is Dropout working as designed. Dropout randomly deactivates 50% of neurons per forward pass, forcing the network to learn redundant representations and preventing co-adaptation. The gap between 99% train and 72% valid was overfitting; the narrowed gap (91% train, 85% valid) is correct behavior","C":"Yes — Dropout should improve both training and validation accuracy simultaneously","D":"No, but the Dropout rate should be increased to 0.9 to close the remaining 6% gap further"},"correct":"B","explanation":{"correct":"- Overfitting (99% train, 72% valid) means the model memorized training patterns. The 27-point gap is the overfitting signal.\n- Dropout during training: each forward pass drops 50% of neurons randomly. The model cannot rely on specific neuron combinations → forces distributed, robust representations. This reduces effective model capacity and acts as ensemble training (each dropout mask creates a different \"sub-network\").\n- The 91% train / 85% valid result: narrower 6-point gap (down from 27) with better absolute validation performance. This is the correct trade-off. During inference, Dropout is disabled (model.eval()), and outputs are scaled by (1-p) to account for the full network.","A":"Training accuracy is not the target metric — generalization (validation accuracy) is. A model that achieves 99% training and 72% valid is failing. 91% train and 85% valid is a success.","B":"","C":"Dropout explicitly impairs training by randomly disabling neurons. It is designed to hurt training performance in exchange for better generalization. Both effects are expected.","D":"Increasing Dropout to 0.9 would disable 90% of neurons per pass, severely under-utilizing the network and likely causing underfitting (both train and valid accuracy would drop). Dropout rates above 0.5 are rarely used in practice."},"reference":"- Srivastava et al., \"Dropout: A Simple Way to Prevent Neural Networks from Overfitting\" (2014): https://jmlr.org/papers/v15/srivastava14a.html"},{"section":"deep-learning","topicSlug":"regularization-and-normalization","topic":"Regularization And Normalization","id":"dl-09002","difficulty":"easy","orderIndex":2,"question":"You train two models with L1 and L2 regularization respectively (same λ, same architecture, same data). After training, Model A (L2) has many small weights near 0.01; Model B (L1) has many weights exactly 0 and a few large weights. Which model has sparse weights and why?","options":{"A":"Model A (L2) — L2 regularization pushes weights toward exactly zero","B":"Model B (L1) — L1 penalty (λ·|w|) has a constant gradient (±λ) regardless of weight magnitude. This constant \"pull\" toward zero is strong enough to push small weights all the way to exactly 0. L2 penalty gradient (2λ·w) diminishes as w approaches 0, so small weights are only pulled weakly and never reach exactly zero","C":"Both regularizations produce identical sparsity patterns; the difference is only in total loss value","D":"Model A (L2) — L2 regularization produces sparsity through the squared penalty amplifying small weights"},"correct":"B","explanation":{"correct":"- L1 subdifferential at w=0: the subgradient is any value in [-λ, λ]. For w > 0: gradient = λ (constant pull toward 0). For w = 0: gradient can be 0 (if the data gradient is within [-λ, λ]), making 0 a stable equilibrium. This is why L1 produces exact zeros.\n- L2 gradient: 2λ·w. As w→0, gradient→0. The force pulling w toward 0 weakens as w gets smaller. L2 pushes weights toward small values but never has enough force to reach exactly 0 (gradient = 0 only at w=0, which requires the weight to already be 0).\n- Practical consequence: L1 produces sparse models (useful for feature selection); L2 produces small but non-sparse models (useful for general regularization). L1 + L2 = Elastic Net, combining both properties.","A":"L2 does not push weights to exactly zero. This is a fundamental property difference between L1 and L2. Confusing them is a very common misconception.","B":"","C":"The sparsity patterns are very different: L1 creates exact sparsity (many zeros), L2 does not. This has major implications for model interpretability and computational efficiency.","D":"L2's squared penalty amplifies the gradient for large weights (strong push for large weights) but diminishes for small weights — the opposite of what would produce sparsity."},"reference":"- Tibshirani, \"Regression Shrinkage and Selection via the Lasso\" (1996): L1 regularization / Lasso\n- https://scikit-learn.org/stable/modules/linear_model.html#lasso"},{"section":"deep-learning","topicSlug":"regularization-and-normalization","topic":"Regularization And Normalization","id":"dl-09003","difficulty":"medium","orderIndex":3,"question":"BatchNorm is applied to a layer's pre-activations. During training, BN uses batch statistics (mean and variance of the current batch). During inference, it uses running statistics (exponential moving average from training). A deployed model that was trained with batch_size=256 is called with batch_size=1 at inference. Your colleague says \"the inference statistics will be wrong since there's only 1 sample.\" Who is correct?","options":{"A":"The colleague is correct — BatchNorm requires batch_size > 1 during inference","B":"The colleague is wrong — during inference, BatchNorm uses stored running statistics (mean and variance accumulated during training), not the current batch's statistics. Batch_size=1 at inference is completely valid because the normalization is applied using the training-time population statistics, not the current sample's statistics","C":"The colleague is correct, but the fix is to use instance normalization at inference time","D":"Both are correct — running statistics are used but become unreliable for batch_size=1"},"correct":"B","explanation":{"correct":"- BatchNorm training mode: normalize using current batch's mean/var. Also updates running_mean and running_var via exponential moving average.\n- BatchNorm eval mode (`model.eval()`): normalize using stored running_mean and running_var. The current batch's statistics are completely ignored.\n- A single sample at inference: y_normalized = (x - running_mean) / √(running_var + ε). This is a deterministic transformation using population statistics. The output doesn't depend on whether other samples are in the batch.","A":"Batch_size > 1 is only required during training (for meaningful batch statistics). At inference in eval mode, any batch size works — even batch_size=1.","B":"","C":"Switching to instance normalization at inference is unnecessary and would change the model's behavior (instance norm uses per-sample statistics, not the trained population statistics). This would require retraining.","D":"Running statistics are computed over the entire training dataset and are reliable regardless of inference batch size. A single inference sample doesn't affect the statistics used for normalization."},"reference":"- PyTorch BatchNorm documentation: https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm1d.html"},{"section":"deep-learning","topicSlug":"regularization-and-normalization","topic":"Regularization And Normalization","id":"dl-09004","difficulty":"medium","orderIndex":4,"question":"A Transformer-based language model uses LayerNorm instead of BatchNorm. During training with a batch of sequences, LayerNorm normalizes across the feature dimension for each sample independently. A team switches to BatchNorm to improve training stability. After the switch, training becomes more unstable and generation quality drops. What is the fundamental incompatibility?","options":{"A":"BatchNorm is slower than LayerNorm for Transformer architectures","B":"BatchNorm normalizes across the batch dimension (mean/var computed over all batch samples at each position). In NLP, different samples in a batch have different sequence lengths and semantics — mixing statistics across samples can corrupt the representation. More critically, at inference time with single samples or variable-length sequences, BatchNorm's running statistics (computed over batch-aggregated features) don't match the per-sample feature distributions. LayerNorm normalizes within each sample independently, making it batch-size-agnostic","C":"BatchNorm requires 2D inputs; Transformer hidden states are 3D (batch, seq, d_model)","D":"LayerNorm includes learnable parameters that BatchNorm lacks, causing the model to lose expressivity"},"correct":"B","explanation":{"correct":"- LayerNorm: for a token at position (batch_idx, seq_pos): normalize across the d_model dimension using that single token's mean and variance. Each token is normalized by its own statistics.\n- BatchNorm at a given layer: for each feature dimension d, compute mean/var over all samples×positions in the batch. This mixes statistics from semantically unrelated tokens (e.g., \"cat\" in one sentence and \"quantum\" in another). These have different semantic content but are normalized together.\n- The critical inference problem: at inference with batch_size=1 and different sequence lengths, running statistics accumulated from diverse training batches may not reflect the statistics of any individual input distribution.","A":"Computational speed is not the primary concern. LayerNorm and BatchNorm have similar complexity. The fundamental issue is correctness of statistics for NLP inputs.","B":"","C":"PyTorch BatchNorm has variants for 1D, 2D, and 3D inputs. BatchNorm1d handles (batch, features) or (batch, features, seq_len). 3D inputs are technically handleable, not the root issue.","D":"Both BatchNorm and LayerNorm have learnable scale (γ) and shift (β) parameters. They have comparable expressivity through these parameters."},"reference":"- Ba et al., \"Layer Normalization\" (2016): https://arxiv.org/abs/1607.06450\n- Vaswani et al., \"Attention Is All You Need\" (uses LayerNorm): https://arxiv.org/abs/1706.03762"},{"section":"deep-learning","topicSlug":"regularization-and-normalization","topic":"Regularization And Normalization","id":"dl-09005","difficulty":"medium","orderIndex":5,"question":"A team trains a model with Dropout (p=0.5) and at inference, runs the model in training mode (forgot to call `model.eval()`). The model's predictions are noisy and inconsistent for the same input. The team also notices the model seems to perform slightly differently than expected. What is the technical issue and what are the two consequences?","options":{"A":"Training mode has no effect on Dropout during inference","B":"Two consequences: (1) Stochasticity — Dropout randomly drops 50% of neurons on each forward pass. The same input produces different outputs on different calls. (2) Scale mismatch — during training, Dropout scales outputs by 1/(1-p) = 2× (inverted Dropout) to keep expected values consistent. If the model uses standard (non-inverted) Dropout, inference in train mode produces halved expected outputs compared to eval mode. Modern frameworks use inverted Dropout during training, so issue 1 (noise) is the main practical problem","C":"Training mode causes BatchNorm to use batch statistics, affecting only BatchNorm layers","D":"Dropout in training mode is identical to not using Dropout at all"},"correct":"B","explanation":{"correct":"- PyTorch implements \"inverted dropout\": during training, active neurons are scaled by 1/(1-p), so the expected output magnitude is the same as without Dropout. At eval mode, Dropout is disabled and no scaling is applied — this is consistent because training already scaled.\n- The stochasticity issue: running a model in training mode at inference means each call to model(x) produces a different output due to random neuron masking. For deterministic predictions (same x → same output), this is a critical bug.\n- This is a common production bug: forgetting `model.eval()` before inference. It can cause serious problems in A/B tests (non-deterministic results), deployment (different outputs from same input), and monitoring (unexplained prediction variance).","A":"Training mode does affect Dropout during inference — it keeps Dropout active, introducing stochasticity. This is the core bug.","B":"","C":"BatchNorm behavior in training vs eval mode is a separate concern (different statistics), not related to Dropout's effect. The question specifically asks about Dropout consequences.","D":"Dropout in training mode is NOT equivalent to no Dropout — it randomly deactivates neurons, reducing effective model capacity and adding noise to each forward pass."},"reference":"- PyTorch nn.Dropout documentation: https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html\n- model.eval() vs model.train(): https://pytorch.org/docs/stable/generated/torch.nn.Module.eval.html"},{"section":"deep-learning","topicSlug":"regularization-and-normalization","topic":"Regularization And Normalization","id":"dl-09006","difficulty":"medium","orderIndex":6,"question":"GroupNorm is used instead of BatchNorm for object detection with small batch sizes (e.g., 2 samples per GPU). You explain to a junior engineer: \"GroupNorm doesn't have the batch-size dependency problem.\" She asks: \"If GroupNorm normalizes across groups of channels, what determines the quality of GroupNorm statistics as batch size decreases?\"","options":{"A":"GroupNorm quality degrades with smaller batch sizes just like BatchNorm","B":"GroupNorm normalizes within each sample and group (mean/var computed over G channels within one sample). Its statistics depend on the number of channels per group (C/num_groups), not the batch size. With batch_size=1, GroupNorm produces meaningful statistics as long as the group size is large enough (typically G=32, num_groups=32 for C=512 gives 16 channels per group). BatchNorm with batch_size=1 is meaningless (variance=0)","C":"GroupNorm requires the number of groups to equal the batch size","D":"GroupNorm and BatchNorm are identical for batch_size=32; GroupNorm is only different for batch_size=1"},"correct":"B","explanation":{"correct":"- GroupNorm: for input (batch, C, H, W), divide C channels into G groups. For each (sample, group): compute mean and var over (C/G × H × W) elements. Statistics are computed within a single sample — batch size doesn't affect them.\n- BatchNorm: for each channel, compute mean/var over (batch × H × W) elements. With batch_size=1: mean and var computed over 1×H×W elements per channel (legitimate for images but the running statistics update is noisy). For batch_size=1 in NLP (1 sample, 1 position): mean and var computed over 1 element — meaningless (var=0).\n- GroupNorm is the standard for detection/segmentation where batch sizes must be small (high-resolution inputs fill GPU memory). ResNeXt, Mask R-CNN, and most modern detection models use GroupNorm.","A":"This is the key distinction between GroupNorm and BatchNorm. GroupNorm's statistics are independent of batch size — this is its defining advantage.","B":"","C":"GroupNorm groups the channel dimension, not the batch dimension. The number of groups is a fixed hyperparameter (typically 32) that doesn't change with batch size.","D":"GroupNorm and BatchNorm are mathematically different for all batch sizes. For large batches, BatchNorm estimates population statistics better (more samples), while GroupNorm always uses the same within-sample statistics."},"reference":"- Wu & He, \"Group Normalization\" (2018): https://arxiv.org/abs/1803.08494"},{"section":"deep-learning","topicSlug":"regularization-and-normalization","topic":"Regularization And Normalization","id":"dl-09007","difficulty":"hard","orderIndex":7,"question":"RMSNorm (used in LLaMA, Mistral, and most modern LLMs) computes: y = x / RMS(x) * γ, where RMS(x) = √(mean(x²)). A researcher says: \"RMSNorm is strictly worse than LayerNorm because it doesn't center the activations.\" Is this correct?","options":{"A":"Yes — zero-centering is essential for normalization to be effective","B":"No — RMSNorm drops the mean-centering step but retains the scale normalization. In practice, Transformer hidden states often have near-zero mean already (due to attention patterns and residual connections). The mean-centering step in LayerNorm adds computational overhead (computing mean, subtracting) without meaningful benefit. RMSNorm achieves similar training stability with fewer operations (~20% faster per layer), which is significant at scale","C":"Yes — RMSNorm causes gradient vanishing because the mean is not removed","D":"No — RMSNorm is mathematically equivalent to LayerNorm for all inputs with non-zero mean"},"correct":"B","explanation":{"correct":"- LayerNorm: x̂ = (x - μ) / σ · γ + β, where μ = mean(x), σ = std(x). Two operations: mean subtraction and variance scaling.\n- RMSNorm: x̂ = x / RMS(x) · γ (no mean subtraction, no β bias term). Only one operation: scale by inverse root mean square.\n- The empirical finding (Zhang & Sennrich, 2019): for neural machine translation and language modeling, LLaMA, Mistral, Gemma, and other modern LLMs show that RMSNorm achieves comparable or better performance than LayerNorm at reduced compute cost. The centering step seems less important than the scale normalization.","A":"Zero-centering is beneficial in some settings (e.g., CNNs where feature distributions are asymmetric), but not universally essential. In Transformer residual streams, activations naturally tend toward zero mean due to the residual connection structure.","B":"","C":"Gradient flow through RMSNorm is similar to LayerNorm. The gradient of RMSNorm is well-defined and doesn't cause vanishing gradients. In practice, RMSNorm networks (LLaMA-7B etc.) train stably without additional interventions.","D":"RMSNorm and LayerNorm are not equivalent. LayerNorm subtracts the mean before scaling; RMSNorm does not. For any input with non-zero mean, these produce different outputs."},"reference":"- Zhang & Sennrich, \"Root Mean Square Layer Normalization\" (2019): https://arxiv.org/abs/1910.07467\n- Touvron et al., \"LLaMA 2\" uses RMSNorm: https://arxiv.org/abs/2307.09288"},{"section":"deep-learning","topicSlug":"regularization-and-normalization","topic":"Regularization And Normalization","id":"dl-09008","difficulty":"hard","orderIndex":8,"question":"You train a model with L2 regularization (weight decay λ=0.01) and observe that some layers have consistently large weights (||W||² >> 1) even after training. You increase λ to 0.1. Now all weights are small (||W||² ≈ 0.01) but validation performance drops significantly. What is the diagnostic and fix?","options":{"A":"The model is too large; reduce the number of layers","B":"The large weights in specific layers likely encode critical task-relevant features — the model needs those large weights to represent its learned transformation. Increasing λ uniformly across all layers penalizes these important weights as heavily as regularization targets (weights that should be small). Fix: layer-wise λ — apply stronger regularization to early layers (often have redundant features) and weaker to final classification layers, or use gradient-based methods to identify which weights should be large","C":"L2 regularization should never be applied uniformly; replace with L1 which is more selective","D":"The validation drop indicates the model was already at optimal capacity; any regularization hurts"},"correct":"B","explanation":{"correct":"- Not all weights in a network play equal roles. Weights in final classification layers often need to be large to sharply separate class probabilities. Weights in intermediate feature extraction layers may be legitimately large for important features (e.g., edge detectors in CNNs have large weights in the dominant orientation directions).\n- Uniform λ penalizes all weights equally, ignoring their semantic importance. This is the key limitation of global weight decay.\n- In practice: LLM fine-tuning often applies weight decay only to specific parameter groups (not biases, not normalization parameters). PyTorch's AdamW allows per-parameter-group λ: `optimizer = AdamW([{'params': early_layers, 'weight_decay': 0.1}, {'params': final_layer, 'weight_decay': 0.001}], lr=1e-3)`.","A":"The issue is not model size but regularization strength calibration. A model with many layers can still need specific large weights for specific tasks.","B":"","C":"L1 regularization produces sparsity (zero weights), not just small weights. For the described problem (some layers needing large weights), L1 would zero those critical weights, making the problem worse.","D":"If the model performs well without λ=0.1, it is not \"at optimal capacity.\" The issue is over-regularization, not perfect capacity utilization."},"reference":"- PyTorch AdamW parameter groups: https://pytorch.org/docs/stable/optim.html#per-parameter-options"},{"section":"deep-learning","topicSlug":"regularization-and-normalization","topic":"Regularization And Normalization","id":"dl-09009","difficulty":"hard","orderIndex":9,"question":"BatchNorm has been claimed to work because it \"reduces internal covariate shift\" (Ioffe & Szegedy, 2015). A subsequent paper (Santurkar et al., 2018, \"How Does Batch Normalization Help Optimization?\") showed this explanation is incorrect. What does the Santurkar et al. paper argue is the actual reason BatchNorm helps, and what experiment did they use to disprove the covariate shift hypothesis?","options":{"A":"Santurkar showed BatchNorm reduces overfitting through regularization effects, not covariate shift","B":"Santurkar showed that even when \"noisy BatchNorm\" (adding noise to BN statistics to increase covariate shift) was applied, training was still faster and more stable than without BN. The actual mechanism is that BatchNorm smooths the loss landscape — it makes the loss function more Lipschitz (smaller changes in loss per unit change in weight) and makes the gradient more predictive (gradients are more consistent across steps). This smooth landscape allows larger learning rates and more stable optimization","C":"Santurkar showed covariate shift reduction is the real mechanism by providing mathematical proof","D":"Santurkar showed that LayerNorm is strictly better than BatchNorm for all architectures"},"correct":"B","explanation":{"correct":"- The covariate shift hypothesis: each layer's input distribution changes as previous layer weights update. BN stabilizes these distributions. The hypothesis predicts BN's benefit should correlate with reduced covariate shift.\n- Santurkar's experiment: they added random noise to BN's statistics (increasing covariate shift) and found training was still faster than no-BN. If covariate shift were the mechanism, more covariate shift should hurt training. It didn't.\n- The actual finding: BatchNorm makes the loss landscape smoother (both loss function and its gradient). Smoother loss: better-conditioned Hessian, more predictable gradients, less \"choppy\" optimization trajectory. This allows larger learning rates and explains why BN lets you train \"from anywhere in the loss landscape.\"","A":"While BN does have regularization effects (stochastic batch statistics add noise), this is a secondary finding. The primary mechanism Santurkar identifies is loss landscape smoothing.","B":"","C":"Santurkar explicitly disproves the covariate shift hypothesis — they do not confirm it. The paper is titled \"How Does Batch Normalization Help Optimization?\" and its contribution is providing an alternative explanation.","D":"LayerNorm vs BatchNorm comparison is a separate topic. Santurkar's paper doesn't claim LayerNorm superiority — it analyzes why BatchNorm helps, regardless of how it compares to alternatives."},"reference":"- Santurkar et al., \"How Does Batch Normalization Help Optimization?\" (2018): https://arxiv.org/abs/1805.11604"},{"section":"deep-learning","topicSlug":"regularization-and-normalization","topic":"Regularization And Normalization","id":"dl-09010","difficulty":"hard","orderIndex":10,"question":"A language model uses Pre-LN (LayerNorm before attention, before FFN) instead of Post-LN (LayerNorm after attention+residual). The model trains without warmup and doesn't explode. The same architecture with Post-LN requires careful warmup and learning rate tuning to avoid training instability. What property of Pre-LN creates this training robustness?","options":{"A":"Pre-LN is more computationally stable due to fewer matrix multiplications","B":"In Post-LN, the residual connection is added and then normalized: LN(x + sublayer(x)). The gradient of x flows through the LN normalization, which can scale gradients in unpredictable ways depending on the variance of the activations. In Pre-LN, the gradient of x flows directly back through the residual path (x + sublayer(LN(x))): ∂L/∂x = ∂L/∂output · (I + ∂sublayer/∂LN_output · ∂LN/∂x). The \"I\" term is a direct gradient path that is never rescaled by LN. This guarantees that the gradient magnitude doesn't collapse regardless of how LN affects the sublayer path","C":"Pre-LN uses smaller weight matrices, requiring less precise initialization","D":"Pre-LN removes the need for residual connections, simplifying gradient flow"},"correct":"B","explanation":{"correct":"- Post-LN gradient: ∂L/∂x_in flows through LN(x_in + F(x_in)). The LN normalization scales the combined (signal + residual) output. Early in training with small weights, x_in dominates and LN approximately normalizes x_in, scaling gradients by 1/||x_in||. This can destabilize training.\n- Pre-LN gradient: x_out = x_in + F(LN(x_in)). ∂L/∂x_in = ∂L/∂x_out · (1 + ∂F/∂x_in). The \"1\" term is the direct residual gradient path that is always present, always magnitude ∂L/∂x_out, never scaled by LN. This gives stable gradient flow even with random initialization.\n- This explains why all modern LLMs (GPT-3, LLaMA, PaLM) use Pre-LN: it enables training without warmup or special initialization.","A":"Pre-LN and Post-LN have the same number of matrix multiplications (the attention and FFN sublayers are identical). The difference is only in where LN is placed.","B":"","C":"Pre-LN doesn't change weight matrix sizes. Both use the same d_model × d_model weight matrices. The stability comes from gradient path properties, not matrix size.","D":"Pre-LN keeps residual connections. The LN is applied before the sublayer, and the original x is added after: x_out = x + sublayer(LN(x)). Residual connections are essential in both Pre-LN and Post-LN Transformers."},"reference":"- Xiong et al., \"On Layer Normalization in the Transformer Architecture\" (2020): https://arxiv.org/abs/2002.04745\n- GPT-3 technical report (uses Pre-LN): https://arxiv.org/abs/2005.14165"},{"section":"deep-learning","topicSlug":"regularization-and-normalization","topic":"Regularization And Normalization","id":"dl-09011","difficulty":"medium","orderIndex":11,"question":"A team applies both Dropout (p=0.3) and L2 regularization (λ=0.01) simultaneously. A colleague claims \"using both is redundant — they both prevent overfitting through the same mechanism.\" Is this correct?","options":{"A":"Yes — both Dropout and L2 are forms of noise injection and have identical effects","B":"No — they prevent overfitting through different mechanisms and can be complementary: Dropout prevents co-adaptation (neurons relying on each other by randomly disabling them during training); L2 prevents large weights by adding a penalty on weight magnitude. Dropout acts on activations (stochastically) while L2 acts on weights (deterministically). A network can have both, and tuning them jointly often outperforms either alone","C":"Yes — applying both reduces effective learning rate, which is the true mechanism of both regularizers","D":"No, but L2 should be removed when using Dropout as they create conflicting gradient signals"},"correct":"B","explanation":{"correct":"- Dropout mechanism: random deactivation of neurons during training prevents any individual neuron from becoming a \"master neuron\" that the network relies on. Forces distributed, robust representations. It's a structural regularizer.\n- L2 mechanism: directly penalizes large weight magnitudes via the loss term. Forces weights toward small values, preventing any single connection from dominating. It's a magnitude regularizer.\n- These address different failure modes: Dropout prevents co-adaptation (structural), L2 prevents weight explosion (magnitude). A model can overfit through either channel. Combined use: typical in modern architectures (e.g., BERT uses both Dropout within the Transformer blocks and weight decay in AdamW).","A":"The mechanisms are fundamentally different. Dropout adds structured activation noise; L2 adds a deterministic gradient penalty. Their mathematical formulations and effects on the loss landscape are distinct.","B":"","C":"Neither Dropout nor L2 reduces the learning rate directly. Dropout reduces the effective number of active neurons per step (computational sparsity), and L2 adds a gradient term (λ·w) that adds to the weight gradient. Neither multiplies the learning rate.","D":"L2 and Dropout gradients don't conflict. In the backward pass, L2 adds λ·w to the weight's gradient, and Dropout masks certain activation gradients. They operate on different quantities (weights vs activations) and sum independently."},"reference":"- Srivastava et al., \"Dropout\" (2014): discusses interaction with other regularization"},{"section":"deep-learning","topicSlug":"regularization-and-normalization","topic":"Regularization And Normalization","id":"dl-09012","difficulty":"easy","orderIndex":12,"question":"Instance Normalization (IN) normalizes each sample-channel pair independently across spatial dimensions, while Group Normalization (GN) normalizes across groups of channels within a sample. For which task is Instance Normalization specifically preferred and why?","options":{"A":"Instance Normalization is preferred for text classification tasks","B":"Instance Normalization is preferred for style transfer and image generation tasks. IN normalizes per-sample per-channel, removing per-channel mean and variance (the \"style\" information). Adaptive Instance Normalization (AdaIN) replaces IN statistics with those of a style image, transferring style while preserving content. This makes IN the right choice when you want to manipulate or remove per-channel statistics (style = global channel statistics in artistic style transfer)","C":"Instance Normalization is preferred for batch_size=1 inference in all tasks","D":"Instance Normalization is preferred for any task involving long sequences"},"correct":"B","explanation":{"correct":"- Per-channel mean and variance capture the \"style\" of an image (color distribution, texture frequency). IN normalizes these away, producing a style-invariant representation.\n- AdaIN (Huang & Belongie, 2017): IN(content) with scale/shift from style image statistics. The content image's spatial structure (content) is preserved, but the channel statistics (style) are replaced by the style image's statistics. This is the mechanism behind fast neural style transfer and StyleGAN.\n- Group Norm or Batch Norm is preferred for classification tasks where absolute channel statistics carry semantic information (edge strength, color distribution helps identify objects).","A":"Text classification doesn't have spatial dimensions. Instance Normalization is primarily designed for 2D feature maps (convolutional layers on images). Layer Normalization is the standard for text.","B":"","C":"While IN works at batch_size=1 (like GroupNorm), batch_size=1 compatibility is not its primary advantage. For general inference at batch_size=1, GroupNorm is preferred; IN is specifically motivated by its style normalization property.","D":"Long sequences don't define IN's use case. Temporal/sequential data uses recurrent models with Layer Normalization, not Instance Normalization."},"reference":"- Huang & Belongie, \"Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization\" (2017): https://arxiv.org/abs/1703.06868\n- Ulyanov et al., \"Instance Normalization: The Missing Ingredient for Fast Stylization\" (2017): https://arxiv.org/abs/1607.08022"},{"section":"deep-learning","topicSlug":"regularization-and-normalization","topic":"Regularization And Normalization","id":"dl-09013","difficulty":"medium","orderIndex":13,"question":"A ResNet model uses BatchNorm between each convolutional layer. When you fine-tune only the last two layers (freezing all earlier layers), training is unstable — loss oscillates and doesn't converge. A colleague says \"freeze the BatchNorm layers too.\" Why would this help?","options":{"A":"Frozen BatchNorm prevents GPU memory overflow during fine-tuning","B":"Frozen earlier layers' BatchNorm layers are still in training mode, updating running_mean and running_var based on the fine-tuning dataset's statistics. If the fine-tuning data has different distribution than pre-training data, the running statistics diverge from what the frozen earlier-layer weights expect, corrupting the intermediate representations. Freezing BN layers (calling bn.eval()) prevents statistics from updating, keeping the earlier layers' transformation consistent","C":"BatchNorm layers must always be frozen when any other layer is frozen — they are linked","D":"Unfrozen BatchNorm increases the effective learning rate for all layers, causing instability"},"correct":"B","explanation":{"correct":"- The problem: Layer 3 (frozen weights) was trained with pre-training statistics. Its frozen weights W₃ were optimized assuming BN₃ would normalize using pre-training statistics (mean=μ₁, var=σ₁²). Fine-tuning updates BN₃'s running stats to (μ₂, σ₂²). Now W₃ is computing outputs based on incorrect normalization — the weights were calibrated for μ₁ but BN₃ is now normalizing with μ₂.\n- Fix: when freezing layers, also freeze their associated BatchNorm layers by calling `.eval()` on them individually: `for m in model.modules(): if isinstance(m, nn.BatchNorm2d): m.eval()`.\n- This is standard practice in transfer learning with ResNets: freeze BN stats in all frozen layers to preserve the pre-trained feature space.","A":"BatchNorm in training mode does not increase memory usage significantly. Running statistics are stored as buffers, not gradients. Memory overflow from fine-tuning is caused by large batch sizes or too many trainable parameters.","B":"","C":"BatchNorm and weight layers are not inherently linked. You can freeze weights without freezing BN, but as explained, this causes the inconsistency described. The connection is functional, not architectural.","D":"BatchNorm does not increase effective learning rate. The BN statistics update (exponential moving average) is not an optimizer step and doesn't affect weight learning rates."},"reference":"- PyTorch BatchNorm fine-tuning best practices: https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html"},{"section":"deep-learning","topicSlug":"regularization-and-normalization","topic":"Regularization And Normalization","id":"dl-09014","difficulty":"hard","orderIndex":14,"question":"You train two identical architectures: one with data augmentation (random crop, flip, color jitter) and one with Dropout. Both achieve similar validation accuracy. A reviewer says \"they are equivalent regularization techniques.\" What is the fundamental difference in what they regularize?","options":{"A":"Data augmentation is always strictly better than Dropout","B":"They regularize fundamentally different aspects: data augmentation regularizes the input space — it teaches the model that certain transformations (flips, crops) should not change the prediction, encoding specific invariances (translation, scale, color). Dropout regularizes the model's weight space — it prevents any single neuron from becoming essential, forcing the network to use distributed redundant representations. A model trained with augmentation may still overfit to specific neuron patterns; a model trained with Dropout may still fail on augmented inputs it never saw","C":"They are equivalent because both reduce effective dataset size by introducing uncertainty","D":"Dropout acts on the loss function while data augmentation acts on gradients"},"correct":"B","explanation":{"correct":"- Data augmentation: the model is trained on x, flip(x), crop(x), jitter(x) — all with the same label. The model learns that these transformations are semantically invariant. This directly encodes domain knowledge about what should be invariant, and is specific to the transformation type.\n- Dropout: randomly disables neurons. The model learns that individual neurons are not reliable and develops redundant features. This is architecture-level regularization, independent of input transformations.\n- In practice: combining both is optimal. Augmentation provides input-space invariances; Dropout prevents neural co-adaptation. Neither can substitute for the other: a high-resolution face recognition model needs both precise spatial features (not well-handled by Dropout alone) and robustness to pose/lighting (requires augmentation).","A":"Neither is universally \"strictly better.\" They address different problems. For very large datasets, augmentation often provides more benefit. For small datasets, Dropout is critical. For medical imaging with limited augmentation options, Dropout is more important.","B":"","C":"Data augmentation increases effective dataset size (by creating more training examples from each real example). Dropout reduces effective model capacity per training step. They work in opposite directions from the sample-count perspective.","D":"Both data augmentation and Dropout ultimately modify gradients (all training techniques do). Dropout multiplies activation gradients by the dropout mask; augmentation changes which input produces which gradient. The distinction is in what aspects of the learning problem they affect."},"reference":"- Shorten & Khoshgoftaar, \"A survey on Image Data Augmentation for Deep Learning\" (2019): https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0197-0"},{"section":"deep-learning","topicSlug":"regularization-and-normalization","topic":"Regularization And Normalization","id":"dl-09015","difficulty":"hard","orderIndex":15,"question":"A team wants to use Dropout in a Transformer model but finds that applying standard Dropout to attention weights (the attention matrix before softmax) causes severe training instability. They switch to \"attention dropout\" (applied after softmax to the attention probabilities). Why is post-softmax application more stable?","options":{"A":"Post-softmax dropout requires fewer random number generations, reducing GPU overhead","B":"Pre-softmax dropout changes the relative magnitudes of attention logits, potentially creating asymmetric softmax distributions where some tokens receive abnormally high attention (if their competitors were dropped, the remaining logits compete differently). Post-softmax dropout zeros out entire attention connections (a token pair) while the remaining connections maintain their proper probability mass after renormalization in the next layer. This preserves the probabilistic interpretation of attention weights while still providing regularization","C":"Pre-softmax dropout causes gradient explosion in the Q·K^T matrix multiplication","D":"Attention dropout must always be applied before softmax; the team made a mistake by switching"},"correct":"B","explanation":{"correct":"- Pre-softmax dropout: zeros out some logits before softmax(logits/√d_k). If logit for token pair (i,j) was large (high attention) but gets dropped, the softmax redistribution shifts attention to other tokens. This can create artificially high attention on tokens that happened not to be dropped — random attention concentration, not semantically meaningful attention.\n- Post-softmax dropout: directly zeros out attention connections after they've been computed as meaningful probabilities. Each remaining connection retains its correct relative weight. The zeroed-out connections are just \"masked\" - the model learns to function with any subset of attention connections active.\n- Post-softmax dropout is the standard in all Transformer implementations (BERT, GPT, T5). The Vaswani et al. paper specifies dropout applied \"to the output of each sub-layer\" and to \"the attention weights.\"","A":"Random number generation count is the same for both (one random mask of the same size). GPU overhead is not the distinction.","B":"","C":"Gradient explosion in Q·K^T is caused by large logit values or poor initialization, not by dropout. Dropout modifies the mask, not the magnitude of the matrix product.","D":"The team correctly switched to post-softmax dropout, which is the standard convention. Pre-softmax dropout is technically implementable but non-standard and less stable as explained."},"reference":"- Vaswani et al., \"Attention Is All You Need\" (2017): Section 5.4 (Regularization with dropout)\n- BERT: https://arxiv.org/abs/1810.04805 (attention_probs_dropout_prob applied after softmax)"},{"section":"deep-learning","topicSlug":"weight-initialization","topic":"Weight Initialization","id":"dl-10001","difficulty":"easy","orderIndex":1,"question":"A new engineer initializes all weights in a 5-layer fully connected network to 0. After 100 training epochs, the model achieves exactly random chance (10% for a 10-class problem). What went wrong?","options":{"A":"Zero initialization causes NaN during forward propagation","B":"Zero initialization causes the symmetry problem: all neurons in each layer compute the same output and receive the same gradient. All neurons in a layer update identically on every step. The model effectively has only 1 neuron per layer regardless of the layer width. With all weights zero, every hidden layer outputs 0, and the gradient for every neuron in a layer is identical, so they all update to the same non-zero value — they remain symmetric forever","C":"Zero initialization prevents the optimizer from computing gradients","D":"Zero initialization only fails for layers with more than 100 neurons"},"correct":"B","explanation":{"correct":"- Forward pass with W=0: all pre-activations are 0, all activations are 0 (or 0.5 for sigmoid). Every neuron in a layer computes the same output.\n- Backward pass: since all neurons in a layer produce the same output, the gradient flowing into each neuron from the next layer is the same. All neurons receive identical updates. After one step: all neurons update to the same new value — still symmetric.\n- This symmetry is never broken by gradient descent. The model has N neurons in a layer but effective rank 1 — all neurons always compute the same function. The model's expressive power is no better than a single neuron per layer.","A":"Zero initialization produces 0 pre-activations (finite values). Forward propagation doesn't produce NaN. The output can pass through softmax cleanly — NaN requires 0/0 or ±∞ operations.","B":"","C":"Gradients can be computed with zero weights. The gradient of the loss with respect to weights is non-zero as long as the input to the layer is non-zero. The issue is not gradient computation failure but gradient symmetry.","D":"The symmetry problem affects all zero-initialized layers regardless of width. Even a layer with 2 neurons initialized to zero will exhibit this behavior."},"reference":"- Goodfellow et al., \"Deep Learning\", Section 8.4 (Weight Initialization)\n- http://cs231n.github.io/neural-networks-2/#init"},{"section":"deep-learning","topicSlug":"weight-initialization","topic":"Weight Initialization","id":"dl-10002","difficulty":"easy","orderIndex":2,"question":"Xavier (Glorot) initialization draws weights from a distribution with variance = 2/(fan_in + fan_out). Kaiming (He) initialization uses variance = 2/fan_in. When should you use each, and what assumption makes them different?","options":{"A":"Xavier is for convolutional layers; Kaiming is for fully connected layers","B":"Xavier is designed for symmetric activations (sigmoid, tanh) where the goal is to keep the variance of activations equal to the variance of inputs across layers. Kaiming is designed for ReLU-like activations which set half the inputs to 0 — halving the expected variance. Kaiming accounts for ReLU's effective variance reduction by using a 2× larger variance (2/fan_in instead of 1/fan_in). Use Xavier with sigmoid/tanh; use Kaiming with ReLU/Leaky ReLU","C":"Both are identical for modern architectures; use either interchangeably","D":"Xavier is for randomly initialized networks; Kaiming is for pretrained networks being fine-tuned"},"correct":"B","explanation":{"correct":"- Xavier derivation: for a layer with symmetric activation g(x) ≈ x near 0 (sigmoid, tanh): Var[output] = fan_in × Var[W] × Var[input]. Setting this equal to Var[input]: Var[W] = 1/fan_in. The symmetric 2/(fan_in+fan_out) formula balances forward and backward variance.\n- Kaiming derivation (He et al., 2015): ReLU(x) = max(0,x) zeros out half the activations. The expected squared output of ReLU(x) for zero-mean symmetric input x is E[ReLU(x)²] = 0.5 × E[x²] = 0.5 × Var[x]. To maintain variance through a ReLU layer: Var[W] = 2/fan_in (factor of 2 compensates for the 50% zeroing).\n- Wrong initialization in deep networks: using Xavier with ReLU → each ReLU layer halves variance → exponential variance decay over depth → vanishing gradients.","A":"Both can be used for convolutional and fully connected layers. The distinction is about activation function, not layer type. PyTorch's `nn.init.xavier_uniform_` and `nn.init.kaiming_uniform_` work for both layer types.","B":"","C":"They produce different variances and make different assumptions. Using Xavier with ReLU in a 20-layer network will likely cause vanishing activations (each layer loses 50% variance). The choice matters significantly.","D":"Both are for randomly initialized networks. Pre-trained networks are fine-tuned from existing weights, not reinitialized."},"reference":"- Glorot & Bengio, \"Understanding the difficulty of training deep feedforward neural networks\" (2010): https://proceedings.mlr.press/v9/glorot10a\n- He et al., \"Delving Deep into Rectifiers\" (2015): https://arxiv.org/abs/1502.01852"},{"section":"deep-learning","topicSlug":"weight-initialization","topic":"Weight Initialization","id":"dl-10003","difficulty":"medium","orderIndex":3,"question":"You train a 10-layer network with ReLU activations and Kaiming initialization. Training is stable. A colleague changes the initialization to use std=0.001 (much smaller than Kaiming). What will happen to training and why?","options":{"A":"Nothing changes — the optimizer will adjust regardless of initialization","B":"With std=0.001, activations will rapidly approach 0 through the network. Layer 1: Var[a₁] ≈ 0.001²×fan_in×Var[x] << 1. By layer 10: activations are effectively 0 for all inputs. Gradients flowing backward through these near-zero activations will also be near-zero. The model will appear to train (loss is computed) but weights will barely update — the network is in an effective \"dead zone\" from the start","C":"Smaller initialization is always better as it prevents gradient explosion","D":"Kaiming initialization with std=0.001 is equivalent to L2 regularization"},"correct":"B","explanation":{"correct":"- Kaiming std for ReLU: std = √(2/fan_in). For fan_in=512: std ≈ 0.063. Using std=0.001 is ~63× smaller.\n- Forward propagation: each layer computes h_{l} = ReLU(W_{l} h_{l-1}). With very small W, the pre-activations are tiny. After 10 layers of multiplying tiny values, activations are effectively zero (numerical underflow or below learning threshold).\n- Backward propagation: gradients flow through ∂h_l/∂W_l = h_{l-1}. If h_{l-1} is near 0, the weight gradient ≈ 0. This creates a self-reinforcing failure: small weights → small activations → small gradients → weights never update.","A":"The optimizer applies updates proportional to gradients. If gradients are ~0 due to poor initialization, the optimizer can't correct the problem — it can't \"see\" what direction to move. A good optimizer cannot overcome initialization so bad that no gradient signal exists.","B":"","C":"Smaller initialization prevents gradient explosion, but taken to extremes, it causes vanishing gradients/activations. There is an optimal scale (Kaiming, Xavier) that balances both issues.","D":"Kaiming initialization and L2 regularization are completely unrelated. L2 regularization is a gradient penalty added during training. Initialization is the starting point of weights."},"reference":"- He et al., \"Delving Deep into Rectifiers\" (2015): Section 2.2 (variance analysis)"},{"section":"deep-learning","topicSlug":"weight-initialization","topic":"Weight Initialization","id":"dl-10004","difficulty":"medium","orderIndex":4,"question":"You initialize a 100-layer network with weights drawn from N(0, 1/fan_in) (no ReLU correction) and observe that the gradient norm at layer 1 is 10⁻²⁰ while the gradient norm at layer 100 is ~1. What specific problem is this and what mathematical property causes it?","options":{"A":"Exploding gradients — the gradient grows from layer 1 to layer 100","B":"Vanishing gradients — the gradient shrinks exponentially from layer 100 to layer 1. The backward pass multiplies gradients by the weight matrix W^T at each layer. With weights initialized from N(0, 1/fan_in), the spectral norm of W is approximately 1, but repeated multiplication of ~100 such matrices has spectral norm ≈ σ^100 where σ is the average singular value. For σ slightly < 1, this decays exponentially (0.99^100 ≈ 0.37; 0.95^100 ≈ 0.006; 0.9^100 ≈ 2×10⁻⁵)","C":"Numerical overflow from the 10⁻²⁰ value indicating floating-point underflow","D":"The gradient norm difference is expected and correct — deep networks always have this pattern"},"correct":"B","explanation":{"correct":"- Gradient at layer l: ∂L/∂W_l = (∏_{k=l+1}^{L} W_k^T) × ∂L/∂h_L. This product of weight matrices is the Jacobian of layer L with respect to layer l.\n- If each weight matrix has spectral norm σ < 1: the product of 100 matrices has spectral norm ≤ σ^100. For σ = 0.9: 0.9^100 = 2×10⁻⁵. This is the vanishing gradient phenomenon.\n- The specific initialization N(0, 1/fan_in) doesn't account for ReLU. With tanh (and Xavier): the variance is calibrated to keep ||h||² ≈ constant. Without ReLU correction (Kaiming), ReLU halves the variance per layer, causing activations (and thus gradients) to decay exponentially.","A":"The gradient grows in magnitude as you go from early layers (near input) to later layers (near loss), not the other way. Gradient at layer 100 is larger → gradients shrink going backward. This is the definition of vanishing gradients.","B":"","C":"10⁻²⁰ in float32 is below the subnormal range (~10⁻⁴⁵) but above absolute zero. In float32, values below ~10⁻³⁸ become subnormal (with reduced precision). 10⁻²⁰ is representable but indicates numerical instability.","D":"The gradient norm difference indicates a serious training problem — layers close to the input will not update meaningfully. This is not \"expected and correct.\""},"reference":"- Bengio et al., \"Learning Long-Term Dependencies with Gradient Descent is Difficult\" (1994)\n- Glorot & Bengio, \"Understanding the difficulty of training deep feedforward neural networks\" (2010)"},{"section":"deep-learning","topicSlug":"weight-initialization","topic":"Weight Initialization","id":"dl-10005","difficulty":"medium","orderIndex":5,"question":"A team trains a model in FP16 (half precision). During the first training step, the loss is NaN. They switch to BF16 and NaN disappears. Both have 16-bit precision; why does BF16 fix the NaN?","options":{"A":"BF16 has higher total precision than FP16; it uses more total bits","B":"FP16 has range [~6×10⁻⁵, ~65504]. BF16 has range [~10⁻³⁸, ~3×10³⁸] — same dynamic range as FP32. At initialization and in early training steps, weight gradients or intermediate activations can exceed FP16's max value (65504), causing overflow to Inf/NaN. BF16's much larger dynamic range (same exponent bits as FP32) prevents this overflow while accepting lower mantissa precision (7 bits vs 10 for FP16)","C":"BF16 automatically applies gradient clipping that FP16 doesn't have","D":"FP16 doesn't support negative numbers; BF16 does, which is required for gradients"},"correct":"B","explanation":{"correct":"- FP16 format: 1 sign bit, 5 exponent bits, 10 mantissa bits. Max value: 65504. Min positive normal: ~6.1×10⁻⁵.\n- BF16 format: 1 sign bit, 8 exponent bits (same as FP32!), 7 mantissa bits. Max value: ~3.4×10³⁸. Min positive: ~1.2×10⁻³⁸.\n- At initialization, with Kaiming init and ReLU, a single forward pass can produce activation magnitudes beyond 65504 in wide networks or with large fan_in. The gradient norms in the first step can also be very large. BF16's FP32-equivalent exponent range prevents these values from overflowing.\n- Trade-off: BF16 has only 7 mantissa bits (vs 10 for FP16), meaning lower fractional precision. But for training stability, dynamic range matters more than mantissa bits.","A":"Both FP16 and BF16 use exactly 16 bits total. BF16 doesn't have \"higher total precision\" — it trades mantissa precision for dynamic range.","B":"","C":"BF16 has no built-in gradient clipping. Gradient clipping is a separate technique applied explicitly in the training loop. BF16's advantage is its number representation range.","D":"Both FP16 and BF16 support negative numbers (via the sign bit). IEEE 754 floating point formats always support negative numbers. Gradients are regularly negative in both formats."},"reference":"- NVIDIA BF16 explanation: https://developer.nvidia.com/blog/accelerating-ai-training-with-tf32-tensor-cores/\n- PyTorch Automatic Mixed Precision: https://pytorch.org/docs/stable/amp.html"},{"section":"deep-learning","topicSlug":"weight-initialization","topic":"Weight Initialization","id":"dl-10006","difficulty":"hard","orderIndex":6,"question":"You initialize a Transformer's token embedding matrix with N(0, 1) and the output projection matrix (embedding → logits) with Kaiming initialization. The first forward pass produces NaN logits. After investigating, you find the pre-softmax logits have values of order 10⁵ — 10⁶. What is the root cause?","options":{"A":"The softmax function is numerically unstable for all large values","B":"The token embedding N(0,1) has std=1 for vectors of dimension d_model. The expected L2 norm of an embedding vector is √d_model. For d_model=768: ||e|| ≈ √768 ≈ 27.7. Multiplied through multiple Transformer layers and the output projection (Kaiming std ≈ √(2/d_model) ≈ 0.05 for d_model=768), the final logit magnitude is ||W_out|| × ||h|| ≈ (0.05)^{1/2} × ... However, the embedding vectors with norm 27.7 entering the Transformer immediately amplify activations by 27.7× — any subsequent Kaiming-initialized layer that assumes unit-norm inputs produces amplified outputs. The fix is to initialize embeddings with std = 1/√d_model","C":"NaN only occurs when both embedding and output projection use the same initialization","D":"The output projection should use zero initialization for the first step"},"correct":"B","explanation":{"correct":"- Standard initialization for embedding matrices in Transformers: Vaswani et al. (2017) explicitly multiply embeddings by √d_model to scale them; GPT-2 uses N(0, 0.02). The key: embedding vectors shouldn't have O(1) components — they should have ||e|| ≈ O(1), not O(√d_model).\n- With N(0,1) embeddings and d_model=768: embedding norm ≈ 27.7. After LayerNorm (which normalizes to unit variance across features), this gets corrected at the first LN layer. But between the embedding and the first LN, the attention computation Q·K^T computes dot products of vectors with norm 27.7, producing values of 27.7²/√d_model ≈ 26.7×27.7 ≈ 740, which already overflow with subsequent operations.\n- Correct practice: `nn.Embedding(vocab_size, d_model); nn.init.normal_(embedding.weight, std=1/math.sqrt(d_model))` or use built-in scaling.","A":"Softmax is numerically stable when implemented as softmax(x - max(x)). Large logits don't produce NaN with numerically stable implementations. The NaN comes from the logits before softmax.","B":"","C":"NaN occurs due to magnitude mismatch between embedding scale (O(√d_model)) and the network's expected input scale (~O(1)). It's not about using the same initialization type.","D":"Zero output projection would produce all-zero logits (then uniform softmax, not NaN). Zero projection prevents learning but doesn't produce NaN."},"reference":"- Vaswani et al., \"Attention Is All You Need\" (2017): Section 3.4 (Embeddings and Softmax)\n- GPT-2 initialization: Radford et al., \"Language Models are Unsupervised Multitask Learners\" (2019)"},{"section":"deep-learning","topicSlug":"weight-initialization","topic":"Weight Initialization","id":"dl-10007","difficulty":"hard","orderIndex":7,"question":"You train a 50-layer ResNet. After 1000 steps, the loss suddenly spikes from 0.5 to 12.0, then slowly recovers over the next 2000 steps. This loss spike pattern repeats twice more during training. What is causing this and what initialization/training fix prevents it?","options":{"A":"The loss spikes are caused by bad batches in the training data","B":"Loss spikes in deep networks are typically caused by catastrophic gradient updates when the gradient norm temporarily exceeds the optimizer's ability to compensate. Common causes: (1) occasional large-norm batches amplified by large LR; (2) interactions between adaptive optimizer momentum and learning rate schedule (e.g., cosine restarts where LR suddenly increases); (3) in models with residual connections, if residual branch outputs are not properly scaled at initialization (zero-init of the last residual layer), early training can produce unstable activations. Fix: (1) gradient clipping; (2) μP (maximal update parameterization) initialization; (3) zero-init the last conv/linear in each residual block","C":"Loss spikes indicate NaN weights that are automatically recovered by the framework","D":"Loss spikes cannot occur in ResNets due to skip connections"},"correct":"B","explanation":{"correct":"- Zero-init of residual branch: in ResNets, initializing the last layer of each residual block to zero means the residual block outputs 0 at initialization. The network starts as a pure linear chain (only skip connections). This prevents early training instability.\n- Gradient clipping: max_norm gradient clipping prevents any single step from making very large weight updates, reducing spike likelihood.\n- μP (Maximal Update Parameterization, Yang et al., 2022): parameterizes weights so gradient updates stay O(1) regardless of width/depth, completely eliminating loss spikes. This is used in Cerebras and some large model trainings.\n- Cosine restarts (SGDR): if LR restarts at a high value, this can cause temporary loss spikes. Warm restarts should start at a smaller LR than the previous cycle's peak.","A":"Occasional bad batches do cause temporary loss increases, but recoverable spikes of magnitude 0.5 → 12.0 are disproportionate to batch-level noise. Batch-level variations typically cause loss fluctuations of 0.01-0.1, not 10× amplification.","B":"","C":"NaN weights are permanent (NaN + anything = NaN). A model with NaN weights cannot recover — subsequent steps would all produce NaN. The described pattern (spike then recovery) indicates large but finite gradients, not NaN.","D":"Skip connections reduce vanishing gradients but don't prevent loss spikes. A ResNet can still have loss spikes from large gradient updates if not mitigated."},"reference":"- Yang et al., \"Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer (μP)\" (2022): https://arxiv.org/abs/2203.03466\n- Zhang et al., \"Fixup Initialization: Residual Learning Without Normalization\" (2019): https://arxiv.org/abs/1901.09321"},{"section":"deep-learning","topicSlug":"weight-initialization","topic":"Weight Initialization","id":"dl-10008","difficulty":"hard","orderIndex":8,"question":"Mixed precision training (FP16 compute, FP32 master weights) uses a technique called \"loss scaling.\" A model trained without loss scaling in FP16 shows correct forward pass but zero gradients for all parameters. Why?","options":{"A":"Loss scaling is only needed for the forward pass, not the backward pass","B":"In FP16, the minimum representable positive value is ~6×10⁻⁵. During backpropagation, gradient values are often much smaller than this (especially in early layers of deep networks or for small gradients in residual branches). These gradients underflow to 0 in FP16. Loss scaling multiplies the loss by a large constant (e.g., 2¹⁵=32768) before the backward pass, scaling all gradients by 32768. This pushes tiny gradient values into FP16's representable range. After the backward pass, gradients are divided by 32768 before applying to the FP32 master weights","C":"Loss scaling is needed because FP16 cannot represent negative numbers","D":"Zero gradients in FP16 training are caused by the Adam optimizer, not number format limitations"},"correct":"B","explanation":{"correct":"- FP16 range gap: the smallest positive normal value in FP16 is ~6.1×10⁻⁵. In FP32, it's ~1.2×10⁻³⁸. Gradients in deep networks at later training stages (or in early layers) can easily be 10⁻¹⁰ or smaller — representable in FP32 but underflowing to 0 in FP16.\n- Loss scaling mechanics: if loss_scaled = loss × S (S=32768), then gradient_scaled = gradient × S. Values that were 10⁻¹⁰ become 3.3×10⁻⁶ — still small but within FP16's subnormal range. After backpropagation, gradients are unscaled: gradient = gradient_scaled / S. The master FP32 weights are updated with the correctly-scaled gradients.\n- Automatic loss scaling (ALS): PyTorch's `GradScaler` dynamically adjusts S — increases S if gradients are finite, decreases S if overflow (Inf/NaN) is detected.","A":"Loss scaling is specifically designed for the backward pass (gradient computation). The forward pass in FP16 is less susceptible to underflow because activation values are typically larger than gradients.","B":"","C":"FP16 does represent negative numbers (via the sign bit). Both FP16 and FP32 support negative numbers in IEEE 754 format.","D":"Adam has bias correction for its moment estimates, but this does not address FP16 underflow. The zero-gradient problem is due to number format limitations, not optimizer choice."},"reference":"- Micikevicius et al., \"Mixed Precision Training\" (2018): https://arxiv.org/abs/1710.03740\n- PyTorch GradScaler: https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.GradScaler"},{"section":"deep-learning","topicSlug":"weight-initialization","topic":"Weight Initialization","id":"dl-10009","difficulty":"hard","orderIndex":9,"question":"You fine-tune a pretrained ResNet-50 by replacing the final classification layer with a new one for 200 classes (original was 1000 classes). You randomly initialize the new layer with Kaiming. After 10 epochs, the frozen layers' BatchNorm running statistics have drifted significantly. What is the source of this drift and how does it indicate a training bug?","options":{"A":"BatchNorm running statistics are always updated by fine-tuning; this is expected behavior","B":"If earlier ResNet layers are frozen (weights don't update), their BatchNorm layers should also be frozen (set to eval mode). Running statistics are updated by BatchNorm layers in train mode, regardless of whether the weights are frozen. If you call model.train() without explicitly setting earlier BN layers to eval mode, BN updates its running mean/var using the fine-tuning data statistics — which may differ significantly from ImageNet pretraining statistics. The frozen weights no longer produce the activations they were calibrated for, corrupting the pretrained feature representations","C":"Running statistics drift is caused by the randomly initialized final layer producing incorrect upstream gradients","D":"BatchNorm running statistics only drift when learning rate is too high"},"correct":"B","explanation":{"correct":"- Frozen weights + training-mode BN: a frozen conv layer with W produces activations for the new dataset. The new dataset may have different image statistics (domain shift). BN's running mean/var are updated to match these new activation statistics — diverging from ImageNet statistics.\n- The resulting issue: the frozen earlier-layer weights were optimized with the assumption that BN would normalize using ImageNet statistics. With updated statistics, each BN layer's effective transformation changes: γ × (x - new_mean) / new_std + β ≠ γ × (x - original_mean) / original_std + β. The carefully learned features of the frozen layers are now distorted.\n- Correct fine-tuning: call `model.eval()` first, then set only the layers you want to fine-tune to `train()`. This freezes both weights AND BN statistics for frozen layers.","A":"Running statistics should only be updated in layers where you want the BN to adapt. For frozen layers, updating BN stats corrupts the pretrained representations. This is not expected behavior — it's a training bug.","B":"","C":"Gradients only flow to unfrozen parameters. The frozen earlier layers don't receive gradient updates. Running statistics are updated by the BN forward pass, not by gradients from the final layer.","D":"Running statistics are updated by the exponential moving average in BN's forward pass: running_mean = (1-momentum) × running_mean + momentum × batch_mean. This update happens regardless of learning rate."},"reference":"- PyTorch transfer learning tutorial: https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html"},{"section":"deep-learning","topicSlug":"weight-initialization","topic":"Weight Initialization","id":"dl-10010","difficulty":"medium","orderIndex":10,"question":"A senior engineer proposes \"orthogonal initialization\" for a deep RNN: initializing the recurrent weight matrix as a random orthogonal matrix (Q where Q^T Q = I). What problem does this specifically solve compared to random Gaussian initialization for RNNs?","options":{"A":"Orthogonal matrices are faster to compute the matrix product for","B":"For an RNN, the hidden state update h_t = tanh(W_h h_{t-1} + W_x x_t) involves repeated multiplication by W_h. With random Gaussian initialization, the spectral norm of W_h can be > 1 or < 1 — causing exponential growth or decay in h over many steps. An orthogonal matrix has all singular values exactly 1 (it's an isometry), so ||W_h h||₂ = ||h||₂. The hidden state magnitude is preserved exactly across time steps, directly addressing the vanishing/exploding gradient problem in RNNs","C":"Orthogonal initialization prevents the symmetry-breaking problem that occurs in RNNs","D":"Orthogonal matrices ensure the RNN weights remain sparse during training"},"correct":"B","explanation":{"correct":"- RNN gradient through time: ∂h_t/∂h_0 = ∏_{k=1}^{t} W_h^T · diag(tanh'(a_k)). For long sequences (t=100+), this product determines gradient magnitude.\n- Orthogonal W_h: all singular values = 1 → spectral norm = 1 → the matrix multiplications don't amplify or attenuate vectors. Combined with tanh' (which ≤ 1), the gradient can only decrease (not explode), and decreases more slowly.\n- Gaussian W_h: singular values centered around √(1/fan_in) ≈ √(1/H). If even one singular value is slightly > 1 or < 1, repeated multiplication amplifies this. Over 100 time steps: σ^100 can be 0 or ∞ for σ ≠ 1.","A":"Matrix multiplication speed depends on matrix dimensions, not whether the matrix is orthogonal. The computation time for W_h × h is identical whether W_h is orthogonal or Gaussian.","B":"","C":"Symmetry breaking is prevented by random (non-zero) initialization of any kind. A random orthogonal matrix breaks symmetry just as a random Gaussian matrix does. This is not the specific benefit of orthogonal initialization.","D":"Orthogonal matrices are fully dense (all entries non-zero). They have the opposite of sparsity. Sparsity is a property of L1 regularization, not orthogonal initialization."},"reference":"- Saxe et al., \"Exact solutions to the nonlinear dynamics of learning in deep linear networks\" (2013): https://arxiv.org/abs/1312.6120\n- Wisdom et al., \"Full-Capacity Unitary Recurrent Neural Networks\" (2016)"},{"section":"deep-learning","topicSlug":"weight-initialization","topic":"Weight Initialization","id":"dl-10011","difficulty":"medium","orderIndex":11,"question":"You compare two models at initialization (step 0): Model A uses Kaiming init, Model B uses N(0, 0.01). You compute the loss for both on the same batch. Model A has loss ≈ ln(10) ≈ 2.3 (for 10-class cross-entropy). Model B has loss ≈ 2.3 as well. A junior engineer says \"they're identical at initialization.\" What is wrong with this assessment?","options":{"A":"The losses are identical because both initializations satisfy the uniform prior condition","B":"The initial loss ≈ 2.3 = ln(10) means both models output near-uniform class probabilities (expected for a random classifier on 10 classes). However, this doesn't mean the models are equivalent. Model A (Kaiming) has appropriately scaled weights that will produce meaningful gradient magnitudes through all layers. Model B (small std=0.01) has tiny weights that will cause near-zero activations in deep layers, vanishing gradients, and effectively zero weight updates from the first step. The loss metric only captures the model's output; it doesn't reflect the health of the gradient flow through the network","C":"Model B will immediately diverge because small weights cause numerical underflow","D":"Both models have identical gradient norms at initialization"},"correct":"B","explanation":{"correct":"- Why both have similar initial loss: With tiny weights (Model B), near-zero pre-activations → all activations ≈ constant → output logits are approximately equal → softmax ≈ uniform → loss = -ln(1/10) = ln(10) ≈ 2.3. With Kaiming (Model A), activations have proper scale but are random → output logits are random but centered near zero → softmax ≈ uniform → loss ≈ ln(10).\n- The critical difference: Model A's random logits have the right gradient magnitude (Kaiming ensures gradients flow through all layers). Model B's near-zero activations → near-zero gradients → weight updates ≈ 0 from step 1.\n- Diagnostics beyond loss: always check gradient norms per layer at initialization. A healthy initialization has similar gradient norms across layers. Model B would show gradient norms decaying exponentially toward input layers.","A":"The \"uniform prior condition\" is satisfied when the model outputs uniform probabilities. Both models achieve this, but the loss metric doesn't capture the gradient flow health.","B":"","C":"std=0.01 is finite and will produce finite activations (not numerical underflow for a few layers). The problem is training dynamics (vanishing gradients), not immediate NaN/overflow.","D":"Model A and Model B have very different gradient norms. Model B's gradient norms decay to near-zero through the network. Model A has well-scaled gradients throughout. This difference is what makes the initializations non-equivalent."},"reference":"- Glorot & Bengio, \"Understanding the difficulty of training deep feedforward neural networks\" (2010)"},{"section":"deep-learning","topicSlug":"weight-initialization","topic":"Weight Initialization","id":"dl-10012","difficulty":"hard","orderIndex":12,"question":"GPT-2 uses a modified initialization where output projection layers in attention blocks are scaled by 1/√N, where N is the number of residual layers. Explain why this specific scaling is used and what problem it solves.","options":{"A":"Scaling by 1/√N reduces the memory footprint of large models","B":"In a Transformer with N residual layers, the total output is a sum of N residual branch contributions: h_final ≈ x + Σ_{i=1}^{N} f_i(x). If each f_i contributes variance O(1), the sum has variance O(N) — growing with depth. By scaling each residual branch's output projection by 1/√N, each contribution has variance O(1/N), and the sum of N such terms has variance O(N × 1/N) = O(1). This keeps total activation variance bounded regardless of model depth, preventing the \"depth × variance amplification\" issue in very deep Transformers","C":"1/√N scaling is equivalent to reducing learning rate by 1/√N for deeper models","D":"The 1/√N scaling prevents loss spikes only during the first 100 training steps"},"correct":"B","explanation":{"correct":"- Variance analysis: if the Transformer is modeled as x_{L} = x_0 + Σ f_i(x), where each f_i is a residual block with output variance σ_f², then Var[x_L] = Var[x_0] + N·σ_f². For large N (e.g., GPT-3 has 96 layers), this grows linearly with N.\n- GPT-2 fix: initialize the output projection of each block (the final linear layer in attention and FFN) with N(0, 0.02²/N) (equivalently, scale by 1/√N). Each block's expected output variance is σ_f²/N, so the total is N × (σ_f²/N) = σ_f² — independent of depth.\n- This follows from Kaiming's variance analysis applied to residual networks: deeper networks need smaller per-layer variance to maintain the same total activation scale.","A":"Scaling initialization values doesn't affect model memory footprint (weights are stored the same way regardless of initialization scale). Memory is determined by the count of weights and their dtype.","B":"","C":"Initialization scaling and learning rate are related through training dynamics but are not equivalent. The 1/√N factor is applied at initialization and doesn't change the learning rate. The learning rate would need to be adjusted separately based on the resulting gradient scales.","D":"The initialization affects training stability throughout training (the initial variance determines how gradients flow from step 1 to convergence). It's not a \"first 100 steps\" effect — good initialization prevents persistent variance growth problems."},"reference":"- Radford et al., \"Language Models are Unsupervised Multitask Learners\" (GPT-2) (2019)\n- Brown et al., \"Language Models are Few-Shot Learners\" (GPT-3): Appendix B (initialization)"},{"section":"deep-learning","topicSlug":"weight-initialization","topic":"Weight Initialization","id":"dl-10013","difficulty":"easy","orderIndex":13,"question":"Biases in neural networks are typically initialized to 0, but there is one important exception: biases in RNN/LSTM forget gates are often initialized to a large positive value (e.g., 1 or 2). Why?","options":{"A":"Large forget gate biases prevent gradient explosion in LSTMs","B":"The forget gate bias initialized to a large positive value causes the forget gate to be near 1 at the start of training (sigmoid(large positive) ≈ 1). This means the LSTM initially \"forgets nothing\" — it passes the full cell state forward. This gives the LSTM access to long-term memory from the start, preventing the vanishing gradient problem in early training. If initialized to 0, forget gates start at 0.5 (sigmoid(0)), causing the model to lose 50% of the cell state at each step, effectively limiting memory length during early training","C":"Large forget gate biases are used to match the scale of other LSTM gates","D":"Forget gate biases should always be 0; using large values is an outdated practice"},"correct":"B","explanation":{"correct":"- Forget gate: f_t = σ(W_f h_{t-1} + U_f x_t + b_f). With b_f=1: f_t ≈ σ(1) ≈ 0.73 at initialization with small weights. With b_f=5: f_t ≈ 0.99.\n- Cell state update: C_t = f_t ⊙ C_{t-1} + i_t ⊙ g_t. If f_t ≈ 0.5 (b_f=0): the cell state is halved at each step. Over 10 steps: C_0 × 0.5^10 ≈ C_0 × 0.001. The LSTM loses the initial cell state within 10 steps.\n- Jozefowicz et al. (2015) showed that initializing forget gate bias to 1 significantly improves LSTM performance on long sequence tasks. The idea: start with a strong prior that the previous state is relevant, let the network learn to forget when appropriate.","A":"Forget gate biases affect the flow of information through time, not the magnitude of gradients directly. Gradient explosion is addressed through gradient clipping or orthogonal initialization, not forget gate bias.","B":"","C":"LSTM gate biases are not matched to each other for scale reasons. Input, output, and forget gates serve different functions, and their biases are set for functional reasons (e.g., forget gate for memory retention), not scale consistency.","D":"This is an active, recommended practice. Keras and PyTorch LSTM default initializations set forget gate bias to 1. The 2015 paper by Jozefowicz et al. and subsequent work have confirmed this benefit."},"reference":"- Jozefowicz et al., \"An Empirical Evaluation of Recurrent Network Architectures\" (2015): http://proceedings.mlr.press/v37/jozefowicz15.html"},{"section":"deep-learning","topicSlug":"weight-initialization","topic":"Weight Initialization","id":"dl-10014","difficulty":"medium","orderIndex":14,"question":"You train a model with mixed precision (FP16/FP32). The model converges to val_loss=0.42 in FP32 but only achieves val_loss=0.55 in FP16 with loss scaling. A colleague says \"FP16 is strictly worse; go back to FP32.\" A second colleague says \"increase loss scale factor to fix the remaining gap.\" Who is right?","options":{"A":"The first colleague is right; FP16 fundamentally cannot match FP32 accuracy","B":"Neither is immediately correct. The val_loss gap (0.42 vs 0.55) likely indicates the model is sensitive to weight update precision. Try: (1) BF16 instead of FP16 (maintains FP32 dynamic range); (2) keep certain sensitive operations (softmax, LayerNorm) in FP32 while using FP16 elsewhere (\"mixed\" in mixed precision); (3) increase loss scale factor. Only if all FP16/BF16 variants fail should you revert to pure FP32. Many production models match FP32 accuracy with proper mixed precision implementation","C":"The second colleague is right; increasing the loss scale factor always closes the accuracy gap","D":"The gap is random noise; train for more epochs in FP16 to close it"},"correct":"B","explanation":{"correct":"- The 0.13 val_loss gap suggests weight updates in FP16 are losing precision in a way that affects final model quality. FP16's limited mantissa (10 bits) means weight updates smaller than weight × 2⁻¹⁰ are lost — gradients that would cause the weights to change in the 11th or more significant bit are completely ignored.\n- BF16 addresses dynamic range but still has fewer mantissa bits than FP32. The specific solution depends on which operation is losing precision.\n- Master FP32 weights: standard mixed precision keeps FP32 master weights and applies FP16 gradients after casting. If this is already implemented and the gap persists, check that the optimizer state (Adam moments) is also in FP32.","A":"FP16 and BF16 regularly match FP32 accuracy in production. Large models (GPT-3, LLaMA, etc.) are trained entirely in BF16/FP16 with mixed precision and achieve competitive accuracy. The gap indicates a fixable configuration issue.","B":"","C":"Increasing loss scale factor beyond the point where gradients don't underflow provides no additional benefit. If gradients are already in the representable range (not underflowing), a larger scale causes overflow (NaN gradients) rather than improved precision.","D":"The gap is systematic (0.55 vs 0.42 is a 23% relative gap), not random noise. Random noise would cause variation around a central value, not a consistent directional gap."},"reference":"- Micikevicius et al., \"Mixed Precision Training\" (2018): https://arxiv.org/abs/1710.03740"},{"section":"deep-learning","topicSlug":"weight-initialization","topic":"Weight Initialization","id":"dl-10015","difficulty":"hard","orderIndex":15,"question":"You train a very wide network (width=4096) with standard Kaiming initialization and SGD. You observe that weight norms grow steadily during training: ||W|| at epoch 100 is 5× larger than at initialization. You suspect this will cause instability. A colleague says \"use weight normalization to fix this.\" Another says \"use L2 regularization (weight decay).\" Both are proposed solutions — what is the fundamental difference in how they address the problem?","options":{"A":"Weight normalization and L2 regularization are mathematically identical for SGD","B":"Weight decay directly penalizes ||W||² via the loss gradient (adds -λW to weight update), counteracting the growth by pulling weights toward smaller values — it's an additive correction to the gradient. Weight normalization reparameterizes W = g × v/||v|| (separating magnitude g from direction v), preventing ||W|| from growing unconstrained because direction and magnitude are updated independently. WN doesn't prevent ||g|| from growing but makes the direction v unit-norm. For the instability problem (growing magnitude), weight decay is more direct; WN addresses a different problem (optimizing over direction and magnitude separately for more stable optimization in deep networks)","C":"Both are identical; they both normalize weights to unit norm","D":"Weight normalization prevents weight growth by periodically resetting weights; L2 adds a penalty that makes training slower"},"correct":"B","explanation":{"correct":"- L2 (weight decay) gradient update: Δw = -η(∂L/∂w + λw). The λw term directly reduces weight magnitude at every step. For growing weights, the pull toward zero counteracts gradient-induced growth. At equilibrium: ||w|| is bounded by the balance between gradient-induced growth and weight decay.\n- Weight normalization (Salimans & Kingma, 2016): W = g × v/||v||. Update is now over g (scalar magnitude) and v (direction vector). ||v|| = 1 by construction (unit norm). The scale g can still grow, but the direction v never grows unbounded. WN provides scale-invariant gradient directions for v, improving training stability and conditioning — but doesn't cap ||g||.\n- For the specific problem (||W|| growing 5×), weight decay is the appropriate tool. WN is more relevant for fixing the optimization landscape (making it easier to optimize well-conditioned directions) than for capping weight magnitude growth.","A":"They are fundamentally different operations. Weight decay adds -λW to the gradient. Weight normalization reparameterizes the weight matrices. For SGD with weight decay, the explicit connection exists (L2 = weight decay), but WN is a reparameterization, not a gradient modification.","B":"","C":"Weight decay does not normalize weights to unit norm — it penalizes large norms without enforcing a specific norm value. Weight normalization ensures ||v|| = 1 for the direction component but allows g to be any value.","D":"Weight normalization doesn't reset weights periodically. It's a mathematical reparameterization applied throughout training. Periodic weight resets would be a completely different technique (not a standard one)."},"reference":"- Salimans & Kingma, \"Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks\" (2016): https://arxiv.org/abs/1602.07868"},{"section":"deep-learning","topicSlug":"cnn-architectures","topic":"Cnn Architectures","id":"dl-11001","difficulty":"easy","orderIndex":1,"question":"A 2D convolutional layer has: kernel_size=3×3, in_channels=64, out_channels=128, stride=1, padding=1. The input is (batch=8, C=64, H=32, W=32). What is the output shape and total number of learnable parameters?","options":{"A":"Output: (8, 128, 32, 32), Parameters: 73,856","B":"Output: (8, 128, 32, 32), Parameters: 73,728","C":"Output: (8, 128, 30, 30), Parameters: 73,728","D":"Output: (8, 64, 32, 32), Parameters: 73,856"},"correct":"A","explanation":{"correct":"- Output spatial size: H_out = (H_in + 2P - K) / S + 1 = (32 + 2×1 - 3) / 1 + 1 = 32. Same for W. So output = (8, 128, 32, 32). ✓\n- Parameters (weights): kernel_size² × in_channels × out_channels = 3×3×64×128 = 73,728 weight parameters.\n- Parameters (biases): one bias per output channel = 128.\n- Total parameters: 73,728 + 128 = 73,856.\n- The common mistake: forgetting to count bias parameters (or assuming \"no bias by default\"). In PyTorch's `nn.Conv2d`, bias=True by default.","A":"","B":"Correctly computes weight parameters (73,728) but forgets to add bias parameters (128). Total should be 73,856.","C":"Incorrectly computes output size without accounting for padding. With padding=1 and kernel=3: (32 + 2 - 3)/1 + 1 = 32, not 30. No padding would give 30×30.","D":"Confuses output channels — the convolutional layer produces out_channels=128 output feature maps, not the input's 64 channels."},"reference":"- PyTorch Conv2d: https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html"},{"section":"deep-learning","topicSlug":"cnn-architectures","topic":"Cnn Architectures","id":"dl-11002","difficulty":"easy","orderIndex":2,"question":"What is the receptive field of the final feature map after stacking three 3×3 convolutional layers with stride=1 and no pooling?","options":{"A":"3×3 (each layer sees only its kernel size)","B":"7×7 — each 3×3 layer adds (kernel_size - 1) = 2 pixels to each side of the receptive field: layer 1 = 3×3, layer 2 = 5×5, layer 3 = 7×7. Two stacked 3×3 layers have the same receptive field as one 5×5 layer; three 3×3 layers equal one 7×7 layer","C":"9×9 — three layers multiply the receptive field: 3×3×3=9","D":"27×27 — the receptive field grows cubically with the number of layers"},"correct":"B","explanation":{"correct":"- Receptive field calculation: each 3×3 conv layer looks at a 3×3 region of the previous layer's output. Layer 1: RF = 3. Layer 2: each unit in layer 2 sees 3×3 of layer 1's output, each of which saw 3×3 of the input. The RF grows by 2 per layer: RF = 2×L + 1 for L layers with kernel=3.\n- Layer 1: RF = 2×1+1 = 3. Layer 2: RF = 2×2+1 = 5. Layer 3: RF = 2×3+1 = 7.\n- This is the VGG insight (Simonyan & Zisserman, 2014): two 3×3 layers have the same receptive field as one 5×5 layer but with fewer parameters (2×9C² vs 25C² for C channels) and more non-linearities.","A":"If each layer only saw its own kernel, there'd be no benefit to stacking layers. The key property of CNNs is that deeper layers have larger receptive fields.","B":"","C":"Receptive field grows additively, not multiplicatively. Each 3×3 layer adds 2 pixels to the RF on each side. Three layers: RF = 3 + 2 + 2 = 7, not 3×3×3.","D":"Receptive field grows linearly with depth (for constant kernel size and stride=1). For stride>1, it grows exponentially — but stride=1 here gives linear growth."},"reference":"- Simonyan & Zisserman, \"Very Deep Convolutional Networks for Large-Scale Image Recognition\" (VGGNet): https://arxiv.org/abs/1409.1556"},{"section":"deep-learning","topicSlug":"cnn-architectures","topic":"Cnn Architectures","id":"dl-11003","difficulty":"medium","orderIndex":3,"question":"AlexNet uses Local Response Normalization (LRN) after its ReLU activations. Modern architectures (VGG, ResNet) dropped LRN and replaced it with BatchNorm. What did LRN attempt to do, and why was BatchNorm a better solution?","options":{"A":"LRN and BatchNorm are identical operations; the name change was cosmetic","B":"LRN normalizes activations within a local neighborhood of channels (across nearby channels at the same spatial location), creating competition between channels and providing local contrast normalization — motivated by lateral inhibition in neuroscience. BatchNorm normalizes across the spatial and batch dimensions for each channel, stabilizing the training dynamics and loss landscape. BatchNorm was superior because: (1) it normalizes the learned representation more globally; (2) provides actual training stability benefits; (3) allows higher learning rates. LRN's benefit was mostly theoretical; empirical results showed it barely helped once BN was available","C":"LRN was replaced because it caused vanishing gradients, while BatchNorm prevents them","D":"LRN is used for classification; BatchNorm is used for detection tasks only"},"correct":"B","explanation":{"correct":"- LRN (AlexNet, Krizhevsky 2012): for neuron at channel c, position (x,y): normalize by a sum of squared activations across nearby channels [c-n/2, c+n/2]. This creates inter-channel competition (\"the most active neuron suppresses others\"), analogous to lateral inhibition in the visual cortex.\n- BatchNorm (Ioffe & Szegedy, 2015): normalizes each channel's activations across the batch and spatial dimensions. Stabilizes learning by preventing covariate shift (or smoothing loss landscape, per Santurkar 2018).\n- Why LRN fell out of use: LRN adds minor regularization but doesn't address the core optimization problem. When BN was introduced, it provided quantifiably better training stability, faster convergence, and higher final accuracy. LRN's neuroscience motivation didn't translate to reliable empirical gains.","A":"LRN and BatchNorm are mathematically very different. LRN: normalization by local channel neighborhood at the same position. BatchNorm: normalization by batch statistics across all spatial positions. They have different normalization axes.","B":"","C":"LRN doesn't specifically cause vanishing gradients (it normalizes activations, not weights). BatchNorm's primary benefit is training stability, not specifically gradient magnitude control (that's the Kaiming initialization + skip connections job).","D":"BatchNorm is used in both classification and detection networks (ResNet for ImageNet, Mask R-CNN for detection). The claim that BN is only for detection is incorrect."},"reference":"- Krizhevsky et al., \"ImageNet Classification with Deep Convolutional Neural Networks\" (AlexNet, 2012)\n- Ioffe & Szegedy, \"Batch Normalization\" (2015): https://arxiv.org/abs/1502.03167"},{"section":"deep-learning","topicSlug":"cnn-architectures","topic":"Cnn Architectures","id":"dl-11004","difficulty":"medium","orderIndex":4,"question":"ResNet introduces skip connections: F(x) + x. The paper argues that learning the residual mapping F(x) = H(x) - x is easier than learning H(x) directly, especially when the identity is a good approximation. What specific evidence from the original ResNet paper supports this, and what would happen without skip connections?","options":{"A":"Without skip connections, training a 56-layer network is impossible due to GPU memory limits","B":"The paper showed that deeper plain networks (without skip connections) have higher training error than shallower ones (counterintuitive — more parameters, worse training). This \"degradation problem\" is not due to overfitting (training error itself is worse). With skip connections, a 110-layer ResNet trains successfully and reaches lower training and test error than a 20-layer plain network. The residual formulation ensures that at minimum, the network can learn identity mappings (F=0), which cannot happen as gracefully in a plain deep network","C":"The evidence is theoretical only; ResNets were proposed before experiments were run","D":"Without skip connections, ResNets achieve 1% lower accuracy because the skip connection provides additional inputs to each layer"},"correct":"B","explanation":{"correct":"- The degradation problem: in the ResNet paper (He et al., 2015), plain 56-layer networks had 6.02% training error vs 4-layer plain networks at 4.18%. More layers → higher training error. This rules out overfitting as the cause — overfitting increases val error but should decrease training error.\n- Identity shortcut argument: if a 56-layer plain net should be at least as good as a 20-layer net (the remaining 36 layers could learn identity), why doesn't it? The answer: learning exact identity mappings is hard for stacked non-linear layers. With residual connections: F(x) = H(x) - x. If the optimal transformation is identity, F(x) = 0, which is easy to achieve (push weights to 0 → F=0).\n- With skip connections: 110-layer ResNet achieves 6.43% test error vs 13.63% for the plain 110-layer network on CIFAR-10.","A":"Deep plain networks can be trained on modern GPUs — the limitation is optimization difficulty, not memory. A 56-layer plain VGG-style network can be instantiated in GPU memory; it simply doesn't train well.","B":"","C":"The ResNet paper is primarily an experimental paper. The experiments on CIFAR-10 and ImageNet are the core evidence. Theory is secondary to the empirical demonstration.","D":"Skip connections don't \"provide additional inputs\" — they add the identity of the previous layer to the output, not a separate additional feature. The benefit is optimization, not additional input information per se."},"reference":"- He et al., \"Deep Residual Learning for Image Recognition\" (2015): https://arxiv.org/abs/1512.03385"},{"section":"deep-learning","topicSlug":"cnn-architectures","topic":"Cnn Architectures","id":"dl-11005","difficulty":"medium","orderIndex":5,"question":"EfficientNet uses compound scaling: simultaneously scaling depth (d), width (w), and resolution (r) with a fixed ratio, rather than scaling each independently. You have a baseline model and want to multiply compute by 8×. How does compound scaling allocate this vs single-dimension scaling?","options":{"A":"Compound scaling allocates all 8× compute to depth (more layers)","B":"EfficientNet compound scaling: given a compute budget of φ times the baseline (FLOPS ∝ d × w² × r²), the formula uses d = α^φ, w = β^φ, r = γ^φ where α × β² × γ² ≈ 2 (so doubling φ doubles compute). For 8×: φ=3 (since 2^3=8). Typical coefficients: α=1.2, β=1.1, γ=1.15. Compound scaling uses all three dimensions in balanced ratios, while single-axis scaling (e.g., 8× depth only) is less efficient because image resolution and channel width don't keep up with depth","C":"Compound scaling is identical to width scaling; depth and resolution scaling add no benefit","D":"The 8× compute budget should be split equally: 2.67× per dimension"},"correct":"B","explanation":{"correct":"- Single-axis limitation: scaling only depth creates very deep but narrow networks. Deep narrow networks may have large receptive fields but limited per-layer feature richness. Scaling only width creates wide but shallow networks that can't learn hierarchical features.\n- Balanced scaling intuition: if input resolution increases (more pixels), you need wider layers to process the extra spatial information, and deeper layers to capture higher-level patterns in the larger resolution input. These three dimensions are interdependent.\n- Empirical finding (EfficientNet paper, Tan & Le 2019): given the same FLOPS, compound-scaled models consistently outperform single-axis scaled models at every compute point on ImageNet.","A":"Allocating all compute to depth ignores the interdependency between dimensions. The compound scaling finding is specifically that balanced scaling outperforms single-axis scaling.","B":"","C":"Width scaling is one component of compound scaling. Depth and resolution scaling interact with width — increasing resolution without increasing width leaves the additional spatial information under-processed.","D":"Equal splitting (2.67× per dimension) is one possible approach, but EfficientNet finds that the optimal split is not equal. The balanced ratios (α, β, γ) are found via neural architecture search on a small proxy."},"reference":"- Tan & Le, \"EfficientNet: Rethinking Model Scaling for CNNs\" (2019): https://arxiv.org/abs/1905.11946"},{"section":"deep-learning","topicSlug":"cnn-architectures","topic":"Cnn Architectures","id":"dl-11006","difficulty":"medium","orderIndex":6,"question":"A 1×1 convolutional layer (also called a \"network-in-network\" or pointwise convolution) is applied to a feature map with shape (batch, 256, 28, 28) to produce (batch, 64, 28, 28). What does this operation accomplish, and why is it used in bottleneck ResNet blocks?","options":{"A":"1×1 convolution is a no-op for spatial dimensions; it only changes the batch dimension","B":"1×1 convolution applies a linear projection across channels at each spatial location independently: for each (h, w) position, it computes a 64-dimensional linear combination of the 256 input channels. This is dimensionality reduction in the channel dimension. In ResNet bottleneck blocks: 256→64 (1×1), 64→64 (3×3), 64→256 (1×1). The expensive 3×3 conv operates on the compressed 64-channel representation, reducing compute by (64/256)² ≈ 16× compared to directly applying 3×3 on 256 channels","C":"1×1 convolution is used to increase spatial resolution from 28×28 to 256×256","D":"1×1 convolution applies 3D spatial filtering across height, width, and channels simultaneously"},"correct":"B","explanation":{"correct":"- 1×1 conv math: output[n, c_out, h, w] = Σ_{c_in} W[c_out, c_in] × input[n, c_in, h, w]. This is a matrix-vector product at each spatial position: the 256-dimensional channel vector at position (h,w) is projected to 64 dimensions.\n- Bottleneck compute savings: 3×3 conv with C channels: 9C² FLOPs per position. With bottleneck (C→C/4→C): 1×1 (C×C/4) + 3×3 (C/4×C/4) + 1×1 (C/4×C) = C²/4 + C²/16×9 + C²/4 = ~C²/1.78. For C=256: 36,864 FLOPs vs 589,824 FLOPs for direct 3×3. 16× FLOP reduction.\n- 1×1 convs also allow channel mixing without spatial computation — they can re-weight which input channels are relevant for each output channel.","A":"1×1 convolution is not a no-op — it changes channel dimensions (from 256 to 64 in this case) and can learn non-trivial channel mixing. It has the same spatial resolution in and out.","B":"","C":"1×1 convolution doesn't change spatial dimensions. The \"1×1\" refers to the spatial extent of the kernel — one pixel × one pixel. Spatial dimensions are preserved.","D":"1×1 convolution is applied independently at each (h, w) spatial position. It does not combine information across spatial locations — it only combines across channels at each position."},"reference":"- He et al., \"Deep Residual Learning for Image Recognition\" (2015): Figure 5 (bottleneck block)\n- Lin et al., \"Network In Network\" (2013): https://arxiv.org/abs/1312.4400"},{"section":"deep-learning","topicSlug":"cnn-architectures","topic":"Cnn Architectures","id":"dl-11007","difficulty":"hard","orderIndex":7,"question":"Depthwise separable convolution (used in MobileNet) separates a standard K×K convolution into: (1) depthwise: K×K applied per channel independently, (2) pointwise: 1×1 across channels. For a layer with C_in=128, C_out=256, K=3: calculate the parameter and FLOP reduction vs standard convolution.","options":{"A":"Parameters: 2× fewer; FLOPs: 4× fewer","B":"Standard conv parameters: K²×C_in×C_out = 9×128×256 = 294,912. Depthwise-separable: depthwise K²×C_in = 9×128 = 1,152 + pointwise C_in×C_out = 128×256 = 32,768, total = 33,920. Parameter reduction: 294,912/33,920 ≈ 8.7×. FLOP reduction is similar: standard FLOPs ∝ K²×C_in×C_out; DSC FLOPs ∝ K²×C_in + C_in×C_out, giving reduction ≈ 1/(1/C_out + 1/K²) ≈ 8-9× for these values","C":"Depthwise separable convolutions are lossless — they compute exactly the same function as standard convolution with fewer parameters","D":"Parameter reduction is 10×; FLOP reduction is 2× due to the extra 1×1 layer"},"correct":"B","explanation":{"correct":"- Standard conv: single kernel of shape (C_out, C_in, K, K). Parameters: C_out × C_in × K² = 256 × 128 × 9 = 294,912.\n- Depthwise conv: C_in kernels of shape (1, 1, K, K), one per input channel. Parameters: C_in × K² = 128 × 9 = 1,152.\n- Pointwise conv: 1×1 conv mixing channels. Parameters: C_in × C_out = 128 × 256 = 32,768.\n- DSC total: 1,152 + 32,768 = 33,920. Ratio: 294,912 / 33,920 = 8.7×.\n- The FLOP reduction formula: standard FLOPs = K²·C_in·C_out; DSC FLOPs = K²·C_in + C_in·C_out. Ratio = 1/(1/C_out + 1/K²) = 1/(1/256 + 1/9) ≈ 1/(0.0039 + 0.111) ≈ 8.6×.","A":"The actual reduction is ~8-9×, not 2× or 4×. The key insight is that DSC doesn't compute all channel combinations simultaneously — depthwise processes each channel separately, then pointwise mixes.","B":"","C":"Depthwise separable convolution is NOT equivalent to standard convolution. Standard conv can express any mapping between C_in channels to C_out channels; DSC constrains the function family. DSC is a structured approximation with reduced representational capacity.","D":"The parameter and FLOP reductions are both approximately the same factor (~8-9×), not asymmetrically 10× and 2×. The extra 1×1 layer adds parameters (the pointwise conv is the major component of DSC's parameter count)."},"reference":"- Howard et al., \"MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications\" (2017): https://arxiv.org/abs/1704.04861"},{"section":"deep-learning","topicSlug":"cnn-architectures","topic":"Cnn Architectures","id":"dl-11008","difficulty":"hard","orderIndex":8,"question":"ConvNeXt (2022) modernizes a standard ResNet by incorporating design principles from Vision Transformers (ViTs). One key change is using depthwise convolution with very large kernels (7×7) in the inverted bottleneck design. What insight from ViT motivated this change, and how does it compare to stacking smaller kernels?","options":{"A":"Larger kernels are used because they are computationally cheaper than small kernels","B":"Vision Transformers use self-attention, which has a global receptive field — every output position attends to every input position. The 7×7 depthwise conv in ConvNeXt approximates this by having a larger local receptive field (49 spatial positions vs 9 for 3×3). A single 7×7 depthwise conv uses 49×C_in parameters (one kernel per channel); equivalent receptive field from stacking three 3×3 convs would use 3×9×C_in²/reduction_factor parameters. The depthwise design makes large kernels computationally feasible since channels aren't mixed at the spatial step","C":"Large kernels are only used in ConvNeXt for the first layer (similar to ViT's patch embedding)","D":"7×7 kernels are used because they match the 7×7 output resolution at the final ResNet stage"},"correct":"B","explanation":{"correct":"- ViT's inductive bias: multi-head self-attention computes all pairwise token interactions. This creates a fully-connected spatial mixing at every layer. The implicit \"large receptive field from the first layer\" is a key difference from CNNs' local receptive fields.\n- ConvNeXt motivation: if large receptive fields help ViTs, can we give CNNs larger receptive fields without the quadratic cost of attention? Depthwise 7×7 convs achieve this: 49 spatial positions processed, but only C_in parameters (vs K²×C_in×C_out for standard 7×7).\n- ConvNeXt also uses other ViT-inspired changes: patch-based downsampling, fewer normalization layers, GELU activation, inverted bottleneck (wide FFN in ViT → wide channel dimension in ConvNeXt).","A":"Larger kernels are generally more expensive, not cheaper. A 7×7 standard conv uses ~5.4× more FLOPs than a 3×3 conv. ConvNeXt uses depthwise 7×7 (cheap) to get large receptive fields without the full cost.","B":"","C":"The 7×7 depthwise conv is used in every stage of ConvNeXt, not just the first layer. This is a key architectural change applied throughout the network.","D":"The 7×7 kernel size is motivated by receptive field and ViT comparison, not by matching output resolution. The 7×7 output resolution at the final ResNet stage is a feature map size, not directly related to why 7×7 kernels were chosen."},"reference":"- Liu et al., \"A ConvNet for the 2020s (ConvNeXt)\" (2022): https://arxiv.org/abs/2201.03545"},{"section":"deep-learning","topicSlug":"cnn-architectures","topic":"Cnn Architectures","id":"dl-11009","difficulty":"hard","orderIndex":9,"question":"A ResNet-50 is deployed for medical image classification. During inference, an image of size 448×448 is used (model was trained on 224×224). The accuracy drops significantly compared to 224×224 images. A colleague says \"just resize to 224×224.\" Another says \"the model can handle any size natively because Conv layers have no size constraint.\" Who is right and why?","options":{"A":"The second colleague is right — ResNets with global average pooling handle any input size","B":"Both are partially right, but they miss a critical issue: convolutional layers accept any size (their weights are size-independent). However, ResNet uses a global average pooling (GAP) layer at the end, which is size-invariant. But performance degrades because: (1) the model's convolutional filters have effective receptive fields calibrated for 224×224 — at 448×448, the same filters cover proportionally smaller parts of the image, disrupting learned feature hierarchies; (2) BatchNorm running statistics were computed for 224×224 spatial distributions; (3) the model hasn't seen 448×448 spatial patterns during training. For best accuracy at 448×448, fine-tune on 448×448 or use test-time augmentation at the training resolution","C":"The first colleague is right — resize to 224×224 is the only valid approach; ResNets cannot process other sizes at all","D":"Both are wrong — a new model must be trained from scratch for 448×448 images"},"correct":"B","explanation":{"correct":"- Why conv layers handle any size: a 3×3 conv sliding over 448×448 produces 446×446 (or 448×448 with padding). The same filters work at any spatial scale — they're position-independent. GAP then takes the mean over all spatial positions, producing a C-dimensional vector regardless of input size.\n- Why performance degrades despite technical compatibility: effective receptive field issue. A ResNet-50 feature at the last conv layer has a receptive field of ~196×196 pixels on a 224×224 input (covering ~77% of the image). On a 448×448 input, the same receptive field covers ~19% of the image. The model sees only local patches, not the full object structure it was trained to recognize.\n- The fix options: (1) fine-tune at 448×448 (adjusts BN stats, teaches model to use larger receptive fields); (2) use FixRes (Touvron et al.) which trains at low res and tests at high res with a simple fix-up.","A":"While technically true that GAP + conv layers allow any input size, performance degrades significantly due to resolution mismatch. \"Can handle any size natively\" implies no accuracy penalty, which is incorrect.","B":"","C":"ResNets can technically process any input size — this statement is factually wrong. The convolutional architecture has no hard size constraint.","D":"Fine-tuning on 448×448 is sufficient — no need to train from scratch. Transfer learning from 224×224 pretrained weights is effective for resolution adaptation."},"reference":"- Touvron et al., \"Fixing the train-test resolution discrepancy\" (FixRes) (2019): https://arxiv.org/abs/1906.06423"},{"section":"deep-learning","topicSlug":"cnn-architectures","topic":"Cnn Architectures","id":"dl-11010","difficulty":"medium","orderIndex":10,"question":"You replace all MaxPooling layers in a CNN with stride-2 convolutional layers (same kernel size, stride=2 instead of stride=1, no pooling). What are the trade-offs?","options":{"A":"Stride-2 conv is strictly better — it has all the benefits of pooling plus it's learnable","B":"Stride-2 conv is learnable (parameters can be optimized for the task) and preserves more information (learned aggregation vs fixed max operation). MaxPooling is parameter-free (no weights to learn), provides perfect translation invariance within the pooling window, and applies a non-linear operation (max). Trade-off: stride-2 conv adds parameters and may be harder to optimize; it doesn't have the built-in non-linear selection property of max. Modern architectures (ResNet, etc.) largely use stride-2 conv; MaxPooling is used in older architectures (VGG) and where translation invariance is explicitly desired","C":"Stride-2 conv and MaxPooling are mathematically identical when kernel_size matches","D":"MaxPooling should always be preferred; stride-2 conv causes spatial aliasing"},"correct":"B","explanation":{"correct":"- MaxPool: takes the maximum value in each pooling window. This is shift-invariant within the window (if the maximum value shifts by 1 pixel, the pool output is the same). No learnable parameters. Non-linear (max is non-differentiable at ties, but piecewise linear).\n- Stride-2 conv: a learned linear combination of the input at each position, then strided. More expressive (can approximate any linear function, including max), but requires training data to learn the appropriate weights. Can overfit the aggregation function.\n- Empirical result: strided convolutions work as well or better in practice (ResNet-50 uses a stride-2 conv at the beginning instead of max pooling). The learned downsampling often outperforms fixed max downsampling for high-level vision tasks.","A":"Stride-2 conv is not \"strictly better.\" For tasks where translation invariance is explicitly desired (e.g., detecting presence of a small texture anywhere in the image), max pooling's built-in invariance is valuable. Learnability doesn't always help if the dataset is small.","B":"","C":"They are not mathematically identical. MaxPool computes the maximum; stride-2 conv computes a weighted sum. These are fundamentally different operations (non-linear max vs linear weighted sum).","D":"MaxPooling also has spatial aliasing (skipping every other pixel) — both approaches have aliasing issues. The claim that stride-2 conv \"causes aliasing\" while MaxPool doesn't is incorrect (both downsample)."},"reference":"- He et al., \"Deep Residual Learning for Image Recognition\" (2015): Section 3.3 (uses stride instead of pooling)\n- Springenberg et al., \"Striving for Simplicity: The All Convolutional Net\" (2015): https://arxiv.org/abs/1412.6806"},{"section":"deep-learning","topicSlug":"cnn-architectures","topic":"Cnn Architectures","id":"dl-11011","difficulty":"medium","orderIndex":11,"question":"In semantic segmentation, the output must have the same spatial resolution as the input (each pixel gets a class label). Encoder-decoder architectures (U-Net, SegNet) use skip connections from encoder to decoder. What specific information do these skip connections carry, and why is it critical for pixel-level predictions?","options":{"A":"Skip connections carry class labels from earlier predictions in the encoder","B":"Skip connections carry high-resolution spatial detail from early encoder layers directly to the corresponding decoder layers. The encoder progressively downsamples and loses fine spatial information (exact object boundaries, thin structures). The decoder upsamples from the bottleneck but can only recover coarse locations without the spatial details. Skip connections provide the exact high-resolution feature maps to the decoder, allowing it to produce sharp boundaries — the encoder contributes \"where exactly\" (spatial precision) while the bottleneck contributes \"what\" (semantic context)","C":"Skip connections in U-Net are only used to prevent gradient vanishing during training","D":"Skip connections carry the original pixel values (raw input) to all decoder layers"},"correct":"B","explanation":{"correct":"- Encoder path: spatial resolution decreases (224→112→56→28→14→7) while channels increase. At 7×7, the network has high semantic understanding but no precise spatial location.\n- Decoder without skip connections: upsamples from 7×7 back to 224×224 using only the bottleneck features. Can reconstruct coarse object locations but produces blurry, imprecise boundaries.\n- U-Net skip connections: at each decoder resolution level, concatenate (or add) the encoder's feature map of the same resolution. The 56×56 decoder layer gets the encoder's 56×56 features — these contain the precise boundaries and textures that were lost during subsequent downsampling.\n- Critical for thin structures: in medical imaging (e.g., blood vessels, cell borders), thin 1-2 pixel structures are completely lost in deep encoders. Skip connections restore this detail.","A":"Skip connections carry feature maps (intermediate learned representations), not class labels. Classification happens at the final decoder output layer.","B":"","C":"Skip connections do help gradient flow (paths to early layers), but this is a secondary benefit. The primary motivation is spatial detail transfer for precise pixel-level predictions.","D":"Skip connections carry layer-specific feature maps from the encoder (processed representations), not the original pixel values. Only the very first skip connection (if from the input layer) would carry near-raw pixels."},"reference":"- Ronneberger et al., \"U-Net: Convolutional Networks for Biomedical Image Segmentation\" (2015): https://arxiv.org/abs/1505.04597"},{"section":"deep-learning","topicSlug":"cnn-architectures","topic":"Cnn Architectures","id":"dl-11012","difficulty":"hard","orderIndex":12,"question":"You compare two feature extraction methods for a ResNet-50: (A) Global Average Pooling (GAP) to produce a 2048-d vector, (B) spatial feature pyramid pooling (SPP) to produce a 7168-d vector (concatenating 2048 from 1×1 + 2048 from 2×2 + 3072 from 3×3 spatial pools). For a nearest-neighbor image retrieval task, SPP achieves significantly higher mAP. Why?","options":{"A":"SPP achieves higher mAP only because it has a larger feature vector (7168 vs 2048)","B":"GAP spatially averages all activations into a single vector — spatial location information is completely discarded. SPP captures features at multiple spatial scales: the 1×1 pool captures global statistics; 2×2 captures quadrant-level features (top-left, top-right, etc.); 3×3 captures 9-region features. For retrieval, the spatial distribution of features (where in the image a feature occurs) is crucial for discriminating images. SPP preserves \"what is in which region\" rather than just \"what is in the image.\" The larger vector is a consequence, not the cause of the improvement","C":"SPP is better because it applies non-linear pooling (max) instead of average pooling","D":"GAP and SPP are equivalent for retrieval tasks; the mAP difference is due to random seeds"},"correct":"B","explanation":{"correct":"- GAP limitation for retrieval: two images with the same objects in different spatial arrangements produce similar GAP vectors. A horse in the upper-left and a horse in the lower-right collapse to the same 2048-d average.\n- SPP spatial preservation: with 2×2 pooling, the top-left quadrant's features are in a different part of the SPP vector than the bottom-right quadrant's features. Two images with the same objects in different positions produce different SPP vectors.\n- This spatial discriminativeness is critical for retrieval — the goal is to find images that are similar in both content and arrangement. For classification (where spatial invariance is desirable), GAP is better.","A":"The larger vector size (7168 vs 2048) does increase capacity but is not the primary reason for mAP improvement. The improvement is specifically due to the spatial information preservation. You can verify this by comparing SPP to a random 7168-d feature — the random large vector wouldn't improve mAP.","B":"","C":"SPP uses max pooling (not average) in some formulations, but the key benefit is multi-scale spatial sampling, not the max vs average distinction. GAP with multiple scales would also outperform single-scale GAP.","D":"The mAP difference between spatial and non-spatial features on retrieval benchmarks is consistent and large (often 10-20% mAP). This is a well-established finding in image retrieval literature."},"reference":"- He et al., \"Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition\" (SPPNet, 2014): https://arxiv.org/abs/1406.4729"},{"section":"deep-learning","topicSlug":"cnn-architectures","topic":"Cnn Architectures","id":"dl-11013","difficulty":"easy","orderIndex":13,"question":"In the LeNet→AlexNet progression, AlexNet introduced several innovations beyond depth. What was the role of ReLU activation compared to the sigmoid/tanh activations used in LeNet?","options":{"A":"ReLU was introduced to prevent the dying neuron problem specific to LeNet","B":"ReLU provides non-saturating gradients for positive activations: ∂ReLU/∂x = 1 for x>0, vs ∂sigmoid/∂x = σ(x)(1-σ(x)) which is at most 0.25 and approaches 0 for large |x|. Tanh is similarly bounded. With ReLU, deep networks can be trained faster because gradients don't vanish through many layers. AlexNet's paper showed ReLU networks trained 6× faster than tanh networks on the same architecture. The trade-off: dying ReLU (if pre-activation stays negative, gradient = 0 permanently)","C":"ReLU was used for biological plausibility; the dying ReLU problem was actually desired to simulate neuron death","D":"ReLU was used because sigmoid requires expensive exp() computations that were impractical on 2012 hardware"},"correct":"B","explanation":{"correct":"- Sigmoid saturation: for large positive or negative x, σ(x)→0 or σ(x)→1. Gradient ≈ 0 for saturated neurons. In a deep network, multiple layers of near-zero gradients compound into vanishing gradient.\n- ReLU (Rectified Linear Unit): max(0,x). Gradient is exactly 1 for x>0 (no saturation), exactly 0 for x<0 (dying ReLU). The non-saturating positive gradient allows training of much deeper networks.\n- 6× speedup: stated in the AlexNet paper. On CIFAR-10, a 4-layer ReLU network reached 25% training error 6× faster than a tanh network.","A":"The dying ReLU problem (neurons stuck at x<0 permanently outputting 0) is a known issue with ReLU, not a problem in LeNet. LeNet used sigmoid/tanh which have different issues (saturation).","B":"","C":"Biological plausibility motivation is not the primary reason cited in the AlexNet paper. The paper explicitly states faster training due to non-saturation as the motivation.","D":"exp() is fast on modern hardware and GPUs. The AlexNet paper doesn't mention hardware computational cost as the reason for ReLU. The primary reason is training speed due to non-saturating gradients."},"reference":"- Krizhevsky et al., \"ImageNet Classification with Deep Convolutional Neural Networks\" (AlexNet, 2012): Section 3.1 (ReLU Nonlinearity)"},{"section":"deep-learning","topicSlug":"cnn-architectures","topic":"Cnn Architectures","id":"dl-11014","difficulty":"hard","orderIndex":14,"question":"A team uses feature maps from a ResNet-50 backbone at multiple scales for object detection (FPN: Feature Pyramid Network). They extract features from ResNet's C3, C4, C5 stages and add top-down pathways. An engineer wants to add the C2 stage (high-resolution, early features). What is the trade-off of adding C2 features?","options":{"A":"Adding C2 features increases accuracy with no cost because more features are always better","B":"C2 features have very high spatial resolution (e.g., 56×56 for 224×224 input) and low semantic content (early layers detect edges, textures, not objects). Adding C2 to FPN: (1) increases memory proportionally to the spatial resolution increase — 56×56 features require 4× more memory than 28×28 (C3) features; (2) the top-down pathway now needs to propagate semantic context to a much larger feature map; (3) C2's features are semantically weak — the network needs to combine high-res but low-semantics with the FPN's top-down signal. For most detection tasks, C2 marginally helps small-object detection but significantly increases compute/memory","C":"C2 features have too low resolution to be useful for object detection","D":"Adding C2 requires retraining the entire backbone from scratch due to gradient flow changes"},"correct":"B","explanation":{"correct":"- FPN multi-scale design: C5 (7×7, 2048ch) → P5 (semantic, low-res). C4 (14×14) → P4. C3 (28×28) → P3. Each P_i is used for detecting objects at a specific scale range.\n- C2 (56×56) benefit: enables detection of very small objects (objects that are only a few pixels in the C3 feature map become more resolved in C2).\n- C2 cost: 56×56 = 3,136 positions vs C3's 28×28 = 784. 4× more positions for all convolutions in the detection head. For batch_size=2 with C2, the FPN head memory can increase by ~20-40% depending on the head design. For the marginal benefit on small objects (which may be rare in the dataset), this cost is often not justified.","A":"More features are not always better when cost is considered. C2's high memory cost is a real trade-off. Additionally, C2's low semantic content means the FPN must do more work to make these features useful.","B":"","C":"C2 has the highest resolution (56×56 for 224×224 input) — it's the highest resolution stage in the backbone. The claim of \"too low resolution\" is factually incorrect.","D":"Adding C2 features to FPN doesn't require retraining from scratch. The backbone weights are unchanged; only the FPN lateral connections and head need training. This is standard fine-tuning."},"reference":"- Lin et al., \"Feature Pyramid Networks for Object Detection\" (FPN, 2017): https://arxiv.org/abs/1612.03144"},{"section":"deep-learning","topicSlug":"cnn-architectures","topic":"Cnn Architectures","id":"dl-11015","difficulty":"hard","orderIndex":15,"question":"VGG-16 and ResNet-50 achieve similar ImageNet top-5 accuracy (~92%). VGG-16 has 138M parameters; ResNet-50 has 25M. You deploy both in production serving 10,000 requests/second. Which bottleneck does VGG-16 hit first, and why doesn't ResNet-50 have the same problem?","options":{"A":"VGG-16 hits a compute bottleneck; ResNet-50 avoids it because skip connections reduce FLOPs","B":"VGG-16's 138M parameters require 528MB in FP32 (or 264MB in FP16). At 10K requests/sec, VGG-16 primarily hits a memory bandwidth bottleneck: each inference must load the full 138M parameters from GPU memory. ResNet-50 at 25M parameters requires only 100MB — it fits easily in GPU L2 cache/shared memory for concurrent requests. The bottleneck shifts from parameter loading (memory-bound) for VGG to compute-bound for ResNet. In production, memory-bound layers are typically 5-10× slower than compute-bound at the same FLOP count","C":"VGG-16 hits a latency bottleneck because it has more layers than ResNet-50","D":"Both models hit identical bottlenecks; parameter count doesn't affect throughput"},"correct":"B","explanation":{"correct":"- Memory bandwidth bottleneck: GPU memory bandwidth (A100: ~2 TB/s) limits how fast weights can be loaded for inference. For VGG-16: loading 528MB takes 528MB / 2TB/s = 0.26ms just for weight loading. For ResNet-50: 100MB / 2TB/s = 0.05ms — 5× faster just from weight transfers.\n- Roofline model: operations are either compute-bound (limited by FLOP/s) or memory-bound (limited by memory bandwidth). For VGG-16's large fully-connected layers (4096→4096: 33.5M parameters), the ratio of FLOPs to bytes loaded is low → memory-bound.\n- Cache effects: ResNet-50's smaller parameter set allows much of the model to be resident in GPU cache. VGG-16's parameters don't fit, requiring main GPU memory access on every inference.","A":"Skip connections don't reduce FLOPs — they add FLOPs (the addition operation). ResNet-50 actually has slightly more FLOPs per image than some VGG variants. The bottleneck difference is parameter count / memory bandwidth, not FLOPs.","B":"","C":"VGG-16 has 16 layers; ResNet-50 has 50 layers. ResNet-50 has more layers, not fewer. Layer count doesn't directly map to latency without considering the operation type and size per layer.","D":"Parameter count directly affects memory usage and bandwidth requirements. A 5× parameter reduction results in a 5× reduction in weight transfer time, which is a primary bottleneck in large-batch inference."},"reference":"- Williams et al., \"Roofline: An insightful visual performance model for multicore architectures\" (2009)\n- MLPerf inference benchmarks: https://mlcommons.org/en/inference-edge-20/"},{"section":"deep-learning","topicSlug":"rnn-lstm-gru","topic":"Rnn Lstm Gru","id":"dl-12001","difficulty":"easy","orderIndex":1,"question":"A vanilla RNN processes a sequence of 100 words and must produce a single classification output. You observe that the gradient norm at step 1 is 10⁻¹⁵ while the gradient norm at step 100 is ~1.0. What is this problem, and what mathematical property causes it?","options":{"A":"Exploding gradients — the gradient grows from step 1 to step 100","B":"Vanishing gradients — the gradient at step 1 is effectively zero. In vanilla RNNs, the gradient at step t flows backward through the recurrence ∂L/∂h_1 = ∂L/∂h_T × ∏_{t=2}^{T} ∂h_t/∂h_{t-1}. Each term ∂h_t/∂h_{t-1} = diag(tanh'(a_t)) × W_h. With T=100 steps, this product is a matrix raised to the 99th power. If the spectral radius of W_h × diag(tanh') is < 1 (which it usually is since tanh' ≤ 1), repeated multiplication drives the gradient to zero exponentially","C":"The gradient norm difference is expected behavior; only the gradient at the last step matters","D":"Vanishing gradients only occur in forward propagation, not backward propagation"},"correct":"B","explanation":{"correct":"- Gradient chain rule for RNNs: ∂L/∂h_1 = Π_{t=2}^{100} W_h^T × diag(tanh'(a_t)) × ∂L/∂h_100.\n- tanh'(x) = 1 - tanh²(x) ≤ 1, and equals 1 only at x=0. For any non-zero pre-activation, tanh' < 1. The product of 99 such terms × W_h: if the largest singular value < 1, the product → 0 exponentially.\n- Practical consequence: the model cannot learn long-range dependencies. The prediction at step 100 depends almost entirely on steps ~90-100; information from steps 1-70 is effectively lost.","A":"The gradient grows from step 1 to step 100, meaning the gradient AT STEP 1 is smaller. This is vanishing gradients (gradients vanish going backward to early timesteps), not exploding. Exploding gradients would make early gradients LARGER than late gradients.","B":"","C":"For sequence classification, all input positions contribute information. The gradient at step 1 being ~0 means the model doesn't update its weights based on early-sequence content — a critical failure for long sequences.","D":"Vanishing gradients are specifically a backward pass (backpropagation through time) phenomenon. The forward pass computes activations, which may also diminish but doesn't directly cause training failure. The training failure comes from zero gradients in backprop."},"reference":"- Bengio et al., \"Learning Long-Term Dependencies with Gradient Descent is Difficult\" (1994)\n- Pascanu et al., \"On the difficulty of training recurrent neural networks\" (2013): https://arxiv.org/abs/1211.5063"},{"section":"deep-learning","topicSlug":"rnn-lstm-gru","topic":"Rnn Lstm Gru","id":"dl-12002","difficulty":"easy","orderIndex":2,"question":"An LSTM has three gates: forget, input, and output. A textbook describes the forget gate as \"the most important gate.\" For a sentiment analysis task on long movie reviews, explain intuitively what information each gate controls and why the forget gate matters specifically.","options":{"A":"The forget gate removes information from the cell state; the input gate adds information; the output gate determines the hidden state — forget is most important because it prevents information accumulation","B":"Forget gate (f_t = σ(W_f[h_{t-1}, x_t] + b_f)): controls what to erase from cell state. For sentiment: \"However, despite the boring intro...\" — after \"However\", the forget gate should reduce the weight on the positive sentiment accumulated so far. Input gate (i_t): controls what new information to write. Output gate (o_t): controls what part of cell state to expose as hidden state h_t for the next step or output. The forget gate is critical for long documents because without selective forgetting, the cell state accumulates everything equally, losing signal in noise","C":"The output gate is most important; it's the only gate visible to the next layer","D":"All gates are equally important; the \"forget gate is most important\" claim is a myth"},"correct":"B","explanation":{"correct":"- Cell state without forget gate: C_t = C_{t-1} + i_t ⊙ g_t. C_t grows monotonically — the cell state is a running sum of everything. For a 1000-word review, the sentiment signal from the last 50 words is buried under noise from the first 950.\n- Forget gate enables selective memory: at word t, f_t ≈ 0 for irrelevant or contradictory content (forget old), f_t ≈ 1 for consistent content (maintain). This allows the LSTM to maintain relevant long-range context while discarding irrelevant information.\n- Jozefowicz et al. (2015) ablation: removing the forget gate (setting f_t = 1 always, i.e., never forget) hurts performance significantly. Setting f_t = 0 (always forget) eliminates long-term memory.","A":"This accurately describes the gates but doesn't give the sentiment-specific intuition for why forget matters. The claim \"prevents information accumulation\" is vague — the key is selective forgetting of contradicted or irrelevant information.","B":"","C":"The output gate is important but is specifically about \"what to expose\" from the cell state at each step, not about managing long-term memory. For long-document tasks, selective forgetting is the primary challenge.","D":"The forget gate is empirically the most critical gate in many ablation studies. Jozefowicz et al. (2015) showed that LSTM variants that remove the forget gate consistently perform worse on language modeling tasks."},"reference":"- Hochreiter & Schmidhuber, \"Long Short-Term Memory\" (1997): original LSTM paper\n- Jozefowicz et al., \"An Empirical Evaluation of Recurrent Network Architectures\" (2015)"},{"section":"deep-learning","topicSlug":"rnn-lstm-gru","topic":"Rnn Lstm Gru","id":"dl-12003","difficulty":"medium","orderIndex":3,"question":"A GRU replaces the three LSTM gates with two: a reset gate and an update gate. A team argues \"GRU is strictly better than LSTM for all tasks because it has fewer parameters.\" What is the accurate analysis?","options":{"A":"The team is correct — GRU is always better due to fewer parameters","B":"GRU has fewer parameters (2 gates vs 3 in LSTM, no separate cell state) making it faster and less prone to overfitting on small datasets. LSTM's separate cell state provides more expressive memory (can independently control what's stored vs what's outputted). Empirically, performance is dataset and task dependent: GRU often matches LSTM on language modeling with less data; LSTM tends to win on tasks requiring complex, structured long-term dependencies (e.g., music generation with long-term structure). The \"fewer parameters = better\" logic ignores that LSTM's extra capacity addresses specific memory management problems","C":"LSTM is always better; GRU was an unsuccessful experiment that never achieved practical adoption","D":"GRU and LSTM are mathematically identical; the name difference is vendor-specific"},"correct":"B","explanation":{"correct":"- GRU parameters for hidden_size=H, input_size=D: 3×(D+H)×H (two gates + candidate hidden: reset, update, new_h). LSTM parameters: 4×(D+H)×H (three gates + cell input: i, f, o, g). GRU ≈ 25% fewer parameters.\n- GRU update gate: z_t = σ(W_z [h_{t-1}, x_t]). h_t = (1-z_t)⊙h_{t-1} + z_t⊙h̃_t. This single gate does both \"forget\" and \"what to update\" — it can't independently control forgetting and updating. LSTM can forget old information while writing specific new information independently.\n- Chung et al. (2014) and Greff et al. (2017) empirical comparisons: on many NLP tasks, GRU ≈ LSTM with 25% fewer parameters. On tasks requiring more precise memory management, LSTM has an edge.","A":"\"Always better\" is empirically false. LSTM outperforms GRU on some tasks (music generation, certain machine translation benchmarks with long dependencies). The relationship is task-dependent.","B":"","C":"GRU is widely adopted. PyTorch and TensorFlow both support GRU as a core layer. GRU is used in production systems (speech processing, time series). It is not an \"unsuccessful experiment.\"","D":"GRU and LSTM have fundamentally different architectures. GRU has no separate cell state; LSTM does. They compute different mathematical functions. The equations are substantially different."},"reference":"- Chung et al., \"Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling\" (2014): https://arxiv.org/abs/1412.3555"},{"section":"deep-learning","topicSlug":"rnn-lstm-gru","topic":"Rnn Lstm Gru","id":"dl-12004","difficulty":"medium","orderIndex":4,"question":"You implement Backpropagation Through Time (BPTT) for a sequence of length 500. Training is slow and uses 40GB of GPU memory. A colleague suggests \"truncated BPTT with segment_length=20.\" What is the trade-off and what does it sacrifice?","options":{"A":"Truncated BPTT reduces memory and speed but maintains identical gradient computation","B":"Truncated BPTT divides the sequence into non-overlapping segments of length 20, computing gradients only within each segment. Memory reduction: from O(T) to O(segment_length). Speed improvement: similar. Trade-off: gradients cannot propagate back beyond 20 steps. The model can only learn dependencies within 20-step windows. The hidden state carries information from before the segment boundary (the initial hidden state of each segment comes from the end of the previous segment), but the weights are not updated to improve this long-term state — the model learns to use short-term context efficiently but cannot optimize the 21-to-500 step dependencies","C":"Truncated BPTT only reduces speed, not memory; all activations must still be stored","D":"Truncated BPTT with segment_length=20 is equivalent to full BPTT for sequence_length=500 when the RNN converges"},"correct":"B","explanation":{"correct":"- Full BPTT memory: must store all T=500 hidden states and pre-activations for the backward pass. Memory = O(T × H²) for a network with hidden size H.\n- Truncated BPTT: process T/20 = 25 segments. Each segment stores only 20 activations. Memory: O(segment_length × H²) = O(20 × H²). 25× memory reduction.\n- The sacrifice: weights are not updated to optimize cross-segment dependencies. The model learns to \"use\" hidden states from before the segment boundary but not to generate them optimally. Dependencies longer than 20 steps are underfit.\n- The hidden state IS passed between segments (avoiding complete information loss), but the gradient is detached at the boundary (`h = h.detach()` in PyTorch), preventing gradients from flowing through.","A":"Truncated BPTT computes different gradients from full BPTT — it explicitly zeroes out gradients beyond the truncation boundary. They are not identical.","B":"","C":"Memory reduction is the primary motivation for truncated BPTT. Only segment_length activations need to be stored at once, not the full T=500. This is a significant memory saving.","D":"No convergence property makes them equivalent. Even after thousands of training steps, a model trained with truncated BPTT cannot learn dependencies that span more than the truncation length, regardless of convergence."},"reference":"- Sutton, \"Time-derivative models of Pavlovian reinforcement\" (1990): original TBPTT\n- Mikolov et al., \"Recurrent Neural Network Based Language Model\" (2010): uses TBPTT"},{"section":"deep-learning","topicSlug":"rnn-lstm-gru","topic":"Rnn Lstm Gru","id":"dl-12005","difficulty":"medium","orderIndex":5,"question":"You train a bidirectional LSTM on named entity recognition (NER). The forward LSTM reads left-to-right, the backward LSTM reads right-to-left. A colleague questions: \"doesn't the backward LSTM see the 'future' relative to the current token?\" When is bidirectional processing valid and when is it invalid?","options":{"A":"Bidirectional LSTMs always see the future; they are invalid for all sequential tasks","B":"Bidirectional models are valid for offline tasks where the full sequence is available at inference: NER, POS tagging, text classification, machine translation encoding (the encoder sees the full source sentence). They are invalid for online/autoregressive tasks where future tokens are unavailable at generation time: language modeling (next-word prediction), streaming ASR, real-time translation. For NER, the word \"Apple\" in \"I work at Apple Inc.\" benefits from seeing \"Inc.\" ahead to classify \"Apple\" as an organization — the full sentence is available, so right-to-left context is legitimate","C":"Bidirectional models are invalid because they cause data leakage — future context is unavailable in production","D":"Bidirectional LSTMs are only valid for classification; for sequence labeling, only forward LSTMs work"},"correct":"B","explanation":{"correct":"- NER is an offline task: the full sentence is a training example. During both training and inference, the complete sequence is provided. Using the full sentence to label each token is not \"data leakage\" — it's using available context.\n- Language modeling is an autoregressive task: generating the next word must only use previous words. During training, the model sees the full sequence, but using future tokens to predict the current token would be cheating (the model would trivially predict the next token because it can see it).\n- BERT uses bidirectional attention (Transformer, not LSTM) for representation learning but cannot directly generate text. GPT uses unidirectional attention (causal masking) for generation.","A":"\"Invalid for all sequential tasks\" is too strong. The validity depends on whether the task requires online processing. Offline tasks (batch processing of complete sequences) fully allow bidirectional models.","B":"","C":"\"Data leakage\" in ML means using information during training that wouldn't be available at deployment time. If full sequences are available at both training and inference (as in NER), there's no leakage.","D":"Bidirectional LSTMs are used for sequence labeling (NER, POS tagging) — this is one of their primary applications. The claim that they only work for classification is incorrect."},"reference":"- Devlin et al., \"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding\" (2018)\n- Collobert et al., \"Natural Language Processing (Almost) from Scratch\" (2011): bidirectional models for NER"},{"section":"deep-learning","topicSlug":"rnn-lstm-gru","topic":"Rnn Lstm Gru","id":"dl-12006","difficulty":"hard","orderIndex":6,"question":"You debug an LSTM language model and find that the forget gate activations (f_t) are saturated near 1.0 for 95% of timesteps, and the input gate (i_t) is near 0.0 for the same timesteps. The model achieves decent perplexity but generalizes poorly. What is this pattern indicating?","options":{"A":"The LSTM is working correctly — high forget gate means good long-term memory","B":"This pattern indicates the LSTM is in \"copy mode\" — nearly always passing through the previous cell state unchanged (f_t ≈ 1, no forgetting) while ignoring new input (i_t ≈ 0, no writing). The model has essentially learned to copy h_{t-1} → h_t for most timesteps, only occasionally updating based on the actual input. This is a degenerate solution: the model achieves decent perplexity by predicting the current word mostly based on accumulated context, but isn't learning to use specific input signals. Poor generalization occurs because the model relies on generic context accumulation rather than learning specific input-output patterns","C":"This is expected for long documents; LSTMs must maintain context across many steps","D":"The issue is the forget gate bias being too large; decrease it to 0 to fix generalization"},"correct":"B","explanation":{"correct":"- Normal LSTM behavior: forget gate should vary by context. At sentence boundaries: f_t ≈ 0 (reset cell state). After topic changes: partial forgetting. For consistent content: f_t ≈ 1 (maintain). 95% saturation near 1.0 is pathologically high.\n- Copy mode failure mode: C_t ≈ C_{t-1} + ε (tiny updates from near-zero input gate). The LSTM is essentially a leaky integrator, not a selective memory system.\n- Diagnosis approach: check if the model predicts words differently when given completely different inputs (different sequence prefixes). A copy-mode LSTM produces nearly identical predictions for many different inputs — it's not truly using the current input.","A":"High forget gate (f_t ≈ 1) alone might be fine for long-range memory. The problem is the simultaneous near-zero input gate — the model never writes new information. Together, these indicate a degenerate solution.","B":"","C":"Maintaining context is legitimate, but context maintenance should be selective (forget irrelevant, maintain relevant). 95% always-maintain is not selective — it's a failure to learn which information to retain.","D":"Decreasing forget gate bias to 0 would make forget gates start at 0.5 (sigmoid(0)), causing the model to forget ~50% of cell state at each step by default. This overcorrects and would likely destroy long-term memory. The issue needs diagnosing more carefully (learning rate, architecture, data)."},"reference":"- Karpathy et al., \"Visualizing and Understanding Recurrent Networks\" (2015): https://arxiv.org/abs/1506.02078"},{"section":"deep-learning","topicSlug":"rnn-lstm-gru","topic":"Rnn Lstm Gru","id":"dl-12007","difficulty":"hard","orderIndex":7,"question":"You compare a 2-layer stacked LSTM (layer 1 output feeds into layer 2) to a single-layer LSTM with doubled hidden size. Both have approximately equal parameter counts. For a machine translation task, which performs better and why?","options":{"A":"Doubled hidden size always wins because wider models learn more diverse features","B":"The 2-layer stacked LSTM typically outperforms the wider single-layer LSTM for translation because depth enables compositional representations: layer 1 can learn syntactic patterns (sentence structure, phrase boundaries) while layer 2 can build semantic representations on top of these structural patterns. Depth creates a hierarchy of abstractions that a single wide layer cannot represent as efficiently. This mirrors the depth advantage in feedforward networks and is why nearly all state-of-the-art RNN-based models (pre-Transformer) used stacked RNNs (2-4 layers)","C":"Single-layer with doubled hidden size wins because stacking creates vanishing gradients","D":"Performance is identical; depth and width are equivalent for LSTMs"},"correct":"B","explanation":{"correct":"- Depth enables hierarchical processing: in Seq2Seq translation models (Sutskever et al. 2014, Wu et al. 2016 Google NMT), stacked LSTMs (4 layers) significantly outperformed single-layer LSTMs. The gain from adding the 2nd layer was larger than from adding more width.\n- Two-layer stacking: h1_t = LSTM1(x_t, h1_{t-1}); h2_t = LSTM2(h1_t, h2_{t-1}). Layer 2 processes sequences of layer 1 outputs — a \"sequence of features\" rather than \"sequence of words.\"\n- Width saturation: increasing hidden size gives diminishing returns; going from H=512 to H=1024 helps less than adding a second 512-unit LSTM layer. Additional neurons in a wide single-layer model become redundant (correlated) as width increases.","A":"Wider single-layer models encounter the redundancy problem — extra neurons learn similar functions. Depth creates qualitatively different processing levels, not just more parallel features.","B":"","C":"Stacking does increase gradient path length, but LSTM's cell state provides stable gradient flow across time. Stacking 2-4 LSTM layers doesn't cause significant vanishing gradients for typical sequence lengths (up to a few hundred tokens).","D":"Depth and width are not equivalent — this is a fundamental finding across all deep learning. For sequential tasks with hierarchical structure (language, time series), depth creates qualitatively better representations."},"reference":"- Sutskever et al., \"Sequence to Sequence Learning with Neural Networks\" (2014): 4-layer stacked LSTM\n- Wu et al., \"Google's Neural Machine Translation System\" (2016): 8-layer stacked LSTM"},{"section":"deep-learning","topicSlug":"rnn-lstm-gru","topic":"Rnn Lstm Gru","id":"dl-12008","difficulty":"medium","orderIndex":8,"question":"An LSTM is used for stock price prediction. The training set covers 2010-2020; the test set is 2020-2023. The model achieves very low training loss but terrible test loss. A naive colleague proposes \"add more LSTM layers to increase capacity.\" What is the likely root cause and why would more capacity not help?","options":{"A":"More capacity would definitely help; the model is simply underfitting","B":"The model is overfitting to the 2010-2020 distribution. The 2020-2023 period includes COVID-19 market disruptions, remote work shifts, and new market dynamics that don't appear in training. More LSTM capacity would increase memorization of the 2010-2020 specific patterns (overfitting worse), not generalization. The actual fixes: (1) regularization (Dropout, L2); (2) expanding training data to include diverse market regimes; (3) using a simpler model with better inductive biases for non-stationary time series","C":"The issue is that LSTMs cannot handle financial data; use a CNN instead","D":"The issue is sequence length mismatch between training and test sets"},"correct":"B","explanation":{"correct":"- Distribution shift: financial markets exhibit non-stationarity — statistical properties change over time (market regimes, volatility regimes). A model trained on 2010-2020 sees bull market cycles, tech dominance, quantitative easing. 2020-2023 brings unprecedented events.\n- More capacity makes it worse: a higher-capacity model will fit the 2010-2020 data more precisely (lower training loss) but become more brittle to distribution shift. The model has more parameters to encode the specific patterns of the training period, and fewer \"slack\" parameters for out-of-distribution generalization.\n- This is a classic distribution shift / concept drift problem, not an underfitting problem. The diagnostic: training loss is already very low (not underfitting). Test loss is high (generalization failure). More capacity reduces training loss further but increases test loss.","A":"Low training loss and high test loss is the definition of overfitting/distribution shift, not underfitting. The solution to overfitting is regularization, not more capacity.","B":"","C":"CNNs can process time series (1D CNNs), but the problem is distribution shift, not architecture. Switching to a CNN wouldn't fix the generalization across market regimes. The architecture is not the fundamental issue.","D":"Sequence length mismatch would cause technical errors (shape mismatches) or be trivially fixed by truncation/padding. The described problem (training works, test fails) is characteristic of distribution shift or overfitting, not sequence length issues."},"reference":"- Hurst, \"Overfitting and Distribution Shift in Time Series Forecasting\" (general concept)\n- https://arxiv.org/abs/2004.12667 (temporal covariate shift in time series)"},{"section":"deep-learning","topicSlug":"rnn-lstm-gru","topic":"Rnn Lstm Gru","id":"dl-12009","difficulty":"hard","orderIndex":9,"question":"You implement a character-level LSTM language model. At inference, you generate text by sampling from the softmax output. You notice that with temperature=1.0, the text is grammatical but repetitive. With temperature=0.01 (near-greedy), output is highly repetitive (loops). With temperature=2.0, output is incoherent. Explain the relationship between temperature and the LSTM's hidden state, and what causes the repetitive loop at near-zero temperature.","options":{"A":"Temperature only affects the output tokens; the LSTM hidden state is temperature-independent","B":"Temperature scales logits before softmax: p_i = softmax(logits/T). High T → uniform distribution (exploration). Low T → peaked distribution (exploitation). The repetition loop at low temperature occurs because: the LSTM's hidden state h_t is conditioned on the previous token x_t. When the model repeatedly samples the same token (greedy/near-greedy often produces a token that the model has been trained to follow with the same token), the hidden state converges to a fixed point — a state where the most probable next token feeds back to produce the same state. The LSTM is trapped in a hidden state cycle","C":"Repetition is caused by the forget gate becoming saturated at low temperature","D":"Temperature changes are applied to the hidden state, not the output distribution"},"correct":"B","explanation":{"correct":"- Fixed point analysis: at temperature→0, the model always selects argmax(logits). If the sequence \"the the the\" has high probability under the model (because \"the\" appears frequently and the LSTM produces high probability for \"the\" after \"the\"), the model is trapped in this cycle.\n- Hidden state convergence: h_t = LSTM(h_{t-1}, x_t). If x_t is always \"the\", then h_t → h* (a fixed vector) because the recurrence with constant input converges. At h*, the model always predicts \"the\", reinforcing the loop.\n- Temperature 2.0 (incoherence): uniform distribution → samples rare or semantically inappropriate words → the LSTM hidden state transitions to an atypical state → subsequent predictions are also atypical.","A":"Temperature is applied to the output logits, but the sampled token IS fed back into the LSTM as input. Therefore, temperature indirectly affects the hidden state trajectory by determining which token is sampled and fed back.","B":"","C":"Forget gate saturation at low temperature is not the mechanism. The forget gate is controlled by the current input and previous hidden state, not by the sampling temperature. The loop is a fixed-point attractor in the hidden-state-token space.","D":"Temperature is applied to the logits (pre-softmax activations) of the output projection layer. It does not modify the hidden state directly."},"reference":"- Karpathy, \"The Unreasonable Effectiveness of Recurrent Neural Networks\" (2015): http://karpathy.github.io/2015/05/21/rnn-effectiveness/"},{"section":"deep-learning","topicSlug":"rnn-lstm-gru","topic":"Rnn Lstm Gru","id":"dl-12010","difficulty":"medium","orderIndex":10,"question":"Modern NLP uses Transformers almost exclusively. A senior engineer claims \"there are still tasks where RNNs beat Transformers.\" What are these tasks, and what properties of RNNs provide the advantage?","options":{"A":"RNNs never beat Transformers; the senior engineer is wrong","B":"RNNs retain advantages in: (1) Online/streaming inference — RNNs process one token at a time with O(1) memory; Transformers require O(T) KV cache that grows with sequence length. For real-time processing with unbounded sequences, RNNs are superior. (2) Very long sequences — Transformer attention is O(T²) compute; at T=100,000 tokens, quadratic scaling is prohibitive. LSTMs process such sequences in O(T). (3) Hardware-constrained edge deployment — RNN hidden state is a fixed-size vector; complete model inference requires only the state vector and current input, not the entire history","C":"RNNs beat Transformers only on synthetic tasks designed to favor sequential processing","D":"RNNs beat Transformers when using ReLU activations instead of tanh"},"correct":"B","explanation":{"correct":"- Streaming inference: an LSTM with H=512 processes token t with only h_{t-1} (512 floats) + x_t → h_t. Memory is constant regardless of how many tokens have been processed. A Transformer needs to store all previous token key-value pairs in the KV cache — O(2 × num_layers × num_heads × head_dim × T) memory, growing linearly with sequence length T.\n- O(T²) vs O(T) computation: for documents with T=100K tokens, O(T²) = 10¹⁰ operations per attention layer. LSTMs: O(T) × O(H²) operations — linear in T.\n- Note: RWKV, Mamba, and SSMs (State Space Models) are recent architectures that combine Transformer-level performance with RNN-style O(T) inference — they're replacing RNNs for these use cases.","A":"The senior engineer is correct in specific scenarios. Streaming and long-sequence tasks are legitimate cases where RNNs (and their successors like Mamba) have practical advantages. Saying \"never\" is incorrect.","B":"","C":"The advantages are practical, not synthetic. Production streaming ASR (speech recognition) systems and edge devices with memory constraints are real-world use cases.","D":"Activation function choice doesn't determine when RNNs beat Transformers. The advantages are architectural (O(1) memory, O(T) compute), not activation-dependent."},"reference":"- Gu & Dao, \"Mamba: Linear-Time Sequence Modeling with Selective State Spaces\" (2023): https://arxiv.org/abs/2312.00752\n- Peng et al., \"RWKV: Reinventing RNNs for the Transformer Era\" (2023): https://arxiv.org/abs/2305.13048"},{"section":"deep-learning","topicSlug":"rnn-lstm-gru","topic":"Rnn Lstm Gru","id":"dl-12011","difficulty":"hard","orderIndex":11,"question":"You implement an encoder-decoder LSTM for sequence-to-sequence machine translation. The encoder reads the source sentence and compresses it into a single fixed-size vector (the final hidden state). This is then used as the decoder's initial hidden state. For short sentences (5-10 tokens), performance is good. For long sentences (50+ tokens), BLEU scores drop significantly. What is the fundamental architectural limitation causing this?","options":{"A":"The encoder uses too few LSTM layers; add more layers to increase capacity","B":"The fixed-size bottleneck problem: all information from the source sentence must be compressed into a single hidden state vector of size H (e.g., 512 or 1024). For short sentences, a 512-d vector can capture the full meaning. For 50+ token sentences, the compression ratio is too high — the encoder must discard information to fit everything in 512 dimensions. This is the \"information bottleneck\" and motivated the invention of attention mechanisms (Bahdanau et al., 2015): instead of compressing to a single vector, attention allows the decoder to selectively access any encoder hidden state at each decoder step","C":"Long sentences cause BPTT to use too many steps, causing gradient explosion in the encoder","D":"The performance drop is due to the decoder, not the encoder — longer targets require more decoder steps"},"correct":"B","explanation":{"correct":"- The bottleneck: regardless of sentence length, the encoder must summarize everything into h_T of fixed size H. For \"The cat sat on the mat.\" (6 tokens): easy compression. For a 50-token sentence with complex structure, dependencies, and multiple clauses: the 512-d vector must encode all of this simultaneously.\n- Empirical evidence: Cho et al. (2014) showed performance degrades sharply for sentences > 30 tokens in seq2seq models. Bahdanau et al. (2015) proposed attention, allowing the decoder to create a different context vector c_t = Σ α_{ti} h_i for each decoder step — directly addressing the bottleneck.\n- The longer the source sentence, the more information is discarded in the fixed-size encoding, leading to poor BLEU scores for long sentences.","A":"More encoder LSTM layers increase the capacity to process each step, but the bottleneck is the fixed-size final hidden state, not the layers' processing capacity. Adding layers doesn't change the H-dimensional bottleneck.","B":"","C":"BPTT through 50 encoder steps doesn't typically cause gradient explosion in LSTMs (LSTMs have cell state highway for gradients). If gradient clipping is applied, this is even less of an issue. The bottleneck is information capacity, not gradient flow.","D":"The decoder performance drop is a consequence of poor encoder representation. If the encoder discards information, the decoder has nothing to work with. The root cause is the encoder-side compression bottleneck."},"reference":"- Cho et al., \"Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation\" (2014): https://arxiv.org/abs/1406.1078\n- Bahdanau et al., \"Neural Machine Translation by Jointly Learning to Align and Translate\" (attention mechanism, 2015): https://arxiv.org/abs/1409.0473"},{"section":"deep-learning","topicSlug":"rnn-lstm-gru","topic":"Rnn Lstm Gru","id":"dl-12012","difficulty":"hard","orderIndex":12,"question":"You profile an LSTM-based model on a GPU. Despite the LSTM having H=1024 and batch_size=256, GPU utilization is only 15%. A profiler shows the bottleneck is sequential matrix multiplications (h_t depends on h_{t-1}, so each step must wait for the previous). What specific GPU efficiency problem does this represent and what architectural modifications address it?","options":{"A":"The problem is batch_size being too small; increase to 4096 to improve GPU utilization","B":"The fundamental problem is sequential data dependency: h_t = LSTM(h_{t-1}, x_t) means step t cannot start until step t-1 completes. GPUs achieve high utilization through massive parallelism. With T=100 timesteps, only 1 of the T potential parallel computations is active at each step. Across the batch (256 samples), the same input token position is processed in parallel — but this is a 256×(small matrix) operation, underutilizing the GPU. Fixes: (1) process batch efficiently across the 256 dimension; (2) use quasi-recurrent neural networks (QRNNs) that parallelize most computation across time; (3) use convolutions (parallelizable) for the input transformation and only use recurrence for the gating; (4) switch to Transformers which are fully parallelizable across time","C":"The problem is that 1024 hidden size is too small for GPU optimization; increase to 8192","D":"The problem is missing cuDNN LSTM optimizations; just add torch.backends.cudnn.enabled = True"},"correct":"B","explanation":{"correct":"- Sequential dependency: in an RNN/LSTM, T timesteps must be processed sequentially. A GPU with 10,000 CUDA cores can only use batch_size=256 cores effectively per step (one per batch element), leaving 9,744 cores idle.\n- Compare with Transformers: all T timesteps' attention can be computed in parallel using batched matrix multiplication Q×K^T. A single layer processes all T positions simultaneously, achieving high GPU utilization.\n- QRNN (Bradbury et al., 2016): parallelizes the convolution over time (input transformation is a temporal convolution, GPU-parallel) while keeping the minimal recurrence in the pooling step (small, fast). Achieves ~16× speedup over LSTM on GPUs.","A":"Batch size increase would help GPU utilization marginally (more samples processed in parallel per step). But the fundamental bottleneck is temporal sequential dependency (T sequential steps), not batch parallelism. Even batch_size=4096 still processes T=100 steps sequentially.","B":"","C":"Hidden size 1024 creates (1024×1024) = 1M parameter matrices per gate. These are large enough for efficient GEMM operations. The issue is temporal dependency, not matrix size.","D":"cuDNN LSTM optimizations (CuDNN's custom LSTM kernel) can provide 2-3× speedup by fusing operations, but they don't overcome the fundamental sequential dependency bottleneck. 15% → 45% is possible, but 15% → 80%+ requires removing the sequential dependency."},"reference":"- Bradbury et al., \"Quasi-Recurrent Neural Networks\" (2016): https://arxiv.org/abs/1611.01576\n- PyTorch cuDNN LSTM optimization: https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html"},{"section":"deep-learning","topicSlug":"rnn-lstm-gru","topic":"Rnn Lstm Gru","id":"dl-12013","difficulty":"medium","orderIndex":13,"question":"You train an LSTM for time series anomaly detection. The normal patterns show gradual trends, but anomalies are sudden spikes. After training, the model achieves 60% anomaly detection rate. A colleague suggests \"add teacher forcing during training.\" Would this help, and what is teacher forcing?","options":{"A":"Teacher forcing would definitely help; use it for all sequence prediction tasks","B":"Teacher forcing: during training, at each step t, instead of feeding the model's own prediction ŷ_{t-1} back as input, feed the ground truth y_{t-1}. This allows faster training convergence (model sees correct context, avoids compounding errors). However, for anomaly detection specifically, teacher forcing causes a training-inference mismatch: at inference, the model must use its own predictions (or actual observations) as input. If the model was trained with perfect previous values, it may not generalize well to inference conditions where its previous predictions may be slightly off","C":"Teacher forcing is only used for language models; it doesn't apply to time series","D":"Teacher forcing should be avoided entirely; it causes catastrophic forgetting in LSTMs"},"correct":"B","explanation":{"correct":"- Teacher forcing mechanism: at training step t, standard training feeds ŷ_{t-1} = model_output_{t-1} → model compounding prediction errors if any early prediction is wrong. Teacher forcing feeds y_{t-1} = ground truth → model always sees correct previous values during training.\n- Benefits: faster convergence, more stable gradients (no error accumulation). Used extensively in seq2seq models.\n- Problem (exposure bias): the model is never exposed to its own prediction errors during training. At inference, small errors compound: ŷ_t is slightly off → ŷ_{t+1} is more off → ŷ_{t+2} is even more off. For anomaly detection, the model must handle both normal and anomalous inputs — training only on ground truth means the model never learns to recover from prediction errors.","A":"\"Always use teacher forcing\" ignores the training-inference gap problem. For tasks with long autoregressive generation, teacher forcing can hurt inference performance. Scheduled sampling (gradually replacing teacher forcing with model predictions) is often a better approach.","B":"","C":"Teacher forcing is used in any task where the model uses its own previous predictions as input: time series prediction, seq2seq models, language models, and anomaly detection with recurrent models. It's not limited to language models.","D":"Teacher forcing doesn't cause catastrophic forgetting (which is a continual learning problem where new training overwrites old knowledge). They are completely different concepts."},"reference":"- Williams & Zipser, \"A Learning Algorithm for Continually Running Fully Recurrent Neural Networks\" (1989): original teacher forcing\n- Bengio et al., \"Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks\" (2015): https://arxiv.org/abs/1506.03099"},{"section":"deep-learning","topicSlug":"rnn-lstm-gru","topic":"Rnn Lstm Gru","id":"dl-12014","difficulty":"easy","orderIndex":14,"question":"You add `dropout=0.3` to a PyTorch `nn.LSTM` layer with `num_layers=3`. A junior engineer says \"dropout is applied after every hidden-to-hidden transition.\" Is this correct?","options":{"A":"Yes — dropout is applied after every h_{t-1} → h_t transition","B":"No — PyTorch's nn.LSTM dropout is applied only between LSTM layers (inter-layer dropout), not within a single layer's hidden-to-hidden transitions. For a 3-layer LSTM: dropout is applied between layer 1→2 and layer 2→3 outputs. The temporal recurrence (h_{t-1} → h_t) within each layer does NOT have dropout applied. This is a known limitation: Variational Dropout (Gal & Ghahramani, 2016) applies the same dropout mask across all timesteps for both input and recurrent connections, but PyTorch's standard LSTM doesn't implement this","C":"Dropout is applied only to the final LSTM layer's output","D":"Dropout in nn.LSTM is applied to every weight matrix independently"},"correct":"B","explanation":{"correct":"- PyTorch nn.LSTM with dropout=p and num_layers=k: PyTorch documentation explicitly states \"If non-zero, introduces a Dropout layer on the outputs of each LSTM layer except the last layer.\"\n- This means: for 3 layers, dropout is applied to the output of layer 1 (before feeding to layer 2) and the output of layer 2 (before feeding to layer 3). The output of layer 3 (the final layer) has no dropout.\n- The recurrent transition within each layer (h_{t-1} → h_t) has no dropout in the standard PyTorch implementation.\n- Variational RNN Dropout: uses the same dropout mask at every timestep (Gal & Ghahramani, 2016). Standard PyTorch uses independent random masks at each step (when applied), and only between layers.","A":"Dropout is not applied at every hidden-to-hidden transition. This is the Variational Dropout approach, not PyTorch's default. Standard nn.LSTM only applies dropout between layers.","B":"","C":"The final layer's output has NO dropout (per PyTorch documentation: \"except the last layer\"). Dropout is applied between intermediate layers.","D":"PyTorch's nn.LSTM dropout doesn't apply to weight matrices directly (that would be weight dropout / DropConnect). It applies to the activation outputs between layers."},"reference":"- PyTorch nn.LSTM documentation: https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html\n- Gal & Ghahramani, \"A Theoretically Grounded Application of Dropout in Recurrent Neural Networks\" (2016): https://arxiv.org/abs/1512.05287"},{"section":"deep-learning","topicSlug":"rnn-lstm-gru","topic":"Rnn Lstm Gru","id":"dl-12015","difficulty":"hard","orderIndex":15,"question":"An LSTM-based seq2seq model with attention is trained for machine translation. At test time, you use beam search with beam_size=5. A researcher claims \"increasing beam_size always improves BLEU.\" You increase to beam_size=50 and observe BLEU decreases. What is happening?","options":{"A":"Larger beam sizes always improve BLEU; the decrease is a software bug","B":"The \"beam search curse\" (or beam search optimization inconsistency): with larger beams, beam search finds translations with higher model log-probability but lower BLEU scores. The model is imperfect — it assigns high probability to sequences that are structurally fluent but semantically incorrect or contain \"safe\" but generic phrases. Larger beams explore more of the model's probability space, finding sequences that are very \"safe\" (high probability under the model) but not actually good translations. The model's log-prob is a proxy for quality, and this proxy breaks down for extreme beams","C":"Larger beams cause GPU memory overflow that corrupts output tokens","D":"BLEU decreases because beam search with large beams produces longer sequences that are penalized by BLEU's brevity penalty"},"correct":"B","explanation":{"correct":"- Beam search optimization: maximize Σ log p(y_t | y_{> 1 (large variance), the maximum element dominates and softmax ≈ one-hot. Gradient of softmax: p_i(1-p_i) → 0 as p_i → 1. Near-zero gradients → training stalls.\n- Vaswani et al. include this exact analysis in \"Attention Is All You Need\" (2017), Section 3.2.1.","A":"\"Overflow\" in softmax is a separate issue (handled by softmax(x - max(x)) numerics). The √d_k scaling is about gradient flow, not overflow. Without scaling, values are large but finite for reasonable d_k.","B":"","C":"Dividing by √d_k (a scalar) has negligible computational cost — it's a single multiplication per element. It doesn't meaningfully reduce softmax computation.","D":"Softmax always sums to 1 regardless of input magnitude. The sum-to-1 property is a mathematical property of softmax, not a consequence of scaling."},"reference":"- Vaswani et al., \"Attention Is All You Need\" (2017): https://arxiv.org/abs/1706.03762 (Section 3.2.1)"},{"section":"deep-learning","topicSlug":"attention-and-transformers-dl","topic":"Attention And Transformers Dl","id":"dl-13002","difficulty":"easy","orderIndex":2,"question":"Multi-head attention uses h separate attention heads, each with reduced dimension d_k = d_model/h. After computing h attention outputs, they are concatenated and projected. A team increases h from 8 to 32 (keeping d_model=512 constant). What changes in computation?","options":{"A":"Increasing h increases total FLOPs proportionally (4× more attention operations)","B":"Total FLOPs remain approximately constant. Each head has d_k = 512/h. The QKV projections: W_Q ∈ ℝ^{d_model × d_k}. With more heads, each head's projection is smaller: total Q projection FLOPs = d_model × d_k × h = d_model × d_model (constant). The attention computation per head: O(T² × d_k) × h = O(T² × d_model) (constant). The output changes: with more heads, each head attends to lower-dimensional subspaces — each head specializes in a narrower feature space","C":"Increasing h increases memory usage by 4× because more attention matrices are stored","D":"Increasing h from 8 to 32 increases d_k from 64 to 256"},"correct":"B","explanation":{"correct":"- Multi-head attention total computation:\n- QKV projections: 3 × T × d_model × d_k × h = 3 × T × d_model² (constant in h)\n- Attention: T² × d_k × h = T² × d_model (constant in h)\n- Output projection: T × d_model × d_model (constant in h)\n- Changing h only redistributes the capacity into more subspaces, not adds total capacity.\n- Trade-off: more heads → lower-dimensional per-head representations → each head is more constrained → may miss complex patterns within a subspace, but can specialize into different relationship types. Too many heads (small d_k) → each head too narrow to capture useful features.","A":"FLOPs don't scale with h because d_k decreases proportionally. 32 heads × d_k=16 = same total dimension as 8 heads × d_k=64.","B":"","C":"The attention matrices have shape T×T per head. Total attention map memory = h × T² × 1 = fixed number of T² elements regardless of head size (per element). With h=32 each head computes T×T, total memory h × T² × d_k = T² × d_model (roughly constant).","D":"d_k = d_model/h. More heads → smaller d_k. h=32, d_model=512 → d_k = 16 (not 256). This is the opposite direction."},"reference":"- Vaswani et al., \"Attention Is All You Need\" (2017): Section 3.2.2"},{"section":"deep-learning","topicSlug":"attention-and-transformers-dl","topic":"Attention And Transformers Dl","id":"dl-13003","difficulty":"medium","orderIndex":3,"question":"Transformers use positional encoding (PE) added to token embeddings. The original Transformer uses sinusoidal PE: PE(pos, 2i) = sin(pos/10000^(2i/d_model)), PE(pos, 2i+1) = cos(...). A team replaces this with learned positional embeddings (a lookup table). For a model trained on sequences up to length 512, they test on sequences of length 1024. What happens?","options":{"A":"Learned PE generalizes perfectly to length 1024 because the model learns position-invariant patterns","B":"Learned PE typically fails to generalize beyond training length. Position embeddings for positions 513-1024 were never seen during training — the embedding table has no entries for these positions (or wraps around/truncates). Even if technically extended, the model has no learned signal for these positions. Sinusoidal PE can extrapolate because the mathematical function is defined for any position. However, in practice, even sinusoidal PE performance degrades for very out-of-distribution positions due to attention patterns being calibrated for shorter sequences","C":"Both sinusoidal and learned PE fail completely for sequences longer than training length","D":"Learned PE is strictly better — it can represent any positional pattern including those beyond 512"},"correct":"B","explanation":{"correct":"- Learned PE lookup table: embedding_table ∈ ℝ^{max_seq_len × d_model}. For position 513: `embedding_table[513]` simply doesn't exist (IndexError or truncation). Even if you extend with zeros or random values, the model has no trained understanding of these positions.\n- Sinusoidal extrapolation: PE(pos, i) = sin(pos/10000^(2i/d_model)) is mathematically defined for any integer pos. Position 1024 produces a valid vector. However, the attention mechanism's effective range is still calibrated for < 512 positions.\n- RoPE (Rotary Position Embedding) and ALiBi are modern solutions specifically designed for length extrapolation, both used in production LLMs (LLaMA, Falcon, etc.).","A":"\"Position-invariant patterns\" would mean the model doesn't use positional information at all. Learned PE is specifically designed to encode position — it's not position-invariant, and it doesn't generalize to unseen positions.","B":"","C":"Sinusoidal PE does extrapolate mathematically (the formula is valid for any position). The claim that it \"fails completely\" is too strong. It may degrade, but it produces valid vectors for any position.","D":"Learned PE is strictly bounded by the training length. Beyond that, it has no learned representation. It cannot represent patterns for positions it never encountered."},"reference":"- Su et al., \"RoFormer: Enhanced Transformer with Rotary Position Embedding\" (RoPE, 2021): https://arxiv.org/abs/2104.09864\n- Press et al., \"Train Short, Test Long: ALiBi\" (2021): https://arxiv.org/abs/2108.12409"},{"section":"deep-learning","topicSlug":"attention-and-transformers-dl","topic":"Attention And Transformers Dl","id":"dl-13004","difficulty":"medium","orderIndex":4,"question":"The Transformer feed-forward network (FFN) consists of two linear layers: FFN(x) = max(0, xW_1 + b_1)W_2 + b_2, with inner dimension 4×d_model. For d_model=512, this is: 512 → 2048 → 512. The FFN has 4× more parameters than the attention sublayer (d_model×4×d_model vs d_model×3×d_model for QKV). What is the role of the FFN, and why is the 4× expansion useful?","options":{"A":"The FFN is redundant; Transformers would work equally well without it","B":"The FFN applies position-wise transformations — the same function independently to each position. While attention performs \"routing\" (mixing information across positions, deciding which positions are relevant to each other), the FFN performs \"computation\" (applying a non-linear transformation to the token's representation at that position). The 4× expansion (dimension bottleneck) allows the model to represent complex functions in the high-dimensional intermediate space: the first layer expands to 2048 dimensions (more features to compute from), the second layer selects and compresses. This is analogous to how wider hidden layers in MLPs can represent more complex functions","C":"The FFN acts as a key-value memory, storing factual knowledge about the world","D":"The 4× expansion is a legacy design choice that modern Transformers have eliminated"},"correct":"B","explanation":{"correct":"- Attention vs FFN role: attention computes token interactions (which positions attend to which). FFN applies a nonlinear transformation per token independently.\n- The \"write-then-read\" intuition: attention gathers relevant context into a representation; FFN then processes this representation. The 4× expansion gives the FFN more \"working memory\" — 2048 intermediate dimensions for computing complex functions of the 512-d input.\n- Research on FFN as memory (Geva et al., 2020): actually supports the memory interpretation — FFN layers seem to store factual associations. But the primary designed role is position-wise nonlinear computation.","A":"Removing FFN layers from Transformers significantly reduces performance. Ablation studies show that FFN layers contribute substantially to Transformer quality. The architecture without FFN is much weaker.","B":"","C":"The \"key-value memory\" interpretation (Geva et al., 2020) is an emerging research finding, not the designed purpose. The primary role is position-wise nonlinear computation. Presenting the memory interpretation as \"the role\" oversimplifies.","D":"The 4× expansion is used in virtually all modern Transformers including LLaMA, GPT-4, and Gemini. Some architectures use different expansion factors (e.g., 8/3× for SwiGLU), but the expansion concept is universal."},"reference":"- Vaswani et al., \"Attention Is All You Need\" (2017): Section 3.3\n- Geva et al., \"Transformer Feed-Forward Layers Are Key-Value Memories\" (2021): https://arxiv.org/abs/2012.14913"},{"section":"deep-learning","topicSlug":"attention-and-transformers-dl","topic":"Attention And Transformers Dl","id":"dl-13005","difficulty":"medium","orderIndex":5,"question":"You notice that in a trained Transformer, the attention patterns in layer 1 are very different from layer 12 (in a 12-layer model). Layer 1 shows local attention (mostly attending to nearby tokens). Layer 12 shows sparse, global attention (a few tokens attending to many distant ones). Why do attention patterns evolve across layers?","options":{"A":"Layer 1 has fewer parameters, forcing it to use local attention","B":"Layer 1 processes raw token embeddings + positional encodings. The embeddings encode surface-level information (word identity, local syntax). Attending locally at layer 1 is optimal for capturing local syntactic structure. By layer 12, representations have been refined through many layers of attention+FFN — they encode abstract semantic information. Global, sparse attention in later layers reflects high-level semantic associations across the full sequence (e.g., a pronoun attending to its antecedent many tokens away, a verb attending to its distant subject). Each layer's attention patterns emerge from what information is useful at that representation level","C":"The attention patterns are random; variation across layers is not meaningful","D":"Layer 12 uses larger attention weights by design; PyTorch initializes later layers with higher weights"},"correct":"B","explanation":{"correct":"- Visualization studies (Clark et al., 2019 \"What Does BERT Look At?\"): different attention heads in different layers capture different linguistic phenomena. Early layers: local attention, syntactic relations (adjacent token dependencies). Late layers: long-range semantic dependencies, coreference.\n- Information accumulation: after 12 layers of attention, each token's representation encodes context from the entire sequence. The later layers have access to highly processed representations that encode global semantic structure, enabling long-range attention to be informative.\n- This hierarchical processing — local syntax → global semantics — mirrors what happens in the brain's language processing and in CNN layer hierarchies (local features → global patterns).","A":"All Transformer layers have the same number of parameters (same d_model, same number of heads). There's no \"fewer parameters in earlier layers\" — they're architecturally identical.","B":"","C":"Attention patterns are highly structured and reproducible across runs and models. Multiple papers show consistent patterns (local attention in early layers, global in later layers) across different Transformer models trained on different tasks.","D":"PyTorch initializes all Transformer layers identically (same initialization scheme for all layers). The patterns emerge during training, not from initialization."},"reference":"- Clark et al., \"What Does BERT Look At? An Analysis of BERT's Attention\" (2019): https://arxiv.org/abs/1906.04341\n- Tenney et al., \"BERT Rediscovers the Classical NLP Pipeline\" (2019): https://arxiv.org/abs/1905.05950"},{"section":"deep-learning","topicSlug":"attention-and-transformers-dl","topic":"Attention And Transformers Dl","id":"dl-13006","difficulty":"hard","orderIndex":6,"question":"Self-attention has O(T²) compute and memory complexity for sequence length T. For T=16,384 (16K tokens), this becomes prohibitive. Name three distinct approaches to reduce this complexity with different complexity-expressiveness trade-offs.","options":{"A":"The only solution is to reduce d_model; attention complexity cannot be reduced","B":"(1) Sparse attention (Longformer, BigBird): compute attention only for local windows + select global tokens. O(T×w) where w is window size. Trades global attention for local efficiency. (2) Linear attention (Performer, Linformer): approximate softmax attention with kernel methods: Φ(Q)Φ(K)^T ≈ QK^T, allowing the association Φ(K)^T V to be precomputed. O(T×d). Full expressiveness loss due to approximation. (3) Flash Attention: same O(T²) complexity but minimizes HBM memory reads/writes using tiling. Full attention (exact), memory efficient, but still O(T²) compute","C":"The O(T²) complexity is a fundamental theorem; it cannot be reduced without losing all expressive power","D":"The solution is to chunk the sequence into non-overlapping segments and apply attention within each chunk (no cross-chunk attention)"},"correct":"B","explanation":{"correct":"- Sparse attention (Longformer): each token attends to w local neighbors + k global tokens. Total attention computations: O(T×(w+k)) instead of T². Trade-off: misses some cross-document attention patterns.\n- Linear attention (Performer): using random feature approximation of softmax kernel: exp(q·k) ≈ φ(q)^T φ(k). Rewrite attention as Q(K^T V) (O(T×d²)) instead of (QK^T)V (O(T²×d)). Trade-off: approximation error in attention distribution.\n- FlashAttention (Dao et al., 2022): exact attention with optimized memory access pattern. Uses tiling to compute attention block by block, never materializing the full T×T attention matrix. No expressiveness loss, but still O(T²) FLOPs — the win is memory and wall-clock time.","A":"Many published approaches reduce attention complexity below O(T²). Listing \"the only solution is reducing d_model\" ignores 5+ years of efficiency research.","B":"","C":"Linear attention (Performer, Linformer) demonstrates O(T×d) complexity with practical applications. The theorem claim is false.","D":"Non-overlapping chunks (basic segmentation) is a crude solution that loses all cross-chunk dependencies. This is worse than sparse attention approaches that at least have overlapping windows or global tokens."},"reference":"- Tay et al., \"Efficient Transformers: A Survey\" (2020): https://arxiv.org/abs/2009.06732\n- Dao et al., \"FlashAttention\" (2022): https://arxiv.org/abs/2205.14135"},{"section":"deep-learning","topicSlug":"attention-and-transformers-dl","topic":"Attention And Transformers Dl","id":"dl-13007","difficulty":"hard","orderIndex":7,"question":"A team computes attention: Q ∈ ℝ^{T×d_k}, K ∈ ℝ^{S×d_k}, V ∈ ℝ^{S×d_v}. In self-attention, T=S. In cross-attention (decoder attending to encoder), T≠S. For a translation model with source length S=20 and target length T=15, describe the shapes of the attention matrix and what each entry (i,j) represents.","options":{"A":"The attention matrix is always square (T×T) regardless of S and T","B":"The attention matrix QK^T ∈ ℝ^{T×S} = ℝ^{15×20}. Entry (i,j) represents the attention score between target position i and source position j — i.e., how much target word i should attend to (be influenced by) source word j when generating its representation. After softmax: each row sums to 1, representing a probability distribution over source positions for each target position. V then aggregates: A×V ∈ ℝ^{15×d_v} produces a context vector for each target position as a weighted sum of source value vectors","C":"The attention matrix shape is (S×T) = (20×15) because source attends to target","D":"Cross-attention requires T=S; the team must pad the source to length 15"},"correct":"B","explanation":{"correct":"- Cross-attention mechanics: Q from decoder (shape: T×d_k), K and V from encoder (shape: S×d_k and S×d_v). QK^T: (T×d_k) × (d_k×S) = T×S.\n- Entry (i,j) of the raw attention matrix: the similarity between query vector q_i (representation of target position i) and key vector k_j (representation of source position j). Softmax over j: how much target token i attends to source token j.\n- This is the mechanism Bahdanau et al. (2015) introduced: the decoder decides where to look in the source sentence for each target word — explicitly learned alignment.","A":"The attention matrix is T×S (not necessarily square). Self-attention has T=S, but cross-attention generally doesn't. The Q×K^T operation requires Q to have columns = K rows (d_k), not the same number of rows.","B":"","C":"Q comes from the decoder (T=target length), K comes from encoder (S=source length). Target attends to source, so attention is T×S (rows=target queries, cols=source keys), not S×T.","D":"Cross-attention explicitly allows different sequence lengths. This is its primary advantage over the original fixed-size encoder vector. Padding to equal lengths is unnecessary and wastes computation."},"reference":"- Bahdanau et al., \"Neural Machine Translation by Jointly Learning to Align and Translate\" (2015): https://arxiv.org/abs/1409.0473"},{"section":"deep-learning","topicSlug":"attention-and-transformers-dl","topic":"Attention And Transformers Dl","id":"dl-13008","difficulty":"medium","orderIndex":8,"question":"Layer normalization in the original Transformer is applied as Post-LN: output = LN(x + Sublayer(x)). Modern LLMs use Pre-LN: output = x + Sublayer(LN(x)). You are designing a new Transformer for a task requiring training with a very high learning rate (1e-3). Which normalization placement should you use and why?","options":{"A":"Post-LN with a very high learning rate is safe; LN handles any learning rate","B":"Pre-LN is required for stable training with high learning rates. In Pre-LN, the gradient of x flows back through the residual path: ∂L/∂x includes a direct term (from the skip connection) that is not scaled by LN. This direct path ensures gradient magnitude doesn't collapse regardless of LN's behavior. With Post-LN at high LR, the optimization is very sensitive to initialization — high LR with Post-LN often causes training divergence. Pre-LN allows training without warmup at high LR, which is critical for rapid training","C":"Post-LN is required; Pre-LN with high learning rate causes exploding activations","D":"The choice doesn't matter; use whichever is implemented in the framework"},"correct":"B","explanation":{"correct":"- Post-LN gradient: ∂L/∂x_in flows through LN(x_in + F(x_in)). The LN normalization gates the gradient magnitude based on the total activation variance. Early in training with high LR and Post-LN, the combined (signal + residual) has high variance, causing LN to scale gradients in unpredictable ways → divergence.\n- Pre-LN gradient: ∂L/∂x_out = ∂L/∂x_in (from skip) + ∂sublayer_gradient. The \"1\" term from the skip connection is always present, providing a well-scaled gradient path.\n- All major modern LLMs (GPT-2, GPT-3, LLaMA, PaLM, Gemini) use Pre-LN precisely for this reason: stable high-LR training without warmup.","A":"LN does NOT make any LR safe. LN normalizes activations, which stabilizes optimization, but Post-LN with very high LR (1e-3 for Transformers is typically high) still diverges regularly. Requiring warmup is exactly the problem Post-LN has.","B":"","C":"Pre-LN with high LR is more stable, not less. The direct residual path in Pre-LN bounds gradient magnitudes. \"Exploding activations\" with Pre-LN is not a documented phenomenon.","D":"The choice of normalization placement is one of the most important architectural decisions for training stability. Major papers (Xiong et al., 2020) show quantifiably better stability with Pre-LN."},"reference":"- Xiong et al., \"On Layer Normalization in the Transformer Architecture\" (2020): https://arxiv.org/abs/2002.04745"},{"section":"deep-learning","topicSlug":"attention-and-transformers-dl","topic":"Attention And Transformers Dl","id":"dl-13009","difficulty":"hard","orderIndex":9,"question":"KV cache (key-value cache) is used during Transformer inference for autoregressive generation. Without KV cache, generating T tokens requires O(T²) total compute. With KV cache, it requires O(T). However, a production system serving many concurrent requests with T=4096 tokens runs out of GPU memory. Explain the memory cost of KV cache and why it is a major production concern.","options":{"A":"KV cache memory is negligible compared to model weights; the OOM is from model parameters","B":"KV cache per request = 2 × num_layers × num_heads × head_dim × T × sizeof(dtype). For LLaMA-7B: 2 × 32 layers × 32 heads × 128 head_dim × 4096 tokens × 2 bytes (FP16) = 2×32×32×128×4096×2 = ~1GB per request. With 100 concurrent requests: 100GB just for KV cache. LLaMA-7B weights are only 14GB in FP16. The KV cache grows linearly with sequence length and request concurrency, often exceeding model weight memory at production scale. This is why techniques like PagedAttention (vLLM), quantized KV cache, and context compression are active research areas","C":"KV cache memory is per-token and is released after each token is generated, so it doesn't accumulate","D":"KV cache only stores the final layer's key-value pairs; earlier layers are recomputed each step"},"correct":"B","explanation":{"correct":"- Exact calculation for LLaMA-7B (32 layers, 32 heads, 128 head_dim, FP16):\n- Per token per layer: 2 (K and V) × 32 heads × 128 head_dim × 2 bytes = 16,384 bytes = 16 KB\n- Per token total: 16 KB × 32 layers = 512 KB per token\n- For T=4096 tokens: 4096 × 512 KB ≈ 2 GB per request\n- At 50 concurrent requests: 50 × 2 GB = 100 GB KV cache vs 14 GB model weights.\n- vLLM's PagedAttention: inspired by OS virtual memory, stores KV cache in non-contiguous memory pages, allowing efficient memory sharing and preventing fragmentation.","A":"As shown, KV cache can be 2-10GB per long request — comparable to or larger than the model weights. At production concurrency, it's the primary memory bottleneck.","B":"","C":"KV cache accumulates throughout a single request (all previously generated tokens' K, V must be stored to avoid recomputation). It's released after the request completes, but during generation, it grows with each new token.","D":"KV cache stores all layers' K and V vectors. That's the point — to avoid recomputing them. Storing only the final layer's KV and recomputing earlier layers would eliminate most of the savings."},"reference":"- Kwon et al., \"Efficient Memory Management for Large Language Model Serving with PagedAttention\" (vLLM, 2023): https://arxiv.org/abs/2309.06180"},{"section":"deep-learning","topicSlug":"attention-and-transformers-dl","topic":"Attention And Transformers Dl","id":"dl-13010","difficulty":"hard","orderIndex":10,"question":"Rotary Position Embedding (RoPE) encodes position by rotating query and key vectors in 2D subspaces: the dot product q_m · k_n (positions m and n) depends only on q_m · k_n computed as a function of (m-n). Why is this \"relative\" property critical for generalization, and how does it differ from absolute sinusoidal PE?","options":{"A":"RoPE is only useful for models with more than 1000 layers; it doesn't apply to standard Transformers","B":"RoPE's attention score depends on (m-n) — the relative distance between positions, not their absolute positions. Absolute PE: the model learns that position 100 should attend differently to position 150 than position 100 attends to position 1. With training on sequences up to length 512, position 600 has never been seen — absolute PE embedding at position 600 is undefined. RoPE: the model learns that \"looking back 50 positions\" means something, regardless of whether \"position 550 looking at 500\" or \"position 50 looking at 0\" — relative distance is the semantically meaningful quantity","C":"RoPE is a form of data augmentation, not a positional encoding","D":"Absolute sinusoidal PE also encodes relative position; RoPE is a minor implementation detail"},"correct":"B","explanation":{"correct":"- RoPE property: for rotation matrices R_m (position m) and R_n (position n): (R_m q)^T (R_n k) = q^T R_{m-n} k. The dot product computes a function of the relative offset m-n. This is exactly what \"relative position encoding\" means.\n- Generalization beyond training length: the model learns functions of relative distances (1, 2, 5, 50, etc.). At inference with longer sequences, the same relative distances are used — the model can generalize to position 5000 looking back 50 positions, because it's the same relative distance as position 100 looking back 50.\n- LLaMA, Mistral, Falcon all use RoPE for this generalization property.","A":"RoPE is used in standard Transformer architectures (LLaMA-7B to 70B, Mistral-7B, etc.) regardless of layer count. It's a positional encoding, not a layer-specific technique.","B":"","C":"RoPE is a positional encoding scheme, not data augmentation. It encodes token positions mathematically, not by modifying training samples.","D":"Absolute sinusoidal PE does NOT encode relative position in attention dot products. sin(pos × f) + position embedding produces different vectors for different absolute positions. The dot product of absolute PEs does have some relative position information (cos(m-n) appears), but it's confounded with absolute position information, not purely relative."},"reference":"- Su et al., \"RoFormer: Enhanced Transformer with Rotary Position Embedding\" (2021): https://arxiv.org/abs/2104.09864"},{"section":"deep-learning","topicSlug":"attention-and-transformers-dl","topic":"Attention And Transformers Dl","id":"dl-13011","difficulty":"hard","orderIndex":11,"question":"Grouped Query Attention (GQA) and Multi-Query Attention (MQA) reduce the number of K and V heads compared to Q heads. Standard MHA: 32 Q, 32 K, 32 V heads. MQA: 32 Q, 1 K, 1 V head. GQA: 32 Q, 8 K, 8 V heads (groups of 4 Q heads share K/V). What is the primary motivation, and what is the accuracy-efficiency trade-off?","options":{"A":"GQA/MQA reduce FLOPs per attention computation by 32× or 4×","B":"The primary motivation is KV cache memory reduction. MQA: reduces K/V heads from 32 to 1 → reduces KV cache by 32×. GQA with 8 groups: reduces KV cache by 4×. The FLOPs for attention computation change by a similar factor, but the dominant bottleneck at inference is memory bandwidth (loading KV cache from HBM), not compute. Accuracy trade-off: MQA can reduce quality (single K/V shared across all 32 Q heads limits expressive diversity). GQA balances this: 8 K/V heads for 32 Q provides more diversity than MQA while still achieving ~4× KV cache reduction. LLaMA-2-70B uses GQA with 8 K/V groups","C":"GQA/MQA only help during training; inference memory is identical to standard MHA","D":"GQA/MQA eliminate the need for KV caching entirely"},"correct":"B","explanation":{"correct":"- Standard MHA KV cache per token: 2 × num_heads × head_dim × num_layers = 2 × 32 × 128 × 32 = 262,144 values per token.\n- MQA KV cache: 2 × 1 × 128 × 32 = 8,192 values — 32× less.\n- GQA (8 K/V groups) cache: 2 × 8 × 128 × 32 = 65,536 values — 4× less.\n- The memory bandwidth bottleneck: at inference, for each generated token, all KV cache must be loaded from GPU HBM. MQA/GQA directly reduce this bandwidth requirement, improving inference throughput.\n- Accuracy: Ainslie et al. (2023) GQA paper shows GQA matches MHA accuracy while MQA has a small but consistent accuracy loss.","A":"FLOPs for attention computation: attention(Q, K, V) scales with num_kv_heads. MQA reduces these FLOPs by 32×, but this is not the primary motivation. Memory bandwidth is the bottleneck in autoregressive inference, not FLOPs.","B":"","C":"KV cache is an inference concept (caching K/V for previously generated tokens). GQA/MQA directly reduce its size, which is an inference benefit. Training is less affected since batched training can cache KV for all positions simultaneously.","D":"KV caching is still necessary with GQA/MQA — the K and V vectors (now fewer) still need to be cached to avoid recomputation. The cache is smaller, not eliminated."},"reference":"- Ainslie et al., \"GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints\" (2023): https://arxiv.org/abs/2305.13245"},{"section":"deep-learning","topicSlug":"attention-and-transformers-dl","topic":"Attention And Transformers Dl","id":"dl-13012","difficulty":"medium","orderIndex":12,"question":"You implement causal (masked) self-attention for autoregressive language modeling. The mask ensures token i only attends to positions j ≤ i. During training with batch_size=32 and sequence_length=512, you materialize a 512×512 causal mask of -∞ (upper triangle) and add it to the attention logits before softmax. A colleague says this is wasteful. What is she referring to and what is a more efficient implementation?","options":{"A":"The mask should be a boolean matrix, not -∞; softmax of -∞ causes NaN","B":"Materializing a 512×512 matrix of -∞ per batch requires storing 512² = 262,144 values per head (or batch element × head × 512² = 32×8×262K ≈ 67M values). This wastes memory and requires a separate addition operation. Efficient alternative: (1) create the mask once as a bool matrix and multiply in the attention kernel; (2) FlashAttention-style tiled attention builds the causal mask implicitly without materializing the full T×T matrix; (3) use `torch.nn.functional.scaled_dot_product_attention` with `is_causal=True` which applies the mask within the fused CUDA kernel without creating a full mask tensor","C":"The causal mask should be applied after softmax, not before","D":"Causal masking is not needed for decoder models; the sequential nature of generation handles causality"},"correct":"B","explanation":{"correct":"- Memory waste: a 512×512 float32 mask = 1MB per layer per item in the batch. With batch=32, 8 heads: 32×8×1MB = 256MB just for masks across all heads per layer.\n- The mask is a static, triangular pattern — the same for every batch element and every head. Creating it once as a register buffer (not recomputed, not batched) saves memory.\n- PyTorch 2.0+: `F.scaled_dot_product_attention(q, k, v, is_causal=True)` fuses the mask application into a single CUDA kernel that computes attention block by block (FlashAttention style), never materializing the full T×T matrix.","A":"softmax(-∞) = 0 (not NaN). exp(-∞) = 0, which in the softmax denominator contributes 0, effectively masking the position. This is exactly the intended behavior.","B":"","C":"The causal mask must be applied before softmax to prevent probability mass from being assigned to masked (future) positions. Applying after softmax would require different mask values and wouldn't correctly zero out future positions.","D":"Causal masking is essential for training decoder models. Without the mask, during training (where the full sequence is available), each token can see all future tokens. The sequential nature applies only at inference time — during training, all positions are processed in parallel and the mask enforces causality."},"reference":"- PyTorch scaled_dot_product_attention: https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html"},{"section":"deep-learning","topicSlug":"attention-and-transformers-dl","topic":"Attention And Transformers Dl","id":"dl-13013","difficulty":"easy","orderIndex":13,"question":"In the Transformer architecture, residual connections are used in both attention and FFN sublayers: output = LN(x + Sublayer(x)). A 6-layer Transformer is trained without any residual connections. What will happen during training and why?","options":{"A":"Training will succeed but converge more slowly","B":"Without residual connections, gradients must flow through the full depth of nonlinear transformations. For a 6-layer Transformer with tanh or ReLU activations in FFN layers, the gradient at the input layer is ∂L/∂x_1 = Π_{l=1}^{6} ∂x_{l+1}/∂x_l. Without skip paths, each term can be < 1 in spectral norm, causing vanishing gradients. The model will fail to train meaningfully — the first few layers will receive near-zero gradient updates while only the final layers train effectively","C":"Training will fail due to a shape mismatch error from the missing addition operation","D":"Training works fine for 6 layers; residual connections are only needed for 50+ layer networks"},"correct":"B","explanation":{"correct":"- Residual connection gradient: with x + F(x), the gradient ∂L/∂x includes a direct term \"1\" from the identity path. This prevents gradient from being forced through the nonlinear F(x) path, guaranteeing some gradient flow regardless of F(x)'s Jacobian.\n- Without residuals in a 6-layer Transformer: gradient must pass through 6 attention+FFN nonlinear compositions. Each composition can attenuate gradients. While 6 layers is not as extreme as 50+, the multi-head attention + FFN layers with LayerNorm are non-trivial nonlinear operations. Gradient vanishing within 6 layers is a real risk, especially early in training.\n- Empirical evidence: the ResNet degradation problem showed even 20-layer plain networks fail. Transformers without residuals show similar degradation.","A":"\"More slowly\" understates the problem. For a 6-layer Transformer without residuals, training typically stalls — the model barely converges rather than just converging slowly.","B":"","C":"Removing the addition operation from `LN(x + Sublayer(x))` doesn't cause a shape mismatch. Both x and Sublayer(x) have the same shape (T × d_model). The operation just becomes `LN(Sublayer(x))`. It's a valid operation, just suboptimal.","D":"The need for residual connections is not exclusively for very deep networks. The original Transformer paper (6 layers) uses residual connections and shows they're important for stable training. The degradation problem was demonstrated for 20+ layers, but residuals help at any depth."},"reference":"- He et al., \"Deep Residual Learning for Image Recognition\" (2015): residual connections for training stability\n- Vaswani et al., \"Attention Is All You Need\" (2017): Section 3 (residual connections in each sublayer)"},{"section":"deep-learning","topicSlug":"attention-and-transformers-dl","topic":"Attention And Transformers Dl","id":"dl-13014","difficulty":"hard","orderIndex":14,"question":"You analyze attention patterns in a trained Transformer and find that certain heads consistently produce near-uniform attention distributions (all positions receive equal weight ≈ 1/T). Another set of heads consistently attend to the [CLS] token or current position. What do these degenerate heads indicate?","options":{"A":"These heads are working correctly — uniform attention is an information aggregation strategy","B":"These are \"no-op heads\" or \"over-smoothing heads.\" Uniform attention head: computes the average of all value vectors — effectively a mean pooling operation. While this can be useful for global aggregation, many uniform heads indicate the model has more heads than it needs and some heads have settled into trivial solutions. Attending to [CLS] or self-attention: the head is not using the key-query interaction to make meaningful choices. Diagnoses: (1) too many heads for the task — some are redundant; (2) the head's Q/K projections learned trivial mappings; (3) informative heads have been \"stolen\" by LayerNorm's normalization. Pruning uniform/trivial heads typically maintains performance with faster inference","C":"Uniform attention heads cannot be pruned; they are mathematically required for the Transformer to function","D":"These heads are a training error; reinitialize and retrain them"},"correct":"B","explanation":{"correct":"- Uniform attention: if attention logits are all 0 (Q·K^T = 0), softmax outputs 1/T for all positions. The output is (1/T)×ΣV_i = mean of all value vectors. This is a valid function (global average pooling) but wastes head capacity.\n- Head pruning research: Michel et al. (2019) \"Are Sixteen Heads Really Better than One?\" showed that in BERT, most attention heads can be pruned without performance degradation. Many heads are redundant. The few informative heads (capturing specific linguistic relations) carry most of the useful computation.\n- The \"over-smoothing\" connection: uniform attention repeatedly applied over multiple layers can cause representations to converge toward the mean, losing local distinctions (related to the over-smoothing problem in GNNs).","A":"While uniform attention does perform mean pooling (a valid operation), having many uniform heads suggests wasted capacity. The model could achieve the same aggregation with fewer heads, freeing capacity for more useful operations.","B":"","C":"Uniform heads can be pruned. Michel et al. demonstrate this empirically — pruning up to 80-90% of heads causes minimal accuracy loss. The model redistributes the computation.","D":"Retraining specific heads without changing the architecture would produce the same degenerate solutions. The fundamental issue is model over-capacity for the task, not a training error."},"reference":"- Michel et al., \"Are Sixteen Heads Really Better than One?\" (2019): https://arxiv.org/abs/1905.10650"},{"section":"deep-learning","topicSlug":"attention-and-transformers-dl","topic":"Attention And Transformers Dl","id":"dl-13015","difficulty":"hard","orderIndex":15,"question":"You implement attention with head_dim=64, num_heads=8, and compute Q, K, V via three separate linear projections. A colleague proposes fusing Q, K, V into a single projection: `QKV = x @ W_QKV` where W_QKV ∈ ℝ^{d_model × 3d_model}. What are the computational and practical trade-offs of this fusion?","options":{"A":"Fused QKV projection is mathematically inequivalent to separate projections","B":"Fused QKV projection is mathematically equivalent: W_QKV = [W_Q | W_K | W_V] concatenated along the output dimension. Practical advantages: (1) single larger GEMM (General Matrix Multiply) instead of three smaller ones — GPU is more efficient for larger matrices; (2) single data load of x from memory (read x once instead of three times); (3) enables memory fusion in frameworks (one kernel launch). Trade-offs: W_QKV must fit in GPU registers/cache — for very large d_model, this may not be possible; less flexibility in applying different regularization to Q vs K vs V. Modern Transformer implementations (FlashAttention, cuDNN) all fuse QKV for efficiency","C":"Fused QKV uses 3× more memory because it stores the full 3d_model output","D":"Fused QKV is less accurate because the shared computation introduces correlation between Q, K, and V"},"correct":"B","explanation":{"correct":"- Mathematical equivalence: computing [x @ W_Q, x @ W_K, x @ W_V] is identical to x @ [W_Q | W_K | W_V] (concatenating along output dimension). The linear operations are independent (no weight sharing).\n- GPU efficiency: three separate GEMMs of size (T, d_model) × (d_model, d_k) vs one GEMM of size (T, d_model) × (d_model, 3×d_k). Larger GEMMs achieve better hardware utilization (higher arithmetic intensity, better use of tensor cores). For d_model=512, one (512, 1536) GEMM is more efficient than three (512, 512) GEMMs.\n- Memory bandwidth: x has shape (T, d_model) = T × 512 values. Reading x once for one GEMM vs three times for three GEMMs = 3× less memory bandwidth for x.","A":"Fused QKV is mathematically equivalent to separate projections. The projection matrices W_Q, W_K, W_V are simply concatenated into W_QKV. The outputs are identical.","B":"","C":"The memory for the output is the same: 3 × T × d_k regardless of whether computed separately or fused. The weight matrices: W_QKV (d_model × 3d_model) = same total parameters as W_Q + W_K + W_V (each d_model × d_k, sum = d_model × 3d_k = d_model × 3d_model).","D":"Q, K, V weights are independent in the fused projection (separate columns of W_QKV). There's no weight sharing or correlation introduced between Q, K, and V computation. The results are identical."},"reference":"- Megatron-LM: https://github.com/NVIDIA/Megatron-LM (fused QKV for efficiency)\n- FlashAttention implementation details: https://github.com/Dao-AILab/flash-attention"},{"section":"deep-learning","topicSlug":"self-supervised-and-contrastive-learning","topic":"Self Supervised And Contrastive Learning","id":"dl-14001","difficulty":"easy","orderIndex":1,"question":"Self-supervised learning (SSL) trains a model on a pretext task without human-labeled data. A team designs a pretext task: predict whether two image patches from the same image are adjacent (positive) or not adjacent (negative). What is this type of pretext task, and what limitation does it have?","options":{"A":"This is a contrastive learning task; it's optimal for all downstream tasks","B":"This is a context prediction pretext task (spatial relationship prediction), a form of SSL that forces the model to understand relative spatial structure. The limitation is \"pretext task bias\": the learned representations are optimized specifically for spatial relationship prediction. If the downstream task is semantic classification (is this a cat?), the features optimized for \"which patches are adjacent\" may not align well with \"which image contains a cat.\" The model learns low-level spatial structure but may miss semantic content needed for classification","C":"This is a supervised task because it generates labels (adjacent/not-adjacent)","D":"This pretext task is equivalent to training on ImageNet labels"},"correct":"B","explanation":{"correct":"- Pretext task bias: SSL models learn only as much as needed to solve the pretext task. A spatial adjacency predictor learns to encode spatial layout and local texture continuity — useful for object detection, not necessarily for fine-grained recognition.\n- This is why designing good pretext tasks is critical and why modern SSL methods (SimCLR, DINO) moved away from hand-designed pretext tasks toward invariance-based approaches (learn representations that are invariant to augmentation).\n- The labels (adjacent/not-adjacent) are automatically derived from the image itself without human annotation — this is the \"self-supervised\" in SSL. The learning is supervised in mechanism but self-supervised in label generation.","A":"\"Optimal for all downstream tasks\" is too strong. Pretext task design significantly affects which downstream tasks benefit. Spatial SSL helps detection/segmentation more than classification.","B":"","C":"SSL specifically means labels are automatically generated from the data itself (no human annotation). Adjacent/not-adjacent labels are derived programmatically from the image — this is self-supervised. \"Supervised\" would require human annotators labeling each patch pair.","D":"ImageNet labels (1000 semantic categories with human annotation) encode semantic content. Adjacency labels encode spatial relationships — these are completely different."},"reference":"- Doersch et al., \"Unsupervised Visual Representation Learning by Context Prediction\" (2015): https://arxiv.org/abs/1505.05192"},{"section":"deep-learning","topicSlug":"self-supervised-and-contrastive-learning","topic":"Self Supervised And Contrastive Learning","id":"dl-14002","difficulty":"easy","orderIndex":2,"question":"SimCLR's contrastive loss (NT-Xent) is defined for a positive pair (i, j) as: L_{i,j} = -log[exp(sim(z_i, z_j)/τ) / Σ_{k≠i} exp(sim(z_i, z_k)/τ)]. The denominator includes all 2N-2 other samples (one positive, one negative for each item). What is the role of temperature τ in this loss, and what happens at τ→0 and τ→∞?","options":{"A":"Temperature τ is a scaling constant with no effect on training; it only normalizes the loss value","B":"Temperature τ controls the sharpness of the similarity distribution. At τ→0: the loss becomes near-zero for clearly separated pairs and near-infinity for any misclassified pair — essentially a hard margin loss that only trains on \"confused\" negatives. At τ→∞: the softmax denominator becomes uniform, and the gradient vanishes — the loss becomes insensitive to relative similarities. Optimal τ (typically 0.07-0.5 in practice) provides informative gradients: \"difficult negatives\" (similar but different class) contribute large gradients; \"easy negatives\" (dissimilar) contribute small gradients","C":"Temperature τ controls the batch size; larger τ requires smaller batches","D":"Temperature τ should always be set to 1.0; other values cause training instability"},"correct":"B","explanation":{"correct":"- Gradient analysis: ∂L/∂sim(z_i, z_j) = -(1/τ)(1 - softmax(sim(z_i,z_j)/τ)). For high similarity (positive pair far from negatives), softmax ≈ 1 → gradient ≈ 0 (already learned). For low similarity (positive pair confused with negatives), softmax small → gradient large (need to push closer).\n- τ→0: gradient is large only when the positive pair similarity is less than any negative — creates a very hard, sparse learning signal. Risk of gradient explosion for bad initializations.\n- τ→∞: all gradients → 0 (all similarities equally weighted). No meaningful learning.\n- SimCLR paper uses τ=0.07; MoCo v2 uses τ=0.2. The choice significantly affects quality.","A":"Temperature strongly affects training. Chen et al. (SimCLR) performed ablation showing τ significantly affects linear evaluation accuracy. Lower τ in a reasonable range generally improves feature quality.","B":"","C":"Temperature and batch size are independent hyperparameters. Temperature affects the sharpness of the similarity distribution; batch size determines how many negatives are available.","D":"τ=1.0 is one valid choice but not optimal. The SimCLR paper shows τ=0.1 outperforms τ=1.0 on CIFAR-10 linear evaluation by a significant margin."},"reference":"- Chen et al., \"A Simple Framework for Contrastive Learning of Visual Representations (SimCLR)\" (2020): https://arxiv.org/abs/2002.05709"},{"section":"deep-learning","topicSlug":"self-supervised-and-contrastive-learning","topic":"Self Supervised And Contrastive Learning","id":"dl-14003","difficulty":"medium","orderIndex":3,"question":"SimCLR requires very large batch sizes (e.g., 4096-8192) for good performance. MoCo (Momentum Contrast) achieves comparable performance with batch_size=256 by using a queue of negative keys. What fundamental problem does MoCo's queue solve, and what is the role of the momentum encoder?","options":{"A":"MoCo's queue solves GPU memory limitations by storing keys in CPU memory","B":"SimCLR's negatives come only from the current batch — with batch=256, only 254 negatives per sample. Contrastive learning benefits from many diverse negatives (the denominator's discriminative power increases with more negatives). MoCo's queue maintains a rolling buffer of K=65,536 encoded keys from recent batches, giving each sample 65,536 negatives without increasing batch size. The momentum encoder solves consistency: if the key encoder is updated with large gradient steps per batch, keys in the queue (encoded by different encoder versions) are inconsistent. Momentum update (ξ=0.999): θ_k ← m×θ_k + (1-m)×θ_q ensures slow, consistent evolution of the key encoder, making all queue entries approximately encoded by the same encoder version","C":"MoCo's queue stores the input images; momentum updates the queue with new images each step","D":"Momentum encoder is used to prevent gradient explosion from training on stale negatives"},"correct":"B","explanation":{"correct":"- SimCLR batch size requirement: with 2N samples and 2N-2 negatives, more negatives → better coverage of the negative space → harder contrastive problem → better features. SimCLR needs large batches because all negatives must be from the current forward pass.\n- MoCo queue: stores encoded keys from recent batches. With K=65,536 queue entries and batch=256, each query is contrasted against 65,536 negatives encoded over the past 65,536/256 ≈ 256 batches.\n- Momentum encoder necessity: if the key encoder (which encoded queue entries) has changed significantly, old entries are inconsistent with current representations. Momentum (very slow) encoder change: entries encoded 256 batches ago are still approximately compatible with today's encoder.","A":"MoCo's queue is in GPU memory (as a tensor). The point is not CPU vs GPU storage but having more negatives than fit in a single batch. MoCo v3 removes the queue entirely in favor of large batch contrastive learning.","B":"","C":"MoCo's queue stores encoded key vectors (d-dimensional feature vectors), not raw images. Encoding images takes compute — storing pre-computed encodings is the efficiency win.","D":"Gradient explosion is not the primary concern. Momentum encoding is specifically about maintaining consistency of the key representations in the queue — all queue entries should be from \"similar\" encoder versions."},"reference":"- He et al., \"Momentum Contrast for Unsupervised Visual Representation Learning (MoCo)\" (2020): https://arxiv.org/abs/1911.05722"},{"section":"deep-learning","topicSlug":"self-supervised-and-contrastive-learning","topic":"Self Supervised And Contrastive Learning","id":"dl-14004","difficulty":"medium","orderIndex":4,"question":"BYOL (Bootstrap Your Own Latent) achieves state-of-the-art self-supervised performance without negative pairs. Critics initially predicted this would fail due to \"representational collapse\" (the model could trivially minimize loss by mapping all inputs to the same constant vector). BYOL avoids collapse using two components — what are they?","options":{"A":"BYOL uses very large batch sizes to prevent collapse","B":"BYOL uses: (1) an online-target asymmetry: the online network has an extra prediction head (MLP) that the target network doesn't have. The two networks are architecturally different, preventing the trivial constant solution (target can't be reached by a constant representation because the prediction head must transform to match target). (2) Stop-gradient on the target: target network is a momentum-updated copy of the online network (no gradient flows to the target). The target is a \"moving average\" oracle. Together, these create an asymmetric optimization that prevents collapse: the online network always chases a moving target that represents a slightly different (momentum-averaged) feature space","C":"BYOL avoids collapse by adding BatchNorm which implicitly creates negative interactions between samples in a batch","D":"BYOL uses random cropping augmentation only; no other architectural tricks are needed"},"correct":"B","explanation":{"correct":"- Representational collapse risk: if both networks mapped every input to the same constant z, the cosine similarity = 1, loss = 0. Perfect training loss with useless representations.\n- BYOL's architectural trick: online network q_θ(z), target network z̄ (no prediction head). Loss: ||q_θ(z) - sg(z̄)||². The prediction head q_θ must transform z to match z̄. A constant z wouldn't have a good prediction — q_θ would need to output z̄ which comes from slightly different representations.\n- Grill et al. (2020) BYOL paper; Richemond et al. (2020) \"BYOL works even without batch statistics\" (analyzes BatchNorm's role); Tian et al. (2021) show momentum + prediction head together are sufficient for collapse prevention.","A":"Batch size is not the primary mechanism. BYOL was shown to work with relatively small batch sizes (512-1024) compared to SimCLR's 4096-8192 requirement.","B":"","C":"BatchNorm does implicitly prevent collapse (all-constant output → constant batch statistics → BatchNorm makes this suboptimal). This was debated in the BYOL community. However, BYOL's primary designed mechanism is the asymmetric architecture (prediction head + momentum encoder), not BatchNorm.","D":"Augmentation is necessary but not sufficient to prevent collapse. Without the prediction head and momentum encoder, simple augmentation-based SSL collapses to constant representations."},"reference":"- Grill et al., \"Bootstrap Your Own Latent (BYOL)\" (2020): https://arxiv.org/abs/2006.07733"},{"section":"deep-learning","topicSlug":"self-supervised-and-contrastive-learning","topic":"Self Supervised And Contrastive Learning","id":"dl-14005","difficulty":"medium","orderIndex":5,"question":"MAE (Masked Autoencoders) masks 75% of image patches and trains a ViT to reconstruct the masked patches from the remaining 25%. This is more aggressive than BERT's 15% masking. Why does a high masking ratio work better for images than for text, and why doesn't it cause the model to only memorize local statistics?","options":{"A":"Images are lower-dimensional than text, requiring more masking for the same difficulty","B":"Images have much higher spatial redundancy than text. Adjacent image patches are highly correlated (smooth regions, textures). With only 15% masking, the model can interpolate from immediately surrounding patches without understanding global structure. 75% masking removes enough context that reconstruction requires global understanding of object structure — \"filling in a 75% occluded cat requires knowing what a cat looks like, not just what neighboring pixels look like.\" Text has lower redundancy: \"The [MASK] ate the [MASK]\" with 15% masking is already challenging. The reconstruction target (pixel values) is also low-level but the learned representations capture semantics because the model can only succeed by understanding global scene structure","C":"75% masking is used to reduce computation (fewer visible patches = less attention cost)","D":"Higher masking ratios cause overfitting; MAE prevents this with gradient clipping"},"correct":"B","explanation":{"correct":"- Spatial redundancy in images: a patch's pixel values can be predicted from adjacent patches via linear interpolation without any semantic understanding. BERT with 15% masking works because text lacks this spatial redundancy — each word carries unique semantic content.\n- MAE's high mask ratio design: He et al. (2021) ablated masking ratios 10%-90% and found 75% optimal. At low ratios, the task is too easy (local interpolation suffices). At very high ratios (90%+), too little context remains and even the decoder can't reconstruct.\n- Computation benefit: the ViT encoder only processes the 25% visible patches. This actually makes MAE faster than processing all patches — a beneficial side effect, not the motivation.","A":"Dimensionality is not the relevant factor. The key property is spatial correlation/redundancy, not dimensionality. A text sequence with 128 tokens is \"lower-dimensional\" than an image in some sense but requires lower masking ratios.","B":"","C":"While MAE's encoder does process fewer patches (25%), the primary motivation is task difficulty calibration, not compute. He et al. explicitly motivated the high masking ratio by the need to force global understanding.","D":"Gradient clipping and overfitting are not related to masking ratio choice. MAE's masking ratio ablation shows a smooth performance curve peaking at 75%, not an overfitting curve."},"reference":"- He et al., \"Masked Autoencoders Are Scalable Vision Learners (MAE)\" (2021): https://arxiv.org/abs/2111.06377"},{"section":"deep-learning","topicSlug":"self-supervised-and-contrastive-learning","topic":"Self Supervised And Contrastive Learning","id":"dl-14006","difficulty":"hard","orderIndex":6,"question":"You apply SimCLR to a medical imaging dataset of 10,000 chest X-rays (unlabeled) and then fine-tune the pretrained model on 100 labeled X-rays for pneumonia detection. Your colleague applies the same pipeline to 1M natural images (ImageNet-scale) before fine-tuning on the same 100 labeled X-rays. Surprisingly, the ImageNet-pretrained model performs similarly to the domain-specific SSL model. What explains this, and when would domain-specific SSL clearly win?","options":{"A":"ImageNet always outperforms domain-specific SSL; more data is always better","B":"At 1M vs 10K samples, ImageNet's data volume advantage may compensate for domain mismatch. SimCLR on 10K images may not learn sufficiently diverse representations — contrastive learning benefits greatly from scale (diversity of negatives, augmentation variety). However, domain-specific SSL clearly wins when: (1) the domain has no visual overlap with natural images (e.g., satellite imagery, pathology slides at 40× magnification, time series data); (2) when you scale domain-specific data to 100K+ unlabeled examples; (3) when downstream task requires highly domain-specific features (cell nucleus morphology vs ImageNet textures)","C":"ImageNet always loses; domain-specific SSL is always better due to lower distribution shift","D":"Domain-specific SSL is illegal to compare to ImageNet; they must be evaluated on different benchmarks"},"correct":"B","explanation":{"correct":"- Scale-quality trade-off: SimCLR at 10K images (chest X-rays) sees limited diversity. The momentum queue has limited unique negatives; augmentations may not create sufficiently different views. ImageNet at 1M images provides diverse negatives and augmentation variety that produces better feature generalization.\n- When domain wins: Raghu et al. (2019) showed that for medical imaging with sufficient domain data (100K+), domain-specific pretraining outperforms ImageNet pretraining significantly. The larger the domain shift, the larger this benefit.\n- Chest X-rays specifically: X-rays have different statistics (grayscale, high-frequency structures, anatomical regularities) from natural images. Domain SSL can learn X-ray-specific features (lung density patterns) that ImageNet SSL misses.","A":"Domain-specific SSL can outperform ImageNet when domain-specific data is abundant. \"Always better\" claims in transfer learning are consistently proven wrong across different scales and domains.","B":"","C":"\"Always loses\" is also wrong. At 10K domain images vs 1M natural images, the scale advantage can outweigh domain specificity. Both A and C are too absolute.","D":"Comparing the two approaches on the same downstream task (100 labeled X-rays) is a standard and valid experimental setup. It's a legitimate research question, not a methodological error."},"reference":"- Raghu et al., \"Transfusion: Understanding Transfer Learning for Medical Imaging\" (2019): https://arxiv.org/abs/1902.07208\n- Zhang et al., \"Contrastive Learning of Medical Visual Representations from Paired Images and Text\" (2020)"},{"section":"deep-learning","topicSlug":"self-supervised-and-contrastive-learning","topic":"Self Supervised And Contrastive Learning","id":"dl-14007","difficulty":"hard","orderIndex":7,"question":"DINO (Self-Distillation with No Labels) uses Vision Transformers (ViTs) and a self-distillation approach where the student network is trained to match the teacher's output distribution. DINO's teacher is trained with centering (subtracting a running mean of teacher outputs) and sharpening (low temperature for teacher softmax). Without centering, what collapse would occur, and why doesn't sharpening alone prevent it?","options":{"A":"Without centering, the model would converge to random representations","B":"Without centering, the teacher's outputs would collapse to a single dominant dimension: all outputs have one probability near 1 and others near 0 (uniform collapse to one prototype). This happens because softmax with sharpening + no centering amplifies any small imbalance — if one output dimension is slightly larger due to initialization, sharpening makes it dominant, and the student learns to predict this \"always one\" output. Sharpening alone doesn't prevent this because it actively amplifies the imbalance: sharper distribution → stronger push toward the dominant dimension → stronger collapse signal. Centering subtracts the running mean of teacher outputs, preventing any single dimension from dominating","C":"Without centering, loss would become NaN in the first training step","D":"Centering is only needed for ViT architectures; CNN-based DINO doesn't need it"},"correct":"B","explanation":{"correct":"- Collapse analysis in DINO: the teacher softmax output p_t = softmax(g_t(x)/τ_t). If τ_t is small (sharpening) and one output dimension h consistently has higher logit: exp(h/τ_t) >> exp(other/τ_t) → p_t ≈ [0,...,1,...,0].\n- All images output the same one-hot → trivial student loss (student always predicts the same thing) → model outputs useless representations.\n- Centering: g_t ← g_t - center, where center = momentum EMA of teacher output. If teacher collapses to output consistently high values at index k, center[k] becomes large, subtracting it and bringing the distribution back toward uniform.\n- Sharpening prevents collapse to uniform (opposite direction): sharpening makes the teacher output more peaked, which is good for learning distinctive features — but only works against uniform collapse, not uniform-to-single-mode collapse.","A":"\"Random representations\" are not the collapse type. The collapse is to non-random but useless representations — all inputs producing the same output.","B":"","C":"Loss doesn't become NaN immediately. The collapse is gradual — the teacher progressively concentrates on one mode over many training steps.","D":"The collapse mechanism (amplification of dominant dimensions through sharpening) applies to any architecture using softmax-sharpened outputs. It's not ViT-specific."},"reference":"- Caron et al., \"Emerging Properties in Self-Supervised Vision Transformers (DINO)\" (2021): https://arxiv.org/abs/2104.14294"},{"section":"deep-learning","topicSlug":"self-supervised-and-contrastive-learning","topic":"Self Supervised And Contrastive Learning","id":"dl-14008","difficulty":"medium","orderIndex":8,"question":"A researcher claims: \"Self-supervised models learn better features than supervised models because they use more data.\" You're asked to evaluate this claim. What is the nuanced truth?","options":{"A":"The claim is correct; SSL always outperforms supervised learning","B":"The claim is partially correct in specific regimes: SSL + large unlabeled data can outperform supervised learning on limited labeled data (few-shot and semi-supervised settings). But supervised models trained on fully labeled large datasets (e.g., full ImageNet with 1.28M labels) still outperform SSL models of the same architecture on classification tasks in most benchmarks, because supervised labels directly optimize for the target metric. SSL's advantage is (1) representation quality with few labels downstream, (2) versatility (same SSL features work for many tasks), and (3) scaling — unlabeled data is far more abundant","C":"Supervised learning always outperforms SSL for all tasks and all data sizes","D":"SSL and supervised models learn identical features; the choice is purely a function of data availability"},"correct":"B","explanation":{"correct":"- Ericsson et al. (2021) \"How Well Do Self-Supervised Models Transfer?\": comprehensive comparison showing that SSL features (SimCLR, BYOL, MoCo v2) transfer better across diverse downstream tasks than supervised features, but supervised ImageNet accuracy is still higher.\n- The nuance: SSL features are more general (better on semantic segmentation, object detection, texture recognition) while supervised features are more specialized (better on classification tasks similar to the supervised training task).\n- Scaling law: He et al. (MAE) and Chen et al. (SimCLR v2) show that with very large unlabeled datasets (100M+ images) and fine-tuning on 1% of ImageNet labels, SSL can match or exceed full supervised training. The crossover point depends on scale.","A":"Full supervised ImageNet training still outperforms SSL for ImageNet classification specifically. \"Always outperforms\" is empirically false.","B":"","C":"SSL explicitly outperforms supervised in: few-shot learning (1% ImageNet labels), cross-domain transfer (ImageNet SSL → medical imaging), and multi-task settings. \"Always loses\" is also false.","D":"SSL and supervised features are measurably different. SSL features have more distributed, texture-sensitive representations; supervised features are more compressed and task-specific. Studies probing representations show clear differences."},"reference":"- Ericsson et al., \"How Well Do Self-Supervised Models Transfer?\" (2021): https://arxiv.org/abs/2011.13377"},{"section":"deep-learning","topicSlug":"self-supervised-and-contrastive-learning","topic":"Self Supervised And Contrastive Learning","id":"dl-14009","difficulty":"hard","orderIndex":9,"question":"VICReg (Variance-Invariance-Covariance Regularization) is an SSL method that avoids collapse through explicit regularization terms rather than negative pairs or asymmetric architectures. The three terms are: Invariance (MSE between two views), Variance (per-dimension std ≥ γ), Covariance (off-diagonal covariance terms → 0). What specific collapse does each term prevent?","options":{"A":"All three terms prevent the same collapse type; they are redundant","B":"Invariance term: pushes different views of the same image toward the same representation (learn view-invariant features). Without it, the model could learn different representations for different augmentations. Variance term: prevents dimensional collapse — where the network maps all inputs to the same point (constant representation, std=0 per dimension). Enforcing std ≥ γ per dimension ensures each dimension encodes diverse information. Covariance term: prevents informational collapse — where multiple dimensions encode the same feature. Zero off-diagonal covariance forces each representation dimension to be independent, maximizing the information encoded across dimensions (similar to ICA objective)","C":"VICReg's variance term prevents gradient explosion, not collapse","D":"The covariance term is only used for regularization during training; it's removed at inference"},"correct":"B","explanation":{"correct":"- Dimensional collapse: if all samples map to the same vector z*, variance per dimension = 0. The variance term penalizes small std per dimension, directly preventing this.\n- Feature redundancy (informational collapse): if dimension 1 and dimension 2 always have the same value (covariance = 1), the representation only has effective dimensionality 1 despite being 2-dimensional. Off-diagonal covariance = 0 forces dimensions to encode different aspects of the input.\n- These three forms of collapse are distinct: point collapse (all samples → same point, caught by variance), dimensional correlation (dimensions encode same info, caught by covariance), augmentation sensitivity (caught by invariance).","A":"Each term addresses a distinct failure mode. Variance prevents point collapse; covariance prevents correlated features; invariance prevents augmentation-sensitive representations. Removing any one allows its corresponding collapse.","B":"","C":"Gradient explosion is not related to variance regularization. VICReg's variance term is specifically an explicit constraint on the output distribution (std ≥ γ), not a gradient magnitude control.","D":"The covariance term is a training regularization that shapes the learned representation. Once trained, the model's representations naturally have low covariance (the weights were optimized to achieve this). At inference, no explicit regularization is applied."},"reference":"- Bardes et al., \"VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning\" (2022): https://arxiv.org/abs/2105.04906"},{"section":"deep-learning","topicSlug":"self-supervised-and-contrastive-learning","topic":"Self Supervised And Contrastive Learning","id":"dl-14010","difficulty":"medium","orderIndex":10,"question":"You train SimCLR on a dataset of 100,000 satellite images. The augmentation pipeline includes: random crop, random horizontal flip, color jitter, and Gaussian blur. After fine-tuning on 500 labeled images, accuracy is only 60%. A colleague who trains on medical X-rays with the same pipeline achieves 85% after similar fine-tuning. What is the root cause of the satellite image underperformance?","options":{"A":"Satellite images are too large for SimCLR; reduce resolution to fix","B":"The augmentation pipeline was designed for natural images (ImageNet), where color jitter and horizontal flip are label-preserving (a blue sky remains sky when hue-shifted). For satellite images: (1) color information is semantically critical — red vs green vs water vs building have specific spectral signatures; color jitter corrupts the most informative feature; (2) horizontal flip may be label-preserving, but random crop of aerial images might crop away the entire object of interest (a single building might be 5% of the image); (3) Gaussian blur destroys the fine-grained structural features (road patterns, building edges) that distinguish satellite image classes. The augmentation must be designed for the domain's semantic invariances","C":"SimCLR requires batch_size > 10,000 for satellite images specifically","D":"Satellite images have 4 channels (RGBI); SimCLR only supports 3-channel inputs"},"correct":"B","explanation":{"correct":"- Augmentation design principle: contrastive learning assumes augmentations create \"views\" that share the same semantic content but differ in appearance. An augmentation is valid if it's label-preserving. For natural images: color changes don't change \"cat vs dog.\" For satellite images: color IS the semantic content.\n- Domain-specific SSL for satellite imagery: researchers use augmentations like season-change simulation (same location in summer vs winter → different spectral signatures but same land use), multi-temporal views, or multi-spectral band dropout.\n- The medical X-ray success: grayscale X-rays are largely invariant to color jitter (already grayscale or near-grayscale). Horizontal flip is medically controversial (left-right lung anatomy matters) but less catastrophic than color destruction.","A":"Resolution is not the root issue. SimCLR works at various resolutions. The problem is augmentation-semantic alignment, not image size.","B":"","C":"SimCLR has no satellite-image-specific batch size requirement. The batch size argument doesn't explain why satellite images specifically underperform compared to medical images.","D":"SimCLR's projection networks work with any input channels. Many satellite image SSL papers use 4-channel (RGBI) or even 13-channel (Sentinel-2) inputs with appropriate projection layers."},"reference":"- Manas et al., \"Seasonal Contrast: Unsupervised Pre-Training from Uncurated Remote Sensing Data\" (2021): https://arxiv.org/abs/2103.16607"},{"section":"deep-learning","topicSlug":"self-supervised-and-contrastive-learning","topic":"Self Supervised And Contrastive Learning","id":"dl-14011","difficulty":"hard","orderIndex":11,"question":"A text encoder trained with contrastive learning on (image, text) pairs (CLIP-style) shows an unexpected behavior: when asked to classify images, performance drops significantly when category names are changed to synonyms or when rare words are used. What is the root cause?","options":{"A":"CLIP models cannot process synonyms; they use character-level encoding","B":"CLIP's text encoder is trained on image-text pairs where certain descriptions are more common: \"a photo of a dog\" appears more frequently than \"a photo of a canine.\" The model's text representations encode the specific word distributions in the training data. Less common words/phrasings may not be well-represented in the learned text embedding space — they may cluster far from the corresponding image embeddings. The zero-shot classification performance depends critically on the prompt template and word choice matching the training distribution","C":"CLIP models do not support zero-shot classification; only supervised fine-tuning works","D":"The issue is the text tokenizer; rare words are split into subword tokens which confuse the model"},"correct":"B","explanation":{"correct":"- CLIP's training distribution: web-scraped (image, alt-text) pairs. \"Dog,\" \"puppy,\" \"golden retriever\" appear frequently with appropriate images. \"Canis lupus familiaris\" or \"canine quadruped\" appear rarely and often without matched images.\n- Prompt engineering (Radford et al., 2021): using \"a photo of {class}\" outperforms \"{class}\" alone. Averaging embeddings of multiple prompts further improves performance. This sensitivity to prompt engineering reveals the model's sensitivity to text distribution.\n- Rare word underperformance is a known limitation: in scientific domains where rare technical terms are used, CLIP's zero-shot performance degrades significantly compared to fine-tuned models.","A":"CLIP uses subword tokenization (BPE), not character-level. It can process synonyms and rare words tokenically. The issue is learned representation quality, not tokenization capability.","B":"","C":"CLIP's primary use case is zero-shot classification — comparing image embeddings to text embeddings of category names. This is the standard evaluation in the original CLIP paper.","D":"Subword tokenization can process any word. Rare words being split into subword tokens does affect representation quality (fewer training examples for those subword combinations), but this is a secondary effect. The primary issue is training distribution coverage."},"reference":"- Radford et al., \"Learning Transferable Visual Models From Natural Language Supervision (CLIP)\" (2021): https://arxiv.org/abs/2103.00020"},{"section":"deep-learning","topicSlug":"self-supervised-and-contrastive-learning","topic":"Self Supervised And Contrastive Learning","id":"dl-14012","difficulty":"easy","orderIndex":12,"question":"Contrastive learning requires defining \"positive pairs\" and \"negative pairs.\" SimCLR uses two augmented views of the same image as positives and other images in the batch as negatives. What is the \"false negative\" problem in contrastive learning?","options":{"A":"False negatives are augmentations that look too similar to the original image","B":"False negatives occur when two different images that belong to the same class (or represent the same concept) are treated as negatives in the contrastive loss. Example: two different photos of the same dog breed are pulled apart as negatives. The contrastive loss will actively push their representations apart even though they should be similar for semantic understanding. This can harm representation quality, particularly when the dataset has many images of the same class. Solutions: class-aware contrastive loss (supervised contrastive learning), momentum queue deduplication, or using very diverse datasets where same-class pairs are rare","C":"False negatives are augmentation pairs where the augmentation removes all useful information","D":"False negatives only occur in text-based contrastive learning, not image-based"},"correct":"B","explanation":{"correct":"- Standard contrastive learning negative sampling: all images in the batch except the current image's augmentations are negatives. With K=256 batch size and a 10-class dataset, roughly 25 other images in the batch have the same class as the current image — these are false negatives.\n- Impact: the loss pushes same-class images apart in feature space, conflicting with the goal of learning semantically meaningful representations. This is why contrastive learning sometimes learns features that separate instances but not classes.\n- Supervised contrastive learning (Khosla et al., 2020): uses label information to identify true negatives (different class) and true positives (same class), avoiding this problem.","A":"Augmentations that look similar to the original are actually good positives — they test whether the model can find invariant features. This is not the false negative problem.","B":"","C":"Augmentations that remove useful information would be ineffective views (the model can't learn from them), but this is an augmentation quality problem, not the false negative problem.","D":"False negatives occur in any contrastive learning setting where negatives are not verified to be semantically different. Image-based contrastive learning has this problem extensively."},"reference":"- Khosla et al., \"Supervised Contrastive Learning\" (2020): https://arxiv.org/abs/2004.11362"},{"section":"deep-learning","topicSlug":"self-supervised-and-contrastive-learning","topic":"Self Supervised And Contrastive Learning","id":"dl-14013","difficulty":"medium","orderIndex":13,"question":"MAE (Masked Autoencoder) uses an asymmetric encoder-decoder: the encoder only processes visible (unmasked) patches; the decoder reconstructs both visible and masked patches. Why is this asymmetry important, and what would happen with a symmetric design?","options":{"A":"The asymmetry is a software optimization; symmetric MAE would work equally well","B":"The encoder only processes visible patches (25%), making it computationally efficient. The decoder is lightweight and only used during pre-training. The asymmetry is critical because: (1) Efficiency: the encoder processes 25% of patches vs 100% — 4× FLOP reduction for the expensive ViT encoder; (2) Feature quality: the encoder never sees masked tokens during pre-training — it learns to extract features from limited visible context. At fine-tuning, all patches are visible, so the encoder is now applied to the full image — a setting it can handle but wasn't constrained to during pretraining, which may actually improve generalization. A symmetric design (encoder sees all, including masked) would be cheaper but produces worse representations","C":"Asymmetry is needed because the decoder must be larger than the encoder to reconstruct","D":"A symmetric design would produce identical representations; the asymmetry only affects training speed"},"correct":"B","explanation":{"correct":"- He et al.'s key insight: the masked tokens (75%) should only be used by the decoder (a shallow MLP), not the encoder. The encoder focuses on learning from limited visible patches — creating a harder, more useful pretext task.\n- Encoder efficiency: a ViT-Large with 196 patches, processing only 25% = 49 patches — the expensive self-attention (T²) scales as 49² instead of 196², a 16× reduction.\n- Decoder design: a small Transformer (narrow and shallow) is sufficient for reconstruction given the encoder's rich representations. The decoder is discarded after pre-training.\n- The asymmetry ensures the encoder learns self-sufficient representations (not relying on the decoder to interpret masked regions).","A":"The paper explicitly ablates symmetric vs asymmetric designs. Symmetric MAE (encoder sees all tokens) performs worse on linear evaluation and fine-tuning. The asymmetry is both computationally and qualitatively important.","B":"","C":"MAE's decoder is deliberately SMALLER than the encoder (lighter). A large decoder would add computation without improving encoder representations.","D":"He et al. (2021) show that asymmetric MAE achieves higher accuracy than symmetric designs. The representations are demonstrably different and better."},"reference":"- He et al., \"Masked Autoencoders Are Scalable Vision Learners (MAE)\" (2021): Figure 9 (ablation on decoder depth/width)"},{"section":"deep-learning","topicSlug":"self-supervised-and-contrastive-learning","topic":"Self Supervised And Contrastive Learning","id":"dl-14014","difficulty":"hard","orderIndex":14,"question":"You compare two SSL-pretrained models: Model A (SimCLR, 100 epochs, 1M images) and Model B (MAE, 800 epochs, 1M images). For image classification fine-tuning with 100% labels, Model B achieves higher accuracy. For few-shot (1% labels), Model A achieves higher accuracy. Explain why this reversal occurs.","options":{"A":"The reversal is caused by Model B training for 800 epochs vs Model A's 100 epochs","B":"SimCLR's invariance-learning (contrastive) creates representations optimized for semantic consistency across augmentations — ideal for few-shot learning because these representations are already semantically structured and semantically similar images cluster together. MAE's reconstruction objective learns dense, detailed visual features by solving the pixel-level reconstruction task — these features capture more fine-grained visual information, benefiting from many labels to learn appropriate classifier mappings. With 100% labels, MAE's richer features can be fully utilized; with 1% labels, SimCLR's already-semantically-aligned features require less fine-tuning signal to produce good classifiers","C":"The reversal indicates a bug in the evaluation protocol; SSL models should have consistent ordering","D":"The reversal is solely due to ViT vs ResNet architecture (MAE uses ViT; SimCLR uses ResNet)"},"correct":"B","explanation":{"correct":"- Contrastive SSL (SimCLR): the objective explicitly creates compact, semantically clustered representations. The projection head discards high-frequency information. The result: well-organized semantic feature space where k-NN or linear classification works with few examples.\n- Masked Autoencoder (MAE): the objective is pixel-level reconstruction, which preserves fine-grained texture and structural information (needed to reconstruct pixels). These rich features benefit from full fine-tuning but don't naturally cluster semantically.\n- This is a well-documented phenomenon: contrastive methods excel at linear evaluation (probing semantic structure) and few-shot; masked autoencoders excel at full fine-tuning (where dense features can be specialized).","A":"Training duration (100 vs 800 epochs) does contribute, but this doesn't explain the reversal. With equal epochs, MAE still outperforms SimCLR on full fine-tuning and SimCLR still outperforms MAE on few-shot.","B":"","C":"The reversal is a real, documented phenomenon — not a bug. Multiple papers confirm that contrastive and generative SSL methods have different strengths across evaluation protocols.","D":"Architecture (ViT vs ResNet) does affect performance, but MAE can be applied to ResNets and SimCLR can use ViT. The fundamental difference is contrastive (semantic alignment) vs generative (density estimation) objectives."},"reference":"- He et al., \"Masked Autoencoders Are Scalable Vision Learners\" (2021): Table 3 comparison with contrastive methods\n- Park et al., \"What Do Self-Supervised Vision Transformers Learn?\" (2022)"},{"section":"deep-learning","topicSlug":"self-supervised-and-contrastive-learning","topic":"Self Supervised And Contrastive Learning","id":"dl-14015","difficulty":"hard","orderIndex":15,"question":"A team uses a two-stage training: (1) SSL pre-training on 10M unlabeled images; (2) supervised fine-tuning on 10K labeled images. They observe that increasing SSL pre-training time from 100 to 1000 epochs improves fine-tuning accuracy from 78% to 82%. However, going from 1000 to 5000 epochs only improves to 82.3%. What phenomenon explains this diminishing returns pattern and what limits further SSL improvement?","options":{"A":"SSL pre-training converges after 1000 epochs; more epochs actively hurt performance","B":"SSL representations plateau when the pretext task is \"solved\" — the model has learned all information available from the unlabeled data that is accessible through the SSL objective. Beyond this point, additional epochs may: (1) overfit to dataset-specific statistics rather than general features; (2) cause the representations to become more task-specific to the SSL objective (contrastive invariances) rather than more general; (3) reduce diversity in representations (augmentation choices become overly familiar). The SSL information bottleneck: the unlabeled data has finite information relevant to downstream tasks, and the SSL objective captures a fraction of it — additional epochs don't unlock new information","C":"The improvement plateau is caused by learning rate decay; increase learning rate at epoch 1000","D":"More epochs require more GPU memory, causing the model to automatically reduce capacity"},"correct":"B","explanation":{"correct":"- Information saturation: SSL learns from data-derived signals. After sufficient training, the model extracts all available information that the SSL objective can expose. Contrastive learning learns invariance to augmentation — more epochs refine this invariance but don't add new information types.\n- Over-specialization risk: with very long training, the model may memorize dataset-specific patterns (which augmentation crops most frequently appear together for each image) rather than learning general features.\n- The logarithmic scaling law: progress in SSL roughly follows a log(epochs) curve. First doublings of epochs yield large gains; later doublings yield smaller gains. This is a general pattern in SSL.","A":"SSL pre-training with more epochs rarely \"actively hurts\" in normal ranges. The pattern here is diminishing returns (82% → 82.3%), not degradation. The claim of \"actively hurts\" would require fine-tuning accuracy to decrease with more SSL.","B":"","C":"Learning rate decay affects convergence speed but not the ultimate information saturation limit. Increasing LR at epoch 1000 might help convergence speed but wouldn't break the information saturation ceiling.","D":"GPU memory is fixed by hardware, not by training duration. More epochs don't reduce model capacity — the model architecture stays constant throughout pre-training."},"reference":"- Chen et al., \"A Simple Framework for Contrastive Learning (SimCLR)\" (2020): Figure 9 (training epochs vs accuracy)\n- Assran et al., \"Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture\" (I-JEPA, 2023)"},{"section":"deep-learning","topicSlug":"graph-neural-networks","topic":"Graph Neural Networks","id":"dl-15001","difficulty":"easy","orderIndex":1,"question":"A graph G = (V, E) has 5 nodes and adjacency matrix A. A GCN (Graph Convolutional Network) updates node features as H^{(l+1)} = σ(D^{-1/2} Ã D^{-1/2} H^{(l)} W^{(l)}), where Ã = A + I (self-loops added). What is the role of the D^{-1/2} Ã D^{-1/2} normalization?","options":{"A":"The normalization prevents gradient explosion by capping all values to [-1, 1]","B":"D^{-1/2} Ã D^{-1/2} is symmetric normalization: for node i, it averages incoming neighbor messages weighted by both the sender's degree and receiver's degree. Without normalization, high-degree nodes (many neighbors) would have very large aggregated features (summing many neighbors). With normalization: each neighbor j contributes 1/√(d_i × d_j) to node i's update — nodes with many connections contribute proportionally less, preventing high-degree nodes from dominating the representation","C":"The normalization is used to make the matrix invertible for the backward pass","D":"D^{-1/2} ensures the adjacency matrix has eigenvalues exactly in [-1, 1], which prevents vanishing gradients"},"correct":"B","explanation":{"correct":"- Unnormalized aggregation: H' = Ã H W. Row i: Σ_j Ã_{ij} H_j W = Σ_{j∈N(i)∪{i}} h_j W. For a hub node with 100 neighbors: sum of 100 vectors — the scale is 100× that of a leaf node with 1 neighbor.\n- Degree normalization: D^{-1/2} Ã D^{-1/2} entry (i,j) = 1/√(d_i × d_j). This normalizes: for node i, neighbor j's contribution = h_j / √(d_i × d_j). High-degree nodes (high d_i) receive smaller contributions per neighbor; high-degree neighbors (high d_j) contribute less.\n- The result: features are normalized to similar scales regardless of local graph structure. This allows the same weights W to work across different graph structures.","A":"The normalization ensures consistency of scale across nodes — it doesn't cap values to [-1, 1]. Feature values can be any real number after the normalization.","B":"","C":"The normalization is for feature scale stability, not matrix invertibility. The adjacency matrix Ã can be inverted separately; the symmetric normalization is a design choice for message aggregation.","D":"Eigenvalue control is a consequence (spectral GCN motivates this normalization through eigenvalues of the graph Laplacian), but the practical interpretation is degree-based aggregation normalization. The eigenvalue interpretation is the spectral theory motivation."},"reference":"- Kipf & Welling, \"Semi-Supervised Classification with Graph Convolutional Networks\" (2016): https://arxiv.org/abs/1609.02907"},{"section":"deep-learning","topicSlug":"graph-neural-networks","topic":"Graph Neural Networks","id":"dl-15002","difficulty":"easy","orderIndex":2,"question":"The message passing framework for GNNs involves three steps: (1) message computation, (2) aggregation, and (3) update. Why must the aggregation function be permutation-invariant, and what are examples of valid and invalid aggregation functions?","options":{"A":"Permutation invariance is only required for graph classification tasks, not node classification","B":"Aggregation must be permutation-invariant because neighbor order in a graph is undefined. Node i's neighbors are an unordered set {j₁, j₂, ..., jₖ} — there's no canonical ordering. If the aggregation depended on order (e.g., a concatenation of [m_{j₁}, m_{j₂}, ..., m_{jₖ}]), the same graph with neighbors listed in different order would produce different node representations. Valid aggregations: mean (Σmⱼ/k), sum (Σmⱼ), max (elementwise max), min. Invalid: concatenation (requires fixed order), LSTM over neighbors (order-dependent unless using sorted order, which is arbitrary)","C":"Permutation invariance is only needed because of GPU memory constraints","D":"Concatenation is a valid aggregation; the order of neighbors is fixed by node ID"},"correct":"B","explanation":{"correct":"- Graph property: edges encode connections, not orderings. The neighborhood N(i) = {j : (i,j) ∈ E} is a set, not a sequence.\n- Permutation equivariance vs invariance: aggregation must be permutation-invariant (same output for any permutation of neighbors). The overall GNN is permutation-equivariant (permuting input node features permutes output node features consistently).\n- Mean vs max vs sum: each captures different graph properties. Sum is used in Graph Isomorphism Network (GIN) because it can distinguish different numbers of identical neighbors (mean cannot). Max captures the most extreme feature in the neighborhood. Mean provides a \"representative neighbor.\"","A":"Permutation invariance is required for all GNN tasks. Even for graph classification, the intermediate node representations must be permutation-invariant. For node classification, the ordering of neighbors affects the node's representation regardless of final task.","B":"","C":"GPU memory doesn't determine permutation invariance. The requirement comes from the mathematical structure of graphs (unordered sets of neighbors), not hardware limitations.","D":"Using node ID to fix neighbor order is an arbitrary, external ordering not encoded in the graph structure. The same graph with relabeled nodes (same structure, different IDs) should produce the same representations — node ID-based ordering violates this."},"reference":"- Xu et al., \"How Powerful are Graph Neural Networks?\" (GIN) (2019): https://arxiv.org/abs/1810.00826"},{"section":"deep-learning","topicSlug":"graph-neural-networks","topic":"Graph Neural Networks","id":"dl-15003","difficulty":"medium","orderIndex":3,"question":"A GCN is applied to a social network for node classification. After training, you observe that node embeddings for nodes 3 hops apart have become very similar — even nodes from different communities. This is the \"over-smoothing\" problem. What causes it, and why does adding more GCN layers make it worse?","options":{"A":"Over-smoothing is caused by the dropout applied between GCN layers","B":"Each GCN layer averages a node's features with its neighbors'. After k layers, a node's representation is influenced by its k-hop neighborhood. As k increases, the k-hop neighborhood grows exponentially (in non-sparse graphs) and eventually covers most of the graph. The averaging makes all node representations converge to a weighted average of all nodes' initial features — proportional to the node's (generalized) degree, which is the same for all nodes with the same degree. More layers → larger neighborhoods → more averaging → more similar representations. Mathematically: repeated application of the normalized Laplacian's diffusion converges to the trivial limit","C":"Over-smoothing is caused by the softmax normalization becoming saturated after multiple layers","D":"Over-smoothing only occurs when graph diameter < number of layers; for small graphs, it doesn't happen"},"correct":"B","explanation":{"correct":"- Information diffusion: GCN propagation is D^{-1/2} Ã D^{-1/2} H W. Ignoring W (consider a linear GCN): H^{(k)} ∝ (D^{-1/2} Ã D^{-1/2})^k H^{(0)}. As k→∞, this matrix converges to a rank-1 matrix (the outer product of the stationary distribution) — all rows become identical. All node representations converge to the same vector.\n- Practical consequence: for node classification where nodes in different communities should have different representations, over-smoothed GCNs cannot distinguish them. This limits most GCNs to 2-3 layers.\n- Mitigation: residual connections (JK-Net: jumping knowledge), normalization (PairNorm), or limiting the receptive field.","A":"Dropout doesn't cause over-smoothing. Dropout randomly disables neurons, which can actually prevent over-smoothing by creating diverse stochastic sub-representations.","B":"","C":"Softmax is typically not a component of GCN message aggregation layers (only at the output classification). The aggregation uses mean/sum, not softmax.","D":"Over-smoothing is particularly problematic when the number of layers exceeds the graph diameter. For a graph with diameter 3 (all pairs within 3 hops), a 6-layer GCN would \"mix\" information beyond the diameter, causing over-smoothing even in small graphs."},"reference":"- Li et al., \"Deeper Insights into Graph Convolutional Networks for Semi-Supervised Classification\" (2018): https://arxiv.org/abs/1801.07606\n- Xu et al., \"Representation Learning on Graphs with Jumping Knowledge Networks\" (2018): https://arxiv.org/abs/1806.03536"},{"section":"deep-learning","topicSlug":"graph-neural-networks","topic":"Graph Neural Networks","id":"dl-15004","difficulty":"medium","orderIndex":4,"question":"Graph Attention Networks (GAT) compute attention coefficients for each edge (i,j): α_{ij} = softmax_j(LeakyReLU(a^T [W h_i || W h_j])). Compare this to GCN's fixed degree normalization. What does GAT's learned attention provide that GCN cannot, and what is its computational cost?","options":{"A":"GAT and GCN produce identical results; attention only changes training speed","B":"GAT allows node i to assign different weights to different neighbors based on their feature content. GCN's normalization (1/√d_i × d_j) depends only on degree (graph structure), not feature content. GAT: \"neighbor j is relevant to node i if their features are related\" — learned from data. This allows task-specific neighbor weighting: for sentiment classification in a social graph, nearby users with similar political views (feature-based) might be more influential than structurally close but semantically distant neighbors. Cost: O(|E| × d) attention coefficient computation for each head vs O(|E|) for GCN — proportional overhead, typically 4-8× more expensive","C":"GAT can only be used for graph classification; GCN is required for node classification","D":"GAT's attention reduces memory usage because it ignores low-weight neighbors"},"correct":"B","explanation":{"correct":"- GCN's limitation: the normalization 1/√(d_i × d_j) is determined solely by node degrees — a structural property. All neighbors contribute equally (after degree adjustment) regardless of their features' relevance.\n- GAT attention: a^T [W h_i || W h_j] computes a scalar for each (i,j) pair based on both nodes' transformed features. Softmax over j normalizes to produce edge weights. The attention is feature-dependent and learned for the specific task.\n- Practical advantage: for citation networks where not all papers cite equally relevant works, GAT can focus on the most semantically related neighbors. Ablations in the original GAT paper show significant improvement over GCN on Cora and Citeseer.","A":"GAT and GCN produce different outputs because GAT uses feature-based attention vs GCN's degree-based normalization. Multiple papers show GAT outperforms GCN on several benchmarks.","B":"","C":"Both GAT and GCN support node classification and graph classification. GAT was originally applied to node classification in the paper. The claim is factually incorrect.","D":"GAT doesn't \"ignore\" low-weight neighbors — it assigns them small but non-zero attention weights. All neighbors are included in the aggregation; only their weights change. Memory usage is O(|E| × d), similar to GCN."},"reference":"- Veličković et al., \"Graph Attention Networks\" (2018): https://arxiv.org/abs/1710.10903"},{"section":"deep-learning","topicSlug":"graph-neural-networks","topic":"Graph Neural Networks","id":"dl-15005","difficulty":"medium","orderIndex":5,"question":"GraphSAGE (Hamilton et al., 2017) uses neighborhood sampling: instead of using all neighbors, it samples a fixed k neighbors per node per layer. For a node with 1000 neighbors, why is this critical for scaling, and what is the trade-off?","options":{"A":"Sampling is needed because GNNs cannot process more than 100 neighbors","B":"Full neighborhood aggregation creates exponential graph expansion: a 3-layer GNN with full aggregation for a node with avg degree 20 needs 1 + 20 + 20² + 20³ = 8,421 nodes per computation tree. For 1,000-neighbor nodes: computation becomes intractable. GraphSAGE samples k₁=25 neighbors for layer 1, k₂=10 for layer 2: fixed k₁×k₂=250 nodes per sample. Memory is O(k₁×k₂×...×batch_size). Trade-off: with sampling, some neighbors are ignored in each forward pass. This introduces variance in the gradient — different samples in different batches produce different gradients. But it enables mini-batch training on graphs with millions of nodes","C":"Sampling reduces memory for storing neighbor feature vectors, but graph structure is still fully used","D":"GraphSAGE sampling is only used at inference; training still uses full neighborhoods"},"correct":"B","explanation":{"correct":"- Neighborhood explosion: in deep GNNs, the computation tree grows exponentially. For full aggregation: 2-layer GNN on a dense graph needs O(d^L) nodes per sample. For a social network with average degree 200 and L=3: 8M nodes per training example — batching becomes impossible.\n- Mini-batch training with GraphSAGE: fix the computation tree size per sample. For each training node, sample exactly k₁ neighbors (layer 1), and for each of those, sample k₂ neighbors (layer 2). Total computation: batch × k₁ × k₂ = fixed budget regardless of graph size.\n- Variance reduction: \"neighbor sampling\" adds noise to the gradient but allows unbiased estimation (sampled mean is an unbiased estimate of full mean). PinSage (Pinterest's GraphSAGE deployment) scaled to 3B nodes using this approach.","A":"GNNs have no hard constraint on neighbor count — they can process any number. The issue is computational scalability (exponential growth), not a hard architectural limit.","B":"","C":"GraphSAGE sampling reduces computation and memory by limiting which neighbors are processed. The graph structure IS modified in the sense that unsampled edges are ignored in each pass.","D":"GraphSAGE sampling is used during both training and inference. At inference, the same sampling (or full aggregation if feasible) is used to generate node embeddings."},"reference":"- Hamilton et al., \"Inductive Representation Learning on Large Graphs (GraphSAGE)\" (2017): https://arxiv.org/abs/1706.02216\n- Ying et al., \"Graph Convolutional Neural Networks for Web-Scale Recommender Systems (PinSage)\" (2018): https://arxiv.org/abs/1806.01973"},{"section":"deep-learning","topicSlug":"graph-neural-networks","topic":"Graph Neural Networks","id":"dl-15006","difficulty":"hard","orderIndex":6,"question":"You apply a GCN to a graph with 3 nodes: A-B-C (path graph). After 2 layers, node A's representation is influenced by nodes A, B, C. If you apply a 4th layer, node A's representation at layer 4 is still influenced by A, B, C (same set — the graph has diameter 2). What does this tell you about GNN depth beyond graph diameter?","options":{"A":"More layers always improve GNN performance by refining representations","B":"Beyond graph diameter, additional GNN layers don't expand the receptive field (every node already receives all other nodes' information at layer k = diameter). Extra layers: (1) simply re-aggregate already-aggregated information — adding non-linearity and transformation without new structural information; (2) increase the risk of over-smoothing (representations converge toward similar values); (3) add computational cost without structural benefit. The effective depth for capturing structural information is bounded by the graph diameter. Going beyond: useful only if the non-linear transformations W, σ add task-relevant function composition beyond structural aggregation","C":"Layers 3 and 4 provide gradient shortcuts that improve training stability","D":"After reaching graph diameter, the model automatically switches to fully connected processing"},"correct":"B","explanation":{"correct":"- Receptive field ceiling: for a graph with diameter D, all nodes are within D hops of each other. A GNN with L ≥ D layers has full-graph receptive field from step D — adding more layers doesn't add new neighbors.\n- The question is then: does more function composition (more W, σ layers) help? Sometimes yes — deeper function approximation can learn more complex mappings. But the risk of over-smoothing increases.\n- Empirical finding: most GNN papers use 2-3 layers. Deeper GNNs (without special design) often underperform due to over-smoothing. Techniques like residual connections in GCNII (Chen et al., 2020) enable deeper GNNs.","A":"More layers don't always improve GNN performance. For most node classification benchmarks, 2-3 layer GCNs outperform deeper variants due to over-smoothing. The \"always improve\" claim is empirically false.","B":"","C":"Layer 3+ in a GNN don't add gradient shortcuts (those would require residual connections). Without residuals, deeper layers add gradient path length, increasing vanishing gradient risk.","D":"GNNs don't \"switch to fully connected processing.\" The architecture is fixed regardless of graph diameter. After reaching the diameter depth, the same message passing continues (aggregating from the full receptive field)."},"reference":"- Chen et al., \"Simple and Deep Graph Convolutional Networks (GCNII)\" (2020): https://arxiv.org/abs/2007.02133"},{"section":"deep-learning","topicSlug":"graph-neural-networks","topic":"Graph Neural Networks","id":"dl-15007","difficulty":"hard","orderIndex":7,"question":"The Weisfeiler-Lehman (WL) graph isomorphism test is the theoretical upper bound on GNN expressive power. The GIN (Graph Isomorphism Network) is designed to match this bound. What specific GNN design choices make GIN as powerful as the WL test, and what graphs can neither WL nor GIN distinguish?","options":{"A":"GIN uses attention to achieve WL-level expressiveness","B":"GIN's key design choices: (1) SUM aggregation instead of MEAN or MAX — sum can distinguish {1,2} from {1,1,2} (sum=3 vs 4); mean gives 1.5 vs 1.33 (different), but max gives 2 vs 2 (same). Only sum uniquely maps multisets to a value. (2) MLP instead of linear layer — the MLP can approximate any injective function on the multiset histogram. Together: h_v^{(k)} = MLP^{(k)}((1+ε) × h_v^{(k-1)} + Σ_{u∈N(v)} h_u^{(k-1)}). Graphs WL cannot distinguish: regular graphs where all nodes have same degree and k-hop neighborhoods. Any two r-regular graphs on n nodes cannot be distinguished by WL or GIN, requiring higher-order GNNs","C":"GIN uses global pooling after each layer to capture graph-level features for WL-equivalent power","D":"WL test and GIN have identical computational complexity; any GNN matches WL power"},"correct":"B","explanation":{"correct":"- WL test: iteratively assigns colors (hashes) to nodes based on their neighborhood multisets. Two graphs are non-isomorphic if their color histograms differ. GIN's SUM + MLP replicates this: the injective MLP maps multisets to unique representations.\n- MEAN and MAX fail WL-level: mean({1,1}) = mean({1}) = 1 (can't distinguish); max({1,1}) = max({1}) = 1. Sum: sum({1,1}) = 2 ≠ sum({1}) = 1. Sum is the only simple aggregation that distinguishes multiset cardinality.\n- WL limitation: two non-isomorphic regular graphs with identical k-hop structure are indistinguishable. 3D GNNs (using node coordinates) or higher-order WL tests can distinguish these but have higher computational cost.","A":"GAT uses attention (feature-based weights). Attention doesn't address the aggregation function's expressive power problem (distinguishing multisets). GAT with mean aggregation is not WL-equivalent.","B":"","C":"Global pooling is used for graph classification, not node classification. GIN's WL-equivalent power comes from the node-level aggregation, not global pooling. Global pooling is applied after GIN layers for graph-level tasks.","D":"\"Any GNN matches WL power\" is false — this is the central contribution of Xu et al. (2019). Most GNNs (using MEAN or MAX aggregation) are strictly less powerful than WL. GIN with SUM + MLP is the specific design that achieves WL-level power."},"reference":"- Xu et al., \"How Powerful are Graph Neural Networks?\" (2019): https://arxiv.org/abs/1810.00826"},{"section":"deep-learning","topicSlug":"graph-neural-networks","topic":"Graph Neural Networks","id":"dl-15008","difficulty":"medium","orderIndex":8,"question":"You apply a GNN for drug-target interaction prediction. This is a link prediction task on a bipartite graph (drugs on one side, proteins on the other). The GNN is used to predict whether an edge (drug, protein) exists. Compare the appropriate GNN formulation vs node classification GNN — what changes?","options":{"A":"Link prediction uses the same GNN as node classification; no changes are needed","B":"Link prediction uses node embeddings as a substrate, then computes edge scores. The GNN learns node representations h_drug and h_protein. Link prediction score: σ(h_drug^T h_protein) or MLP(concat(h_drug, h_protein)). Key differences: (1) Loss is applied to (node_pair, label) tuples instead of (node, label); (2) Negative sampling is critical — real drug-protein pairs are positive; random drug-protein pairs are negative (many non-edges exist); (3) For inductive link prediction (predict edges for unseen drugs/proteins), GraphSAGE-style encoders are needed instead of transductive GCNs that require all nodes at training time","C":"Link prediction requires separate GNNs for drug nodes and protein nodes that are combined by attention","D":"Bipartite graphs cannot be processed by GNNs; use MLP with node features only"},"correct":"B","explanation":{"correct":"- GNN for node classification: loss = CE(h_v, label_v) for each node v. The GNN is trained to produce a good node representation for the classification task.\n- GNN for link prediction: the GNN generates node embeddings, then a \"decoder\" (dot product, MLP, or element-wise product) scores each potential edge. Loss: binary CE(score(i,j), 1) for real edges; binary CE(score(i',j'), 0) for sampled negative pairs.\n- Bipartite graph GNN: handle the two node types separately. Drug nodes aggregate from protein neighbors; protein nodes aggregate from drug neighbors. Alternating 2-layer propagation is common.","A":"Link prediction and node classification have different loss functions and output heads. While the GNN encoder is similar, the task setup (what the GNN optimizes for) is fundamentally different.","B":"","C":"Separate GNNs with attention is one valid approach (multi-view learning), but it's not required. A single unified GNN that updates both drug and protein representations simultaneously is standard and simpler.","D":"GNNs are explicitly designed for graph-structured data and have been applied to bipartite graphs extensively (drug-protein interaction, user-movie recommendation). GraphSAGE, GCN, and GAT all support bipartite graphs."},"reference":"- Hamilton et al., \"Embedding Methods for Link Prediction\" (2020 survey)\n- Lim et al., \"Drug-Target Interaction Prediction using GNNs\" (various 2020-2022 papers)"},{"section":"deep-learning","topicSlug":"graph-neural-networks","topic":"Graph Neural Networks","id":"dl-15009","difficulty":"hard","orderIndex":9,"question":"You train a GNN for graph classification on molecular property prediction. Two molecules have identical atom types and bond types in different arrangements. A standard GCN with mean aggregation labels them as the same class. Why, and how does GIN fix this?","options":{"A":"GCN labels them the same because it ignores bond types","B":"Mean aggregation loses count information: if molecule A has two carbon atoms in the benzene ring and molecule B has one carbon atom, mean({C,C}) = mean({C}) — both give the same average. Standard GCN with mean aggregation cannot distinguish graphs where neighborhoods have different multiplicity of the same atom type. GIN uses SUM: sum({C,C,N}) ≠ sum({C,N}) — captures that the first structure has two carbons where the second has one. The MLP then maps these different sums to different representations. For molecular property prediction where the exact count of specific atoms in a neighborhood matters (e.g., degree of saturation), sum aggregation is critical","C":"GCN labels them the same because molecular graphs are always isomorphic","D":"The fix is to use edge features (bond types) instead of changing aggregation"},"correct":"B","explanation":{"correct":"- Multiset problem: {C, C, N} and {C, N} have the same mean (if C=1, N=0: mean({1,1,0}) = 0.67, mean({1,0}) = 0.5 — actually different). But consider: {1, 2} and {1.5, 1.5} both have mean 1.5. Sum({1,2}) = 3 ≠ Sum({1.5, 1.5}) = 3 in this example. The key is that with discrete atom features, specific patterns like {C, C, N} vs {C, N, N} have different sums only with integer encodings.\n- The deeper issue: if we map atom types as integers and sum them, two neighborhoods with different carbon counts produce different sums. Mean doesn't distinguish {2 carbons, 1 nitrogen} from {1 carbon, 1.5-equivalent nitrogen}.\n- GIN's design ensures the representation function is injective on multisets — same multiset gives same representation, different multisets give different representations.","A":"GCN can incorporate bond types as edge features (a valid extension). But the fundamental aggregation problem (mean vs sum for multisets) is separate from edge feature usage. Using edge features with mean aggregation still has the multiset distinguishability problem.","B":"","C":"If two molecules have the same atom/bond types but different arrangements, they are non-isomorphic (different molecular graphs). The GCN's failure is due to the aggregation function, not graph isomorphism.","D":"Edge features help represent bond information but don't address the multiset cardinality problem in aggregation. The aggregation fix (sum) is still needed even with edge features."},"reference":"- Xu et al., \"How Powerful are Graph Neural Networks?\" (2019): Section 3 (GIN design)"},{"section":"deep-learning","topicSlug":"graph-neural-networks","topic":"Graph Neural Networks","id":"dl-15010","difficulty":"medium","orderIndex":10,"question":"A recommendation system uses a bipartite user-item graph. You train a GNN (LightGCN) that propagates user preferences to item nodes and item characteristics to user nodes. During training, you mask 10% of positive user-item edges as validation set. At inference, for a new user with only 2 interactions, the GNN produces poor recommendations. What is the root cause?","options":{"A":"LightGCN requires at least 100 interactions per user; filter out low-interaction users","B":"The cold start problem: a new user with 2 interactions has a k-hop neighborhood containing only 2 items and their shared users. The GNN aggregates: user's embedding ← average of 2 item embeddings. This provides very limited information for learning user preferences. The 2 items may not represent the user's diverse interests. The GNN is designed for users with enough interaction history to form a meaningful local graph structure. Fixes: (1) hybrid approach combining GNN with content-based features; (2) meta-learning (MAML-style) for few-interaction users; (3) separate cold-start module that uses side information (demographics, item content)","C":"The issue is the masking during training — use all edges for training to fix cold start","D":"Cold start only occurs for new items, not new users; the model should work for any user"},"correct":"B","explanation":{"correct":"- LightGCN propagation: e_u^{(k)} = Σ_{i∈N(u)} e_i^{(k-1)} / |N(u)|. For a user with 2 interactions: e_u^{(1)} = (e_{item1} + e_{item2}) / 2. This single vector must represent all preferences.\n- Compare to power users with 200 interactions: e_u^{(1)} is an average of 200 diverse items, capturing broad preferences. Layer 2 brings in items interacted with by users similar to our user.\n- The neighborhood structure for 2-interaction users is too sparse for meaningful aggregation. The GNN has limited information to learn preferences from.","A":"There's no hard 100-interaction threshold in LightGCN. The issue is gradual degradation with fewer interactions, not a cliff at a specific count. Filtering users is an extreme solution that eliminates cold-start users entirely.","B":"","C":"Including validation edges in training would cause data leakage (testing on edges the model was trained on). This doesn't fix cold start — a new user at inference time still has only 2 interactions regardless of training strategy.","D":"Cold start affects both new users (few interactions) and new items (few ratings). New items have the same problem: a new item with 2 ratings has a sparse neighborhood and is poorly represented."},"reference":"- He et al., \"LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation\" (2020): https://arxiv.org/abs/2002.02126"},{"section":"deep-learning","topicSlug":"graph-neural-networks","topic":"Graph Neural Networks","id":"dl-15011","difficulty":"hard","orderIndex":11,"question":"Heterogeneous graphs have multiple node types and edge types (e.g., user→reviews→product, user→friends→user). A standard GNN treats all edges equally. What is the specific problem this causes for message passing, and how does HAN (Heterogeneous Attention Network) or RGCN address it?","options":{"A":"Standard GNNs cannot process heterogeneous graphs at all due to different feature dimensions","B":"Different edge types have different semantic meaning: a \"user-reviews-product\" edge carries different information than a \"user-friends-user\" edge. Aggregating both equally conflates semantically different information. Two nodes may be structurally close through \"friends\" edges but semantically unrelated; aggregating friends as if they were product reviews would corrupt the representation. RGCN: separate weight matrix W_r per relation type r: h_v = Σ_r Σ_{u∈N_r(v)} W_r h_u / c_{v,r}. Each relation has its own transformation. HAN: uses meta-path-based attention, where meta-paths (user→product→user) create homogeneous subgraphs aggregated with learned attention weights per meta-path type","C":"Heterogeneous graphs require GNNs to be retrained for each edge type independently","D":"Heterogeneous graphs can be made homogeneous by concatenating edge type as a node feature; no architectural change needed"},"correct":"B","explanation":{"correct":"- Semantic mismatch: h_v^{friend-path} encodes social similarity; h_v^{review-path} encodes product preference. Averaging these with the same weight W would produce a representation that mixes two completely different types of relationships.\n- RGCN (Schlichtkrull et al., 2018): each relation type r has its own weight matrix W_r ∈ ℝ^{d×d}. This lets the model learn how to process friend messages differently from review messages. For graphs with many relation types, basis decomposition reduces parameters: W_r = Σ_b a_{rb} V_b.\n- HAN (Wang et al., 2019): meta-path-based approaches aggregate along specific semantic paths (user-buys-product-buys-user: other users who bought the same products), then use attention to weight different meta-paths.","A":"Standard GNNs can process heterogeneous graphs with unified feature spaces — the issue is semantic conflation, not inability. With a feature projection, nodes of different types can be mapped to a common space.","B":"","C":"Training separate GNNs per edge type would produce disconnected representations that can't interact. RGCN integrates all relations in a unified model with relation-specific parameters.","D":"Edge type as a node feature is one approach (edge-conditioned convolutions), but it doesn't address the aggregation problem. A node aggregating from 100 friends and 100 product reviews would still mix them equally unless the aggregation is modified."},"reference":"- Schlichtkrull et al., \"Modeling Relational Data with Graph Convolutional Networks (RGCN)\" (2018): https://arxiv.org/abs/1703.06103"},{"section":"deep-learning","topicSlug":"graph-neural-networks","topic":"Graph Neural Networks","id":"dl-15012","difficulty":"medium","orderIndex":12,"question":"You train a GCN for fraud detection on a financial transaction graph. Fraudulent transactions (1% of edges) are rare. After training, the model achieves 99% accuracy but detects only 5% of actual fraud cases. What is happening and what specifically should be changed about the GNN training?","options":{"A":"GCNs cannot detect fraud; use a CNN instead","B":"The 99% accuracy with 5% fraud recall indicates severe class imbalance exploitation: the model predicts \"not fraud\" for every node (or almost every node), achieving 99% accuracy because 99% of transactions are legitimate. The GNN is optimizing the wrong objective (accuracy on imbalanced data). Fixes: (1) use weighted cross-entropy loss or focal loss that amplifies loss for rare positive (fraud) class; (2) oversample fraud examples in mini-batches; (3) use metrics that account for imbalance (F1, AUROC, Precision-Recall AUC); (4) graph-specific: ensure fraud nodes are well-represented in each mini-batch's computation graph by oversampling their neighbors","C":"The model needs more GNN layers to capture long-range fraud patterns","D":"The 99% accuracy is correct and the 5% fraud recall is acceptable given the class ratio"},"correct":"B","explanation":{"correct":"- Accuracy paradox: with 1% fraud, a model predicting \"not fraud\" for everything achieves 99% accuracy. This is not useful. The model has essentially learned to predict the majority class.\n- Class imbalance in graphs: standard mini-batch GNN training samples nodes uniformly, so fraud nodes (1%) appear rarely. The model sees 100× more non-fraud examples and optimizes to predict non-fraud.\n- Focal loss: L = -(1-p_t)^γ × log(p_t). The (1-p_t)^γ factor down-weights easy examples (correctly classified non-fraud with high confidence) and focuses training on hard examples (fraud cases). Used in FICO and other financial ML systems.","A":"GCNs are used in production fraud detection (e.g., at Alibaba: GBDT-GNN, at PayPal). The issue is training configuration, not architecture.","B":"","C":"More layers might help capture fraud ring patterns (connected fraud nodes), but the primary issue is class imbalance. Fixing the imbalance problem would yield immediate improvement; more layers might provide incremental gains.","D":"5% fraud recall (missing 95% of actual fraud) is a critical failure in fraud detection. Real fraud detection systems target >80% recall with acceptable precision. The 99% accuracy metric is meaningless here."},"reference":"- Lin et al., \"Focal Loss for Dense Object Detection\" (2017): https://arxiv.org/abs/1708.02002\n- Wen et al., \"Towards Consumer Loan Fraud Detection: Graph Neural Networks with Role-Based Features\" (various GNN fraud papers)"},{"section":"deep-learning","topicSlug":"graph-neural-networks","topic":"Graph Neural Networks","id":"dl-15013","difficulty":"easy","orderIndex":13,"question":"Node classification, link prediction, and graph classification are three main GNN tasks. For a drug discovery application (predict which molecules have desired properties), which task is appropriate and what does the model output?","options":{"A":"Node classification — predict property for each atom in the molecule","B":"Graph classification — the entire molecule is the input graph (atoms as nodes, bonds as edges). The model produces a single embedding for the whole graph via a graph-level readout (global mean/sum/max pooling over all node embeddings) and predicts the molecular property (e.g., toxicity, solubility) from this embedding. Each molecule is one graph; the label is the molecular property","C":"Link prediction — predict whether two atoms would form a new bond","D":"Node classification is required because molecular properties are atom-level phenomena"},"correct":"B","explanation":{"correct":"- Graph classification setup: input = graph G = (V, E) with atom features on nodes and bond features on edges. GNN produces node embeddings h_v after k layers. Graph-level readout: h_G = READOUT({h_v : v ∈ V}). The READOUT (sum, mean, or attention-based) aggregates all node embeddings into a fixed-size graph vector. Final prediction: ŷ = MLP(h_G).\n- Task suitability: molecular properties (toxicity, solubility, bioactivity) are global properties of the whole molecule, not properties of individual atoms or atom pairs. Graph classification produces a single prediction for the whole graph.\n- Link prediction would be appropriate for: predicting new chemical bonds (bond formation prediction), protein-protein interaction prediction.","A":"Atom-level classification would predict properties for each atom (e.g., NMR shift of each carbon). Molecular toxicity/solubility is a whole-molecule property, not an atom-level property.","B":"","C":"Link prediction predicts whether an edge (bond) exists or will form. Molecular property prediction doesn't require predicting new bonds — the molecular structure is given; the task is to predict the whole-molecule property.","D":"While some molecular properties have atom-level explanations (reactivity centers), the prediction task for drug discovery is typically molecule-level. Atom-level classification would predict per-atom properties, not the molecule-level drug property."},"reference":"- Gilmer et al., \"Neural Message Passing for Quantum Chemistry (MPNN)\" (2017): https://arxiv.org/abs/1704.01212"},{"section":"deep-learning","topicSlug":"graph-neural-networks","topic":"Graph Neural Networks","id":"dl-15014","difficulty":"hard","orderIndex":14,"question":"You compare two GNN training paradigms: transductive (entire graph is visible during training, test nodes are unknown but present in the graph) and inductive (test graphs are completely unseen during training). For a fraud detection system on a bank's transaction graph, which paradigm applies, and what architectural constraint does it impose?","options":{"A":"Transductive learning always applies for graphs; inductive is only for images","B":"Fraud detection on evolving transaction graphs requires inductive learning: new customers, new merchants, and new transactions appear daily — completely unseen nodes must be classified at inference. Inductive GNNs (GraphSAGE, GAT with node features) learn a generalizable aggregation function that can be applied to any graph structure. Transductive GNNs (vanilla GCN) learn node embeddings directly — these embeddings are node-specific and cannot be applied to new nodes not present during training. The architectural constraint: inductive GNNs cannot use node IDs as features (ID-based embeddings don't generalize) and must learn from structural and feature-based aggregation","C":"Fraud detection uses transductive learning because the test graph is a subset of the training graph","D":"Inductive learning is only possible with graph classification, not node classification"},"correct":"B","explanation":{"correct":"- Transductive GCN: learns representations for fixed nodes V at training time. For a new node v not in V: no representation exists without retraining.\n- Inductive GNN (GraphSAGE): learns aggregation function f_SAGE(h_v, {h_u : u∈N(v)}) that can generate representations for any node given its features and neighborhood. New customer → sample 25 neighbors from their transaction history → apply f_SAGE → get embedding.\n- ID feature prohibition: if node IDs are used as input features (e.g., one-hot encoding of node index), new nodes have IDs that were never seen during training. The model can't generate representations for them.","A":"Inductive learning is critical for many graph applications. Production recommender systems (PinSage), fraud detection, and drug discovery (predicting on new molecules) all require inductive capability.","B":"","C":"New customers and merchants are not a \"subset of the training graph\" — they are new nodes. A growing transaction graph continuously adds new nodes, requiring inductive inference.","D":"Inductive learning is the standard for node classification in production systems. Graph classification is inherently inductive (each new graph is a test \"node\"). Inductive node classification is just as natural."},"reference":"- Hamilton et al., \"Inductive Representation Learning on Large Graphs (GraphSAGE)\" (2017): Section 4 (Inductive vs Transductive)"},{"section":"deep-learning","topicSlug":"graph-neural-networks","topic":"Graph Neural Networks","id":"dl-15015","difficulty":"hard","orderIndex":15,"question":"A knowledge graph embedding task uses a GNN to predict missing triples (head, relation, tail). You compare TransE (geometric embedding) with an RGCN + decoder. Your RGCN achieves lower MRR (Mean Reciprocal Rank) than TransE on a benchmark. A colleague says \"GNNs are always better than geometric methods for KG completion.\" What fundamental limitation of GNNs explains RGCN's underperformance?","options":{"A":"GNNs require more training data; TransE works with smaller datasets","B":"For knowledge graph completion, GNNs face the \"entity symmetry\" problem: GCN aggregates 1-hop neighbors. If two entities have the same set of relational neighbors (e.g., two cities \"located-in\" the same country and \"has-airport\"), their RGCN representations become identical after aggregation — the GNN cannot distinguish them. TransE models individual entity-relation translations: entity A positioned at e_A such that e_A + r_relation ≈ e_B for each fact (A, r, B). Each entity has its own embedding vector, capturing its unique role across all relations. RGCN conflates entities that share the same relational neighborhood structure, missing fine-grained entity-specific information","C":"TransE is always better than GNNs for all graph tasks; GNNs are overhyped","D":"RGCN needs more layers to match TransE; add 10 layers to fix the MRR"},"correct":"B","explanation":{"correct":"- Entity symmetry in RGCN: consider two cities, Paris and London, both \"located-in\" Europe and \"has-airport\" → True. Their 1-hop neighborhoods are structurally identical. RGCN produces the same embedding. But they're different entities with different properties.\n- TransE's entity-specific embeddings: each entity e ∈ ℝ^d is learned independently. Paris and London have different vectors, even if their local relational structure overlaps.\n- This is a manifestation of the WL test limitation: the WL test (and GNNs by extension) cannot distinguish nodes with identical neighborhood structures. For dense KGs where many entities have similar relational patterns, this is a critical limitation.","A":"Training data size is a factor but not the fundamental explanation. RGCN can underperform on large KGs where entities have similar structures. TransE scales well with data.","B":"","C":"GNNs outperform TransE on some KG tasks and datasets, particularly when multi-hop reasoning or structural context is important. \"Always better\" or \"always worse\" claims are both incorrect.","D":"Adding more RGCN layers doesn't solve the entity symmetry problem — it would cause over-smoothing, making distinct entities even more similar. The root cause is the aggregation-based representation, not depth."},"reference":"- Bordes et al., \"Translating Embeddings for Modeling Multi-relational Data (TransE)\" (2013): https://papers.nips.cc/paper/5071-translating-embeddings-for-modeling-multi-relational-data\n- Zhang & Chen, \"Link Prediction Based on Graph Neural Networks\" (2018): discusses GNN limitations for KG"},{"section":"deep-learning","topicSlug":"transfer-learning","topic":"Transfer Learning","id":"dl-16001","difficulty":"easy","orderIndex":1,"question":"A team fine-tunes a ResNet-50 pretrained on ImageNet for classifying satellite images. They use feature extraction (freeze all layers, train only the final classifier). After 20 epochs, validation accuracy plateaus at 62%. The same team fine-tunes all layers and achieves 84%. What does this reveal about the feature extraction vs fine-tuning decision?","options":{"A":"Feature extraction should always be used; the 62% result is a bug in the training pipeline","B":"Feature extraction assumes ImageNet features transfer well to the target domain. Satellite images (top-down view, different color distribution, no common object classes with ImageNet) differ significantly from ImageNet (natural photography). The deep convolutional layers of ResNet-50, optimized for natural images, produce features that are poorly aligned with satellite image structure. Fine-tuning all layers allows these task-specific features to be learned. Feature extraction is appropriate when: (1) target domain is similar to source; (2) target dataset is small (fine-tuning would overfit); fine-tuning is appropriate when: (1) sufficient target data exists; (2) domains differ significantly","C":"Feature extraction is better for large datasets; fine-tuning is for small datasets only","D":"The difference is due to the learning rate; using a lower LR in feature extraction would match full fine-tuning"},"correct":"B","explanation":{"correct":"- Domain distance: ImageNet contains natural photos (animals, objects, scenes). Satellite images have: top-down perspective, different scale (meters per pixel), different color statistics, objects like fields/roads instead of dogs/cars. Early CNN layers learn Gabor-like filters for natural image edges — these generalize. Later layers encode high-level semantic concepts (dog faces, car shapes) — these don't transfer to satellite imagery.\n- The rule of thumb: freeze early layers (generalizable low-level features), fine-tune later layers (task-specific high-level features). For very different domains, fine-tune most or all layers.\n- Yosinski et al. (2014) showed empirically that transferability decays with layer depth for cross-domain transfer.","A":"Feature extraction is not universally applicable. The 62% plateau indicates the frozen features are insufficiently informative for satellite images. The \"bug\" framing is incorrect — it's a domain mismatch issue.","B":"","C":"The relationship is opposite: feature extraction is appropriate for small target datasets (to avoid overfitting), fine-tuning is preferred when target data is abundant enough to update weights without overfitting.","D":"Learning rate affects training stability, not the representation quality. A frozen layer cannot adapt regardless of learning rate. The gap is architectural (frozen vs trainable), not optimization-related."},"reference":"- Yosinski et al., \"How transferable are features in deep neural networks?\" (2014): https://arxiv.org/abs/1411.1792"},{"section":"deep-learning","topicSlug":"transfer-learning","topic":"Transfer Learning","id":"dl-16002","difficulty":"easy","orderIndex":2,"question":"You fine-tune a BERT model on a 500-example medical QA dataset. After training, training loss → 0.01 but validation loss → 2.8 (heavily overfit). A colleague suggests catastrophic forgetting. Another suggests overfitting. What is the actual issue, and how would you distinguish the two?","options":{"A":"This is definitely catastrophic forgetting; you need to freeze BERT's layers","B":"Overfitting and catastrophic forgetting are distinct: Overfitting: model memorizes training examples; validation loss is high because the model doesn't generalize to unseen examples. BERT has 110M parameters; 500 examples provides ≈ 220 examples/parameter — massively underparameterized relative to data. Catastrophic forgetting: model's general language knowledge is overwritten by the new task, losing pretrained language model capabilities. To distinguish: (1) test on a general NLP benchmark (e.g., GLUE task) — if performance collapses, catastrophic forgetting; (2) examine val loss curve — if it rises immediately and steeply, the model is failing to generalize (overfitting), not forgetting prior knowledge. In this case with 500 examples, overfitting is the primary explanation","C":"The issue is that BERT requires at least 10,000 examples; use a smaller model","D":"Catastrophic forgetting only occurs in continual learning settings, not standard fine-tuning"},"correct":"B","explanation":{"correct":"- Overfitting diagnosis: train accuracy ≈ 100%, val accuracy ≈ low. With 500 examples and 110M parameters, BERT can memorize all training examples perfectly without learning generalizable patterns.\n- Catastrophic forgetting diagnosis: reduced performance on tasks BERT was originally trained for. You'd check by evaluating on original pretraining tasks (masked LM, next sentence prediction).\n- Both can coexist, but the dominant problem with 500 examples is almost certainly overfitting. Fix: regularization (weight decay, dropout), data augmentation, reduce learning rate, reduce training epochs, use LoRA (train <1% of parameters).","A":"Freezing BERT's layers would cause feature extraction mode — appropriate only if the medical QA task is well-represented by general language features. Freezing prevents the model from learning medical terminology. The recommended approach is parameter-efficient fine-tuning (LoRA) or heavy regularization.","B":"","C":"BERT models are fine-tuned successfully on much smaller datasets (< 100 examples with proper techniques). The issue is not a minimum dataset size requirement.","D":"Catastrophic forgetting is a broader phenomenon. During fine-tuning, if the learning rate is too high and too many epochs are run, the model's weights shift far from the pretrained initialization, losing general capabilities."},"reference":"- Howard & Ruder, \"Universal Language Model Fine-Tuning for Text Classification (ULMFiT)\" (2018): https://arxiv.org/abs/1801.06146\n- Hu et al., \"LoRA: Low-Rank Adaptation of Large Language Models\" (2022): https://arxiv.org/abs/2106.09685"},{"section":"deep-learning","topicSlug":"transfer-learning","topic":"Transfer Learning","id":"dl-16003","difficulty":"medium","orderIndex":3,"question":"ULMFiT introduced discriminative fine-tuning, where different learning rates are used for different layers (e.g., LR for layer 1 = η/2.6^3, LR for layer 2 = η/2.6^2, LR for layer 3 = η). What is the rationale, and what would happen if you used a uniform high LR across all layers?","options":{"A":"Discriminative LR is used to reduce computational cost; it has no effect on quality","B":"Discriminative LR recognizes that early (low) layers learn general, transferable features (syntax, basic semantics) and late (high) layers learn task-specific features. Fine-tuning requires: late layers to adapt significantly to the new task; early layers to adapt slowly (preserve valuable general features). Uniform high LR: all layers update aggressively. Result: early layers' general representations are overwritten by task-specific signals from the small fine-tuning dataset. The model loses its general language understanding (catastrophic forgetting). Discriminative LR → lower LR for early layers (slow drift from pretrained values) + higher LR for later layers (fast task adaptation)","C":"Uniform LR is fine because later layers dominate the gradient signal anyway","D":"Discriminative LR is needed only for RNNs; Transformers require uniform LR"},"correct":"B","explanation":{"correct":"- Layer-wise learning rate intuition: features become increasingly task-specific with depth. Overwriting general features (low layers) with task-specific fine-tuning signals corrupts valuable pretrained representations.\n- Mathematically: with high LR, the weight update Δw = -η × ∇L can be large relative to the pretrained values w_pretrained. For early layers, this destroys generalizable features. For late layers, this is desirable — the pretrained late-layer features (general NLU) should be replaced with task-specific representations.\n- ULMFiT ablation: discriminative LR consistently outperforms uniform LR in classification experiments.","A":"ULMFiT's paper shows discriminative LR improves test accuracy across multiple NLP benchmarks. The quality difference is significant — it's not a compute optimization.","B":"","C":"Later layers do dominate gradient signal at the output, but early layers also receive gradient through backpropagation. Without discriminative LR, early layer gradients can still cause significant weight updates.","D":"Discriminative LR is applicable to both RNNs (ULMFiT's original application) and Transformers. Papers on fine-tuning BERT and GPT models show that using lower LR for early Transformer layers improves performance."},"reference":"- Howard & Ruder, \"Universal Language Model Fine-Tuning for Text Classification\" (2018): https://arxiv.org/abs/1801.06146"},{"section":"deep-learning","topicSlug":"transfer-learning","topic":"Transfer Learning","id":"dl-16004","difficulty":"medium","orderIndex":4,"question":"You pretrain a ResNet on natural photos and then fine-tune on medical X-ray images. Despite fine-tuning, the model underperforms a smaller CNN trained from scratch on 50,000 X-ray images. This is negative transfer. What conditions cause negative transfer, and how would you detect it before investing in full fine-tuning?","options":{"A":"Negative transfer means the pretrained model is corrupted; you need to reinitialize","B":"Negative transfer: performance with transfer learning < performance without transfer learning (training from scratch). Causes for X-ray scenario: (1) domain gap — X-rays are greyscale (often 3-channel replicated), inverted brightness (denser tissue = whiter), different spatial scale. ImageNet's RGB color statistics are meaningless; (2) task mismatch — ImageNet classification (1000 diverse categories) vs binary/multi-label pathology detection. The model wastes capacity encoding ImageNet priors. Detect before full fine-tuning: (1) compare linear probe (frozen feature) accuracy on 10% of target data vs random init with same architecture and data; (2) measure feature similarity (CKA: Centered Kernel Alignment) between pretrained and optimal target features; if CKA is low, transfer will be poor","C":"Negative transfer only occurs when pretraining dataset is smaller than target dataset","D":"Increasing pretraining epochs prevents negative transfer"},"correct":"B","explanation":{"correct":"- Negative transfer evidence: Raghu et al. (2019) \"Transfusion\" paper found that ImageNet pretraining provided minimal benefit for radiology tasks compared to training with proper medical image architectures/data.\n- Domain gap measurement: CKA similarity between ImageNet-pretrained features and features of a model trained from scratch on X-rays. Low similarity → the two domains require fundamentally different feature representations.\n- Early detection: linear probing on 10% of data takes minutes vs full fine-tuning which takes hours/days. If linear probe with pretrained features doesn't outperform random init features, full fine-tuning is unlikely to help.","A":"Negative transfer doesn't corrupt the pretrained model — the original pretrained weights are unchanged. The issue is that fine-tuning adapts the model away from its pretrained state without reaching a good target-domain solution.","B":"","C":"Negative transfer is primarily about domain/task mismatch, not dataset size comparison. Large pretraining datasets can still transfer negatively to very different domains.","D":"More pretraining epochs on ImageNet would deepen ImageNet-specific features, potentially worsening transfer to X-rays. Domain-specific pretraining (on unlabeled X-rays) would help, not more ImageNet epochs."},"reference":"- Raghu et al., \"Transfusion: Understanding Transfer Learning for Medical Imaging\" (2019): https://arxiv.org/abs/1902.07208"},{"section":"deep-learning","topicSlug":"transfer-learning","topic":"Transfer Learning","id":"dl-16005","difficulty":"medium","orderIndex":5,"question":"Prototypical Networks (ProtoNets) and MAML (Model-Agnostic Meta-Learning) are two approaches for few-shot learning. Explain the key algorithmic difference and which is more appropriate for few-shot image classification with a new class that has only 5 labeled examples.","options":{"A":"MAML is always better because it optimizes during meta-training and meta-testing","B":"ProtoNets: compute class prototypes = mean embedding of support set examples; classify query by nearest prototype. Non-parametric; no gradient steps at test time. MAML: learn an initialization θ such that a few gradient steps on a new task produces a good model. Meta-test: take 5-10 gradient steps from θ for the new class. For few-shot image classification with 5 labeled examples: (1) ProtoNets are simpler, faster, more robust; the 5 support examples define a reliable prototype in embedding space; (2) MAML requires gradient steps at test time (computationally more expensive) and higher-order gradients during training; (3) Empirically, ProtoNets often match MAML performance on standard benchmarks (miniImageNet, tieredImageNet) while being significantly simpler","C":"Neither applies — few-shot learning requires at least 100 examples","D":"MAML is for NLP only; ProtoNets are for image classification only"},"correct":"B","explanation":{"correct":"- ProtoNets at test time: given 5 labeled examples of \"snow leopard\" (never seen during training): embed each through the encoder φ; compute prototype c = (1/5) Σᵢ φ(xᵢ); classify new query x̂ by argmin_c ||φ(x̂) - c||². No gradient descent required.\n- MAML at test time: θ_leopard = θ_init - α ∇_θ L(θ; 5 examples). This requires a forward+backward pass for 5-10 optimization steps. The goal: the gradient steps should quickly adapt the global θ to the new class.\n- Practical consideration: ProtoNets are simpler to implement and debug. MAML involves second-order gradients (computing gradients of gradients) or first-order approximation (FOMAML), which is more complex.","A":"MAML is not always better. For simple few-shot image classification, ProtoNets consistently perform comparably or better with lower computational cost. MAML has advantages in tasks requiring rapid adaptation through gradient steps (e.g., reinforcement learning, regression tasks).","B":"","C":"Few-shot learning is specifically designed for 1-10 labeled examples per class (N-shot, K-way). Both ProtoNets and MAML are designed for this regime and have been validated on 1-shot and 5-shot benchmarks.","D":"Both ProtoNets and MAML are general-purpose few-shot learning algorithms applicable to image classification, NLP, and reinforcement learning. There's no domain restriction."},"reference":"- Snell et al., \"Prototypical Networks for Few-shot Learning\" (2017): https://arxiv.org/abs/1703.05175\n- Finn et al., \"Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (MAML)\" (2017): https://arxiv.org/abs/1703.03400"},{"section":"deep-learning","topicSlug":"transfer-learning","topic":"Transfer Learning","id":"dl-16006","difficulty":"medium","orderIndex":6,"question":"LoRA (Low-Rank Adaptation) fine-tunes a pretrained language model by adding trainable rank-r matrices (A ∈ ℝ^{d×r}, B ∈ ℝ^{r×d}) to the attention weight matrices W ∈ ℝ^{d×d}, where r << d. The adapted weight is W' = W + BA. Why does this approach prevent catastrophic forgetting, and what is the memory saving for d=4096, r=8?","options":{"A":"LoRA prevents forgetting by freezing the model and training separate heads for each task","B":"LoRA's anti-forgetting mechanism: the original W is frozen (never updated). The adaptation is encoded entirely in BA, initialized as B=zeros, A=random (so BA=0 initially — no initial perturbation). Only A and B are updated. The pretrained knowledge in W is preserved by construction. Memory saving: full fine-tuning W requires d² trainable parameters: 4096² = 16.8M params. LoRA: A has d×r = 4096×8 = 32,768 params; B has r×d = 8×4096 = 32,768 params; total = 65,536 params ≈ 65K. Saving: 16.8M / 65K ≈ 256× fewer trainable parameters per weight matrix","C":"LoRA prevents forgetting by using a replay buffer of pretraining examples during fine-tuning","D":"LoRA works by pruning 90% of weights before fine-tuning, reducing the chance of overwriting important weights"},"correct":"B","explanation":{"correct":"- Preservation by freezing: W_pretrained remains exactly unchanged throughout fine-tuning. Any forgetting in traditional fine-tuning comes from directly updating W. LoRA sidesteps this entirely.\n- BA initialization: B=0 means BA=0 initially. W' = W + 0 = W. As training proceeds, BA learns the task-specific delta. This is a clean starting point with no disruption.\n- Parameter count: for GPT-3 (d=12288), full fine-tuning: 12288² = 150M per attention matrix. With 96 layers × 4 matrices = 57.6B parameters just for attention. LoRA r=4: 96 × 4 × 2 × 12288 × 4 = 37.7M total LoRA parameters — a 1500× reduction.","A":"LoRA does freeze the model body, but it doesn't train \"separate heads per task.\" The LoRA adapter (BA) modifies the same weight matrices used for all tasks. Task separation via separate heads is a different approach (multi-task heads, adapter layers).","B":"","C":"Replay buffers (experience replay) are a technique from continual learning (e.g., Elastic Weight Consolidation, GEM). LoRA doesn't use replay buffers. It prevents forgetting through architectural design (frozen base weights).","D":"LoRA doesn't prune weights. All d² parameters of W are retained but frozen. Pruning would remove parameters; LoRA adds parameters (the BA matrices) while keeping W intact."},"reference":"- Hu et al., \"LoRA: Low-Rank Adaptation of Large Language Models\" (2022): https://arxiv.org/abs/2106.09685"},{"section":"deep-learning","topicSlug":"transfer-learning","topic":"Transfer Learning","id":"dl-16007","difficulty":"hard","orderIndex":7,"question":"You fine-tune GPT-2 (117M params) on a 1000-example customer service chatbot dataset. During fine-tuning with a learning rate of 5e-4, the model quickly learns to generate helpful responses but loses its general text coherence (produces grammatically broken sentences outside the chatbot domain). What is happening mechanically, and what are two independent fixes?","options":{"A":"GPT-2 is too large for 1000 examples; use a 2-layer LSTM instead","B":"The high LR (5e-4) aggressively updates all 117M parameters. With 1000 examples, the model rapidly overfits to chatbot patterns: specific phrasing, vocabulary, and response structures. The large updates overwrite the pretrained language modeling capabilities (grammar, coherence). Mechanically: the weight updates Δw = -η × ∇L are large (η=5e-4 is high for GPT-2 fine-tuning; typical: 1e-5 to 5e-5). Fix 1: reduce LR to 1e-5 — smaller updates preserve pretrained representations while allowing gradual task adaptation. Fix 2: use LoRA (r=8) — freeze all 117M params, add 0.5M trainable params. Only the low-rank adapters update; GPT-2's language model weights are frozen, preserving coherence","C":"The issue is gradient clipping; enable gradient clipping to fix coherence","D":"Fine-tuning GPT-2 on 1000 examples is the correct approach; the coherence issue resolves after more training"},"correct":"B","explanation":{"correct":"- LR impact: for pretrained LLMs, fine-tuning LR is typically 1-2 orders of magnitude lower than pretraining LR. GPT-2 was pretrained with LR ~6.25e-4 with a large batch and warm-up. Fine-tuning at 5e-4 without a small batch or LR schedule applies updates at ≈ pretraining magnitude, treating the model as if training from scratch.\n- Forgetting speed: 1000 examples × multiple epochs = thousands of gradient steps. Each step at 5e-4 drifts the weights significantly from pretrained initialization.\n- LoRA fix: with B=0 initialized adapters, only the low-rank matrices capture task knowledge. The base model's language knowledge is architecturally protected.","A":"Model size alone doesn't determine fine-tuning success. With proper regularization (lower LR, LoRA, early stopping), GPT-2 can be effectively fine-tuned on small datasets. Switching to an LSTM would lose GPT-2's language knowledge.","B":"","C":"Gradient clipping (||∇|| ≤ max_norm) prevents large gradient magnitudes but doesn't prevent many small updates from cumulatively overwriting pretrained weights. Gradient clipping is necessary for training stability but doesn't address catastrophic forgetting.","D":"More training with a high LR would worsen catastrophic forgetting, not resolve it. As training continues, the model's weights move further from the pretrained initialization."},"reference":"- Mosbach et al., \"On the Stability of Fine-Tuning BERT: Misconceptions, Explanations, and Strong Baselines\" (2021): https://arxiv.org/abs/2006.04884"},{"section":"deep-learning","topicSlug":"transfer-learning","topic":"Transfer Learning","id":"dl-16008","difficulty":"hard","orderIndex":8,"question":"Elastic Weight Consolidation (EWC) is a continual learning technique that adds a regularization term L_EWC = (λ/2) Σ_i F_i(θ_i - θ*_i)², where F_i is the Fisher information for parameter i and θ*_i is the pretrained value. For fine-tuning on Task B after training on Task A, what does F_i represent and why does it address catastrophic forgetting better than simple L2 regularization (L = (λ/2) Σ_i (θ_i - θ*_i)²)?","options":{"A":"F_i is the gradient magnitude; EWC and L2 are equivalent when gradients are uniform","B":"F_i is the Fisher information — the expected squared gradient of the log-likelihood with respect to θ_i on Task A's data: F_i = E[(∂ log p(y|x,θ) / ∂θ_i)²]. This estimates how sensitive Task A's loss is to parameter θ_i. High F_i → parameter θ_i is important for Task A; changing it will hurt Task A performance. EWC advantage over L2: L2 penalizes all weight changes equally — it treats a parameter critical to Task A (large Fisher) the same as a parameter irrelevant to Task A (small Fisher). EWC concentrates regularization on important parameters. This allows parameters irrelevant to Task A to freely update for Task B (flexibility), while protecting critical Task A parameters (forgetting prevention)","C":"EWC uses the Hessian diagonal; the Fisher information is only an approximation of the Hessian","D":"L2 regularization prevents catastrophic forgetting equally well as EWC; F_i is just used for computational efficiency"},"correct":"B","explanation":{"correct":"- Fisher information intuition: if changing θ_i by a small amount ε significantly changes the log-likelihood for Task A's data, θ_i is important for Task A. F_i captures this: F_i = E[(∂ log p(y|x,θ*) / ∂θ_i)²]. Under the Laplace approximation, the posterior over θ_i near θ*_i is Gaussian with precision F_i.\n- L2 vs EWC: consider θ_j, a parameter used exclusively for Task A (high F_j), and θ_k, irrelevant to Task A (F_k ≈ 0). L2 penalizes both equally. EWC: heavy penalty on θ_j (preserve Task A), no penalty on θ_k (free to learn Task B).\n- Practical effect: EWC enables selective forgetting — only irrelevant parameters can change, while the important ones are protected.","A":"F_i is not the gradient magnitude; it's the expected squared gradient of the log-likelihood. Gradient magnitude during Task B training measures sensitivity to Task B, not Task A. F_i is computed once from Task A data.","B":"","C":"F_i (Fisher diagonal) is indeed related to the Hessian diagonal. Under mild conditions (near the MLE), F_i ≈ -E[∂² log p / ∂θ_i²] = Hessian diagonal. This relationship is used as the EWC motivation. The statement \"only an approximation\" is technically true but doesn't make the claim incorrect — the Fisher diagonal is the standard EWC formulation.","D":"L2 regularization does not perform as well as EWC for continual learning. The Fisher-weighted regularization is the key innovation in EWC. Papers show EWC significantly outperforms L2 regularization in sequential task learning benchmarks."},"reference":"- Kirkpatrick et al., \"Overcoming catastrophic forgetting in neural networks (EWC)\" (2017): https://arxiv.org/abs/1612.00796"},{"section":"deep-learning","topicSlug":"transfer-learning","topic":"Transfer Learning","id":"dl-16009","difficulty":"medium","orderIndex":9,"question":"A 3-layer CNN pretrained on ImageNet is fine-tuned for a new task with only 200 labeled examples. You test two strategies: (A) freeze layer 1-2, fine-tune layer 3 + new classifier head; (B) fine-tune all layers with a very low LR (1e-6). Which strategy is less likely to overfit, and why?","options":{"A":"Strategy B always outperforms A for small datasets because it updates more parameters","B":"Strategy A (freeze early layers) is less likely to overfit: 200 examples can only reliably train a small number of parameters. Frozen layers provide fixed feature extraction — the trainable parameter count is reduced to layer 3 + head. Fewer parameters relative to data → less overfitting. Strategy B: all 3 layers update, but with LR=1e-6. Very small updates mean the layers drift very slowly — overfitting is prevented through update magnitude limitation rather than frozen architecture. Trade-off: Strategy A is more robust to overfitting but may underperform if early layers provide suboptimal features. Strategy B allows richer adaptation but risks eventual overfitting. Recommendation: use Strategy A with early stopping, or LoRA","C":"Neither strategy can work with 200 examples; you must use data augmentation first","D":"Strategy B cannot learn anything at LR=1e-6; the gradients vanish before reaching layer 1"},"correct":"B","explanation":{"correct":"- Parameter count comparison: if layer 1-2 have 500K params, layer 3 has 200K, head has 1K: Strategy A trains 201K params; Strategy B trains 701K params. With 200 examples, 200 examples / 201K params = 1 example per 1000 parameters (still sparse, but 3.5× better than Strategy B).\n- LR=1e-6 effect: the weight update per step = 1e-6 × gradient. For typical gradient magnitude ~1, updates ~1e-6 per step. After 200 examples × 10 epochs = 2000 steps, total drift ≈ 2000 × 1e-6 = 0.002 — very small weight changes.\n- Both strategies have merit; the best approach depends on how similar the source and target domains are.","A":"Updating more parameters with 200 examples is a recipe for overfitting, not improvement. The fundamental challenge is generalization with limited data. More trainable parameters require more data.","B":"","C":"Data augmentation is a complementary technique (not a prerequisite). With proper augmentation, both strategies can work. But augmentation doesn't resolve the architectural question about which strategy overfits less.","D":"LR=1e-6 doesn't cause vanishing gradients. Gradients flow normally through backpropagation; the LR only scales the update step. The model can learn at LR=1e-6 — just slowly."},"reference":"- Kornblith et al., \"Do Better ImageNet Models Transfer Better?\" (2019): https://arxiv.org/abs/1805.08974"},{"section":"deep-learning","topicSlug":"transfer-learning","topic":"Transfer Learning","id":"dl-16010","difficulty":"hard","orderIndex":10,"question":"Adapter layers are small bottleneck modules inserted between Transformer layers: h → LayerNorm → down-project (d→m, m< Volume factor when: (1) target domain has unique structural properties absent in general domain (X-rays vs photos); (2) task requires fine-grained domain-specific distinctions; (3) target dataset is small (domain-specific features reduce the needed fine-tuning to align representations). Counter-case: for tasks where both domains apply (e.g., skin lesion detection — photos share some properties), ImageNet with 50× data may win","C":"The result is due to RadImageNet being harder to overfit; size doesn't affect generalization","D":"Neural scaling laws guarantee more data = better; the experiment must have a flaw"},"correct":"B","explanation":{"correct":"- Feature alignment vs volume: consider linear probing: frozen pretrained features → logistic regression on target task. If domain-specific features score 0.85 and ImageNet features score 0.62, fine-tuning can improve both but starts from a better initialization with domain-specific pretraining.\n- CKA analysis: Raghu et al. and Nguyen et al. used CKA to measure feature similarity between pretrained models and task-optimal models. Domain-specific pretraining produces features with higher CKA similarity to the target task, requiring less adaptation.\n- Practical guidance: for specialized domains (medical imaging, satellite imagery, molecular biology), domain-specific pretraining often outperforms general pretraining even with less data.","A":"Domain-specific pretraining is not always better. For target tasks well-covered by general pretraining (e.g., detecting office objects, classifying natural animals), ImageNet pretraining with vastly more data and diversity wins. \"Always better\" is an overstatement.","B":"","C":"Overfitting difficulty doesn't explain transfer quality. The explanation is feature alignment: what the model learns to represent during pretraining.","D":"Neural scaling laws apply within a domain and training paradigm. They don't claim that data from a different distribution always improves performance. Cross-domain transfer violates the i.i.d. assumption underlying scaling laws."},"reference":"- Mei et al., \"RadImageNet: An Open Radiologic Deep Learning Research Dataset for Effective Transfer Learning\" (2022): https://arxiv.org/abs/2201.09600"},{"section":"deep-learning","topicSlug":"transfer-learning","topic":"Transfer Learning","id":"dl-16012","difficulty":"medium","orderIndex":12,"question":"Domain adaptation addresses the scenario where training distribution p_source ≠ test distribution p_target. In unsupervised domain adaptation (UDA), you have labeled source data and unlabeled target data. How does Domain-Adversarial Neural Network (DANN) use a gradient reversal layer (GRL) to learn domain-invariant features?","options":{"A":"GRL flips gradients to make the classifier worse, improving domain adaptation by adversarial training on the label space","B":"DANN has three components: feature extractor G_f, label classifier G_y (on source labels), domain classifier G_d (predicts source vs target). GRL sits between G_f and G_d. Forward pass: normal. Backward pass through GRL: gradients are multiplied by -λ (reversed). Effect: G_d tries to distinguish source from target; reversed gradient tells G_f to produce features that maximally confuse G_d. Result: G_f learns features where source and target are indistinguishable (domain-invariant). Simultaneously, G_y trains G_f to keep label-discriminative information. The learned features are both label-predictive AND domain-invariant — features transfer to unlabeled target domain with high accuracy","C":"GRL prevents the feature extractor from training; only G_y and G_d update during backprop","D":"Domain adaptation only works when source and target have the same number of classes"},"correct":"B","explanation":{"correct":"- Minimax objective: min_{G_f, G_y} max_{G_d} [L_y(G_y(G_f(x_source)), y) - λ L_d(G_d(G_f(x)), d)]. G_f minimizes label loss, G_d maximizes domain classification loss (minimizes domain classifier accuracy). GRL implements this minimax through the reversal trick.\n- Why domain invariance helps: if G_f produces features where source and target look the same, a classifier trained on source features can be applied to target features without explicit target labels.\n- Limitation: domain invariance is necessary but not sufficient. If the conditional distribution p(y|features) differs between domains (label shift), DANN may fail.","A":"GRL doesn't flip gradients for the label classifier (G_y). G_y receives normal gradients from the label classification loss. Only G_d receives reversed gradients, which flow back to G_f. The adversarial training targets domain confusion, not label confusion.","B":"","C":"GRL only affects gradient flow (multiplies by -λ). The feature extractor G_f receives gradients from both paths: positive gradients from G_y's label loss (learn discriminative features) and negative gradients from G_d's domain loss (learn domain-invariant features). G_f updates from both.","D":"DANN works regardless of class count differences. The domain classifier is binary (source vs target) regardless of the number of task classes. Many practical domain adaptation scenarios have different class distributions between source and target."},"reference":"- Ganin & Lempitsky, \"Unsupervised Domain Adaptation by Backpropagation (DANN)\" (2015): https://arxiv.org/abs/1409.7495"},{"section":"deep-learning","topicSlug":"transfer-learning","topic":"Transfer Learning","id":"dl-16013","difficulty":"hard","orderIndex":13,"question":"CLIP (Contrastive Language-Image Pre-Training) enables zero-shot transfer: a model trained on image-text pairs can classify images into unseen classes without fine-tuning, by comparing image embeddings to text embeddings of class descriptions. If CLIP is used for zero-shot classification of pathological tissue types, and zero-shot CLIP achieves 51% top-1 accuracy, while a ResNet fine-tuned with 100 labeled examples achieves 79%, which approach should be chosen and what is the core limitation of CLIP's zero-shot transfer here?","options":{"A":"Always use CLIP zero-shot; fine-tuning introduces catastrophic forgetting","B":"Choose the fine-tuned ResNet (79% > 51%). Core limitation of CLIP zero-shot for pathology: CLIP was trained on internet image-text pairs — pathological tissue images (H&E staining, microscopy) are rare or absent in web data. The text descriptions (\"carcinoma\", \"adenocarcinoma\", \"dysplasia\") are specialized medical terms that CLIP's text encoder associates with radiology reports or textbooks, not with actual H&E-stained tissue images. Domain gap: CLIP's image encoder was not trained on microscopy images; its visual representations don't align with the specific features pathologists use. 100 labeled examples is sufficient to train a ResNet head (or fine-tune the last few layers) to learn domain-specific visual distinctions","C":"CLIP zero-shot is always preferable because it doesn't require any labeled data","D":"Fine-tuned ResNet is better only because it has more parameters; use CLIP with more parameters to match"},"correct":"B","explanation":{"correct":"- 51% vs 79%: a 28% accuracy gap is decisive. The cost of labeling 100 examples (hours of pathologist time) is justified by the performance gain in a clinical setting.\n- CLIP's zero-shot strength: it excels on natural image categories well-represented in web data (ImageNet-like classes). For domain-specific visual concepts (pathology, radiology, satellite imagery), zero-shot performance degrades significantly.\n- LLaVA-Med, PathCLIP: domain-specific CLIP-like models pretrained on medical image-text pairs achieve much higher zero-shot performance, highlighting that the limitation is domain gap, not the framework itself.","A":"CLIP zero-shot doesn't suffer catastrophic forgetting (nothing is fine-tuned). The concern is the opposite: CLIP's zero-shot representations don't transfer well to highly specialized domains.","B":"","C":"\"No labeled data required\" is an advantage of zero-shot learning, but it's not sufficient justification when zero-shot performance is 28% below fine-tuned performance in a high-stakes domain (clinical pathology). The cost of 100 labels is worth 28% accuracy gain.","D":"The gap is not due to parameter count. ResNet-50 (25M params) with 100-example fine-tuning outperforms CLIP ViT-B/32 (150M image encoder params) on pathology tasks. The issue is representation alignment, not model capacity."},"reference":"- Radford et al., \"Learning Transferable Visual Models from Natural Language Supervision (CLIP)\" (2021): https://arxiv.org/abs/2103.00020\n- Zhang et al., \"BiomedCLIP\" (2023): domain-specific medical CLIP"},{"section":"deep-learning","topicSlug":"transfer-learning","topic":"Transfer Learning","id":"dl-16014","difficulty":"hard","orderIndex":14,"question":"You fine-tune a Vision Transformer (ViT-L/16, 307M params) on a 2,000-example art classification dataset using full fine-tuning. Training set accuracy = 97%, validation accuracy = 58%, suggesting severe overfitting. You compare five interventions: (A) weight decay 0.01, (B) frozen patch embedding + 50% of transformer blocks, (C) LoRA r=16, (D) 10× data augmentation (flips, rotations, color jitter), (E) dropout 0.3 in all attention layers. Rank these from most to least effective.","options":{"A":"A > B > C > D > E","B":"Most effective → least effective: D > B > C > A > E. (D) Data augmentation effectively multiplies the 2,000 examples, directly addressing the data scarcity problem — 10× augmentation ≈ 20,000 examples, dramatically reducing overfitting. (B) Freezing 50% of blocks reduces trainable parameters from 307M to ~150M while preserving pretrained features. (C) LoRA r=16: trainable parameters ≈ 2×16×d_model × n_layers. For ViT-L (d=1024), one LoRA pair: 2 × 1024 × 16 = 32,768 params. All layers: ~10M params — massive reduction. (A) Weight decay penalizes large weights but doesn't reduce parameter count. (E) Dropout in attention is the least effective alone — it doesn't directly address the parameter:data ratio problem","C":"C > D > B > E > A","D":"All interventions are equally effective; use any combination"},"correct":"B","explanation":{"correct":"- Severity of overfitting: 97% train vs 58% val is extreme. With 307M params and 2K examples, the ratio is ~150K examples/param needed for reliable fitting — we have 100× less data.\n- Data augmentation: art classification benefits especially from geometric (rotation, flips) and color augmentation. 10× augmentation with a 2K dataset gives 20K diverse examples without any architectural change.\n- LoRA vs weight decay: LoRA reduces effective parameter count to ~3% of full model. Weight decay constrains weight magnitudes but keeps all 307M parameters active and potentially able to overfit.\n- Dropout in attention: attention dropout disrupts the attention patterns learned during pretraining and can hurt representation quality. It's a blunt instrument for this specific overfitting problem.","A":"Ranking A (weight decay) above B (frozen blocks) and C (LoRA) is incorrect. Weight decay is the weakest regularizer here — it doesn't fundamentally address the parameter:data imbalance. Frozen blocks and LoRA directly reduce the effective number of trainable parameters.","B":"","C":"While LoRA is highly effective, ranking it above data augmentation is debatable. Data augmentation directly increases the effective dataset size, addressing the root cause. LoRA reduces the model's capacity to overfit but doesn't increase information content.","D":"The interventions have different expected magnitudes of effect based on the overfitting mechanism. Data augmentation addresses data scarcity; parameter reduction (B, C) addresses model complexity; A and E are weaker regularizers."},"reference":"- Touvron et al., \"Training data-efficient image transformers (DeiT)\" (2021): augmentation strategies for ViT\n- He et al., \"Masked Autoencoders Are Scalable Vision Learners\" (2022): fine-tuning ViT at various dataset scales"},{"section":"deep-learning","topicSlug":"transfer-learning","topic":"Transfer Learning","id":"dl-16015","difficulty":"hard","orderIndex":15,"question":"A continual learning system trains on Tasks A, B, C, D sequentially. After training on D, it shows perfect performance on D, 90% on C, 65% on B, and 12% on A. A senior engineer argues that this is expected behavior and the performance difference is statistically insignificant. What does the performance pattern actually reveal, and what is the fundamental trade-off in continual learning that makes this an unsolved problem?","options":{"A":"The pattern shows normal learning — earlier tasks are naturally harder","B":"The pattern shows classic catastrophic forgetting (also called \"catastrophic interference\"): Task A's performance (12%) is near chance — the model has forgotten nearly everything from Task A. Each new task updates weights optimizing the current task, progressively overwriting earlier task representations. The fundamental trade-off (plasticity-stability dilemma): Stability: preserve performance on previous tasks (resist changing old weights). Plasticity: learn new tasks effectively (freely update weights). These are mutually contradictory: learning Task D requires changing weights, which disrupts Task A-C representations. No current approach fully resolves this. EWC (Fisher-weighted regularization) partially mitigates it. Replay methods (store samples from old tasks) reduce forgetting at memory cost. Progressive Neural Networks (column per task) have full stability but grow linearly in parameters. None achieve human-level continual learning","C":"The pattern is caused by task D being more recent; add a time-based decay to fix it","D":"Use higher learning rates on early tasks to make them \"stickier\" in the network"},"correct":"B","explanation":{"correct":"- 12% on a classification task: if Task A is 10-class, random chance = 10%. The model has essentially random performance on Task A — complete forgetting.\n- Stability-plasticity dilemma: biological neural systems solve this through complementary learning systems (hippocampus = fast learning/plasticity; neocortex = slow consolidation/stability). Current neural networks have a single weight space serving all tasks — no natural separation.\n- State of the field (2024): EWC, PackNet, GEM, A-GEM, and other continual learning methods improve over naive sequential training but still show non-trivial forgetting. Few-shot learning and meta-learning partially address this for related tasks.","A":"The performance pattern is not \"expected normal behavior.\" Task A being at 12% (near chance) is not \"naturally harder\" — it was presumably mastered before Task B began. The degradation across tasks is caused by training Task B, C, D, not by task difficulty.","B":"","C":"Time-based decay doesn't address the fundamental problem. Decaying old weights would make forgetting worse, not better. The goal is to preserve old task performance, which requires resisting weight changes on important old parameters.","D":"Using higher LR on earlier tasks would cause them to be learned initially with \"larger\" representations, but subsequent task training would still overwrite those weights. The problem is sequential overwriting during later task training, not initial learning rate."},"reference":"- Kirkpatrick et al., \"Overcoming catastrophic forgetting in neural networks (EWC)\" (2017): https://arxiv.org/abs/1612.00796\n- Parisi et al., \"Continual lifelong learning with neural networks: A review\" (2019): https://arxiv.org/abs/1802.07569"},{"section":"deep-learning","difficulty":"easy","id":"dl-e001","topicSlug":"introduction-to-neural-networks","orderIndex":1,"topic":"Introduction To Neural Networks","question":"A perceptron computes f = 1 if w₁x₁ + w₂x₂ + b ≥ 0. A student wants to implement the AND gate (output 1 only when x₁=1 AND x₂=1). They try w₁=1, w₂=1, b=-1. For input (1,0), they compute z = 1+0-1 = 0, which fires (≥ 0). Is this correct AND behavior, and what bias would fix it?","options":{"A":"Yes — AND should fire at z ≥ 0, so b=-1 is correct","B":"No — input (1,0) should output 0 for AND. z=0 fires, so the decision boundary is wrong. Setting b=-1.5 fixes it: (1,1): z=0.5>0 ✓; (1,0): z=-0.5<0 ✓; (0,1): z=-0.5<0 ✓; (0,0): z=-1.5<0 ✓","C":"No — AND cannot be implemented by a perceptron regardless of weight values","D":"Yes — AND requires that the sum of inputs equals 2, so z=0 is the correct threshold for (1,0)"},"correct":"B","explanation":{"correct":"- Perceptron boundary: fires when z ≥ 0. With b=-1, inputs (1,0) and (0,1) give z=0, which fires — but AND should output 0 for these cases.\n- Fix: shift the boundary between z=1 (both inputs=1) and z=1 (one input=1). Since both give the same z with w=[1,1], we want the threshold strictly between those two sums (1 vs 2). b=-1.5 places the boundary at z=0 when w·x=1.5, which falls between 1 and 2.\n- This shows the bias term's role: it translates the decision boundary without changing the hyperplane orientation.","A":"z=0 is a boundary condition that fires (≥ 0). For AND, (1,0)→0 so it must not fire. The threshold is set too loosely.","B":"","C":"AND is linearly separable (you can draw a line separating the 3 \"false\" points from 1 \"true\" point in 2D). A perceptron can implement it — unlike XOR.","D":"The sum of inputs for (1,0) is 1, not 2. With w=[1,1] and b=-1, z=0 for (1,0), which triggers activation. That's the bug."},"reference":"- Rosenblatt, \"The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain\" (1958)"},{"section":"deep-learning","difficulty":"easy","id":"dl-e002","topicSlug":"introduction-to-neural-networks","orderIndex":2,"topic":"Introduction To Neural Networks","question":"You train a 2-layer neural network (1 hidden layer) with ReLU on a 2D classification dataset. On the training set, the model achieves 98% accuracy. A colleague says \"this means the model has learned the true underlying function.\" Another says \"it may have memorized the training data.\" What single experiment most directly distinguishes the two?","options":{"A":"Train for more epochs — if accuracy stays at 98%, the model has generalized","B":"Evaluate on a held-out test set — if test accuracy is also ~98%, the model generalizes; if test accuracy drops significantly (e.g., 60%), it has memorized training data without learning the underlying pattern","C":"Plot the loss curve — a decreasing training loss confirms generalization","D":"Increase model capacity — if a larger model also achieves 98%, then the pattern is real"},"correct":"B","explanation":{"correct":"- Generalization test: the training accuracy tells you nothing about generalization. Any sufficiently large network can memorize any finite dataset (Zhang et al. 2017 showed networks can memorize randomly labeled CIFAR-10).\n- The held-out test set (unseen data from the same distribution) is the gold standard for whether the model has learned the underlying function or just the training set.\n- A train/test accuracy gap (high train, low test) is the definition of overfitting (memorization). Equal train and test accuracy suggests the learned function generalizes.","A":"Training longer with the same data cannot reveal whether the model generalizes — it only confirms that the model can still fit the training set.","B":"","C":"A decreasing training loss is expected for any model with enough capacity regardless of generalization. It measures fit, not generalization.","D":"Testing a larger model on the same training data doesn't test generalization. Both small and large models can achieve 98% training accuracy while generalizing differently."},"reference":"- Zhang et al., \"Understanding Deep Learning Requires Rethinking Generalization\" (2017): https://arxiv.org/abs/1611.03530"},{"section":"deep-learning","difficulty":"easy","id":"dl-e003","topicSlug":"neurons-and-perceptrons","orderIndex":3,"topic":"Neurons And Perceptrons","question":"In a network `z = Wx + b`, you set all biases to zero at initialization. Unlike weights, all biases can remain zero and the network still works. True or False — and what specific problem occurs if you also initialize all weights to the same constant (e.g., 0.001) in addition to zero biases?","options":{"A":"True — zero biases are fine; zero weights are also fine because the gradient will differentiate them during training","B":"False — zero biases already cause a symmetry problem because all neurons in a layer produce identical pre-activations","C":"True — zero biases are fine. But identical weights (e.g., all 0.001) cause the symmetry problem: every neuron in a layer receives identical gradients, so all weights update identically at every step — the layer effectively has just one unique neuron regardless of width, and the network never develops diverse representations","D":"False — biases must always be initialized to 1 to prevent the vanishing gradient problem"},"correct":"C","explanation":{"correct":"- Zero bias: fine for most architectures. Biases are independent per neuron — setting them to 0 initializes the threshold at 0, which is a reasonable starting point. Gradient updates for biases are `∂L/∂b = δ`, which differ per neuron once weights differ.\n- Symmetric weight problem: if all weights in a layer are identical, then every neuron computes the same z = w·x + b. The same activation, same gradient → same weight update → remain identical forever. This is the symmetry problem.\n- The layer collapses to a single effective neuron regardless of its width.","A":"Identical weights cause the symmetry problem regardless of the learning process — gradients are identical for identical neurons, so training cannot break the symmetry.","B":"Zero biases alone do NOT cause a symmetry problem. Bias gradients ∂L/∂b = δ only equal zero when the activation gradient δ is zero. With different weights per neuron, δ differs, so biases diverge.","C":"","D":"Bias initialization to 1 is used in some specific cases (e.g., LSTM forget gate bias initialized to 1), but it's not a general requirement and is not related to vanishing gradients."},"reference":"- Goodfellow et al., \"Deep Learning\" (2016), Chapter 8: Optimization for Training Deep Models"},{"section":"deep-learning","difficulty":"easy","id":"dl-e004","topicSlug":"neurons-and-perceptrons","orderIndex":4,"topic":"Neurons And Perceptrons","question":"A fully connected layer maps 512 input features to 256 output features. Including biases, how many trainable parameters does this layer have? And if you add a second identical FC layer (256 → 256) after it, what is the total parameter count for both layers combined?","options":{"A":"Layer 1: 512 × 256 = 131,072 params. Layer 2: 256 × 256 = 65,536 params. Total: 196,608","B":"Layer 1: 512 × 256 + 256 = 131,328 params. Layer 2: 256 × 256 + 256 = 65,792 params. Total: 197,120","C":"Layer 1: (512 + 1) × 256 = 131,328 params. Layer 2: (256 + 1) × 256 = 65,792 params. Total: 197,120 (same as B, different calculation)","D":"Layer 1: 512 × 256 = 131,072. Layer 2: 256 × 256 = 65,536. Bias is a single global parameter, so total = 196,608 + 1 = 196,609"},"correct":"B","explanation":{"correct":"- Each FC layer: W of shape (d_out, d_in) + bias vector b of shape (d_out,).\n- Layer 1: W₁ has 512 × 256 = 131,072 weights + 256 biases = 131,328 params.\n- Layer 2: W₂ has 256 × 256 = 65,536 weights + 256 biases = 65,792 params.\n- Total: 131,328 + 65,792 = 197,120 params.\n- The bias is one scalar per output neuron (not a single global value), so each layer adds d_out bias terms.","A":"Forgets to include the bias vectors. This is a common mistake when counting parameters. Both layers have one bias per output unit.","B":"","C":"","D":"The bias is NOT a single global parameter. It is a vector of size d_out — one learned offset per output neuron. Treating it as a single value is incorrect."},"reference":"- PyTorch docs: `torch.nn.Linear` — lists `weight` (out_features × in_features) and `bias` (out_features)"},{"section":"deep-learning","difficulty":"easy","id":"dl-e005","topicSlug":"activation-functions","orderIndex":5,"topic":"Activation Functions","question":"You build a binary sentiment classifier (positive/negative). A junior engineer uses sigmoid activation on the output neuron and cross-entropy loss. A second engineer uses a linear output neuron (no activation) and MSE loss. Which is preferred and why?","options":{"A":"MSE + linear is preferred because the loss is smooth and easy to differentiate","B":"Sigmoid + cross-entropy is preferred: sigmoid maps the output to (0,1) producing a valid probability; cross-entropy penalizes confident wrong predictions logarithmically (large gradient when p=0.01 for y=1). MSE + linear can produce outputs outside [0,1] and has near-zero gradient when the output is very wrong (far from 0/1), slowing learning","C":"Both are equivalent; the choice of activation and loss doesn't affect the final result","D":"Linear + MSE is preferred for binary classification because MSE has a unique global minimum"},"correct":"B","explanation":{"correct":"- Sigmoid output: p ∈ (0,1), interpretable as a probability. BCE loss: -[y log p + (1-y) log(1-p)]. When the model is confidently wrong (p=0.001 for y=1), loss = -log(0.001) ≈ 7, providing a large gradient signal.\n- MSE gradient for classification: ∂MSE/∂p = 2(p-y). When p=10 (linear output, very wrong), gradient = 18, which can be large. But when the activation is saturated (e.g., sigmoid ≈ 0.99999), MSE gradient becomes tiny because (0.99999 - 1)² ≈ 0. This is why cross-entropy is preferred with sigmoid.\n- MSE is designed for regression where the target can be any real number, not a binary label.","A":"MSE is smooth, but \"smooth\" doesn't mean \"appropriate.\" The gradient behavior for classification tasks makes MSE suboptimal — it doesn't naturally encode probability semantics.","B":"","C":"The choice significantly affects convergence speed and gradient behavior. Sigmoid + BCE trains faster and converges to better solutions for binary classification.","D":"MSE has a unique minimum for regression problems, but for classification with a linear output, the model may learn to push outputs far outside [0,1], which is numerically unstable and harder to interpret."},"reference":"- Goodfellow et al., \"Deep Learning\" (2016), Chapter 6.2: Output Units"},{"section":"deep-learning","difficulty":"easy","id":"dl-e006","topicSlug":"activation-functions","orderIndex":6,"topic":"Activation Functions","question":"A network uses ReLU activations throughout. During training, you observe that 40% of neurons in layer 3 always output exactly 0 regardless of the training example. What is this phenomenon called, and what is the most direct single change to address it?","options":{"A":"Gradient clipping — apply norm-based clipping to prevent neurons from dying","B":"This is the \"dying ReLU\" problem. When a neuron's pre-activation z is consistently negative (due to large negative bias or large negative weight updates), ReLU outputs 0, and the gradient ∂ReLU/∂z = 0. No gradient flows, so weights don't update, locking the neuron in the dead state. The most direct fix: switch to Leaky ReLU (f(z) = max(αz, z) with α=0.01), which allows a small gradient even for z < 0","C":"Batch normalization — apply BN before ReLU to center the pre-activations around zero","D":"This is vanishing gradient; fix by reducing the number of layers"},"correct":"B","explanation":{"correct":"- Dead ReLU mechanism: `ReLU'(z) = 1 if z > 0, else 0`. If a neuron's z is always ≤ 0, the gradient is always 0 — weight update = 0 forever. The neuron is \"dead.\"\n- Common causes: large learning rate causing large weight updates that push z negative; poor initialization (all-negative biases for that layer).\n- Leaky ReLU fix: `f(z) = z if z > 0 else αz`, `f'(z) = 1 if z > 0 else α`. Even when z < 0, a non-zero gradient (α=0.01) flows, allowing the neuron to recover.\n- Alternative: better initialization (Kaiming) reduces the chance of neurons dying from the start.","A":"Gradient clipping prevents exploding gradients (very large gradients). Dead ReLU neurons have zero gradient, not large gradient — clipping does nothing for neurons that are already dead.","B":"","C":"BN before ReLU helps prevent many neurons from dying by keeping z centered near 0. However, the most direct architectural fix specifically targeting dying ReLU is Leaky ReLU, not BN.","D":"Dead ReLU is not the vanishing gradient problem. Vanishing gradient is about gradients becoming exponentially small as they flow backward through many layers. Dead ReLU is about specific neurons with permanently zero gradient."},"reference":"- Maas et al., \"Rectifier Nonlinearities Improve Neural Network Acoustic Models\" (2013) — Leaky ReLU"},{"section":"deep-learning","difficulty":"easy","id":"dl-e007","topicSlug":"activation-functions","orderIndex":7,"topic":"Activation Functions","question":"BERT uses GELU activation, while original ResNet uses ReLU. A student asks: \"Can I replace BERT's GELU with ReLU without hurting performance?\" You know GELU = x·Φ(x) where Φ is the Gaussian CDF. What property of GELU makes it preferred in Transformer architectures?","options":{"A":"GELU is computationally cheaper than ReLU, which speeds up large Transformer training","B":"GELU provides a smooth, probabilistic gating: it multiplies x by the probability that x is positive under a Gaussian, giving a smooth approximation of ReLU. This smooth gradient near zero allows more information to pass through during backpropagation compared to the hard zero gradient of ReLU for z < 0. Empirically, GELU outperforms ReLU in Transformer architectures (BERT, GPT) though the difference in CNNs is smaller. Replacing BERT's GELU with ReLU would likely cause a small but measurable performance drop.","C":"GELU prevents the dying neuron problem by outputting negative values for negative inputs","D":"GELU and ReLU are interchangeable; the architectural context doesn't matter for activation choice"},"correct":"B","explanation":{"correct":"- GELU(x) = x·Φ(x) ≈ 0.5x(1 + tanh(√(2/π)(x + 0.044715x³))). This is smooth everywhere (unlike ReLU's kink at 0).\n- Stochastic interpretation: GELU is the expected value of a Bernoulli-gated linear unit where the gate probability = Φ(x). Small positive inputs are probabilistically suppressed.\n- For Transformers: the smooth gradient behavior near zero matters because attention weights produce many near-zero pre-activations that benefit from smooth gradient flow.\n- GELU ≈ ReLU for large |x|, but the smooth transition region matters for training dynamics.","A":"GELU is actually slightly more expensive than ReLU (involves CDF computation or a polynomial approximation). The preference is not computational.","B":"","C":"GELU does produce negative outputs for some negative inputs (where x·Φ(x) is a small negative number near 0). However, this is not the main reason it's preferred. The smoothness is the key property.","D":"Activation choice interacts with architecture. GELU and SiLU work better in Transformers; ReLU works well in CNNs. The choice is empirically architecture-specific."},"reference":"- Hendrycks & Gimpel, \"Gaussian Error Linear Units (GELUs)\" (2016): https://arxiv.org/abs/1606.08415"},{"section":"deep-learning","difficulty":"easy","id":"dl-e008","topicSlug":"forward-propagation","orderIndex":8,"topic":"Forward Propagation","question":"A linear layer in PyTorch is defined as `nn.Linear(128, 64)`. You feed a batch of 32 samples: `x.shape = (32, 128)`. PyTorch computes `output = x @ W.T + b`. What is the shape of W, W.T, b, and the final output?","options":{"A":"W: (128, 64), W.T: (64, 128), b: (64,), output: (32, 64)","B":"W: (64, 128), W.T: (128, 64), b: (64,), output: (32, 64)","C":"W: (64, 128), W.T: (128, 64), b: (32,), output: (32, 64)","D":"W: (128, 64), W.T: (64, 128), b: (128,), output: (32, 128)"},"correct":"B","explanation":{"correct":"- PyTorch's `nn.Linear(in_features, out_features)` stores W with shape (out_features, in_features) = (64, 128). This is because the computation is `output = x @ W.T + b`.\n- x @ W.T: (32, 128) @ (128, 64) = (32, 64). ✓\n- b has shape (64,) — one bias per output feature, broadcast across the batch.\n- Note: W.shape = (64, 128) in PyTorch, NOT (128, 64) as you might expect from the math z = Wx + b (which uses column vectors). PyTorch uses row vectors and transposes W.","A":"Swaps W and W.T shapes. PyTorch stores W as (out, in) = (64, 128), so W.T = (128, 64). If W were (128, 64), then x @ W.T would be (32, 128) @ (64, 128) — incompatible shapes.","B":"","C":"Bias b has shape (out_features,) = (64,), not (batch_size,) = (32,). The bias is broadcast across the batch, not assigned per sample.","D":"Output shape (32, 128) would require the layer to output 128 features, not 64. The defined layer maps 128→64, so the output must be (32, 64)."},"reference":"- PyTorch docs: `torch.nn.Linear` — https://pytorch.org/docs/stable/generated/torch.nn.Linear.html"},{"section":"deep-learning","difficulty":"easy","id":"dl-e009","topicSlug":"forward-propagation","orderIndex":9,"topic":"Forward Propagation","question":"You call `model.eval()` and then run inference. You forgot to also wrap the inference in `with torch.no_grad()`. What is the practical consequence, and what is the purpose of each call?","options":{"A":"Without `torch.no_grad()`, the model computes wrong outputs because gradients affect the forward pass","B":"`model.eval()` changes model behavior (turns off Dropout random masking, uses BatchNorm running stats instead of batch stats). `torch.no_grad()` disables gradient computation, reducing memory and speeding up inference. Forgetting `torch.no_grad()`: model behavior is correct (eval() handles that), but PyTorch still builds a computational graph and stores activations for potential backward pass — wasting memory and slowing inference, but not producing wrong results","C":"`torch.no_grad()` is identical to `model.eval()`; calling one is sufficient","D":"Forgetting `torch.no_grad()` causes an error at the end of the inference loop"},"correct":"B","explanation":{"correct":"- `model.eval()`: changes module behavior. Dropout becomes identity (no masking). BatchNorm uses stored running_mean/running_var instead of batch statistics. This ensures deterministic, reproducible inference.\n- `torch.no_grad()`: tells autograd not to track operations in the context. No computation graph is built, and intermediate activations for backprop are not stored. This saves memory (no activation storage) and computation (no gradient bookkeeping).\n- Combined: correct behavior (eval) + memory efficiency (no_grad). Missing no_grad: memory usage stays high (all activations cached), but outputs are numerically identical.","A":"Gradients don't affect forward pass computation values. The gradient tensor is separate from the value tensor. Forward pass output = same regardless of whether autograd is enabled.","B":"","C":"They have completely different functions. eval() affects layer behavior (Dropout, BN). no_grad() affects memory management. Neither subsumes the other.","D":"PyTorch doesn't raise an error for running inference without no_grad(). It simply uses more memory than necessary."},"reference":"- PyTorch docs: `torch.no_grad()` and `Module.eval()`"},{"section":"deep-learning","difficulty":"easy","id":"dl-e010","topicSlug":"loss-and-cost-functions","orderIndex":10,"topic":"Loss And Cost Functions","question":"You train a regression model to predict house prices (in dollars, range 50K–2M). You use MSE loss. A colleague suggests switching to MAE loss. Under what condition would MSE be worse than MAE, and what is the key difference in their gradient behavior?","options":{"A":"MSE is always worse than MAE for regression; always use MAE","B":"MSE is worse than MAE when the dataset has outliers (e.g., a few houses worth $20M). MSE squares the error: a $5M error contributes 25× more loss than a $1M error (5² vs 1²). This causes the model to shift its predictions toward outliers. MAE treats all errors proportionally (a $5M error is 5× a $1M error). Gradient difference: MSE gradient = 2(ŷ - y), which grows with error magnitude. MAE gradient = ±1 (constant regardless of error magnitude). Near zero error, MAE's subgradient creates optimization difficulties (no smooth minimum).","C":"MAE is worse than MSE because its gradient is always ±1, making it slower to converge","D":"MSE and MAE produce identical optimal models; the difference is only in numerical stability"},"correct":"B","explanation":{"correct":"- Outlier sensitivity: MSE minimizer is the conditional mean E[y|x], which is pulled toward outliers. MAE minimizer is the conditional median, which is robust to outliers.\n- Gradient comparison: MSE ∂L/∂ŷ = 2(ŷ-y) — proportional to error. Large errors → large gradient updates (fast learning, but outliers dominate). MAE ∂L/∂ŷ = sign(ŷ-y) = ±1 — constant regardless of error magnitude.\n- MAE downside: the constant gradient doesn't provide fine-grained signal near the optimum, potentially causing oscillation around the true minimum. Huber loss combines both behaviors.","A":"MSE is preferred when the data has no significant outliers (most values near the mean) and you want a smooth, easily optimizable loss. \"Always use MAE\" is an oversimplification.","B":"","C":"MAE's constant gradient is a weakness for optimization (no smooth minimum) but doesn't make MAE worse than MSE across all scenarios. The choice depends on the data distribution and outlier sensitivity requirements.","D":"MSE and MAE produce different optimal solutions: mean-minimizing vs median-minimizing. These are different statistics and will differ when the distribution is skewed."},"reference":"- Huber, \"Robust Estimation of a Location Parameter\" (1964) — motivation for Huber loss"},{"section":"deep-learning","difficulty":"easy","id":"dl-e011","topicSlug":"loss-and-cost-functions","orderIndex":11,"topic":"Loss And Cost Functions","question":"A model predicts class probabilities [0.7, 0.2, 0.1] for a 3-class problem. The true label is class 0 (one-hot: [1, 0, 0]). Calculate the cross-entropy loss and explain what happens to the loss if the model becomes more confident in the correct class (predicts [0.99, 0.005, 0.005]).","options":{"A":"CE = -(0.7·log0.7 + 0.2·log0.2 + 0.1·log0.1) ≈ 0.80. More confidence → higher loss because all terms are included","B":"CE = -log(0.7) ≈ 0.357. With prediction [0.99, 0.005, 0.005]: CE = -log(0.99) ≈ 0.010. More confidence in the correct class → lower loss. Cross-entropy only cares about the probability assigned to the true class","C":"CE = -(log0.7 + log0.2 + log0.1) ≈ 3.0. The loss increases when probabilities are more concentrated","D":"CE = -(0.7 + 0.2 + 0.1) = -1.0. Loss is constant since probabilities always sum to 1"},"correct":"B","explanation":{"correct":"- Cross-entropy formula: CE = -Σ y_k log(p_k). With one-hot y=[1,0,0]: only the k=0 term survives: CE = -1·log(p_0) = -log(0.7) ≈ 0.357.\n- More confident in true class: p_0 = 0.99 → CE = -log(0.99) ≈ 0.010. The loss decreases as confidence in the correct class increases.\n- This is why cross-entropy works so well with softmax: the model is penalized based solely on how much probability it assigns to the correct class.","A":"The formula used in A is the entropy of the predicted distribution, not cross-entropy against the true labels. With one-hot labels, terms for wrong classes multiply by 0 and drop out.","B":"","C":"Sums of logs rather than weighted by labels. The one-hot label selector means only the true class log probability contributes.","D":"Cross-entropy is not constant. As probabilities change (while summing to 1), the loss changes. The true class probability determines the loss."},"reference":"- Goodfellow et al., \"Deep Learning\" (2016), Chapter 6.2.1: Cross-Entropy Loss"},{"section":"deep-learning","difficulty":"easy","id":"dl-e012","topicSlug":"loss-and-cost-functions","orderIndex":12,"topic":"Loss And Cost Functions","question":"You are training a medical image classifier. Of 10,000 training examples, only 100 have the disease (class 1). You use standard cross-entropy loss and get 99% accuracy, but the model never predicts class 1. What is wrong, and what is the simplest fix to the loss function?","options":{"A":"The model needs more training epochs; 99% accuracy with no disease predictions means under-training","B":"The model has learned to predict class 0 (healthy) for everything — achieving 99% accuracy by ignoring the rare class. This is the class imbalance problem. Standard cross-entropy treats all examples equally, so the 9,900 healthy examples dominate training. Fix: use weighted cross-entropy with class weight = N_total / N_class: weight_0 = 10000/9900 ≈ 1.01, weight_1 = 10000/100 = 100. Now each disease example contributes 100× more to the loss, forcing the model to learn class 1.","C":"Switch from cross-entropy to MSE loss — MSE handles imbalanced classes better","D":"The issue is the learning rate; lower it to allow the model to detect the rare class"},"correct":"B","explanation":{"correct":"- Imbalance effect: the loss landscape is dominated by the majority class. Predicting class 0 for everything minimizes the overall loss (9900/10000 examples are correct). Gradients from the 100 disease examples are overwhelmed by gradients from the 9900 healthy examples.\n- Weighted CE: multiply each example's loss by its class weight. `L = -Σ w_k · y_k · log(p_k)`. Alternatively, use focal loss which down-weights easy (well-classified) examples.\n- Evaluation: for imbalanced problems, use F1 score, AUROC, or precision-recall AUC instead of accuracy.","A":"Training longer without addressing the imbalance would just make the model more confident in always predicting class 0. 99% accuracy is already converged to this degenerate solution.","B":"","C":"MSE for classification has its own problems and doesn't inherently address class imbalance better than CE. Focal loss is the state-of-the-art fix.","D":"Learning rate doesn't address the fundamental signal imbalance. Even with a perfect learning rate, 9900 examples of class 0 will overpower 100 examples of class 1 unless the loss is reweighted."},"reference":"- Lin et al., \"Focal Loss for Dense Object Detection (RetinaNet)\" (2017): https://arxiv.org/abs/1708.02002"},{"section":"deep-learning","difficulty":"easy","id":"dl-e013","topicSlug":"backpropagation","orderIndex":13,"topic":"Backpropagation","question":"In PyTorch, you compute `loss.backward()` twice in a row without calling `optimizer.zero_grad()` between iterations. What happens to the gradients, and why is `optimizer.zero_grad()` called at the start of each training iteration?","options":{"A":"The second `backward()` call resets and recomputes gradients from scratch — previous gradients are overwritten","B":"PyTorch accumulates gradients by default: each `backward()` call adds to the existing `.grad` attribute. After two `backward()` calls, `param.grad` = sum of both gradient computations. Without `zero_grad()`, gradients from previous batch add to the current batch, effectively doubling (or more) the effective gradient, causing incorrect updates. `zero_grad()` sets all `.grad` to zero before computing the new gradient — ensuring each update uses only the current batch's gradient","C":"PyTorch raises a RuntimeError on the second `backward()` call","D":"The second `backward()` call divides the gradient by 2 to compensate for calling twice"},"correct":"B","explanation":{"correct":"- Gradient accumulation: PyTorch's design choice. `param.grad += new_gradient` at each `.backward()`. This is actually useful for gradient accumulation over micro-batches (to simulate large batch training with limited memory).\n- Bug from forgetting `zero_grad()`: iteration 1 gradient g₁ accumulates; iteration 2 computes g₂ but param.grad = g₁ + g₂. The optimizer step uses this sum, making the effective learning rate larger than intended and the update direction incorrect.\n- Standard pattern: `optimizer.zero_grad()` → `loss = model(x)` → `loss.backward()` → `optimizer.step()`.","A":"This is incorrect. PyTorch accumulates, not overwrites. See the official docs: \"Gradients are accumulated.\" Overwriting would require explicit `param.grad = None` or `zero_grad()`.","B":"","C":"PyTorch allows multiple `backward()` calls if the graph is retained (retain_graph=True) or if the graph is rebuilt each forward pass. No error is raised — but incorrect results occur.","D":"PyTorch has no such averaging behavior for gradients from multiple backward passes."},"reference":"- PyTorch docs: `Optimizer.zero_grad()` — https://pytorch.org/docs/stable/optim.html"},{"section":"deep-learning","difficulty":"easy","id":"dl-e014","topicSlug":"backpropagation","orderIndex":14,"topic":"Backpropagation","question":"You write a custom loss function that includes `math.log(pred)` (Python's math module, not torch). During training, the loss computes correctly but `loss.backward()` raises an error. What is the problem?","options":{"A":"`math.log` is not differentiable; you must use a polynomial approximation","B":"`math.log` converts the tensor to a plain Python float, breaking the PyTorch autograd computation graph. Autograd tracks operations on tensors; once a tensor is converted to a Python float (which `math.log` does), the graph connection is severed. `loss.backward()` cannot compute gradients through operations not recorded in the graph. Fix: replace `math.log(pred)` with `torch.log(pred)`, which records the log operation in the autograd graph.","C":"`math.log` returns a negative value for probabilities, causing a NaN in backprop","D":"The error occurs because `math.log` requires integer inputs; use `float()` to convert first"},"correct":"B","explanation":{"correct":"- Autograd graph: PyTorch builds a directed acyclic graph of tensor operations during the forward pass. Each tensor operation returns a new tensor with a `.grad_fn` pointing to the operation. `torch.log(t)` records a `LogBackward` node.\n- `math.log(t)`: first extracts the scalar value from `t` (calling `.item()` implicitly), then applies Python's math.log. The result is a plain Python float with no `.grad_fn`. The autograd graph is severed.\n- `loss.backward()` traverses the graph from the loss tensor. If the graph ends prematurely (float stops graph), it cannot compute gradients for earlier tensors.","A":"`math.log` is mathematically differentiable (d/dx log x = 1/x). The problem is not mathematical differentiability but PyTorch's ability to *track* the operation in its automatic differentiation graph.","B":"","C":"For predictions p ∈ (0,1), log(p) is negative — this is expected in cross-entropy loss. Negative log probability is not a NaN. The error is about graph tracking, not sign.","D":"`math.log` works on floats. The problem isn't input type — it's that extracting a float breaks the computation graph."},"reference":"- PyTorch docs: Autograd Mechanics — https://pytorch.org/docs/stable/notes/autograd.html"},{"section":"deep-learning","difficulty":"easy","id":"dl-e015","topicSlug":"optimizers","orderIndex":15,"topic":"Optimizers","question":"You train a model with SGD and observe that the loss oscillates: 0.8 → 1.2 → 0.7 → 1.3 → 0.6 → 1.4. The loss is decreasing on average but oscillating wildly. What is the most likely cause and what single hyperparameter change would smooth training?","options":{"A":"The model has too many parameters; reduce model size to fix oscillation","B":"The learning rate is too high. Large LR causes updates to overshoot the minimum — the loss decreases then bounces past the optimal point in the loss landscape. The optimizer jumps back and forth across the minimum. Fix: reduce the learning rate (e.g., by 10×). Oscillating loss with downward trend is a classic too-high LR signature. Adding momentum can also help by smoothing the update direction.","C":"The batch size is too small; increase to full-batch gradient descent","D":"The oscillation is caused by the ReLU activation; switch to sigmoid"},"correct":"B","explanation":{"correct":"- Overshoot mechanism: at the minimum, the gradient is zero. When LR is high, the step size is large — the optimizer lands on the other side of the minimum where the gradient points back. This creates oscillation.\n- Decreasing average trend: despite oscillation, the optimizer is slowly finding lower regions. This is why reducing LR stabilizes training — smaller steps approach the minimum without overshooting.\n- LR diagnosis: smooth loss = LR OK; oscillating loss = LR too high; no decrease = LR too low or gradient issue.","A":"Model size doesn't cause loss oscillation patterns. Large models may overfit, but overfitting shows as train loss decreasing while val loss increases — not the oscillation pattern described.","B":"","C":"Small batch size does introduce gradient noise, but the pattern for noisy gradients is more random (not the regular back-and-forth oscillation). Full batch gradient descent would remove noise but also remove regularization benefits.","D":"Activation function doesn't cause the per-iteration oscillation pattern. ReLU provides sparse, stable activations. Switching activations wouldn't fix an optimizer overshoot problem."},"reference":"- Goodfellow et al., \"Deep Learning\" (2016), Chapter 8.3: Basic Algorithms"},{"section":"deep-learning","difficulty":"easy","id":"dl-e016","topicSlug":"optimizers","orderIndex":16,"topic":"Optimizers","question":"Adam optimizer uses bias correction for its moment estimates: m̂_t = m_t / (1 - β₁ᵗ) and v̂_t = v_t / (1 - β₂ᵗ). Why is this correction needed specifically in the first few steps, and what happens to the correction factor as t → ∞?","options":{"A":"Bias correction scales up the learning rate to compensate for small gradients at initialization","B":"At t=1, m₁ = (1-β₁)g₁ (initialized from zero). Without correction, m₁ underestimates the true gradient because it is scaled by (1-β₁) ≈ 0.1 (β₁=0.9). Bias correction: m̂₁ = m₁/(1-β₁¹) = m₁/0.1 = 10·m₁, restoring the true scale. As t→∞, β₁ᵗ → 0, so (1-β₁ᵗ) → 1, and m̂_t → m_t (correction factor becomes 1). The correction matters mainly in early steps and disappears asymptotically.","C":"Bias correction is only needed when gradients are very small (< 1e-8); otherwise it has no effect","D":"The correction normalizes the update to be between -1 and +1 at all times"},"correct":"B","explanation":{"correct":"- Exponential moving average startup: m_t = β₁ m_{t-1} + (1-β₁) g_t. Starting from m_0=0:\n- t=1: m₁ = (1-β₁)g₁ ≈ 0.1g₁ (if β₁=0.9). Underestimates by factor 10.\n- t=2: m₂ = β₁(1-β₁)g₁ + (1-β₁)g₂. Still underestimates.\n- t=100: m₁₀₀ ≈ m̂₁₀₀ (correction ≈ 1 since β₁¹⁰⁰ ≈ 0.000027).\n- The correction is critical early in training when the EMA hasn't had time to accumulate enough history to represent the true running average.","A":"Bias correction doesn't scale the learning rate — it corrects the moment estimates. The effective step size is m̂_t / (√v̂_t + ε) × η, where η is the fixed LR.","B":"","C":"Bias correction applies to all gradient magnitudes uniformly. It's always active — the division by (1-β₁ᵗ) happens regardless of gradient magnitude.","D":"Bias-corrected Adam updates can take values larger or smaller than ±1 depending on the gradient and the second moment estimate. There's no clipping to [-1,1]."},"reference":"- Kingma & Ba, \"Adam: A Method for Stochastic Optimization\" (2015): https://arxiv.org/abs/1412.6980"},{"section":"deep-learning","difficulty":"easy","id":"dl-e017","topicSlug":"optimizers","orderIndex":17,"topic":"Optimizers","question":"You train the same CNN on CIFAR-10 using (A) SGD with momentum=0.9, lr=0.1 and (B) Adam with lr=0.001. After 100 epochs, SGD achieves 93% test accuracy and Adam achieves 91%. Why might SGD outperform Adam on image classification despite Adam's adaptive learning rates?","options":{"A":"Adam has a bug for image classification tasks; use SGD by default","B":"SGD with momentum tends to find flatter minima than Adam on vision tasks. Flat minima generalize better (small perturbations in weights → small change in loss) than sharp minima. Adam's per-parameter adaptive step sizes can cause it to converge to sharper minima faster. Additionally, Adam's effective LR decay (as v_t accumulates) means late-stage training may have very small updates, preventing escape from sharp local minima. Many image classification benchmarks (ImageNet, CIFAR) show SGD + momentum + LR schedule outperforms Adam in final accuracy, though Adam converges faster early on.","C":"SGD uses the entire dataset while Adam uses mini-batches; larger data = better accuracy","D":"Adam requires 10× more memory than SGD, causing memory errors that reduce accuracy"},"correct":"B","explanation":{"correct":"- Sharp vs flat minima: Adam's adaptive updates can exploit gradient information more efficiently in each step, but they may converge to sharper loss basins. Flat minima are associated with better generalization (Hochreiter & Schmidhuber 1997, Keskar et al. 2017).\n- The SGD+momentum advantage in vision: SGD converges more slowly but often to flatter, better-generalizing solutions. It's also more sensitive to learning rate schedule, which is why LR schedules (cosine, step decay) are critical for SGD.\n- Practical tip: Adam is often better for NLP (Transformers), SGD is often better for vision (CNNs). This is a well-documented empirical finding.","A":"Adam has no \"bug\" for image classification. It's a valid optimizer. The difference is in the optimization landscape, not a software defect.","B":"","C":"Both SGD and Adam use mini-batches in standard deep learning practice. The comparison is between the adaptive vs non-adaptive update rules, not batch usage.","D":"Adam stores first and second moment vectors (2× parameter count overhead vs SGD's momentum buffer). This is a 2× increase, not 10×, and it doesn't cause accuracy differences — it's a memory concern at scale."},"reference":"- Keskar et al., \"On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima\" (2017): https://arxiv.org/abs/1609.04836"},{"section":"deep-learning","difficulty":"easy","id":"dl-e018","topicSlug":"ann-architectures","orderIndex":18,"topic":"Ann Architectures","question":"The Universal Approximation Theorem (UAT) states that a single hidden layer neural network with enough neurons can approximate any continuous function on a compact domain. A student concludes \"therefore, deep networks are unnecessary — we just need enough neurons in one hidden layer.\" What is wrong with this conclusion?","options":{"A":"The UAT is incorrect; neural networks cannot approximate arbitrary functions","B":"The UAT proves existence, not practicality. While a single hidden layer CAN approximate any function, the number of neurons required may be exponential in the input dimension. Deep networks can approximate the same function with exponentially fewer neurons by composing simpler sub-functions hierarchically. The UAT also says nothing about learnability via gradient descent — a theoretically sufficient shallow network may be practically untrainable","C":"The UAT applies only to linear activation functions; ReLU networks cannot approximate arbitrary functions","D":"The UAT is correct; a shallow network with enough neurons is always better than a deep network for any task"},"correct":"B","explanation":{"correct":"- Existence vs construction: the theorem guarantees a set of weights exists, but doesn't guarantee gradient descent will find them, or that a practical number of neurons is sufficient.\n- Depth efficiency: Montufar et al. (2014) showed that deep ReLU networks can represent exponentially more linear regions than shallow networks with the same parameter count. A function requiring N neurons shallowly may need only O(log N) per layer with depth.\n- Practical implication: depth is not just about theoretical expressive power — it's about learning efficiency. Hierarchical features (edges → textures → parts → objects in CNNs) are naturally learned by depth.","A":"The UAT is well-proven and widely accepted. The issue is its practical implications, not its correctness.","B":"","C":"The UAT was originally proved for sigmoid (Hornik et al. 1989) and later for ReLU and many other activation functions. ReLU networks are universal approximators.","D":"This is the student's incorrect conclusion. Deep networks usually outperform wide shallow networks on complex tasks with equal or fewer parameters."},"reference":"- Cybenko, \"Approximation by Superpositions of a Sigmoidal Function\" (1989)\n- Montufar et al., \"On the Number of Linear Regions of Deep Neural Networks\" (2014): https://arxiv.org/abs/1402.1869"},{"section":"deep-learning","difficulty":"easy","id":"dl-e019","topicSlug":"ann-architectures","orderIndex":19,"topic":"Ann Architectures","question":"A model achieves 95% training accuracy and 60% validation accuracy. A colleague says \"train longer to close the gap.\" Another says \"make the model smaller.\" What is the correct diagnosis, and what are two complementary fixes?","options":{"A":"The model is underfitting; add more layers to increase capacity","B":"This is overfitting: the model has memorized training examples without generalizing. High train accuracy with much lower val accuracy = train/validation gap = overfitting. Fix 1: regularization — add L2 weight decay or Dropout to penalize memorization. Fix 2: reduce model capacity (fewer layers/neurons) so the model is forced to learn patterns present in validation data too. Alternatively: get more training data, or use data augmentation.","C":"The model is correct — 60% validation accuracy is expected with 95% training accuracy; this gap is normal","D":"This is underfitting; the model needs more training data to improve validation accuracy"},"correct":"B","explanation":{"correct":"- Overfitting signature: train accuracy >> val accuracy. The model has learned training-specific patterns (noise, memorized examples) that don't generalize.\n- Why \"train longer\" doesn't help: more training epochs on the same data will increase training accuracy further (potentially to 99%) while validation accuracy may decrease further (the model memorizes more).\n- Why \"smaller model\" helps: a model with fewer parameters is forced to learn the most statistically reliable patterns, which tend to generalize. This is the bias-variance tradeoff — smaller models have higher bias but lower variance.","A":"95% training accuracy indicates the model is learning well from training data (not underfitting). Underfitting = low training accuracy. The problem is the 35% train/val gap.","B":"","C":"A 35% gap between train and val accuracy is a significant overfitting indicator, not normal. Typical acceptable gaps depend on the task, but 35% is large for most real-world tasks.","D":"Underfitting = model too simple to fit training data (low train accuracy). Here, train accuracy is high. More data helps with overfitting but doesn't fix the cause (model capacity)."},"reference":"- Goodfellow et al., \"Deep Learning\" (2016), Chapter 5.2: Capacity, Overfitting, and Underfitting"},{"section":"deep-learning","difficulty":"easy","id":"dl-e020","topicSlug":"regularization-and-normalization","orderIndex":20,"topic":"Regularization And Normalization","question":"At inference time, a model with Dropout layers (p=0.5) produces different outputs for the same input when run twice. The model is in training mode. What simple fix is needed, and what would happen to output values if you correctly switch to eval mode but forget to scale activations?","options":{"A":"This is expected; models always produce different outputs for the same input","B":"Fix: call `model.eval()` before inference. This disables the random dropout mask — all neurons are active during evaluation. Scale concern: PyTorch uses inverted dropout (applies scale 1/(1-p) during training), so at eval, no scaling is needed — outputs are already correctly scaled. If you manually implemented non-inverted dropout (scale at eval), forgetting the scale factor would make eval activations 2× larger than training activations (for p=0.5), causing miscalibrated predictions.","C":"Call `torch.manual_seed(42)` before each inference run to make dropout deterministic","D":"Remove dropout layers from the model architecture after training; they are only needed during training"},"correct":"B","explanation":{"correct":"- Inverted dropout (PyTorch default): during training, active neurons are scaled up by 1/(1-p). During eval, all neurons are active with no scaling — the expected activation value is the same as in training.\n- Without inverted dropout (scale at eval): during training, activations are unscaled. At eval, all neurons are active → activations are (1/(1-p))× larger than training. The model's calibration (e.g., softmax temperatures) is off.\n- `model.eval()` is the correct fix. It sets `training=False` for all modules, which disables the Bernoulli mask in `nn.Dropout`.","A":"Stochastic outputs are a bug for production inference (except in Monte Carlo Dropout for uncertainty estimation). For standard inference, deterministic outputs are required.","B":"","C":"Setting a random seed makes the dropout mask deterministic but not necessarily correct — it would produce a specific masked output, not the intended full-network output. The model would still not use all neurons.","D":"Removing dropout layers is bad practice — it would change the model architecture and may break saved weights/configurations. Use eval() instead."},"reference":"- Srivastava et al., \"Dropout: A Simple Way to Prevent Neural Networks from Overfitting\" (2014): https://www.jmlr.org/papers/v15/srivastava14a.html"},{"section":"deep-learning","difficulty":"easy","id":"dl-e021","topicSlug":"regularization-and-normalization","orderIndex":21,"topic":"Regularization And Normalization","question":"BatchNorm stores `running_mean` and `running_var` during training (updated via momentum=0.1). At inference, these running stats are used instead of batch statistics. If you accidentally continue training after loading a model checkpoint (with frozen BN layers), but the running stats were computed on a different dataset, what goes wrong?","options":{"A":"Nothing — running stats are always recomputed at inference so old stats don't matter","B":"Frozen BN layers (in eval mode) use their stored running_mean/running_var at inference. If these stats were computed on a different domain (e.g., pretraining on ImageNet, fine-tuning on X-rays), the normalization divides by the wrong distribution statistics. X-ray pixel intensities differ from ImageNet pixel statistics — using ImageNet mean/var to normalize X-ray inputs would produce badly normalized features, degrading model performance. Fix: either (1) unfreeze BN and let it update running stats during fine-tuning, or (2) recalculate running stats by doing a forward pass through the training data with model in train mode but optimizer disabled.","C":"Running stats are automatically adapted when you load a checkpoint, so domain mismatch is impossible","D":"BN running stats only affect the bias term; the scale is learned and adapts during fine-tuning"},"correct":"B","explanation":{"correct":"- Running stats purpose: at inference, batch statistics are unavailable (inference may process single samples). Running stats provide an approximation of the training set's feature distribution.\n- Domain mismatch: if ImageNet running_mean[0] = 0.485 and X-ray running_mean[0] = 0.1 (different brightness distribution), BN will incorrectly normalize X-ray features using ImageNet statistics.\n- Fine-tuning BN: when fine-tuning for a new domain, BN layers should generally be unfrozen to allow running stats to adapt. Small datasets sometimes freeze BN to avoid overfitting.","A":"Running stats are NOT recomputed at inference. In eval mode, BN uses the stored running_mean and running_var, which were accumulated during training.","B":"","C":"Loading a checkpoint restores the stored running stats from the checkpoint — it doesn't recompute or adapt them.","D":"BN has two learnable parameters: γ (scale) and β (shift). It also has two non-learnable running buffers: running_mean and running_var. Both the normalization statistics AND the learned scale/shift are domain-dependent."},"reference":"- Ioffe & Szegedy, \"Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift\" (2015): https://arxiv.org/abs/1502.03167"},{"section":"deep-learning","difficulty":"easy","id":"dl-e022","topicSlug":"regularization-and-normalization","orderIndex":22,"topic":"Regularization And Normalization","question":"You apply LayerNorm to a Transformer's input sequence. The input is shape (batch=2, seq_len=10, d_model=512). LayerNorm normalizes over the last dimension (d_model). What exactly is being normalized, and why is this preferred over BatchNorm for NLP tasks?","options":{"A":"LayerNorm averages across the batch dimension; BatchNorm averages across the d_model dimension","B":"LayerNorm normalizes each (batch, position) pair independently over its 512 features. For token (b=0, t=3), it computes mean and std over the 512 feature values of that one token, then normalizes. This is per-token, per-sample normalization. BatchNorm would normalize over the batch×sequence dimensions for each feature — requiring consistent batch statistics across all positions and samples. For NLP, different positions have very different distributions (punctuation vs content words), and batch stats are computed over variable-length sequences making BN unstable. LayerNorm is independent of batch size and sequence length, making it ideal for variable-length NLP tasks.","C":"LayerNorm and BatchNorm produce identical outputs for NLP tasks; choose either","D":"LayerNorm normalizes across the sequence dimension; each feature is normalized across all 10 positions"},"correct":"B","explanation":{"correct":"- LayerNorm(x)_i = (x_i - μ) / σ × γ_i + β_i, where μ and σ are computed over the feature dimension for one token at a time.\n- BatchNorm for NLP problems: (1) batch stats are unstable for small batches or variable-length sequences; (2) at inference with batch size=1, batch statistics are undefined — must use running stats; (3) different positions in a sequence have different semantic roles — pooling their statistics corrupts the representation.\n- LayerNorm advantages: works with any batch size (including 1), independent of sequence length, position-agnostic — each token normalized by its own feature statistics.","A":"This swaps the normalization axes. LayerNorm normalizes over features (d_model); BatchNorm normalizes over the batch and spatial/sequence dimensions for each feature.","B":"","C":"They produce different outputs because they normalize over different dimensions. The statistics (mean, variance) are computed from different sets of values.","D":"Normalizing across sequence positions (10 tokens) is a different variant — it would conflate information from different positions. Standard LayerNorm as used in Transformers normalizes over the feature (d_model) dimension."},"reference":"- Ba et al., \"Layer Normalization\" (2016): https://arxiv.org/abs/1607.06450"},{"section":"deep-learning","difficulty":"easy","id":"dl-e023","topicSlug":"weight-initialization","orderIndex":23,"topic":"Weight Initialization","question":"Xavier (Glorot) initialization sets `Var(w) = 2 / (fan_in + fan_out)`. Kaiming (He) initialization sets `Var(w) = 2 / fan_in`. What is the key architectural assumption that differentiates when each is appropriate?","options":{"A":"Xavier is for recurrent networks; Kaiming is for convolutional networks","B":"Xavier assumes a symmetric activation function (like tanh or sigmoid) where the positive and negative parts of the activation have equal variance contribution. Kaiming accounts for ReLU's asymmetry: ReLU zeros out half the neurons, effectively halving the variance. Setting Var(w) = 2/fan_in compensates for this halving. Using Xavier with ReLU: variance shrinks by half per layer → vanishing activations in deep networks. Using Kaiming with tanh: variance is slightly too large → mild exploding activations in very deep networks.","C":"The only difference is that Xavier uses fan_in + fan_out while Kaiming uses only fan_in; both work with any activation","D":"Xavier is for the first layer; Kaiming is for all subsequent layers"},"correct":"B","explanation":{"correct":"- Variance analysis for ReLU: if z ~ N(0, σ²), then ReLU(z) has variance ≈ σ²/2 (positive half of a Gaussian has half the variance of the full Gaussian). To maintain variance through each layer: we need Var(w) × fan_in × Var(a) = Var(z). Since Var(ReLU(z)) = Var(z)/2, we need an extra factor of 2: Var(w) = 2/fan_in.\n- For tanh/sigmoid: these have derivatives near 1 at initialization (if inputs are small), so variance is approximately preserved without the factor of 2. Xavier's 2/(fan_in + fan_out) is a compromise to maintain variance in both forward and backward passes.","A":"The choice is activation-function-based, not architecture-based. CNNs can use either Xavier or Kaiming depending on which activation they use. Kaiming is preferred when using ReLU regardless of whether it's a CNN or MLP.","B":"","C":"The denominator difference (fan_in vs fan_in + fan_out) matters. Using Kaiming with tanh in a deep network would give slightly too-high variance (2/fan_in vs the optimal 2/(fan_in+fan_out)), which can cause instability in very deep networks.","D":"There is no layer-position-based rule. Xavier and Kaiming apply uniformly to all layers based on the activation function used in that layer."},"reference":"- He et al., \"Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification (Kaiming init)\" (2015): https://arxiv.org/abs/1502.01852"},{"section":"deep-learning","difficulty":"easy","id":"dl-e024","topicSlug":"weight-initialization","orderIndex":24,"topic":"Weight Initialization","question":"You initialize all weights in a 10-layer MLP to zero. After one forward pass, you call backward(). What values do all the weight gradients have, and why?","options":{"A":"Gradients are random because the loss function introduces randomness","B":"All weight gradients are identical within each layer (all the same value), and the symmetry means the model can never learn diverse features. Specifically: with W=0, all pre-activations z=0, so activations a = σ(0) = same constant for all neurons. The same output flows to every downstream neuron. Gradients via chain rule: ∂L/∂W_ij = δ_j × a_i — since all a_i in a layer are equal and all δ_j in a layer are equal, all weight gradients in a layer are equal. After the update, all weights in a layer remain equal: the model has one effective neuron per layer.","C":"Gradients are all zero because a zero forward pass produces zero loss","D":"Gradients are non-zero and different for each weight because the loss function differentiates each weight independently"},"correct":"B","explanation":{"correct":"- Forward pass with W=0: z = 0·x + b = b. If biases are also 0: z=0, a=σ(0) = constant (e.g., 0.5 for sigmoid, 0 for ReLU) for ALL neurons.\n- All neurons in a layer identical → all outputs identical → loss gradient distributes identically to all neurons in a layer.\n- Gradient formula: ∂L/∂w_{ij} = ∂L/∂z_j × x_i = δ_j × a_{i-1}. Since a_{i-1} is the same for all j (identical neurons in previous layer), and δ_j is the same for all j (identical neurons in current layer), all w_{ij} share the same gradient for given i.\n- Result: all neurons stay identical forever — the layer is \"collapsed.\"","A":"The loss function is deterministic given the model output. With identical forward pass (all zeros), the loss and its gradient are deterministic. No randomness is introduced.","B":"","C":"The loss is not necessarily zero. For cross-entropy, L = -log(p_correct). If all outputs are equal and there are K classes, each gets probability 1/K. CE = log(K) ≠ 0 (for K > 1). So loss ≠ 0, but gradients are equal across neurons.","D":"Weight gradients within a layer are equal (not different) due to symmetry. They differentiate only if the inputs or activations differ across neurons."},"reference":"- Goodfellow et al., \"Deep Learning\" (2016), Chapter 8.4: Parameter Initialization Strategies"},{"section":"deep-learning","difficulty":"easy","id":"dl-e025","topicSlug":"cnn-architectures","orderIndex":25,"topic":"Cnn Architectures","question":"A convolutional layer has 32 filters of size 3×3, applied to a 3-channel (RGB) input. How many weights does this layer have (excluding biases)? How does this compare to a fully connected layer taking the same 64×64×3 input and producing 32 outputs?","options":{"A":"Conv: 32 × 3 × 3 = 288 weights. FC: 64 × 64 × 3 × 32 = 393,216 weights. Conv has 1365× fewer weights","B":"Conv: 32 filters × (3×3 kernel × 3 input channels) = 32 × 27 = 864 weights. FC: 64 × 64 × 3 × 32 = 393,216 weights. Conv has 455× fewer weights — this is the parameter efficiency of weight sharing","C":"Conv: 32 × 3 × 3 × 3 × 64 × 64 = 56,623,104 weights. Conv and FC have the same order of magnitude","D":"Conv: 32 × 3 × 3 = 288 weights. FC: 64 × 64 × 32 = 131,072 weights. Conv is smaller because it ignores one input channel"},"correct":"B","explanation":{"correct":"- Conv parameter count: each filter is of shape (C_in × K_H × K_W). 32 filters, each of shape (3 × 3 × 3) = 27 values. Total = 32 × 27 = 864 weights.\n- FC parameter count: input has 64 × 64 × 3 = 12,288 values. FC to 32 outputs: 12,288 × 32 = 393,216 weights.\n- Ratio: 393,216 / 864 = 455×. This is the key advantage of CNNs: weight sharing (the same filter is applied at every spatial location) makes them dramatically more parameter-efficient for spatial data.","A":"Forgets to multiply by C_in=3 (the filter must cover all 3 input channels). One filter is 3×3×3=27, not 3×3=9.","B":"","C":"This incorrectly multiplies the spatial output dimensions into the weight count. Convolutional weights are independent of the spatial dimensions they're applied to — that's the point of weight sharing. The 64×64 output positions all use the SAME 864 weights.","D":"Convolution does NOT ignore input channels. Each filter convolves all input channels simultaneously. The \"×3\" for RGB channels is part of the filter shape."},"reference":"- LeCun et al., \"Gradient-Based Learning Applied to Document Recognition\" (1998) — weight sharing motivation"},{"section":"deep-learning","difficulty":"easy","id":"dl-e026","topicSlug":"cnn-architectures","orderIndex":26,"topic":"Cnn Architectures","question":"In a CNN, two design choices for downsampling are: (A) max pooling (non-parametric, takes the max in each 2×2 window) and (B) strided convolution with stride=2 (a parametric, learned operation). When would you prefer max pooling over strided convolution, and what does max pooling preserve that average pooling does not?","options":{"A":"Max pooling is always preferred because it is parameter-free","B":"Max pooling is preferred when you want to preserve the strongest activation (presence of a feature) regardless of exact location — providing translation invariance within the pooling window. Max pooling preserves the most activated feature in a local region. Average pooling preserves the average activation, which is sensitive to the presence of many weakly activated neurons rather than one strongly activated one. Strided conv is preferred when you want the network to learn how to downsample based on task-specific patterns. Modern architectures (ResNet, EfficientNet) increasingly prefer strided convolution because it allows the network to decide what is important to keep.","C":"Average pooling is identical to max pooling; the distinction doesn't matter in practice","D":"Max pooling requires the same number of parameters as a strided convolution; the choice is aesthetic"},"correct":"B","explanation":{"correct":"- Max pooling: `output = max(x_1, x_2, x_3, x_4)` for a 2×2 window. If one pixel strongly detects an edge, the max preserves that detection regardless of whether neighboring pixels also detect it. This provides feature presence detection.\n- Average pooling: `output = (x_1 + x_2 + x_3 + x_4) / 4`. One strong detection is diluted by three weak ones. Good for global average pooling at the end of a network (replacing FC layers).\n- Strided conv advantage: the filter learns what to preserve during downsampling, potentially learning task-relevant downsampling. This is why modern architectures use it.","A":"Parameter-free is not always an advantage. Strided convolution can outperform max pooling when the task benefits from learned downsampling.","B":"","C":"Max and average pooling produce different outputs. For a window [0, 0, 0, 100]: max=100, avg=25. For detection tasks, max pooling (100) correctly signals feature presence; average pooling (25) weakens the signal.","D":"Max pooling has zero parameters (just takes the max). Strided convolution has kernel_size²×C_in×C_out parameters. They are not equal."},"reference":"- Springenberg et al., \"Striving for Simplicity: The All Convolutional Net\" (2014): https://arxiv.org/abs/1412.6806"},{"section":"deep-learning","difficulty":"easy","id":"dl-e027","topicSlug":"cnn-architectures","orderIndex":27,"topic":"Cnn Architectures","question":"You design a CNN for a 256×256 input. After 4 layers of stride-1 convolution with 3×3 kernels and no padding, what is the spatial size of the output? If you add padding=1 to each layer, what changes?","options":{"A":"Without padding: 256 - 4×3 = 244×244. With padding: 256×256 (unchanged)","B":"Without padding: each 3×3 conv (stride 1, no padding) reduces each spatial dimension by 2 (=(K-1)). After 4 layers: 256 - 4×2 = 248×248. With padding=1: output size = input size (same padding), so output remains 256×256 after all 4 layers","C":"Without padding: 256 / 4 = 64×64. Convolution halves the spatial dimension each layer","D":"Without padding: 256 - 4×(3-1)/2 = 252×252. With padding, the size doubles"},"correct":"B","explanation":{"correct":"- Output size formula (stride 1): `out = floor((in + 2×P - K) / S) + 1`. With P=0, K=3, S=1: `out = in - 2`.\n- After 4 layers: 256 - 4×2 = 248. Each layer removes 2 pixels (one from each border).\n- With padding=1: `out = (in + 2×1 - 3) / 1 + 1 = in`. Each layer maintains spatial dimensions. This is \"same\" padding — output equals input size.\n- Why same padding matters: without padding, deep CNNs lose spatial resolution quickly. Same padding allows very deep networks (e.g., VGG's 16 layers) to maintain spatial dimensions until explicit downsampling.","A":"The reduction per layer is K-1 = 2 (not K = 3). A 3×3 filter without padding removes 1 pixel from each edge per layer, so 2 pixels total per spatial dimension.","B":"","C":"Stride-1 convolution does NOT halve the spatial dimension. Halving requires stride=2 (strided conv) or pooling with pool_size=2, stride=2.","D":"`(K-1)/2 = 1` for K=3. The formula \"in - 4×(K-1)/2 = 252\" equals 252 for K=3, which is wrong. After 4 layers, the reduction is 4 × 2 = 8, giving 256 - 8 = 248, not 252. And padding never doubles the output size."},"reference":"- Goodfellow et al., \"Deep Learning\" (2016), Chapter 9.5: Basic Convolution Function"},{"section":"deep-learning","difficulty":"easy","id":"dl-e028","topicSlug":"rnn-lstm-gru","orderIndex":28,"topic":"Rnn Lstm Gru","question":"An RNN processes the sentence \"The cat that chased the dog barked.\" The network needs to associate \"cat\" (position 1) with \"barked\" (position 8) for subject-verb agreement. Why does a vanilla RNN struggle with this, and what architectural component in LSTM was specifically designed to address it?","options":{"A":"RNNs cannot process sentences longer than 5 words; LSTM extends this limit to 500 words","B":"In a vanilla RNN, the hidden state h_t = tanh(W_h h_{t-1} + W_x x_t). Information from position 1 (\"cat\") must survive through 7 tanh operations. Each tanh compresses values to (-1,1), and since |tanh'| ≤ 1, the gradient of the loss at position 8 with respect to weights at position 1 involves a product of 7 partial derivatives, typically < 1, causing vanishing gradients. The LSTM cell state c_t = f_t ⊙ c_{t-1} + i_t ⊙ g_t provides a direct highway for information: when the forget gate f_t ≈ 1, information passes unchanged — no repeated squashing. This is why LSTM can maintain long-range dependencies.","C":"The issue is that RNNs use tanh, while LSTM uses ReLU — ReLU prevents vanishing gradients","D":"LSTM simply adds more parameters, which allows it to store more sequence positions in memory"},"correct":"B","explanation":{"correct":"- Vanilla RNN gradient: ∂L/∂h_1 = ∂L/∂h_8 × Π_{t=1}^{7} ∂h_{t+1}/∂h_t. Each factor = W_h diag(1 - h_t²) (Jacobian of tanh). If the spectral radius of W_h × diag(1-h_t²) < 1, gradients vanish over 7 steps.\n- LSTM cell state highway: c_t = f_t ⊙ c_{t-1} + new_info. When f_t=1 and new_info≈0: c_t = c_{t-1} exactly. The gradient ∂c_t/∂c_{t-1} = f_t ∈ (0,1) — only one pointwise multiplication, not a full matrix Jacobian × tanh derivative. This avoids the compound shrinkage of vanilla RNN gradients.","A":"There's no hard 5-word or 500-word limit. The ability to capture long-range dependencies is gradual — vanilla RNN struggles proportionally with distance, LSTM extends practical range but not to infinite.","B":"","C":"LSTM doesn't use ReLU in its core gates — it uses sigmoid (gates) and tanh (cell/output activations). The long-range capacity comes from the additive cell state update, not from ReLU.","D":"More parameters don't inherently address vanishing gradients. The specific architectural innovation is the gating mechanism and additive cell state update, not parameter count."},"reference":"- Hochreiter & Schmidhuber, \"Long Short-Term Memory\" (1997): https://www.bioinf.jku.at/publications/older/2604.pdf"},{"section":"deep-learning","difficulty":"easy","id":"dl-e029","topicSlug":"rnn-lstm-gru","orderIndex":29,"topic":"Rnn Lstm Gru","question":"In seq2seq (encoder-decoder) models without attention, the encoder reads an input sequence and produces a single fixed-size context vector c. The decoder uses c to generate the output sequence. What fundamental limitation does this create for long input sequences?","options":{"A":"The decoder cannot run without attention; seq2seq without attention is theoretically impossible","B":"The encoder must compress all information from the input (regardless of length) into a single fixed-size vector c ∈ ℝ^d. For a 100-word sentence, all semantic content must fit in d dimensions. Information is lost when the sequence contains more unique content than the vector can represent. In practice: for short sequences, the bottleneck is manageable; for long sequences (100+ words), early input information is lost by the time the encoder finishes. The decoder then generates outputs without access to early input positions. Attention (Bahdanau, 2015) fixes this by allowing the decoder to attend to all encoder hidden states, bypassing the bottleneck.","C":"The bottleneck is only a problem when the vocabulary size exceeds the context vector dimension","D":"The fixed context vector is only used for the first decoder step; subsequent steps use the decoder's own hidden state"},"correct":"B","explanation":{"correct":"- Information bottleneck: the context vector c = h_T (last encoder hidden state) must summarize the entire input. Empirically, RNNs tend to remember recent tokens better than early ones.\n- Translation quality degradation: Bahdanau et al. (2015) showed that BLEU scores for fixed-context seq2seq drop sharply for input sentences with more than 30 words, while attention-based models maintain quality for much longer sequences.\n- Attention solution: instead of using a single c, attention computes a weighted sum of all encoder hidden states h_1, ..., h_T at each decoder step. The decoder can \"look back\" at any position.","A":"Seq2seq without attention was the standard before 2015 and successfully trained on many tasks (machine translation, summarization). The issue is quality degradation for long sequences, not impossibility.","B":"","C":"Vocabulary size and context vector dimension are different quantities. The bottleneck is about fitting sequence semantic content into d dimensions, not about the number of possible words.","D":"In standard fixed-context seq2seq: c is used to initialize the decoder hidden state h_0^dec = tanh(W_s · c). After that, each decoder step uses its own h_t^dec. However, c is NOT directly provided at each subsequent step in the basic version — the issue is still the information bottleneck at initialization."},"reference":"- Bahdanau et al., \"Neural Machine Translation by Jointly Learning to Align and Translate\" (2015): https://arxiv.org/abs/1409.0473"},{"section":"deep-learning","difficulty":"easy","id":"dl-e030","topicSlug":"rnn-lstm-gru","orderIndex":30,"topic":"Rnn Lstm Gru","question":"Bidirectional LSTMs run the sequence forward (left to right) and backward (right to left), concatenating hidden states. A student wants to use a bidirectional LSTM for real-time speech recognition (must transcribe as audio arrives). Why is this inappropriate?","options":{"A":"Bidirectional LSTMs require 2× the memory, making them too slow for real-time use","B":"A bidirectional LSTM requires the complete sequence before producing representations. The backward LSTM starts from the last token and moves to the first — it cannot compute backward states until the full input is available. For real-time speech, tokens arrive one at a time and the system must produce output before the sequence ends. A causal (forward-only) LSTM can process each token as it arrives. Bidirectional models are appropriate for offline tasks (post-processing, text classification) where the full sequence is available at inference time.","C":"Bidirectional LSTMs cannot be trained on audio sequences; they only work for text","D":"The backward LSTM causes the model to output words in reverse order, which is wrong for speech recognition"},"correct":"B","explanation":{"correct":"- Causality requirement: real-time systems require causal models — output at time t can only depend on inputs up to time t. Bidirectional models violate causality because the backward hidden state h_t^{bwd} depends on inputs x_{t+1}, ..., x_T.\n- Practical impact: for streaming speech recognition (e.g., voice assistants, live captions), you need a result within ~100ms of each spoken word. Waiting for the full sentence is unacceptable.\n- When bidirectional is appropriate: document classification (full text available), NLP understanding tasks (BERT processes the full sequence), offline audio analysis.","A":"Memory usage is a constraint but not the primary reason for avoiding bidirectional LSTM in real-time settings. Modern hardware can handle 2× memory for typical sequence lengths. The fundamental issue is the causality violation.","B":"","C":"Bidirectional LSTMs work on any sequence type (audio, text, biological sequences). The limitation is about inference-time causality, not input modality.","D":"The backward LSTM produces hidden states in reverse order internally, but the concatenated output still corresponds to each time step in forward order. The output is not reversed."},"reference":"- Schuster & Paliwal, \"Bidirectional Recurrent Neural Networks\" (1997) — original BiRNN paper"},{"section":"deep-learning","difficulty":"easy","id":"dl-e031","topicSlug":"attention-and-transformers-dl","orderIndex":31,"topic":"Attention And Transformers Dl","question":"In the Transformer attention mechanism, Q (Query), K (Key), and V (Value) matrices are derived from the same input sequence via linear projections. Describe what Q, K, and V represent conceptually using an analogy, and what the attention formula `softmax(QK^T / √d_k) V` computes.","options":{"A":"Q=weights, K=biases, V=activations — standard neural network components","B":"Information retrieval analogy: Q represents \"what I am looking for\" (the current token's query for context), K represents \"what I advertise\" (every token's summary of its content), V represents \"what I return\" (the actual content to contribute if selected). The formula: QK^T computes the compatibility between each query and every key (dot product = similarity). / √d_k scales to prevent extreme softmax values. softmax(·) converts similarities to attention weights (sum to 1). × V produces a weighted combination of all values — the output is a blend of all input values weighted by relevance to the query.","C":"Q is the input, K is a learned weight matrix, V is the output layer","D":"Q, K, V all contain the same information; the distinction is only to allow the model to use three separate weight matrices"},"correct":"B","explanation":{"correct":"- Database analogy: think of a soft key-value store. Q is the search query. K are the database keys. If query Q matches key K_j, return value V_j (weighted by match strength).\n- For a sentence \"The cat sat\": when computing the output for \"sat,\" the Q for \"sat\" is compared against K for \"The,\" \"cat,\" and \"sat.\" If \"cat\" has the highest Q·K similarity (because the subject relates to the verb), V_{cat} gets the highest weight.\n- Self-attention: Q, K, V all come from the same sequence — each token queries all other tokens (including itself).","A":"Q, K, V are not standard NN components. They are specific projections designed to implement soft retrieval.","B":"","C":"Q comes from the current representation being transformed, not directly the input. K and V both come from the context (same sequence in self-attention). Calling K a \"weight matrix\" confuses the projection matrix W_K with the projected key tensor K = X W_K.","D":"If Q, K, V had identical information, you could collapse them. The three separate projections allow the model to learn different aspects: what to look for (Q), what to expose (K), what to return (V). Empirically, W_Q, W_K, W_V learn very different transformations."},"reference":"- Vaswani et al., \"Attention Is All You Need\" (2017): https://arxiv.org/abs/1706.03762"},{"section":"deep-learning","difficulty":"easy","id":"dl-e032","topicSlug":"attention-and-transformers-dl","orderIndex":32,"topic":"Attention And Transformers Dl","question":"Transformers use positional encoding to inject sequence order information. Without positional encoding, what happens when you feed the sentence \"cat chases dog\" vs \"dog chases cat\" to a Transformer, and why?","options":{"A":"The Transformer correctly captures order through its causal mask, making positional encoding redundant","B":"Without positional encoding, the Transformer's self-attention produces the same output for both sentences. Self-attention computes similarity between every pair of tokens regardless of position. For \"cat chases dog\" and \"dog chases cat\": the same three token embeddings participate, just in different positions — but without position information, the model cannot distinguish position 1 from position 3. The attention scores depend only on token content (Q·K similarity), not on where the token is in the sequence. Positional encoding adds position-specific vectors to each token embedding, making \"cat at position 1\" different from \"cat at position 3.\"","C":"Order is captured by the feedforward layers; positional encoding is optional","D":"The Transformer uses order through its convolution layers; positional encoding is for CNNs only"},"correct":"B","explanation":{"correct":"- Permutation equivariance: self-attention is permutation equivariant — permuting the input permutes the output in the same way. Without positional encoding, the model treats the input as an unordered set of tokens.\n- Sentence reversal test: \"cat chases dog\" → tokens {cat, chases, dog} at positions {1,2,3}. \"dog chases cat\" → same token set {cat, chases, dog} at different positions. Without PE: QK^T produces the same attention matrix (modulo permutation). With PE: the embeddings differ (cat@pos1 ≠ cat@pos3), so attention patterns and outputs differ.\n- This also explains why transformers need positional encoding for autoregressive generation — without it, the model can't know which token is \"next.\"","A":"Causal masks prevent attending to future positions but don't encode position information. A causal mask with identical token embeddings still can't distinguish \"cat at position 1\" from \"cat at position 5.\"","B":"","C":"Feedforward layers are position-wise (applied independently to each position's representation). They don't mix information across positions and can't capture order.","D":"Standard Transformers have no convolution layers. Positional encoding is not a CNN concept — it's specific to attention-based models that lack inherent sequence order."},"reference":"- Vaswani et al., \"Attention Is All You Need\" (2017): Section 3.5 — Positional Encoding"},{"section":"deep-learning","difficulty":"easy","id":"dl-e033","topicSlug":"attention-and-transformers-dl","orderIndex":33,"topic":"Attention And Transformers Dl","question":"Self-attention has O(T²) complexity in sequence length T. For a sequence of T=512 tokens with d_model=768, estimate the number of attention score computations and explain why this becomes a bottleneck for documents with T=50,000 tokens.","options":{"A":"Attention score computations = T × d_model = 512 × 768 = 393,216. Length doesn't affect bottleneck","B":"Each attention score is a dot product between a query and a key (both of dimension d_k). For T=512: T² = 262,144 dot products, each costing d_k multiplications. The full QK^T matrix has shape (T×T). For T=50,000: 50,000² = 2.5 billion attention score pairs. Storing QK^T requires 2.5B × 4 bytes = 10 GB just for one attention head. Compute: 2.5B dot products of dimension d_k each. Both memory and compute grow quadratically with T, making standard attention impractical for long documents.","C":"Attention is O(T×d_k) not O(T²); only memory grows quadratically","D":"O(T²) complexity applies only to masked (causal) attention; bidirectional attention is O(T)"},"correct":"B","explanation":{"correct":"- QK^T matrix: for queries Q ∈ ℝ^{T×d_k} and keys K ∈ ℝ^{T×d_k}, the product QK^T ∈ ℝ^{T×T} requires T² dot products of length d_k each. This is O(T²d_k) computation.\n- Memory bottleneck: the T×T attention matrix must be stored for the softmax and the subsequent multiplication with V. For T=50K and float32: 50,000² × 4 bytes = 10 GB per head per batch sample.\n- Efficient attention: FlashAttention (Dao et al., 2022) computes attention in tiles to avoid materializing the full T×T matrix; Sparse attention limits each token to attending only k nearby or globally selected tokens, reducing to O(T√T) or O(T).","A":"T × d_model is not the number of attention score computations. The QK^T matrix has T×T entries, each computed from a d_k-dimensional dot product.","B":"","C":"Both memory and compute are O(T²). The QK^T matrix has T² entries (memory), and computing it requires T² dot products (compute).","D":"Both causal (masked) and bidirectional attention compute the full T×T QK^T matrix (then mask after). The complexity is O(T²) for both. Masking reduces the effective attention range but not the matrix computation."},"reference":"- Dao et al., \"FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness\" (2022): https://arxiv.org/abs/2205.14135"},{"section":"deep-learning","difficulty":"easy","id":"dl-e034","topicSlug":"self-supervised-and-contrastive-learning","orderIndex":34,"topic":"Self Supervised And Contrastive Learning","question":"In SimCLR, two views of the same image are created by applying random augmentations (crop, color jitter, Gaussian blur). These form a positive pair. All other images in the batch form negative pairs. Why must the two views of the same image be semantically similar but visually different?","options":{"A":"Visual difference is required for GPU efficiency — identical views would be processed in parallel","B":"The goal is to learn representations invariant to augmentation. If views are too similar (e.g., only brightness changed slightly), the model can match them via low-level pixel statistics rather than semantic content — it learns \"same image = same brightness\" rather than \"same image = same object.\" If views share semantic meaning but differ visually (e.g., different crops, different colors), the model must encode the underlying semantic content to correctly pull them together, learning useful high-level representations. This is the augmentation design principle: be invariant to what you apply augmentation for, but not to what you don't apply augmentation for.","C":"Augmentations are only for data expansion; the model learns from the original images","D":"Views should be as different as possible to maximize contrastive learning difficulty"},"correct":"B","explanation":{"correct":"- Invariance design: the representation learned by SimCLR is invariant to the applied augmentations. If you apply strong color jitter, the model learns color-invariant features. This is intentional: object identity is color-invariant in many tasks.\n- Too-easy augmentations: if views are nearly identical (e.g., small brightness shift), the model learns trivial similarity (low-level pixel matching). The representation doesn't capture anything interesting.\n- Too-hard augmentations: if views are too different (semantically different), the model can't find a useful invariance and may collapse.\n- The key insight (Chen et al., 2020): careful augmentation design determines what invariances are learned and thus how well representations transfer to downstream tasks.","A":"GPU efficiency is determined by batch size and architecture, not augmentation type. Identical views would be processed the same — no GPU efficiency difference.","B":"","C":"The training signals in SimCLR come from the contrastive loss on augmented views. The \"original image\" is not used — only the two augmented views of each image.","D":"If views are too different (semantically different crops of different objects), the model would be trying to push together views that represent different things. This breaks the \"positive pair means same semantic content\" assumption."},"reference":"- Chen et al., \"A Simple Framework for Contrastive Learning of Visual Representations (SimCLR)\" (2020): https://arxiv.org/abs/2002.05709"},{"section":"deep-learning","difficulty":"easy","id":"dl-e035","topicSlug":"self-supervised-and-contrastive-learning","orderIndex":35,"topic":"Self Supervised And Contrastive Learning","question":"BERT uses Masked Language Modeling (MLM) as a pretraining objective: 15% of tokens are masked, and the model predicts the masked tokens. Why is this considered self-supervised learning rather than supervised learning?","options":{"A":"MLM is supervised learning because each masked token has a correct label","B":"MLM is self-supervised because the labels are derived automatically from the data itself, requiring no human annotation. The \"labels\" for the masked tokens are the original tokens — extracted from the text. No human labels the dataset; the training signal is generated from the raw text corpus by the masking procedure itself. Self-supervised learning = automatic label generation from structure in unlabeled data. Contrast with supervised learning: human-annotated labels (e.g., sentiment = positive/negative). MLM generates millions of training examples from raw text without any human effort.","C":"MLM is unsupervised learning because the model processes unlabeled data","D":"MLM is semi-supervised because it requires some labeled examples for fine-tuning"},"correct":"B","explanation":{"correct":"- Self-supervised definition: a learning paradigm where supervisory signal is generated from the raw data itself. The data structure creates labels (e.g., mask a word and try to predict it, predict next word in a sequence, predict image rotation).\n- Key distinction from unsupervised: unsupervised learning (clustering, PCA) finds structure without explicit prediction targets. Self-supervised learning has explicit prediction targets but generates them automatically.\n- Key distinction from supervised: supervised learning requires human-labeled data (a person decides the label). In MLM, the algorithm decides the label (original token) by masking — no human involvement.\n- This allows BERT to pretrain on the entire internet (hundreds of billions of tokens) without human annotation.","A":"The distinction is HOW the labels are created, not whether labels exist. MLM has labels (original tokens) but they are automatically generated. Supervised learning requires human-provided labels. Both have labels; only the source differs.","B":"","C":"Unsupervised learning makes no predictions — it finds patterns, clusters, or latent variables without a prediction target. MLM has a clear prediction target (the masked token). Self-supervised is a better term.","D":"Fine-tuning uses a small amount of supervised data, but pretraining via MLM is purely self-supervised. The overall BERT workflow is \"self-supervised pretraining + supervised fine-tuning\" — only the pretraining phase is self-supervised."},"reference":"- Devlin et al., \"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding\" (2019): https://arxiv.org/abs/1810.04805"},{"section":"deep-learning","difficulty":"easy","id":"dl-e036","topicSlug":"graph-neural-networks","orderIndex":36,"topic":"Graph Neural Networks","question":"A social network has users as nodes and friendships as edges. Node features are [age, location_id, num_posts]. In a GNN, what is the difference between \"node features\" and \"node embeddings,\" and what does one round of message passing produce?","options":{"A":"Node features and node embeddings are the same thing; message passing doesn't change them","B":"Node features are the raw input attributes (age, location, num_posts) — provided before training. Node embeddings are learned vector representations produced by the GNN — they encode both the node's own features AND information from its neighborhood. One round of message passing: each node aggregates (e.g., averages) the features/embeddings of its neighbors and combines them with its own features via a learned transformation: h_v^{(1)} = σ(W · concat(h_v^{(0)}, mean({h_u^{(0)} : u ∈ N(v)})). After one round, the embedding of user A encodes A's features plus the average features of A's direct friends.","C":"Node features are used for training, node embeddings are used for inference only","D":"Message passing only updates edge features; node features remain constant throughout the GNN"},"correct":"B","explanation":{"correct":"- Feature vs embedding distinction: features are fixed inputs (not learned); embeddings are learned representations output by the GNN. The GNN transforms features into embeddings layer by layer.\n- After 1 layer: each embedding captures 1-hop neighborhood. After 2 layers: 2-hop neighborhood. After k layers: k-hop neighborhood.\n- Why this matters: two users with identical raw features but different social circles will have different embeddings after message passing, because the neighborhood structure affects the aggregation.","A":"Message passing fundamentally changes the node representations. After one round, each node's representation incorporates its neighbors' features — this is different from the initial features.","B":"","C":"Both features and embeddings are available during training and inference. The distinction is about what is provided vs computed, not when they are used.","D":"GNNs update node representations (embeddings) via message passing. Edge features can also be used to weight messages (as in GAT), but node representations are the primary output of each layer."},"reference":"- Hamilton et al., \"Graph Representation Learning\" (2020 book): Chapter 5 — Graph Neural Networks"},{"section":"deep-learning","difficulty":"easy","id":"dl-e037","topicSlug":"graph-neural-networks","orderIndex":37,"topic":"Graph Neural Networks","question":"After training a GNN on a citation network, you want to classify a paper that was not in the training graph (a completely new paper added to the network). An inductive GNN (like GraphSAGE) can do this, but a transductive GNN (like vanilla GCN) cannot. What is the key difference?","options":{"A":"GraphSAGE uses more layers than GCN, allowing it to handle new nodes","B":"Transductive GCN: the weight matrix W is applied to all node features simultaneously in A·H·W form. The adjacency matrix A is fixed at training time — new nodes are not in A, so the trained model cannot be applied to them. Inductive GraphSAGE: the model learns a neighborhood aggregation function (a set of weights that aggregate from arbitrary neighbor sets). For a new node, you gather its neighbors, apply the learned aggregation function, and produce an embedding without needing to re-optimize. The key: GraphSAGE learns HOW to aggregate, while vanilla GCN learns node-specific representations tied to the training graph.","C":"GraphSAGE uses attention to handle unseen nodes, while GCN does not","D":"Transductive GCN cannot handle new nodes because it uses GPU memory, not CPU memory"},"correct":"B","explanation":{"correct":"- Transductive learning: optimization over all nodes in a fixed graph. The learned representations are specific to those nodes. Adding a new node to the graph would require reoptimizing the entire model.\n- Inductive learning: learn a function (the message-passing aggregator) that generalizes to unseen nodes. GraphSAGE's aggregator function can be applied to any node's neighborhood — seen or unseen.\n- Practical importance: in real-world graphs (social networks, knowledge graphs), new nodes are added continuously. Transductive models require expensive re-training for each new node; inductive models produce embeddings on-the-fly.","A":"The number of layers doesn't determine transductive vs inductive behavior. A 2-layer GraphSAGE and a 2-layer GCN differ architecturally, not in depth.","B":"","C":"GAT (Graph Attention Network) is attention-based and can also be inductive. The inductive/transductive distinction is about whether the model generalizes to new nodes via a learned aggregation function, not about whether attention is used.","D":"GPU/CPU memory has nothing to do with transductive vs inductive learning. This is an architectural and mathematical concept."},"reference":"- Hamilton et al., \"Inductive Representation Learning on Large Graphs (GraphSAGE)\" (2017): https://arxiv.org/abs/1706.02216"},{"section":"deep-learning","difficulty":"easy","id":"dl-e038","topicSlug":"transfer-learning","orderIndex":38,"topic":"Transfer Learning","question":"You fine-tune a pretrained ResNet-50 on a small (2,000 example) dataset of X-ray images. You freeze layers 1-4 (early) and fine-tune only layer 5 (late) + the classifier head. A colleague suggests freezing all layers instead (pure feature extraction). Which approach is better, and under what condition would you switch the recommendation?","options":{"A":"Feature extraction (all frozen) is always better for small datasets because fine-tuning causes overfitting","B":"Partial fine-tuning (freeze early, update late layers + head) is generally better when the source domain (ImageNet) and target domain (X-rays) differ significantly. Early layers (edges, textures) generalize well and should be frozen. Late layers (high-level semantics) are domain-specific and benefit from updating. If the dataset were extremely small (< 100 examples) or if X-ray images were very similar to ImageNet, full feature extraction might be preferred — then fine-tuning late layers would overfit.","C":"Always fine-tune all layers for any domain gap, regardless of dataset size","D":"The number of frozen layers doesn't matter — use the same strategy regardless of dataset size"},"correct":"B","explanation":{"correct":"- Layer transferability: Yosinski et al. (2014) showed that early layers (low-level feature detectors: edges, textures) transfer across very different domains, while late layers are increasingly task-specific.\n- Freezing strategy heuristic:\n- Small dataset + similar domain → freeze most layers (feature extraction)\n- Small dataset + different domain → freeze early, fine-tune late\n- Large dataset + similar domain → fine-tune all (with discriminative LR)\n- Large dataset + different domain → fine-tune all with low LR\n- With 2,000 X-ray examples: enough to update late layers without overfitting, but not enough to update all 25M ResNet-50 parameters.","A":"With significant domain gap (ImageNet vs X-rays), frozen late layers will produce domain-specific ImageNet features that are suboptimal for X-ray analysis. Some adaptation is beneficial.","B":"","C":"Fine-tuning all layers with 2,000 examples and 25M parameters would cause severe overfitting. The dataset has roughly 80 examples per parameter — extremely low coverage.","D":"The dataset size fundamentally determines how many parameters can be reliably updated. This is not an aesthetic choice."},"reference":"- Yosinski et al., \"How transferable are features in deep neural networks?\" (2014): https://arxiv.org/abs/1411.1792"},{"section":"deep-learning","difficulty":"easy","id":"dl-e039","topicSlug":"transfer-learning","orderIndex":39,"topic":"Transfer Learning","question":"LoRA (Low-Rank Adaptation) adds matrices A ∈ ℝ^{d×r} and B ∈ ℝ^{r×d} to a frozen pretrained weight W ∈ ℝ^{d×d}. The adapted weight is W' = W + BA. For d=768 (BERT-base), r=8, calculate the % parameter reduction vs full fine-tuning of W. Why is B initialized to zero?","options":{"A":"LoRA reduces parameters by 50% because rank-8 is half of rank-16","B":"Full fine-tuning W: d² = 768² = 589,824 parameters. LoRA: A has d×r = 768×8 = 6,144 params; B has r×d = 8×768 = 6,144 params; total LoRA = 12,288. Reduction: (589,824 - 12,288) / 589,824 ≈ 97.9% fewer params. B=0 initialization: BA = 0·A = 0, so W' = W + 0 = W initially. The model starts inference-equivalent to the pretrained model with no disruption. As training progresses, BA accumulates the task-specific delta. Starting from BA≠0 would perturb the pretrained representations from the first step.","C":"LoRA reduces parameters by 8× because rank r=8 is 1/8 of full rank (768/8=96 not 8...)","D":"B is initialized to zero because LoRA cannot train B; only A is trainable"},"correct":"B","explanation":{"correct":"- Parameter count: LoRA adds two small matrices rather than updating all d² parameters of W. For d=768, r=8: 2×768×8 = 12,288 << 589,824.\n- 97.9% reduction: this means only ~2.1% as many parameters need to be trained vs full fine-tuning, while the frozen W retains all pretrained knowledge.\n- B=0 initialization: ensures BA=0 at the start → W' = W → model produces the same outputs as the pretrained model before any training steps. This is a clean initialization with no disruption.\n- A is initialized randomly (e.g., Gaussian) so that as B starts training, BA develops non-trivial values.","A":"The reduction is based on parameter count, not rank ratio. Full rank-768 matrix has 768² = 589K params. LoRA rank-8 has 2×768×8 = 12K. The ratio is 589K/12K ≈ 48×, not 2×.","B":"","C":"Rank r=8 means the LoRA matrices have rank at most 8 (r << d). The reduction factor is d/r = 768/8 = 96×? Actually: d²/(2×d×r) = d/(2r) = 768/16 = 48×. The exact ratio is ~48×, not 8×.","D":"Both A and B are trainable parameters in LoRA. B is initialized to zero for the clean start; A is initialized randomly. During training, both are updated."},"reference":"- Hu et al., \"LoRA: Low-Rank Adaptation of Large Language Models\" (2022): https://arxiv.org/abs/2106.09685"},{"section":"deep-learning","difficulty":"easy","id":"dl-e040","topicSlug":"transfer-learning","orderIndex":40,"topic":"Transfer Learning","question":"After fine-tuning a GPT-2 model (117M parameters) on customer support data, you evaluate on the original GPT-2 benchmarks (HellaSwag, WinoGrande) and find performance dropped significantly. What is this phenomenon, and what is the simplest architectural fix that prevents it while still allowing task-specific adaptation?","options":{"A":"Performance drop is expected; fine-tuned models cannot maintain general capabilities","B":"This is catastrophic forgetting: fine-tuning on customer support data overwrites the pretrained weights, degrading general language capabilities. GPT-2's weights encoded broad language knowledge; high-LR updates for the narrow customer support distribution push the weights toward this specific domain, overwriting patterns for general text. Simplest fix: LoRA. By freezing GPT-2's 117M parameters and only training small low-rank adapter matrices (≈0.5M params for r=8), the pretrained weights remain intact — general benchmarks are unaffected. Task-specific knowledge is learned entirely in the adapters. Alternatively: very low LR (1e-5) with early stopping reduces (but doesn't eliminate) forgetting.","C":"The performance drop is caused by data preprocessing, not weight updates","D":"Catastrophic forgetting only occurs in continual learning; fine-tuning is immune to it"},"correct":"B","explanation":{"correct":"- Forgetting mechanism: each gradient step updates all 117M weights toward the customer support loss minimum. Weights that were optimized for general language understanding are shifted. After enough steps with high LR, general capabilities degrade.\n- LoRA solution: W_pretrained is frozen (never modified). W' = W_pretrained + BA. Only BA (≈0.5M params for GPT-2) is updated. General benchmarks use W_pretrained paths directly → performance unchanged. Task-specific knowledge is stored in BA.\n- This is the core appeal of parameter-efficient fine-tuning: adapt to new tasks without touching the base model's knowledge.","A":"This is the motivation for parameter-efficient fine-tuning research — the problem is real but solvable. LoRA, adapters, and prompt tuning are all designed to allow task adaptation without forgetting.","B":"","C":"Data preprocessing artifacts would affect training performance. General benchmark degradation is specifically caused by weight modification (gradient updates), not data issues.","D":"Catastrophic forgetting is not exclusive to continual learning settings. Any time a pretrained model is fine-tuned on a distribution different from pretraining, there is risk of forgetting, proportional to the LR, number of steps, and distribution shift magnitude."},"reference":"- Hu et al., \"LoRA: Low-Rank Adaptation of Large Language Models\" (2022): https://arxiv.org/abs/2106.09685\n- McCloskey & Cohen, \"Catastrophic Interference in Connectionist Networks\" (1989)"},{"section":"deep-learning","difficulty":"hard","id":"dl-h001","topicSlug":"introduction-to-neural-networks","orderIndex":1,"topic":"Introduction To Neural Networks","question":"You train a shallow neural network on the XOR problem with 2 inputs, 2 hidden units (ReLU), and 1 output (sigmoid). After 10,000 SGD steps, training loss plateaus at 0.25 instead of converging near 0. You verify the network has sufficient capacity. What are the two most likely causes, and what would each fix look like?","options":{"A":"The XOR problem is not solvable by any neural network; the plateau is expected","B":"Cause 1 — Symmetry from identical weight initialization: if both hidden neurons start with identical weights, they remain identical throughout training (symmetry problem), effectively giving the network only 1 unique hidden unit. One unique hidden unit cannot solve XOR (it creates one linear boundary, insufficient to separate XOR's 4 points). Fix: use random weight initialization (e.g., Xavier). Cause 2 — Learning rate issues: too high → oscillates around the solution; too low → extremely slow convergence, appearing plateaued. The XOR loss landscape has narrow valleys — SGD with the wrong LR gets stuck. Fix: use an adaptive optimizer (Adam) or perform LR search. Verify fix: after proper init + optimizer, XOR should converge to near-zero loss in <1000 steps.","C":"Plateau at 0.25 means the model has converged to the globally optimal solution","D":"2 hidden units are insufficient; XOR requires at least 4 hidden units to solve"},"correct":"B","explanation":{"correct":"- Symmetry-broken capacity: with 2 identical neurons, the effective hidden layer has rank 1 — one linear boundary, equivalent to logistic regression. XOR is not linearly separable; logistic regression cannot solve it.\n- XOR convergence test: the correct solution has weights like W₁=[[1,-1],[-1,1]], b₁=[0,0], W₂=[1,1], b₂=-0.5 (or equivalent). Loss should approach 0.\n- LR diagnosis: if a full loss curve shows oscillations around 0.25, the LR is too high. If it decreases extremely slowly (loss: 0.693 → 0.500 → 0.400 → 0.320 ... over 10K steps), LR is too low or the network is stuck in a flat region.\n- 2 ReLU hidden units with proper initialization CAN solve XOR — it needs two intersecting half-planes.","A":"XOR IS solvable by any neural network with ≥2 hidden units and non-linear activations. The XOR problem's unsolvability applies only to single-layer perceptrons (linear classifiers).","B":"","C":"Cross-entropy loss of 0.25 with binary labels means the model's probability outputs are around 0.78 for correct class. This is not optimal for XOR (which has clear binary boundaries).","D":"2 hidden units (ReLU) are sufficient for XOR. A single hidden unit is not enough; 2 is the minimum. More units make optimization easier but are not required."},"reference":"- Minsky & Papert, \"Perceptrons\" (1969) — XOR non-linearity requirement\n- Goodfellow et al., \"Deep Learning\" (2016), Chapter 6.1"},{"section":"deep-learning","difficulty":"hard","id":"dl-h002","topicSlug":"backpropagation","orderIndex":2,"topic":"Backpropagation","question":"You implement a neural network in a framework that uses reverse-mode automatic differentiation. The forward pass computes: `z = relu(W @ x); loss = cross_entropy(softmax(V @ z), y)`. During the backward pass, you notice that gradient norms for W are 1000× larger than for V. The network has L=20 layers (not shown). What is the most likely cause, and how does this differ from the vanishing gradient problem?","options":{"A":"This is the vanishing gradient problem — W is in an early layer so its gradients vanish","B":"This is the exploding gradient problem for W. In deep networks, the gradient of the loss with respect to W (an early layer) involves the Jacobian product: ∂L/∂W = (∂L/∂z_L) × Π_{l=k}^{L} (∂z_{l+1}/∂z_l) × ∂z_k/∂W. If the spectral norm of each Jacobian ∂z_{l+1}/∂z_l > 1, the product grows exponentially. With 20 layers where each Jacobian has spectral norm 1.4: 1.4^20 ≈ 836 — consistent with 1000× amplification. Vanishing gradients (spectral norm < 1) cause early-layer gradients to shrink to near zero, preventing learning. Exploding gradients cause extreme updates that corrupt weights. The asymmetry (W >> V) is because W is in an earlier layer with more Jacobian multiplications than V (the last layer). Fix: gradient clipping, weight normalization, or skip connections.","C":"The 1000× difference is expected and indicates W is learning faster than V — a feature, not a bug","D":"The gradient difference is caused by the ReLU activation at W's layer; switch to sigmoid to equalize gradients"},"correct":"B","explanation":{"correct":"- Jacobian chain growth: ∂z_{l+1}/∂z_l = W_{l+1} × diag(ReLU'(z_l)). ReLU': 0 or 1. The matrix W_{l+1} × diag(mask) has spectral norm ≈ spectral_norm(W_{l+1}) × sparsity_factor. If weights are initialized slightly large (norm > 1), gradients explode.\n- Asymmetry: V is the last layer (1 Jacobian multiplication for V's gradient). W is 19 layers earlier (19 Jacobian multiplications). The amplification hits earlier layers harder.\n- Clipping: `torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)` scales all gradients proportionally when their total norm exceeds max_norm, preventing corrupt updates.","A":"Vanishing gradients cause W's gradient to be 1000× SMALLER than V (not larger). Gradients shrink as they propagate backward through many layers. The described scenario (W >> V) is the OPPOSITE of vanishing gradients.","B":"","C":"1000× gradient difference is a training instability indicator. The optimizer step for W would be 1000× larger than for V, causing W to be wildly overupdated while V barely changes. This is not \"faster learning\" — it's gradient explosion.","D":"Switching to sigmoid from ReLU would worsen the situation. Sigmoid derivatives are at most 0.25, causing vanishing gradients. ReLU with derivative {0,1} preserves gradient magnitude better than sigmoid."},"reference":"- Pascanu et al., \"On the difficulty of training recurrent neural networks\" (2013): https://arxiv.org/abs/1211.5063"},{"section":"deep-learning","difficulty":"hard","id":"dl-h003","topicSlug":"optimizers","orderIndex":3,"topic":"Optimizers","question":"You train a large Transformer (1B parameters) on a TPU cluster using AdamW with β₁=0.9, β₂=0.95, ε=1e-8, lr=3e-4, weight_decay=0.1. At step 50,000, the loss spikes from 2.1 to 8.3 and never recovers. Loss scaling is used (scale=65536). You have saved checkpoints. What is the most likely cause, and what is the systematic debugging procedure?","options":{"A":"The loss spike is caused by a corrupt data batch; skip that batch and continue","B":"Most likely cause: gradient overflow in FP16 causing NaN/Inf propagation. At step 50K, the loss scale (65536) multiplied by an unusually large gradient may have caused FP16 overflow (>65504) → gradients become Inf/NaN → weight update is Inf/NaN → weights corrupted → loss never recovers. Systematic debugging: Step 1: check gradient norm history — was there a spike in grad norm at step 50K? Step 2: check if loss scale was halved at step 50K (PyTorch AMP's GradScaler does this automatically after overflow, but if the weights are already corrupted, recovery is impossible). Step 3: reload the last good checkpoint (step 49,000) and inspect: monitor loss scale values, gradient norms per layer, and weight norm spikes. Step 4: if grad overflow confirmed, reduce the initial loss scale or use BF16 (larger dynamic range, no loss scaling needed).","C":"The spike is caused by learning rate being too high; reduce lr and continue from the spike","D":"The loss spike means the model has escaped a local minimum and is exploring a better region"},"correct":"B","explanation":{"correct":"- FP16 overflow signature: gradient norm jumps to Inf or NaN at a specific step. The optimizer update corrupts all affected weights in one step. No recovery is possible from the corrupted weights — the training trajectory diverges permanently.\n- AMP GradScaler behavior: it automatically detects Inf/NaN gradients and skips the optimizer step (preserving weights). But if NaN propagates INTO the model weights before the scaler detects it, corruption occurs.\n- BF16 advantage: BF16 has the same exponent range as FP32 (8-bit exponent) vs FP16 (5-bit exponent). Max BF16 = ~3.4×10^38 vs FP16 = 65504. BF16 virtually eliminates overflow — why modern TPU training defaults to BF16.\n- Checkpoint reload is the only recovery option once weights are corrupted.","A":"A single corrupt data batch causes a temporary loss spike that recovers over a few steps as the bad update is averaged out. A permanent loss spike (never recovers) is characteristic of weight corruption, not a bad batch.","B":"","C":"Continuing from the spike point with reduced LR is ineffective — if the weights are already corrupted (NaN values), the loss function output is undefined. Must restore from checkpoint.","D":"Loss spikes during language model training at step 50K are a well-known failure mode (see Chinchilla, PaLM training reports), not beneficial exploration. They require careful monitoring and are architectural/numerical issues, not optimization features."},"reference":"- Chowdhery et al., \"PaLM: Scaling Language Modeling with Pathways\" (2022): https://arxiv.org/abs/2204.02311 — training instability analysis"},{"section":"deep-learning","difficulty":"hard","id":"dl-h004","topicSlug":"activation-functions","orderIndex":4,"topic":"Activation Functions","question":"A production vision model uses SiLU (Swish) activations: f(x) = x·σ(x). During ONNX export for edge deployment, you discover the target hardware lacks a native sigmoid instruction. A colleague suggests approximating SiLU with a piecewise linear function. What properties of SiLU must the approximation preserve for the exported model to match training performance within 1% accuracy, and what is the risk if only the value (not the derivative) is matched?","options":{"A":"Only the value needs to match; derivatives are irrelevant after training","B":"For inference (not further training), only the forward-pass value needs to be matched at inference time. However, accuracy within 1% requires: (1) Value accuracy at the activation's operating range: SiLU(x) ≈ x for x >> 0; ≈ 0 for x << 0; the non-trivial region is roughly x ∈ [-3, 3]. The piecewise approximation must closely match SiLU in this range. (2) Smoothness near x=0: SiLU has a smooth minimum near x ≈ -1.28 (minimum value ≈ -0.28). A piecewise linear approximation with segments captures this only if a breakpoint is placed near the minimum. (3) Risk of derivative mismatch: if the model was trained with SiLU and deployed with an approximation that has different curvature, the activations see a shifted distribution. For deep networks, this distribution shift compounds across layers. Even if per-activation error is 0.5%, compounding over 50 layers can cause output shift >> 1%. Test: compare layer-wise activation distribution between original and approximated model on calibration data.","C":"Replace SiLU with ReLU entirely; the accuracy difference will be less than 1%","D":"The approximation is irrelevant; only the final softmax temperature matters for accuracy"},"correct":"B","explanation":{"correct":"- Inference vs training derivative needs: during inference, no backpropagation occurs. The derivative of the activation function is not needed. Only the forward-pass output values matter.\n- Compounding error: in a deep network, each activation's approximation error creates a small distribution shift in the next layer's inputs. With 50 layers, a per-layer relative error of ε compounds: (1+ε)^50 ≈ e^{50ε}. For ε=0.01: e^{0.5} ≈ 1.65 — 65% distribution shift. The final layer may see dramatically different input statistics than expected.\n- Calibration fix: post-training quantization tools (TensorRT, ONNX Runtime) use calibration data to adjust activation ranges and minimize this compounding error.","A":"While technically true for a single activation in isolation, the compounding distribution shift in deep networks means value-only approximation can still cause significant accuracy loss if the approximation has systematic bias in the operating range.","B":"","C":"ReLU and SiLU have fundamentally different behaviors: SiLU is non-monotonic (has a minimum at x≈-1.28) while ReLU is monotonic. The network's weights were trained assuming SiLU's non-monotonic behavior. Replacing with ReLU changes the learned function significantly (>1% accuracy loss for well-trained models).","D":"Softmax temperature affects calibration (confidence scores) but not classification accuracy (argmax of logits). Activation distribution shifts do affect the logit values themselves."},"reference":"- Ramachandran et al., \"Searching for Activation Functions (Swish/SiLU)\" (2017): https://arxiv.org/abs/1710.05941"},{"section":"deep-learning","difficulty":"hard","id":"dl-h005","topicSlug":"weight-initialization","orderIndex":5,"topic":"Weight Initialization","question":"You train a 50-layer pre-LN Transformer from scratch. At initialization (before any training), you measure the output logit variance across the vocabulary for each input sample — it is 800× larger than expected from a standard Kaiming init. You trace the cause to the residual stream. What initialization strategy does GPT-2 use to address this, and why does depth amplify variance in the residual stream?","options":{"A":"GPT-2 uses zero initialization for all layers; larger variance is expected at depth 50","B":"Residual variance amplification: in a Pre-LN Transformer with skip connections, the residual stream at depth l is: x_l = x_0 + Σ_{i=1}^{l} F_i(LayerNorm(x_{i-1})). Each F_i adds variance: Var(x_l) ≈ Var(x_0) + l × Var(F_i). At layer 50 with l=50 blocks: variance grows as O(l) = 50× if each sub-layer contributes equal variance. For 2 sub-layers per block (attention + FFN) and depth 50: 100 sub-layers → 100× variance amplification. GPT-2 fix: scale residual projections by 1/√(2N), where N is the number of residual layers. The residual output matrix (c_proj in attention, the second linear in FFN) is initialized with std = 0.02 / √(2N). This pre-scales each sub-layer's contribution so that Var(Σ F_i) = constant regardless of depth.","C":"The variance is caused by LayerNorm; removing LayerNorm solves the issue","D":"This is normal behavior; large initial logit variance doesn't affect training"},"correct":"B","explanation":{"correct":"- Mathematical derivation: if each F_i(·) has output variance σ² (with standard init), and they are approximately independent, then Var(x_l) = Var(x_0) + l × σ². For l=50 blocks (100 sub-layers): Var(x_50) = Var(x_0) + 100σ² >> Var(x_0) if σ² is non-trivial.\n- GPT-2 scaling: in OpenAI's GPT-2 implementation, the c_proj weight in attention and the second linear in MLP are initialized with std=0.02/√(2N), where N=number of layers. This scales each sub-layer's contribution by 1/(2N), making the total variance Var(Σ F_i) ≈ 2N × (σ/2N)² = σ²/2N × 2N = σ² — independent of depth.\n- Why it matters: large initial logit variance means the initial softmax is highly peaked (initial predictions are very confident in random directions). This causes large initial gradients and unstable early training.","A":"Zero initialization for all layers would cause the symmetry problem. GPT-2 uses random initialization with a depth-scaled standard deviation for specific layers.","B":"","C":"LayerNorm normalizes the inputs to each sub-layer, preventing the sub-layer's INPUTS from having extreme variance. But the OUTPUTS of the sub-layers (the residual additions) still accumulate variance in the residual stream. Removing LayerNorm would worsen training, not fix the variance problem.","D":"Large initial logit variance causes: (1) overconfident initial predictions → large initial CE loss → large initial gradients → potential gradient explosion; (2) the model may converge to overconfident solutions. Proper initialization is critical for stable deep Transformer training."},"reference":"- GPT-2 paper: Radford et al., \"Language Models are Unsupervised Multitask Learners\" (2019) — initialization section\n- Wang & Komatsuzaki, \"GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model\" (2021) — explains the 1/√(2N) scaling"},{"section":"deep-learning","difficulty":"hard","id":"dl-h006","topicSlug":"regularization-and-normalization","orderIndex":6,"topic":"Regularization And Normalization","question":"You train a ViT-Large with BatchNorm instead of LayerNorm. With batch size B=512, training is stable and achieves 83% top-1 on ImageNet. You then try to deploy with batch size B=1 (single image inference) and find accuracy drops to 61%. Explain the exact mechanism causing the 22-point drop, and describe two ways to fix this without retraining.","options":{"A":"The accuracy drop is caused by missing gradients at batch size 1","B":"$2d","C":"Batch size 1 is insufficient for backpropagation, causing the accuracy drop","D":"The drop is caused by dropout at inference; call model.eval() to fix it"},"correct":"B","explanation":{"correct":"- BN at inference: BN uses stored running_mean and running_var, NOT the current batch's statistics. For B=1, the single sample's statistics are irrelevant — BN still uses the population-level running stats from training.\n- The real cause: during training with B=512, running stats are computed from diverse batches that closely approximate the true data distribution. The stored stats accurately capture the feature distribution. At B=1, the ISSUE is not that BN computes differently — it always uses running stats at inference. The issue is if the training set statistics poorly represent the test data (distributional mismatch, or if accumulation was unstable).\n- Why ViT specifically: ViT uses patch embeddings. BatchNorm across patches (not across the batch) can be problematic. LayerNorm, which normalizes per token, is better suited for variable per-image statistics.","A":"Backpropagation is not used at inference — there are no gradients at inference regardless of batch size.","B":"","C":"Batch size 1 is a valid inference setting for all neural networks. No backpropagation occurs at inference.","D":"model.eval() disables Dropout and switches BN to use running stats (not batch stats). If the model is already in eval mode, calling it again changes nothing. The issue is the running stats themselves, not Dropout."},"reference":"- Ioffe, \"Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models\" (2017): https://arxiv.org/abs/1702.03275"},{"section":"deep-learning","difficulty":"hard","id":"dl-h007","topicSlug":"cnn-architectures","orderIndex":7,"topic":"Cnn Architectures","question":"A ResNet-50 model is used for transfer learning on a 5-class satellite imagery task (512×512 input). Standard ResNet-50 expects 224×224 input. You resize all images to 224×224 and fine-tune, achieving 87% accuracy. A colleague fine-tunes with 512×512 input (keeping all ResNet conv layers, only retraining the fully connected head at native resolution). Their model achieves 91%. What specific architectural property of ResNet enables the 512×512 model to work without retraining conv layers, and what would break if BatchNorm layers were frozen?","options":{"A":"ResNet works at any resolution because its parameters encode pixel values","B":"Fully convolutional property: ResNet's convolutional layers (conv1, conv2-5) are translation-equivariant and resolution-independent — the same filters slide across any spatial size. A filter trained on 224×224 features detects the same edge/texture/object-part patterns at 512×512. What changes: (1) the spatial output map is larger (e.g., before the final avgpool: 7×7 for 224 input → 16×16 for 512 input); (2) global average pooling (GAP) aggregates over the larger spatial map — producing the same 2048-d embedding regardless of spatial size. BatchNorm frozen issue: BN's running_mean and running_var were computed during training on 224×224 inputs. At 512×512, the features at each spatial position have different statistics — the network is computing activations over a larger receptive field at each position. If BN is frozen, the normalization uses 224-statistics for 512-activations → potentially biased normalization at every layer. Fix: unfreeze BN to recompute running stats during fine-tuning.","C":"ResNet processes 512×512 by splitting the image into four 256×256 tiles","D":"The 512×512 model works because the fully connected head automatically adapts to any spatial input"},"correct":"B","explanation":{"correct":"- Resolution independence of convolution: conv(x, W) is defined for any input size. The same weight W slides across all positions. This is the foundational property that makes CNNs usable for different input sizes.\n- GAP effect: Global Average Pooling averages over all spatial positions: GAP(x) = mean_spatial(x). For 512×512 input, the spatial map before GAP is larger (more positions to average), but the output is still 2048-d. The network correctly handles this.\n- Accuracy improvement (87% → 91%): higher resolution provides more detailed spatial information about satellite features (road edges, building outlines) that are lost at 224×224. The CNN extracts finer features at 512×512.","A":"Parameters encode filter patterns (e.g., edge detection), not pixel values. The same filter works for any resolution because it detects local patterns regardless of the overall image size.","B":"","C":"ResNet processes the full image in a single forward pass. There is no internal tiling mechanism.","D":"The fully connected head is fixed size (2048 → num_classes). It's the GAP layer that makes the output resolution-independent, not the FC head. Without GAP, an FC head would require a fixed spatial input."},"reference":"- Long et al., \"Fully Convolutional Networks for Semantic Segmentation\" (2015): https://arxiv.org/abs/1411.4038 — resolution independence"},{"section":"deep-learning","difficulty":"hard","id":"dl-h008","topicSlug":"rnn-lstm-gru","orderIndex":8,"topic":"Rnn Lstm Gru","question":"You train a 2-layer stacked LSTM (hidden size H=512) for language modeling. During training, you notice that the forget gate activations f_t average 0.97 across all time steps. A researcher says \"this is a problem — the forget gate should be closer to 0.5 to allow selective forgetting.\" You disagree. Who is correct, and what does f_t ≈ 0.97 indicate about what the LSTM has learned?","options":{"A":"The researcher is correct; forget gate near 1 means the LSTM is not learning to forget","B":"You are correct. f_t ≈ 0.97 indicates the LSTM has learned to maintain long-range dependencies — the cell state c_t ≈ 0.97 × c_{t-1} + new_info. At each step, 97% of previous cell state is retained. For language modeling, much information (subject of a sentence, topical context) must persist for many steps. A forget gate near 0.5 would cause the cell state to decay to half its value every step — effective memory of only ~14 steps (0.5^14 ≈ 10^{-4}). Language requires context windows much longer than 14 steps. The LSTM has learned: \"retain most information continuously; selectively add new information.\" This is correct behavior for long-range language modeling. A forget gate near 0.5 would be appropriate for tasks requiring rapid context switching.","C":"Both gates' values are irrelevant; only the final hidden state h_T matters","D":"f_t ≈ 0.97 causes exploding gradients through the cell state; this will destabilize training"},"correct":"B","explanation":{"correct":"- Forget gate learning: the forget gate is initialized at 1.0 in many implementations (bias initialized to 1) specifically because long-range memory is more useful initially. During training, it may stay near 1 for language tasks.\n- Effective memory horizon: with f_t = 0.97, the effective memory (exponential decay) has time constant τ = -1/log(0.97) ≈ 33 steps. Information from 33 steps ago is attenuated to e^{-1} ≈ 37% of its original value — a useful working memory for language.\n- Task-dependent: for speech modeling with short phoneme dependencies, f_t might converge lower. For language, high forget gate is expected and correct.","A":"A forget gate near 1 means the model RETAINS information (not forgets). The gate name is counterintuitive — \"forget gate ≈ 1\" means \"don't forget.\" For long-range language dependencies, this is desirable.","B":"","C":"The hidden state h_t = o_t ⊙ tanh(c_t) is computed from the cell state c_t at every step. The cell state is what provides long-term memory. The forget gate's value directly determines the long-range gradient flow through c_t.","D":"LSTM gradient through the cell state: ∂c_t/∂c_{t-1} = f_t. For f_t=0.97, the gradient is multiplied by 0.97 at each step. This is much better than vanilla RNN (which has products of full Jacobians). Cell state gradient at step T for step 1: 0.97^T → small but non-zero. This is the key innovation — the additive cell update preserves gradients far better than the multiplicative RNN recurrence."},"reference":"- Hochreiter & Schmidhuber, \"Long Short-Term Memory\" (1997)\n- Gers et al., \"Learning to Forget: Continual Prediction with LSTM\" (1999) — forget gate initialization"},{"section":"deep-learning","difficulty":"hard","id":"dl-h009","topicSlug":"attention-and-transformers-dl","orderIndex":9,"topic":"Attention And Transformers Dl","question":"You implement multi-head attention from scratch. After careful testing, you discover that for long sequences (T=2048), the softmax of QK^T/√d_k produces attention weights extremely close to one-hot (one position gets ≈1.0, all others ≈0). The model fails to aggregate information across positions. Explain the mathematical cause, and why increasing d_k from 64 to 256 worsens the problem rather than alleviating it.","options":{"A":"Increasing d_k from 64 to 256 improves attention; the one-hot behavior is a data problem","B":"The problem is softmax saturation from large dot-product magnitudes. For Q, K ~ N(0, 1) (standard initialization), Q_i · K_j ~ N(0, d_k) — variance grows linearly with d_k. For d_k=64: typical dot products have std = √64 = 8. softmax receives inputs with std=8; the maximum logit is ~3×8=24; softmax(24 vs 0) ≈ e^{24}/e^{24} ≈ near-one for the max, near-zero for all others. The 1/√d_k scaling in QK^T/√d_k divides by √64=8: effective std becomes 1 — softmax inputs are manageable. If you increase d_k to 256 WITHOUT adjusting initialization: Q_i · K_j ~ N(0, 256); std = 16; even with 1/√256=1/16 scaling, after dividing by 16, std = 1 still. So the formula is correct IF the scaling is applied. The bug is likely a missing 1/√d_k scaling factor when d_k changed. If scaling is always applied, d_k=256 should work the same as d_k=64.","C":"One-hot attention is desirable; it means the model has learned to focus precisely","D":"The problem is the temperature; increase softmax temperature to flatten the distribution"},"correct":"B","explanation":{"correct":"- Dot product variance: Q ∈ ℝ^{d_k}, K ∈ ℝ^{d_k} with iid N(0,1): dot product Q·K = Σᵢ QᵢKᵢ, which is a sum of d_k products of N(0,1) variables. Variance = d_k (sum of d_k terms each with variance 1). Std = √d_k.\n- Scaling: 1/√d_k brings the variance to 1. Softmax of inputs with std=1 is well-behaved.\n- If 1/√d_k is applied correctly, changing d_k shouldn't affect softmax saturation. The described worsening (d_k=256 → worse) suggests the 1/√d_k scaling is NOT being updated when d_k changes (a common implementation bug when hardcoding the scaling factor).","A":"The one-hot behavior is a mathematical consequence of high-variance dot products, not a data problem. It's deterministically solvable by the 1/√d_k scaling.","B":"","C":"Useful attention aggregates information from multiple positions (soft attention). For tasks like translation (\"the professor who...\" needs to link \"professor\" to the subject of a relative clause), one-hot attention means only one position's information is used, losing context. One-hot is occasionally correct (e.g., copying) but should be learned, not forced.","D":"\"Softmax temperature\" = 1/T where T is a scalar. Softmax temperature = 1/√d_k is already the temperature. \"Increasing temperature\" means using 1/(√d_k × T) with T>1, which makes attention more uniform — this is the correct direction (flatter attention = less one-hot). But the right fix is using the 1/√d_k scaling correctly, not an additional temperature."},"reference":"- Vaswani et al., \"Attention Is All You Need\" (2017): Section 3.2.1 — Scaled Dot-Product Attention, explanation of 1/√d_k"},{"section":"deep-learning","difficulty":"hard","id":"dl-h010","topicSlug":"forward-propagation","orderIndex":10,"topic":"Forward Propagation","question":"You profile a Transformer's forward pass on an A100 GPU and find that a linear layer (d_model=4096 → d_ff=16384, batch=1, seq=512) achieves only 12% of theoretical FLOPs utilization (MFU). A colleague achieves 58% MFU on the same hardware with a different batch size. Explain why compute efficiency is so low at batch=1, and what specific memory hierarchy property causes the bottleneck.","options":{"A":"The bottleneck is the softmax operation; replace it with linear attention","B":"The bottleneck is memory bandwidth, not compute. The linear layer multiplies X (512×4096) by W (4096×16384). FLOPs = 2 × 512 × 4096 × 16384 ≈ 68B. Time to load W from HBM: W has 4096×16384 × 2 bytes (FP16) = 128 MB. A100 HBM bandwidth = 2 TB/s. Load time = 128 MB / 2 TB/s = 64 μs. A100 peak compute = 312 TFLOPS (BF16). Compute time = 68 GFLOPs / 312 TFLOPs/s = 0.22 μs. Ratio: memory-bound by 64/0.22 = 290×. The operation is severely memory-bandwidth limited, not compute-limited. The arithmetic intensity = FLOPs / bytes = 68G / 128M = 531 FLOP/byte. A100 ridge point (compute/bandwidth) = 312T / 2T = 156 FLOP/byte. Since 531 > 156, in theory this should be compute-bound. The catch: at batch=1, seq=512, the output X (small) is reused, but W must be loaded once per forward pass. With larger batch, the FLOPs increase while the weight loading stays constant → higher arithmetic intensity → compute-bound → higher MFU.","C":"The bottleneck is Python interpreter overhead; use TorchScript to fix it","D":"12% MFU is normal for batch size 1; no optimization is possible"},"correct":"B","explanation":{"correct":"- Roofline model: operations below the \"ridge point\" (arithmetic intensity < compute/bandwidth ratio) are memory-bound. Above the ridge point: compute-bound.\n- Batch size effect: with batch=32, seq=512: FLOPs = 32 × 512 × 4096 × 16384 × 2 ≈ 2.2T. Weight loading: still ~128 MB (reused for all 32 samples). Arithmetic intensity = 2.2T / 128M = 17,188 FLOP/byte >> 156 ridge point. Now strongly compute-bound.\n- This is why LLM inference is memory-bound (small batches) and training is compute-bound (large batches). Systems like continuous batching (vLLM) increase the effective batch size to improve GPU utilization.","A":"Softmax is not the bottleneck for a standard linear layer. The profiling shows the linear layer itself at 12% MFU — softmax is a different operation.","B":"","C":"TorchScript reduces Python overhead (relevant for small operations where Python is the bottleneck). For 68 GFLOPs operations running on a GPU, Python overhead is negligible (<1 μs for kernel launch vs 64 μs for memory bandwidth).","D":"12% MFU for a production model is too low — significant optimization is possible. Batching multiple requests (as vLLM does), quantization (4-bit weights halve memory bandwidth), and weight streaming optimizations can improve this substantially."},"reference":"- Karpathy, \"The GPU Computational Bottleneck and Why Batch Size Matters\" — llm.c discussions\n- Williams et al., \"Roofline: An Insightful Visual Performance Model for Multicore Architectures\" (2009)"},{"section":"deep-learning","difficulty":"hard","id":"dl-h011","topicSlug":"loss-and-cost-functions","orderIndex":11,"topic":"Loss And Cost Functions","question":"You train an object detection model with Focal Loss: FL(p_t) = -(1 - p_t)^γ log(p_t), with γ=2. During training on a dataset with 1000 background examples for every 1 foreground example, the loss is dominated by background. A colleague sets γ=5 to down-weight easy backgrounds more aggressively. At γ=5, training loss decreases faster initially but final mAP is 3 points lower than γ=2. What is the mathematical mechanism causing the degradation?","options":{"A":"Higher γ always improves Focal Loss; the mAP drop is caused by insufficient training epochs","B":"With γ=5, easy backgrounds (p_background ≈ 0.999, p_t ≈ 0.999): weight = (1-0.999)^5 = (0.001)^5 = 10^{-15}. These examples contribute almost zero gradient signal. Hard foreground (p_t = 0.5): weight = (0.5)^5 = 0.031. For γ=2: easy background weight = (0.001)^2 = 10^{-6}; hard foreground weight = (0.5)^2 = 0.25. The ratio hard/easy changes from 0.25/10^{-6} = 250,000 at γ=2 to 0.031/10^{-15} = 3.1×10^{13} at γ=5. At γ=5, the loss is computed from an extremely small effective sample — only the hardest examples contribute meaningful gradients. This creates high-variance gradient estimates (few examples dominate) and overfits to the specific hard examples in each mini-batch. With γ=2, a broader set of semi-hard examples provides more stable, generalizing gradients.","C":"γ=5 is equivalent to hard example mining; the mAP drop is expected and acceptable","D":"The issue is the learning rate; reduce LR when using γ=5"},"correct":"B","explanation":{"correct":"- Gradient variance analysis: at γ=5, only examples with p_t ∈ [0.3, 0.7] receive substantial gradients. For 1000:1 imbalance, only a tiny fraction of the mini-batch contributes usable signal.\n- Effective batch size reduction: with γ=5 and 99.9% of examples being background with high confidence, the effective learning signal comes from << 0.1% of examples. The mini-batch gradient estimate has extremely high variance.\n- Optimal γ: Lin et al. (2017) showed γ=2 is optimal for COCO detection. γ ∈ [0.5, 5] were tested; γ=2 provided the best mAP. Higher γ reduces loss too aggressively on easy examples, hurting gradient quality.","A":"Faster initial loss decrease does not imply better final mAP. A model can rapidly minimize the extremely-down-weighted easy backgrounds while poorly learning to distinguish hard cases due to noisy gradient estimates.","B":"","C":"Hard example mining (OHEM) selects a fixed number of hard examples per batch, providing stable sample counts. γ=5 provides no such stability — the effective sample count varies per batch based on what the model finds easy/hard at each step.","D":"Lower LR with γ=5 would slow down already-noisy gradient updates, not address the fundamental gradient variance problem from sparse effective samples."},"reference":"- Lin et al., \"Focal Loss for Dense Object Detection (RetinaNet)\" (2017): https://arxiv.org/abs/1708.02002 — Table 1: γ comparison"},{"section":"deep-learning","difficulty":"hard","id":"dl-h012","topicSlug":"ann-architectures","orderIndex":12,"topic":"Ann Architectures","question":"A team trains a wide MLP (1 hidden layer, 65536 neurons) vs a deep MLP (8 hidden layers, 256 neurons each). Both have approximately equal parameter counts. On MNIST, both achieve ~99% accuracy. On a hierarchical image composition task (parts → objects → scenes), the deep model significantly outperforms the wide model. Explain from the circuit complexity perspective why depth provides an exponential advantage for hierarchical functions, and what the \"number of linear regions\" argument says.","options":{"A":"Deep models outperform wide models because deep models have more parameters","B":"Circuit complexity argument: a hierarchical function f(x) = h₃(h₂(h₁(x))) where each hᵢ extracts features from the previous level cannot be computed efficiently by a shallow circuit without exponential width. A deep ReLU network with d layers can compute functions that require exponential (in d) width for any shallow network. Formally: functions composable as depth-k circuits require O(2^k) neurons in a 1-hidden-layer network but only O(k × poly(n)) in a depth-k network. Number of linear regions: a ReLU network with L layers and N total neurons can produce O((N/L)^(L-1) × N) linear regions in the input space. Equivalently, deep networks produce exponentially more linear regions (decision boundaries) than shallow networks of the same parameter count. For parts→objects→scenes: each layer learns a higher-level composition. The wide model must represent ALL compositions in a single layer — exponentially harder than sequential composition.","C":"The advantage is numerical, not structural; deeper models have better gradient flow","D":"Width and depth are equivalent for any function; the task difference is due to training, not architecture"},"correct":"B","explanation":{"correct":"- Montufar et al. (2014) formal result: a deep ReLU network with L hidden layers, each width n, creates at least (n/⌊n/2⌋)^{(L-1)} × (1/2 × Σᵢ binomial(n-1, i)) linear regions. The key factor is exponential in L. A 1-hidden-layer network of the same parameters creates only polynomial regions.\n- Compositional bias: deep networks naturally implement hierarchical computations (layer 1: edges, layer 2: textures, layer 3: parts, layer 4: objects). Wide shallow networks must encode all hierarchical relations in a single transformation.\n- MNIST exception: MNIST digits have minimal hierarchical structure (simple strokes), so wide and deep models perform similarly. Hierarchical tasks (scenes, language syntax) benefit from depth.","A":"Both networks have approximately equal parameter counts, ruling out parameter count as the explanation.","B":"","C":"Gradient flow is a training concern. The exponential advantage is a representational (architectural) property — even with perfect optimization, the shallow model needs exponential width.","D":"Barron's theorem and circuit complexity theory formally prove that certain functions cannot be efficiently represented shallowly. The advantage is not just empirical."},"reference":"- Montufar et al., \"On the Number of Linear Regions of Deep Neural Networks\" (2014): https://arxiv.org/abs/1402.1869"},{"section":"deep-learning","difficulty":"hard","id":"dl-h013","topicSlug":"self-supervised-and-contrastive-learning","orderIndex":13,"topic":"Self Supervised And Contrastive Learning","question":"In SimCLR, the InfoNCE loss is: L = -log(exp(sim(z_i, z_j)/τ) / Σ_{k≠i} exp(sim(z_i, z_k)/τ)). You run two experiments: (A) batch size N=256, τ=0.07 and (B) batch size N=2048, τ=0.07. Experiment B achieves significantly higher downstream accuracy. Beyond more negative samples, what specific learning dynamics does the larger batch size change, and what does the temperature τ control about the difficulty of negatives?","options":{"A":"Larger batch size only adds more negative samples; the learning dynamics are the same","B":"$2e","C":"Larger batch causes worse results because false negatives increase","D":"Temperature τ=0.07 has no effect on learning; only τ=0 and τ=∞ are meaningfully different"},"correct":"B","explanation":{"correct":"- False negative concern: with N=2048 from ImageNet, ~2000 negatives are from different classes. But with 1000 classes, ≈2047/1000 ≈ 2 negatives are from the same class as the anchor (false negatives). Studies show this reduces accuracy slightly, but the benefit of hard negatives dominates.\n- Temperature interpretation: L = Σ_pos[sim/τ] - log(Σ_all exp(sim/τ)). Low τ focuses learning on high-similarity pairs. If the hardest negative has sim=0.6 and the positive has sim=0.8: with τ=0.07: exp((0.6-0.8)/0.07) = exp(-2.86) ≈ 0.057. With τ=1: exp(0.6-0.8) = exp(-0.2) ≈ 0.82. Low τ makes the hard negative much less competing — but since we want to push it away, the gradient signal is actually largest when the negative is close to the positive (high sim negative → high loss → large gradient).\n- Chen et al. (2020) ablation: batch size 256 → 76.5% top-1; batch size 4096 → 82.9% top-1 on ImageNet with linear evaluation.","A":"The learning dynamics DO change beyond count: the distribution of difficulty of negatives changes, gradient variance changes, and the interaction with temperature changes.","B":"","C":"False negatives (same-class treated as negative) are a real concern but empirically don't outweigh the benefits of more hard negatives. Studies that explicitly handle false negatives (e.g., Debiased Contrastive Learning) improve results but start from the strong N=2048 baseline.","D":"Temperature fundamentally reshapes the loss landscape. τ→0 approaches hard-max (only the hardest negative matters). τ→∞ makes all negatives equally weighted (uniform, no hard negative focus). τ=0.07 is a strong hard-negative-focusing temperature."},"reference":"- Chen et al., \"A Simple Framework for Contrastive Learning of Visual Representations (SimCLR)\" (2020): https://arxiv.org/abs/2002.05709 — Appendix B: batch size and temperature ablations"},{"section":"deep-learning","difficulty":"hard","id":"dl-h014","topicSlug":"graph-neural-networks","orderIndex":14,"topic":"Graph Neural Networks","question":"Two molecules: (A) benzene (cyclic, all carbons connected in a ring) and (B) cyclohexane (same ring structure but saturated, no double bonds). Their molecular graphs have identical topology. A 3-layer standard GCN predicts identical properties for both. A more expressive GNN correctly distinguishes them. What fundamental limitation of standard message-passing GNNs causes this failure, and what is the Weisfeiler-Leman (WL) test connection?","options":{"A":"GCNs cannot process cyclic graphs; use only tree-structured GNNs for molecules","B":"Standard message-passing GNNs (MPNNs) are bounded in expressiveness by the 1-Weisfeiler-Leman (1-WL) graph isomorphism test. The 1-WL test: iteratively color each node by a hash of its color + sorted neighbor colors. Two graphs are distinguished if their final color histograms differ. If the 1-WL test fails to distinguish two graphs, no standard MPNN can distinguish them either (Xu et al. 2019, GIN paper). For benzene vs cyclohexane: both have the same ring topology and all nodes have the same degree (2 bonds each). The ONLY structural difference is in edge features (double bonds in benzene vs single bonds in cyclohexane). Standard GCNs with only node features and a uniform adjacency matrix cannot incorporate edge type — they see the same graph. Fix: (1) Add edge features to message passing: m_{ij} = φ(h_i, h_j, e_{ij}) where e_{ij} is the edge type (bond order). (2) Use higher-order GNNs (k-WL, PPGN) that track subgraph structures.","C":"The issue is insufficient depth; add more GCN layers to distinguish the molecules","D":"Standard GCN distinguishes benzene from cyclohexane correctly; the premise is wrong"},"correct":"B","explanation":{"correct":"- 1-WL bound: Xu et al. (2019) proved that if two graphs are indistinguishable by the 1-WL test (their iterative coloring produces the same histogram), then any sum-aggregation MPNN assigns them the same representation.\n- Benzene vs cyclohexane topologically: C₆H₆ (benzene) and C₆H₁₂ (cyclohexane) have different molecular formulas — but if the GNN only sees carbon nodes and undirected bonds, both appear as a 6-cycle with same-degree nodes. The aromatic ring information lives in bond type (not captured by standard adjacency).\n- Fix with edge features: DirectedMP or DMPNN (message passing on directed edges) with bond-type features correctly distinguishes benzene (aromatic/double) from cyclohexane (single bonds).","A":"GCNs can process cyclic graphs. The adjacency matrix correctly represents cycles. The issue is expressiveness (which substructures are captured), not the presence of cycles.","B":"","C":"Adding more layers doesn't solve the expressiveness bound. With identical node features and the same adjacency, all layers produce the same aggregated representations for both molecules, regardless of depth.","D":"A standard GCN with only carbon atom identity as node features and binary adjacency CANNOT distinguish benzene from cyclohexane — they have the same topology and node types. This is a known limitation motivating edge-featured and higher-order GNNs."},"reference":"- Xu et al., \"How Powerful are Graph Neural Networks (GIN)\" (2019): https://arxiv.org/abs/1810.00826 — WL expressiveness theorem"},{"section":"deep-learning","difficulty":"hard","id":"dl-h015","topicSlug":"transfer-learning","orderIndex":15,"topic":"Transfer Learning","question":"You fine-tune LLaMA-3-8B on a legal contract analysis task using LoRA (r=16, α=32, target_modules=[q_proj, v_proj]). After fine-tuning, the model excels at legal tasks but its general reasoning performance (MMLU) drops from 68% to 54%. Identify two distinct mechanisms causing MMLU degradation and propose a fine-tuning strategy that limits MMLU degradation to < 2% while maintaining legal task performance.","options":{"A":"MMLU degradation is unavoidable with LoRA; accept the 14-point drop","B":"$2f","C":"The degradation is caused by insufficient training data; add more legal examples","D":"Fine-tune k_proj instead of q_proj and v_proj; key projection doesn't affect reasoning"},"correct":"B","explanation":{"correct":"- Attention pattern modification: q_proj and v_proj directly control what the model attends to (q_proj) and what information is extracted from attended positions (v_proj). Legal text has very different attention patterns than MMLU reasoning (e.g., attending back to defined terms in contracts vs attending to relevant context in multiple-choice).\n- EWC (Elastic Weight Consolidation): Fisher information matrix F estimates how important each parameter is for the original task. EWC loss = L_legal + λ Σᵢ Fᵢ × (θᵢ - θ*ᵢ)². This constrains LoRA parameters critical for MMLU reasoning.\n- Practical solution: modern PEFT libraries (HuggingFace PEFT) allow task vectors — train separate LoRA adapters for each task, then interpolate. Serve the legal adapter for legal tasks, keep base model for MMLU.","A":"MMLU drops larger than 5% indicate significant forgetting. LoRA is specifically designed to minimize forgetting — a 14% drop suggests misconfigured LoRA (too high rank/scale, wrong target modules). It IS avoidable.","B":"","C":"More legal data would increase legal performance but would also increase the magnitude of LoRA updates, potentially increasing MMLU interference. The problem is the DIRECTION of adaptation, not the quantity of training.","D":"k_proj (key projection) directly participates in the query-key dot product that computes attention scores — it's just as important to reasoning as q_proj. There's no principled reason to expect k_proj modification to be less disruptive."},"reference":"- Hu et al., \"LoRA: Low-Rank Adaptation of Large Language Models\" (2022): https://arxiv.org/abs/2106.09685\n- Kirkpatrick et al., \"Overcoming catastrophic forgetting in neural networks (EWC)\" (2017): https://arxiv.org/abs/1612.00796"},{"section":"deep-learning","difficulty":"hard","id":"dl-h016","topicSlug":"neurons-and-perceptrons","orderIndex":16,"topic":"Neurons And Perceptrons","question":"A network uses the formula h = σ(W₂ σ(W₁ x + b₁) + b₂). You want to prove this 2-layer MLP can represent any function f: ℝ → ℝ on [0,1]. A student cites the Universal Approximation Theorem (UAT) and says it requires infinite neurons. You argue it requires only O(1/ε²) neurons for a function in a specific smoothness class. What is the smoothness condition that makes efficient approximation possible, and what does Barron's theorem say?","options":{"A":"Any function in L² requires O(1/ε) neurons; smoothness doesn't affect neuron count","B":"Barron's theorem (1993): A function f: ℝⁿ → ℝ is in Barron's class if its frequency-domain L¹ norm is finite: C_f = ∫ ||ω|| |f̂(ω)| dω < ∞ (where f̂ is the Fourier transform). For such functions: a 1-hidden-layer network with m neurons achieves L² approximation error ε = O(C_f / √m). Equivalently: to achieve error ε, you need m = O(C_f² / ε²) neurons — polynomial in 1/ε, not exponential. For comparison: without Barron's condition (non-smooth functions or high-frequency content), achieving ε error with a fixed-degree polynomial approximation may require exponentially many terms. Smoothness condition: finite C_f means f's Fourier representation has decaying high-frequency content — the function doesn't oscillate wildly (bounded variation in Fourier domain).","C":"UAT requires infinite neurons in general; no bounded-neuron guarantee exists","D":"Barron's theorem applies only to sigmoid activations; ReLU networks have no such guarantee"},"correct":"B","explanation":{"correct":"- C_f interpretation: functions with large C_f (high Fourier L¹ norm) have lots of high-frequency content (sharp corners, rapid oscillations). These require more neurons to approximate. Smooth functions (small C_f) are approximable with fewer neurons.\n- Dimension-free result: Barron's theorem is notable because the m = O(C_f²/ε²) bound does NOT depend on the input dimension n. This is \"the blessing of Barron's class\" — deep learning avoids the curse of dimensionality for this function class.\n- Practical connection: why do neural networks work well in practice? Natural language, images, and audio signals have decaying Fourier spectra (they're \"smooth enough\" to be in Barron's class). Pure adversarial examples often exploit high-frequency perturbations — they leave Barron's class.","A":"The claim \"any L² function requires O(1/ε)\" is false. L² includes highly non-smooth functions (e.g., random noise) that require exponentially many neurons. The O(1/ε²) bound is specific to Barron's class.","B":"","C":"The original UAT (Cybenko, Hornik) only proves existence (infinite neurons are sufficient). Barron's theorem provides the constructive bound. UAT doesn't say \"infinite neurons are necessary\" — Barron's provides the polynomial guarantee for smooth functions.","D":"Barron's original theorem used sigmoid activations as a constructive proof. The result has been extended to ReLU networks by subsequent work (e.g., Barron & Klusowski 2018). The class of expressible functions is similar for both."},"reference":"- Barron, \"Universal Approximation Bounds for Superpositions of a Sigmoidal Function\" (1993): IEEE Transactions on Information Theory\n- Bach, \"Breaking the Curse of Dimensionality with Convex Neural Networks\" (2017): https://arxiv.org/abs/1412.8690"},{"section":"deep-learning","difficulty":"hard","id":"dl-h017","topicSlug":"backpropagation","orderIndex":17,"topic":"Backpropagation","question":"You debug a custom attention implementation. The forward pass is correct but loss.backward() gives incorrect gradients for W_Q. You verify using the finite difference method: ∂L/∂W_Q ≈ (L(W_Q + εe_ij) - L(W_Q - εe_ij)) / (2ε). The finite difference gradient is 0.153 but autograd gives 0.089. The discrepancy is consistent across multiple inputs. What are the three most likely implementation errors in the custom backward pass for W_Q?","options":{"A":"The finite difference check has numerical errors; trust autograd only","B":"Three likely errors in the custom backward pass for W_Q: (1) Missing factor from scaling: attention = softmax(QK^T / √d_k) V. The backward pass for W_Q must propagate through both the 1/√d_k scaling AND through the softmax Jacobian. If the implementation applies the softmax gradient but forgets to multiply by 1/√d_k, the gradient is scaled by √d_k too large (or too small if divided instead of multiplied). (2) Softmax Jacobian error: ∂softmax(z)/∂z = diag(s) - s·sᵀ where s=softmax(z). A common error: computing only the diagonal (treating softmax as element-wise) and ignoring the s·sᵀ outer product. This is the most common softmax backward error. (3) Incorrect gradient accumulation through multi-head attention: if W_Q is shared across heads (or gradients from multiple heads are summed), forgetting to sum gradients from all heads or dividing instead of summing causes systematic underestimation (0.089 ≈ 0.153 × num_heads / (num_heads × 2)?).","C":"The finite difference check is correct; autograd has a bug in PyTorch","D":"W_Q gradient of 0.089 is correct; the finite difference approximation is too coarse"},"correct":"B","explanation":{"correct":"- Gradient check validation: finite difference is the gold standard for custom backward passes. If FD and autograd disagree consistently (not numerically), the autograd implementation has a bug. The reverse is never the case for standard differentiable operations.\n- Softmax Jacobian: ∂L/∂z_i = ∂L/∂s × ∂s/∂z_i = Σ_j (∂L/∂s_j)(s_j(δ_{ij} - s_i)) = s_i(∂L/∂s_i - Σ_j s_j ∂L/∂s_j). This requires the full outer product, not just the diagonal. Omitting s·sᵀ systematically underestimates the gradient.\n- Scaling factor: the 1/√d_k factor must be correctly propagated. ∂(QK^T/√d_k)/∂Q = K/√d_k. Missing this factor would cause the gradient to be √d_k × too large, not too small. So the missing 1/√d_k would explain if autograd < FD by a factor of √d_k.","A":"Finite difference check on a correctly computed loss function is numerically accurate for ε values like 1e-5 (for float64). A consistent discrepancy (0.153 vs 0.089) is too large to be numerical error.","B":"","C":"PyTorch's autograd is extremely well-tested and correct for standard operations. The custom backward pass is user-implemented; that's where the bug is.","D":"ε=1e-5 for finite differences typically gives 8-digit accuracy for smooth functions in float64. A discrepancy of 0.064 (42% error) is far beyond numerical error and indicates a logic bug."},"reference":"- CS231n, \"Computing Gradients: Numerical Gradient Checking\" — gradient check methodology\n- Vaswani et al., \"Attention Is All You Need\" (2017): Section 3.2 — attention derivation"},{"section":"deep-learning","difficulty":"hard","id":"dl-h018","topicSlug":"regularization-and-normalization","orderIndex":18,"topic":"Regularization And Normalization","question":"RMSNorm (used in LLaMA) vs LayerNorm (used in BERT): RMSNorm(x) = x / RMS(x) × γ where RMS(x) = √(1/n Σxᵢ²). LayerNorm(x) = (x - μ) / σ × γ + β. RMSNorm removes the mean-centering step and has no β (shift) parameter. A researcher claims \"RMSNorm is strictly worse because it loses the centering invariance.\" Construct a counterargument using the re-centering invariance property of transformers.","options":{"A":"The researcher is correct; LayerNorm's centering is necessary for all architectures","B":"Counterargument: in transformer architectures with residual connections, the network is shift-invariant to the bias β. The output of a transformer block is x' = x + F(Norm(x)). If LayerNorm's learned β introduces a constant shift to every token's representation, this shift propagates through the residual: x'_l = x_0 + Σᵢ F_i(LN(xᵢ)). Since F contains linear layers (W × (·) + b), any constant shift from β can be absorbed into the bias b of the subsequent linear layer. Therefore β is redundant and provides no additional expressiveness — the subsequent linear bias can represent the same function. For μ centering: in a residual network, the mean component of x can also be absorbed into downstream biases. RMSNorm is computationally simpler (no mean subtraction), numerically more stable (denominator is always ≥ 0, no cancellation errors), and achieves equivalent expressiveness. LLaMA and Mistral empirically match or exceed BERT's performance with RMSNorm.","C":"RMSNorm and LayerNorm are identical; β and μ have no effect on outputs","D":"RMSNorm is better because it has fewer parameters and always outperforms LayerNorm"},"correct":"B","explanation":{"correct":"- Redundancy of β in residual networks: for any constant vector c = β (from LayerNorm), the following linear layer W×(·) + b produces W×c + b — the β contribution is equivalent to a constant additive term to b. Since b is already a learned parameter, β adds no new expressive capacity; it's absorbed.\n- Mean subtraction redundancy: x - μ removes the mean component. But the subsequent linear W×(·) + b already has a bias b that can shift the mean. Again, centering is not strictly necessary when biases exist in downstream layers.\n- Computational benefit: RMSNorm avoids computing the mean (one pass through the data), only computes the root-mean-square (also one pass). Marginally faster, and more numerically stable (no chance of cancellation errors from (x-μ) when x ≈ μ).","A":"The \"centering invariance\" argument assumes the network lacks other components that can compensate. In transformer blocks with residual connections and bias terms, the centering is indeed redundant.","B":"","C":"LayerNorm with β=0 and μ subtraction still differs from RMSNorm in behavior: LayerNorm normalizes to zero mean, RMSNorm normalizes by RMS only (mean can be non-zero after RMSNorm). They produce different outputs. But the LEARNED MODEL can achieve equivalent final representations.","D":"RMSNorm empirically matches or exceeds LayerNorm — it's not universally better for all architectures. For non-residual networks, the redundancy argument breaks down and mean-centering may matter."},"reference":"- Zhang & Sennrich, \"Root Mean Square Layer Normalization\" (2019): https://arxiv.org/abs/1910.07467"},{"section":"deep-learning","difficulty":"hard","id":"dl-h019","topicSlug":"cnn-architectures","orderIndex":19,"topic":"Cnn Architectures","question":"You design a mobile CNN using depthwise separable convolutions. The model achieves 78% top-1 on ImageNet but runs at 15ms on a mobile CPU (target: < 10ms). A profiler shows the 1×1 pointwise convolutions consume 82% of the latency, even though they have fewer FLOPs than standard convolutions. Explain the counterintuitive result and what architectural modification addresses this.","options":{"A":"Reduce the number of channels; fewer channels always reduce latency proportionally","B":"Counterintuitive cause: FLOPs ≠ latency. The 1×1 pointwise conv computes C_in × C_out multiplications per pixel. For MobileNet with C=512 channels: 512×512 = 262,144 multiplications per pixel. While this is fewer FLOPs than the 3×3 depthwise (9×512 = 4,608 FLOPs), the 1×1 conv accesses W (512×512 weights = 1MB) that must be loaded from cache for each position. The 3×3 depthwise filter is 9×512 = 4.5KB — fits in L1 cache. The 1×1 pointwise conv has low arithmetic intensity (few FLOPs per byte loaded) — it's memory-bandwidth bound, not compute-bound. Fix 1: Bottleneck design (MobileNetV2 inverted residual): expand channels for the 3×3 depthwise (compute-efficient at expanded dim), then project down with 1×1 (smaller C_out). Fix 2: Channel shuffling (ShuffleNet): replace full 1×1 with grouped 1×1 + channel shuffle to maintain cross-group mixing at reduced compute.","C":"The latency issue is caused by Python overhead in the forward pass; use TorchScript","D":"1×1 convolutions are always faster than 3×3 convolutions; the profiler is incorrect"},"correct":"B","explanation":{"correct":"- Arithmetic intensity of 1×1: for each output pixel: FLOPs = 2 × C_in × C_out. Memory: C_in × C_out × 4 bytes (weights, float32). Arithmetic intensity = 2 FLOPs / 4 bytes = 0.5 FLOP/byte. The CPU roofline ridge point is typically ~20 FLOP/byte for modern CPUs. At 0.5 FLOP/byte, the operation is 40× more memory-bound than the ridge.\n- 3×3 depthwise: FLOPs = 9 × C (independent channels). Memory: 9 × C × 4 bytes. Arithmetic intensity = 2 × 9 × C / (9 × C × 4) = 2/4 = 0.5 FLOP/byte — also memory-bound! But the weight tensor is 40× smaller → fits in L1 cache → effective bandwidth is much higher for depthwise.\n- MobileNetV2 fix: by using an inverted bottleneck (expand → depthwise → project), the depthwise operates at high channel count (high FLOP but cached weights), and the projection reduces channels (small weight tensor, fits in cache).","A":"Reducing channels proportionally reduces both FLOPs and latency — but may reduce accuracy. The optimization question is efficiency (latency-per-FLOP), not just latency.","B":"","C":"TorchScript reduces Python overhead (~microseconds). At 15ms total, Python overhead is < 1%. The bottleneck is memory bandwidth (10+ms in 1×1 convs).","D":"Smaller kernel size (1×1 vs 3×3) doesn't guarantee faster execution when the operation is memory-bandwidth bound and the weight tensor doesn't fit in cache."},"reference":"- Sandler et al., \"MobileNetV2: Inverted Residuals and Linear Bottlenecks\" (2018): https://arxiv.org/abs/1801.04381"},{"section":"deep-learning","difficulty":"hard","id":"dl-h020","topicSlug":"rnn-lstm-gru","orderIndex":20,"topic":"Rnn Lstm Gru","question":"You train a seq2seq model with attention for machine translation (English→German). The model achieves 28 BLEU on the test set. During error analysis, you find the model performs well on short sentences (< 15 tokens, 31 BLEU) but poorly on long sentences (> 40 tokens, 18 BLEU). The attention mechanism uses additive (Bahdanau) attention. What specific attention pathology causes long-sentence degradation, and what architectural change addresses it without switching to Transformers?","options":{"A":"Long sentences require more parameters; add more LSTM layers to fix the length degradation","B":"$30","C":"Long-sentence degradation is inevitable for all sequence models; accept lower BLEU","D":"The fix is to increase the attention dimensionality from 256 to 1024; more expressive alignment fixes the length problem"},"correct":"B","explanation":{"correct":"- Attention score diffusion: for uniform attention: each weight = 1/T. With T=50: max possible attention weight = 1 (one-hot). Mean = 1/50 = 0.02. The gradient signal for updating the encoder position that should receive attention is proportional to the attention weight. Diffused attention → small gradient signal to the \"right\" position → slower/weaker convergence.\n- Coverage mechanism: maintains a coverage vector c_t = Σ_{t' 65504 → overflow in FP16 without the max-subtraction trick. This is why attention logits can cause overflow for long sequences.\n- FlashAttention also avoids materializing the T×T matrix in HBM memory (stores only tiles in SRAM), solving both the numerical stability AND the memory bandwidth problem simultaneously.","A":"FP16 can achieve < 0.1% relative error vs FP32 for well-conditioned operations. The 8% error is a specific numerical pathology from the softmax, not an intrinsic FP16 limitation.","B":"","C":"Masking (adding -∞ to masked positions before softmax) doesn't cause the error for unmasked positions. The overflow/underflow issue is from the softmax computation itself over long sequences.","D":"FP16 precision (mantissa) affects the accuracy of each operation. The precision issue here is RANGE (exponent overflow), not mantissa bits. Longer sequences cause larger accumulated sums → range overflow."},"reference":"- Dao et al., \"FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness\" (2022): https://arxiv.org/abs/2205.14135 — Algorithm 1: online softmax"},{"section":"deep-learning","difficulty":"hard","id":"dl-h030","topicSlug":"transfer-learning","orderIndex":30,"topic":"Transfer Learning","question":"You fine-tune CLIP (dual encoder: image encoder + text encoder) for a specialized medical image-text retrieval task. After fine-tuning with learning rate 1e-4 on 10,000 image-text pairs, the medical retrieval performance improves from 31% R@1 to 78% R@1. However, the model loses its zero-shot classification ability on general ImageNet-1k (from 76% → 23%). A colleague suggests task arithmetic: merge the fine-tuned model with the original CLIP using weight interpolation. What is the theoretical basis for task arithmetic, and predict the accuracy tradeoff curve at interpolation coefficient α ∈ {0, 0.25, 0.5, 0.75, 1.0}?","options":{"A":"Weight interpolation always degrades both tasks; do not interpolate","B":"$38","C":"Task arithmetic requires retraining both models jointly; interpolation is not valid","D":"The optimal α is always 0.5; no other coefficient can improve on this"},"correct":"B","explanation":{"correct":"- Linear mode connectivity: two models fine-tuned from the same pre-trained checkpoint often lie in the same loss basin. Linear interpolation between them stays within low-loss regions for BOTH tasks (loss barriers are small), enabling smooth tradeoff curves.\n- Task vector composition: τ_medical = θ_medical_ft - θ_CLIP. Adding ατ_medical to θ_CLIP scales the medical adaptation — α=0.5 adds half the medical specialization while retaining most of the original structure.\n- Practical use: WiSE-FT (Wortsman et al.) showed this interpolation consistently improves distribution shift robustness. Applied to CLIP, it achieves better ImageNet+OOD tradeoffs than either model alone.","A":"Empirically, task arithmetic (weight interpolation from the same pre-trained init) consistently produces points on the Pareto frontier between the two tasks — better than either endpoint for at least one task at no cost to the other.","B":"","C":"Ilharco et al. show that task vectors can be computed without joint retraining. The linear operation (θ_pretrained + ατ) is all that's needed.","D":"The optimal α depends on the relative importance of each task. For medical deployment prioritizing retrieval, α=0.75 may be better. For general-purpose with medical enhancement, α=0.25 may be better. There is no universal optimal α."},"reference":"- Ilharco et al., \"Editing Models with Task Arithmetic\" (2023): https://arxiv.org/abs/2212.04089\n- Wortsman et al., \"Model Soups: Averaging Weights of Multiple Fine-tuned Models Improves Accuracy and Robustness\" (2022): https://arxiv.org/abs/2203.05482"},{"section":"deep-learning","difficulty":"hard","id":"dl-h031","topicSlug":"activation-functions","orderIndex":31,"topic":"Activation Functions","question":"You deploy a quantized model (INT8 weights, INT8 activations) that uses GELU activation. The quantization calibration was done with 100 ImageNet batches. Post-quantization accuracy drops from 82.3% (FP32) to 74.1% (INT8) — an 8.2% drop, much larger than typical (<1% for well-quantized models). You suspect the GELU activation is the culprit. What specific quantization challenges does GELU pose compared to ReLU, and what quantization-aware technique mitigates this?","options":{"A":"GELU and ReLU have identical quantization behavior; the accuracy drop is from weight quantization","B":"$39","C":"Switch to symmetric INT8; this fixes the GELU quantization issue","D":"Use INT16 for GELU activations and INT8 elsewhere; mixed precision solves the problem"},"correct":"B","explanation":{"correct":"- Asymmetric vs symmetric quantization: symmetric INT8 maps [-R, R] to [-127, 127]. For GELU output range [-0.17, max_val]: max_val ≈ 10 in a well-trained model. Symmetric range ±10 wastes the range [-10, -0.17] (negative GELU territory), allocating 90% of negative range to a region with no activations.\n- QAT mechanism: during forward pass, insert fake quantize: x_q = round(x / scale) × scale. Backward: STE passes gradients through as if x_q = x. The model learns to keep activations in quantization-friendly ranges.\n- For GELU specifically: QAT teaches the model to avoid the problematic smooth transition region (around x=0) or to produce activations with distributions that INT8 can represent well — sometimes by changing the scaling of inputs to GELU.","A":"Activation quantization is a major source of accuracy loss, especially for non-ReLU activations. GELU's non-monotonic range and smooth curvature make it more challenging to quantize than ReLU.","B":"","C":"Symmetric INT8 WORSENS the GELU problem by forcing a symmetric range that wastes quantization resolution on negative GELU values rarely encountered in practice.","D":"Using INT16 for activations would mostly solve the precision issue (16 bits provides 256× more resolution than INT8), but INT16 operations are 4× slower than INT8 on most inference hardware. This defeats the purpose of INT8 quantization."},"reference":"- Jacob et al., \"Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference\" (2018): https://arxiv.org/abs/1712.05877"},{"section":"deep-learning","difficulty":"hard","id":"dl-h032","topicSlug":"backpropagation","orderIndex":32,"topic":"Backpropagation","question":"You compute the gradient of a matrix multiplication Y = X @ W with respect to W. The correct gradient is ∂L/∂W = X^T @ ∂L/∂Y. You implement this in a custom CUDA extension. After testing, you find the gradients are correct when X has shape (B, D_in) but incorrect when X has shape (B, T, D_in) (batched sequence). You didn't modify the kernel. What is the mathematical cause, and what is the correct gradient formula for the 3D case?","options":{"A":"The gradient formula is the same for 2D and 3D; the error must be elsewhere","B":"Mathematical cause: for 2D case, Y = X @ W where X: (B, D_in), W: (D_in, D_out), Y: (B, D_out). The gradient: ∂L/∂W = X^T @ ∂L/∂Y. Shape check: X^T: (D_in, B), ∂L/∂Y: (B, D_out) → result: (D_in, D_out) ✓. For 3D case, Y = X @ W where X: (B, T, D_in), W: (D_in, D_out), Y: (B, T, D_out). Broadcasting: the matrix multiplication is applied independently for each (b, t) pair. The gradient: ∂L/∂W = Σ_{b,t} x_{b,t}^T ⊗ ∂L/∂y_{b,t}. In tensor notation: ∂L/∂W = einsum('bti, bto -> io', X, ∂L/∂Y) OR: reshape X to (B×T, D_in), ∂L/∂Y to (B×T, D_out), then (B×T, D_in)^T @ (B×T, D_out) = (D_in, D_out). The custom kernel using X^T @ ∂L/∂Y in 3D gets: X^T: (B, D_in, T), ∂L/∂Y: (B, T, D_out) → batched matmul gives (B, D_in, D_out) — NOT summed over the batch. The kernel either ignores the batch reduction (takes only slice [0]) or sums incorrectly.","C":"3D tensors require the transpose of ∂L/∂Y rather than X^T; swap the operands","D":"The 3D gradient requires dividing by T (the sequence length) for normalization"},"correct":"B","explanation":{"correct":"- The 2D formula X^T @ dY works because the batch dimension contracts naturally in the 2D matrix multiply. For 3D, naively applying the same formula without summing over (B, T) gives per-batch-per-step gradients, not the accumulated gradient across all positions.\n- Correct implementation: `torch.einsum('bti,bto->io', X, dY)` correctly sums over both B and T dimensions. Alternatively: `X.reshape(-1, D_in).T @ dY.reshape(-1, D_out)`.\n- This is a common bug in custom backward implementations: forgetting that gradients w.r.t. shared parameters (W is shared across all B×T applications) must be summed, not averaged, over all the instances that used W.","A":"The gradient formula IS different for 3D — the reduction dimensions change. The 2D formula doesn't account for the T sequence dimension.","B":"","C":"Swapping operands (dY^T @ X instead of X^T @ dY) doesn't produce the right gradient. The correct formula sums outer products x_{b,t} ⊗ dy_{b,t} over all (b,t) pairs.","D":"The gradient is NOT divided by T. Dividing by T would give the average gradient, not the total gradient. The total gradient (sum over all positions) is correct for parameter updates — the learning rate effectively averages by using the loss's average over positions."},"reference":"- Goodfellow et al., \"Deep Learning\" (2016), Chapter 6.5: Back-Propagation in Feedforward Networks — matrix gradient derivation"},{"section":"deep-learning","difficulty":"hard","id":"dl-h033","topicSlug":"graph-neural-networks","orderIndex":33,"topic":"Graph Neural Networks","question":"You train a link prediction model on a social network graph with 1M nodes and 50M edges. Your GNN uses 3-layer GraphSAGE with mean aggregation. At inference, predicting whether edge (u,v) exists requires the embeddings of both u and v. You benchmark on a held-out edge set and find AUC=0.91. A reviewer says \"your result is inflated by data leakage from graph structure.\" How could structural leakage occur in link prediction evaluation, and what is the correct negative sampling strategy?","options":{"A":"AUC=0.91 is correct; graph structure cannot cause data leakage","B":"$3a","C":"Remove 50% of training edges; this prevents all forms of data leakage","D":"GraphSAGE cannot be used for link prediction; use only non-structural methods"},"correct":"B","explanation":{"correct":"- Inductive vs transductive leakage: GraphSAGE is inductive — it computes embeddings from neighbor aggregation. For an edge (u,v) in the test set, if u's neighborhood aggregation includes v (because the edge u-v exists in the graph used for aggregation), the embedding h_u directly encodes information about v's features. This makes the link score f(h_u, h_v) essentially \"see\" the test edge.\n- Correct evaluation protocol: (1) Split edges into train/val/test. (2) Construct aggregation graph using ONLY training edges. (3) Compute all node embeddings using only training graph structure. (4) Score edges (u,v) using these training-graph embeddings.\n- OGB (Open Graph Benchmark) link prediction datasets follow this protocol explicitly, and results on datasets using it are directly comparable.","A":"Structural leakage is a well-documented problem in graph link prediction evaluation. Several published results have been found to be inflated due to this issue.","B":"","C":"Removing training edges reduces the model's ability to learn structural patterns. The fix is not to remove edges but to ensure the evaluation uses only training edges for neighborhood computation.","D":"GraphSAGE is widely used for link prediction. The OGB leaderboard features many GraphSAGE-based methods. The issue is evaluation protocol, not the model architecture."},"reference":"- Poole et al., \"GraphSAGE and link prediction leakage\" — evaluation best practices\n- Hu et al., \"Open Graph Benchmark\" (2020): https://arxiv.org/abs/2005.00687 — OGB link prediction evaluation protocol"},{"section":"deep-learning","difficulty":"hard","id":"dl-h034","topicSlug":"introduction-to-neural-networks","orderIndex":34,"topic":"Introduction To Neural Networks","question":"A mechanistic interpretability researcher finds that a 2-layer MLP (trained on a toy task) implements a specific Boolean circuit in its weights. They claim \"we can read off exactly what computation the network performs.\" A production team uses a 128-layer Transformer. The researcher claims the same circuit-reading approach scales. What are the two fundamental obstacles to mechanistic interpretability at scale, and what does the concept of \"superposition\" specifically predict about neural network representations that makes interpretation harder?","options":{"A":"Mechanistic interpretability works equally well at any scale; it's purely a compute problem","B":"$3b","C":"The only obstacle is compute; given enough time, all circuits can be found","D":"Superposition means different neurons encode the same feature redundantly; this simplifies interpretation"},"correct":"B","explanation":{"correct":"- Superposition evidence: Toy models of superposition (Elhage et al. 2022) show explicitly how a 2-layer MLP trained with D' > D features uses superposition. Features are represented as vectors that are nearly (but not perfectly) orthogonal, allowing D neurons to represent D' > D features.\n- Polysemanticity: in superposition, individual neurons respond to multiple unrelated features (e.g., a neuron in GPT-2 responds to \"code tokens,\" \"mathematical notation,\" and \"European names\"). This is documented empirically by Anthropic's interpretability team.\n- Scale challenge: BERT-large has 24 layers × 1024 dimensions. Potentially millions of superimposed features across all layers. Even if each feature could be identified (itself hard), understanding the CIRCUIT connecting features across layers is a separate exponentially hard problem.","A":"Scale introduces qualitatively new challenges beyond compute. Even with unlimited compute, the superposition problem means that individual neurons do not cleanly encode interpretable concepts.","B":"","C":"Superposition is not just a compute problem. The mathematical structure (near-orthogonal feature vectors across neurons) means there is no direct mapping from neurons to features, regardless of compute budget.","D":"Superposition is the OPPOSITE of redundancy. In superposition, each neuron encodes DIFFERENT parts of MULTIPLE features simultaneously. Redundancy would mean multiple neurons encode the same feature — that's a different phenomenon."},"reference":"- Elhage et al., \"Toy Models of Superposition\" (2022): https://transformer-circuits.pub/2022/toy_model/index.html\n- Elhage et al., \"A Mathematical Framework for Transformer Circuits\" (2021): https://transformer-circuits.pub/2021/framework/index.html"},{"section":"deep-learning","difficulty":"hard","id":"dl-h035","topicSlug":"regularization-and-normalization","orderIndex":35,"topic":"Regularization And Normalization","question":"You train a 24-layer Transformer language model and find that without any normalization, the model's loss spikes unpredictably during training, and with Post-LayerNorm, it requires careful LR warmup. You switch to Pre-LayerNorm (Pre-LN). A reviewer asks: \"Pre-LN is known to cause representation collapse in very deep networks — the residual stream's contribution from early layers becomes negligible.\" Explain the mathematical mechanism behind this collapse and what technique (DeepNorm or ResiDual) addresses it.","options":{"A":"Pre-LN never causes representation collapse; the reviewer is incorrect","B":"$3c","C":"Collapse is prevented by making the residual connection trainable (learnable weight)","D":"The collapse is a training issue; more epochs solve the deep Pre-LN collapse"},"correct":"B","explanation":{"correct":"- Quantitative collapse: with L Pre-LN layers, x_L = x_0 + Σ_{l=0}^{L-1} F_l(LN(x_l)). The accumulated sum ||x_L|| grows as O(L × ||F_l output||). At layer L, ||F_l(LN(x_l))|| / ||x_L|| ≈ ||F_l(LN(x_l))|| / (L × ||F_l||) = 1/L → 0 as L → ∞. Each layer's contribution shrinks inversely with depth — the network behaves as if only the first few layers matter.\n- DeepNorm: α and β are analytically derived (α = (2N)^{1/4} for self-attention, β = (8N)^{-1/4}). These keep E[||x_l||] constant across all L layers and E[||∂L/∂x_l||] constant — solving both forward collapse and gradient vanishing simultaneously without warmup.\n- Used in: GLM-130B, DeepNet (Microsoft), and other very deep Transformers (>100 layers) use DeepNorm or similar techniques to enable stable training without warmup.","A":"Pre-LN collapse is mathematically derived and empirically documented. Wang et al. (2022) explicitly showed depth ≥ 1000 layers with Post-LN is unstable but DeepNorm enables stable training at depth 1000.","B":"","C":"Learnable residual weights (scalar α per layer) are a valid idea, but require initialization tuning. DeepNorm's contribution is providing the analytical formula for α and β — removing the need for empirical search.","D":"More training epochs don't fix structural issues in the forward pass. If early layers' contributions collapse, the optimization landscape itself is degraded — more steps on the same landscape don't recover collapsed representations."},"reference":"- Wang et al., \"DeepNet: Scaling Transformers to 1,000 Layers\" (2022): https://arxiv.org/abs/2203.00555"},{"section":"deep-learning","difficulty":"hard","id":"dl-h036","topicSlug":"self-supervised-and-contrastive-learning","orderIndex":36,"topic":"Self Supervised And Contrastive Learning","question":"You pretrain a ViT-B/16 using MAE (Masked Autoencoder) with 75% masking ratio. The decoder reconstructs raw pixel values. After pretraining, you fine-tune for image classification (linear probe: frozen encoder + linear layer). Linear probe accuracy = 68%. A DINO-pretrained ViT-B/16 achieves 78% linear probe accuracy with the same setup. Despite MAE achieving better full fine-tune performance (83% vs DINO's 81%), why does MAE's linear probe significantly underperform DINO, and what property of DINO's loss function creates more linearly separable representations?","options":{"A":"MAE is strictly worse than DINO; the results above contradict this","B":"$3d","C":"The linear probe difference is purely due to architecture; DINO uses a CLS token","D":"MAE should be pretrained for 1600 epochs to match DINO's linear probe performance"},"correct":"B","explanation":{"correct":"- Linear probe measures representation quality WITHOUT task-specific adaptation. It specifically tests if class information is encoded in a linearly accessible way in the frozen representation.\n- MAE's learned features: pixel reconstruction requires preserving spatial, textural, and structural details. These features are distributed across many representation dimensions, not necessarily aligned with semantic classes.\n- DINO's features: each image produces a distribution over 65536 \"semantic concepts\" (the DINO prototypes). Similar-content images produce similar prototype distributions → the representation directly encodes semantic similarity → linear classifiers easily extract class information.\n- This explains why DINO representations produce impressive unsupervised segmentation (background/foreground structure visible in attention maps) while MAE representations show better local texture features.","A":"The stated results reflect actual published comparisons. MAE achieves better FULL fine-tune accuracy than DINO while having worse linear probe accuracy — both facts are empirically correct. The two methods learn representations with complementary properties.","B":"","C":"Both ViT-B/16 variants use a CLS token (ViT architecture includes it). The linear probe uses the CLS token's representation for both. The difference is the pretraining loss, not the architecture.","D":"MAE was pretrained for 1600 epochs in the original paper. Training longer with the same pixel reconstruction loss produces better reconstruction but doesn't fundamentally change the semantic linearity of the representation."},"reference":"- He et al., \"Masked Autoencoders Are Scalable Vision Learners (MAE)\" (2021): https://arxiv.org/abs/2111.06377 — Table 1: linear probe comparison\n- Caron et al., \"Emerging Properties in Self-Supervised Vision Transformers (DINO)\" (2021): https://arxiv.org/abs/2104.14294"},{"section":"deep-learning","difficulty":"hard","id":"dl-h037","topicSlug":"loss-and-cost-functions","orderIndex":37,"topic":"Loss And Cost Functions","question":"You train a variational autoencoder (VAE) with ELBO loss: L = E_q[log p(x|z)] - KL(q(z|x) || p(z)). During training, the KL term collapses to 0 (KL divergence becomes near-zero) while reconstruction loss remains high. This is \"posterior collapse.\" Explain the exact optimization mechanism causing this and the two most effective fixes used in production VAEs.","options":{"A":"KL collapse means the model has converged; zero KL is optimal","B":"$3e","C":"Posterior collapse is caused by the decoder being too small; increase decoder capacity","D":"Use MSE reconstruction loss instead of log-likelihood; this prevents KL collapse"},"correct":"B","explanation":{"correct":"- Optimization landscape: the ELBO has two competing terms. If the decoder is powerful (e.g., Transformer decoder), it can minimize reconstruction loss without using z by leveraging context (x_1,...,x_{t-1} → x_t in autoregressive VAE). The KL term then drives q(z|x) → p(z) for free (no cost). The overall ELBO improves even though z becomes uninformative.\n- KL annealing mechanism: during early training (β=0), the model trains as a standard autoencoder — z must encode information. As β increases, the KL penalty is introduced gradually. The encoder already has useful encodings, so it doesn't collapse.\n- Free bits: Kingma et al. (2016) showed that requiring min-KL = λ bits per dimension (e.g., λ=0.25 bits) prevents collapse by ensuring each dimension encodes at least λ bits. This is a hard constraint on the minimum information the encoder must encode.","A":"KL=0 means the posterior = prior for ALL inputs — the encoder provides NO information about the input. The latent z is then a pure noise sample that carries no semantic information. This is a degenerate solution where the VAE's purpose (encoding meaningful latent structure) has failed.","B":"","C":"Larger decoder capacity WORSENS posterior collapse — a more powerful decoder is better at reconstructing without using z. The fix is to force the encoder to be used, not to reduce decoder capacity.","D":"MSE reconstruction loss doesn't prevent collapse — the decoder can still learn to ignore z with MSE loss, achieving low reconstruction error without using the latent code. The problem is the optimization dynamics, not the specific reconstruction loss form."},"reference":"- Bowman et al., \"Generating Sentences from a Continuous Space (KL annealing)\" (2016): https://arxiv.org/abs/1511.06349\n- Kingma et al., \"Improving Variational Inference with Inverse Autoregressive Flow\" (2016) — Free bits"},{"section":"deep-learning","difficulty":"hard","id":"dl-h038","topicSlug":"ann-architectures","orderIndex":38,"topic":"Ann Architectures","question":"You train a Neural ODE model where the hidden state dynamics are modeled by a differential equation: dh/dt = f_θ(h(t), t), solved with a numerical ODE solver (Runge-Kutta). The loss is computed at the final time T. Compared to a discrete ResNet with the same parameter count, what are the memory implications of backpropagating through a Neural ODE, and why do two different gradient computation methods (backprop through solver vs adjoint method) give different memory vs compute trade-offs?","options":{"A":"Neural ODE requires identical memory to ResNet backpropagation","B":"$3f","C":"Adjoint method is always better than standard backprop in all aspects","D":"Neural ODE gradient computation is identical to RNN backpropagation"},"correct":"B","explanation":{"correct":"- Adjoint method derivation: ∂L/∂θ = -∫_T^0 a(t)^T × ∂f_θ/∂θ(h(t),t) dt. The adjoint a(t) satisfies: da/dt = -a(t)^T × ∂f_θ/∂h(h(t),t). Both h(t) and a(t) are computed by backward ODE integration from T to 0.\n- Memory trade-off: standard backprop through N steps stores N activation checkpoints. The adjoint stores only the current (h, a) pair — O(D) total. For N=500 steps and D=1024: 500× memory reduction.\n- Numerical accuracy: the adjoint ODE integrate backward from T to 0, reconstructing h(t) as it goes. Numerical integration errors in the reconstructed h(t) cause small discrepancies in the adjoint gradient vs true gradient. For high accuracy, use tight tolerances (rtol=1e-7, atol=1e-8).","A":"Neural ODE with standard backprop through the solver stores all solver steps — O(N×D) vs ResNet's O(layers×D). For adaptive solvers (N can vary), Neural ODE memory is often larger.","B":"","C":"Adjoint method uses 2× compute (two full ODE integrations). Standard backprop uses 1 forward + 1 backward pass through stored states — memory-expensive but compute-equivalent to the forward pass. For memory-constrained scenarios, adjoint is better; for compute-constrained scenarios, direct backprop may be better.","D":"RNN backpropagation (BPTT) processes discrete steps and stores hidden states at each step. Neural ODE uses continuous-time ODE solving with adaptive step sizes — different algorithms, different memory/compute profiles."},"reference":"- Chen et al., \"Neural Ordinary Differential Equations\" (2018): https://arxiv.org/abs/1806.07366 — Section 2: Reverse-mode automatic differentiation of ODE solutions"},{"section":"deep-learning","difficulty":"hard","id":"dl-h039","topicSlug":"cnn-architectures","orderIndex":39,"topic":"Cnn Architectures","question":"You train EfficientNet-B7 on a 200-class fine-grained classification task (bird species). The training images are 600×600. After training, you deploy on a mobile device and must reduce latency from 450ms to < 50ms. A team proposes knowledge distillation from EfficientNet-B7 (teacher) to MobileNetV3-Small (student). During distillation training, the student achieves only 71% accuracy vs the teacher's 89%. Identify two specific reasons why large-to-small distillation gaps occur for fine-grained tasks, and propose a distillation strategy that narrows the gap.","options":{"A":"Knowledge distillation always achieves teacher accuracy; a gap means implementation error","B":"$40","C":"Larger distillation temperature always fixes capacity gaps; use T=20 for fine-grained tasks","D":"The gap is caused by the optimizer; switch the student to Adam for distillation"},"correct":"B","explanation":{"correct":"- Capacity gap in KD: Mirzadeh et al. (2020) showed that large teacher-student capacity gaps HURT distillation — a model that is too powerful a teacher actually degrades student performance vs a medium-complexity teacher. Intermediate teachers \"bridge\" the gap.\n- Feature distillation (FitNets, Romero et al. 2014): intermediate layer features are richer than final logits. The teacher's intermediate representations encode hierarchical features (wing texture at layer 3, beak shape at layer 5) that are more directly learnable by the student than the compressed logit distribution.\n- Progressive distillation efficiency: each step is a smaller capacity gap, making distillation more effective at each stage. The full chain can achieve 82-84% for fine-grained classification with MobileNetV3 vs 71% with direct distillation.","A":"Knowledge distillation consistently shows accuracy gaps for large teacher-to-small student transfers, especially for fine-grained tasks. The gap narrows with better distillation strategies but rarely disappears entirely.","B":"","C":"High temperature (T=20) softens the logit distribution — useful when the teacher has sharp one-hot-like predictions (all mass on one class). For fine-grained 200-class tasks, the teacher already produces soft distributions. Very high T further smooths the distributions, potentially losing the fine-grained similarity structure that makes KD useful.","D":"The optimizer choice affects convergence speed but not the fundamental capacity limitation. Both Adam and SGD would produce similar final accuracy for a capacity-constrained student."},"reference":"- Mirzadeh et al., \"Improved Knowledge Distillation via Teacher Assistant\" (2020): https://arxiv.org/abs/1902.03393\n- Romero et al., \"FitNets: Hints for Thin Deep Nets\" (2015): https://arxiv.org/abs/1412.6550"},{"section":"deep-learning","difficulty":"hard","id":"dl-h040","topicSlug":"attention-and-transformers-dl","orderIndex":40,"topic":"Attention And Transformers Dl","question":"You analyze the gradient flow in a 24-layer Pre-LN Transformer during training. You find that layer 1's attention weights consistently receive gradients 12× smaller than layer 24's attention weights, despite using Pre-LN (which is supposed to improve gradient flow). You also notice that all layers use weight tying (the same W_Q, W_K, W_V matrices shared across all layers). How does weight tying interact with Pre-LN to cause this gradient imbalance, and what is the correct fix?","options":{"A":"Gradient imbalance is impossible with Pre-LN; the observation must be a measurement error","B":"$41","C":"The gradient imbalance is beneficial; early layers should receive smaller updates","D":"The fix is to increase dropout at early layers; this rebalances gradients"},"correct":"B","explanation":{"correct":"$42","A":"Pre-LN improves gradient flow compared to Post-LN, but does NOT perfectly equalize gradients across all layers. The combination of Pre-LN + weight tying specifically creates the described imbalance. The observation is physically plausible and measurable.","B":"","C":"The imbalance with weight tying is harmful: layer 24 dominates the shared W_Q updates, causing the weight matrix to be specialized for deep-layer query patterns. Early layers' query patterns (which process more local, syntactic information in language models) are underweighted. Final model quality suffers.","D":"Dropout affects which neurons are active during training but doesn't balance gradient magnitudes across layers for shared weights. Dropout is applied to activations, not to gradients directly."},"reference":"- Press et al., \"Using the Output Embedding to Improve Language Models\" (2017): https://arxiv.org/abs/1608.05859 — weight tying\n- Howard & Ruder, \"Universal Language Model Fine-Tuning (ULMFiT) — Layer-wise LR decay\" (2018): https://arxiv.org/abs/1801.06146"},{"section":"deep-learning","difficulty":"medium","id":"dl-m001","topicSlug":"introduction-to-neural-networks","orderIndex":1,"topic":"Introduction To Neural Networks","question":"You have a 3-class classification problem (classes A, B, C) with 100 training examples each. You train a 3-layer MLP and find training accuracy = 99% but the model consistently misclassifies class C examples as class B. Validation accuracy for classes A and B is 92%, but class C validation accuracy is 41%. What is happening, and what is the most targeted intervention?","options":{"A":"The model has overfit classes A and B; add more dropout","B":"The model has learned a decision boundary that conflates C with B — suggesting C and B are similar in feature space for the model's learned representation. The issue is not general overfitting (A/B perform well on validation) but class-specific confusion. Targeted intervention: (1) examine the confusion matrix to confirm B↔C confusion; (2) inspect C and B examples to understand feature overlap; (3) add C-vs-B discriminative examples to training (data augmentation or collection); (4) add a class-specific loss term that penalizes C→B errors more heavily. Simply adding dropout would reduce the overall accuracy without specifically addressing the B/C boundary.","C":"Increase the learning rate to force the model to better separate class C","D":"The 41% validation accuracy for class C is acceptable since it's above random (33%)"},"correct":"B","explanation":{"correct":"- Targeted diagnosis: the asymmetric error (A/B fine, C bad) points to a representation problem specific to C vs B, not global overfitting. The model has learned to separate A from {B,C} but not to separate B from C.\n- Common causes: C and B may share low-level features; the training labels may be noisy for C; the model's learned features may not capture the C-vs-B distinguishing information.\n- Targeted fix: focus on the specific failure mode. Confusion matrix analysis + example inspection + targeted data collection is more efficient than global regularization changes.","A":"Dropout reduces overall capacity uniformly. It would degrade A and B performance too, without specifically addressing the C-vs-B boundary.","B":"","C":"Higher LR can help escape local minima but can also destabilize what's already working (A and B). It's a blunt instrument for a targeted problem.","D":"41% validation accuracy for class C in a 3-class problem is only 8 percentage points above random (33%). This level of performance is not \"acceptable\" for most real applications and indicates a genuine classification failure."},"reference":"- Goodfellow et al., \"Deep Learning\" (2016), Chapter 11: Practical Methodology — Confusion matrix analysis"},{"section":"deep-learning","difficulty":"medium","id":"dl-m002","topicSlug":"introduction-to-neural-networks","orderIndex":2,"topic":"Introduction To Neural Networks","question":"A student claims: \"A network with more layers than needed will automatically learn to use only the useful layers and ignore the rest — extra layers do no harm.\" Is this claim correct for a deep ReLU network without skip connections?","options":{"A":"Correct — gradient descent automatically prunes unnecessary layers to identity transforms","B":"Partially incorrect. Extra ReLU layers CAN learn the identity function (by setting weights to I and biases to 0), which is theoretically harmless. However, in practice: (1) extra layers make optimization harder — the loss landscape becomes more non-convex with more composition of non-linear functions; (2) extra layers add parameters that can overfit; (3) without skip connections (like ResNet), deep networks suffer from degradation — adding layers can actually decrease training accuracy because the optimization landscape makes it hard to learn identity mappings. Skip connections in ResNet explicitly allow extra layers to learn near-zero residuals, fixing this problem.","C":"Correct — a ReLU layer with W=I and b=0 is exactly identity; gradient descent trivially finds this","D":"Incorrect — extra layers always harm performance because they introduce vanishing gradients"},"correct":"B","explanation":{"correct":"- He et al. (2016) demonstrated the degradation problem: 56-layer plain networks perform worse than 20-layer networks on CIFAR-10 — even on training accuracy. This cannot be explained by overfitting. Extra layers don't automatically learn identity.\n- Why identity is hard to learn: the optimization must jointly adjust all layers. Extra layers create saddle points and local minima that make the gradient landscape harder to navigate.\n- ResNet fix: H(x) = F(x) + x. The network only needs to learn F(x) = 0 (zero residual) to implement identity. Learning zero is much easier than learning the identity mapping directly.","A":"Gradient descent does not automatically prune unnecessary layers. The degradation problem is well-documented empirically.","B":"","C":"While W=I, b=0 is a valid identity for ReLU layers (for non-negative activations), gradient descent doesn't reliably find this solution in practice, especially when the layer needs to pass both positive and negative activations.","D":"Vanishing gradients are one problem but not the only one. Degradation occurs even when gradients flow well (e.g., with BN). The landscape problem is more fundamental."},"reference":"- He et al., \"Deep Residual Learning for Image Recognition\" (2016): https://arxiv.org/abs/1512.03385"},{"section":"deep-learning","difficulty":"medium","id":"dl-m003","topicSlug":"neurons-and-perceptrons","orderIndex":3,"topic":"Neurons And Perceptrons","question":"A 2-layer MLP (input→hidden→output) with linear activations (no non-linearity) is trained on a multi-class problem. A professor says \"this model is equivalent to a single linear layer.\" Prove or disprove with a matrix algebra argument.","options":{"A":"False — two linear layers have more parameters than one, so they must be more expressive","B":"True. With linear activations: output = W₂(W₁x + b₁) + b₂ = W₂W₁x + W₂b₁ + b₂ = Wx' + b' where W' = W₂W₁ and b' = W₂b₁ + b₂. The product of two matrices is still a matrix. The 2-layer linear network has the same output as a single linear layer with W' = W₂W₁. The intermediate hidden layer adds no expressive power — only reparametrizes the same space of linear functions. This is why non-linear activations are essential: they break this collapsibility.","C":"False — the bias terms b₁ and b₂ prevent collapse; two biases are more expressive than one","D":"True only if W₁ and W₂ are square matrices; non-square matrices prevent the collapse"},"correct":"B","explanation":{"correct":"- Matrix multiplication closure: the product of two matrices (W₂ ∈ ℝ^{K×H}, W₁ ∈ ℝ^{H×D}) gives W' ∈ ℝ^{K×D}. This is just a K×D linear transformation — exactly what a single linear layer computes.\n- Bias collapse: W₂b₁ + b₂ is a constant vector — equivalent to a single bias b' = W₂b₁ + b₂.\n- Implication: stacking linear layers without non-linearity is wasteful. Any depth of linear layers is equivalent to a depth-1 linear model. This is the fundamental reason why activation functions are not optional.","A":"More parameters does NOT imply more expressiveness when the parameters collapse. W₂W₁ is a rank-min(rank_W₂, rank_W₁) matrix, which can be factored many ways. The function space (all linear functions) is the same.","B":"","C":"b' = W₂b₁ + b₂ is a single bias vector in ℝ^K, exactly what a single linear layer with bias uses. Two biases collapse into one. No extra expressiveness.","D":"The collapse applies regardless of whether matrices are square. For any W₂ ∈ ℝ^{m×n} and W₁ ∈ ℝ^{n×p}, W₂W₁ ∈ ℝ^{m×p} — a linear transformation, regardless of shape."},"reference":"- Goodfellow et al., \"Deep Learning\" (2016), Chapter 6.1: Why Deep Architectures — linear collapse argument"},{"section":"deep-learning","difficulty":"medium","id":"dl-m004","topicSlug":"activation-functions","orderIndex":4,"topic":"Activation Functions","question":"A network uses sigmoid activations in all hidden layers. You observe that gradients at layer 1 are 1000× smaller than gradients at the output layer after 10 layers. The network fails to learn useful features. You switch to ReLU. After the switch, training is faster but several neurons still show zero gradients throughout training. What are the two separate problems, and why doesn't one fix solve both?","options":{"A":"Both problems are caused by the learning rate; adjust LR to fix both","B":"Problem 1 (sigmoid): vanishing gradients. Sigmoid derivative σ'(z) = σ(z)(1-σ(z)) ≤ 0.25. Through 10 layers, the gradient magnitude is ≤ 0.25^{10} ≈ 10^{-6} — effectively zero at early layers. ReLU derivative = 1 (for z > 0), avoiding this cascade shrinkage. Problem 2 (ReLU): dead neurons. Neurons with z ≤ 0 have gradient = 0 regardless of loss — they cannot receive updates. These are independently caused: vanishing gradients (too-small values from chain rule multiplication) vs dead neurons (structural zeros from ReLU definition). No single fix addresses both: ReLU fixes vanishing gradients but introduces dead neurons. Leaky ReLU addresses dead neurons but may still have slight gradient shrinkage for z < 0.","C":"Both are the same problem; add BatchNorm before the activation to solve both at once","D":"The zero gradients in ReLU are expected and harmless; only sigmoid's problem is real"},"correct":"B","explanation":{"correct":"- Vanishing gradient mechanism: ∂L/∂w_1 = ∂L/∂a_10 × Π_{l=1}^{9} ∂a_{l+1}/∂a_l × ∂a_1/∂z_1. Each sigmoid factor ≤ 0.25. Product of 9 factors ≤ 0.25^9 ≈ 4×10^{-6}.\n- Dead ReLU mechanism: a neuron stuck with z < 0 has ReLU'(z) = 0 at that neuron. Chain rule at that neuron = 0, so all upstream weights see zero gradient from that path.\n- These are different problems: one is about gradient shrinkage through multiplication; the other is about structural zeros that don't depend on the loss value.","A":"LR affects update magnitude but not gradient direction or whether gradients exist. A high LR doesn't restore vanished gradients; a low LR doesn't resurrect dead ReLU neurons.","B":"","C":"BatchNorm before activation helps center pre-activations, reducing sigmoid saturation and reducing dead ReLU probability. But it doesn't fully solve either: with long sequences or deep networks, sigmoid gradients still vanish; and some neurons can still die even with BN.","D":"Zero gradients for dead ReLU neurons ARE harmful — those neurons contribute nothing to the model's capacity and waste parameters. 40% dead neurons (as in the easy.md example) is a significant problem."},"reference":"- Glorot et al., \"Deep Sparse Rectifier Neural Networks\" (2011): https://proceedings.mlr.press/v15/glorot11a.html"},{"section":"deep-learning","difficulty":"medium","id":"dl-m005","topicSlug":"activation-functions","orderIndex":5,"topic":"Activation Functions","question":"PReLU (Parametric ReLU) has a learnable slope for the negative region: f(x) = x if x > 0 else αx, where α is a learned parameter (initialized to 0.25). ELU uses f(x) = x if x > 0 else α(eˣ - 1). A practitioner asks: \"Should I use PReLU or ELU for a 50-layer ResNet on ImageNet?\" What is the key practical consideration for PReLU, and under what condition is ELU preferred?","options":{"A":"Always use ELU; PReLU is deprecated","B":"PReLU consideration: it adds one learnable parameter per channel (or per neuron). For a 50-layer ResNet with thousands of channels, this is a small but non-zero increase in parameters and storage. More importantly, if the dataset is small, PReLU's extra parameters can overfit — the slopes may tune to the training distribution. ELU preference: ELU produces negative outputs for x < 0 (approaches -α asymptotically), making activations zero-mean in expectation. This reduces bias shift in deeper networks — each layer's output is closer to zero mean, avoiding the systematic positive shift that ReLU causes (non-zero mean activations → bias correction needed in subsequent layers). ELU is preferred when zero-mean activations are beneficial (dense networks without BN).","C":"They produce identical results; the choice doesn't matter","D":"PReLU is always better than ELU because learned parameters outperform fixed parameters"},"correct":"B","explanation":{"correct":"- PReLU parameter overhead: if applied per-channel (most common), adds one scalar per channel. For ResNet-50 with 64+64+128+256+512 channels across stages = ~1024 total: ~1024 extra parameters. Negligible for large datasets; may overfit for small datasets.\n- ELU zero-mean advantage: for x sampled symmetrically around 0, E[ELU(x)] ≈ 0 when α=1 (since the negative tail compensates positive values). ReLU: E[ReLU(x)] = E[x⁺] > 0. This systematic positive bias accumulates across layers, requiring BN to correct.\n- ResNet with BN: since ResNet uses BN, the zero-mean benefit of ELU is less critical. For ResNets, empirical results with ReLU are strong; PReLU (He et al. 2015) showed marginal improvements on ImageNet.","A":"ELU is not universally better. PReLU can outperform ELU on some tasks (He et al. showed PReLU surpassed ELU on ImageNet). Neither is categorically deprecated.","B":"","C":"PReLU and ELU are mathematically different functions with different gradient profiles. For negative inputs, PReLU: constant slope α; ELU: exponential approach to -α. They produce different outputs and different gradients.","D":"Learned parameters don't always outperform fixed parameters. PReLU's α may converge to values similar to ELU's fixed curve, providing no advantage, while adding optimization complexity."},"reference":"- He et al., \"Delving Deep into Rectifiers\" (2015): Section 3 — PReLU vs other activations"},{"section":"deep-learning","difficulty":"medium","id":"dl-m006","topicSlug":"forward-propagation","orderIndex":6,"topic":"Forward Propagation","question":"You implement a forward pass for a 3-layer MLP in NumPy. Layer 1: (512→256, ReLU), Layer 2: (256→128, ReLU), Layer 3: (128→10, softmax). For training, you save intermediate activations a₁, a₂ for backpropagation. A memory-constrained deployment system says \"don't store activations — recompute them during backward.\" What is the memory vs compute trade-off?","options":{"A":"Recomputation is never done in practice because it doubles training time","B":"Standard backprop: store a₁, a₂ during forward pass (memory = O(batch × hidden)), use them directly in backward pass (no recompute). Activation checkpointing (gradient checkpointing): don't store a₁, a₂. During backward pass, rerun the forward pass from a checkpoint to recompute the needed activation. Trade-off: memory reduced by (approximately) the number of non-checkpointed layers (e.g., 4× for 4 layers), but compute increases by ≈1.33× (one extra forward pass per backward pass). For memory-constrained systems, this is a key technique — it enables training larger models or larger batches that wouldn't fit in GPU memory otherwise.","C":"Recomputing activations is impossible because the random number generator state is different for each forward pass","D":"Memory and compute are the same thing; saving activations always saves both"},"correct":"B","explanation":{"correct":"- Memory of activations: for a batch of B examples in a layer with H hidden units: B × H floats. For deep models (BERT: 24 layers, H=768, B=32): 24 × 32 × 768 × 4 bytes ≈ 2.4 MB per layer × 24 layers ≈ 57 MB for activations alone (plus more for attention). This is significant for very deep models.\n- Gradient checkpointing trade-off: implemented in PyTorch via `torch.utils.checkpoint`. A classic paper showed O(√N) memory is achievable for N layers with O(N) compute overhead using optimal checkpointing strategy.\n- When activations are deterministic (no dropout): recomputation is exact. With dropout: must use the same random seed, which requires saving the RNG state — a small memory cost.","A":"Gradient checkpointing is used in production. BERT, GPT, and other large models use it extensively. The 33% compute overhead is acceptable when memory is the bottleneck.","B":"","C":"Deterministic activations (ReLU, linear) can be recomputed exactly. For stochastic operations (Dropout), PyTorch's gradient checkpointing saves the RNG state at the checkpoint, then restores it for recomputation.","D":"Memory and compute are independent resources. Saving activations (writing to GPU RAM) costs memory but NOT extra compute. Recomputation costs extra compute but saves memory. They trade against each other."},"reference":"- Chen et al., \"Training Deep Nets with Sublinear Memory Cost\" (2016): https://arxiv.org/abs/1604.06174"},{"section":"deep-learning","difficulty":"medium","id":"dl-m007","topicSlug":"forward-propagation","orderIndex":7,"topic":"Forward Propagation","question":"You vectorize a forward pass for a batch of 32 samples. The weight matrix W is (d_out, d_in) = (512, 256) and the input batch is X = (32, 256). You write `output = X @ W + b`. A colleague writes `output = (W @ X.T).T + b`. Both produce the same output. Which is faster in practice and why?","options":{"A":"The first version (X @ W) is always faster because it uses fewer memory bytes","B":"The second version (W @ X.T).T is faster in practice for small d_in relative to batch size. More accurately: for large batches, both are equivalent in FLOPs. The performance depends on memory layout (row-major storage). X ∈ ℝ^{32×256} is stored row-major: each row is contiguous. X @ W = (32×256) @ (256×512): X is read row-by-row (cache-friendly); W is read column-by-column (potentially cache-unfriendly). Modern BLAS libraries optimize both orderings. In practice, PyTorch/NumPy BLAS calls perform identically since they internally choose the optimal layout. The important insight: row-major memory layout affects cache performance, but optimized BLAS handles this automatically.","C":"Neither is faster — matrix multiplication is always O(n³) regardless of order","D":"The first is faster because transposing X.T requires copying data while X @ W uses the original memory"},"correct":"D","explanation":{"correct":"- Transpose memory: `.T` in NumPy/PyTorch is a view (no data copy), just changes the stride. `X.T` doesn't copy data. However, the resulting non-contiguous memory layout can make the subsequent matmul cache-unfriendly.\n- BLAS optimization: modern libraries (cuBLAS, MKL) detect memory layout and choose optimal algorithms. `X @ W` on contiguous row-major data is typically cache-friendly.\n- Practical recommendation: `X @ W + b` (first form) is the standard, idiomatic, and often faster choice because X is contiguous row-major and W is accessed in the natural BLAS order.","A":"The statement \"fewer memory bytes\" is incorrect — both forms compute the same (32×512) output and use the same input data.","B":"The `.T` operation on X creates a non-contiguous view, which can hurt performance. The first form is typically preferred. The answer is partially correct in that BLAS handles both, but incorrectly suggests the second form is faster.","C":"Matrix multiplication time complexity is O(n³) for n×n matrices in theory, but practical performance depends heavily on hardware, memory layout, and BLAS implementation.","D":""},"reference":"- PyTorch docs: `torch.matmul` performance notes"},{"section":"deep-learning","difficulty":"medium","id":"dl-m008","topicSlug":"loss-and-cost-functions","orderIndex":8,"topic":"Loss And Cost Functions","question":"You train a neural network for 3-class classification with cross-entropy loss. The model's training loss has plateaued at 1.09 for 20 epochs — it hasn't decreased at all. What does this specific loss value tell you about what the model is actually doing?","options":{"A":"Loss of 1.09 means the model has 90% accuracy","B":"CE loss of 1.09 ≈ log(3) ≈ 1.099 is the cross-entropy of a uniform distribution over 3 classes: -log(1/3) = log(3) ≈ 1.099. The model is predicting approximately equal probability (33.3%) for all classes — essentially making no learning progress beyond chance. This is a strong signal that the model is st