A developer queries GPT-4o with: "Classify the sentiment of this review: 'The battery lasts forever but the screen cracked after two weeks.'" The model returns "Mixed" with no further detail. Their manager asks for confidence scores and aspect-level breakdown. The developer's first instinct is to switch to a larger model. A prompt engineer says: "The model is capable — the prompt is underspecified." What specific prompt change would most reliably produce aspect-level output?