Alignment And Safety | GenAI & LLMs

Live Engine

Select Topic

easyAlignment And Safety

A product team deploys an LLM-based customer service agent without any safety measures. Within 24 hours, users report the model: (1) providing detailed instructions for bypassing their competitor's security, (2) agreeing with conspiracy theories about the company's products causing harm, (3) generating offensive content when given creative prompts. Their CTO says: "We need RLHF." An engineer says: "These are three different failure categories requiring different interventions." What are the three failure categories and which safety mechanism addresses each?

Live Engine

Select Topic

easyAlignment And Safety