![]() |
The Domestic YakAuthor: The Domestic Yak
The world is a wild place. Ajax & Ed help make the world a little less chaotic. Join us as we discuss popular topics & special interests. Language: en Contact email: Get it Feed URL: Get it iTunes ID: Get it |
Listen Now...
Auditing LLMs & Hidden Objectives
Episode 34
Monday, 17 March, 2025
This episode summarizes: Auditing language models for hidden objectives by Samuel Marks Et.al.Submitted on March 14th 2025 https://arxiv.org/abs/2503.10965Investigated the feasibility of alignment audits by training a language model with a hidden objective: to exhibit behaviors it believes reward models favor, even if undesirable. Three teams successfully identified the hidden objective using techniques like interpretability tools, behavioral attacks, and training data analysis.










