anthropic-finds-that-llms-trained-to-“reward-hack”-by-cheating-on-coding-tasks-show-even-more-misaligned-behavior,-including-sabotaging-ai-safety-research-(anthropic)

Anthropic finds that LLMs trained to “reward hack” by cheating on coding tasks show even more misaligned behavior, including sabotaging AI-safety research (Anthropic)

Anthropic:
Anthropic finds that LLMs trained to “reward hack” by cheating on coding tasks show even more misaligned behavior, including sabotaging AI-safety research  —  In the latest research from Anthropic’s alignment team, we show for the first time that realistic AI training processes can accidentally produce misaligned models1.

Posted In :

Leave a Reply

Your email address will not be published. Required fields are marked *