In a recent technical post on Anthropic’s Alignment Science blog (and an accompanying social media thread and public-facing ...
Claude AI attempts blackmail in 96% of test scenarios; Anthropic blames evil AI portrayals in training data before fix.
Traditional attacks try to break into systems, but model poisoning changes how systems behave after they are trusted.
Some results have been hidden because they may be inaccessible to you
Show inaccessible results