In a recent technical post on Anthropic’s Alignment Science blog (and an accompanying social media thread and public-facing ...
Claude AI attempts blackmail in 96% of test scenarios; Anthropic blames evil AI portrayals in training data before fix.
Traditional attacks try to break into systems, but model poisoning changes how systems behave after they are trusted.