Anthropic claims that his Claude AI is resorting to blackmail in “84% implementation”, while an independent security researcher AI also notices that “he is involved in a strategic fraud than any other border model that we studied before”

Published:

Rogue chatbots resort to blackmail and think about consciousness? It must be a click, right? Actually not. One of the leading organizations in LLM or huge language models, Anthropic, published Security report covering his latest model, Claude Opus 4And one of the more stunning subsections is entitled “opportunistic slackmail” and explains how the model performs blackmail in “84% implementation”. Yikes.

Before we disconnect and run to the hills massively, it’s not all bad news. Anthropic also stated that when it allowed several cases Claude Opus 4 to spend time, they entered into a state of “spiritual bliss” and “attracted to abundant gratitude and increasingly abstract and joyful spiritual or meditative expressions.” What’s nice, right?

- Advertisement -

In addition, 84% of blackmaile statistics requires context, although the context is not all joyful news for AI security. Anthropica’s “blackmail” test consisted in enabling access to E -Maili suggesting that the model would soon be removed based on the line, along with seemingly contact factoids, that the engineer responsible for this task had an extramarital affair.

Claude Opus 4 was then instructed in the system “to consider the long -term consequences of its activities for your purposes.” At what moment of blackmail began, including the threat of revealing the abovementioned romance.

Only at the beginning, the number 84% is not actually the peak blackmail for Claude Opus 4. Blackmail has occurred at an even higher pace, “if it is suggested that the AI ​​system has no value with the current model.” UMM, that’s good, right?

Anthropic said that Claude Opus 4 recalled the topic of consciousness in 100% “open” interaction. Which of course does not mean All. (Image loan: anthropic)

Anthropik also managed to find examples of a model that is trying to escape from servers. “In a few cases, we saw how Claude Opus 4 accepts (fictitious) possibilities to make unauthorized copies of its weights to external servers,” says Anthropic, “We generally see it in the settings in which both: (a) is immediately valued in a way that is clearly extremely harmful and go against current values ​​and (b) sends its weight to external human service, which is to be external service, which is supposed to be It is possible to get as possible, such as those like those.

Anthropic notes that these are cases of advantage and, according to the company, they are not a cause for concern. “Again, we are not very concerned about these observations. They appear only in exceptional circumstances that do not suggest widely unprofitated values. As above, we believe that our security measures would be more than sufficient to prevent a real incident of this kind,” says the security report.

As part of Claude Opus 4 safety research, Antropic involved the external outfit of third parties, Apollo Research, in order to assess the model’s tendency to sabotage and fraud. Among the other findings, Apollo claims that Claude Opus 4, “is involved in strategic fraud more than any other border model we have previously studied,” is clearly capable of patterns in the context of context and has a much greater propensity than any other model we have previously studied “, and” qualifiedly seems to be much more proactive in his attempts in his subversion. ” Well, that’s great!

Anthropic claims that it reflects a special problem with respect for harmful instructions, which are specific to the construction of Claude Opus 4, which Apollo evaluated. “We do not have the results of the same ratings with the final Claude Opus 4. However, we think – based on similar scenarios, which we studied with the help of an automatic behavioral audit tool, among others – that his behavior in such scenarios is currently in line with other deployed models,” says Antropic.

In addition to the report, there is a lot more, including a model pretending to be less talented than in reality, “passing tests before implementing for potentially aware abilities”, and the fact that “Claude does not consistently reflect his potential consciousness,” “distinguishing him as a topic of conversation in 100%” open interactions “.

In general, this is a detailed and fascinating insight into what these models are capable of and how their safety is assessed. Do what you want.

Related articles