On Deceptive Large Language AI Models

Interesting research article about how to remove backdoors and deceptive behaviour from Large Language AI models (LLM). The simplest example is a model trained to write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024. The result of the study is that it can be very difficult to remove such behaviour using current standard safety training techniques. Deceptive behaviour can be introduced intentionally in the LLM during the training but it could also happen by poor training. Applying current techniques to identify and remove backdoors, such as Adversarial training, can actually fail and end up providing a false sense of security. Another result of the study is that the larger LLM seem more prone to be “Deceptive”.