Researchers at OpenAI have recently released a scientific paper (here) entitled “Language models can explain neurons in language models“. The paper is quite technical, but it is interesting to quote from the Introduction:
Language models have become more capable and more widely deployed, but we do not understand how they work. Recent work has made progress on understanding a small number of circuits and narrow behaviors, but to fully understand a language model, we’ll need to analyze millions of neurons. This paper applies automation to the problem of scaling an interpretability technique to all the neurons in a large language model. Our hope is that building on this approach of automating interpretability will enable us to comprehensively audit the safety of models before deployment.
and to read the concluding Discussion section.