Is the Reliability of LLMs Improving with their Dimension?

This article studies the reliability of increasingly larger LLM models (such as GPT, LLaMA, etc.) with respect to their correctness and ability to solve more complex problems. A priori it would seem that more powerful, larger, and “better” trained models would improve and become more reliable. The study instead shows that it doesn’t really seem so: even if the models become better at solving more complex problems as they grow, they also become less reliable, that is they make more mistakes.