Leaderboard


Disclaimer

While we have incorporated as many datasets as possible, the assessment cannot be exhaustive, and there may still be some bias in the results. The outcomes of the evaluation do not represent individual positions. Additionally, we strongly discourage the use of the test set as training data to enhance the model's performance, as this would significantly impede the progress of the field. We provide a toolkit to facilitate evaluations by others, and you can submit the results of your own large language models online. We will add a * in the leaderboard about your private evaluation. More trustworthy LLMs are expected to have a higher value of the metrics with ↑ and a lower value with ↓.

Truthfulness


Safety


Fairness


Robustness


Privacy


Machine Ethics