The announcement was made by the nonprofit Center for AI Safety (CAIS) and Scale AI, a company that provides a range of data labeling and AI development services. Challenges the new standard. For Frontier AI systems.
Mathematics in the benchmark called Humanity's Last Exam. It includes thousands of crowdsourced questions on subjects such as the humanities and natural sciences. To make the assessment more rigorous, the questions are in multiple formats, including a combination of diagrams and images.
in one A preliminary studyNot one publicly available flagship AI system has managed to score better than 10% on Humanity's final test.
CAIS and Scale AI say they plan to open up the benchmark to the research community so researchers can dig deeper into the differences and evaluate new AI models.