Language Model Evaluation

Anthropic releases Claude Sonnet 4.6: Benchmark performance, how to try it

According to Anthropic, "Claude Sonnet 4.6 is our most capable Sonnet model yet." The company says Sonnet 4.6 has a 1 million ...

Devdiscourse

Revolutionizing Neuroscience: Manas-1 - The AI Brain Language Model

NeuroDx has introduced Manas-1, a 400-million-parameter AI model aimed at decoding brain electrical activity. This breakthrough enables early diagnosis of neurological conditions with over 95% ...

AIM Intelligence and BMW Group Examine Gaps in Evaluating Enterprise AI Policy Compliance

Research reveals LLMs follow allowlist policies but systematically fail to enforce organizational prohibitions, ...

Research Ignited Students Publish Academic Papers and Build AI Models in Expanded PhD-Led Program

High school students gain PhD-led mentorship, publish original research, and build real-world AI models through ...

Model Response Optimization Emerges as Critical Foundation for AI-Era Brand Management

New discipline addresses gap between brand intent and AI-generated descriptions as 95% of B2B buyers plan to use ...

Slator

Academia and Hyperscalers Building the Core Infrastructure for African Language AI

New translation models, open speech datasets, and automatic speech recognition benchmarks aim to expand AI support for African languages.

Type Dynamics Assessment from Core Factors Bridges the Gap Between the Four-Letter Code and Jung’s Cognitive Processes

Type Dynamics assessment gives MBTI-qualified practitioners a way to move beyond the four-letter code and into Jung's 8 ...

The Lancet

Large language models and misinformation

The barrage of misinformation in the field of health care is persistent and growing. The advent of artificial intelligence (AI) and large language models (LLMs) in health care has expedited the ...

The Lancet

Ethical and regulatory challenges of large language models in medicine

GitHub

xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation

If you are developing a Benchmark, you can use our xFinder to replace traditional RegEx methods for extracting key answers from LLM responses. This will help you improve the accuracy of your ...

GitHub

S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language Models

📢 If our work is useful for your research, please star ⭐ our project. 📣 [2025/10/09]: We update the evaluation for the latest LLMs in 🏆 LeaderBoard, and further release Octopus, an automated LLM ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results