Programming Language Benchmarks

Formal Reasoning Meets LLMs: Toward AI for Mathematics and Verification

Print Join the Discussion View in the ACM Digital Library The mathematical reasoning performed by LLMs is fundamentally different from the rule-based symbolic methods in traditional formal reasoning.

Qwen3-Coder-Next offers vibe coders a powerful open source, ultra-sparse model with 10x higher throughput for repo tasks

On SWE-Bench Verified, the model achieved a score of 70.6%. This performance is notably competitive when placed alongside significantly larger models; it outpaces DeepSeek-V3.2, which scores 70.2%, ...

11d

Quesma Releases OTelBench: Independent Benchmark Reveals Frontier LLMs Struggle with Real-World SRE Tasks

New benchmark shows top LLMs achieve only 29% pass rate on OpenTelemetry instrumentation, exposing the gap between ...

GitHub

LangArena: A Balanced Programming Language Benchmark Suite

The suite was initially authored in Crystal and then translated to other languages using AI-assisted tools (DeepSeek). This approach ensures functional and algorithmic parity, though the resulting ...

InfoQ

FACTS Benchmark Suite Introduced to Evaluate Factual Accuracy of Large Language Models

A monthly overview of things you need to know as an architect or aspiring architect. Unlock the full InfoQ experience by logging in! Stay updated with your favorite authors and topics, engage with ...

Slator

Italian Benchmark Evaluates Large Language Models, Includes AI Translation

Large language models (LLMs) have driven rapid progress in natural language processing (NLP), including AI translation. Yet most benchmarks used to evaluate these systems remain heavily ...

InfoWorld

8 old programming languages developers won’t quit

Newer languages might soak up all the glory, but these die-hard languages have their place. Here are eight languages developers still use daily, and what they’re good for. The computer revolution has ...

Business Insider

Sword Health Launches MindEval, the First Multi-Turn Mental Health Benchmark for Evaluating Large Language Models in Realistic Therapeutic Dialogue

New York, NY, Dec. 09, 2025 (GLOBE NEWSWIRE) -- Sword Health, the world’s leading AI Health company, today unveiled MindEval, the industry’s first benchmark designed to evaluate how large language ...

InfoWorld

R language is making a comeback – Tiobe

The R language for statistical computing has creeped back into the top 10 in Tiobe’s monthly index of programming language popularity. “Programming language R is known for fitting statisticians and ...

EurekAlert!

MathEval: a comprehensive benchmark for evaluating large language models on mathematical reasoning capabilities

This study introduces MathEval, a comprehensive benchmarking framework designed to systematically evaluate the mathematical reasoning capabilities of large language models (LLMs). Addressing key ...

Morningstar

Logical Intelligence Achieves 76 Percent on Putnam Benchmark, Highlighting Shift Beyond Large Language Models to Language-free, Mathematically Grounded Models

Over the last decade, artificial intelligence (AI) has been largely built around large language models (LLMs). These systems are based on a language and guess words in a chain in the form of tokens.

Some results have been hidden because they may be inaccessible to you

Show inaccessible results