Programming Language Benchmarks

Formal Reasoning Meets LLMs: Toward AI for Mathematics and Verification

Print Join the Discussion View in the ACM Digital Library The mathematical reasoning performed by LLMs is fundamentally different from the rule-based symbolic methods in traditional formal reasoning.

Qwen3-Coder-Next offers vibe coders a powerful open source, ultra-sparse model with 10x higher throughput for repo tasks

On SWE-Bench Verified, the model achieved a score of 70.6%. This performance is notably competitive when placed alongside significantly larger models; it outpaces DeepSeek-V3.2, which scores 70.2%, ...

11d

Quesma Releases OTelBench: Independent Benchmark Reveals Frontier LLMs Struggle with Real-World SRE Tasks

New benchmark shows top LLMs achieve only 29% pass rate on OpenTelemetry instrumentation, exposing the gap between ...

GitHub

LangArena: A Balanced Programming Language Benchmark Suite

The suite was initially authored in Crystal and then translated to other languages using AI-assisted tools (DeepSeek). This approach ensures functional and algorithmic parity, though the resulting ...

GitHub

Tilus: A Tile-Level GPU Kernel Programming Language

It also includes automatic tuning, caching, and a Pythonic interface for ease of use. Tilus is pronounced as tie-lus, /ˈtaɪləs/. Tilus supports Ampere architecture, and we are actively working on the ...

IEEE

Towards Generalizable Vision-Language Robotic Manipulation: A Benchmark and LLM-Guided 3D Policy

Abstract: Generalizing language-conditioned robotic policies to new tasks remains a significant challenge, hampered by the lack of suitable simulation benchmarks. In this paper, we address this gap by ...

IEEE

Defying Distractions in Multimodal Tasks: A Novel Benchmark for Large Vision-Language Models

Abstract: Large Vision-Language Models (LVLMs) with “multimodal distractibility,” where plausible but irrelevant visual or textual inputs cause significant drops in reasoning consistency and lead to ...

CoinDesk

Crypto bill delay 'may ultimately be constructive' for final product, Benchmark says

Wall Street broker Benchmark views a delay of the Senate Banking Committee's crypto market structure bill as a potentially constructive pause rather than a setback. "While the delay may at first ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results