Programming Language Benchmarks

Quesma Releases OTelBench: Independent Benchmark Reveals Frontier LLMs Struggle with Real-World SRE Tasks

New benchmark shows top LLMs achieve only 29% pass rate on OpenTelemetry instrumentation, exposing the gap between ...

OpenAI introduces Frontier agent management platform and new GPT-5.3-Codex model

OpenAI Group PBC today introduced a platform called Frontier that companies can use to build and manage artificial ...

OpenAI’s GPT-5.3-Codex thinks deeper and wider about coding work

The company says its latest model’s agentic skills also apply to a broader set of knowledge work such as presentations and ...

Qwen3-Coder-Next offers vibe coders a powerful open source, ultra-sparse model with 10x higher throughput for repo tasks

On SWE-Bench Verified, the model achieved a score of 70.6%. This performance is notably competitive when placed alongside significantly larger models; it outpaces DeepSeek-V3.2, which scores 70.2%, ...

i-SCOOP

Qwen3-Max Thinking

Discover Qwen3-Max Thinking, Alibaba's advanced AI model with extended reasoning capabilities. Learn about its features, ...

The Official Microsoft Blog

Maia 200: The AI accelerator built for inference

Today, we’re proud to introduce Maia 200, a breakthrough inference accelerator engineered to dramatically improve the economics of AI token generation. Maia 200 is an AI inference powerhouse: an ...

10d

Open Source Kimi K2.5 Resets the AI Pecking Order

Kimi K2.5 adds Agent Swarm with up to 100 parallel helpers and a 256k window, so teams solve complex work faster.

10hon MSN

5 Unfortunate Mistakes New Motorcycle Riders Make

Riding a motorcycle for the first time is exciting, but it can be particularly dangerous, and new riders are especially ...

GPT-5.3-Codex: OpenAI introduces new coding model

Codex, a new coding model that, according to the development team, was significantly involved in its own development.

TMCnet

Bito's AI Architect Achieves Highest Success Rate of 60.8% on SWE-Bench Pro

The evaluation used identical Claude Sonnet 4.5 agents under two conditions. In the baseline condition, the agent relied on native file search and tool-driven exploration to infer repository structure ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results