Monday, May 25, 2026

Alibaba’s 35-Hour Model Flex

Today’s Overview

Good morning, Alibaba just made a loud entrance into the agent race with a model that can run for 35 hours straight. At the same time, Anthropic is widening Claude’s enterprise security footprint, while OpenAI is pushing better ways to evaluate agentic systems. Let’s dive in.

Top Stories

Alibaba’s Qwen3.7-Max Runs for 35 Hours

Alibaba’s Qwen3.7-Max is positioned as a proprietary flagship model built for long-horizon agent work. The release combines a 1 million token context window with native support for external harnesses like Claude Code, while Alibaba says it can outperform key rivals on math tasks. It is available through paid API access only.

Alibaba says the model can keep going for 35 hours autonomously in a single run.
The system is built for large workflows, with a 1 million token context window and a 64K output limit.
Alibaba also says it supports Claude Code integration through native Anthropic API protocol compatibility.

Anthropic Readies Mythos 1 Release

Anthropic appears to be moving Claude Mythos toward broader availability. Traces have shown up across Google Cloud and AWS through vulnerability discovery programs, which suggests a release may be close.

Mythos has already surfaced in Google Cloud and AWS traces tied to vulnerability discovery programs.
The model appears to be shifting toward broader availability rather than staying limited to internal use.
The same report also points to Claude Opus 4.8 rumors as another model in the pipeline.

Bezos Calls Prometheus An Engineer

Jeff Bezos says Project Prometheus is not a robotics company. Instead, he described it as an effort to build an artificial general engineer that automates how physical objects are created, with physics-based simulation at the center. The startup launched in late 2025 and has already raised heavily.

Bezos framed the company as a next-generation CAD system rather than a robotics play.
The startup is focused on physics-based cloud simulations instead of text-heavy workflows.
It has pulled in talent from OpenAI, DeepMind, Meta, and xAI as it builds out the team.

AlphaProof Nexus Lands Real Math Wins

Google DeepMind’s AlphaProof Nexus is showing how formal verification changes the shape of AI research. By pairing LLM-generated proof attempts with Lean’s proof checker, the system has already solved open Erdős problems and proven new OEIS conjectures. The result is a concrete example of agents doing work where correctness matters as much as speed.

The system solved 9 of 353 Erdős problems and proved 44 of 492 OEIS conjectures.
Lean acts as the filter, rejecting invalid steps so bad proof paths do not survive.
DeepMind says some results cost only a few hundred dollars per problem, making the approach unusually efficient.

Intology’s Locus Sets A Benchmark Mark

Intology says its Artificial Scientist, Locus, set a new world record on NanoGPT-Bench by tackling a difficult training step with a fused Triton kernel. The company is using the result to argue that research agents can do more than clean up code and may help push harder technical work forward.

The breakthrough came from a fused Triton kernel aimed at a difficult training step.
NanoGPT-Bench is meant to separate real R&D from easy wins rather than reward routine cleanup.
Intology is using Locus to argue for research-capable agents instead of narrow coding assistants.

OpenAI Pushes Macro Evals For Agents

OpenAI outlined a macro-evaluation workflow for agentic systems that looks across populations of traces instead of focusing on isolated failures. The cookbook walks through a synthetic EV order workflow and shows how teams can turn many runs into a smaller set of recurring behavior patterns. The goal is to help reviewers decide where to look first when systems get messy at scale.

The workflow is built around whole populations of traces rather than single bad outputs.
It uses a synthetic EV order workflow with specialist agents for pricing, compliance, supply, routing, and scheduling.
OpenAI says the process can reduce thousands of events into a small set of patterns that humans can inspect.

Codex upgrades New attachments, goal mode, locked computer use, and better web annotations.
Freu AI Learns from user actions to automate cross-software workflows.
Claude Compliance API Adds 28 security integrations for enterprise governance and telemetry.

Quick Hits

Yansu learns how you work and turns it into software.
Anthropic’s compute bill is reportedly $1.25 billion per month for access to Colossus and Colossus II through 2029.
White House chip funding puts $9 billion behind spy agencies buying advanced AI chips.
Command A+ is Cohere’s open enterprise workhorse.
Orchestria is an AI music engine with granular stem control.
MCP release candidate brings a stateless core, updated authorization, and breaking changes before the July 28 final spec.

Alibaba’s 35-Hour Model Flex

Today’s Overview

Top Stories

Alibaba’s Qwen3.7-Max Runs for 35 Hours

Anthropic Readies Mythos 1 Release

Bezos Calls Prometheus An Engineer

Research & Analysis

AlphaProof Nexus Lands Real Math Wins

Intology’s Locus Sets A Benchmark Mark

OpenAI Pushes Macro Evals For Agents

Trending AI Tools

Quick Hits

Today’s Overview

Top Stories

Alibaba’s Qwen3.7-Max Runs for 35 Hours

Anthropic Readies Mythos 1 Release

Bezos Calls Prometheus An Engineer

Research & Analysis

AlphaProof Nexus Lands Real Math Wins

Intology’s Locus Sets A Benchmark Mark

OpenAI Pushes Macro Evals For Agents

Trending AI Tools

Quick Hits

Keep reading for free