Friday, June 5, 2026

Anthropic’s Mythos Moment Nears

Today’s Overview

Good morning, AI’s next wave is getting weird fast: Anthropic may be close to a new Mythos release, Meta wants business agents answering customers everywhere, and OpenAI is quietly pushing deeper into hardware. Meanwhile, fresh research shows models beating law professors and stumbling through real security work in revealing ways. Let’s dive in.

Top Stories

Anthropic appears close to a new Mythos launch

Anthropic appears to be preparing a public launch of a new Mythos version that is stronger than Mythos Preview. A checkpoint codenamed Oceanus was recently made available to red teamers, a move that often comes shortly before wider release. The program was reportedly paused after someone resold access through a Chinese API proxy, and it is still unclear whether that will affect timing.

The red-team checkpoint was labeled claude-oceanus-v1-p, directly tying the test build to the Oceanus codename.
The public thread framed the model’s appearance as a signal of newer Mythos models potentially arriving soon.
The available source is narrow, with one unrolled post and no confirmed launch date or product details.

Meta launches Business Agent across its apps

Meta Business Agent gives companies a unified AI layer for customer interactions across WhatsApp, Messenger, and Instagram. It is designed to help businesses answer questions, recommend products, book appointments, qualify leads, and decide when a human should step in.

Meta says more than one million businesses already use a Business Agent on WhatsApp and Messenger.
The product is expanding globally and can respond in local languages while matching a business’s tone.
Meta is also launching an agent platform with connections to hundreds of systems including Shopify, Zendesk, and Shopee.

OpenAI backs Opal’s AI-native hardware push

OpenAI is leading a funding round for Opal Electronics as the camera startup prepares a new product line beyond webcams. The move fits OpenAI’s broader hardware ambitions, even as its own ambient computing project has reportedly faced delays.

Opal is known for the C1 and Tadpole camera products, giving OpenAI a hardware partner with capture-device experience.
The new line is expected to move into AI-native creative devices rather than staying limited to webcams.
Key details remain undisclosed, including form factor and pricing for the upcoming device.

A $1,500 test of whether LLMs can hack

A developer built a deliberately vulnerable book review app to test whether LLMs could find a private flag in user reviews. GPT-5.5 performed best, solving seven out of 10 runs, while DeepSeek-V4-Pro followed with three successful runs. Many failures came from models getting stuck on the wrong attack surface, hitting budget limits, or refusing because of security guardrails.

The intended exploit was not in the hardened API, but in wide-open Firebase access exposed through app configuration.
Every run had a $10 max budget and a two-hour time limit, making persistence part of the evaluation.
DeepSeek V4 Pro had the cheapest successful solves at $0.62 per solve among the reported full-run models.

Law professors preferred AI tutor answers

A Stanford-led study tested AI legal tutoring in contract-law office-hours questions where judgment matters more than a single correct answer. Sixteen professors from 14 schools blindly judged 2,918 comparisons between faculty-written answers and responses from Google’s Gemini 2.5 Pro and NotebookLM. The professors preferred the AI responses 75% of the time, and an expanded evaluation using an AI stand-in judge ranked Claude Opus 4.7 at the top among additional systems.

The study used 40 representative questions spanning case or code recall, doctrine recall, hypotheticals, and policy.
LLM answers were flagged as harmful in 3.53% of cases, compared with 12.06% for professors.
The paper argues its method can scale by using expert agreements to evaluate AI tutors in judgment-heavy domains.

Eva-Bench expands voice-agent evaluation

EVA-Bench Data 2.0 expands its benchmark from one enterprise domain to three: Airline CSM, Enterprise ITSM, and Healthcare HRSD. The update adds 121 tools and 213 scenarios to test how voice agents handle realistic enterprise tasks.

The new benchmark includes 50 airline scenarios, 80 ITSM scenarios, and 83 Healthcare HRSD scenarios.
The release covers more than 35 distinct workflows across realistic enterprise voice-agent tasks.
Every scenario was checked for solvability against three frontier models before inclusion.

Anthropic makes the case for self-improving AI

Anthropic argues that AI development is accelerating as systems increasingly help design and build their own successors. The piece frames recursive self-improvement as a shift where AI handles more execution while humans still guide priorities and judgment. Anthropic says its internal benchmarks show typical engineers shipping eight times more code than in previous years.

Anthropic says more than 80% of merged code was authored by Claude as of May 2026.
A March 2026 poll of 130 employees found a median estimate of roughly 4x more output with Mythos Preview.
Claude reportedly shipped over 800 fixes in April 2026 that reduced a class of API errors by a factor of one thousand.

ChatGPT memory dreaming A new memory system turns past chats into a category-sorted profile for better personalization, rolling out first to Plus and Pro users in the U.S.
Poke in iMessage Apple approved Poke as a third-party AI service inside iPhone Messages, letting users chat with it directly to perform tasks.
BigSet TinyFish released an open-source multi-agent scraper for building structured live datasets from plain-English descriptions.

Quick Hits

SpaceX IPO target is reportedly aiming for a $1.75 trillion IPO valuation, which would make it one of the largest public-market debuts ever discussed.
Anthropic confidential IPO filing signals a move toward public markets and was described in the source text as targeting a 965 billion dollar valuation.
U.S.-Japan AI partnership puts $1 billion behind AI research tied to the U.S. Genesis Mission’s goal of doubling science output.
Google Labs Dreambeans uses AI to curate personalized stories from Google app data like Gmail and Calendar, with recommendations tailored to user interests.
Generalist AI funding brings in $400 million to advance physical AI, with support from investors including Radical Ventures and NVIDIA.
Suno Series D raised over $400 million at a $5.4 billion valuation.

Anthropic’s Mythos Moment Nears

Today’s Overview

Top Stories

Anthropic appears close to a new Mythos launch

Meta launches Business Agent across its apps

OpenAI backs Opal’s AI-native hardware push

Research & Analysis

A $1,500 test of whether LLMs can hack

Law professors preferred AI tutor answers

Eva-Bench expands voice-agent evaluation

Anthropic makes the case for self-improving AI

Trending AI Tools

Quick Hits

Today’s Overview

Top Stories

Anthropic appears close to a new Mythos launch

Meta launches Business Agent across its apps

OpenAI backs Opal’s AI-native hardware push

Research & Analysis

A $1,500 test of whether LLMs can hack

Law professors preferred AI tutor answers

Eva-Bench expands voice-agent evaluation

Anthropic makes the case for self-improving AI

Trending AI Tools

Quick Hits

Keep reading for free