We made a chart of 44 documented incidents of AI agents acting against user intent – sometimes subverting routine security and deceptively hiding evidence of their actions.
I've started a PhD program at the University of Oxford, researching AI governance! I'll be doing this program part-time while continuing my role as a Member of Policy Staff at METR in Berkeley.
I'm in the Department of Engineering Science, advised by Professors Phil Torr and
Criminal liability for not being a helicopter parent is crazy. Starting in 4th or 5th grade, I was taking the NYC subway home by myself for an hour multiple times a week.
Actually GPT-5 is WAY more capable than expectations! Best model we'll ever need! Everyone working on AI capabilities should pat themselves on the back and go on indefinite vacation.
From the White House's National Security Memorandum on AI: Automated AI R&D might pose a threat to national security. The U.S. AI Safety Institute will be evaluating AI R&D capabilities pre-deployment.
1/ METR has never taken money from AI labs
2/ METR has had anywhere from a few days to a few months to evaluate a deployment candidate model, prior to deployment (examples: metr.org/blog/2025-02-2…, metr.github.io/autonomy-evals…)
3/ METR is able to share information to the extent it
How is OpenAI’s statement compatible with Daniel K forfeiting equity? “We have never canceled any current or former employee’s vested equity nor will we if people do not sign a release or nondisparagement agreement when they exit.”
It should be possible to set up a system to automatically detect the "best of n" strategy, especially if the n papers are submitted by the same authors? Even before LLMs, it wouldn't have been hard to write multiple variations/paraphrases of the same paper.
Per our Frontier Safety Framework, we continue to test our models for critical capabilities. Here’s the updated model card for Gemini 2.5Pro with frontier safety evaluations + explanation of how our safety buffer / alert thresholds approach applies to 2.0, 2.5, and what’s coming.
Curious about how the Japan AI Safety Institute (AIセーフティ・インスティテュート) is thinking about AI safety evaluation and red-teaming? They've put out two English-language reports: aisi.go.jp/assets/pdf/ai_…, aisi.go.jp/assets/pdf/ai_…
There are a lot of papers + model cards out there related to dangerous capability evals! I've made an Airtable to compile them. Definitely not comprehensive but hope this is useful.