• Tech Insights 2026 Week 9

    A year ago our top benchmark for measuring “smartness” in AI models, ARC-AGI, was becoming saturated. OpenAI o3 scored over 87% on the benchmark in a high-compute configuration, and it became clear we needed a new benchmark. A new benchmark was designed, and ARC-AGI-2 was launched in March 2025. The test is based on visual problem solving, where humans average 60% on the test. The best models in April 2025 scored less than 4% on ARC-AGI-2.

    In November 2025 Google launched Gemini 3 Pro that scored an impressive 31% on the ARC-AGI-2 test, and a month later OpenAI launched GPT-5.2-extra-high that scored 52.9%. Two weeks ago Opus 4.6 scored 68.8% and now Gemini 3.1 Pro scored 77.1% on the ARC-AGI-2 test. It’s hard to overstate just how impressive this development has been the past year – and just how competent today’s models are at almost any task you throw at them. Ever month the models become smarter, and we discover amazing new ways to use them.

    Last week’s top news is the amazing Gemini 3.1 Pro, and this model made a significant impression on me. It has a visual understanding that’s unmatched by anything we have seen so far, and it can create vectorized graphics and interactive experiences with amazing quality. It has been specifically tuned to create beautiful SVG images, and if you have a minute go to the Gemini 3.1 Pro launch page, and scroll down to “Intelligence applied”. I promise you will be impressed.

    The other main topic from last week is security. As models become smarter, they tend to find creative ways to solve tasks. AI-agents write scripts, run custom-written software, access the Internet and save temporary files to your system. The main problem with this is that they run with your user permissions, so anything you can do on your computer the AI-agent can do. The solution so far has been to have the AI-agent ask for permission for every action performed, but that slows down execution time. A much better solution is to run the agent within a container, and that is exactly what Cursor launched with Agent Sandboxing last week. OpenAI also launched a new Lockdown mode to further reduce the risks of running chat agents accessing the web.

    If you are working with Codex, Claude Code or GitHub Copilot in agent mode, I strongly recommend that you check out devcontainers and Docker. The smarter the models become, the more it is in your interest to make sure they only have access to the information they need. With ARC-AGI-2 scores already surpassing human intelligence, in 1-2 years you will be releasing truly superhuman intelligence on your system. Make sure it’s you who are in control when you do that.

    The cover image for this newsletter was of course created with Gemini 3.1 Pro and is a fully vectorized SVG image. You can download it here.

    Thank you for being a Tech Insights subscriber!

    Listen to Tech Insights on Spotify: Tech Insights 2026 Week 9 on Spotify

    1. OpenClaw Creator Peter Steinberger Joins OpenAI
    2. Google Releases Gemini 3.1 Pro with Major Reasoning Gains
    3. Google Integrates Lyria 3 Music Generation Into Gemini App
    4. Google Pomelli Adds Free AI Product Photography via “Photoshoot”
    5. Chrome Ships WebMCP Early Preview
    6. Claude Opus 4.6 Reaches 14.5-Hour METR Time Horizon
    7. Anthropic Releases Claude Sonnet 4.6 With 1M Token Context Window
    8. Figma and Anthropic Launch “Code to Canvas”
    9. Alibaba Releases Qwen3.5: Open-Weight Multimodal Agent Model
    10. xAI Releases Grok 4.20 Public Beta
    11. Cursor Introduces Agent Sandboxing
    12. OpenAI Adds Lockdown Mode and Elevated Risk Labels to ChatGPT

    OpenClaw Creator Peter Steinberger Joins OpenAI

    https://techcrunch.com/2026/02/15/openclaw-creator-peter-steinberger-joins-openai/

    The News:

    • Peter Steinberger, an Austrian developer who created OpenClaw (formerly Clawdbot, then Moltbot), has joined OpenAI to work on personal AI agents.
    • OpenClaw is a free, open-source autonomous AI agent that uses messaging platforms (Telegram, Slack, WhatsApp, Discord) as its primary interface and can execute tasks including calendar management, flight booking, browser automation, and proactive notifications without being prompted.
    • The project reached over 190,000 GitHub stars, nearly double the count from its initial 100,000-star milestone weeks earlier.
    • OpenClaw will transfer to an independent open-source foundation that OpenAI will continue to support financially, rather than being absorbed into OpenAI’s proprietary product line.
    • Steinberger wrote in his announcement post: “What I want is to change the world, not build a large company, and teaming up with OpenAI is the fastest way to bring this to everyone.”
    • Sam Altman stated on X that Steinberger will “drive the next generation of personal agents” at OpenAI.

    My take: I believe current models like Sonnet 4.6, GPT-5.2 and Opus 4.6 are still not good enough to let loose autonomously even for experimentation. In some cases when a particular task matches their training patterns they will perform almost like a virtual human, but for other things they will fail at even the simplest of tasks. OpenClaw won because users found enough use cases where current AI models performed amazingly well, which worked fine as demos posted on social media. The failed use cases people do not talk about so much.

    With Peter joining OpenAI I don’t think OpenClaw has any real future. It was an experiment, and I believe the lessons learnt by Peter while building it is what OpenAI is buying. They will use his expertise and experiences from doing this journey, but doing it one more time with an official product running in a more secure manner. And when that product is finished in a year or so, I believe the AI models then will also be good enough to make this proposal really useful.

    Google Releases Gemini 3.1 Pro with Major Reasoning Gains

    https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro

    The News:

    • Google released Gemini 3.1 Pro on February 18, 2026, rolling it out across the Gemini app, NotebookLM, Gemini API, Vertex AI, Google AI Studio, Gemini CLI, Antigravity, and Android Studio.
    • On the ARC-AGI-2 benchmark, which tests a model’s ability to solve novel logic patterns, 3.1 Pro scored 77.1%, up from 31.1% for Gemini 3 Pro, a 148% increase.
    • On GPQA Diamond, a graduate-level science reasoning test, 3.1 Pro scored 94.3%, compared to 92.4% for GPT-5.2 and 91.3% for Claude Opus 4.6.
    • The model generates animated SVGs directly from text prompts, producing code-based vector animations with small file sizes relative to video formats.
    • 3.1 Pro is currently in preview; general availability is described as “coming soon,” with access restricted to Google AI Pro and Ultra plan subscribers in the consumer app.

    My take: This model a true game changer! Stop what you are doing and go watch this video. Then go to their launch page and scroll down to “Intelligence applied”. Gemini 3.1 Pro not only knows how to draw things well, it understands the context and adapts the visual rendering based on that. This is unlike anything you have seen from an AI model so far when it comes to vector visualizations.

    “I’ve been developing the SVG generation capabilities for Gemini 3.1, and the complexity of the SVGs is stunning! 🚀 This allows UX designers to transcend pixel constraints and directly output structural, production-ready code!” Jiao Sun, Google DeepMind

    This model seems to be particularly good at visual understanding, which explains the massive score at the ARC-AGI-2 benchmark and the amazing SVG creation capacities. This is hands-down the biggest “threat” we have seen so far to graphics designers as a profession. If an AI model truly understands the context and can render vectorized images with extremely high quality and detail, for many companies this will be good enough for all their needs in this area.

    The ARC-AGI-2 benchmark

    Read more:

    Google Integrates Lyria 3 Music Generation Into Gemini App

    https://blog.google/innovation-and-ai/products/gemini-app/lyria-3

    The News:

    • Google DeepMind’s Lyria 3 is now in beta inside the Gemini app, letting any user aged 18+ generate 30-second music tracks with lyrics from text prompts or uploaded photos and videos.
    • Lyria 3 auto-generates lyrics from the prompt, removing the need to write them manually. Previous Lyria models required users to supply their own lyrics.
    • Users can control style, tempo, and vocal type. Supported languages include English, German, Spanish, French, Hindi, Japanese, Korean, and Portuguese.
    • Each generated track includes AI-made cover art from Nano Banana and can be shared directly via a link.
    • All tracks are watermarked with SynthID. The Gemini app can now detect SynthID in audio, images, and video files uploaded by users.
    • Lyria 3 also powers YouTube’s Dream Track feature for Shorts, now expanding beyond the U.S. Higher usage limits apply to Google AI Plus, Pro, and Ultra subscribers.

    My take: Google Lyria 3 ships directly inside the Gemini and YouTube apps. Need music for your video clip? Just click to generate it. If you have any kind of media production business going on, then you need to follow the development in this area closely. Things are moving extremely fast, and competing with Google that integrate these things straight into YouTube will be very difficult going forward.

    Google Pomelli Adds Free AI Product Photography via “Photoshoot”

    https://blog.google/innovation-and-ai/models-and-research/google-labs/pomelli-photoshoot

    The News:

    • Google Labs added Photoshoot to Pomelli, its free marketing platform for small and medium-sized businesses, converting basic product photos into studio-style images without any professional equipment.
    • Users upload any product photo, including smartphone snapshots, then select from templates such as Studio, Floating, Ingredient, In Use (with AI-generated model), and Lifestyle.
    • Photoshoot applies the Business DNA system, which extracts brand colors, tone, and aesthetic from the business website, and applies them automatically to generated images.
    • Generated images can be downloaded directly or saved to Business DNA for reuse in future campaigns.
    • Users can alternatively skip uploading an image entirely and provide a product URL, letting Photoshoot pull images, title, and description directly from the product page.
    • The service is currently available for free in the United States, Canada, Australia, and New Zealand.

    My take: Can you see where this is going? We now have Google NanoBanana Pro for raster image generation, Gemini 3.1 Pro for vectorized image generation, Lyria 3 for music generation and now also Pomelli Photoshoot to create product photo shoots. Google is adding one component after another to automate basically every process when it comes to product design and marketing. Will this replace a team of graphic designers? Not today. But will it be good enough for most companies in 1-2 years? Probably yes.

    Chrome Ships WebMCP Early Preview

    https://developer.chrome.com/blog/webmcp-epp

    The News:

    • WebMCP is a W3C proposal co-authored by Google and Microsoft engineers that lets websites expose structured, callable tools directly to AI agents via a new browser API, navigator.modelContext, without requiring a separate MCP server.
    • Chrome 146 Canary ships WebMCP behind a feature flag at chrome://flags, available now for early preview program participants.
    • Two APIs are available: a declarative API where developers add toolname and tooldescription attributes to existing HTML forms, and an imperative API using navigator.modelContext.registerTool() with a name, description, JSON schema, and a JavaScript callback.
    • Early benchmarks from one source cite approximately 67% reduction in computational overhead compared to traditional agent-browser interaction methods such as DOM parsing and screenshot analysis, with task accuracy around 98%.
    • Google engineer Khushal Sagar describes the goal as making WebMCP the “USB-C of AI agent interactions with the web,” a single standard interface any agent can use regardless of underlying LLM.
    • Current limitations include no headless mode support (tool calls require a visible browser tab), no built-in discoverability mechanism for clients to know which sites support tools without visiting them, and potential UI refactoring requirements for complex apps.

    “The API is not intended for fully autonomous agents operating without human oversight” WebMCP GitHub

    My take: WebMCP is built with one specific use case in mind – assisting users with browsing the web. Right now AI agents have to scrape the DOM model of web pages or create screenshots to understand what is showing, and none of those options are suitable for daily use. WebMCP is targeted for launch in mid-to-late 2026 and that sounds reasonable. This means we can train next generation AI models on how to navigate the web the right way with WebMCP, and within a year or so we might have solved how to do autonomous user-assisted web browsing safely at high speed.

    Read more:

    Claude Opus 4.6 Reaches 14.5-Hour METR Time Horizon

    https://metr.org/time-horizons

    The News:

    • Anthropic’s Claude Opus 4.6 has been evaluated by METR (Model Evaluation and Threat Research) on autonomous software task completion, with results published February 20, 2026.
    • METR estimates a 50% time horizon of approximately 14.5 hours for Opus 4.6, meaning the model successfully completes software tasks that a human expert would take up to 14.5 hours to finish, at least half the time. The 95% confidence interval is wide: 6 hours to 98 hours.
    • The progression across generations is stark: Claude 3.7 Sonnet (early 2025) measured 59 minutes, Opus 4.1 extended further in August 2025, Opus 4.5 reached 4 hours 49 minutes in December 2025, and Opus 4.6 now records 14.5 hours.
    • METR’s time horizon metric measures task difficulty (calibrated by human expert completion time), not the literal wall-clock time the AI spends. Agents typically run several times faster than humans on tasks they complete.
    • The METR task set draws primarily from software engineering, machine learning, and cybersecurity domains. Performance on tasks outside those domains is expected to differ substantially.

    My take: Claude Opus is an amazing problem-solver. You can send it almost any task, and it just fixes it. This why most people love using it. In this case Claude Opus 4.6 succeeds 50% of the time doing tasks that would take humans over 14 hours to finish. What this benchmark does not measure is the quality of the source code produced to solve the task. The benchmark I really would like see, which does not exist, is a benchmark showing how long it took for a model to produce 100% working code, and then rating the quality of that code. In the mean time I’ll continue using GPT-5.2-extra-high and GPT-5.3-CODEX-extra-high for advanced programming and Claude Opus 4.6 for fixing things quickly, and right now I am quite happy with that setup.

    Anthropic Releases Claude Sonnet 4.6 With 1M Token Context Window

    https://www.anthropic.com/news/claude-sonnet-4-6

    The News:

    • Anthropic released Claude Sonnet 4.6 on February 17, 2026, as the new default model on Free and Pro plans at the same price as Sonnet 4.5: $3 per million input tokens and $15 per million output tokens.
    • The model scores 79.6% on SWE-bench Verified, compared to 80.8% for the more expensive Opus 4.6, at one-fifth the price.
    • On OSWorld, the standard computer use benchmark, Sonnet 4.6 scores 72.5%, up from 14.9% when Anthropic launched computer use sixteen months ago, and nearly matching Opus 4.6 at 72.7%.
    • In Claude Code testing, users preferred Sonnet 4.6 over Sonnet 4.5 roughly 70% of the time, citing better context reading, less logic duplication, and fewer false success claims. Users also preferred it over the previous flagship Opus 4.5 in 59% of comparisons.
    • A 1M token context window is available in beta, large enough to hold entire codebases or extensive document collections in a single request.
    • The API model identifier is claude-sonnet-4-6, with support for adaptive thinking, extended thinking, and context compaction in beta.

    My take: Claude Sonnet 4.6 was not the massive improvement many had hoped it would be. Most senior developers I know prefer GPT-5.3-Codex-extra-high to Opus 4.6 for programming, mainly because it’s so much better using the full context window and that it takes more time to analyze before producing code. Opus 4.6 is quicker, likes to try different approaches, it throw many things at the target just to see what sticks. Sonnet is similar to Opus 4.6 in the problem-solving approach but it’s even faster to come to conclusions and feels even more “stressed” than Opus. The main improvement in Sonnet 4.6 seems to come from the increased skills in “computer use”. It’s definitely improving in this area but it’s still extremely slow and close to unusable for most every day applications, mainly since it needs to capture everything with screenshots. WebMCP should make this experience much better in 6-12 months.

    Figma and Anthropic Launch “Code to Canvas”

    https://www.figma.com/blog/introducing-claude-code-to-figma

    The News:

    • Figma partnered with Anthropic on February 17, 2026 to release “Code to Canvas”, a feature that converts AI-generated code from tools like Claude Code into fully editable frames inside Figma.
    • Users capture working interfaces built in Claude Code and paste them into Figma as editable frames, not static images. Teams can then annotate, duplicate, compare layouts side by side, and explore design variations without touching the underlying codebase.
    • For multi-step flows, users can capture multiple screens in one session, preserving sequence and context across the entire experience.
    • The feature works bidirectionally: Figma’s MCP server lets developers pull Figma frames back into a coding environment using a prompt and a link, enabling a round-trip between design and code.

    My take: My experience with both Claude Sonnet and Opus is that they are not very good at creating visual designs. They are decent when you ask them to build something that matches their training data, but as soon as you go outside this they typically struggle. Maybe Google Gemini 3.1 Pro would have been a better fit for Figma here? Gemini 3.1 Pro is in a completely different league when it comes to vectorized designs – it will take a long time for Claude Sonnet/Opus to just catch up to where Gemini is today, and by that time Gemini will be able to produce almost any vectorized visual material you can think of. But maybe that is also why they went with Anthropic, Google might be their biggest long-term threat for autonomous UI design.

    Alibaba Releases Qwen3.5: Open-Weight Multimodal Agent Model

    https://qwen.ai/blog?id=qwen3.5

    The News:

    • Alibaba released Qwen3.5-397B-A17B on February 17, 2026, under the Apache 2.0 license, making it freely available for download, fine-tuning, and commercial deployment.
    • The model uses a sparse Mixture-of-Experts (MoE) architecture with 397 billion total parameters but activates only 17 billion per query, paired with a new Gated Delta Networks attention layer that reduces compute costs further.
    • Compared to its predecessor Qwen3-Max (a 1 trillion+ parameter model), Qwen3.5 processes requests 19x faster at 256K token context and 8.6x faster on standard workloads.
    • The model supports 201 languages and dialects, up from 119 in the previous generation, with a 250,000-token vocabulary that speeds up processing for most languages by 10 to 60 percent.
    • It includes visual agentic capabilities, operating phone and desktop application interfaces autonomously, such as filling out spreadsheets or executing multi-step workflows.
    • API pricing is $0.40 per million input tokens and $2.40 per million output tokens. A hosted version, Qwen3.5-Plus, with a 1 million token context window is available via Alibaba Cloud Model Studio with web search, code interpreter, and adaptive reasoning.

    My take: On benchmarks alone, this model seems to perform very well. Almost on par with a state-of-the-art model like Opus 4.6. And 201 supported languages is a crazy figure, it makes you wonder just how much this affects the actual performance of it. But compared to Opus and GPT this is a fairly small model at 397B parameters, so my feeling is that this is yet another “benchmark-optimized” model that scores very high on a few select benchmarks but not so well in actual everyday productivity. Have you used it? I would love to be proven wrong here.

    Read more:

    xAI Releases Grok 4.20 Public Beta

    https://twitter.com/elonmusk/status/2023829664318583105

    The News:

    • AI released the public beta of Grok 4.20 on February 17, 2026, available to SuperGrok (~$30/month) and X Premium+ subscribers who manually select it from the model menu. The release introduces a multi-agent architecture as its core change from prior versions.
    • Instead of a single model, Grok 4.20 runs four specialized agents in parallel: Grok (orchestrator), Harper (real-time research via X firehose), Benjamin (math and logic), and Lucas (creative output). The four agents debate and fact-check each other before producing a final response. For complex tasks, the system can scale to 16 agents in “Heavy” mode.
    • The context window stands at 256K by default, expandable to 2M tokens in agentic and tool-use modes. The model ingests roughly 68 million English tweets per day from X’s firehose, giving it near-real-time grounding for news and market data.
    • Elon Musk announced that Grok 4.20 uses a “rapid learning” architecture, with weekly model updates and release notes, unlike prior versions which shipped on multi-month cycles.
    • In Alpha Arena Season 1.5, a live-money trading competition held in January 2026, Grok 4.20 posted a baseline return of +10-12% (up to +34-47% in optimized configurations), while all competing models from OpenAI, Google, and Anthropic finished in the red. Four Grok 4.20 variants claimed four of the top six spots.

    “Grok 4.2 will be about an order of magnitude smarter and faster than Grok 4 when the public beta concludes next month. Still many bug fixes and improvements landing every day. The public beta gives us more critical feedback to address” Elon Musk

    My take: So of course Elon Musk versioned their new top model to 4.20 (420 is slang for cannabis consumption). So far xAI hasn’t published any official benchmarks other than the “70.8%” SWE-Bench figure, but since this is beta I would consider this more of an internal unoptimized version rather than a production model. Still I think the approach with 4 agents working together is interesting, for most tasks I think it’s overkill though so I am guessing this model will perform very well on some tasks, and a bit worse on others when it’s out of beta. I typically don’t like post news about beta models, but I thought this architecture was interesting and worth keeping an eye on.

    Cursor Introduces Agent Sandboxing

    https://cursor.com/blog/agent-sandboxing

    The News:

    • Cursor has shipped agent sandboxing, letting coding agents run terminal commands automatically inside a controlled environment without requiring per-command human approval.
    • Sandboxed agents interrupt the user 40% less often than unsandboxed ones, as approval requests are limited to actions that require stepping outside the sandbox, most commonly network access.
    • On macOS, Cursor uses the macOS Seatbelt primitive (sandbox-exec), a kernel-level mechanism also used by Chrome and Apple system applications, despite being deprecated since 2016. The sandbox policy generates dynamically at runtime based on workspace settings and .cursorignore files.
    • On Linux, Cursor combines Landlock and seccomp directly: seccomp blocks unsafe syscalls, while Landlock enforces filesystem restrictions and maps workspaces into an overlay filesystem where ignored files are completely inaccessible.
    • On Windows, Cursor runs the Linux sandbox inside WSL2; a native Windows implementation is in progress in collaboration with Microsoft.
    • Cursor updated its internal Shell tool prompts to surface sandbox constraint failures explicitly, so agents can recognize when escalation is required. These changes were validated using an internal benchmark called Cursor Bench. Enterprise customers including NVIDIA are already running on the feature.

    My take: If you are not running Codex, Claude code or Copilot within a secure container you probably should. As the agents become more advanced, they tend to solve problems by writing software code that fixes things, and they also save lots of temporary files to your system temp folder. There is a actually a real risk that an AI agent writes a piece of software code that malfunctions and accidentally removes data on your system. And since AI agents run with your user permissions, they have the same access as you do to the entire computer.

    I always recommend everyone to run AI agents within Docker or Devcontainers (if you use VSCode or Jetbrains). There is no drawback from doing it, on the contrary once containerized you can even run them in full YOLO mode with full permissions to do anything – since it will never affect your computer or local network. That said, it’s positive to see Cursor rolling this thing out to all users. I just wished they had standardized on Docker instead of sandbox-exec on MacOS.

    OpenAI Adds Lockdown Mode and Elevated Risk Labels to ChatGPT

    https://openai.com/index/introducing-lockdown-mode-and-elevated-risk-labels-in-chatgpt

    The News:

    • OpenAI introduced two security features on February 13, 2026 targeting prompt injection attacks, where third parties attempt to manipulate AI systems into following malicious instructions or leaking sensitive data from connected apps and conversations.
    • Lockdown Mode is an optional setting for high-risk users (executives, security teams, journalists) that deterministically disables or restricts features an attacker could exploit; web browsing is limited to cached content with no live network requests leaving OpenAI’s controlled network.
    • Agent Mode, Deep Research, Canvas code network access, and automatic file downloads for analysis are all disabled in Lockdown Mode; users can still upload files manually and use image generation, but ChatGPT cannot include images in responses.
    • Lockdown Mode is available now for ChatGPT Enterprise, Edu, Healthcare, and for Teachers; admins enable it via Workspace Settings by creating a dedicated role, and can configure which apps and actions remain available.
    • “Elevated Risk” labels are a standardized in-product warning for high-risk capabilities across ChatGPT, ChatGPT Atlas, and Codex; in Codex, for instance, the label appears when a developer grants the coding assistant live network access.
    • OpenAI states it will remove the “Elevated Risk” label from specific features as security mitigations mature, and will add the label to new features over time.

    My take: I am not sure this is the right approach by OpenAI. This looks like a quick-fix solution rather than the right solution. As AI agents become more powerful they really should run in isolated environments. There will always be risks for prompt injection and other exploits. This news looks to me that OpenAI just realized the need for it, in the future I am hoping they will add support for real isolated environments natively for all their products. As I mentioned in the news item above, if you are using VSCode and GitHub Copilot today you really should enable Devcontainers in your project repo.