We kicked off our new weekly sequence This Week in AI on Monday, and we coated a number of floor in half-hour, together with an AI mannequin that discovered safety holes quicker than many years of human auditing, a knowledge heart in Utah the dimensions of two Manhattans, and a sensible argument for why the harness you construct round a mannequin now issues greater than which mannequin you choose.
Listed below are a couple of takeaways from the dialog between host Eric Freeman, college member at UT Austin and a longtime good friend of O’Reilly, and visitor John Berryman, founding father of Arcturus Labs, an early manufacturing engineer on GitHub Copilot, and coauthor of O’Reilly’s Immediate Engineering for LLMs. Watch all the episode to seek out out why you need to be constructing your personal agent and why John believes finally there might be no web for people.
AI’s safety drawback is now a coverage drawback
You’ve in all probability already heard about Mythos. Anthropic’s inside testing of the frontier mannequin surfaced 1000’s of beforehand unknown safety vulnerabilities throughout main working methods, browsers, and monetary infrastructure, together with a 27-year-old bug in OpenBSD. Anthropic selected to not launch the mannequin publicly and as an alternative launched Undertaking Glasswing, a restricted program giving monitored entry to a small group of trusted companions for defensive patching.
That call moved quick in Washington. In roughly six weeks, the dialog shifted from the light-touch nationwide AI coverage launched in March to reported White Home discussions of an govt order overview course of modeled on how the FDA handles medicine. Safety researcher Bruce Schneier has questioned whether or not Mythos is uniquely succesful right here or whether or not related outcomes are achievable with cheaper public fashions, however as Freeman famous (paraphrasing Schneier), both means, it’s an issue that’s coming.
The compute race is getting stranger
Anthropic leased xAI’s whole Colossus 1 supercluster in Memphis: greater than 200,000 GPUs and 300 megawatts of energy. A month earlier than that deal, Anthropic expanded its settlement with Google and Broadcom for 3.5 gigawatts of capability coming on-line in 2027. For context, that’s roughly 10 occasions the facility output of the Colossus 1 deal, in a single contract. After this episode aired, Anthropic introduced that that deal has been expanded to Colossus 2 as properly.
Field Elder County, Utah, simply accredited a 40,000-acre AI information heart referred to as the Stratos mission, backed by investor and TV character Kevin O’Leary (a.ok.a. Mr. Great). It’s deliberate for 9 gigawatts at full buildout. That’s a footprint greater than twice the dimensions of Manhattan, powered by the equal of 9 industrial nuclear reactors. And like many information heart offers going ahead, together with Colossus above, it was accredited over native protests.
Infrastructure at this unbelievable scale takes years to come back on-line, and the businesses making these bets are pricing in a world the place mannequin functionality retains scaling. Whether or not that assumption holds will decide quite a bit about what’s economically viable to construct within the subsequent decade.
The harness issues greater than the mannequin
John was available to rethink the agent harness, which as he identified, entered a brand new part with the step change in mannequin functionality that occurred in November and December of final 12 months. He took Eric by means of the arc of AI product improvement, from doc completion and chat loops to tool-calling brokers, DAG-based workflows, and now the harness period represented by instruments like Claude Code. Every development added functionality, John famous, but in addition complexity, and every generated a brand new class of issues round reliability and management. In our present second, which John has dubbed the “age of the unharnessed agent,” brokers at the moment are inside attain of everybody, not simply software program builders.
The payoff of this “unharnessed” period is management. John described a consumer engagement the place he changed a bespoke software with a skills-driven agent. Now area consultants with no improvement expertise can learn the agent’s conduct written in plain English and higher perceive it. As John defined,
Slightly than constructing a bespoke agent. . ., I simply constructed one thing that was simply the agent harness—the agent—and I simply gave it expertise that describe what principally I discovered in interviewing their consultants, how they’d work with these brokers. And it labored completely. Not solely does the agent keep on monitor and do what it must do as of late, however it’s coded, so far as my consumer is anxious, in English.
The consultants don’t need to complain to builders “this doesn’t work.” The consultants can have a look at the English description of what’s occurring and see issues, and possibly even repair it themselves. And I’m actually excited to principally give that energy into the fingers of the those that know finest find out how to change it, the consultants.
That’s a distinct relationship between the consultants and the software than something a wrapped industrial product affords.
As Eric identified, latest Stanford analysis helps this broader level: Efficiency gaps between a naked mannequin and a well-designed harness now typically matter greater than which underlying mannequin you’re utilizing. The benchmark that used to dominate shopping for choices, which mannequin scores highest, has been displaced by a more durable query about which harness matches the duty.
John closed with a demo of his private agent shifting from an Obsidian pocket book into Wikipedia and again, carrying context throughout environments. He used it for example an idea he referred to as the “open agent protocol,” his time period for a not-yet-existing customary the place an agent receives environment-specific expertise because it strikes between contexts. The protocol doesn’t exist but, however the demo made the course clear.
What’s subsequent
Be part of us and a rotating lineup of knowledgeable company for weekly dwell software demos and deeper dives into the matters that matter in AI. We’re taking subsequent week off for Memorial Day within the US, however we’ll be again on June 1 with host Andreas Welsch and company Maya Mikhailov and Doug Shannon to chop by means of one other week of AI headlines and separate what really drives enterprise worth from what seems good in a demo however goes nowhere in manufacturing. Our first few episodes are free and open to all in the event you’d wish to attend dwell—register right here.
We’ll proceed to share full episodes and publish our takeaways right here on Radar every Friday. You may as well watch or hear on YouTube, Spotify, Apple, or wherever you get your podcasts.
