The next article initially appeared on the Asimov’s Addendum Substack and is being republished right here with the creator’s permission.
Are LLMs dependable?
LLMs have constructed up a repute for being unreliable.1 Small modifications within the enter can result in large modifications within the output. The identical immediate run twice may give totally different or contradictory solutions. Fashions usually battle to stay to a specified format except the immediate is worded good. And it’s arduous to inform when a mannequin is assured in its reply or if it may simply as simply have gone the opposite manner.
It’s simple responsible the mannequin for all of those reliability failures. However the API endpoint and surrounding tooling matter too. Mannequin suppliers restrict the form of interactions that builders may have with a mannequin, in addition to the outputs that the mannequin can present, by limiting what their APIs expose to builders and third-party corporations. Issues like the total chain-of-thought and the logprobs (the possibilities of all attainable choices for the following token) are hidden from builders, whereas superior instruments for guaranteeing reliability like constrained decoding and prefilling usually are not made out there. All options which might be simply out there with open weight fashions and are inherent to the best way LLMs work.
Each determination made by mannequin builders on what instruments and outputs to supply to builders by way of their API isn’t just an architectural alternative but additionally a coverage determination. Mannequin suppliers immediately decide what degree of management and reliability builders have entry to. This has implications for what apps may very well be constructed, how dependable a system is in apply, and the way effectively a developer can steer outcomes.
The synthetic limits on enter
Trendy LLMs are often constructed round chat templates. Each enter and output, excluding instrument calls and system or developer messages, is filtered by way of a dialog between a consumer and an assistant—directions are given as consumer messages; responses are returned as assistant messages. This turns into extraordinarily evident when taking a look at how trendy LLM APIs work. The completions API, an endpoint initially launched by OpenAI and extensively adopted throughout the trade (together with by a number of open mannequin suppliers like OpenRouter and Collectively AI) takes enter within the type of consumer and assistant messages and outputs the following message.2
The give attention to a chat interface in an API has its advantages. It makes it simple for builders to cause about enter and output being utterly separate. However chat APIs do extra than simply use a chat template below the hood; they actively restrict what third-party builders can management.
When interacting with LLMs by way of an API, the boundary between enter and output is usually a agency one. A developer units earlier messages, however they often can’t prefill a mannequin’s response, which means builders can’t pressure a mannequin to start a response with a sure sentence or paragraph.3 This has real-world implications for folks constructing with LLMs. With out the flexibility to prefill, it turns into a lot more durable to regulate the preamble. If you recognize the mannequin wants to start out its reply in a sure manner, it’s inefficient and dangerous to not implement it on the token degree.4 And the constraints prolong past simply the beginning of a response. With out the flexibility to prefill solutions, you additionally lose the flexibility to partially regenerate solutions if solely a part of the reply is mistaken.5
One other deficiency that’s significantly seen is how the mannequin’s chain-of-thought reasoning is dealt with. Most giant AI corporations have made a behavior of hiding the fashions’ reasoning tokens from the consumer (and solely exhibiting summaries), reportedly to protect towards distillation and to let the mannequin cause uncensored (for AI security causes). This has second-order results, considered one of which is the strict separation of reasoning from messages. Not one of the main mannequin suppliers allow you to prefill or write your personal reasoning tokens. As a substitute it’s good to depend on the mannequin’s personal reasoning and can’t reuse reasoning traces to regenerate the identical message.
There are official causes for not permitting prefilling. It may very well be argued that permitting prefilling will tremendously enhance the assault space of immediate injections. One examine discovered that prefill assaults work very effectively towards even state-of-the-art open weight fashions. However in apply, the mannequin just isn’t the one line of protection towards attackers. Many corporations already run prompts towards classification fashions to seek out immediate injections, and the identical sort of safeguard is also used towards prefill assault makes an attempt.
Output with few controls
Prefilling just isn’t the one casualty of a clear separation between enter and output. Even inside a message, there are levers which might be out there on a neighborhood open weight mannequin that simply aren’t attainable when utilizing a normal API. This issues as a result of these controls permit builders to preemptively validate outputs and be certain that responses comply with a sure construction, each reducing variability and enhancing reliability. For instance, most LLM APIs help one thing they name structured output, a mode that forces the mannequin to generate output in a given JSON format; nonetheless, structured output doesn’t inherently have to be restricted to JSON.6 That very same approach, constrained decoding, or limiting the tokens the mannequin can produce at any time, may very well be used for far more than that. It may very well be used to generate XML, have the mannequin fill in blanks Mad Libs-style, pressure the mannequin to put in writing a narrative with out utilizing sure letters, and even implement legitimate chess strikes at inference time. It’s a strong characteristic that permits builders to exactly outline what output is suitable and what isn’t—guaranteeing dependable output that meets the developer’s parameters.
The rationale for that is probably that LLM APIs are constructed for a variety of builders, most of whom use the mannequin for easy chat-related functions. APIs weren’t designed to offer builders full management over output as a result of not everybody wants or needs that complexity. However that’s not an argument towards together with these options; it’s solely an argument for a number of endpoints. Many corporations have already got a number of supported endpoints: OpenAI has the “completions” and “responses” APIs, whereas Google has the “generate content material” and “interactions” APIs. It’s not infeasible for them to make a 3rd, more-advanced endpoint.
An absence of visibility
Even the mannequin output that third-party builders do get by way of the mannequin’s API is usually a watered-down model of the output the mannequin offers. LLMs don’t simply generate one token at a time. They output the logprobs. When utilizing an API, nonetheless, Google solely offers the highest 20 probably logprobs. OpenAI not offers any logprobs for GPT 5 fashions, whereas Anthropic has by no means supplied any in any respect. This has real-world penalties for reliability. Log chances are one of the vital helpful indicators a developer has for understanding mannequin confidence. When a mannequin assigns almost equal chance to competing tokens, that uncertainty itself is significant info. And even for these corporations who present the highest 20 tokens, that’s usually not sufficient to cowl bigger classification duties.
On the subject of reasoning tokens even much less output info is supplied. Main suppliers comparable to Anthropic,7 Google, and OpenAI8 solely present summarized considering for his or her proprietary fashions. And OpenAI solely provides that when a legitimate authorities ID is equipped to OpenAI. This not solely takes away the flexibility for the consumer to actually examine how a mannequin arrived at a sure reply, nevertheless it additionally limits the flexibility for the developer to diagnose why a question failed. When a mannequin offers a mistaken reply, a full reasoning hint tells you whether or not it misunderstood the query, made a defective logical step, or just received unfortunate on the ultimate token. A abstract obscures a few of that, solely offering an approximation of what truly occurred. This isn’t a difficulty with the mannequin—the mannequin remains to be producing its full reasoning hint. It’s a difficulty with what info is supplied to the top developer.
The case for not together with logprobs and reasoning tokens is analogous. The danger of distillation will increase with the quantity of knowledge that the API returns. It’s arduous to distill on tokens you can not see, and with out giving logprobs, the distillation will take longer and every instance will present much less info.9 And this danger is one thing that AI corporations want to contemplate rigorously, since distillation is a strong approach to imitate the skills of sturdy fashions for an inexpensive value. However there are additionally dangers in not offering this info to customers. DeepSeek R1, regardless of being deemed a nationwide safety danger by many, nonetheless shot straight to the highest of US app shops upon launch and is utilized by many researchers and scientists, largely because of its openness. And in a world the place open fashions are getting increasingly more highly effective, not giving builders correct entry to a mannequin’s outputs may imply shedding builders to cheaper and extra open options.
Reliability requires management and visibility
The reliability issues of present LLMs don’t stem solely from the fashions themselves but additionally from the tooling that suppliers give builders. For native open weight fashions it’s often attainable to commerce off complexity for reliability. The complete reasoning hint is at all times out there and logprobs are absolutely clear, permitting the developer to look at how a solution was arrived at. Consumer and AI messages could be edited or generated on the developer’s discretion, and constrained decoding may very well be used to supply textual content that follows any arbitrary format. For closed weight fashions, that is changing into much less and fewer the case. The selections made round what options to limit in APIs harm builders and in the end finish customers.
LLMs are more and more being utilized in high-stakes conditions comparable to medication or regulation, and builders want instruments to deal with that danger responsibly. There are few technical boundaries to offering extra management and visibility to builders. Lots of the most high-impact enhancements comparable to exhibiting considering output, permitting prefilling, or exhibiting logprobs, value virtually nothing, however can be a significant step in the direction of making LLMs extra controllable, constant and dependable.
There’s a place for a clear and easy API, and there may be some advantage to issues about distillation, however this shouldn’t be used as an excuse to remove essential instruments for diagnosing and fixing reliability issues. When fashions get utilized in high-stakes conditions, as they more and more are, failure to take reliability significantly is an AI security concern.
Particularly, to take reliability significantly, mannequin suppliers ought to enhance their API by permitting options that give builders extra visibility and management over their output. Reasoning must be supplied in full always, with any security violations dealt with the identical manner that they’d have been dealt with within the ultimate reply. Mannequin suppliers ought to resume offering not less than the highest 20 logprobs, over the complete output (reasoning included), in order that builders have some visibility into how assured the mannequin is in its reply. Constrained decoding must be prolonged past JSON and will help arbitrary grammars by way of one thing like regex or formal grammars.10 Builders must be granted full management over “assistant” output—they need to have the ability to prefill mannequin solutions, cease responses mid-generation, and department them at will. Even when not all of those options make sense over the usual API, nothing is stopping mannequin suppliers from making a brand new extra complicated API. They’ve achieved it earlier than. The choice to withhold these options is a coverage alternative, not a technical limitation.
Enhancing intelligence just isn’t the one manner to enhance reliability and management, however it’s often the one lever that will get pulled.
