As AI applied sciences advance, really useful brokers will change into able to higher anticipating consumer wants. For experiences on cell gadgets to be really useful, the underlying fashions want to grasp what the consumer is doing (or making an attempt to do) when customers work together with them. As soon as present and former duties are understood, the mannequin has extra context to foretell potential subsequent actions. For instance, if a consumer beforehand looked for music festivals throughout Europe and is now searching for a flight to London, the agent might provide to search out festivals in London on these particular dates.
Giant multimodal LLMs are already fairly good at understanding consumer intent from a consumer interface (UI) trajectory. However utilizing LLMs for this job would sometimes require sending data to a server, which will be sluggish, expensive, and carries the potential danger of exposing delicate data.
Our current paper “Small Fashions, Large Outcomes: Reaching Superior Intent Extraction By way of Decomposition”, introduced at EMNLP 2025, addresses the query of easy methods to use small multimodal LLMs (MLLMs) to grasp sequences of consumer interactions on the net and on cell gadgets all on gadget. By separating consumer intent understanding into two levels, first summarizing every display screen individually after which extracting an intent from the sequence of generated summaries, we make the duty extra tractable for small fashions. We additionally formalize metrics for analysis of mannequin efficiency and present that our method yields outcomes akin to a lot bigger fashions, illustrating its potential for on-device purposes. This work builds on earlier work from our workforce on consumer intent understanding.
