Sunday, April 19, 2026

A 100-AV Freeway Deployment – The Berkeley Synthetic Intelligence Analysis Weblog


Coaching Diffusion Fashions with Reinforcement Studying

We deployed 100 reinforcement studying (RL)-controlled vehicles into rush-hour freeway site visitors to easy congestion and scale back gasoline consumption for everybody. Our objective is to sort out “stop-and-go” waves, these irritating slowdowns and speedups that normally haven’t any clear trigger however result in congestion and vital vitality waste. To coach environment friendly flow-smoothing controllers, we constructed quick, data-driven simulations that RL brokers work together with, studying to maximise vitality effectivity whereas sustaining throughput and working safely round human drivers.

General, a small proportion of well-controlled autonomous autos (AVs) is sufficient to considerably enhance site visitors move and gasoline effectivity for all drivers on the street. Furthermore, the educated controllers are designed to be deployable on most fashionable autos, working in a decentralized method and counting on normal radar sensors. In our newest paper, we discover the challenges of deploying RL controllers on a large-scale, from simulation to the sphere, throughout this 100-car experiment.

The challenges of phantom jams



A stop-and-go wave shifting backwards by freeway site visitors.

If you happen to drive, you’ve certainly skilled the frustration of stop-and-go waves, these seemingly inexplicable site visitors slowdowns that seem out of nowhere after which instantly clear up. These waves are sometimes attributable to small fluctuations in our driving conduct that get amplified by the move of site visitors. We naturally alter our pace primarily based on the car in entrance of us. If the hole opens, we pace as much as sustain. In the event that they brake, we additionally decelerate. However because of our nonzero response time, we would brake only a bit tougher than the car in entrance. The following driver behind us does the identical, and this retains amplifying. Over time, what began as an insignificant slowdown turns right into a full cease additional again in site visitors. These waves transfer backward by the site visitors stream, resulting in vital drops in vitality effectivity because of frequent accelerations, accompanied by elevated CO2 emissions and accident danger.

And this isn’t an remoted phenomenon! These waves are ubiquitous on busy roads when the site visitors density exceeds a essential threshold. So how can we deal with this drawback? Conventional approaches like ramp metering and variable pace limits try and handle site visitors move, however they typically require expensive infrastructure and centralized coordination. A extra scalable strategy is to make use of AVs, which might dynamically alter their driving conduct in real-time. Nonetheless, merely inserting AVs amongst human drivers isn’t sufficient: they have to additionally drive in a better method that makes site visitors higher for everybody, which is the place RL is available in.



Basic diagram of site visitors move. The variety of vehicles on the street (density) impacts how a lot site visitors is shifting ahead (move). At low density, including extra vehicles will increase move as a result of extra autos can cross by. However past a essential threshold, vehicles begin blocking one another, resulting in congestion, the place including extra vehicles truly slows down general motion.

Reinforcement studying for wave-smoothing AVs

RL is a strong management strategy the place an agent learns to maximise a reward sign by interactions with an surroundings. The agent collects expertise by trial and error, learns from its errors, and improves over time. In our case, the surroundings is a mixed-autonomy site visitors situation, the place AVs study driving methods to dampen stop-and-go waves and scale back gasoline consumption for each themselves and close by human-driven autos.

Coaching these RL brokers requires quick simulations with real looking site visitors dynamics that may replicate freeway stop-and-go conduct. To attain this, we leveraged experimental information collected on Interstate 24 (I-24) close to Nashville, Tennessee, and used it to construct simulations the place autos replay freeway trajectories, creating unstable site visitors that AVs driving behind them study to easy out.



Simulation replaying a freeway trajectory that displays a number of stop-and-go waves.

We designed the AVs with deployment in thoughts, guaranteeing that they will function utilizing solely fundamental sensor details about themselves and the car in entrance. The observations encompass the AV’s pace, the pace of the main car, and the area hole between them. Given these inputs, the RL agent then prescribes both an instantaneous acceleration or a desired pace for the AV. The important thing benefit of utilizing solely these native measurements is that the RL controllers will be deployed on most fashionable autos in a decentralized method, with out requiring extra infrastructure.

Reward design

Probably the most difficult half is designing a reward perform that, when maximized, aligns with the completely different aims that we need the AVs to realize:

  • Wave smoothing: Cut back stop-and-go oscillations.
  • Vitality effectivity: Decrease gasoline consumption for all autos, not simply AVs.
  • Security: Guarantee affordable following distances and keep away from abrupt braking.
  • Driving consolation: Keep away from aggressive accelerations and decelerations.
  • Adherence to human driving norms: Guarantee a “regular” driving conduct that doesn’t make surrounding drivers uncomfortable.

Balancing these aims collectively is tough, as appropriate coefficients for every time period have to be discovered. For example, if minimizing gasoline consumption dominates the reward, RL AVs study to return to a cease in the midst of the freeway as a result of that’s vitality optimum. To stop this, we launched dynamic minimal and most hole thresholds to make sure secure and affordable conduct whereas optimizing gasoline effectivity. We additionally penalized the gasoline consumption of human-driven autos behind the AV to discourage it from studying a egocentric conduct that optimizes vitality financial savings for the AV on the expense of surrounding site visitors. General, we goal to strike a steadiness between vitality financial savings and having an inexpensive and secure driving conduct.

Simulation outcomes



Illustration of the dynamic minimal and most hole thresholds, inside which the AV can function freely to easy site visitors as effectively as attainable.

The everyday conduct realized by the AVs is to take care of barely bigger gaps than human drivers, permitting them to soak up upcoming, probably abrupt, site visitors slowdowns extra successfully. In simulation, this strategy resulted in vital gasoline financial savings of as much as 20% throughout all street customers in probably the most congested situations, with fewer than 5% of AVs on the street. And these AVs don’t should be particular autos! They’ll merely be normal client vehicles outfitted with a wise adaptive cruise management (ACC), which is what we examined at scale.



Smoothing conduct of RL AVs. Purple: a human trajectory from the dataset. Blue: successive AVs within the platoon, the place AV 1 is the closest behind the human trajectory. There’s sometimes between 20 and 25 human autos between AVs. Every AV doesn’t decelerate as a lot or speed up as quick as its chief, resulting in reducing wave amplitude over time and thus vitality financial savings.

100 AV discipline check: deploying RL at scale


Our 100 vehicles parked at our operational heart through the experiment week.

Given the promising simulation outcomes, the pure subsequent step was to bridge the hole from simulation to the freeway. We took the educated RL controllers and deployed them on 100 autos on the I-24 throughout peak site visitors hours over a number of days. This huge-scale experiment, which we referred to as the MegaVanderTest, is the biggest mixed-autonomy traffic-smoothing experiment ever carried out.

Earlier than deploying RL controllers within the discipline, we educated and evaluated them extensively in simulation and validated them on the {hardware}. General, the steps in the direction of deployment concerned:

  • Coaching in data-driven simulations: We used freeway site visitors information from I-24 to create a coaching surroundings with real looking wave dynamics, then validate the educated agent’s efficiency and robustness in quite a lot of new site visitors situations.
  • Deployment on {hardware}: After being validated in robotics software program, the educated controller is uploaded onto the automotive and is ready to management the set pace of the car. We function by the car’s on-board cruise management, which acts as a lower-level security controller.
  • Modular management framework: One key problem through the check was not gaining access to the main car info sensors. To beat this, the RL controller was built-in right into a hierarchical system, the MegaController, which mixes a pace planner information that accounts for downstream site visitors situations, with the RL controller as the ultimate choice maker.
  • Validation on {hardware}: The RL brokers have been designed to function in an surroundings the place most autos have been human-driven, requiring strong insurance policies that adapt to unpredictable conduct. We confirm this by driving the RL-controlled autos on the street below cautious human supervision, making modifications to the management primarily based on suggestions.

Every of the 100 vehicles is related to a Raspberry Pi, on which the RL controller (a small neural community) is deployed.

The RL controller instantly controls the onboard adaptive cruise management (ACC) system, setting its pace and desired following distance.

As soon as validated, the RL controllers have been deployed on 100 vehicles and pushed on I-24 throughout morning rush hour. Surrounding site visitors was unaware of the experiment, guaranteeing unbiased driver conduct. Knowledge was collected through the experiment from dozens of overhead cameras positioned alongside the freeway, which led to the extraction of hundreds of thousands of particular person car trajectories by a pc imaginative and prescient pipeline. Metrics computed on these trajectories point out a pattern of decreased gasoline consumption round AVs, as anticipated from simulation outcomes and former smaller validation deployments. For example, we will observe that the nearer individuals are driving behind our AVs, the much less gasoline they seem to devour on common (which is calculated utilizing a calibrated vitality mannequin):



Common gasoline consumption as a perform of distance behind the closest engaged RL-controlled AV within the downstream site visitors. As human drivers get additional away behind AVs, their common gasoline consumption will increase.

One other strategy to measure the impression is to measure the variance of the speeds and accelerations: the decrease the variance, the much less amplitude the waves ought to have, which is what we observe from the sphere check information. General, though getting exact measurements from a considerable amount of digital camera video information is sophisticated, we observe a pattern of 15 to twenty% of vitality financial savings round our managed vehicles.



Knowledge factors from all autos on the freeway over a single day of the experiment, plotted in speed-acceleration area. The cluster to the left of the purple line represents congestion, whereas the one on the suitable corresponds to free move. We observe that the congestion cluster is smaller when AVs are current, as measured by computing the realm of a comfortable convex envelope or by becoming a Gaussian kernel.

Closing ideas

The 100-car discipline operational check was decentralized, with no express cooperation or communication between AVs, reflective of present autonomy deployment, and bringing us one step nearer to smoother, extra energy-efficient highways. But, there’s nonetheless huge potential for enchancment. Scaling up simulations to be sooner and extra correct with higher human-driving fashions is essential for bridging the simulation-to-reality hole. Equipping AVs with extra site visitors information, whether or not by superior sensors or centralized planning, may additional enhance the efficiency of the controllers. For example, whereas multi-agent RL is promising for enhancing cooperative management methods, it stays an open query how enabling express communication between AVs over 5G networks may additional enhance stability and additional mitigate stop-and-go waves. Crucially, our controllers combine seamlessly with present adaptive cruise management (ACC) programs, making discipline deployment possible at scale. The extra autos outfitted with sensible traffic-smoothing management, the less waves we’ll see on our roads, which means much less air pollution and gasoline financial savings for everybody!


Many contributors took half in making the MegaVanderTest occur! The total checklist is on the market on the CIRCLES mission web page, together with extra particulars in regards to the mission.

Learn extra: [paper]

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles