Photo by — @beaz

Google AI and Evolutionary Strategies

Exploring Evolutionary strategies — model agnostic meta-learning in robotics

If you are confused by the headline do not be alarmed.

We are not Ex Machina anytime soon.

Rather this article deals with an article on attempting to begin understanding how to take the appropriate actions going from a simulation to the real world for robots.

Following the Google AI blog is usually fascinating, but sometimes you come across an article that is of particular interest. This article is called Exploring Evolutionary Meta-Learning in Robotics written by Xingyou (Richard) Song, Software Engineer and Yuxiang Yang, AI Resident, Robotics at Google.

I always recommend to read the blog article in full, however I will make an attempt at summarising some of the information — I do this also to learn more about the subject matter.

Their article deals with what they describe as:

“…moving trained policies from “sim-to-real” remains one of the greatest challenges of modern robotics, due to the subtle differences encountered between the simulation and real domains, termed the “reality gap”.”

They are discussing the rapid development of more accurate simulator engines.

There is a need to train robot policies.

Real-world deployment can be very costly, so any type of gains made in simulation is incredibly important.

They mention two existing approaches:

  1. Imitation learning
  2. Offline reinforcement learning
  3. Domain randomisation.

They mention the last one can sacrifice performance stability.

This is because it seeks to optimise for stable policies across all tasks, but not as much improvements on ‘specific tasks’.

When real things happen simulations do not always work as well as intended.

They give one example of surface type:

“Surface type will determine the optimal policy — for an incoming flat surface encountered in simulation, the robot could accelerate to a higher speed, while for an incoming rugged and bumpy surface encountered in the real world, it should walk slowly and carefully to prevent falling.”

They did some previous work on this.

In “Rapidly Adaptable Legged Robots via Evolutionary Meta-Learning.”

So what is evolutionary strategies (ES)?

They refer to an article by OpenAI on evolutionary strategies (ES), however Open AI refers back to neuroevolution literature.

In the given article there is a description of this change occurring in algorithms.

They refer to the reinforcement learning as the following:

Above: In the game of Pong, the policy could take the pixels of the screen and compute the probability of moving the player’s paddle (in green, on right) Up, Down, or neither. (picture by OpenAI)

They mention in clear text that it is not related to biological evolutions (and that is surely worth noting), although it was likely inspired by it.

“On “Evolution”. Before we dive into the ES approach, it is important to note that despite the word “evolution”, ES has very little to do with biological evolution. Early versions of these techniques may have been inspired by biological evolution and the approach can, on an abstract level, be seen as sampling a population of individuals and allowing the successful individuals to dictate the distribution of future generations. However, the mathematical details are so heavily abstracted away from biological evolution that it is best to think of ES as simply a class of black-box stochastic optimization techniques.”

So ES techniques is black-box stochastic optimization technique?

Sounds like another explanation is needed.

“In ES, we forget entirely that there is an agent, an environment, that there are neural networks involved, or that interactions take place over time, etc.”

The optimisation is a “guess and check” process.

Google AI mentions the comparison of standard policy gradients for adapting meta-policies.

They argue that these: “…do not allow sim-to-real adaptation, ES enables a robot to quickly overcome the reality gap and adapt to dynamic changes in the real world, some of which may not be encountered in simulation.”

Their algorithm quickly: “adapts a legged robot’s policy to dynamics changes. In this example, the battery voltage dropped from 16.8V to 10V which reduced motor power, and a 500g mass was also placed on the robot’s side, causing it to turn rather than walk straight. The policy is able to adapt in only 50 episodes (or 150s of real-world data).”

This mention of meta-learning so far might sound confusing, so what is meta-learning?

Meta-Learning technique:

“at a high level, meta-learning learns to solve an incoming task quickly without completely retraining from scratch, by combining past experiences with small amounts of experience from the incoming task.”

Most of the past experiences come cheaply from simulation.

While a minimal, yet necessary amount of experience is generated from the real world task.

According to the article this allows the policy to fine-tune specifically to the real-world task at hand.

They mention that policy gradient methods is a standard.

Policy gradient: seek to improve the likelihood of selecting the same action given the same state.

To determine a likely action the policy must be stochastic.

Stochastic: refers to a randomly determined process.

The real world is highly random, to some degree of course.

There is much that can happen that we do not predict.

They mention two conflicting objectives.

“The combination of using a stochastic policy inside a stochastic environment creates two conflicting objectives:

  1. Decreasing the policy’s stochasticity may be crucial, as otherwise the high-noise problem might be exacerbated by the additional randomness from the policy’s actions.
  2. However, increasing the policy’s stochasticity may also benefit exploration, as the policy needs to use random actions to probe the type of environment to which it adapts.”

They proceed to attempt solving these challenges by applying: model-agnostic meta-learning (MAML) which searches for a meta-policy that can adapt to a specific task quickly using small amounts of task-specific data — yet as ES.

“ES-MAML, an algorithm that leverages a drastically different paradigm for high-dimensional optimisation — evolutionary strategies.”

That is Evolutionary Strategies — Model Agnostic Meta-Learning.

This approach updates the policy based solely on:

“…the sum of rewards collected by the agent in the environment. […] This allows the use of deterministic policies and exploration based on parameter changes and avoids the conflict between stochasticity in the policy and in the environment.”

Why is this important?


“This flexibility is critical for efficient adaptation of locomotion meta-policies. Our results show that adaptation with ES can be conducted with a small number of additional on-robot episodes. Thus, ES is no longer just an attractive alternative to the state-of-the-art algorithms, but defines a new state of the art for several challenging RL tasks.”

To read more in-depth check out their actual blog post that will go far more into the details.

Although you might want to see this video first:

This is #500daysofAI and you are reading article 324. I am writing one new article about or related to artificial intelligence every day for 500 days. My focus for day 300–400 is about AI, hardware and the climate crisis.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store