## (and what will make you build your first AI)

In late 2017 Google introduced AlphaZero, an AI system that taught itself from scratch how to master the games of chess, Go and shogi in four hours.

The short training time was largely enough for AlphaZero to beat world champion chess programs.

(Andriy Popov / Alamy Stock Photo)

Recently, OpenAI demonstrated that Reinforcement Learning isn’t just a tool for virtual tasks. Dactyl, its human-like robot hand has learned to solve a Rubik’s cube on its own.

Google AlphaZero and OpenAI Dactyl are **Reinforcement Learning** algorithms, given no domain knowledge except the rules of the game. Some AI experts believe this approach is the most viable strategy for achieving human or superhuman Artificial General Intelligence (AGI).

In our previous article, we introduced the building blocks of Reinforcement Learning.

## Don’t Ever Ignore Reinforcement Learning Again

### Supervised or unsupervised learning is not everything. Everyone knows that. Get started with OpenAI Gym.

#### towardsdatascience.com

Now we will dive deeper into the mechanisms used by AI agents to teach themselves how to take the right flow of actions for achieving a globally rewarding objective.

Let’s consider OpenAI Frozen Lake, a simple environment, where the agent controls the movement of a character in a grid world. Some tiles of the grid are walkable, and others lead to the agent falling into the water. The agent is rewarded for finding a walkable path to a goal tile.

Even for such fairly simple environments, we can have a variety of policies. The agent can always move forward for example, or choose an action randomly or try to go around obstacles by checking whether that previous forward action failed, or even funnily spin around to entertain.

The intuitive **definition of policy** is that it is some set of rules that controls the agent’s behavior. Different policies can give us different return, which makes it important to find a good policy.

Formally, policy is defined as the probability distribution over actions for every possible state:

An optimal policy 𝛑* is one that maximizes the expected value function *V*:

The **value function** *V(s)* is the expected long-term return with discount for state *s*, as opposed to the short-term reward. The value function represents how good is a state for an agent to be in. It is equal to expected total reward for an agent starting from that state. In other words, the total reward of the one-step rewards for taking action *a* in state *s* and is defined via *V(𝑠)*.

The value function depends on the policy by which the agent picks actions to perform. Learning the optimal policy requires us to use the so-called **Bellman equation**.

Let’s introduce the Bellman equation intuitively by considering the example below. The agent can perform either the action *1, 2, …* or *N*. This would bring the agent to a future state *S1, S2, …* or *SN*. The agent would get the reward *r1, r2, …* or *rN* accordingly. The expected long-term reward for each future state would be *V1, V2, …* or *VN*.

If the agent takes the action *a=i*, while being in state *S0*, then the expected long-term reward or value for state *S0* is given by the equation below, where γ is a constant.

The optimal policy would help the agent to choose the best possible

action. In order to do so, the agent needs to calculate the resulting values for every possible action *a=1, 2, …, N* and choose the maximum possible outcome.

The above equation is called **deterministic Bellman equation**. It can become a stochastic equation if, for a given action, the agent can reach more than one future state with different probabilities. This situation is illustrated below.

The resulting **stochastic Bellman equation** for this general case is given as follows.

We provide an implementation of the Bellman equation to choose the best possible action at a given state *s.* The function calculates the resulting values for every action and choose the maximum possible outcome. You will need to load the necessary prerequisite libraries as described in our previous article.

In the previous section, we explained how to find the best action that provides the maximal long-term value. If we can do this for all states, we will obtain the value function. We will also know which action to perform at each state (optimal policy). This algorithm is called **value iteration**.

The *value iteration* algorithm randomly selects an initial value function. It then calculates a new improved value function in an iterative process, until it reaches an optimal value function. Finally, it derives the optimal policy from that optimal value function.

In a 4×4 Frozen Lake environment, the *value iteration* algorithm loops over all 16 states and 4 possible actions to explore rewards of a given action and calculates the maximum possible action/reward and stores it in the vector V[s]. The algorithm iterates until V[s] is not significantly improving anymore.

The optimal policy *P* is then to take every time the action to go state with the highest *V* value.

The function below implements the value iteration algorithm on a 4×4 Frozen Lake environment.

The resulting array above illustrates how the value iteration function successfully computed the long-term reward for each of the 16 states. The algorithm took 7 iterations.

The states representing a hole (*H*) have value 0, while those representing frozen tiles (*F*) have higher values, especially if these are on a promising path toward the goal *G*.

The function below provides a visual of the results using a heatmap. Arrows show the flow of actions providing the best total reward using the optimal policy *P*.

Below we run the value iteration algorithm on a 8×8 Frozen Lake environment.

In the previous section, we presented the value iteration algorithm and illustrated how an agent would walk through a frozen lake with holes to reach the goal.

In **policy iteration** algorithms, we start with a random policy instead of a random value function. Then we find the value function of that policy. Next, we find a new (improved) policy based on the previous value function. After multiple iterations, this will lead to an optimal policy.

The function below implements the policy iteration algorithm.

As we can observe, both algorithms produce the same result. While value iteration algorithm keeps improving the value function at each iteration until the value function converges, the policy improvement theorem assures us that the policies found by the policy iteration algorithm are better than the original random policy.

Both approaches achieve the same goal, policy iteration being more computationally efficient.

Both value and policy iteration algorithms rely on the hyperparameter γ (gamma), which defines the learning rate during value or policy update.

Here we try different gamma values and discuss what the impact of gamma is on the training.

The plots show how values increases as gamma increases. Different values of gamma (0–10) produce different policies. Lower gamma values will put more weight on short-term rewards, whereas higher gamma values will put more weight towards long-term gains.

The best gamma depends on the domain of the task. In the Frozen Lake context, it does not make sense to look for short term gains (e.g. negative reward caused by falling in the hole is actually less valuable than the same penalty received later, after a long way without falling). We rather want to look as far ahead as we can.

In this article we provided hands-on techniques how to implement value iteration and policy iteration algorithms for finding the optimal policy in Reinforcement Learning.

A post from Andrej Kaparthy provides further insights about these techniques.

Both value and policy iteration algorithms work when the agent knows sufficient details about the environment model. Learning or providing a transition model can be hard in several scenarios such as autonomous driving, medical treatments or stock trading. Model-free approaches are more appropriate in such situations. We did not cover them in this article.

Q-learning is a model-free learning that is used when the agent does not know the environment model but has to discover the policy by trial and error making use of its history of interaction with the environment. State–action–reward–state–action (SARSA) is another algorithm where the agent interacts with the environment and updates the policy based on actions taken. This post will give additional interesting insights about model-free algorithms.

Thanks for reading.