1. Introduction

Temporal difference learning is often the first step when being introduced to reinforcement learning. Two prominent algorithms are often used to expand on this topic and showcase the basics of reinforcement learning. Those algorithms are Q-learning and SARSA.

On the surface, these algorithms look very similar and it can be hard to discern how they differ and why that difference matters. That is exactly what we explain in this article. We take a quick refresher on value functions and Q-functions. We then introduce the update rules for Q-learning and SARSA and highlight the difference.

Finally, we introduce CliffWorld as an example problem to highlight how that difference impacts the learned behavior.

If you want to learn more, feel free to explore another article about a more detailed introduction to Q-Learning.

2. What Is Reinforcement Learning

Reinforcement learning is based on the concept of the Markov Decision Process (MDP). An MDP is defined as a tuple \langle S,A,T,R,\gamma \rangle. S is a set of states, A is a set of actions, T is the state transition function, R is a reward function, and \gamma is a discount factor. In an MDP the future is independent of the past given the present, this is known as the Markov property.

Solutions to the reinforcement learning problem are often separated into the categories of value-based and policy-based algorithms. Value-based methods such as Q-learning are popular and Q-learning, in particular, has received a lot of attention through popular implementations such as DQN, Dueling-DQN, and Rainbow.

The popularity of the Q-learning approach however might lead us to ask why SARSA an algorithm very much related to Q-learning has seen less attention recently, even though it is still informative when learning about reinforcement learning.

2.1. Value Functions and Q Functions

Value functions are functions of states that estimate how good it is for an agent to be in a given state. We measure how good a state is based on what the agent can expect in terms of future rewards; this is the expected return. What happens in the future depends on the actions an agent takes, its policy, and so value functions are defined in relation to policies.

In equation 1 we see the value for a state is defined as the exponentially discounted expectation of rewards over future states:

(1)    \begin{equation*} V^{\pi}(s) = \mathbb{E}_{\pi}\{\Sigma_{k=0}^{\infty}\gamma^{k}r_t+k+1|s_t=s\} \end{equation*}

The Q-function then represents the expected return for taking specific action in a certain state and then following the policy. We also show this expectation defined over discounted future rewards for the action-value function.

This is the Q function that we learn with Q-learning or SARSA. We can see its definition in equation 2:

(2)    \begin{equation*} Q^{\pi}(s,a) = \mathbb{E}_{\pi}\{\Sigma_{k=0}^{\infty}\gamma^{k}r_t+k+1|s_t=s , a_t=a\} \end{equation*}

How the agent perceives the future will impact its behavior.

3. What Is SARSA

SARSA, which expands to State, Action, Reward, State, Action, is an on-policy value-based approach. As a form of value iteration, we need a value update rule.

For SARSA, we show this in equation 3:

(3)    \begin{equation*} Q(s_t,a_t) = Q(a_t,a_t)+ \alpha ( r_t + \gamma (Q(s_{t+1},a_{t+1}) - Q(s_{t},a_{t})) ) \end{equation*}

The Q-value update rule is what distinguishes SARSA from Q-learning. In SARSA we see that the time difference value is calculated using the current state-action combo and the next state-action combo. This means we need to know the next action our policy takes in order to perform an update step.

This makes SARSA an on-policy algorithm as it is updated based on the current choices of our policy.

4. What Is Q-Learning

Q-learning differs from SARSA in its update rule by assuming the use of the optimal policy. The use of the \max_{a} function over the available actions makes the Q-learning algorithm an off-policy approach.

This is because the policy we are updating differs in behavior from the policy we use to explore the world, which uses an exploration parameter \epsilon to choose between the best-identified action and a random choice of action:

(4)    \begin{equation*} Q(S,A) = Q(S_t,A_t)+ \alpha ( r_t + \gamma (\max_{a}Q(S_{t+1},a) - Q(S_{t},A_{t})) ) \end{equation*}

5. What Does All of This Mean?

We said earlier that value functions are defined in relation to policies. We have also seen how SARSA updates with respect to the policy it is currently following and how Q-learning updates with respect to the optimal policy.

Both approaches, usually, follow an \epsilon -greedy policy when acting in the world. This difference in policy used to update the Q-value leads to differences in the policy learned.

We’ll now explore that with an example using Cliff World.

5.1. An Example: Cliff World

We can highlight the effect of this seemingly small difference by using a simple grid-world-based problem; The Cliff World. In a cliff world, the agent has to walk from the starting cell to the goal cell along the edge of a cliff; without falling off.

There is a -1 penalty for a step and a -100 penalty for stepping off the cliff. The shortest path is a straight line along the cliff edge. It is the shortest path but it is risky if the agents miss-step there is a large negative penalty.

The cliff world is drawn from Reinforcement Learning: An Introduction by Sutton and Barto; a seminal text of the field:

Rendered by QuickLaTeX.com

While we know the shortest path, our Q-learning and SARSA agents will disagree over if it is the best or not. Our, on-policy, SARSA agent views the cliff edge as riskier because it chooses and updates actions subject to its stochastic policy. That means it has learned it has a high likelihood of stepping off the cliff and receiving a high negative reward.

Our Q-learning agent by contrast has learned its policy based on the optimal policy which always chooses the action with the highest Q-value. It is more confident in its ability to walk the cliff edge without falling off.

5. Conclusion

Reinforcement Learning is a powerful learning paradigm with many potential uses and applications. It is however sensitive to design decisions which as we have seen can lead to different final solutions and behaviors. We’ve examined one such intricacy that explores the difference between two classic value-based reinforcement learning methods.

The way in which an agent perceives its expected future returns influences how the agent learns and the final policy it achieves. If an agent acts with a stochastic policy it will be more uncertain about rewards and choose a safer path. On the other hand, an agent that expects to always select the best action will take more direct if riskier actions.

Understanding this sensitivity of reinforcement learning algorithms to the future is key to understanding how to utilize them well.


» 下一篇: 多宇宙优化器