值迭代 vs. Q学习 | Baeldung中文网

1. Introduction

To achieve a goal, an AI agent must observe the state of the world and choose an action to execute based on that observation.

After executing the chosen action, the agent observes the new state of the world and decides whether or not it has achieved its goal. This is the problem setting for reinforcement learning.

How do agents figure out which action to choose? One way is to assign values to states and the actions taken in those states. Two algorithms we can use for that are value iteration and Q-learning.

In this tutorial, we highlight the differences between those algorithms.

2. The Problem Domain

Let’s start with a world model. We focus on the concept of a Markov Decision Process (MDP). In an MDP, the world is a set of states (), actions (), transition probabilities between states ( $S\times A\times S$ ), and rewards ( $R(\cdot)$ ) for reaching each state.

Usually, we also include the discount factor $\gamma$ in this definition. It specifies the relative weight of future rewards against immediate ones. Given a Markov Decision Process, the problem becomes figuring out how to act optimally in this world.

The extent of our knowledge of the MDP influences our choice of algorithm: Q-learning or value-iteration.

2.1. Value Iteration

The value iteration algorithm iterates over states, performing a Bellman backup of the value for a state over all possible actions:

The Value-Iteration algorithm

The algorithm terminates when the value update is below a preset threshold.

2.2. Q-Learning

The Q-learning algorithm differs from the value iteration algorithm in that it doesn’t iterate through states. Here, we don’t know the set of states and their transitions. So, the q-learning algorithm forces us to act in the environment instead:

The Q-Learning Algorithm

In the Q-learning algorithm, we perform multiple complete runs through our environment, choosing actions at each step until termination. We then update the Q-function using the Bellman equation. This operation learns the Q-value, also known as the action-value, which is a function for each state and action combination.

Now, let’s check the differences between Q-learning and value iteration.

3. Model-Free vs. Model-Based

The key difference between value iteration and Q-learning lies in what we know about the Markov Decision Process.

Do we know the transition function? If we do, we know how to go from one state to another and how our actions change the states. We can use this knowledge at every state to decide how to weigh future rewards. If we don’t, we need another approach.

Another word for methods that use a known version of the transition function is model-based methods, with the opposite being called model-free.

3.1. Value Iteration Is Model-Based

So, value iteration is a model-based method, as we know the transition function, which tells us how we move from one state to another.

Consequently, we can iterate through each state-to-state transition to compute a value for each state.

3.2. Q-Learning Is Model-Free

In Q-learning, we don’t know how likely an action is to lead from one state to another.

As we don’t know the state transitions when using Q-learning, we must actively engage with the environment. Our agent sees a state, takes action, and transitions to a new state.

This environment interaction separates Q-learning from value iteration. It makes Q-learning expensive and often requires a simulation environment to make the process feasible, as physical setup can be costly and slow.

4. Dynamic Programming vs. Reinforcement Learning

The value iteration algorithm is a dynamic-programming method. Given the full MDP, we compute the value function by solving a system of linear equations with |S| variables and |S| unknowns.

Instead of iterating over states in Q-learning, we actively interact with the world through a series of episodes. Through exploratory actions, we eventually visit all states and learn the correct action value for each state. This exploration process and the inability to iterate over states without engaging with the environment make Q-learning a reinforcement-learning method.

5. Value and Action-Value

Knowing the transition probabilities and the reward function, we can efficiently compute the value of each action. Without this information, we need to learn the value of each action to act optimally.

If we don’t know the transition probabilities, we can’t directly compute which action leads to the next best state. We have to learn them.

6. Utility

Value iteration is often less useful in practice as problems where we know the full state-transition function are less common.

We often end up utilizing Q-learning instead. Both methods are similar and highly related, and the key difference between their constructions is what is known before solving the problem.

7. Summary

The key differences, highlighted in the table, center around the type of problem being solved:

Value Iteration

Q-Learning

World Knowledge

Complete

Incomplete

Method

Dynamic Programming

Reinforcement Learning

Solves For

Value Function

Action-Value Function

The choice between these two methods depends on whether we have full knowledge of the transition function and whether we can iterate over states or have to act within the environment to sample states and state transitions.

This leads to the main difference. As we know the state-transition function in value iteration, we don’t need to learn which action is best as it can be computed when required. In this case, we can calculate the value of the state. This differs from q-learning, where we don’t know this transition and so must estimate not only the value of the state but also the value of taking an action in that state, the action-value function.

8. Conclusion

In this article, we discussed the differences between value iteration and Q-learning.

In value iteration, we assume perfect knowledge of the world model, the MDP, so we compute the optimal policy or value function iteratively. In contrast, we use Q-learning when we don’t have the perfect knowledge of the world.

Persistence

REST

Security