目录
1. What is Reinforcement Learning?
1.1. The big picture
1.2. A formal definition
2. The Reinforcement Learning Framework
2.1. The RL Process
2.2. The reward hypothesis: the central idea of Reinforcement Learning
2.3. (Optional) Markov Property
2.4. Observations/States Space
2.5. Action Space
2.6. Rewards and the discounting
2.7. Type of tasks
2.8. Exploration/ Exploitation tradeoff
3. The two main approaches for solving RL problems
3.1. The Policy π: the agent’s brain
3.2. Policy-Based Methods
4. The “Deep” in Reinforcement Learning
5. Summarize
Deep RL(Deep Reinforcement Learning) is a type of Machine Learning where an agent learns how to behave in an environment by performing actions and seeing the results.(译:强化学习是机器学习的一个分支,强化学习最大的特点是在交互中学习(Learning from Interaction)。Agent 在与环境的交互中根据获得的奖励或惩罚不断的学习知识,更加适应环境。RL学习的范式非常类似于我们人类学习知识的过程,也正因此,RL被视为实现通用AI重要途径。)
Since 2013 and the Deep Q-Learning paper, we’ve seen a lot of breakthroughs. From OpenAI five that beat some of the best Dota2 players of the world, to the Dexterity project, we live in an exciting moment in Deep RL research..
1. What is Reinforcement Learning?
In order to understand what is reinforcement learning, let’s start with the big picture.
1.1. The big picture
The idea behind Reinforcement Learning is that an agent (an AI) will learn from the environment by interacting with it (through trial and error) and receiving rewards (negative or positive) as feedback for performing actions.
Learning from interaction with the environment comes from our natural experiences.
For instance, imagine you put your little brother in front of a video game he never played, a controller in his hands, and let him alone.
Your brother will interact with the environment (the video game) by pressing the right button (action). He got a coin, that’s a +1 reward. It’s positive, he just understood that in this game he must get the coins.
But then, he presses right again and he touches an enemy, he just died -1 reward.
By interacting with his environment through trial and error, your little brother just understood that in this environment, he needs to get coins, but avoid the enemies.
Without any supervision, the child will get better and better at playing the game.
That’s how humans and animals learn, through interaction. Reinforcement Learning is just a computational approach of learning from action.
1.2. A formal definition
If we take now a formal definition:
Reinforcement learning is a framework for solving control tasks (also called decision problems) by building agents that learn from the environment by interacting with it through trial and error and receiving rewards (positive or negative) as unique feedback.
But how Reinforcement Learning works?
2. The Reinforcement Learning Framework
2.1. The RL Process
To understand the RL process, let’s imagine an agent learning to play a platform game:
This RL loop outputs a sequence of state, action and reward and next state.
The goal of the agent is to maximize its cumulative reward, called the expected return.
2.2. The reward hypothesis: the central idea of Reinforcement Learning
Why the goal of the agent
is to maximize the expected return?
Because RL is based on the reward hypothesis, which is that all goals can be described as the maximization of the expected return (expected cumulative reward).
That’s why in Reinforcement Learning, to have the best behavior, we need to maximize the expected cumulative reward.
2.3. (Optional) Markov Property
You’ll see in papers that the RL process is called the Markov Decision Process (MDP).
We’ll talk again about the Markov Property in the next chapters. But if you need to remember something today about it is just that Markov Property implies that our agent needs only the current state to make its decision about what action to take and not the history of all the states and actions he took before.
Now let’s dive a little bit
on all this new vocabulary
2.4. Observations/States Space
Observations/States are the information our agent gets from the environment. In the case of a video game, it can be a frame (a screenshot), in the case of the trading agent, it can be the value of a certain stock etc.
There is a differentiation to make between observation and state:
With a chess game, we are in a fully observed environment, since we have access to the whole check board information.
In Super Mario Bros, we are in a partially observed environment, we receive an observation since we only see a part of the level.
2.5. Action Space
The Action space is the set of all possible actions in an environment.
The actions can come from a discrete or continuous space:
Taking this information into consideration is crucial because it will have importance when we will choose in the future the RL algorithm.
2.6. Rewards and the discounting
The reward is fundamental in RL because it’s the only feedback for the agent. Thanks to it, our agent knows if the action taken was good or not.
The cumulative reward at each time step t can be written as:
Which is equivalent to:
However, in reality, we can’t just add them like that. The rewards that come sooner (at the beginning of the game) are more probable to happen, since they are more predictable than the long term future reward.
Let say your agent is this small mouse that can move one tile each time step, and your opponent is the cat (that can move too). Your goal is to eat the maximum amount of cheese before being eaten by the cat.
As we can see in the diagram, it’s more probable to eat the cheese near us than the cheese close to the cat (the closer we are to the cat, the more dangerous it is).
As a consequence, the reward near the cat, even if it is bigger (more cheese), will be more discounted since we’re not really sure we’ll be able to eat it.
To discount the rewards, we proceed like this:
As the time step increases, the cat gets closer to us, so the future reward is less and less probable to happen.
Our discounted cumulative expected rewards is:
γ越小,k越大(即步数越多)时,这个公式越趋向0(即只能看到短期利益);同理,γ越大,要求这个公式越趋向0,只能k更大(即步数更大),所以就能看到远期利益 小神龙
2.7. Type of tasks
A task is an instance of a Reinforcement Learning problem. We can have two types of tasks: episodic and continuous.
2.7.1. Episodic task
In this case, we have a starting point and an ending point (a terminal state).This creates an episode: a list of States, Actions, Rewards, and New States.
For instance think about Super Mario Bros, an episode begin at the launch of a new Mario Level and ending when you’re killed or you’re reach the end of the level.
2.7.2. Continuous tasks
These are tasks that continue forever (no terminal state). In this case, the agent has to learn how to choose the best actions and simultaneously interacts with the environment.
For instance, an agent that do automated stock trading. For this task, there is no starting point and terminal state. The agent keeps running until we decide to stop him.
2.8. Exploration/ Exploitation tradeoff
Finally, before looking at the different methods to solve Reinforcement Learning problems, we must cover one more very important topic: the exploration/exploitation trade-off.
Remember, the goal of our RL agent is to maximize the expected cumulative reward. However, we can fall into a common trap.
Let’s take an example:
In this game, our mouse can have an infinite amount of small cheese (+1 each). But at the top of the maze, there is a gigantic sum of cheese (+1000).
However, if we only focus on exploitation, our agent will never reach the gigantic sum of cheese. Instead, it will only exploit the nearest source of rewards, even if this source is small (exploitation).
But if our agent does a little bit of exploration, it can discover the big reward (the pile of big cheese).
This is what we call the exploration/exploitation trade off. We need to balance how much we explore the environment and how much we exploit what we know about the environment.
Therefore, we must define a rule that helps to handle this trade-off.
If it’s still confusing think of a real problem: the choice of a restaurant:
3. The two main approaches for solving RL problems
Now that we learned the RL framework,
how do we solve the RL problem?
In other terms,
how to build a RL agent
that can select the actions
that maximize its expected cumulative reward
?
3.1. The Policy π: the agent’s brain
The Policy π is the brain of our Agent, it’s the function that tell us what action to take given the state we are. So it defines the agent behavior at a given time.
This Policy is the function we want to learn, our goal is to find the optimal policy π*, the policy that maximizes expected return when the agent acts according to it. We find this π* through training.
There are two approaches to train our agent to find this optimal policy π*:
3.2. Policy-Based Methods
In Policy-Based Methods, we learn a policy function directly.
This function will map from each state to the best corresponding action at that state. Or a probability distribution over the set of possible actions at that state.
We have two types of policy:
Stochastic: output a probability distribution over actions.
3.3. Value based methods
In Value based methods, instead of training a policy function, we train a value function that maps a state to the expected value of being at that state.
The value of a state is the expected discounted return the agent can get if it starts in that state, and then act according to our policy.
“Act according to our policy” just means that our policy is “going to the state with the highest value”.
Here we see that our value function defined value for each possible state.
4. The “Deep” in Reinforcement Learning
Wait… you spoke about Reinforcement Learning,
but why we speak about Deep Reinforcement Learning?
Deep Reinforcement Learning introduces deep neural networks to solve Reinforcement Learning problems — hence the name “deep.”
For instance, we’ll work on Q-Learning (classic Reinforcement Learning) and then Deep Q-Learning both are value-based RL algorithms.
You’ll see the difference is that in the first approach, we use a traditional algorithm to create a Q table that helps us find what action to take for each state.
In the second approach, we will use a Neural Network (to approximate the q value).
5. Summarize
参考:
Playing Atari with Deep Reinforcement Learning: https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf Gym Retro: https://github.com/openai/retro Unity ML-Agents Toolkit: https://github.com/Unity-Technologies/ml-agents/ TF-Agents: https://github.com/tensorflow/agents The MineRL Python Package: https://github.com/minerllabs/minerl A Free course in Deep Reinforcement Learning from beginner to expert. https://simoninithomas.github.io/deep-rl-course/#syllabus https://thomassimonini.medium.com/an-introduction-to-deep-reinforcement-learning-17a565999c0c Exploration: http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_13_exploration.pdf MIT—— Introduction to Deep Learning: http://introtodeeplearning.com/ Playing Super Mario Bros. With Deep Reinforcement Learning: https://github.com/Kautenja/playing-mario-with-deep-reinforcement-learning A Free course in Deep Reinforcement Learning from beginner to expert: https://simoninithomas.github.io/deep-rl-course/ Simple Reinforcement Learning with Tensorflow Part 8: Asynchronous Actor-Critic Agents (A3C): https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-8-asynchronous-actor-critic-agents-a3c-c88f72a5e9f2 Playing Mario with Deep Reinforcement Learning: https://github.com/aleju/mario-ai
扫码关注腾讯云开发者
领取腾讯云代金券
Copyright © 2013 - 2025 Tencent Cloud. All Rights Reserved. 腾讯云 版权所有
深圳市腾讯计算机系统有限公司 ICP备案/许可证号:粤B2-20090059 深公网安备号 44030502008569
腾讯云计算(北京)有限责任公司 京ICP证150476号 | 京ICP备11018762号 | 京公网安备号11010802020287
Copyright © 2013 - 2025 Tencent Cloud.
All Rights Reserved. 腾讯云 版权所有