前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >专栏 >【深度强化学习】—— 入门

【深度强化学习】—— 入门

作者头像
WEBJ2EE
发布于 2022-03-30 13:11:09
发布于 2022-03-30 13:11:09
63200
代码可运行
举报
文章被收录于专栏:WebJ2EEWebJ2EE
运行总次数:0
代码可运行
代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
目录
1. What is Reinforcement Learning?
  1.1. The big picture
  1.2. A formal definition
2. The Reinforcement Learning Framework
  2.1. The RL Process
  2.2. The reward hypothesis: the central idea of Reinforcement Learning
  2.3. (Optional) Markov Property
  2.4. Observations/States Space
  2.5. Action Space
  2.6. Rewards and the discounting
  2.7. Type of tasks
  2.8. Exploration/ Exploitation tradeoff
3. The two main approaches for solving RL problems
  3.1. The Policy π: the agent’s brain
  3.2. Policy-Based Methods
4. The “Deep” in Reinforcement Learning
5. Summarize

‍Deep RL(Deep Reinforcement Learning) is a type of Machine Learning where an agent learns how to behave in an environment by performing actions and seeing the results.(译:强化学习机器学习的一个分支,强化学习最大的特点是在交互中学习(Learning from Interaction)。Agent 在与环境的交互中根据获得的奖励或惩罚不断的学习知识,更加适应环境。RL学习的范式非常类似于我们人类学习知识的过程,也正因此,RL被视为实现通用AI重要途径。)

Since 2013 and the Deep Q-Learning paper, we’ve seen a lot of breakthroughs. From OpenAI five that beat some of the best Dota2 players of the world, to the Dexterity project, we live in an exciting moment in Deep RL research..

1. What is Reinforcement Learning?

In order to understand what is reinforcement learning, let’s start with the big picture.

1.1. The big picture

The idea behind Reinforcement Learning is that an agent (an AI) will learn from the environment by interacting with it (through trial and error) and receiving rewards (negative or positive) as feedback for performing actions.

Learning from interaction with the environment comes from our natural experiences.

For instance, imagine you put your little brother in front of a video game he never played, a controller in his hands, and let him alone.

Your brother will interact with the environment (the video game) by pressing the right button (action). He got a coin, that’s a +1 reward. It’s positive, he just understood that in this game he must get the coins.

But then, he presses right again and he touches an enemy, he just died -1 reward.

By interacting with his environment through trial and error, your little brother just understood that in this environment, he needs to get coins, but avoid the enemies.

Without any supervision, the child will get better and better at playing the game.

That’s how humans and animals learn, through interaction. Reinforcement Learning is just a computational approach of learning from action.

1.2. A formal definition

If we take now a formal definition:

Reinforcement learning is a framework for solving control tasks (also called decision problems) by building agents that learn from the environment by interacting with it through trial and error and receiving rewards (positive or negative) as unique feedback.

But how Reinforcement Learning works?

2. The Reinforcement Learning Framework

2.1. The RL Process

To understand the RL process, let’s imagine an agent learning to play a platform game:

  • Our Agent receives state S0 from the Environment — we receive the first frame of our game (environment).
  • Based on that state S0, the agent takes an action A0 — our agent will move to the right.
  • Environment transitions to a new state S1 — new frame.
  • Environment gives some reward R1 to the agent — we’re not dead (Positive Reward +1).

This RL loop outputs a sequence of state, action and reward and next state.

The goal of the agent is to maximize its cumulative reward, called the expected return.

2.2. The reward hypothesis: the central idea of Reinforcement Learning

Why the goal of the agent

is to maximize the expected return?

Because RL is based on the reward hypothesis, which is that all goals can be described as the maximization of the expected return (expected cumulative reward).

That’s why in Reinforcement Learning, to have the best behavior, we need to maximize the expected cumulative reward.

2.3. (Optional) Markov Property

You’ll see in papers that the RL process is called the Markov Decision Process (MDP).

We’ll talk again about the Markov Property in the next chapters. But if you need to remember something today about it is just that Markov Property implies that our agent needs only the current state to make its decision about what action to take and not the history of all the states and actions he took before.

Now let’s dive a little bit

on all this new vocabulary

2.4. Observations/States Space

Observations/States are the information our agent gets from the environment. In the case of a video game, it can be a frame (a screenshot), in the case of the trading agent, it can be the value of a certain stock etc.

There is a differentiation to make between observation and state:

  • State s: is a complete description of the state of the world (there is no hidden information). In a fully observed environment.

With a chess game, we are in a fully observed environment, since we have access to the whole check board information.

  • Observation o: is a partial description of the state. In a partially observed environment.

In Super Mario Bros, we are in a partially observed environment, we receive an observation since we only see a part of the level.

2.5. Action Space

The Action space is the set of all possible actions in an environment.

The actions can come from a discrete or continuous space:

  • Discrete space: the number of possible actions is finite.
    • In Super Mario Bros, we have a finite set of actions since we have only 4 directions and jump.
  • Continuous space: the number of possible actions is infinite.
    • A Self Driving Car agent has an infinite number of possible actions since he can turn left 20°, 21°, 22°, honk, turn right 20°, 20,1°…

Taking this information into consideration is crucial because it will have importance when we will choose in the future the RL algorithm.

2.6. Rewards and the discounting

The reward is fundamental in RL because it’s the only feedback for the agent. Thanks to it, our agent knows if the action taken was good or not.

The cumulative reward at each time step t can be written as:

Which is equivalent to:

However, in reality, we can’t just add them like that. The rewards that come sooner (at the beginning of the game) are more probable to happen, since they are more predictable than the long term future reward.

Let say your agent is this small mouse that can move one tile each time step, and your opponent is the cat (that can move too). Your goal is to eat the maximum amount of cheese before being eaten by the cat.

As we can see in the diagram, it’s more probable to eat the cheese near us than the cheese close to the cat (the closer we are to the cat, the more dangerous it is).

As a consequence, the reward near the cat, even if it is bigger (more cheese), will be more discounted since we’re not really sure we’ll be able to eat it.

To discount the rewards, we proceed like this:

  1. We define a discount rate called gamma. It must be between 0 and 1.
    1. The larger the gamma, the smaller the discount. This means our agent cares more about the long term reward.
    2. On the other hand, the smaller the gamma, the bigger the discount. This means our agent cares more about the short term reward (the nearest cheese).
  2. Then, each reward will be discounted by gamma to the exponent of the time step.

As the time step increases, the cat gets closer to us, so the future reward is less and less probable to happen.

Our discounted cumulative expected rewards is:

γ越小,k越大(即步数越多)时,这个公式越趋向0(即只能看到短期利益);同理,γ越大,要求这个公式越趋向0,只能k更大(即步数更大),所以就能看到远期利益 小神龙

2.7. Type of tasks

A task is an instance of a Reinforcement Learning problem. We can have two types of tasks: episodic and continuous.

2.7.1. Episodic task

In this case, we have a starting point and an ending point (a terminal state).This creates an episode: a list of States, Actions, Rewards, and New States.

For instance think about Super Mario Bros, an episode begin at the launch of a new Mario Level and ending when you’re killed or you’re reach the end of the level.

2.7.2. Continuous tasks

These are tasks that continue forever (no terminal state). In this case, the agent has to learn how to choose the best actions and simultaneously interacts with the environment.

For instance, an agent that do automated stock trading. For this task, there is no starting point and terminal state. The agent keeps running until we decide to stop him.

2.8. Exploration/ Exploitation tradeoff

Finally, before looking at the different methods to solve Reinforcement Learning problems, we must cover one more very important topic: the exploration/exploitation trade-off.

  • Exploration is exploring the environment by trying random actions in order to find more information about the environment.
  • Exploitation is exploiting known information to maximize the reward.

Remember, the goal of our RL agent is to maximize the expected cumulative reward. However, we can fall into a common trap.

Let’s take an example:

In this game, our mouse can have an infinite amount of small cheese (+1 each). But at the top of the maze, there is a gigantic sum of cheese (+1000).

However, if we only focus on exploitation, our agent will never reach the gigantic sum of cheese. Instead, it will only exploit the nearest source of rewards, even if this source is small (exploitation).

But if our agent does a little bit of exploration, it can discover the big reward (the pile of big cheese).

This is what we call the exploration/exploitation trade off. We need to balance how much we explore the environment and how much we exploit what we know about the environment.

Therefore, we must define a rule that helps to handle this trade-off.

If it’s still confusing think of a real problem: the choice of a restaurant:

  • Exploitation: You go everyday to the same one that you know is good and take the risk to miss another better restaurant.
  • Exploration: Try restaurants you never went before, with the risk of having a bad experience but the probable opportunity of an amazing experience.

3. The two main approaches for solving RL problems

Now that we learned the RL framework,

how do we solve the RL problem?

In other terms,

how to build a RL agent

that can select the actions

that maximize its expected cumulative reward

?

3.1. The Policy π: the agent’s brain

The Policy π is the brain of our Agent, it’s the function that tell us what action to take given the state we are. So it defines the agent behavior at a given time.

This Policy is the function we want to learn, our goal is to find the optimal policy π*, the policy that maximizes expected return when the agent acts according to it. We find this π* through training.

There are two approaches to train our agent to find this optimal policy π*:

  • Directly, by teaching the agent to learn which action to take, given the state is in: Policy-Based Methods.
  • Indirectly, teach the agent to learn which state is more valuable and then take the action that leads to the more valuable states: Value-Based Methods.

3.2. Policy-Based Methods

In Policy-Based Methods, we learn a policy function directly.

This function will map from each state to the best corresponding action at that state. Or a probability distribution over the set of possible actions at that state.

We have two types of policy:

  • Deterministic: a policy at a given state will always return the same action.

Stochastic: output a probability distribution over actions.

3.3. Value based methods

In Value based methods, instead of training a policy function, we train a value function that maps a state to the expected value of being at that state.

The value of a state is the expected discounted return the agent can get if it starts in that state, and then act according to our policy.

“Act according to our policy” just means that our policy is “going to the state with the highest value”.

Here we see that our value function defined value for each possible state.

4. The “Deep” in Reinforcement Learning

Wait… you spoke about Reinforcement Learning,

but why we speak about Deep Reinforcement Learning?

Deep Reinforcement Learning introduces deep neural networks to solve Reinforcement Learning problems — hence the name “deep.”

For instance, we’ll work on Q-Learning (classic Reinforcement Learning) and then Deep Q-Learning both are value-based RL algorithms.

You’ll see the difference is that in the first approach, we use a traditional algorithm to create a Q table that helps us find what action to take for each state.

In the second approach, we will use a Neural Network (to approximate the q value).

5. Summarize

  • Reinforcement Learning is a computational approach of learning from action. We build an agent that learns from the environment by interacting with it through trial and error and receiving rewards (negative or positive) as feedback.
  • The goal of any RL agent is to maximize its expected cumulative reward (also called expected return) because RL is based on the reward hypothesis, which is that all goals can be described as the maximization of the expected cumulative reward.
  • The RL process is a loop that outputs a sequence of state, action, reward and next state.
  • To calculate the expected cumulative reward (expected return), we discount the rewards: the rewards that come sooner (at the beginning of the game) are more probable to happen since they are more predictable than the long term future reward.
  • To solve an RL problem, you want to find an optimal policy, the policy is the “brain” of your AI that will tell us what action to take given a state. The optimal one is the one who gives you the actions that max the expected return.
  • There are two ways to find your optimal policy:
    • By training your policy directly: policy-based methods.
    • By training a value function that tells us the expected return the agent will get at each state and use this function to define our policy: value-based methods.
  • Finally, we speak about Deep RL because we introduces deep neural networks to estimate the action to take (policy based) or to estimate the value of a state (value based) hence the name “deep.”

参考:

Playing Atari with Deep Reinforcement Learning: https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf Gym Retro: https://github.com/openai/retro Unity ML-Agents Toolkit: https://github.com/Unity-Technologies/ml-agents/ TF-Agents: https://github.com/tensorflow/agents The MineRL Python Package: https://github.com/minerllabs/minerl A Free course in Deep Reinforcement Learning from beginner to expert. https://simoninithomas.github.io/deep-rl-course/#syllabus https://thomassimonini.medium.com/an-introduction-to-deep-reinforcement-learning-17a565999c0c Exploration: http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_13_exploration.pdf MIT—— Introduction to Deep Learning: http://introtodeeplearning.com/ Playing Super Mario Bros. With Deep Reinforcement Learning: https://github.com/Kautenja/playing-mario-with-deep-reinforcement-learning A Free course in Deep Reinforcement Learning from beginner to expert: https://simoninithomas.github.io/deep-rl-course/ Simple Reinforcement Learning with Tensorflow Part 8: Asynchronous Actor-Critic Agents (A3C): https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-8-asynchronous-actor-critic-agents-a3c-c88f72a5e9f2 Playing Mario with Deep Reinforcement Learning: https://github.com/aleju/mario-ai

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2022-03-15,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 WebJ2EE 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
暂无评论
推荐阅读
Docker + WebAssembly 集成简介
Docker+Wasm 的技术预览版现在已经发布了,Wasm 最近引起了很多轰动,该功能将使你更容易快速构建针对 Wasm 运行时的应用程序。
我是阳明
2022/12/29
9310
Docker + WebAssembly 集成简介
CentOS 7.2 安装MariaDB
版权声明:本文为木偶人shaon原创文章,转载请注明原文地址,非常感谢。 https://blog.csdn.net/wh211212/article/details/53129488
shaonbean
2019/05/26
9520
Docker 最佳实战:Docker 部署单节点 MariaDB 实战指南
今天分享的内容是 Docker 最佳实战「2024」 系列文档中的 Docker 部署单节点 MariaDB 实战指南。
运维有术
2024/07/16
1.5K0
Docker 最佳实战:Docker 部署单节点 MariaDB 实战指南
mariadb容器
上面命令会启动一个名为my-mariadb的容器,并初始化一个testdb数据库,同时设置root用户的密码为Letmein。
kongxx
2024/09/05
1500
MariaDb数据库管理系统的学习(一)安装示意图
MariaDB数据库管理系统是MySQL的一个分支。主要由开源社区在维护,採用GPL授权许可。开发这个分支的原因之中的一个是:甲骨文公司收购了MySQL后,有将MySQL闭源的潜在风险,因此社区採用分支的方式来避开这个风险。 MariaDB的目的是全然兼容MySQL。包含API和命令行,使之能轻松成为MySQL的取代品。在存储引擎方面,使用XtraDB(英语:XtraDB)来取代MySQL的InnoDB。 MariaDB由MySQL的创始人Michael Widenius(英语:Michael Widenius)主导开发,他早前曾以10亿美元的价格,将自己创建的公司MySQL AB卖给了SUN,此后。随着SUN被甲骨文收购,MySQL的全部权也落入Oracle的手中。MariaDB名称来自Michael Widenius的女儿Maria的名字。
全栈程序员站长
2022/07/06
4210
MariaDb数据库管理系统的学习(一)安装示意图
LAMP架构介绍,MySQL、MariaDB介绍,MySQL安装
LAMP包含了四种东西,就是Linux+Apache(httpd)+MySQL+PHP的一个简写,Linux我们都很熟悉,它通常作为服务器操作系统,Apache则是一个提供Web服务的一个软件,它真正的名称是httpd。MySQL是数据库软件,存储的是一些数据、字符串、用户信息。PHP是一个脚本语言,和shell类似但是比shell复杂,如果接触过C语言的就知道PHP是用C语言开发的,它通常用来做网站,是前几年比较火热的一门语言,但是现在的趋势是向移动端互联网发展,在PC上访问网站的用户少于在手机上访问网站的用户,所以现在PHP的需求也没有以前高了。
端碗吹水
2020/09/23
1.2K0
LAMP架构介绍,MySQL、MariaDB介绍,MySQL安装
从MariaDB的发展理解ONgDB开源图数据基金会
•一、一起看MariaDB的发展 •1.1 MariaDB介绍 •1.2 DB-ENGINES排名•二、了解ONgDB背后的基金会 •1.1 ONgDB介绍 •1.2 DB-ENGINES排名•三、您可能想知道这些内容
马超的博客
2022/09/02
5520
从MariaDB的发展理解ONgDB开源图数据基金会
MariaDB的二进制格式安装
MariaDB是MySQL的一个分支,由MySQL的创始人Michael Widenius主导开发,当期主要由开源社区在维护,采用GPL授权许可。MariaDB的目的是完全兼容MySQL,包括API和命令行,使之能轻松成为MySQL的代替品。MariaDB以源码,二进制及rpm/deb格式分发。
用户1456517
2019/03/05
7820
性能超过MySQL的MariaDB到底强在哪里?
近年来,不少程序员在吹捧MariaDB,抛弃MySQL。本文总结了一些 MariaDB强过MySQL的地方,分享给大家!
业余草
2021/03/03
2.7K0
Docker部署开源私有云相册,给你的照片一个家
大家好,我是星哥,在这个数字化时代,照片不仅是一种记录生活的方式,更是一种方便并分享还原真时刻的工具。
星哥玩云
2024/12/30
4860
Docker部署开源私有云相册,给你的照片一个家
Docker快速入门总结笔记
(1)基本介绍 Docker 是一个开源的应用容器引擎,基于 Go 语言 并遵从 Apache2.0 协议开源。
全栈程序员站长
2022/09/07
6110
使用Docker部署开源的WPS-Office
越来越多的企业和个人开始将应用程序部署在Docker容器中。传统的办公软件往往需要在本地安装,且只能在单个设备上使用,这对远程办公、多人协作的效率提出了挑战。
星哥玩云
2024/12/27
8900
使用Docker部署开源的WPS-Office
Docker容器构建MariaDB数据库完整教程
Docker作为一种流行的容器化平台,能够简化应用环境的构建和管理。本文将介绍如何使用Docker构建和运行一个基于CentOS的MariaDB数据库镜像。通过本教程,您将学习到创建Dockerfile、编写初始化脚本、生成Docker镜像及运行容器的完整流程。
神秘泣男子
2025/01/07
1840
Docker容器构建MariaDB数据库完整教程
一篇文章,四种方法教你在Linux上安装MariaDB
2008年1月16日,MySQL AB 宣布它已经同意被Sun微系统集团以大约10亿美元的价格收购。该项收购已于2008年2月26日完成。而这也意味着,MySQL日后可能会走向完全的商用;为了继续保持在GNU GPL下开源,MariaDB孕育而出,并且MariaDB的首席开发者:Monty(Ulf Michael Widenius),正式MySQL AB的创世成员。
Mintimate
2021/09/09
9.2K3
一篇文章,四种方法教你在Linux上安装MariaDB
修改,编译,GDB调试openjdk8源码(docker环境下)
在上一章《在docker上编译openjdk8》里,我们在docker容器内成功编译了openjdk8的源码,有没有读者朋友产生过这个念头:“能不能修改openjdk源码,构建一个与众不同的jdk“,
程序员欣宸
2018/01/04
1.9K0
修改,编译,GDB调试openjdk8源码(docker环境下)
[docker]安装Mysql
本文编写于 205 天前,最后修改于 205 天前,其中某些信息可能已经过时。 1.从镜像库搜索mysql镜像 docker search mysql 2.获取5.6版本镜像 docker pull mysql:5.6 3.创建mysql目录用来放置相关文件 mkdir -p ~/mysql/data ~/mysql/logs ~/mysql/conf 4.创建容器 docker run -p 3306:3306 --name mymysql -v ~/mysql/conf:/etc/mysql/conf.
贰叁壹小窝
2020/07/22
2.3K0
如何用Docker Compose部署项目?
之前我们用docker部署了springboot,redis,mysql的项目,但是是部署在三个不同的容器里,还需要先知道redis和mysql的ip地址,手动配置到springboot应用容器里,我只是想快速在本地进行测试啊,这样成本太高了,有没有什么办法,把他们集中管理呢?比如把它构建成为一个镜像。
秦怀杂货店
2022/02/17
2.4K0
如何用Docker Compose部署项目?
centos配置docker环境
yum默认链接的还是国外的镜像,速度相对不理想,配置成国内的镜像会快很多,这里以阿里镜像为例进行配置:
霍格沃兹测试开发Muller老师
2022/12/16
8640
你的第一个 Docker + React + Express 全栈应用
最近发现一个很有意思的现象:一个人想学某样技术的时候,当学会了之后,但是这时出现了一个问题需要学习另一门技术时,无论这个人前面学得多么刻苦,用功,到这一步有 99% 的概率都会放弃。我愿称这种现象为 “学习窗口”。
coder_koala
2021/10/12
1.3K0
你的第一个 Docker + React + Express 全栈应用
无快不破,在本地 docker 运行 IDEA 里面的项目?
之前我们用docker部署了springboot,redis,mysql的项目,但是是部署在三个不同的容器里,还需要先知道redis和mysql的ip地址,手动配置到springboot应用容器里,我只是想快速在本地进行测试啊,这样成本太高了,有没有什么办法,把他们集中管理呢?比如把它构建成为一个镜像。
秦怀杂货店
2021/12/18
2.1K0
相关推荐
Docker + WebAssembly 集成简介
更多 >
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档