转载声明
本文为灯塔大数据原创内容,欢迎个人转载至朋友圈,其他机构转载请在文章开头标注:“转自:灯塔大数据;微信:DTbigdata”
导读:
人工智能(AI)系统已经给机器人注入了像人类一样灵活地掌握和操纵物体的能力,现在,研究人员说他们已经开发出了一种算法,机器可以通过它来学习如何独立行走。
(文末更多往期译文推荐)
加州大学伯克利分校和谷歌脑研究所的科学家们在发表在Arxiv.org(“通过深度强化学习学会走路”)上的预印论文中描述了一个人工智能系统,该系统“教会”一个四足机器人穿越熟悉和陌生的地形。
“深度强化学习可用于自动化控制器的采集,用于一系列机器人任务,实现将感官输入映射到低级别动作的策略的端到端学习,”该论文的作者解释道。“如果我们能在现实世界中直接学会行走步态,我们原则上就能得到理想适应每个机器人甚至个别地形的控制器,从而有可能获得更好的敏捷性、能量效率和健壮性。”
设计挑战是双重的。强化学习——一种人工智能训练技术,使用奖励或惩罚来驱使代理人达到目标——需要大量数据,在某些情况下需要数以万计的样本,才能取得良好的结果。微调机器人系统的超参数——即决定其结构的参数——通常需要进行多次训练,随着时间的推移,这可能会损坏腿上的机器人。
“深度强化学习已被广泛用于学习模拟中的运动策略,甚至将它们转移到现实世界的机器人中,但由于模拟中的差异,不可避免地导致性能损失,并且需要大量的手动建模,”该论文的作者指出。“在现实世界中使用这些算法已经证明具有挑战性。”
为了追求一种方法,用研究人员的话说,“[使其成为一个系统学习运动技能的可行性”而无需模拟训练],他们选择了一种称为“最大熵RL”的强化学习框架(RL)。最大熵RL优化学习策略以最大化预期回报和预期熵,或者正在处理的数据中的随机性度量。在RL中,AI代理通过从策略中采样动作并接收奖励,不断寻找最佳的行动路径 ——也就是说,状态和行动的轨迹。最大熵RL激励政策更广泛地探索; 参数 (比如说温度)确定熵对奖励的相对重要性,从而确定其随机性。
阳光和彩虹并不都是这样——至少一开始不是这样。因为熵和奖励之间的权衡直接受到奖励函数的规模的影响,而奖励函数的规模又影响学习速率,所以通常必须针对每个环境调整缩放因子。解决方案是自动调整温度和奖励尺度,部分通过在两个阶段之间交替进行:数据收集阶段和优化阶段。
结果说明了一切。在OpenAI健身房的实验中,这是一个用于训练和测试AI代理的开源模拟环境,作者的模型与四个连续运动任务(HalfCheetah,Ant,Walker和Minitaur)的基线相比,实现了“几乎相同”或更好的性能。
在第二次真实世界的测试中,研究人员将他们的模型应用于四足Minitaur,一个带八个执行器的机器人,测量电机角度的电机编码器,以及一个测量方向和角速度的惯性测量单元(IMU)。
他们开发了一个管道,包括(1)计算机工作站,更新神经网络,从Minitaur下载数据,并上传最新政策; (2)机器人上的Nvidia Jetson TX2执行上述策略,收集数据,并通过以太网将数据上传到工作站。经过两个小时的160,000步后,一个奖励前进速度和惩罚“大角加速度”和俯仰角的算法,他们成功地训练Minitaur在平坦的地形上行走,在木块等障碍物上行走,以及上坡和台阶。
“据我们所知,这个实验是第一个在没有任何模拟或预先训练的情况下直接在现实世界中学习低执行四足动物运动的深度强化学习算法的例子,”研究人员写道。
原文
This AI teaches robots how to walk
Artificially intelligent (AI) systems have imbued robots with the ability tograsp and manipulate objectswithhumanlike dexterity, and now, researchers say they’ve developed an algorithm through which machines might learn to walk on their own. In a preprint paper published on Arxiv.org (“Learning to Walk via Deep Reinforcement Learning“), scientists from the University of California, Berkeley and Google Brain, one of Google’s artificial intelligence (AI) research divisions, describe an AI system that “taught” a quadrupedal robot to traverse terrain both familiar and unfamiliar.
“Deep reinforcement learning can be used to automate the acquisition of controllers for a range of robotic tasks, enabling end-to-end learning of policies that map sensory inputs to low-level actions,” the paper’s authors explain. “If we can learn locomotion gaits from scratch directly in the real world, we can in principle acquire controllers that are ideally adapted to each robot and even to individual terrains, potentially achieving better agility, energy efficiency, and robustness.”
The design challenge was twofold. Reinforcement learning — an AI training technique that uses rewards or punishments to drive agents toward goals — requires lots of data, in some cases tens of thousands of samples, to achieve good results. And fine-tuning a robotic system’s hyperparameters — i.e., the parameters that determine its structure — usually necessitates multiple training runs, which can damage legged robots over time.
“Deep reinforcement learning has been used extensively to learn locomotion policies in simulation, and even transfer them to real-world robots, but this inevitably incurs some loss of performance due to discrepancies in the simulation, and requires extensive manual modeling,” the paper’s authors point out. “Using such algorithms … in the real world has proven challenging.”
In pursuit of a method that would, in the researchers’ words, “[make it] feasible for a system to learn locomotion skills” without simulated training, they tapped a framework of reinforcement learning (RL) known as “maximum entropy RL.” Maximum entropy RL optimizes learning policies to maximize both the expected return and expected entropy, or the measure of randomness in the data being processed. In RL, AI agents continuously search for an optimal path of actions — that is to say, a trajectory of states and actions — by sampling actions from policies and receiving rewards. Maximum entropy RL incentivizes policies to explore more widely; a parameter — say, temperature — determines the relative importance of entropy against the reward, and therefore its randomness.
It wasn’t all sunshine and rainbows — at least not at first. Because the trade-off between entropy and the reward is directly affected by the scale of the reward function, which in turn affects the learning rate, the scaling factor normally has to be tuned per environment. The researchers’ solution was to automate the temperature and reward scale adjustment, in part by alternating between two phases: a data collection phase and an optimization phase.
The results spoke for themselves. In experiments in OpenAI’s Gym, an open source simulated environment for training and testing AI agents, the authors’ model achieved “practically identical” or better performance compared to the baseline across four continuous locomotion tasks (HalfCheetah, Ant, Walker, and Minitaur).
And in a second, real-world test, the researchers applied their model to a four-legged Minitaur, a robot with eight actuators, motor encoders that measure motor angles, and an inertial measurement unit (IMU) that measures orientation and angular velocity.
They developed a pipeline consisting of (1) a computer workstation that updated the neural networks, downloaded data from the Minitaur, and uploaded the latest policy; and (2) an Nvidia Jetson TX2 onboard the robot that executed said policy, collected data, and uploaded the data to the workstation via Ethernet. After 160,000 steps over two hours with an algorithm that rewarded forward velocity and penalized “large angular accelerations” and pitch angles, they successfully trained the Minitaur to walk on flat terrain, over obstacles like wooden blocks, and up slopes and steps — none of which were present at training time.
“To our knowledge, this experiment is the first example of a deep reinforcement learning algorithm learning underactuated quadrupedal locomotion directly in the real world without any simulation or pretraining,” the researchers wrote.
文章编辑:思加
征稿通知
你是一位喜爱编程/热衷AI/钟情算法的文艺程序猿(媛)吗?
你是一位关心科技圈动态、想要与更多的人探讨前沿、爱写作文的胖友吗?
欢迎联系小编给原创模块投稿!
有趣译文亦或爬虫分享、算法介绍或者编程心得,你的文章一经采用,将收到丰厚的稿酬!
领取专属 10元无门槛券
私享最新 技术干货