更新时间:2023-05-24 22:46
2017年10月19日凌晨,在国际学术期刊《自然》(Nature)上发表的一篇研究论文中,谷歌下属公司Deepmind报告新版程序AlphaGo Zero:从空白状态学起,在无任何人类输入的条件下,它能够迅速自学围棋,并以100:0的战绩击败“前辈”。它经过3天的训练便以100:0的战绩击败了AlphaGo Lee,经过40天的训练便击败了AlphaGo Master。
“抛弃人类经验”和“自我训练”并非AlphaGo Zero最大的亮点,其关键在于采用了新的reinforcement learning(强化学习的算法),并给该算法带了新的发展。
AlphaGo Zero仅拥有4个TPU,零人类经验,其自我训练的时间仅为3天,自我对弈的棋局数量为490万盘。但它以100:0的战绩击败前辈。
A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from self-play. Here we introduce an algorithm based solely on reinforcement learning, without human data, guidance or domain knowledge beyond game rules. AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGo’s own move selections and also the winner of AlphaGo’s games. This neural network improves the strength of the tree search, resulting in higher quality move selection and stronger self-play in the next iteration. Starting tabula rasa, our new program AlphaGo Zero achieved superhuman performance, winning 100–0 against the previously published, champion-defeating AlphaGo.
人工智能的一个长期目标是一种算法,它可以在具有挑战性的领域中学习超人的熟练程度。最近,AlphaGo成为第一个在围棋比赛中击败世界冠军的程序。AlphaGo的树搜索使用深度神经网络评估位置和选定移动。这些神经网络是通过人类专家动作的监督学习和自我对弈的强化学习来训练的。在这里,我们介绍了一种仅基于强化学习的算法,没有超出游戏规则的人类数据、指导或领域知识。AlphaGo成为自己的老师:训练神经网络来预测AlphaGo自己的走法选择以及AlphaGo游戏的获胜者。该神经网络提高了树搜索的强度,从而在下一次迭代中产生更高质量的移动选择和更强的自我对弈。从白纸开始,我们的新程序AlphaGo Zero取得了超人的表现,以100-0战胜了之前发布的击败冠军的AlphaGo。