Back To The Future: AlphaGo, Monte Carlo Tree Search, and Simulation-based Algorithms for Markov Decision Processes
Back To The Future: AlphaGo, Monte Carlo Tree Search, and Simulation-based Algorithms for Markov Decision Processes 
2017 / 06 / 23 / PM 4:00 - 5:00
Location: 301 - 204
Speaker: Prof. MICHAEL C. FU
MICHAEL C. FU holds the Smith Chair of Management Science in the Decision, Operations and Information Technologies Department of the Robert H. Smith School of Business, with a joint appointment in the Institute for Systems Research and an affiliate appointment in the Department of Electrical & Computer Engineering (both in the Clark School of Engineering), all at the University of Maryland. He received an S.B. in mathematics and S.B./S.M. in EECS from MIT in 1985 and a Ph.D. in applied mathematics from Harvard in 1989. His research interests include simulation optimization and applied probability, particularly with applications towards supply chain management and financial engineering. In 2004 he was named a Distinguished Scholar-Teacher at the University of Maryland. He is a Fellow of INFORMS and IEEE.
In March 2016 in Seoul, Korea, Google DeepMind's AlphaGo, a computer Go-playing program, defeated the reigning human world champion Go player, 4-1, a feat far more impressive than previous victories by computer programs in chess (IBM's Deep Blue) and Jeopardy (IBM's Watson). AlphaGo combines several machine learning approaches, specifically deep neural networks with a technique called Monte Carlo tree search. Current versions of Monte Carlo tree search used in Go-playing algorithms trace their roots back to the adaptive multi-stage sampling simulation-based algorithm for estimating value functions in finite-horizon Markov decision processes (MDPs) introduced in a 2005 Operations Research article by Chang et al., which was the first to use Upper Confidence Bounds (UCBs) for Monte Carlo simulation-based solution of MDPs. UCBs are well known in the machine learning community for addressing the well-known exploration-exploitation tradeoff that arises in multi-armed bandit problems. After highlighting the high-level structure behind AlphaGo, we review the main ideas in UCB-based Monte Carlo tree search by describing the simulation-based MDP algorithm of Chang et al. (2005) and by using examples on decision trees and simpler games.