by Vukosi Marivate
Imagine trying to teach a machine how to play chess. There are numerous small steps to teach; strategies to start the game, strategies to play the middle of the game and strategies to end the game. Strategies are the possible actions to take in a given situation. The combinations of these strategies increase the possible ways you can play the game. Now imagine trying to find the best treatment strategy for a patient who has a chronic disease such as Diabetes. There are multiple drugs available, there are different types of lifestyles and situations that patients are in. There may be multiple sequential treatment options that we would have to address. Problems such as this lend themselves well to mathematical tools known as Reinforcement Learning. Reinforcement Learning involves creating computer programs that learn the best strategies in different applications. Furthermore, these computer programs can evaluate how well a given strategy would do in a given context, e.g. if we give a certain drug (the drug being the strategy) to a patient, the program can evaluate whether it can improve her lifestyle in the long term.
In Reinforcement Learning, the goal or payoff may be multiple steps into the future and as such it is not immediately clear what strategy will lead to the best result. For example, returning to the Diabetes example, the long term health effects of a treatment strategy for the patient may only be clear years into the treatment. This is a slight departure from a majority of Machine Learning tools where decisions have an immediate goal, like correctly recognising a face in a picture. Instead, RL is closer to the promise of Artificial Intelligence: creating an agent that interacts with an environment and acts rationally to meet a certain goal (described through what we call a “payoff”).
A majority of research in this area focuses on creating programs that directly interact with the intended system where learning would be happening. As such, the program would be playing chess games directly with different players and the program would be learning from the outcomes of the moves it had made. An interesting part of this research is where the program learns from interaction data that had been previously collected and not directly from the system. Returning to the chess example, this would mean the program would be given the records of numerous chess tournament games and would have to learn or evaluate strategies. As such we would term this learning offline. This gives the program access to much more data. It’s harder to play a million games with other willing human players directly. The price we pay for this data is that the Reinforcement Learning program is not able to explore, i.e. try out moves not contained in the pre-collected data. A key question to ask, therefore, is whether this can be problematic?
Imagine a situation where the chess data was collected from players who only play a limited set of strategies. As such if we were to use the strategy learned by the Reinforcement Learning program to play against a new human player who uses a strategy the program had not seen before, the program may lose a lot of games. A potential remedy to this would be to collect a diverse set of strategies to be used for learning. If the learning was online, this would be easier as we are able to design algorithms that can explore parts of the the problem world that the program may not have enough information about. In the offline learning scenario, this is not possible and is limited by the collected data.
Pre-collected data may be limited in its explorations due to other constraints. In healthcare for example, limitations of how diverse the data is might be limited by health care guidelines, ease of access or cost of medication. As such, exploration of alternative approaches to treatments in healthcare may be limited. Secondly, healthcare data that covers long term treatments, while getting easier to obtain, due to the sharing of Electronic Health Records (see IBM’s Watson Application to Healthcare), may still be limited in size and diversity of treatments. Another layer of complexity is added due to the uncertain nature of the outcomes. That is, the same treatment may have completely different outcomes on patients who look similar (similar demographics and disease history). As such, there is underlying uncertainty as to the effects of drugs on patients too. Now, imagine how great the uncertainty might be if the patient undergoes a series of different treatments!
A similar situation can be observed in education. The path students take in classes, offline or online, is normally in a linear order. The student learns from one section, then completes its assignment and then is evaluated in a test. The order in which this learning occurs does not change nor is there space to branch out into alternative ways to learn a concept. Data from online platforms such as Massive Open Online Courses (MOOCs) is slowly starting to become available, but the diversity in approaches to teaching is still restricted. As such, in these situations, we would like our program’s evaluation methodologies to identify for us the limitations of such constraints.
In our research, we are focused on evaluation methods for Reinforcement Learning. In particular, for the “offline” situation, we have been investigating ways in which we can create evaluation methods that can better express the uncertainty we have on the outcomes of using different strategies given that our pre-collected data could be limited and noisy. We believe with these types of techniques we will be able to harness the large amounts of data that is already being collected in fields like medicine and education. Reinforcement Learning techniques also brings in a unique perspective to these fields by focusing on the long term effects of strategies.
- Upcoming online course on Reinforcement Learning on Udacity: Machine Learning 3—Reinforcement Learning
- Fun: Creating a program to play Flappy Bird: Flappy Bird Reinforcement Learning
Vukosi Marivate is a 2009 fellow of the Fulbright Science & Technology Award, from South Africa, and a PhD candidate in Computer Science at Rutgers University.