Playing Atari Games in OCaml

At Jane Street, we enjoy using OCaml for lots of different things, from FPGA designs to web development. When it comes to Machine Learning, Python is one of the most commonly used languages. Machine learning frameworks such as TensorFlow or PyTorch wrap some highly efficient C++ and GPU implementations of tensor operations in easy to use Python apis. These frameworks also provide automatic differentiation functionalities which are commonly used to train deep learning models.

In this talk we will see how we can leverage TensorFlow or PyTorch directly from OCaml so that we can use our favorite programming language to build deep learning models and train them on GPUs. We will consider the Reinforcement Learning setting where an agent is trained to play Atari video games such as Space Invaders or Breakout. Our agents will be based on the Deep Q-Learning approach introduced in 2014.

Transcript

As Ron said, my name is Laurent, and I will be talking about how we can use machine learning to play Atari games. As we like to do everything with machine learning at Jane Street, with OCaml at Jane Street, and with machine learning, we will do that using OCaml.

I will not assume that you know that much about machine learning. Also, because of that, we will first look at the classical setting which is supervised learning before looking at reinforcement learning, which is what we will use to play, actually, Atari games. But for both sections, again, we will illustrate that with OCaml and have nice types everywhere.

First, the supervised learning setting. In supervised learning, you are given some data set that has some levels attached and you want to build an algorithm that, given some examples from your data set, is able to produce the levels back. A typical example is image specification. In that case, you have access to some image here of pythons and of camels, and you have to decide whether these are images of pythons actually or whether these are images of camels. That sounds like something super straightforward to do, but it turns out that if you’re asked to write an algorithm to actually do that, it’s really complicated. Actually finding out the rules that define what is a python and what is a camel, that’s not straightforward at all. If you had asked someone from the computer vision community 10 years ago, would we be able to do that with an accuracy of, say, 90%, they would have told you that nope, that’s a very difficult problem. Even with lots of [inaudible] feature, we are not able to do so. It turns out that nowadays, if you release the planning model, that only achieves 99.5% accuracy on this task, that’s not a very good model. You can do far better than that, and you can do, actually, better than what a human being is doing.

Let’s have a look at some OCaml code to illustrate that. What we want to build is a classifier function. That is a function that goes from images, there, to camel or python. That can be either a camel or a python. To build this classifier, you have access to lots of labeled samples. That is a huge collection of images and the labels that go with the image.

We first have to define what is the things that we care about. What is a good classifier? We want the classifier to have the best possible accuracy, so that will be defined by what is called the loss function. The loss function takes as input the predictions that your classifier has produced. It also takes as input what is [inaudible]. That is the labels that actually belong to that data set. And it returns some kind of metric, the foot, that presents the score of your classifier. Typically what you want to do is minimize the loss. That is, you want to find the classifier that has the lowest possible loss.

You can say that a very simple solution to do that is just to embed all the data in your classifier. You can just, when you’re given a new image, you compare it for some distance to all the images that you are … in your original data set, and you return the image for which the distance is minimal. Actually, that would work very well on this data set that you used for training, but that will not generalize well. If you try with a new image of python or camel that’s taken from the same distribution but that wasn’t in the original data set, that would actually probably not work. That is because the distance that you actually used is not a proper distance. It might be, let’s say, L2, and you will find that a python and a camel are close in L2 that might, to a human being, look pretty different. What you would like to add is some kind of perceptual loss or some kind of distant metric that is what humans care about, but that’s quite tricky to build.

One thing that we will also do is we will split some part of the data set into … we’ll split the training data set that will be used to build your classifier algorithm from the validation data set that we will use to measure the performance. Because, as we just mentioned, if you measure the performance on the thing that you used to build your algorithm, it’s very easy to have perfect performance, and still it might not work well into the wild.

There’s a difficult point, now that we know what we want to optimize, is that the set of possible algorithm is actually a discrete set that is very, very large. And again, we would like to find the best algorithm. As it’s discrete, it makes the optimization problem to be quite anti-complete, and so the first thing that we will want to do is move to a more continuous space. There we will have some optimization techniques that will be far easier to use. That’s why rather than building … rather than finding the optimal classifier over the set of all the algorithm, we’ll say that our classifier actually takes some parameters, take an image, and again, has to read on the label. The parameters that will be used by your model will be lots of floating point variables, lots of real numbers. Again, what we will want to do is find the minimum for the loss over the set of parameters. Here, notice that’s theta.

In order to find the parameters, machine learning approach would be first, try some random parameters value and initialize your algorithm with that. After that, you run your classifier algorithm. You run and you compute the loss on the prediction that your classifier has produced. For now, it’s random. And you adjust the parameters a tiny bit based on this loss value each time. For that, we use what is called gradient descent, and the idea there is just to follow the derivative. If you want to find the minimum, you just have to compute the slope, find the direction where the slope is maximum and goes downhill, and you follow that. That’s illustrated in this small set of OCaml counter. We just extract some images and some labels from our sample data set, run the miniature on some parameter, and then we write some optimization algorithm that goes as follow: compute the predictions, compute the loss of the prediction compared to the actual [inaudible] labels. Once you have the loss, compute the gradient and adjust the parameters by subtracting a small fraction of the gradient.

This parameter alpha is called the learning rate. If you put a very small alpha, it’s nice because that way you’re sure to make some very small progress at each step, but it might take forever to actually go to the minimum. And of course, if you take larger alpha, then it’s faster to go to the minimum, but you might overshoot the minimum and just bump around, not being able to go down the loss function.

The guard function that we are using it, actually here it’s not really an OCaml function. It’s not something that takes a float value and return a float value. It’s more something that takes the algorithm that you’ve used to compute from the parameter the loss value, and is able to compute the gradient. For this we will rely on PyTorch in the following slide, but we can also rely on things like TensorFlow or Chainer, or also Moderna machine learning framework. This is what is called back propagation.

So PyTorch. PyTorch is a tensor library that was open sourced by Facebook some time ago. It’s primarily as a Python API, and it provides you with an optimized set of tensor/matrix operations. Nice thing is they all have a very efficient GPU implementation. What we will do is from OCaml, call this PyTorch framework. Actually we will not use that toward the Python part and go directly to the C++ implementation. Here you see some very simple sample of PyTorch cord being called in OCaml. We build a tensor out of a list with the value 1, -1, and 2. Actually [inaudible]. After that we can do any kind of operation, like take the absolute value, add one, and you just see the results there, and there. That’s actually taken from a netbook that runs in OCaml.

Again, the very nice thing is that here, these operations are super simple. The list is very small, so it’s instant news, but if you image that you multiply very large matrix together, which is very common, being able to run on a GPU is something that is key. That’s the first things that PyTorch actually buys you.

The second thing that PyTorch gives you is actually this grad function, so that’s called, again, automatic differentiation. Or in our case, back propagation. To do that, PyTorch would actually keep internally a graph of the computations that it has done. Whenever you build a tensor, you can tell PyTorch, oh, actually this tensor X is important. I will need later on to be able to compute some gradient based on X. So you say that buys this, [inaudible]. After that, you can do all the computation you want on X. Here you just take the square of X and you add X to it. You can then apply tensor that’s backward. And why? At this point PyTorch will go through the graph that has been used to compute Y, and backward, go back to the parameter X and compute the gradient of everything. Then you have access to distance or that grad of X, which returns certain [inaudible], in that case, the derivative, from which we can just extract the value.

Again, we just run that on 1, -1, and 42. And of course, the derivative of X^2 + X is 2X + 1, and you see that we actually get 3 - 1 and 2 x 42 + 1. Once again, here we don’t do anything smart on the OCaml world, we just rely on the underlying framework to do the appropriate thing. That’s very nice because the people that have been developing PyTorch or TensorFlow or that kind of frameworks have been thinking a lot about numerical stability of their algorithm, so they will apply all the fancy techniques and you will benefit from them in OCaml without having any need to actually know about these fancy techniques.

Now we have this gradient function, and we can ask ourself, what should we use for this classifier function? We say that this classifier function, it take as argument some parameters, an image, and it has to return a class. Rather than returning a class, we will return something that is more continuous. We will return the probability for the element to be … for the image to be a camel or for the image to be a python. To get this probabilities, we’ll just chain values operations. This operation will be matrix multiplications, and we will combine them with nonlinear functions. Of course, we cannot just use matrix multiplications because if you just combine matrix multiplication, what you create is a linear function and you don’t add any degree of freedom by having multiple layers. But as soon as you inject some nonlinear function, you have more expressiveness in the functions that you can represent. It even turns out that thanks to this universal approximation theorem, you can pull that … you can represent pretty much every single continuous function with just two layers, like two matrix multiplication layers, separated by one nonlinear function.

The nonlinear function of choice, it used to be the sigmoid five or 10 years ago, and nowadays, pretty much everyone uses what is called the rectifier linear unit. That’s a super straightforward function. It just takes an input X and return if X is positive, X, if X is negative, zero. Very nice thing about that is that it’s super easy to compute so you can have a very optimized way of doing it, whereas a sigmoid, actually you need to compute some exponential, which is a bit less straightforward.

Again, with this universal approximation theorem gives us, if you have two layers, you can represent everything. That doesn’t mean that you cannot actually find the minimums that would represent every single function, so still you may want deeper models. But here we’ll just consider a model with two such layers and a nonlinear [inaudible]. The way we’ll build it with OCaml is like this. Model, it will take as input what is called here VS. It’s a variable store that where we’ll store the parameters that we want to optimize on so that the optimizers that will perform gradient descent knows about them. It will take as input the width and the height of the image as well as a number of dimensions that you want to output. In that case, we can say either we output one dimension, which is a probability, say, to be a camel, or we can output two dimensions, which are like the score for the camel, the score for the python. When it comes to actually building a classifier, we’ll just say as a class that has a larger score.

There we build a first linear layer, again, based on this var store and we go from a flattened representation of the image, which size is the width times the height times three because we consider image in colors. And we go to vectors of dimension two hundred. After that, we’ll add a nonlinearity and we will have a second linear layer that goes from these vectors of dimension two hundred, to the output dimension, say, two in our case. We just use that, combine that to build a new layer. This layer takes as input an image, you flatten it so that because we say that rather than having a special structure with a width and a height, we just put everything next to another in a single vector, apply the first linear layer, apply the nonlinear thing, and then apply the second linear layer. That’s it. You have a model that produces some scores. If we find the optimal value for the parameters these two linear layers use, then, ideally, we will have something that will tell us if the object is a python or a camel with decent accuracy.

It turns out that this model is a bit too simple to work on a python and a camel. It would work fairly well for things like image specification, which used to be … sorry, like digit specification, which is what used to be the typical benchmark, again 10 years ago. But in that case it’s a bit more complicated so we’ll need models that are a bit more evolved.

One thing to note is that the total number of parameter in your case is already fairly big. The number of the parameters mostly come from the first linear layer because here we have an input that has dimension 224 by 224 by 3 is the measured size, 224. And as the output size, 200. The matrix that we used to do the multiplication has a size of the product of these two, and that’s probably something like 30 million parameters. It’s not a problem for gradient descent. As long as you’re able to compute gradient explicitly, you will be able to just apply this algorithm. What is quite tricky is to know whether or not it will find the proper minimum or it will just get stuck in some kind of local minimum. I turns out that there are things empirically work very well. People don’t really have an understanding for why it is the case, but it seems that there are no really bad local minimum. There are lots of local minimums, but most of them are very close to the optimal things that we are looking for.

That’s what is called a neural network. Here, again, we have images that are a bit more complex than just black and white digit images so we’ll have to do something a bit less straightforward, which is using a converted channel neural network. We’ll see later on what is a converted channel neural network, but first, on this very simple model we can see how we will run some training loops, how we’ll run gradient descent. There is an additional challenge compared to what we’ve seen so far, which is that you cannot compute the value of this function f on all the images in one go because your GPUs only have a limited amount of memory. It can be huge nowadays. It can like 48 gigabytes or something like that, but still, if you have something like a 30 million parameter and you want to apply that to … you have a data set that might have million of images, and each of these image have dimension 224 by 224 by 3, that won’t fit into your run GPU card. That’s why you will do something that is called using mini batch.

We will iterate a random batch of smaller size, let’s say 64. Usually people take whatever is the size that will fit on their GPU. Then the training procedure is fairly straightforward. We create a variable store to hold the variable, create a model based on this variable store, create an optimizer, so that will the bit that is responsible for doing the gradient descent step that we’ve seen earlier, and we just iterate so have a first for loop that we run, say, 100 time. Then we iterate over all small, random batches of size 64, and we compute again our predictions, compute our loss, and apply your optimizer. So just adjust slightly your parameters. We iterate that for quite some time, for a couple of hours or whatever, and it turns out that it might find a nice minimum and we can use that for the model to classify images.

I mentioned that linear are like small nonlinear model, but just involving two layer might not be enough. That’s mostly because the first step that we did, we compacted all the information that was taken from 224 x 224 x 3 pixel into a vector of dimension 200. That’s a very aggressive operation. Like wej reduce a dimension [inaudible] a lot very quickly. Still, by doing so, we needed 30 million parameters. To avoid, we would like to far more than 200 dimension, we’d like to keep a representation that has thousands and thousands of dimensions, if not millions. The way to do it will be to use what are called convolutional layer.

The idea of convolutional layer is actually very straightforward, and it comes from kind of the fact that information locality matters. If two pixel are centered in an image, they probably have an inference one to another. By that I mean that when you look at a given pixel, it’s useful to know if the pixel next to it is of the same color or not of the same color, but it’s far less interesting to know about the pixel that is at the opposite side of the screen. You would more like to know about the average value of the pixels that are at the opposite side of the image.

Convolutional layers will help with that. What they will do is just rather than applying a large linear operation to the full scale, to the full image, we’ll apply a small linear [inaudible] to a small patch of the image of, say, 5 x 5, and x 3 because, again, number of colors. For each of these it will extract a vector of, say, size 64, so we’ll only need 5000 ways to do that. After that, we do that again and again and again, another screen. Sorry, another image. So that will build a new feature map that has, in that case, rather than three channels, we have 64 channels, and we can iterate such layers one after another. You can think of it as at first we build things like edge detectors because we look at adjacent pixels. Once we’ve noticed that we have edges, the layers above that will be able to combine these edges to detect higher level feature. They might be able to notice that something is actually a wheel. After that, when have noticed that you have two wheels at some part of the screen, the layer above might be able to combine them and to say, oh, actually it’s a car.

You build a representation of, again, a higher level and higher level concept. For that you alternate these convolution layers, and from time to time you have subsampling layers. The idea there is just to reduce the resolution of your screen. We’ve built the edge detectors. That’s fine. But at this point we don’t care that much about single pixel. We just group adjacent pixels together via this subsampling stuff, and we rerun another convolution layer. Again, we iterate that kind of thing multiple times until at the end we just apply some linear layer and obtain the output category. It turns out that this works very well.

What we’ve done with supervised learning, we’ve taken a large number of labeled example and we tried to build the optimal classification function. That is, the function among the class of functions that we can represent these [inaudible] parameters that has minimal loss. Again, this would work quite nicely outside of our initial data set. Of course, to illustrate that, what is nice is this story, the recent story of the computer vision domain. That’s taken from a competition that is held every year which is called ImageNet. It’s both a competition and a data set. Roughly you have one million images that belong to 1000 categories, and you have to do classification. Actually, we care more about step five classification because step one classification is very difficult, like levels might be noisy and so on. You have in ImageNet … I don’t really remember how many, but I would say more than five different breeds of dogs. Actually, even for a human being, it’s pretty hard to get the proper labels.

You see that in 2001, the best algorithm from this competition has an accuracy … an error rate of 26%. That was the last year something that wasn’t using deep learning won the competition. After that, the next year, AlexNet came out. It was, I think, from people from University of Toronto, and it reduced the thing to 16.4% of error, which is quite an amazing improvement, especially as things had been quite flat before that. Also, AlexNet didn’t need any hard coding knowledge. For this 2011 model, people have been handcrafted features manually. They’ve been cutting algorithms that say oh, actually a good representation, I care a lot about this kind of texture so I will add to my model some handcrafted detectors, and actually, it’s good at detecting this texture. Starting from this year everything is generic. There is no point at all in coding features manually. After that it went down quite a lot because people get more used to training deeper models and to training models in a more precise way. You see 16.4, 11.7, and so on. Around 2015 it started being better than human.

Human in that case is some Andrej Karpathy, who used to be, I think, a postdoc at Stanford around this time, that actually measured that performance on himself. I think before that, he trained a bit, because if you ask a human being that has no preparation about that, he would actually do pretty poorly because he will not know what are the 1000 categories that he should classify to. Here you see 3.1 in 2016. Since then things have improved even a bit more. Yeah, that will be kind of state of the art for the supervised learning setting.

That’s something that is very nice, but I mean it’s not yet perfect. It’s not always the case that we have lots of labeled example. What we say at the beginning is that we want to play Atari games, and so far we haven’t done any of that. The big difficulty is that for Atari games, you don’t really have someone that will tell you, for each image, what are the actions that you should take. That would be super annoying for someone to do that. You could actually just let people play it for hours and try to mimic their behavior, but that will be not very … that will take long, and that would actually, probably less well than what we will be doing next. Also, one key limitation right now is that it takes human only a few image to be able to recognize an elephant from other things, despite never having seen an elephant for real.

Still, if you train a model to do that, you will need quite a lot of images. That’s why nowadays domains that are quite trendy are transfer learning, which is about taking a model that has already been trained and fine tune it on a new data set, and what is called self-supervised learning. You have access to a huge data set but we remove the labels, and you only have access to the label for a small part of the data set. Nowadays people are able to match roughly state of the art by using these kind of techniques, which are far closer to what a human being … to what human beings do. Human beings, they accumulate knowledge over years of how the world works by doing. Perhaps the three or six first months of your life you build a model of what is actually physics, and after that you leverage this model of physics to be able to do plenty of different tasks, of course among which image recognition.

Let’s now switch gears and move to reinforcement learning. There let’s have a quick look at, again, the timeline of recent events. What we will look at is playing Atari games and that starts to be a bit older. It started in 2013 with some publication, it’s at the very end of 2013, some publication by DeepMind that ended up being published in Nature, which for a computer science paper is quite a nice fit, in 2015. The algorithm that we will look at today is actually this Deep Q-Learning algorithm that they used to play Atari games to a roughly human level. Not for all games, for quite a few of them.

After that, of course there has been AlphaGo in 2016. AlphaGo was a mix of supervised and unsupervised learning. Actually, it started by training itself on lots of professional player games, and only after that they started to play … to use that to do unsupervised learning. You will first come with an algorithm that is reasonably good because of supervised learning, and after that you get it to play the game a lot and to learn a lot about the game, and it starts to be better and better.

Next year AlphaZero becomes state of the art on Go, Chess, and Shogi, and removed the supervised bit. On AlphaZero, there is no human knowledge. Recently OpenAI Five and AlphaStar, these are a bit more involved because these are games where you don’t have perfect information, where the action space is far larger than Go, et cetera, so it’s far more of a challenge. Still, it doesn’t outperform humans but it’s comparable to professional players right now.

What is different compared to what we did in 1997 when IBM was able to beat world champion Garry Kasparov at Chess? At that point, things were hardcoded and it was mostly about brute force. What Deep Blue was doing is just used key search and have an hardcoded evaluation function that tells it if is a position good or not. You can think about it as a king … I’m sorry, a queen is worth 10 point, each of your piece is worth some amount of points. You compute an evaluation. You compute a function that gives you the score of your position and you find the branch in your tree that leads to the highest possible score or the lowest possible probability to reach a small score, or whatever you care about. That doesn’t work well for Go because the [inaudible] Go are far higher than the one of Chess. That’s why there, people had to wait for a few years before they were able to actually remove the hardcoded bit, and instead, train some machine learning algorithm. It’s the same as what we said for images at the beginning. You don’t want to code manually an algorithm that recognize camel from python. You just want the computer to learn it for you, and after that, you want to use it.

The reinforcement learning setting, it’s a bit different from the traditional supervised learning setting. We’ll get an agent to interact with an environment. The agent is able to observe the environment and get the state of the world from there. That’s St. After that he can perform some action a and get a reward out of it, and that’s a [inaudible] loop. We have an agent that performs some action, it goes inside the environment and update itself, and gives back the reward for the next step and the state for the next step. Of course, video games are a lot like that, and what you want to do is find the agent that has … that collects the largest possible sum of rewards over time. An agent will have so called [inaudible] policy. It’s just a function from observation to action. Our objective, once again, is to find the best algorithm that we can do to optimize our loss function. In that case, rather than minimizing we can maximizing for now, and we want to collect rewards.

That’s very similar to what we said for supervised learning. We would like to find the best possible algorithm. Once again, we’ll have the thing that algorithm are in a discrete space. It’s hard to optimize over discrete space so we’ll go into a more continuous space and we’ll use, again, gradient descent.

There are new challenges, though, compared to supervised learning, which are that rewards can be deferred. If you play, let’s say, Pong, you have to understand that it’s not … if you score a point, it might be because you took some action actually 10 step before, not just the step before. That was important bit. You also have to have a trade off between exploration and exploitation. If you notice that an action is good, you might … your way to discover the world is just to interact with the environment, and so you’re tempted to always play this action that is good because you know that it’s good. But still, you want to do some exploration so that you discover new things on the world, and you may notice that there is another action that is actually far better than the ones that you’ve been doing so far.

We can put nice OCaml types under that. An environment, again, it has a type. You can create it. After that you can create new game sessions that will give you the observation for the first state, for the first step of the game, and you can interact with the environment. You give it an action, which is an integer there, but we give it another type if we want to. And you receive back the observation of the new … what the world look like now. Rewards that you’ve collected during that step, and Booleans that tells us whether or not the game is finished. We’ll just use that thing in a tight loop to train your algorithm.

We’ll use what is called Q-learning. In Q-learning, we model the Q function. The Q function, it takes as argument a state and an action and it returns you the total reward that we should be able to collect from the rest of the game. Here we’ll call Q* the function that corresponds to the perfect agent, to the optimal policy. It might be impossible to compute this policy, but still there is this Q* function that, again, takes a state, takes an action, gives you what is the best that I can do from there. This Q* function, it will follow what is called the Bellman equation.

The Bellman equation is something very straightforward. You just take Q* applied to S and action A. It’s equal to the reward that I would collect plus the maximum of all the actions of Q* on the next state, so states from SI would reach using action A, and this max is taken over all possible actions. We just say there that we have the optimal policy.

There is one weird thing there which is this gamma thing. Gamma here, it’s a discount factor. It’s just there so that it’s incentivize all agent to collect rewards now other than later. Otherwise you can imagine that an agent that is just able to stay in a state where it thinks it’s able further down the road to collect all the rewards in the world will just do nothing because it’s still in this state that is pretty good. We just say oh, actually, getting you reward right now is a bit better than getting it in the 1000th step, so we apply this multiplicative factor, and usually you give it values like .99 or something like that.

We are back to our problem of how do we approximate Q? In the supervised learning case, that would have been the optimal classifier, some notion of optimal classifier. Here we want to find a function that is reasonably good at playing the game, so we want to find the best function for this loss. This loss is just a subtraction of the two parts of the Bellman equation. We take the left hand part and we subtract the right hand part, and that just the things that we want to minimize. We know that for Q, this would actually be zero. It’s actually optimal for this loss. It turns out that minimizing that returns nice strategies.

Yeah, that was the loss that we want to minimize. Yes, that’s very similar to what we did in the supervised case. We just want to find the optimal set of parameters theta, and the theta come from a continuous parameter. Sorry, from a continuous parameter set rather than something discrete. Again, most useful that we can use gradient descent. It will just require the operation to be differentiable, and once again we’ll be able to leverage PyTorch, TensorFlow, or whatever is our back propagation engine to compute the gradient, and we will be able to do a gradient descent, just applying this formula or applying a bit more fancy optimizer.

That’s done in OCaml. Again, that follows the equation very closely. We call qvalues the application of our model to our state s. Then the next qvalues are the application of our model to state s’, code next here, and we apply the maximum over the actions. Here our q_model actually doesn’t return the value for a single action, it returns a vector that holds the Q value for all the possible actions. The expected_qvalues, that’s actually the right-hand side of the Bellman equation. That’s R + gamma * next_qvalues. And the loss, we just take, actually, the mean square error, the sum of the square divided by the number of elements, between the left-hand side of the Bellman equation, the qvalues here, and the expected_qvalues, so the right-hand side of the Bellman equation.

There it’s, again, a bit magic. We just have to ask our framework to do the automatic differentiation for us. It will perform a step of optimization based on that. Because we might use quite complicated model, evaluating q_model or evaluating the backward_step results in millions and, even more, billions of operations. We don’t mind that much because we’ll be able to run on GPUs, or even on TPUs or on fancy hardware. And also, we will run that for quite some time.

Now that we have a Q value, again, we just use this to train our Q function. We have the optimal … not the optimal. We have a Q function that minimize the loss, at least is a local minimum, and we want to build a policy like we want to build an agent. We mentioned that we have a trade off between exploration and exploitation. We still want to do exploration. For that we’ll do what … we’ll use what is called an epsilon greedy policy. That is, at each state, rather than just taking the best action, we’ll have a probability epsilon to take random action. If this float is smaller than epsilon, just take a random action. Here you have some hardcoded formulas that tells you decrease epsilon over time, because to start with I have no confidence in what I’ve been doing. I have been taking mostly random actions, but at some point, I started having confidence in my result and probably taking random action is less good than taking the actions that I’ve … that are the output of my model. Even when it will come to actually evaluate the policy and get it to play for real, to kind of look at its score, we will turn off epsilon greedy, or we’ll run it with an epsilon of zero so that we always take the optimal decisions.

The settings that we will look at on these Atari games. The observation, they are images with three color channels and 210 x 160 pixel at 60 fps. We actually, we’ll take as input not one frame but the last four frames so that we can see movement. If you play Pong, knowing where the ball is and taking a decision based on this snapshot is actually superior because you don’t know if it’s moving up or down. You have one chances of two of taking a good action, which is not ideal. Either you can decide to have a model that has some kind of memory, but that won’t be the case here. There we’ll compile a few frame so that we can pick up movements from there. That works super well for Atari games because they are very simple, but of course, for StarCraft or DOTA or whatever, you cannot afford to do that because you need to remember far longer dependencies than just first step.

We have 4 to 18 possible actions, depending on the game. Also I think that in Pong, you only have two actions, so maybe it’s only 2 to 18. The Q function that will build, it takes as input pixel values, like for these four frames, and will return the Q values for all of the actions.

It turns out that, to play Pong, that first game there, it will be sufficient to use a very similar simple model, that the models that we’ve seen earlier. We just have two linear layers and we separate them with a rectifier linear unit. For Pong it’s enough to actually play the game. We will even downsample it to 80 x 80. We will remove the color information, because you can play Pong in black and white. That’s not that difficult. And some things that I didn’t mention there is that rather than having four frames in the game of Pong, you can just subtract one frame with the previous one so that you can actually get to see the movement. Yeah, this thing, if you train it for … I would just keep over experience replace the optimizations that make it a bit better, but if you train it on Pong over 500 episode, that would be 500 game, it outperforms the hardcoded agent that is playing this game.

Let’s have a quick look at a video of that. That’s actually an OCaml-based agent that you see playing there on the right-hand side. You can see that it’s able to throw the ball reasonably well. It hasn’t been trained for that long. It doesn’t make all the point, but still it’s decent. The reason why it’s shaking, it’s mostly an issue on my side, which is that at each time step the agent can choose to go up or down, but it’s not allowed to do nothing. Yeah, it’s not perfect. It has to discover that actually, if you want to do nothing, you have to go quickly up, down, up, down, up, down, rather than just stay still.

Yeah, that’s not ideal, but still, it’s working fairly well, and you see that. Pong is played … you play 20 points. Of course, if you play randomly, you get a score of -20 because each point of your opponent is worth -1, and if you play perfectly, you get a score of +20 or +21, because all your points are worth one. To start with, random actions, we get a reward of zero. Out of pure randomness, we notice that oh, actually it’s good to send back the ball to the opponent. It’s a kind of game that you can hardcode pretty easily, but it will work for a more complex game.

The reason why there are two curves there, it’s just these two curves are using different [inaudible]. It’s something that is a bit tricky about this is that it depends a lot on your initial condition. If you’re lucky, you train the blue agent and you learn quickly. If you are less lucky, then it takes a bit longer to train. In more complicated setting you might even get it to train in some initial conditions, and if you’re unlucky, it will actually never start learning.

Let’s do something more complicated than Pong, Breakout. There you need a convolutional neural network rather than just the [inaudible] model. You can see that it will start learning pretty quickly and be able to optimize over time. Let’s have a quick look at the video. There it’s allowed to do nothing, so it shakes quite a lot less. Also an important point that I didn’t know about Breakout is that the bricks at the bottom are only worth one but the bricks above are worth more so the agent has an incentive to try to eat the bricks that are quite above the thing. What is fairly nice is that the agent will discover what is perhaps not the optimal strategy but a good strategy at Breakout, which actually consists in digging some tunnel on the side, either the left-hand side or the right-hand side. Once you’ve dig your tunnel, you can try to send the ball above the bricks where it will bounce pretty quickly. It takes, of course, a bit of time for the agent to do that. It’s, agent, far from perfect. Let’s have it still playing that.

You can still see that it tries to hit a bit on the side. Also the speed increases, and that would be a problem for a human being, but for the agent it doesn’t make that much of a difference anyway. Yeah, it continues like that. I don’t remember if it’s in this video, but I hope at some point … yes, it does one tunnel. Yeah, it sends a ball above. And of course, the reward that you get … start increasing for real pretty quickly. After that it digs a second tunnel. Again, having two tunnel is perhaps not that good, but yeah, you get it to get some actually pretty decent score, and outperform a human, again, quite easily. But from the agent’s perspective, speed is actually not that hard, as long as it’s able to move the pad fast enough. Precision is not that much of an issue, whereas for a human being able to send the ball perfectly on one side or the other is hard. For an AI it’s actually not that complicated.

Did it play a whole game before it changes parameters? In the Pong, it accomplishes the 20 points?

Yeah, in that case it actually doesn’t. It really start from scratch pretty much on a new gaia, a new life. That’s called episodic life. It turns out that it increases the speed of learning at the beginning, but at the end you might want to relax this thing.

Last example that we’ll look at is Space Invaders. Again, that’s fairly similar. You’ll see that there the agent is fairly nice. Again, all these have been training OCaml. Those things there is that, of course, it notice that it’s good to shoot at the invaders so that you get them. It doesn’t notice that it’s actually good to shoot this thing, so it didn’t play enough episode to notice that this is worse a lot. That’s a bit sad. Using those algorithm you are able to actually discover that thing. It doesn’t care that much about protecting itself because somehow it’s able to avoid the thing just by one or two pixel, but that’s enough for the agent.

One thing that is actually pretty nice is the end game, because there, for a human being it’s difficult because things, movements are orthogonal so you don’t really manage to time your shot perfectly. Whereas, of course, for a computer, you can get the perfect shot. It measures things exactly and it knows where to send the perfect shot. Once you get into higher levels, you get more invaders at the bottom and this thing get remove, but still … it still doesn’t get to shoot this, which is sad. But still, it will destroy all of them pretty quickly, so hopefully speed increase because it’s better at time, when things actually move a bit faster. Yeah, you can see that the end game, it just does roughly perfect things. You can iterate that again and again, and it just gets quite better than a human being in that kind of thing. Yeah, that’s pretty much the same.

That will be about it for the reinforcement learning bit. Of course it’s not perfect. It takes a human being seconds to understand Pong or a couple games. It took us multiple hundred games, and that’s for Pong. For other games it took us actually, like the Space Invaders example, is multiple days of training on a GPU, whereas yourself, you can just play these games again after a couple minutes or a couple of hours, you have understood what are the rules of the game. You don’t have to start from scratch understanding what’s the kind of laser beam is, that’s actually a bad thing to receive one on you and it’s a good thing to shoot one at the invaders. You don’t have to rationalize that whereas the agent has to do random decision, notice this kind of pattern, learn this kind of pattern, and put it in its parameter and iterate. That takes time.

Q-learning works reasonably well. There are quite another family of algorithm which are called policy gradient algorithm, which are, I think, a bit more trendy these days. That would be A2C, A3C, PPO, TRPO, whatever. All these things learn quite a bit faster than Q-learning, and they’re also simpler to tune. It turns out that in this case you get to play lots of games in parallel rather than just playing one game, and that must be what speed up the learning, just leveraging the fact that you can use tons of CPU course to play your games.

Some conclusions about this experiment playing Atari games in OCaml. Machine learning in OCaml, it has some nice advantage. You use functional programming, which is something that probably most of you enjoy, and you get the type checker to help you to avoid some small silly mistakes that you might not catch using Python. For example, comparing an integer with a pair is probably not something that you want to do, and with the default Python it would just let you do that happily, and decide that it’s false or true, I don’t know, but take a random decision. OCaml on that is pretty nice.

What you get also with OCaml is that you just leverage existing framework, so you still are able to run on GPUs or even TPUs, and you have these nice libraries that are actually C++ or even Fortran that PyTorch or TensorFlow will implement or bind to. Also a nice thing about OCaml is that you have access to a top level so you can actually interactively, in a Jupyter Notebook, try your ideas, look at your tensors, look at your gradients, and iterate pretty quickly. Of course there are quite a few limitations. It’s nowhere near the Python experience, and most of them have to do with the ecosystem. In Python, you have tons of libraries, you have tons of tutorials, online course. If you have any questions about it, you just google it and someone had the issue before. You go on Stack Overflow, it’s there. In OCaml, if you trying googling machine learning in OCaml, you will get pretty much nothing. It’s a bit more difficult. That’s why one thing that we’ve tried doing with this PyTorch and TensorFlow bindings is to keep the OCaml libraries as close as possible to the Python one. That way, if you have a problem with the library, you can just google for the library, find the answer in the Python world, and hopefully it will be easy to translate to the OCaml world.

Another benefit is that you get some nice interop ability with Python. You can decide that you train your algorithm in Python and then use the weights that you’ve gained in OCaml, or you can go the other way around. Being able to just exchange data and exchange even more in between the two different world is something that is actually pretty nice. Yeah, all this is open source, so it’s on GitHub, and there, yeah, you have one version with PyTorch, one version with TensorFlow. If you want to give it a try, I would be very happy to get some feedback on that.

That’s all I got, and now, if you have any questions, I will be happy to hear them.

With the type checker on machine learning in OCaml, seems like one area where you tried to force the type checker to do something for you is labels, or classification. If you were writing this in OCaml you might be too proud to use string labels on your own [inaudible] vectors. Have you tried to come up with…

Sure. Yeah, just to repeat the question for the mic. Your question is about labels, and it’s probably a nice place to use the type checker. I think that’s fairly right. The examples that I gave of labels that were not of a precise type but just using [inaudible], but you can have decent types there. Those places where you want to have decent types can be, for example, when you consider tensors, you can attach to the tensors the kind of elements that they hold. They might hold either a float value because they are parameters, or they’re actually labeled so they will hold integer values, or even in an abstract way, label values. Indeed, using that, you will be less likely to try to add some label to some parameters, which would obviously make no sense, but only the type checker will be able to help you with that.

For actually interacting with the Atari game, was there some existing version of the Atari games that you can programmatically interface with?

Yes, there is something … Again, the question is how do we interact with Atari games, and how did we end up playing with the emulator. For that we leveraged something called the OpenAI Channel. It’s published by OpenAI, and it’s a Python library. We used, actually, some Python bindings to OCaml which are called PyYAML. Code calls Python, which calls the Atari thing, and get back the data from this Atari emulator and get back the frame from there. What is fairly nice is that it’s able to do so without actually having to duplicate the memory. You’re able to even exchange the data between the Python side of thing and the OCaml side of thing, just by sharing some memory between the two of them. We didn’t have to implement anything for that. We just used some very thin wrappers around an existing Python library.

The games that you trained, it seems that we really trained the agent to play with a very fast reaction time. It seems that if you scale down the FPS at which the agent’s allowed to act, then it would be closer to a human reaction time, and may even converge on a strategy rather than reaction time. Has there been research in that area to see whether it’s really as powerful, really full strategies?

Sure. The question is about is it fair to let the agent have zero reaction time whereas for a human being it takes perhaps a tenth of a second for you to react, and for FPS or for real time strategy games such as StarCraft or whatever, a human being is only able to do a certain amount of action per minute. It turns out that professionals are actually able to do hundreds of action per minute. But still, an AI is able to do whatever you let it do, a number of agents … of actions per minute. Yeah, when Alpha Craft was released last year, they put some limit so that it’s comparable with what human beings are doing. They just asked some professional player what is a decent limit, and the agent is not allowed to do more action than whatever the humans were doing.

It’s a good thing to do, but it doesn’t mean that it’s totally fair once you’ve done that because when you measure the action per minute that a human being is doing, most of the actions are actually more or less the same action or very correlated. It’s just that professional player like to kind of keep their [inaudible], and there they have this very … they click always the same thing quite a few times, whereas the AI is able to do very very different things in each of its action. Of course, the idea is to make things as fair as possible. As they try to find better way of limiting things than just action per minutes.

As a follow up, are you limited to reaction time or are you able to see more strategies develop by the agents?

I didn’t do that at all on the Atari games. I think on the RTS, like on the Star Craft or DOTA, what you would see is mostly that if you allow infinite action per minute, the agent will game the metric that you gave it, which is get the maximum score, by just micromanaging units super precisely, up to a pixel, which a human being is not able to do. It would use what are kind of bugs of the game which a human cannot exploit because it’s not fast enough, and it would just go forward by one pixel each person, go backward by one pixel and so on, and notice that oh, actually, that’s the perfect strategy. Nobody is able to beat that. That’s super nice to have, but you’re right, it’s not something that is kind of a fair game. You want to add some limitations, and people are putting limitation. That gives more human-like behavior.

It’s kind of a trend in the world machine learning things. Whatever you do, your agent will try to game your metric. You fix the loss but it’s the [inaudible] world. The loss is actually what requires the agent to try to optimize, so if it notice that a pattern is super good and you didn’t put in your loss something that penalize exploiting this aspect, your agent will happily exploit it.

I’m curious. You said with DOTA, hundreds of hours creating … or tens of hours, maybe, to train Pong. How much of that was GPU time and how much of that was CPU time?

For Pong, it actually takes less than that. For Pong, you can even find an implementation by the same Andre Capetit that run on Python and a Notebook, and perhaps tens of hours CPU time. The things that I did with Pong, GPU time, it’s less than a couple hours. You have a very good idea in, let’s say, 30 minute to one hour, whether you have made progress and it’s almost perfect at this point. For Atari Breakout, it’s faster, it’s far longer, and we are talking more, mostly for Space Invaders, days of GPU times.

On mainly GPU, how much the memory of GPU card that you had?

For all that it’s some super simple games. It’s trained on a single GPU. The very nice thing is that that way the training is very straightforward. You don’t have to mix data from multiple GPUs. It’s actually pretty much consumer grade GPUs so it’s graphic cards that I have, which is an RTX 2070, which has, I think, 8 gigabytes of memory.

For those of us who have access to the community and to look at it, do you have any idea of how, if we want to experiment with a similar environment, to look at your experiment with helping get set up? Indeed, do you know how we can [inaudible]?

It depends what kind of experience you want. If you want to try the OCaml thing, you can just go … actually, it’s probably [inaudible] there. If you go to this repo, in it you have examples of reinforcement learning. If you use your [inaudible], which should be your favorite build system, to run A2C.exe, it will just … provided that you have OpenAI [inaudible] installed, it will just happily run that and you can tweak this example. A2C should be able. This version, I don’t know which game it plays. Yeah, it plays Space Invaders, and it should give the same kind of results. Actually it’s A2C so it’s even a bit better and faster to train. Yeah, that will be one source.

After that, of course, just looking at the code is a bit difficult. You have tons of tutorials around that on internet. Most of them are in Python, but if anyone wants to write some OCaml ones, that would be very welcome. One nice resource is actually the OpenAI baseline. They provide reference implementation for most algorithm that are quite battle tested. The one difficult thing in this domain is that if you have a small bug, it might not train at all. It turns out that most of the existing implementation have very small quirks that are difficult to replicate. Going to these OpenAI baseline you get some implementations that, like this, at some quite high level of [inaudible].

Let’s say, for instance, you had an alternative game. So Breakout, but instead of having a round paddle, because the response is so round. Yeah, so there’s some bimodal or rectangular response when you get the ball, would you be able to train that up and see a difference in the Q matrices that would be indicative of this? Can you dig down at that level?

That’s a quite good question. The question is about if we had a variation of the game where we replace the paddle with something not square or rectangle shaped and with something that is more round or whatever, will we be able to see in the Q values and to adapt our agent to that. That’s actually a world domain of research which is, how can you benefit from some game to be agents from other games. Again, what people would like to do is you learn on all the Atari games minus one, and you want to be able to learn on the last one without having to do things from scratch but by reusing the experience that you’ve learned. In the same way, you want to learn with a rectangular paddle, but still be able to adapt to whatever shapes you’re given. That’s something on which human beings are super good, but right now algorithms are a bit less good. There are quite some big progress being made.

Are there production places and markets where you can use reinforcement learning, even if you don’t have an environment you can [inaudible] 10,000 examples?

Of course I will not comment exactly on where you can use it on market. You can think of markets as being something that are pretty close to what we did for video games, where you have some observation about the state of the world, you take some actions, and you get some reward based on your action. That said, it’s much more challenging in a couple of points. These points are, the observations you get are super noisy. It’s not an image, it’s lots of time series, and most of it is actually noise and not actual proper signal. Also, what you need to do this reinforcement learning thing is a perfect simulator of the world. If you had a perfect simulator, maybe you will be able to train that kind of thing if you let it run for long enough.

I think one more question.

One more question and finish. Well, if that’s it, thank you very much for your attendance.

The next great idea will come from you