The neural network that takes actions in the game.
An action the agent takes while playing the game.
The total predicted reward for a given action.
A list that keeps tracks of: the state before an action, the action taken, the reward expercienced after that action was taken, the state after the action.
The neural network that decides the playing actions.
The network that gets trained via action replay and periodically updates the prediction networks weights.
An algorithm that decides what percentage of actions the agent takes is random. This is to control how much the agent explores an environment.
Once the agent’s memory has reached a certain length. A random sample is taken from the memory and is used to train the training network.
Training the training network
DeepQ learning works by approximating the q function of a game. This means that it tries to predict the overall expected reward of all the actions a player can take. It’s harder to estimate the reward for actions that take place farther into than future than what’s happening right now so we use something called the
discount factor which lowers the affect of predictions of future actions. We use the formula below to calculate the q value.
Q(s,a) = r(s,a) + discount factor * max of the future q values
Our target q value for any action is the directly observed reward r(sma), the reward given at the time step, plus the discount factor times the maximum predicted q value for the step after the predicted q value. Using our target q value we can train the neural netwoork to predict the long term outcomes of any given action. When playing the game we take the action with highest predicted q value.
Things I learned along the way
At first I tried to apply deepq learning on
MountainCar-v0 environment. I found the mountain car environment only gives out rewards very rarely. The player is only rewarded once the car reaches the top of the hill. This meant that the agent wasn’t able to learn anything because it hadn’t been able to reach the top of the hill.
I had bug in the way I used numpy’s np.put function. I realized that the put function flattens the array before inserting the reward data into it. This meant that the calculated rewards where being put in the wrong place. I switched the code to use np.putmask Which actually puts the rewards in their place with their corresponding actions.
import gym import numpy as np from tensorflow.keras.utils import to_categorical from tensorflow.keras import layers import tensorflow as tf import random import time import matplotlib.pyplot as plt from collections import deque env = gym.make("Breakout-ram-v0") savevideo = lambda x: x % 20 == 0 # capture video every 20 games env = gym.wrappers.Monitor(env, "largebatch", video_callable=savevideo, force=True) INPUT_SHAPE = env.observation_space.shape NUM_ACTIONS = env.action_space.n def createRamModel(): model = tf.keras.Sequential() model.add(layers.Dense(80, activation='relu', input_shape=INPUT_SHAPE)) # output is the same size as number of outputs model.add(layers.Dense(60, activation='relu')) model.add(layers.Dense(NUM_ACTIONS, activation='linear')) model.compile(optimizer=tf.keras.optimizers.Adam(0.01), loss='mse', # mean squared error metrics=['mae']) # mean absolute error return model target_model = createRamModel() training_model = createRamModel() training_model.set_weights(target_model.get_weights()) memory_actions = deque(maxlen=10000) epsilon = 1 batch_size = 100 # number of steps taken before training network is updated c = 300 episodes =  for i in range(2000): total_reward = 0 counter = 0 obs = env.reset().reshape(1, -1) # the action the model takes is the output with the highest value action = np.argmax(target_model.predict(obs)) done = False while not done: lastobs = obs obs, reward, done, info = env.step(action) total_reward += reward obs = obs.reshape(1, -1) if random.random() > epsilon: action = np.argmax(target_model.predict(obs)) else: action = env.action_space.sample() step = [lastobs, action, reward, obs] memory_actions.append(step) if len(memory_actions) > 6000: print("training " + str(epsilon)) # do training once we've sampled enough actions batch = np.asarray(random.sample(memory_actions, batch_size)) current_states = np.concatenate([i for i in batch]) cur_q_vals = training_model.predict(current_states) next_states = np.concatenate([i for i in batch]) rewards = np.array([i for i in batch]) actions = to_categorical( np.array([i for i in batch]), num_classes=NUM_ACTIONS) future_q_vals = training_model.predict(next_states) maxfuture_q = np.amax(future_q_vals, axis=1) updates = rewards + 0.99*maxfuture_q np.putmask(cur_q_vals, actions, updates.astype( 'float32', casting='same_kind')) training_model.fit(current_states, cur_q_vals, batch_size=batch_size) counter += 1 epsilon = max(0.1,epsilon *0.999999) if counter > c: counter = 0 print("SET WEIGHTS") target_model.set_weights(training_model.get_weights()) episodes.append(total_reward)