# q-learning adventure

Reinforcement learning, and Q-learning in particular, is fun.

Q-learning is a temporal difference (TD) learning algorithm which converges on the optimal policy of a world by learning action-utility pairs and making decisions based on what it learns through exploration. Q-learning and the applicability of a policy based learning agent are extremely fascinating to me – creating something that learns and adapts to its world in an observable / real-time way is a lot of fun and easily demonstrable to others.

With this project, I set out to demonstrate and explain Q-learning algorithm variants that diverge from standard Q-learning, each algorithm variant having their own unique pros and cons over the standard Q-learning algo. In particular, I researched three Q-learning variants in addition to the standard one:

- Delayed Q-Learning (Strehl, Alexander L., Lihong Li, Eric Wiewiora, John Langford, and Michael L. Littman, 2006)
- Double Q-Learning (Hasselt, Hado V., 2010)
- Speedy Q-Learning (Azar, Mohammad G., Remi Munos, Mohammad Ghavamzadeh, and Hilbert J. Kappen, 2011)

The project with each of these algorithms implemented on the same tile-based map can be found here:

The project was written in JavaScript using pixi.js.

### Q-learning

In order to understand Q-learning algorithm variants, one must first understand standard Q-learning! Q-learning is model-free, meaning that the agent does not need to know the transition model of the world it lives in. This holds true for all of the Q-learning variants described in this paper as well. The update function for standard Q-learning in the following:

`Q(s, a) = 𝑄(𝑠, 𝑎) + α(R(s) + γ(maxQ(s′,a′)) − 𝑄(𝑠, 𝑎))`

Q(s, a) is our q-table, a mapping of states-actions to construct an optimal policy by employing a Markov decision process (MDP). γ is our discount rate, or how much our algorithm favors near rewards vs. further ones. our discount rate used in the project for standard Q-learning is 1. α, the learning rate of the algorithm, biases the agent’s learning in favoring new information or favoring old. our learning rate is 0.8. And finally, the exploration factor ε of our algorithm determines the probability of the agent taking random steps outside of its learned q-table. For this project, the exploration rate is static to 0, meaning our algorithm may converge on a nonoptimal MDP. this makes some of the comparisons to other algorithms which have this step baked in naturally as opposed to via random chance interesting, although in future iterations of this project I should play around with standard Q-learning exploration rate as well as the other values (learning rate, discount rate).

### Delayed Q-learning

Delayed Q-learning differs from standard Q-learning primarily in that Q values are not always updated during each action. Instead, Q values have the chance to be updated for each state when a threshold of the number of occurrences that the state has been visited has been met. Unlike any other form of Q-learning or model-free learning, in fact, delayed Q-learning is PACMDP (Probably Approximately Correct in Markov Decision Processes). The algorithm is PACMDP because delayed Q-learning only updates Q(s, a) if the accumulated utilities for m samples clear our supplied ε1 value (between 0 and 1), meaning that its sample complexity is bounded by polynomial ε1. The full delayed Q-learning algorithm is below:

```
Inputs: γ, S, A, m, ε1
for all (s,a) do
Q(s, a) ← 1(1 – γ)
U(s,a) ← 0
l(s,a) ← 0
t(s,a) ← 0
LEARN(s, a) ← true
end for
t* ← 0
for t = 1,2,3, … do
Let s denote the state at time t
Choose action a := argmaxa’ϵA Q(s, a’)
Let r be the immediate reward and s’ the next state after executing
action a from state s
if LEARN(s, a) = true then
U(s, a) ← U(s, a) + r + γmax a’Q(s’, a’)
l(s, a) ← l(s, a) + 1
if l(s, a) = m then
if Q(s, a) – U(s, a)/m >= 2ε1 then
Q(s, a) ← U(s, a)/m + ε1
t* ← t
else if t(s, a) >= t* then
LEARN(s, a) ← false
end if
t(s, a) ← t, U(s, a) <- 0, l(s, a) <- 0
end if
else if (ts, a) < t* then
LEARN(s, a) ← true
end if
end for
```

The update function for delayed Q-learning is the following:

`Q(s, a) = 1/m ∑(R(s) + γ(maxQ(s’, a’) – Q(s, a)) + ε1`

With delayed Q-learning, depending on m (in our case, set to only 2), we see that it takes a fairly large amount of time for the agent to converge. Depending on great our m value is, this is stretched further. The γ rate is set to .8, and the ε1 (through experimenting) is set to 0.05. This algorithm converges to an optimal policy.

### Double Q-learning

Double Q-learning differs from standard Q-learning by doing a double estimation of Q; instead of one q-table, there are two – an A q-table and a B q-table. The learner chooses the highest action for each state from both tables for greedy navigation, yet only updates one table selected randomly after each action. Double Q-learning seeks to provide a Q-learning solution that is rid of the overestimative positive bias that standard Q-learning is known for, causing the standard algorithm to suffer in stochastic environments (unlike our simple, predictable grid world). This algorithm converges to an optimal policy without an exploration rate.

The algorithm for double Q-learning is shown below:

```
Initialize Qa, Qb, s
repeat
Choose a, based on Qa (s, *) and Qb(s, *), observe r, s’
Choose (e.g. random) either UPDATE(A) or UPDATE(B)
if UPDATE(A) then
Define a* = argmaxa Qa(s’, a)
Qa(s’, a) <- Qa(s’, a) + α(s, a)(r + γ Qb(s’, a*) - Qa(s, a))
else if UPDATE(B) then
Define b* = argmaxa Qb(s’, a)
Qb(s’, a) <- Qb(s’, a) + α(s, a)(r + γ Qa(s’, b*) – Qb(s, a))
end if
s <- s’
until end
```

As seen from the code above, the dual update rules are the following:

```
Qa(s’, a) = Qa(s’, a) + α(s, a)(r + γ Qb(s’, a ∗) − Qa(s, a))
Qb(s’, a) = Qb(s’, a) + α(s, a)(r + γ Qa(s’, b ∗) − Qb(s, a))
```

### Speedy Q-learning

Speedy Q-learning seeks to solve the problem of standard Q-learning’s slow convergence time by providing an algorithm intended to supersede standard Q-learning, essentially being

algorithmically superior (better time complexity) while maintaining similar characteristics.

While the optimal space complexity of both speedy Q-learning and standard Q-learning are the same ( O(n) ), speedy Q-learning has superior sample and computational complexity.

The algorithm for speedy Q-learning can be seen below:

```
for k := 0,1,2,3,…,T-1 do
αk := 1 / (k + 1)
for each (x, a)ϵ Z do
Generate the next state sample Q(s,a*)
TkQk-1(x, a) := r(x, a) + γMQk-1(yk)
TkQk(x, a) := r(x, a) + γMQk(yk)
Qk+1(x, a) := Qk(x, a)+ αk(TkQk(x, a) - Qk(x, a)) + (1 - αk)( TkQk(x, a) - TkQk-1(x, a));
end
end
return QT
```

The update rule for speedy Q-learning is the following:

```
Qk + 1(x, a) =∶ Qk(x, a) + αk(TkQk(x, a) + (1 − αk)( TkQk(x, a) − TkQk − 1(x, a));
```

##### References

Hasselt, Hado V. "Double Q-learning." Double Q-learning. Neural Information Processing Systems, 2010. Web. 15 Nov. 2014.

Azar, Mohammad G., Remi Munos, Mohammad Ghavamzadeh, and Hilbert J. Kappen. "Speedy QLearning." Neural Information Processing Systems(2011): n. pag. 2011. Web. 15 Nov. 2014.

Strehl, Alexander L., Lihong Li, Eric Wiewiora, John Langford, and Michael L. Littman. "PAC ModelFree Reinforcement Learning." PAC Model-Free Reinforcement Learning (n.d.): n.pag. http://www.research.rutgers.edu/~lihong/pub/Strehl06Pac-Talk.pdf. Rutgers. Web. 15 Nov. 2014.

Strehl, Alexander L., Lihong Li, Eric Wiewiora, John Langford, and Michael L. Littman. "PAC ModelFree Reinforcement Learning." PAC Model-Free Reinforcement Learning. N.p., n.d. Web. 15 Nov. 2014.

Russell, Stuart, and Peter Norvig. Artificial Intelligence: A Modern Approach. S.l.: Pearson Education Limited, 2013. Print.