I believe there is a flaw in the QLearningAgent implementation in reinforcement.py, possibly resulting from how run_single_trial is written.
I was testing this with the 4x3 environment problem given in 17.1. Upon reaching a terminal state (TERMINAL?(s1) == True), the __call__ function returns None. This causes run_single_trial to exit. If called again in a loop for multiple trials (IE for _ in range(N): run_single_trial(agent_program, mdp)), this results in a call to QLearningAgent.__call__ with s1 being the initial state [(1,1) for 4x3 environment], r1 being the reward for this state (-0.04 for 4x3 environment), TERMINAL?(s) == TRUE [as s is either (4,2) or (4,3)], and a == None. This then sets Q[s, None] = r1 = -0.04, instead of the actual termination value of 1 or -1. This results in an incorrect policy. Simply change line 93 to Q[s, None] = r fixes the issue and learns a correct policy.
I recognize this does not match the pseudocode in the book (21.8), and I am not certain if this is simply due to the implementation of run_single_trial. A better fix may be available which more closely matches the pseudocode from 21.8.
I believe there is a flaw in the
QLearningAgentimplementation in reinforcement.py, possibly resulting from howrun_single_trialis written.I was testing this with the 4x3 environment problem given in 17.1. Upon reaching a terminal state (
TERMINAL?(s1) == True), the__call__function returnsNone. This causesrun_single_trialto exit. If called again in a loop for multiple trials (IEfor _ in range(N): run_single_trial(agent_program, mdp)), this results in a call toQLearningAgent.__call__with s1 being the initial state [(1,1) for 4x3 environment], r1 being the reward for this state (-0.04 for 4x3 environment),TERMINAL?(s) == TRUE[as s is either (4,2) or (4,3)], anda == None. This then setsQ[s, None] = r1 = -0.04, instead of the actual termination value of 1 or -1. This results in an incorrect policy. Simply change line 93 toQ[s, None] = rfixes the issue and learns a correct policy.I recognize this does not match the pseudocode in the book (21.8), and I am not certain if this is simply due to the implementation of
run_single_trial. A better fix may be available which more closely matches the pseudocode from 21.8.