QLearningAgent learns incorrect results

I believe there is a flaw in the `QLearningAgent `implementation in reinforcement.py, possibly resulting from how `run_single_trial` is written.

I was testing this with the 4x3 environment problem given in 17.1. Upon reaching a terminal state (`TERMINAL?(s1) == True`), the `__call__` function returns `None`. This causes `run_single_trial` to exit. If called again in a loop for multiple trials (IE `for _ in range(N): run_single_trial(agent_program, mdp)`), this results in a call to `QLearningAgent.__call__` with s1 being the initial state [(1,1) for 4x3 environment], r1 being the reward for this state (-0.04 for 4x3 environment), `TERMINAL?(s) == TRUE` [as s is either (4,2) or (4,3)], and `a == None`. This then sets `Q[s, None] = r1 = -0.04`, instead of the actual termination value of 1 or -1. This results in an incorrect policy. Simply change line 93 to `Q[s, None] = r` fixes the issue and learns a correct policy.

I recognize this does not match the pseudocode in the book (21.8), and I am not certain if this is simply due to the implementation of `run_single_trial`. A better fix may be available which more closely matches the pseudocode from 21.8.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

QLearningAgent learns incorrect results #1247

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

QLearningAgent learns incorrect results #1247

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions