Q-learning is an off-policy algorithm, which means it can learn about the environment using trajectories where the actions weren't sampled from the agent (i.e. human demonstrator). I'll probably add a demo of this soon.
Q-learning is also a model-free algorithm, which means it's not doing any planning or tree search. It's basically just estimating the discounted future rewards it expects to see if takes an action a in state s and follows the optimal policy from there.
This implementation uses experience replay and temporal difference error clamping, but currently does not do fitted Q iteration ("target" network) or double DQN.
There's a demo using OpenAI's gym in
npm install weblearn weblearn-dqn
const ndarray =const DQN =const ReLU Linear MSE SGD Sequential =let model =const STATE_SIZE = 2const NUM_ACTIONS = 3// model input should match state size// and have one output for each actionmodellet agent =// get these from your environment:let observation =let reward = 3let done = falselet action = agent// `action` is an integer in the range of [0, NUM_ACTIONS)// call this whenever ya wanna do a learn step.// you can call this after each `agent.step()`, but you can also call it more or less often.// just keep in mind, depending on the size of your model, this may block for a relatively long time.let loss = agent
opts should be an object with some of the following properties:
model: WebLearn model. required.
numActions: number. number of actions. required.
epsilon: number. initial probability of selecting action at random (for exploration). optional.
memorySize: number. how many of our most experiences to remember for learning. optional.
maxError: number or
false. limit the absolute value of the td-error from a single experience. false for no limit. optional.
finalEpsilon: number. probability of selecting an action at random after
epsilonDecayStepssteps of training. optional.
epsilonDecaySteps: number. on what timestep should we reach
epsilon === finalEpsilon? optional.
learnBatchSize: number. how many transitions should we learn from when we call
gamma: number. parameter used for discounting rewards far in the future vs. rewards sooner. optional.
returns a number
action (integer specifying index of action to take).
observation: ndarray. some representation of the state of your environment. required.
reward: number. this is what the agent will try to maximize. required.
done: boolean. is this state the last state of an episode? optional.
makes the agent do some learning. this can take a long time. returns the loss from the learn step. the loss from a single learn step will be pretty noisy since experiences are sampled from memory at random, but if you average over multiple
.learn()s, that might be useful.