WebLearn DQN
Simple Deep Q-learning agent for WebLearn.
Reinforcement learning agent that uses a WebLearn model to approximate the Q-function for your environment.
Q-learning is an off-policy algorithm, which means it can learn about the environment using trajectories where the actions weren't sampled from the agent (i.e. human demonstrator). I'll probably add a demo of this soon.
Q-learning is also a model-free algorithm, which means it's not doing any planning or tree search. It's basically just estimating the discounted future rewards it expects to see if takes an action a in state s and follows the optimal policy from there.
This implementation uses experience replay and temporal difference error clamping, but currently does not do fitted Q iteration ("target" network) or double DQN.
There's a demo using OpenAI's gym in examples/
Usage
npm install weblearn weblearn-dqn
const ndarray = const DQN = const ReLU Linear MSE SGD Sequential = let model = const STATE_SIZE = 2const NUM_ACTIONS = 3// model input should match state size// and have one output for each actionmodel let agent = // get these from your environment:let observation = let reward = 3let done = false let action = agent// `action` is an integer in the range of [0, NUM_ACTIONS) // call this whenever ya wanna do a learn step.// you can call this after each `agent.step()`, but you can also call it more or less often.// just keep in mind, depending on the size of your model, this may block for a relatively long time.let loss = agent
let agent = DQN(opts)
opts
should be an object with some of the following properties:
model
: WebLearn model. required.numActions
: number. number of actions. required.epsilon
: number. initial probability of selecting action at random (for exploration). optional.memorySize
: number. how many of our most experiences to remember for learning. optional.maxError
: number orfalse
. limit the absolute value of the td-error from a single experience. false for no limit. optional.finalEpsilon
: number. probability of selecting an action at random afterepsilonDecaySteps
steps of training. optional.epsilonDecaySteps
: number. on what timestep should we reachepsilon === finalEpsilon
? optional.learnBatchSize
: number. how many transitions should we learn from when we callagent.learn()
? optional.gamma
: number. parameter used for discounting rewards far in the future vs. rewards sooner. optional.
let action = agent.step(observation, reward, done)
returns a number action
(integer specifying index of action to take).
observation
: ndarray. some representation of the state of your environment. required.reward
: number. this is what the agent will try to maximize. required.done
: boolean. is this state the last state of an episode? optional.
let loss = agent.learn()
makes the agent do some learning. this can take a long time. returns the loss from the learn step. the loss from a single learn step will be pretty noisy since experiences are sampled from memory at random, but if you average over multiple .learn()
s, that might be useful.
🤖