RL Data Studio — Deep Reinforcement Learning, visualised

Environment & Agent

goal wall cliff agentrange [0.0, 0.0]

Calculation Inspector

Step the simulation to inspect a calculation.

Equation

The governing equation for each step appears here.

Learning Curves

Run the simulation to plot learning curves.

Hyperparameters

γ — discount

0.9

θ — DP threshold

0.001

random seed

About the Algorithm

Value IterationL2 · Bellman & DP

Dynamic Programming (model-based)

model-based (planning)learns V(s) / π

Value iteration repeatedly applies the Bellman *optimality* backup V(s) ← maxₐ Σ p(s',r|s,a)[r + γV(s')] to every state until V stops changing. The maxₐ folds a greedy improvement into each evaluation sweep. It requires the full model p(s',r|s,a) (it is *planning*, not learning) and converges to the optimal value function v*, from which the optimal policy is read off greedily.

Step Log

0 steps buffered

No steps yet.

Action Values

Click a cell in the grid to inspect its action-values.

Live Stats

steps0

episodes0

statusrunning

sweep0

deltaInfinity