RL Data Studio
Deep Reinforcement Learning, visualised
Environment & Agent
5
0.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
goal wall cliff agentrange [0.0, 0.0]
Calculation Inspector
4

Step the simulation to inspect a calculation.

Equation
3

The governing equation for each step appears here.

Learning Curves
4

Run the simulation to plot learning curves.

Hyperparameters
4
γ — discount
0.9
θ — DP threshold
0.001
random seed
About the Algorithm
4
Value IterationL2 · Bellman & DP
Dynamic Programming (model-based)
model-based (planning)learns V(s) / π

Value iteration repeatedly applies the Bellman *optimality* backup V(s) ← maxₐ Σ p(s',r|s,a)[r + γV(s')] to every state until V stops changing. The maxₐ folds a greedy improvement into each evaluation sweep. It requires the full model p(s',r|s,a) (it is *planning*, not learning) and converges to the optimal value function v*, from which the optimal policy is read off greedily.

Step Log
4
0 steps buffered

No steps yet.

Action Values
4

Click a cell in the grid to inspect its action-values.

Live Stats
4
steps0
episodes0
statusrunning
sweep0
deltaInfinity