Environment & Agent
5
goal wall cliff agentrange [0.0, 0.0]
Calculation Inspector
4
Step the simulation to inspect a calculation.
Equation
3
The governing equation for each step appears here.
Learning Curves
4
Run the simulation to plot learning curves.
Hyperparameters
4
γ — discount
0.9θ — DP threshold
0.001random seed
About the Algorithm
4
Value IterationL2 · Bellman & DP
Dynamic Programming (model-based)model-based (planning)learns V(s) / π
Value iteration repeatedly applies the Bellman *optimality* backup V(s) ← maxₐ Σ p(s',r|s,a)[r + γV(s')] to every state until V stops changing. The maxₐ folds a greedy improvement into each evaluation sweep. It requires the full model p(s',r|s,a) (it is *planning*, not learning) and converges to the optimal value function v*, from which the optimal policy is read off greedily.
Step Log
4
0 steps buffered
No steps yet.
Action Values
4
Click a cell in the grid to inspect its action-values.
Live Stats
4
steps0
episodes0
statusrunning
sweep0
deltaInfinity