This note helps me keep track of things I’ve read or heard but don’t quite understand (or can’t prove!).
- As with Q learning, introducing non-linear function approximators means that convergence is no longer guaranteed See: Continuous control with deep reinforcement learning
- Exploration algorithms for Markov Decision Processes (MDPs) are typically concerned with reducing the agent’s uncertainty over the environment’s reward and transition functions. In a tabular setting, this uncertainty can be quantified using confidence intervals derived from Chernoff bounds, or inferred from a posterior over the environment parameters. See: Unifying Count-Based Exploration and Intrinsic Motivation