A.I/RL by David Silver
Lecture 3 : Planning by Dynamic Programming
궁선이
2019. 3. 24. 12:44
Planning by Dynamic Programming
1. Introduction�
What is Dynamic Programming?
The term dynamic programming refers to a collection of algorithms that can be used to compute optimal policies given a perfect model of the environment as a Markov decision process
Two properties of Dynamic Programming
- Optimal subproblem
- Principle of optimality applies
- Optimal solution can be decomposed into subproblems
- Overlapping subproblem
- Subproblems recur many times
- Solutions can be cached and reused
- Markov decision processes satisfy both properties
- Bellman equation gives recursive decomposition
- Value function stores and reuses solutions
Planning and Learning
- Planning
- The model of the environment is known
- The agent performs computations with its model (without any external interaction)
- The agent improves its policy
- Learning
- The environment is initially unknown
- The agent interacts with the environment
- The agent improves its policy
Prediction and Control
- Prediction : Evaluate the future
- Given a policy
- Control : optimise the future
- Find the best policy
Planning by Dynamic Programming
- Dynamic programming assumes full knowledge of the MDP
- It is used for planning in an MDP
DP는 Prediction과 Control을 반복하며 다음과 같이 동작합니다.
- 현재의 Policy를 사용해 Value function을 구합니다. (Prediction)
- 구해진 Value function을 바탕으로 Policy를 업데이트 합니다. (Control)
2. Policy Evaluation
- Prediction 단계
- 현재의 Policy를 사용해 Value function을 구합니다.
Iterative Policy Evaluation
- Bellman expectation backup 을 반복적으로 사용하는 방법.
Bellman Equation
Example
3. Policy Iteration
How to Improve a Policy
Policy Iteration
Policy Improvement (증명)
Extension to policy Iteration
- 그런데 2. Policy Evaluation의 Iterative Policy Evaluation을 몇번이나 해야하는가?
- 꼭 수없이 많은 단계의(e.g K=1000) Evaluation을 한 뒤에 Policy Improvement를 해야하는가?
- 적당한 때에 (e.g K=5) 끊고, Policy Improvement를 하면 안되는가? -> Value Iteration
4. Value Iteration
Principle of Optimality
- s1 -> s2, s2가 s1의 Successor state, s2에서 optimal value가 존재한다면 s1에서도 optimal value를 찾을 수 있다.
Deterministic Value Iteration
- 만약 Bellman Optimality Equation을 통해 Final state부터 시작해서 V*(s)를 구할 수 있다.
Example
Value Iteration
5. Summary
6. Asynchronous Dynamic Programming
- In-place dynamic programming
- Prioritised sweeping
- Real-time dynamic programming