A.I/RL by David Silver

Lecture 3 : Planning by Dynamic Programming

궁선이 2019. 3. 24. 12:44

Planning by Dynamic Programming

1. Introduction�

What is Dynamic Programming?

The term dynamic programming refers to a collection of algorithms that can be used to compute optimal policies given a perfect model of the environment as a Markov decision process

Two properties of Dynamic Programming

Optimal subproblem
- Principle of optimality applies
- Optimal solution can be decomposed into subproblems
Overlapping subproblem
- Subproblems recur many times
- Solutions can be cached and reused

Markov decision processes satisfy both properties
- Bellman equation gives recursive decomposition
- Value function stores and reuses solutions

Planning and Learning

Planning
- The model of the environment is known
- The agent performs computations with its model (without any external interaction)
- The agent improves its policy
Learning
- The environment is initially unknown
- The agent interacts with the environment
- The agent improves its policy

Prediction and Control

Prediction : Evaluate the future
- Given a policy
Control : optimise the future
- Find the best policy

Planning by Dynamic Programming

Dynamic programming assumes full knowledge of the MDP
It is used for planning in an MDP

DP는 Prediction과 Control을 반복하며 다음과 같이 동작합니다.

현재의 Policy를 사용해 Value function을 구합니다. (Prediction)
구해진 Value function을 바탕으로 Policy를 업데이트 합니다. (Control)

2. Policy Evaluation

Prediction 단계
현재의 Policy를 사용해 Value function을 구합니다.

Iterative Policy Evaluation

Bellman expectation backup 을 반복적으로 사용하는 방법.

Bellman Equation

Example

3. Policy Iteration

How to Improve a Policy

Policy Iteration

Policy Improvement (증명)

Extension to policy Iteration

그런데 2. Policy Evaluation의 Iterative Policy Evaluation을 몇번이나 해야하는가?
꼭 수없이 많은 단계의(e.g K=1000) Evaluation을 한 뒤에 Policy Improvement를 해야하는가?
적당한 때에 (e.g K=5) 끊고, Policy Improvement를 하면 안되는가? -> Value Iteration

4. Value Iteration

Principle of Optimality

s1 -> s2, s2가 s1의 Successor state, s2에서 optimal value가 존재한다면 s1에서도 optimal value를 찾을 수 있다.

Deterministic Value Iteration

만약 Bellman Optimality Equation을 통해 Final state부터 시작해서 V*(s)를 구할 수 있다.

Example

Value Iteration

5. Summary

6. Asynchronous Dynamic Programming

In-place dynamic programming
Prioritised sweeping
Real-time dynamic programming

저작자표시 비영리 변경금지 (새창열림)