Dynamic Programming

题外话:以前做题的时候老是碰到dp,不过队内有两位dp大师,我也不怎么输出dp题,没想到最近看书又碰到了dp。。。DP真是个需要智商的东西,很绝望啊。。。
UPD:自己之前以为值迭代和策略迭代相比会更加不稳定,但是后来问了一下why,被告诉其实两者都会产生震荡(不稳定)。

Policy Evaluation (Prediction)

police evalution(prediction problem):We consider how to compute the state-value function \(v_\pi\) for an arbitrary policy \(\pi\).

For our purposes, iterative solution methods are most suitable. The initial approximation, \(v_0\), is chosen arbitrarily (except that the terminal state, if any, must be given value 0), and each successive
approximation is obtained by using the Bellman equation for \(v_\pi\) (4.4) as an update rule:

iterative policy evaluation:在\(v_\pi\)的存在下,找到一个序列的\({v_k}\),使得k趋近于无穷的时候,\(v_k=v_\pi\)。
expected update: To each state s: it replaces the old value of \(s\) with a new value,obtained from the old values of the successor states of s, and the expected immediate rewards, along all the one-step transitions possible under the policy being evaluated.
All the updates done in DP algorithms are called expected updates because they are based on an expectation over all possible next states rather than on a sample next state.
过程伪代码:

注意更新价值函数的时候有两种思路:第一种就是开两个数组old/new,按照之前的式子那样迭代。第二种是in-place,新算出来的值立马代替old,这样每轮迭代,如果一些依赖的状态已经更新了,就用更新后的来计算。in-place的收敛速度更快,但是其中每一轮计算各s的顺序也有很大影响。其实我在这里想到的居然是01背包的空间优化(囧了)。。。

Policy Improvement

Our reason for computing the value function for a policy is to help find better policies.Suppose we have determined the value function vπ for an arbitrary deterministic policy \(\pi\). For some state s we would like to know whether or not we should change the policy to deterministically choose an action \(a \neq \pi(s)\).
根据状态行为对的价值函数,
如果我们没有按照当前的policy选择一个action a,而是自己选择了一个a,并且得到了一个\(q_\pi(s,a)\)是大于原来的\(v_\pi(s)\)的话,也就意味着当前这一步贪心的选择一个比原来好的a哪怕之后还是按照之前的policy进行操作选择action也会比原来的策略好,所以就可以生成一个新的策略了。
policy improvement theorem:对于两个确定性策略\(\pi\)和\(\pi’\),假设对于所有的状态s来说都满足,\(q_\pi(s,\pi'(s))\geq v_\pi(s)\),那也就意味着都满足\(v_\pi'(s)\geq v_\pi(s)\)。
证明过程:

policy improvement:
根据上面在一个状态下选择一个较优的方式,我们可以拓展到在每个状态下在所有action中选择最优策略,规则如下:\(\pi'(s) = argmax_a q_\pi(s, a) = argmax_a \sum_{s’,r} p(s’, r | s, a) (r + \gamma v_\pi(s’))\).The greedy policy takes the action that looks best in the short term–after one step of lookahead–according to \(v_\pi\).
对于随机性的policy,上面的理论也是适用的,当有若干action的都可以达到最优的时候, each maximizing action can be given a portion of the probability of being selected in the new greedy policy. Any apportioning scheme is allowed as long as all submaximal actions are given zero probability.Any apportionment of probability among these actions is permitted.

Policy Iteration

We can thus obtain a sequence of monotonically improving policies and value functions:

E代表a policy evaluation,I代表了a policy improvement。Each policy is guaranteed to be a strict improvement over the previous one (unless it is already optimal). Because a finite MDP has only a finite number of policies, this process must converge to an optimal policy and optimal value function in a finite number of iterations.
伪代码:

Value Iteration

evolution可能很慢,In fact, the policy evaluation step of policy iteration can be truncated in several ways
without losing the convergence guarantees of policy iteration.
One important special case is when policy evaluation is stopped after just one sweep (one update of each state).This algorithm is called value iteration.

伪代码:

Because the max operation in (4.10) is the only difference between these updates, this just means that the max operation is added to some sweeps of policy evaluation. All of these algorithms converge to an optimal policy for discounted finite MDPs.

Asynchronous Dynamic Programming

If the state set is very large, then even a single sweep can be prohibitively expensive.
Asynchronous DP algorithms are in-place iterative DP algorithms that are not organized in terms of systematic sweeps of the state set. These algorithms update the values of states in any order whatsoever, using whatever values of other states happen to be available.
The values of some states may be updated several times before the values of others are updated once. To converge correctly, however, an asynchronous algorithm must continue to update the values of all the states: it can’t ignore any state after some point in the computation. Asynchronous DP algorithms allow great flexibility in selecting states to update.
Of course, avoiding sweeps does not necessarily mean that we can get away with less computation. It just means that an algorithm does not need to get locked into any hopelessly long sweep before it can make progress improving a policy.
We can try to take advantage of this flexibility by selecting the states to which we apply updates so
as to improve the algorithm’s rate of progress. We can try to order the updates to let value information propagate from state to state in an efficient way.

Generalized Policy Iteration

generalized policy iteration (GPI):the general idea of letting policy evaluation and policy improvement processes interact, independent of the granularity and other details of the two processes.所以也就是说之前的三种迭代dp的方法都可以认为是GPI的。
所有的RL方法都可以被建模为GPI。It is easy to see that if both the evaluation process and the improvement
process stabilize, that is, no longer produce changes, then the value function and policy must be optimal.

两个过程都达到稳定,也说明找到了最优策略。
The evaluation and improvement processes in GPI can be viewed as both competing and cooperating.Making the policy greedy with respect to the value function typically makes the value function incorrect for the changed policy, and making the value function consistent with the policy typically causes that policy no longer to be greedy.

Efficiency of Dynamic Programming

DP is sometimes thought to be of limited applicability because of the curse of dimensionality, the fact that the number of states often grows exponentially with the number of state variables. Large state sets do create difficulties, but these are inherent difficulties of the problem, not of DP as a solution method. In fact, DP is comparatively better suited to handling large state spaces than competing methods such as direct search and linear programming.
In practice, DP methods can be used with today’s computers to solve MDPs with millions of states.In practice, these methods usually converge much faster than their theoretical worst-case run times, particularly if they are started with good initial value functions or policies.

发表评论

电子邮件地址不会被公开。 必填项已用*标注