特色文章

BZOJ上的USACO 第一弹

听说刷BZOJ上的USACO题目比较涨rank,于是我也开始了开心的划水生活~在这里贴一些题目的代码和思路,如果感觉比较好的题目用★标注出来,也会单独写写。

现在做了几题:

50/50

终于填完了,写题写的好慢啊QAQ。再弄个第二弹。

继续阅读

特色文章

Codeforces题目泛做计划

找了一些比较水的cf上的题目,然而又不想整套整套的的写了,于是一些散的cf上的题目就放在这里了。

现在做了几题:

30

好弱啊、、

继续阅读

Model-Free Prediction

题外话:书上没有按照把所有的方法融合在一起讲prediction的方法来讲,而是分方法分别讲prediction,control这样的,为了看上去的方便,而且自己看视频也有很多地方没看懂,所以就按照视频上的顺序摘抄一下做个记录。

Incremental Implementation

这部分在书中的第二章,当时直接跳过了,在这里做一个记录。
Let \(Q_n\) denote the estimate of its action value after it has been selected n − 1 times,易得:
\(Q_{n+1}=\frac{1}{n}(R_n+\sum_{i=1}^{n-1}R_i)=Q_n+\frac{1}{n}[R_n-Q_n]\)。就是相当于把最后新来的一项平均分成n份和原来的相补。This implementation requires memory only for \(Q_n\) and \(n\), and only the small computation for each new reward.
The general form is

The expression [Target − OldEstimate] is an error in the estimate.The target is presumed to indicate a desirable direction in which to move, though it may be noisy. In the case above, for example, the target is the nth reward.

Monte Carlo Methods

因为只看了一部分所以可能有些偏差。。。
Monte Carlo methods require only experience——sample sequences of states, actions, and rewards from actual or simulated interaction with an environment.
Monte Carlo methods only for episodic tasks. we assume experience is divided into episodes, and that all episodes eventually terminate no matter what actions are selected. Only on the completion of an episode are value estimates and policies changed.
我们可以认为MC是一种基于对于完整的回报(Gt)的平均。

Monte Carlo Prediction

目的: learning the state-value function for a given policy
An obvious way to estimate it from experience, then, is simply to average the returns observed after visits to that state. As more returns are observed, the average should converge to the expected value.
Each occurrence of state s in an episode is called a visit to s.
根据对于一个episode中对同一个状态s的访问计算不同,得到first-visit MC和every-visit MC。
first-visit MC伪代码:

需要注意:我们是按照时光倒流来计算G的(这里貌似也可以考虑折扣因子)。
Both first-visit MC and every-visit MC converge to \(v_\pi(s)\) as the number of visits (or first visits) to s goes to infinity. By the law of large numbers the sequence of averages of these estimates converges to their expected value.
从Blackjack这个例子可能得到的思考:即便我们对于我们完全了解Blackjack的环境,但是我们还是不能准确得到一个关于下一个事件的分布,DP的方法需要我们知道全部的概率分布,但是这往往很不现实。In contrast, generating the sample games required by Monte Carlo methods is easy.The ability of Monte Carlo methods to work with sample episodes alone can be a significant advantage even when one has complete knowledge of the environment’s dynamics.
Whereas the DP diagram includes only one-step transitions, the Monte Carlo diagram goes all the way to the end of the episode.
An important fact about Monte Carlo methods is that the estimates for each state are independent. The estimate for one state does not build upon the estimate of any other state, as is the case in DP. In other words, Monte Carlo methods do not bootstrap.
The computational expense of estimating the value of a single state is independent of the number of states. One can generate many sample episodes starting from the states of interest, averaging returns from only these states, ignoring all others.

Temporal-Difference Learning

这一部分主要是TD(0)。
TD learning is a combination of Monte Carlo ideas and dynamic programming (DP) ideas.
TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (they bootstrap)
constant-α MC:\(V(S_t)=V(S_t)+\alpha[G_t-V(S_t)]\),\(G_t\) is the actual return following time \(t\), and \(alpha\) is a constant step-size parameter.
TD methods need to wait only until the next time step. At time \(t + 1\) they immediately form a target and make a useful update using the observed reward \(R_{t+1}\) and the estimate \(V(S_t+1)\). The simplest TD method makes the update:\(V(S_t)=V(S_t)+\alpha[R_{t+1}+\gamma V(S_{t+1})-V(S_t)]\).
伪代码:

书上有一段关于MC、DP、TD为什么都是估计的一个说法:MC是因为期望下的Gt用了采样得到的return来替代,DP则是因为我们在计算下一步的\(V_\pi\)时候采用的是当前的估计。而TD则综合了他们俩既是sample又是采用了当前状态的估计下一步状态。
Sample updates differ from the expected updates of DP methods in that they are based on a single sample successor rather than on a complete distribution of all possible successors.
TD error:\(\delta_t=R_{t+1}+\gamma V(S_{t+s})-V(S_t)\).
if the array V does not change during the episode (as it does not in Monte Carlo methods), then the Monte Carlo error can be written as a sum of TD errors:

Advantages of TD Prediction Methods

TD methods have an advantage over DP methods in that they do not require a model of the environment, of its reward and next-state probability distributions.
The next most obvious advantage of TD methods over Monte Carlo methods is that they are naturally implemented in an on-line, fully incremental fashion.TD methods one need wait only one time step.
TD methods are much less susceptible to these problems because they learn from each transition regardless of what subsequent actions are taken.(这一部分的的对比没有看太懂,可能是和control之类的比)
For any fixed policy \(\pi\), TD(0) has been proved to converge to vπ, in the mean for a constant step-size parameter if it is sufficiently small.
MC和TD这两种方法很难说谁先收敛到真正的解最快,但是一般:TD methods have usually been found to converge faster than constant-α MC methods on stochastic tasks.

n-step Bootstrapping

The backup diagrams of n-step methods:

The methods that use n-step updates are still TD methods because they still change an earlier estimate based on how it differs from a later estimate.
The natural state-value learning algorithm for using n-step returns is thus,\(V_{t+n}(S_t)=V_{t+n-1}(S_t)+\alpha[G_{t:t+n}-V_{t+n-1}(S_t)]\).Note that no changes at all are made during the first n−1 steps of each episode.
伪代码:

error reduction property:

Because of the error reduction property, one can show formally that all n-step TD methods converge to
the correct predictions under appropriate technical conditions.

Eligibility Traces

The mechanism is a short-term memory vector, the eligibility trace \(Z_t \in R^d\), that parallels the long-term weight vector \(W_t \in R^d\). The rough idea is that when a component of \(W_t\) participates in producing an estimated value, then the corresponding component of \(Z_t\) is bumped up and then begins to fade away.
Forward views are always somewhat complex to implement because the update depends on later things that are not available at the time.MC以及n-step 的方法都属于这种。


待补充。。。。

Dynamic Programming

题外话:以前做题的时候老是碰到dp,不过队内有两位dp大师,我也不怎么输出dp题,没想到最近看书又碰到了dp。。。DP真是个需要智商的东西,很绝望啊。。。
UPD:自己之前以为值迭代和策略迭代相比会更加不稳定,但是后来问了一下why,被告诉其实两者都会产生震荡(不稳定)。

Policy Evaluation (Prediction)

police evalution(prediction problem):We consider how to compute the state-value function \(v_\pi\) for an arbitrary policy \(\pi\).

For our purposes, iterative solution methods are most suitable. The initial approximation, \(v_0\), is chosen arbitrarily (except that the terminal state, if any, must be given value 0), and each successive
approximation is obtained by using the Bellman equation for \(v_\pi\) (4.4) as an update rule:

iterative policy evaluation:在\(v_\pi\)的存在下,找到一个序列的\({v_k}\),使得k趋近于无穷的时候,\(v_k=v_\pi\)。
expected update: To each state s: it replaces the old value of \(s\) with a new value,obtained from the old values of the successor states of s, and the expected immediate rewards, along all the one-step transitions possible under the policy being evaluated.
All the updates done in DP algorithms are called expected updates because they are based on an expectation over all possible next states rather than on a sample next state.
过程伪代码:

注意更新价值函数的时候有两种思路:第一种就是开两个数组old/new,按照之前的式子那样迭代。第二种是in-place,新算出来的值立马代替old,这样每轮迭代,如果一些依赖的状态已经更新了,就用更新后的来计算。in-place的收敛速度更快,但是其中每一轮计算各s的顺序也有很大影响。其实我在这里想到的居然是01背包的空间优化(囧了)。。。

Policy Improvement

Our reason for computing the value function for a policy is to help find better policies.Suppose we have determined the value function vπ for an arbitrary deterministic policy \(\pi\). For some state s we would like to know whether or not we should change the policy to deterministically choose an action \(a \neq \pi(s)\).
根据状态行为对的价值函数,
如果我们没有按照当前的policy选择一个action a,而是自己选择了一个a,并且得到了一个\(q_\pi(s,a)\)是大于原来的\(v_\pi(s)\)的话,也就意味着当前这一步贪心的选择一个比原来好的a哪怕之后还是按照之前的policy进行操作选择action也会比原来的策略好,所以就可以生成一个新的策略了。
policy improvement theorem:对于两个确定性策略\(\pi\)和\(\pi’\),假设对于所有的状态s来说都满足,\(q_\pi(s,\pi'(s))\geq v_\pi(s)\),那也就意味着都满足\(v_\pi'(s)\geq v_\pi(s)\)。
证明过程:

policy improvement:
根据上面在一个状态下选择一个较优的方式,我们可以拓展到在每个状态下在所有action中选择最优策略,规则如下:\(\pi'(s) = argmax_a q_\pi(s, a) = argmax_a \sum_{s’,r} p(s’, r | s, a) (r + \gamma v_\pi(s’))\).The greedy policy takes the action that looks best in the short term–after one step of lookahead–according to \(v_\pi\).
对于随机性的policy,上面的理论也是适用的,当有若干action的都可以达到最优的时候, each maximizing action can be given a portion of the probability of being selected in the new greedy policy. Any apportioning scheme is allowed as long as all submaximal actions are given zero probability.Any apportionment of probability among these actions is permitted.

Policy Iteration

We can thus obtain a sequence of monotonically improving policies and value functions:

E代表a policy evaluation,I代表了a policy improvement。Each policy is guaranteed to be a strict improvement over the previous one (unless it is already optimal). Because a finite MDP has only a finite number of policies, this process must converge to an optimal policy and optimal value function in a finite number of iterations.
伪代码:

Value Iteration

evolution可能很慢,In fact, the policy evaluation step of policy iteration can be truncated in several ways
without losing the convergence guarantees of policy iteration.
One important special case is when policy evaluation is stopped after just one sweep (one update of each state).This algorithm is called value iteration.

伪代码:

Because the max operation in (4.10) is the only difference between these updates, this just means that the max operation is added to some sweeps of policy evaluation. All of these algorithms converge to an optimal policy for discounted finite MDPs.

Asynchronous Dynamic Programming

If the state set is very large, then even a single sweep can be prohibitively expensive.
Asynchronous DP algorithms are in-place iterative DP algorithms that are not organized in terms of systematic sweeps of the state set. These algorithms update the values of states in any order whatsoever, using whatever values of other states happen to be available.
The values of some states may be updated several times before the values of others are updated once. To converge correctly, however, an asynchronous algorithm must continue to update the values of all the states: it can’t ignore any state after some point in the computation. Asynchronous DP algorithms allow great flexibility in selecting states to update.
Of course, avoiding sweeps does not necessarily mean that we can get away with less computation. It just means that an algorithm does not need to get locked into any hopelessly long sweep before it can make progress improving a policy.
We can try to take advantage of this flexibility by selecting the states to which we apply updates so
as to improve the algorithm’s rate of progress. We can try to order the updates to let value information propagate from state to state in an efficient way.

Generalized Policy Iteration

generalized policy iteration (GPI):the general idea of letting policy evaluation and policy improvement processes interact, independent of the granularity and other details of the two processes.所以也就是说之前的三种迭代dp的方法都可以认为是GPI的。
所有的RL方法都可以被建模为GPI。It is easy to see that if both the evaluation process and the improvement
process stabilize, that is, no longer produce changes, then the value function and policy must be optimal.

两个过程都达到稳定,也说明找到了最优策略。
The evaluation and improvement processes in GPI can be viewed as both competing and cooperating.Making the policy greedy with respect to the value function typically makes the value function incorrect for the changed policy, and making the value function consistent with the policy typically causes that policy no longer to be greedy.

Efficiency of Dynamic Programming

DP is sometimes thought to be of limited applicability because of the curse of dimensionality, the fact that the number of states often grows exponentially with the number of state variables. Large state sets do create difficulties, but these are inherent difficulties of the problem, not of DP as a solution method. In fact, DP is comparatively better suited to handling large state spaces than competing methods such as direct search and linear programming.
In practice, DP methods can be used with today’s computers to solve MDPs with millions of states.In practice, these methods usually converge much faster than their theoretical worst-case run times, particularly if they are started with good initial value functions or policies.

Finite Markov Decision Processes Learning


题外话:中秋节跟着几位大爷呼呼social,基本上每个夜晚都是在看电影吃烤肉酒吧中度过,然后就导致代码bug没有fix(不过真的也不想fix,明明就不是我写的为啥要我改呢。。我是真的讨厌改别人代码里的bug)。。。就看了一篇过几天reading group要讲的论文(回来可能还得做个ppt啥的),赶快看看Finite Markov Dexision Processes这一章节吧。。。没看课程视频直接看了书。。。所以明天听完别人讲的后看看视频再做补充好了。。。

The Agent-Environment Interface


我们一般认为MDP形成了如下的一个序列:\(S_0,A_0,R_1,S_1,A_1,R_2….\),这里面我们认为\(A_t\)的时刻得到的奖励为\(R_{t+1}\)。。。

the probability of each possible value for \(S_t\) and \(R_t\) depends only on the immediately preceding state and action, \(S_{t−1}\) and \(A_{t−1}\).
This is best viewed a restriction not on the decision process but on the state.
Markov property:The state must include information about all aspects of the past agent-environment interaction that make a difference for the future.
我们可以通过上式推出来另外三个式子。。。
state-transition probabilities:

the expected rewards for state{action pairs :

the expected rewards for state{action{next-state:

agent-environment boundary

The general rule we follow is that anything that cannot be changed arbitrarily by the agent is considered to be outside of it and thus part of its environment.
We always consider the reward computation to be external to the agent because it defines the task facing the agent and thus must be beyond its ability to change arbitrarily.
The agent-environment boundary represents the limit of the agent’s absolute control, not of its knowledge.

MDP framework

one signal to represent the choices made by the agent (the actions), one signal to represent the basis on which the choices are made (the states)做出选择的基础, and one signal to define the agent’s goal (the rewards).

Goals and Rewards

reward hypothesis:That all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward).
The reward signal is your way of communicating to the robot what you want it to achieve, not how you want it achieved.

Returns and Episodes

The return is the sum of the rewards:\(G_t=R_{t+1}+R_{t+2}+…+R_T\).
Each episode ends in a special state called the terminal state, followed by a reset to a standard starting state or to a sample from a standard distribution of starting states.
the next episode begins independently of how the previous one ended.Tasks with episodes of this kind are called episodic tasks.
continuing tasks:goes on continually without limit
discounted return:

递推:\(G_t=R_{t+1}+\gamma G_{t+1}\)

Unified Notation for Episodic and Continuing Tasks

absorbing state:

\(G _ { t } \doteq \sum _ { k = t + 1 } ^ { T } \gamma ^ { k – t – 1 } R _ { k }\),其中T=inf or r=1,但不能同时满足,为了适应于两种不同task的规定。

Policies and Value Functions

value functions:functions of states (or of state–action pairs) that estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state).
a policy \(\pi(a|s)\) is a mapping from states to probabilities of selecting each possible action.\(A_t=a \)if \(S_t=s\)
The value function of a state \(s\) under a policy \(\pi\), denoted \(v _ { \pi } ( s ) \) is the expected return when starting in \(s\) and following \(\pi\) thereafter.
We call the function \(v_\pi\) the state-value function for policy \(\pi\):
\(v _ { \pi } ( s ) \doteq \mathbb { E } _ { \pi } \left[ G _ { t } | S _ { t } = s \right] = \mathbb { E } _ { \pi } \left[ \sum _ { k = 0 } ^ { \infty } \gamma ^ { k } R _ { t + k + 1 } | S _ { t } = s \right] \),for all \(s \in \mathcal { S}\)
\(q_{\pi}\) is the action-value function for policy \(\pi\):
\(q _ { \pi } ( s , a ) \doteq \mathbb { E } _ { \pi } \left[ G _ { t } | S _ { t } = s , A _ { t } = a \right] = \mathbb { E } _ { \pi } \left[ \sum _ { k = 0 } ^ { \infty } \gamma ^ { k } R _ { t + k + 1 } | S _ { t } = s , A _ { t } = a \right]\)
Bellman Equation for \(v_{\pi}\):

It expresses a relationship between the value of a state and the values of its successor states.
It states that the value of the start state must equal the (discounted) value of the expected next state, plus the reward expected along the way.

Optimal Policies and Optimal Value Functions

A policy \(\pi\) is defined to be better than or equal to a policy \(\pi^{‘}\) if its expected return is greater than or equal to that of \(\pi^{‘}\) for all states.
In other words, \(\pi \geq \pi ^ { \prime }\) if and only if \(v _ { \pi } ( s ) \geq v _ { \pi ^ { \prime } } ( s )\) ,for all \(s \in {S}\)。
optimal state-value function:
\(v_{*} ( s ) \doteq \max _ { \pi } v _ { \pi } ( s )\)
\(q_{*}\):optimal action-value function:

Bellman optimality equation

Intuitively, the Bellman optimality equation expresses the fact that the value of a state under an optimal policy must equal the expected return for the best action from that state.

Backup diagram:

The Bellman optimality equation for \(q_{*}\) is:

Backup diagram:

两种search(已知state求最优action):
\(v_*\)需要one-step ahead search,就是要知道所有可行的action及其能达到的state。
\(q_*\)不需要,因为记得就是state-action对,相当于cache。