Model-Free Prediction


Incremental Implementation

Let \(Q_n\) denote the estimate of its action value after it has been selected n − 1 times,易得:
\(Q_{n+1}=\frac{1}{n}(R_n+\sum_{i=1}^{n-1}R_i)=Q_n+\frac{1}{n}[R_n-Q_n]\)。就是相当于把最后新来的一项平均分成n份和原来的相补。This implementation requires memory only for \(Q_n\) and \(n\), and only the small computation for each new reward.
The general form is

The expression [Target − OldEstimate] is an error in the estimate.The target is presumed to indicate a desirable direction in which to move, though it may be noisy. In the case above, for example, the target is the nth reward.

Monte Carlo Methods

Monte Carlo methods require only experience——sample sequences of states, actions, and rewards from actual or simulated interaction with an environment.
Monte Carlo methods only for episodic tasks. we assume experience is divided into episodes, and that all episodes eventually terminate no matter what actions are selected. Only on the completion of an episode are value estimates and policies changed.

Monte Carlo Prediction

目的: learning the state-value function for a given policy
An obvious way to estimate it from experience, then, is simply to average the returns observed after visits to that state. As more returns are observed, the average should converge to the expected value.
Each occurrence of state s in an episode is called a visit to s.
根据对于一个episode中对同一个状态s的访问计算不同,得到first-visit MC和every-visit MC。
first-visit MC伪代码:

Both first-visit MC and every-visit MC converge to \(v_\pi(s)\) as the number of visits (or first visits) to s goes to infinity. By the law of large numbers the sequence of averages of these estimates converges to their expected value.
从Blackjack这个例子可能得到的思考:即便我们对于我们完全了解Blackjack的环境,但是我们还是不能准确得到一个关于下一个事件的分布,DP的方法需要我们知道全部的概率分布,但是这往往很不现实。In contrast, generating the sample games required by Monte Carlo methods is easy.The ability of Monte Carlo methods to work with sample episodes alone can be a significant advantage even when one has complete knowledge of the environment’s dynamics.
Whereas the DP diagram includes only one-step transitions, the Monte Carlo diagram goes all the way to the end of the episode.
An important fact about Monte Carlo methods is that the estimates for each state are independent. The estimate for one state does not build upon the estimate of any other state, as is the case in DP. In other words, Monte Carlo methods do not bootstrap.
The computational expense of estimating the value of a single state is independent of the number of states. One can generate many sample episodes starting from the states of interest, averaging returns from only these states, ignoring all others.

Temporal-Difference Learning

TD learning is a combination of Monte Carlo ideas and dynamic programming (DP) ideas.
TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (they bootstrap)
constant-α MC:\(V(S_t)=V(S_t)+\alpha[G_t-V(S_t)]\),\(G_t\) is the actual return following time \(t\), and \(alpha\) is a constant step-size parameter.
TD methods need to wait only until the next time step. At time \(t + 1\) they immediately form a target and make a useful update using the observed reward \(R_{t+1}\) and the estimate \(V(S_t+1)\). The simplest TD method makes the update:\(V(S_t)=V(S_t)+\alpha[R_{t+1}+\gamma V(S_{t+1})-V(S_t)]\).

Sample updates differ from the expected updates of DP methods in that they are based on a single sample successor rather than on a complete distribution of all possible successors.
TD error:\(\delta_t=R_{t+1}+\gamma V(S_{t+s})-V(S_t)\).
if the array V does not change during the episode (as it does not in Monte Carlo methods), then the Monte Carlo error can be written as a sum of TD errors:

Advantages of TD Prediction Methods

TD methods have an advantage over DP methods in that they do not require a model of the environment, of its reward and next-state probability distributions.
The next most obvious advantage of TD methods over Monte Carlo methods is that they are naturally implemented in an on-line, fully incremental fashion.TD methods one need wait only one time step.
TD methods are much less susceptible to these problems because they learn from each transition regardless of what subsequent actions are taken.(这一部分的的对比没有看太懂,可能是和control之类的比)
For any fixed policy \(\pi\), TD(0) has been proved to converge to vπ, in the mean for a constant step-size parameter if it is sufficiently small.
MC和TD这两种方法很难说谁先收敛到真正的解最快,但是一般:TD methods have usually been found to converge faster than constant-α MC methods on stochastic tasks.

n-step Bootstrapping

The backup diagrams of n-step methods:

The methods that use n-step updates are still TD methods because they still change an earlier estimate based on how it differs from a later estimate.
The natural state-value learning algorithm for using n-step returns is thus,\(V_{t+n}(S_t)=V_{t+n-1}(S_t)+\alpha[G_{t:t+n}-V_{t+n-1}(S_t)]\).Note that no changes at all are made during the first n−1 steps of each episode.

error reduction property:

Because of the error reduction property, one can show formally that all n-step TD methods converge to
the correct predictions under appropriate technical conditions.

Eligibility Traces

The mechanism is a short-term memory vector, the eligibility trace \(Z_t \in R^d\), that parallels the long-term weight vector \(W_t \in R^d\). The rough idea is that when a component of \(W_t\) participates in producing an estimated value, then the corresponding component of \(Z_t\) is bumped up and then begins to fade away.
Forward views are always somewhat complex to implement because the update depends on later things that are not available at the time.MC以及n-step 的方法都属于这种。



电子邮件地址不会被公开。 必填项已用*标注