# 一些关于NER任务调研的小思考

## Natural Language Processing (almost) from Scratch

Collobert的论文，介绍了一个统一的神经网络架构用于解决自然语言处理各种的各种任务，主要是序列标注任务，包括词性标注（POS）、词语组块分析（Chunking）、命名实体识别（NER）以及语义角色标注（SRL）等。因为提的非常早，很多工作也都把这个作为神经网络做NER的比较早的起点来说，当然和现在的网络结构相比肯定不那么合适了，文章比较长，我只挑着看了一些，简单写写。

## Learning Named Entity Tagger using Domain-Specific Dictionary

### EMNLP 2018

AutoNER模型引入另外一个tag机制：Tie or Break。对于相邻的两个token，对于这两个token指的是同一实体词的话，就打上tie的标签，当这二者之间至少有一个token属于unknown-typed high-quality phrase，那么就打上unknown的标签。否则就打Break标签。两个连续的Break标签之间的token组成的span，打上他匹配的标签类似于modified IOBES机制。当然对于没有相匹配的就给O标签。作者说这么做的理由主要是：当一个entity被部分匹配的时候，其中部分正确的Tie信息得以被利用。并且当一个unigram entity被错误匹配的时候（false positive），Tie or Break部分并不会出错，因为这种词旁边往往是break（这点其实没看懂，感觉上是不会因为这个词影响到别的词的匹配？？）。而unigram entity正是字符串匹配中最容易匹配错误的部分。

1.提出了”Tie or Break” labeling scheme。区别于传统的BIOES labeling scheme，该方法只关心相邻两个tokens之间的关系，是连在一起还是分开。
2.相邻的两个Break之间，就是一个可能的entity span。对于一个给定的span，再做一个(k+1)分类，其中k为NER需要识别的entity types的个数，额外的一类为None。

## Efficient Contextualized Representation: Language Model Pruning for Sequence Labeling

### EMNLP 2018

ELMO等语言模型因为层数参数过多，可能在做具体任务（比如，序列标注任务）在inference阶段会非常耗时。所以作者在ELMO的基础上提出了用dense connection构建了一个新的stack-LSTM的架构LD-Net来训练语言模型并且加入了layer-wise dropout。在特定任务的训练在加上layer selection的一个剪枝过程，最后可以利用少量的参数且不需要重新训练便可以达到和整个预训练的语言模型所达到的效果近似。

# Model-Free Prediction

## Incremental Implementation

Let $$Q_n$$ denote the estimate of its action value after it has been selected n − 1 times,易得：
$$Q_{n+1}=\frac{1}{n}(R_n+\sum_{i=1}^{n-1}R_i)=Q_n+\frac{1}{n}[R_n-Q_n]$$。就是相当于把最后新来的一项平均分成n份和原来的相补。This implementation requires memory only for $$Q_n$$ and $$n$$, and only the small computation for each new reward.
The general form is

The expression [Target − OldEstimate] is an error in the estimate.The target is presumed to indicate a desirable direction in which to move, though it may be noisy. In the case above, for example, the target is the nth reward.

## Monte Carlo Methods

Monte Carlo methods require only experience——sample sequences of states, actions, and rewards from actual or simulated interaction with an environment.
Monte Carlo methods only for episodic tasks. we assume experience is divided into episodes, and that all episodes eventually terminate no matter what actions are selected. Only on the completion of an episode are value estimates and policies changed.

### Monte Carlo Prediction

An obvious way to estimate it from experience, then, is simply to average the returns observed after visits to that state. As more returns are observed, the average should converge to the expected value.
Each occurrence of state s in an episode is called a visit to s.

first-visit MC伪代码：

Both first-visit MC and every-visit MC converge to $$v_\pi(s)$$ as the number of visits (or first visits) to s goes to infinity. By the law of large numbers the sequence of averages of these estimates converges to their expected value.

Whereas the DP diagram includes only one-step transitions, the Monte Carlo diagram goes all the way to the end of the episode.
An important fact about Monte Carlo methods is that the estimates for each state are independent. The estimate for one state does not build upon the estimate of any other state, as is the case in DP. In other words, Monte Carlo methods do not bootstrap.
The computational expense of estimating the value of a single state is independent of the number of states. One can generate many sample episodes starting from the states of interest, averaging returns from only these states, ignoring all others.

## Temporal-Difference Learning

TD learning is a combination of Monte Carlo ideas and dynamic programming (DP) ideas.
TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (they bootstrap)
constant-α MC:$$V(S_t)=V(S_t)+\alpha[G_t-V(S_t)]$$，$$G_t$$ is the actual return following time $$t$$, and $$alpha$$ is a constant step-size parameter.
TD methods need to wait only until the next time step. At time $$t + 1$$ they immediately form a target and make a useful update using the observed reward $$R_{t+1}$$ and the estimate $$V(S_t+1)$$. The simplest TD method makes the update：$$V(S_t)=V(S_t)+\alpha[R_{t+1}+\gamma V(S_{t+1})-V(S_t)]$$.

Sample updates differ from the expected updates of DP methods in that they are based on a single sample successor rather than on a complete distribution of all possible successors.
TD error：$$\delta_t=R_{t+1}+\gamma V(S_{t+s})-V(S_t)$$.
if the array V does not change during the episode (as it does not in Monte Carlo methods), then the Monte Carlo error can be written as a sum of TD errors:

### Advantages of TD Prediction Methods

TD methods have an advantage over DP methods in that they do not require a model of the environment, of its reward and next-state probability distributions.
The next most obvious advantage of TD methods over Monte Carlo methods is that they are naturally implemented in an on-line, fully incremental fashion.TD methods one need wait only one time step.
TD methods are much less susceptible to these problems because they learn from each transition regardless of what subsequent actions are taken.（这一部分的的对比没有看太懂，可能是和control之类的比）
For any fixed policy $$\pi$$, TD(0) has been proved to converge to vπ, in the mean for a constant step-size parameter if it is sufficiently small.
MC和TD这两种方法很难说谁先收敛到真正的解最快，但是一般：TD methods have usually been found to converge faster than constant-α MC methods on stochastic tasks.

## n-step Bootstrapping

The backup diagrams of n-step methods:

The methods that use n-step updates are still TD methods because they still change an earlier estimate based on how it differs from a later estimate.
The natural state-value learning algorithm for using n-step returns is thus,$$V_{t+n}(S_t)=V_{t+n-1}(S_t)+\alpha[G_{t:t+n}-V_{t+n-1}(S_t)]$$.Note that no changes at all are made during the first n−1 steps of each episode.

error reduction property：

Because of the error reduction property, one can show formally that all n-step TD methods converge to
the correct predictions under appropriate technical conditions.

## Eligibility Traces

The mechanism is a short-term memory vector, the eligibility trace $$Z_t \in R^d$$, that parallels the long-term weight vector $$W_t \in R^d$$. The rough idea is that when a component of $$W_t$$ participates in producing an estimated value, then the corresponding component of $$Z_t$$ is bumped up and then begins to fade away.
Forward views are always somewhat complex to implement because the update depends on later things that are not available at the time.MC以及n-step 的方法都属于这种。

# Dynamic Programming

UPD：自己之前以为值迭代和策略迭代相比会更加不稳定，但是后来问了一下why，被告诉其实两者都会产生震荡（不稳定）。

## Policy Evaluation (Prediction)

police evalution(prediction problem)：We consider how to compute the state-value function $$v_\pi$$ for an arbitrary policy $$\pi$$.

For our purposes, iterative solution methods are most suitable. The initial approximation, $$v_0$$, is chosen arbitrarily (except that the terminal state, if any, must be given value 0), and each successive
approximation is obtained by using the Bellman equation for $$v_\pi$$ (4.4) as an update rule:

iterative policy evaluation：在$$v_\pi$$的存在下，找到一个序列的$${v_k}$$，使得k趋近于无穷的时候，$$v_k=v_\pi$$。
expected update： To each state s: it replaces the old value of $$s$$ with a new value,obtained from the old values of the successor states of s, and the expected immediate rewards, along all the one-step transitions possible under the policy being evaluated.
All the updates done in DP algorithms are called expected updates because they are based on an expectation over all possible next states rather than on a sample next state.

## Policy Improvement

Our reason for computing the value function for a policy is to help find better policies.Suppose we have determined the value function vπ for an arbitrary deterministic policy $$\pi$$. For some state s we would like to know whether or not we should change the policy to deterministically choose an action $$a \neq \pi(s)$$.

policy improvement theorem：对于两个确定性策略$$\pi$$和$$\pi’$$，假设对于所有的状态s来说都满足，$$q_\pi(s,\pi'(s))\geq v_\pi(s)$$，那也就意味着都满足$$v_\pi'(s)\geq v_\pi(s)$$。

policy improvement：

## Policy Iteration

We can thus obtain a sequence of monotonically improving policies and value functions:

E代表a policy evaluation，I代表了a policy improvement。Each policy is guaranteed to be a strict improvement over the previous one (unless it is already optimal). Because a finite MDP has only a finite number of policies, this process must converge to an optimal policy and optimal value function in a finite number of iterations.

## Value Iteration

evolution可能很慢，In fact, the policy evaluation step of policy iteration can be truncated in several ways
without losing the convergence guarantees of policy iteration.
One important special case is when policy evaluation is stopped after just one sweep (one update of each state).This algorithm is called value iteration.

Because the max operation in (4.10) is the only difference between these updates, this just means that the max operation is added to some sweeps of policy evaluation. All of these algorithms converge to an optimal policy for discounted finite MDPs.

## Asynchronous Dynamic Programming

If the state set is very large, then even a single sweep can be prohibitively expensive.
Asynchronous DP algorithms are in-place iterative DP algorithms that are not organized in terms of systematic sweeps of the state set. These algorithms update the values of states in any order whatsoever, using whatever values of other states happen to be available.
The values of some states may be updated several times before the values of others are updated once. To converge correctly, however, an asynchronous algorithm must continue to update the values of all the states: it can’t ignore any state after some point in the computation. Asynchronous DP algorithms allow great flexibility in selecting states to update.
Of course, avoiding sweeps does not necessarily mean that we can get away with less computation. It just means that an algorithm does not need to get locked into any hopelessly long sweep before it can make progress improving a policy.
We can try to take advantage of this flexibility by selecting the states to which we apply updates so
as to improve the algorithm’s rate of progress. We can try to order the updates to let value information propagate from state to state in an efficient way.

## Generalized Policy Iteration

generalized policy iteration (GPI)：the general idea of letting policy evaluation and policy improvement processes interact, independent of the granularity and other details of the two processes.所以也就是说之前的三种迭代dp的方法都可以认为是GPI的。

process stabilize, that is, no longer produce changes, then the value function and policy must be optimal.

The evaluation and improvement processes in GPI can be viewed as both competing and cooperating.Making the policy greedy with respect to the value function typically makes the value function incorrect for the changed policy, and making the value function consistent with the policy typically causes that policy no longer to be greedy.

## Efficiency of Dynamic Programming

DP is sometimes thought to be of limited applicability because of the curse of dimensionality, the fact that the number of states often grows exponentially with the number of state variables. Large state sets do create difficulties, but these are inherent difficulties of the problem, not of DP as a solution method. In fact, DP is comparatively better suited to handling large state spaces than competing methods such as direct search and linear programming.
In practice, DP methods can be used with today’s computers to solve MDPs with millions of states.In practice, these methods usually converge much faster than their theoretical worst-case run times, particularly if they are started with good initial value functions or policies.