Finite Markov Decision Processes Learning


题外话:中秋节跟着几位大爷呼呼social,基本上每个夜晚都是在看电影吃烤肉酒吧中度过,然后就导致代码bug没有fix(不过真的也不想fix,明明就不是我写的为啥要我改呢。。我是真的讨厌改别人代码里的bug)。。。就看了一篇过几天reading group要讲的论文(回来可能还得做个ppt啥的),赶快看看Finite Markov Dexision Processes这一章节吧。。。没看课程视频直接看了书。。。所以明天听完别人讲的后看看视频再做补充好了。。。

The Agent-Environment Interface


我们一般认为MDP形成了如下的一个序列:\(S_0,A_0,R_1,S_1,A_1,R_2….\),这里面我们认为\(A_t\)的时刻得到的奖励为\(R_{t+1}\)。。。

the probability of each possible value for \(S_t\) and \(R_t\) depends only on the immediately preceding state and action, \(S_{t−1}\) and \(A_{t−1}\).
This is best viewed a restriction not on the decision process but on the state.
Markov property:The state must include information about all aspects of the past agent-environment interaction that make a difference for the future.
我们可以通过上式推出来另外三个式子。。。
state-transition probabilities:

the expected rewards for state{action pairs :

the expected rewards for state{action{next-state:

agent-environment boundary

The general rule we follow is that anything that cannot be changed arbitrarily by the agent is considered to be outside of it and thus part of its environment.
We always consider the reward computation to be external to the agent because it defines the task facing the agent and thus must be beyond its ability to change arbitrarily.
The agent-environment boundary represents the limit of the agent’s absolute control, not of its knowledge.

MDP framework

one signal to represent the choices made by the agent (the actions), one signal to represent the basis on which the choices are made (the states)做出选择的基础, and one signal to define the agent’s goal (the rewards).

Goals and Rewards

reward hypothesis:That all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward).
The reward signal is your way of communicating to the robot what you want it to achieve, not how you want it achieved.

Returns and Episodes

The return is the sum of the rewards:\(G_t=R_{t+1}+R_{t+2}+…+R_T\).
Each episode ends in a special state called the terminal state, followed by a reset to a standard starting state or to a sample from a standard distribution of starting states.
the next episode begins independently of how the previous one ended.Tasks with episodes of this kind are called episodic tasks.
continuing tasks:goes on continually without limit
discounted return:

递推:\(G_t=R_{t+1}+\gamma G_{t+1}\)

Unified Notation for Episodic and Continuing Tasks

absorbing state:

\(G _ { t } \doteq \sum _ { k = t + 1 } ^ { T } \gamma ^ { k – t – 1 } R _ { k }\),其中T=inf or r=1,但不能同时满足,为了适应于两种不同task的规定。

Policies and Value Functions

value functions:functions of states (or of state–action pairs) that estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state).
a policy \(\pi(a|s)\) is a mapping from states to probabilities of selecting each possible action.\(A_t=a \)if \(S_t=s\)
The value function of a state \(s\) under a policy \(\pi\), denoted \(v _ { \pi } ( s ) \) is the expected return when starting in \(s\) and following \(\pi\) thereafter.
We call the function \(v_\pi\) the state-value function for policy \(\pi\):
\(v _ { \pi } ( s ) \doteq \mathbb { E } _ { \pi } \left[ G _ { t } | S _ { t } = s \right] = \mathbb { E } _ { \pi } \left[ \sum _ { k = 0 } ^ { \infty } \gamma ^ { k } R _ { t + k + 1 } | S _ { t } = s \right] \),for all \(s \in \mathcal { S}\)
\(q_{\pi}\) is the action-value function for policy \(\pi\):
\(q _ { \pi } ( s , a ) \doteq \mathbb { E } _ { \pi } \left[ G _ { t } | S _ { t } = s , A _ { t } = a \right] = \mathbb { E } _ { \pi } \left[ \sum _ { k = 0 } ^ { \infty } \gamma ^ { k } R _ { t + k + 1 } | S _ { t } = s , A _ { t } = a \right]\)
Bellman Equation for \(v_{\pi}\):

It expresses a relationship between the value of a state and the values of its successor states.
It states that the value of the start state must equal the (discounted) value of the expected next state, plus the reward expected along the way.

Optimal Policies and Optimal Value Functions

A policy \(\pi\) is defined to be better than or equal to a policy \(\pi^{‘}\) if its expected return is greater than or equal to that of \(\pi^{‘}\) for all states.
In other words, \(\pi \geq \pi ^ { \prime }\) if and only if \(v _ { \pi } ( s ) \geq v _ { \pi ^ { \prime } } ( s )\) ,for all \(s \in {S}\)。
optimal state-value function:
\(v_{*} ( s ) \doteq \max _ { \pi } v _ { \pi } ( s )\)
\(q_{*}\):optimal action-value function:

Bellman optimality equation

Intuitively, the Bellman optimality equation expresses the fact that the value of a state under an optimal policy must equal the expected return for the best action from that state.

Backup diagram:

The Bellman optimality equation for \(q_{*}\) is:

Backup diagram:

两种search(已知state求最优action):
\(v_*\)需要one-step ahead search,就是要知道所有可行的action及其能达到的state。
\(q_*\)不需要,因为记得就是state-action对,相当于cache。

计蒜之道2018初赛做题记录

最近又开始了计蒜之道,目前进行到了前两场,两场都打了:第一场是自己打,第二场帮考试中的队友打的。第一场题目不算太坑一个小时做完前三个不想写hard就可以晋级了。。。第二场题目有些坑点实在难以注意到结果半个小时写完前两个easy就晋级了。。。写了一整场的medium。。。想到博客已经n年没有更新了,最近考试周也过去了,就把题目补补(如果会的话),写到这里算了,毕竟月底蓝桥杯还是要咸鱼冲锋一下的。。剩下的几场因为没有号了,可能就随便打打了。。。

第一场

A. 百度无人车

二分答案没啥好说的。。

B. 百度科学家(简单)

这道题我比赛的时候写了两个版本简单和中等,困难当时也看出来要写可持久化线段树优化建图,但是老年人写不动就算了。
简单版本直接枚举的起点然后bfs的。。。不贴代码了。

C. 百度科学家(中等)

发现加的总边数不会超过1e5就很好办了。先Tarjan缩点成dag后重新建图,然后因为权值非负贪心选择出度为0的点中价值最小的即可。

D. 百度科学家(困难)

待补。

第二场

A. 淘宝的推荐系统

直接dp[i]表示到当前元素为止价值为i的合法序列长度,然后就是O(nd)的dp了。。。

B. 阿里巴巴的手机代理商(简单)

没有update不贴代码了。。

C. 阿里巴巴的手机代理商(中等)

这个题细节贼鸡儿多。。。就说一个地方吧,因为我是错在这里了,就是你update需要先删除然后断掉子树,然后插入在连接。具体看代码吧,至于为什么,考虑你把一个老的单词换成一个新的单词且老的是新的前缀的情况,所以这个顺序不能乱。

D. 阿里巴巴的手机代理商(困难)

待补。。写不动了啊。。

Project Euler 211 Divisor Square Sum(约数平方和)

For a positive integer n, let σ2(n) be the sum of the squares of its divisors. For example,

σ2(10) = 1 + 4 + 25 + 100 = 130.
Find the sum of all n, 0 < n < 64,000,000 such that σ2(n) is a perfect square.

简单写写了,早上问了一下大爷们约数平方和有什么好的性质么?答:积性函数啊!哦,那我只要线性筛一遍一个个验证答案好了。。。
然后发现自己原来没怎么学过线性筛啊。。。


如果x=p1^a1*p2^a2*….pk^ak,


那么约数平方和为(1+p1^2+p1^4+…+p1^(2*a1))(1+p2^2+p2^4+…p2^(2*a2))*…*(1+pk^2+pk^4+…+pk^(2*ak))。
那么为了保证线性筛的复杂度,我们需要存一个每个数的最小素因子(也就是p1)的括号里的东西就可以做了。。。
直接看代码好了~
注意爆int。。。
哦对了这个函数貌似叫做除数函数,表示约数的次幂和。