# Finite Markov Decision Processes Learning

## The Agent-Environment Interface

the probability of each possible value for $$S_t$$ and $$R_t$$ depends only on the immediately preceding state and action, $$S_{t−1}$$ and $$A_{t−1}$$.
This is best viewed a restriction not on the decision process but on the state.
Markov property：The state must include information about all aspects of the past agent-environment interaction that make a difference for the future.

state-transition probabilities：

the expected rewards for state{action pairs :

the expected rewards for state{action{next-state：

### agent-environment boundary

The general rule we follow is that anything that cannot be changed arbitrarily by the agent is considered to be outside of it and thus part of its environment.
We always consider the reward computation to be external to the agent because it defines the task facing the agent and thus must be beyond its ability to change arbitrarily.
The agent-environment boundary represents the limit of the agent’s absolute control, not of its knowledge.

### MDP framework

one signal to represent the choices made by the agent (the actions), one signal to represent the basis on which the choices are made (the states)做出选择的基础, and one signal to define the agent’s goal (the rewards).

## Goals and Rewards

reward hypothesis：That all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward).
The reward signal is your way of communicating to the robot what you want it to achieve, not how you want it achieved.

## Returns and Episodes

The return is the sum of the rewards:$$G_t=R_{t+1}+R_{t+2}+…+R_T$$.
Each episode ends in a special state called the terminal state, followed by a reset to a standard starting state or to a sample from a standard distribution of starting states.
the next episode begins independently of how the previous one ended.Tasks with episodes of this kind are called episodic tasks.
continuing tasks：goes on continually without limit
discounted return:

## Unified Notation for Episodic and Continuing Tasks

absorbing state：

$$G _ { t } \doteq \sum _ { k = t + 1 } ^ { T } \gamma ^ { k – t – 1 } R _ { k }$$，其中T=inf or r=1，但不能同时满足，为了适应于两种不同task的规定。

## Policies and Value Functions

value functions：functions of states (or of state–action pairs) that estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state).
a policy $$\pi(a|s)$$ is a mapping from states to probabilities of selecting each possible action.$$A_t=a$$if $$S_t=s$$
The value function of a state $$s$$ under a policy $$\pi$$, denoted $$v _ { \pi } ( s )$$ is the expected return when starting in $$s$$ and following $$\pi$$ thereafter.
We call the function $$v_\pi$$ the state-value function for policy $$\pi$$:
$$v _ { \pi } ( s ) \doteq \mathbb { E } _ { \pi } \left[ G _ { t } | S _ { t } = s \right] = \mathbb { E } _ { \pi } \left[ \sum _ { k = 0 } ^ { \infty } \gamma ^ { k } R _ { t + k + 1 } | S _ { t } = s \right]$$，for all $$s \in \mathcal { S}$$
$$q_{\pi}$$ is the action-value function for policy $$\pi$$：
$$q _ { \pi } ( s , a ) \doteq \mathbb { E } _ { \pi } \left[ G _ { t } | S _ { t } = s , A _ { t } = a \right] = \mathbb { E } _ { \pi } \left[ \sum _ { k = 0 } ^ { \infty } \gamma ^ { k } R _ { t + k + 1 } | S _ { t } = s , A _ { t } = a \right]$$
Bellman Equation for $$v_{\pi}$$：

It expresses a relationship between the value of a state and the values of its successor states.
It states that the value of the start state must equal the (discounted) value of the expected next state, plus the reward expected along the way.

## Optimal Policies and Optimal Value Functions

A policy $$\pi$$ is defined to be better than or equal to a policy $$\pi^{‘}$$ if its expected return is greater than or equal to that of $$\pi^{‘}$$ for all states.
In other words, $$\pi \geq \pi ^ { \prime }$$ if and only if $$v _ { \pi } ( s ) \geq v _ { \pi ^ { \prime } } ( s )$$ ，for all $$s \in {S}$$。
optimal state-value function：
$$v_{*} ( s ) \doteq \max _ { \pi } v _ { \pi } ( s )$$
$$q_{*}$$：optimal action-value function：

### Bellman optimality equation

Intuitively, the Bellman optimality equation expresses the fact that the value of a state under an optimal policy must equal the expected return for the best action from that state.

Backup diagram：

The Bellman optimality equation for $$q_{*}$$ is：

Backup diagram：

$$v_*$$需要one-step ahead search，就是要知道所有可行的action及其能达到的state。
$$q_*$$不需要，因为记得就是state-action对，相当于cache。