11.1 简介 #
本书之前介绍的基于策略的方法包括策略梯度算法和 Actor-Critic
算法。这些方法虽然简单、直观,但在实际应用过程中会遇到训练不稳定的情况。回顾一下基于策略的方法:参数化智能体的策略,并设计衡量策略好坏的目标函数,通过梯度上升的方法来最大化这个目标函数,使得策略最优。具体来说,假设 $\theta$ 表示策略 $\pi_\theta$ 的参数,定义 $J(\theta)=\mathbb{E}_{s_0}\big[V^{\pi_\theta}(s_0)\big] = \mathbb{E}_{\pi_\theta}\big[\sum_{t=0}^\infty \gamma^t r(s_t,a_t)\big]$,基于策略的方法的目标是找到 $\theta^*=\arg\max_\theta J(\theta)$,策略梯度算法主要沿着 $\nabla_\theta J(\theta)$ 方向迭代更新策略参数 $\theta$。但是这种算法有一个明显的缺点:当策略网络是深度模型时,沿着策略梯度更新参数,很有可能由于步长太长,策略突然显著变差,进而影响训练效果。
针对以上问题,我们考虑在更新时找到一块信任区域 (trust region
),在这个区域上更新策略时能够得到某种策略性能的安全性保证,这就是信任区域策略优化 (trust region policy optimization,TRPO
) 算法的主要思想。TRPO
算法在 2015 年被提出,它在理论上能够保证策略学习的性能单调性,并在实际应用中取得了比策略梯度算法更好的效果。
11.2 基础知识 #
11.2.1 KL 散度 (Kullback-Leibler Divergence or relative entropy) #
令 $P$ 和 $Q$ 是定义在相同样本空间 $\Omega$ 上的两个概率分布
- 离散空间:
$$ D_{KL}(P \Vert Q) = \sum_{x \in \Omega} p(x)\log \cfrac{p(x)}{q(x)} $$
- 连续空间:
$$ D_{KL}(P \Vert Q) = \int_{\Omega} p(x)\log \cfrac{p(x)}{q(x)} dx $$
11.2.2 总变差散度 (Total Variation Distance) #
总变差散度 (Total Variation Distance, TVD
) 是概率论和统计学中的一种度量两个概率分布之间差异的指标。总变差散度用来衡量 $P$ 和 $Q$ 这两个概率分布在所有可能事件上的差异。
令 $P$ 和 $Q$ 是定义在相同样本空间 $\Omega$ 上的两个概率分布,总变差散度 $D_{TV}(P,Q)$ 定义为:
$$ D_{TV}(P, Q) = \sup_{A \subseteq \Omega} |P(A) - Q(A)| $$
- 离散空间:
$$ D_{TV}(P, Q) = \frac{1}{2} \sum_{x \in \Omega} |p(x) - q(x)| $$
- 连续空间:
$$ D_{TV}(P, Q) = \frac{1}{2} \int_{\Omega} |p(x) - q(x)| dx $$
11.2.3 KL 散度 和 TV 散度之间的关系 (Pinsker’s inequality) #
任意离散空间:
$$ D_{TV}(P \Vert Q)^2 \le \cfrac{1}{2} D_{KL}(P \Vert Q) $$
11.2.4 共轭梯度法 (conjugate gradient algorithm) #
共轭梯度法 (Conjugate Gradient Algorithm
) 是一种用于求解线性方程组和无约束最优化问题的数值算法,特别适用于大规模稀疏对称正定矩阵。共轭梯度法是一种十分有效的迭代方法。
共轭梯度法主要用于解决如下形式的线性方程组:
$$ A\mathbf{x} = \mathbf{b} $$
其中 $A$ 是一个对称正定矩阵, $\mathbf{x}$ 是未知向量, $\mathbf{b}$ 是已知向量。
等价表述为如下最小化问题
$$ \min_{\mathbf{x}} f(\mathbf{x}) = \frac{1}{2}\mathbf{x}^T A \mathbf{x} - \mathbf{b}^T \mathbf{x} $$
两者具有相同的唯一解
计算 $f(\mathbf{x})$ 的梯度,可以得到等于线性问题的残差:
$$ \nabla f(\mathbf{x}) = A\mathbf{x} - \mathbf{b} $$
迭代求解流程:
- 给定 $x_0$
- 另 $r_0 \leftarrow A x_0 -b$, $p_0\leftarrow -r_0$, $k \leftarrow 0$
- while $r_k \neq 0$ do:
- 步长 $\alpha_k \leftarrow \cfrac{r_k^T r_k}{p_k^T A p_k}$
- 更新 $x_{k+1} \leftarrow x_k + \alpha_k p_k$
- 更新 $r_{k+1} \leftarrow r_k + \alpha_k A p_k$
- $\beta_{k+1} \leftarrow \cfrac{r_{k+1}^T r_{k+1}}{r_k^T r_k}$
- $p_{k+1} \leftarrow -r_{k+1} + \beta_{k+1} p_k$
- $k \leftarrow k+1$
- end while
11.3 TRPO #
11.3.1 准备工作 #
考虑由元组 $\langle \mathcal{S},\mathcal{A},P,r,\rho _0,\gamma \rangle$ 定义的无限时域折扣马尔可夫决策过程 (MDP
),其中 $\mathcal{S}$ 是状态的有限集合、$\mathcal{A}$ 是动作的有限集合,$P: \mathcal{S}\times \mathcal{A}\times \mathcal{S} \rightarrow \mathbb{R}$ 为状态转移分布,$r: \mathcal{S} \rightarrow \mathbb{R}$ 奖励函数,$\rho_0: \mathcal{S} \rightarrow \mathbb{R}$ 是初始状态 $s_0$ 的分布,$\gamma \in (0,1)$ 是折扣因子。
定义 $\pi: \mathcal{S}\times \mathcal{A} \rightarrow [0,1]$ 是随机策略,定义 $\eta(\pi)$ 为 discounted reward
的期望:
$$ \eta(\pi) = \mathbb{E}_{s_0, a_0, \cdots} \left[ \sum_{t=0}^\infty \gamma^t r(s_t) \right] $$
定义状态动作价值函数 $Q_\pi$、状态价值函数 $V_\pi$、优势函数 $A_\pi$:
$$ \begin{align*} &Q_\pi(s_t, a_t) = \mathbb{E}_{s_{t+1}, a_{t+1}, \cdots}\left[ \sum_{l=0}^\infty \gamma^l r(s_l) \right] \\ &V_\pi(s_t) = \mathbb{E}_{a_t, s_{t+1}, \cdots}\left[ \sum_{l=0}^\infty \gamma^l r(s_l) \right] \\ &A_\pi(s, a) = Q_\pi(s, a) - V_\pi(s) \\ \end{align*} $$
其中 $a_t \sim \pi(a_t|s_t), s_{t+1} \sim P(s_{t+1}| s_t, a_t)$ 对于所有的 $t \ge 0$
新策略 $\tilde{\pi}$ 与老策略 $\pi$ 有如下关系:
$$ \eta(\tilde{\pi}) = \eta(\pi) + \mathbb{E}_{s_0, a_0, \cdots \sim \tilde{\pi}}\left[ \sum_{t=0}^\infty \gamma^t A_\pi(s_t, a_t) \right] \tag{1} $$
证明: #
设 trajectory
$\tau:=(s_0, a_0, s_1, a_1, \cdots)$ 由 $\tilde{\pi}$ 采样生成,另外有 $A_\pi(s, a) = \mathbb{E}_{s^\prime\sim P(s^\prime|s,a)}[r + \gamma V(s^\prime) - V(s)]$
$$ \begin{align*} &\mathbb{E}_{\tau|\tilde{\pi}}\left[ \sum_{t=0}^\infty \gamma^t A_\pi(s_t, a_t) \right] \\ &= \mathbb{E}_{\tau|\tilde{\pi}}\left[ \sum_{t=0}^\infty \gamma^t \left(r(s_t) + \gamma V_\pi(s_{t+1}) - V_\pi(s_t) \right) \right] \\ &= \mathbb{E}_{\tau|\tilde{\pi}}\left[ -V_\pi(s_0) + \sum_{t=0}^\infty \gamma^t r(s_t) \right] \\ &= -\mathbb{E}_{s_0}\Big[ V_\pi(s_0) \Big] + \mathbb{E}_{\tau|\tilde{\pi}}\left[ \sum_{t=0}^\infty \gamma^t r(s_t) \right] \tag{ $s_0$ 跟 policy 无关} \\ &= -\eta(\pi) + \eta(\tilde{\pi}) \end{align*} $$
证毕!
设:
$$ \rho_\pi(s) = P(s_0=s) + \gamma P(s_1=s) + \gamma^2 P(s_2=s) + \cdots $$
其中 $s_0 \sim \rho_0$,重写 (1)
:
$$ \begin{align*} \eta(\tilde{\pi}) &= \eta(\pi) + \sum_{t=0}^\infty \sum_s P(s_t=s|\tilde{\pi}) \sum_a \tilde{\pi}(a|s) \gamma^t A_\pi (s,a) \\ &= \eta(\pi) + \sum_s \sum_{t=0}^\infty \gamma^t P(s_t=s|\tilde{\pi}) \sum_a \tilde{\pi}(a|s) A_\pi (s,a) \\ &= \eta(\pi) + \sum_s \rho_{\tilde{\pi}}(s) {\color {red} \sum_a \tilde{\pi}(a|s) A_\pi (s,a) } \tag{$1^\prime$} \\ \end{align*} $$
上面公式显示,只要能保证红色部分为正,策略就会提升,可以通过 $\tilde{\pi}(s) = \arg\max_a A_\pi (s,a)$ 方法来迭代提升;但是很难避免红色部分为负导致效果反而降低,另外由于 $\rho_{\tilde{\pi}}$ 和 $\tilde{\pi}$ 依赖新策略,直接优化是很困难的,所以提出近似:
$$ L_{\pi}(\tilde{\pi}) = \eta(\pi) + \sum_s \rho_{\color{red}\pi}(s) \sum_a \tilde{\pi}(a|s) A_\pi (s,a) \tag{2} $$
可以看到,上面式子使用旧策略的状态分布替代新策略的状态分布,要求两者差异不大,优化 $L_{\pi}(\tilde{\pi})$ 便可
11.3.2 局部近似 #
假设有一个参数化策略 $\pi_\theta(a|s)$,那么对于任何参数 $\theta_0$ 有
$$ L_{\pi_{\theta_0}}(\pi_{\theta_0}) = \eta(\pi_{\theta_0}) \tag{3} $$
$$ \nabla_\theta L_{\pi_{\theta_0}}(\pi_{\theta})|_{\theta=\theta_0} = \nabla_\theta\eta(\pi_{\theta}) |_{\theta=\theta_0} \tag{4} $$
这两个式子能够看到,两者在 $\theta_0$ 的值相同,梯度也相同,所以,优化 $L_{\pi}(\tilde{\pi})$,也就是优化 $\eta(\tilde{\pi})$,前提是,步子得小哈,只有局部是近似的
证明: #
$L_{\pi_{\theta_0}}(\pi_{\theta_0}) = \eta(\pi_{\theta_0})$ 证明:
$$ \begin{align*} L_{\pi_{\theta_0}}(\pi_{\theta_0}) &= \eta(\pi_{\theta_0}) + \sum_s \rho_{\pi_{\theta_0}}(s) \sum_a \pi_{\theta_0}(a|s) A_{\pi_{\theta_0}} (s,a) \\ &= \eta(\pi_{\theta_0}) + \sum_s \rho_{\pi_{\theta_0}}(s) \left( \sum_a \pi_{\theta_0}(a|s) \Big( Q_{\pi_{\theta_0}}(s,a) - V_{\pi_{\theta_0}}(s) \Big) \right) \\ &= \eta(\pi_{\theta_0}) + \sum_s \rho_{\pi_{\theta_0}}(s) \left( \sum_a \pi_{\theta_0}(a|s) Q_{\pi_{\theta_0}}(s,a) - V_{\pi_{\theta_0}}(s) \right) \tag{其中:$\sum_a \pi_{\theta_0}(a|s) = 1$}\\ &= \eta(\pi_{\theta_0}) \end{align*} $$
$\nabla_\theta L_{\pi_{\theta_0}}(\pi_{\theta})|_{\theta=\theta_0} = \nabla_\theta\eta(\pi_{\theta}) |_{\theta=\theta_0}$ 证明:
$$ \begin{align*} \nabla_\theta L_{\pi_{\theta_0}}(\pi_{\theta}) &= \sum_s \rho_{\pi_{\theta_0}}(s) \sum_a \nabla_\theta\pi_{\theta}(a|s) A_{\pi_{\theta_0}} (s,a) \\ \nabla_\theta\eta(\pi_{\theta}) &= \nabla_\theta \left(\sum_s \rho_{\pi_\theta}(s) \sum_a \pi_{\theta}(a|s) A_{\pi_{\theta_0}} (s,a) \right) \\ &= \sum_s \nabla_\theta \rho_{\pi_\theta}(s) \sum_a \pi_{\theta}(a|s) A_{\pi_{\theta_0}} (s,a) + \sum_s \rho_{\pi_\theta}(s) \sum_a \nabla_\theta\pi_{\theta}(a|s) A_{\pi_{\theta_0}} (s,a) \\ &= \sum_s \rho_{\pi_\theta}(s)\sum_a \nabla_\theta\pi_{\theta}(a|s) A_{\pi_{\theta_0}} (s,a) \tag{当 $\theta = \theta_0$ 时,$\sum_a \pi_{\theta}(a|s) A_{\pi_{\theta_0}} (s,a) = 0$} \\ \end{align*} $$
11.3.3 一般随机策略的单调提升保证 #
定义:
$$ D_{TV}^{\max}(\pi, \tilde{\pi}) = \max_s D_{TV}\big(\pi(\cdot|s) \Vert \tilde{\pi}(\cdot|s)\big) $$
定理 1 #
令 $\alpha = D_{TV}^{\max}(\pi_{old}, \pi_{new})$,则:
$$ \eta(\pi_{new}) \ge L_{\pi_{old}}(\pi_{new}) - \cfrac{4\epsilon\gamma}{(1-\gamma)^2}\alpha^2 \tag{5} $$
其中 $\epsilon=\max_{s,a}|A_\pi(s,a)|$
由于 $D_{TV}(p \Vert q)^2 \le D_{KL}(p \Vert q)$,定义 $D_{KL}^{\max}(\pi, \tilde{\pi}) = \max_s D_{KL}(\pi(\cdot|s) \Vert \tilde{\pi}(\cdot|s))$,则 定理 1 可以写成:
$$ \eta(\tilde{\pi}) \ge L_\pi(\tilde{\pi}) - C D_{KL}^{\max}(\pi, \tilde{\pi}) \tag{6} $$
其中 $C = \cfrac{4\epsilon\gamma}{(1-\gamma)^2}$
保证 expected return
$\eta$ 不递减的策略迭代算法
#
- Initialize $\pi_0$.
- for $i = 0, 1, 2, \cdots$ 直到收敛 do
- 计算所有优势函数 $A_{\pi_i}(s,a)$
- 更新 $\pi_{i+1} = \arg\max_{\pi}\Big[ L_{\pi_i}(\pi) - C D_{KL}^{\max}(\pi_i, \pi) \Big]$
- end for
定义 $M_{\pi_i}(\pi) = L_{\pi_i}(\pi) - C D_{KL}^{\max}(\pi_i, \pi)$,从公式 (6)
可以得到:
$$ \eta(\pi_{i+1}) \ge M_{\pi_i}(\pi_{i+1}) $$
当 $\pi = \tilde{\pi}$ 时,$C D_{KL}^{\max}(\pi, \tilde{\pi}) = 0$,还有 (2)
中 ,$\sum_a \tilde{\pi}(a|s) A_\pi (s,a) = 0$,可以得到:
$$ \eta(\pi_i) = M_{\pi_i}(\pi_i) $$
所以:
$$ \eta(\pi_{i+1}) - \eta(\pi_i) \ge M_{\pi_i}(\pi_{i+1}) - M_{\pi_i}(\pi_i) $$
所以,能保证 $M_{\pi_i}(\pi_i) \rightarrow M_{\pi_i}(\pi_{i+1})$ 的提升,就能保证 $\eta(\pi_i) \rightarrow \eta(\pi_{i+1})$ 的提升,真牛逼
定理 1 证明 #
引理 1:
就是把 (1)
写下来:
$$ \eta(\tilde{\pi}) = \eta(\pi) + \mathbb{E}_{\tau \sim \tilde{\pi}}\left[ \sum_{t=0}^\infty \gamma^t A_\pi(s_t, a_t) \right] $$
将 $\bar{A}(s)$ 定义为 $\tilde{\pi}$ 在状态 $s$ 下相对于 $\pi$ 的优势期望,也就是 ($1^\prime$) 红色部分:
$$ \bar{A}(s) = \mathbb{E}_{a\sim\tilde{\pi}(\cdot|s)}[A_\pi(s,a)] $$
引理 1
就可以改写成:
$$ \eta(\tilde{\pi}) = \eta(\pi) + \mathbb{E}_{\tau \sim \tilde{\pi}}\left[ \sum_{t=0}^\infty \gamma^t \bar{A}(s_t) \right] $$
公式 (2)
就可以改写成:
$$ L_{\pi}(\tilde{\pi}) = \eta(\pi) + \mathbb{E}_{\tau \sim \pi}\left[ \sum_{t=0}^\infty \gamma^t \bar{A}(s_t) \right] $$
这两个表达的区别,仅仅是使用 $\pi$ 还是 $\tilde{\pi}$ 来进行数据采样,为了限制 $\eta(\tilde{\pi})$ 和 $L_\pi(\tilde{\pi})$ 之间的差异,我们将限制每个时间步所产生的差异。要做到这一点,我们首先需要引入一个衡量 $\pi$ 和 $\tilde{\pi}$ 一致性的度量。
定义 1:
$(\pi, \tilde{\pi})$ 策略对的联合分布 $(a,\tilde{a}) \big| s$,对于所有的 $s$ 都使得 $P(a \neq \tilde{a} |s) \le \alpha$。则称它们是一个 $\alpha$-耦合策略对。$\pi$ 和 $\tilde{\pi}$ 分别表示 $a$ 和 $\tilde{a}$ 的边际分布。
引理 2:
$(\pi, \tilde{\pi})$ 是 $\alpha$-耦合策略对,则对于所有的 $s$:
$$ \big| \bar{A}(s) \big| \le 2\alpha\max_{s,a} \big| A_{\pi}(s,a) \big| $$
证明:
$$ \begin{align*} \bar{A}(s) &= \mathbb{E}_{\tilde{a} \sim \tilde{\pi}} \big[A_{\pi}(s, \tilde{a}) \big] \\ &= \mathbb{E}_{(\tilde{a}, a) \sim (\tilde{\pi}, \pi)} \big[ A_{\pi}(s, \tilde{a}) - A_{\pi}(s, a) \big] \tag{因为 $\mathbb{E}_{a \sim \pi} \big[ A_{\pi}(s, a) \big] = 0$} \\ &= P(a \neq \tilde{a} |s)\mathbb{E}_{(\tilde{a}, a) \sim (\tilde{\pi}, \pi) |_{a\neq \tilde{a}}} \big[ A_{\pi}(s, \tilde{a}) - A_{\pi}(s, a) \big] \end{align*} $$
两边加上绝对值,就得证!
引理 3:
$(\pi, \tilde{\pi})$ 是 $\alpha$-耦合策略对,则:
$$ \big|\mathbb{E}_{s_t\sim\tilde{\pi}}[\bar{A}(s_t)] - \mathbb{E}_{s_t\sim\pi}[\bar{A}(s_t)] \big| \le 2\alpha\max_s\bar{A}(s) \le 4\alpha(1 - (1-\alpha)^t)\max_{s,a}|A_\pi(s,a)| $$
证明:
设 $\tau$、$\tilde{\tau}$ 分别基于 $\pi$、$\tilde{\pi}$ 采样得到,设 $n_t$ 代表在 $\pi$ 和 $\tilde{\pi}$ 分布下,$i < t$ 之前,$a_i \neq \tilde{a}_i$ 的个数
$$ \mathbb{E}_{s_t\sim\tilde{\pi}}[\bar{A}(s_t)] = P(n_t=0) \mathbb{E}_{s_t\sim\tilde{\pi}|_{n_t=0}}[\bar{A}(s_t)] + P(n_t>0) \mathbb{E}_{s_t\sim\tilde{\pi}|_{n_t>0}}[\bar{A}(s_t)] \\ \mathbb{E}_{s_t\sim\pi}[\bar{A}(s_t)] = P(n_t=0) \mathbb{E}_{s_t\sim\pi|_{n_t=0}}[\bar{A}(s_t)] + P(n_t>0) \mathbb{E}_{s_t\sim\pi|_{n_t>0}}[\bar{A}(s_t)] $$
注意到当 $n_t = 0$ 时:
$$ \mathbb{E}_{s_t\sim\tilde{\pi}|_{n_t=0}}[\bar{A}(s_t)] = \mathbb{E}_{s_t\sim\pi|_{n_t=0}}[\bar{A}(s_t)] $$
所以:
$$ \mathbb{E}_{s_t\sim\tilde{\pi}}[\bar{A}(s_t)] - \mathbb{E}_{s_t\sim\pi}[\bar{A}(s_t)] = P(n_t>0) \left( \mathbb{E}_{s_t\sim\tilde{\pi}|_{n_t>0}}[\bar{A}(s_t)] - \mathbb{E}_{s_t\sim\pi|_{n_t>0}}[\bar{A}(s_t)] \right) $$
$P(a \neq \tilde{a} |s) \le \alpha$ 则 $P(a = \tilde{a} |s) \ge 1 - \alpha$,所以 $P(n_t=0) \ge(1-\alpha)^t$,所以:
$$ P(n_t>0) \le 1-(1-\alpha)^t $$
另外:
$$ \begin{align*} \left| \mathbb{E}_{s_t\sim\tilde{\pi}|_{n_t>0}}[\bar{A}(s_t)] - \mathbb{E}_{s_t\sim\pi|_{n_t>0}}[\bar{A}(s_t)] \right| &\le |\mathbb{E}_{s_t\sim\tilde{\pi}|_{n_t>0}}[\bar{A}(s_t)]| + |\mathbb{E}_{s_t\sim\pi|_{n_t>0}}[\bar{A}(s_t)]| \\ &\le 4\alpha\max_{s,a}|A_\pi(s,a)| \tag{用到 引理 2} \end{align*} $$
然后就可以得到:
$$ \left|\mathbb{E}_{s_t\sim\tilde{\pi}}[\bar{A}(s_t)] - \mathbb{E}_{s_t\sim\pi}[\bar{A}(s_t)]\right| \le 4\alpha(1 - (1-\alpha)^t)\max_{s,a}|A_\pi(s,a)| $$
中间那个,我想不明白,不想了
再把上面 $\eta(\tilde{\pi})$ 和 $L_\pi(\tilde{\pi})$ 改写公式带下来,并设 $\epsilon=\max_{s,a}|A_\pi(s,a)|$
$$ \begin{align*} |\eta(\tilde{\pi}) - L_\pi(\tilde{\pi})| &= \sum_{t=0}^\infty\gamma^t\left|\mathbb{E}_{s_t\sim\tilde{\pi}}[\bar{A}(s_t)] - \mathbb{E}_{s_t\sim\pi}[\bar{A}(s_t)]\right| \\ &\le \sum_{t=0}^\infty\gamma^t \cdot 4\epsilon\alpha(1-(1-\alpha)^t) \\ &= 4\epsilon\alpha \left( \cfrac{1}{1-\gamma} - \cfrac{1}{1-\gamma(1-\alpha)} \right) \\ &= \cfrac{4\alpha^2 \gamma \epsilon}{(1-\gamma)(1-\gamma(1-\alpha))} \\ &\le \cfrac{4\alpha^2 \gamma \epsilon}{(1-\gamma)^2} \end{align*} $$
最后,为了用 $D_{TV}$ 代替 $\alpha$,我们需要使用 TV 散度和耦合随机变量之间的对应关系:
假设 $p_X$ 和 $p_Y$ 是 $D_{TV}(p_X || p_Y)=\alpha$ 的分布,然后存在联合分布 $(X,Y)$,其边界分布为 $p_X、p_Y$,并且 $X=Y$ 的概率为 $1-\alpha$
因此,如果我们有两个策略 $\pi$ 和 $\tilde{\pi}$,使得 $\max_s D_{TV}(\pi(\cdot|s)||\tilde{\pi}(\cdot|s)) \le \alpha$,那么我们可以定义一个具有适当边值的 $\alpha$-耦合策略对 $(\pi, \tilde{\pi})$。然后将 $\alpha= \max_s D_{TV}(\pi(\cdot|s)||\tilde{\pi}(\cdot|s)) \le \alpha$ 带入上面,就可以证得定理 1
11.3.4 参数化策略的优化问题 #
- 优化问题为最大化目标函数:
$$ \max_\theta \left[ L_{\theta_{old}}(\theta) - CD_{KL}^{\max}(\theta_{old}, \theta) \right] $$
在实践中,如果采用上述惩罚系数 $C$,则步长会非常小,一种较为鲁棒的方法是把惩罚项转变为约束条件,使用新旧策略之间 $KL$ 散度的约束,即信赖域约束,因此优化问题转化为:
$$ \max_\theta L_{\theta_{old}}(\theta) $$
$$ s.t. \ \ D_{KL}^{\max}(\theta_{old}, \theta) \le \delta $$
上述优化问题约束条件用来约束所有状态下两两策略的 $KL$ 散度都有界,这样造成约束条件的个数过于庞大,甚至可能是无穷多个,求解这一问题是不切实际的;因此,本文可以考虑用平均 $KL$ 散度来近似上述约束条件:
$$ \max_\theta L_{\theta_{old}}(\theta) $$
$$ s.t. \ \ \bar{D}_{KL}^{\rho_{\theta_{old}}}(\theta_{old}, \theta) \le \delta $$
其中 $\bar{D}_{KL}^{\rho}(\theta, \tilde{\theta}) = \mathbb{E}_{s\sim\rho}\left[ \pi_\theta(\cdot|s) | \pi_{\tilde{\theta}}(\cdot|s) \right]$
11.3.5 Sample-Based Estimation of the Obiective and Constraint #
- 目标函数:
$$ L_{\theta_{old}}(\theta) = \eta(\theta_{old}) + \sum_s \rho_{\theta_{old}}(s) \sum_a \pi_\theta(a|s) A_{\theta_{old}} (s,a) $$
- 第一项与 $\theta$ 无关,因此优化问题进一步写为:
$$ \max_\theta \sum_s \rho_{\theta_{old}}(s) \sum_a \pi_\theta(a|s) A_{\theta_{old}} (s,a) $$
$$ s.t. \ \ \bar{D}_{KL}^{\rho_{\theta_{old}}}(\theta_{old}, \theta) \le \delta $$
- 其中,利用重要性采样来估计动作分布,给定单个状态 $s$:
$$ \sum_a \pi_\theta(a|s) A_{\theta_{old}} (s,a) = \mathbb{E}_{a\sim\pi_{\theta_{old}}(\cdot|s)} \left[ \cfrac{\pi_{\theta}(a|s)}{\pi_{\theta_{old}}(a|s)} A_{\theta_{old}} (s,a) \right] $$
- 由于 $A=Q-V$,旧策略下,$V$ 为常数,与 $\theta$ 无关,因此可以舍去 $V$,因此,优化问题进一步变为:
$$ \max_\theta J_{\theta_{old}}(\theta) = \mathbb{E}_{s\sim\rho_{\theta_{old}}, a\sim\pi_{\theta_{old}}} \left[ \cfrac{\pi_{\theta}(a|s)}{\pi_{\theta_{old}}(a|s)} Q_{\theta_{old}} (s,a) \right] $$
$$ s.t. \ \ \mathbb{E}_{s\sim\rho_{\theta_{old}}}\left[ D_{KL}\left( \pi_{\theta_{old}}(\cdot|s) | \pi_{\theta}(\cdot|s) \right) \right] $$
11.3.6 约束优化问题的求解 #
- 泰勒展开式:
$$ f(x) = \sum_{n=0}^{\infty} \frac{f^{(n)}(a)}{n!} (x-a)^n = f(a) + f^{\prime}(a)(x-a) + \frac{f^{\prime\prime}(a)}{2!}(x-a)^2 + \frac{f^{\prime\prime\prime}(a)}{3!}(x-a)^3 + \cdots $$
对上述目标函数采用线性近似(类似于求解 $Ax=y$ 中的 $x$),对约束条件采用二次近似:
$$ J_{\theta_{old}}(\theta) \approx \nabla_\theta J_{\theta_{old}}(\theta) \big|_{\theta=\theta_{old}} \cdot (\theta - \theta_{old}) $$
$$ \mathbb{E}_{s\sim\rho_{\theta_{old}}}\left[ D_{KL}\left( \pi_{\theta_{old}}(\cdot|s) | \pi_{\theta}(\cdot|s) \right) \right] \approx \cfrac{1}{2} (\theta - \theta_{old})^T \cdot I(\theta_{old}) \cdot (\theta - \theta_{old}) $$
其中 $I_{ij}(\theta_{old}) = \cfrac{\partial^2 {\mathbb{E}_{s\sim\rho_{\theta_{old}}}\left[ D_{KL}\left( \pi_{\theta_{old}}(\cdot|s) | \pi_{\theta}(\cdot|s) \right) \right]} }{\partial i \partial j} \Bigg | _{\theta=\theta_{old}}$
- 因此,约束优化问题近似为:
$$ \max_\theta \nabla_\theta J_{\theta_{old}}(\theta) \big|_{\theta=\theta_{old}} \cdot (\theta - \theta_{old}) $$
$$ s.t. \ \ \cfrac{1}{2} (\theta - \theta_{old})^T \cdot I(\theta_{old}) \cdot (\theta - \theta_{old}) \le \delta $$
- 上述转化为:
$$ \begin{align*} \min_\theta &\ \ -\nabla_\theta J_{\theta_{old}}(\theta) \big|_{\theta=\theta_{old}} \cdot (\theta - \theta_{old}) \\ s.t. &\ \ \cfrac{1}{2} (\theta - \theta_{old})^T \cdot I(\theta_{old}) \cdot (\theta - \theta_{old}) - \delta + c = 0 \\ s.t. &\ \ c \le 0 \end{align*} $$
构造拉格朗日函数,$\gamma$ 为 拉格朗日乘子
$$ \begin{align*} L(\theta, \gamma) &= -\nabla_\theta J_{\theta_{old}}(\theta) \big|_{\theta=\theta_{old}} \cdot (\theta - \theta_{old}) + \gamma\left( \cfrac{1}{2} (\theta - \theta_{old})^T \cdot I(\theta_{old}) \cdot (\theta - \theta_{old}) - \delta + c = 0 \right) \\ \cfrac{\partial L(\theta, \gamma)}{\partial \theta} &= -\nabla_\theta L_{\theta_{old}}(\theta) \big|_{\theta=\theta_{old}} + \gamma\left( I(\theta_{old}) \cdot (\theta - \theta_{old}) \right) = 0 \\ \end{align*} $$
$$ \Downarrow \\ \theta = \theta_{old} + \underbrace{\cfrac{1}{\gamma}}_{步长} \underbrace{I^{-1}(\theta_{old}) \cdot \nabla_\theta J_{\theta_{old}}(\theta) \big|_{\theta=\theta_{old}}}_{搜索方向} $$
然而,求解 $I$ 的逆矩阵依旧难求,于是用共轭梯度法 (Conjugate Gradient Algorithm
) 近似求解搜索方向;假设已通过共轭梯度法获得了近似搜索方向 $\hat{g} \approx I^{-1}(\theta_{old}) \cdot \nabla_\theta J_{\theta_{old}}(\theta) \big|_{\theta=\theta_{old}}$,则 $\theta$ 的更新为 $\theta=\theta+\beta\hat{g}$
- 现在确定步长 $\beta$,约束条件:
$$ \cfrac{\partial L(\theta, \gamma)}{\partial \theta} = 0 \Rightarrow \delta-c = \cfrac{1}{2}(\beta\hat{g})^T I(\theta_{old}) (\beta\hat{g}) = \cfrac{1}{2} \beta^2 \hat{g}^T I(\theta_{old}) \hat{g} $$
- 则 $\beta$ 为:
$$ \beta = \sqrt{ \cfrac{2(\delta-c)}{\hat{g}^T I(\theta_{old}) \hat{g}} } $$
$\hat{g}^T I(\theta_{old}) \hat{g}$ 可以通过单个 Hessian 向量积进行计算
- 因此,参数的更新为:
$$ \theta = \theta_{old} + \sqrt{ \cfrac{2(\delta-c)}{\hat{g}^T I(\theta_{old}) \hat{g}} } \cdot \hat{g} $$
其中,最大步长:
$$ \beta_{\max} = \sqrt{ \cfrac{2\delta}{\hat{g}^T I(\theta_{old}) \hat{g}} } \tag{当 $c = 0$ 时} $$
11.4 代码 #
import copy
import torch
import numpy as np
import gymnasium as gym
import matplotlib.pyplot as plt
import torch.nn.functional as F
import rl_utils
class PolicyNet(torch.nn.Module):
def __init__(self, state_dim, hidden_dim, action_dim):
super(PolicyNet, self).__init__()
self.fc1 = torch.nn.Linear(state_dim, hidden_dim)
self.fc2 = torch.nn.Linear(hidden_dim, action_dim)
def forward(self, x):
x = F.relu(self.fc1(x))
return F.softmax(self.fc2(x), dim=1)
class ValueNet(torch.nn.Module):
def __init__(self, state_dim, hidden_dim):
super(ValueNet, self).__init__()
self.fc1 = torch.nn.Linear(state_dim, hidden_dim)
self.fc2 = torch.nn.Linear(hidden_dim, 1)
def forward(self, x):
x = F.relu(self.fc1(x))
return self.fc2(x)
def compute_advantage(gamma, lmbda, td_delta):
td_delta = td_delta.detach().numpy()
advantage_list = []
advantage = 0.0
for delta in td_delta[::-1]:
advantage = gamma * lmbda * advantage + delta
advantage_list.append(advantage)
advantage_list.reverse()
return torch.tensor(advantage_list, dtype=torch.float)
class TRPO:
""" TRPO算法 """
def __init__(self, hidden_dim, state_space, action_space, lmbda,
kl_constraint, alpha, critic_lr, gamma, device):
state_dim = state_space.shape[0]
action_dim = action_space.n
# 策略网络参数不需要优化器更新
self.actor = PolicyNet(state_dim, hidden_dim, action_dim).to(device)
self.critic = ValueNet(state_dim, hidden_dim).to(device)
self.critic_optimizer = torch.optim.Adam(self.critic.parameters(),
lr=critic_lr)
self.gamma = gamma
self.lmbda = lmbda # GAE参数
self.kl_constraint = kl_constraint # KL距离最大限制
self.alpha = alpha # 线性搜索参数
self.device = device
def take_action(self, state):
state = torch.tensor([state], dtype=torch.float).to(self.device)
probs = self.actor(state)
action_dist = torch.distributions.Categorical(probs)
action = action_dist.sample()
return action.item()
def hessian_matrix_vector_product(self, states, old_action_dists, vector):
# 计算黑塞矩阵和一个向量的乘积
new_action_dists = torch.distributions.Categorical(self.actor(states))
kl = torch.mean(
torch.distributions.kl.kl_divergence(old_action_dists,
new_action_dists)) # 计算平均KL距离
kl_grad = torch.autograd.grad(kl,
self.actor.parameters(),
create_graph=True)
kl_grad_vector = torch.cat([grad.view(-1) for grad in kl_grad])
# KL距离的梯度先和向量进行点积运算
kl_grad_vector_product = torch.dot(kl_grad_vector, vector)
grad2 = torch.autograd.grad(kl_grad_vector_product,
self.actor.parameters())
grad2_vector = torch.cat([grad.view(-1) for grad in grad2])
return grad2_vector
def conjugate_gradient(self, grad, states, old_action_dists): # 共轭梯度法求解方程
x = torch.zeros_like(grad)
r = grad.clone()
p = grad.clone()
rdotr = torch.dot(r, r)
for i in range(10): # 共轭梯度主循环
Hp = self.hessian_matrix_vector_product(states, old_action_dists,
p)
alpha = rdotr / torch.dot(p, Hp)
x += alpha * p
r -= alpha * Hp
new_rdotr = torch.dot(r, r)
if new_rdotr < 1e-10:
break
beta = new_rdotr / rdotr
p = r + beta * p
rdotr = new_rdotr
return x
def compute_surrogate_obj(self, states, actions, advantage, old_log_probs,
actor): # 计算策略目标
log_probs = torch.log(actor(states).gather(1, actions))
ratio = torch.exp(log_probs - old_log_probs)
return torch.mean(ratio * advantage)
def line_search(self, states, actions, advantage, old_log_probs,
old_action_dists, max_vec): # 线性搜索
old_para = torch.nn.utils.convert_parameters.parameters_to_vector(
self.actor.parameters())
old_obj = self.compute_surrogate_obj(states, actions, advantage,
old_log_probs, self.actor)
for i in range(15): # 线性搜索主循环
coef = self.alpha**i
new_para = old_para + coef * max_vec
new_actor = copy.deepcopy(self.actor)
torch.nn.utils.convert_parameters.vector_to_parameters(
new_para, new_actor.parameters())
new_action_dists = torch.distributions.Categorical(
new_actor(states))
kl_div = torch.mean(
torch.distributions.kl.kl_divergence(old_action_dists,
new_action_dists))
new_obj = self.compute_surrogate_obj(states, actions, advantage,
old_log_probs, new_actor)
if new_obj > old_obj and kl_div < self.kl_constraint:
return new_para
return old_para
def policy_learn(self, states, actions, old_action_dists, old_log_probs,
advantage): # 更新策略函数
surrogate_obj = self.compute_surrogate_obj(states, actions, advantage,
old_log_probs, self.actor)
grads = torch.autograd.grad(surrogate_obj, self.actor.parameters())
obj_grad = torch.cat([grad.view(-1) for grad in grads]).detach()
# 用共轭梯度法计算x = H^(-1)g
descent_direction = self.conjugate_gradient(obj_grad, states,
old_action_dists)
Hd = self.hessian_matrix_vector_product(states, old_action_dists,
descent_direction)
max_coef = torch.sqrt(2 * self.kl_constraint /
(torch.dot(descent_direction, Hd) + 1e-8))
new_para = self.line_search(states, actions, advantage, old_log_probs,
old_action_dists,
descent_direction * max_coef) # 线性搜索
torch.nn.utils.convert_parameters.vector_to_parameters(
new_para, self.actor.parameters()) # 用线性搜索后的参数更新策略
def update(self, transition_dict):
states = torch.tensor(transition_dict['states'],
dtype=torch.float).to(self.device)
actions = torch.tensor(transition_dict['actions']).view(-1, 1).to(
self.device)
rewards = torch.tensor(transition_dict['rewards'],
dtype=torch.float).view(-1, 1).to(self.device)
next_states = torch.tensor(transition_dict['next_states'],
dtype=torch.float).to(self.device)
dones = torch.tensor(transition_dict['dones'],
dtype=torch.float).view(-1, 1).to(self.device)
td_target = rewards + self.gamma * self.critic(next_states) * (1 -
dones)
td_delta = td_target - self.critic(states)
advantage = compute_advantage(self.gamma, self.lmbda,
td_delta.cpu()).to(self.device)
old_log_probs = torch.log(self.actor(states).gather(1,
actions)).detach()
old_action_dists = torch.distributions.Categorical(
self.actor(states).detach())
critic_loss = torch.mean(
F.mse_loss(self.critic(states), td_target.detach()))
self.critic_optimizer.zero_grad()
critic_loss.backward()
self.critic_optimizer.step() # 更新价值函数
# 更新策略函数
self.policy_learn(states, actions, old_action_dists, old_log_probs,
advantage)
num_episodes = 1000
hidden_dim = 128
gamma = 0.98
lmbda = 0.95
critic_lr = 1e-2
kl_constraint = 0.0005
alpha = 0.5
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
env_name = 'CartPole-v1'
env = gym.make(env_name)
np.random.seed(0)
torch.manual_seed(0)
agent = TRPO(hidden_dim, env.observation_space, env.action_space, lmbda, kl_constraint, alpha, critic_lr, gamma, device)
return_list = rl_utils.train_on_policy_agent(env, agent, num_episodes)
episodes_list = list(range(len(return_list)))
plt.plot(episodes_list, return_list)
plt.xlabel('Episodes')
plt.ylabel('Returns')
plt.title('TRPO on {}'.format(env_name))
plt.show()
mv_return = rl_utils.moving_average(return_list, 9)
plt.plot(episodes_list, mv_return)
plt.xlabel('Episodes')
plt.ylabel('Returns')
plt.title('TRPO on {}'.format(env_name))
plt.show()