zhuzilin's Blog

about

强化学习的一些数学基础

date: 2025-10-09
tags: 数学  

Performance Difference Lemma

这个引理非常厉害,它相当于是可以用老策略的 adv 来估计新策略的提升量。我们就可以就此得到一些提升的下界。

Vπ(s0)=Ea0π(s0)[r(s0,a0)+γEsP(s0,a0)Vπ(s)]Qπ(s0,a0)=r(s0,a0)+γEsP(s0,a0)Vπ(s)\begin{aligned} V^\pi(s_0)&=\mathbb{E}_{a_0\sim\pi(\cdot|s_0)}[r(s_0,a_0)+\gamma\mathbb{E}_{s'\sim P(s_0,a_0)}V^\pi(s')]\\ Q^\pi(s_0,a_0)&=r(s_0,a_0)+\gamma\mathbb{E}_{s'\sim P(s_0,a_0)}V^\pi(s') \end{aligned}

定义Phπ(s;s0)\mathbb{P}_h^\pi(s;s_0) 是在 π\pi 下,通过 hh 步从 s0s_0 转换到 ss 的概率。Phπ(s,a;s0)\mathbb{P}_h^\pi(s,a;s_0) 是在 π\pi 下,通过 hh 步从 s0s_0 转换到 ss,然后在 ss 选择了 aa 动作的概率。也就有:

Phπ(s;s0)=aPhπ(s,a;s0)\mathbb{P}_h^\pi(s;s_0)=\sum_a\mathbb{P}^\pi_h(s,a;s_0)

定义

ds0π(s,a)=(1γ)hγhPhπ(s,a;s0)d_{s_0}^\pi(s,a)=(1-\gamma)\sum_h^\infty\gamma^h\mathbb{P}_h^\pi(s,a;s_0)

引理:Performance Difference Lemma

Vπ(s0)Vπ(s0)=11γEsds0π[Eaπ(s)Aπ(s,a)]V^\pi(s_0) -V^{\pi'}(s_0)=\frac{1}{1-\gamma}\mathbb{E}_{s\sim d_{s_0}^\pi}[\mathbb{E}_{a\sim\pi(\cdot|s)}A^{\pi'}(s,a)]

证明:

Vπ(s0)Vπ(s0)=Vπ(s0)Ea0π(s0)[Qπ(s0,a0)]+Ea0π(s0)[Qπ(s0,a0)]Vπ(s0)=Vπ(s0)Ea0π(s0)[r(s0,a0)+γEsP(s0,a0)Vπ(s)]+Ea0π(s0)[Qπ(s0,a0)]Vπ(s0)=Ea0π(s0)[r(s0,a0)+γEsP(s0,a0)Vπ(s)]Ea0π(s0)[r(s0,a0)+γEsP(s0,a0)Vπ(s)]+Ea0π(s0)[Qπ(s0,a0)]Vπ(s0)=γEa0π(s0)[Es1P(s0,a0)[Vπ(s1)Vπ(s1)]]+Ea0π(s0)[Qπ(s0,a0)]Vπ(s0)=γEa0π(s0)[Es1P(s0,a0)[Vπ(s1)Vπ(s1)]]+Ea0π(s0)[Qπ(s0,a0)Vπ(s0)]=γEa0π(s0)[Es1P(s0,a0)[Vπ(s1)Vπ(s1)]]+Ea0π(s0)[Aπ(s0,a0)]\begin{aligned} &V^\pi(s_0)-V^{\pi'}(s_0)\\ &=V^\pi(s_0) -\mathbb{E}_{a_0\sim\pi(\cdot|s_0)}[Q^{\pi'}(s_0,a_0)]+\mathbb{E}_{a_0\sim\pi(\cdot|s_0)}[Q^{\pi'}(s_0,a_0)]-V^{\pi'}(s_0)\\ &= V^\pi(s_0) -\mathbb{E}_{a_0\sim\pi(\cdot|s_0)}[r(s_0,a_0)+\gamma\mathbb{E}_{s'\sim P(s_0,a_0)}V^{\pi'}(s')]+\mathbb{E}_{a_0\sim\pi(\cdot|s_0)}[Q^{\pi'}(s_0,a_0)]-V^{\pi'}(s_0)\\ &=\mathbb{E}_{a_0\sim\pi(\cdot|s_0)}[r(s_0,a_0)+\gamma\mathbb{E}_{s'\sim P(s_0,a_0)}V^\pi(s')]-\mathbb{E}_{a_0\sim\pi(\cdot|s_0)}[r(s_0,a_0)+\gamma\mathbb{E}_{s'\sim P(s_0,a_0)}V^{\pi'}(s')]\\ &+\mathbb{E}_{a_0\sim\pi(\cdot|s_0)}[Q^{\pi'}(s_0,a_0)]-V^{\pi'}(s_0)\\ &=\gamma\mathbb{E}_{a_0\sim\pi(\cdot|s_0)}[\mathbb{E}_{s_1\sim P(s_0,a_0)}[V^\pi(s_1)-V^{\pi'}(s_1)]]+\mathbb{E}_{a_0\sim\pi(\cdot|s_0)}[Q^{\pi'}(s_0,a_0)]-V^{\pi'}(s_0)\\ &=\gamma\mathbb{E}_{a_0\sim\pi(\cdot|s_0)}[\mathbb{E}_{s_1\sim P(s_0,a_0)}[V^\pi(s_1)-V^{\pi'}(s_1)]]+\mathbb{E}_{a_0\sim\pi(\cdot|s_0)}[Q^{\pi'}(s_0,a_0)-V^{\pi'}(s_0)]\\ &=\gamma\mathbb{E}_{a_0\sim\pi(\cdot|s_0)}[\mathbb{E}_{s_1\sim P(s_0,a_0)}[V^\pi(s_1)-V^{\pi'}(s_1)]]+\mathbb{E}_{a_0\sim\pi(\cdot|s_0)}[A^{\pi'}(s_0,a_0)]\\ \end{aligned}

如果我们设:

Pπ(s1;s0)=a0π(a0s0)P(s1s0,a0)=Ea0π(s0)P(s1s0,a0)\mathbb{P}^\pi(s_1;s_0)=\sum_{a_0}\pi(a_0|s_0)P(s_1|s_0,a_0)=\mathbb{E}_{a_0\sim\pi(\cdot|s_0)}P(s_1|s_0,a_0)

我们也就有了:

Vπ(s0)Vπ(s0)=γEs1Pπ(s0)[Vπ(s1)Vπ(s1)]+Ea0π(s0)[Aπ(s0,a0)]\begin{aligned} V^\pi(s_0)-V^{\pi'}(s_0)=\gamma\mathbb{E}_{s_1\sim\mathbb{P}^{\pi}(\cdot|s_0)}[V^\pi(s_1)-V^{\pi'}(s_1)]+\mathbb{E}_{a_0\sim\pi(\cdot|s_0)}[A^{\pi'}(s_0,a_0)]\\ \end{aligned}

替换一下,我们有:

Vπ(s1)Vπ(s1)=γEs2Pπ(s1)[Vπ(s2)Vπ(s2)]+Ea1π(s1)[Aπ(s1,a1)]\begin{aligned} V^\pi(s_1)-V^{\pi'}(s_1)=\gamma\mathbb{E}_{s_2\sim\mathbb{P}^{\pi}(\cdot|s_1)}[V^\pi(s_2)-V^{\pi'}(s_2)]+\mathbb{E}_{a_1\sim\pi(\cdot|s_1)}[A^{\pi'}(s_1,a_1)]\\ \end{aligned}

代入就有:

Vπ(s0)Vπ(s0)=γEs1P1π(s0)[γEs2P1π(s1)[Vπ(s2)Vπ(s2)]+Ea1π(s1)[Aπ(s1,a1)]]+Ea0π(s0)[Aπ(s0,a0)]=γ2Es2P2π(s0)[Vπ(s2)Vπ(s2)]+γEs1P1π(s0)[Ea1π(s1)[Aπ(s1,a1)]]+Ea0π(s0)[Aπ(s0,a0)]=γ2Es2P2π(s0)[Vπ(s2)Vπ(s2)]+γEs1,a1P1π(,s0)[Aπ(s1,a1)]+Ea0π(s0)[Aπ(s0,a0)]=...=h=0γhEs,aPhπ(,;s0)Aπ(s,a)=11γEs,ads0πAπ(s,a)\begin{aligned} V^\pi(s_0)-V^{\pi'}(s_0)&= \gamma\mathbb{E}_{s_1\sim\mathbb{P}_1^{\pi}(\cdot|s_0)}[\gamma\mathbb{E}_{s_2\sim\mathbb{P}_1^{\pi}(\cdot|s_1)}[V^\pi(s_2)-V^{\pi'}(s_2)] +\mathbb{E}_{a_1\sim\pi(\cdot|s_1)}[A^{\pi'}(s_1,a_1)]]\\ &+\mathbb{E}_{a_0\sim\pi(\cdot|s_0)}[A^{\pi'}(s_0,a_0)]\\ &=\gamma^2\mathbb{E}_{s_2\sim\mathbb{P}_2^{\pi}(\cdot|s_0)}[V^\pi(s_2)-V^{\pi'}(s_2)]+\gamma\mathbb{E}_{s_1\sim\mathbb{P}_1^{\pi}(\cdot|s_0)}[\mathbb{E}_{a_1\sim\pi(\cdot|s_1)}[A^{\pi'}(s_1,a_1)]]\\ &+\mathbb{E}_{a_0\sim\pi(\cdot|s_0)}[A^{\pi'}(s_0,a_0)]\\ &=\gamma^2\mathbb{E}_{s_2\sim\mathbb{P}_2^{\pi}(\cdot|s_0)}[V^\pi(s_2)-V^{\pi'}(s_2)]+\gamma\mathbb{E}_{s_1,a_1\sim\mathbb{P}_1^{\pi}(\cdot,\cdot|s_0)}[A^{\pi'}(s_1,a_1)]\\ &+\mathbb{E}_{a_0\sim\pi(\cdot|s_0)}[A^{\pi'}(s_0,a_0)]\\ &=...\\ &=\sum_{h=0}^{\infty}\gamma^h\mathbb{E}_{s,a\sim \mathbb{P}_h^\pi(\cdot,\cdot;s_0)}A^{\pi'}(s,a)\\ &=\frac{1}{1-\gamma}\mathbb{E}_{s,a\sim d_{s_0}^\pi}A^{\pi'}(s,a) \end{aligned}