zhuzilin's Blog

about

About Matrix Calculus

date: 2019-03-10
tags: math  

很长时间以来,虽然学习机器学习,但是我对最基本的矩阵求导仍然是非常模糊,在这里好好的整理一下。以下内容几乎全部来自Matrix calculus的维基页面。

Scope

从一个简单的标量函数对矩阵求导为例:

f(x1,x2,x3)=fx1x^1+fx2x^2+fx3x^3\nabla f(x_1, x_2, x_3)=\frac{\partial f}{\partial x_1}\hat{x}_1+\frac{\partial f}{\partial x_2}\hat{x}_2+\frac{\partial f}{\partial x_3}\hat{x}_3

也可以写成矩阵形式,那就是标量函数关于向量求导:

f=fx=[fx1fx2fx3]T\nabla f=\frac{\partial f}{\partial \bold{x}}= \left[ {\begin{array}{c} \frac{\partial f}{\partial x_1} & \frac{\partial f}{\partial x_2} & \frac{\partial f}{\partial x_3} \end{array}}\right]^T

更复杂一点的例子,就是标量函数对矩阵求导,被称为gradient matrix,其包含了对于矩阵每一个元素的导数。

再举另外一个例子,有的时候我们由n个因变量,m个自变量,那么因变量向量相对于自变量向量的导数为一个m×nm\times n的矩阵,其中包含了所有的组合。

对于标量,向量,矩阵这三样东西,我们可以两两组合求导,但是这其中有几样是相对比较常见的

Types Scalar Vector Matrix
Scalar \frac{\partial y}{\partial x} \frac{\partial \mathbf{y}}{\partial x} \frac{\partial \mathbf{Y}}{\partial x}
Vector \frac{\partial y}{\partial \mathbf{x}} \frac{\partial \mathbf{y}}{\partial \mathbf{x}}
Matrix \frac{\partial y}{\partial \mathbf{X}}

注意我们可以对矩阵函数关于向量求导,但是其会为超过2阶的张量了,所以不在这里进行讨论了。

Derivative

这里面一一展示了上面的6种的展开模式:

Vector-by-scalar

ym×1x=[y1xy2xymx]m×1\frac{\partial\bold{y}_{m\times1}}{\partial x}= \left[{\begin{array}{c} \frac{\partial y_1}{\partial x}\\ \frac{\partial y_2}{\partial x}\\ \vdots\\ \frac{\partial y_m}{\partial x} \end{array}}\right]_{m\times1}

在vector calculus种,这被成为y\bold{y}的tangent vector。

Scalar-by-vector

yxn×1=[yx1yx2yxn]n×1\frac{\partial y}{\partial \bold{x}_{n\times1}}= \left[{\begin{array}{c} \frac{\partial y}{\partial x_1}\\ \frac{\partial y}{\partial x_2}\\ \vdots\\ \frac{\partial y}{\partial x_n} \end{array}}\right]_{n\times1}

Vector-by-vector

ym×1xn×1=[y1x1y1x2y1xny2x1y2x2y2xnymx1ymx2ymxn]m×n\frac{\partial \bold{y}_{m\times1}}{\partial \bold{x}_{n\times1}}= \left[{\begin{array}{c} \frac{\partial y_1}{\partial x_1}&\frac{\partial y_1}{\partial x_2}&\dots&\frac{\partial y_1}{\partial x_n}\\ \frac{\partial y_2}{\partial x_1}&\frac{\partial y_2}{\partial x_2}&\dots&\frac{\partial y_2}{\partial x_n}\\ \vdots&\vdots&\ddots&\vdots\\ \frac{\partial y_m}{\partial x_1}&\frac{\partial y_m}{\partial x_2}&\dots&\frac{\partial y_m}{\partial x_n} \end{array}}\right]_{m\times n}

在vector calculus中,向量函数y\bold{y}对向量x\bold{x}求导被称为Jacobian matrix。(注意这里用的是numerator layout。)

Matrix-by-scalar

Ym×nx=[y11xy12xy1nxy21xy22xy2nxym1xym2xymnxn]m×n\frac{\partial \bold{Y}_{m\times n}}{\partial x}= \left[{\begin{array}{c} \frac{\partial y_{11}}{\partial x}&\frac{\partial y_{12}}{\partial x}&\dots&\frac{\partial y_{1n}}{\partial x}\\ \frac{\partial y_{21}}{\partial x}&\frac{\partial y_{22}}{\partial x}&\dots&\frac{\partial y_{2n}}{\partial x}\\ \vdots&\vdots&\ddots&\vdots\\ \frac{\partial y_{m1}}{\partial x}&\frac{\partial y_{m2}}{\partial x}&\dots&\frac{\partial y_{mn}}{\partial x_n} \end{array}}\right]_{m\times n}

Scalar-by-matrix

yXm×n=[yx11yx12yx1nyx21yx22yx2nyxm1yxm2yxmn]m×n\frac{\partial y}{\partial \bold{X}_{m\times n}}= \left[{\begin{array}{c} \frac{\partial y}{\partial x_{11}}&\frac{\partial y}{\partial x_{12}}&\dots&\frac{\partial y}{\partial x_{1n}}\\ \frac{\partial y}{\partial x_{21}}&\frac{\partial y}{\partial x_{22}}&\dots&\frac{\partial y}{\partial x_{2n}}\\ \vdots&\vdots&\ddots&\vdots\\ \frac{\partial y}{\partial x_{m1}}&\frac{\partial y}{\partial x_{m2}}&\dots&\frac{\partial y}{\partial x_{mn}} \end{array}}\right]_{m\times n}

Layout convention

对于\frac{\partial \mathbf{y}}{\partial \mathbf{x}} 的结果有两种表示的方法,如果y是m维,而x是n维,那么结果可以是m×nm\times n或者n×mn \times m。一般有如下的几种表示方法:

  1. Numerator layout, 按照y\bold{y}xT\bold{x}^T排列,也就是结果是m×nm\times n。这常被称为Jacobian formulation
  2. Denominator layout,按照yT\bold{y}^Tx\bold{x}排列,也就是结果是n×mn\times m。这常被称为Hessian formulation,是上一种的转置。
  3. 第三种就是写为\frac{\partial \mathbf{y}}{\partial \mathbf{x}'},也就是关于x的专制求导,并符合numerator layout,用这种标记就可以统一两种layout了。

下面是对于上面两种layout对应的一些结构:

  1. 对于numerator layout,yx\frac{\partial\bold{y}}{\partial x}是列向量,yx\frac{\partial y}{\partial\bold{x}}是行向量
  2. 对于denominator layout,yx\frac{\partial\bold{y}}{\partial x}是行向量,yx\frac{\partial y}{\partial\bold{x}}是列向量
  3. 对于第三种,会写为yx\frac{\partial\bold{y}}{\partial x}yx\frac{\partial y}{\partial\bold{x}‘}

对于矩阵相关的求导也会有对应的方式。

Differential Form

有的时候微分形式更好处理。

Scalar

Expression Result (numerator layout)
{\displaystyle d(\operatorname {tr} (\mathbf {X} ))=} {\displaystyle \operatorname {tr} (d\mathbf {X} )}
d(|\mathbf{X}|) = {\displaystyle |\mathbf {X} |\operatorname {tr} \left(\mathbf {X} ^{-1}d\mathbf {X} \right)=\operatorname {tr} (\operatorname {adj} (\mathbf {X} )d\mathbf {X} )}
d(\ln|\mathbf{X}|) = {\displaystyle \operatorname {tr} \left(\mathbf {X} ^{-1}d\mathbf {X} \right)}

Matrix

Expression Result (numerator layout)
d(\mathbf{A}) = {\displaystyle 0}
d(a\mathbf{X}) = a,d\mathbf{X}
{\displaystyle d(\mathbf {X} +\mathbf {Y} )=} {\displaystyle d\mathbf {X} +d\mathbf {Y} }
d(\mathbf{X}\mathbf{Y}) = {\displaystyle (d\mathbf {X} )\mathbf {Y} +\mathbf {X} (d\mathbf {Y} )}
{\displaystyle d\left(\mathbf {X} ^{\top }\right)=} (d{\mathbf  {X}})^{\top }
{\displaystyle d\left(\mathbf {X} ^{-1}\right)=} {\displaystyle -\mathbf {X} ^{-1}\left(d\mathbf {X} \right)\mathbf {X} ^{-1}}

Conversion from differential to derivative form

Canonical differential form Equivalent derivative form
dy = a,dx \frac{dy}{dx} = a
dy = \mathbf{a},d\mathbf{x} \frac{dy}{d\mathbf{x}} = \mathbf{a}
{\displaystyle dy=\operatorname {tr} (\mathbf {A} ,d\mathbf {X} )} \frac{dy}{d\mathbf{X}} = \mathbf{A}
d\mathbf{y} = \mathbf{a},dx \frac{d\mathbf{y}}{dx} = \mathbf{a}
d\mathbf{y} = \mathbf{A},d\mathbf{x} \frac{d\mathbf{y}}{d\mathbf{x}} = \mathbf{A}
d\mathbf{Y} = \mathbf{A},dx \frac{d\mathbf{Y}}{dx} = \mathbf{A}

Some useful formula

这里在原文中有很好的一个表,在这里会摘取一些机器学习中经常会用的式子。

  • Axx=A(num)/AT(denom)\frac{\partial\bold{Ax}}{\partial \bold{x}}=\bold{A}(num)/\bold{A}^T(denom)
  • xTAx=AT(num)/A(denom)\frac{\partial\bold{x^TA}}{\partial \bold{x}}=\bold{A}^T(num)/\bold{A}(denom)
  • (uv)x=uTvx=uTvx+vTux(num)/vxu+uxv(denom)\frac{\partial(\bold{u}\cdot \bold{v})}{\partial \bold{x}}= \frac{\partial\bold{u}^T\bold{v}}{\partial \bold{x}}= \bold{u}^T\frac{\partial\bold{v}}{\partial \bold{x}}+\bold{v}^T\frac{\partial\bold{u}}{\partial \bold{x}}(num)/ \frac{\partial\bold{v}}{\partial \bold{x}}\bold{u}+\frac{\partial\bold{u}}{\partial \bold{x}}\bold{v}(denom)
  • 矩阵的迹的导数可以用其循环和转置的性质求得,如:

    d tr(AXBXTC)=tr(d(CAXBXT))=tr(d(CAX)BXT+CAXd(BXT))=tr(CAd(X)BXT+CAXB(dXT))=tr(BXTCAdX)+tr(CAXB(dX)T)=tr(BXTCAdX)+tr((dXBTXTATCT)T)=tr(BXTCAdx)+tr(dXBTXTATCT)=tr((BXTCA+BTXTATCT)dX)tr(AXBXTC)X=BXTCA+BTXTATCT\begin{aligned} d\ tr(AXBX^TC) &= tr(d(CAXBX^T)) \\ &=tr(d(CAX)BX^T + CAXd(BX^T))\\ &=tr(CAd(X)BX^T+CAXB(dX^T))\\ &=tr(BX^TCAdX)+tr(CAXB(dX)^T)\\ &=tr(BX^TCAdX)+tr((dXB^TX^TA^TC^T)^T)\\ &=tr(BX^TCAdx)+tr(dXB^TX^TA^TC^T)\\ &=tr((BX^TCA+B^TX^TA^TC^T)dX)\\ \frac{\partial tr(AXBX^TC)}{\partial X} &= BX^TCA+B^TX^TA^TC^T \end{aligned}

    注意很多时候,把变量结果转化为迹就可以更好的求得结果。

  • 对于行列式,我们有:

    XX=XX1logXX=X1\begin{aligned} \frac{\partial |X|}{\partial X} &= |X|X^{-1}\\ \frac{\partial \log|X|}{\partial X} &= X^{-1} \end{aligned}