softmax回归
softmax
回归常用于多分类问题,其输出可直接看成对类别的预测概率
假设对k
类标签([1, 2, ..., k]
)进行分类,那么经过softmax
回归计算后,输出一个k
维向量,向量中每个值都代表对一个类别的预测概率
下面先以单个输入数据为例,进行评分函数、损失函数的计算和求导,然后扩展到多个输入数据同步计算
对数函数操作
对数求和
\[ log_{a}^{x}+log_{a}^{y} = log_{a}^{xy} \]
对数求差
\[ log_{a}^{x}-log_{a}^{y} = log_{a}^{\frac{x}{y}} \]
指数乘法
\[ e^{x}\cdot e^{y} = e^{x+y} \]
求导公式
若函数\(u(x),v(x)均可导\),那么
\[ \left(\frac{u(x)}{v(x)}\right)^{\prime}=\frac{u^{\prime}(x) v(x)-v^{\prime}(x) u(x)}{v^{2}(x)} \]
单个输入数据进行softmax回归计算
评分函数
假设使用softmax
回归分类数据\(x\),共\(k\)个标签,首先进行线性回归操作
\[ z_{\theta}(x)=\theta^T\cdot x =\begin{bmatrix} \theta_{1}^T\\ \theta_{2}^T\\ ...\\ \theta_{k}^T \end{bmatrix}\cdot x =\begin{bmatrix} \theta_{1}^T\cdot x\\ \theta_{2}^T\cdot x\\ ...\\ \theta_{k}^T\cdot x \end{bmatrix} \]
其中输入数据\(x\)大小为\((n+1)\times 1\),\(\theta\)大小为\((n+1)\times k\),\(n\)表示权重数量,\(m\)表示训练数据个数,\(k\)表示类别标签数量
输出结果\(z\)大小为\(k\times 1\),然后对计算结果进行归一化操作,使得输出值能够表示类别概率,如下所示
\[ h_{\theta}\left(x\right)=\left[ \begin{array}{c}{p\left(y=1 | x ; \theta\right)} \\ {p\left(y=2 | x ; \theta\right)} \\ {\vdots} \\ {p\left(y=k | x ; \theta\right)}\end{array}\right] =\frac{1}{\sum_{j=1}^{k} e^{\theta_{j}^{T} x}} \left[ \begin{array}{c}{e^{\theta_{1}^{T} x}} \\ {e^{\theta_{2}^{T} x}} \\ {\vdots} \\ {e^{\theta_{k}^{T} x}}\end{array}\right] \]
其中\(\theta_{1}、\theta_{2},...,\theta_{k}\)的大小为\((n+1)\times 1\),输出结果是一个\(k\times 1\)大小向量,每列表示\(k\)类标签的预测概率
所以对于输入数据\(x\)而言,其属于标签\(j\)的概率是
\[ p\left(y=j | x; \theta\right)=\frac{e^{\theta_{j}^{T} x}}{\sum_{l=1}^{k} e^{\theta_{l}^{T} x}} \]
损失函数
利用交叉熵损失(cross entropy loss
)作为softmax
回归的损失函数,用于计算训练数据对应的真正标签的损失值
\[ J(\theta) = (-1)\cdot \sum_{j=1}^{k} 1\left\{y=j\right\} \ln p\left(y=j | x; \theta\right) = (-1)\cdot \sum_{j=1}^{k} 1\left\{y=j\right\} \ln \frac{e^{\theta_{j}^{T} x}}{\sum_{l=1}^{k} e^{\theta_{l}^{T} x}} \]
其中函数\(1\{\cdot\}\)是一个示性函数(indicator function
),其取值规则为
1 | 1{a true statement} = 1, and 1{a false statement} = 0 |
也就是示性函数输入为True
时,输出为1
;否则,输出为0
对权重向量\(\theta_{s}\)进行求导:
\[ \frac{\varphi J(\theta)}{\varphi \theta_{s}} =(-1)\cdot \frac{\varphi }{\varphi \theta_{s}} \left[ \\ \sum_{j=1,j\neq s}^{k} 1\left\{y=j \right\} \ln p\left(y=j | x; \theta\right) +1\left\{y=s \right\} \ln p\left(y=s | x; \theta\right) \\ \right] \]
\[ =(-1)\cdot \sum_{j=1,j\neq s}^{k} 1\left\{y=j \right\} \frac{1}{p\left(y=j | x; \theta\right)}\frac{\varphi p\left(y=j | x; \theta\right)}{\varphi \theta_{s}} +(-1)\cdot 1\left\{y=s \right\} \frac{1}{p\left(y=s | x; \theta\right)}\frac{\varphi p\left(y=s | x; \theta\right)}{\varphi \theta_{s}} \]
分为两种情况
- 当计算结果正好由\(\theta_{s}\)计算得到,此时线性运算为\(z=\theta_{s}^{T} x\),计算结果为\(p\left(y=s | x; \theta\right)=\frac{e^{\theta_{s}^{T} x}}{\sum_{l=1}^{k} e^{\theta_{l}^{T} x}}\),求导如下
\[ \frac{\varphi p\left(y=s | x; \theta\right)}{\varphi \theta_{s}} =\frac{u^{\prime}(x) v(x)-v^{\prime}(x) u(x)}{v^{2}(x)} \]
其中
\[ u(x) = e^{\theta_{s}^{T} x}, v(x)=\sum_{l=1}^{k} e^{\theta_{l}^{T} x} \]
所以
\[ \frac{\varphi u(x)}{\varphi \theta_s} = e^{\theta_{s}^{T} x}\cdot x=u(x)\cdot x, \frac{\varphi v(x)}{\varphi \theta_s} = e^{\theta_{s}^{T} x}\cdot x=u(x)\cdot x \\ \frac{\varphi p\left(y=s | x; \theta\right)}{\varphi \theta_{s}} = p\left(y=s | x; \theta\right)\cdot x-p\left(y=s | x; \theta\right)^2\cdot x \]
- 当计算结果不是由\(\theta_{s}\)计算得到,此时线性运算为\(z=\theta_{j}^{T} x, j\neq s\),计算结果为\(p\left(y=j | x; \theta\right)=\frac{e^{\theta_{j}^{T} x}}{\sum_{l=1}^{k} e^{\theta_{l}^{T} x}}\)
\[ \frac{\varphi p\left(y=j | x; \theta\right)}{\varphi \theta_{s}} =\frac{u^{\prime}(x) v(x)-v^{\prime}(x) u(x)}{v^{2}(x)} \]
其中
\[ u(x) = e^{\theta_{j}^{T} x}, v(x)=\sum_{l=1}^{k} e^{\theta_{l}^{T} x} \]
所以
\[ \frac{\varphi u(x)}{\varphi \theta_s} = e^{\theta_{j}^{T} x}\cdot x=0, \frac{\varphi v(x)}{\varphi \theta_s} = e^{\theta_{s}^{T} x}\cdot x \\ \frac{\varphi p\left(y=s | x; \theta\right)}{\varphi \theta_{s}} = -p\left(y=s | x; \theta\right)p\left(y=j | x; \theta\right)\cdot x \]
综合上述两种情况可知,求导结果为
\[ \frac{\varphi J(\theta)}{\varphi \theta_{s}} =(-1)\cdot \sum_{j=1,j\neq s}^{k} 1\left\{y=j \right\} \frac{1}{p\left(y=j | x; \theta\right)}\frac{\varphi p\left(y=j | x; \theta\right)}{\varphi \theta_{s}} +(-1)\cdot 1\left\{y=s \right\} \frac{1}{p\left(y=s | x; \theta\right)}\frac{\varphi p\left(y=s | x; \theta\right)}{\varphi \theta_{s}} \\ =(-1)\cdot \sum_{j=1,j\neq s}^{k} 1\left\{y=j \right\} \frac{1}{p\left(y=j | x; \theta\right)})\cdot (-1)\cdot p\left(y=s | x; \theta\right)p\left(y=j | x; \theta\right)\cdot x + (-1)\cdot 1\left\{y=s \right\} \frac{1}{p\left(y=s | x; \theta\right)}\left[p\left(y=s | x; \theta\right)\cdot x-p\left(y=s | x; \theta\right)^2\cdot x\right] \\ =(-1)\cdot \sum_{j=1,j\neq s}^{k} 1\left\{y=j \right\}\cdot (-1)\cdot p\left(y=s | x; \theta\right)\cdot x + (-1)\cdot 1\left\{y=s \right\} \left[x-p\left(y=s | x; \theta\right)\cdot x\right] \\ =(-1)\cdot 1\left\{y=s \right\} x - (-1)\cdot \sum_{j=1}^{k} 1\left\{y=j \right\} p\left(y=s | x; \theta\right)\cdot x \]
因为\(\sum_{j=1}^{k} 1\left\{y=j \right\}=1\),所以最终结果为
\[ \frac{\varphi J(\theta)}{\varphi \theta_{s}} =(-1)\cdot \left[ 1\left\{y=s \right\} - p\left(y=s | x; \theta\right) \right]\cdot x \]
批量数据进行softmax回归计算
上面实现了单个数据进行类别概率和损失函数的计算以及求导,进一步推导到批量数据进行操作
评分函数
假设使用softmax回归分类数据\(x\),共\(k\)个标签,首先进行线性回归操作
\[ z_{\theta}(x_{i})=\theta^T\cdot x_{i} =\begin{bmatrix} \theta_{1}^T\\ \theta_{2}^T\\ ...\\ \theta_{k}^T \end{bmatrix}\cdot x_{i} =\begin{bmatrix} \theta_{1}^T\cdot x_{i}\\ \theta_{2}^T\cdot x_{i}\\ ...\\ \theta_{k}^T\cdot x_{i} \end{bmatrix} \]
其中输入数据\(x\)大小为\((n+1)\times m\),\(\theta\)大小为\((n+1)\times k\),\(n\)表示权重数量,\(m\)表示训练数据个数,\(k\)表示类别标签数量
输出结果\(z\)大小为\(k\times m\),然后对计算结果进行归一化操作,使得输出值能够表示类别概率,如下所示
\[ h_{\theta}\left(x_{i}\right)=\left[ \begin{array}{c}{p\left(y=1 | x_{i} ; \theta\right)} \\ {p\left(y=2 | x_{i} ; \theta\right)} \\ {\vdots} \\ {p\left(y=k | x_{i} ; \theta\right)}\end{array}\right] =\frac{1}{\sum_{j=1}^{k} e^{\theta_{j}^{T} x}} \left[ \begin{array}{c}{e^{\theta_{1}^{T} x_{i}}} \\ {e^{\theta_{2}^{T} x_{i}}} \\ {\vdots} \\ {e^{\theta_{k}^{T} x_{i}}}\end{array}\right] \]
其中\(\theta_{1}、\theta_{2},...,\theta_{k}\)的大小为\((n+1)\times 1\),输出结果是一个\(k\times m\)大小向量,每列表示\(k\)类标签的预测概率
所以对于输入数据\(x_{i}\)而言,其属于标签\(j\)的概率是
\[ p\left(y_{i}=j | x_{i}; \theta\right)=\frac{e^{\theta_{j}^{T} x_{i}}}{\sum_{l=1}^{k} e^{\theta_{l}^{T} x_{i}}} \]
代价函数
利用交叉熵损失(cross entropy loss)作为softmax回归的代价函数,用于计算训练数据对应的真正标签的损失值
\[ J(\theta) = (-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y_{i}=j\right\} \ln p\left(y_{i}=j | x_{i}; \theta\right) = (-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y_{i}=j\right\} \ln \frac{e^{\theta_{j}^{T} x_{i}}}{\sum_{l=1}^{k} e^{\theta_{l}^{T} x_{i}}} \]
其中函数\(1\{\cdot\}\)是一个示性函数(indicator function),其取值规则为
1 | 1{a true statement} = 1, and 1{a false statement} = 0 |
也就是示性函数输入为True时,输出为1;否则,输出为0
对权重向量\(\theta_{s}\)进行求导:
\[ \frac{\varphi J(\theta)}{\varphi \theta_{s}} =(-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m}\cdot \frac{\varphi }{\varphi \theta_{s}} \left[ \sum_{j=1,j\neq s}^{k} 1\left\{y_{i}=j \right\} \ln p\left(y_{i}=j | x_{i}; \theta\right)+1\left\{y_{i}=s \right\} \ln p\left(y_{i}=s | x_{i}; \theta\right) \right] \]
\[ =(-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m}\cdot \sum_{j=1,j\neq s}^{k} 1\left\{y_{i}=j \right\} \frac{1}{p\left(y_{i}=j | x_{i}; \theta\right)}\frac{\varphi p\left(y_{i}=j | x_{i}; \theta\right)}{\varphi \theta_{s}} +(-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m}\cdot 1\left\{y_{i}=s \right\} \frac{1}{p\left(y_{i}=s | x_{i}; \theta\right)}\frac{\varphi p\left(y_{i}=s | x_{i}; \theta\right)}{\varphi \theta_{s}} \]
分为两种情况
- 当计算结果正好由\(\theta_{s}\)计算得到,此时线性运算为\(z=\theta_{s}^{T} x_{i}\),计算结果为\(p\left(y_{i}=s | x_{i}; \theta\right)=\frac{e^{\theta_{s}^{T} x_{i}}}{\sum_{l=1}^{k} e^{\theta_{l}^{T} x_{i}}}\),求导如下
\[ \frac{\varphi p\left(y_{i}=s | x_{i}; \theta\right)}{\varphi \theta_{s}} =\frac{u^{\prime}(x) v(x)-v^{\prime}(x) u(x)}{v^{2}(x)} \]
其中
\[ u(x) = e^{\theta_{s}^{T} x}, v(x)=\sum_{l=1}^{k} e^{\theta_{l}^{T} x} \]
所以
\[ \frac{\varphi u(x)}{\varphi \theta_s} = e^{\theta_{s}^{T} x}\cdot x=u(x)\cdot x, \frac{\varphi v(x)}{\varphi \theta_s} = e^{\theta_{s}^{T} x}\cdot x=u(x)\cdot x \\ \frac{\varphi p\left(y=s | x_{i}; \theta\right)}{\varphi \theta_{s}} = p\left(y=s | x_{i}; \theta\right)\cdot x_{i}-p\left(y=s | x_{i}; \theta\right)^2\cdot x_{i} \]
- 当计算结果不是由\(\theta_{s}\)计算得到,此时线性运算为\(z=\theta_{j}^{T} x_{i}, j\neq s\),计算结果为\(p\left(y_{i}=j | x_{i}; \theta\right)=\frac{e^{\theta_{j}^{T} x_{i}}}{\sum_{l=1}^{k} e^{\theta_{l}^{T} x_{i}}}\)
\[ \frac{\varphi p\left(y_{i}=j | x_{i}; \theta\right)}{\varphi \theta_{s}} =\frac{u^{\prime}(x) v(x)-v^{\prime}(x) u(x)}{v^{2}(x)} \]
其中
\[ u(x) = e^{\theta_{j}^{T} x}, v(x)=\sum_{l=1}^{k} e^{\theta_{l}^{T} x} \]
所以
\[ \frac{\varphi u(x)}{\varphi \theta_s} = e^{\theta_{j}^{T} x}\cdot x=0, \frac{\varphi v(x)}{\varphi \theta_s} = e^{\theta_{s}^{T} x}\cdot x \\ \frac{\varphi p\left(y_{i}=s | x_{i}; \theta\right)}{\varphi \theta_{s}} = -p\left(y_{i}=s | x_{i}; \theta\right)p\left(y_{i}=j | x_{i}; \theta\right)\cdot x_{i} \]
综合上述两种情况可知,求导结果为
\[ \frac{\varphi J(\theta)}{\varphi \theta_{s}} =(-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m}\cdot \sum_{j=1,j\neq s}^{k} 1\left\{y_{i}=j \right\} \frac{1}{p\left(y_{i}=j | x_{i}; \theta\right)}\frac{\varphi p\left(y_{i}=j | x_{i}; \theta\right)}{\varphi \theta_{s}} +(-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m}\cdot 1\left\{y_{i}=s \right\} \frac{1}{p\left(y_{i}=s | x_{i}; \theta\right)}\frac{\varphi p\left(y_{i}=s | x_{i}; \theta\right)}{\varphi \theta_{s}} \\ =(-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m}\cdot \sum_{j=1,j\neq s}^{k} 1\left\{y_{i}=j \right\} \frac{1}{p\left(y_{i}=j | x_{i}; \theta\right)})\cdot (-1)\cdot p\left(y_{i}=s | x_{i}; \theta\right)p\left(y_{i}=j | x_{i}; \theta\right)\cdot x_{i} + (-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m}\cdot 1\left\{y_{i}=s \right\} \frac{1}{p\left(y_{i}=s | x_{i}; \theta\right)}\left[p\left(y_{i}=s | x_{i}; \theta\right)\cdot x_{i}-p\left(y_{i}=s | x_{i}; \theta\right)^2\cdot x_{i}\right] \\ =(-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m}\cdot \sum_{j=1,j\neq s}^{k} 1\left\{y_{i}=j \right\}\cdot (-1)\cdot p\left(y_{i}=s | x_{i}; \theta\right)\cdot x_{i} + (-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m}\cdot 1\left\{y_{i}=s \right\} \left[x_{i}-p\left(y_{i}=s | x_{i}; \theta\right)\cdot x_{i}\right] \\ =(-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m}\cdot 1\left\{y_{i}=s \right\} x_{i} - (-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m}\cdot \sum_{j=1}^{k} 1\left\{y_{i}=j \right\} p\left(y_{i}=s | x_{i}; \theta\right)\cdot x_{i} \]
因为\(\sum_{j=1}^{k} 1\left\{y_{i}=j \right\}=1\),所以最终结果为
\[ \frac{\varphi J(\theta)}{\varphi \theta_{s}} =(-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m}\cdot \left[ 1\left\{y_{i}=s \right\} - p\left(y_{i}=s | x_{i}; \theta\right) \right]\cdot x_{i} \]
梯度下降
权重\(W\)大小为\((n+1)\times k\),输入数据集大小为\(m\times (n+1)\),输出数据集大小为\(m\times k\)
矩阵求导如下:
\[ \frac{\varphi J(\theta)}{\varphi \theta} =\frac{1}{m}\cdot \sum_{i=1}^{m}\cdot \begin{bmatrix} (-1)\cdot\left[ 1\left\{y=1 \right\} - p\left(y=1 | x; \theta\right) \right]\cdot x\\ (-1)\cdot\left[ 1\left\{y=2 \right\} - p\left(y=2 | x; \theta\right) \right]\cdot x\\ ...\\ (-1)\cdot\left[ 1\left\{y=k \right\} - p\left(y=k | x; \theta\right) \right]\cdot x \end{bmatrix} =(-1)\cdot \frac{1}{m}\cdot X_{m\times n+1}^T \cdot (I_{m\times k} - Y_{m\times k}) \]
参考:
Softmax regression for Iris classification
Derivative of Softmax loss function
上述计算的是输入单个数据时的评分、损失和求导,所以使用随机梯度下降法进行权重更新,分类
参数冗余和权重衰减
softmax
回归存在参数冗余现象,即对参数向量\(\theta_{j}\)减去向量$$不改变预测结果。证明如下:
\[ \begin{aligned} p\left(y^{(i)}=j | x^{(i)} ; \theta\right) &=\frac{e^{\left(\theta_{j}-\psi\right)^{T} x^{(i)}}}{\sum_{l=1}^{k} e^{\left(\theta_{l}-\psi\right)^{T} x^{(i)}}} \\ &=\frac{e^{\theta_{j}^{T} x^{(i)}} e^{-\psi^{T} x^{(i)}}}{\sum_{l=1}^{k} e^{\theta_{l}^{T} x^{(i)}} e^{-\psi^{T} x^{(i)}}} \\ &=\frac{e^{\theta_{j}^{T} x^{(i)}}}{\sum_{l=1}^{k} e^{\theta_{t}^{T} x^{(i)}}} \end{aligned} \]
假设\((\theta_{1},\theta_{2},...,\theta_{k})\)能得到\(j(\theta)\)的极小值点,那么\((\theta_{1}-\varphi,\theta_{2}-\varphi,...,\theta_{k}-\varphi)\)同样能得到相同的极小值点
与此同时,因为损失函数是凸函数,局部最小值就是全局最小值,所以会导致权重在参数过大情况下就停止收敛,影响模型泛化能力
在代价函数中加入权重衰减,能够避免过度参数化,得到泛化性能更强的模型
在代价函数中加入L2
正则化项,如下所示:
\[ J(\theta) = (-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y_{i}=j\right\} \ln p\left(y_{i}=j | x_{i}; \theta\right) + \frac{\lambda}{2} \sum_{i=1}^{k} \sum_{j=0}^{n} \theta_{i j}^{2} = (-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y_{i}=j\right\} \ln \frac{e^{\theta_{j}^{T} x_{i}}}{\sum_{l=1}^{k} e^{\theta_{l}^{T} x_{i}}} + \frac{\lambda}{2} \sum_{i=1}^{k} \sum_{j=0}^{n} \theta_{i j}^{2} \]
求导结果如下:
\[ \frac{\varphi J(\theta)}{\varphi \theta_{s}} =(-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m}\cdot \left[ 1\left\{y_{i}=s \right\} - p\left(y_{i}=s | x_{i}; \theta\right) \right]\cdot x_{i}+ \lambda \theta_{j} \]
代价实现如下:
1 | def compute_loss(scores, indicator, W): |
鸢尾数据集
使用鸢尾(iris)数据集,参考Iris Species
共4
个变量:
SepalLengthCm
- 花萼长度SepalWidthCm
- 花萼宽度PetalLengthCm
- 花瓣长度PetalWidthCm
- 花瓣宽度
以及3
个类别:
Iris-setosa
Iris-versicolor
Iris-virginica
1 | def load_data(shuffle=True, tsize=0.8): |
numpy实现
1 | # -*- coding: utf-8 -*- |
训练10万次的最好训练结果以及对应的测试结果:
1 | # 测试集精度 |
指数计算 - 数值稳定性考虑
在softmax
回归中,需要利用指数函数\(e^x\)对线性操作的结果进行归一化,这有可能会造成数值溢出,常用的做法是对分数上下同乘以一个常数\(C\)
\[ \frac{e^{f_{i_{i}}}}{\sum_{j} e^{f_{j}}}=\frac{C e^{f_{y_{i}}}}{C \sum_{j} e^{f_{j}}}=\frac{e^{f_{i_{i}}+\log C}}{\sum_{j} e^{f_{j}+\log C}} \]
这个操作不改变结果,如果取值\(C\)为线性操作结果最大值负数\(\log C=-\max _{j} f_{j}\),就能够将向量\(f\)的取值范围降低,最大值为\(0\),避免数值不稳定
1 | def softmax(x): |
softmax回归和logistic回归
softmax
回归是logistic
回归在多分类任务上的扩展,将\(k=2\)时,softmax
回归模型可转换成logistic
回归模型
\[ h_{\theta}(x)=\frac{1}{e^{\theta_{1}^{T} x}+e^{\theta_{2}^{T} x^{(i)}}} \left[ \begin{array}{c}{e^{\theta_{1}^{T} x}} \\ {e^{\theta_{2}^{T} x}}\end{array}\right] =\frac{1}{e^{\vec{0}^{T} x}+e^{(\theta_{2}-\theta_{1})^{T} x^{(i)}}} \left[ \begin{array}{c}{e^{\vec{0}^{T} x}} \\ {e^{(\theta_{2}-\theta_{1})^{T} x}}\end{array}\right] \\ =\frac{1}{1+e^{(\theta_{2}-\theta_{1})^{T} x^{(i)}}} \left[ \begin{array}{c}{1} \\ {e^{(\theta_{2}-\theta_{1})^{T} x}}\end{array}\right] = \left[ \begin{array}{c}{\frac{1}{1+e^{(\theta_{2}-\theta_{1})^{T} x^{(i)}}}} \\ {\frac{e^{(\theta_{2}-\theta_{1})^{T} x}}{1+e^{(\theta_{2}-\theta_{1})^{T} x^{(i)}}}}\end{array}\right] =\left[ \begin{array}{c}{\frac{1}{1+e^{(\theta_{2}-\theta_{1})^{T} x^{(i)}}}} \\ {1- \frac{1}{1+e^{(\theta_{2}-\theta_{1})^{T} x^{(i)}}}}\end{array}\right] \]
针对多分类任务,可以选择softmax
回归模型进行多分类,也可以选择logistic
回归模型进行若干个二分类
区别在于选择的类别是否互斥,如果类别互斥,使用softmax
回归分类更为合适;如果类别不互斥,使用logistic
回归分类更为合适