# Deriving the loss function for softmax regression

Imagine we have a dataset $D =\left\{(\textbf{x}^{(i)}, {y}^{(i)})\right\}$ consisting of N training examples, where each feature vector $\textbf{x}^{(i)} \in \mathbb{R}^{m}$ and label ${y}^{(i)} \in \left\{1,...,K\right\}$. Our goal is to build a model that takes as input a new example $x^{*}$ and predicts its class label $y^{*}$. One straight-forward approach to deriving supervised learning algorithms is to use maximum likelihood estimation to model the class-conditional probability:

where

and

In our case, we will be modeling our multinomial distribution with the softmax function:

To make the following calculations easier, we change the form of each label $y^{(i)} \in \mathbb{R}$ to $\textbf{y}^{(i)} \in \mathbb{R}^{K}$ using one-hot encoding. For example, if $y^{(i)}=3$ and $K=5$, then:

By encoding $y^{(i)}$ using one-hot encoding, we get a slightly modified dataset $D =\left\{\textbf{x}^{(i)}, \textbf{y}^{(i)}\right\}$, where $\textbf{x}^{(i)} \in \mathbb{R}^{m} , \textbf{y}^{(i)} \in \mathbb{R}^{m}$, and $i = 1,...,N$. If the data are $i.i.d.$, we can use maximum-likelihood estimation to find the parameters $\textbf{W}$ and $\textbf{b}$ that maximize the likelihood of the data:

Instead of maximizing $L(\textbf{W}, \textbf{b})$, we can instead minimize its negative log-likelihood. This is equivalent because $(i)$ the natural log function is monotonic, and $(ii)$ maximizing a function is the same is minimizing the negative of that function.

Note that in equation $(7)$, we were able to simplify the right hand side of the expression because $\sum_{k=1}^K {\textbf{u}_{k}^T\textbf{y}^{(i)}} = 1$.

Now that we have an expression for $\ell(\textbf{W}, \textbf{b})$, we can choose from a number of algorithms to minimize the loss. In a future note, we are going to derive parameter update equations for minimizing $\ell(\textbf{W}, \textbf{b})$ with stochastic gradient descent.