Deriving the loss function for softmax regression

Imagine we have a dataset consisting of N training examples, where each feature vector and label . Our goal is to build a model that takes as input a new example and predicts its class label . One straight-forward approach to deriving supervised learning algorithms is to use maximum likelihood estimation to model the class-conditional probability:

where

and

In our case, we will be modeling our multinomial distribution with the softmax function:

To make the following calculations easier, we change the form of each label to using one-hot encoding. For example, if and , then:

By encoding using one-hot encoding, we get a slightly modified dataset , where , and . If the data are , we can use maximum-likelihood estimation to find the parameters and that maximize the likelihood of the data:

Instead of maximizing , we can instead minimize its negative log-likelihood. This is equivalent because the natural log function is monotonic, and maximizing a function is the same is minimizing the negative of that function.

Note that in equation , we were able to simplify the right hand side of the expression because .

Now that we have an expression for , we can choose from a number of algorithms to minimize the loss. In a future note, we are going to derive parameter update equations for minimizing with stochastic gradient descent.