Softmax and Cross Entropy with Python implementation

5 minute read

This blog mainly focuses on the forward pass and the backpropagation of a network using a softmax classifier with cross entropy loss. We will go through the entire process of it’s working and the derivation for the backpropagation. Then we will implement it’s code in Numpy and look into some practical numerical stability issues.

Function definitions

Cross entropy

Cross entropy is a loss function that is used for multi-class classification. Let \(\custommedium C\) be the number of classes, \(\custommedium y_i\) be the true value of the class and \(\custommedium p_i\) be the predicted value for that class.

\[\customsmall L = – \sum_{i}^C y_i log p_i\]

Binary cross entropy is a loss function that is used for binary classification in deep learning. When we have only two classes to predict from, we use this loss function. It is a special case of Cross entropy where the number of classes is 2.

\[\customsmall L = -{(y\log(p) + (1 - y)\log(1 - p))}\]

Softmax

Softmax is used to take a C-dimensional vector of real numbers which correspond to the values predicted for each of the C classes and transforms it into a vector of real numbers in the range (0,1) which adds upto 1.

\[\customsmall P_i = \frac{\exp(x_i)}{\sum_j \exp(x_j)}\] \[\customsmall \sum_i P_i = 1\]

Forward and Backward propagation

Let us consider the last two layers of an architecture that first transforms input by an affine transformation and then uses softmax and cross-entropy loss. Let the inputs to the second last layer be \(\custommedium X\), the weights connecting the last two layers be \(\custommedium W\). (Ignoring biases)

Hence the shapes of \(\customsmall X\) and \(\customsmall W\) are \(\customsmall N X D\) and \(\customsmall D X C\) respectively.

Softmax_cross_entropy_image — Architecture

Forward Pass

Affine transform
\[\customsmall scores = X.W + b\]
\(\customsmall scores\) is now of shape \(\customsmall N X C\)
Softmax

We must calculate the softmax over each row/example out of the \(\custommedium N\) examples as each example will predict one class out of \(\custommedium C\) classes.
\[\customsmall scores\_normalised_i = \frac{\exp(scores_{i,j})}{\sum_j^C \exp(scores_{i,j})}\]
\(scores\_normalised_i\) is of shape \(\custommedium 1 X C\)
Cross entropy loss
\[\customsmall L = – \sum_{j}^C y_j log p_j\]
In the forward pass, Assuming \(\custommedium y_j\) to be 1 in a class (say k’th class) and 0 in all the other classes (\(\custommedium j \ne k\)), we need to only consider the value predicted for that corresponding class while calculating cross entropy loss. Hence our loss function for each example becomes
\[\customsmall L_i = - log p_{i,k}\]
where k corresponds to the true class in the i’th example

Code with forward Pass

for i in range(X.shape[0]):
      scores = X[i].dot(W)
      scores_normalised = np.exp(scores)/np.sum(np.exp(scores))
      loss+= (-1)*np.log(scores_normalised[y[i]])

Backpropagation

Our aim is to find the derivative of the loss with respect to the weight matrix, so we can perform gradient descent and optimise the weight matrix.

Essentially, we must evaluate

\[\customsmall \frac{\partial L}{\partial W_{i,j}}\]

which will have the same shape as W because we must find by how much we need to change the weights at each link between the layers.

The equations that we used are

\[\customsmall L = - log p_{k}\] \[\customsmall p_k = \frac{\exp(scores_k)}{\sum_j^C \exp(scores_j)}\] \[\customsmall scores = X.W + b\]

By chain rule,

\[\customsmall \frac{\partial L}{\partial W_{i,j}} = \frac{\partial L}{\partial p_{k}} \times \frac{\partial p_{k}}{\partial scores_j} \times \frac{\partial scores_j}{\partial W_{i,j}}\]

Derivative of \(\custommedium L\) with respect to \(\custommedium p_k\)
\[\customsmall \frac{\partial L}{\partial p_{k}} = \frac{-1}{p_k}\]
Derivative of \(\custommedium p_k\) with respect to \(\custommedium scores_j\)

We need to consider two cases :

Case 1 : \(\custommedium j \ne k\)

When j does not correspond to the correct class, the numerator will be a constant. Only the denominator will be differentiated.
\[\customsmall \frac{\partial p_{k}}{\partial scores_{j}} = \exp(scores_k) \times \left( \frac{-1}{(\sum_j^C \exp(scores_j))^2} \times \exp(scores_j)\right)\]
Case 2 : \(\custommedium j = k\)

When j corresponds to the correct class, the numerator must also be taken into consideration while differentiating. Hence we differentiate using division rule.
\[\customsmall \frac{\partial p_{k}}{\partial scores_{j}} = \frac{\exp(scores_k)\times \sum_j^C \exp(scores_j) - (\exp(scores_k))^2}{(\sum_j^C \exp(scores_j))^2}\]
Derivative of \(scores_j\) with respect to \(\custommedium W_{i,j}\)

\[\customsmall \frac{\partial scores_{j}}{\partial W_{i,j}} = X_i\]

Bringing them all together using chain rule, we get an equation for each case,

When considering wrong class j,

\[\customsmall \frac{\partial L}{\partial W_{i,j}} = \frac{-1}{p_k} \times \exp(scores_k) \times \left( \frac{-1}{(\sum_j^C \exp(scores_j))^2} \times \exp(scores_j)\right) \times X_i\] \[\customsmall = \frac{-1}{\frac{\exp(scores_k)}{\sum_j^C \exp(scores_j)}} \times \exp(scores_k) \times \left( \frac{-1}{(\sum_j^C \exp(scores_j))^2} \times \exp(scores_j)\right) \times X_i\] \[\customsmall \frac{\partial L}{\partial W_{i,j}} = \frac{\exp(scores_j)}{\sum_j^C \exp(scores_j)} \times X_i\]

When considering correct class j,

\[\customsmall \frac{\partial L}{\partial W_{i,j=k}} = \frac{-1}{p_k} \times \frac{\exp(scores_k)\times \sum_j^C \exp(scores_j) - (\exp(scores_k))^2}{(\sum_j^C \exp(scores_j))^2} \times X_i\] \[\customsmall = \frac{-1}{\frac{\exp(scores_k)}{\sum_j^C \exp(scores_j)}} \times \left( \frac{\exp(scores_k)\times \sum_j^C \exp(scores_j) - (\exp(scores_k))^2}{(\sum_j^C \exp(scores_j))^2}\right) \times X_i\] \[\customsmall \frac{\partial L}{\partial W_{i,j=k}} = \left(-1 +\frac{\exp(scores_k)}{\sum_j^C \exp(scores_j)}\right) \times X_i\]

Code with backward pass

This is the implementation of both forward and backward pass. dW is the derivative of the loss w.r.t the weight matrix.

for i in range(X.shape[0]):
      scores = X[i].dot(W)
      scores_normalised = np.exp(scores)/np.sum(np.exp(scores))
      loss+= (-1)*np.log(scores_normalised[y[i]])

      for j in range(W.shape[1]):
          dW[:,j] += X[i]*scores_normalised[j]
      dW[:,y[i]]-=X[i]

    dW/=X.shape[0]
    loss/= X.shape[0]

We have optimised the code a little by using one loop for both cases as we have a like term in both equations.

The like term is \(\customsmall \frac{\exp(scores_k)}{\sum_j^C \exp(scores_j)} \times X_i\).

Further Optimisation

Numpy hates loops and performs very well with matrices. So a more optimised approach is by getting rid of the loop and using matrix multiplication.

Here we use an indicator matrix, which is a matrix of size \(\custommedium N X C\) where in each row there exists only one 1, and that column would correspond to the correct class.

scores = X.dot(W)
scores_normalised = np.exp(scores)/np.sum(np.exp(scores),axis=1,keepdims=True)
loss =  np.sum((-1)*np.log(scores_normalised[range(X.shape[0]),y]))
y_ind = np.zeros_like(scores) # Indicator matrix : N x C
y_ind[range(X.shape[0]),y]=1
dW += X.T.dot(scores_normalised)
dW -= X.T.dot(y_ind)

An even more optimised solution without the use of an indicator matrix is to just subtract one from the scores_normalised matrix corresponding to the correct class in each row. Then performing matrix multiplication.

scores_normalised[range(X.shape[0]),y] -= 1
dW += X.T.dot(scores_normalised)

An important note.

When we practically implement softmax, the terms \(\custommedium \exp(scores_j)\) and \(\custommedium \sum_j^C \exp(scores_j)\) may be very large due to the exponentials. Hence we must preserve numerical stability. We can use a small trick by multiplying a constant C to ensure numerical stability.

\[\customsmall softmax(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}\] \[\customsmall = \frac{C e^{x_i}}{C \sum_j e^{x_j}}\] \[\customsmall = \frac{e^{x_i+\log C}}{\sum_j e^{x_j + log C}}\]

A common choice for \(C\) is \(\custommedium \log C = -\max_j x_j\)

scores = X.dot(W)
max_scores = np.max(scores,axis=1).reshape(-1,1)
scores -= max_scores
scores_normalised = np.exp(scores)/np.sum(np.exp(scores),axis=1,keepdims=True)
loss =  np.sum((-1)*np.log(scores_normalised[range(X.shape[0]),y]))


#Without indicator_matrix (Optimised)
scores_normalised[range(X.shape[0]),y] -= 1
dW += X.T.dot(scores_normalised)

That’s it! Now you have the derivative of the weight matrix and the loss!

Complete code

Check out the full code for this implementation here.

Share on

Twitter Facebook LinkedIn

Surya Dheeshjith