One-Layer NN (Classification)

A one-layer neural network (also known as a perceptron) is a fundamental architecture that provides a lot of insights about understanding basic concepts like parameters, training, the role of loss function, backpropagation, and so on.

Consider a one-layer neural network that has 4 inputs and 3 outputs as follows.

A one-layer neural network architecture

A one-layer neural network architecture.

Parameters

The size of the input and output determines the parameter set of this neural network. Since it has only a single linear layer, the output can be written as

\[ \mathbf{z} = \mathbf{W}\mathbf{x} + \mathbf{b}, \]

where \(\mathbf{W}\) and \(\mathbf{b}\) are weight matrix and bias vector of the network with sizes that are determined by the input and output size. Thus,

\[\begin{split} \mathbf{W}= \begin{bmatrix} w_{11} & w_{12} & w_{13} & w_{14} \\ w_{21} & w_{22} & w_{23} & w_{24} \\ w_{31} & w_{32} & w_{33} & w_{34} \\ \end{bmatrix} , \mathbf{b}= \begin{bmatrix} b_{1} \\ b_{2} \\ b_{3} \end{bmatrix}. \end{split}\]

The vector \(\mathbf{z}\) is called logit.

Loss function

The above is only how we define the architecture of the network. Therefore, it is not considered as a neural network model. To convert any neural network architecture, we need to have a loss function that has to be a scalar-valued function. Notice, that right now the architecture returns a vector in \(\mathbb{R}^3\).

Suppose the loss function is the cross-entropy function. This function measures how close two probability vectors are to one another. However, the problem is that logit vector is not a probability vector and we need to convert it to a probability vector. Transforming the logit vector to a probability vector is done by a softmax function (vector-valued) that simply raises each component to \(e\) and divides by the sum of them as follows:

\[\begin{split} \hat{\mathbf{y}} = \text{softmax}(\mathbf{z}) = \begin{bmatrix} \frac{e^{z_1}}{e^{z_1}+e^{z_2}+e^{z_3}} \\ \frac{e^{z_2}}{e^{z_1}+e^{z_2}+e^{z_3}} \\ \frac{e^{z_3}}{e^{z_1}+e^{z_2}+e^{z_3}} \end{bmatrix}. \end{split}\]

The vector \(\hat{\mathbf{y}}\) is now the output of the single layer neural network as the picture shows.

A one-layer neural network model.

Note that transformations that convert logit vectors of layers of a neural network are called activation functions. Thus, the above softmax function is an activation function.

Finally, we can write

\[ l ( \mathbf{y}, \hat{\mathbf{y}} ) = \text{CE} \Big( \text{softmax}( \mathbf{W}\mathbf{x}+\mathbf{b}), \mathbf{y} \Big) = l \Big( \mathbf{f}(\mathbf{W}\mathbf{x}+\mathbf{b}), \mathbf{y} \Big), \]

where \(\mathbf{f}:\mathbb{R}^3 \to (0,1)^3\) is \(f(\mathbf{z})=\text{softmax}(\mathbf{z})\) that takes any arbitrary vector in \(\mathbb{R}^3\) and maps it to a probability vector in \(\mathbb{R}^3\). Notice that, a softmax function cannot provide zero or one in its components.

Optimization

Once we have a scalar-valued function we can minimize it using optimization algorithms like gradient descent. Algebraically speaking, we can minimize the following:

\[ \min_{\mathbf{W}\in \mathbb{R}^{3\times 4}, \mathbf{b} \in \mathbb{R}^3} \text{CE} \Big( \text{softmax}( \mathbf{W}\mathbf{x}+\mathbf{b}), \mathbf{y} \Big), \]

where \(\mathbf{x}\) is a given data vector in \(\mathbb{R}^{4}\) with its corresponding label

\[\begin{split} \mathbf{y} \in \Big\{ \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix} , \begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix} , \begin{bmatrix} 0 \\ 0 \\ 1 \end{bmatrix} \Big\}. \end{split}\]

For sure minimizing the above is theoretically possible, but in practice we are given an \(N\) set of pairs of data \(\{(\mathbf{x}^{(1)}, \mathbf{y}^{(1)}), \dots, (\mathbf{x}^{(N)}, \mathbf{y}^{(N)})\}\). By looking at the above we can see that the loss now can be defined as the average loss over all data points as follows

\[ \ell(X, Y, \mathbf{W}, \mathbf{b}) = \frac{1}{N} \sum_{i=1}^N \text{CE} \Big( \text{softmax}( \mathbf{W}\mathbf{x^{(i)}}+\mathbf{b}), \mathbf{y}^{(i)} \Big), \]

where

\( X=\{\mathbf{x}^{(1)}, \dots, \mathbf{x}^{(N)}\} \) and \( Y=\{\mathbf{y}^{(1)}, \dots, \mathbf{y}^{(N)}\}. \)

For notational simplicity, \(\ell(X, Y, \mathbf{W}, \mathbf{b})\) is usually written as \(\ell(\boldsymbol{\theta})\) where \(\boldsymbol{\theta}=\{\mathbf{W}, \mathbf{b}\}\). Also, the loss associated to a data point \(l ( \mathbf{y}^{(i)}, \hat{\mathbf{y}}^{(i)} )\) is written as \(l_i ( \boldsymbol{\theta} )\). Thus,

\[ \ell(\boldsymbol{\theta}) = \frac{1}{N} \sum_{i=1}^N l_i ( \boldsymbol{\theta} ) = \frac{1}{N} \sum_{i=1}^N \text{CE} \Big( \text{softmax}( \mathbf{W}\mathbf{x^{(i)}}+\mathbf{b}), \mathbf{y}^{(i)} \Big) \]

Finally we can minimize the loss function as follows:

\[ \min_{\boldsymbol{\theta}} \ell(\boldsymbol{\theta}) = \min_{\mathbf{W}\in \mathbb{R}^{3\times 4}, \mathbf{b} \in \mathbb{R}^3} \frac{1}{N} \sum_{i=1}^N \text{CE} \Big( \text{softmax}( \mathbf{W}\mathbf{x^{(i)}}+\mathbf{b}), \mathbf{y}^{(i)} \Big). \]