One-Layer NN (Classification)
A one-layer neural network (also known as a perceptron) is a fundamental architecture that provides a lot of insights about understanding basic concepts like parameters, training, the role of loss function, backpropagation, and so on.
Consider a one-layer neural network that has 4 inputs and 3 outputs as follows.
A one-layer neural network architecture
Parameters
The size of the input and output determines the parameter set of this neural network. Since it has only a single linear layer, the output can be written as
where \(\mathbf{W}\) and \(\mathbf{b}\) are weight matrix and bias vector of the network with sizes that are determined by the input and output size. Thus,
The vector \(\mathbf{z}\) is called logit.
Loss function
The above is only how we define the architecture of the network. Therefore, it is not considered as a neural network model. To convert any neural network architecture, we need to have a loss function that has to be a scalar-valued function. Notice, that right now the architecture returns a vector in \(\mathbb{R}^3\).
Suppose the loss function is the cross-entropy function. This function measures how close two probability vectors are to one another. However, the problem is that logit vector is not a probability vector and we need to convert it to a probability vector. Transforming the logit vector to a probability vector is done by a softmax function (vector-valued) that simply raises each component to \(e\) and divides by the sum of them as follows:
The vector \(\hat{\mathbf{y}}\) is now the output of the single layer neural network as the picture shows.
Note that transformations that convert logit vectors of layers of a neural network are called activation functions. Thus, the above softmax function is an activation function.
Finally, we can write
where \(\mathbf{f}:\mathbb{R}^3 \to (0,1)^3\) is \(f(\mathbf{z})=\text{softmax}(\mathbf{z})\) that takes any arbitrary vector in \(\mathbb{R}^3\) and maps it to a probability vector in \(\mathbb{R}^3\). Notice that, a softmax function cannot provide zero or one in its components.
Optimization
Once we have a scalar-valued function we can minimize it using optimization algorithms like gradient descent. Algebraically speaking, we can minimize the following:
where \(\mathbf{x}\) is a given data vector in \(\mathbb{R}^{4}\) with its corresponding label
For sure minimizing the above is theoretically possible, but in practice we are given an \(N\) set of pairs of data \(\{(\mathbf{x}^{(1)}, \mathbf{y}^{(1)}), \dots, (\mathbf{x}^{(N)}, \mathbf{y}^{(N)})\}\). By looking at the above we can see that the loss now can be defined as the average loss over all data points as follows
where
\( X=\{\mathbf{x}^{(1)}, \dots, \mathbf{x}^{(N)}\} \) and \( Y=\{\mathbf{y}^{(1)}, \dots, \mathbf{y}^{(N)}\}. \)
For notational simplicity, \(\ell(X, Y, \mathbf{W}, \mathbf{b})\) is usually written as \(\ell(\boldsymbol{\theta})\) where \(\boldsymbol{\theta}=\{\mathbf{W}, \mathbf{b}\}\). Also, the loss associated to a data point \(l ( \mathbf{y}^{(i)}, \hat{\mathbf{y}}^{(i)} )\) is written as \(l_i ( \boldsymbol{\theta} )\). Thus,
Finally we can minimize the loss function as follows: