# One-Layer NN (Classification) A one-layer neural network (also known as a perceptron) is a fundamental architecture that provides a lot of insights about understanding basic concepts like parameters, training, the role of loss function, backpropagation, and so on. Consider a one-layer neural network that has 4 inputs and 3 outputs as follows. A one-layer neural network architecture
A one-layer neural network architecture
A one-layer neural network architecture.
**Parameters** The size of the input and output determines the parameter set of this neural network. Since it has only a single linear layer, the output can be written as $$ \mathbf{z} = \mathbf{W}\mathbf{x} + \mathbf{b}, $$ where $\mathbf{W}$ and $\mathbf{b}$ are weight matrix and bias vector of the network with sizes that are determined by the input and output size. Thus, $$ \mathbf{W}= \begin{bmatrix} w_{11} & w_{12} & w_{13} & w_{14} \\ w_{21} & w_{22} & w_{23} & w_{24} \\ w_{31} & w_{32} & w_{33} & w_{34} \\ \end{bmatrix} , \mathbf{b}= \begin{bmatrix} b_{1} \\ b_{2} \\ b_{3} \end{bmatrix}. $$ The vector $\mathbf{z}$ is called logit. **Loss function** The above is only how we define the architecture of the network. Therefore, it is not considered as a neural network model. To convert any neural network architecture, we need to have a loss function that has to be a scalar-valued function. Notice, that right now the architecture returns a vector in $\mathbb{R}^3$. Suppose the loss function is the cross-entropy function. This function measures how close two probability vectors are to one another. However, the problem is that logit vector is not a probability vector and we need to convert it to a probability vector. Transforming the logit vector to a probability vector is done by a softmax function (vector-valued) that simply raises each component to $e$ and divides by the sum of them as follows: $$ \hat{\mathbf{y}} = \text{softmax}(\mathbf{z}) = \begin{bmatrix} \frac{e^{z_1}}{e^{z_1}+e^{z_2}+e^{z_3}} \\ \frac{e^{z_2}}{e^{z_1}+e^{z_2}+e^{z_3}} \\ \frac{e^{z_3}}{e^{z_1}+e^{z_2}+e^{z_3}} \end{bmatrix}. $$ The vector $\hat{\mathbf{y}}$ is now the output of the single layer neural network as the picture shows.
A one-layer neural network model
A one-layer neural network model.
Note that transformations that convert logit vectors of layers of a neural network are called activation functions. Thus, the above softmax function is an activation function. Finally, we can write $$ l ( \mathbf{y}, \hat{\mathbf{y}} ) = \text{CE} \Big( \text{softmax}( \mathbf{W}\mathbf{x}+\mathbf{b}), \mathbf{y} \Big) = l \Big( \mathbf{f}(\mathbf{W}\mathbf{x}+\mathbf{b}), \mathbf{y} \Big), $$ where $\mathbf{f}:\mathbb{R}^3 \to (0,1)^3$ is $f(\mathbf{z})=\text{softmax}(\mathbf{z})$ that takes any arbitrary vector in $\mathbb{R}^3$ and maps it to a probability vector in $\mathbb{R}^3$. Notice that, a softmax function cannot provide zero or one in its components. **Optimization** Once we have a scalar-valued function we can minimize it using optimization algorithms like gradient descent. Algebraically speaking, we can minimize the following: $$ \min_{\mathbf{W}\in \mathbb{R}^{3\times 4}, \mathbf{b} \in \mathbb{R}^3} \text{CE} \Big( \text{softmax}( \mathbf{W}\mathbf{x}+\mathbf{b}), \mathbf{y} \Big), $$ where $\mathbf{x}$ is a given data vector in $\mathbb{R}^{4}$ with its corresponding label $$ \mathbf{y} \in \Big\{ \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix} , \begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix} , \begin{bmatrix} 0 \\ 0 \\ 1 \end{bmatrix} \Big\}. $$ For sure minimizing the above is theoretically possible, but in practice we are given an $N$ set of pairs of data $\{(\mathbf{x}^{(1)}, \mathbf{y}^{(1)}), \dots, (\mathbf{x}^{(N)}, \mathbf{y}^{(N)})\}$. By looking at the above we can see that the loss now can be defined as the average loss over all data points as follows $$ \ell(X, Y, \mathbf{W}, \mathbf{b}) = \frac{1}{N} \sum_{i=1}^N \text{CE} \Big( \text{softmax}( \mathbf{W}\mathbf{x^{(i)}}+\mathbf{b}), \mathbf{y}^{(i)} \Big), $$ where $ X=\{\mathbf{x}^{(1)}, \dots, \mathbf{x}^{(N)}\} $ and $ Y=\{\mathbf{y}^{(1)}, \dots, \mathbf{y}^{(N)}\}. $ For notational simplicity, $\ell(X, Y, \mathbf{W}, \mathbf{b})$ is usually written as $\ell(\boldsymbol{\theta})$ where $\boldsymbol{\theta}=\{\mathbf{W}, \mathbf{b}\}$. Also, the loss associated to a data point $l ( \mathbf{y}^{(i)}, \hat{\mathbf{y}}^{(i)} )$ is written as $l_i ( \boldsymbol{\theta} )$. Thus, $$ \ell(\boldsymbol{\theta}) = \frac{1}{N} \sum_{i=1}^N l_i ( \boldsymbol{\theta} ) = \frac{1}{N} \sum_{i=1}^N \text{CE} \Big( \text{softmax}( \mathbf{W}\mathbf{x^{(i)}}+\mathbf{b}), \mathbf{y}^{(i)} \Big) $$ Finally we can minimize the loss function as follows: $$ \min_{\boldsymbol{\theta}} \ell(\boldsymbol{\theta}) = \min_{\mathbf{W}\in \mathbb{R}^{3\times 4}, \mathbf{b} \in \mathbb{R}^3} \frac{1}{N} \sum_{i=1}^N \text{CE} \Big( \text{softmax}( \mathbf{W}\mathbf{x^{(i)}}+\mathbf{b}), \mathbf{y}^{(i)} \Big). $$