Activation Layer - Motivation

Activation layers play a critical role in neural networks. Without them, even deep networks collapse into simple linear models and fail to learn complex patterns.

Why Do We Need Activation Layers?

Consider a neural network with a hidden layer but no activation function.

For a hidden layer:

h_1 = w_{01} + w_{11}x_1 + w_{21}x_2

h_2 = w_{02} + w_{12}x_1 + w_{22}x_2

And an output layer:

y = v_0 + v_1h_1 + v_2h_2

Substituting the hidden layer equations:

y = v_0 + v_1h_1 + v_2h_2

=v_0+v_1(w_{01}+w_{11}x_1+w_{12}x_2)+v_2(w_{02}+w_{12}x_1+w_{22}x_2)

=v_0+v_1w_{01}+v_2w_{02}+(v_1w_{11}+v_2w_{12})x_1+(v_1w_{12}+v_2w_{22})x_2

=W_0+W_1x_1+W_2x_2

Here $W_0=v_0+v_1w_{01}+v_2w_{02}$ , $W_1=v_1w_{11}+v_2w_{12}, W_2=v_1w_{12}+v_2w_{22}$

This shows that without activation functions, the entire network reduces to a single linear equation, no matter how many layers it has.

Stacking linear layers still results in a linear model.

Why Linear Models Fail: The XOR Problem

The XOR function is a classic example that exposes the limitation of linear models.

XOR is not linearly separable, meaning:

No single straight line can separate its classes
Any model of the form $y = W_0 + W_1x_1 + W_2x_2$ will fail

This is why non-linearity is essential.

Activation Functions Introduce Non-Linearity for Real Word Data

Most real world data are non-linear, and without activation layer we won't be able to represent them.

Image Classification: Recognizing a cat in an image requires understanding curves, textures, shapes, and spatial relationships — all inherently non-linear patterns.
Natural Language Processing (NLP): Understanding and classifying human language (e.g., sentiment analysis, spam detection, translation) involves non-linear relationships between words and context.
Recommender Systems: Predicting user preferences for products, movies, or music involves highly non-linear interactions between user history, item features, and context.
Speech Recognition: Audio data contains complex patterns that vary greatly, requiring non-linear models to interpret human speech accurately.
Autonomous Driving: Data used for object detection and decision making in self-driving cars (e.g., distinguishing pedestrians from vehicles in real-time video feeds) is inherently non-linear.

Activation functions:

Break linearity
Allow neural networks to learn complex decision boundaries
Enable hierarchical feature learning

Linear models cannot capture these relationships

Sigmoid Activation Function

\sigma(x) = \frac{1}{1 + e^{-x}}

Properties

Output range: $(0, 1)$
Smooth and differentiable

Where It Is Used

Output layer for binary classification
When probabilities are required

Why It Works

Outputs can be interpreted as probabilities
Enables gradient-based learning

Limitations

Vanishing gradient problem
Not zero-centered
Slow training in deep networks

Hyperbolic Tangent (tanh)

\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}

Properties

Output range: $(-1, 1)$
Zero-centered

Where It Is Used

Hidden layers in older networks
RNNs and LSTMs

Why It Works

Stronger gradients than sigmoid
Better convergence than sigmoid

Limitations

Still suffers from vanishing gradients
Slower than ReLU

Rectified Linear Unit (ReLU)

f(x) = \max(0, x)

Where It Is Used

Hidden layers of deep neural networks
Computer vision models
Most modern deep learning architectures

Why It Works

Fast computation
No vanishing gradient for $x > 0$
Sparse activation helps reduce overfitting

Limitations

Dying ReLU problem (neurons stuck at zero)

Leaky ReLU

f(x) = \begin{cases} x, & x > 0 \\ \alpha x, & x \le 0 \end{cases}

where $0.01 \le \alpha \le 0.3$ .

Where It Is Used

Hidden layers when ReLU fails
GANs
Deep CNNs with dying ReLU issues

Why It Works

Prevents neurons from dying
Maintains small gradients for negative inputs

Limitations

Slightly slower than ReLU
$\alpha$ is a hyperparameter

Softmax Activation Function

If there are $K$ classes $z = [z_1, z_2, \dots, z_K]$ :

\text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}

Where It Is Used

Output layer for multi-class classification
Final layers of CNNs
Transformer architectures

Why It Works

Converts raw scores into probabilities
Outputs sum to 1
Differentiable and trainable

Limitations

Only suitable for mutually exclusive classes
Sensitive to large values (numerical overflow)
Not used in hidden layers
Vanishing gradients when one class dominates

Key Takeaways

Without activation functions, neural networks collapse into linear models
Non-linearity is essential to solve real-world problems
Different activation functions serve different purposes
Choosing the right activation function is crucial for performance

Activation functions are what give neural networks their power.