Neural Networks Hyper-parameters (Quick Revision)-1
Activation Functions Revision
Welcome back! This is the #10 blog of the Data Science/Machine Learning Series. Till now, we have covered Loss Functions, Tree-based Algorithms, Regressions, and Weight Initialization for neural networks. Today we are going to discuss One of the Hyper-parameters Activation Functions. If you have missed the previous blogs, you can catch up on those here.
Neural Network is a vast concept and has many variants in it. As the Algorithms are getting better and better, new concepts emerge every year. Activation Functions are used to extract the data more effectively. It basically transforms the weights of the neurons from one layer to another layer. Pictorially it can be represented as shown.
The Neurons have two-part. Σ is called the Activation Function and the other is given in the diagram. Activation functions can be a crucial hyper-parameter while training the Deep Learning Model. Activation functions are usually chosen based on the problem we are solving. There are numerous activation functions to discuss. However, we will revise the most popular and important ones. Those are
- Sigmoid Activation Function
- Tanh Activation Function
- ReLU Activation Function
- Leaky ReLU Activation Function
- Softmax Activation Function
So, without further delay Let’s dive in.
Sigmoid
Sigmoid function is given by the equation
It is also known as the logistic function. This function will map the input (range from -infinity to +infinity) to 0 to 1. When z=0, the output will be 0.5. If the z is high, the result will tend to reach 1. If out is low, the result will tend to move towards 0. The graph of this function is
The sigmoid function can be mainly used for binary classification, as it gives the output in a probabilistic manner between 0 to 1. One of the disadvantages of the sigmoid function is whenever the z value is too high or too low, it becomes computationally expensive(as z approaches high value(+inf), the output remains close to one but never be one. Similarly as z approaches too low value(-inf) the output remains close to zero but never becomes zero). And also, the problem of vanishing gradients in neural networks arises due to small values (<1).
Tanh
Tanh stands for Tan Hyperbolic. The Function is given by the equation
This function basically maps input ranging from -inf to +inf to -1 to +1. This function can also be used in binary classification similar to sigmoid. If we need outputs ranging from -1 to +1, we can use the Tanh function. Similar to sigmoid, the disadvantages are vanishing gradient problems and computational cost for large or small values. When z=0, the output will be 0. If the z is high, the result will tend to reach 1. If out is low, the result will move towards -1. The graph for Tanh is given below
ReLU
ReLU stands for Rectified Linear Unit. This function can be used at the last layer for regression(non-negative regression) problems. And always using the ReLU activation function for hidden layers is useful. ReLU activation function is given by the simple equation
The equation is self-explanatory. If the value is less than zero, then the output will be fixed to zero. Else it will be set to the value at the input. Graphically it can be shown as
This function is used widely. As it resolves the issue of the Vanishing gradient problem to some extent compared to sigmoid or tanh. By maintaining the high value, it can resolve the issue. Since it has only a max operator, the computation is simple compared to other functions. (which had an exponential term or a tanh function).
Leaky ReLU
Leaky Relu is a modification for Relu function. As we have seen, the ReLU function will deactivate the neurons (set to zero if it has negative values). This modification for ReLU solves this problem by allowing some leakage to the negative half. The equation is given by
The graph can clearly explain the concept.
This is also a popular activation function for deep neural networks. It combines the advantage of ReLU and also solves the shutting down of neuron problems.
Softmax
The softmax activation function is mainly used for multi-class classification situations. This is majorly used at the last layer of the neural networks. The equation is given by
The softmax function will return the values between 0 to 1 such that if there are 5 classes, the total sum of values of 5 classes will be 1. Pictorially it can be represented as
Customizing the Activation Function
We have seen most of the popular activation functions. There are some more popular functions like Linear, parametric ReLU, etc. A number of activation functions can be derived through ReLU, Sigmoid etc. It can be done via numpy library or keras. You can modify the ReLU function by multiplying a constant value, that is nothing but parametric ReLU. Similarly exponential linear unit can also be created modifying the ReLU. These can be done based on the problem statements you have. If your model are getting vanishing gradient problem you can choose ReLU. If you dont want the neurons to shut, you can either use parametric ReLUor leaky ReLU, or also you can create custom Activation function based on hyper-parameters given in those. I highly recommend you to read this blog to get more insights.
Awesome! We have revised One of the Hyper-parameters of Neural Networks. In the next blog, we will revise the Optimization Functions used in Deep Learning. Till then take care and See you next time!