Convolutional Neural Network Series 2 — Delving into Convolution: Exploring the Core Operations of CNNs

8 min readJan 30, 2024

In the first article of our Convolutional Neural Network (CNN) series, “Convolutional Neural Network Series 1 — Decoding the Visual World: An Introduction to CNNs,” we explored the fundamental concepts and history of CNNs, highlighting their pivotal role in image processing. We discussed the basic idea behind convolution operations, the role of activation functions in CNNs, and provided an overview of simple CNN architectures. This laid a solid foundation for understanding Convolutional Neural Networks.

In this installment, we will delve deeper into one of the most critical components of CNNs: the convolutional layer. Convolutional layers are key in processing and interpreting visual information through the application of filters (or kernels) on input data, extracting features to form feature maps. This process is crucial for CNNs to effectively recognize and understand images. We will thoroughly discuss how filters work, their role in capturing important information from images, and the impact of strides and padding in convolution operations. By deepening our understanding of these concepts, we will be better equipped to harness the power of CNNs for visual information processing and application.

Let’s embark on unraveling the mysteries of the convolutional layer and unlocking the core operations of CNNs.

Fundamentals of Convolutional Layers

Definition and Function of Convolutional Layers

The Convolutional Layer is the cornerstone of a Convolutional Neural Network. Its primary function is feature extraction, achieved by applying a series of learned filters (or kernels) to the input data. Each filter is specifically designed to capture a particular type of feature, such as edges, textures, or specific shapes. As these filters slide (or “convolve”) across the input data, they progressively construct a more abstract and refined representation of the data, crucial for subsequent image recognition and classification tasks.

Understanding Filters and Feature Maps

Filters (or kernels) in a convolutional layer are small windows containing a set of learnable parameters. As these filters slide over the input image, they perform a dot product operation with the local region of the image, producing a new two-dimensional array, known as the feature map. Each feature map represents the activation of certain specific features in the input image. For instance, one filter might focus on capturing vertical edges, while another might recognize blue regions.

This sliding process involves each filter moving across the entire width and height of the image, computing at each position to produce a value. The result is that, for each filter, the network learns important features from the image, which are then mapped out in the feature maps. This method allows the network to utilize these features in subsequent layers for more complex tasks, such as recognizing specific objects or scenes.

Through convolutional layers, CNNs are able to capture the rich spatial hierarchy of images, which is key to their success in the field of image processing. Next, we will explore the diverse functions of different types of filters and their role in feature extraction, as well as the influence of stride and padding in convolution operations.

Filters and Feature Extraction

Types and Functions of Filters

Filters play a vital role in Convolutional Neural Networks, as they directly determine the types of features the network can identify and extract. Filters come in various forms, each aimed at capturing specific types of image features.

Edge Detection Filters: These filters are capable of identifying edges in an image. They work by enhancing the intensity changes near the edges. For example, a horizontal edge detection filter will capture horizontal edges, while a vertical edge detection filter focuses on vertical ones.
Blurring Filters: These filters are used to smooth an image, reducing its detail and noise. They operate by averaging the pixels in the vicinity, resulting in a smoother image.
Sharpening Filters: Sharpening filters enhance the details of an image, making it appear clearer. They achieve this by increasing the contrast between a pixel and its neighboring pixels.

The application of these filters in a convolutional neural network allows the network to abstract more complex features from the raw pixels, which is crucial for understanding and classifying images.

The Process of Generating Feature Maps

The generation of feature maps is accomplished by applying filters to the input data. Let’s understand this process in detail through an example:

Selecting a Filter: First, decide on the type of filter to use. For instance, we choose a filter designed to detect vertical edges.
Applying the Filter: Next, this filter slides over the input image, covering different areas of the image. At each position, a dot product operation is performed between the filter and the covered area of the image, and then the results are summed up.
Generating the Feature Map: The result of each dot product operation forms one pixel on the feature map. As the filter traverses the entire image, a complete feature map is constructed. In our example, this feature map would highlight the vertical edges in the image.

Through this process, convolutional layers start from basic visual elements, such as edges, and gradually build up complex visual features, essential for subsequent image classification and recognition tasks. Next, we will explore the roles of stride and padding in convolution operations and how they affect the size and quality of feature maps.

Stride and Padding

Concept and Impact of Stride

Stride refers to the step length with which the filter moves across the input image. Specifically, it determines the number of pixels the filter skips as it moves from one position to the next. Stride is a key parameter in the design of a convolutional layer, as it directly affects the size of the resulting feature map.

Impact of Stride Size: With a stride of 1, the filter moves one pixel at a time, producing a feature map size close to that of the input image. However, increasing the stride (e.g., to 2 or more) makes the filter move faster, resulting in smaller feature maps and potentially less detailed information capture.
Considerations in Choosing Stride: Opting for a smaller stride can produce more detailed feature maps but at a higher computational cost; a larger stride reduces computational load but might miss some finer details.

Application of Padding

Padding is another important concept in convolutional layers. During the convolution process, extra pixels, usually zero-valued, are sometimes added around the border of the input image. This is done for two main reasons:

Maintaining Dimensionality: Padding allows the convolutional layer’s output to have the same spatial dimensions as the input. This is particularly important for building deep networks, as it allows for the stacking of multiple layers without rapidly diminishing the size of the feature maps.
Reducing Information Loss: Without padding, the image’s edge regions would participate less in the convolution operation, potentially leading to loss of edge information. Padding ensures that these edge areas are also utilized more fully.

In summary, stride and padding are two key factors that influence the size and quality of the output feature maps in a convolutional layer. They must be carefully chosen and adjusted based on the specific requirements of the task and the network architecture. Next, we will demonstrate these concepts in practice with a coding example.

Practical Coding Example of a Convolutional Layer

Coding Demonstration

To better understand the practical application of convolutional layers, let’s demonstrate how to implement one using the popular deep learning framework, TensorFlow. Here’s a simple example:

import tensorflow as tf
from tensorflow.keras.layers import Conv2D
import numpy as np

# Example input image - here we use a randomly generated 8x8 image
input_image = np.random.rand(8, 8, 3)  # 8x8 size, 3 color channels
input_image = np.expand_dims(input_image, axis=0)  # Adding a batch dimension# Creating a simple convolutional layer
# We use a single 3x3 filter, with the number of output channels set to 1
conv_layer = Conv2D(filters=1, kernel_size=(3, 3), strides=(1, 1), padding='valid')# Applying the convolutional layer to the image
output = conv_layer(input_image)

Code Explanation

Importing Necessary Libraries: We start by importing TensorFlow and necessary submodules. NumPy is also needed to handle input data.
Preparing Input Data: A random 8x8 image is created, representing a simple three-channel (e.g., RGB) image. np.expand_dims is used to add an extra batch dimension, which is a requirement for TensorFlow data processing.
Creating the Convolutional Layer: We use Conv2D to create a convolutional layer. Here, we define a single 3x3 filter (filters=1), with a stride of 1 (strides=(1, 1)), and without padding (padding='valid'). No padding means the output feature map will be smaller than the size of the input image.
Applying the Convolutional Layer: Finally, we apply the convolutional layer to the input image to get the output feature map.

In this example, we see how a convolutional layer applies a filter to an input image and produces a new feature map. This process captures specific features in the image, which are vital for subsequent image processing tasks. By altering the size, number, stride, and padding of the filter, we can adjust the behavior of the convolutional layer to suit different needs and scenarios in practical applications.

Conclusion

Weight sharing is a core concept in convolutional neural networks. Unlike traditional fully connected neural networks where each input and output connection has a unique weight, in convolutional layers, the weights of a filter are shared across the entire input data.

Function of Weight Sharing: This mechanism of weight sharing allows the convolutional layer to efficiently recognize the same feature no matter where it appears in the input. For instance, a filter designed to detect vertical edges can identify these edges anywhere in the image without having to relearn the same feature with new weights.
Enhancing Efficiency: Weight sharing significantly reduces the number of parameters in the model, lowering computational costs and helping to reduce the risk of overfitting.

The receptive field refers to the area of the input data that a single neuron in a convolutional layer observes. In convolutional neural networks, each neuron in the output feature map is extracting information from a small window (its receptive field) of the input data.

Importance of the Receptive Field: The concept of the receptive field is crucial for understanding how convolutional layers capture local features. Each neuron focuses on a small part of the input data, allowing the network to capture local features like edges, corners, or textures.
Impact of Stacked Convolutional Layers: In multi-layered convolutional networks, the receptive field gradually increases as the data passes through each layer. This means that neurons in deeper layers can capture features over a larger area of the input data, enabling a progression from local to more global feature understanding.

In conclusion, weight sharing and the concept of the receptive field are two key aspects of the design of convolutional neural networks. They make CNNs efficient and powerful in handling images and other high-dimensional data. Understanding these concepts allows for a better grasp of how CNNs function and their effective application in practical problems.