Efficient Convolutional Neural Network with MobileNet

Problem

The number of operations of convolution is quite resource-intensive. Suppose we have an input image of size $D_x \times D_x \times M$ and a convolution block with a kernel of size $D_k \times D_k \times M$. First, the conv block will do element-wise multiplication with a segment on the input image. Then, we sum the value along all the channels into a single value. The kernel then slides by one block to the right (if the stride is 1), and the whole process is repeated until all ceils are covered. $N$ kernel is used to obtain $N$ channel as output.

Regular Convolution
(Source: https://towardsdatascience.com/a-basic-introduction-to-separable-convolutions-b99ec3102728)

Regular Convolution (Source: https://towardsdatascience.com/a-basic-introduction-to-separable-convolutions-b99ec3102728)

Operations of a regular Conv block are:

One Conv Block's Operations: $D_k \cdot D_k \cdot M$
One Block's Slide over the input: $D_x \cdot D_x$

Using $N$ conv blocks, the total operation is:

$$T = D_k \cdot D_k \cdot M \cdot D_x \cdot D_x \cdot N$$

Solution

One key challenge is that the conv multiplication is repeated for all $N$ blocks. For example, one conv block is applied to the top-left corner of the input. The second block will re-compute everything again. The idea of a Depth-wise separable convolution block is to not repeat that element-wise multiplication.

Each of the regular conv blocks does multiplication followed by additions. In depth-wise separable convolution, the process is split into two. One layer is for multiplication and another is for addition.

Depthwise Convolution: This is similar to regular conv block except it does not perform additional across all channels. Hence, the operation is $D_k \cdot D_k \cdot M \cdot D_x \cdot D_x$.

Depthwise Convolution
(Source: https://towardsdatascience.com/a-basic-introduction-to-separable-convolutions-b99ec3102728)

Depthwise Convolution (Source: https://towardsdatascience.com/a-basic-introduction-to-separable-convolutions-b99ec3102728)

Pointwise Convolution: This is a $1 \times 1 \times M$ kernel. $N$ kernels are used to obtain $N$ channels as output. The number of operations: $D_x \cdot D_x \cdot M \cdot N$.

Pointwise Convolution
(Source: https://towardsdatascience.com/a-basic-introduction-to-separable-convolutions-b99ec3102728)

Pointwise Convolution (Source: https://towardsdatascience.com/a-basic-introduction-to-separable-convolutions-b99ec3102728)

Total operations of Depth-wise Separable Convolution:

$$(D_k \cdot D_k \cdot M \cdot D_x \cdot D_x) + (D_x \cdot D_x \cdot M \cdot N)$$

Let's compare:

$$\frac{(D_k \cdot D_k \cdot M \cdot D_x \cdot D_x) + (D_x \cdot D_x \cdot M \cdot N)}{D_k \cdot D_k \cdot M \cdot D_x \cdot D_x \cdot N} =\frac{1}{N} + \frac{1}{D_k \cdot D_k}$$

Here we can see that the depth-wise approach requires way fewer operations than the regular conv block.