Efficient Convolutional Neural Network with MobileNet
Reduce CNN operations with depth-wise and point-wise convolutions
Problem
The number of operations of convolution is quite resource-intensive. Suppose we have an input image of size \(D_x \times D_x \times M\) and a convolution block with a kernel of size \(D_k \times D_k \times M\). First, the conv block will do element-wise multiplication with a segment on the input image. Then, we sum the value along all the channels into a single value. The kernel then slides by one block to the right (if the stride is 1), and the whole process is repeated until all ceils are covered. $N$ kernel is used to obtain $N$ channel as output.

Regular Convolution (Source: https://towardsdatascience.com/a-basic-introduction-to-separable-convolutions-b99ec3102728)
Operations of a regular Conv block are:
One Conv Block's Operations: \(D_k \cdot D_k \cdot M\)
One Block's Slide over the input: \(D_x \cdot D_x\)
Using $N$ conv blocks, the total operation is:
$$T = D_k \cdot D_k \cdot M \cdot D_x \cdot D_x \cdot N$$
Solution
One key challenge is that the conv multiplication is repeated for all $N$ blocks. For example, one conv block is applied to the top-left corner of the input. The second block will re-compute everything again. The idea of a Depth-wise separable convolution block is to not repeat that element-wise multiplication.
Each of the regular conv blocks does multiplication followed by additions. In depth-wise separable convolution, the process is split into two. One layer is for multiplication and another is for addition.
- Depthwise Convolution: This is similar to regular conv block except it does not perform additional across all channels. Hence, the operation is \(D_k \cdot D_k \cdot M \cdot D_x \cdot D_x\).

Depthwise Convolution (Source: https://towardsdatascience.com/a-basic-introduction-to-separable-convolutions-b99ec3102728)
- Pointwise Convolution: This is a \(1 \times 1 \times M\) kernel. $N$ kernels are used to obtain $N$ channels as output. The number of operations: \(D_x \cdot D_x \cdot M \cdot N\).

Pointwise Convolution (Source: https://towardsdatascience.com/a-basic-introduction-to-separable-convolutions-b99ec3102728)
Total operations of Depth-wise Separable Convolution:
$$(D_k \cdot D_k \cdot M \cdot D_x \cdot D_x) + (D_x \cdot D_x \cdot M \cdot N)$$
Let's compare:
$$\frac{(D_k \cdot D_k \cdot M \cdot D_x \cdot D_x) + (D_x \cdot D_x \cdot M \cdot N)}{D_k \cdot D_k \cdot M \cdot D_x \cdot D_x \cdot N} =\frac{1}{N} + \frac{1}{D_k \cdot D_k}$$
Here we can see that the depth-wise approach requires way fewer operations than the regular conv block.
How to use
Instead of using a regular convolution layer, we use two separate operations - depthwise and pointwise convolutions.

Regular Conv vs. Depthwise Separable Conv (Source: Original research paper)




