MLP Mixer - MLP in Computer Vision
For tasks with images, the CNN algorithm and transformers were applied (will definitely talk about transformers in computer vision in the future).
However, with the application of MLP-Mixer, although the performace of models with CNN algorithm and attention mechanism are accaptable, they may be rendered obsolete.
Basic Structure
There are two types of layers that are applied: one with MLPs on image patches, and another with MLP across the image patches. The former is also known as “mixing the per-location features” and the latter is also known as “mixing the spatial information.” The former is channel-mixing MLP, and the latter is token-mixing MLP.
The input is the matrix of patch * channel. The matrix is called token. The details are to be detailed later.
This is the overall structure of MLP-Mixer. Channel-mixing MLP is independently calculated to token with different channel. Token-mixing MLP is independently calculated to channel with different spatial information.
Channel-mixing is similar to CNN when it is seen as 1x1 Convolutional filter.
Architecture
Models for Computer Vision are comprised of these layers:
- Layer for mixing features within specific spatial location,
- Layer for mixing features between different spatial locations.
For MLP-Mixer, these two operations are strictly divided.
The input image of size (3, H, W) is divided into image patch with size (3, P, P). Then there are S image patches (H x W / P x P).
When each patch is projected into Channel dimension, the dimension becomes S * C.
This is Token Mixing Layer. The token goes through layer norm. Then, it is transposed then goes through MLP.
After the features go through token mixing layers, they are transposed again. Here, the concept of skip-connection is mentioned.
Skip connections are a type of shortcut that connects the output of one layer to the input of another layer that is not adjacent to it.
This allows the token to add the original matrix and go through layer norm. After the norm, another MLP process is done on it and allow the channels to mix features.
These are the equations. $\sigma$ is an element-size nom-linearity. $D_S$ and $D_C$ are hidden widths within the MLPs. They are selected independently of the number of input patches / paych size.
After the mixer layers, Global Average Pooling was done before yielding the predictions.
Experiments
Although the results suggest that the accuracy is not outstanding compared to existing computer vision algorithms, the efficiency is high compared to them. In addition, with larger dataset, the performance increases.
The original images are transfored into two ways:
- Patch + Pixel Shuffling
- Global Shuffling
The former method is mixing the patch orders also randomly mixing pixels inside the patch. The latter is mixing pixels in all images.
MLP-Mixer was not related to the order of patches. However, global shuffling had impact on the result, for the input to violate the grouping rules that were applied. For the CNN algorithm, both changes violate the results.
The overall paper can be found here.