For tasks with images, the CNN algorithm and transformers were applied (will definitely talk about transformers in computer vision in the future).

However, with the application of MLP-Mixer, although the performace of models with CNN algorithm and attention mechanism are accaptable, they may be rendered obsolete.

Basic Structure

There are two types of layers that are applied: one with MLPs on image patches, and another with MLP across the image patches. The former is also known as “mixing the per-location features” and the latter is also known as “mixing the spatial information.” The former is channel-mixing MLP, and the latter is token-mixing MLP.

The input is the matrix of patch * channel. The matrix is called token. The details are to be detailed later.

Basic Structure

This is the overall structure of MLP-Mixer. Channel-mixing MLP is independently calculated to token with different channel. Token-mixing MLP is independently calculated to channel with different spatial information.

Channel-mixing is similar to CNN when it is seen as 1x1 Convolutional filter.

Architecture

Models for Computer Vision are comprised of these layers:

  1. Layer for mixing features within specific spatial location,
  2. Layer for mixing features between different spatial locations.

For MLP-Mixer, these two operations are strictly divided.

The input image of size (3, H, W) is divided into image patch with size (3, P, P). Then there are S image patches (H x W / P x P).

When each patch is projected into Channel dimension, the dimension becomes S * C.

Token Mixing Layer

This is Token Mixing Layer. The token goes through layer norm. Then, it is transposed then goes through MLP.

Channel Mixing Layer

After the features go through token mixing layers, they are transposed again. Here, the concept of skip-connection is mentioned.

Skip connections are a type of shortcut that connects the output of one layer to the input of another layer that is not adjacent to it.

This allows the token to add the original matrix and go through layer norm. After the norm, another MLP process is done on it and allow the channels to mix features.

Equations for the layers

These are the equations. $\sigma$ is an element-size nom-linearity. $D_S$ and $D_C$ are hidden widths within the MLPs. They are selected independently of the number of input patches / paych size.

After the mixer layers, Global Average Pooling was done before yielding the predictions.

Experiments

Although the results suggest that the accuracy is not outstanding compared to existing computer vision algorithms, the efficiency is high compared to them. In addition, with larger dataset, the performance increases.

Experiment

The original images are transfored into two ways:

  1. Patch + Pixel Shuffling
  2. Global Shuffling

The former method is mixing the patch orders also randomly mixing pixels inside the patch. The latter is mixing pixels in all images.

Result

MLP-Mixer was not related to the order of patches. However, global shuffling had impact on the result, for the input to violate the grouping rules that were applied. For the CNN algorithm, both changes violate the results.

The overall paper can be found here.