Understanding ConvMixer (with a simple PyTorch implementation)

Mriganka Nath
4 min readOct 17, 2021

Recently in a tweet by Andrej Karpathy a paper has gone viral https://twitter.com/karpathy/status/1445915220644229124, where a model accomplishes two things together,1) crosses the 80% top1-accuracy in Imagenet 1k and 2) the whole model architecture actually fits in a tweet!! this was a paper titled “Patches Are All You Need?” where the authors introduced a new model architecture, called “ConvMixer”. It is still under review and we don't know the real author/authors.

Link to the paper: https://openreview.net/forum?id=TVHS5Y4dNvM

Before diving into the architecture of ConvMixer, let us see how the authors got motivated, and how they used the ideas behind existing ideas to make their new model.

Convolutional Neural Networks have been dominating the field of computer vision tasks, and now it is the Transformers that are making the buzz. With their very powerful architectural design, transformers have been very successful in the field of NLP, and now they are doing the same thing with vision. The “self-attention” in these vision transformers is quadratic in time O(n²), due to which they work with “patches of images”. Patches group together small regions of images into single input features, in order to be applied to larger images. Here in this paper, the author tries to discuss how the use of patch embeddings is powerful enough to make a simple model reach the levels of Resnets and some variants of ViT.

Modern architectures consist of layers that mix different features of an image, mainly spatial features and channel-wise features. Two types of a convolutional layer are used in the ConvMixer to do this. Depthwise convolution is used to mix the spatial features, and pointwise convolution is used to mix the channel-wise features. Depthwise convolution is done by applying a grouped convolution layer with groups equal to the number of channels. Pointwise convolution is done via applying a 1x1 size kernel.

Some design parameters which may affect the convmixer’s performance are:-

  1. the dimension of the patch embeddings
  2. the size of the patch, which control the internal resolution of the model. If a smaller size is chosen, the time required for the model may also increase, as it will generate more patch embeddings.
  3. the depth of the model, that is how many time depthwise and pointwise convolutions are applied
  4. the kernel size of the depthwise convolutions

Now let us discuss the full architecture, a interesting thing in this model is that all throughout the model the size of the input is the same, generally in CNNS the shape is like a pyramid, where the size decreases from input to the final output. This type of architecture where the size and shape remain the same throughout is called “isotropic” architecture. It was mainly used in Vision transformers.

The official implementation is very very short, to be specific only just 280 characters. All you need is just to use “from torch.nn import * ” before the model

A clearer implementation is also given, But I thought of implementing it on my own, (you have to use the latest PyTorch version to run it, 1.9)

With every convolutional layer, Gelu activation function is used with a batchnorm layer. The first CNN layer has kernel size and stride equal to the patch size, and the output channel size is given by us, hence making the shape of the output (output channel size x size of image/patch size x size of image/patch size). This shape is maintained throughout the model. After this, it goes through a repetitive block of depthwise and pointwise convolution, this block is repeated “d” times, here depth, it is also a hyperparameter, which can be tuned. Finally, a global average pooling layer and a fully connected layer is used to give the final classification.


To be honest, it is just amazing to see such a small architecture can reach 80% on Imagenet 1k, really proving working with patch embeddings may be beneficial. Various training recipes were used to train the model. Augmentation included Randaugment , mixup , cutmix , random erasing. AdamW was used as the optimizer with a triangular learning rate schedule. The authors believe that with more layers, bigger patch size and some hyperparameter tuning model’s performance can be improved.



Mriganka Nath

high dimensions go brrrrr; I work with Neural Networks;