Vision Transformer (ViT) under the magnifying glass, Part 1

Embeddings

39 min readJan 17, 2023

Introduction

Transformers in NLP and Vision Transformers (ViTs) for Computer Vision have been around for a while. First, “Attention Is All You Need”¹ revolutionized the field of NLP, and then “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”² had a tremendous impact on Computer Vision. The latter is the paper that we’ll dissect in this article. Both works weren’t the first to attempt to use attention in their fields, but they were the first to propose simple and effective architectures. Since then there have been a lot of interesting uses of attention and self-attention layers in Computer Vision. Here are a few:

“End-to-End Object Detection with Transformers”³, was released a bit earlier than the ViT we are reviewing (object detection);
“Swin Transformer: Hierarchical Vision Transformer using Shifted Windows”⁴ (classification);
“BEIT: BERT Pre-Training of Image Transformers”⁵ (classification);
“Emerging Properties in Self-Supervised Vision Transformers”⁶ (self-supervised pre-training);
“LoFTR: Detector-Free Local Feature Matching with Transformers”⁷ (image feature matching);
“Injecting Semantic Concepts into End-to-End Image Captioning”⁸ (image captioning);
“PCT: Point Cloud Transformer”⁹ (3D point cloud segmentation);
“TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up”¹⁰ (image generation);
“StyTr2: Image Style Transfer with Transformers”¹¹ (style transfer);
“Zero-Shot Text-to-Image Generation”¹² (text-to-image);
Etc.

It is definitely very exciting to witness a certain paradigm shift and to see alternative or at least complementary layers to the plain old convolution. This is possibly one of the reasons why vision transformers have been widely talked about, along with their success in various tasks of computer vision. So most likely you’ve already heard about them and might even know how they work at some level. But if you still feel like there are a few pieces of the puzzle missing, then this series of articles is for you. They are meant to glue the important details together in an interactive walkthrough over the classic ViT and answer some engineering and theoretical questions which can arise when seeing this architecture for the first time. We are going to look critically at all layers and explore their properties together with the motivation behind them, according to the authors of the paper and based on other works.

Luckily, many visual transformers from recent publications aim to improve ViT by introducing only incremental changes, at least this holds true at the time of publishing this article. It means that understanding how it is built will help you understand other research in the domain better. In fact, some works will be briefly discussed here for their contribution to mitigating specific issues present in the model.

Here is a sneak peek into questions that will be addressed in this article:

What layers are there in the network and how to implement them?
How many parameters does each layer introduce?
How are the weights shared over the input sequence?
How is the transition from a convolutional to an attention layer made?
Where are the activations and what type are they?
What are the motivation, strengths, and weaknesses behind various design choices?
For example, what is a LayerNorm and why not use BatchNorm?
What is the [CLS] token and did anyone try anything else instead for the same purpose?
What are positional masks and head masks and when are they used?
How to do inference in a different image resolution?
How is the whole network trained and are there nuances compared to training a CNN?
Etc.

All of these will be answered while moving through the layers one by one.

The pre-requisite for the material presented below is being familiar with the general concepts of building and training a neural network. Prior knowledge in vision will be very helpful, but prior knowledge in NLP is not required. Regardless, it can be useful to anyone who wants to understand any kind of transformer better.

The architecture

The task that the ViT from “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”² solves is image classification. As with any classification network, it can also serve as a backbone for other tasks and can be even combined with CNN. Below you can see the layers the ViT is composed of. They are listed with references to the sections in the article series discussing them. We’ll take a look at the code for each layer, but please note that the original implementation is written in JAX, while this explanation is based on the implementation of the Hugging Face model “google/vit-base-patch16–224” in PyTorch and is verified with the paper and with the source code.

Figure 1: Model overview, taken from “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”¹

1 Embedding (Stem)
  1.1 Patch Embeddings
  1.2 (Optional) Positional masking
  1.3 [CLS] token
  1.4 Positional embedding
  1.5 DropOut
2 Encoder
  VitLayer x N
    2.1 Attention
      2.1.1 Layer Normalization 
      2.1.2 MHSA
      DropOut
      2.1.3 (Optional) Head masking 
      2.1.4 Linear
      Dropout
    2.2 MLP
      Layer Normalization
      Linear
      GeLU
      Linear
      DropOut
  Layer Normalization
3 Classifier

The authors chose this architecture deliberately to prove a very powerful yet simple idea: the great success of Transformers¹ in NLP can be brought to Vision almost out of the box. What it means in practice is that ViT looks like the encoder part of the Transformer, but does have a few differences. The most obvious one is that the input is an image, so it has to be somehow represented as a sequence. Apart from that, unlike in the Transformer, positional embedding is learnable and normalization precedes each block. We will, of course, get back to these discrepancies later.

It can be said with certainty that they succeeded in bringing NLP and Vision closer together. For example, we can see relatively recent works like “BEiT: BERT Pre-Training of Image Transformers”⁵ and similar ones, which aim to mimic the masked language modeling task with masked image modeling. More on this when we get to the masking stage. On the other hand, some progress is being made by applying convolutional prior knowledge to Vision Transformers, which we’ll also discuss later. One direction where both fields meet is making transformers faster and more efficient to enable real-time applications. This is relevant both for Computer Vision and NLP, that’s why optimization ideas are often shared. A few examples of works that aim to speed up Vision Transformers are ResT¹³, EfficientFormer¹⁴, EdgeViT¹⁵, etc.

In general, there are probably hundreds of ViT-related papers by now. To be able to understand those, it is essential to have a good understanding of the base, so let’s jump into unraveling the architecture. Based on the main stages in it, this article will be split into 3 parts:

Part 1: Embeddings (you are here)
Part 2: Encoder (in progress)
Part 3: A classification head, model sizes, training, inference, etc (in progress)

Additionally, there is also a Colab Notebook where you can experiment with all the code you’ll find below.

Embedding (Stem)
1.1 Patch Embeddings
1.2 (Optional) Positional masking
1.3 [CLS] token
1.4 Positional embedding
1.5 Dropout

Summary
Acknowledgments
References

1. Embeddings (Stem)

The embedding is the stage that is there to prepare an input image for the Transformer encoder, that is, the rest of the network. To be a bit more specific, after this stage we’ll have a certain representation, and it will be impossible to tell just by looking at it whether it came from an image or from a sentence. In other words, it will have the same shape an embedded text sentence would have.

1.1 Patch embeddings

The embedding of an image starts with splitting it into patches. A bit later, each patch will be treated as a token in a sentence. This is where the title is coming from: “an image is worth 16x16 words”, where the notations “word” and “token” are used interchangeably and 16x16 refers to pixels. In NLP a token stands for a unique vector that represents a meaningful word or part of a word, and all tokens together define a vocabulary. The vocabulary of a pre-trained model, e.g. of BERT¹⁶, is fixed. It is there to map each input token the model gets to its embedding. If you want to learn more about tokenization in NLP you can start for example here.

But if you feel like you don’t fully understand what this all means, don’t worry. You don’t have to, because in ViTs the definition of a token/word is a bit different. In particular, we won’t store any vocabulary of image patches, as it would be redundant. Unlike a linguistic token which is initially represented as a number in a vocabulary, in some sense, an image patch is already a certain meaningful embedding consisting of pixel values. So instead of a fixed mapping, a linear layer is applied to each image patch directly, resulting in a unique embedding. Here’s how it’s done.

To begin with, a patch size must be defined. It’s a hyperparameter that is set manually by the authors through experimentation and it cannot be changed after a model is trained. If you ever see ViTs in various comparison tables, the “/P” suffix in a model name refers to the used patch size, e.g. ViT-L/16 stands for a large model with 16x16 patches. Model sizes introduced in the paper will be discussed in part 3 of the article series.

In the ViT, patches do not overlap in any way, you can think of it as a grid. They will “interact” only later, in other layers. But for now, we take each one and apply the same linear layer to it, that is, with shared weights. To apply a linear layer to a PxPx3 image patch we can flatten it into P*P*3 values and use Linear(P*P*3, D), D for a new dimension. If we look at it again, we can see that this operation is in fact equivalent to convolving an input image with stride 16 and D kernels of size 16x16. This eliminates the need to do any flattening and reshaping (for now). Indeed, each patch pixel value in its spatial location and channel (red, green, and blue) gets a unique weight, and the same weight will be applied to a value in the same coordinate in a new patch. Let’s stop right there and take a look at the animation of the process. After all, one animation is worth 1000 words:)

*Figure 2: Linear embedding of an RGB input image. [*Source: Image by the author.]

ViT is trained on images in resolution 224x224 on ImageNet²⁴ or JFT-300M²³, so after the convolution (or a linear layer) this is equivalent to 14x14 patches. Each patch initially has 16 * 16 * 3 = 768 values, and the convolution has 768 kernels too, namely D = 768. In other words, 768 values of each patch are mapped into 768 new values. This is a pure coincidence, or maybe it would be more correct to call it a design choice. The new dimension D is called a hidden size or latent vector size, and of course, it is not a must to make it equal to the number of input values in a patch. In fact, D = 768 is also used for models with other patch sizes, like for example in the Hugging Face model “google/vit-base-patch32–384”. In the paper, the authors vary the latent vector size D based on the model size, but keep it the same throughout all layers of one model, except in MLP (you’ll see it in part 2 of the articles).

Now that we have 14x14 patches with 768 features in each, the next step is to form a sequence from them. This is done by simply flattening 14x14 elements into 14*14 elements, row after row.

*Figure 3: Embeddings flattening [*Source: Image by the author.]

Code:

For simplicity, here and in all the snippets below we’ll omit a definition of torch.nn.Module with forward(…) and other methods. Instead, we’ll focus only on required operations, torch layers, and shapes. Obviously, to be able to do inference or train a model, it has to be properly organized into sub-modules, so if you want to see the whole model’s class together you can look at the code in Hugging Face. We’ll import the entire model later on in this article.

Back to embeddings:

im_h, im_w = 224, 224
num_channels = 3
input_image = torch.ones((1, num_channels, im_h, im_w))  # torch.Size([1, 3, 224, 224])
patch_size = 16
hidden_size = 768
projection = torch.nn.Conv2d(num_channels, hidden_size,
                             kernel_size=patch_size, stride=patch_size)
x = projection(input_image)  # torch.Size([1, 768, 14, 14])
embeddings = x.flatten(2).transpose(1, 2)  # torch.Size([1, 196, 768]), 196 = 14*14

# Run yoursef here: https://colab.research.google.com/github/yurkovak/ViT_notebook/blob/main/Vision_Transormer_(ViT)_under_the_magnifying_glass.ipynb

Let’s also write some code to prove that a convolutional and a linear layer applied to flattened patches do the same thing. For this, we’ll take weights of Conv2d, reshape them and copy them into weights of Linear, reshape the image ourselves prior to applying the layer and compare results. The results should match.

embeddings_conv = embeddings

# Copy weights from Conv2d into Linear
conv_weight = projection.weight  # torch.Size([768, 3, 16, 16])
conv_bias = projection.bias  # torch.Size([768])
linear_projection = torch.nn.Linear(patch_size * patch_size * num_channels, hidden_size)
linear_projection.weight = torch.nn.Parameter(copy.deepcopy(conv_weight).flatten(1))  # torch.Size([768, 768])
linear_projection.bias = torch.nn.Parameter(copy.deepcopy(conv_bias))  # torch.Size([768])

# Patchify the image and project with Linear
patched_image = torch.nn.Unfold(kernel_size=patch_size, stride=patch_size)(input_image)  # torch.Size([1, 768, 196])
patched_sequence = patched_image.transpose(1, 2)  # torch.Size([1, 196, 768])
embeddings_linear = linear_projection(patched_sequence)

# Compare the results of Linear and Conv2d
print('Outputs of Linear and Conv2d match:', 
      torch.allclose(embeddings_conv, embeddings_linear, atol=1e-03))  # prints True

# Run yoursef here: https://colab.research.google.com/github/yurkovak/ViT_notebook/blob/main/Vision_Transormer_(ViT)_under_the_magnifying_glass.ipynb

Parameter count:

Let’s dig deeper:

Weaknesses of the selected patching approach

As you can see, the separation into patches is very simple and even naive to some extent. To embed a 2D patch, a plain linear layer is used on its flattened representation. Let’s look back in time for a moment and recall that years ago linear layers were commonly used on flattened images to train NNs for tasks like classification. Back then there was a known issue that any two vertically neighboring pixels at coordinates (i, j) and (i+1, j) end up far apart in a flattened representation. Luckily, CNNs with very small kernels addressed this problem by introducing the local connectivity pattern. For example, when a small 3x3 kernel with stride 1 is applied to an image, its weights are equally applied to each pixel together with 8 neighboring pixels around it.

Well, in the ViT, chunks of an image (patches) much bigger than 3x3 are grouped together and there is no overlap between them. So the same issue from the past, which we can say was solved by CNNs, is in fact present in ViTs within each patch. To re-iterate, coordinates (i, j) and (i+1, j) within each patch do end up far apart in flattened vectors. This creates two distinct issues:

Due to 16x16 kernels, or any other big ones, local pixel-level features and relationships are disregarded;
Due to stride 16, or any other stride equal to a kernel size, information along patch borders is somewhat underrepresented.

To understand the second issue, think about the following example. Imagine you have an image of a puppy. Since patches will “pay attention” to each other later in the attention layers, it would be probably preferable to have meaningful parts of an image within each patch. Instead, with a non-overlapping grid of the pre-defined size that we have in the ViT, we can end up with the grid cutting right through the puppy’s face:( Without conducting proper experiments, it’s hard to say how much of a problem patch borders create in practice, but it’s clearly not ideal.

*Figure 4: You can see how the eyes and the nose ended up split into separate patches, even though they can be considered individual semantic units [*Source: Image by the author.]

The natural solution that comes to mind is to use much smaller patches instead. Then we’ll have a smaller kernel, better local connectivity within each patch, and arguably more meaningful and semantically homogeneous chunks of an image. The authors of the paper experimented with this and report that decreasing the patch size shows surprisingly robust improvements. However, it comes with a price and turns out to be a bit trickier than it sounds. First and foremost, the usage of smaller patches increases a sequence length and makes the network computationally more expensive. When you finish part 2 of the article series, you’ll understand exactly why it’s the case. But even if we could afford more computations, Namuk Park et al in their work “How Do Vision Transformers Work?”¹⁷ show that extremely small patch sizes, like 2x2, hurt the predictive performance of ViT. It’s fair to speculate that other real-world datasets and image resolutions might also require careful reselection of this parameter.

Related works

To deal with these problems some research works explore more elaborate image-dividing methods. We won’t dive into the accuracy improvements each one brings and at which computational cost. Instead, we’ll just go over the methods proposed in a few prominent papers, and if you are curious you can of course check their result tables yourself. On the other hand, if you don’t feel like reading about other visual transformers yet, feel free to skip to the next section.

A model called TNT from “Transformer in Transformer”¹⁸ also uses 16x16 patches, but inside of each patch, there are 4x4 sub-patches. To differentiate, they came up with the following names: they call big patches “sentences” and small patches inside of them “words”. The network has two parallel flows: outer and inner. The outer Transformer is pretty much the same ViT we are reviewing in this article (with minor differences) that operates on sentences. The inner Transformer is actually the same ViT too (minus the [CLS] token you’ll learn about soon), but it operates on 4x4 sub-patches of each sentence independently. The way weights are shared between sentences by the inner Transformer is as if sentences from one image make a batch. Flows get combined together with addition performed in each and every encoder block, that’s why we say that inner and outer flows are parallel. To be a bit more specific, they add 1. sentence-level embeddings and 2. flattened embeddings of all words within a sentence. Here is the source code in the MindSpore framework and a PyTorch implementation by timm.

Then, there is also a model called Swin from “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows”⁴ that actually received the best paper award in ICCV 2021. Forget about two flows of the TNT, Swin has only one. But just like in TNT, it also deals with sub-patches. Instead of the names “sentences” and “words” used in TNT, they call big 28x28 pixels patches “windows” and 4x4 sub-patches inside of them, well, simply “patches”. Since there is no outer flow, Swin applies attention only within each window. And just like in TNT, weights are shared between windows as if they form a batch. But there are two more tricks to know here. One trick is that once in a few blocks, they take groups of 2x2 spatially neighboring hidden vectors and combine them into one with a space-to-depth operation. This way they gradually reduce sequence length so that at the end of the network there is only one window left. The other trick is the shifting mentioned in the title which is done before every other block and reverted back after the block. It’s performed by torch.roll and the analogy for it is a treadmill: whatever disappears from one side appears on the other side. So attention layers can actually receive weird-looking windows that contain two opposite sides of an image stitched together. The source code for this model is available in PyTorch here.

A model called T2T from “Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet”¹⁹ recursively tokenizes an input image to tokens. We won’t review it further not to diverge from the main topic too much.

It’s quite evident that the one thing TNT, Swin, and T2T have in common is a somewhat hierarchical approach to patching, which does seem closer to the way humans extract information from images. It’s also not rare to see a network that uses Transformer blocks on top of CNN-extracted features, as it was done for example in LoFTR⁷. Only time will tell if there are better ways to create a sequence from an image, but regardless, there is no doubt that patch embedding is the part of the classic ViT that calls for a lot of improvement.

1.2 (Optional) Positional masking

After the previous stage of patch embedding, we have a sequence of patches/tokens, each projected into a latent vector of length D. Now is the time to apply positional masking. However, it is very important to emphasize that this step is optional: it depends on the type of training that’s being done. The main training recipe for classification doesn’t require any masking at all and so this step is skipped completely. However, it is necessary for training the masked patch prediction task briefly mentioned in the paper.

The motivation behind such a training regime is self-supervision. In this context, it means training the model to restore the masked patches, which are known to us, but not shown to the model. This creates a form of supervision without the need for any human-made labels. Clearly, this doesn’t mean that the model will be able to actually classify images yet, but it will have a good initialization. When we finally train for the classification task, we’ll start with a model that’s already somewhat aware of the contents of images.

We’ll discuss it in more detail soon, but first, let’s understand the mechanics. For this type of training, some elements of the sequence are randomly selected and masked. How exactly? Knowing that masked patch prediction training is planned, we’ll initialize a parameter vector that will serve as a special [MASK] token. We’ll go through the hidden vectors we want to mask (random for each training iteration) and will overwrite their embeddings with this token. This way we are masking, or in other words hiding, some patches in the image.

A [MASK] token is exactly the same in every masked position, meaning that the parameters are shared. It is initialized with zeros but will change throughout training together with other parameters of the model. It is not mentioned in the paper why a learnable token is used. An alternative would be to just always mask hidden vectors with zeros. It is possible that this choice was arbitrary following the “why not” logic. But now that we have to explain it backward, we can hypothesize that there can be images where zeros will resemble other embeddings too much, so perhaps there are better values for this special token and they should be learned instead.

If you are wondering whether the mask is applied in any other layer of the model, then the answer is no, it’s not. Right after patch embeddings is the only place where positional masking happens. The purpose of this is to pre-train the model in a self-supervised manner. Such pre-training does not require any labels, only images themselves. You randomly hide parts of an image from the model, but then feed them as labels in a specific form: either as a mean pixel value or as all patch pixel values, possibly resized down. By learning to restore the hidden pixels the model is supposed to learn more meaningful representations and relationships in the data before it even gets to see the real labels.

The data used for this stage can be images from the same labeled set you have or some other similar set of images. The authors state that a small subset is enough because they observe diminishing returns on downstream performance after 100k pretraining steps. After the pre-training stage, the head of the model is changed back to a classification head and regular supervised training begins. The reported improvement in accuracy on ImageNet compared to training from scratch is 2%.

*Figure 5: Positional masking with a learnable [MASK] token [*Source: Image by the author.]

Code:

Below embeddings obtained from the previous step are used (right after Conv2d).

batch_size, seq_length = embeddings.shape[:2]  # 1 and 196 = 14*14

# Define a learnable [MASK] token, requires_grad is True by default
mask_token = torch.nn.Parameter(torch.zeros(1, 1, hidden_size))  # torch.Size([1, 1, 768])

# Create a binary mask, choose to mask out 4th and 20th patches as an example
bool_masked_pos = torch.zeros((seq_length,))  # torch.Size([196])
bool_masked_pos[4] = 1.
bool_masked_pos[20] = 1.

# Replace patch embeddings with mask token on the selected positions
mask_tokens = mask_token.expand(batch_size, seq_length, -1)  # torch.Size([1, 196, 768])
masked_positions = bool_masked_pos.unsqueeze(-1).type_as(mask_tokens)  # torch.Size([196, 1])
embeddings = embeddings * (1.0 - masked_positions) + mask_tokens * masked_positions  # torch.Size([1, 196, 768])

# Compare new embeddings to [MASK] token
print('The 4th token matches the 20th token', torch.equal(embeddings[:, 4], embeddings[:, 20]))  # prints True
print('The 4th token matches the [MASK] token', torch.equal(embeddings[:, 4], mask_token.squeeze(1)))  # prints True

# Run yoursef here: https://colab.research.google.com/github/yurkovak/ViT_notebook/blob/main/Vision_Transormer_(ViT)_under_the_magnifying_glass.ipynb

Note a neat implementation trick: embeddings are written as

embeddings = embeddings * (1.0 - masked_positions) + mask_tokens * masked_positions

Alternatively, it’s possible to use boolean indexing directly. PyTorch won’t have any problem differentiating something like this:

positions_to_mask = [4, 20]
embeddings[:, positions_to_mask] = mask_token.expand(batch_size, len(positions_to_mask), -1)

Parameter count:

Remember that if positional masking isn’t performed then new parameters aren’t introduced.

Let’s dig deeper:

Similarly to BERT

The masking we are discussing now was initially introduced in the field of NLP in BERT from “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”¹⁶. In this paper, the authors shared very impressive accuracy gains over the classic Transformer encoder¹ and achieved state-of-the-art results across a wide range of benchmark datasets, in various tasks. All of this without making any significant changes to the architecture, only by masked language modeling and a few other types of pre-training. They also use a [MASK] token, which is ultimately just a name for a learnable parameter vector.

Considering their success, it clearly seems very promising to make it work in ViT, so that’s what the authors of ViT tested. They even used almost the same probabilities for masking. In BERT 15% of positions are masked, while in ViT a 50% chance is used. But whenever a token is picked for masking and as the one to be restored, both papers use the [MASK] token 80% of the time, a random token 10% of the time, and leave an embedding unchanged in the remaining 10% of cases. The motivation for having a bit of unpredictability as opposed to just using the [MASK] for masking all the time is explained in BERT as follows:

“… Transformer encoder does not know which words it will be asked to predict or which have been replaced by random words, so it is forced to keep a distributional contextual representation of every input token.”¹⁶

Note that this randomness isn’t implemented in the Hugging Face model, so it’s not present in the code above either.

Related works

The procedure we are seeing now is, as the authors of ViT fairly call it, just a preliminary exploration. Many works have followed taking similar but slightly different approaches to masked image modeling (MIM for short, by an analogy to MLM, masked language modeling). For example, both MAE from “Masked Autoencoders Are Scalable Vision Learners”²⁰ and BEiT from “BEIT: BERT Pre-Training of Image Transformers”⁵ also try to improve downstream accuracy by reconstructing missing pixels.

On the other hand, there is another well-established direction for self-supervised pre-training that uses self-distillation. There, instead of pixels directly, a latent vector produced by a teacher is used as a label for a student. A teacher can be for example an exponential moving average of a student, that’s why the approach is called self-distillation. One example of such a method would be a recent work “What to Hide from Your Students: Attention-Guided Masked Image Modeling”²¹ where the authors question the effectiveness of choosing patches for masking completely at random. They show that some patches are too easy and so reconstructing them is less helpful than masking and reconstructing hard, kind of less boring patches. They use the teacher’s attention scores to guide which parts of an input image should be masked.

It won’t go amiss to repeat once again that masking is one of the approaches for self-supervised pre-training, but it is not required for the usual supervised classification which is actually the main task of the ViT. So let’s get back on track and start the next section where we’ll learn about an interesting trick used for classification.

1.3 [CLS] token

So far we’ve seen one special learnable token, here comes another one. In this section, we are going to talk about a [CLS] token, another concept inherited from BERT. But unlike [MASK], this token will always be present in the model. Remember that [MASK] is used to completely replace embeddings of certain patches? Instead, [CLS] has a special position in a sequence, that is, it always precedes all the other embeddings and increases a sequence length by 1. This token is also a learnable parameter vector and it’s stacked right before all the other embeddings we’ve obtained until now.

*Figure 6: Concatenation of learnable parameters of the [CLS] token [*Source: Image by the author.]

Code:

cls_token = torch.nn.Parameter(torch.zeros(1, 1, hidden_size))  # torch.Size([1, 1, 768])
cls_tokens = cls_token.expand(batch_size, -1, -1)  # torch.Size([1, 1, 768])
embeddings = torch.cat((cls_tokens, embeddings), dim=1)  # torch.Size([1, 197, 768])
new_seq_length = embeddings.shape[1]
print(new_seq_length)  # prints 197 = 14*14 + 1

# Run yoursef here: https://colab.research.google.com/github/yurkovak/ViT_notebook/blob/main/Vision_Transormer_(ViT)_under_the_magnifying_glass.ipynb

Parameter count:

Let’s dig deeper:

If you haven’t seen it in the past, then I bet you are not sure why it is even needed. Let’s fast-forward for a moment to the end of the network. Spoiler alert, all the layers in the model will be dealing with a sequence of latent vectors, up to the very end when it’s time to attach a classification head. The question is, how can we use all vectors together and where should the classification head be attached? There are a few ways to aggregate all the sequence elements together. One of the commonly used methods is Global Average Pooling (GAP), which lies in creating an average vector and using it as an input to a classification fully-connected layer. But [CLS] token is another, more unusual way that can serve the same purpose. It was initially introduced in BERT¹⁶ and consequently used in ViT. Let’s try to understand how it can aid a classification head.

To state the obvious, the name [CLS] stands for “classification”. When we reach self-attention layers, you’ll see that each hidden vector will be able to attend to other elements according to their relevance. There, all the other hidden vectors that originally corresponded to image patches will only “care” about themselves, each position will gather the information that will help its patch to be represented better. But the [CLS] token has a special nature, it doesn’t store any knowledge of its own. Instead, it will collect/aggregate information useful for the classification task.

The way to make this concept work is by encouraging this behavior with the architecture. To be specific, all the other encoding vectors returned by the last layer of the encoder will be discarded, all except the encoding on the 0-th position corresponding to the [CLS] token. Only the encoding of the [CLS] token from the last encoder layer will be used by a classification head. This way, this sequence element will be forced to learn to encapsulate information useful for classification.

But let’s not rush too much to the last layers, we have the whole encoder (part 2) ahead of us. Just remember for now about the special purpose of this new sequence element.

1.4 Positional Embedding (PE)

We’ve reached the last part of embedding, a positional one. It is quite an intriguing concept, because it’s not something intrinsic to CNNs, and not even to RNNs for that matter. Instead, it is pretty specific to transformers for one particular reason: they contain permutation-invariant MHSA layers which will be covered in part 2. It means that if you permute elements that such a layer accepts, the result of it won’t change. This property of course has its advantages, but considering we’ve just split an image into non-overlapping patches and flattened them into one long 1D sequence, we do want to preserve some information about the order. So positional embedding is supposed to do exactly that.

In practice, the solution used in ViT is surprisingly trivial. We’ll let the network learn a unique parameter vector for each position, and that’s how layers in the rest of the network will be able to differentiate where each token is coming from. In other words, yet another special parameter have to be introduced, this time with a shape that contains sequence length. Then, this parameter tensor will be added to the current embeddings. That’s all. It can be said that this trick works “out of the box”, there is no need to dedicate a loss term to it. The model is able to learn a meaningful positional embedding as a byproduct of training for the main task, simply because positional information is helpful. And in this section, we’ll investigate and visualize what exactly is learned in these parameters.

Note that none of the parameters we’ve introduced previously depended on the sequence length, so in case you’ve lost track of the number of elements we have in the sequence, now is a good time to reiterate. Recall that we preceded the sequence with an extra [CLS] token in the previous step. Thus, the length is 1 plus the total number of patches in the non-overlapping grid, (1 + H/P x W/P), where (H x W) is the initial input image resolution and P is the used patch size.

*Figure 7: Summation of learnable positional embedding [*Source: Image by the author.]

Code:

# Initialize position_embeddings parameters and add to the sequence embeddings
position_embeddings = torch.nn.Parameter(torch.zeros(1, new_seq_length, hidden_size))  # torch.Size([1, 197, 768])
embeddings = embeddings + position_embeddings  # torch.Size([1, 197, 768])

# Run yoursef here: https://colab.research.google.com/github/yurkovak/ViT_notebook/blob/main/Vision_Transormer_(ViT)_under_the_magnifying_glass.ipynb

Parameter count:

Let’s dig deeper:

2D structure

One more time we observe here how a spatial structure of an image is ignored. Each patch has a 2D coordinate in the original image, X and Y on a 14x14 grid, yet, the positional embedding is 1D. To clarify, the positional embedding of each token isn’t one-dimensional, it has D values, but on the image level, there are no heuristics that would explicitly convey to the model the 2D positioning of patches and the corresponding relationships between them. Such heuristics would imply that, for example, two patches in the 4-th column but from different rows share the same embedding for the Y axis.

Instead, in 1D, each position in the embedding is free to be completely unique, the authors of ViT just let the model learn everything it needs. Why? This is not a random design choice, but rather something that was properly ablated in the paper. They tested many options: 1D, 2D, no positional embedding at all, and even a more elaborate relative one. The only significant gap in the accuracy of almost 3% was observed when comparing models with and without any position embedding, while the used strategy itself didn’t matter much. So considering these results, it definitely makes sense to stick to 1D for simplicity.

It’s worth noting that it’s also stated in the paper that a model learns to represent a 2D image topology with 1D embeddings anyway. The authors speculate that it might be because a 14x14 grid isn’t that hard to “figure out” due to a relatively small grid resolution. They provide a cool figure to prove the claim.

*Figure 8: From “*An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.”² *captioned as “Position embeddings of models trained with different hyperparameters.”*

Let’s try to understand what’s going on here. What you see is a cosine similarity between all-to-all positional embeddings of a model trained with three different regimes. That is, to navigate the chart you should pick one plot from these three, look at some position (i, j), and find a similarity map. On this small square, a color at each position (k, m) tells you how similar the positional embedding of the patch (i, j) is to the positional embedding of the patch (k, m).

Since it is emphasized that different training hyperparameters lead to different patterns, it would be interesting to take a look at such a figure for a pre-trained model available in Hugging Face. The source code for generating the charts from the paper wasn’t open-sourced, so let’s write it ourselves and try to reproduce the visualization:

def similarity_one_to_all(pos_embedding: torch.tensor, all_embeddings: torch.tensor) -> np.ndarray:
    cos_sim = torch.nn.CosineSimilarity(dim=-1)
    similarities = []
    # Compare the given pos_embedding to all other embeddings one by one
    for i in range(len(all_embeddings)):
        similarity = cos_sim(pos_embedding, all_embeddings[i])  # one values
        similarities.append(similarity)
    return np.array(similarities)


def similarity_all_to_all(pos_embeddings: torch.tensor, grid_h: int, grid_w: int) -> List[List]:
    """
    Compute similarity between  positional embeddings, all to all
    :param pos_embeddings:  embeddings of shape (sequence_length x D),
                            sequence_length must be equal to grid_h * grid_w
    :param grid_h:          number of patches along Y axis
    :param grid_w:          number of patches along X axis
    :return:                grid_h x grid_w array of similarity heatmaps,
                            where similarity_heatmaps[i][j] stores similarity between embedding
                            for the patch at (i, j) to all the other embeddings
    """
    # Reshape pos. embeddings into a grid
    pos_embeddings_grid = pos_embeddings.reshape((grid_h, grid_w, -1))
    similarity_heatmaps = []
    # Go over patches rows
    for i in range(grid_h):
        row_i_heatmaps = []
        # Go over patches columns
        for j in range(grid_w):
            # Compare pos. embedding (i, j) to all other embeddings and reshape similarities into a heatmap
            similarities = similarity_one_to_all(pos_embeddings_grid[i, j], pos_embeddings)
            row_i_heatmaps.append(similarities.reshape((grid_h, grid_w)))
        similarity_heatmaps.append(row_i_heatmaps)

    return similarity_heatmaps


# Load pre-trained ViT-B/16 and get its position embedding
model = ViTForImageClassification.from_pretrained("google/vit-base-patch16-224")
trained_pos_embeddings_all = model.vit.embeddings.position_embeddings[0].detach()  # torch.Size([197, 768])

# Discard batch dimension and [CLS] token position
image_pos_embeddings = trained_pos_embeddings_all[1:]  # torch.Size([196, 768])
grid_h, grid_w = im_h // patch_size, im_w // patch_size

# Compute similarities and visualize
similarity_heatmaps = similarity_all_to_all_vectorized(image_pos_embeddings, grid_h, grid_w)
visualize_heatmaps(similarity_heatmaps, grid_h, grid_w)

# Run yoursef here: https://colab.research.google.com/github/yurkovak/ViT_notebook/blob/main/Vision_Transormer_(ViT)_under_the_magnifying_glass.ipynb

In this snippet, you can find the function for computing similarities, but the plotting function is omitted due to irrelevance to the explanation. You can find it, as well as a much shorter vectorized form of similarity_all_to_all(…) in the Colab notebook. The code produces the plot below, except for the zoom-in which was added manually to add clarity:

*Figure 9: Similarity of the learned positional embeddings of the pre-trained model from Hugging Face* “google/vit-base-patch16–224”*, all-to-all; the highlighted similarity map is the similarity of PE of the patch at (11, 9) to the PEs of all other patches [*Source: Image by the author.]

As you can see, each positional embedding is the most similar to itself (obviously), but it is also very similar to the blob around it. This means that the model embeds spatially neighboring regions with very similar vectors which is a desirable behavior. Surprisingly, for each position embedding, there are also other regions it’s highly similar to, that is, there are more bright blobs than one. It indicates that similar positional embedding vectors are used for various regions.

In case you haven’t noticed, for visualization purposes, we had to discard the positional embedding of the [CLS] token and compared only the PEs of image patches, one to another. But what about the PE of the [CLS] token? After all, it is not located within an image, so it doesn’t have a 2D coordinate on the patches grid. It’s interesting to see whether it ended up similar to any other pos. embeddings. The logic tells us that it shouldn’t be considering its special nature. Indeed, in the plot below, you can see that the learned values are different from all other pos. embedding, the similarity is on average 0, which means that the PE of the [CLS] token is almost orthogonal to the PEs of all the real patches. Intuitively, it means that the similarity between the PE of the [CLS] token and the PE of any other given patch is as small as it can be, namely, they are very different. This is due to the fact that the position of the [CLS] token is special, unlike any of the positions within the image grid.

# Get positional embedding that corresponds to the [CLS] token
cls_pos_pos_embedding = trained_pos_embeddings_all[0]  # 0-th position, torch.Size([768])

# Visualize similarity between [CLS] token PE to all other PE-s
cls_pos_to_all_similarity = similarity_one_to_all(cls_pos_pos_embedding, image_pos_embeddings)
plt.figure(figsize=(10, 10))
plt.imshow(cls_pos_to_all_similarity.reshape((grid_h, grid_w)), vmin=-1, vmax=1)
cbar = plt.colorbar(aspect=25, ticks=[-1, 1], label='Cosine similarity')
cbar.set_label(label='Cosine similarity', fontsize=24)
plt.xticks([0, grid_w - 1], labels=[1, grid_w], fontsize=24)
plt.yticks([0, grid_h - 1], labels=[1, grid_h], fontsize=24)
plt.show()

# Run yoursef here: https://colab.research.google.com/github/yurkovak/ViT_notebook/blob/main/Vision_Transormer_(ViT)_under_the_magnifying_glass.ipynb

*Figure 10: Similarity of the positional embedding of the [CLS] token to the position embeddings of all other patches [*Source: Image by the author.]

Fixed position encodings

Note that a learnable positional embedding is one of the discrepancies between ViT and the Transformer¹ encoder. The NLP Transformer uses a fixed formula for this purpose. The formula calculates sine and cosine with a unique wavelength for each position i in the D-dimensional encoding vector (sine for even i-s, cosine for odd). This function was chosen in “Attention is all you need”¹ for the nice properties that:

the further the elements, the smaller the similarity;
it is symmetric;
the (j + k)-th position can be expressed as a linear function of the j-th encoding:

You can find the exact formula and proof in this great article, and we’ll take advantage of the code we have and visualize the all-to-all similarities for such fixed embedding. For this, we’ll use an implementation of the cosine-sine formula from a PyTorch tutorial.

def get_fixed_pos_encodings(sequence_length: int, D: int) -> torch.tensor:
    """
    Create positional encodings according to the formula used for Transformer in
    "Attention Is All You Need"
    """
    position = torch.arange(sequence_length).unsqueeze(1)
    div_term = torch.exp(torch.arange(0, D, 2) * (-math.log(10000.0) / D))
    pe = torch.zeros(sequence_length, D)
    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)
    return pe


# Create fixed positional encoding and visualize similarities
fixed_pos_encodings = get_fixed_pos_encodings(grid_h * grid_w + 1, hidden_size)  # torch.Size([197, 768])
similarity_heatmaps_fixed = similarity_all_to_all(fixed_pos_encodings[1:], grid_h, grid_w)
visualize_heatmaps(similarity_heatmaps_fixed, grid_h, grid_w)

# Run yoursef here: https://colab.research.google.com/github/yurkovak/ViT_notebook/blob/main/Vision_Transormer_(ViT)_under_the_magnifying_glass.ipynb

*Figure 11: Similarity of all-to-all the positional encodings defined by the sine and cosine formula from “Attention Is All You Need”*¹ [Source: Image by the author.]

As you can see, the pattern is very different from the one learned by ViT. In particular, the similarity is the highest between patches in the same row and the spatial structure isn’t apparent at all. It comes as no surprise, considering the formula was picked for text sequences and wasn’t meant to deal with images. To understand why the similarities of fixed encoding are the way they are, let’s take a look at the values themselves.

Figure 12: Top: Absolute values of Transformer’s positional encodings defined by the sine and cosine formula from “Attention Is All You Need”¹; Bottom: of the leared position embeddings of a pre-trained ViT “google/vit-base-patch16–224” [Source: Image by the author.]

As usual with learned weights, the values of ViT are hard to interpret. But the values of the fixed encoding of the Transformer pretty much explain their similarity map in Figure 11: vectors are the most similar for the adjacent sequence elements. This means that with fixed encoding vertically neighboring patches would get very different PE parameters. All in all, apparently, the learned embedding suits ViT much better, since it naturally learns a 2D structure without any special heuristics.

Placement in the network

The last architecture design to consider is the placement of positional embedding in the network. We already know that it is added once, shortly after linear patch embeddings. But what are the alternatives? One is to change the placement from the stem where we are now, to more locations. The authors of the paper test the addition of positional embeddings in every layer, with a new set of learnable parameters or with shared weights. The difference in accuracy compared to the baseline approach was negligible. It might be thanks to the skip connections that, as we’ll see in part 2, go through all the layers, allowing propagation of the information obtained by the current layer.

A similar and related discussion was raised in this GitHub issue. Why are the positional embeddings added to the patch embeddings when it is often the case in neural networks that the concatenation of feature maps/vectors yields better performance than adding them? While I’m not aware of any paper that would directly ablate concatenation of patch embeddings and positional embedding, there are a few potential reasons to resort to addition anyway. Firstly, it saves on parameters in the layer to come, meaning that we’d need bigger neurons to process a result of concatenation. Secondly, even with the summation, the network can easily learn to dedicate some positions within a vector of length D to a stronger signal from positional embedding, and other positions within a vector to a stronger signal from patch embedding. And the third hypothetical reason is in my opinion the most interesting one. It lies in the fact that it might be helpful to encode some patches differently depending on the position they are coming from. To rephrase, an embedding of a patch, which is the same regardless of its location, will be altered by positional embedding using summation, resulting in unique position-dependant features.

1.5 Dropout (DP)

We’ve reached the last layer in the embeddings stage, Dropout²². This layer is widely used for all kinds of deep neural networks to prevent overfitting, so there is a high chance that you are already familiar with it. Let’s quickly recall the idea behind this trick.

The problem that Dropout is meant to solve is complex co-adaptations that build between neurons during training. That is, during back-propagation, specific weights may learn to correct mistakes other neurons cause. The authors claim that such co-adaptations generalization poorly to new data, and as a result, lead to overfitting. The proposed solution is to drop out random neurons during training, different ones each iteration. There is no need to literally delete a neuron and its weights. Instead, it can be implemented by simply zeroing out some neurons’ outputs, as if they weren’t present in the network at the current iteration. Since all the neurons are eventually used at test time, this can be perceived as a form of combining exponentially many different architectures.

As you can see, Dropout is one of those layers that have different behaviors at train and test times. This creates the following caveat: think about the next layer, the one that relies on inputs that had Dropout applied to them. During training, this following layer adapts to certain value distributions. Let’s take some value x at position (i, j), an output of a certain neuron. We know its expected value at test time, considering that nothing is zeroed out:

But its expected value during training is

where p is the probability of x at (i, j) to be zeroed out.

So that’s what the next layer will in fact learn to expect. However, we can see that the expected value of x at (i, j) at test time will be bigger, so we must scale it back by the factor of (1 — p). Alternatively, the process can be optimized for inference by applying inverse scaling during training. In particular, if during training together with zeroing out some values, we divide all values by (1 — p), then at train time we get the following expectation:

This allows the next layer to always “work” with the same distribution of inputs, both during training and during inference.

That being said, in PyTorch, there is no need to write it manually, it’s implemented in torch.nn.DropOut, but it’s always good to remember what happens under the hood.

*Figure 13: Application of Dropout to embeddings, train mode [*Source: Image by the author.]

Code:

The code below applies Dropout to the embeddings. But unlike with other layers, it does so both in train and in eval mode. Outputs obtained in both modes are saved into separate variables.

Don’t get confused by the double application, during inference Dropout should be applied only once, in the mode that corresponds to the current stage, training, or evaluation. But for debugging purposes, we’ll need both, because we are going to actually compare values and make sure that the results hold true to the formula explained above.

# Define the DropOut layer and apply it in the train mode (the default mode)
dropout_prob = 0.1
dropout = torch.nn.Dropout(dropout_prob)
embeddings_after_train_dp = dropout(embeddings)  # torch.Size([1, 197, 768])

# Apply DropOut to the same input embeddings, but in eval mode
dropout.eval()
with torch.no_grad():
    embeddings_afer_eval_dp = dropout(embeddings)

# Run yoursef here: https://colab.research.google.com/github/yurkovak/ViT_notebook/blob/main/Vision_Transormer_(ViT)_under_the_magnifying_glass.ipynb

Using embeddings_after_train_dp and embeddings_afer_eval_dp, let’s check a few things for both modes:

how many zeros there are in the tensor before and after the operation, in % from the total number of values;
what the actual values are.

# Count zero values percentage before and after the operation
with torch.no_grad():
    get_zeros_percentage = lambda t: round(((t == 0.).sum() / t.numel()).item() * 100, 1)
    print(f'Percentage of zeros BEFORE applying dropout: {get_zeros_percentage(embeddings)}%')
    print(f'Percentage of zeros AFTER applying dropout (train): {get_zeros_percentage(embeddings_after_train_dp)}%')
    print(f'Percentage of zeros AFTER applying dropout (eval): {get_zeros_percentage(embeddings_afer_eval_dp)}%\n')

# Check values before and after the operation
with torch.no_grad():
    embeddings_afer_eval_dp = dropout(embeddings)
    print('Values of input embeddings:\n', embeddings[0, :4, 0].detach())
    print('Values of embeddings after DP (train):\n', embeddings_after_train_dp[0, :4, 0].detach())
    print('Values of input embeddings scaled with 1/(1 - p):\n', embeddings[0, :4, 0]/(1 - dropout_prob))
    print('Values of embeddings after DP (eval):\n', embeddings_afer_eval_dp[0, :4, 0].detach())

# Run yoursef here: https://colab.research.google.com/github/yurkovak/ViT_notebook/blob/main/Vision_Transormer_(ViT)_under_the_magnifying_glass.ipynb

The result will vary due to randomness, but here is an example log of these prints:

Percentage of zeros BEFORE applying dropout: 1.5%
Percentage of zeros AFTER applying dropout (train): 11.3%
Percentage of zeros AFTER applying dropout (eval): 1.5%

Values of input embeddings:
 tensor([0.0000, 0.5994, 0.5994, 0.5994])
Values of embeddings after DP (train):
 tensor([0.0000, 0.6660, 0.6660, 0.0000])
Values of input embeddings scaled with 1/(1 - p):
 tensor([0.0000, 0.6660, 0.6660, 0.6660])
Values of embeddings after DP (eval):
 tensor([0.0000, 0.5994, 0.5994, 0.5994])

As you can see, in the train-mode output, there are roughly (dropout_prob * 100) % zeros. There is actually a bit more due to some naturally occurring zeros in the input, that is, not all zeros are induced by DP. The values that remain are equal to input_value divided by (1 — p). Conversely, in the eval mode, Dropout does absolutely nothing to the input.

Parameter count:

Adding Dropout doesn’t introduce any new parameters.

Let’s dig deeper:

The used probabilities

The dropout probability p is a hyperparameter that must be set manually. The authors of ViT use p = 0 for training on JFT-300M²³, an internal Google dataset with more than 300 million images. Such probability makes it equivalent to not using Dropout at all. However, p = 0.1 is used for all model sizes when training from scratch on ImageNet²⁴, both IN-1K (with roughly 1 million images) and IN-21K (with roughly 14 million images). They write that, with this dataset, strong regularization was crucial.

Placement in the network

If you scroll up to the “Achitecture” section and look at the list of layers, you may notice that there will be more locations in the network where the Dropout layer is used. The authors state that they apply Dropout “after every dense layer except for the qkv-projections and directly after adding positional embedding to patch embeddings.”². If you have prior experience with various CNN architectures, you can recall that it’s not common to find a Dropout layer in CNNs anywhere except in a classification head. The reason for this lies in the way feature maps are produced in CNNs. There, due to ubiquitous usage of stride = 1 and also due to spatially shared convolutional weights, zeroing out a random value neither fully removes access to an input region nor deletes the whole neuron altogether. As a result, various structured forms of Dropout were introduced in the past, e.g. DropBlock²⁵, DropPath²⁶, DropConnect²⁷, etc. When applied on feature maps throughout the network, these have much more effect on reducing overfitting than Dropout.

But let’s get back to ViT and think for a moment. Which neurons can we say get deleted when we temporarily zero out a value in the (H/P x W/P + 1) x D embeddings tensor? It is important to emphasize that Dropout was initially proposed for networks that consist of fully connected layers and operate on a 1D vector, not on a sequence of vectors. In such a neural network, zeroing out one value indeed imitates a deletion of the whole neuron used to produce it.

But the embeddings we have at this stage are 2D. Recall that they are the result of the summation of patch embeddings and position embeddings. Zeroing out a value in this tensor completely removes access to a corresponding value among positional embedding parameters. As for the linear patch embedding layer (Conv2d) from section 1.1, none of its kernels are “gone” completely. To clarify, even if x at (i, j) is set to 0, some other x at (i + k, j) likely isn’t, so the kernel that’s used to produce the j-th feature in an embedding of length D remains.

That’s why a Dropout layer in this particular placement in the network is sometimes called positional dropout. As for the other locations in the network, some vision transformers do favor a structured form of Dropout, e.g. DropPath²⁶ used for Swin transformer⁴.

Summary

This wraps up the embedding stage of the ViT and part 1 of the article. In this part, we learned which operations are applied to an image before it reaches encoder blocks. We saw what’s done to represent an image as a sequence and figured out the mechanics behind a few special tokens. We also talked through each step in depth: a bit of history, motivation, some nice properties, issues, and further research directions.

If you would like to go over all the animations together, it is convenient to do it in the Colab Notebook. Let’s also do a short recap of what we have so far:

An image is convolved with a big kernel and stride equal to the kernel size. This is equivalent to splitting an image into patches and applying a linear projection.
The resulting feature map is flattened into a sequence. For self-supervised pre-training, some elements in the sequence may be masked at this stage with a learnable [MASK] token.
The sequence is prepended with a learnable parameter vector dedicated to a [CLS] token.
Positional embedding is added to the sequence, each token gets a unique parameter vector to learn its position.
Dropout is applied to the embedding.

As for the total number of parameters in the embedding stage, as you can see in Figure 14, most of them are concentrated in the linear projection layer, i.e. patch embedding.

Figure 14: Parameters in the embedding part of the network [Source: Image by the author.]

In the next part of the article we will dig into the building blocks of the encoder, including multi-head self-attention, and in part 3 will discuss the classification head and everything that goes beyond the architecture: training, inference, model sizes, some general model statistics, etc. Stay tuned!

If you learned something new, please consider clapping and sharing with friends, so that more people can benefit from it.

Acknowledgments

Major thanks go to Professor Ran El-Yaniv, Anton Yurkov, Omri Puny, Natan Bagrov, Ido Shahaf, and Ofri Masad for taking the time to review the article and for contributing to making it better, more clear, and informative.

References

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. arXiv:1706.03762
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-End Object Detection with Transformers. arXiv:2005.12872
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv:2103.14030
Bao, H., Dong, L., Piao, S., & Wei, F. (2021). BEiT: BERT Pre-Training of Image Transformers. arXiv:2106.08254
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging Properties in Self-Supervised Vision Transformers. arXiv:2104.14294
Sun, J., Shen, Z., Wang, Y., Bao, H., & Zhou, X. (2021). LoFTR: Detector-Free Local Feature Matching with Transformers. arXiv:2104.00680
Fang, Z., Wang, J., Hu, X., Liang, L., Gan, Z., Wang, L., Yang, Y., & Liu, Z. (2021). Injecting Semantic Concepts into End-to-End Image Captioning. arXiv:2112.05230
Guo, M., Cai, J., Liu, Z., Mu, T., Martin, R. R., & Hu, S. (2020). PCT: Point cloud transformer. arXiv:2012.09688
Jiang, Y., Chang, S., & Wang, Z. (2021). TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up. arXiv:2102.07074
Deng, Y., Tang, F., Dong, W., Ma, C., Pan, X., Wang, L., & Xu, C. (2021). StyTr²: Image Style Transfer with Transformers. arXiv. arXiv:2105.14576
Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., & Sutskever, I. (2021). Zero-Shot Text-to-Image Generation. arXiv:2102.12092
Zhang, Q., & Yang, Y. (2021). ResT: An Efficient Transformer for Visual Recognition. arXiv:2105.13677
Li, Y., Yuan, G., Wen, Y., Hu, J., Evangelidis, G., Tulyakov, S., Wang, Y., & Ren, J. (2022). EfficientFormer: Vision Transformers at MobileNet Speed. arXiv:2206.01191
Pan, J., Bulat, A., Tan, F., Zhu, X., Dudziak, L., Li, H., Tzimiropoulos, G., & Martinez, B. (2022). EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers. arXiv:2205.03436
Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805
Park, N., & Kim, S. (2022). How Do Vision Transformers Work? arXiv:2202.06709
Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., & Wang, Y. (2021). Transformer in Transformer. arXiv:2103.00112
Yuan, L., Chen, Y., Wang, T., Yu, W., Shi, Y., Jiang, Z., Tay, F. E., Feng, J., & Yan, S. (2021). Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet. arXiv:2101.11986
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2021). Masked Autoencoders Are Scalable Vision Learners. arXiv:2111.06377
Kakogeorgiou, I., Gidaris, S., Psomas, B., Avrithis, Y., Bursuc, A., Karantzalos, K., & Komodakis, N. (2022). What to Hide from Your Students: Attention-Guided Masked Image Modeling. arXiv:2203.12719
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research
Sun, C., Shrivastava, A., Singh, S., Gupta, A. (2017). Revisiting Unreasonable Effectiveness of Data in Deep Learning Era. arXiv:1707.02968
Deng, J., Dong, W., Socher, R., L. -J. Li, Li, Kai, Fei-Fei, Li.(2009). ImageNet: A large-scale hierarchical image database. IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255, doi: 10.1109/CVPR.2009.5206848.
Ghiasi, G., Lin, T., & Le, Q. V. (2018). DropBlock: A regularization method for convolutional networks. arXiv:1810.12890
Larsson, G., Maire, M., & Shakhnarovich, G. (2016). FractalNet: Ultra-Deep Neural Networks without Residuals. arXiv:1605.07648
Wan, L., Zeiler, M., Zhang, S., Le Cun, Y. & Fergus, R.. (2013). Regularization of Neural Networks using DropConnect. Proceedings of the 30th International Conference on Machine Learning, in Proceedings of Machine Learning Research 28(3):1058–1066

Vision Transformer (ViT) under the magnifying glass, Part 1

Embeddings

Introduction

The architecture

Table of Contents

1. Embeddings (Stem)

1.1 Patch embeddings

1.2 (Optional) Positional masking

1.3 [CLS] token

1.4 Positional Embedding (PE)

1.5 Dropout (DP)

Summary

Acknowledgments

References

Written by Kate Yurkova

Responses (7)