Paper Summary — What is being transferred in transfer learning?
Now-a-days we widely use the transfer learning to either achieve good results or to get save time on training the model from scratch. But we don’t know what is actually being transferred during this process.
If you use a ImageNet pretrained classification model and train it on some medical domain (say chest X-ray images for some disease) you will see it’s performing better than training that same model from scratch (considering we have small train set)! How even the objects present in the ImageNet dataset are similar to Human chest X-ray images!?
Few days ago, I came across this paper — What is being transferred in transfer learning (link). It’s a paper from Google Brain. This paper helps you understand what is happening when pre-trained models are trained for some different task. This blog is a summary of the paper. I will try to cover most of the topics but read the paper too. Its worth the time!
Coming to the paper…
Note — If I quote anything from the paper it’ll be italic and inside quotes. For example — “this is an example.”
I have not discussed about Performance barrier and loss landscape because I believe it is better to read this part from the paper directly as summarizing this would be copying stuff from the paper and there is no point just copying.
Abstract
In the abstract the authors say that although we are using transfer learning so much yet we do not understand “which part of the neural network is responsible” for this. In this paper, the authors “provide new tools and analyses to address these fundamental questions”. This paper focuses on results of several experiments which were done to understand what is being transferred in transfer learning. In short, one experiment was on “block-shuffled images”, then they separated the “effect of feature reuse from learning low-level statistics of data”, they also show that when we use pretrained model weights, the “model stays in the same basin in the loss landscape and different instances of such model are similar in feature space and close in parameter space.”
Introduction
We use transfer learning to transfer knowledge from one domain(source domain) to other domain(target domain) when we have less data or we need to produce results in lesser time. In total three datasets were used to carry out experiments — ImageNet, DomainNet (has images such as real images, sketches and clip arts to study transfer learning) and CheXpert (chest X-ray images)
Problem Setup
The authors use ImageNet(Source domain) pre-trained weights as starting weights for transfer learning and consider CheXpert and datasets from DomainNet as target domain (or downstream task). In other words, they take ImageNet pretrained weights and train the model on CheXpert and DomainNet dataset to study transfer learning.
The authors “analyze networks in four different cases : pre-trained network, the network at random initialization, the network that is fine-tuned on target domain after pre-training on source domain and the model that is trained on target domain from random initialization.” All these four cases will be referred as following in the paper : “RI (random initialization), P (pre trained model), RI-T (model trained on target domain from random initialization), P-T (model trained/fine-tuned on target domain starting from pre-trained weights).”
Role of Feature reuse
We usually use transfer learning when we have a very small dataset for target domain. But still we don’t know how this works even when source domain (dataset of pre-trained weights) is visually very different from the target domain (dataset for the current task).
To study the role of feature reuse, ImageNet pre-trained model was used and target domains were CheXpert and DomainNet.
When RI-T (model trained on target domain from random initialization) was compared to P-T(model trained/fine-tuned on target domain starting from pre-trained weights), a very huge performance boost was seen on the DomainNet’s Real dataset. “This confirms the intuition that feature reuse plays an important role in transfer learning”. Now, when ImageNet pretrained weights were used to fine-tune the model on CheXpert dataset, it was observed that P-T converges faster than RI-T in all the cases. “This suggests additional benefits of pre-trained weights that are not directly coming from feature reuse”.
To further verify this hypothesis, “a series of modified downstream task” were created which were very different from each other visually. To be precise, image was partitioned into equal sized blocks and those blocks were shuffled across the image randomly. By doing this, the image was changing visually but the low level statistics remained the same. (Block size 1 means — pixels of size 1 were taken and shuffled across the image). Now the target domain looked visually far distinct from the source domain.
After performing this shuffling, it was observed that — 1) final performance for RI-T and P-T drops as blocks size becomes smaller in size indicating that task is becoming more difficult to learn. 2) There is a decrease in relative accuracy difference with decrease in block size on real and clipart dataset from DomainNet “showing consistency with the intuition that decreasing feature reuse leads to diminishing benefits”. 3) On q1uick draw dataset from DomainNet, there not much decrease. The dataset was already very dissimilar to the ImageNet dataset, indicating that some other factors of the pre-trained weights are helping the model in the downstream task.
The authors conclude that — “feature reuse plays a very important role in transfer learning, especially when the downstream task shares similar visual features with the pre-training domain.” They also tell that there are low level statistics which don’t have any visual information play a role in transfer learning as even after shuffling the image pixels completely and shuffling the input channels, they did not see significant decreasing trend of metrics for quickdraw dataset.
Opening up the model
In the second experiment, the authors investigate the mistakes made by the model, that is, they see the “common mistakes” made by both the models if they classify the same data point incorrectly. Then they also see “uncommon mistakes” where in one model classifies a data point correctly and other one incorrectly. They first “compare the ratio of common and uncommon mistakes between two P-Ts, a P-T and a RI-T and two RI-Ts”. They see that good number of uncommon mistakes are between O-T and RI-T where as two P-Ts have very few uncommon mistakes. This trend is visible for CheXpert and DomainNet dataset.
Feature similarity : The authors use Centered Kernel alignment technique to measure similarity between “two output features in a layer of network architecture given two instances of such network”. It was observed that “ two instances of P-T are highly similar across different layers”. It was noted that the feature similarity is stronger towards the last layers of the network in case of P-T and RI-T. Important point to be noted here — “These experiments show that the initialization point, whether pre-trained or random, drastically impacts feature similarity, and although both networks are showing high accuracy, they are not that similar in the feature space. This emphasizes on role of feature reuse and that two P-T are reusing the same features.”
Module Criticality
In this section, the authors discuss how to calculate which module of the architecture is more critical from the rest of the modules.
The most important outcome was that — it was seen that lower layers of the model are responsible for more general features while the higher layers have features which are specific to the target domain/dataset the model is being trained on.
Conclusion
Do checkout the paper once to gain more insights about the experiments performed. This paper tells us that :-
- We can use (or at least consider trying) transfer learning even if the target domain is visually very different from the source domain (dataset used in pre-trained weights)
- Features in the lower layers and some low level statistics which we cannot see visually are mostly responsible for this amazing performance of transfer learning.
- Since upper layers are more domain specific, we should freeze lower layers and train the upper layers for better performance of the model. But from which layers to start unfreezing would require extensive experiments and resources.
Thanks for reading the blog. Hope you find it helpful!
Connect with me on LinkedIn :).