Paper Summary — Efficient Long Range Attention Network (ELAN)

5 min readJun 29, 2023

In this blog, I will summaries the paper “Efficient Long-Range Attention Network for Image Super-resolution”. In this paper, the authors came up with two new modules called as ELAN, which is efficient long range attention network and GMSA, which is, group wise multi-scale self attention. ELAN module aims to extract local structural information from image while having complexity level of 1x1 Convolution operation and GMSA aims at exploting long range image dependecy.

ELAN was introduced for Super Resolution originally but now is an important part of the famous YOLOv7 model. This is the main reason of covering this paper. In YOLOv7, the authors are using a slight modified version of ELAN.

Efficient Long Range Attention Network for Image Super-resolution paper — Link
Note — “Any content copied from paper will be italic and inside quotes”

The main purpose of ELAN is to learn long range dependecies — meaning, the model would learn relationship between different parts of an image which may be spatially far but co-related.

Efficient Long Range Attention block (ELAB) is built by “cascading two shift-conv with a GMSA module, which is further accelerated by using a shared attention mechanism”

Architeture Overview

The above picture represents the entire ELAN model. We see that, first the low level features are extracted using basic convolution operation. The features from these convolution block are passed through ELAN and a copy of those features are concatented with the output features of ELAN block. This concatenated block is then used to reconstruct the image.

The loss function used was L1, which was used to minimuze the pixel-wise loss. The optimizer used was Adam.

Explaining Blocks of ELAB

ELAB block is made of 2 parts — Local Feature extraction and GMSA block.

Local Feature extraction

It comprises of Shift-Conv and ReLU activation function. The Shift-Conv has a set of Shift operations and one 1x1 convolution. The input features are split into five groups, each group is shifted to left, right, top and bottom and the 5th group remains as it is, followed by 1x1 convolution. This helps in providing larger receptive fields.

Shift-conv — The Shift operation (Paper link) moves each channel of its input tensor in a different spatial direction. For example, consider a 3D tensor. When the shift operation is applied, each channel is shifted independently. Meaning, channel 1 may shift to right, channel 2 may shift to top and channel 3 may shift to left.

By applying this operation, we are introducing spatial transformations and rearrangements to the features. This will help us capture different spatial relations and patterns, improving overall understanding of the image.

Group-wise multi-scale self attention (GMSA)

Group wise means that the input is divided into groups. Multiscale means the attention mechanism is applied on different window sizes. Self attention calculates the importance of different pixels in spatial locations. Hence, GMSA operator “divides features into different groups of different window sizes and calculates Self Attention separately”. “This strategy provides a larger receptive field than a small fixed size window for long-range attention modeling, while being more flexible than a large fixed-size window for reducing the computation resource cost.”

In this, Self Attention is calculated on non-overlapping group of features (the feature maps are divided into groups). Also, by using different window sizes, GMSA can capture long-range dependencies in the input while using less computation.

Accelerated Self Attention

Authors have made few modifications to the original Self Attention module. They replaced Layer Normalization with Batch Normalization layer. Layer Normalization requires many element wise operations where as Batch Normalization merges with Convolution layer and require no additional computation cost while inferencing.

“Self Attention in SwinIR is calculated on the embedded Gaussian space, where three independent 1x1 convolutions, denoted by theta, phi and g, are employed to map the input feature X into three different feature maps. We set theta = phi and calculate the SA in the symmetric embedded Gaussian space , which can save one 1x1 convolution in each SA block.”

Shared attention

The accelerated Attention mechanism still has two 1x1 convs and four reshaping operation causing extra time due to larger feature size in Super Resolution like task. Hence, the authors propose to share the attention weights among adjacent modules.

This sharing helps reduce computation and memory footprint and hence improves efficiency.

The main drawback with Self Attention from Transformers is, as the input feature map size increases, the complexity increases 4x! This limits us from tasks such as Super Resolution where input size matters a lot!

Shifted Window

The authors first employ circular shift to the feature along the diagonal direction and calculate GMSA on the shifted feature. Then they shift the result back via inverse circular shift. The circular shift with half window size leads to a new partition of feature map and introduces connections among neighboring non-overlapped windows in the previous GMSA module. Although some pixels on the border are shifted to distant areas via circular shift, they found it has negligible impact on the final SR performance since such pixels only occupy a small portion of the entire feature map in the SR task. (This part is copied from paper as the authors have themselves explained in simple terms)

I have covered this paper in brief and highly recommend the readers to go through the paper once to check the results and experiments performed. Also, feel free to drop any sources you get which helps you in understanding this paper/model better.

Thanks for reading :)