By Christoph Schütte in News — Jun 6, 2023

Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles

Reseacher from Meta have developed a new, simpler and faster hierarchical vision transformer called Hiera.

Scientists have developed a new, simpler and faster hierarchical vision transformer called Hiera. By pretraining with a strong visual pretext task (MAE), the researchers were able to strip out unnecessary components from previous models, resulting in a more accurate and faster transformer. The researchers evaluated Hiera on a variety of image and video recognition tasks.

Hiera Setup. Modern hierarchical transformers like Swin (Liu et al., 2021) or MViT (Li et al., 2022c) are more parameter efficient than vanilla ViTs (Dosovitskiy et al., 2021), but end up slower due to overhead from adding spatial bias through vision-specific modules like shifted windows or convs. In contrast, we design Hiera to be as simple as possible. To add spatial bias, we opt to teach it to the model using a strong pretext task like MAE (pictured here) instead. Hiera consists entirely of standard ViT blocks. For efficiency, we use local attention within “mask units” (Fig. 4, 5) for the first two stages and global attention for the rest. At each stage transition, Q and the skip connection have their features doubled by a linear layer and spatial dimension pooled by a 2 × 2 maxpool.

Paper

Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles

Modern hierarchical vision transformers have added several vision-specificcomponents in the pursuit of supervised classification performance. While thesecomponents lead to effective accuracies and attractive FLOP counts, the addedcomplexity actually makes these transformers slower than their vani…

arXiv.orgChaitanya Ryali

Paper

Source Code

Subscribe to ssv.ai