Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles

Reseacher from Meta have developed a new, simpler and faster hierarchical vision transformer called Hiera.

Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles
Scientists have developed a new, simpler and faster hierarchical vision transformer called Hiera. By pretraining with a strong visual pretext task (MAE), the researchers were able to strip out unnecessary components from previous models, resulting in a more accurate and faster transformer. The researchers evaluated Hiera on a variety of image and video recognition tasks.
Hiera Setup. Modern hierarchical transformers like Swin (Liu et al., 2021) or MViT (Li et al., 2022c) are more parameter efficient than vanilla ViTs (Dosovitskiy et al., 2021), but end up slower due to overhead from adding spatial bias through vision-specific modules like shifted windows or convs. In contrast, we design Hiera to be as simple as possible. To add spatial bias, we opt to teach it to the model using a strong pretext task like MAE (pictured here) instead. Hiera consists entirely of standard ViT blocks. For efficiency, we use local attention within “mask units” (Fig. 4, 5) for the first two stages and global attention for the rest. At each stage transition, Q and the skip connection have their features doubled by a linear layer and spatial dimension pooled by a 2 × 2 maxpool.

Scientists have developed a new, simpler and faster hierarchical vision transformer called Hiera. By pretraining with a strong visual pretext task (MAE), the researchers were able to strip out unnecessary components from previous models, resulting in a more accurate and faster transformer. The researchers evaluated Hiera on a variety of image and video recognition tasks.

Paper

Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles
Modern hierarchical vision transformers have added several vision-specificcomponents in the pursuit of supervised classification performance. While thesecomponents lead to effective accuracies and attractive FLOP counts, the addedcomplexity actually makes these transformers slower than their vani…

Source Code

GitHub - facebookresearch/hiera: Hiera: A fast, powerful, and simple hierarchical vision transformer.
Hiera: A fast, powerful, and simple hierarchical vision transformer. - GitHub - facebookresearch/hiera: Hiera: A fast, powerful, and simple hierarchical vision transformer.

Subscribe to ssv.ai

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe