Meta AI Research open source DINOv2, a basic model for computer vision (CV) tasks. DINOv2 has been previously tested on a curated dataset of 142 million images and can be used as a backbone for many tasks, including image classification, video action recognition, semantic segmentation, and depth estimation.
The Meta model was based on the Vision Transformer (ViT) architecture, with adjustments for self-supervised learning objectives. To train the model, the team built an automated pipeline to build a curated dataset from images pulled from the web. The main contribution of the work was the improvement of the training process, which is twice as fast and uses a third of the memory of the previous methods. When evaluated against CV criteria, DINOv2 outperformed other self-supervised learning (SSL) models and showed similar or better performance than poorly supervised models. According to Meta,
Going forward, the team plans to integrate this model, which can act as a building block, into a larger, more complex AI system that can interact with large language models. A visual backbone that provides rich information about images will allow complex AI systems to think about images in a deeper way than describing them in a single text sentence. Models trained in textual supervision are ultimately constrained by image captions. With DINOv2, there is no such built-in limitation.
Deep learning models for CV tasks have typically relied on large datasets of human-annotated images; For example, ImageNet. In 2021, OpenAI released CLIP, a basic resume template trained with a template Weak supervision, where the annotations were automatically derived by scraping html tags and other web-based metadata associated with the source images. In the same year, Google published the ViT model, which uses SSL for training, and Meta published its work on the original version of DINO, which combined the ViT model with knowledge distillation, resulting in smaller models with similar performance.
For DINOv2, Meta focused on collecting more training data and scaling up the training process. For the training data, Meta collected 1.2 billion unique images from the Internet, and then grouped them according to their similarity to the images in the ImageNet data set for a final set of 142 million images. To scale up the training, Meta implemented a custom version of FlashAttention and used Full Data Parallel (FSDP) training with PyTorch. Overall, the project consumed about 200,000 GPUs a day.
To evaluate the performance of DINOv2 as a core model, the team tested it on a variety of CV tasks and compared it to several basic SSL models as well as weakly supervised models such as CLIP. In the ImageNet-1k classification task, DINOv2 showed a “very significant improvement” over other SSL models and also outperformed the poorly moderated models. It also set a new state-of-the-art SSL record on three video action recognition benchmarks and beat baselines on instance-level recognition benchmarks and on three single-depth estimation benchmarks.
In the Hacker News discussion about the work, many users praised Meta’s recent work in computer vision as well as previous contributions such as PyTorch. Someone noticed a shift in Meta communications about their work:
As a grad student in the field, Meta has always had great contributions to open source machine learning efforts, through little effort of Yann LeCun’s in-house advocacy. What has changed recently is their PR strategy: [OpenAI] Basically show everyone that it doesn’t matter if you have the best models if your publicity is bad.
DINOv2 code and samples are available on GitHub. The project website hosts an interactive demonstration of several computer vision tasks using DINOv2.
#Meta #OpenSources #Computer #Vision #Foundation #Model #DINOv2