Meta bets big on AI with custom chips – and a supercomputer

Image credits: Bryce Durbin/TechCrunch

At a virtual event this morning, Meta unveiled its efforts to develop its internal infrastructure for AI workloads, including generative AI like the kind that powers its recently launched ad creative and creation tools.

It was an attempt to take power away from Meta, which has historically been slow to adopt AI-friendly hardware systems — hindering its ability to keep up with competitors like Google and Microsoft.

build our own [hardware] The capabilities give us control over every layer of the stack, from data center design to training frameworks,” Alexis Bjorlin, vice president of infrastructure at Meta, told TechCrunch.”This level of vertical integration is necessary to push the boundaries of AI research on a large scale. “

Over the past decade or so, Meta has spent billions of dollars recruiting top data scientists and building new types of AI, including the AI ​​that now powers the discovery engines, moderation filters, and ad filters found in all of its apps and services. But the company has struggled to turn many of its most ambitious AI research innovations into products, particularly in the field of generative AI.

Until 2022, the Meta largely ran its AI workloads with a combination of CPUs — which tend to be less efficient at these types of tasks than GPUs — and a custom chip designed to speed up AI algorithms. Meta pulled the plug on a large-scale rollout of the custom chip, which is planned for 2022, and instead placed orders for billions of dollars of Nvidia GPUs that required major redesigns of many of its data centers.

In an effort to turn things around, Meta has made plans to begin developing a more ambitious in-house chip, due out in 2025, capable of training and running AI models. This was the main topic of today’s presentation.

Meta calls the new chip the Meta Training and Inference Accelerator, or MTIA for short, and describes it as part of a “family” of chips for accelerating AI training and inference workloads. (“Inference” refers to running a trained model.) MTIA is an ASIC, a type of chip that aggregates various circuits on a single board, allowing it to be programmed to perform one or more tasks in parallel.


Meta AI chip built specifically for AI workloads.

“To gain better levels of efficiency and performance across our critical workloads, we needed a custom solution designed jointly with the model, software stack, and system hardware,” Björlin continued. This provides a better experience for our users across a variety of services. “

Custom AI chips are increasingly the name of the game among Big Tech players. Google has created a processor, TPU (short for “tensor processing unit”), to train large generative AI systems like PaLM-2 and Imagen. Amazon offers proprietary chips to AWS customers for training (Trainium) and inference (Inferentia). Reportedly, Microsoft is working with AMD to develop an in-house AI chip called the Athena.

Meta says it created the first generation MTIA — the MTIA v1 — in 2020, built on a 7nm process. It can max out at 128MB of internal memory to go all the way up to 128GB, and in a Meta-designer benchmark test — which, of course, should be taken with a grain of salt — Meta claims that MTIA handled “low complexity” and “medium complexity” AI models more efficiently than graphics processing unit.

Meta says work still needs to be done in the areas of memory and networking of the chip, which are bottlenecks as the size of AI models grows, requiring workloads to be split across multiple chips. (Not coincidentally, Meta recently acquired an Oslo-based team building AI networking technology at British chip unicorn Graphcore.) For now, MTIA’s focus is strictly on inference — not training — for “recommendation workloads” across Meta apps family.

But Meta stressed that MTIA, which continues to improve, “significantly” increases the company’s performance-per-watt efficiency when running recommendation workloads — which in turn allows Meta to run “better” and “evolve” (ostensibly) intelligence workloads. artificial.

Artificial intelligence supercomputer

Perhaps one day Meta will pass the bulk of its AI workloads over to the banks of the MTIAs. But for now, the social network relies on the GPUs of its research-focused supercomputer, the Research SuperCluster (RSC).

First revealed in January 2022, RSC—assembled in partnership with Penguin Computing, Nvidia, and Pure Storage—has completed phase two of construction. Meta says it now has a total of 2,000 Nvidia DGX A100 systems sporting 16,000 Nvidia A100 GPUs.

So why build an internal supercomputer? Well, for example, there is peer pressure. Several years ago, Microsoft made a big splash with its AI supercomputer built in partnership with OpenAI, and recently said it would team up with Nvidia to build a new AI supercomputer in the Azure cloud. Elsewhere, Google has been touting its AI-focused supercomputer, which has 26,000 Nvidia H100 GPUs — putting it well ahead of the Meta.


Meta artificial intelligence research supercomputer.

But in addition to keeping up with Jones, Meta says the Center for Research and Services gives the advantage of allowing researchers to train models using real-world examples from Meta production systems. This is in contrast to the company’s previous AI infrastructure, which only made use of open source and publicly available datasets.

“The RSC AI supercomputer is being used to push the boundaries of AI research in many areas, including generative AI,” said a Meta spokesperson. “It is really about AI research productivity. We wanted to provide AI researchers with a state-of-the-art infrastructure to be able to develop models and empower them with an AI development training platform.”

At its peak, RSC can reach approximately 5 exaflops of computing power, which the company claims makes it among the fastest in the world. (Lest that impress, it should be noted that some experts see the exaflops performance metric with a pinch of salt and that RSC is well outperformed by the world’s fastest supercomputers.)

Meta says it used RSC to train LLaMA, a tortured acronym for “Large Language Model Meta AI” — a large language model that the company shared as a “closed version” to researchers earlier in the year (and which was later leaked to several online communities). The largest LLaMA model was trained on 2,048 A100 GPUs, Meta says, which took 21 days.

“Building our own supercomputing capabilities gives us control over every layer of the stack; from designing data centers to training frameworks.” RSC will help AI researchers at Meta build new and better AI models that can learn from trillions of examples; works across hundreds of different languages; seamlessly parse text, images, and video together; developing new tools for augmented reality; And much more.”

video converter

In addition to MTIA, Meta is developing another chip to handle certain types of computing workloads, the company revealed at today’s event. Meta calls the Meta Scalable Video Processor, or MSVP, and says it’s the first in-house developed ASIC solution designed for on-demand and live broadcast video processing needs.

Meta started considering dedicated server-side video chips years ago, readers may remember, announcing an ASIC for video transcoding and conclusion in 2019. This is the fruit of some of those efforts, as well as a renewed push for a competitive advantage in the live video area specifically. selection.

“On Facebook alone, people spend 50% of their time on the app watching video,” Meta Harikrishna Reddy, CTO, Yunqing Chen wrote in a blog post he co-authored and published this morning. To serve a wide variety of devices worldwide (mobile devices, laptops, televisions, etc.), videos uploaded to Facebook or Instagram, for example, are converted into a stream of multiple bits, in encoding formats and resolutions. And different quality… MSVP is programmable, scalable, and can be configured to support both the high-quality transcoding required for video-on-demand as well as the lower latency and faster processing times required for live streaming.”


The dedicated Meta chip is designed to speed up video workloads, such as streaming and transcoding.

Meta says its plan is to eventually offload the majority of “stable and mature” video processing workloads to MSVP and only use the software’s video encoding for workloads that require specific customization and “significantly” higher quality. Work continues to improve video quality with MSVP using pre-processing methods such as intelligent noise reduction and image optimization, Meta says, as well as post-processing methods such as artifact removal and super resolution.

“In the future, MSVP will allow us to support more critical Meta use cases and needs, including short videos — enabling efficient delivery of generative AI, augmented/virtual reality, and other metaverse content,” Reddy and Chen said.

AI focus

If there’s a common thread in today’s hardware announcements, it’s that Meta is desperately trying to pick up the pace in terms of AI, especially generative AI.

As far as has been telegraphed before. In February, CEO Mark Zuckerberg — who has reportedly made increasing Meta’s computing power for AI a top priority — announced a new high-level AI team to, in his words, “charge” the company’s research and development. Similarly, CTO Andrew Bosworth recently said that generative AI is the area he and Zuckerberg spend the most time in. Chief scientist Yann LeCun said Meta plans to deploy AI tools to create items in virtual reality,

“We’re exploring chat experiences in WhatsApp and Messenger, visual creation tools for Facebook and Instagram posts and ads, and over time video and multimedia experiences as well,” Zuckerberg said during the first-quarter Meta earnings call in April. “I expect these tools to be useful to everyone from everyday people to creators to businesses. For example, I expect that a lot of interest in AI agents for business messaging and customer support will come once we gain this experience. Over time, this will extend to our work on the metaverse as well.” , where people will more easily be able to create avatars, objects, worlds, and code to tie it all together.”

In part, Meta is feeling increasing pressure from investors worried that the company isn’t moving fast enough to capture the (potentially large) market for generative AI. It doesn’t have an answer – yet – for chatbots like Bard, Bing Chat, or ChatGPT. It also hasn’t made much progress in image creation, which is another major segment that has seen exponential growth.

If the predictions are correct, the total addressable market for generative AI software could reach $150 billion. Goldman Sachs predicts that it will raise GDP by 7%.

Even a small slice of that could wipe out Meta’s missing billions in its investments in “metaverse” technologies like augmented reality headsets, meeting software, and virtual reality stadiums like Horizon Worlds. Reality Labs, the Meta division responsible for augmented reality technology, The company reported a net loss of $4 billion in the fourth quarter, and the company said during the first-quarter call that it expects to “increase operating losses year-over-year in 2023.”

#Meta #bets #big #custom #chips #supercomputer

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top