Join CEOs in San Francisco July 11-12, to learn how leaders are integrating and optimizing AI investments for success.. learn more
Meta, the social media giant formerly Facebook, has been a leader in artificial intelligence (AI) for more than a decade, using it to power its products and services like News Feed, Facebook ads, Messenger and virtual reality. But as the demand for more advanced and scalable AI solutions grows, so does the need for more innovative and efficient AI infrastructure.
At today’s AI Infra @ Scale event — a one-day virtual conference hosted by Meta’s engineering and infrastructure teams — the company announced a series of new hardware and software projects aimed at powering the next generation of AI applications. The event featured speakers from Meta who shared their opinions and experiences on building and deploying AI systems at scale.
Among the announcements was the design of a new AI data center that will be optimized for both AI training and inference, the two main phases of developing and running AI models. The new data centers will leverage Meta’s own silicon, the Meta Training and Inference Accelerator (MTIA), a chip that will help accelerate AI workloads across areas as diverse as computer vision, natural language processing and recommendation systems.
Meta also revealed that it has already built a Research Supercluster (RSC), an artificial intelligence supercomputer that integrates 16,000 GPUs to help train Language Large Models (LLMs) like the LLaMA project, which Meta announced at the end of February.
It happened
Transform 2023
Join us in San Francisco on July 11th and 12th, where senior executives will share how to integrate and optimize AI investments for success and avoid common pitfalls.
Register now
“We’ve been building advanced AI infrastructure for years now, and this work reflects long-term efforts that will enable further advances and better use of this technology in everything we do,” Meta CEO Mark Zuckerberg said in a statement.
Building the infrastructure for AI will be important in 2023
Meta is far from the only high-profile or large-scale IT company considering purpose-built AI infrastructure. In November, Microsoft and Nvidia announced a partnership to get an AI supercomputer in the cloud. The system makes use of (not surprisingly) Nvidia GPUs connected to Nvidia’s Quantum 2 InfiniBand mesh technology.
A few months later in February, IBM outlined the details of its artificially intelligent supercomputer, codenamed Fila. It uses IBM x86 silicon, along with Nvidia GPUs and Ethernet-based networking. Each node in the Vela system is packed with eight 80GB A100 GPUs. IBM’s goal is to build new foundational models that can help meet the AI needs of enterprises.
Not to be outdone, Google also jumped into the AI giant’s race with an announcement on May 10. The Google system uses Nvidia GPUs along with specially designed Infrastructure Processing Units (IPUs) to enable the fast flow of data.
Meta is now also jumping into the silicon space with its MTIA chip. Inference chips specifically designed for AI are nothing new either. Google has been building its own Tensor Processing Unit (TPU) for several years, and Amazon has had its AWS chips since 2018.
For Meta, the need for AI inference extends to multiple aspects of its operations for its social media sites, including news feeds, rankings, content understanding, and recommendations. In a video demonstrating silicon MTIA, Meta Infrastructure Research Scientist Amin Firouzshahian commented that traditional CPUs are not designed to handle inference requests from applications running Meta. That’s why the company decided to build its own custom silicon.
“The MTIA is a chip that is optimized for the workloads we care about and is specifically designed to meet those needs,” said Ferozshahian.
Meta is also a big user of the open source machine learning (ML) framework PyTorch, for which it was originally created. Since 2022, PyTorch has been managed by the efforts of the Linux Foundation’s PyTorch Foundation. Part of the goal with MTIA is to have highly optimized silicon for running PyTorch workloads at the large Meta scale.
MTIA silicon is a 7 nanometer (nm) process design and can deliver up to 102.4 TOPS (trillion operations per second). The MTIA is part of a highly integrated approach within Meta to improve AI operations, including networking, data center optimization, and energy use.
The data center of the future is designed for AI
Meta has been building its data center for over a decade to serve the needs of billions of users. So far, it’s been working well, but the exponential growth in AI requirements means it’s time to do more.
“Our current generation of data center designs are world-class, energy efficient and energy efficient,” said Rachel Peterson, Vice President of Data Center Strategy at Meta during a roundtable discussion at the Infra@scale event. “It’s really supported us with multiple generations of servers and storage and networking and is really able to serve our existing AI workloads really well.”
As the use of AI across the Meta grows, more computing capacity will be required. Peterson noted that Meta sees a future where AI chips are expected to consume more than 5 times the power of Meta’s typical CPU servers. This expectation caused Meta to rethink data center cooling and provide liquid cooling to the chipsets in order to provide the right level of energy efficiency. Enabling the right cooling and power to enable AI is the driving force behind Meta’s new data center designs.
“As we look ahead, it has always been about planning for the future of AI hardware and systems and how we can get the most performance systems in our fleet,” Peterson said.
VentureBeat’s mission It is to be the digital city arena for technical decision makers to gain knowledge about the technology of transformational and transactional enterprises. Discover our briefings.
#Meta #unveils #data #centers #supercomputer #power #future