As we indicated a year ago when some key silicon experts from Intel and Broadcom were hired to work on Meta Platforms, the company formerly known as Facebook has always been the most obvious place for custom silicon work. Among the eight largest Internet companies in the world, which are also the eight largest buyers of IT equipment in the world, Meta Platforms is the only company that is pure hypersegmentation and does not sell infrastructure capacity on the cloud.
As such, Meta Platforms has its own software stacks, top to bottom, and can do whatever it wants to create the hardware that drives them. And she is rich enough to do this and spends enough money on silicon and people to tune software for her that she can no doubt save a lot of money by controlling more of her agility.
Meta Platforms is revealing native AI inference and video encoding chips at today’s AI Infra @ Scale event, as well as talking about the deployment of the Research Super Computer, new data center designs to accommodate heavy AI workloads, and the evolution of its AI frameworks. We’ll dig through all of this content over the next few days, as well as do a Q&A with Alexis Black Bjorlin, Vice President of Infrastructure at Meta Platforms and one of the key executives added to the company’s dedicated silicon team. for a year. But for now, we’ll focus on the Meta Training and Inference Accelerator, or MTIA v1 for short.
To all the companies that thought you were going to sell a lot of GPUs or NNPs to Facebook, as we say in New York City, Vogtaputit.
And in the long term, CPUs, DPUs, switching and routing ASICs can be added to the list of semiconductor components that will not be purchased by Meta Platforms. And if Meta Platforms really wanted to wreak so much havoc, it could sell its silicon to others. . . . Strange things have happened. Like the pet rock craze of 1975, just to give one example.
NNP and GPU can’t handle the load with a good total cost of ownership
The inference engine MTIA AI kicked off in 2020, just as the coronavirus pandemic made everything worse and AI went beyond image and speech recognition to text translation into the generative capabilities of large language models, which seem to know how to do many things they weren’t meant to do. with it. Deep Learning Recommendation Models, or DLRMs, are a more difficult computational and memory problem than LLMs because of their reliance on textures — a kind of graphical representation of the context of a data set — that must be stored in the main memory of computing devices running neural networks. LLMs have no frills, but DLRMs do, and that memory is why host CPU memory capacity and fast, high-bandwidth connections between CPUs and accelerators matter more to DLRM than they do to LLMs.
Joel Coburn, software engineer at Meta Platforms, showed this chart as part of the AI Infra @ Scale event showing how the company’s DLRM inference models have been growing in volume and computational demand in the past three years and how the company expects it to grow in the next 18 months:
Keep in mind that this is for inference, not training. These models have been quite proliferated by the relatively small amount of low resolution computed by the on-chip vector engines in most CPUs, and the on-chip matrix math engines like those now on Intel Xeon SP v4 and IBM Power10 chips and coming to AMD Epycs soon may not be enough.
Anyway, this is the kind of graph we see all the time, although we haven’t seen them for DLRMs. This is a scary chart, but it’s not quite as terrifying as this:
On the left side of the graph, Meta Platforms are replacing CPU-based inference with inference neural network processors, or NNPIs, in the “Yosemite” micro-server platforms we talked about in 2019. CPU-based inference is still going strong as you can see, But as DLRM models got fatter and went outside the bounds of NNPIs, then Meta Platforms had to bring in the GPUs to do the inference. We assume these weren’t the same GPUs that were used to do the AI training, but PCI-Express cards like Nvidia’s T4 and A40, but Coburn wasn’t specific. And then this started to get more and more costly as more reasoning ability was needed.
“You can see the heuristic requirements rapidly outpacing the NPI and Meta core capabilities of GPUs as they provide more computing power to meet the growing demand,” Coburn explained in the MTIA launch presentation. But it turns out that although GPUs provide a great deal of memory bandwidth and computing throughput, they were not designed with heuristics in mind and their efficiency is low relative to real models despite significant software optimizations. This makes them difficult and expensive to deploy and practice. “.
We highly doubt Nvidia will argue that Meta Platforms was using the wrong hardware for the DLRMs, or perhaps just show how a “Grace” CPU and “Hopper” GPU will save the day. None of that seemed to matter because Meta Platforms wants to control its own destiny in silicon just as it did in 2011 when it launched the Open Compute project to open up server and data center architectures.
Which begs the question: Will Facebook open source the RTL and design specifications for the MTIA device?
Banking on RISC-V for MTIA
Facebook has long been a strong proponent of open source software and hardware, and it would have been a huge surprise if Meta Platforms hadn’t adopted the RISC-V architecture of the MTIA accelerator. It is, of course, based on a dual-core RISC-V processing element, wrapped in a whole bunch of stuff but not so much that it can’t fit into a 25W chip and a 35W dual M.2 peripheral card.
Here are the main specifications of the MTIA v1 chip:
Because it’s low frequency, the MTIA v1 chip also burns very little power, being implemented on 7nm processes means the chip is small enough to run great without being on the most advanced processes from Taiwan Semiconductor Manufacturing Co, ranging from 5nm to 3 nm with 4 nm in between. These are more expensive processes too, and maybe something to save for a later day – and a later generation of hardware for training and inference separately or together as Google does with their TPU – when these processes are more mature and therefore cheaper.
The MTIA v1 inference chip has a 64-element grid with 128MB of SRAM wrapped around it that can be used as primary storage or for cache ending with sixteen Low Power DDR5 (LPDDR5) memory controllers. This LPDDR5 memory is used in laptops and is also used in the CPU of Nvidia’s upcoming Grace Arm server. Those 16 channels of LPDDR5 memory can provide up to 64GB of external memory, suitable for holding those big, fat textures that are necessary for DLRMs. (More on that in a moment.)
These 64 processing elements are based on a pair of RISC-V cores, one plain vanilla and one with math vector extensions. Each processing element has 128 KB of local memory and static function units to do FP16 and INT8 arithmetic, move through nonlinear functions, and transmit data.
This is what the MTIA v1 board looks like:
There’s no way you could put a fan on top of a chip and fit dozens of them inside a Yosemite V3 server. Maybe this is only being shown on a large scale?
Here is the neat part in designing the MTIA server. There is a leaf/backbone network of PCI-Express adapters in the Yosemite server that not only allows MTIA modules to communicate with the host but also with each other and 96GB of host DRAM that can cache large weddings if needed. (Just like Nvidia would do with the Grace-Hopper.) The whole shebang weighs in at 780 watts per system—or just over the 700 watts a single Hopper SXM5 GPU comes in when running the full-tilt jig.
Nvidia H100 can handle 2000 terabytes at INT8 resolution within 700 watts devicebut Meta Platforms Yosemite inference platform can handle 1230 terabytes with 780 watts System. The DGX H100 is 10,200 watts with eight GPUs, that’s 16,000 terawatts for 1.57 terawatts per watt. MTIA comes in at 1.58 terabytes per watt, and is tuned for the DLRM framework and PyTorch for the Meta Platform – and it’s going to tune even higher. We highly suspect MTIAs architecture costs significantly less per unit of work than the DGX H100 system – otherwise Meta Platforms wouldn’t show it.
Raw feeding and speeds aren’t the best way to compare systems, of course. DLRM’s levels of complexity and model size vary, and not everything is good about everything. Here’s how DLRMs are divided within Meta platforms:
explains Roman Levinstein, Principal Engineering Manager at Meta Platforms. “The division also gives us insight where And how MTIA is more efficient. MTIA can reach up to 2x better performance per watt on fully connected layers compared to GPUs. “
Here’s how the performance per watt stacks up on low-, medium-, and high-complexity models:
Levinstein cautioned that the MTIA device has not yet been optimized for DLRM inference in models with higher complexity.
We’ll try to find out which NNPIs and GPUs were tested here and do a little price/performance analysis. We will also consider how to build an AI training device from this foundational chip. Stay tuned.
#Meta #platforms #handcraft #inference #chip #train #artificial #intelligence