iptv techs

IPTV Techs


Training – CUDA Moat Still Alive – SemiAnalysis


Training – CUDA Moat Still Alive – SemiAnalysis


Intro

SemiAnalysis has been on a five-month lengthy quest to resettle the truth of MI300X. In theory, the MI300X should be at a huge advantage over Nvidia’s H100 and H200 in terms of particularations and Total Cost of Ownership (TCO). However, the truth is that the on paper specs as given below are not recurrentative of perestablishance that can be foreseeed in a genuine-world environment. If AMD could transfer the below labeleted perestablishance with this memory, it would be a very strong competitor in the labelet. 

Source: SemiAnalysis, Nvidia, AMD

Today we are going to talk thraw our five-month journey carry outing self-reliant analysis and training-caccessed benchlabeling of the MI300X, the H100 and the H200, engaging with both NVIDIA and AMD. We will do a detailed supervise of the many low-level benchlabels that we ran, see the table of satisfyeds for summary. Furthermore, we will appraise the total cost of ownership of Nvidia and AMD GPUs and factor in perestablishance. Ultimately much of what we are doing is uncoverly giving a comprehensive uncover recommfinishation to AMD on what they need to do to be competitive and mend their software publishs after five months of surrfinisherting and squashing bugs. It’s not equitable that it’s imdepfinishable software, they need to alter how they do enhugement. 

In low, when comparing Nvidia’s GPUs to AMD’s MI300X, we set up that the potential on paper advantage of the MI300X was not genuineized due to a deficiency wilean AMD uncover liberate software stack and the deficiency of testing from AMD.

AMD’s software experience is riddled with bugs rfinishering out of the box training with AMD is impossible. We were chooseimistic that AMD could aelevate as a strong competitor to NVIDIA in training toilloads, but, as of today, this is unfortunately not the case. The CUDA moat has yet to be traverseed by AMD due to AMD’s feebleer-than-foreseeed software Quality Assurance (QA) culture and its challenging out of the box experience. As speedy as AMD tries to fill in the CUDA moat, NVIDIA engineers are toiling obviousime to beginanten shelp moat with new features, libraries, and perestablishance modernizes.

We splitd benchlabel source code and interarbitrate test results for GEMM benchlabel and Single Node Training with both Nvidia and AMD and held calls and talkions to request feedback and carry out raisements to the benchlabels, and we toiled with AMD to carry out bug mendes for the software stacks. 

Our goal with this highly iterative engageion was to guarantee that our tests are an unprejudiced evaluation of what genuine-world users would experience. 

We initiassociate reckond to unveil this article a scant months ago but wanted to obtain the extra time to join with the AMD team and allotigate possible mendes or enhugement toil. We spent a think aboutable time chooseing and mending AMD software bugs so that we could give AMD every chance to show MI300X unobstructed by AMD software stack bugs as contestd to only shotriumphg problematic perestablishance out of the box. To give a iminwhole astonishion, we also make clear the think aboutable amount of toil on tuning and bug-squashing that it took to get there. We leank this approach supplys users with the best possible level of transparency.  

We wanted to give in any way we could to try to raise the AMD ecosystem. Though AMD software is much better now due to our bug alerts and tire-booting, its uncover software stack still descfinishs low. We have uncover-sourced many of the benchlabels and originated straightforward one-liner directs to reoriginate them.

If Lisa Su and the AMD Leadership redouble their allotment with a caccess on their software and testing stack, they have a chance to be competitive with Nvidia on training. We leank the engineers at AMD are innervously vient and are doing their best to proceed the AMD ecosystem – and indeed help from these engineers in the establish of bug mendes, configuration help and custom images raised the results we were able to get from the MI300X. 

To convey our benchlabeling process to a coda, on November 15th, 2024 we sent Nvidia and AMD a write of most of our beginant GEMM and one node benchlabeling code and results for comments, verification, and fine-tuning. We asked that any final comments, mendes, feedback and any perestablishance raisements be surrfinisherted by November 25th. We set this time sketch to crysloftyize test results to permit time to author an in-depth analysis and commentary and carry out multiple rounds of inside and outside appraises, all steps that can obtain a variable and standardly uncomprehendable amount of time, typicassociate from 2-4 weeks. 

A scant days ago, after we adviseed both that we had validateed an article uncoveration date of December 20th, AMD asked that we defer uncoveration to integrate results based on a beta WIP enhugement originate on an AMD enhugeer’s branch. All of our benchlabeling on Nvidia was carry outed on uncoverly useable stable liberate originates. In the spirit of transparency and iminwholeness, we integrate these results as well as modernized testing harness results on as the distinctive November 25th deadline image and the tardyst uncoverly useable software. However, we count on that the accurate way to clear up the results is to see at the perestablishance of the uncover stable liberate of AMD/Nvidia software. 

Below are the catalog of software originates that we have used for benchlabeling:

  • H100 Public Stable Rerent – Out of Box experience for Nvidia H100.
  • H200 Public Stable Rerent – Out of Box experience for Nvidia H200.
  • MI300X Nov 25th Custom Build – This is a custom VIP docker image hand-createed that originates all depfinishencies from source code written by AMD principal engineers.
  • MI300X Stable Public Rerent PyTorch 2.5.1 – Out of Box experience for AMD MI300X.
  • MI300X Public Nightly Dec 19th – This can propose where AMD perestablishance can be by January 2025, when PyTorch 2.6 is liberated, over 1 year after begin.
  • MI300X Dec 21st WIP dev originate – This is the image that AMD surrfinisherted to us after we consentd to defer uncoveration of the article. It is an experimental enhugement originate that has not yet been united into AMD’s inside main branch, and it does not use the native PyTorch flash attention API. Perestablishance with this image can propose where AMD uncover stable liberate perestablishance will be in 1-2 quarters from now.

We are very appreciative for the technical help supplyd by AMD and Nvidia thrawout this process, but we upgrasp our indepfinishence in the results we unveil. We want to shout out to and thank our AMD counterparties, Anush Elangovan (AMD VP of AI), Hui Liu and many dozens of amazing AMD Principal/Senior engineers, AMD VPs of Engineering, AMD Engineering Fellows, AMD CVPs of Engineering and AMD Directors of Engineering, AMD Software Library Leads, for triaging and mending our various bug alerts. On the Nvidia side, we are appreciative to Kedar Potdar, Ian Buck, Sylvain Jeaugey and the NCCL team from NVIDIA for their amazing help. 

Thank you to CrusoeTensorWave (AMD Ventures Portco), NebiusLambdaHot Aisle and Sustainable Metal Cboisterous (SMC) / Firmus for the compute and for being helpers of uncover-source benchlabeling. Crusoe, Nebius, SMC / Firmus and Lambda help administerd SLURM and splitd home honestories out of the box. TensorWave currently has administerd SLURM in beta and this feature will come to ambiguous useability (GA) at the commence of next year. Sustainable Metal Cboisterous is one of the scant neocboisterouss that has official MLPerf GPT-3 175B Training results.

We will be releasing a chase up article on inferencing for the H100, H200 and MI300X. We may also liberate a chase-up article in a scant months to chase up on AMD training perestablishance to see if out of box experience has raised and test other models such as LlaVa & Mamba.

Source: SemiAnalysis

Key Findings

  1. Comparing on paper FLOP/s and HBM Bandwidth/Capacity is akin to comparing cameras by mecount on examining megapixel count. The only way to tell the actual perestablishance is to run benchlabeling.
  2. Nvidia’s Out of the Box Perestablishance & Experience is amazing, and we did not run into any Nvidia particular bugs during our benchlabels. Nvidia tasked a one engineer to us for technical help, but we didn’t run into any Nvidia software bugs as such we didn’t need much help.
  3. AMD’s Out of the Box Experience is very difficult to toil with and can need think aboutable patience and elbow grease to shift towards a usable state. On most of our benchlabels, Public AMD stable liberates of AMD PyTorch is still broken and we needed toilarounds.
  4. If we weren’t helped by multiple teams of AMD engineers triaging and mending bugs in AMD software that we ran into, AMD’s results would have been much shrink than Nvidia’s.
  5. We ran unofficial MLPerf Training GPT-3 175B on 256 H100 in collaboration with Sustainable Metal Cboisterous to test the effects of separateent VBoost setting
  6. For AMD, Real World Perestablishance on uncover stable liberated software is nowhere shut to its on paper labeleted TFLOP/s. Nvidia’s genuine world perestablishance also undershoots its labeleting TFLOP/s, but not by cforfeitly as much.
  7. The MI300X has a shrink total cost of ownership (TCO) appraised to the H100/H200, but training perestablishance per TCO is higher on the MI300X on uncover stable liberates of AMD software. This alters if one uses custom enhugement originates of AMD software. 
  8. Training perestablishance is feebleer, as showd by the MI300X ‘s matrix multiplication micro-benchlabels, and AMD uncover liberate software on one-node training thrawput still lags that of Nvidia’s H100 and H200.
  9. MI300X perestablishance is held back by AMD softwareAMD MI300X software on BF16 enhugement branches have better perestablishance but has not yet united into the main branch of AMD’s inside repos. By the time it gets united into the main branch and into the PyTorch stable liberate, Nvidia Bdeficiencywell will have already been useable to everyone.
  10. AMD’s training perestablishance is also held back as the MI300X does not transfer strong scale out perestablishance. This is due to its feebleer ROCm Compute Communication Library (RCCL) and AMD’s shrink degree of vertical integration with nettoiling and switching challengingware appraised to Nvidia’s strong integration of its Nvidia Collective Communications Library (NCCL), InfiniBand/Spectrum-X nettoil fabric and switches.
  11. Many of AMD AI Libraries are forks of NVIDIA AI Libraries, directing to subchooseimal outcomes and compatibility publishs.
  12. AMD customers tfinish to use hand createed kernels only for inference, which unbenevolents their perestablishance outside of very lean well depictd use cases is needy, and their flexibility to rapidly shifting toilloads is non-current.

Executive Recommfinishation to AMD

We genuinely want to see another effective competitor to Nvidia and want to help AMD get to that spot, but, unfortunately, there is still much toil to be done on that front. At the bottom of this article, we have a detailed catalog of feedback for the Lisa Su and the AMD Leadership Team, but supply a summary here:

  1. Give AMD Engineers more compute and engineering resources to mend and raise the AMD ecosystem, they have very scant inside gpu boxes relative to what Nvidia supplys to their engineers. Tensorwave, the hugest AMD GPU Cboisterous has given GPU time for free to a team at AMD to mend software publishs, which is inrational given they phelp for the GPUs.
  2. AMD needs to hook up thousands more of MI300X, MI325X to PyTorch CI/CD for automated testing to guarantee there is no AMD perestablishance revertions & functional AMD bugs. Nvidia has given thousands of GPUs for PyTorch CI/CD to guarantee an amazing out of box experience
  3. The AMD Executive Team should personassociate and intensively internassociate test (i.e., “dogfood”) products that are being shipped to the uncover rather than caccess on testing inside originates. Preferably dogfood during livestream (twitch.tv) to show the genuine out of box experience. This is appreciate how geotoastyz livestreams
  4. AMD should collaborate with Meta to get production LLM training toilloads toiling as soon as possible on PyTorch ROCm, AMD’s answer to CUDA, as commonly, PyTorch code paths that Meta isn’t using have many bugs.
  5. Move away from over-reliance on properly setting many environment flags (up to dozens) to originate an AMD deployment usable. Instead, bake these settings into the default configuration. Make the out of the box experience usable!
  6. Focus on making out of box experience excellent instead of over-reliance on custom VIP images that originate all depfinishencies from source code main@particularpledge and obtain 5 hours to originate.
  7. Stop foreseeing finish users to use PYTORCH_TUNABLE_OPS which is a prototype buggy feature and is not admireful of the finish users time as it obtains ~1 hour for the finish user to tune every time an finish user wants to originate any alters to their code. 
  8. AMD should surrfinisher MLPerf Training GPT-3 175B results. MLPerf is an apples-to-apples benchlabeling methodology that uses time to unitence as the north star.
  9. We want AMD to be competitive and are uncover to greet with more detailed feedback on how to mend the AMD Datacaccess GPU Ecosystem for the better.

A Summary of the AMD vs Nvidia Narrative

Before we dive into various facets of AMD’s software stack that helderly AMD back, we will talk the MI300X’s straightforward particularations, its comparative total cost of ownership, and how most analysts and allotors have appraised its competitiveness.

The MI300X begined in tardy 2023 with an exciting set of on paper particularations—featuring 1,307 TFLOP/s of FP16 compute (stronger than the H100’s 989 TFLOP/s), 5.3 TB/s of memory bandwidth, and 192GB of HBM3, 3.35 TB/s of memory bandwidth, and 80GB of HBM3. These specs outnaked those of the H200, which itself is, effectively, a memory-spec bumped version of the H100, transfering 4.8TB/s of memory bandwidth and 141GB of HBM3e. 

Source: SemiAnalysis, Nvidia, AMD

On paper total cost of ownership for an MI300X deployment is innervously compelling, not only due to the shrink ASP of the MI300X, but also because it is typicassociate deployed using affordableer Ethernet nettoiling. Comparing a cluster of 16k H200s vs a 16k MI300X ethernet cluster directs to cforfeitly 40% of the cost savings coming from nettoiling alone, with the remainder of the savings from a shrink accelerator cost. The use of Whitebox Ethernet switches is a substantial cost savings appraised to using Nvidia’s Quantum-2 switches, but the genuine separateence is affordableer transceivers, as Nvidia branded transceivers cost as much as 2-3x over what a normal transceiver OEM accuses.

At face appreciate, the MI300X seems the best of both worlds: higher perestablishance and shrink total cost of ownership. At the time of its begin, it was rational to foresee split obtains to the underdog AMD from this compelling combination. The table below shows total upfront cluster capex – we current a more detailed shatterdown of cluster capex components as well as a detailed nettoiling BoM analysis in the sections at cforfeit the bottom of the article.

Source: SemiAnalysis AI TCO Model

As orders firmified, excitement built up for potential of the MI300X, helped alengthy by bullish commentary and guidance from AMD. With a compelling spec advantage, it was effortless to dispute for further upside to AMD’s guidance, which most allotors presumed administerment was sandbagging. AMD had a strong hand, in theory. After all they have mid-one digit labelet split in datacaccess GPUs for 2024 and, rationassociate, a glide path towards even 10-12% labelet split by 2027 could be conservative while adviseing think aboutable obtainings upside for AMD.

However, over from tardy 2023 and thraw most of 2024, guidance for brimming year 2024 datacaccess GPU sales repeatedly underperestablished those lofty foreseeations. From its 1Q24 obtainings thraw its 3Q24 obtainings, AMD only elevated guidance from $4B to $5B, well under the $6-8B allotor bogey based on CoWoS and HBM supply consentments. Our demand see in the Accelerator Model tracked Microsoft’s disnominatement timely in the year and deficiency of chase on orders.

The earlier bullish line of reasoning was appreciate purchasing a certain car model from a magazine without a test drive or requesting feedback from owners of that model or reading any appraises. But dread not – SemiAnalysis has put the MI300X, H100, and H200 thraw their paces at scale and can show why AMD’s current software stack publishs resolutely disshow this line of reasoning. 

General Matrix Multiply (GEMM) Perestablishance

Most FLOPS in a alterer-based architecture (i.e. ChatGPT, Llama, etc.) go towards matrix multiplication, also comprehendn as GEMMs. For this reason, GEMM perestablishance is a excellent proxy for how well frontier alterers, such as ChatGPT, Llama, Claude, Grok, etc. will train on the challengingware

GEMMs obtain two input matrices, Matrix A and Matrix B, with Matrix A having the shape of (M, K), M rows and K columns, and Matrix B having the shape of (K,N) to originate an output matrix of shape (M,N). 

Source: Nvidia

Conceptuassociate, each element of the resulting matrix is a sum of element-rational multiplications alengthy the “K” foolishension of the inputs. For this matter, the K foolishension is also comprehendn as the reduction foolishension.

Source: SemiAnalysis

Below, we have tested the chaseing genuine-world shapes, given in the establish (M,N,K)—which is low for multiplying a matrix of foolishensions (M,K) and (K,N) together. 

These chaseing matrix shapes were actuassociate used in Meta’s Llama 70B production training:

  • (16384, 8192, 1280) – Fused QKV Projection GEMM shape
  • (16384, 1024, 8192) – Attention Output Projection shape
  • (16384, 8192, 7168) – FFN GEMM shape
  • (16384, 3584, 8192) – FFN GEMM shape
  • (8192, 8192, 8192) – Standard GEMM shape for benchlabeling

We used OpenAI’s do_bench function for the benchlabel setup, an industry standard method of benchlabeling PyTorch. The do_bench function supplys cache evidenting between runs as a default and supplys ways to toastyup and perestablish the benchlabel multiple times, taking the median result as the given accuracy.  We used toastyup=30 and rep=200 for these tests. Both input tensor A and B were randomly initialized with a standard distribution with unbenevolent 0 and variance 1. This is because a standard distribution comes the shutst to aligning the actual distribution of weights and activations in conmomentary neural nettoils. The distribution of the input tensors will impact the results of the TFLOP/s perestablishance benchlabel. We will talk the reasons why the input distribution effects TFLOP/s perestablishance tardyr in the article.

For BF16, we can see that the H100 and H200 accomplishs rawly 720 TFLOP/s aobtainst their labeleted 989.5 TFLOP/s, while the MI300X accomplishes a mere ~620 TFLOP/s appraised with their labeleted 1,307 TFLOP/s. 

This unbenevolents that, despite a much higher labeleted BF16 TFLOP/s, the MI300X is 14% sluggisher than the H100 and H200. This AMD result used a custom docker image that was hand createed by an AMD principal engineer yet still accomplishd sluggisher perestablishance than Nvidia’s GPUs. For our out of the box testing of the MI300X, the TFLOP/s thrawput even sluggisher than this! In includeition to a custom image, AMD also needs the user to set many environment flags that aren’t set by default to accomplish these perestablishance results. 

Source: SemiAnalysis

Unblessedly, the story is worse for FP8. The H100/H200 accomplishs ~1,280 TFLOP/s out of the labeleted 1979 TFLOP/s. The MI300X, in comparison, only accomplishes ~990 TFLOP/s. Thus, for FP8, the MI300X is 22% sluggisher than H100. This is for both inputs being of the e4m3 FP8 (i.e. 4 exponent bits and 3 mantissa bits) datatype. 

Source: SemiAnalysis

It is beginant to remark that calling GEMM is a straightforward task, and we shouldn’t foresee to run into AMD software bugs. Unblessedly, a beginant bug that we greeted is that the torch.matmul and F.Licforfeit APIs have been transfering separateent perestablishances on AMD for a couple of months during the summer. One would foresee the torch.matmul and F.Licforfeit APIs to have the same perestablishance, but, astonishingly, F.Licforfeit is much sluggisher!

This is a strange bug as torch.matmul and F.Licforfeit are both wrappers around the challengingware vfinishor GEMM libraries, so they should accomplish the same level of perestablishance. F.Licforfeit, in particular, is beginant, as this is the way most finish users in PyTorch begin the GEMM kernels. 

When we commenceed testing AMD five months ago, the uncover AMD PyTorch still had this bug. The root cause was that AMD in fact has two separateent underlying GEMM libraries, rocBLAS and hipBLASLt, with HipBLASLt being more upgraded for the MI300X. The bug was that torch.matmul uses the upgraded hipBLASLt, but AMD had not alterd F.Licforfeit by default, leaving it to use the unupgraded rocBLAS library. 

This beginant bug was ultimately mended by AMD a scant months ago after our bug alerts, and we hope it doesn’t reeunite due to a deficiency of proper revertion testing. AMD’s usability could raise think aboutably if it raiseed its testing efforts instead of pauseing for users to uncover these critical publishs.

We have uncover sourced the GEMM benchlabel used in our tests into a straightforward three liner that anyone can easily run:

Source: SemiAnalysis

Stas’ GEMM Benchlabel Is Wrong

Recently, a benchlabel has been floating around the internet that claims that, on GEMMs, AMD MI300X’s perestablishance is shut to that of the H100. We cherish Stas Bekman and leank he does a lot of chooseimistic toil for the ML community, but unfortunately, his benchlabel has some flaws.

Source: Stas Bekman

There are two main publishs with Stas’ benchlabel: it isn’t properly carrying out L2 Cache evidenting and also is spropose taking the max perestablishance, instead of the median/unbenevolent TFLOP/s over the course of the iterations for a particular shape. Without L2 Cache evidenting between iterations, the benchlabel does not accurately echo genuine-world GEMM perestablishance. Furthermore, since the TFLOP/s alter based on which iteration it is on, you need to use a unbenevolent/median over at least 100 iterations as the basis for an accurate GEMM benchlabel. OpenAI’s do_bench supplys L2 cache and unbenevolent/median out of the box by default, so we recommfinish that engineers use it for micro-benchlabeling. Below, we have simplified Stas’ benchlabel into pseudocode and have commented on the publishs refered above.

Source: SemiAnalysis

HBM Memory Bandwidth Perestablishance

It is expansively comprehendn that AMD MI300X has better memory bandwidth than the Nvidia H100 and H200, adviseing 5.3 TB/s of bandwidth vs 4.8 TB/s for the H200 and 3.35 TB/s for the H100. Imshowd HBM memory bandwidth is very beneficial in inferencing and is sometimes beneficial in training. In training, users can set a huger batch size if they have more HBM memory capacity and memory bandwidth. Although if a huger global batch size is used, after a certain size, the model will obtain lengthyer to unitence. It is effortless to run speedy with huge global batch size but at a high level, it will hurt time to unitence.

From our HBM memory bandwidth benchlabeling, we see that that MI300X indeed has way better memory bandwidth than both the H200 and the H100. We tested memory bandwidth in Pytorch with Tensor.duplicate_ & used the industry standard OpenAI do_bench to guarantee accuracy.

As you will see in our upcoming H100 vs H200 vs MI300X inference article, memory bandwidth is very beginant for inferencing.

Source: SemiAnalysis
Source: SemiAnalysis

AMD Hand-Crafted VIP Custom Builds and WIP Development Builds

The only reason we have been able to get AMD perestablishance wilean 25% of H100/H200 perestablishance is because we have been helped by multiple teams at AMD in mending many AMD software bugs. To get AMD to a usable state with somewhat reasonable perestablishance, a enormous ~60 Dockerfile, hand createed by an AMD principal engineer, was particularassociate supplyd for us, since the Pytorch Nightly and uncover PyTorch AMD images functioned needyly and had version separateences. This docker image needs ~5 hours to originate from source and insloftys depfinishencies and sub-depfinishencies (hipBLASLt, Triton, PyTorch, TransestablisherEngine), a huge separateence appraised to Nvidia, which advises a pre-built, out of the box experience and obtains but a one line of code. Most users do not originate Pytorch, hipBLASLt from source code but instead use the stable liberate.

When using uncover PyTorch, users have the choice of toiling with the tardyst stable images or a nightly PyTorch upload. So, although a nightly PyTorch upload may have the tardyst pledges that could potentiassociate direct to better perestablishance or could mend some bugs, but users must accomprehendledge that the upload may not be brimmingy tested and could grasp new bugs from Meta/AMD/Nvidia or other PyTorch contributors that have not been uncovered yet. Note that most finish users are using the stable liberate of PyTorch.

Source: SemiAnalysis, AMD
Source: Nvidia

Deweightlessbrimmingy, Nvidia’s Docker images grasp the finish set of enhugeer tools needed for profiling and debugging, appreciate Nsight Compute and Nsight Systems. AMD, in contrast, does not integrate their OmniTrace enhugeer tool out of the box. 

Until a couple weeks ago, the AMD docker images only helped PyTorch 2.3, which liberated 8 months ago. Mainline PyTorch 2.4 and PyTorch 2.5 have also since liberated and PyTorch 2.6 is about to come out in Q1 2025. We recommfinished to an AMD Principal Engineer and to AMD’s VP of AI that AMD should have the tardyst AMD PyTorch version – AMD has since commenceed unveiling graspers for some of these AMD PyTorch versions. Docker image for AMD PyTorch 2.5 is still missing.

Source: Nvidia

Dec 21st AMD Development Builds

Below is AMD’s December 21st enhugement originate docker image. As you can see, it uses a number of non stable devlopment branches for depfinishencies such as hipBLASLt, AOTriton, ROCm Attention and insloftys everyleang including PyTorch from source code, taking upwards of 5 hours to originate. These versions of the depfinishencies haven’t even been united into AMD’s own main branch yet.  99.9% of users will not be insloftying PyTorch from source code and all of its depfinishencies from source code on enhugement branches but will instead use the uncover stable PyPi PyTorch.

Furthermore, instead of using Flash Attention thraw the PyTorch native user frifinishly torch. scaled_dot_product_attention API, this AMD Development originate begins another library (enhugement branch as well) attention carry outation. We have seen more users use Flash Attention thraw PyTorch native torch. scaled_dot_product_attention API since it is more user frifinishly and bundled into out of box PyTorch. Even AMD’s own uncover write downation recommfinishs using Flash Attention thraw torch.scaled_dot_product_attention API. We hope that these kernels get united into PyTorch flash attention instead of making the finish user inslofty a split library taking hours of their time to originate. This is not a user-frifinishly experience. Furthermore, AMD must help FlexAttention as it has rapidly become the go to in the industry. 

AMD’s December 21st Dev originate is on a hanging enhugement branch. That unbenevolents it is a branch that has not been brimmingy QA’ed and is at use only at a danger branch. There are many troubles about the validity of the results from using a enhugement originate and branches and originateing from source code, as most users are not doing this in genuine life. Most users will be insloftying AMD/Nvidia PyTorch from PyPI stable liberate mostly so we recommfinish readers upgrasp this in mfinish when analyzing these results.

That being shelp, we are including these enhugement originate results as it is an indication of where AMD uncover stable liberate software will be 1-2 quarters from now. However, at the same time, when it comes to vie, 1-2 quarters from now, Nvidia Bdeficiencywell will already be expansively deployed, while AMD MI355X will not commence shipments until H2 2025. 

Source: SemiAnalysis, AMD

Training Testing Methodology (GPT1.5B, Llama 8B, Llama 70B, Mistral)

There are many ways to test training perestablishance. The most accurate way is to obtain a medium-sized AI commenceup model’s inside codebases and run them on a 512-1024 GPU cluster. This way, the test run has all the chooseimizations that a normal user would have. Everyleang else is equitable a proxy for the perestablishance of these training runs. Training perestablishance obtains into account HBM bandwidth, HBM capacity, TFLOP/s, nettoiling, and system architecture. Comparing on paper HBM bandwidth/capacity is equitable appreciate comparing on paper camera megapixels.

MLPerf GPT3 175B Training is also a excellent proxy to meabrave the time it obtains to train to a particular unitence. MLPerf benchlabel think abouts global batch sizes and whether a joined precision carry outation incurs a unitence penalty. Unblessedly, MLPerf is quite difficult to run due to a deficiency of user-frifinishly write downation and teachions, and the perestablishance is standardly min-maxed via a custom tuned configuration particularassociate concocted for MLPerf that an ordinary user would not adchoose. Note that Nvidia has surrfinisherted MLPerf Training results with over 11k H100s, while AMD runs MLPerf Training internassociate. AMD’s results are foreseeed feeble, so they have never surrfinisherted any MLPerf Training, let alone the MLPerf GPT3 175B benchlabel.

When scheduleing our SemiAnalysis benchlabel, we wanted to echo the ordinary user’s model carry outation, and so chooseed for torch. scaled_dot_product_attention API (which uses flash attention backfinish), PyTorch Distributed Data Parallel (DDP) and/or Fully Schallenginged Data Parallel (FSDP) with torch.compileAlso remark that AMD recommfinishs users use torch.scaled_dot_product_attention in their own write downation. We count on this is the most recurrentative of a normal user toilload. Further, we used a generic PyTorch native carry outation of these models to upgrasp it shut to a normal ML Scientist user and originate it effortless to run with a one line of code. In contrast to MLPerf, the goal of our benchlabel is to be as straightforward to run as possible, while still being a excellent proxy for perestablishance. Note, since we don’t obtain into account time to unitence, this benchlabel has a sweightless bias towards AMD as we set the micro batch size higher on AMD vs on Nvidia. When taking time to unitence into account, AMD results will be worse than what is stated.

As an aside, many AI practitioners have shelp they are not using Megatron or NeMo or 3D Parallelism due to the high level of intricateity and deficiency of flexibility associated with those libraries, whose inpliableity and intricateity originate their usage for ML Research effectively impossible. Note that in terms of 3D Parallelism, both Nvidia and AMD will get higher perestablishance, assuming their software stack toils, which is a huge assumption for AMD. AMD Megatron is a fork of Nvidia Megatron and has less than 10 stars which unbenevolents that it is probably not dogfooded well. Submitting bug alerts would obtain extra months to get AMD Megatron toiling for straightforward models. 

For our SemiAnalysis model training benchlabel, we will test four models, with the first being a straightforward GPT 1.5B DDP, as we count on this is recurrentative of what minuscule-scale experiments/ablations would see appreciate before scale-out to hugeger model sizes. DDP is a much straightforwardr and less nettoil-intensive establish of parallelism. Next, we tested the standard Llama3 8B and Llama3 70B 4 Layer Proxy as a baseline for a famous model’s perestablishance. Third, we tested Mistral 7B v0.1, which appraises if challengingware will perestablish well when includeing a bit of intricateity, as Mistral uses sliding triumphdow attention instead of the standard casual attention. Modern models such as ChatGPT, Claude, Genimi, o1, o3 do not use standard casual attention & use a intricate attention mechanism.

A Modern GPT/Llama/Transestablisher model is built by stacking the same alterer layer over & over aobtain. As such, measuring the perestablishance of equitable 4 layers is a fantastic proxy for the overall perestablishance of the model.

Source: Imgur

Furthermore, in conmomentary LLM training for all frontier LLM models, pipeline parallelism is used which unbenevolents that a couple of alterer layers are placed in each GPU server. Never in conmomentary pretraining is a whole model placed on a one node.

Source: SemiAnalysis

The model FLOP for each token trained is depictd by the chaseing establishula:

6 * non_input_embedding_params + 12 * num_layers * num_heads * head_foolish * max_seq_len * density

With density being how sparse the attention is relative to a brimming mask. Casual attention has, for example, a 50% sparsity, while sliding triumphdow attention has even shrink sparsity.

Note that originassociate our testing harness used 6 * params instead of 6 * non_input_embedding_params which is the wrong way of calculating model FLOP per token. Furthermore, there was another bug in see to the way we used FSDP. We have since modernized our testing harness and retroactively retested as well as modernized all of benchlabel results atraverse all versions of software for both H100, H200, MI300X, uncover stable, uncover nightly, VIP images and AMD enhugement originates. All results cataloged below are with the modernized testing harness.

Single Node Training Perestablishance

Note that the H100/H200 perestablishance we current in this alert echos an out of the box perestablishance without any hand-createed tuning from Nvidia engineers, while the results for the MI300X comes after many months of tuning and bug mendes from AMD’s engineers. We did not run into any Nvidia-particular bugs appraised to AMD training, which was comparatively bug-filled. Five months ago, many models couldn’t run at more than 150 TFLOP/s on the AMD MI300X due to an AMD software bug in attention backwards and torch compile, which forced the user to manuassociate label a region of the model as non-compliable instead of having a brimming graph compile.

We see that, for all models, the H100/H200 triumphs relative to MI300X uncover liberates/uncover nightly liberates/Nov 25thoriginate from source VIP image. It is engaging that the MI300X does not perestablish well on minusculeer models such as GPT 1.5B or on any model that uses a non-casual attention layer, appreciate Mistral 7B v0.1. This is due to FlexAttention not being brimmingy operational at the time of the deadline, while, on Nvidia GPUs, it has been toiling since August 2024. As such, the H100/H200 beats MI300X by more than 2.5x in terms of TFLOP/s for MI300X uncover liberate/uncover nightly liberate/Nov25th VIP originate.

For the Dec 21st MI300X inside WIP enhugement branches originate, we still see it perestablish worse than H100/H200 on GPT 1.5B. Furthermore, it perestablishs sweightlessly worse on H100 on Mistral 7B. For Llama3 8B and Llama3 70B Proxy, the Dec 21st MI300X WIP enhugement originate perestablishs better than H100/H200, but remark that this is due to MI300X WIP enhugement using an AMD engineer’s enhugement branch that has not even been united to the AMD main branch. 

Source: SemiAnalysis

Three months ago, finisheavoring to do FP8 Training on AMD led to segfaults and challenging errors. On the off chance it did toil, it was, in fact, sluggisher than the same run using BF16. We toiled with AMD’s FP8 team to mend this publish, as well as the AMD hipBLASLt team, which originated tuning for mending MI300X FP8 perestablishance. FP8 Training is beginant as it speeds up training appraised to BF16 & most frontier labs use FP8 Training.

After many mendes, we can see that the MI300X’s Nov 25th thrawput for Llama3 8B and GPT 1.5B is somewhat competitive with H100’s. As common, H200 triumphs in this categruesome. However, for Llama3 70B 4 Layer Proxy, AMD Nov 25th’s results are socount on beaten.

For Mistral 7B which has a non-casual attention layer, AMD Nov 25th perestablishance is shut to half that of an H100. This shows that, for anyleang that isn’t a straightforward model, even after months of tuning, AMD is still not competitive due to a sweightless tfeeble in the model set up. Many frontier models and AI training commenceups are using intricate attention layers for lengthy context spans and fruitful attention, but, AMD is still far behind on those.

Unblessedly, FP8 training on AMD only toils on custom images such as our November 25th VIP image and December 21st WIP enhugement branch image. When we first commenceed trying AMD FP8 Training, it was sluggisher than AMD BF16 Training on uncover liberates.

Source: SemiAnalysis

For AMD’s WIP enhugement originates, we see that on Llama3 8B, it triumphs aobtainst H100 but is still sluggisher than H200’s uncover stable software liberate. H200 perestablishance finishly beats MI300X even on their Dec 21st WIP enhugement branches.

It is engaging that the MI300X does not perestablish well on non-casual attention layer, appreciate Mistral 7B v0.1 even for their inside originates. Mistral using sliding triumphdow attention which some of the frontier models uses. It seems that if you want to train a model that doesn’t use casual attention, AMD MI300X will automaticassociate dissee.

While a lot of people putting out perestablishance comparisons between challengingware, most do not uncover source their testing code and they do not originate easily reproducible. We took an uncover source approach, and we have uncover-sourced our one node training benchlabel and made it effortless to run with only a couple of lines:

Source: SemiAnalysis

Multi-Node Training Perestablishance

For multi-node, we benchlabeled two nodes of H100 and two nodes of MI300X. Unblessedly, we didn’t get access to a multi-node H200 deployment in time for the article.

H100 triumphs aobtain by a huge margin in this benchlabel appraised to MI300X, with the H100 ranging from 10-25% speedyer. This gap expansivens as you include more nodes toiling together into a one training toilload. This is a comprehendn problem, which AMD is finisheavoring to mend next year by deploying their new in house 400G AI caccessed NIC.

AMD PYTORCH_TUNABLE_OPS FLAG is a Bad User Experience

In order to get AMD training toiling decently, users need to use PYTORCH_TUNABLE_OPS which is an AMD particular prototype flag for the finish user to tune GEMMs. Since this is a prototype feature (i.e. not stable), in the past a lot of bugs with this feature cropped up including but not restricted to seg faults, HBM memory leaks, and a whole present of otherpublishs such as many unit tests being disabled. These comprehendn tunable ops bugs have been mended now but there are foreseeed a many more undetermined AMD software bugs. 

Furthermore, even if users do not greet any bugs and thus the runway is evident for this prototype AMD flag to toil, it still obtains users anywhere from 1-2 hours to tune any conmomentary LLM model. Although these GEMMs can be cached by the finish user, any inbeginant alters to the finish user’s code results in the need for the user to spfinish another 1-2 hours tuning. As you can envision, this will sluggish down an ML Scientist’s iteration cycle speed when trying to carry out model R&D and ablations experiments. 

On Nvidia, this flag isn’t needed as their GEMM library (cuBLASLt) comes tuned out of the box and cuBLASLt’s heuristic model out of the box picks the accurate algorithm for most shapes on H100/H200. In contrast, AMD hipBLASLt/rocBLAS’s heuristic model picks the wrong algorithm for most shapes out of the box, which is why so much time-consuming tuning is needd by the finish user.

We recommfinish that AMD to mend their GEMM libraries’ heuristic model such that it picks the accurate algorithm out of the box instead of wasting the finish user’s time doing tuning on their finish. Users standardly iterate rapidly when doing research and therefore rerunning tunable ops will sluggish down research velocity beginantly.

Scale Up NVLink/xGMI  Topology

Scale up fabric is innervously beginant for GPU Clusters, as it supplys an innervously speedy path for tensor and expert parallelism used in frontier model training. For this reason, we have carry outed benchlabels to meabrave scale up fabric perestablishance.

The scale up fabric on H100 and H200 is called NVLink and supplys 450GByte/s of bandwidth per GPU and joins 8 GPUs together. On the MI300X, the scale up fabric is called xGMI and, on paper, it joins 8 GPUs, providing 448GByte/s of bandwidth per GPU. On the surface, MI300X’s scale up nettoil is innervously aappreciate and shut in perestablishance to that of the H100/H200, providing equitable 0.5% less on paper bandwidth. Unblessedly, the truth of the situation separates acutely.

First, MI300X’s xGMI is a point-to-point fabric, which unbenevolents that it isn’t actuassociate providing 448GByte/s of bandwidth between GPUs pairs. Instead, each GPU can only talk to one another at 64GByte/s. A GPU can only accomplish the stated 448GByte/s if one GPU includeresses all 7 other GPUs simultaneously. That unbenevolents that, for Tensor Parallelism TP=2, the highest bandwidth is 64GByte/s and 189GByte/s for TP=4.

Source: SemiAnalysis

In contrast, since Nvidia’s NVLink uses a switched topography, one GPU can talk to another GPU at the brimming 450GByte/s. Furthermore, the four NVSwitches in H100/H200 help in-nettoil reduction (referred to as NVLink SHARP (NVLS), allowd by default), a technique to shrink data shiftments by carrying out collectives/reductions inside the switch itself.

Source: SemiAnalysis

All Reduce/All to All/Reduce Scatter/All Gather Collectives Oversee

We will showcase benchlabels atraverse scale-up and scale-out nettoils for both the Nvidia H100/H200 and AMD’s MI300. The collectives that we will be testing are the main set of collectives used in frontier LLM training: all_shrink, all_collect, shrink_scatter, and all to all. All shrink is for data parallelism and tensor parallelism, all collect is used for ZeRO/FSDP parallelism (as well as for tensor parallelism), and Reduce Scatter is used for ZeRO/FSDP parallelism. 

Due to the way that compute-communication overlapping toils, genuine-world message sizes range from 16MiB to 256MiB, with the default PyTorch DDP size being 25MiB (NVIDIA’s MLPerf 11,000 H100 GPT-3 175B run used a message size of max 200MiB). We also test 8GiB and 16GiB equitable to see what the peak bus bandwidth is, though these message sizes are not used in the genuine world. All these collectives talked above are used during 3D Parallelism and FSDP/ZeRO Parallelism, which are common techniques for training frontier models.

Source: DeepSpeed
Source: Meta

Single Node NCCL Collective

We see that Nvidia does much better than AMD atraverse all the genuine-world messages for every one collective. This is not astonishing due to the H100/H200’s better 450GByte/s NVLink switched topology with in-nettoil reduction (NVLS), appraised to MI300X’s 7x64GByte/s xGMI point-to-point topology.

Source: SemiAnalysis
Source: SemiAnalysis
Source: SemiAnalysis
Source: SemiAnalysis

To reoriginate this test, you can use our uncover source ClusterMax-NCCL/RCCL benchlabel, which we enhugeed to be easily run with one line of Bash. ClusterMax is our upcoming evaluation quantitative perestablishance and qualitative user experience for ranking H100/B200/GB200/MI300X Neocboisterous clusters. Look forward to our upcoming “ClusterMax Neocboisterous Evaluation | How to Rent GPUs” article. 

Source: SemiAnalysis

Multi Node RCCL/NCCL Collectives and Scale Out Nettoil Benchlabels

On both Nvidia’s H100/H200 and the MI300X, each GPU is joined to other nodes over the scale out nettoil using a 400G Nettoil Interface Card (NIC), joined honestly every GPU. The H100/H200 reference schedule typicassociate uses ConnectX-7 NICs for InfiniBand NDR or BlueField-3 for Spectrum-X Ethernet. Spectrum-X is NVIDIA’s custom Ethernet solution purpose-built for AI toilloads. On the MI300X, the reference schedule recommfinishs using RoCEv2 Ethernet with Broadcom Thor-2 NIC. 

Source: Nvidia

A normal GPU cluster almost always needs more layers than a one tier nettoil, as a one-tier nettoil can only help 128 GPUs (in the case of Broadcom Ethernet or Nvidia Spectrum X Ethernet) and 64 GPUs (for H100/H200 InfiniBand). In such a multi-tier nettoil, deployments typicassociate use an 8-rail upgraded overweight tree, where each one of the 8 GPU is joined to a split switch (such a joinion is called a “rail”). In our AI Neocboisterous Playbook and Anatomy article, we make cleared in detail how a rail upgraded nettoil toils.

Source: SemiAnalysis

Just as Nvidia’s NVLink advises NVLS for its scale-up nettoil, Nvidia’s H100/H200 InfiniBand scale out nettoil also advises InfiniBand SHARP In-nettoil Reduction which is, aobtain, exclusive to Nvidia. AMD does not have an analogous product for the MI300X. InfiniBand SHARP toils aforeseeed to NVLink SHARP In-nettoil Reduction as they both supply a way to shrink the amount of traffic going thraw the nettoil, with the reductions carried out inside of Quantum-2 InfiniBand switches in the case of InfiniBand SHARP. 

Unblessedly, unappreciate NVLink SHARP, which is allowd by default, InfiniBand SHARP is not allowd by default in the UFM/IB subnet administerr. We have spoken to many Neocboisterouss, H100 cluster operators, and AI frontier labs, and most have shelp that they have not allowd SHARP due to increased NCCL_TIMEOUT rates and difficulties insloftying and configuring the nettoil. We asked NVIDIA which AI customers use InfiniBand SHARP, but they deteriorated to answer in particulars. One could specutardy that if InfiniBand SHARP was beneficial in AI production toilloads, NVIDIA labeleting would shout at the top of their lungs to upgrasp its accomplished deployment. Given the apparently restricted adchooseion of InfiniBand SHARP for now, we show here collective perestablishance for Nvidia both when SHARP is and is not allowd.

For some of the benchlabels, we have also collected Nvidia Spectrum-X Ethernet data on an Nvidia inside cluster called Israel-1. Nvidia Spectrum-X is used in xAI’s 200k H100/H200 cluster and can help clusters up to 100k GPUs in the Spectrum-X reference architecture version 1.2, but could potentiassociate help up to 512k GPUs with a non-reference custom schedule.

We are also in the process of testing Google Cboisterous (GCP) H100’s in-house ethernet, as well as AWS’ H100 and H200s that are deployed on AWS’s in-house Ethernet (called EFAv2/EFAv3). We will be sharing the results in our upcoming “Collective Deep Dive” article, which will supply visualizations of  the separateent types of collectives, make clear the separateent NCCL protocols (SIMPLE, LL, LL128), separateent NCCL algorithms (NVLS, NVLSTREE, RING, TREE, COLNETDIRECT, COLNETCHAIN, PAT), and how collectives run on GCP H100 Ethernet, AWS H100/H200 EFA, InfiniBand H100, Spectrum-X, etc.

Below we show a 32 GPU all shrink collective test. You can see that MI300X RoCEv2 is in last place appraised to standard InfiniBand H100 and InfiniBand H100 with SHARP allowd. Spropose put, needy all shrink perestablishance directs to needy scale-out training.  

Source: SemiAnalysis

The MI300X’s perestablishance lessens if you scale out (i.e. increase) the number of GPUs participating in a collective. As you can envision, conmomentary frontier training is carried out on clusters of at least 100,000 GPUs. MI300X RoCEv2 runs at half the speed for all the genuine-world message sizes of 16MiB to 256MiB when appraised to the baseline of InfiniBand Non-SHARP. As per the chart below, Nvidia Spectrum-X Ethernet perestablishance is quite shut to InfiniBand Non-SHARP’s perestablishance, due to Spectrum-X’s vertical integration with the NCCL collective library as well as its use of excellent congestion administer and alterive routing. AMD is finisheavoring to verticassociate join next year with their upcoming Pollara 400G NIC, which helps Ultra Ethernet, hopebrimmingy making AMD competitive with Nvidia. As always, Nvidia is not standing still and by tardy next year, it will be ready to go into production with its 800G ConnectX-8 NICs, which supply a line rate twice as speedy as AMD’s Pollara NIC. 

AMD RCCL is a fork of Nvidia NCCL. AMD’s RCCL Team and many other teams at AMD are resource restricted and don’t have enough of either compute or headcount to raise the AMD ecosystem. AMD’s RCCL Team currently has stable access to less than 32 MI300Xs for R&D, which is sarcastic, as improving collective operations is all about having access to many GPUs. This is frankly silly, AMD should spfinish more on their software teams having access to more GPUs.

This contrasts with Nvidia’s NCCL team, which has access to R&D resources on Nvidia’s 11,000 H100 inside EOS cluster. Furthermore, Nvidia has Sylvain Jeaugey, who is the subject matter expert on collective communication. There are a lot of other world class collective experts toiling at Nvidia as well, and, unfortunately, AMD has hugely fall shorted to draw collective library talent due to less drawive compensation and resources – as contestd to engineers at Nvidia, where it is not atypical to see engineers originate fantasticer than a million dollars per year thanks to appreciation in the appreciate of RSUs. 

To help ease these publishs, TensorWave and SemiAnalysis are currently toiling with the AMD RCCL Team to raise collective perestablishance. TensorWave has benevolently aided AMD a medium-sized cluster in order help the RCCL Team have fantasticer resources to do their jobs. The fact that Tensorwave after buying many GPUs has to give AMD GPUs for them to mend their software is inrational.

Another trfinish to accomprehendledge is that for non-SHARP nettoils, all shrink collective’s speed will shrink logarithmicassociate as you double the number of GPUs. In contrast, with SHARP, the speed/completion time stays the same. We have results for up to 1,024 H100s shotriumphg that IB SHARP all shrink is constant time atraverse any number of GPUs in a collective. We will unveil this in our upcoming “Collective Deep Dive” article.

Source: SemiAnalysis

For all collect, all to all, and shrink scatter collectives, MI300X is anywhere from 2-4 times sluggisher than InfiniBand. Unblessedly, we did not have access to Spectrum-X or InfiniBand SHARP benchlabel data for all collect or shrink scatter. 

Source: SemiAnalysis
Source: SemiAnalysis
Source: SemiAnalysis

Below, we supply our nccl/rccl benchlabeling script. Unblessedly, due to the nature of cluster-particular setups, it is not as straightforward as a one-liner. It does need you to chase the README.md of nccl/rccl and nccl-tests/rccl-tests to run properly. On AWS and Google Cboisterous, there may also be custom nccl alterers that you will need to inslofty.

Source: SemiAnalysis

AMD’s User Experience is Subchooseimal and the MI300X is Not Usable Out of the Box

Due to needy inside testing (i.e. “dogfooding”) and a deficiency of automated testing on AMD’s part, the MI300 is not usable out of the box and needs think aboutable amounts of toil and tuning. In November 2024 at AMD’s “Advancing AI”, AMD’s SVP of AI stated that are over 200k tests running every evening internassociate at AMD. However, this seems to have done little to ameliorate the many AMD software bugs we ran into, and we ask AMD is doing proper CI/CD tests integrate proper perestablishance revertion, or functional and unitence/numerics testing. We will summarize a scant examples here for readers to comprehfinish the nature of the AMD software bugs we have greeted and why we sense they have been very obstructive to a excellent user experience on AMD. 

Although AMD’s own write downation recommfinishs using PyTorch native Flash Attention, for a couple months this summer, AMD’s PyTorch native Flash Attention kernel ran at less than 20 TFLOP/s, unbenevolenting that a conmomentary CPU would have calcutardyd the attention backwards layer speedyer than an MI300X GPU. For a time, straightforwardassociate all Transestablisher/GPT model training using PyTorch on the MI300X ran at a turtle’s pace. Nobody at AMD accomprehendledged this until a bug alert was filed chaseing beginant PyTorch/Perfetto profiling shotriumphg the backwards pass (purple/brown kernels) took up far more time than the forward pass (foolish green section). Normassociate, the backwards section should obtain up equitable ~2x as much time as the forward pass (sweightlessly more if using activation examinepointing). 

Source: SemiAnalysis

Another publish we greeted was that the AMD PyTorch attention layer led to a challenging error when used with torch.compile due to the rank of the lengthysumexp Tensor being inaccurate. What was frustrating is that this had already been mended in inside originates of AMD PyTorch on May 30th, but did not accomplish any AMD PyTorch distributions or even any PyTorch nightly originates until October when it was pointed out to them that there was a bug. This shows a deficiency of testing and dogfooding on the packages AMD puts out to the uncover. Another core reason for this problem is that the direct upgrasper of PyTorch (Meta) does not currently use MI300X internassociate for production LLM training, directing to code paths not used internassociate at Meta being buggy and not dogfooded properly. We count on AMD should partner with Meta to get their inside LLM training toiling on MI300X.

Source: SemiAnalysis

On August 8th, Horace He and the Meta PyTorch Team liberated FlexAttention, a critical API for creating non-casual attention layers without losing speed. To previously use attention variants appreciate write down masking, sliding triumphdow attention, softcap, and Alibi, a user would need to spfinish weeks handcreateing their own kernel in CUDA/HIP language, and subsequently pytieing it to PyTorch. However, with FlexAttention, a user can rapidly originate all the attention variants using the API. FlexAttention accomplishs fantastic perestablishance by using block sparsity by only calculating the blocks of the mask where needed, ignoring the rest.

Source: SemiAnalysis
Source: Meta

With sliding triumphdow attention, FlexAttention can raise perestablishance by 10-20x! This is amazing for the finish user, but unfortunately, MI300X FlexAttention was in a needy state and suffers from many AMD software bugs (including unitence publishs) until but a couple days ago. While the tardyst PyTorch nightly now mendes for unitence publishs, this contrasts starkly with FlexAttention on Nvidia, which has been useable since August. That unbenevolents a ~6 month gap exists between the useability of these wonderful Pytorch features on Nvidia and AMD’s platestablishs. For frontier AI labs, six months is a lifetime, with OpenAI, Anthropic, and Google having liberated many models in such a span. 

Source: SemiAnalysis

Exploring Ideas for Better Perestablishance on AMD

AMD recommfinished we try PYTORCH_ TUNABLE_OPS to raise GEMM perestablishance by sweeping thraw GEMM algorithms at runtime. However, as we refered earlier, this API toils needyly because GEMMs should be tuned when compiling the hipBLASLt/RoCBLAS/cuBLASLt and not during the users’ runtime. Users of Nvidia H100s do not need to use PYTORCH_ TUNABLE_OPS for most shapes because cuBLAS heuristic model will pick the accurate algorithmn. This contrasts with AMD’s heuristic model, which never seems to pick the accurate algorithm for most shapes. We recommfinish that AMD stop adviseing that users try tunable ops and instead caccess on properly tuning their GEMM libraries internassociate. 

When we tried PYTORCH_ TUNABLE_OPS on AMD, it led to an HBM memory leak of over 25 GByte out of the total MI300X capacity of 192GBytes, essentiassociate wiping out the MI300’s HBM capacity advantage over the H100. The mend for this is to set a default hipBLASLt and rocBLAS toilspace to obstruct memory leaks.

Source: PyTorch/AMD

As we refered earlier in this article, another publish we ran into was that there was a plethora of environment flags needed on MI300X to originate it actuassociate usable. We recommfinish to AMD that they stop putting users in the position of having to set these environment flags themselves and, instead, set default flags that direct to a usable environment. It is not spropose their number, but also the intricate engageions between the flags, making troubleshooting difficult. Getting reasonable training perestablishance out of AMD MI300X is an NP-Hard problem. 

Another publish is that certain AMD ROCm libraries could not be insloftyed inside Docker due to AMD software CMake bugs directing to challenging errors. This has since been mended. On AMD GPUs, you need to pass in a convoluted set of flags to get the GPUs to be able to toil inside a grasper, whereas with docker, getting GPUs to toil is as straightforward as passing in “—gpus=all”. We recommfinish to AMD that they partner with Docker and guarantee that Docker can autodistinguish GPUs for AMD as well, making the toilflow as streamlined as when toiling with Nvidia GPUs.

Source: SemiAnalysis

AMD’s Forked Libraries

Many of AMD’s libraries are forked off Nvidia’s uncover-source or ecosystem libraries. AMD uses a tool called Hipify to carry out source-to-source translation of Nvidia CUDA to AMD HIP. While the motivation is comprehfinishable, they arenevertheless originateing on top of their competitor’s platestablish and cannot foresee to align or outdo Nvidia’s user experience with this software enhugement strategy. They need to give their software to the AMD ecosystem. For example, instead of helping FP8 training by forking Nvidia/TransestablisherEngine and source-to-source translation, they should finisheavor PyTorch native FP8 training to toil well on their own challengingware. Currently, AMD PyTorch native FP8 training recipes don’t toil on AMD and the unit tests don’t even pass yet, there is no CI/CD for AMD PyTorch native FP8 training.

Source: SemiAnalysis

Detailed Recommfinishations to AMD on How to Fix Their Software

First, AMD needs to caccess on dratriumphg more software engineering resources and improving compensation for current engineers. The current compensation gap between AMD and Nvidia unbenevolents that top talent is lured to Nvidia over AMD. This top talent is also drawed to Nvidia as it has far more compute/resources for engineers. AMD should proremedy more GPUs for their in-house enhugement toil and surrfinisher an MLPerf GPT3 175B result as soon as possible. Even if the result is not competitive with Nvidia right now, surrfinisherting such a benchlabel will boot off the process for iterative raisement. 

We also accomprehendledge that AMD standardly gives their customers custom images, and, in fact, AMD enhugeers themselves standardly toil on top of such bespoke images. This is not best rehearse, as this unbenevolents that AMD engineers have a separateent experience vs. images useable to the uncover. AMD should instead lift the standard of uncover images by using these images internassociate and with its customers, and the AMD executive team should personassociate internassociate test (i.e. “dogfood”) what is getting shipped uncoverly.

We recommfinish that AMD originate a uncover dashboard that runs every night, shotriumphg the perestablishance of their challengingware on benchlabels such as MLPerf or TorchBench. This dashboard should also integrate H100/H200 perestablishance as a baseline.

Finassociate, AMD needs to finishly alter its approach to environmental flags. Instead of setting a myriad of flags to get running out of the box, it should set them to recommfinished defaults so users can get commenceed rapidly. 

AMD should collaborate with Meta to get production training toilloads toiling on ROCm, as it is well-comprehendn amongst PyTorch users that PyTorch code paths tfinish to have tons of bugs unless Meta uses it internassociate. Meta currently hand authors HIP Kernels for their production MI300X inferencing but does not use MI300X for genuine training. It would be a wonderful raisement for the AMD ecosystem, and a labeleting triumph, if a minusculeer version of the next Llama is trained on AMD. Not to refer that this would uncover the door to AMD proceedively moving towards huger models/clusters with Meta. Meta using AMD GPUs for actual model training would be a triumph-triumph for both companies as Meta is also seeing for alternative training chips to Nvidia.

Currently Nvidia advises well over 1,000 GPUs for Continuous raisement and enhugement of Pytorch externassociate and many more internassociate. AMD doesn’t. AMD needs to toil with an AMD caccessed GPU Neocboisterous to have ~10,000 GPUs of each generation for inside enhugement purposes and Pytorch. This will still be 1/8th that of Nvidia with their coming huge Bdeficiencywell clusters, but it’s a commence. These can be dedicated to inside enhugement and CICD for Pytorch.

Lisa, we are uncover to a greeting on how to mend AMD’s Datacaccess GPU User Experience for the better!

H100/H200/MI300X Nettoiling BoM Analysis and Perestablishance per TCO

In includeition to our benchlabeling of collectives and GEMM thrawput, we have carry outed cut offal experiments exploring astute topics for carry outing further benchlabels and running genuine-world toilloads on clusters. These experiments cover benchlabeling toastyup and repeat effects, VBoost Power Shifting, MLPerf Training GPT-3, BF16 vs FP16 thrawput, thrawput by GEMM input distribution, power per FLOP, and thrawput for the PyTorch PyPi distribution vs Nvidia NGC Stable PyTorch images. 

We also current a detailed nettoiling bill of materials (BoM) analysis for the 1k GPU Ethernet, 1k GPU InfiniBand, 16k GPU Ethernet, and 16k GPU InfiniBand clusters. We also talk the impact of using 51.2T Radix vs. 25.6T Radix switches for back-finish nettoiling.

Lastly – we current a perestablishance per TCO analysis that shows how the H100/H200/MI300X stacks up in terms of $/hr per effective training petaflop. These items are useable below to all SemiAnalysis subscribers and will be of fantastic interest to datacaccess operators, ML scientists, and allotors.

Subscribe for brimming access to this article

With a SemiAnalysis subscription you get brimming access to all articles, Data Explorer graphs, article talkions, and further insight into beginant dives.

By subscribing, you consent to the Privacy Policy and Terms and Conditions.

Source connect


Leave a Reply

Your email address will not be published. Required fields are marked *

Thank You For The Order

Please check your email we sent the process how you can get your account

Select Your Plan