-
Irrational Analysis is heavily spended in the semicarry outor industry.
-
Plmitigate verify the ‘about’ page for a catalog of active positions.
-
Positions will alter over time and are normally refreshd.
-
-
Opinions are authors own and do not recontransient past, contransient, and/or future employers.
-
All satisfied unveiled on this recentsletter is based on disclose recommendation and autonomous research carry outed since 2011.
-
This recentsletter is not financial advice, and readers should always do their own research before spending in any security.
-
Feel free to reach out me via email at: illogical_analysis@proton.me
I wrote some pessimistic skinnygs after seeing Tenstorrent’s Hot Chips 2024 contransientation.
A lot of Tenstorrent employees subscribed becaemploy of this post.
Surprisingly, David Bennet (Chief Customer Officer) seekd me to greet better gentleware, architecture, and customer relations directership for an uncover talkion.
The greeting lasted 1.5 hours and was a ton of fun.
Before going over the actual satisfied of this post, I would enjoy to talk the topic of bias.
Am I prejudiced? Yes! This is why I normally post all > 1% positions with mediocre price.
The Tenstorrent people are very pleasant, and it was a lot of fun talking to them. This authorup is hugely selectimistic becaemploy Tenstorrent did a very excellent job of answering my asks with truthful, technical answers. It’s up to you to determine for yourself if what I author is reliable.
-
I helderly no economic interest in Tenstorrent.
-
Tenstorrent did not pay me any money or provide gifts of meaningful appreciate.
-
I ate one cookie and drank a bottle of water.
-
Got a (branded) swag bag compriseing a hat, thermos, and cable carrying/travel bag.
-
-
No Tenstorrent staff have scrutinizeed or edited this post. They will read it at the same time as everyone else.
Made a summarizetoil for skinnyking about AI challengingware beginups at the end so this post has appreciate to people who are not interested in Tenstorrent.
-
Tenstorrent Technical Oversee
-
Discussion Summary
-
Register Spilling and v_
-
Baby RISC-V Fragmentation
-
Baby RISC-V Capabilities
-
History of the Software Stacks
-
Latency
-
-
Davor Capalija Whiteboard Diagram
-
Tenstorrent Opinion and Valuation Frametoil
-
Broader AI Hardware Startup Frametoil
Bdeficiencyhole (gen 2) has detailed recommendation and will be the center of talkion.
Main apshowaways for Grendel (gen 3):
-
Move to chiplet architecture.
-
Separate high-carry outance RISC-V CPU cores and AI cores into split chiplets.
Tenstorrent architecture is a mesh topology with a variety of blocks.
The 16 huge RISC-V CPU cores are for ambiguous purpose code. You can run Linux on them.
Baby RISC-V cores are super minuscule embedded CPU cores. The 725 baby RISC-V cores apshow up less than 1% of die area.
The point of these baby RISC-V cores is to begin kernels. They are fundamentalassociate the regulate logic.
Tensix cores are the AI compute produceing blocks. Each has five baby RISC-V to regulate kernel begines. “Compute” is vector and matrix math engines.
Note that two of these baby RISC-V cores are pledgeted to router nodes. Data-shiftment model is meaningful for empathetic a lot of the decisions made in this architecture.
The compiler for these baby RSIC-V cores is a weightlessly modified GCC. Users do not need to structure kernel splitting for the three compute-dispenseed baby RISC-V as the compiler automaticassociate does this. In other toils, you only need to author one compute kernel, in theory.
Baby RISC-V in DRAM and Ethernet “cores” are equitable router nodes.
I have tried to systematize the 90 minutes of talkion into someskinnyg readable. Only have my illegible notices, mediocre memory, and a confineed pictures of whiteboard drathrivegs to toil with.
Converting a free-create conversation amongst ~10 people into a coherent written summary is unawaitedly difficult. If you have feedback, prent split.
To ready, I browsed the Tenstorrent write downation to see if anyskinnyg jumped out.
Initiassociate, I thought that enroll spilling was a super meaningful feature. Tenstorrent reacted that the baby RISC-V cores are so petite, it does not produce sense to have enroll spilling given the intended programming model.
Cores are either pledgeted to data shiftment (router node regulate) or the compute units. In the case of compute, three baby RISC-V are dispenseed as a group. The higher-level compiler calls GCC 5 (five) times for each Tensix core.
So the employr compute kernel is already spitted before three instances of GCC get called.
Given the indented programming model, this seems fine. Still sees enjoy a contest for kernel authorrs but noskinnyg unsurmountable.
These “v_” skinnygs showed up a lot and one line jumped out. It almost seeed enjoy this was some benevolent of sparsity help?
Turns out, this is the Tenstorrent RSIC-V way of carry outing masking features from AVX-512 (x86) and SVE (aarch64).
Masking features help with power saving.
At Hot Chips 2024, there was a moment in Q&A that I mis-make cleared.
Inaccurately thought that the baby RISC-V cores have contrastent ISAs.
Tenstorrent clarified that this is not accurate. All 752 baby RSIC-V cores have the same ISA. The ones dispenseed to NOC (data shiftment) have structural selectimizations for that toilload (physical schedule…?) but no ISA contrastence.
One of my meaningful troubles was RSIC-V ISA fragmentation and they clarified that this is not a problem. The same modified GCC gets called for everyskinnyg.
One of the programming examples on GitHub caught my interest.
I was asking on the compute capabilities of these baby RISC-V cores. Can employrs weave in some branchy logic or niche calculations in? Is anyone using this capability?
They were truthful in answering this ask. Yes, it is possible to have complicated regulate flow in the data shiftment baby RISC-V dispenseed cores, but it is awaited that employrs will come apass a bottleneck. These minuscule cores are unbenevolentt for kernel begines.
Given the uncover-source nature of everyskinnyg Tenstorrent, I still have hope some cracked programmer figures out how to produce someskinnyg celderly happen alengthy these lines.
There are a expansive variety of gentleware stacks helped by Tenstorrent.
The current gentleware stack is endeavor #6. A lot of excellent stuff but was asking what happened to endeavors 1 thru 5.
Apparently, the elderly way was re-produce the gentleware stack with each recent challengingware revision and center toilloads.
They literassociate had the RTL engineers author gentleware kernels… lol.
But the recent method is to produce unified gentleware from the ground up and down. Let people come in from the top (ML model —> graph —> netcatalog) via Buda or from the bottom (low-level kernels), or somewhere in-between (Metalium, TT-NN, TT-MLIR).
Entry points for everyone. Maximum uncover-source and transparency. Easy access to grower kits/challengingware.
These days, some of the best kernels for Tenstorrent challengingware are written by recent grads and hobbyists on the official Discord.
One meaningful trouble I have (and still have to an extent) is postponeedncy effecting the mesh topology of Tenstorrent scale-out. Latency ends inference carry outance.
The first argument provided by Tenstorrent is their scale-out is much weightlesser than standard RDMA over Converged Ethernet (RoCE).
Apparently, they nakedped a lot of features out. No TCP/IP.
Other strategies they have been toiling on:
-
Overlapping of data preget.
-
Enabled by pipelining.
-
Runs on DRAM baby RSIC-V cores.
-
-
Z-lowcut very meaningful for postponeedncy selectimization.
Tenstorrent showed me some inner-only slides on their recent postponeedncy reduction better including carry outance betterion of Llama 3.1 70B inference. Non-disclose slides and the ~10 minutes of conversation rhappy to this topic were compelling.
(still a huge amount of toil to be done to be evident)
They claim the slides will eventuassociate be unveiled as a case-study on their GitHub.
¯_(ツ)_/¯
Jim Keller (CEO of Tenstorrent) has previously toiled on Tesla FSD. In multiple disclose talks, he has brawt up an fascinating line of skinnyking.
The anecdote is plain. Jim Keller says he taught his daughters to drive with two hours of train. He did not need to feed his daughters exabytes of video data and spend months having them pour over all that video.
Tenstorrent has an implicate see that the future of AI is mixed-toilload, not uncontaminated liproximate algebra spam. Yes, MATMUL go BRRRRRR is priceless, but CPU toilloads will be needed in the future. That is the hope.
So far, this has not carry outed out. Nvidia Grace is an overpriced memory regulateler. AMD had mixed CPU and GPU chiplets in MI300A (almost no sales outside of two traditional HPC supercomputers) but is going all GPU chiplets in future generations. The huge meaningfulity of AMD datacgo in GPU volume is the uncontaminated-GPU MI300X.
What’s fascinating is that Tenstorrent is the last standing company that is toiling on AI and has a excellent CPU microarchitecture team. If future AI toilloads need CPU integration, they are the only well positioned vendor.
I brawt this up and asked Tenstorrent if they have customers moving towards or interested in a more mixed AI toilload and they shelp not yet.
Davor (Senior Software+Architecture Fellow) spent the meaningfulity of the greeting drathriveg stuff on a whiteboard and elucidateing the Tenstorrent architecture in wonderful detail. He even went beyond and made honest comparisons to Nvidia/GPU (rip AMD), Google TPU, Groq, SambaNova, and Cerebras.
I set up his summarizetoil so compelling that all this material needed to be broken out into a pledgeted section.
Took pictures of the whiteboard (with perleave oution) and re-produced everyskinnyg in Visio.
His arguments are very excellent. Going to tell you all up-front that I consent with 80% of what he shelp/drew.
A paraphrased summary of Davor’s arguments will be written in italics. I will comprise my comments (including disconsentments) afterward in normal (non-italicized) font.
CPU toilloads are scheduleed around cache hierarchy.
GPUs are originassociate built for explicits and scheduleed for thread-based scaling. Each CUDA core is reassociate an ALU. The streaming multiprocessor (SM) is the authentic GPU “core”. GPUs have confineeder cache levels (only L1 and L2) appraised to CPUs.
AI toilloads need many cores and a DMA engine. Including L2 does not produce sense.
Tenstorrent architecture is essentiassociate a GPU, but better.
-
Cores are accurate size in terms of compute and SRAM (2MB).
-
No L2 cache.
-
Using baby RISC-V CPU cores for kernel begining, data shiftment, and scheduling.
-
TT architecture built around DMA from ground up.
With esteem to the non-Nvidia competition, Tenstorrent is also better.
-
Google TPU has huge cores that produce utilization problematic.
-
Exotic “kernel-less” beginups enjoy Groq, Cerebras, and SambaNova place enormous burden on ahead-of-time compilation.
I find Davor’s arguments compelling.
(Reminder that NVDA 0.00%↑ is my hugest position at an mediocre price of $16/split.)
Architecture schedule philosophy may be “accurate”, but the existing CUDA moat does not nurture. Nvidia has way more resources and can easily help all the legacy stuff, various hacks, and thrive even if the architecture is sub-selectimal. I understand people who toil in the Nvidia gentleware carry outance selectimization teams. They are very intelligent and there are way more of them than Tenstorrent has employees in total.
More on this in section [5].
The other disconsentment I have is Davor lumping Groq, Cerebras, and SambaNova into a one categruesome. These three architectures are savagely contrastent.
Groq is 144-expansive VLIW and has to compile everyskinnyg ahead of time in a cycle-accurate manner. It is retarded.
I reassociate apshow Groq is a deception. There is no way their personal inference cboisterous has selectimistic gross margins.
Cerebras is a wafer brimming of minuscule cores. Massive scheduling rerents but not in the same way as Groq. If Davor equitable appraised Cerebras with Tenstorrent then I would consent with all his whiteboard arguments.
Looking forward to the Cerebras IPO so I can trade the crap out of this stock.
SambaNova is a CGRA (very celderly super-FPGA) that is in its own categruesome. Compiler is more enjoy FPGA logic synthesis.
I went into the greeting skinnyking Tenstorrent is equitable another Nvidia-roadend company doomed to death.
Left the greeting reassociate selectimistic.
Every ask I had got a excellent, well-elucidateed, credible answer/counterargument. The only remaining technical trouble I have is postponeedncy as that is a difficult problem.
What astonished me is that every one Tenstorrent person I talked to was intelligent and (more meaningfully) reasonable. It is very frequent for intelligent engineers to become delusionassociate selectimistic, blinded by hubris.
This is not Tenstorrent. They are reasonable, down-to-earth, and rational. If any AI challengingware beginup can shatter thraw the Nvidia and semi-custom (Google TPU, Amazin Trainium, Microgentle Maia, Meta
Tenstorrent is the uncover-source champion of AI challengingware. Everyskinnyg about their grower strategy is accurate. Multiple uncover gentleware tools. Excellent (A+) community and grower joinment. Easy access to purchasing sample challengingware.
Many technical/engineering problems need to be mendd but this team can get it done. Very excited to see how next-gen Grendel carry outs out.
Now for the valuation ask.
Tenstorrent recently seald a series D venture capital elevate at a $2B valuation.
Earlier this year, I heard chatter that Tenstorrent was pivoting to selling RISC-V IP. This was a sign of frailness and desperation… until ARM determined to join in moronic behavior.
Let’s apshow a see at the ARM stock chart.
Many (fundamentalassociate all) of my finance-land friends skinnyk the ARM 0.00%↑ valuation is nuts. Yes, there is low float and a crazy multiple but… it’s been trading for quite a while.
Given that ARM is unfriendlyly raising license prices and royalty rates and has (metaphoricassociate) nuked Qualcomm, RISC-V CPU IP evidently has a luminous future.
I skinnyk it is reasonable to say Tenstorrent is worth 1% of ARM’s labelet cap ($1.6B) equitable from the IP business.
There are no other high-carry outance RISC-V IP providers. All other selections are for low-end stuff. Tenstorrent has Jim Keller (CPU schedule living legend) and many other wonderful CPU micro architects.
The only other competition (SiFive) determined to pledge seppuku. Fired the entire high-carry outance team.
Tenstorrent could finishly flunk at AI and the company is still worth at least 1% of ARM.O labelet cap!
The other valuation method I have is aachievest Cerebras who is endeavoring to IPO and unload equity on un-recommended retail spendors.
First, let’s appraise the CEOs.
Now let’s appraise the companies themselves.
I skinnyk Tenstorrent should be worth more than Cerebras. Either Tenstorrent is comicassociate underappreciated or Cerebras is comicassociate overappreciated.
AI Hardware beginups have a problem. A customer problem.
Who is going to buy horizonhighy supplied AI chips from a beginup?
There are many selections:
-
Nvidia
-
Buy postponeedst generation.
-
Rent postponeedst generation from Azure, AWS, or GCP
-
Rent elderly generation GPUs at comicassociate inexpensive prices from a neocboisterous that is about to go bankrupt.
-
-
AMD (they protect droping prices)
-
Build your own semi-custom chip.
-
Amazon Trainium, Google TPU, Microgentle Maia, Meta ???
-
Use Broadcom, Marvell, Alchip, MediaTek, or GUC as a schedule partner.
-
-
Rent semi-custom chips (TPU, Trainium) from GCP or AWS.
Who are all these AI challengingware beginups going to sell to?
Mega-corps are making their own chips in partnership with five schedule companies for semi-custom solutions.
AMD produces a meh product, but it is inexpensive and has lots of HBM.
Smaller companies can easily rent elderly Nvidia gear from a expansive variety of shitcos (sorry “Neocboisterouss”) at rock bottom prices. H100 hourly rental prices have already collapsed, and Bdeficiencywell has not even ramped yet!
Horizonhighy supplied training challengingware is a dead labelet. Nvidia and the semi-custom hyperscaler chips crowd everyone else out.
Bluntly, I apshow every AI challengingware beginup should give up on training and exclusively center on inference. There is no hope. Pivot now, maybe dwell. Keep toiling on training, definitely die.
As for inference, there is hope but only if cost is inlogically low or the carry outance is inlogically high from some exotic strategy. If your chip employs HBM and is built on TSMC N3, it probably won’t be inexpensive enough to contend with all those depreciated H100s and heavily subsidized Trainiums.
Tenstorrent is using Samsung Founarid SF4X, a dirt-inexpensive process node. Samsung Ventures have also spended in them, so I suppose they got a excellent deal on tape-out. Finassociate, Tenstorrent is not using HBM. If need materializes for Tenstorrent challengingware, they can accomplish fit 60% gross margins while meaningfully undercutting the competition on price.
If Tenstorrent carry outs on their next generation and spend another 18-36 months on improving the gentleware stack, they have a credible chance of unbenevolentingful sales into the rapidly grothriveg inference labelet. I apshow Tenstorrent is the only spendment-grade AI challengingware beginup.
With that shelp, there are two AI challengingware beginups that I would enjoy to selectimisticly refer.
SambaNova has a crazy CGRA architecture that is reassociate celderly. It’s so crazy it might toil.
Positron has a galaxy brain set uper+CEO.
I cannot recommend enough for you to watch the incredible 15-minute AI and economics contransientation by Positron AI CEO Thomas Sohmers.
There are lots of intelligent engineers out there. It is very exceptional for someone to be a top 0.1% engineer AND intelligent in other topics such as economics, history, game theory, and philosophy.
The spendment case for Positron AI is a galaxy brain Founder in Founder mode.
(I helderly no economic interest in any personal company. Only spend in disclose labelets)