Edit: If you missed it - I was on a podcast! Check it out here!
Everyone loves to talk about tech monopolies. Their acquisition spree and obvious market power in a world with no distribution cost is likely better discussed in the DOJ recommendation or at the venerable Ben Thompson’s Stratechery. Instead, I want to talk about some good ole fashioned monopolizing. And that’s vertical integration via going down the technology stack into hardware. I want to discuss why now, why does it matter, and how is each of the large platforms positioned.
Moore’s Void
The phrase “Owe the bank 500 dollars, that is your problem. Owe the bank 500 million – that is the bank’s problem.” is something that comes to mind for some of the tech monopolies right now. There is a shifting relationship between the largest software companies in the world and their suppliers, and as the leading software companies have become ever-larger portions of the compute pie, it’s kind of become the problem of the tech companies, and not the semiconductor companies that service them to push forward the natural limits of hardware. Software ate the world so completely that now the large tech companies have to deal with the actual hardware that underlies their stack. Especially as some companies like Intel have fallen behind.
At this point in time, no other companies have ever had such a concentrated share of absolute compute and sells it as a service. Even IBM in its zenith sold PCs and Mainframes (and they still ran a tightly integrated stack!), not units of compute disaggregated like the infrastructure as a service provider. As Moore’s law has broken down and AI compute demand has skyrocketed, this has kind of become a problem at the companies and they are aware. This great video about opensource EDA and tooling problems (if you’re a nerd you’ll enjoy) started with some interesting caveats.
However, if you look at Google's products, were our demand for compute power continues to grow substantially, frequently at exponential type numbers, and this used to be a free ride with Moore's law. Giving us increasing compute power to keep up with this increasing demand for computation. But that's kind of come to an end now which isn't great, and so we have a lot of projects at Google that are trying to solve this and I'm working on one of them. I don't work on the most successful project we've had here, which is the TPU. This is a ML accelerator that has drastically increased our ability to do ML compute, and this is kind of an interesting thing in its showing that hardware, that is the main specific, can potentially keep up with this growing demand for compute. The problem is it's taking a lot of effort to create these hardware accelerators and while some groups at Google are big enough, like the ML people to have dedicated teams working on dedicated hardware like the TPU um, we are looking at the problem that every team at Google is eventually probably going to have to look at hardware accelerating their workloads, especially if their demand continues to rise.
Btw - this transcript was made with https://hierogly.ph/ made by @_nd_go and me.
So clearly this is top of mind at many of the tech companies around the world. I had wanted to write about this in February of this year, but just a few months later and we’ve seen the thesis play out in a big way. Apple’s M1/AX chips, AWS’s Graviton, Azure’s Catapult, and heck even Facebook is rumored to be starting their own chip platform. I don’t see it stopping any time soon, in fact, I think this will accelerate.
I believe that in a few years, most of the large tech companies will have a much tighter level of integration and we will likely see much less “commoditized” platforms. Yes, they might run on partially open stacks (think open networking roadmap and Facebook) but their differentiation is going to be not only software but also hardware. We are going back to the old patterns of integration of both Software and Hardware.
The unit economics of this is profound, partially because if a company doesn’t pursue this, they will have to pay the exponential cost of AI compute at face value, but also potential competitors will have to face a new barrier to entry. The profit deserts around their moats, as mentioned in the first @modestproposal1 Invest like the Best podcast, will climb even higher. They will be able to sell products below their competitors while making a profit.
Let’s walk through each of their plans.
No company better exhibits vertical integration than Apple. Something I have always been shocked about is “what could Apple be spending their absolute dollars of R&D on” but as time has gone on it’s clear what it is, and that’s building their in-house semiconductor expertise.
Maybe it’s the rumored coming glasses or car, but it’s much more likely they are following the exponential curve of new nodes. 5nm is hitting an asymptote, a place where many other companies can’t compete in terms of absolute dollars, and by being there first they now have an absolute hardware advantage that can be tightly integrated into their software. Also, this will help them increase their leverage over their partners who make software in their ecosystem, requiring them to tightly integrate or face terrible performance.
Every year that goes by, Apple slowly subsumes more of its platform in house. The big recent announcement was the M1, which not only offered a fast CPU, but the best of what heterogenous compute has to offer, better battery life, better optimization with certain applications, and cheaper for Apple. This was a slam dunk and just a sign of the things to come. They have also of course worked on making their own smartphone modems, designed the new U1 UWB in house, and continue to grind higher with the out of this world Axx mobile phone platform. Apple aspires one day to be close to all in house, offering better products for cheaper, a fully walled silicon garden.
Google on the other hand has only had a single large foray into custom silicon, but one with a big splash, their TPU. Google is the leading company in AI, with more cited research than any other company or university in the world.
Google’s TPU is a tensor processing unit, particularly created to work with Google’s Tensorflow network. TPUs are an effort to capture the hardware and the software component and create an end to end AI stack that is controlled by Google. If AI continues to grow (it will) and it does so on the back of Google’s Tensorflow platform, Google will inevitably be well-positioned. From a software perspective and from selling the compute perspective. The TPU is their attempt at a walled garden within AI. And since you can only rent TPUs in the cloud, this is how they will monetize their strategic place so close to the software.
But I don’t think Google is stopping there. Google just recently announced a partnership with TSMC to pursue SoIC, and likely will be launching a whole array of custom silicon for their Waymo subsidiary, as driverless cars are one of the highest compute demand drivers. They are also rumored to be launching a new ARM chip for their Pixel phone lineup. There will be more to come.
AWS is the world’s largest IaaS (Infrastructure as a Service) player, and it is no surprise whatsoever that they have already been pursuing this strategy all along. Their strategy is much lower level and a bit out of the view of the public, but they have done a lot of impressive work here. For Amazon, it all starts with their acquisition of Annapurna, which has lead to DPUs and Graviton. From an infrastructure as a service perspective, they are the furthest along to offering a full-stack silicon walled garden. Annapurna pioneered one of the first network DPUs, they launched their now successful Graviton chip, and are likely to launch an AI accelerator soon. Something that is really striking to me is the anecdata I have seen coming out of Twitter as of late regarding Graviton.
A reminder that while there aren’t intense price cuts like they used to be in the dawn of cloud computing, price matters a lot for IaaS. The point is that you’re buying generic compute units, and if that is delivered to you for much cheaper from graviton chips, well that surely delights the customer. This will help scale the graviton platform, pushing it to lower geometries like 5nm and create huge economies of scale and cost savings that AWS gets to pocket. Intel’s margin is now Amazon’s opportunity. The barrier to entry in this notoriously capital intensive industry (IaaS) is just getting larger by the day.
Amazon isn’t just an IaaS company as you know, and recently they shifted some of their Alexa chips in-house from Nvidia. As they enter new segments, they will continue to bring their chip expertise to every market Amazon touches. And as they expand into retail, healthcare, and other new verticals, that’s bound to amount to some more interesting platforms.
Microsoft’s silicon plays are a lot quieter than the others, and I only have really started to peel the onion back after learning about Inphi. In particular, they have a relatively novel datacenter peering strategy dependent on ColorZ, but I also believe that they will start to walk the way of custom silicon very soon. An example of this is the ARM-based surface that Microsoft has been designing. This is pretty striking if you remember the Wintel alliance, as it seems that Microsoft is willing and ready to give the marriage up.
In particular, the platform they have talked the most on, and have the most progress for custom silicon is the edge. Azure Sphere in particular is a new platform that is anchored by their Pluton chip to improve security and is a highly opinionated ARM-based ecosystem promising security and performance.
Last, there is Microsoft’s Catapult platform. This is a product that is custom-designed FPGA supplied by Altera and is their specific take on a flexible network design (think DPUs). Within the Azure datacenter, their Catapult semiconductor helps move data throughout their network, as well as accelerate certain workloads. Going forward I think the Catapult configuration will approach traditional DPUs, a new emerging market that has just begun. Watch for more announcements.
And this pattern does not just apply to the IaaS providers, but the largest internet properties on the internet as well. Facebook as a standalone cloud is likely in the top 10 globally. Facebook so far has not announced significant custom silicon initiatives, but guessing at their footprint and Oculus, I’m going to say something will pop up in the next few years.
Custom Silicon’s High Barrier to Entry
The reason why I wrote this entire article was to highlight what it looks like for a company from the outside of these custom platforms. And it’s daunting.
Source: https://www.extremetech.com/computing/272096-3nm-process-node
This is a barrier to entry that few companies can really climb over anymore, with 500 million in R&D only possible by a few companies (270 according to a screener I used) and many of the companies with R&D budgets larger than 500m is large tech companies themselves. It is no surprise they are going custom, as now this is a very capital intense way to create a gulf between them and the rest. For example, something I wanted to note is that every single company mentioned so far spends more on absolute R&D than Intel! Samsung, a company that is out of the scope of this discussion rounds out the list of the companies that spend more than Intel on R&D worldwide. This is likely not a coincidence! Semiconductors are becoming more capital intense as we hit the wall of physics, and by being at that leading edge the new technology monopolies will get to operate in that world alone.
Just imagine now that you are an entrant, trying to sell IaaS, maybe like Digital Ocean (huge fan). If Intel and AMD chips are all that you can use, you better pray and hope their roadmaps are strong, because now that your competitors are able to create and expand their own roadmaps faster than the large semiconductor platforms, you may be forced to eventually buy from them or just be at a structural gross margin disadvantage. You could offer identical services but make worse profits, just on the basis that you don’t make your own chips. If they lower prices, you could even lose money! You cannot compete.
But before we cry wolf, there is a company that is pretty well aware of this and is now the largest post-Intel semiconductor company around; Nvidia. Their acquisition of ARM is really important, and while it was expensive, ARM is going to inevitably be embedded into every single roadmap I mentioned above. In fact, the majority of the custom products are ARM-based, and Nvidia knows this. Nvidia is positioning itself as a large and independent silicon platform in the AI age. Like the Intel of yesteryear. Nvidia now will be a relevant company no matter what happens with the tech platforms pushing forward. And now Nvidia is even going up against their stack, they are offering software. This is a conversation for another day, however.
The push and tug will be violent, but clearly, the ball is in the large software companies’ court. They are right now the leading edge of all innovation on the internet, and now many hyperscalers will be some of the first in line at leading-edge fabs. The void left behind in Intel’s wake is massive, and everyone realizes that they can benefit from their “death” by bringing their own silicon and creating an end to end platform that cannot be replicated.
Software ate the world and hardware has been struggling to keep up recently. Now the largest software companies are slowly becoming hardware companies and pursuing an integrated strategy that only can be achieved at the largest scale possible and with barriers of entry that are quickly expanding in addition to their well-known network or aggregation effects. The walls are slowly rising, the moats slowly widening, and as we are on the cusp of a new hardware renaissance, the decisions the hyperscalers make now are going to have a long-lasting competitive shadow. Stay tuned.
Let's say as a software engineer, I have the next 10 years to invest my free time to build a "moat" for my own career by "going down the stack" and integrating specialized hardware with my software, where should I start my learning journey or what skills do you think will become essential?
I'm tempted to go to Apple, Google, FB, and AWS's career page and find such info, but I'm really curious to what you have to say!
Thank you!
Hey man, I'm curious if you plan to provide an update on how have these compute platforms evolved over the past three years. Very curious to understand better Google's Tensorflow platform given the drama since ChatGPT.
What's your preferred way to track each platform and their capabilities? I'd assume that some of the in-house silicon will be used for internal workloads vs cloud compute for clients (e.g. the video chip from Google).