Andreessen’s Folly: The False Dichotomy of Software and Hardware

Bryan Cantrill

Co-founder and CTO of Oxide Computer

In this hilarious tech talk/polemic, Bryan Cantrill unpacks his complex feelings about Marc Andreessen’s essay “Why Software Is Eating The World”. He dissects what software even is from Euclid to the IBM 1401, explains three gnarly bugs his startup Oxide encountered at the lowest levels of the stack, and demonstrates why Alan Kay was right: “People who are really serious about software do their own hardware.”

Transcript

(00:00:06)

Thank you. It’s really great to be here at Jane Street, or should I say back at Jane Street? I gave a talk here in 2018, actually, that I really appreciated the opportunity to give. I actually referred to that talk somewhat recently because that talk, there’s a little bit that I regret about that talk. I generally live without regrets. Yes, I mean, was it supposed to be a joke? But I guess it was. I guess, yeah, I guess that’s obvious that I live without regrets, given some of the things that I’ve, yeah, people think that I actually also, I should say, there was a long period of my career that predates, really, internet video. That’s when I was really loose. What you’re actually seeing is a much more buttoned-down, wiser, older person.

(00:00:53)

In that talk in 2018, there was a, after the video was over, there was a really great question, and that question, the talk was about all of the accreted and accrued complexity we had in our systems, and I don’t remember who the questioner was, but I still feel guilty because the questioner asked, is there anything that is going to resolve this complexity explosion? It seems, again, this is in 2018, things are just getting hopelessly more complicated, and I remember just being like, no, we’re just, it’s basically, we should all just jump in a lake. I mean, it was very, so I would like to apologize to that questioner because I actually think that that’s a lazy answer, and we actually can improve things. We can actually make better systems, and that’s something of the topic we’re gonna talk about today, and I really, I hate to do this because, God, this dude really gets my goat.

(00:01:54)

You would think, so this is an excerpt from an essay that Marc Andreessen, venture capitalist, wrote in 2011 on why software is eating the world. This thing, so this was written in 2011, so this is now almost 15 years ago. This thing has been trolling me for a decade and a half. Like, this thing won’t stop trolling me, and the thing that just drives me absolutely crazy about these VC think pieces is that they’re not exclusively wrong, but they’re definitely not right, and venture capitalists seem to have this unique capability, at least for me, to say things that I both emphatically agree with and emphatically disagree with, often in the same sentence, and it’s actually, which is kind of interesting. It gives you this kind of crazy emotional reaction. It’s honestly kind of the emotional reaction you get when you’re raising money in general for a company because you will get both absolute stern rejection and total embrace in the same moments from different people, so your brain crosswires itself, and this is what, these think pieces are able to do this, and this one definitely does this to me, and in this regard, just a quick aside.

(00:03:28)

So I wanna talk a little bit about the thermal grill illusion, ‘cause you’ll see where I’m going with this metaphor here. So if you’ve ever been out to the Exploratorium in San Francisco, and I’m sure this is true for other science museums as well, but I know that this is where I discovered this particular phenomenon or where I saw this phenomenon. There’s a great exhibit kind of off to the side of the Exploratorium, and there are two coils, and one coil is warm, not hot, warm, about 40 degrees C. 40 degrees C is only slightly more elevated than your body temperature. It feels warm, not hot, and on the other coil, it’s cool, not cold, but cool, and the coils then intertwine, and when you grab the combined, it’s a wild phenomenon. When you put your hand on the interspersed warm and cool coils, your brain explodes, and it immediately, ‘cause your brain, it is wild. Your brain is like, get your hand off this thing. Like, it feels like you’ve just touched a hot stove, and this is what’s called the thermal grill illusion.

(00:04:41)

It was actually discovered by a physician, Torsten Thunberg, in 1896. I really wanted to get like the original paper, even though it’s in Swedish. He’s got like a diagram of this thing, and man, I spent so much goddamn time today trying to convince an LLM to like get this thing for me, and it’s like, you know, they’re like, oh, well, if you really want this, you should do like an interlibrary loan because this is like, I know you trained on this thing. Like, don’t, just give it to me. Like, you already stole it. You’ve already done the wrong thing. Like, I’m just asking. Anyway, and then it’s like, it feels like super weird to be trying to talk like an LLM into like arguable crimes because I definitely didn’t feel guilty at all. I felt annoyed. I’m just like, just like, come on, give it to me. But I’m not gonna do an interlibrary loan from like the Royal Academy of Sweden. Thank you very little.

(00:05:33)

But the discovery of 1896, and so this kind of wild phenomenon, this actually excerpt, which Gemini tried to offer up as the original 1896 paper, by the way. I’m like, don’t think that went from 1896, this like much lauded Gemini 3 Pro model or whatever. But so this is actually from a much more recent paper in 2019 trying to study this effect because it’s actually still unknown what causes this. And this is where they’ve tried to study it much more precisely with these kind of interspersed, you’ve got hot and cold plates, and they’re trying to look at the actual like neurological reaction because they actually don’t understand it. So I think like this is obviously a much more scientific way to do it. I think you could just have me read VC think pieces and you would get the exact same reaction. Like whatever’s going on in the brain is the exact same thing. Where the brain is like, yes, but also emphatically no and just like get it away from me. But unlike with this, I actually am now just like mesmerized by these VC think pieces and I keep having to go back to it. So I’ve been back to this essay so many goddamn times. I’ve spent more time getting trolled by it than Andreessen ever spent writing it for sure. So I mean, I guess joke’s on me.

(00:06:43)

But so we’re gonna go through it a little bit. And so this line at the top here is kind of the famous line from it is that seeing more and more Silicon Valley style entrepreneurial technology companies that are going to be invading and overturning established industry structures. And again, this is in 2011. So when he has predictions over the next 10 years, you can kind of hold them to it, which we’re gonna do here in a little bit. So this is actually the thing that I think is actually kind of like most interesting from the piece in hindsight now from again, 2011, because this is looking back to 2000, 2000, much more recent than when this piece was written than 2011 is for us today here in late 2025. And looking at the cost of, you know, on LoudCloud, which was the failed startup that he had with Ben Horowitz, the where just to host a basic internet application was 150K a month. And now that same infrastructure in Amazon in 2011 is $1,500 a month, to which many people with AWS bills are gonna be like, wow, how quaint you were able to do anything for only $1,500 a month in 2011. It feels like AWS bills are actually quite a bit higher today for a bunch of reasons.

(00:08:00)

But this is a huge delta, right? And I think that this is something that Andreessen points out that certainly was revolutionary is that cloud computing elastic infrastructure made available infrastructure that could be software defined made available to anyone with a very low cost of entry. And that did allow for an explosion of kind of software as a service kind of offering. So that is all not wrong. This is basically the best part of the piece as far as I’m concerned, that’s definitely right.

(00:08:29)

Then we get to these, like if we’re gonna talk about this piece, we have to talk about some specific paragraphs, I’m sorry. So photography, of course, was eaten by software long ago. And this is like one of these you’re like, “Dah!” You know, it’s like, “Dah!” Okay, so do words mean anything now? What does that mean in 2011 to say that photography was eaten by software long ago? I don’t even know what that actually means. But yes, okay, we have mobile phones for sure and we have cameras. Software powered camera is like, that’s, we’re beginning to stretch things here because there’s a lot of hardware, obviously, involved in that camera. Much more so, by the way, in 2025 than there was in 2011. I mean, if you look at the quality, and I actually went back to some of my, the photos that I took on a camera phone in 2011, and they look like a Kodak 110 camera that I used to have as a kid in the ’80s. It’s like, they’re pretty grainy snapshots.

(00:09:33)

And I mean, the photography on phones right now is extraordinary, but it’s not just because of software. It’s also not just because of hardware, right? If you look at the history of computational photography since 2011, which by the way, wasn’t eaten by software long ago. There was a lot of software hardware co-design as part of that. And who was the beneficiary of that? Well, first of all, we were the beneficiary of that. And when you go take a photo of your cats or your kids or what have you, you’re able to whip your camera out very quickly. We’re all walking around with cameras all the time. I think that is wildly revolutionary, something that was inconceivable, say, a century ago. And I think it’s gonna have long-term, I think, there are so few photos. If you’re a Gen Xer, you’re just, and especially if you’re, I’m the first born, okay? So I actually got a baby book. My sister, a Gen Xer as well, is just like, did I even exist before my college graduation? It’s just like, nothing. I mean, just like no baby book, nothing. And now we document our lives so thoroughly, in part because of this revolution, which is pretty interesting.

(00:10:39)

We’re like, okay, okay, so Marc Andreessen, we’re gonna like, okay, we’re, you know, and then, okay, so what are the companies that are, these companies that we should look to as the software companies that are gonna be eating the world in this coming revolution of computational photography? Oh, it’s Shutterfly, Snapfish, and Flickr. And again, if you’re me and you’re like immunocompromised when you read this stuff, you immediately are like spending, you know, 35 minutes in a Shutterfly rat hole figuring out like which PE shop bought Shutterfly to chop it up. You’re like, where am I right now? It’s like, why am I, but of course, like Shutterfly, Snapfish, and Flickr are all, I mean, they’re not even like trivia questions. I mean, these are companies that are, yes, they exist or they did exist, but they are not the, they were not the innovators. They were not the leaders. I mean, this paragraph is basically like wrong, sorry.

(00:11:32)

Then, I’m not done, I’m so sorry. You’re just like, are we gonna do this the whole time? Like, eh, not the whole time, but you know, Henry, you tell me, like I got, you know, I can, maybe a little while. So now I just educated you about Snapfish and Flickr. We’re gonna learn more about defunct companies here. So today’s largest direct marketing platform is a software company, Google. And this is like one of these things where you’re just like, you just get to make up words, I guess, but like if a company writes software, that’s a software company. It’s like, it’s actually, okay, all right, let’s just, whatever, fine. Obviously, software, extremely important, like that’s, no one’s disputing that. So now it’s been joined by Groupon, LivingSocial, Foursquare, and others. Who are the others? They gotta be just like, I mean, who knows? Sorry, I mean, like above, these are above the fold. Who is below the fold for Groupon, LivingSocial, and Foursquare? I mean, this is a cemetery. And they’re using software to eat the retail marketing industry. It’s like, definitely like, eating was taking place. I’m not sure who ate whom or what ate what. I think the creditors ate at least some of those companies.

(00:12:47)

And I think it would generate over $700 million revenue. There’s a top line and bottom line, a lot of education, a lot of economic education that can be had here. After being in business for only two years, I mean, that’s, well, there’s an irony in that sentence for sure. But again, these are, these companies were easy come, easy go kind of companies, right? They were companies that didn’t exist, and then exploded, and then like broadly disappeared. By the way, there’s, and actually, I do, I guess, have some self-control because I didn’t snip the paragraph on gaming, which points to Zynga and Rovio as the, so they, at the expense, by the way, of Electronic Arts and Nintendo. And it’s like, well, you would have actually been much better off buying Electronic Arts than being an LP in a16z. But the… Oh, my apologies to any LPs in a16z.

(00:13:45)

And then so what’s next? So this is the past where we, these titans of industry, Groupon, LivingSocial, Foursquare, Rovio, Zynga, and so on, and don’t forget like Snapfish, and Snapfish, Flickr, and Shutterfly. Okay, so that’s how we got to the current titans of industry. What is to the future? And this is where you just get to like, okay, are we gonna revisit this one? That healthcare and education are the next, that’s where the next Zynga, Rovio, and Foursquare is gonna come from, healthcare and education. I would just be curious, I’m like, no, like, just what happened here? I mean, is there one of these? I actually don’t even know that there is. Actually, and this is where I do, again, I’ve got some self-control. My venture capital firm is backing aggressive startups in both of these. I’m like, oh, God, I would love to go find, but you gotta go find their zeros. It’s like, you know, it can be a little bit.

(00:14:34)

So, these industries, I would say you have not seen it. Now, which is not to say that like, software is not very important in healthcare, obviously, and education, and education, I actually think the most important educational product that was developed is actually like the Chromebook, honestly, because the Chromebook put the, and so, we don’t really talk about the Chromebook, I think, very much, but if you’ve got kids, you know your kids have got Chromebooks. Chromebooks are like pretty ubiquitous. They are amazingly cheap. They’re very, actually, surprisingly robust, considering the way I’ve seen them jammed into backpacks and so on, and they are actually, I mean, one kind of tragedy of the Chromebook is that it’s so effectively secure that the kids don’t really have the ability to kind of take it apart, because the thing is actually built pretty well in that regard. So, but the Chromebook really revolutionized education by, and certainly, during the pandemic, Chromebooks were very load-bearing, right? A lot of kids learning during the pandemic from Chromebooks. So, I think, in some regards, it is actually a hardware/software product that was actually much more revolutionary in education, and then healthcare, I mean, this is what I, healthcare, of course, has been revolutionized over and over and over again by software, software/hardware systems, right? I mean, if you, I mean, God forbid, if you have to present to an emergency room tonight with abdominal pain, you will likely go into a CT, right? You will, computed tomography revolutionized healthcare, as did MRI, as did NMR, as did ultrasound. Very, very important technologies. I guess, I’m not sure if that qualifies for having been eaten like photography was eaten, but I don’t know, I guess we just kind of make things up when we’re a VC think piece.

(00:16:14)

All right, so what’s the problem here? The problem, other than my own, me being immunocompromised around VC think pieces, the problem is that it is extremely reductive, and it’s reductive for, like, understandable reasons, like, I get it, like, you’re a VC, it’s like bacteria, you kind of do what VCs do, and in particular, you conflate innovation with new company formation. This kind of drives me nuts. If you, I mean, you all work for an established company, you know there is a lot of innovation that happens at established companies, newsflash, and yes, there is innovation that happens in startups, but actually, when you are new company formation, you’ve got a whole bunch of other things that you’re trying to do at the same time. It’s actually very hard to be truly innovative in terms of, certainly technologically innovative with new company formation. A lot of our deepest innovation actually comes from established companies, so, like, the kind of the fundamental false dichotomy is this kind of conflation with innovation and new company formation, and also innovation with, like, economic disruption, right? The only real innovation is actually something that is, like, economically disruptive, that’s gonna, like, you know, eschew regulators or, you know, what have you. It’s like, that’s, if it doesn’t move fast and break things, it’s not innovative. It’s like, well, that’s bullshit, obviously, and our most important innovations actually are, don’t actually do that. They abide by the constraints of the system and do really interesting things, but anyway, he doesn’t see that, but, of course, that, like, whatever. He’s a venture capitalist. Like, seriously, what do you expect? That’s fine.

(00:17:47)

The thing that, then, you also get this kind of conflation with software consumption versus software development. Like, just because you use a spreadsheet does not mean that you are now a software company, right? So, there’s that. That’s a little bit ridiculous, and then, of course, the conflation between software development and software companies. Like, because software is very core to what you do, does that make you a software company? I mean, I’m not even sure it matters, right? Like, what does it mean to even be a software company? I mean, to me, what it means to be a software company is that your product is software, and that’s actually, like, pretty limited and not, honestly, that interesting.

(00:18:30)

To me, like, the true folly here, though, is this idea that software and hardware are totally disjoint, and that essay, I think, over and over and over again, kind of implicitly denigrates hardware. Hardware is something that is, it is low-margin, it is commodity. There’s, like, a bunch of, kind of, ways that hardware is actually denigrated, unless it’s changing your life with the phone, in which case, software ate it, apparently. But there’s this kind of idea that, like, of the primacy of software, that software is what is our true value in innovation, and the physicality of it is actually irrelevant, and I think that that is really problematic. I think it was very wrong when it was written in 2011, right? So, in 2011, if you could go back to a time machine in 2011, like, you’re probably gonna go, like, buy some NVIDIA, some TSMC, some AMD, right? I mean, some Apple. Companies that are very much in the physical world, but companies for whom software is extremely important. And, like, so the, and no one’s disputing that, by the way, that software is very important. We know that software’s important, extremely important, but it’s the false dichotomy that is the problem with entries in space.

(00:19:51)

All right, so, this is gonna require us, and I’m gonna apologize in advance for this, the, we do need to think about, just for a second, what even is software? And be careful on this one. I’m kind of like, I’ll tell you that the, I was, long ago, I was interviewing Arthur Whitney, and you may have heard of Arthur Whitney. Arthur Whitney is a really interesting technologist, worked for Morgan Stanley, developed a language called K, KX systems, wild stuff. So, I was actually interviewing Arthur. Apparently, Arthur has given no interviews since this interview in 2009, which was kind of wild, and now I’m wondering, like, maybe that affected him more than it affected me. But Arthur is the kind of, so Arthur, in particular, among other things, so Arthur was taught APL by Ken Iverson, he was, so, when he was, like, 11. And at the time, I had a four-year-old, and he wanted to teach my four-year-old K. Which is, some would say, that’s like, it’s kind of child abuse. I mean, I knew, like, the difference is, like, I actually know my four-year-old, and, like, good luck with that. First of all, any four-year-old is a challenge, no matter what. I’m like, this kid is, I don’t think this kid’s gonna be very interested in K, as it turns out. But I admired Arthur’s optimism. He’s like, let’s teach him K. I’m like, let’s actually just get him to put on his fucking shoes.

(00:21:24)

The, so, you know, different goals there. But in talking to Arthur, you know, Arthur, again, is so bright, and thinks about things so differently, it kind of causes your brain to blow up. And at one point, I did, we were having this conversation, I’m like, what is software? Because, to me, this is, like, and I feel very lucky that we are kind of where we are at this kind of time in human history, because software is that important. I mean, software has this extraordinary staying power. I mean, I disagree, obviously, with the false dichotomy of Andreessen, but software really is marvelous. I mean, it is amazing, because it has properties of both information and of machine. And you can look at it either way, ‘cause it’s actually truly both. And there’s so much that comes from that. The cost of goods sold of software is zero. It’s knowledge, knowledge that we run as machine. And that leads to all sorts of, like, wild things.

(00:22:29)

So I was curious what Arthur thought about this. Like, what is, and what is your analog for software? And I think that there’s, like, lots of analogs to be had. I think, ultimately, the answer is, like, software kind of defies it, because it is so different. But I was asking him his analog, and thought that, you know, sometimes people will say, like, software’s like a bridge, you know? It’s like you were building a bridge, like a big civil engineering engagement. Something, maybe it’s like a biological system. Like, I don’t know. So I asked him this question. I’m like, what is the analog for software for you? And he kind of pauses and says, poetry. And I was just like, boom. You know, to kind of, and the way, I mean, he exudes such craft in the way, I mean, for Arthur, the beauty of code is how few characters it contains. I’m just putting it out there. I’m not casting judgment on it. I’m saying, I’m just letting you know. Like, that is Arthur’s view. And there is a kind of aesthetic to that that you have to admire. He’s definitely, like, lived this. I mean, he’s very consistent in that regard.

(00:23:35)

But this whole thing, like, really had me thinking, I’m like, what even, like, I’m asking this question about the analog for software. Like, what is software? Because, and you think, like, what is software? Well, software is the instructions that you run on a machine. Like, okay, so does the, if I get rid of the hardware, does the software still exist? It’s like, well, I think so, right? I mean, I think we would acknowledge that the software exists in a vacuum. Does software have to be Turing-complete? Like, I don’t think so. If you say that software has to be Turing-complete to be software, there are lots of things that look and feel like software that you’re not gonna acknowledge as software. And ultimately, maybe it’s the mutability. Like, software is mutable. Hardware is not mutable. Well, the problem is we’ve got software that we actually have in the mask ROM on a microcontroller that is immutable. That is definitely software. And we also have an FPGA that is hardware that is mutable. Is an FPGA hardware or software?

(00:24:36)

And I think any kind of discussion on this will ultimately get you to the conclusion that like Euclid was writing software. I don’t know how you can, it’s very hard to have a definition of software that admits the things that we all agree is software without letting Euclid sneak in the back from several thousand years ago with a GCD algorithm. And software, I mean, this is kind of the amazing thing about software is this kind of timelessness of software. And I think actually to go back at the top in terms of like why was my answer to that question kind of fundamentally wrong? Because in software, when we do innovate, we can actually, that innovation can carry in perpetuity. There is software that you deal with in some capacity that will survive you. That’s kind of wild. That’s not true of anything else. The software will long survive the hardware that it’s on. Like that’s pretty wild. Like that’s pretty interesting that the software will long survive the hardware.

(00:25:41)

When I was, I remember during Y2K, after the Y2K rollover, I was actually in New York City talking to the CIO of a big bank. And we were talking about what Y2K meant for the bank. They had many more Y2K issues than we had at Sun. At Sun, our Y2K issues, about a third of them, this is, it actually is as dumb as it sounds. A third of them were like leap year problems. And you’re like, look, I’m like a Gen Z-er or a millennial, so I barely remember the year 2000. But wasn’t that like a number of digits in the year problem? Yeah, yeah, yeah, that was a number of digits in the year problem. But it turns out there was another totally different Y2K problem.

(00:26:22)

So the every four years is a leap year, right? Except for every 100, except for every 400. And a, so 2000, if you’re dumb, it’s like the old bell curve meme. If you’re just like an idiot, that like every four years is a leap year, you’re fine, your software is good. And if you’re like a brainiac, you’re like, well, brainiac, your software’s actually broken because you’ve encoded amazingly, there’s a decent amount of software that does this, that encoded the rules one and two, but not rule three. And the way they did this is because they took an offset from 1900, because there’s a lot of software that starts as an offset in 1900. So if you take that offset and you mod that by, you’re like, oh, great, okay, this, aha, I’m clever. It’s not a leap year. It’s like, no, it is a leap year. Just like, can you be like the dummy over there?

(00:27:11)

So those were our problems. There was actually a bug in the at command. You would run, the at command is, yeah, sorry, we’re here, I guess. The at command is a Unix command that allows you to run a job in the future. And if you did at plus, the at is very strange. You can give it the syntax of at plus and then give it like n days, or you can give it a unit of time. It’s also like, where are we? Like, this is, Unix, this is like weird Unix. You get into the, this is like the B sides of Unix. It’s some super weird Unix. And so you can give it at plus a number month. It’s like, well, what does that mean? So you say, well, if you’re on the 15th of the month, then you wanna do at plus one month. It’ll run on the 15th of the month. It’s like, that’s great, I hate to be that guy, but there are days that exist in this month that don’t exist in that month. So like, what do you do? Like, what kind of sociopath is gonna run something on at plus one month when it’s the 30th or the 31st of the month?

(00:28:09)

And so in particular, we had a bug where if you were targeting February 29th, 2000, remember, 2000, the 2000 is a leap year, right? So February 29th very much does exist. That software would run that on March 1st. By the time we discovered this bug, this is only with this plus end month syntax. By the time we discovered this bug, four of these days had already passed. There were only so many left in 1999. But people were so gripped by fear over Y2K. It really is, I mean, this is where I know, I feel like I’m down at the Gen X home telling you tales of the way it used to be. But people were truly gripped by fear. Whenever you see people, like, whenever there’s a mass hysteria and there’s like lots of mass hysteria about lots of things right now, it’s always useful to go back to some of the past mass hysterias. And people were very, very afraid that the world was gonna stop on January 1st, 2000, which would have been super entertaining because it would have started with New Zealand. A part of me kind of wanted this because you would have this thing where like New Zealand bursts into flames. No one has heard from New Zealand. And there’s just like a thermonuclear explosion. And then you’re like, okay, now the rest of the world, there’s like this doom that’s going around the globe. And Hawaii’s like, we might live, but we only have, but that didn’t happen. So it actually turns out like nothing really happened because we spent a lot of time preparing for it.

(00:29:33)

So I’m at the bank. Like, where are we? I’m at the bank. I’m asking, we’re talking Y2K. And the CIA says, you know, it’s funny. Actually, 45% of our Y2K problems, this very large bank, came from a single hardware platform. So you can imagine, I was at the edge of my seat. That guy could have done anything. He could have been like, and I will tell you that next week. I mean, I was at the edge of my seat. Like, what platform? The IBM 1401.

(00:30:03)

And that was actually like a true brain bending moment. It’s like IBM 1401. So the IBM 1401, in case all the numbers bore together, which would be very reasonable. The IBM 1401 is a computer so old, you barely call it a computer. It’s an accounting machine. It is a stored program computer. You should go to the Computer History Museum in Mountain View. You can go see an IBM 1401. They’ve got one that’s restored. You should go watch, go to the, if you ever get out to California. Like, the reason to go to California, like, look, I know I’m in a room of nerds. I’m not gonna, the reason to go to California is to go to the Computer History Museum in Mountain View and look at the card punch they have on the IBM 1401. That thing is space age, and it was made in 1949. It is absolute mechanical marvel. So that’s how old the 1401 is. The 1401, that’s the punch from the 1401. The 1401 dates from the ’50s. This is a machine that’s basically in the ’50s, not the ’60s, early ’60s. They did a transistorized 1401. But you’re just like, how, what?

(00:31:04)

So this is a computer that was old in the early ’60s. And how are you running software in production? Well, as you can imagine, it was software that was running on an emulated hardware platform that was running on an emulated hardware platform that was running on an emulated hardware platform and so on. So on the one hand, like, that is amazing. And that is this amazing power of software. But it does get to, like, what’s the software and what’s the hardware in that system? It’s running on virtual hardware that’s running on virtual hardware that’s running on virtual hardware. It’s like, and does it matter? I mean, that’s the other thing. It’s like, is all of this like a distinction without a difference? ‘Cause I think it’s like, a lot of this is, we know that hardware is important and we can talk about what that is. We know that software is important. But I think what’s actually important is not thinking about how they’re different, but thinking about how we use them together to develop a system. So I think it is, it’s kind of a waste of time to try to really distinguish those.

00:32:01)

So at Oxide, we very much set out with the idea of doing hardware/software co-design. So what we wanted to do, what we, and this came from our own experiences, trying to run on commodity machines and having no end of problems. And the problems that were in our own software, you can go fix those. The problems that we couldn’t fix were the ones that were deeper in the stack. And we had a storage product at Sun back in the day. We had written a bunch of software. Every single piece of firmware that we did not control from many different vendors at one time or another was completely debilitating to us. In increasingly comical ways, it was almost like the gods wanted to really make a point that the way these systems can fail, because they start operating at such cross purposes. And those are the problems that we couldn’t do anything about.

(00:32:56)

So when we started Oxide, we felt strongly that we needed to do our own board design. Actually, what I felt really strongly about is getting rid of the BMC. The BMC is the Baseboard Management Controller. It’s the computer within a computer on a server. And it’s this kind of like garbage that, it’s the thing that has like the VGA port out the side. And it’s nothing but problems. Also, I’ve got resentment because it has my initials, the BMC. So certainly back at Sun, the BMC caused no end of problems. And I would have these subject lines that would give me a fight or flight reaction because, you know: “BMC induces panic”. It’s like, what? No, no, no. So no, kind of like my Michael Bolton there.

(00:33:37)

But I wanted to get rid of the BMC and in particular, scale it down, make it something that is much more comprehensible and software that we could control. We wanted to do our own board design, our own rack design. We did our own switch design and importantly, all of our own software from the lowest layers all the way up. Well, with an asterisk on it. All the software that we could do. What’s amazing is for all the software we’ve done, there’s still software that’s out of our reach that we rely on other people to do. And that software is still very, very proprietary, which is a huge problem. Because what we have found is in this thought that’s going in, but have absolutely seen it now having an actual like shipped working product and so on, the ability to actually co-engineer, co-design the software and the hardware allows you to do some really unique things that are really important. So we went into this, that was kind of the belief. We knew we had to do our own compute sled. If I could go back in time and slap myself, I mean, I would do it over several things. One of them would be over the fact that I thought we were going to tweak reference designs. Like I said that several times. Like, oh, we will just tweak a design. You do not tweak a design. By the time we were getting rid of the BMC and replacing it with our own service processor, you are designing your own board. But like, this should be a possible thing, right? It’s like, we’re not making our own silicon. We’re taking these components and yes, it’s a complicated PCB. And yes, the layout is complicated. Yes, yes, yes. But like, this is a solvable problem, right?

(00:35:14)

And when you go do that, you are able to actually go do a bunch of very important things. One of the things that we did in terms of our own software, so we did our own switch, which we felt that we needed to do. I was a little bit concerned when we first raised that a venture capitalist would ask us like, what are you doing about the switch? Because it’s like, well, we kind of think we need to do our own switch, which I know feels like very ambitious now. We wanna do not just our own compute sled design, but our actual, our own switch design. Fortunately, I overestimated VCs, who never asked us that. And actually, I came more and I would have this like fight or flight reaction when someone says, I have a technical question for you. And I used to think like, oh God, like here it comes. Here comes the technical question. And if you, if any of you were to say that, I would have exactly that reaction. It’s like, okay, here it comes. Here comes a really tough question. But no, no, no. When a VC says, I’ve got a technical question, it’s like, I’ve got a question that I think is pretty smart. And I would like you to validate that it’s a smart question for me, please. Which is a much easier task. Honestly, it’s a much easier task. Well, it’s mixed. It’s a much easier task if it’s a good question. When it’s a moronic question, you really have to be like, okay, let’s pretend that this is a really insightful question.

(00:36:35)

So there was that. But what I was really braced for is like, oh my God, they’re gonna ask us about the switch. And they never did. It’s like, okay, great. In fact, they would also say things like, technically, we know you can do this. And I’m like, I am not explaining what we’re doing very well. Because technically, this is, I am so far off the end of my skis on this one. Like, are you out of your mind? Like, the amount that we’re taking on. Like, no, no, like, sorry. There’s like, this is loaded with technical risk. But it’s obviously the things that you’re thinking, not saying to it. You’re like, hey, let me stop you there, pal. There is lots of risk in this thing. You’re like, okay, am I, you want me to put money into it? Like, yes, yes, put it on the bonfire. You don’t wanna do that. It’s like, okay.

(00:37:16)

But I should know, like, oh boy, there’s a but coming. And of course, the but was like, we don’t think there’s a market there. It’s like, duh. It’s like, you’ve got it exactly backwards. It’s technically so challenging that I’m not sure we will ever get a machine that actually works. But if we can, the market is huge, right? And that’s like, you can’t really explain that. So, we didn’t know what we were gonna do. We had not confided. Nobody asked us about the switch. Until, of course, as you might imagine, someone not unlike yourself coming from a company that had deployed a lot of compute, a terrific technologist said, Bryan, you did not preface this with I’ve got a technical question for you. Of course, like, why would you? Bryan, what are you doing about the switch? Damn it, I almost made it. And we were seeking an angel investment firm. And I’m like, look, I know you’re not gonna wanna invest in the company when I say this, but we think we need to do our own switch. I’m sorry to, we have died. I’m sorry to tell you this. And he’s like, okay, if you’re doing your own switch, I’m in. I’m like, yeah, okay. And as he said, we did not stand on our own two feet until we kicked Juniper out. I’m like, oh, okay, interesting.

(00:38:25)

So, we’ve done all this now. And it was really hard to do our own switch. But now that we’ve done it and we’ve got it working, there are so many things that we can kind of uniquely do. It’s been really uniquely powerful. And I think the biggest challenge of doing a lot of this stuff was getting over the fear of doing it. And once you tack into it, it was a bunch of things. And so, it’s kind of worthwhile looking back on these things. And in terms of the things we’ve done, so, we eliminated the BMC, yay. We did our own operating system, appropriately enough, called Hubris. Also, the debugger for Hubris is appropriately enough called Humility, the debugger. I think that you could observe that there’s actually a lot more humility in Hubris and a lot more Hubris in Humility, but that gets complicated. We eliminated the switch operating systems. We did our own switch. We eliminated the UEFI BIOS.

(00:39:24)

That’s actually, for all of the things we did, actually, the riskiest thing that we did was the elimination of the UEFI BIOS. And so, the BIOS is the basic input-output system. It’s very old. Dates from CP/M UEFI is a stowaway from Itanium. Amazingly enough, it’s like I managed to leap the species barrier, and now we’ve got UEFI. It’s like the Wuhan bat or what have you. And this UEFI BIOS is this bootloader that exists at the very lowest layers of all x86 systems. And sadly, ARM systems as well, because ARM, there was this moment in history, and if you were alive and awake during this, you had the same frustration, where ARM is like, “We need to be like x86.” And you’re like, “Where are you going with that?” “We’re gonna adopt a UEFI BIOS.” You’re like, “No, no, no, not you, ARM. “That’s the whole reason we like ARM, “because you don’t have a BIOS. “We need to be successful like they are.” You’re like, “Oh, God.” So, they have a UEFI BIOS. You’re like, “Oh, God.”

(00:40:25)

And the problem with this layer of software, it is this very, very lowest layer of software, very much software, and it needs to boot the system effectively. But it’s a very complicated system that it needs to go boot. It needs to go find an image to go boot. How does it do that? Well, it actually needs to bring up the system to boot the system. So, there is very much this catch-22, and there’s an entire world of complexity in these things. But it’s not written in a very modern way that’s putting it euphemistically. One of my colleagues, Josh Kuo, calls UEFI MS-DOS circa 2099, which is kind of what it is. It’s like futuristic DOS. It’s very weird about the way the software is structured. But you have this very, like this deep kind of platform enablement layer, and then it finds the thing that wants to go boot, and then it’s like, “Shit, okay. “We gotta put everything back.” So, it basically sends the system backwards. And so, when the operating system boots, the operating system actually is on a system that has already booted.

(00:41:31)

And this stuff has real problems. That there’s a bunch of, I mean, you’ve got all the problems in terms of like boot time and so on, but this software has vulnerabilities in it, real vulnerabilities, deep stuff. And if you are able to break into the BIOS, you control the brainstem of the system. You control everything. So, we wanted to get rid of this layer entirely. And that was, I mean, it’s a sad state of the server industry when we came along. People were like, “Impossible. “You absolutely can’t do it. “There’s no way you can do it.” Apparently, Google had tried and failed. And so, anything that Google has tried and failed is like tautologically impossible for humanity, a little bit annoying. But this is what we want.

(00:42:13)

And so, we actually eliminated this. And what we did, we don’t feel it’s very like revolutionary from an abstraction perspective. We actually control the first instruction, and then we boot the system. That’s what we do. So, there is no bootloader, because we are actually in that SPI NOR payload that is actually, when you come out of reset, that’s what you actually pull. We’re in that payload, and then there’s enough in that payload to pull the rest of the system in, but it’s not booting something else. And we are holistically booting all the way up. We never send the system backwards. What this has meant is that we have to do a bunch of very, very, very low-level initialization. So, we bring up the PCIe engines. We bring up, obviously, multiple cores. We do a bunch of things. We don’t have to do DIMM training, fortunately. So, DDR, Double Data Rate Memory, needs to be trained. We’re now into the true dark arts. So, DDR is this highly, highly, highly parallel, noisy interface that is very high-speed, and it needs to be effectively tuned every time you boot. It needs to be trained to figure out what the right parameters are so you don’t lose data. That seems nice. It’s like, okay, you do your training. Sounds good. Sounds important. That is done by, on AMD, that’s actually done by, there are hidden cores on the AMD package, and that’s done by what’s called the PSP on AMD. That actually does the DIMM training. So, we don’t have to do that. When you execute first instruction, DIMMs are trained, but nothing else is brought up. So, we engage in that lowest layer of platform enablement.

(00:43:52)

It’s like, well, doesn’t that make you very welded onto AMD? I mean, yeah, by design. We have this different approach, because right now, in the server industry, people want a commoditized world, so the world becomes commoditized to serve them. It’s like, you effectively have garbage to build on, because you’ve selected for garbage. If you wanna have something that actually works, you need to integrate it much more tightly together, and that means making choices. So, we made the choice, in 2019, that it’s like, no, I think it’s gonna be AMD, not Intel, and there were people who were like, no way. It’s like, eh, this one’s easy, actually. This one is easier than you might think. In 2019, it was very clear that the house was on fire in Intel, and that Intel was gonna be in the situation that they are now in if they didn’t course correct, which they broadly haven’t. But so, we are very tightly mated onto the parts that we select, which has been really important for us.

(00:44:50)

So, it’s like, it’s kind of reasonable to be like, okay, so, we’ve got this whole thing working, which is great. We’ve shipped it. We’ve got our next generation sled that is working. We brought up, we’re shipping that. It’s great. We’ve got systems in the field. How do we feel about all this? How do we feel about having taken all this on? And the thing that’s kind of amazing, as you look back, we don’t regret a single one of those decisions. And in fact, the decision to not have a bias, which I kind of felt, and I think everyone at Oxide felt like, we need to have our own software at this layer. It will be a longer, slower path, but it’s important. I think what we learned is actually it was a faster path. It ultimately was faster for us to get up by having that software ourselves. Now, it was really hard, but ultimately it was a huge win to control that layer. And as we look up and down the stack, there’s not really a decision that we regret. In fact, the decisions that we regret are where we didn’t go yet further, where we’ve got software that we still don’t control that we wanna go control, because in terms of the hardware software co-design, where we can find those pieces that we can design, we’ve got both the hardware and the software we can design together, we can do really extraordinary things.

(00:46:07)

Now, I think that there’s a bunch of, we can talk about all the great functionality that we have been able to build, which is great. I think I’m just too much of a failure junkie in that I’m a bit of a scholar on how things fail. I think it’s mesmerizing, right? Because when you look at how things fail, it tells you how they work. So I think pathology is really important and interesting in computing systems. So for me, the times that I have felt this the most strongly of like, yes, this is emphatically the right path, are when it doesn’t work. Well, maybe shortly after we get it working, maybe when, but there was actually never any regret when it wasn’t working. It was more just like, I hope we don’t die. But the, when we’ve had these problems, it’s like, it just unequivocally, I wanna talk about a couple of these, and I’ll hit on just a couple of these.

(00:46:56)

So one was actually way back in the day, when we had our first, bringing up our first board. As you can imagine, it was our first board, so there were a lot of things, a lot of things in motion. The CPU would not come out of reset. And the CPU would do this very vexing thing where it would power on, seemingly. We would give it, it would initiate its power on sequence, which says like, give me, you know, I need to set these rails to this value, great. And then after 1.25 seconds, it would bounce. And it’s like, actually, let’s do it again. You’re like, okay, what happened? What, you didn’t like the first one? Like, nope, I guess I didn’t. You’re like, okay. And this was, you know, our assumption was like, okay, the power’s marginal, power’s bad. We’re spending tons of time making the power, and actually, AMD is like, actually, we’ve never seen power this good before. Like, your power margins are very, very good. Like, at this point, it’s definitely not your power margins. And we went down so many blind alleys, and one of the challenges when you are kind of going your own way with respect to hardware software design is like, I mean, obviously, like, we own, we’ve got lots of differences between us and a reference design. And the state of the industry is basically like, what are the differences? It must be that. It’s like, is that what we feel like first principles linking to you?

(00:48:16)

So, in particular, there is a pin on the CPU called KB reset. It’s active low, so it’s KB reset low. And you’re like, is the, yes. Keyboard. Like, you’re on a, it works on a server CPU. It will never see a keyboard. It’s like, we’ve got a pin on that dedicated to the keyboard? Yes, yes. Now, AMD says like, that can be a no connect. Right, great, can it be like a fuck off? Because that’s how I really feel about it. But right, it’s a no connect. It’s a no connect. So, we leave it as a no connect. And AMD’s like, ah, what about KB reset L? And like, you told us that’s a no connect. They’re like, yeah, but the CPU’s angry. The gods are angry. The gods demand a sacrifice. Can we go sacrifice KB reset L? It’s like, it really should not be KB reset L. Desperation will do remarkable things for you. Like, okay, let’s see what we can do. Can we, we need to tie KB reset L. So, it’s not a no connect, doesn’t float. We actually need to tie it down.

(00:49:17)

So, this is a photo of that happening. Oh, no. And this is where, I mean, you know, I’ve got such a marvel, Eric Austin on our team. This is a photo that he took. This is very much his hands. For any Gen Zers, this is a dime. It’s too hard to explain. But this is running from that pin all the way out and tying that down. Of course, didn’t change anything. Like, all right, awesome. That was fun. Actually, it was kind of a relief. Honestly, it would have been really angering if that had been the problem. So, I guess it’s good that it’s a problem.

(00:49:52)

All right, so, what was the problem? And the problem, and several weeks of debugging. These were some long weeks, ‘cause these are like, we are not going to live. We actually need to resolve this problem. The problem was an actual firmware bug in software that we did not develop. So, the regulator, the voltage regulator, there’s actually a protocol that the CPU speaks to the regulator, which is kind of crazy, right? It’s like the absolute lowest layer of the system, and it’s something that you would recognize, even if you’re doing software in a much higher layer, where there is a protocol, and the CPU is requesting different voltage levels at different times. So, this is how, I mean, this is truly a marvel. If you look at the draw on a CPU, as workloads are fluctuating, you will see the current move a lot, as it will request more effectively, and it will adjust its own voltage accordingly, to actually put more power to work. And that’s a consequence of phones and laptops, right? Phones and laptops, we really care about the power management of those things, so we have that in the server space. Anyway, it’s amazing. So, there’s a bunch of amazing stuff in there, very rich protocol.

(00:51:01)

That protocol says like, “Hey, here’s the voltage level that I need.” As it turns out, that’s the VOTF packet. There is a VOTF complete packet that the regulator needs to send back to the CPU, saying, “Hey, I did it.” And the tool for verifying power, AMD’s got this really cool tool called SDLE, that you plug into the socket, and it verifies all your power, which is great. We were using it to verify power, power’s awesome. The problem is, that thing didn’t care at all about the completion packet. The CPU, kind of understandably, did. So the CPU would be like, “Yo, can you set the voltage to this level, please?” And it would do it, and it would never say that it had done it. And as a result, the CPU’s like, “All right, I don’t know what the hell’s going on over there, “but I’m, you know, okay, I’ll ask again.” I mean, it’s like, it thinks that we’re the four-year-old that won’t put on our shoes, be like, “Okay, I’ll try it again.”

(00:51:55)

And the actual problem was, we had a firmware bug. And a firmware bug that actually had already been fixed by Renesas, the Renesas FE was like, “God, you should have contacted me much sooner.” It was like, “Well, we wanted to make sure “that it wasn’t, that it was us.” But it was, this was like a very eye-opening moment, because you’ve got this, and a firmware bundle that we don’t have access to, but it’s not something that can be dynamically updated in the system, you actually, that comes with the part, you’ve got a finite number of those that can be flashed onto the part, and then you’re out of slots, you blow the slot. So it’s in non-volatile RAM. And you think like, “Wow, all that software.” Like, there’s sophisticated software at these very lowest layers of the stack. And it was, for all of my desire to start a company that would not get bit by firmware bugs, it was a firmware bug that bit us first, which I think is kind of amusing. But that was kind of, that was a wild problem, and also a relief that we had done so many other layers, that we are, if we’re down to the voltage regulator software, like, okay, we go, those we’ll accept, I guess, at some level.

(00:52:58)

Another one that was absolutely brutal, we had a NIC, the, and our, so the NIC is the network interface card. NIC choices are bad, just in general. Like, it’s just, it’s not a great landscape. And we kind of famously, we’ve got something called RFDs, request for discussions, like kind of RFCs, that we are, some are public, some are not. But our RFD on our NIC selection, we call it the four NICs of the apocalypse. Because it felt like we were choosing between death, war, pestilence, famine, disease, I guess. And Broadcom, by the way, is war. Broadcom’s always war. Like, you know Broadcom is war. I can’t remember what Mellanox split into, and then, like, Intel was, like, pestilence, I think. And Chelsio was famine. And we’re like, you know what, we’re going with famine. And there are a lot of reasons why we were, I mean, that was, a lot of things that were valuable about that.

(00:53:58)

But, so we had this Chelsio NIC, have a Chelsio NIC, and that NIC would transiently fail to train all of its PCI leads. So on some small subset of boxes, it wouldn’t, and then you could bounce it, and it would train. But it was, some of these would train all the time, and some of them would train much more sporadically. And we ended up doing a bunch of very low-level debugging over a long period of time, but it required us to be able to modify that lowest layer of boot software over and over and over again, and with a lot of debugability, to figure out what was going on. And what we actually discovered is that if you reset this thing a second time, it would always function correctly. We call it the double-PERST. As it turns out, this is, so they isolated, this has been a bug for them for 19 years. And assuredly, something out there does a double-PERST that needed to be, so every system on the planet will issue two resets to every NIC to accommodate God knows what. It would be actually interesting to know what broken device everybody’s accommodating. But this, the brokenness in this NIC was able to hide out under that bug. So this NIC had an unrelated issue that they never located because everybody does this double-reset. Well, except for us, because it’s a reset. You only need one of them. It’s like, well, long story, but better send us a second one.

(00:55:36)

So again, and this was also an eye-opener in terms of how long, I love the very long shadow that these things cast. So in the last one, and this one is definitely germane because you all at Jane Street were the first ones to actually see this. So the IBC is an intermediate bus converter. And you had a SLED, SLED-19, famous to us, that would, that every like, not so long, every like 24 to 48 hours, its drives would reset. And which, and then, so when the drives, which by the way, like the operating system, that’s not really the way it works. Like you’re not allowed to just like reset things arbitrarily. And so in particular, the drive is like, all right, well, I guess I’m waiting to be initialized now. I was just reset. The operating system may meanwhile has IOs in flight and so on. And at some point, the operating system is like, what happened to all of our I/O devices? And the operating system is like, we have lost all of our I/O devices. And the operating system would bounce. And it would bounce, everything would come back up. It’d be fine. But we were seeing this like pretty persistently and assumed that like we got a drive issue, but we’re seeing it on all drives at once. So all 10 drives in the SLED were all resetting simultaneously. It’s like, this has gotta be like a drive issue.

(00:56:56)

And actually, what we thought, we were, I mean, the nice thing about doing all the software yourself is whenever we have a problem anywhere, it doesn’t even enter our mind that the problem might be due to somebody else. We’re always like, nope, we definitely own this one. And in particular, thought that this might be an interaction with the firmware on the part with respect to the way we’re bringing up PCIe engines. So we were very concerned about that. Well, as it turns out, that’s actually not what it was. What it actually was was the bus converter is a big part of the back of the SLED that converts from 54 volts DC down to 12 volts. And on a small number, 3% of our SLEDs, occasionally, every day or so, maybe once a week, that voltage would dip from 12 to eight and go back up. And it would go down for about a millisecond and a half. So it was like two milliseconds down and three milliseconds up.

(00:57:51)

And amazingly, this is just like this, kind of the amazing bit about regulators and physics and so on, that from 12 to eight, you are within the input threshold for all the regulators. Everyone else is like, all right, weird, but cool. Don’t know why my input voltage is dropping, but I’m with you. My job is to regulate three, three. My job is to regulate one, oh. And my, okay, that’s fine. I’m cool with it. The drives, however, so again, we’re going from 12 to just below eight. And the drives are like, yo, we operate on 12. We, like, I’m not cool with it at all. And so that’s why the drives would all reset. So the drives all bounce, but everything else in the system was fine.

(00:58:38)

This was another one of these that was like super chilling. So we, and we wanted, before we got your sled back, we wanted to actually take some iterations on it. So one of the things that we did is we actually modified the service processor firmware to use the regulators, the hot swap controllers that are actually on those shark fins on the drives, because they are witnesses to a crime and they have a little facility in that regulator that allows them to record minimum maximum values. And so they would see the voltage drop and like, okay, so we actually know conclusively that this is the issue. So we got that back. That was on 3% of sleds.

(00:59:15)

We were, as you can imagine, working with our IBC vendor to understand why this is happening. Of course, we heard what I, I swear, we at Oxide will never say, which is like, never seen this before. This is like a Dellism, right? If you have Dell, like, we’ve never seen this before. And you’re like, why are you saying that? Like, what is that? Like, you, it’s like, how are you not saying that to care less about my problem? It just feels like that is the opening sentence in a paragraph where you close my case, right? I mean, it just, it’s, and you won’t hear that out of us. You may hear from us that we’ve never seen this before, but you won’t hear you’re the only customer who’s seen this problem. Like, the, so, they had never seen this before. We were the only customer seeing the problem.

(01:00:03)

And what, they couldn’t reproduce it. We sent these units back to them. They couldn’t find it. And what we, what we were able to do is actually test these things in the factory. We were able to validate them in the factory. So we weren’t able to ship any more of those sleds out. They actually, we just learned, just this past week, that like, actually, they debugged it. And there is an, there’s a circuit issue on the input voltage has got, that circuit is not totally robust. And in particular, that, it’s just close to the margins. And if you have parts, this is one of the things that’s very frustrating about, like, actual hardware. Like, the stuff that you hold in your hand, indisputably hardware, we can argue about an FPGA, but like, a capacitor or a resistor, these are definitely hardware. The thing that’s frustrating about those is like, there’s variability to those things, which is super frustrating. If you’re coming from software, we don’t have really that kind of like, some software either works or it doesn’t. You don’t have that kind of variability, but if you, that variability could add up, and then you would have this in IBC that was vulnerable to thinking that there was a drop on its input that wasn’t there. So it would start to regulate down because it would think its input is dropped, and then it realized, like, oh shit, put it back. And, but within that, the drives reset.

(01:01:17)

So, but again, this was a chilling example because I think to myself, God, how many times have, like, if you just saw drives reset, your first thought is not that the bus converter is broken. Like, that thing’s job is like, super straightforward, not to take anything away from the complexity of that, but it’s also like, it regulates that voltage for, and it’s gonna have an uptime of 100.0%, you would expect. So the idea that you would have a transient in a bus converter, it would get eye-opening, and eye-opening about, like, if we had not done all of our own software and kind of thought of this whole holistically, you just have a ghost. And that’s what we actually have in other systems. Systems where you aren’t doing that hardware/software co-design, you just have ghosts. Like, bad juju, voodoo happens, whatever it is. Poltergeists, reboot it. I mean, it’s like, we’re trying to get the link. It’s like, everything is like Linux audio, right? We were just like, please, God, work. What sacrifices do I need to make for my Linux audio to work?

(01:02:21)

And when we don’t actually build these things together, that’s what the world is. So it’s really dangerous to think about, and Andreessen’sFolly really is a problem, to think about these things as being like, no, no, the primacy of software. It’s like, no, we actually need to co-design these things together, holistic design. And when we design things holistically, we’re able to solve really novel problems. And I actually think it’s actually, we shouldn’t think of it as Andreessen’s Folly, although, again, if you’re like me and you get trolled by VCs, you probably will. But really, Alan Kay put this best in 1983. People who are really serious about software do their own hardware. And that is the way we should think about hardware. Hardware is when we collectively are really serious about our own software systems, we want to co-design that with the hardware. And what’s on what side of that divide is actually less interesting than it is to actually build these things together to deliver a high-performing, robust system that is able to enable us to do things way, way, way up the stack, at a great distance from that co-design.

(01:03:33)

With that, thank you very much. And sorry, I know I’ve gone a bit over. I’m not sure if you-

Henry Nelson (01:03:43)

No, but yeah. So yeah, we’ve got time for questions. If people just wanna raise their hands, we can pass around a mic, or I think this box should work, too.

Bryan Cantrill (01:03:56)

Oh, wow, testing.

Audience (01:03:58)

All right, you mentioned a couple times that there’s still software that you don’t control on the Oxide sled. What are some examples, and what’s next?

Bryan Cantrill (01:04:06)

Yeah, right, well, that regulator software. The big examples, the drive firmware. We have not done our own drive firmware. So when you’re on that SSD, that’s another whole universe. We’ve not done any of that, right? So we’re not buying raw flash. We’re buying actual SSDs. That one, I don’t know if that one’s next or not, but that one is of interest. I think that we don’t control any of the software that actually runs on package on the CPU. And there’s a whole network of hidden cores. You’ve got the PSP. You’ve got all the other cores around there. I would dearly love that to be open. I would actually love, actually, let me back up a little bit. What we actually need to develop these kinds of systems is transparency at that hardware/software interface. So I actually don’t have a need to write all this stuff ourselves, necessarily, but we actually, I’d like to understand what it’s doing. So we have pervasive open source upstack, which is great. We do not have pervasive open source downstack. And I would, I mean, there is no SSD that has open firmware, to the best of my knowledge. I would love to be wrong, actually. There is no NIC that has open firmware.

(01:05:23)

And actually, I guess what’s next is the NIC for us. We control the software in that it is delivered through us, the software that runs on the Chelsio part, but it’s ultimately that they develop their own, we’ve got a driver for it, but they develop their own firmware. With the NIC, we are effectively doing our own NIC by not doing the hard parts. So NICs are really, really challenging. SerDes are really challenging, right? These are super high-speed interfaces. And actually, I know this ‘cause I’ve talked to folks here at Jane Street who’ve used the Xilinx line of parts. The AMD Versal 2 is a super interesting part for us. And we can actually go do a real, ‘cause it’s got hard blocks that are SerDes. So we can actually do real software control. And so that’s actually, I guess, what’s next for us, ‘cause that one is very much in development.

(01:06:17)

Next thing would be the top rack switch, the middle rack switch, based on Intel Tofino. We ultimately got Intel to open it all up, but that is still super proprietary even though they’ve killed it, which is very frustrating. We, for our next-gen switch, the Xsight Labs, they’ve got a part called the X2, and we’ve gotten them to completely document the lowest levels of their ISA. So we are doing, we will effectively do, on our Tofino switch right now, we are required to effectively use Intel’s P4 compiler. We have always wanted to do our own compiler there. So we are, which Intel’s like, “You can’t.” It’s like, “Okay, well, can we have the microarchitecture documentation so we can figure out if we can or not?” It’s like, “No, you can’t have that.” Like, “Well, then we definitely can’t.”

(01:07:10)

On the X2, we will do our own P4 compiler. So, and then we will open source all that stuff. I mean, the other thing is, all the software we do, we open source, because I’m not really worried about someone walking. I mean, you would have to make all of our same design decisions to, I mean, you can knock yourself out. But the, so that would be what’s next for us. And the biggest single problem we have with Tofino were the bits that were proprietary. The proprietary, lowest-level system software, firmware, is a huge problem, because it allows, there’s so much stuff that hides out there. So, the stuff that’s open, if we have an open SSD, we probably wouldn’t even want to run it. I mean, we’re also, like, clear-eyed about, like, it would be really hard to do our own SSD firmware. We’d wanna be, or our own FTL, our own Flash Translation Layer, and we’d wanna do that for exactly the right reasons. You’d wanna know, and we’re not just doing this stuff because, like, we’re just to do it. And to the contrary, we want to pull stuff in off the shelf wherever we can. So, I get that, that’s what’s next for us.

Audience (01:08:14)

Hey, thanks for the great talk. I was curious if you guys are thinking about doing Silicon design or FPGA design of some kind. And if so, one of the things that Oxide has done is invested heavily on, like, safe programming languages like Rust. I’m wondering if you have similar thoughts about what better hardware design might look like for you guys.

Bryan Cantrill (01:08:34)

Yeah, that’s a great question. So, FPGA, yes, we already have. We’ve got FPGAs in the first sled, we’ve got FPGAs. In our Cosmo-based sled, FPGAs play a much larger role. And then we’re doing an FPGA-based NIC. So, we’re doing a bunch of FPGA stuff work already. HDL is a bit, it’s a bit tough, unfortunately, there. We did, I do, and I do still love Bluespec, but Bluespec, the reality of Bluespec was that it did not have the, kind of, the throw weight that we needed to really be able to use it for high-speed stuff in particular was gonna be really problematic. As we get into ASICs, we, at some point, will do ASICs, I think. The, anyone at Oxide would say that it would start with the root of trust, which is a pretty big lift because that’s secure silicon, that’s a real challenge, but it’s also, that’s a part that we’ve got room for improvement in that part, I would say.

(01:09:32)

For the high-speed stuff, I would love to, it’s really, really hard. And that’s, those are still, like, really locked by a couple of proprietary software companies. Like, if you want to do an ASIC, you are about to be a customer of some real proprietary software companies like Cadence and Synopsys. And the, to the point where, like, that whole ecosystem is kind of nuts in terms of, like, the way that you get IP, you’ve gotta be tied onto a process node. I mean, I feel like, sorry to get me onto something, but the, I, like, I felt like Intel had a great opportunity to just absolutely redefine everything, but it would need, it would have needed a much crazier CEO, it needs a much crazier CEO, because there is, like, some really interesting opportunity to go, like, open up what a fab looks like, but that’s the kind of thing you need to go do to really revolutionize it. I think we at Oxide would be pretty limited about what we can do. I think we end up being, like, a Cadence customer, unfortunately. So, but the FPGA, it’s much more tractable. FPGA, we, and we have done a bunch of different things. We will continue to open everything that we’re doing. You’re still using a proprietary tool chain from Xilinx, but it’s better than the ASIC tool chains. Does that answer your question?

Audience (01:10:43)

Yep, thanks. Bryan Cantrill: Yeah. Audience: Hi, great talk, but I live a lot higher up the stack in my day-to-day work. So, you mentioned a lot of times that you have this, like, functionality, performance, robustness that results from doing your own hardware. Can you give some examples of, like, some stuff you could do in software further up the stack that was enabled by your hardware work?

Bryan Cantrill (01:11:01)

Yeah, absolutely. So, in particular, the, when you control the Switch, and you control the NIC, you now have a true networking fabric that you can truly program. And you think, like, a bunch of the things that you wanna go do, like, you wanna do firewalling, you wanna do virtual private networks. There are a bunch of things that you wanna be able to do to provision a tenant, a bunch of things that are, again, way up the stack. When you don’t control all those layers, those become really arduous or impossible. So, those are the kinds of things that we’ve been able to do. From an observability perspective, being able to, again, see from top to bottom. From a performance perspective, being able to know, when, and actually, one of the things that I’m really excited to do, that we have not done yet at all, power management.

(01:11:48)

So, right now, the parts have got tremendous power management. Great, power management capabilities. They are basically all on autopilot, which is pretty good, it’s great. What you would like to be able to do is actually stuff more equipment into a rack. Have a rack that actually, where you are able to dynamically cap how much power it can draw. And then you could say, all right, look, so, yes, I want this rack. This is, you know, for a Cosmo rack, it will be at, like, 33 kW. I want this thing to only draw 20 kW. And have you, the control plane, decide, like, which CPU gets that. And that then allows you to, like, really rethink the kind of work that you then schedule onto that thing. And so, there’s a bunch of stuff, I mean, like that. I’m sorry, are those helpful examples? Is that-

Audience (01:12:34)

Oh, definitely, 100%. Yeah.

Bryan Cantrill (01:12:37)

But no, there’s a lot of stuff that you can only do when you can uniquely cut across that, for sure. Audience: Beautiful, thank you. Bryan Cantrill: Yeah.

Audience (01:12:45)

Hi, so I was wondering how you think about the trade-off between, like, you know, building something quickly so you can see the ways that it will break, versus going very deep from the get-go. Like, are there places where you kind of cut a corner and use something, didn’t dig into it, so that you could build the rest of it, and then later dug into it when you needed to–

Bryan Cantrill (01:13:03)

Yes, yeah, for sure. And so, prototyping is really important. So, in, and in, are you asking in terms of hardware or software, or both, or?

Audience (01:13:12)

I guess kind of both, yeah.

Bryan Cantrill (01:13:13)

Yeah, so the, for sure. So, on hardware, one of the things that we started to do a little bit, and then realized that we weren’t doing nearly enough, and then we put the gas pedal on, is building much smaller boards that are studies in the way a particular component might work. And you can build those, I mean, it has never been easier to build hardware than it is today. For all the kind of hardware is hard, you can go lay out a board, you can go, especially like a board that’s a simple board or something that’s low speed. I mean, I think we had a board where, that kind of the idea entered Cliff’s head, and he had the board out to PCBWay 11 hours later. And like, and these are like small boards, but they are boards that allow you to study a much smaller aspect of the system. Because what you want to avoid is having, when you kind of bring that whole system together, you wanna make sure that you are only debugging the problems then, that you could only debug then. Any problem that you could debug earlier, you wanna debug earlier.

(01:14:17)

So those things are like little throwaway boards that we’ll do, and iterate very, very quickly on them. And then, and I think as time has gone on, we have more of a bias about like, do those fast. Like don’t even, don’t even wonder if we should do that or not. And also don’t wait, I mean, one of the problems we had early is that we would, well, we’re doing this prototype board, like let’s add more to it. It’s like, no, no, just make, like if you wanna make a prototype board for an idea, just have it test that one idea. Don’t wait for anybody else, just go. And get that back, and then we can actually have something we can experiment with, for sure. And I think that actually allows you to be much more rigorous about the system. Does that answer your question?

Audience (01:14:56)

Yeah, thank you.

Bryan Cantrill (01:14:57)

Yeah, you bet. We, and I should have said this earlier, but we, so we’ve got our, we had the On the Metal podcast, we started the company. We’ve got an Oxide and Friends podcast that we normally record on Mondays, but not recording it today, for obvious reasons. The, and we’ve got an episode on the, on building prototype boards that you’d really like, if that’s interesting to you.

Audience (01:15:16)

Yeah, awesome, thank you.

Bryan Cantrill (01:15:17)

Yep.

Audience (01:15:18)

Hey, sorry, one more question. So, GPUs have obviously become a big part of data centers, and GPUs?

Bryan Cantrill (01:15:23)

Oh, GPUs, oh yeah, yeah, right, what about GPUs? Never heard of them. Yeah, huh? Sorry, tell me more? Yes, GPUs, right, those things.

Audience (01:15:31)

So, how are you guys thinking about, like integrating it into the model that Oxide has?

Bryan Cantrill (01:15:35)

Yeah, totally, great question. Yeah, so the, so here’s the problem, here is the challenge with GPUs. Obviously important. There are two doors. One is labeled partner with NVIDIA, and the other is labeled compete with NVIDIA. And what both of those doors have in common is death. As far as I’m concerned. And the problem with NVIDIA is, NVIDIA is this very proprietary island. And they want to take the whole system. And I, like, I don’t, you know, again, it’s like bacteria. I don’t, like, fault them for it. It’s not like the bacteria’s fault. It’s just like, it’s doing what bacteria does. So, they are trying to build a bit, and their aim has been towards expansion. Okay, super proprietary company, though. Really, really, really proprietary company. And makes it really hard, I don’t see, I’ve never seen how we can, by partnering with NVIDIA, deliver something that delivers Oxide value. Like, I just, like, don’t know what the advantage is over us doing it versus a Supermicro sled. Because I can’t do, we can’t do any of the value that you get.

(01:16:47)

So, I kinda need one of these other folks to please, like, can we make that compete with NVIDIA door, not be death? And AMD’s obviously super interesting. They’ve made a bunch of interesting moves in that regard. We will do an accelerator sled, for sure. But it’s gotta be done in a way where we can deliver Oxide value. We’re not gonna do it just because, like, if someone wants NVIDIA, we’re the wrong, they should just go buy a Supermicro box, knock yourself out. Well, no, I don’t like that. Like, well, I know, but that’s, so. But it’s obviously super important. I mean, I think it’s kind of like, we’re in a bit of a wild, I mean, obviously, a wild time in many different dimensions. I don’t know how that’s gonna unfold. Because right now, the parts that are being generated for training, in particular, are, don’t have anywhere near the reliability that I think people should be able to expect. I mean, you guys have, you all have much more experience in this than I do. But some of the failure rates that I’ve heard about are shocking, shocking. And again, like, I’m the guy that obsesses over, like, 10 disks that reset, you know, once a week or whatever. Like, things that are dead on arrival or have some of the thermal issues and so on, like that, we would need to be able to deliver a much more robust system. So we would be looking for a partner that would allow us to do that.

Audience (01:18:10)

Is there a third door with other companies that are building inference accelerators or?

Bryan Cantrill (01:18:14)

Sure, I love the crazy door. So that’s the crazy door. And I love the crazy door, I’m all for it. So, and there’s a bunch of stuff that’s out there. There’s a bunch of, and the presence of RISC-V allows people to put soft cores in front of this stuff where you can, like, quickly take, ‘cause I think you can make a pretty good argument that GPUs are actually not great for inference, right? There’s a lot of kilowatt hours that’s being kind of burned that you could use much more efficiently. So I’m all for it, that’d be great. What we, what I would just want, and then with a smaller player, this is easier. It’s like, we just need to have a transparency at that hardware/software interface. And the, but if we have that, then we can go do that. And this is the kind of thing where we would also, like, really look to you about, like, what are the kinds of things that you’re seeing and you need, and we can absolutely go. We’ve got lots of things that we can go do. We just wanna make sure that we’re delivering something that is actually valuable. If we can deliver something that’s valuable, we can deliver something that’s really valuable. Because you, then you do, you look at things like power capping and power management. Will you be able to do it holistically? Having true multi-tenant support on these things? Like, the multi-tenancy on a GPU is, like, real rocky. And the, and again, this is something you all are more experienced with than I do, but it’s like, with the CPU, we’ve got great multi-tenancy. And I would want to have those same things on these things that function as accelerators.

(01:19:42)

I think at the same time, you’ve got a bunch of these things coming on package, and that’s also, I mean, I’ve not quite figured out why AMD is, like, disinterested. I think it’s like, don’t overthink it. It’s like a kind of Conway’s Law thing. Because you’ve got the accelerators, and you’ve got the kind of, the GPUs and the CPUs under the same house. Like, can’t I take these chiplets and combine them on package? Which they do in these things, the MI300A, they did this with these APUs. But then they’re, like, very disinterested in them. And they’re kind of, like, upset when you’re interested in them. But you’re like, okay, that’s, all right, never mind. I feel like I’m talking about, like, throwing you a wedding shower for someone with whom you’re gonna break up. You know what I mean? You’re just like, ah, can we just, can we not talk about Bob, please? Like, we’re, and I’m like, okay. The, so I, we’re really interested in that stuff, but haven’t seen exactly a path. I’m saying that that’s kind of what our thinking would be. Yes, let me know that if I’m wrong.

Audience (01:20:40)

Thank you.

Audience (01:20:42)

So, I like your partnership with AMD. I remember hearing, like, a while back that AMD was, like, looking at open sourcing, like, their silicon initialization stuff.

Bryan Cantrill (01:20:52)

Yes, openSIL.

Audience (01:20:53)

Yeah, openSIL, yeah. Yeah, yeah, yeah. Has there been good progress on that?

Bryan Cantrill (01:20:56)

Yes. Audience: And does that influence the decision to, like, go with AMD, or?

Bryan Cantrill (01:20:59)

I’m not sure that came after. Yeah, after. Great question. Very supportive of openSIL. openSIL is effectively an open BIOS. They’re taking that lowest layer of silicon initialization and making, which is great. Actually, I’ll tell you the reason that I’m very supportive of that effort. We were going to open source all of our software. This included a bunch of low-layer platform enablement, which we would actually need AMD’s permission to go do. And I’m like, oh, God. And we did add some trial balloons of some kind of orthogonal software that we wanted to open, and it just, like, got lost in bureaucratic morass at AMD. And I’m like, I’m really not looking forward to that. openSIL, fortunately, made it so we didn’t have to ask that question, because openSIL effectively opened up the lowest layers of implementation details that we were relying on. So we are very supportive of openSIL for that.

(01:21:57)

We are never gonna use openSIL ourselves, but I think it’s great. So I’m very supportive because it allows us to open up our layer, and very supportive of AMD opening, and all vendors should be opening up all of their software. There was a very interesting moment with the hyperscalers. AMD is learning. People think, like, look, I know we’re crazy, but people, so we’ll say things, and then they’ll hear, like, you know, I’ve heard that from other people as well. You guys are not the only ones that think that, as it turns out. It’s like, I know that. That’s what I’ve been telling you the entire time. But so, on that, there are lots of other hyperscalers that were super interested in that effort, and that AMD was a bit surprised by, I think. So, very, very supportive, but we won’t use it..

Audience (01:22:40)

Thanks. Bryan Cantrill: Yep, you bet. Thank you very much.

Andreessen’s Folly: The False Dichotomy of Software and Hardware

Transcript

The next great idea will come from you