The Hurricane's Butterfly: Debugging Pathologically Performing Systems

Bryan Cantrill

Joyent

Despite significant advances in tooling over the past two decades, performance debugging—finding and rectifying those limiters to systems performance—remains a singular challenge in our production systems. This challenge persists in part because of a butterfly effect in complicated systems&colon; small but ill-behaving components can have an outsized effect on the performance of a system in aggregate.

This talk will explore this challenge, including why simple problems can cause non-linear performance effects, how they can remain so elusive and what we can do to better debug them.

Transcript

Great. I’m a fool for giving this talk because I, like all software engineers, I think, I’m superstitious at some level. And this is true not just in our domain, this is true in baseball too. If you know any baseball players, baseball players are very superstitious. I didn’t really play baseball growing up, but my oldest son, Will, loves playing baseball. Superstition is very endemic to baseball, and he developed superstitions on his own when he was as young as seven. He’s like, “I need to wear that same underwear. I need to do this and do that.” Why is he superstitious? Because he feels he’s doing the same thing when he’s at the plate or in the field, but sometimes good things happen and sometimes bad things happen.

I feel that software engineers, too. We’re superstitious because sometimes good things happen to us and sometimes bad things happen to us, and it must be the gods that are doing this to us. The gods love to toy with me, and I’m too stupid to learn to just keep a low profile. And in particular, I just foolishly gave this talk in May on debugging during an outage, which if that’s not on your hands and knees begging for a conga line of outages to come your way, I don’t know what is. Which is exactly what happened. I gave that talk, and it must have been as I was leaving the stage that the pager started going off and the klaxons and everything else. And I got to debug during an outage.

But I’m a slow learner, and in, I think it was in, August I gave a talk called Zebras All the Way Down, where I decided that I was going to take an old enemy out to the back and finally shoot it, and that was firmware. Firmware and I have had a long and antagonistic relationship, and I felt it was time to finally send a message. And by the way, we are engaged in a war of humanity versus firmware, and you all need to decide which side you’re going to be on. But this was stupid, obviously, because you know what happens. You know what ends up happening here. I give a talk on the perils of firmware, and, of course, every firmware bug on the planet put the crosshairs right on me.

Because I control time and space with my mind, I have brought Spectre upon us. I’m sorry. I apologize. And I really apologize because I’m too stupid to stop. I can guarantee you… I’ve actually turned my phone off right now because I don’t want to know about the giant system that begins behaving pathologically during this talk and won’t stop for years afterward. I’m stupid to give this talk, but it’s a talk that I really wanted to give. And when Spiros gave me the opportunity, I was thinking about what are some of the things that are important to me that would also be important to you here.

Pathologically performing systems have effectively been my career, at some level or another. Spent a very long time dealing with pathological systems. Pathological systems of many varieties. Systems that don’t work. God forbid they ever all start working completely all of the time. I don’t know what we’re going to do. I don’t know what I’m going to do. Maybe you know what you’re going to do. I don’t know what I’m going to do. There’s a level at which the fact that these systems are broken keep us employed, but I’ve been dealing with broken systems my entire career.

And, of course, when you’re talking about broken systems, there’s a taxonomy there. The failures that are easiest for us to debug are when the system does us the service of detonating, of blowing up, of saying, “I’m taking myself out of the software gene pool. I’m going to blow my brains out.” Ideally, it says, “I’m going to blow my brains out because I have determined that there is some condition that to me is fatal.” The reason that does such a service is because you know exactly where to start. How did we enter the condition that the software perceived as fatal? More often, the software is not quite that aware.

More often, it will actually do something that is fatal to it and not realize why or how. It will have… This is the uncaught exception or this is the null pointer dereference, or what have you, where we decide that the software can no longer continue and we kill it, and we give you a core dump or a crash dump, which is a capture of that application’s state. You have all of that state that you can then go debug and figure out what happened and why. So you can take out from this production system, you can begin to debug what happened, and it’s very gratifying when you have one of these problems that happen once in a blue moon and you’re able to take a core or crash dump, debug it, realize, “Wait a minute. Actually, this condition is much easier to hit than we think. We’ve only seen this once, but I think if I just do this and just do that and run this test over here.”

And then, boom, the system blows up exactly as your debugging indicates, there is no more satisfying feeling, I think, in software engineering. It feels so gratifying to take this failure that happened once, be able to reproduce it, be able to fix it. Those are the good days. It’s great to be able to go do that. Unfortunately, these are not all failures. Not all failures are fatal. May all your failures be fatal. But failures are not fatal. In fact, many failures are nonfatal. The system continues to move. The system is continuing to operate. It has failed in some important capacity, but it’s moving on.

Now, if you’re lucky, that failure is explicit. It has encountered an error condition, it’s going to emit this error condition or log the error condition. The software has detected there is brokenness about, and maybe it’s a service that I depend on is not available so I’m going to log an error condition, I’m going to point you in the right direction. Something is wrong, humans. Humans play a very important… I don’t know about you, but I chuckle at the idea of this artificial general intelligence, the AGI. This idea that the robots are going to take over. It’s like, “Maybe the robots can actually go fix themselves first.” Because you and I actually know how brittle these software systems are, so as soon as we get post singularity I got a long list for the software to go do while go make myself a margarita or whatever.

I’m not worried about being enslaved by broken software. That is not a concern. When software indicates this is where something is broken, go start over here, that’s helpful because the problem is nonfatal, but you’ve got a log message. You’ve got something to start on. You’ve got an idea of what’s going wrong. What’s much, much worse is when the failure is totally implicit, when the software keeps running and just does the wrong thing. That’s very unsporting of software, when it just does the wrong thing. This can be the software giving you the wrong answer, which is the absolute worst thing.

Those of you who do numeric programming know it is brutally hard to debug bugs when you’ve got numeric programming. In fact, some of the earliest bugs… Actually, the earliest computer program has got a program written for the EDSAC has got an infamous bug in it that took two decades, three decades to debug. It’s a numeric programming issue. This is where you’re dealing with floating point issues and so on. It can be really, really brutal. It can give you the wrong answer, it gives you the wrong thing. Or it can have pathological side effects, like the leaking resources. But I actually think the gnarliest class of all is the software that operates fine, isn’t broken by a strict definition of broken, but is very unsatisfying to us because it doesn’t perform well.

It performs pathologically. It’s not doing what it’s supposed to do as quickly as we demand that it does it. And that is to say the system just sucks. It’s like, “I’m up, I’m working, but no one’s happy about it.” And guess why? I don’t know why. No one knows why. Then these problems are really painful, especially when they happen on production systems, especially when they happen on a system that was performing very well. Which, to me, is the much more common case. It’s one thing to have something that’s not performing well when you’re developing it. It’s like, “I’m timing it. That takes a little too long. Let me time that. Yuck. Okay, I need to go optimize that a little bit, spend some time.” That’s a very gratifying feeling.

What’s much more common is like, “Looks good to me.” Performs well, seems to perform well. We put it in production, it performs well for six months or a year, two years, or five years. And then one morning it just performs terribly, and nobody knows why. And everyone gets very, very upset. You get very upset because it’s very hard to determine why. This was brought home to me very viscerally when I first started my career, where we had a very large benchmarking system. We had a system, this was an E10K, for those of you who are as ancient as I am. For those of you Millennials in the room who don’t recognize Dan Aykroyd or Three’s Company or any other relevant cultural references, the E10K was… This was a machine made in 1997, 64 processors, 64 gigs of RAM, which you still don’t have on your laptop, actually. This is a big enough machine that it was made over two decades ago and you still don’t have that much memory on your laptop and you still don’t have that many cores in your laptop. This is a big machine. 747 pallet sized machine. We were doing a benchmark with five of them. We had one machine that was running the benchmark and three machines that were feeding it, and it was awesome. This fire-breathing monster ripping through this SAP benchmark. And it was great until it was terrible. And it was terrible for like three or four minutes of absolute, “What is going on? Nobody knows.”

Not doing productive work, and then, as if a cloud lifted, the machine would suddenly start doing work again. And this was very frustrating because we could see that if the machine could just do well all of the time, it would be an awesome computer that many people would want to buy. But a computer that slips on the ice every four minutes and can’t get up for three, that’s not very fun. And it wasn’t every four minutes, unfortunately. It would be like every hour or every 20 minutes, and we were debugging it and debugging it and debugging it and debugging it, and it was a very early lesson to me in the stacks of abstraction that we have.

We built these incredible stacks, and this was in 1997. The stacks have grown so much taller and so much deeper since then. These stacks of abstraction, where we build software on top of software on top of software on top of software on top of software. The reason we do this, the reason we are able to do this is because of what I call the software paradox. Software is both information and machine. It’s like nothing else we have ever made. Software has the properties of information and it has the properties of machine. And in your frustrating days, let me remind you that you are living in the absolute golden age. You are living in Athens, you are living in… Pick your Renaissance, pick your golden age. You are living right now in the golden age of software because we are all still figuring out what that means.

One of the things it means is when we build software that’s correct, that sediments into the stack of abstraction, and we build ever upwards. So we have this stack that goes down so far we can’t even see it. And if you have a headache when you’re trying to go through these stacks of abstraction, it’s not you. It’s the actual stack itself. And there’s a level which it’s majestic. It’s also terrifying. And in fact, the stacks go so deep it will challenge your definition of what software is. I don’t know if people know who Arthur Whitney is. Is Arthur Whitney hailed as a god around here? He should be.

Arthur Whitney developed an APL derivative called A+ at Morgan Stanley. Arthur Whitney is a very, very bright, somewhat strange person. I don’t think he’d mind me saying this. Eccentric. And I was interviewing Arthur Whitney for an issue of ACMQ, and interviewing Arthur Whitney was like doing some hard drugs. I was asking… You’d ask Arthur these questions, and he would give you these short answers because he values being concise. He would give you answers that are so short and so perplexing that you’re convinced that it’s your own mortality and stupidity that are preventing you from being on this higher divine plane that Arthur Whitney is on.

To give you an example of this, I had asked him what he felt the closest metaphor was for software. Is software like… Is it a bridge? Or is it DNA? What’s the closest physical metaphor? What’s the closest metaphor for software? He waits for a second, he says, “Poetry.” And I was like… Kind of like, “Okay, I’m going to need a long second on that one.” So we’re having this conversation, I’m going deeper and deeper into this Alice in Wonderland kind of trippy thing. And then I made the mistake of… I kind of left that. I shouldn’t have been driving after that. I should have called a cab at that point.

But I was driving home and I was going to go check up on a friend, Mike Olson, who is the founder of Cloudera. So I check in on the Cloudera guys. This is when Cloudera had like four people. They kind of welcome me in, and we’re sitting around. And I’m like, “What is software? What is software?” And they’re like, “Software is what you run on the hardware.” It’s like, “I don’t think so.” Software is not what you run on the hardware because if you annihilate all of the hardware, does the software cease to exist? Let me ask you a different question. Was Euclid writing software when he developed his GCD algorithm? And if you answer no to that, you get really caught up in a lot of problems.

If Euclid’s not writing software, you’re not writing software either. And then hardware isn’t even what you think, right? Hardware is much more valuable than what you think. So I’m kind of having this discussion, and they’re like, “What is this guy’s problem?” And Mike is finally like, “Hey, can I talk to you? Can I talk to you? Can I talk to you? What’s going on? Are you on something?” And I’m like, “Mike, I’m sorry. I just had this conversation with Arthur Whitney. My brain’s kind of blown up. I don’t know what anything is anymore.” But this is kind of amazing when you stop and think about it, that we can’t even define what software is.

Obviously, we know what software is. This shouldn’t prevent you from doing your day job. We do know what software is. It’s like, “Sorry man. I don’t even know. What is software?” It’s like, “Can you please get this code reviewed and get it in please?” But it does go to some of these unique attributes of software and the fact that we’ve got this huge, huge stack that goes deeper. And I think people now appreciate, especially in the last week and a half, they appreciate some of that depth that maybe they didn’t appreciate before. Is the microcode that you downloaded from Intel, is that software or not?

You’re downloading a binary blob that you’re putting on your CPU that’s adding a new MSR to that CPU. Is that software? I don’t know. It sounds like software, it feels like software. It certainly isn’t Turing Complete. Actually, don’t install that microcode. That microcode will actually [inaudible at 15:55], so be careful about it. Don’t get too hung up on that. But the thing about the giant stack or abstraction is it’s amazing and majestic, for some kind of disturbed definition of majestic, when it works. But when it fails, and when it fails not from an explosive perspective, but when it fails to perform, when it performs pathologically, all of the power of those abstractions begin to cut against it in two very important ways.

First, all this layering is going to actually amplify performance pathologies. As we go down the stack, the actual work we’re going to do is going to increase. When we’re at those highest layers of abstraction, less does more, right? That’s the point of being at a high level of abstraction, less does more. Well, if less does more, that typo, or that simple bug, that one-liner gets in the Perl monitoring script that gets rolled out to every machine, and is running every 20 milliseconds instead of every 10 seconds, or what have you. All of a sudden you’ve generated this huge amount of additional work down stack. You get this explosion of work.

It goes the other way, too. When you’ve got problems that are very low in the stack, they can induce this explosion of latency as you go up the stack. So these layers of the stack are not totally airtight sealed from one another. Again, we’ve got new appreciation for that now, but these things are not completely sealed, and especially when they misbehave. These issues, these issues in one layer that induce an explosion of work or an explosion of latency in another layer, these are the butterflies that cause hurricanes. People are familiar with the idea of the butterfly effect, that a butterfly flapping its wings, and pick your city, as it turns out, there’s no real agreement on what city it should be in, so whatever city you first learned is as valid as any other city.

I think it’s supposed to even flap its wings in Dallas, that was the original. Anyway, a butterfly flaps its wings, and pick your city, and induces a hurricane in, pick your other city. It’s obviously extreme to show the example, but the idea that is very small input changes can have massive changes in output. And we see that in performance. We’ve got these butterflies that can cause hurricanes. I have collected these over the years. I’ve had many, many butterflies inducing many, many hurricanes, so I’m going to try not to fixate on these for too long because this is where I’ve gotten my old war stories. I just want to go through a couple of these.

One of these, they’re called the ARC-induced black hole. The ARC is the adaptive replacement cache that’s present in SmartOS, our operating system, part of CFS, present in FreeBSD and other systems as well. And the performance pathology here was the most extreme performance pathology you can have. The performance pathology here is machine no longer doing useful work, machine furiously doing something. We should get back to actually that example early on in my career from 1997. I didn’t actually tell you what the cause of that was. What was the cause of that benchmark machine that furiously was doing something but we didn’t know what? It was clearly in the networking stack, and we kept digging deeper and deeper and deeper in the networking stack, and we found algorithms that were quadratic and so on.

Great, we’d fix that and nothing would happen. It would still be sad. We’d go fix other problems, still be sad, still be sad, still be sad. And we finally did debug it. The problem was actually not due to the machine at all. There was a router in the lab, this was a lab environment, there was a router with bad firmware on it. The router would periodically just reboot, and this machine was misconfigured. It was misconfigured to be able to route packets. So this E10K, world’s most expensive database machine, would be like, “Hey, wait a minute database. Hold on. Somebody needs to route packets somewhere. I’m about to become the world’s most expensive, world’s worst router. Please stand by.” And it would be routing packets badly, routing packets badly, routing packets badly. And then this router would finally be like, “Okay, I can route packets again. Okay! So I have some database work to do. What’s going on with the database?”

And that, to me… It took us two weeks to debug that. Two weeks of around the clock, around the globe, because this was an era when these five machines needed to be sold so we could make our quarter’s numbers. That’s how important this was, how expensive they were. It took us two weeks to debug that, with all of the world’s experts were brought to bear on that. And the fact that it took us two weeks and the problem was so simple and we couldn’t see it was chilling. And it showed that the size of the hurricane that these butterflies can induce. This was literally a misconfiguration in a configuration file. A mistake anyone can make. Very, very hard to debug.

This was similar in spirit, not quite as knuckle headed in root cause, but similar in spirit, where the system would frenetically do something, but that something did not involve running processes. So the operating system, to all accounts, looks dead to the world. It would be actually much… I shouldn’t say better. It would be worse, but emotionally better, if the operating system refused to be pingable as well, because then you could be like, “I don’t know. The hardware is just stopped. I guess we have to power cycle it and I don’t have to debug it.” But you know that’s wrong. You know you’re a sinner when you think that. I know I’m a sinner when I think that. Don’t think that.

The machine is pingable, but nothing else is happening, which means you know it’s the operating system. I know who this is. I know who the murderer is because I’ve been tracking this murderer my entire career. It’s the operating system. The operating system kernel is doing something wildly stupid. In this case, what happened was the operating system… This is free memory over here, and what happened is the operating system saw this explosion of free memory. But before it saw the explosion of free memory, the free memory dipped below a very important line. And this is the line at which it says, “Okay, memory is actually so tight, I’m going to go ask the rest of the system, Hey, can you give me some memory. I need some memory back.” And this explosion is actually something very important, giving it a lot of memory back.

But right here it says, “I need some memory back.” And when it does that, it kicks off this kingdom cache reap, this is the task queue length here, and this thing takes a very long period of time. What ends up happening while this is going on is this other unrelated ARC process comes in and becomes effectively blocked on this. The ARC wants to reduce in size. Wanted to reduce in size right there. It goes to reduce in size, but then it actually blocks behind this K-memory part and the ARC begins growing, growing, growing. And then the ARC gets so large that the system is like, “Okay, this gap has become so large between where we want the ARC to be and where it actually is that we’re going to stop everybody.”

And so for this period of time the system is stopped waiting for memory to free up. Very frustrating. We were able to debug this, ultimately, but it was excruciating. The root cause was very small, but very, very hard to debug.

This is one that I brought on much more recently. This is firmware cackling. This particular hurricane was… We had post-crisis running, yes these aren’t spinning disks, so yes, I’m sorry. Some of us can’t run SSDs. “Well, we’ll just run SSDs,” everyone’s like, “Yeah, I get it. Thank you for all your work. I can’t do that.” These are spinning disks. We had spinning disks running a database server that loves to autovacuum. Postgres, loves to autovacuum anyway. And postgres does this massive autovacuum. It kicks off a bunch of reads, and these reads become very slow because it’s spinning media, so we’re blocked on all these reads.

But the whole system begins to cork behind this, and we are convinced the entire time we’re debugging this hurricane that the problem is somewhere in postgres. We’ve got to get postgres. It’s auto-vacuuming too much, it’s doing too many reads, and we were trying to minimize it. We finally looked at the actual latency of the reads because it’s spinning media, it’s rotating rust. Reads are really, really, really slow. It’s like, “Yeah, reads are always slow.” But then you look at them, like, “These reads are really, really… These reads are very slow, even for spinning media. These reads are really, really bad.” This is actually showing you its time on the X axis here. This is over a 30 second period, roughly. Latency, log latency on the Y axis. Over here on the Y2 axis, this is showing you the reads outstanding and the writes outstanding.

And what you can see here is this is where we’ve got 10 writes and 10 reads outstanding, but the reads, which are here, are not completing. And there’s this very long period where the writes are all completing, but it’s like, “Hey, jerk. You’ve had reads. You want to give me a read, please?” And it’s like, “Do you have any writes to do?” It’s like, “No, you have reads to do.” You’re looking at the trace, you’re screaming at the drive, “Give me my read back.” And finally, at this cliff, and there’s a cliff here at 1,700 milliseconds. I got the firmware developers on the phone, I’m like, “Take me to 1,700 in your source code.”

They were kind of deep in the bargaining phase at that point because they acknowledged this should not be doing this. And they’re like, “We looked for 1,700, and we can’t find it.” It turns out it was 2,000 minus 300, but anyway. They didn’t actually graph their source base for 1,700. Of course, that assumes they found out whose home directory was being delivered out of first. And so we would see these way outlier reads, and then the writes would stop. And it’s like, “Oh, I’ve got a bunch of reads to do.” This drive was designed to be a benchmark special. That’s what it boils down to. This drive can do… If you give it nothing but reads, it’s great. If you give it nothing but writes, it’s great. Meanwhile, in the real world, if you give it some reads and some writes, the writes will complete quickly and the reads will complete basically not at all.

And this turned out to be an extremely complicated problem that had to be solved by removing drives, sadly, and replacing them with a different vendor. And again, I brought this on myself by even talking about firmware, so I deserve it, and that’s what it is. We’re all sinners, and so on.

Okay, so this is one that’s much more recent, and this is data from Scaleway that I’ve been one of the many cloud companies that I’ve been collaborating with during this rather difficult period of Meltdown and Spectre. This is showing us, actually, a very interesting opportunity because normally we have a system that just behaves pathologically and we’ve got to go find the butterfly.

With Spectre and Meltdown, we are actually introducing exactly one butterfly and then seeing what happens. And wow, what happens. This is really surprising. This is a PHP workload running with the kernel page-table isolation. Kernel page-table isolation should not induce this much overhead. This is actually very surprising to me that you see this huge spike in system time. This is IO time. The Linux implementation is still a bit rough. It keeps crashing and cropping data, so that’s why you have the idle time. But you can see that the system gets pegged, we get a lot more IRQ time. And this is from a deep change, but one that shouldn’t cause this.

The kernel page-table isolation, which is separating… Well, it’s not separating user kernel memory, because user kernel memory already is separated. It’s just that the microprocessor is refusing to recognize that under the… The microprocessor no longer recognizes protection when it comes to speculation, or it doesn’t recognize protection when it comes to speculation. So we’re forced to actually stratify it in a way that even the microprocessor can’t work around, and we see this really serious effect. God forbid you had to debug this just from this. You’d be very hard pressed to get to that butterfly that induced it. And we’re going to see that with Spectre well, by the way. We’re going to talk about that afterwards. Talk about all the problems with the Spectre and Meltdown workarounds.

This is the challenge that we have. This challenge is we’ve got this hurricane. The hurricane is the system sucks, and we have to find the butterflies. I call this Leventhal’s conundrum because my friend Spiros says to Adam Leventhal. I’ve collaborated with Adam Leventhal for a long time. This is something that Adam pointed out that always stuck with me. I think the number of Google hits for Leventhal’s conundrum is exactly two right now, so we’ll give it a third here. I keep trying to juice it a little bit for Adam, so Leventhal’s conundrum. But you can now sound smart and kind of drop this into conversation and be shocked when people haven’t heard of it.

Leventhal’s conundrum is, you’ve got the hurricane, now go find the butterflies. This is excruciating. This is really, really, really difficult. And don’t let anyone convince you that it’s not because there’ll be people that are around you or you’ll see on the internet or whatever that seem like they solve these problems effortlessly, and nobody does. This is not effortless. This is really, really hard. Why is it hard? First of all, the symptoms are often very far removed from the root cause. Root cause is a firmware bug on a router, and the symptom is here in the networking stack of an operating system of an unrelated machine. Symptoms are very, very far removed.

There often is not a single root cause, but several. This is really tough because you’re dealing with correct software. Correct by some definition. Logically correct, but pathologically performing. So it is very easy to find a problem that’s not the problem, and this is… You’ve done this and your head is nodding slowly and painfully, and maybe a single tear is coming down your cheek right now because you’re remembering a time that you found the problem and crushed it and implemented it and deployed it in production, and it’s slower now. And you’re like, “That’s a mistake.” It’s like, “No, no. That’s not a mistake. It’s slower.”

Or more often it’s like… More often, because when your coworkers aren’t deliberately trying to torture you, they’re like, “It’s a little faster. I’m seeing…” it’s like, “No, you don’t have to placate me. How much faster?” “It’s positive noise.” It’s like, “Positive… That’s not a thing. You’re just making that up for me. I appreciate that, but…” So this can be really, really brutal. It’s really brutal to spend a lot of time, and you think you’ve got the answer, but it’s not the answer. But it may be part of the answer. If you’ve ever dealt with a scalability problem, these are actually in some ways more frustrating.

If you have a scalability problem, you will not actually unlock the full power of the system until all of the scalability problems have been removed. You can remove a problem and immediately hit the next problem. It is true that what you found was the problem absolutely emphatically is the problem. You had to fix it, you did fix it. Good on you. System is now slower because we’re hitting a bottleneck that’s even worse. And that’s when you really have to have the long conversation with yourself and pick yourself up off the mat and go to the next one and the next one and the next one and the next one. And when you finally release it, and hopefully it’s you and not some other jerk that replaces you.

But hopefully it’s you that gets to that final bottleneck that you remove and all of a sudden throughput is flowing through the system as if through an open sluice. Very, very, very, very gratifying. But you’re so exhausted by that point you just kind of fall asleep at your desk. All of those things were issues, but you had many of them, so that’s a challenge. The system, the stupid system, it’s dynamic. By nature of the problem, it’s dynamic, it’s moving. Right now, it’s moving. While you’re debugging, it’s moving because it’s functioning. And it’s very frustrating to be… I’m sure you’ve had this happen, where you’re debugging a problem and you’re making progress. You’re making progress, and you’re like, “Wait a minute. What’s going on now?”

And then you’ll hear in your office, “Hey, what happened? Is the load off? What just happened?” And you’re just like, “The autovacuum stopped.” Or something happened, or the load was off. Something changed, and now the hurricane is gone. But the hurricane isn’t gone because you fixed it, and this is important. And it’s important, and this is a tough moment because you have to be like, “Okay, good. Problem solved. I don’t want to think about it anymore.” It’s like, “No, no. You actually need to go back through the data that you have. And what did you learn, and how can you now go find the thing that you know heart of hearts that you haven’t fixed?” Hope is not a strategy when it comes to debugging. Hope may be an outlet, but it’s not a strategy.

The fact that it’s dynamic is very challenging. And then, and this kind of goes to the scalability point, improvements to the system are very hard to model. One of the things I always try to do is I think I have found the problem, or I found a problem. Is there any way that I can model how the system will perform with this improvement in place? And it’s often very difficult. And often that modeling is fraught with its own peril, because you’re going to now change the system, fix a bug, fix a subsystem, make it scale, what have you, it’s going to change the system. But it’s very hard to model. We need to give it a shot, but it’s very, very hard.

With all of this, it’s very difficult. And I think that of all of debugging, I think this is the most difficult debugging. And to be absolutely emphatically clear on one point, this is not tuning. This is debugging. You tune a guitar, okay? And someone with even a modest amount of talent and a modestly good ear can tune any guitar very quickly. We do not tune performance. We debug pathologically performing systems. And when you think of it as debugging, your mindset should change because you’re not thinking… When you’re tuning a guitar, you’re moving things back and forth. If you’re tuning a system, you’re like, “Oh, let’s just move things back and forth. Let’s change GC algorithms, let’s do this. Let’s reset this.” Just kind of jigger with it.

No, please don’t. Debug it. Understand why the system is behaving the way the system is behaving. Please don’t just change your GC algorithm. Now, I know that I’m probably in the same space with respect to Java, but I don’t know. I’m sure there are many people who deploy JDMs. If you know that every Java performance problem is that you’re using a wrong garbage collection algorithm because that’s what people know. And actually, I was with one customer who was running software that they had bought from someone else, and the customer is… they want… Like, “Would you mind being here while we’re talking to this other?” I’m like, “Why do you want…” It’s like, “Well, we just kind of need an advocate.” I’m like, “I’m like your attorney now? Okay.”

Sure enough, I could see why, because they were kind of being snowed a little bit. And in particular, the software company… Well, software companies are still misbehaving, like, “Well, here’s the problem. You’ve set your GC algorithm to… You’ve got the second generation, infant, newborn zombie garbage collection, and you’re supposed to set…” And they’re like, “Well, we set it to that because you told us to set it to that two weeks ago, when it was actually just set to stock defaults.” It’s like, “Oh, now what?” So we don’t want to just fiddle with the system until it performs because you won’t necessarily solve the problem. And GC is not necessarily as much of a problem as people think it is because that’s what they’re used to fiddling with. So you are not tuning, you are debugging.

And when we think of it this way, we think of it as debugging, it allows us to focus on how hard this is. This is not rote, it is not mechanical, it is not easy. There’s not a recipe you can follow to debug a pathologically performing system. Just like when you’re debugging a bug, it’s like if you’ve got a bug in your software, you’re like, “Where do I go? What recipe do I follow to debug my bug?” You don’t. You need to actually debug it. And when we think of it as debugging, we can resist, or try to resist, being guided by folklore. And I know that hopefully I’m not part of the problem here. Hopefully I’m not creating folklore here. I probably am.

Everyone’s going to go back and be like, “I know what the performance problem is. It’s our disk performance. It’s the ARC, it’s the disk, or it’s kernel page-table isolation.” I don’t want to create folklore here. We have to resist being guided by folklore, like, “Oh, once upon a time, I was at one job and the problem was that we had set this thing the other way, so let’s go set it back the other way.” Because that’s a story I heard once. It’s like, “Please don’t do that.” Don’t do that because you don’t actually know… Well, first of all, you don’t even know that caused the problem then, often. Often you take these things apart, it’s like, “That’s actually not even what was causing the problem then, let alone now.”

So please don’t operate based on folklore. And, just to sharpen the point, don’t change the system before you understand it. Let’s understand the system first, and then change it. I get that people are really excited. Excited is the wrong word. Traumatized. You want the problem to go away, right? We want to do whatever it takes to make this problem go away. But we actually do need to understand the problem. It can be very difficult when that’s a production system. How do we debug? Well, it can be very tempting to think about debugging as the process of forming hypotheses. That is not what debugging is. Hypotheses are… Yes, a hypothesis is going to be a critical stage to debugging. It is the last stage of debugging.

When you debug, you should be iterating between questions and answers, questions and answers, questions and answers. And as you’re iterating, the possible space of hypotheses gets smaller and smaller and smaller, until you don’t have a choice. This is what it has to be, because we have eliminated everything else. You want to avoid playing 20 questions the way my kids play 20 questions. You play 20 questions, and they’re like, “Okay, my first question, is it a badger?” Like, “No, don’t… No. You want to… Is it an animal? Is it a mineral?” It’s like, “Yeah, yeah, yeah. I get it, but is it a badger?” And if you are a parent and you’ve been through this, “It is a badger, you know it’s a badger, but under those conditions, you can change what it is.”

I want to give you permission to be like, “It is totally not a badger.” They’re like, “But I think it is a badger because we were just talking of badgers, we saw a badger. You have short-term memory loss. I think it’s a badger.” It’s like, “It’s not a badger. Go to your room.” Ask, “Is it an animal?” “Yes, it’s an animal.” “Is it a badger?” “Stop, please. Can you please…” All right. You don’t want to be leaping to hypotheses. Even when they’re right, it’s an anti-pattern. You want to be iterating between asking questions of the system and then actually observing the system. So how do we ask questions? How do we make observations? Well, for performance debugging, it can be very, very hard.

The right question to ask is, dear software system, why do you suck so much right now, and why do you always do it at the worst possible time? That’s the question you want to ask. Unfortunately, that’s not a very technical question. That’s an emotional question, not a really technical one. We need to find what is the initial question. And there are a couple of methodologies out there. One that I think is pretty good is the USE method, Brendan Gregg’s USE method, which is a resource-centric methodology. It’s focusing on, for each resource in the system, for the CPU, for the NIC, for memory, what is the utilization, what is the saturation, are there any errors? And that is a great approach for first questions.

It is not a recipe to debug your problem. I think some people think, again, “Oh, just spray some USE method on it.” It’s like, “No, no, no, no. It doesn’t work that way.” The USE method can get you to ask some early questions, but by the way, the USE method does not get you to that router that’s popping the firmware bug in the lab. It can get you to some early questions, but ultimately those questions it answers, you’re going to need to go figure out. So these are not actual recipes, but they are very good starting points. The methodologies that are out there are a good place to start. Once you have the questions, then you want to actually observe the system. You want to make observations. What is the system doing?

In this regard, the observability of the system is crucial to everything that we need to do. If we cannot observe the system, you’re going to be guessing. You’re going to be guessing, making changes, and drawing inferences. That’s what we do with systems we can’t observe. I would love to tell you that every system we have is observable. But remember that whole firmware bit, or take Spectre and Meltdown. These things are not necessarily directly observable, and we are observing them by their side effects. So there are layers to the stack that are so incredibly deep that we are going to have to do this because they are very hard to observe.

But for the software we write, there is no excuse for it to be unobservable. And it has to be said, when you are doing this, this approach of making guesses, and even when you fix the problem, you haven’t necessarily fixed the problem. Correlation does not imply causation. You may have fixed the problem simply because you reset the process. Anything could have… You could have fixed the problem effectively by accident, and this is a regard in which success can teach you nothing. If you’ve done no debugging prior to making the change that punitively fixed the problem, you will naturally be like, “I fixed the problem.” It’s like, “No, no. You didn’t fix the problem. It’s more nuanced than that.” So be very careful about drawing the wrong conclusion.

We’ve got to be able to answer on these systems. Obviously, this is something that… And honestly, the origin or DTrace came out of that E10K benchmark when my conclusion was… It was bad. It would have been bad to debug that today, with systems being much more insurmountable. It was excruciating when we couldn’t instrument it, and it took us way longer than it should have. I would like to believe that we’d be able to debug that problem a lot faster. Still a very hard problem, though. Still very hard to debug. A couple instrument systems, static instrumentation modifying your source to provide semantically relevant information, like your logging, your accounting, or something. This is very important.

Static instrumentation is the way we can form some of those initial questions. If software is totally silent about everything all the time, that initial question that we’re going to ask it when it’s behaving badly becomes very, very difficult. So static instrumentation in the operating system, in libraries, in your software is very, very important. But static instrumentation doesn’t get us all the way there. We need to be able to instrument the system dynamically. That’s what DTrace does. But I would point… Folks that aren’t familiar with the open tracing effort from the CNCF, I would check that out. It’s another way, totally different way, of dynamically instrumenting the system.

There are lots of ways to dynamically instrument the system. DTrace is certainly one of them, but there are many that are out there. Some that are inspired, some that are totally different layers of the stack, but you are going to need the capacity to dynamically instrument your system because you’re going to need to have the capacity to change what it’s doing to emit the datum that you’re doing to need to answer your question. That’s what we’re trying to do. We’re trying to answer your question. So both static instrumentation and dynamic instrumentation are really essential.

Want to take a quick aside because observability is apparently become a trendy topic, which is very weird for me. It’s a little bit like… I spell Bryan with a Y. I’ve been correcting people my entire life, but Bryan with a Y is actually more popular now than Brian with an I, which very much upends my whole worldview, like the chip that I have on my shoulder because everyone misspells my name. So if people start spelling my name properly, I’m not going to know what to do with myself. It’s a little bit kind of like observability. Everybody’s talking about this new thing, observability. I’ve been doing observability my entire career.

But I think what’s happening, the reason that you are hearing the term observability more, which is great, by the way, I just want to emphasize that, just like the Bryans with a Y are. The reason you are hearing it more is because people are realizing the limits of monitoring. That we monitor systems that we only operate, systems that we do not mutate, we do not develop, that we receive them from somewhere. We need to monitor firmware, okay? We have to monitor firmware. It’s not observable because we’re not actually developing it, but for software that we are actually developing ourselves, when we deploy the systems that we develop, when operations become devops, which I get is all about culture or whatever.

There’s a technical underpinning there as well, that devops is about marrying the development of software with its operations, then monitoring it is insufficient. We actually want to expand that to observability. I think it’s great that this is happening. This is happening industry wide. Let’s get a word on aggregation because when you’ve got a lot… When you’re instrumenting the system, and now you’ve got a lot of trace data that’s coming out of the system, it is not just tempting but essential to aggregate that data. This is actually an early observation we made in DTrace, was that the aggregation of data really needed to be a first class notion in DTrace. We had to have the ability to aggregate data in situ.

DTrace has aggregation as a first class primitive, and we aggregate on a per use basis and roll that out. It allows you to have pretty invasive questions that don’t actually perturb the performance of the system because of all that in situ aggregation. So aggregation is really, really, really important, but there are also limits to aggregation. When you are aggregating data, you are eliminating time. That time variable is being chucked, and time can be very important. There are questions that you can only answer with disaggregated data. And that’s challenging because disaggregated data is a lot of data. We aggregate for a good reason. This is not an either or, but there are limits to aggregation.

We will use aggregation, we have to use aggregation, but we also need to understand its limits, and we need to know when it’s time to actually switch to… And honestly, debugging that firmware problem. If you go back to this firmware problem, this took me longer to debug than it should have because it took me longer than it should have to disaggregate. That is to say if you look at the actual counters when this is running, you don’t see this pattern as vividly. So there were things that I was not seeing because I was aggregating too much for too long. Now that said, we still need to aggregate. It’s not… We all need to aggregate. We want to aggregate. We just need to be aware of its limitations.

Let’s talk about visualization, because visualization is really, really essential. Our visual cortex is amazing at… We find food, we find mates, we find shelter, we find all this great stuff with this visual cortex. We find patterns. We are very, very good at finding patterns. We’re still better than the computers at finding patterns, especially new ones that we haven’t seen before. We don’t need training data to discover patterns, we can discover de novo patterns, and that is a very powerful and unique property. We want to visualize data, and the value of visualizing data, by the way, is more often than not, it’s not to give you the answer. It’s to provoke a whole nother series of questions. Sometimes it gives you the answer, but more often than not, when you visualize data, you’ll be surprised by something.

And maybe it will disconfirm my hypothesis, but it will open up a new avenue of investigation. It will give you new questions to ask. I think that visualizing the systems is really an essential skill, and it’s one that feels like we’re still on the starting block in so many ways. We’re pretty deep into this whole Von Neumann model. It hasn’t really changed that much, but we still don’t really have great ways of visualizing these systems, it feels to me. It feels like we’re still struggling with visualization. Speaking of struggling with visualization, I kind of looked back over my career and realized that the technology that has been with me my entire career, that has… I wouldn’t say never failed me, because it’s too idiosyncratic to say it’s never failed me, but the technology that has carried me more often than not is good old Gnuplot.

I’ve been using Gnuplot, and on the one hand it’s pedestrian or quotidian to say you should plot data, but you should plot data. You should visualize data. If you’re looking at data, you should visualize it, and you should have fluidity in some way of plotting data. Maybe that’s Excel, maybe that’s R. I guess that it’s probably R around here. It feels like R feels very foreign to me still. It feels like a grad student somehow got way off the reservation. But for me it’s Gnuplot. You know when a Gnuplot lover is saying that something feels foreign, that actually that is a valued judgment because Gnuplot is a little weird. Gnuplot is a little strange, but it is very powerful, and especially when you combine it with the traditional Unix tool chain.

It’s powerful stuff, and the fact that we’re talking about Gnuplot highlights a very important point. When you are debugging, the tools that you use are not magicians. As you can imagine, people often turn to me like, “Hey, can you DTrace this? Can you Dtrace it and do the Dtrace thing?” You know David Copperfield is just an illusionist, right? It’s not actually like… Not actually a magician. I don’t actually have magical powers, and I obviously know how to use DTrace, and I’ve used it a lot. But I don’t know anything about your performance problem. I need to go ask all the same hard questions that you need to ask. DTrace can answer your questions. Tools can help you, but they can’t do it for you. They’re not going to do the thinking for you. The thinking has to be done by you, by all of us, and it’s hard. Gnuplot is actually an important one, and certainly not a magician.

Another visualization, this is something we… Actually, we first did a while ago. It was in 2008. I think that we had… This is a product that I developed at Sun. I think this is one of the first products to treat latency as a heat map. I don’t know, I wouldn’t say it was the first, but I’m sure there are others. But a lot of people are doing this now, which is great. Great to see because I think it’s a very useful technique, where you are plotting latency on the Y axis and then the hue, the saturation here is actually determining the frequency of latency outliers or latency at that particular value.

And you can see lots and lots of interesting patterns here. There’s a lot of interesting work looking at heat maps and trying to understand, like, “Why are we seeing this kind of crazy, crazy pattern?” So heat maps have been very valuable, and I think we’re seeing that much more widespread, which is great. Great way of visualizing, and again, provoking questions. Provoking questions of, why are we seeing this outlier? What is actually going on? What’s happening there? Flame graphs are this work that Brendan Gregg did when he was at Joyent, which we’ve used a lot, certainly the industry has used a lot. Flame graphs are a way of taking aggregated profile data and then taking these stack traces and organizing them.

In this case, this is a node program, and we are taking this profile data. This tells you where you’re spending your time. Incredibly valuable when you’re spending your time on CPU, which is great. We should all have compute bound work. If you’ve got a performance problem where you’re spinning furiously on CPU, that can be very, very helpful because it’s happening right in front of you all the time. So you should, ideally, be able to figure out, using time based profiling techniques, what’s going on, where am I in my program. And you’ll see surprising results. Flame graphs have been very valuable and a very good way to visualize the system.

One thing I really like about flame graphs is the way Brendan did this, is this is not actually mated onto any particular data source. What he wrote in unholy Perl script, because that’s what Brendan does, that’s his shtick, that actually generates a SVG. It takes data from any system that you can gather the data, and it will actually generate a SVG, S-V-G, which SVGs are actually quite useful and kind of inspiring. We had a problem recently where I felt we needed a new way to visualize the system because I had just come off this other problem where the aggregation of data was, I felt, hiding the problem. I was working on a problem where we had a Cassandra benchmark that is… Man, Cassandra.

I’m developing a complicated way of shipping Cassandra, but this was not Cassandra’s fault. So you can kind of put my complicated relationship aside. This was a Cassandra benchmark, and we should have been just running the machine full tilt, and we should have been hitting IO very, very hard in this benchmark. And it just wasn’t. It wasn’t that it wasn’t doing work, it was doing some work, but it was doing… We were like 60% IO saturation. I wanted to understand what does that look like if I break it down sub-second. If I disaggregate that, what does it actually look like? What it looks like is a whole lot of state transitions.

What I wanted to do is I wanted to take a way of taking all these state transitions and plotting time on the X axis and putting each state on its own line and coloring the different states they were in. It seems like a reasonable thing to do. This is actually not a new way of visualizing the system. Back in the day, when threads were young and you would only create four of them instead of 600, in Cassandra’s case, the thread debuggers would often create a separate line for each thread and you could kind of see what they were doing. But those tools, and I tried to check up on this, those tools have kind of gone away or they don’t seem to be in widespread use. I couldn’t find something that did this anyway, and I couldn’t get Gnuplot to do it.

I did actually try very hard to get Gnuplot to do it, but Gnuplot is just like, “You know what? Let me just plot things. Let’s leave it at that.” I’m like, “Okay, Gnuplot.” Kind of inspired by what Brendan did, which was generating a SVG, because the nice thing with a SVG you can actually generate a kind of an application when you generate an image. And so I developed this thing called Statemaps, and something just making available now. And actually, this is much better seen interactively, so I’m going to… If you don’t mind, we’ll just get out of that. And let us go over here. There we go. I think that’s the same thing. Okay, so hopefully folks can see that. I’m going to just zoom in a little better.

What we can do is we can actually interactively look at this data, and what we’re seeing here is a… We’ve got a different line for, it may be a little hard to see, but a different line for each thread in Cassandra. And this is, in this case, thread 794. See we’ve got a lot of threads. This is one process, by the way. One process, and you can see that… And the thing that is very striking when you first look at it is the amount of work we are not doing. In terms of the colors here, I’m off CPU waiting, this is off CPU on a futex. Unfortunately, the way Linux works in terms of an implementation detail, this is not necessarily the wrong thing to do. It just makes it problematic to understand what the system’s doing.

You would use a futex both if you were wildly contending on something, and if you’re simply waiting for a condition to become true. So futexes are used to implement both CDs and mutexes, which makes it a little… But in this case, all of these futex waits were not actually blocked on any thread. We are waiting for a condition to be true. We’re waiting to do IO or something. We’ve got some threads that are off CPU on a futex, waiting for a condition to become true. Others of these are off CPU, waiting for the OS to say, “Hey, you’ve got to some work using the networking stack.” Then we’ve got a little bit of on CPU. Then the red, and especially the yellow are actually doing work. That’s what we care about, is actually doing work.

This thing is supposed to be… Again, we are supposed to be stressing the IO subsystem, and it’s like, “There’s a whole lot of idle in here.” The thing that was interesting were these striations of idle that you would see. It would do some amount of work for some period of time, and if we zoom in we can see it doing some amount of work, and then no work. So doing some work there, now, see that, on CPU. Imagine what that is. That’s actually GC. This is GC right here. GC is not the problem here. GC is just… But there is garbage that needs to be collected, so it is there. And you can see how long we stop for GC. We don’t actually stop that long for GC.

And you can see that then everyone is ready to go, and that’s great, and we kind of run. Yay, we run, we run, we run, and then we just stop doing work, not too much work. It’s like, “Hey, where are we in the desert here?” We’re in the desert of, “Yoo-hoo, hello? Benchmark, what are we doing? Is there a router popped somewhere? What are we doing?” We’re not furiously doing work. We’re just not doing anything. We’re waiting for work to show up. That is very frustrating, waiting for work to show up. And we tried putting more load on the system, but when we put more load on the system, the periods of idle didn’t reduce. It’s just that the periods of work became much more contended. So I’m like, “That’s frustrating.” It’s like, “Okay, is there some union contract with the Cassandra, where it can only work for like 15 minutes a day or something? What’s going on?”

That’s what it felt like. So with the Statemaps we can take this data… this data is generated by DTrace, but it just generates JSON payload effectively that we can then… You can generate that any way you want on any system you want, in part because I do feel sad for people that are like, “I want DTrace, but I don’t have it.” I’m like, “You know, you can have clean water? We have this technology. You don’t have to suffer through infant mortality. Your baby can survive into adulthood now.” Whatever. But you can actually gather this data however you gather the data, and then you can render it.

You can see we’re over-zoomed here, in that we have exceeded the resolution that we can offer with the SVG, otherwise the SVG is going to represent literally all the data. So you can kind of make it more fine-tuned and zoom in on an area. I’ve got another one here where I’ve zoomed in a little bit. Here’s where I’ve zoomed in, and now when I zoom in here, you’re going to see that it’s going to hold up a little bit more and we haven’t over-zoomed as much. I think in this third one, I think I really wanted to zoom in on this GC area. There’s the GC. Now we have every state transition is present, so this is only a span of 199 milliseconds, so this is 200 milliseconds. This is a short. This is what happens in 200 milliseconds. This is a fifth of a second for Cassandra.

It’s kind of amazing, right? It’s especially amazing when you look at this and you kind of zoom in on this. It is really mesmerizing because you can see that the GC… Before, when we were zoomed out, the GC just kind of stopped. And it’s actually not exactly what happens. It kind of staggers across a little bit for another 20 milliseconds. Again, it’s not the problem, so we don’t want to obsess about it too much. Not the problem yet. We can see we’re blocked a little bit. That’s probably CPU contention. It’d be interesting to drill down on that, but you can see how when you look at the data this way, it’s like, “Aw, I’ve got a lot of followup questions I want to go ask.”

Of course, the followup question I wanted to ask is, “Why is the system not doing anything?” Which was very, very frustrating. In fact, here’s what the IO looked like. You don’t have to generate these. These are being generated for each thread, and there are descripts in the repo that do all this stuff. Here is one that is looking at the devices. Now, we kind of knew this from the aggregated data, but this is over a 10 second period. You can see the devices look like they’re about 60% utilized. That’s actually not what was happening. What was happening is 100% utilized, 0% utilized, 100% utilized, 0% utilized, with these idle striations. And they’re not that long.

You kind of zoom in on this, it’s not that long necessarily. You look at this idle, this period, this is 700 milliseconds, so you’re getting a 500 millisecond idle period. Little hard to capture that one while it’s idle. Very, very, very frustrating. And so what did this problem end up being, by the way? And this is one we’re still trying to understand all the details on, but we have actually figured out how to unlock throughput. What was happening in the system is it was slightly memory over-subscribed. Now, it wasn’t memory over-subscribed in the sense that applications couldn’t get memory. They got plenty of memory. Everyone was able to get memory, but it was just over-subscribed enough that the kernel was wandering around every subsystem being like, “Hey, if you’ve got any memory for me… Memory’s kind of tight around here. So you got any memory? Give it back to me.”

As it turns out the networking subsystem is a little too helpful. Networking’s like, “I’ve got some memory for you. I have routing entries.” You’re like, “No, no, no, no. Please, please.” It’s like, “I’ll give you my baby.” It’s like, “Don’t give the baby.” “They’re going to turn your baby into dog food. We don’t need the baby that badly.” That’s a little bit of a visceral analogy, sorry. If you’re a new parent and you’re sobbing right now, that’s totally natural, so sorry about that. It had like 500 routing entries. This is 10K on a system with a quarter of a terabyte of DRAM. Do not ever give this up, ever. But it’s like, “Take my routing entries, please.” And it goes to take your routing entries, and they’re just like… I don’t know where anything is anymore. It’s like “rah rah rah”…

It’s like, “Wow, that must go on forever.” It’s like, “It goes on for about 500 milliseconds.” 500 milliseconds of… Then everyone’s going to be like, “Okay, now I know where everyone is again. Okay, I’m running again.” And the kernel is like, “Hey, does anyone have some memory?” And then it’s like, “Hey, I’ve got some memory for you right over here. Got some memory.” It’s like, “I just had another baby.” It’s like, “No, please, stop!” But this is where software doesn’t feel these things. I wish it did. And so how does it actually look when we… This is what it looks like when we actually unlock all that throughput.

And there’s still some questions to ask. I still have some questions about why are we not seeing this 100% red. I’ve got a lot of followup questions that I want to go ask. On the one hand, I wouldn’t say that… This didn’t crack the case, but it gave us a lot of followup questions, and definitely helped substantiate the perception that we’re having these oscillating periods of idle. I’ll show you another one of these, and we’ll wrap up in a couple minutes here. But just to show you another one of these, this is postgres. A postgres instance that’s running in production, and yes, this is suspending media. This is all now blocked on random reads on those spinners. And if you look at this, you can see that it’s telling you what the actual postgres process is.

This is the checkpointer over here. And I got another one zoomed in on that IO region right here. And so this, again, is every state is actually represented here, so we can zoom in arbitrarily and we can go see what exactly is… And if you go into the minutia, it’s actually pretty interesting in terms of what’s actually happening. One of the things that’s interesting is you look at these IOs and any one of these IOs you take… Let’s kind of zoom in here. You’ll see this… see a little bit of CPU. CPU, and then IO, IO, IO, IO, IO. And then what about CPU? That shows you how much faster your CPU is than your disk. This little thread, just a little bit of work, and it’s like, “See you in November. Off CPU.”

This so far has proven valuable for us, and again, putting it out there so other folks can play around with it. It’s at a very primordial state, so… I had actually really hoped for it to be in a much less primordial state by now, but Intel really kind of ruined those plans, I’m afraid. I’m afraid that it is still mostly primordial. Let’s go back here. We showed you that one, we looked at that one, there’s that one. I’ll make these SVGs available to you, too, so you can see them and play around with them because they’re kind of neat. Okay, just to kind of wrap up here, and if you’re going to take not much away from this, do take this away from it, that you are debugging these systems when you are looking at the performance.

Treat it like a debugging exercise. Don’t treat it like a tuning exercise, but treat it like a debugging exercise, and go in there knowing this is the hardest kind of debugging. You want to stop making the guesses and stop your coworkers from making the guesses. This can be even more of a challenge when it’s like, “Oh, I know what it is. It’s this. It’s a badger, it’s a badger.” It’s like, “Okay, just settle down.” “It’s a badger. I know it’s a badger. It was a badger last time, it was a badger at my other job, I know it’s a badger.” It’s like, “Okay, badger. Settle down. Let’s hypothesize. Maybe it is that. Let’s actually use that to inform a question. What question can we ask of the system? What data can we gather to explore that idea?”

We have to enshrine observability, and one thing that I would emphasize in terms of the psychology of debugging, that too often, and I’m guilty of this myself, when I debug something, and even as I’ve done it tonight. I’ve showed you these bugs I’ve debugged, and I’m making it look easy, like, “What do you have behind your ear?” “Oh, you’ve got a firmware bug. Look at that.” And making balloon animals out of firmware bugs or whatever. Don’t let me fool you. It’s really hard. It took us a long time and went down a bunch of blind alleys. The people that do this well, they want you to think it’s effortless. It is not effortless. It is excruciatingly hard work, and debugging very much rewards persistence, grit, and resilience much more than intuition because it’s very tempting for you to say, “I don’t have the intuition for this stuff.” It’s like, “Yeah, no one does.”

And sometimes the intuition actually counts against you. What you can have, what anyone can have, is the persistence, the grit, the resilience to just grind until you make forward progress, and it is very, very hard. But debugging is more perspiration than inspiration. And we have to have the faith, ultimately, at the end of the day. Damn it, we made these systems. Every one of these instructions that’s executing right now is executing because a human told it to execute at some level. We can understand this. This is not a natural system. This is not a natural phenomenon. We can understand this. You may need to repeat that to yourself many times when you are suffering through a problem that seems that you can’t understand. I certainly have had to repeat that to myself many times, and so will you, but we can understand them. They are synthetic systems. We ultimately can find the hurricane’s butterfly. Thank you very much.

The Hurricane's Butterfly: Debugging Pathologically Performing Systems

Transcript

The next great idea will come from you