On Verification for System Configuration Languages

Arjun Guha

University of Massachusetts, Amherst

Puppet is a configuration management system used by hundreds of organizations to manage thousands of machines.

Transcript

Let’s welcome Arjun.

Thanks for the introduction and this thing is on, right? Great. So thank you all for coming and thanks for having me here. As Ron said, I’m Arjun. I’m a professor of computer science at UMass Amherst out in western Massachusetts.

Yeah, this is one of several projects that are going on in my lab and it’s a pretty typical example of the kind of thing that I like to do, which as he said is bring programming languages and verification techniques to messy systems problems that people generally thought were immune to clean solutions.

Today what I’m going to talk about is system configuration languages, Puppet specifically. Just from some of the conversations I was hearing it sounds like there are people in the room who have worked with Puppet, Chef, Ansible, things of that sort. Could I get a sense if you raise your hands? Okay, awesome.

Be patient at first. I’m going to introduce the system configuration languages for the general audience.

Okay. I want you to think about the problem of setting up a web server. Let’s imagine you wanted to set up a web server. Okay. So the first thing you might do to do that is install your favorite web server. Maybe it’s Apache. Now that doesn’t do anything, right? All that does is set up the default, welcome to Apache webpage. So you’ve got to edit the Apache site configuration, point it to your actual web page. After that you need to actually restart the web server because it doesn’t pick up your configuration changes and then when you go to your web browser you probably made a mistake the first time. Maybe there’s a typo in your directory or something of that sort.

Alright, so you go back to your terminal. You edit that configuration again, restart it again, test, this time it works. Okay. So there’s probably a couple more things you need to do if you’re serious about what you’re doing. Maybe you want to block SSH access, not if you’re using Ansible, that people are starting to learn about this. Maybe you want to mount some sort of backup volume, use a crontab to set up some sort of backup.

Alright, now imagine that your website starts getting a lot of traffic and one web server is no longer enough and so what you need to do is spin up three more web servers. Okay, I hope you all were taking notes because I can’t remember those commands anymore. Right, so this is a problem. In practice if you have even a moderately sized organization you probably have lots and lots of computers, not just three, some of which may be Mac machines, some of which may be Windows machines, some database servers, some web servers and configuring all of these computers is a problem.

So in fact, we all know that software bugs are a problem but a lot of the major systems outages that we’ve been hearing about over the last couple of years have been not software bugs but configuration bugs. Both system configuration bugs and network configuration bugs.

For example, I think this was a year or two ago, the New York Stock Exchange went down for about a couple of hours and the root cause was determined to be a configuration error, not really a software bug. A difference between testing and production environments. A different kind of configuration, so there was a military plane crash that was attributed to incorrect settings in the ECU. Perhaps the worst of these was a Facebook outage. I think it was early last year, Facebook went down for two and a half hours and everybody panicked. People started calling the police. The police replied, not on Facebook because it was down but on Twitter telling people to go outside.

So the system configuration errors are problems that your average human notices. All right. And so that’s where configuration management languages come in. All right. These are just a small sample of some of the configuration languages that are in use today. There are several more that you might’ve heard of. The ones along the top are, let me say, contemporary languages. The ones on the bottom are somewhat older. So there’s Nix, which is a lot of fun. Some people might’ve seen this as like this Haskell-based immutable configuration language. I can talk about how that relates to these in more detail if people want.

Oh and by the way, feel free to just interrupt me if you have any questions. I will deflect questions or tell you to shut up if I need to. Right, so just raise your hand or start yelling.

Okay, so what I’m going to talk about today is Puppet. All right. Why Puppet? One reason is that it’s very popular. There’s over 30,000 organizations that use Puppet the last time that they reported these numbers. Though a lot of the things that I’m talking about, and will talk about, a lot of the techniques and tools that we’ve developed for Puppet should be possible to reproduce for these other system configuration languages too. I’m not going to talk about that in great detail but if you’re interested come talk to me and I’ll tell you what I think the unique challenges are in domains like Chef and Ansible and whatnot.

Okay. So a little bit about how Puppet works. Puppet follows a client server architecture. So you have your Puppet Master, which has configurations for your entire organization and these configurations are pushed down to various machines. Right. That’s all I want to talk about as far as the architecture goes. But one other thing. So, the idea is that as you update a configuration on the master the changes are automatically propagated down to the machines that have the Puppet agent running on them. Okay.

What I want to focus on is the Puppet configuration language, right, and so let me talk about that using a small example. So here’s the smallest Puppet configuration file I could think of. All it does is create an authorized key file in my home directory, right, and in this particular one, this is not good Puppet. I have sort of an in line public key as the content of my authorized key. You don’t have to do this of course. You can also say, “Source”. and have it load keys off some remote server. You can add some more attributes. So it turns out that if this is what you actually write down by default, Puppet creates a file that’s owned by rules, which is probably not what’s intended here. So you can also specify ownership and permissions and group like that.

Now one other thing. If the .ssh directory doesn’t exist, creating this configuration will fail. You can create that directory separately but the simplest thing to do is probably to say, “force => true”, and what that does is create parents as necessary. Now if the user already has something in there and you don’t want to sort of blast it away you might use the backup attribute to say, “backup the older version.” If that’s not what you want to do you can also say, “replace => false”, to not blast away users’ changes in case you want to preserve changes that someone has made locally. And if all of this is too complicated there’s a built in resource called ssh_authorized_key that sort of abstracts management for that file. Right, so Puppet has many, many, many different kinds of resource types as they’re called that do this kind of thing.

So, hopefully you get a sense of what’s going on here, right. So think all about all the sort of low level operations that are involved in doing all this kind of stuff. So what Puppet has given you is, let’s call it a declarative specification of machine state. So what makes a language declarative? Does anyone have any definitions or guesses on that? Yeah?

The way I like to think about it is that it tells you what it should do as opposed to how.

Right. Yeah, and that’s exactly what the Puppet lab says. It tells you what as opposed to how to set up the machine. That’s a pretty solid definition. Any others that people can think of? So is OCaml declarative? Yeah?

I was just going to say it’s idempotent so you-

Yeah, okay great. Anything else? That’s a good one, we’ll get to that. Yeah?

I often feel like declarative is just code for “easy to reason about.”

Yeah, easy to reason about. Okay, so you’re all right. So I think when you say declarative what you really mean is that there’s certain kinds of properties of the programming, of programs and the language that you can reason about. So for example, and I’ll get to idempotence in a sec. So one of the properties that you might care about in system configurations is that they be deterministic. Declarative is not a replacement for testing. You still got to test things, right?

So for example, maybe you write a manifest and you test it out in some sort of virtual machine and everything works fine. You might want to know that when you actually deploy that manifest in production, (I’m going to try to avoid Puppet jargon: Manifest is Puppet jargon for configuration) so when you deploy a Puppet manifest in production you might want to know that those systems are set up in exactly the same way as in testing.

So the other property that’s much talked about Puppet is that it’s idempotent. So what happens, to sort of oversimplify things a little bit, is that the configuration is sort of repeatedly applied to the system to ensure that the system is in sync with the specified configuration and so what that means is that applying the same configuration twice should do nothing. Right, so manifests ought to be idempotent.

So let’s just start with these two basic properties to begin with. Manifest should be deterministic and manifest should be idempotent. Just to be clear, these aren’t properties that I’m making up. If you go to the Puppet documentation you’ll see that they see that feature of the system, first one, idempotency and there’s a desire that system configuration should be deterministic.

So, given that these are desired features of the language is it really deterministic and idempotent? Let’s start with that. You’re shaking your head. All right. So, okay. You can go online and there’s all sorts of blogs that talk about how to write good Puppet code. I’m not going to do that. I’m going to give you some small examples of terrible code. Just bear with me for a moment, just for a couple of slides. We’re going to look at some terrible, terrible system configurations.

Okay. Here’s a really small example of a Puppet manifest. So there’s three different resources. Create a user account, create a file called vimrc, and install the vin package. Okay. Everything seems fine but there’s this thing about Puppet. If you don’t explicitly specify the order in which resources are installed it’s free to install resources in an arbitrary order with some small caveats. What may happen here is it may actually try to create the file before the user account. It actually does in this example and so unless you write down that edge that’s what happens here. It’ll error out unless you write down that edge. Okay?

Let’s talk about why that might be the case. This seems totally obvious. It seems like user account, file of user’s home directory, it should figure out that that edge should exist. Can anyone think why it’s not inferred that that edge needs to always exist?

So imagine you’re not using Linux and you’re using macOS or heaven forbid, Windows. You’ve got to assume that the home directory structure is different on different operating systems which is why that edge is not automatically created. Okay.

So, what’s interesting here is that Puppet tries to be consistent. When there’s choices to be made it doesn’t randomly pick different orders, so it picks different lexicographic orders. So it’s the case that if we rename Carol to Alice it works correctly but with Carol, based on whatever ordering it picks it doesn’t work. Okay?

So here’s another example. Here’s something that you might want to do. You might want to install Go because you like Go and you might want to remove Perl because Perl is terrible. Right? Seems like a pretty reasonable thing to want to do. Okay. So unfortunately under Ubuntu Trusty, this is not true anymore, but under Ubuntu Trusty it turns out that the Go compiler depends on Perl. I don’t know why. It’s not relevant. It just does and so think about what this does. The other thing is that Puppet doesn’t read package level dependencies so it’s up to you to specify dependencies if necessary. So if it executes this one first it’s going to first remove Perl and Go if it’s installed and then when it runs that one it’s going to install Go and Perl. If you run it again in the other order it’s going to first install Go and Perl and then once it runs that it’ll remove Perl and Go. Okay, so this is like a manifest that may have a system flip back and forth between two different states.

Okay. So there’s an easy fix to all this of course. We can simply pick an installation order. Right? So for example, let’s just consider a little bit of a straw man solution. Let’s just take all our resources and fix an installation order. So for example, we’ll just pick that one. If we pick the wrong one we’ll at least get a deterministic error and we can fix it. Out here, well, I would argue that picking an installation order isn’t the right thing to do. It’s that neither order makes any sense but let’s just entertain that solution for the moment. You can do this if you want but the problem with having a fixed installation order is that it tends to break modularity.

All right, so let me introduce one more Puppet feature, which is that Puppet lets you sort of organize your configuration into independent modules. All right so here we have a module called cpp which installs a make, m4 and gcc because that’s what’s involved in building lots of C programs and here we have another module for setting up an OCaml developer’s tool chain. It includes the OCaml compiler and make and m4 because a lot of packages use make and m4. Then we can instantiate both modules in case I want to set up a box that has both C++ and OCaml developer tools.

All right. So the problem is that if these two modules are authored by two different people and they fix different orders between m4 and make you wind up with a cycle dependency graph and the two modules aren’t co-installable. So it’s actually important that these manifests be loosely constrained so that coinstallability is possible.

Okay. Any questions so far? Okay. So I want to give just one last example of Puppet. This is sort of contrived. But here is a manifest that’s not idempotent. All right, so this says we got to run that one before that one. This one creates a file and it copies file-2 over to file-1 and then we say, “File-2 must be deleted.” So you run it once, it works, you run it a second time, that one blows up because the second operation deleted file-2. Right? So that’s a triple manifest that’s not idempotent.

Okay. So stepping back. What we’re after is showing that Puppet manifests are truly deterministic and idempotent. Before we think about how we’re going to actually establish that, let’s think about the various things that make this hard to establish. So one of the issues in Puppet is that there are many, many, I think over 50 different built in resource types. This is a slightly outdated list and there’s many, many more different custom resource types that people have written using Puppet’s Ruby based plug-in framework. So all of these resource types, since we’re installing software they fundamentally set things up in a file system, there’s possibilities for endless interactions between them and it’s the interactions where these properties are violated.

So apart from this sort of vast number of resource types that Puppet has, Puppet is also actually a pretty complicated language. So this is kind of small. It’s like a list of features. This is one of those tables where people brag about how great their language is by numbering the features it has, which I have strong opinions about. So as you can see they’ve kept adding and adding features as versions have gone on. They’ve also removed features, which is sort of troubling when you’re a programming language. One thing I’m happy about is that Puppet hasn’t had dynamic scope since 2012 so if people don’t know what dynamic scope is, we don’t have to deal with that but Puppet still has a lot of crazy linguistic features many of which aren’t found in your conventional programming language.

Okay, so I’m not going to get into all of this in great detail. Frankly a lot of it is pretty boring. The interesting thing here is that a lot of these crazy features of Puppet can be modeled as datalog roles. And you can effectively take a Puppet program and from it generate a datalog program which, when evaluated, the generated tables encode a graph between primitive resources. All right. So this is the first step in our tool chain. We go from Puppet with all its crazy complexities, via datalog to just a graph of atomic resources with attributes.

And what lets us do is think about properties not of Puppet but of resource graphs. Let me briefly talk about other system configuration languages. Everything I talk about from here on will really be about resource graphs, not Puppet, so if you wanted to force ideas to something else like Chef or Ansible, it’s about building a different front end here.

So I want to first start talking about showing that Puppet configurations, really resource graphs, are deterministic. Okay, so quick straw man. I want to point out that testing is intractable. Right, so one thing you can imagine doing is taking a Puppet manifest and trying to install resources in all possible orders and verifying that they all sort of produce the same result but you very quickly get into a state space explosion problem. All right, so you can’t really do that. So that leaves us static analysis. Yeah, okay so the problem with static analysis … Yeah, go ahead?

The biggest counter argument [inaudible 00:20:26] a lot of values for each note of the graph. Is that what you’re basically saying?

I’m sorry, could you repeat the question?

Are you positing that [inaudible] practice where each configuration value has many values and it has many relationships over any other configuration values and that’s why it explodes?

I’m saying that in a well structured Puppet manifest there’s potentially an exponential number of possible orders in which things may be installed, right. Given a fixed manifest Puppet will always pick one but once that manifest is composed with others, other installation orders get exposed and exploring that exponential space is intractable. We’ve also tried it with Docker containers and things of that sort. It takes a long time even for pretty trivial manifests. Is that clear?

Okay.

Okay, yeah?

You mean it’s intractable in the sense that it’s impractical to spin off machines [inaudible] as opposed to?

Yeah, I mean you would need … it’s just exponential search space.

Okay.

Okay, so that leaves static analysis. The problem with doing static analysis is that all these resources mutate machine state and again, remember there’s like 50 different resources and there’s endless possible interactions.

Okay, so the next step in our tool chain. We go from graphs of resources where a resource is like a labeled node with several attributes like a filed back the contents, et cetera, to a graph of little file-system manipulating programs. All right. So what these programs are are little imperative programs written in, think of it as basically the Y language for people who have seen this in program analysis textbooks, small imperative programs with built in primitives that let you modify systems, that let you model updates and tests on system state. So instead of getting into that languaging detail I’m just going to show you a little example.

This is somewhat simplified from what our system actually does. Okay. So for example, let’s say we had that resource so we’re going to create the authorized keys file and its content is that. So that turns into that program and so some of the details of Puppet are sort of becoming clearer here. So what this does is check to see whether that’s a file. If it is removes it and then creates it, right? That’s what that program does. On the other hand, if you had the replaced false attribute we get that program so if the file doesn’t exist we create it but if the file already exists we leave it in place. Then if you say, “Force => true”, to create the directory tree we get this much longer program that sort of sets up the directory tree.

Yeah?

So the thing to the left of the program-

Yeah, this is-

Right is also a Puppet program?

No, the thing on the left is a Puppet program. The thing on the right is like our modeling language which is this like imperative program with low level file-system operations like [inaudible]

[inaudible] language you’re using for explaining what things-

Yeah, this is a language that we made up that’s the simple language on which we’re going to do our analysis because it makes the interactions between Puppet resources manifest.

And the real thinking teaching language of the resource on the left hand side is like-

Is Ruby, yeah.

And are you going to analyze the Ruby itself?

No, we’re not going to do that. I’m going to get into things of that sort later but for the moment what we’re doing is we’re modeling the semantics of the resources using this language so this is an abstraction. Certain kinds of details are left out. But I’ll get to sort of real code analysis towards the end.

Okay, so for people who know Puppet, for example, exacts don’t work here. I’ll get to exacts in a moment for how they fit into the story. Okay. All right, any further questions here? All right, cool.

So those are files. Another really important resource is packages, right? So what packages, so for example when you say, “Package ensure present”, what your system does is it uses the package manager to create scores of files on your file system so what we do is we have a little sort of web service that in containers runs either repoquery apt-file, the idea being that you can sort of plug in more containers from more environments and we get the list of files that the package creates and we model it as a really long program that creates one file for each file that’s returned to the listing.

For the purpose of checking determinism the actual contents of the file is not so important. What really matters is just knowing that each file has a potentially unique content. The abstraction is sort of complicated and there’s certain kinds of crazy situations where you can get false positives. For example, if a package creates a file with say, some contents X and then the user overrides that file with the same contents X, we wouldn’t detect that but apart from those kinds of details it’s an abstraction that works reasonably well. As far as soundness goes this doesn’t support post install scripts, if you know what those are. I’ll again talk about those kinds of things towards the end.

Yeah?

So, who’s going to manage the real interpretation of the Puppet, like you said, the Puppet language is just the declaration of the snippet and then to translate to apt-file-

I mean, we do that. So we have a system that-

So it’s done by the Puppet package?

No. So this is what the Puppet package auto would have written. Our system reads in, schleps in all the Puppet code and for packages this is what it does to build an abstraction off what the package does. Right, so packages fundamentally create a whole a bunch of files and directories and we use repoquery or apt-file to get that listing. That’s basically all there is to it. All right. Okay.

We’ve done this for I want to say 25 or so common Puppet resource types. There’s a bunch of sort of macOS and Windows specific Puppet resources that we haven’t deal with. We’ve sort of focused on Linux, which is the more common use case.

Okay, so popping back up a little bit. You can think of the semantics of one of these programs, when I say these programs I mean these programs that we’re generating from Puppet manifest, you can think of these as functions that sort consume a file system and produce a file system. So you get an input file system and reproduce a file system unless of course preconditions aren’t met in which case an error is signaled. So that’s what the type there is but when you put these programs together in a resource graph that isn’t fully constrained the programs may run in any order and so the semantics becomes, we get an input file system and we produce potentially a set of file systems including errors.

The property of determinism is simply ensuring that for all input file systems, the cardinality of that set is one but each input file system should produce exactly one state and that state may be error but that’s okay.

So given that, what we do is take a function with that signature and think of it instead as a relation between a file system and a file system and then you can specify non-determinism relationally like this. So think of non-determinism as, basically, if you can show that there exist three states, sigma one, two and three such that they’re related like this where sigma two and sigma three are not equal. What that means is on this input state sigma one, we get to two different input states, sigma two and sigma three so that’s what it means for a manifest to be non-deterministic.

All right. Okay and so what we’re going to do and I’m not going to get into this in detail is given this relational specification we then model the semantics of our language in a sat-solver in the Z3 theorem prover and use the theorem prover to sort of either find sigma one, sigma two and sigma three, which then shows that the manifest is non-deterministic or if it can’t find one that tells us, that’s essentially a proof that it is deterministic given our model.

So, there’s a couple of issues in pulling this off. So I think we’re going to talk about three of them. So encoding this problem as a sat solving problem only works well if you’re reasoning about a finite set of states, and if you think about the space of possible file systems that’s essentially an infinite space. The other problem is that the size of each state is exponential because each state represents a file system, and the other is that if you do this naively translating that graph into a formula that encodes all possible interactions produces a formula that’s exponentially large. So this is all great but we’re not off the hook yet.

Just for people who know things about sat-solvers, we’re talking about formulas that have on the order of a billion variables, which is intractable for even contemporary sat-solvers.

Okay, so how do we deal with this problem? So the first thing we’re going to do is we’re going to use an old technique called partial-order reduction. It’s a really simple idea. Basically if you have a state and you have two actions that you’re applying, the square action and the circle action and you can prove that they commute, you don’t need to explore both paths. You can just sort of explore that one or the other one. So the way that you normally do this kind of check is by having some sort of … Well, okay so what this means is that you can generate that graph instead of this graph assuming that you can do this and do this fast.

Now, the way that this normally works is that you look at these actions or these programs, square and circle, you sort of calculate the set of locations that they read and write and if you can show that you have two programs that don’t write to the same location, which is what these overlapping sets show, then it doesn’t matter in what order you sort of run them. Okay.

So this is the kind of thing that becomes relevant when you’re doing analysis of concurrency programs and whatnot. Okay, so that’s fine. Unfortunately this doesn’t quite work in the context of Puppet mostly because of packages. So the semantics of a package is that it creates its directory tree if necessary. So what that means is that two packages almost always may interfere with each other because we have a set of common files, standard system paths, in which programs are installed and two packages may need to create that directory tree if necessary and /opt is a great example because that’s a pretty common path but doesn’t always exist in base systems.

Okay, so what’s needed is a sort commutativity check that’s specialized for this domain and what we essentially do is something like this. We basically abstract programs not just through the sets of locations that they read and write but we also abstract them to the set of directory trees that they create. Right, and once you do that you can actually do this fast check. You can take two programs, see that they definitely don’t interact with each other in any way and just consider one ordering instead of both orderings and a common thing to do is to have a manifest that installs, I don’t know, let’s say 10 packages, and you know that you don’t have something crazy like that Perl/Go example that I showed you earlier. Let’s say you have 10 packages and no conflicts between them and you can rely on the system package manager to install them in the right order and install dependencies as necessary. So there’s just no need to specify dependencies in between them. And if you do the naïve thing, the usual thing, you’ll run into this problem and the system would have to consider all possible interleavings of these 10 packages which is [inaudible]. On the other hand if you use our specialized commutativity check it’s able to figure out the packages don’t interfere with each other.

Yeah?

Do you get any weird cases where the makers set some property in the directories, like permissions?

Yeah. Okay, so in this work we didn’t model permissions. That’s not something that typically happens. That’s typically not the cause of determinism, bots makers do set permissions. There’s something I want to talk about briefly towards the end when we get into modeling permissions in great detail because permissions turn out to be relevant for another problem we’re going to talk about. In a couple of slides I’m going to show you benchmarks. The thing to keep in mind is that those benchmarks assume that models don’t have permissions in them so once you have permissions things become a little bit slower.

Any other questions? All right.

So that’s one thing. That deals with the problem of formulas that are exponentially large. One way to do this is to generate this thing and then prune it down to that. That doesn’t help because you still create the exponentially large optics but the trick is to leverage this check to directly generate that and never generate this. So that deals with the problem of formulas being exponentially large.

I want to briefly talk about the problem of state size. The main thing that we’re modeling is the state of files in the file system. Well, really the state of paths in the file system to be precise. There may be nothing of that path, the path may be a directory, the path may be a file and so on. So for example let’s say that you have a simple package that installs git because git and the lxde-icon-theme package to pick a random example which installs like 7,000 icons. Because it has to, right? So if we think about how we model this, we take these, using our crazy Docker based apt, repoquery tool chain, this will turn into a program that essentially creates 500 files. That will turn into another program that creates 7,000 files. The way you model this is that you say, “All right, in the initial state I have 7,500 files that don’t exist. Then maybe I run git and that creates 500 files and 7,000 files don’t exist and then once I install that package all these files exist.”

So that’s why states are really large. Because if you do this naively, we’re tracking the state of all of these files but I mean come on, think about it, right. Git and this package of icons are not interacting with each other in any sort of interesting way. So this is a dumb problem to have. So let’s think about the property that we’re actually interested in proving right? Remember we don’t actually care about modeling Puppet in all this gory detail. All that we really care about is proving that things are deterministic or coming up with a counter example that shows that two resources in different orders produces something that’s non-deterministic. So again, thinking back to commutativity, if you can identify a block of code in let’s say, in my square. So let’s say this blue triangle and you can prove that it commutes with everything else, everything else in there. It turns out that that inequality holds, even only if that inequality holds because of commutativity and then you can actually pull blue out of the equation completely and that inequality holds.

Right, so basically what we’re able to do is build up this big model and then pull out chunks that don’t matter. So let me show you at a high level what that means. So let’s say you have a manifest that looks like this and this is a common bug at least from what we’ve seen. So we’re installing the Apache package and we’re then creating an Apache configuration but here what happened is that the user forgot to say that, “Oh, we should only create this file after the package has been installed. Okay, so what our system will do is it will take all the hundreds of files that Apache creates and it’ll say, “None of these files matter for the properties that I’m trying to prove which is non-determinism, but it will keep around the Apache two file because it does interact with this other resource.

Or here’s another example. Let’s say that we’re creating two user accounts, Arjun and Rian and I’m also creating a file in my home directory. If this is all that you’re working with our system will sort of prune out everything that has to do with the Rian resource but keep around /home/arjun because it interacts with that file. So these are some of the tricks that we use to shrink down the state space.

Yeah?

When you say you prune files that don’t interact with anything else, do you just mean that if apache writes a file like user bin apache-

Yeah.

[inaudible] like reads or writes that file [inaudible], what’s the property that-

Let’s say you install Apache and it creates a user in Apache and then you have some other resource that reads or writes to it. Then that part can’t be proved because it may be that there’s some sort of non-deterministic interaction with it.

Okay.

But in practice, you probably aren’t doing anything else with usr/bin Apache, right, so in practice those kinds of paths will get pruned out from the model. Basically we created this super detailed model to account for everything that could possibly go wrong and then these tricks help prune out the things that obviously don’t interact in ways that are non-deterministic.

This is sort of asking his question again but is the property for nothing pruned that there exists a pair of resources that touch a path, that read and write, that both write to that path or something or both write…

So the precise property is sort of this. So let me try to explain it. Let me try to explain it in more detail, right. Imagine that square and circle are boxes that represent the program that abstracts of two different resources, right, and what we’re trying to do is show that determinism, what we’re trying to do is show that when you run them, square followed by circle, you get a different result than when you do circle followed by square. Right and so we’re trying to sort of find some input state sigma that will drive these two programs to produce, these two different orderings of the two resources to produce, a different outlook. So the property is that if you can find a section of code in square that you can commute with everything else then that inequality holds, if and only if that inequality holds.

So you’re doing this like pair wise analysis?

Yeah.

You’ll do this analysis whether individual pieces can be pruned. It’s not like globally you prune things. It’s like as you’re doing individual…

Right. So what we’re actually doing is that we’re…so let me back up a little bit more. So we are generating formulas that represent graphs that look like this and what we’re actually trying to do is we’re trying to take fragments of resources and moving them down to the leafs where they’re the last thing that’s down. So basically if you can take a chunk of code and push it down to the very leaf of this formula that we’re generating, what you’re eventually saying is that this is the last chunk of code that I run, and whether these two paths lead to different results or not will not depend on whether this chunk of code runs and so then you can pull out the chunk of code and if the chunk of code is working with a set of paths that nothing else uses, well then you don’t need to reason about those paths at all.

All right, so there’s a chunk of code that, let’s say, creates user in Apache and a bunch of those other things. If you can sort of push it all the way to the end on all possible paths and nothing else works with usr/bin Apache, well then you don’t need to represent usr/bin Apache at all in the model. Is that in the clear? Okay.

All right, so where was I? Okay, cool. So one last technical detail I want to talk about is about finite state [inaudible]. So all this [inaudible] technology and whatnot. It works well if you’re dealing with a finite state space. So the problem of course is that there’s effectively an infinite space of possible file systems that you may need to reason about and so here’s something that you may observe. So if you take a public manifest it’s only like creating a fixed set of files, so maybe you’re dealing with a few thousand, few hundred thousand files so it seems like all we need to do is reason about the files that are mentioned in the manifest. So we don’t need to actually consider, for example, some other random path. If all we’re doing is installing Apache we don’t need to think about the paths that the engine X web server creates. We can work one manifest at a time.

This is almost true. The only caveat has to do with how empty directories work. So let’s say you have this program P here, which checks to see if /a is a file, if so it removes it and then creates /a and this other program, it simply creates /a without trying to check if it’s a file. These programs aren’t equivalent so there is a file systems state which sort of shows that they’re different and to reason about that you don’t need to think about all possible file system states. You just need to sort of think about file systems that have the path /a but consider these two programs, right. So what this program does is it checks to see if /a is a directory and if so it does nothing and if not it blows up, if it’s not a directory, whereas this other program, q prime, it checks to see if /a is an empty directory in which case it sort of does nothing, otherwise it blows up.

So these two programs are different but to reason about the fact that they’re different you need to say, “Well, imagine there was a path, imagine there was a file inside /a.” It’s not enough to just look at the paths that are mentioned in the program to tell that these two programs are different. You need to add another path to the set of paths that you’re reasoning about to tell those two programs apart. Okay, so there’s some subtly here. If you do the naïve thing you’ll get a result that’s unsound. Once you realize that this is a test that programs do you have to add a few more points to your representation of state and it sort of works out.

Well, okay so let me do this. I think I’m going to hazard a very quick demo of what our system does before getting onto numbers and whatnot. And, oh I’m off the wifi. It’ll hopefully get me back on without much drama. Let me go to a website. Let’s go to Yahoo. I haven’t been there in awhile, I don’t want to get us something that’s cached. All right.

Okay. So here’s our little web-based demo of what our tool does. So there’s a couple of sizeable examples here and we’ll talk about them once I show you some benchmark results. So let me just go over a really simple one. Can people read this, by the way?

No.

Okay, let me make it a little larger. Well, this is bootstrapping. Super clever. Is that a little better? So this is just the example that I had earlier. I have creating a package vim. The file home/carol/vimrc and the user carol and so when you run our tool it’s called Rehearsal. It thinks for a little while and it comes back with this. It says, “Checking if manifest deterministic failed.” Then one thing that’s nice about the system is that it reports the failures as a fragment of Puppet code. So it says, “This dependency may be missing”, so in this example you can actually copy this fragment of code, paste it in here, run it again and it comes back saying it’s deterministic and it’s idempotent. I’ll get to idempotence in a second.

So it’s not always set to do this. You’ve got to read error messages and process them and make sure that they actually make sense. So there’s some examples here of manifests where it comes up with a correction that will make it deterministic but if we’re honest with ourselves, it’s not the right fix. Anyway, that’s linked to from my website. So let’s go back here.

Yeah?

Part of the way this is running, is literally like spinning up a Docker instance to run the little snippets to generate their interpretations or is that done somehow in advance so you know…

The interpretation of each resource is written by ourselves in our modeling language so Docker is not necessarily there. The Docker instance is necessary to come up with an interpretation of packages. So to come up with an interpretation of package we need to run repoquery or apt-file to get the package list. It’s not doing that live in this case. It has a little cache but if you throw in some package name that I haven’t tried, yeah it’ll spin up a little Docker instance and do what it needs to do.

Okay, so that worked, amazing. So let me skip through that. Back up slides. Okay, so quick graphs. All that’s sort of happening here is that I’m sort of pointing out that all the sort of hacks that reduce the sizes of state space are really necessary. So here are two manifests that are timing out if you don’t have all the hacks and you just model the thing naively but with the optimizations and heuristics the manifests are sort of analyzed in a second and a half or so. There’s about I think 13 or 14 benchmarks on this graph and a couple of them are marked non-det, which means they were non-deterministic. To be clear, these are all manifests that we found by scraping GitHub and so there a couple of manifests that we found that were truly non-deterministic.

As far as what’s happening here, the reason that with our heuristics things go a lot faster is because what our heuristics do is they cut down the number of files that the system needs to reason about so without a heuristic here for example, this one which times out has a model that’s reasoning about 1,500 files and 1,500 files plus all the possible interleavings of resources makes the thing time out but with our heuristics it cuts the number down to 250 files or so. Right, so it’s sort of pruning down the model to just a set of things that may interfere with each other and it’s leaving out the things that don’t obviously not interfere with each other.

Okay, so … Yeah?

I had a question about how you [inaudible].

Yeah?

You started going into this detail where you’re going to be worried about whether the directory was empty or not.

Right.

Is that kind of like a detail of the instruction that you kind of left out or to add? Is that sort of an example of that layer for the extraction? Do you also have to worry about for instance a hash [inaudible] some of the files aren’t installed [inaudible]

So for this work we model file equality so the system reasons about where the files are the same or not. It doesn’t model the contents in any detail so if you have a manifest that reads the content of a file, cuts it up or something and puts half of it here and half of it there, that won’t be captured accurately. On the other hand if you have a manifest that’s sort of copying one file from here to there that will be modeled precisely. So we also model directories and anti-directories in great detail. Something else that was asked earlier, something else that we don’t model are file permissions and modes and things of that sort because they don’t seem to play a role in these kinds of determinism bugs in our experience but there’s something that I’m going to talk about very briefly towards the end where we sort of bring that back in.

Okay.

Okay so once you do all this you prove that the darn manifest you’re dealing with is deterministic. What you really get is this. So you get a go from this relational semantics for Puppet to thinking of Puppet programs as just functions from input file systems to output file systems. That’s really what determinism gives you. Okay, and once you know that you’re dealing with functions a whole bunch of other things can be done so for example, we looked at this briefly in the demo, proving that a function is idempotent is super easy with a sat-solver where F compose, F equals F. The other thing you can do is check preconditions and post-conditions. That’s something else that the web-based demo shows. They have this little goofy insertion language which lets you write down post-conditions if you want. There’s certain cases where I can imagine that being useful.

What I like about determinism and idempotence is that they are sort of like these meta properties that you sort of always want and you don’t have to come up with annotations or anything of that sort. So what’s this? Okay, so this is another graph that shows that checking idempotence is fast and the reason that this is fast is not because we did anything clever. It’s because we’re just dealing with functions. Yeah, it’s really fast. Okay, so in another minute or two I want to briefly talk about two other things that we’re doing in this space and this is a little more work in progress. It’s not published graphs yet. So something that I’ve alluded to earlier, something that we’re ignoring in Puppet so far is shell scripts.

I’ve shown you that Puppet has all these beautiful abstractions for creating packages, files and all of this stuff. And the point of this is that you should be using the abstractions instead of writing your own shell scripts. Unfortunately there are times when you really have to do that and this is my favorite example. So let’s say that you wanted to install the Oracle JDK on a Linux machine, right. We all want to do this because Oracle JDK’s terrible. Okay, so this is what you have to do. So it’s not standard on Ubuntu at least the last time I checked so you’ve got to add a PPA, update and run Oracle, Java 7 installer. Right. So have people done this? I’m just curious. Okay a couple of people who are programming in Scala presumably.

It’s easy to take this and translate it into a Puppet manifest that looks like that. Unfortunately when you do that it doesn’t work. It blows up in your face. Does anyone know why? Yeah?

[inaudible].

Yeah, yeah. So Oracle in their infinite wisdom has one of these interactive license screens and since Puppet runs in a non-interactive shell it says, “Well, I can’t show you the license so I guess you didn’t accept it so fail to install package.” So how do you solve this problem? Okay, what you do is you go to stack overflow and you ask how do you say auto yes to the license agreement and then the answer with the green check marks has that little script. It doesn’t matter what it says. You just blindly paste it into your manifest using the exec resource which is just short for resource, which is just like a shell script and it works. I love taking shell script from the internet and running them on my console.

Okay. So something that we’ve been working on currently is learning finite-state models of shell scripts from I/O examples, running them in containers perturbing the input, looking at the S trays, trying to drive it to explore different states to learn a finite-state model. I want to emphasize that a really hard problem would be if I told you, here’s wget, come up with a model of it. That would be a really hard problem to solve but that’s actually not the problem that we need to solve here. The problem we need to solve is more like model wget, that URL. So it’s a much simpler problem to work with. We’re not trying to model commands given arbitrary input. The way you use shell scripts in a Puppet manifest is that you’re invoking particular arguments.

Okay. So the other thing that we’re working on is a little bit different. Okay, so the point of doing this is that once we do this we’ll be able to send many more manifests through our system. I think we found that at least half, about half the Puppet manifests that are out there on GitHub use embedded shell scripts and our tool won’t work until we get this thing working.

So the last thing I want to talk about is a slightly different problem. It has to do with fixing bugs. Okay, so let’s say that you write this manifest. It’s creating a vimrc file that says, “Column-number-mode-true.” Right. You deploy it on a machine and then you still got to test. It’s easy to show that it’s deterministic. You still got to test so you test by firing up vim and it says, “Error detected while processing, not …” So that’s how you turn on column numbering in Emacs. Right. So how do you fix this problem? So you remove the file so that you make the error go away. You fire up vim, let’s say, and then you type in set ruler because that’s how you turn on column numbers in vim and then you fire up vim again, you observe that everything’s okay and well, unfortunately what you need to do is you need to then sort of apply this edit back to the manifest and test again just to make sure you didn’t make a mistake the second time.

So basically what you need to do is original configuration followed by a bunch of shell commands but then you need to sort of come up with the updated configuration. So what we’re working on is effectively calculating this configuration update given the original configuration and the sequence of shell commands. It’s sort of a search program in a funny domain. One way to think of it is that Puppet development has a really, really long edit, compile debug loop. You’ve got to deploy a system and this will hopefully make that loop go a lot faster in certain cases and there’s a bunch of interesting problems because it’s often the case that there may be several possible configuration updates and that’s really what we’re sort of exploring right now is how do you present several possible updates to the user.

All right, so that’s all I wanted to talk about today. Thank you all for coming. I’m around and I’ll be happy to take any more questions if you have any.

On Verification for System Configuration Languages

Transcript

The next great idea will come from you