Abstractions for Expressive, Efficient Parallel and Distributed Computing

Lindsey Kuper

University of California, Santa Cruz

Parallel and distributed systems are notoriously difficult to build correctly or efficiently. In parallel systems, the manipulation of shared state can cause unintended behavior in the presence of unpredictable task scheduling, while in distributed systems, the manipulation of replicated state can cause unintended behavior in the presence of an unreliable network. Meanwhile, decades of research have not yet produced a general solution to the problem of automatic program parallelization.

In this talk, I'll discuss how my research addresses these challenges from both theoretical and applied points of view. My work on lattice-based data structures, or LVars, proposes new foundations for expressive deterministic-by-construction parallel and distributed programming models. My work on non-invasive domain-specific languages for parallelism gives programmers language-based tools for safe, deterministic parallelization. The guiding principle and goal of both of these lines of work is to find the right high-level abstractions to express computation in a way that not only does not compromise efficiency, but actually enables it. I'll conclude by discussing the role that this principle of finding the right efficiency-enabling abstractions can play in my ongoing investigation into SMT-based verification of neural networks.

Transcript

Thank you. Is the mic working?

Yeah.

Great. Well, thanks to Ron for that wonderful introduction, thanks to all of you for being here. So, yes, as was said, I’m Lindsey Kuper and I’m really excited to be talking with you today about abstractions for expressive and efficient parallel and distributed computing. So I recently joined the computer science and engineering department at UC Santa Cruz, over on the best coast, and in my job there and before that, as was mentioned for four years at Intel Labs and then before that, during my PhD work, I’ve spent a lot of time thinking about two big problems in computer science. And these problems are, first, how do we build correct and efficient parallel systems? That is, how do we build systems that let us do many computations at the same time, so that we can compute as fast as possible and yet still correctly?

And, second, how do we build correct and efficient distributed systems? That is, how do we build systems that let us do computation in many places across a network that’s subject to network partitions and unbounded latencies between those places, and yet still do it correctly. And these problems of how to build correct and efficient parallel systems, and how to build correct and efficient distributed systems, each have a research community around them, and the communities have each developed their own culture and their own vocabulary and their own formalisms. And so, I think it’s helpful to try to find the unifying abstractions that can help us understand and hopefully make progress on both, and to that end I work on programming language based approaches that address these problems from both theoretical and applied points of view.

And this talk is going to have a little bit of both, so, on the theoretical side, I’m going to talk about my work on lattice based data structures or LVars, and this is new foundations for expressive deterministic parallel and distributed programming. And LVars generalize previous approaches to guaranteed deterministic programming, and I’m going to show how we used LVars to parallelize some applications that have traditionally been hard for deterministic parallel programming models, and I’ll talk a little bit about how this worked, it kind of extends naturally to a distributed setting, too.

And then, on the more applied side, I’m going to talk about work on what I call non-invasive, domain specific languages for parallel programming. And this is work that I did with my Intel colleagues about safe deterministic parallelization without requiring invasive changes to the source code that you’re parallelizing. And my colleagues and I at Intel implemented a non-invasive DSL called Parallel Accelerator, that we released as a Julia package and we used it to parallelize code written in the Julia programming and made it run orders of magnitude faster.

And the guiding principle and goal of all of this work, and the theme of this talk is to try to find the right high level abstractions to express computation in a way that not only doesn’t compromise efficiency, but actually enables it. And it seems at first like it’s a contradiction, right? Because ordinarily we think the received wisdom is that making things high performance requires us to go low level and close to the metal, but it turns out that if we have a high level representation of programmer intent, then that can enable smart scheduling choices or optimizations that would be hard or impossible otherwise.

And this principle of finding the right high level abstractions to enable efficiency that’s been so fruitful in this work on parallel and distributed systems, well, it turns out to also bear fruit in an area of research that I recently got excited about which is using SMT solvers to formally verify properties of neural networks, and especially those that are used in safety critical systems, and doing that more efficiently than was previously possible.

So I’ll talk a bit about that, and then finally I’m going to try to outline a research agenda that connects that work back to language based approaches to parallel and distributed systems. So, it’s a lot to cover in one talk, but let’s jump in.

So to start with, let’s talk about deterministic programming. And by that, I mean programming in such a way that given certain inputs to a program, the observable output of the program is going to be the same, and I’ll try to illustrate this with a story. So, let’s suppose that one day I’m shopping at my favorite large corporate online store. And right now, my shopping cart is empty. So it just so happens, and those of you who’ve heard me give talks will already know this, that at my favorite large corporate online store, all the code is written in Haskell.

So, the programmers have defined a data type that was all the items that are available for purchase, and they’re representing the contents of my shopping cart as a key value map where the keys are items, and the values are the quantities that I’m buying of those items. And this code happens to be Haskell, but we could do the same thing in OCaml or any language of your choice. So this map is stored in what’s known in Haskell as IORef, and that means that it can be shared between parallel tasks, and when this IORef is first created here, it’s initial contents are an empty map, meaning that my cart is empty.

So all the items that I buy have to get added to this cart somehow. So, in particular the store that I’m shopping at is known for its fine selection of books, so I add a book to my cart and doing so launches a thread that modifies the contents of the cart and adds the book that I picked by calling this insert function. I add a quantity of one. And this happens asynchronously, so I can keep shopping while that’s going on. So the store also offers all kinds of other fine merchandise, so, along with my book I add a pair of shoes to my cart as well, and that thread runs asynchronously, as well.

Okay, so now, suppose I want to look at the contents of my cart and suppose that that computation happens in yet another thread, but this time I want the value that the computation returns, so I have to wait for that thread that I spawned to complete, I return my carts contents and that ends up being what this whole program evaluates to. So, what do you think the contents of my cart are going to be? What does this program evaluate to? If I were to print P at this point, what would I get?

Well, I tried running this code on my laptop and I ran it a few hundred times, just to be sure, and the answer seems to be, it depends, right? So, the first time I ran it, I got the book and the shoes. But sometimes I just got the shoes. And sometimes I just got the book. So this is a non-deterministic program. And that seems less than ideal. It would be nice if when I look at my shopping cart, what I see is not subject to the whim of the language runtime system and how it chose to schedule the addition of items to my cart. And although this particular program is only running on one machine, the problem is only exacerbated, of course, when we run it in a distributed setting on many machines and we’ll talk more about that later.

So this happened because our program has two tasks that are sharing state with each other, but sharing is a nice thing to do, right? Sharing is caring, and these tasks are not so much sharing, I say, as fighting viciously over a piece of state. So the slogan that I like to use is that if we want determinism, we have to find a way to share nicely. So, what went wrong here? Well, one way of explaining what went wrong is to say, well, this program that we wrote is undersynchronized, and we need to put in more synchronization. And if all we wanted to do was fix this undersynchronization in this particular program, then we could do that. So, in this case, we’d probably have to return something from each of these two async calls, and then wait for those actions to finish, and then, and only then, do the read.

And this process of having to put in synchronization barriers in the right places to make sure that your code is correct is pretty common. And I would bet that a lot of us have had to write code that did the equivalent of that at some point in our lives. But rather than fixing individual programs in an ad hoc way like this, what we might like to do instead is come up with a programming model such that all programs written in the model are guaranteed to be deterministic.

So let’s look back at the non-deterministic version of our shopping cart program. So we can see that this code is guilty of the mortal sin of doing two multiple writes to the same shared data structure, and ruling out programs that do that has been the foundation of a significant body of work on deterministic programming, and a particularly prominent example of that is something called IVars, and IVars are shared variables that can only be written to once and for which reads block until that one write has arrived. And the I in IVar stands for immutable.

And if you want to know more about IVars, you can look up Arvind’s work from the ’80s on the Id programming language. So, in an IVar based language, and by the way, IVars are also, if you use the monad-par package in Haskell it also has IVars, as well as other languages. So, in an IVar based language, you would not be able to write a program with a non-determinism bug like this, in fact, that combination that IVars have of single writes and blocking reads ensures determinism for programs in which all communication happens through IVars.

So IVars are very useful. But on second thought, is this program really so terrible? Because even though these threads are both writing to this same shared data structure, notice that neither one is really clobbering the other’s state, instead it builds on the other one’s state, and so in fact these first two threads updates commute and so it’s okay for them to run an arbitrary order. The non-determinism that we saw earlier when we saw that this program produced different results, that is only introduced at the point when the IORef is read and we are at risk of accidentally observing an intermediate state before both of the threads have had a chance to write.

So, what if we could take inspiration from IVars, but loosen that restriction that they’re only allowed to write once, while keeping around a form of blocking read that can prevent the order of updates from being observed, and that’s what LVars do. So like IVars, LVars are data structures that can be shared among multiple threads, but it’s fine to write to an LVar multiple times. And the rules are these writes have to be commutative, meaning that they can happen in arbitrary order, so, for instance, it doesn’t matter whether we add the book or the shoes to our shopping cart first. And second, writes have to be inflationary, which means that with every write, the LVars contents have to stay the same or grow bigger with respect to an application specific ordering of states, or a lattice.

So instead of I for immutable, we have L for lattice. And, finally, LVar reads have to be what we call threshold reads, and I’ll explain that next. So, we’ll walk through another program that adds things to our cart, but this time the cart is represented by an LVar. So, what are the states that our cart can be in? It can be empty, and that’s represented by the bottom element of this lattice, or we could have one copy of the book, or two copies, or so on, or one pair of shoes, or two, or so on, or it could have both in some combination. So we’ll look at how the state of the cart changes as we add things.

So we start at the bottom of this lattice, and in this program here, we fork a couple of threads and each of those does a write, and those threads can run in arbitrary order, so if the first thread, this one that adds shoes to my cart, happens to run first, then we would move up into the appropriate lattice state. And then if the other thread ran, we would move up again, but we could have also done that in the other order, so starting at the bottom, we could add the book to the cart, or the books, rather, because we’re getting two books this time, and then the shoes.

And either way, we would end up in the same place. So, now let’s try doing a read, this is the fun part. So, what we want to do is, here, we’re given a key, we want to get the value associated with that key. So we want to know, in this case, how many copies of the book are in my cart. And this get key operation could run before either of those forked threads finish. But it might block at that point. So I mentioned earlier that IVar reads are blocking reads, if you try to read from an IVar before it’s been filled in, your read will block, but when it is filled in, your read will unblock and you’ll be able to see the exact complete contents of the IVar.

LVar reads are also blocking, but they’re what we call threshold reads, so what a threshold read does is it designates a subset of elements in the lattice that serve as a sort of threshold, such that if the actual state of the LVar reaches a point that’s at or above any of those threshold elements, then the read can unblock and return a result. But the result it returns is not necessarily going to be the exact LVar contents, it’s just going to be the element of the threshold set that we crossed over on our way up the lattice.

So, at the point that this read unblocks, the shoes might be in the cart or they might not, but regardless of whether they’re there yet, this operation is going to return a deterministic result, two books. And one thing that I think helps in trying to understand threshold sets is to visualize a sort of tripwire going across the lattice, so the LVar state moves up over time, and eventually it might cross the tripwire, and at that point, that’s when we’re allowed to unblock and return a result, but the result will be the same on every run, it’ll be a particular element of the threshold set regardless of the timing of when we crossed the tripwire.

So for something, you might ask, what are the conditions on these threshold sets? This seems important. And indeed it is. So for something to be a legal threshold set, we need a particular mathematical property called pairwising compatibility to hold of it, and that happens to be true here. But in this code, I’m using a data structure from my LVar library and that data structure, its API, only provides read operations like get key that can be expressed in terms of these legal threshold sets.

So I didn’t have to think about threshold sets or pairwising compatibility or any of that when I was writing the code, I just had to use what the library gave me. So in general the obligation on somebody who wants to implement a new one of these LVar data structures is that they have to make sure that writes are commutative and inflationary and they have to make sure that whatever operations they provide for reading from the LVar can be expressed in terms of these threshold sets, and if they meet those obligations then determinism will be guaranteed in the client code.

So that’s how LVar writes and reads work. And next I want to look at a more interesting problem that’s traditionally been hard for deterministic programming models, and this is parallel graph traversal. So, alright, let’s say that we have a directed graph like this one and we want to consider the problem of traversing this graph started from a given node and finding all the nodes that are reachable from it and we want to do this in parallel so it’s fast.

So, ordinarily the algorithm for doing this could go something like the following. So we have a starting node, and we mark it as having been seen, and then we look at its neighbors, maybe launching a thread for each one, and for each neighbor we check to see if it’s mark is having been seen yet, if so, then that thread is done, and if not, then we mark it as seen and we launch a thread for each of its neighbors. And we keep going like this and this goes on until all of the reachable nodes have been seen.

So as we traverse the graph, notice that set of seen nodes over on the right just grew monotonically because we kept on adding neighbors of seen nodes to that set, and you do this until the set reaches a fixed point. So because this algorithm involves this monotonically growing data structure, you might think, it might seem, to be a nice match for that LVars programming model that I talked about. We could use an LVar to represent this set of seen nodes, have different threads writing to it, have set union as the write operation that adds new nodes to the set.

But unfortunately, we cannot express this algorithm using only the threshold reads that I talked about before, because there’s no way to check if a node has been seen or not, and without that test, we won’t know when to stop looking for new nodes. In fact, this particular graph has a cycle in it, so we could get stuck. So, what do we do?

Well, the way that we accomplish this with LVars is by extending the API a little bit to use what we call events and event handlers. So we say that an event is a write that changes an LVar state, and an event handler is associated with an LVar and listens for state changing updates to that LVar, such as a new node arriving in the set of seen nodes. And an event handler can run a callback in reaction to an event, and then we also have a quiesce operation that will block until all of the callbacks that were launched by a particular event handler are done running.

So with these ingredients, we have almost everything that we need to write our parallel graph traversal, and it could look something like this code. And there’s just one more piece that we need, and that’s an operation called freeze. So the freeze operation is a non-blocking read that lets us find out the exact contents of an LVar, and once an LVar has been frozen, its state can’t change anymore, and any further writes that would change its state instead raise an exception.

But even if we forget to quiesce before we freeze, or even if we do it in the wrong place, the only bad thing that can happen in this programming model by freezing too early is that we raise a write after freeze exception. So, formally, our determinism result says that if we have two executions starting from a particular program configuration, and those two executions end, then either the ending configurations are the same or one or the other of the ending configurations is the error configuration.

So, for instance, to make this really concrete, it’s not possible to see a shopping cart that doesn’t have all our stuff in it, instead you’re either going to see the complete contents of the cart or you’ll get an error. And if you hit the error case, then it always means that there’s a synchronization bug in your code and the error message can even tell you exactly which write and which freeze were racing with each other so that you can more easily debug it.

So I want to take a second to talk about this quasi-determinism result and the structure of the proof. By the way, we call this quasi-determinism because it’s either the same answer every time or an error, as opposed to determinism which is just the same answer every time. So for simplicity’s sake, I’m just going to talk about determinism, but the structure of the quasi-determinism proof is similar.

So determinism says the following, it says, suppose you start from a given program configuration, which is just, it’s a snapshot of the contents of memory plus an expression that you’re evaluating. And you step it or run it until it can’t run anymore, you’ve reached a value, and then you do it again, you step it or run it until it can’t run anymore, determinism says that those two results are going to be equal. And note that we’re not claiming to necessarily go through the same series of reduction steps on both runs, we’re just claiming that the outcome is the same, and this is sometimes called external determinism.

So, how do you prove this? So, to prove this in our setting, we needed a property that I call independence that captures the idea that independent effects commute with each other. And I’m not going to explain every detail here, I just want to talk about the shape of it. So what does this property say? It’s an inference rule, so the truth of the thing above the line implies the truth of the thing below, and this inference rule says that if you have a program configuration, so, this S here is the store, that’s like the contents of memory that I mentioned before. And the E is an expression, so that’s a program configuration and you take a step with it and this property says that if you do that, then you will also be able to take a step and end up with the same ending expression even if the starting store is bigger in a particular way.

In other words, our ability to take that step and end up with that ending expression is independent of whether the store looks like it did on the top or if it looks like it did on the bottom, and that’s how this property gets its name, independence. So why am I showing you this? Well, to give you an idea of why I think this is cool, I want to put it next to what’s known as the frame rule from separation logic, which is one of the really cool developments in program verification from the last 20 years.

And the frame rule is also written like an inference rule, so, again, the truth of the thing above the line implies the truth of the thing below. So what does the frame rule say? It says if you have a program, in this case C, the program is called C, C stands for command, frame rules kind of come from this imperative world, so, C stands for command. So you have this program called C and it runs starting in a state that satisfies preconditioned P, ending in a state that satisfies post condition Q.

If you have that, then it does not do any harm to also have this other disjoint property, R, be true when the program runs. So R might be, I don’t know, an assertion about some memory location that’s outside of the memory footprint of the program, and the star here in P star R and key star R says that R is disjoined from P or Q, those other assertions. So the truth of R will not change the execution of our program, and furthermore, the execution of our program won’t change the truth of this unrelated thing, R.

So if you already know that if you start from P you can run C and get to Q, you know that if you also have R to begin with, C won’t interfere with the truth of R, and furthermore, R won’t interfere with the post condition being true. So there’s this nice noninterference. And the frame rule I think is a nice way of talking about what doesn’t change when you run a piece of code, and it gets its name from this notion of the frame problem, which McCarthy and Hayes wrote about in 1969 when they were writing about AI and philosophy.

And they have this really kind of charmingly dated example in one of their papers where they say, “in proving that 1 person could get into conversation with another, we were obliged to add the hypothesis that if a person has a telephone, he still has it after looking up a number in the telephone book”. And, obviously, a lot has changed since 1969, we don’t really use telephone books anymore and we don’t default to male pronouns, but it’s still the case that in formal systems you usually have to put a lot of effort into talking about what does not change.

And a frame rule is one way of addressing that. So this independence property that I have here is a frame property, if you squint, and except that instead of having that star, that separating conjunction where the pieces you’re considering have to be disjoint, up here the pieces are actually allowed to overlap in memory, but you can still reason about them as if they were separate. So we have this idea of disjointedness and use it to reason about things that do overlap but don’t interfere.

And I think this is an example of actually what I meant at the beginning of this talk when I said that our goal was to find the right high level abstractions to enable efficient computation, because it turns out that knowing that these two pieces of state won’t interfere with each other gives the runtime system the freedom to be able to schedule things however it wants and take advantage of whatever parallel resources exist, and ultimately make programs run faster.

So I said at the beginning of this talk that I was going to talk about parallel and distributed computing, so far I’ve been talking a lot about parallelism, but I want to say something about how this idea is also applied to distributed computing. So let’s talk about distributed computing. So, effectively programming distributed systems, as I said, means writing programs in such a way that they run correctly across a network that’s subject to network partitions and other kinds of failures.

And one of the things that makes distributed programming so hard is that you have state that’s replicated across a number of physical locations. And we saw before that it’s hard to get the contents of our shopping cart deterministic or it’s hard to keep code that operates on our shopping cart deterministic, right? And as if that isn’t hard enough, just to deal with our shopping cart when it’s running on my laptop, imagine if it’s replicated in data centers around the world as it almost certainly actually is.

So ideally, it would be the case that every replica always agrees on the data, or the contents of my cart, but in practice that won’t be true, because that goal is intentioned with our desire for this system to be highly available for both reading and writing. And if that wasn’t hard enough, we also have to deal with parts of the network catching on fire and being unable to talk to each other from time to time.

So the CAP theorem from distributed systems tells us that if we want to be robust to these inevitable network partitions, we have to compromise on at least one of consistency or availability. And if we want to have high availability, we sacrifice strong consistency sometimes for what’s known as eventual consistency in which replicas may not always agree and the only guarantee is that they’ll eventually come to agree.

But that leaves us with a new problem, which is how do we get all replicas to agree? And if they differ, how do we know which one is right? And so we could try to update them to the one that’s most recently written, but it’s not necessarily a great idea because even if we can figure out which one is most recently written, which is itself a hard problem, that might not be the semantics that we want. So there’s a nice quotation about this that I want to share from the dynamo paper on Amazon’s distributed key value cloud storage system.

And this has to do with application specific mechanisms for resolving conflicts between replicas, so they say the application can, “decide on the conflict resolution method that’s best suited for its clients experience, for instance, the application that maintains customers shopping carts can choose to merge the conflicting versions and return a single unified shopping cart”. So, in other words, they’re saying that if two replicas disagree on what your cart is, then that’s okay, you can combine them in some way that makes sense for your application rather than trying to figure out which one wrote last.

So this was a very influential paper, it heavily influenced, for instance, the design of distributed databases like REAC and Cassandra, and a little while after this paper was published, Mark Shapiro and his collaborators formalized some of the ideas that I’m talking about here as what they called “conflict free replicated data types” or CRDT’s. And CRDT’s are data structures based on the idea that if you can define a lattice for the states that an object in the data store can be in, then replicas of that object that differ can merge with one another in a deterministic way because that domain specific conflict resolution operation is just the least upper bound operation in this lattice.

So that sounds kind of familiar. So, because CRDTs are already kind of a cousin to the LVars that I talked about earlier, one thing that you might think about doing is trying to bring LVar style threshold reads to the setting of CRDTs. So this is nice because it could give us a deterministic way to query the contents of CRDTs, so coming back to this shopping cart idea, if you want to know whether the book is in the cart, you could ask a bunch of replicas whether it’s there and the query will block only until it appears at one of them, not until it appears at all of them.

So you could have a deterministic query even though the data store isn’t necessarily consistent, and this suggests a way to save on synchronization costs, and this I think is in line, again, with our guiding principle about choosing the right high level abstractions to unlock efficiency, and it’s related to that frame property that I was talking about earlier. And I have a workshop paper that talks about this a little bit, but I think there’s more to do here, actually, and if you’re interested I’d love to talk to you more about it later.

All right, so we’ve talked about lattice based data structures and how they generalize existing approaches to building guaranteed deterministic parallel and distributed systems, and how this is an example of this principle of finding the right high level abstractions to enable efficiency, and so for the next part of the talk, I want to switch gears and talk about this principle in a different context in this work on non-invasive DSLs for productive parallelism.

So I want to start by introducing you to my friend Laura. This is Laura, and Laura worked in a research group that collects radar data by flying a plane over Antarctica, transmitting radar pulses and listening for and recording the result, which appears in radargrams like this one. So I think this is actually a picture of ice and water under the ice, and rocks underneath that, and so by studying this radar profile, they can learn more about the ice and the rocks beneath the surface.

So Laura wants to spend her time thinking about geophysics. She does not want to spend her time thinking about manual memory management or how to schedule parallel tasks, so when she writes code, she uses what I will call a productivity programming language, like MATLAB or Python or R or Julia, but once she has a working prototype in one of those languages, a dilemma comes up. When she wants to scale up her code to run on a larger input size or at a finer granularity, this productivity language turns out to be too slow.

So the next step, usually, is to port the slow parts of the code from the productivity language to what I’ll call an efficiency language, which is usually C or C++ or Fortran, and doing this requires expertise in the efficiency language, it usually also requires expertise in high performance computing tools like OpenMP, and the result is code that is indeed faster but it’s usually also more brittle and it’s harder to experiment with and it’s harder to maintain, and this is sometimes called the two language problem.

So one proposed solution to this two language problem is using what are known as high performance domain specific languages, and an example of work done in this space is the Delight framework for building high performance DSL’s, which was developed by Kunle Olukotun’s group at Stanford, and I really like this unicorn picture which Kunle uses in all his talks and which I borrowed from the Delight team, and the idea is if you pick a language like Python or Julia, you get productivity and generality, and if you pick one like C++, you get performance and generality, and the combination of all three is this magical unicorn that doesn’t exist.

But what the Delight people found is that by giving up generality and restricting things to a particular domain, you can offer both a high level program model and high performance, so once again, it’s this guiding principle of finding the right high level abstractions, because it turns out that if the compiler knows a little more about what you’re trying to do, it can use that information to make smarter optimizations. And I find this to be a really beautiful idea.

But if these high performance DSL’s, built using these wonderful tools like Delight, are so great, then why are they not more widely used by practitioners like my friend Laura? So, I think there are a couple of reasons, one is the learning curve, if you have an existing body of code in Julia or Python, then learning the DSL and porting code to it is a lot to ask. Maybe it’s easier than porting to C++, but it’s probably still a pain. And then there are functionality cliffs, what if the DSL only does 95% of what you need it to do? Then you can’t use the DSL, and it’s hard for even the most well designed DSL to anticipate every use case that you might eventually encounter.

So my colleagues and I at Intel labs, when I was there, spent some time thinking about these problems and what we ended up coming up with was this system called ParallelAccelerator, which is what we call a non-invasive domain specific language. So what do I mean by non-invasive? Well, ParallelAccelerator is an embedded DSL, meaning we start with an existing programming language, in this case it’s Julia, but unlike most embedded DSL’s, it’s mostly focused on speeding up execution of particular constructs that are already present in the language, in particular it’s mostly targeting numeric code that can be expressed with aggregate array operations, so it’s easy to get started coming from plain Julia.

And the key piece of the interface is a macro called ACC, which is short for accelerate. It also supports an additional domain specific construct which is called runStencil for doing Stencil computations which I’ll say more about in a minute. And for anything new, like runStencil, that we add to Julia, there are two implementations. One as a library function that’s written in Julia, and one high performance native implementation, so that means that it’s possible to turn off the compiler and run in library mode during development and debugging and then turn on the compiler when it’s time to deploy. And that lets us kind of sidestep some of the issues that hindered option of DSL’s.

So we implemented all this as a package for Julia called ParallelAccelerator.jl and it’s available on GitHub. So, I used to say at this point in the talk that it was one of the top 20 most popular Julia packages, but I think it’s fallen off of the top 20 at this point so I can’t brag about that any more. But let’s see a couple examples of how it’s used. So this is a Julia implementation of the Black-Scholes Options pricing benchmark, and I feel kind of silly talking about this at Jane Street, and let me tell you the goal here is not to talk about how Option pricing works because I know nothing about that, but the goal is instead just to give you an idea of the kind of code that ParallelAccelerator can parallelize.

So we have this function, Black-Scholes, which takes five arguments and all these arguments are arrays of numbers, and this function does a lot of computations involving pointwise, addition and subtraction and multiplication and division on those arrays, and I’ve highlighted all of those operations here in red. So this code is written entirely in what we call array style, and that means it’s really kind of screaming out to be parallelized with ParallelAccelerator. So to do that, we only had to make non-invasive additions to the code, which I’ve highlighted in yellow here, so we import the ParallelAccelerator library at the top using ParallelAccelerator, it’s what you write in Julia, and then we annotate the Black-Scholes function with this ACC annotation.

Other than that, the rest of the code is just regular Julia, and what ParallelAccelerator does now is it translates these pointwise array operations into data parallel map operations, and it’s also possible for ParallelAccelerator to avoid certain array bounds checking once we’ve established that the input arrays and the output arrays are the same size, and it can also fuse some of these operations to avoid allocating intermediate arrays.

So that means it goes faster, so without ParallelAccelerator, the original code here took about 22 seconds to run in Julia when the arrays that were passing in have 100 million elements each. So let’s see how that compares to ParallelAccelerator, and while we’re at it, we’ll also compare to MATLAB, so these are running times that I collected for this Black-Scholes code when each of those arrays has 100 million elements, as I said, and this plot shows speed up, so higher is better here. So, on the left first, there’s MATLAB, and the two bars in the middle are results for plain Julia without ParallelAccelerator, one in array style and one with explicit for loops.

By the way, notice that the version with the explicit for loops is a little bit faster, this is something that’s kind of weird. In Julia, writing explicit for loops, in traditional Julia, is faster than writing aggregate array operations, and in fact, we got some push back when we developed this library because people thought it was going to encourage people to write non-idiomatic Julia, because explicit for loops were considered idiomatic. Thankfully, the Julia team was on our side on this, and they thought we were doing the right thing, but we did get a little push back from people who thought that we were encouraging people to write Julia the wrong way.

So we see the Julia results are a little bit slower than MATLAB but they’re in the same ballpark, and then we get to the ParallelAccelerator results which are in blue on the right, so I have a couple different versions, we’re using a 36 core machine here so we have 36 threads of parallelism available to us, if we use even just one thread, under ParallelAccelerator we run in just over 13 seconds, and that’s with the array style code that I showed on the previous slide.

So it’s already a bit of a speed up over plain Julia and that’s because of that getting rid of that intermediate array allocation and array bounds checking that I mentioned, but then we can add parallelism to that and run on 36 threads, which is this right most bar, and for that the running time drops down to half a second, which is over 40 times faster than our original code and all we had to do was import the ParallelAccelerator library and add that one annotation.

So we’re really happy with this. This code also is expressly chosen to illustrate the advantages of this style, so we don’t always get results this good, but this is an example of the sort of thing you can do with ParallelAccelerator. I want to show one more example, and this next one… Yeah, go ahead.

So for MATLAB [inaudible], and it automatically is parallel if you have multi-core architecture, so it doesn’t use all 36 cores? These only get three times the speed up?

Yeah, so the lighter colored MATLAB bar is with the single thread argument to MATLAB and then the other one is just what it does automatically.

But maybe the default thread number used is set at a 12 or something, 36, maybe?

I don’t know.

I always did that, because you can’t have a default number of threads [inaudible].

It does, so the darker colored MATLAB bar, it’s default parallelization settings, so I can’t remember what those settings are but this is kind of like what it does out of the box. There might be ways to make it better, but I put MATLAB up here as an example because usually it seems like what the issue is is somebody’s starting with Julia who’s coming to Julia from MATLAB, they’re surprised at first when their Julia code is slower than MATLAB because everybody told them that Julia was supposed to be super fast, and this is indeed what we observed as well, that MATLAB was a little bit faster out of the box.

So what we’re trying to show people is that you can make things super fast with Julia, but at least at the time it had no native parallelization, so you can get really great performance out of it but you have to use a tool that actually takes advantage of the parallel hardware that you have. So this other example that I want to talk about uses this runStencil construct that I mentioned. So I’ll talk briefly about Stencil computations.

A Stencil computation is a computation involving an array of some kind, it could be a one dimensional array or two dimensional or whatever, and you want to update every element in the array based on the nearby elements, so kind of the classic “hello, world” example of Stencil computations is you want to blur an image that’s represented as a two dimensional array of pixels. So in order to blur an image, you change the value of every pixel to a particular average of that pixel’s value and the values of the neighboring pixels, and that’s what this code does, it’s a weighted average of that pixel’s value and the values of the pixels around it.

This is for a black and white image so we don’t have all the different channels, but it would be similar for a color image. So here we’re using the runStencil construct that ParallelAccelerator gives you, but it doesn’t look all that different from what you would write in plain Julia if you were writing this code in array style, but in plain Julia, it would actually be quite slow. So if we ran this on a large image for 100 iterations, then the plain Julia version of this code, without runStencil, would take almost 900 seconds to run.

So let’s see how that compares to ParallelAccelerator. So, again, I’ve got MATLAB results here, too, and, again, MATLAB is a little bit faster than plain Julia, and then there are the ParallelAccelerator results, so this is where I think things get exciting. So the first blue bar shows the speed up for ParallelAccelerator just on one thread, and when we do that, the running time drops down to under 40 seconds, and then when we add parallelism to that, which is on the right, then we’re down to 1.4 seconds which is more than 600 times faster than the Julia code.

Now, for this one we did have to do more work to get the speed up than we had to for the Black-Scholes code because for this one we had to use that runStencil construct that’s specific to ParallelAccelerator, and we had to rewrite a few lines of Julia to use it, but the runStencil version is really not that different from the plain Julia code that it’s replacing. I don’t have a slide showing the plain Julia code but it’s in our GitHub repo, so you can go see it for yourself if you want.

So that’s runStencil, and so before I finish this part of the talk, I want to say something about the impact of ParallelAccelerator. So, ParallelAccelerator in Julia itself didn’t have that much of an impact, I think it mentioned it was near the top of this Julia package popularity list for a while, but it kind of quickly dropped off and the Julia ecosystem changes pretty quickly, so it didn’t make that much of an impact on Julia. But the impact that it had was actually elsewhere, so it was used as the basis for another high performance DSL for deep neural networks that appeared at PLDI ‘16, and then my colleague Ehsan Totoni developed a distributed data analytics toolkit called HPAT that’s based on it that appeared at a conference a couple of years ago, too.

But what I think is really most exciting is that some of my colleagues at Intel, and I can take no credit for this, they ported the ParallelAccelerator technology to Python, and they integrated it with the Numba compiler for numerical programming in Python. So this is now part of Numba, I think it was actually the largest external contribution that Numba has ever accepted, and it makes it possible for Numba users to run their code multithreaded and it’s with almost no user intervention, you just have to turn on a particular option to Numba. And this is a post on the Anaconda blog from last December that you can check out to learn more about that.

All right, so that wraps up the part of the talk about non-invasive DSLs for productive parallelism and how they also, I think, exemplify finding the right high level abstractions to enable efficient computation. And, finally, I want to change gears yet again and talk about this third project that’s been my obsession of late, and that I presented at SysML last year on using SMT solvers to verify properties of neural networks used in safety critical software, and I’ll talk about how that fits in nicely with our guiding principle and how it even connects back to all the other stuff I talked about and points to a longer term research agenda.

And I should mention that most of this is, well, all of this, is done in collaboration with a group of people at Stanford who did really most of the work. All right, so let’s talk about safety critical software systems. In safety critical settings, it’s prudent if not actually required by law to rigorously show that certain bad things can’t happen before you use a piece of software, and today we have some good tools for formal verification of software, often based on SMT solvers.

The trouble is that verification tools aren’t necessarily keeping pace with how safety critical software is being built, in particular, often safety critical software is now using neural networks. And the example that I’m going to focus on here is from a paper that was published in 2016 by my collaborators in the aeronautics and astronautics department at Stanford.

So their goal here was to decrease the storage space requirements of an aircraft collision avoidance system, and they wanted to do this so that it would fit in memory on the verified hardware on board an aircraft, and they found that by representing this collision avoidance policy as a neural network it took up much less space than the original implementation they had. But as they write here, “there are significant certification concerns with neural network representations, which may be addressed in the future”.

So let’s look a little bit at the system that they’re discussing, so this is an overview of the system, this is before we get to the neural network part. So, it runs on board an aircraft during flight and once per second it collects sensor measurements and it produces one of five of what are called resolution advisories, so the most common resolution advisory is COC which stands for clear of conflict, which means that the plane doesn’t need to do anything special, and there are four others which are weak left turn, weak right turn, strong left turn and strong right turn.

And the resolution advisory that the system produces is the result of a lookup in a table, which is this table here, this optimized logic table, so for each of about 120 million possible states of the environment that you see when you do one of these sensor readings where a state describes the relationship between your aircraft and this possible intruder aircraft, for every one of those 120 million possible states, the table has an entry that gives a score to each one of those five advisories, COC, weak left, weak right, et cetera.

And this table is really, really big, it needs hundreds of gigabytes of storage and this is too big to fit in memory on verified hardware, and it has to be in memory because these table lookups have to be fast. So hence this idea that these folks at Stanford had of training a neural network on the table and then storing that trained model instead of storing the whole table. So what does that look like?

Well, the input to the trained network is a state of the environment, just like we saw on the previous slide, and the output is a score for each one of those advisories. And this network, the trained network, the storage space it takes up is smaller than what the original table took up by a factor of 1,000, which makes it small enough to fit in memory on their hardware, it takes a few megabytes.

However, the trained network is only an approximation of the original table. So to illustrate what that looks like, here are some illustrations of, these are like top down views of encounters between our aircraft and an intruder aircraft, and the color at every point in the plot here shows the advisory that would be issued if the intruder aircraft were at that location. So on the left is what the original table’s advisories would say to do, and the plot on the right is what the neural network would advise for you to do.

So they’re obviously not the same. But it might not necessarily be a problem that they’re not the same, the question is whether or not certain properties that we care about still hold. So, for example, we might want to prove that if the intruder is close, let’s say it’s 15,000 feet away, which is considered close, and it’s approaching from the left, then the network will always advise strong right. So once we know those properties that we’re interested in showing, we can try to use an SMT solver to try to prove that they hold.

So let’s talk about how that would work. So the first step is to take that trained network that we want to prove something about, and take the property that we want to prove, and we encode them both as a huge Boolean formula, so in this case of this network, every unit in this network will become a variable in the formula, and everything that relates to that unit will become part of the formula somehow as a constraint on that variable.

And now we have something that we can determine the satisfiability or unsatisfiability of by feeding it to an SMT solver. So what is SMT? Well, you might have heard of a SAT problem, and that’s the problem of determining whether a Boolean formula is satisfiable, whether there’s a satisfying assignment of true or false to every variable in the formula. SMT stands for satisfiability modular theories and the SMT problem is an extension of the SAT problem, so it’s also about determining whether a formula is satisfiable, but instead of just being a formula with Boolean variables in it, the formula can be more complicated in the case of SMT.

So in this case, instead of just being a Boolean formula, it can contain expressions that come from what’s called the theory, that’s the T in SMT. In this case it’s the theory of linear real arithmetic, so that means that the variables in this formula, instead of just being Boolean variables, now they can be real numbers, and we can do addition and subtraction and multiplication on them.

So this is the overall strategy. We construct this giant formula, we hand it off to the SMT solver, and we let the solver crank on the problem for a long, long time until it returns satisfiable or unsatisfiable. So right now we’re working with this theory of real arithmetic, this is what’s called a decidable theory, so that should mean that the solver should eventually have an answer for us. But in practice, it’s really slow, SMT solving is often really slow and so a big part of research in this area is trying to find more efficient ways to determine the satisfiability of a formula.

So there are traditionally two approaches to the architecture of an SMT solver, and the first one of these is what’s called the eager approach, in which you take the SMT formula and you essentially compile it down to a Boolean SAT formula and then you can solve it with a SAT solver. So, in principle, as long as you have a decidable theory, you can do that in much the same way that you can compile code in a high level language to machine code.

But it turns out that this is not actually the approach that most modern solvers take, and that’s because it turns out to be really inefficient, that if you compile to a plain old Boolean SAT formula, you may lose a lot of the high level structure of the problem that you started with, and then you lose the ability to apply domain knowledge from whatever theory you had, so, for example, addition is commutative, and if you want to exploit that fact that three plus four equals four plus three, then you have to be able to talk about addition, and if you’re working at a level where you can talk about addition, it’s easy. But it’s hard to do if you only have Boolean expressions, in the same way that it’s hard to write machine code.

So if you take the eager route, then you end up having to do a lot more work to prove whatever it is that you want to prove. So what is more popular is this lazy approach to SMT solving, and in the lazy approach, you still have an underlying SAT solver, but you have all of these what are called theory solvers, that are each specific to a different theory, and the solver looks something like this picture which is taken from one of Clark Barrett’s talks. So here you have several different theory solvers at the top, so here I guess it’s the theory of uninterpreted functions, the theory of arithmetic, the theory of arrays, theory of bit vectors, and these are all talking to a piece called the solver core and then a SAT solver underneath that the core communicates with.

And this is yet another example of this principle of finding the right high level abstractions to enable efficiency. In that last section of the talk, we talked about how with high performance DSLs it helps to have that high level representation of programmer intent, and it’s the same thing with lazy SMT. If you can do certain optimizations up at the level of the theory solver, then the problem can be solved a lot more efficiently than if you just eagerly compiled everything down to a SAT formula right away.

And you really need all the help that you can get for trying to solve things efficiently, because we’re working with problems that are NP-complete and we do encounter these worst case running times. And it turns out that to be able to verify anything interesting about reasonably sized neural networks, it’s not enough to just have a solver for, say, the theory of linear real arithmetic, you also need the theory solver to have some domain knowledge about neural networks, this is what my colleagues at Stanford discovered, and here’s why.

So, going back to this picture of the high level verification strategy, we can expand out that SMT solver piece now that we’ve talked about SMT solver internals a bit. So we can expand that out to be a little more detailed. So, we just talked about how lazy SMT solvers can have these different theory solvers depending on what theory our formulas are written with. In this case, I mentioned it’s the theory of linear real arithmetic that we care about, and for that, our theory solver could be a linear programming solver or LP solver.

But there’s a catch, right? Because our neural network has activation functions and this is the thing that’s stopping it from just being a linear combination of its inputs, it can learn much more interesting functions because it can also do this non-linear stuff. So in this case, it happens to be a network with ReLu activations. A ReLu activation has to be encoded as a disjunction in our SMT formula that we compiled down to. Because it has to be encoded as a disjunction in the formula, that means it cannot be handled by the LP solver, rather, it requires you to drop down to the level of the SAT solver, and dropping down to the level of the SAT solver makes things slow.

So what my Stanford collaborators figured out is that you can take this LP solver and just make it a little bit smarter by adding a higher level ReLu primitive to this theory of linear real arithmetic, and that means that instead of eagerly splitting ReLu’s into disjunctions, you can be lazy and leave them unsplit and proceed with trying to solve the rest of the problem.

So, in this case, using this higher level ReLu primitive, instead of compiling it down to a disjunction to be handled by the SAT solver right away, this was enough to make this problem tractable. And I just want to show a few of the properties that this solver was able to show about that aircraft collision avoidance network, using this lazy ReLu splitting approach. So the interesting thing here, I think, is this last column which says maximum ReLu split depth.

So, this network, for context here, this network is divided into sub-networks, and there’s 300 ReLu activations in each of these sub-networks. That means that if you can think of a ReLu as being either on or off, that means there’s two to the 300 possible combinations of on or off. Two to the 300 is a really large number. And if the solver had to split every one of those ReLu’s into disjuncts, then properties like these would be impossible to verify in any reasonable amount of time.

But it turns out it doesn’t have to do that, it only has to split a small fraction of them. So that’s what this ReLu split depth is here, for example, the first one is 22. So we still have to look at two to the 22, which is still a big number, but it’s a lot less than two to the 300. So, in general, we’re looking at running times of tens of hours to prove or disprove these properties rather than the lifetime that they would take if we didn’t have this lazy ReLu splitting.

So in the use of this higher level ReLu constraint, and this custom theory solver that knew about neural network activation functions, was what enabled the solver to work efficiently. And I did a workshop paper with these folks in which we proposed extending this approach to verify the properties of networks that use other kinds of activations such as sigmoid functions by using approximations of those functions that are piecewise linear. We have not actually implemented this yet, if you want to work on this, talk to me.

All right, so I’ve talked a lot about a lot of different projects here, all of which kind of I think fit in with this guiding principle of finding the right high level abstractions to enable efficient computation and finally I want to bring it back around to what I want to do in the future.

So we just saw that it is possible to formally verify properties of neural networks using SMT solvers if you’re willing to dig into solver internals, rather than just using an off the shelf solver. And, in fact, I think this is true not just for neural networks but in all kinds of domains where we want to make the use of solvers practical, so I would like to invest in developing new domain specific solvers that are well suited for the problems of parallel and distributed computing.

In particular, I think that solvers that have baked in support for reasoning about lattices and partial orders could get us a really long way. And then, we also want to parallelize the solvers themselves, so this is a hard problem and it’s one that to my knowledge has never been done in a guaranteed deterministic by construction setting. Here’s why I think this matters. So, it’s especially important to be able to safely and efficiently share state between different solver sub-tasks, because the solver remembers cranking on this big formula, it needs to be able to work on different parts of that large formula in parallel, and we need those sub-tasks to be able to share knowledge with each other as we monotonically move toward getting more of that formula processed as solving goes along.

So my hypothesis, as yet untested, is that we can exploit the same techniques that we developed for guaranteeing determinism in other settings, and we can use them in this setting of parallel SMT solving, which is kind of a holy grail. But I don’t want to stop there. My long term vision here is actually to democratize solver hacking, so what I want is a framework for quickly building high performance domain specific solvers. So, today it’s common in a number of sub-fields of computer science for people to use off the shelf solvers as black boxes. If you go and pick out any random paper from POPL or PLDI, they’ll say we used an SMT solver.

But usually people in this line of work don’t necessarily dig into the internals of the solver, and to build your own theory solver, one would appear to need to be an SMT solver internals expert, even though that architecture that we saw of the lazy SMT solver with all those nice modular theory solvers, you’d think that would lend itself to this modular style of development in which theory solvers could be developed independently. But in practice, it would seem that SMT solvers are somewhat monolithic, and that this SMT internals expertise is required, so I want to change that, my claim is that if you were a domain expert, if you know about distributed consistency, for example, you should not have to be an SMT internals expert in order to create a custom theory solver for your domain.

So I mentioned that Delight framework for implementing high performance DSLs, which tried to make it possible to rapidly implement high performance DSLs without having to necessarily be an expert on the guts of compilers. I want that, but for implementing high performance domain specific solvers, and that’s a long term goal, and I think I’m going to be working on this for the next five years to life, so if that sounds like something you want to be part of, I’m looking for grad students, so get in touch.

I am almost done, one final thing that I want to plug, so, last term at UC Santa Cruz I taught a course on languages and abstractions for distributed programming, and my students and I wrote a blog about what we did. So if you want to know about what kinds of problems that I’m interested in in this intersection of programming languages and distributed systems and verification, go check out this blog, my students did a lot of really great work on this and I want to share it with the world.

That’s all I’ve got and I would love to take questions, thank you.

Hey, I was wondering if there are any existing SMT solvers that you could use as a base for the kind of thing that you want to do, or do you think you have to start from scratch?

Yeah, so it’s a great question, so the question is could you use an existing SMT solver as the basis for what you want to do. Well, in principle, yes. So, Z3, for example, is a solver that has an architecture like I talked about, or CVC4, they have this architecture where there are theory solvers and a solver core, and it should be possible, I think, for regular folks to develop theory solvers that plug in nicely, but it’s not easy to do. And I don’t know if the problem is… So I would like to claim that the problem is largely cultural, regular people think they can’t hack on MST solvers, but I think maybe the problem is also technical, maybe the problem is lack of documentation, maybe the problem is… Well, I can’t say that the problem is that the solvers aren’t open source, because they are, but just because the code is available doesn’t mean it’s easy to hack on. So, yeah, I think there are existing starting points and there is kind of some folk knowledge about which solvers are easier to hack on than others, that I don’t know is true, whether it’s true or not. Yeah?

So you compared the performance between MATLAB and also using the ParallelAccelerator, I’m wondering if you had any measurement to do with the low level refactoring and how they compared?

Low level, so the question is-

Into a performance language like [inaudible 00:58:34].

Yeah, great question, so the question is, I was only showing MATLAB and Julia results up here and to really be fair I would also have a column of C++. So if you look at our paper, we had expert C++ implementations of three of the workloads that we looked at. We didn’t have them for all of them, and the reason for that is that expert parallel C++ implementations are often kind of hard to come by.

So the ones that we had were done by experts at Intel who, no, are not me, so we had those for three workloads that we looked at, and so we were able to get within shouting distance of the expert implementations, I think we were considering it a success if we could be only twice as slow as C++. And we got there. So I was really pleased with that.

But you’re right, it’s kind of cheating to not put that on the slide as well, yeah, Ron.

So, we’re going down this [inaudible 00:59:43] of being non-invasive DSLs, [inaudible 00:59:44], so there’s another approach people often do to make the ladder between ESL and the ordinary language easier, which is [inaudible 00:59:51] DSL’s. We’d show those more commonly in fancy type system [inaudible 00:59:58], OCaml and Haskell, and I’m kind of curious what you think are the trade offs between this approach where you’re more aggressively hacking at the compiler level, [inaudible 01:00:07] ordinary surface syntax and get the better behavior, versus something where you just use [inaudible 01:00:13] style approaches to [inaudible 01:00:16].

Yeah, so, Ron’s question I guess is contrast non-invasive DSLs with embedded DSLs. So a non-invasive DSL is a kind of embedded DSL, there’s not necessarily a firm definition of what non-invasive means, this is a term that was coined by Jan Vitek, by the way, he really helped us figure out how to market this work and he was the one who came up with this term non-invasive. I’m not kidding, it took us a while to get this paper published and trying to find the right story to tell, and it was Jan who helped us do it.

So I guess the question is, when would you need to do this versus… So, I guess an example, you could also think of the LVars library, I didn’t talk about it very much, but we have a library called LVish for writing LVars code, and you can think of it as an embedded DSL and so in the same way, so, all LVar operations run inside a particular monad and I like to think of monads as DSLs. So that’s kind of one way of having an embedded DSL that’s kind of enforced through the type system.

But there’s this firm boundary between the DSL code and the rest of the code. So, I guess a non-invasive DSL tries to make that boundary more porous, and whether one or the other is right for a problem depends on the problem.

I guess I wonder specifically, it doesn’t just [inaudible 01:01:40], but how explicit it is? It knows where the combinator ends and where [inaudible 01:01:46]. And I wonder how that played out, for example, in the delivery of errors?

Yeah, so, okay, the question is about errors. You’re right, so one of the things that is kind of a struggle when you’re using ParallelAccelerator is if it runs into something it can’t parallelize, we kind of have to drop back to using the sequential version, and there’s a lot of stuff that it can’t parallelize, it can only parallelize this narrow subset of Julia code that’s aggregate array operations on arrays of numbers. So, for example, ParallelAccelerator can’t handle strings, or at least last I checked it can’t handle strings.

But what we tried to encourage people to do is not put the ACC annotation around giant blocks of code, instead start by putting it around tiny pieces of code and factor the code that does that stuff out into small functions which is maybe a good idea anyway, and then have ACC around those small functions and then it can interact with the other code. Does that answer your question?

Yeah, I was just wondering, when you produce errors for the user, is that you doing some kind of syntactic transformation, I imagine, between the code that you see and something that I might recognize as a little sub-language [inaudible 01:03:11]?

[crosstalk 01:03:11] Yeah, so I think what you’re getting at is one of the bug bears of programming with embedded DSLs is that you get error message that are in the context of the language, the underlying host language, rather than in the context of the DSL. So I think the Racket people have done a lot of amazing work on this, but a lot of the work ends up having to be somebody’s blood, sweat and tears trying to make the error messages good. And I am the first to confess that ParallelAccelerator does a bad job at this, as is so often the case with research quality code.

But I think there are huge usability issues with DSL’s which I think is one of the reasons why they’re not so widely adopted, I don’t know, there’s some Racket people here, like Carl, maybe he has thoughts on… I don’t know, are you a Racket person?

Not actively.

Not actively, okay. I feel like those people are at the forefront of figuring out how to make embedded DSLs and towers of embedded DSLs more usable. Yes?

So I was inspired by that previous question, there was very similar sounding work, like Neil Conway’s Bloom, which is an embedded DSL in Ruby for doing almost exactly the same thing, could you contextualize those two pieces of work and what has come since pushing this lattice programming?

Yeah, you bet, so Bloom was actually also worked on by my colleague Peter Alvaro at UCSC, so he’s the office right next to mine, so we actually just submitted a grant proposal about trying to do this stuff. So Bloom is a language for distributed programming and the work that Neil Conway did was called BloomL, so kind of extending Bloom to programming over arbitrary lattices, whereas the original Bloom was just about sets.

And so Bloom doesn’t have a determinism guarantee, so Bloom is about distributed programming with guaranteed eventual consistency, but eventual consistency is different from determinism, because you can still observe intermediate results that are different, for example with the shopping cart, you know that your shopping cart, your different replicas are eventually going to converge, but in the meantime you might see one or the other thing in the cart.

And Bloom actually does something interesting, so, they actually rule out, and especially in BloomL, you’re not allowed to do non-monotonic operations at all. So the only kinds of operations you can do in BloomL is those that allow the contents of your data structure to grow, and so this is a really, some might say, draconian approach. So with LVars, what you might do is, because this LVish library is in Haskell, what you might do is you might have certain parts of your code where the data structure is growing monotonically, and you would do those, you would write those parts using the LVish library, and then when you were done with that phase of the code where the data structure grew monotonically, you would freeze the LVar, and then just use it as a plain old regular data structure from that point on.

So you can have designated parts of your code where the data structure only grows monotonically, whereas with Bloom it’s like it must be the case, all the time. Or with BloomL, at least, it must be the case, all the time, but you’re right, these things are related, they’re similar in a lot of ways, they were done basically concurrently and it’s really interesting because during my PhD work, I was working on this stuff and submitting papers and getting them rejected and then people from the distributed programming community actually came to me and they told me that they were interested in my work so it was distributed systems people who were interested in what I was doing before any PL people were interested, and then that’s sort of how I became part of the distributed systems community and then ended up jumping into it with both feet later. And I would love to enable more interaction between these communities, and that’s what I’m trying to do now. Yeah?

So, the ParallelAccelerator has some parallels to [inaudible 01:07:30] and LVMs, and, traditionally, when you can build a [inaudible 01:07:36] problem is [inaudible 01:07:37]. And so you get it working, and you look [inaudible 01:07:41], and then two months later it doesn’t work at all, where you changed a line in the middle, it broke it. It’s really fragile, so the abstraction doesn’t meet your goal, your high level abstraction, for encouraging efficient computation. It’s so bad that you end up often just writing a [inaudible 01:07:59], and that’s fine, you can see on your [inaudible 01:08:02] it’s not near the productive side, that’s fine [inaudible 01:08:07]. But for these productive languages, you kind of want to avoid that, so I’m wondering, [inaudible 01:08:10] that you explore any robust abstractions where someone steps outside [inaudible 01:08:16], it tells you what you know?

Yeah, that’s a great question, so let me try to paraphrase the question, so, kind of comparing ParallelAccelerator to auto-vectorization, yeah, I think you run into the same problems where you might, like the Black-Scholes code, for instance, we were only doing operations that ParallelAccelerator knows about and can parallelize, but if we had the code were a little different, there’s actually… ParallelAccelerator cannot parallelize all Julia array operations, it can parallelize a set of them that it knows about.

And so you still have correctness, because we’ll just fall back to using the original sequential version if we can’t parallelize the code, but it might not be fast, and so I think what you’re asking for is something that will give you some kind of optimization advice, like, warning, you’re trying to introduce an operation that the parallelizer, in your case the auto-vectorizor, won’t be able to handle, so what should you do? I think that’s a fascinating question, I don’t know the answer. Yeah?

Could you comment on the scalability, in essence you [inaudible 01:09:31], what kind of scalability do you have of these big networks, or…

So, the short answer it’s not where it needs to be. So the work that the team at Stanford did on this system called Reluplex, that’s what the system was that I was talking about, was able to handle networks that were a couple of orders of magnitude bigger than what the previous best approach for verifying properties of neural networks with SMT solvers could do. That previous approach was published in 2010, and it was able to verify things about a network that was tiny, it was like ten neurons.

So they were looking, they had a larger network, and then they divided it into these sub-networks of 300 activations or 300 neurons each, which is still really small, but it was a couple of orders of magnitude bigger than what they could do before, and if you look at the whole network… But it’s still not necessarily able to handle modern reasonably sized neural networks. So there’s been some follow up work that’s really interesting, there’s a paper, I think called AI Squared, it was abstract interpretation for AI which was about trying to be even smarter about…

I see Ron laughing, the titles of these things are sometimes funny. I have not yet taken the time to sit down and understand the work, but they were able to do what Reluplex was doing but an efficient way by pruning off larger pieces of the search space, or I think by being able to put parts of a search space in the same equivalence class and kind of applying this abstract interpretation approach to solving, which I think is super cool and I wish I understood better.

So the state of the art has advanced a little bit in terms of how big of a network you can handle, but it’s still not where it ought to be, and I think really an even bigger problem is just even writing down the properties that you want to prove about these networks in the first place. So in the case of this aircraft collision avoidance system, you can state the properties that you want to prove fairly straightforwardly in terms of if the input falls within a certain class, then the output should fall within a certain class.

And you can do that because the input is, in some sense, relatively high level, it’s like, this airplane is here, the other airplane is there, they’re going this fast, but if the input were something like pixels from a camera, then how do you write down a property, if this pixel is this, if this array of pixels is this, how do you write down the property like this car doesn’t hit a pedestrian?

So I think there’s this fundamental issue with networks that they’re, the first layer of the network is operating on this very low level input data to even be able to write down the property that you want to verify in terms of the relationship between inputs and outputs and I think that’s even a harder problem. Yeah?

So, kind of related to that, I was wondering, once you have this point and other things they really care about this network are the following fairly clear and small set of properties. Doesn’t that point to a significantly better conversion storage? Which is, you can just… In some sense you have these fairly simple [inaudible 01:12:58] which feels like… Or is the story that these are like rough sanity cases that we want to make sure are true, but actually we care about some other kind of rough equivalence that we don’t know how to express between the particular…

I’m kind of wondering how complete of a correctness story are these theories and I was also wondering this because I was wondering how applicable is this in other cases, do you need these proofs because they’re somewhere in the airline safety checklist, it should have proofs about things? Or does it really capture at the core what [inaudible 01:13:33] important?

Yeah, that’s a great question, so I guess one way to talk about this question is say, how do we decide what property we needed to prove there like when I said the collision avoidance system always says to take a strong left when the intruder is near and approaching from the right? So if that’s the case, how do you know that that and maybe a handful of other properties are the interesting ones to prove? And how do you know that once you’ve proven those, then you’ve covered the space of bad things that could happen?

And this is not limited to this situation, like always in verification, there is the question about are you trying to verify the right things? And somebody asked me if we had a proof of “full functional correctness” of the aircraft collision avoidance system, and it’s like, no, we have proofs of these particular properties that we were able to write down. Were those the right ones? I don’t know.

So people who know more about that domain might be able to say more and we had people who were domain experts involved in this project but I don’t know what claims they would necessarily make about… And I think any time you ask, is this piece of software safe? It’s kind of like if you ask a security person, is this piece of software secure? Well, the answer is, what is your threat model? And it’s the same here. What is your threat model? What are the things you think could go wrong?

And we can protect against those things, but we don’t know what we don’t know. So I don’t have an answer that’s better than that. Yes?

I wanted to talk a little bit about quasi-determinism, and I’m kind of curious, is it possible to prove statically, without having to run, that it simply won’t error out or that it will throw an error? If you were to eyeball it, if you froze something, and then you tried to write to it or something, that it should throw an error, but is there any attempt to actually statically verify these things rather than just prove this quasi-deterministic property?

Yeah, so there are type systems for ensuring determinism, and probably one of the best known of those is something called DPJ or deterministic parallel Java, and so DPJ is an extension to Java and you can put in it annotations in that will say that one operation commutes with another, and then it’ll do this sophisticated type checking to try to make sure that your program is deterministic. So, LVars is actually not a type system. And, in fact, one thing that has always irritated me about the LVars work is that we’re relying on certain assumptions, like I talked about the assumptions that the programmer had to make of those LVar libraries, they’re assuming that whoever wrote that library did their homework and made the read and write operations have those properties of commutativity and inflationary-ness and that read operations would correspond to threshold sets.

But we don’t have anything that proves that that’s the case in the LVar library per se. So this idea of verified LVars is actually something else that I want to do. So, Niki Vazou had a little bit of work on proving commutativity in LiquidHaskell, which is relevant to this, but what she was doing was she was looking at, so, in Haskell there’s the ord type class, which says, the ord type class, says that the elements of the values that have a certain type are a partial order, or, excuse me, a total order.

But there’s nothing in Haskell that actually proves that that is true. So she did a verified ord type class where it actually calls off to an SMT solver that LiquidHaskell uses and so you know that your ord actually means ord when you’re using her system. So, in principle, you could do something like that with everything that you need to do to ensure that something is actually a lattice, and I want to do that. So that’s yet another thing on my list.

Just one more question?

Sure. One more question. Who wants to be last?

[inaudible 01:18:03].

All right, thank you.

Abstractions for Expressive, Efficient Parallel and Distributed Computing

Transcript

The next great idea will come from you