How Language Models Model Programming Languages and How Programmers Model Language Models

Arjun Guha takes a deeper look at how LLMs internally reason about programming concepts, and how programmers reason about the programming abilities of LLMs.

Transcript

John Crepezzi (00:00:05)

Today we’re pleased to have Arjun Guha here to speak with us. Arjun is a professor of computer science at Northeastern. If you actually go look back at prior talks on the New York Tech Talks list on the website, you’ll see that Arjun was actually the first person outside the company to come speak and give a Tech Talk here. That was way back in March of 2017. So it’s been eight years. It’s been a long time

He’s got a wide range of interests that overlap with people at Jane Street. Back in 2017, when he was here, he was talking about verification of system configuration languages like Puppet. He’s also broadly interested in programming languages, systems, software engineering, mechanistic interpretability, and making LLMs better at programming tasks, especially on low-resource languages like OCaml. I see we’ve got a lot of people from the compiler team here, so this is going to be a good one for you

Today, he’ll be talking both about how LLMs reason internally about working with programming languages and how humans have adapted their communication styles and working standards for working with large language models. We’ll have some time for questions at the end, and I hope you enjoy it. Welcome back, Arjun

Arjun Guha (00:01:19)

Thank you so much, and thank you, everyone, for coming. And thanks to everyone who I met this afternoon. I’ve really enjoyed my time here so far

So my background is in programming languages research. I was an OCaml hacker back in grad school. For the last five years or so, I’ve written, you know, more Python than one should, because I’ve been sort of in the weeds with code LLMs

There’s sort of three major strands to my research. So as you said, one is trying to make LLMs better at low-resource languages. This is a term of art for programming languages for which there is limited training data, such as OCaml. Understanding how people use LLMs, and finally, understanding how LLMs work under the hood as they do programming tasks. And I’m going to try to tie these three threads together in this talk

But before I get into any of that, I want to talk about benchmarking. I want to begin by introducing some benchmarks that my group has built for LLMs as they do tasks in lower resource languages, such as OCaml. In this field, without benchmarks, you’re just sort of, you know, taking shots in the dark. So let’s talk about some benchmarks

(00:02:10)

So to understand our work, I think we need to turn back the clock to about summer of 2022. So the summer of 2022, you can think of that as after Copilot, but before ChatGPT. So in the summer of 2022, the labs that were training language models had begun to sort of standardize on how they were evaluating models on coding tasks

So OpenAI had released their Codex model that powered ChatGPT, sorry, not ChatGPT, that powered GitHub Copilot about a year ago. The paper that introduced that model also introduced this benchmark called HumanEval. And it had been rapidly adopted. There were also other benchmarks, such as MBPP, which were very much sort of along the same lines. And other labs, such as a group at Meta and a group at Salesforce, they’d all begun to sort of adopt these benchmarks for evaluation

(00:03:15)

So what I want us to do is begin by looking at what some of these benchmark tasks look like. So what I have here is an example problem from HumanEval. So that is the prompt to the model. So it’s like a OCaml function signature followed by a docstring. The model sees that and is prompted to generate the function body. And then the tests at the bottom are run to see whether the model generated right code. The model doesn’t see the test at all

This is actually a great problem for a language model. And for models at the time, possibly even for some models today, this kind of prompt would absolutely stump the model. Because it sort of redefines what a vowel is. Because it says that y is also a vowel, but only when it appears at the end of a given word. And this sort of violates the strong prior that models have about what vowels are, that a lot of them will just sort of do the vowel thing and not really follow the instruction right

(00:04:19)

So if you look at some of the benchmarks, so HumanEval is a benchmark suite of 167 problems that are all sort of along these lines. The MBPP benchmark from Google has 400 or so problems, a little bit easier, but again, sort of along these lines. And there’s a bunch of other papers that are all sort of along these lines. But the common theme that they all have is there’s a prompt, there’s hidden test cases, oh, and it’s all just Python. So people were training these multilingual models, but they were being evaluated almost exclusively on their Python programming performance

(00:05:13)

And so the natural question that we asked is how did they do at like other languages? So what we did back then is that we took these benchmarks and developed a little suite of, I mean, technically these are compilers or transpilers, these were really, really, really, really trivial like test case compilers that would translate the benchmarks from Python into a whole suite of target languages

So basically it kind of works like this. Given the Python prompt at the top, we could turn it into, I have here a Rust example, a Rust prompt at the bottom doing a little bit of type inference to figure out what the type annotations need to be, and also mechanically translating the test cases from Python into Rust. And once we do that, we have like this parallel suite of benchmarks across a whole bunch of programming languages

(00:06:15)

And this graph from our paper, we call the benchmark MultiPL-E, was, I believe, the first multi-language, like large-scale multi-language evaluation of models of the time. So we have here the OpenAI Codex model, which is like a 175 billion parameter model. The largest model, of course, doing better than the smaller models. But this was like the first multi-programming language evaluation of code LLMs

So evaluation today, and by the way, on the x-axis, we have the various programming languages, of course. And on the y-axis, this is metric passes, one that’s frequently used. You should just think of that as fraction of the benchmark that the models are able to solve. So on, say, Python, the model, what is that? The codex model was getting around 45% or so, and doing, of course, much worse on lower resource languages

(00:07:39)

Okay, so here’s the state of multi-language benchmarking today. So I pulled these numbers from the Kimi-K2 paper, which is like a very large open model that came out a couple of months ago. And as you can see, the benchmark is getting pretty saturated. The best models are like at 90% or approaching 90%. And we know that, so from work from others, we know that about 10% of the problems are faulty in various ways. This is a recurring theme. If you need a large enough benchmark, Some of the problems are faulty in some sort of way. And so the benchmark truly is saturated. And so there is a need for new benchmarks

(00:08:32)

So we did some recent work that we called agnostics, where, you know, we realized that we can just give these models much harder tasks to do now, of course. So instead of asking them to just complete a function body, you can ask them, you can give them detailed instructions of not just about the task to do, but, you know, exact input and output formatting

So for example, you can take a problem from HumanEval, which was just a Python prompt, and turn it into a problem such as this, which sort of says, “Give me a complete program that takes input in the following format on standard in, and produces output in the following format in standard out.” And once you have a problem like that, you can just say, “Why don’t you solve this in OCaml?” And as long as you can compile and run OCaml, you sort of get a multi-language benchmark actually much more easily than we did before

In fact, in many cases, you can have a capable LLM do this transformation to turn a language-specific benchmark in Python into a language-agnostic benchmark. So if you do this with a benchmark, you get a benchmark. If you do it to training items, you get language-agnostic training items that you can use for, say, reinforcement learning

(00:09:43)

So just an aside, so we’ve done some recent work on reinforcement learning for code LLMs applied to low resource languages. I have numbers here for like OCaml and Fortran. But the paper has a result for several other models and for a couple of other programming languages as well

But the question that I want to ask, I don’t want to talk about training models today. What I really want to ask is, are there other baselines that we should include. So what I’m doing here is I’m saying that, you know, we train this little model, and it does quite well in OCaml. It does really well on Fortran, it turns out. And what I mean is that it does much better than the model that we trained it up from, but are there other baselines that we should include? So I’m going to get back to this question

(00:10:48)

But to keep things simple for this talk, I’m going to talk about one little model that I trained up on OCaml. So I don’t want to talk about, you know, Sonnet 4, I don’t want to talk about, the Qwen 3 model is this weird hybrid thinking model. I don’t want to talk about that. I’m going to talk about Qwen Coder 2.5 3B Instruct. It’s a very vanilla, like little model. It’s actually a really strong model for its size

On a language-agnostic version of HumanEval, when I ask it to sort of solve the problems in OCaml, the base model gets about 10% of the problems right, but the trained model gets 17% right. So this looks great, like, you know, significant improvement. But the question is, are there other baselines that we should include? And by the way, just to give you a sense of how hard this benchmark is, it’s still really easy for like a frontier model, like GPT-5 Mini gets 72% of these problems correct

(00:11:39)

Okay, so before we get to other baselines, we’re going to take a little bit of a detour through the main technical part of this talk, which is we’re going to try to understand what is really going on inside these models? And so to answer this question, we’re going to work with the model that I just introduced. We’re going to work with a benchmark, which is a language-agnostic version of HumanEval

By the way, the way I constructed this is I had, I think, Sonnet 4.5 translate these models and verify that the tests were consistent with the translation. And I’ve clipped off these prompts, but you can just, you know, reading the prefix of these prompts, you can hopefully get a sense of what these problems are like. They’re still really straightforward. So it’s just trivial, you know, single functions that you can sort of solve in like a couple of lines of code

(00:12:39)

And so what I’m going to first do is, I’m going to have the model solve these tasks with two variations of this dataset. In one variation, the prompts will be prefixed with write an OCaml program to solve this problem. And in the second variation, the prompts will be prefixed with write a Python program to solve this problem. So we’re going to end up with two parallel datasets, one for which the model to generate an OCaml solution, one for which the model to generate a Python solution

Okay, so just to be clear, so Qwen Coder 3B Instruct is a chat model. And so when you give it a prompt like, “Write hello world in OCaml,” what the model actually receives is that. So these are the special tokens that are inserted into the text stream to sort of clearly mark what is the user message and what is the response from the model. And of course, what these models actually receive as input is not text. They receive what are called tokens. And what the model really sees is this stream of integers. This is what actually receives as input, okay?

(00:13:29)

And so the last five tokens of the prompt are that. So every prompt to the model in a single turn ends with im_end, new line, in_start, assistant, new line. And they map into those, you know, happen to map into those five token IDs

And what I’m not going to do is the usual thing, which is to ask, what is the next predicted token? What I’m going to instead do, and I’ll unpack this slowly, is ask, what are the intermediate values that the model produces as it’s doing its computation? Okay, so I want to give a very high-level schematic for people who haven’t seen this of what a model is. And we’ll dig into this and go in a little bit. So we know that models are made of layers. Everyone knows that. So when we feed into a model a prompt such as, you know, “Write fizz buzz in OCaml,” you can think of the model as consisting of, roughly speaking, three parts

(00:14:24)

The first part is what’s called the embedding layer, which maps integers into high dimensional vectors. So this is vectors RN. N is like, I think, 2048 in the case of the particular model that I’m working with. Then there are several layers. These are the transformer layers or the decoder blocks that sort of refine what that vector is. And finally, there’s the unembedding layer, which maps from these vectors to a distribution over the space of tokens

So given a prompt such as this, hopefully what we’ll get at the end is a distribution. And I’ve just made up a distribution here. So hopefully let will be more likely than def, because let is how you begin OCaml programs

(00:15:50)

So I want to quickly show you what this looks like in code. So this is, again, the schematic for this particular model in code. So there’s your embedding layer, which maps from the space of tokens. So there’s like 150,000 or so different token types this model takes into 2,000 dimensional real value vectors. Then there’s 36 layers along which those vectors are refined. And we’re not going to need to get into the innards of the layer, but this is where attention and all of that stuff happens. And you can see some of that here. So there’s 36 layers followed by what’s called a language modeling head that maps back from these 2,000 dimensional vectors back into a distribution over tokens

So here’s what we’re going to do. We’re not going to do the usual thing, which is just get out the distribution of tokens and ask, what’s the most likely token or sample from the distribution? What we’re instead going to do is we’re going to read off the intermediate values, also known as activations or a residual stream, that the model produces as it goes through the transformer layers

(00:16:54)

Okay, and so the way I’m going to do this is using another project that is developed at Northeastern, which is called NNSight and NDIF. NDIF is the National Deep Inference Facility. So NNSight is this, you can think of it as a DSL embedded in Python that makes it easy to both read models’ internal states and even manipulate them, and we’ll see manipulations in a bit. And what NDIF is, is it’s a NSF-sponsored, sort of, you know, hosted GPU service onto which, like researchers in the U.S. can sort of query very, very large models, like the 400 billion parameter Lama model, for instance, which would be, you know, very expensive to sort of host at home

So I want to show, like, two bits of code in this talk. And then on the next slide, what I’d like to show us is what it looks like to actually access models’ internal states in and inside

So this is really what it looks like. So we have the prompt. We tokenize it to get out that stream of integers. We then feed the prompt for the model. And within this width block, we can perform various manipulations. I’m not actually modifying the model outputs. What I’m saying is that at every layer, get the output for the layer and give me the last five tokens. Those last five tokens which are the sort of special tokens after the prompt, after the user written prompt. Okay, and I get a list of those five tokens

(00:17:56)

All right, so great. So we have we have lots of high-dimensional vectors. We have five per layer and 36 layers and that’s just for like one prompt. And we have a dataset of, you know, 300 or so prompts, like 154 Python, 154 OCaml. And what I want to do is I want to look at these vectors

So what we’re going to do is, so looking at high dimensional vectors is hard. The way I’m going to do it is by projecting them into 2D using a principal components analysis, PCA. So what you need to know about PCA is that it’s a linear method. So what it does is that it, you should think of it as it learns a change of basis in high dimension from the sort of standard or normal basis into a new basis where the basis vectors are ranked by the amount of variation in the dataset that they capture. So like the first basis vector is one in which you can observe the most variation in the data. The second basis vector is the one that you can observe the second most amount of variation in data, and so on. And to do a 2D projection, we’re going to just take the top two vectors

(00:19:23)

And I’m going to apply PCA to the last five tokens of the prompts in this dataset. Not to the generated programs, mind you, just to the prompts. So there’s no programs being generated here. We’re just looking at the prompts. And, you know, the dataset is what we’ve talked about already

So let’s have a look. So this is due to bugs in Marimo. This is easiest to actually do not in full screen mode. But Marimo’s great. Where are we? Okay, great

So, I have two sliders here. At this point, all the points are sort of on top of each other. But the blue points are OCaml, the orange points are Python. I can slide along the token position. I can slide along the layers

So starting at the first layer of the model, they’re basically right on top of each other. The prompts are basically identical. One says OCaml, one says Python. But as we sort of go down the layers, we start seeing some sort of separation. The model is actually sort of encoding some information about the language in which it needs to do the task

(00:20:52)

I just wanted to show the tool tips here, right? So sorry, it’s a little small, but this says, OCaml program, rectangular grid of wells, blah, blah, blah. And the one right next to it is Python program, rectangular grid of wells, blah, blah, blah. So the two parallel tasks are right next to each other. This is hopefully unsurprising. Okay, and this sort of holds. And then eventually, other things start happening. I don’t know, I don’t know how to interpret this

Okay, so let’s move along the tokens, okay? So what layer am I at? So that’s token position zero. That’s token position one, sort of the same thing. A little bit more separation. That’s token position three, sort of the same thing. I’m sort of from the back

But then at some point, I think it’s around here, if I can drag this, we start seeing a clearer separation between the Python intermediate values and the OCaml intermediate values. And it holds pretty much across the layers. And it seems to be the case here. Seems to be the case here, clicking is hard. It’s the case at the last token, it’s the case here

(00:22:29)

All right, so to be clear, this is dimensionality reduction. This could all be a mirage. We’re going to see that it’s not. But, you know, there is some sort of separation appearing between the prompts that say do it in Python and the prompts that say do it in OCaml

Okay, so here’s the first experiment that we’re going to do. So like, let’s see. Like the average Python, and I’m doing this in the high dimensions, but the average Python prompt is like the intermediate value is really somewhere around there, right? And the average OCaml one is sort of there. So what I’m going to do is I’m just going to compute that vector and add. There’s probably an optimal layer in which to do this, and I did some exploration. But it was getting too complicated, so I’m going to be stupidly lazy. I’m just going to do this vector addition at every single layer

So this technique is called activation steering. It’s been used a bunch for natural language tasks. So we’re seeing it for some programming tasks, and we’re going to do it for a couple of other programming tasks as we go along

(00:23:30)

So again, I want to quickly show you what we’re computing. So we compute the set of, basically, the list of intermediate values for Python, the list of intermediate values for OCaml, using a version of the code that I showed you earlier. And I compute this language-changing patch. And the patch is literally this. So we start at Python. We step away from Python. And we move toward OCaml. That’s it, okay? So minus Python plus OCaml

And again, I want to quickly talk through the code that we used to actually do this intervention in NNSight. So let me start by, okay, let’s start by ignoring the loop. So we’re going to generate output from the model up to some number of tokens. If the list of patches is empty, we’re going to skip the loop. And the output is just going to be the normal output from the model

(00:24:26)

So let’s talk about the case when there actually are patches. Essentially, what we do is… Where are we? Right, so patch i is the patch for layer i. And j is the index of the token. And all that we do is, so we have to sort of, because we’re counting from the back, I need to sort of, do something to sort of deal with the negative indexing. But basically, what I do is at patch position, sorry, at layer i, at patch position, you can read this as negative j. I add in the right patch to the output. So I’m just really adding in that computed patch vector to the output from the layer

So here’s what happens. So what I’m going to now do is feed the model in these prompts without any language-specific direction at all. And what ends up happening, and I think I have just the first three examples from this dataset and I’ve clipped generations so that it doesn’t get too long. What we’ve basically done is we’ve changed the default programming language of the model from Python to OCaml. It doesn’t say write this in OCaml, but adding in that patch makes it generate OCaml. Okay, so a couple more examples here

(00:25:27)

All right, so let’s go back, so remember I started talking about evaluations and fine-tuning models and baselines. I’m going to do a different kind of patching experiment next. This was just a warm-up just where I wanted to sort of show you all the details. I’m going to present the next one a little more quickly

We’re going to do the same kind of thing, you know, we need two parallel datasets. I’m going to compute this sort of difference of means, but the two parallel datasets here are going to both be… Oh, okay, let me skip this for a moment

Right, so I’m going to have two parallel datasets. They’re going to both be in OCaml. Again, just the prompts, not the solutions, but it’s going to be OCaml problems that the model solves correctly and OCaml problems that the model fails to solve correctly. And what I’m going to patch the model to do is step away from the intermediate values on the problems on which it solves the problems incorrectly and sort of step toward the intermediate values on the problems which it actually solves correctly. And then generate and evaluate and that’s the result that I get

(00:27:03)

So I’m relieved, I was worried that it would sort of come up, you know, it would reach as high as our trained model. I’m glad it didn’t. It means that it wasn’t work wasted. But, you know, the question to ask is like, when you’re training up a model, there’s perhaps other baselines that you should consider to ask the question, you know, what is it that you’ve really achieved, right? So if you can get like a bunch of performance just by doing this, I mean, you wouldn’t want to do it, but clearly, I’m not sort of of endowing the model with like new knowledge as I do it. You have to ask, are you just aligning the model? Are you actually endowing the model with new knowledge? So I think experiments like this can sort of give you more clarity into what you’ve achieved

(00:27:42)

I also did some, like, you know, hand-prompt engineering. You can also use various prompt optimizers to do this. But I think I know how to optimize an OCaml prompt. So in this case, I think what I did was I just sort of, I told it what libraries I had installed. So I said, I don’t have any of the Jane Street public libraries installed. And because, you know, it’s sort of two dialects, the language really has two dialects. And you’ve got to know when you’re evaluating the model, are you evaluating the fact that your environment doesn’t have the Jane Street libraries installed, or do you want to encourage the model to write in that way? You just have to understand what it is that you want and make sure you’re measuring the right thing

Okay, so yeah, so basically, you know, my conclusion is that our effort doing RL was worthwhile on this model. And I think that’ll hold for the larger models as well. But, you know, the gap to the baseline has been narrowed from this experiment

(00:28:39)

Okay, so I want to move on to a third application of activation steering, which I think is actually the most interesting of these three, which is using activation steering to understand why models mispredict types. And I want to be precise about what I mean here. I do not mean OCaml-style type inference. What I mean is predicting type annotations in languages with explicit type annotations. And I’m going to work with Python and TypeScript

So the better your model is at the task, the cleaner results you get. Models are really good at Python, TypeScript, which is why I’m picking Python and TypeScript as opposed to, I don’t know, OCaml or Haskell, for instance

(00:29:34)

So the question that I’m asking is, on a prompt such as that, can the model fill in the right type annotation? And, you know, we know, so hopefully people have an intuitive sense that models sometimes understand program structure. But in other cases, they just sort of go off like variable names. And in a prompt like that, where the variable is named n, but it’s clearly being used as, what is that? Clearly being used as a string, many weaker models will just get that wrong and say, oh, n integer. But I want to sort of disentangle, like, you know, why is this the case

So again, datasets. So again, we’re doing activation steering, so there’s two parallel datasets. So dataset one is a combination of two datasets. One is TypeScript programs from the stack, which is this GitHub dataset, TypeScript programs, that type check. And for Python, there’s a dataset called ManyTypes4Py, which is a dataset of Python programs with many type annotations, as the name suggests. So again, both programs that type check using TSC or using PyRight

(00:30:47)

And for my negative dataset, so this is harder to construct. But what we do is we do a bunch of semantics-preserving edits to the positive examples until the LLM mispredicts. So it’s things like this. We take this program, which type checks. We rename the class point to some type zero, and making sure we do the right thing at uses of the type. We’ll rename a variable from x to temp. And eventually, we’ll start dropping type annotations as well. But we do this sort of grab bag of like semantics preserving syntactic mutations to the syntax of the language until type prediction breaks. And that breaking prediction, which is model-specific, becomes like the negative dataset for the model

(00:32:15)

And so, okay, it turns out, let me zoom in here. And we do a bunch of ablations in the paper covering various different kinds of edits. We’re able to actually correct a whole bunch of type predictions using the same activation-steering methodology. So we step away from the average intermediate values for the prompts on which the model mispredicts types and towards the average intermediate values for the prompts where the model gets the type prediction task right

The paper goes into details about like how we class balance and, you know what the held out set is, but it works pretty well. So we get up to, depending on what layer you patch at. And out here, my student was doing something much more sophisticated like in my earlier experiment I just sort of patched every layer because I was lazy. Out here, we’re trying to do a targeted patch to like one layer or maybe sort of three consecutive layers. And so you get different results when you patch different layers, but you get up to like, you know, 60% accuracy on a bunch of these type prediction tasks. And by the way, baseline is zero. So these are all type prediction tasks where the model just gets it wrong, but we get 60% of them right with the patch

So this is not the most interesting result from this paper. So I want to show you two more results

(00:33:05)

So this graph is a bit to read. So in languages like in TypeScript and Python and really, many other languages, when I say predicts the type correctly, what I really mean is the model fills in the type that the program had before. Whereas in these gradually typed languages, there’s always a type that you can write down which will make the type checker happy, which is any. So I really mean, does the model fill in the type that the programmer had written? And so what I have here is, let’s see, on the x-axis… One sec, let me just figure out how I want to pronounce this. Yeah, so on the x-axis, we ask, does the model mispredict a type that’s actually okay? So if it mispredicts a type and it predicts any, that’s actually okay, even though it is technically a misprediction. And on the y-axis, we have the accuracy of steering

(00:34:01)

So steering accuracy is high when the base model predicts a type that actually leads to a type error. So there are cases where the unsteered model predicted a type, the type was not expected, but it was still okay. For example, it predicted any, which is not the type that we wanted, but it’s still technically, an okay type to predict. We aren’t able to correct those. But when the base model predicts a type that was just wrong and would have led to a type error, we are able to correct those much more often. So really, what the model is doing is that it’s not improving type precision. It really is correcting type errors. And those are just two different things. That’s good. I mean, that is actually what I set myself up to do. So I think that’s interesting

(00:35:32)

And finally, so remember, I’m working with Python and TypeScript datasets. I think this has, I think, both Python and TypeScript. and these numbers were from TypeScript. But we really have sort of two sets of steering vectors, one for correcting Python type errors, and another for correcting TypeScript type errors

So there’s another question you can ask, which is what happens when you try to correct Python type errors with a TypeScript steering vector, or vice versa? And what we find is that it’s just as effective no matter which way you do it. And I think that’s pretty cool. Because I think that to me, that sort of strongly suggests that there’s some sort of like shared representation of type that the model is learning across at least these two languages, and I would speculate other languages as well. But this is sort of a strong signal that that is the case, as opposed to the model sort of representing them in sort of entirely different ways internally

So, you know, wrapping up, I think we’re at this point where we just have like a surface level understanding of what LLMs can and cannot do, and we’re just beginning to dig deeper with interpretability techniques. Most interpretability research is not on programming tasks, but I think the, like the formal properties of code, like the ability to do semantics preserving code edits, you know, actually test a generated code, run the type checker, et cetera. They make programming languages a really good platform for studying model internals. And again, we’re just beginning to like scratch the surface of what can be done here

(00:36:44)

So however, I think asking what tasks can an LLM do is actually a really narrow question. We also need to understand humans’ mental models of LM capabilities to truly understand what is possible. And that’s going to be the second and shorter part of my talk

So again, to really understand our work here, we need to turn back the clock again to, let’s see, I think this is early ‘23. Oh, actually, sorry. I want to have a quick interlude about some more recent work from us

So this is like a point about benchmarks. So like what we normally want to ask is, can a model do some task X? The problem with benchmarks is that they really ask the question, can a model produce the correct answer on a prompt P, which is not quite the same thing

(00:38:17)

So just to give an example, here’s a prompt from a benchmark called ParEval. It’s like this, it’s a benchmark for parallel programming. It has like CUDA benchmarks, ROCm benchmarks. This is an OpenMP prompt. And it’s a hard benchmark. But, you know, maybe the model will do a lot better if you just add a little bit more detail to this very terse prompt

And so something that we did recently is that we mechanically sort of dialed up the detail to get all benchmark results to pass with high reliability, and then had the model generate, and then dial down the detail again using a model to get prompts that were less and less detail. And we were able to sort of generate these nice curves that show that as you drive down the level of detail on various models, you know, performance sort of goes down in a predictable way. And so I think this is one way to get at, you know, what do you need to like put into a prompt to like solve a particular family of tasks

(00:39:19)

Okay, but I want to actually talk about, you know, prompts and people. So there’s like a lot of research in the space. According to my, you know, Sonnet summary of like 60 paper titles from this year, like 20% of the papers are about LLMs and CS education

Part of the problem, I’ve been studying student LLM interactions since ‘23, but I think what’s interesting about our work in the space is that they lend themselves to, it lends itself to interesting secondary analyses, which I’ll show you in a little bit

But again, I want to introduce the study that we did back in early ‘23 with 120 students who had all completed CS1 in Python and no other course. So like at Northeastern, we have, we have students used to learn Scheme, those students were excluded. So it’s just people who know Python

Okay, so just a reminder, what is early ‘23? ChatGPT had just been released. None of these students had used GitHub Copilot. None of them had used ChatGPT. Many hadn’t even heard of ChatGPT. Like, you know, college students have better things to do than like, you know, keep up on the latest tech news. And so in some sense, like, the high-level question is, how do students with zero LLM experience but basic programming knowledge do at prompting a state-of-the-art code LLM from ‘23?

(00:40:10)

But I want to talk to you about the experimental design. So the model is, it’s the largest OpenAI Codex model. It was easily the best code LLM of the time. We had 120 students from three universities

When you ask a student who’s just on CS1 to use a model to solve a task, you have to be careful what problems you give them. You can’t just tell them, write a web server. They’ll say, what’s a server? So we were careful to pick trivial programming problems taken from their homeworks and exams. So like this, for example, is like one of the problems

The task that we gave them was only to write the natural language prompt. So we were focusing entirely on the prompt writing ability. So we did not allow them to edit code. If the model produced wrong output, they could either give up, they could roll the dice again, models are non-deterministic, or they could revise the prompt to try to get the model to pass

(00:40:55)

Also, we didn’t want to ask, could they verify the code, which is also something that one needs to do in real life. We have a separate study on that. But for this, we would run unit tests for them, show them all unit test results. There were no hidden tests here. This is like a Zoom study, 60 minutes to do six tasks. Students could retry as many times as they want to give up whenever they wanted

So here, let me just sort of zoom in and show you what the UI looked like. So there’s sort of two screenshots here. So, oh yeah, so when you want to like study student prompting ability, you can’t sort of tell them what the task is in English, because they might just parrot back what you told them. So we showed them test cases, the test cases we were using to evaluate correctness, and we showed them the function signature. And we said, you know, write a description of this function

They hit Submit. Codex thinks for a bit, produces code. In this case, it’s wrong, because I think they wrote dictionary instead. So you can see how we formatted it. We just sort of plugged it in as a docstring. We showed them expected output, actual output, and they could try again and move on. This is basically all there is to the study. It’s like some other survey questions as well

(00:41:50)

Okay, so quickly, how do students do? Perhaps unsurprisingly, they don’t do super well. So there’s various ways of like slicing the data. So one question that you can ask is like, after infinite retries, how do they do? So that’s what we call the eventual success rate. That’s sort of how they do it. It’s a wide distribution

Another question you can ask is, if you count every failed attempt as a failure, how did they do? The success rate is much lower. And remember, this is without giving sort of perfect feedback. You got it wrong. Just, you know, removing the fact that in real life, you need to actually evaluate the model’s output yourself

(00:42:54)

Okay, but I think what’s more interesting here is that dataset. And the paper goes into a whole bunch more details like demographics, et cetera, but what’s really interesting here is the dataset that we managed to gather from this experiment. We’ve published this anonymized dataset of 2,000 plus student prompts that has this like full prompting trajectory for each student, including the model generated code, the test results and so on

And one thing you can do with this dataset is you can turn it into a benchmark. So what we do is we take the first and last prompt from each trajectory and we get a benchmark which has like several prompts of varying quality per task, and in this way, and in this sense this benchmark is unique. Like in most benchmarks there’s like there’s one task and there’s one prompt for the task. It’s actually not the case for ParEval, but this is the case for like most benchmarks. You know, there’s a sort of identification of prompt and task, that is not the case here

(00:44:16)

And since we have a sense of like what are the good prompts, these are the prompts on which, you know, Codex succeed, what are the bad prompts, these are the prompts on which Codex, well, on which they eventually gave up, we can sort of plot these curves that sort of show how students, how various other models do when we resample them on these tasks

So what I think is interesting, so let’s see, okay, so let’s look at say, GPT 3.5 because it’s, you know, it’s another open AI model. What I think is interesting is that if you look at the green line, these are the prompts on which students succeeded. Well, actually, there was a bunch of them that were just actually really low quality prompts. They just got lucky. Perhaps worse is that for some of the prompts in which the students failed, actually a bunch of them, like I don’t know, 25% or so, those actually really got good prompts. Students just got unlucky at the moment and gave up. They could have just rolled the dice again and just as you get a, you know, high pass at GPT 3.5, I think you would have also gotten a high pass rate with code-davinci-002

(00:45:17)

Something else that’s sort of clear here is that the first prompts where a student solved a problem in one shot, in a single attempt, tend to be more reliable than students’ last successful attempt. So a prompt on which a student succeeded, but it took them multiple iterations to get there

The intuition here is that, you know, you write a great prompt, you solve it, whereas if you take multiple iterations, you know, end up sort of, just sort of dragging it on and adding more and more detail. There’s a thing that I often find as a, you know, when I teach programming that, you know, students will write code, it won’t work. So then they’ll write more code, and then it won’t work, and then they’ll write more code, and they’ll just keep writing code, and the thing to do is like, no, no, stop, just throw everything you have. It’s really hard to do

(00:45:54)

Students tend to do the same thing with prompts, which is possibly worse than just adding code, right? Because we know that these model, if you add more and more context, at some point, it’s just sort of like randomly ignoring what’s in the context. So that’s like a failure mode that we observe here as well

Okay, so one last thing I want to mention, which is the other thing you can do with these prompt trajectories is that you can ask, you know, what is it that really makes student-written prompts unreliable?

(00:46:44)

So I want to introduce a problem from our study, it’s the total bill problem. So it’s hopefully obvious what this problem is. It’s like a grocery bill. You’ve got to multiply quantity by price. And then that’s like, well, it says so that’s the sales tax. So you’ve got to write down the prompt to do this

During the study, there were 13 students that attempted this problem. Many of them made multiple attempts. So we have a large dataset of attempts at this problem. Some of the students eventually succeeded. Some of the students eventually failed

(00:47:11)

What’s hard here is analyzing this kind of natural language text. There’s just a huge amount of variation in what people have written. It took us a year and a half before we finally sort of had a breakthrough in like how to understand these things

And what we came up with is that, and this was my grad student, Francesca, who came up with this, is that the question we ask ourselves is, you know, what are the essential set of facts, we call them clues, that get added, removed, or updated in every attempt? So students are making sort of ad hoc changes to prompts in various ways. And I’ll show you one of these prompt trajectories in a moment. But what is the actual information content of these prompts that they’re changing?

(00:48:26)

And for any problem, we can actually come up with the set of facts or clues that are necessary to solve the problem correctly. And for the total build problem, we came up with eight of them just based on analysis of the successful prompts. It’s like inputs are lists, list structures explained, round to two decimal places, you can sort of see them here. And what we do is we label every edit by the delta they make to the set of clues

And let me zoom in here. And I’ll walk you through one of the trajectories

So what this picture shows is all the trajectories of all the students as they attempted this total bill problem. These are the eight clues. The way to read this graph is the diamond represents a student making a change to their prompt. The circle represents model generating code, our platform, running tests on the code, presenting test results to the student who then observes it and then makes another edit with a diamond. So it’s this sort of alternating graph

(00:49:43)

So green is, of course, the successful state. Reds are cases where students gave up. And we cluster the failures together. So multiple students end up in this state because it’s the same failure. The same set of tests are failing in the same way

So let’s just talk about student, who is this? Student 23. So student 23 starts out. They look at the initial problem description. And that is the prompt that they write. It’s like, you know, function takes on a list, blah, blah, blah, okay, so they’ve almost gotten it. They add every single clue except for clue number seven, which is you’ve got to round the answer to two decimal places

So they observe a failure. They make an edit. What’s the edit? They change tax to taxes. That’s the change that they make, it is absolutely doesn’t change the information content of the prompt, so there’s no sort of annotation here

(00:50:37)

Model produces another result. It’s stochastic. It fails in a way that’s different from the original failure

Student observes this failure, and they make an edit. So they go back to tax, and then they add which is the last two components of the list. But they’d actually already said that. So they sort of modify their description of the list structure. And then they’re back to where they were before. They observed the same generation and the same failure

And what happens then? So then they make another edit. They do this sort of rewording here. It really doesn’t sort of, well, so what is it we think we’ve done? They actually, they add fact four, but they actually delete the information about adding up the results. And at this point, they give up

(00:51:40)

And it sucks that they’ve given up here, because they’re actually at a state where most of the other students sort of realize in one more attempt that I’ve just got to add in clue seven, which is round to two decimal places. I think, you know, intro CS students, they don’t have the mental model of like, oh, when I see a number with 500 decimal places, it’s just like that floating point thing I’ve got to round. Like, they haven’t actually learned floating point yet in any detail

Yeah, so my takeaway from this is that students just don’t have a good mental model of what the model already knows or what the priors of the model are. And we do this for all the problems in the dataset. And basically, we come up with findings such as, if all the clues are present with high probability, your prompt will solve the problem. If even one clue is missing, with high probability, you will not solve the problem. If you get stuck in a cycle where you sort of revisit the same error state, with very high probability, you will give up. There’s a bunch of these sort of findings in the paper

(00:52:40)

So just to conclude, so everyone knows someone who claims that LLMs make them like, you know, 3 to 4 times more productive. I claim that they make me significantly more productive. But we’re at a point where it’s not clear whether the randomly sampled developer, you know, what benefit that they get from LLMs. And if you’ve looked at some of these controlled studies, like the recent meta study, for example, I mean, there’s evidence that, at least in controlled environments, LLMs are actually slowing developers down. The meta study includes some of the compiler hackers who work on NNSight and NDIF. So which is… Well, that’s just a fact

So anyway, but I think we’re at an interesting point. So we’re at a point where coding agents are actually exploding in popularity. So here’s a graph from a recent paper that we wrote. It’s really easy to actually mine GitHub commits by agents. They’re all like, it says, it’s proudly signed, co-authored by Cloud Code or, you know, there’s like a link to like a ChatGPT website. So it’s very very easy to sort of tell. I mean, I suppose it’s possible that people are just writing co-authored by Cloud Code on the commits, but I don’t think they are

So in just like the first three months or so since Cloud Code was released, we managed to gather I think it was a 1.3 million commits and there’s like many more in like month number four. So there’s an enormous amount of data being generated by agents and we’re beginning to sort of look into how it is the developers are actually using this in the wild and open source. But that’s, you know, that that’s work still to be done

So that’s all that I really wanted to talk about today. Yeah, I think there’s only to get into the extras. Thank you all for coming to come to the talk. I would be happy to take questions from the audience if anyone has any. Please, I was told I can throw this at people

Audience: Yeah, go for it

Arjun Guha: All right

Audience (00:54:19)

Oh my god. Thank you. Hello, hello. Okay, how do you see kind of like, I think I’m especially interested because the second half of your presentation you talked about working with CS1 students. How do you see the face and the shape of CS education changing with the introduction of large language models? I went to Northeastern, I started in 2020, and then I definitely saw a difference in how we, like worked with the introduction of ChatGPT, so I’m just curious how you see it now and if it’s like it’s gonna be symbiotic, if it’s gonna be like just how you see it basically?

Arjun Guha (00:54:55)

I mean, I think… Okay, there’s probably a way to use them well, but I don’t think we know how to yet. I think that there’s, you know, I sincerely believe that they can increase programmer productivity, but the goal when you’re in college is not to be productive. Like I’m not looking for 50 implementations of Pac-Man, for example. When we assign a task like that, the goal is to get students to, like learn. And when the model does the task for you, it’s not clear to me that one learns

Now, it may be that the goal at hand is to learn how to use the model. I mean, I teach a class where students are learning how to use models better, and it’s fine when that is the task. But I think for most of the things that people are learning, you know, the model can like short-circuit learning. So I think, you know, we face an uphill battle

Audience (00:55:50)

So the way you analyze things was in terms of these clues. And you were talking about how, oh, with high probability, if someone had all the clues then they’re going to succeed. So it’s a good way to structure your analysis of the student’s progress. Do you have an idea of what the students’ mental model was for what progress they were making and like were they thinking in terms of adding or removing, or modifying clues, or-

Arjun Guha (00:56:13)

No, no. Great question. So that was like the first part of the paper that I didn’t talk about. So very briefly, when you talk to students after they do the task, you know, ask them what the mental model was, well, first, the other time people said all sorts of things like no one said language model. And when you ask them, you know, what it is, you know, why is it that the model doesn’t… Why is it that it’s hard? They would sort of suggest that there were like syntactic things on like vocabulary issues that they had

So the first part of the paper actually does this like causal intervention experiment where we take prompts with like bad vocabulary and substitute them with good vocabulary and the other way around. So for example, you know, they’re doing Java… Sorry, I’m sorry, they’re doing Python, so we would take prompts where it says like array and change it to list and other like more peculiar things like people who say, like set of characters instead of string

And we find that those kinds of interventions in general, have like no impact. So I don’t think students realize that the problem that they have is that they’re not conveying the right information and it’s like there, you know, the finesse of grammar just like does, I mean, it probably matters like a tiny bit, but doesn’t really matter

Audience (00:57:31)

That makes a lot of sense given that like when you’re an intro programmer it’s like, oh you forgot the colon or you forgot to indent-

Arjun Guha (00:57:37)

Yeah, yeah, you can miss a period, miss a colon, it doesn’t matter. Yeah. There’s a question from Jacob at the back

Audience (00:57:49)

I want to go back to the activation steering for a second

Arjun Guha: Sure

Audience : I’m curious if you tried inverting the patch and seeing what you got

Arjun Guha (00:57:59)

What do you mean by invert?

Audience (00:58:00)

So like the one where you are correcting the type inference

Arjun Guha (00:58:06)

Oh, no, not in that paper-

Audience (00:58:08)

I’m wondering what, you know, what it was that like made them better at it. I don’t know, I’ve seen a bunch of papers that go and do it the other way around

Arjun Guha (00:58:15)

Yeah, yeah. So we didn’t do that. What we do, sorry, and I skipped over the slide, we do like a random baseline. So we try to patch with random noise, also like random noise of the same magnitude, to see if that intervention, like what effect it has. And the answer, I mean, it has an effect in certain cases, but it doesn’t have the sort of pronounced effect of correcting type errors. It would be funny to try to introduce type errors, I guess. Yeah. But we haven’t done that one

Audience (00:58:41)

Thanks

Arjun Guha (00:58:43)

I’m happy to take one more question

Audience (00:58:47)

So earlier when you were doing the addition of a vector thing and looking at the PCA, what is your interpretation of what it means when the like two groups are like separated? Like two groups being one group and the OCaml group in Python. I don’t know how to make sense of it

Arjun Guha (00:59:14)

So these are giant stacked classifiers. There is some sort of hyperplane being learned in the high dimensional space that the model is using to classify this set of prompts. And I’m relying on my intuition that if I have two sets of prompts, one say do it in OCaml, the other says do it in Python, the model must be separating them some way. I’m lucking out in that I’m finding that the separation is happening sort of so consistently at every layer. But it’s just classification

Audience (00:59:52)

So it’s kind of like magic?

Arjun Guha (00:59:55)

So, I mean, as far as classification is magic, which I don’t think it is. Yeah, it’s just, we’re sort of exposing the fact that it’s a bunch of classifiers under the hood, very, very complicated classifiers. But it’s just classifiers. All right, thank you

The next great idea will come from you