I recently spoke with a friend — I'll call him "Gert" — who is a very senior data scientist at a Big Tech company. Part of our conversation went like this:
Me: Execs don't realize that LLM use cases are narrow and brittle.
Gert: Yeah, but LLMs are good for coding. Junior engineers can get a lot done using AI to write initial code or test cases.
Me: Have you had one write code for you?
Gert: Well, no, not personally.
Me: Try it! The code I've seen is terrible.
This exchange aligns with two observations about the typical discussion of LLMs:
Many stakeholders imagine the value of LLMs for someone else, while they personally have limited experience with their use cases.
Discussion of LLM value relies too much on generalization from single use cases. A common pattern is, "It is valuable for me, so it must be valuable for [something else]."
In a previous blog post, I tried ChatGPT on a simple R programming exercise (taken from my Quant UX book). Two of my conclusions were:
"[ChatGPT code for R] contains so many errors that one would be better served to start over"
"[using LLM code requires high] expertise to debug some other entity's code"
Rather than saving time and energy, I found that the ChatGPT code in R was not only bug-ridden but nearly unintelligible. Only an advanced programmer (like Gert above) would be able to sort it out ... and yet such a programmer has no difficulty writing code, and they do not need the LLM at all.
But that was ChatGPT. How about Google Bard? In this post, I'll try the same problem on Google Bard, and see whether my two main conclusions still apply.
(tl/dr; yes, IMHO it still requires more advanced programming skills to use AI-generated code than to write your own.)
(Update: see the Appendix at the end of this post for an update after the release of Bard + Gemini.)
Bard's Answer
I gave Bard the same problem I had used with ChatGPT:
Using the R language, write a function that will take a typical license plate string and convert it to the NATO phonetic alphabet (https://en.wikipedia.org/wiki/NATO_phonetic_alphabet). For example, the license plate input "QUANT 1" would give the result "Quebec Uniform Alpha November Tango One".
LLM output is not generally reproducible — although they are determinant systems, the determinants such as system architectures and random number seeds are not shared with users — but the following is what Google Bard gave me on 11/26/23:
Bard then gave a test case and references:
I share and evaluate Bard's R code below, but first, let's look at Bard's references.
Bard Makes References
Bard shared two references: a book on Google Books and a Python repository on GitHub. My first thought was "References? Great!" ... but then I clicked on the links.
Here's the first reference that Bard gave me, a link to Google Books:
What? That's a UK political history book. What does that reference have to do with programming? Could it be that it references the NATO alphabet?
I clicked "Search Inside" to see whether I could search for "NATO". But "Search Inside" didn't link to the book as listed. Instead, it linked to something else that was unrelated, a US Navy publication from the 1950s:
In short, #1 of the 2 Bard references was garbage.
How about the 2nd reference? That pointed to this GitHub repo, which is directly related to the coding problem (although in Python rather than R):
That is a good reference for coding the NATO alphabet ... except that when you look closely, you'll see that the reference doesn't match Bard's solution!
The screenshot below shows a portion of the code in that Python repo. There are at least four substantial differences between the reference and Bard (setting aside the R/Python difference):
The reference uses different variants for some entries (e.g., it uses "alpha" whereas Bard uses the more official "Alfa"; it uses "ate" instead of "eight" or the official "ait"; and so forth)
The reference includes many additional symbols such as "tilde"
The reference code doesn't capitalize its translations
The reference code [in portions not shown] preserves capitalization and pronounces them (such as "UPPERCASE Quebec uniform alpha november tango")
In other words, it is one reference to code that translates the NATO alphabet but is not a good reference for Bard's solution.
So, for its references, I would score Bard at -0.5 out of 2.0. It listed one somewhat suitable reference but did so misleadingly (because its solution didn't match the reference), and I score that as 0.5 out of 1.0. But then Bard refers to a wildly irrelevant book, compounded by an inaccurate link in the Google Book entry itself. For being misleading, I score that reference as -1.0 out of 1.0. That gives a total -0.5 / 2.0. (If you want to score the irrelevant link as 0, then the total would be 0.5 / 2.0. In any case, that is an "F" grade for its references, which I find to be shocking for Google.)
A much better reference comes rather easily from ... wait for it ... Google Search. That is the official NATO alphabet reference (warning: PDF link).
OK, so Bard falsifies references. That's bad, yet is sadly unsurprising for an LLM. But how about its code? Let's see.
Testing Bard's Solution
Here is the complete function from Bard, followed by 3 test cases. The first test case is one that Bard wrote, and the next two are ones I wrote. (You can copy the R code using the "copy" button in the upper right.)
# This version exactly as written by Google Bard
nato_license_plate <- function(license_plate) {
# Split the license plate into characters
license_plate_chars <- str_split(license_plate, "", simplify = TRUE)
# Convert each character to its NATO phonetic equivalent
nato_chars <- lapply(license_plate_chars, function(char) {
nato_alphabet <- c("A" = "Alfa", "B" = "Bravo", "C" = "Charlie", "D" = "Delta",
"E" = "Echo", "F" = "Foxtrot", "G" = "Golf", "H" = "Hotel",
"I" = "India", "J" = "Juliett", "K" = "Kilo", "L" = "Lima",
"M" = "Mike", "N" = "November", "O" = "Oscar", "P" = "Papa",
"Q" = "Quebec", "R" = "Romeo", "S" = "Sierra", "T" = "Tango",
"U" = "Uniform", "V" = "Victor", "W" = "Whiskey",
"X" = "X-ray", "Y" = "Yankee", "Z" = "Zulu",
"0" = "Zero", "1" = "One", "2" = "Two", "3" = "Tree",
"4" = "Fower", "5" = "Fiver", "6" = "Six", "7" = "Seven",
"8" = "Eight", "9" = "Niner")
if (char %in% nato_alphabet) {
nato_alphabet[[char]]
} else {
""
}
})
# Combine the NATO phonetic equivalents into a single string
nato_license_plate <- paste(unlist(nato_chars), collapse = " ")
return(nato_license_plate)
}
# test cases (the first one is from Bard; next two from me)
nato_license_plate("QUANT 1")
nato_license_plate("qUant 1")
nato_license_plate("QU@NT 1")
When I run that function, I get an error that str_split()
can't be found:
That's a simple oversight of not calling the correct package (stringr
), which is not difficult to fix. So I added a line to use the package and try again:
# added package call before the str_split() function
# ...
library(stringr) # FIX: required package
license_plate_chars <- str_split(license_plate, "", simplify = TRUE)
# ...
Now the function runs, but it returns all blanks instead of a correctly translated string:
This tells us that something else is going wrong. I'll return to that in a moment, but first, let's take a look at Bard's test case.
Look at Bard's Test Case Again
Bard wrote a test case for us: nato_license_plate("QUANT 1")
, and it gave the supposed output from it:
As we saw in the previous section, the function doesn't work. That's problem #1. LLMs fabricate their test cases and their results.
Problem #2 is that the test case itself is not great. Although it came from my input prompt, a much better result would be multiple test cases (my prompt said "for example" and did not say "only this example").
Problem #3 is slightly more subtle, which is that the supposed output does not match Bard's function, even if the function were working. You can see that the test output shows "Alpha" whereas Bard's function uses "Alfa" as shown in the code above. Again, Bard simply copied the desired output from the prompt regardless of whether its code could generate it.
Fixing All the Problems
At this point, I'll simply fix all of the problems in the code. Following is complete and functional code. I append "2
" to the function name to keep my version of the function separate from Bard's code. After the code block, I comment on each change that I made.
nato_license_plate2 <- function(license_plate) {
# IMPROVE: declare this outside the lapply() loop
nato_alphabet <- c("A" = "Alfa", "B" = "Bravo", "C" = "Charlie", "D" = "Delta",
"E" = "Echo", "F" = "Foxtrot", "G" = "Golf", "H" = "Hotel",
"I" = "India", "J" = "Juliett", "K" = "Kilo", "L" = "Lima",
"M" = "Mike", "N" = "November", "O" = "Oscar", "P" = "Papa",
"Q" = "Quebec", "R" = "Romeo", "S" = "Sierra", "T" = "Tango",
"U" = "Uniform", "V" = "Victor", "W" = "Whiskey",
"X" = "X-ray", "Y" = "Yankee", "Z" = "Zulu",
"0" = "Zero", "1" = "One", "2" = "Two", "3" = "Tree",
"4" = "Fower", "5" = "Fiver", "6" = "Six", "7" = "Seven",
"8" = "Eight", "9" = "Niner")
license_plate <- toupper(license_plate) # IMPROVE: convert to upper case
# Split the license plate into characters
library(stringr) # FIX: required package
license_plate_chars <- str_split(license_plate, "", simplify = TRUE)
# Convert each character to its NATO phonetic equivalent
nato_chars <- lapply(license_plate_chars, function(char) {
if (char %in% names(nato_alphabet)) { # FIX: lookup is to names, not values
nato_alphabet[[char]]
} else {
char # FIX/IMPROVE: return unknown characters
}
})
# Combine the NATO phonetic equivalents into a single string
nato_license_plate <- paste(unlist(nato_chars), collapse = " ")
return(str_squish(nato_license_plate)) # IMPROVE: str_squish to remove duplicate spaces
}
nato_license_plate2("QUANT 1")
nato_license_plate2("qUant 1")
nato_license_plate2("QU@NT 1")
Stepping through the changes:
I moved the NATO alphabet declaration outside of the
nato_chars
assignment, where it initially appeared inside an anonymous function. That makes it more readable, and it reflects the better programming practice of separating data declarations from functional code. Score: Bard 0.75 — the code works but it is oddly constructed. I give that a middle "C" grade, using the common US letter grade system.I added a
toupper()
conversion so the input case would not matter. That is implicit in my prompt (and in general usage of the NATO alphabet in most contexts). Better would have been for Bard to ask which we wanted, to include both cases, or to call out its assumption. Without any of those, I score Bard 0.60, a "D-" grade.I added the required call to the
stringr
package as above. I'll be generous here and score that as 0.75, a "C". It's a definite error but is also a common type of error and is easily fixed.Now we get to the worst error in the code, and the reason all outputs were blank: Bard's code incorrectly indexes the translated strings and not their lookup values. In its list structure
nato_alphabet
, Bard created object names such as "A" that match values such as "Alfa". But when it checks whether an input letter (such as "A") is in the list, it looks at the translations and not at the names. I addednames(nato_alphabet)
so that would work correctly. Score: 0.0 — a complete failure. It not only guarantees 100% wrong output, it uses completely wrong indexing logic. No points for that.In the case of an unmatched character, I simply return the character (char) instead of Bard's answer to return a blank. This was not specified in my prompt — but should have been clarified interactively by Bard — but without specification, it is a very bad choice to silently eliminate input data; much better is to give it back untranslated (perhaps with a warning). Score 0.60, a "D-".
Finally, I use
str_squish()
to remove extra blank characters. Bard's output — after fixing its function — includes spaces that are present in the input but are not needed in the output ([1] "Quebec Uniform Alfa November Tango One")
. The extra spaces should either be eliminated (which would match my prompt) or announced (with the translation, "space"). Score 0.75, a "C" because it's not terrible but is not great.To round off the scores, I'll give 3.4 / 4.0 for the other parts of the function (which amount to about 4 lines of code). Those lines are rudimentary but have the virtue of working. That's an equivalent 85%, or a solid "B" for those lines.
The total score for Bard's code, if I were grading, is 68% or a "D+". The code was broken but wasn't very difficult — for an experienced R programmer — to coerce into a working function.
Before we leave this section, I'll note one other thing: Bard used a fairly advanced R construction that paired lapply()
with an anonymous function (in this code: lapply(license_plate_chars, function(char) {
...) That is something I would only expect from a fluent R programmer.
That illustrates both a pro and a con of LLM code: you can learn new things, but understanding the code takes much more experience than a new programmer would have. In Bard's original code — before I refactored it — the anonymous function was further obfuscated by a lengthy declaration of the NATO lookup table. New programmers may find it difficult to refactor code in the presence of fairly complex structures like anonymous functions — and that increases the odds of introducing bugs into the code (and decreases the odds of finding or solving an LLM's bugs).
What if Bard were a Quant UX Candidate?
When I worked at Google, I was often the "programming hat" interviewer for Quant UX candidates (and FWIW, I was the lead/owner of various docs for the job description, interview questions, interview evaluations, and quant careers, and a member of the central hiring committee).
I've written separately about why programming is highly desired for Quant UXRs. How would I score Bard if it were a Quant UX candidate?
First of all, just as IQ tests don't apply to LLMs, interview rubrics do not apply to LLMs. The only "correct" answer is that an LLM can't be a Quant UXR so the question is not answerable. So let's ask this closely related question: what if Bard's code had been written by a Quant UX candidate?
Considering only the code in itself, I would grade it as "Pass". It uses R fluently, well above the minimum bar, as is demonstrated by correctly using lapply()
and an anonymous function. (Those are not required for R fluency, but they help demonstrate fluency.) It demonstrates correct usage of control structures (if()
) and looping (implicit in lapply()
). The syntax errors are minor and do not worry me; they would appear quickly when testing.
However, I would consider more than just the code. First of all, in interviews I probe on the code. What if lapply()
were not available? What would they do instead? What if the maximum string size were 100B characters? (Thus exceeding memory.) What if the dictionary needed to change ad hoc? (For example, they might discuss using a separate data file that defines the dictionary instead of hard-coding it.)
Second, I would look for the candidate's questions of me, and their considerations and assumptions. Bard asked no questions and stated no assumptions; for that reason, even with decent code, a candidate would fail. (TBF, I would first encourage a candidate to ask questions!)
Third, I would see whether anything else raises issues of a candidate's competence. In Bard's case, it gave unnecessary and irrelevant references along with fictitious output for the test case. Again, I would ask a candidate about such things ... but my default for such things — before a good explanation — is "no hire."
In short, Bard would get a "hire" for its code, but a "no hire" for its overall responses.
Is Bard's Answer Useful?
I find this code to be useful for only one thing: the definition of the data dictionary ("A" = "Alfa", etc) which saves some typing. Even that is a mixed bag because it would usually be much more flexible to separate the data definition from the code.
Fixing the code as I did above — refactoring the data definition, correcting the errors, making it work regardless of input case, and so forth — takes at least as much time as writing such a function from scratch.
You might say, "well, but it's a good starting point." I'll grant that, although I think it would be odd for someone to have the R skills needed to debug a list-apply with an anonymous function and yet not know where to start with a simple problem.
Would it do better on larger, more complex problems? I don't know ... although I see no reason to expect fewer bugs or better code architecture for more complex problems. I would expect the bugs to scale up at least linearly.
Conclusions
To return to my initial observations from my previous review:
"[ChatGPT code for R] contains so many errors that one would be better served to start over" ==> This is less clear-cut with Bard. Its code is markedly better than what I saw from ChatGPT for this problem, and some people may find its code to be a useful starting point.
"[using LLM code requires high] expertise to debug some other entity's code" ==> This is still true. Although Bard's code is easier to fix than ChatGPT's, it still requires substantial expertise.
In short, from this exercise I cannot recommend Bard to help beginners write R code. It requires advanced skills in R syntax, debugging, and testing. Experienced R programmers might find its approach to be inspirational or even time-saving, although I personally did not.
To update my opening discussion with my friend "Gert," I didn't find Bard's code — unlike ChatGPT's — to be "terrible," just not good enough.
What about me? I do not use LLMs to write any of my code. I have not yet found any of their code to be worth the effort and the risk of bugs. I also have concerns about the ethics of using code that may be plagiarized; Bard's references are not reassuring.
Appendix: All the Code
As always, I compile all my R code. Here is Bard's code and my corrected version.
##### as written by Bard
nato_license_plate <- function(license_plate) {
# Split the license plate into characters
license_plate_chars <- str_split(license_plate, "", simplify = TRUE)
# Convert each character to its NATO phonetic equivalent
nato_chars <- lapply(license_plate_chars, function(char) {
nato_alphabet <- c("A" = "Alfa", "B" = "Bravo", "C" = "Charlie", "D" = "Delta",
"E" = "Echo", "F" = "Foxtrot", "G" = "Golf", "H" = "Hotel",
"I" = "India", "J" = "Juliett", "K" = "Kilo", "L" = "Lima",
"M" = "Mike", "N" = "November", "O" = "Oscar", "P" = "Papa",
"Q" = "Quebec", "R" = "Romeo", "S" = "Sierra", "T" = "Tango",
"U" = "Uniform", "V" = "Victor", "W" = "Whiskey",
"X" = "X-ray", "Y" = "Yankee", "Z" = "Zulu",
"0" = "Zero", "1" = "One", "2" = "Two", "3" = "Tree",
"4" = "Fower", "5" = "Fiver", "6" = "Six", "7" = "Seven",
"8" = "Eight", "9" = "Niner")
if (char %in% nato_alphabet) {
nato_alphabet[[char]]
} else {
""
}
})
# Combine the NATO phonetic equivalents into a single string
nato_license_plate <- paste(unlist(nato_chars), collapse = " ")
return(nato_license_plate)
}
nato_license_plate("QUANT 1") # Bard
nato_license_plate("qUant 1") # Added
nato_license_plate("QU@NT 1") # Added
###### revised
nato_license_plate2 <- function(license_plate) {
# IMPROVE: declare this outside the lapply() loop
nato_alphabet <- c("A" = "Alfa", "B" = "Bravo", "C" = "Charlie", "D" = "Delta",
"E" = "Echo", "F" = "Foxtrot", "G" = "Golf", "H" = "Hotel",
"I" = "India", "J" = "Juliett", "K" = "Kilo", "L" = "Lima",
"M" = "Mike", "N" = "November", "O" = "Oscar", "P" = "Papa",
"Q" = "Quebec", "R" = "Romeo", "S" = "Sierra", "T" = "Tango",
"U" = "Uniform", "V" = "Victor", "W" = "Whiskey",
"X" = "X-ray", "Y" = "Yankee", "Z" = "Zulu",
"0" = "Zero", "1" = "One", "2" = "Two", "3" = "Tree",
"4" = "Fower", "5" = "Fiver", "6" = "Six", "7" = "Seven",
"8" = "Eight", "9" = "Niner")
license_plate <- toupper(license_plate) # IMPROVE: convert to upper case
# Split the license plate into characters
library(stringr) # FIX: required package
license_plate_chars <- str_split(license_plate, "", simplify = TRUE)
# Convert each character to its NATO phonetic equivalent
nato_chars <- lapply(license_plate_chars, function(char) {
if (char %in% names(nato_alphabet)) { # FIX: lookup is to names, not values
nato_alphabet[[char]]
} else {
char # FIX/IMPROVE: return unknown characters
}
})
# Combine the NATO phonetic equivalents into a single string
nato_license_plate <- paste(unlist(nato_chars), collapse = " ")
return(str_squish(nato_license_plate)) # IMPROVE: str_squish to remove duplicate spaces
}
nato_license_plate2("QUANT 1")
nato_license_plate2("qUant 1")
nato_license_plate2("QU@NT 1")
Appendix: An Update Using Bard + Gemini (Dec 8, 2023)
Just before publishing the above post, Google released the first version of "Gemini" and added it into the Bard LLM.
I tried the same problem again, and noticed that Bard + Gemini's answer was much worse. It was so much worse that it was essentially worthless and unrecoverable due to the number of bugs and poor coding decisions. Rather than being a substantial improvement over the old ChatGPT code, I would say it is roughly just as bad now.
Briefly, I observed:
It uses "." as if it were an automatically defined object
Its indexing logic is even worse and more incorrect than the version above
It uses indexing that (even if it were correct) would fail in case of unmatched input
It attempts to index under the assumption of an initial character match — which although it would work for the alphabetic characters using one version of their phonetic equivalents, would fail for other variations for for numerals
It matches numbers to the wrong words (namely, to letters)
It fails to include the phonetic strings for numbers
It uses tidyverse for no particularly good reason, which obfuscates the code
It fails to call the correct string function library (or to use a base R equivalent), and fails to call a required tidyverse library.
That is so many errors and poor decisions — for a function that should take only 4-10 total lines of code! — that it is pretty much worthless for anything except saving the typing of NATO phonetic spellings (and TBH those would be better handled in a data file anyway).
The initial Bard code above was salvageable with moderate effort and some degree of R knowledge. The Gemini code here is not salvageable in my opinion. It has too many errors and bizarre choices to be worth the effort.
In short, Bard + Gemini appears to be a significant downgrade from the original Bard version above for this R problem. (Caveat: LLM output is not generally reproducible so this is a single-shot observation.)
For reference, here's what Bard + Gemini wrote. I made the bullet points above from reading its code (which doesn't run correctly, of course).
# ed. note: version written by Bard after Gemini release, 12/8/23
# Function to convert license plate to NATO phonetic alphabet
to_nato_phonetic <- function(plate) {
# Define conversion table
nato_table <- c(
"Alpha", "Bravo", "Charlie", "Delta", "Echo", "Foxtrot", "Golf",
"Hotel", "India", "Juliett", "Kilo", "Lima", "Mike", "November",
"Oscar", "Papa", "Quebec", "Romeo", "Sierra", "Tango", "Uniform",
"Victor", "Whiskey", "X-ray", "Yankee", "Zulu",
0:9
)
# Split the plate into individual characters
chars <- str_split(plate, "", simplify = TRUE)
# Convert each character to its NATO phonetic equivalent
phonetic_chars <- sapply(chars, function(x) {
if (is.numeric(x)) {
# Convert numbers to NATO words
nato_table[x + 1]
} else {
# Convert letters to NATO words
tolower(x) %>% toupper() %>% str_sub(1, 1) %>%
which(nato_table == .) %>% nato_table[[.]]
}
})
# Combine the phonetic characters into a single string
return(paste0(phonetic_chars, collapse = " "))
}
# Example usage
plate <- "QUANT 1"
phonetic_plate <- to_nato_phonetic(plate)
cat("Input:", plate, "\nOutput:", phonetic_plate)