Is Programming Required for Quant UX ?

Is Programming Required for Quant UX ?

Social scientists considering quant UX roles often ask me, "Is programming really required for quant UX?" They might add, "I've done lots of research and statistics and have never needed to program!"

Here are a few reasons why I think programming is advantageous and almost "required" for Quant UXRs.

Answer 1: Job Postings

This week, I saw Quant UX openings listed online at 6 companies (some, like Google, had multiple openings). Of those, 3/6 required programming in R, Python, or similar languages (Google, Robinhood, and AnitaB). Another 1/6 might be satisfied by SQL as an alternative (Cisco), 1/6 was unclear (Applause), and 1/6 did not require programming (Trioptus).

Here are excerpts from the postings:

  1. Google: "Experience in programming languages used for data manipulation and computational statistics (Python, R, MATLAB, C++, Java, Go), and with SQL"
    [full disclosure, I wrote a previous version of this when I worked there.]

  2. Cisco: "Skilled at Python, Javascript, R, or SQL"
    [Javascript is odd here but I imagine there's some reason, perhaps an A/B testing platform.]

  3. AnitaB: "Experience in programming languages used for data manipulation and computational statistics (Python, R, MATLAB, C++, Java, Go), and with SQL"
    [This text is not a mistake; it is identical to the Google listing.]

  4. Applause: "Experience with data analysis and data wrangling tools, such as R, SPSS, SAS, Python, or similar"
    [I would not include SPSS or SAS because they are not general purpose.]

  5. Robinhood: "SQL (advanced queries). R and/or Python."
    [Short and clear!]

  6. Trioptus: "Proficiency in statistical analysis software (SPSS, Qualtrics) and data visualization tools"
    [This is confused; Qualtrics is not "statistical software".]

An inference from these postings is that 4, perhaps 5, out of 6 companies hiring for Quant UX require programming skills. SQL is separately required.

( Digression: bonus stats interview questions*!*
[1] I observed 6 companies offering jobs. What statistical distribution might this observation come from? Assuming this week is a typical observation, what would be some other expected observations and their probabilities?
[2] Of the 6 companies, 4 require programming. What distribution does that 4/6 come from? How would you simulate other possible fractions? [for answers to #1 and #2, see Chapters 3 and 6 of the R book or Python book].
[3] Google posted multiple jobs but I only counted them as one observation. If you had a larger data set, what's a better way to model that? [answer: see (generally related) Chapters 9 and 13 in the R book; or Gelman & Hill (2007)]. )

Answer 2: Why is Programming Required? Example.

In Chapter 6 of the Quant UX Book, Kerry Rodden and I present a detailed discussion of the importance of programming for Quant UX. In a nutshell, most of the data that Quant UXRs receive is messy and needs systematic work to acquire, compile, clean, and pre-process for analysis. Industry data sets do not have nice structures like academic studies. Programming helps tremendously when working with them.

Here's a different answer: programming dramatically expands what you can do. I'll give a real example from a problem today. First, I'll set up the question.

Two days ago (as of the time of writing), some colleagues and I were discussing external research that showed that [one set of options] would appeal to X% of a product's target audience, while [another, partially overlapping set of options] would appeal to X+2%. The question arose, "Is the 2% difference a 'significant' difference?" In other words, should we care about the difference?

You might have one or more of these thoughts:

  1. Even if it's not "statistically significant" 2% is probably more than 0% and you should go with that. (Generally, I agree, assuming that all else is equal. Yet the question here is whether we should even make that decision at all.)

  2. Stakeholders don't understand 'significant' and we shouldn't get caught up on that. (I agree! Yet knowing the answer could help me make that case.)

  3. You can just look at the confidence intervals (CIs) for the estimates X% and X+2% and see if they are overlapping. That wouldn't work here, for several reasons. First, confidence intervals don't work that way; you need a pooled variance estimate. Even more importantly, (a) we didn't have CIs for the base estimates because they came from a complex process, run externally, and that process didn't give CIs; and (b) even if we had CIs for each of the two estimates, what we want to know is the CI for the difference score, and that can't be derived from the separate CIs on their own.

Without programming skills, the limited number of honest answers include:

  1. "I don't know, call the research provider and get them to figure it out."

  2. "Based on my experience, I would guess the CI is X%."

  3. "It can't be done."

Yet with programming skills, it can be done — at least well enough for a reasonable answer. The short answer is data simulation combined with bootstrap estimation. In case those don't mean much to you, here's the process. (I won't go into all the nuances, but this is a general outline.)

  1. The underlying data here come from a discrete choice model (DCM). I knew from the research report what the average sample estimates were for the preference of each item that went into the various "sets of options" being considered.

  2. DCMs have a well-defined structure where individual respondents' preference estimates are related to the group mean and the standard deviation of preferences around that group mean. From the data we had, I knew the group means. I didn't have the standard deviations, but I could make a reasonable guess of the expected distribution of their possible values (namely, the exponential distribution, which is used in estimating/fitting DCM models).

  3. Given #1 and #2, I could use the known means and choose a reasonable, possible set of SDs for each item by drawing values from the distribution assumed in DCM models.

  4. With those, I could then simulate a reasonable data set that matched what we knew about the overall structure of the provider's data. (Note: simulated DCM data also adds errors for the means — which follow a normal distribution — and individual-level estimates (Gumbel error); I'll skip those details.)

  5. Given a single plausible, simulated data set, I could find the scores for the two computed [sets of preferences], and then find the difference between the two. In other words, I could calculate a new difference that might be 2% or 5% or 0%, or whatever,

But steps #3 and #4 in that process involve randomization, and the answer in step #5 is only a single estimate, based on a single simulated data set. The question then is: what tends to happen, on average? This is where programming shines: I could create a simulated data set many, many times — where every time matches the structure of the data we knew — and look at the typical variation of estimates.

I'll omit various details here and get to the point: I wrote R code to perform this process and ran it 500,000 times (1000 times each for 500 different sets of comparisons similar to the one of interest).

Then I looked at the distribution of results, and found that 95% of the data sets produced confidence intervals for the difference scores that were greater than 2% ... and had a median CI of 5%.

In short, the CI for the difference scores was almost certainly much higher than 2%. The CI was probably around 5% and might be as high as 8%. The observed difference of 2% in the data we had was well within what we would expect from random variation. That finding supported other information we had, increasing our confidence, and I found the answer only thanks to programming.

(Bonus interview note. If you know enough to answer follow-up questions about it, then "bootstrap estimation" is a good answer for many Quant UX interview statistics and programming problems. Such questions are similar to — but simpler than — the problem I presented here. Bootstrapping almost always requires programming.)

Answer 3: Some things are worth doing just because they're worth doing

If you don't enjoy programming, that's no problem! Most people don't program, and I imagine that most people who try programming give it up (same as calculus or reading classic novels).

However, some of us enjoy programming and it adds value to problems like the one above. To be honest, I enjoy it! I was delighted to get the reasonable answer above — and I got the answer by writing R code faster than I was able to complete the back-and-forth in an email thread with the provider about the same problem.

To be sure, I'd rather have the original data. Real data gives a more confident answer than an answer from informed simulation. But even if I had the data, I would still need to code a bootstrapping routine to get the answer here for a difference CI (I would not need the simulation aspect). If I get the data later, I'll see how it compares to the simulation answer. In the past, I've found such answers to be in good or adequate agreement. In this case, we needed an answer quickly to compare with other information and it was great to have this option on the table!

Answer 4: There are More Examples

In the Quant UX book, and also in the R book and the Python book, there are many more examples where we use R or Python code to solve Quant UX problems. We share all of the code online, as well as example data sets and practice problems.

If you find the problems, approaches, and solutions valuable, that is a good reason to program. I hope our work will help you to accelerate that.

Convinced? Building Programming Skills

The best and perhaps only answer is to practice. The more you code, the better you'll get. But to get you started and accelerate your learning in programming, I can suggest:

  1. OK, I hate to say it, but the R book and Python book. Each of them teaches programming alongside real-world projects for data analysis, and they are intended to be starting points. Most importantly: bring your own data sets and apply the skills to problems you care about as you go.

  2. For a second book on R programming: Norm Matloff's The Art of R Programming. It goes into things you'll want to know beyond the basics, and teaches a surprising amount of general computer science and CS intuition along the way.

  3. The Quant UX book doesn't teach programming but it does outline what you need to know, answering "how much programming?" It includes a few example interview questions, plus three chapters with code for applied quant UX projects that supplement the R and Python books.

As for whether to learn R or Python, it depends.

If you're an engineer, I suggest Python; if a social scientist, I suggest R. Otherwise, go for the language with the most people around you who can answer questions. Beyond that, if you want to focus on data and statistics, choose R; if you want to focus on programming or applications, choose Python.

Either way, once you are fluent in one language, learning others is usually easy.

Best wishes and thank you for reading!