Quant UX Blog

There's No Magic Bullet

Chris Chapman — Mon, 06 May 2024 14:58:13 GMT

Stakeholders often ask me questions that are variations of this pattern: "what's the effect of ___ on ___ ?"

For example:

If we ship this design, will the product be more appealing? (design appeal)
If we add this feature, will customers be more delighted? (feature delight)
If we target this segment, will we sell more products? (targeting sales)
If we increase prices, will we make more money? (price profit)

Sometimes I answer these by pointing out the difficulty of causal analysis. We can't run randomized experiments for such questions ... and although there are methods that may help such as causal modeling via propensity scoring + matched sampling those method typically involve assumptions that can't be met, or require sample sizes that are not feasible.

The underlying problem is that stakeholders often jump from a fact to an implication. Effects such as the above are facts; product decisions do lead to tangible outcomes and there are implications like those listed above. However, that does not imply that we can successfully measure them.

Here, instead of going into why correlation isn't causation, or why causal modeling is not a panacea, I share another approach that I take to deconstruct such requests: depicting and discussing the underlying situation.

The Underlying Situation

I'll start with a chart that might communicate the entire message:

This is a general schematic and not "the answer". One could go through the items and flow and then agree, disagree, reorder, restate, add, or delete various boxes. You might replace the "4 Ps" with some other framework. Such changes are all OK.

Whether you agree with the details or not, the key point of the chart is this: successful products require simultaneously successful execution of dozens of crucial areas. A failure in any one of the boxes above can ruin a product, its customer experience, or a business.

That implies a corollary: success in any single box, or a subset of boxes, or a single row, or a subset of rows is not enough. Success in any one area doesn't imply a successful outcome. In fact, if you look at the chart again, you'll see that there is not any single outcome in it (see my criticism of "North Star" metrics). There is a set of outcomes that feed into future cycles, mediated in large part by brand perception.

Some Implications

Discussion of how to create and assess "successful products" might require slightly more than one blog post but I'll unpack four key implications of my view here:

Answering a question like "what's the effect of ___ on ___ ?" usually requires more effort than it is worth. Forget about the rare exception.
Instead: focus on good assessment within each box assessing users' experiences, assessing behaviors, assessing product quality, assessing prices, assessing competitors, and so forth and don't try to link them together (which, even if you did, would necessitate good assessment within each box anyway; in fact, it would require great assessment to handle the pooled variance and confounds.)
The business goal is not to search for a magic bullet that connects [some decision] and [great outcome]. It is unreasonable to expect to find one.
Instead: execute reasonably well across the board and then identify the points for unique value add from your competencies.
Increasingly, I believe businesses and products fail because they forget the basics. At least in Tech, too many teams try to create the "the next big thing." That loses sight of two facts: (a) most "big things" are surprises to their creators, and (b) most businesses do fine without a "big thing." (FWIW, I think this tendency is related to promotion-driven development, but that's another topic.)
Put differently, a well-crafted product that meets the needs of paying customers already is a "big thing." Don't mess it up.
Research (and this is true of other disciplines) needs to mix breadth and depth breadth across a large landscape of assessment, as outlined above, plus depth to do it well.
In the Quant UX book, we talked about breadth and depth in terms of individual researchers' skills ... and the concept applies just as much to research teams and to companies as a whole.
The landscape of "boxes" in the chart above can be used to plan holistic research. I often create such charts of all the areas we might want to know about, and then prioritize research activities. As a general matter, I prefer to distribute somewhat "superficial" research across much of the landscape of questions to get an impression of where problems are. Then target depth research where there are customer problems (or opportunities).
*This doesn't mean that UX researchers should try to cover everything in the chart. For example, I don't think UXRs are the right folks to assess advertising, channel promotions, or various other things. However, *UXRs are the right folks to consider the big picture of a product and its overall, end-to-end experience and help coordinate with other teams about what the business knows.

Conclusion

Great product teams and good product teams, which TBH are good enough! require successful execution of many things. In any competitive business, success is not guaranteed, and is not easy ... or if it has been easy, then it's random and not replicable. Instead of expecting a magic bullet or a path to a "North Star" we should expect to do real work and execute well on many things at once.

And do you know what the good news is? Executing well and, for researchers, learning about many different aspects of the products and customers is rewarding on its own.

Finally, I'll say this: If running a business is drudgery such that an executive is looking for a shortcut and a magic way to implement change then the change needed is theirs ... they should do something else. Meanwhile, I have customers to talk to.

Finally, here's the one-chart summary again. Cheers!

Surveys and the "True Score" Mistake

Chris Chapman — Wed, 03 Apr 2024 16:42:07 GMT

Increasingly, I think there is a deep and mostly unstated assumption about survey research that drives profound misunderstanding: the belief in a "true score".

Researchers and their audiences expect that there is an underlying real value that we can know about product questions such as feature preference and willingness to pay, as well as social questions such as policy preference or likelihood to vote for a candidate. We want to find and report that true score with some confidence.

Surveys and polls are considered "good" insofar as they assess such latent values. Analysts use methods and statistics that reflect that fundamental premise. For example, the notion that there is a true value to be assessed through sampling leads to classical statistics tests of inferential hypothesis testing and confidence intervals.

In this post, I explain why I believe the assumption is profoundly mistaken, and how we should think about surveys instead. I won't pretend to have all the answers.

The short version is this: survey responses are motivated communication, not expressions of some other latent truth. Our research efforts must be designed to reflect this. Surveys do express truth ... but it is not the truth of some true score as usually assumed. Instead, it is the truth of what respondents tell us, reflecting many influences.

Examples of the Mistake

The underlying problem is the mistaken belief that there is a latent, unobservable and yet real "true value" that underlies survey assessment and other UX metrics.

I'll give a canonical example in moment, but first here are a few areas where I see this problem occur repeatedly:

Customer satisfaction. We act as if "CSat" is a stable property we can uncover and then probe for its causal determinants and influences.
Satisfaction with product features. We assume that overall CSat is composed of satisfaction with individual elements, which can be assessed independently.
Intent to purchase. We assume that "intention to purchase" is a state that can be accessed and reported by a respondent.
Willingness to pay. We act as if WTP is a property of a product or a feature, such as, "What is the willingness to pay for a 0.5" larger smartphone screen?"
Voting. Moving over to the public opinion realm, there is an assumption that somehow we can probe, "What if the election were held today?"

As I'll describe, all of those are misconceptions. I do not argue that there is no truth, but rather that truth is more subtle than the naive beliefs above.

A Specific Case

When I worked at Google, there was an annual employee survey that asked a question similar to this:

A survey author with a naive interpretation of this question might assume that:

there is such a thing as satisfaction with compensation
that respondents can access the degree of satisfaction through introspection
that respondents will report such introspection to some degree of accuracy
that the response scale is adequate to capture their report

As for the salary surveys, a running joke among a few colleagues at Google was, "Anyone who says they are satisfied with their salary is not competent to work here." I'll discuss that response later in this post.

Surveys as Motivated Communication

The two most important things for researchers and stakeholders to know are:

Respondents do not answer surveys for our reasons, but for their reasons
Respondents do not owe us responses that align with our concept of "truth"

A shorthand expression of those two facts is that survey responses are motivated communication.

Respondents take surveys for many reasons, such as:

They are interested in the topic
They want to help the research or its sponsor
They are incentivized with pocket money, points, or the like
They are survey researchers and like to see other surveys
They want to complain about a product or topic
They are bored
They wish to sabotage the research
They want to earn money to take care of themselves or their families

All of those are legitimate reasons to take a survey. It is our job as survey authors to anticipate and design survey contexts, contents, and analyses with respect to such motivations.

Similarly, when answering individual items on a survey, respondents adopt various motivated reasons for item responses:

Answering in the way they believe is truthful
Exaggerating to look better or for social desirability
Speeding to get through the survey
Answering randomly because they don't understand an item or can't answer as they wish
Trying to dismiss an intrusive survey to get back to their desired task
Complaining about a product, service, person, etc.
Attempting to "steer" the results of the survey in some way
Answering maliciously for various reasons
Quitting the survey because it is too long or otherwise annoying

All of those are legitimate ways to answer survey items. No one owes us any particular kind of response or any particular reason why they responded!

None of this means that survey results are wrong or useless. It simply says that we must interpret all of our data rationally, with awareness that results may not mean what we superficially hope they mean.

To put it differently, respondents (usually) are not "lying" in their answers, and they are not bad respondents for having motivations that differ from ours.

What We Should Do as Survey Authors

As survey researchers, our job is to design surveys in such a way that we maximize responses and motivations that align with the decisions we need to make.

As product researchers, presumably we are running a survey because there is some decision we are informing. We want respondents to answer in ways that align with giving a real-world (albeit imaginary) latent "truthful response" to inform our decision.

That means we should:

Be transparent and fair (discouraging malicious responses)
Offer reasonable incentives (encourage responding; discourage malice)
Write simple, clear, and comprehensive items (discourage randomness)
Have short and responsive surveys (discourage speeding and quitting)
Use robust methods such as randomized scales and randomized experimental designs and methods (limit bias from speeding or random responding)
Be careful not to intrude or derail respondents' tasks (use intercepts sparingly)
Provide 1-2 open ends to obtain additional feedback (accept complaining)

Side note.You might wonder about an "honesty statement" at the beginning. Some researchers have claimed that asking respondents to agree to answer honestly increases the quality of survey data. Some related research itself has become controversial (pointer). I don't take a position on that, but my opinion is that well-designed surveys don't require us to demand that respondents answer "honestly."

Wait, What about Bots?

You're no doubt aware that survey-taking bots are a crisis in survey research (example). In fact, bots are so much a crisis that some researchers have given up and claim that complex bots may be "data" (see this post about LLMs).

Well, you guessed it: I think bots are an understandable way for respondents to answer surveys ... IF the survey developer or platform allows them and provides appropriate motivation. If people can make a living by automating surveys ... well, why not?

That doesn't mean I want bot data. Rather, it is part of our job as survey developers to guard against them by not allowing them and decreasing the motivation.

Unfortunately, some survey platforms are interested in collecting "any data" and at the lowest cost, regardless of the quality. No amount of screening, data cleaning, or modeling can turn bad data into good data. At best, we recover some data while introducing a lot of post hoc bias.

A complete consideration of bots is out of scope here, but at a high level we must use carefully designed sample platforms, robust survey design, and robust analysis.

Back to the Salary Research

Earlier I showed an item about employee salary satisfaction:

That particular survey had a known, enumerated sample so it didn't suffer from bots. But when we think about it from a motivation perspective, we find several other problems:

Respondents are motivated to give answers in particular directions.
They might say they are dissatisfied because that could influence decisions about base salaries.
Or they might report high satisfaction because they worry about being tracked and then being laid off, fired, or otherwise punished.
The motivation of the survey authors is unclear, and it might not be credible even if they stated it.
The response scale is standard ... and yet it is bizarre because I assume it is not connected to anything the company really cares about. Salaries are not decided in order to maximize satisfaction but in order to attract and retain employees vs. their other choices of jobs.
A better item would focus on the employee side of that specific decision. For example, we might ask about salaries vs. competitors. Such an item might even be amenable to calibration (e.g., by asking about some known to be higher or lower) ... but I digress from the topic at hand!

The best way for a respondent to think about such an item is from a game theory perspective. I do not know the motivations or likely actions of the authors of such a survey at a company so there is no possible way to address them. At the same time, my personal salary is a vital topic to me. The net result is that I should answer such a question in the way that benefits me.

FWIW, I would suggest a reasonable game theory response is this:

If the survey is credibly anonymous, a respondent might say they are very dissatisfied. (BTW, using the end points of a scale maximizes the influence on statistics.)
If the survey is attributed to the respondent and they worry about their job, a rational response is to say they are satisfied. That's not high enough to be suspicious.
If the survey is attributed to the respondent and they don't worry about their job, a rational response is to say they are dissatisfied. That's no doubt honest, because almost everyone would like a higher salary!

Notice that none of those reasonable responses says anything about actual satisfaction. That's why the authors should use a different item altogether that more specifically addresses the real decision they are making.

BTW, my colleagues' claim that anyone saying "satisfied" shouldn't work there was based on the assumption that situation #2 above didn't apply (being worried about one's job). Until 2023, that was not a systemic concern at Google (but then the layoffs happened).

After response #2 is eliminated, a rational game theory response is to report some degree of dissatisfaction. Remember, there is no particular moral high ground in survey research ... or at least, we shouldn't assume that there is!

For the survey authors, the key point is to focus on the decision being made (such as whether to raise salaries or not) and then design survey items that address that decision.

For example, a conjoint analysis survey might ask, "which job would you choose?" across several randomized possibilities. Here's an example (as discussed here):

By varying the job attributes and salaries, such a survey can identify employees' preferences and likely choices to stay or leave, relative to salary levels.

Conclusion & More Reading

My conclusion is this: survey responses are motivated by many factors, and we as survey authors should accept that reality, stop pretending that answers approximate "reality," and design our surveys to be robust to respondent motivations.

Specifically, I find it helpful to say that there is no true score that somehow underlies a survey item. There are only answers that help us make a decision or take some action. If we stop pretending that we are accessing truth, while fighting off "lying" respondents, we will design better surveys with less bias and more direct results.

More Reading? Kerry Rodden and I say much more about survey design, items, analyses, and unintentionally introduced biases in Chapters 4, 8, and 13 of the Quant UX book.

Or, if you're interested in conjoint analysis and forced tradeoffs among survey choices, check out MaxDiff in Chapter 10 of the Quant UX book, and Conjoint Analysis in Chapters 9 and 13 of the R book.

Cheers!

P.S. If the company is reading, I will predict with 100% certainty that employees will be more satisfied with higher salary. That finding is free of charge.

Open for Registration: Quant UX Con 2024

Chris Chapman — Fri, 29 Mar 2024 00:16:24 GMT

Today's blog post is short, just to make sure readers of this newsletter hear about it: Early Bird Tickets are now available (until May 5) for Quant UX Con 2024.

Early Bird Tickets are $75 ... and that's a 40% discount, $50 off the general admission price of $125 that starts after May 5.

As always, Quant UX Con also makes scholarship tickets available for free to anyone who needs one. Conference sponsors and individual supporters generously make that possible for the worldwide Quant UX community!

Psss ...

I will also give newsletter readers a hint about the fantastic keynote speaker lined up (shush, not yet announced!). Search clue: Google, Stanford faculty, legendary UX researcher, recursive hint ... and someone who interviewed me for a job in 2011.

Stay tuned for the reveal, plus the complete program for Quant UX Con 2024!

To register for Quant UX Con (40% off until May 5), visit: https://events.ringcentral.com/events/quant-ux-con-2024-worldwide

For full conference details, including the archive of last year's presentations, visit: https://www.quantuxcon.org

For other recommended UX Research conferences, check out Jan Ahrend's curated list at: https://usercalendar.com

Intro to the Van Westendorp Pricing Exercise

Chris Chapman — Thu, 21 Mar 2024 13:48:56 GMT

A popular and yet in my opinion not popular enough method to investigate pricing is the Van Westendorp Price Sensitivity Meter (I'll call it "VW PSM").

The VW PSM is a fast and simple way to investigate perceptions of pricing. I'll show how it works, with R code, and describe its strengths and weaknesses.

Overall, I find the VW PSM to be very useful ... as long as you do not interpret it as "willingness to pay" (WTP). VW PSM is not WTP, but it complements other methods such as conjoint analysis. In any case, VW PSM is a big improvement over the rudimentary approach of asking only a single item, "How much would you pay?"

The Items & Basic Concept

The data for VW PSM comes from 4 numeric, open response items. I usually ask them in this order (adapting the exact wording as needed):

If you were purchasing XYZ today, at what price would you say it is both a reasonable price and also a good deal?
... at what price is XYZ starting to get expensive, but you would still consider it?
... at what price is XYZ too expensive, and you would not consider it?
... at what price is XYZ too cheap to be believable, and you expect it to be low quality?

Let's imagine we're pricing a rechargeable bike helmet 2-way radio. A hypothetical respondent might say that $80 is a reasonable price and a good deal; $100 is starting to get expensive; $150 is too expensive; but less than $50 is too cheap to be credible. (BTW, I don't know anything about the bike radio market; I'm making up the data here. If the data seem unreasonable to you, pretend it's a different product.)

The VW PSM method works by plotting four curves the cumulative distributions of each of those 4 values by price, to find the range of prices where most respondents say a product is not too expensive and also not too cheap. Let's take a look at how that works!

PSM Analysis: Set Up

First, we'll load the hypothetical "bike helmet radio" data. Usually this would come from importing a data set ... but for simplicity here, I've hard-coded it directly into R (thanks, datapasta package!) In this data frame, we have responses to each of the 4 questions as asked above (and in the same order). Specifically:

Good deal ... becomes the variable cheap
Staring to get expensive ... is expensive
Too expensive ... is tooexpensive
Too cheap ... is toocheap

Here is an R code snippet for the data; as always, you can find a complete set of all the R code at the end of this post.

(BTW, it's possible to do VW PSM without using R see here, or search for "van westendorp excel" but really, why would anyone not want to use R? LOL.)

# complete data (hypothetical; N=12)vw.dat <- data.frame(            id = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L),         cheap = c(120L,50L,50L,40L,250L,200L,125L,199L,50L,150L,149L,100L),     expensive = c(150L,150L,150L,100L,350L,450L,200L,129L,80L,300L,250L,300L),  tooexpensive = c(250L,250L,300L,250L,500L,600L,300L,199L,100L,500L,450L,400L),      toocheap = c(40L, 25L, 25L, 15L, 25L, 50L, 50L, 29L, 0L, 99L, 75L, 50L))summary(vw.dat)

In the summary(), we see that the median "good deal" (cheap) is $122, and so forth:

However, what is more interesting than summary statistics is the distribution of answers. Here's a distribution for the "too expensive" answers.

# plot one ecdf (empirical cumulative distribution function)plot(ecdf(vw.dat$tooexpensive))

This code uses the R ecdf() function (empirical cumulative distribution function) to find the proportion of responses that are at or below a particular value. The resulting chart is the following:

In this chart, we see that no one reported that our radio would be too expensive until it costs $100. By the time the price reaches $450, about 75% of respondents say it is too expensive to consider (reading the Y axis).

In the next section, we add the distributions for the other 3 answers (good deal aka cheap, expensive, and too cheap) and see how those intersect in VW PSM analysis.

PSM Analysis: Complete

The R package pricesensitivitymeter makes VW PSM analysis simple. In a nutshell, you feed your data into the psm_analysis() function, save those results as a model object, and plot them with psm_plot(). Let's dive in!

First, we fit the model and save the results to a model object that I call psm:

# Van Westendorp analysislibrary(pricesensitivitymeter)# fit the VW modelpsm <- psm_analysis(toocheap     = vw.dat$toocheap,                    cheap        = vw.dat$cheap,                    expensive    = vw.dat$expensive,                    tooexpensive = vw.dat$tooexpensive,                    validate = FALSE, interpolate = TRUE, intersection_method = "median")

In the call to psm_analysis(), I set a few options that I prefer. You can decide on those for your particular problem; see that function's help page. I also prefer to hard code the alignment when calling psm_analysis() between the 4 VW PSM parameters and my individual columns as shown in the code above as a double check.

When you run the function with the current data set, you'll be warned that "Some respondents' price structures might not be consistent". I discuss that in the "Assumptions & Problems" section below. Meanwhile, you might be interested to identify those respondents:

# which respondents fail the consistency check?with(vw.dat, which(cheap <= toocheap | expensive <= cheap | tooexpensive <= expensive))

In these data, row 8 is inconsistent. With that, you can decide what to do (see the "Problems & Assumptions" section below).

After fitting the model, the main result is a plot of the 4 distribution curves. The code is simple just psm_plot(psm) although I add some options to clean it up:

# plot it# set a price axis tick label interval, and maximum price to plotscalebin <- 40scalemax <- scalebin * (round(max(psm$data_vanwestendorp$price) / scalebin) + 1) # set up scale for $[scalebin] increments# OR, if you want to trim the X axis, just set a maximum directlyscalemax <-300psm_plot(psm) +   scale_x_continuous(breaks=0:(scalemax/scalebin)*scalebin) +  coord_cartesian(xlim=c(0, scalemax)) +  theme_minimal() +  ylab("ECDF") +   xlab("Price ($)") +  ggtitle("Price Expectation, Hypothetical Product (N=12)")

In the first lines above, I set a "bucket size" for the labels on the X axis (scalebin). Then I choose a maximum value for the X axis from the fitted model (scalemax) ... but I override that on the next line. You'll want to use only one of those two lines.

The psm_plot() function returns a ggplot2 objection, so we can add ggplot options. I trim the X axis, add breaks to match the scalebin choice, and add labels.

Here is the resulting plot:

Let's interpret the results one step at a time:

The inner shaded area is the range of reasonable prices, defined (by default) by a lower and upper point of intersection of the curves. In this case, the range of reasonable prices is $46$237 (inspect the psm object for the exact values).
At the lower price end, the reasonable price range begins where the observed proportions for not cheap and cheap (good deal) intersect. Below that point, many respondents believe the product is too cheap, and most would pay more.
At the upper end, the reasonable range ends where not expensive and too expensive intersection. Above that, many respondents would stop considering the product.
The model identifies two candidate points for the best price: the so-called indifference point (IDP) and optimal price point (OPP). I do not interpret those as "willingness to pay" or "optimal price" or anything like that. Instead, I interpret them as defining the perceived range of the most-expected (and reasonable) prices. They express perceptions, not willingness to take action. In this case, theperceived range of expected prices for our radio is $99149.
[Technical side note] The VW PSM plot inverts some of the ECDF curves to be more interpretable (plotting the proportion below a particular value, i.e., 1-ECDF). You can ignore thinking about that; or see the references for details.

With stakeholders, I share the VW PSM chart and interpret it as above. I explain that the chart tells us what respondents expect the price range to be, and what the reasonable price boundaries are ... according to the understanding that respondents have, and relative to what we tested.

If our product's pricing plan falls outside of the VW PSM ranges, then we need to do additional thinking, marketing, or research to understand the disconnect with respondents' perceptions. (BTW, it often happens that respondents expect a product to be priced higher. That is good to understand!)

I also explain to stakeholder that we would need additional data such as a conjoint analysis or other market experiment to assess the actual purchase likelihood. In my opinion, PSM is primarily a perceptual assessment.

Assumptions & Problems

Briefly, some assumptions of the VW PSM method are:

... that respondents understand the product you're asking about.
- Details. PSM is most appropriate for mature, well-understood products and asking about full product prices, not incremental costs (unless those are also well understood). However, as I note in the recommendations section below, it can be informative for new products as well.
... that they can answer the questions. In person pre-testing is always advisable!
... that their answers will be consistent, for instance that their values of "expensive" will be higher than those for "cheap".

Typical problems are:

Inconsistency. PSM survey responses often fail the assumption of being correctly ordered, and you have to decide what to do about them.
- Details. I typically include such data ... but only if it is a small proportion of respondents. One thing I never do is to force responses into logical sequence using survey validation logic. You want to detect inconsistency, not impose false consistency onto raw data! (In my code above, the option validate=FALSE tells the psm_analysis() function to use all data, and it warns you when there are problems detected.)
Lack of understanding. As noted above, the product has to be something that respondents understand well enough that they have a price "in their heads."
Ranges that are too large (or too narrow) to be useful to a business. Later authors have added modifications that partially help with this (see discussion in the pricesensitivitymeter package documentation).
Respondent negotiation. Because VW PSM items are transparently about price, respondents may lowball their answers in hopes to depress pricing.
- Details. In my experience, I have not observed much of that; results often mirror market realities. But if we view VW PSM as a perceptual exercise, then such negotiation is still an informative and interesting signal about respondents' perceptions.
General data quality. Speeding, bots, bad samples, etc., are always a problem.

In practice, the most common problem I see is this: researchers often field a PSM study without pre-testing it. Before fielding, ask 2-4 people not from the product team to take your PSM survey live and discuss their answers!

The second most common problem is when researchers interpret a VW PSM result as being about willingness to pay or a behavioral intent. IMO, it is not. I view VW PSM as a perceptual measure about what respondents expect.

Recommendations & Learning More

Overall, I think the Van Westendorp PSM method is a great addition to a Quant UX and Quant Marketing toolkit. However, it must be used judiciously. In particular:

If you are asking respondents a single item, "How much would you pay?" ... don't. VW PSM (or conjoint analysis) is much better than that.
VW PSM is easy to administer and informative for initial pricing exploration. It takes much less respondent time than conjoint analysis, and can be administered easily in any survey platform or form.
The results should be viewed as exploratory and perceptual in nature. They do not establish "willingness to pay" but tell you how respondents think about a reasonable price range for your product or service.
You can add stated purchase intent to VW PSM (the "Newton-Miller-Smith extension"; see the R package reference) but I typically use conjoint analysis instead. Conjoint estimates are better than simple statements of purchase intent.
Although VW PSM works best with established, mature product categories, it can be helpful with new products for an initial read about expected price perceptions. Interpret it as a signal about consumer understanding.
VW PSM has several limitations as noted above. If you have many respondents with confused and inconsistent data, then something may be wrong with your survey. (The good news is that the method provides such diagnostic information!)
Ideally, pricing research will use several methods together, such as VW PSM for expectations, plus conjoint analysis to estimate expected choices, plus market data. With triangulated data, we can be more confident.
If you're doing conjoint analysis, VW PSM is an easy and quick way to cross-check your conjoint analysis results.

To learn more about VW PSM and pricing research more generally, see:

[Original VW PSM paper]
Van Westendorp, P (1976). "NSS-Price Sensitivity Meter (PSM) A new approach to study consumer perception of price" Proceedings of the ESOMAR 29th Congress, 139167. Available at https://archive.researchworld.com/a-new-approach-to-study-consumer-perception-of-price/
[R package reference]
Alletsee, M (2024). pricesensitivitymeter: Van Westendorp Price Sensitivity Meter Analysis, v1.3. https://CRAN.R-project.org/package=pricesensitivitymeter.
[Comparison of VW PSM and Conjoint Analysis]
https://sawtoothsoftware.com/resources/blog/posts/conjoint-analysis-pricing-research
[Introduction to Conjoint Analysis]
Chapman, C, and Feit, EM (2019). Chapter 13 in the R book.
[Reference for many pricing methods]
Rao, VR, ed. (2010). Handbook of Pricing Research in Marketing. Elgar.

I hope you'll give VW PSM a try ... preferably alongside a second method such as conjoint analysis whenever possible. Cheers!

R code

All of the R code from this post is here. You can use the "copy" button in the upper right to grab it and then paste into RStudio or your favorite editor.

# Van Westendorp analysis for blog# Citation: Chris Chapman (2024). "Van Westendorp Pricing" at the Quant UX Blog, https://quantuxblog.com.# complete data (hypothetical; N=12)# note that the long numeric suffix is added by datapasta (data came from a Libre Office spreadsheet)vw.dat <- data.frame(            id = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L),         cheap = c(120L,50L,50L,40L,250L,200L,125L,199L,50L,150L,149L,100L),     expensive = c(150L,150L,150L,100L,350L,450L,200L,129L,80L,300L,250L,300L),  tooexpensive = c(250L,250L,300L,250L,500L,600L,300L,199L,100L,500L,450L,400L),      toocheap = c(40L, 25L, 25L, 15L, 25L, 50L, 50L, 29L, 0L, 99L, 75L, 50L))summary(vw.dat)# plot one ecdf (empirical cumulative distribution function)plot(ecdf(vw.dat$tooexpensive))# Van Westendorp analysislibrary(pricesensitivitymeter)# fit the VW modelpsm <- psm_analysis(toocheap     = vw.dat$toocheap,                    cheap        = vw.dat$cheap,                    expensive    = vw.dat$expensive,                    tooexpensive = vw.dat$tooexpensive,                    validate = FALSE, interpolate = TRUE, intersection_method = "median")# which respondents fail the consistency check?with(vw.dat, which(cheap <= toocheap | expensive <= cheap | tooexpensive <= expensive))# plot it# set a price axis tick label interval, and maximum price to plotscalebin <- 40scalemax <- scalebin * (round(max(psm$data_vanwestendorp$price) / scalebin) + 1) # set up scale for $[scalebin] increments# OR, if you want to trim the X axis, just set a maximum directlyscalemax <- 300psm_plot(psm) +   scale_x_continuous(breaks=0:(scalemax/scalebin)*scalebin) +  coord_cartesian(xlim=c(0, scalemax)) +  theme_minimal() +  ylab("ECDF") +   xlab("Price ($)") +  ggtitle("Price Expectation, Hypothetical Product (N=12)")

The End of Tech as a Big Family

Chris Chapman — Thu, 14 Mar 2024 17:35:07 GMT

I've been thinking about the past months of Tech layoffs and how some of them were unnecessarily traumatic.

My take is this: many Tech companies intentionally promote emotional ties of being a "family." When employees feel like family, they perceive impersonal layoffs as betrayal. I review 6 ways in which Tech companies cultivate the emotions of being a "family" ... and how those lead to feelings of betrayal.

To illustrate this, I use examples from Google where I worked for 10 years (I left a few months before the layoffs began). After reviewing six emotional ties, I discuss what companies and employees should do instead.

(1) How Tech Cultivated a "Family": Hiring

The first emotional tie involves hiring. At Google, I served on the central UX hiring committee and the steering committee for Quant UXRs. I won't recap Tech or Google hiring practices; they are well known and are described in the Quant UX book.

In my experience, the Google hiring process selected employees to be very similar to one another: intelligent, verbally quick, experts in their fields, open minded, curious, honest, transparent, and surprisingly modest. This was known as "Googliness" and was intentionally assessed in interviews. (There is a down side: it can exclude people with different styles. That's a separate discussion.)

This process selected employees who were very comfortable together, almost like an extension of graduate school. That is both a figurative and literal statement: a majority of engineering Googlers (including UXRs) have graduate degrees.

This all leads to a common experience: some employees feel that they are more "like" their colleagues than they have experienced before. For purposes of the discussion here, this hiring for similarity creates the first strong emotional tie.

(2) How Tech Cultivated a "Family": Work Places

Tech offices like Google's can be quirky and extravagant. Such offices are designed to mimic and evoke homes ... especially idealized, fantasy homes. Thus, they reinforce emotional ties of home and "family."

At Google, I worked at or visited 27 office locations (perhaps 100 buildings). Home mimicry appeared in basic home aspects such as self-serve kitchens and food, espresso machines, sleeping areas, pet areas, bicycles, herb gardens, clothes (free t-shirts and jackets), laundry rooms (free to use), lawns, and patios.

At a higher level the workspaces evoked fantastic homes with libraries, gyms, hidden rooms, massage chairs, game rooms, BBQ, fish ponds, beer kegs, bars stocked with liquor, panoramic views, hot tubs, swimming pools, and kayaks.

All of this was done in home color palettes (or corporate brand colors). No money was needed for any of these, reinforcing that it was a family, not a business.

Following are a couple of examples. First is an online photo of one of the many game rooms. It is quite similar to a home!

This is a "secret entrance" to the Paris office (I also saw "secret" entrances in Buenos Aires and Osaka). It is quite similar to the front door of a home.

This is a garden and lawn at Google HQ. Again, it's like a fantasy home or resort.

These home-like, non-monetized, and often fantastical spaces at Tech campuses evoke homes and establish the second strong emotional tie.

(3) How Tech Cultivated a "Family": Trust

On my first day of orientation at Google in 2012, the leader made two recommendations that astonished me:

Make your personal calendar public inside the company
Give everyone permission to edit your documents. (There were exceptions for sensitive information. I'll set that aside.)

Many new employees asked, "How can I trust such public sharing?" The answer: try it and see that colleagues deserve the trust. By sharing one's calendar, almost all back-and-forth was eliminated in finding "meeting times that work." By sharing docs, colleagues simply fixed issues when they see them, and that almost always made them better. In 10+ years, I never had any problems with such sharing.

The takeaway was that employees learned to trust one another, and such trust was systematically and structurally encouraged from the first day. Trust formed the third strong emotional tie.

A couple of months into my job in 2012, I was told that I should socialize more after work with the team, such as accepting invitations to go out to dinner. The team paid for dinners when visitors came from other offices. This extensive socialization during and after work reinforced emotional ties.

Another important social and emotional factor involved travel. In the 19 years I worked in Tech before COVID, travel was pervasive, interesting, and upscale.

I traveled for research approximately 40% of the time prior to COVID, and I made work trips to all the locations with stars in this map (blue pins =Google offices):

Travel is an emotional factor for several reasons. Travel directly creates emotional resonance through mixtures of anxiety, sleep deprivation, unexpected circumstances, excitement, and relief. Colleagues to whom you are already close, as noted above accompany you, cementing social bonds. Travel budgets and corporate discounts make luxury possible that otherwise might be out of reach.

These social experiences everything from team meals to traveling together create the fourth strong emotional tie.

(5) How Tech Cultivated a "Family": Status & Mission

I won't say a lot about mission except that, when I was there, Googlers took seriously the former mission statement, "Don't be evil" (even if it was mocked externally.) It was a point of pride and was actively discussed when making decisions.

As for status, any large tech company conveys status in many ways: in its brand reputation, in perks such as travel, and by generally giving employees opportunities they might not have elsewhere.

Here's an example. In 2014, I was a delegate to meetings that occurred at the United Nations General Assembly; I attended with world leaders such as Barack Obama. (As I'll describe below, that's much less impressive than it might sound!)

Here's a photo of me at the UN:

Let me set the record straight. I was not at the UN because there was anything special about me. Other people in my team were asked to go first. But they declined and then I went because no one else had time ... for the UN General Assembly!

Far from being unusual, this is the kind of opportunity that happens to ordinary employees like me just for being at a large company. I'm sure all of my colleagues have similar stories. The point is that such experiences are emotional and they occur not because one is special but because one happens to be working at a corporation.

All of this adds up to the fifth strong emotional tie: mission and status.

(6) How Tech Cultivated a "Family": Care for Real Family

This section is short. Companies provide resources and benefits that are directly important for employees' health and their real families.

These include benefits such as: private health insurance, which is crucial for US employees in particular; life insurance policies; benefits for dental insurance, mental health, and prescription eyewear; and many more ... even relatively simple and yet emotionally significant events such as "take your child to work day."

And then there is immigration. Anyone on an H1B or other visa in the US (similar in other countries) has an even stronger emotional tie: residency with their family may depend on being employed. The company welcomes them as "family" to the US by assisting them with every step of the immigration process ... as in the US chart below. Notice that almost half the chart is for "Employer-Sponsored" immigration.

In summary, with regards to real family, the company is in charge of everything from one's residence to life-giving healthcare (in the US). Those form the sixth emotional tie: care for one's real family.

Brief Layoff Examples, ICYMI

Before I link the 6 emotional aspects to layoffs, it may help to list some of the layoff experiences I have in mind. At Google (for example), there were employees who:

discovered their layoffs via automated email ... to their personal accounts
showed up to work early and were unable to enter their office building
received text notifications while recovering in the hospital after childbirth
lined up publicly to see whose badge would turn green (employed) or red (fired)
were recalled from medical leave just so they could be sacked

How Family Emotions Make Layoffs a Betrayal

Layoffs at Google and similar companies were a betrayal for a simple reason: no one expects to be expelled from a family ... especially in the ways recapped above.

Here are the 6 family-like emotional ties I listed and how they relate to layoffs:

Hiring. Not only is one's job lost but contrary to the hiring process hiring for talent seems unimportant, because layoffs are indiscriminate. Betrayal: personal excellence, achievement, and tenure do not matter.
Work Spaces. Laid off employees are expelled from work spaces. Betrayal: one's unconscious "home" is taken away. And yet one may be reminded of it daily by driving by it or observing other employees.
Trust. There is no warning, no alternatives, and managerial statements are not credible. Betrayal: everyone is on their own and can't believe or trust anyone else.
Social Ties. Being fired may mean losing one's day-to-day group of friends, both locally and across offices and countries. Betrayal: social ties disappear overnight.
Status & Mission. There is a direct loss of affiliation and job status. Even more, employees question whether missions such as "don't be evil" meant anything. Betrayal: feeling conned into believing something the company did not believe.
Care and Real Family. Layoffs threaten families in every way from healthcare to country of residence. Betrayal: seeing that a company doesn't care about one's kids or family. Guilt at not seeing that the "big family" was never serious.

In short, some Tech companies build a powerful, emotional sense of family using all the cultivation methods listed above and then treat employees as if they are not only disposable, but not even worth the time to engage in personal discussion.

Why Would a Company Do Layoffs that Way?

I talk to many folks across Tech, and I've heard the following explanations and theories. (I'm not endorsing any in particular, just listing them.)

Incompetence. Many tech executives are poorly educated in management.
Arrogance. Some executives believe they know better than best practices.
Resentment. Executives may resent employees with deep knowledge and skills.
Power. Some executives push back against employees' perceived power.
Union busting. Potential organizers can be included in layoffs to get rid of them.
Discrimination. Layoffs provide cover to fire employees in protected classes.
Cruelty and bullying. Weak executives want to show "I'm the boss here."
Peer pressure. Execs want to impress other execs by making "hard calls."
Weakness. Executives are unable to oppose Wall Street demands to cut people.
Stock price gouging. Execs get personal payoffs from post-layoff stock bumps.
Distraction. Layoffs cover up executives' poor decisions and lack of strategy.
Business reasons. Sometimes there are good business reasons.

The logic of layoffs often follows a classic pattern: something must be done; this is something; therefore this must be done. "Something" can then be selected according to execs' preferences from the list above, even if it is logically unrelated to layoffs.

The explanations above are not mutually exclusive; several may apply simultaneously. For my part, I would add this observation: when executives violate best practices, they give evidence against their own business acumen. The fact that the list above is overwhelmingly negative is a signal that something is going very wrong between execs and employees in some companies.

Regardless of the explanation, there is no business justification for inhumane layoffs. Such conduct is likely to create a collapse of morale for employees who remain.

What Should a Company Do?

One framework involves a mindfulness framework for "right speech," namely that communication should be necessary, timely, truthful, helpful, and kind. For layoffs:

Necessary. Conduct layoffs not as a first resort but as a last resort. Offer alternatives such as voluntary severance with incentives.
Timely. Consider the timing. Don't conduct layoffs immediately before emotional times such as holidays, at night, when employees are hospitalized, etc.
Truthful. Don't give false reasons or offer platitudes. Employees will recognize and resent untrue claims about the state of a business.
Helpful. Offer as much help as possible. That doesn't mean only a "generous" severance package, but real choices with individual flexibility.
Kind. A layoff is an emotionally charged situation and must be treated as such. That means one-to-one communication with human engagement and listening.

This is a "checklist" to plan communication initiatives about layoffs (or anything).

What Should You Do? (Remember, Work != Family)

My basic recommendation is to challenge unrealistic but unconscious emotional connections that a company creates. Remind yourself that work is work, and family is family. This is similar to cognitive-behavioral therapy in daily life.

My suggestions:

The Big Family is Over. Work is not a family. It may be rewarding, emotional, and friendly, and that is perfectly OK and enjoyable ... and it is a corporate business. Don't expect anyone to put your needs ahead of their own. (That doesn't mean anyone should be aggressive or self-promoting! Just be realistic.)
Be Aware ... and Don't Believe the Story. When you hear that work is a "family", or that people are the company's most precious resource: counter those claims with rational statements. Powerful interests want you to believe a happy story ... until they change it. Be skeptical and observe what happens.
Save Money. If you have 6-12 months of savings set aside, you'll have less risk. (Yes, I know that's difficult.)
Cultivate an External Community. Don't rely on work to be a social network. Join a community group, religious group, hobby group, or the like. This will help you relax from work, and build a support network separate from work.
Use Game Theory. When you face choices at work, don't rely entirely on either logic or emotion. Use game theory to model scenarios and the most likely payoffs for each choice you might make. (I'll write another post about that.)

Finally, do not judge yourself by an imaginary ideal. Anger is OK. Fear is OK. Losing a job is OK. Just surviving today is OK (consider therapy if it persists). Those are valid emotions and we don't have to deny them.

For More

There is a lot out there about layoffs, so I won't recommend anything specific, with one exception: I mentioned using game theory to help inform decisions at work. A good introduction to game theory at work is this article except I would not focus so much on "winning". I prefer instead to focus on "expected outcomes".

Thanks for reading and stay optimistic in the job environment!

Six month blog "performance review"

Chris Chapman — Fri, 01 Mar 2024 20:42:11 GMT

This is a Friday "bonus" post for newsletter subscribers. Thank you for being subscribers!

I started this blog six months ago and it is a good milestone to review how the blog is performing. In this short post, I review performance with an eye to UX metrics.

Top Goals

I had three goals for the blog:

Add content of interest to Quant UXRs, especially readers of the Quant UX book. I wanted the blog posts to: (1) add content following the style of the Quant UX book (and R book), combining method discussions and R code; and (2) discuss other current topics of general interest. I think the blog has accomplished that with posts like this one on sentiment analysis.
Score: OK.
Advance best practices in research. Quant UX Research is a new field and I hope that I can add a social science perspective that helps to ground it in best practices. Examples include my two series on pseudoscience and LLMs.
Score: OK.
Stop myself from writing another book. I'm serious! In 2023, I had many ideas for a "next book" and would have started it ... but didn't welcome a commitment of 2+ years. I hoped a blog would save me from starting a new book prematurely.
Score: Good so far (LOL) ... but that may change.

HEART UX Metrics

I'm a long-time fan of the HEART metrics framework. It was described and promoted by Kerry Rodden and others at Google before I worked there. In my time at Google, I ran many HEART workshops with many product teams (my teams and others).

Here's how I see the HEART Framework aligned to this blog.

Happiness. I have no systematic data about happiness for the blog. The comments I receive are almost all positive ... which makes me happy, but may be self-selected.
Engagement. The best statistic I have for engagement is time on page, which has been steady for months at ~1:45 (1 min & 45 sec). I don't have comparative data, but at least a large fraction of users are reading posts and are not immediately fleeing!
Adoption. Just this week, the blog passed 20,000 views. The number of views has been growing steadily every month, with gradually higher peak views and an overall increasing trend of daily and unique viewers. Here are the last 3 months (views are in light blue, and daily unique visitors in dark blue):

Retention. As shown in the chart above, unique visitors read more than 1 post, on average (total views > unique visitors), and that is one sign of retention. But the best retention stat I have is the number of subscribers readers who not only visit but then sign up for the newsletter. That is growing and approaching 200 subscribers. The open rate is around 70% for each newsletter. Here are stats for the past 10 newsletters (posts):

Task Success. For task success, I have no data ... apart from the fact that the views and open rates are consistent with success. Posts typically get 500+ views, indicating success. FWIW, here are the current top 6 posts. They reflect interested in my diverse set of topics, ranging from R to LLMs to methods.

Takeaways

The blog goals are being met especially the goal of "no book". I'll keep writing!
Readership exceeds expectations. I expected around 50 interested readers, but it's higher. I hope it continues to be of interest!
For this style of blog that integrates code with markdown Hashnode hosting has been perfect. I wish it had some extra features, but it is simple.

Thank YOU for reading the blog

I appreciate readers' engagement! If you have comments about what you like or don't like ... I'd love to hear what you like. (I'll tolerate hearing what you don't like!)

Leave comments here, or contact me at the usual places (see here).

Finally, I'll highlight my personal top 3 favorite posts so far:

Why we're far from AGI (draws on my psychology and philosophy background)
Easy MaxDiff in R (promotes MaxDiff surveys, with an R-based demo)
Multidimensional sentiment analysis (R code for a favorite set of methods)

Cheers!

But wait ... there actually is some R code

The cover photo for this post shows my RStudio setup when I was working on a recent post about time estimation for Quant projects. The R code is in that post.

Time Estimation for Quant UXR Projects

Chris Chapman — Tue, 20 Feb 2024 14:54:36 GMT

Stakeholders always want to know how long a project will take.

Folks taking on Quant UXR projects may not have a good way to estimate project time, so I'll share my observations. This is not a comprehensive discussion; it is a short version of how I think about the question. I'll also use the opportunity to warn about things not to do!

As a preliminary matter, the discussions here are about "quantitative UX research" as Kerry Rodden and I define it in the Quant UX book:

Quantitative UX research is the application of empirical research methods to inform user-centered product design at scale. [p. 25]

In accordance with that, this post assumes that you will gather new or existing data, perform empirical analysis or statistical modeling, and share the results for product impact. Other tasks would differ in their heuristics for time estimation.

Unit == a Month

The first general rule is this: when planning project duration or discussing it with stakeholders, the unit of measurement is one month.

Thus, the simplest project is estimated to be 1 month in duration. For your own purposes not to communicate with stakeholders that might break down like this:

Week 1: Reconnaissance. Clarify the question and decision to be made. Investigate and plan data acquisition. Model data and proof-of-concept analyses.
Week 2: Foray. Gather and clean data. Preliminary modeling. Prepare to reboot the entire process if something fails. Refine the question and decision.
Week 3: Regroup. Do the analyses, put together the key insights. Review tentative directions and outcomes with a small set of trusted colleagues.
Week 4: Finalize. Revise and iterate analyses. Build out final deliverables.

Each stage poses failure points. It may be impossible to collect the data you need. You might obtain data but the quality makes it unusable. The anticipated analyses may fail to work or may yield uninterpretable or ambivalent results. And stakeholders may pose unanticipated questions that require substantial reworking.

Crucially, the one month timeline assumes that the question(s) and decision(s) are clear and are suitable for a one-shot answer. If they are not clear, are exploratory, need iteration, or have no specified decision to inform, then the process will be longer (if it should even be done at all). I discuss that in the next section.

(Photo:Towfiqu Barbhuiya on Unsplash)

By thinking in units of a month, you build in enough flexibility to recover from one or two significant road blocks. For example, switching to a new statistical method or a new set of data may be accomplished in a matter of days (depending on the details). Or, if everything goes according to plan, you will be able to deliver more polished results and perhaps even early results!

As a practical matter, this provides a default response for many situations:

Q: How long will it take for ___ ?
A: A month.

A Big Project == Several Projects == Several Months

The second rule is this: when a question involves complex, long-term, or strategic decisions, think of it as multiple projects occurring in overlapping succession. Each part may be estimated as a one month project on its own.

Three examples of this are: (1) developing user segments or profiles; (2) first-time conjoint analysis in a new product space (see this post); (3) developing a validated survey scale (see this post).

Let's look at the first example. Why would user segments require several iterated months of time? Many stakeholders (and too many analysts) assume that a segmentation project involves simply applying a clustering method to existing data to uncover segments. They expect that will be a one-shot, relatively quick project.

There are so many incorrect assumptions in that sentence that I might write a whole chapter on it. (See Chapter 11 [10] in the R [Python] book!) The three most important reasons are these: (1) you are unlikely to have the right data and will need to collect it; (2) there is no guarantee that you'll find segments; most human data is quasi-continuous, not crisply categorical, and is sparse in multidimensional space; (3) even if you find segments, there is no guarantee they will be useful. Iteration is required to refine the data, the analyses, and the results. (BTW, there is a corollary: one-shot segmentation projects are unlikely to be correct or useful.)

I think of such projects with what software engineers call an "iterative waterfall" model. In that model, there are discrete steps that feed forward one at a time (thus, "waterfall"). At the same time, any project may also be set back and require revisiting previous steps (and thus is "iterative"). That might mean backing up to the immediately preceding step ... or all the way back to project inception.

Here's a chart that illustrates an iterative waterfall for data analytic projects. The solid lines represent the expected step-by-step flow. The dashed and dotted lines signify that there is always a chance of needing to backtrack to any previous step.

This chart shows 2 waves of data and analysis. Additional waves of data collection or analysis might be required for a given project, and often they are unknown in advance. Each significant chunk adds 1 month on top of the 1 month baseline. Reading the chart above, where I expect 2 rounds of data and analysis, I would generally propose a 3 month timeline: one month for the baseline scoping and deliverables, plus one month for each of the two data+analysis rounds.

Story: years ago, I led a project to find user segments in three product lines concurrently. We sought an "overall" model to apply to all 3 products, plus individual segments for each product: a total of 4 models, where 1 hierarchical model encompassed the other 3. It was also international in 3 languages.
I counted that as 5 "projects" total (4 models plus internationalization), and as discussed above estimated 6 months for it. [TBH, that was conservatively estimated, to give room to negotiate and/or deliver ahead of schedule.]
When a key executive reviewed the plan, the first comment was, "What?! All you have to do is run a segmentation study. A study should take 2 or 3 weeks!"
In the end, the project took about 5 months. Roughly 2 months of that was spent just to iterate on deliverables in order to address stakeholder feedback.

Bottom line: plan projects in units of months, and don't be shy about planning for iteration and surprises. If you deliver early, great! If not, you'll set expectations appropriately.

Or, as Star Trek Chief Engineer Scott put it, when he heard a low-end time estimate:

What About Quick Projects and Data Requests?

What if you have a request that you can answer quickly? Perhaps it simply involves slicing data that already exist, in a matter of minutes.

Well, in that case, answer it ... after considering the following questions:

Will the fast turnaround establish a pernicious expectation for future requests? Are you teaching-by-example that this is what to expect from research?
Are you encouraging the other person to focus on short-term questions?
Do they really need the answer for a decision, or is it just a random request?
Is it distracting you from other work?
Will it help build or maintain an important relationship with a colleague? Or are they not a good partner?

Do's and Don'ts in Time Estimation

DO: include at least 1-2 weeks or two to preview, share, and evangelize. A common mistake is to assume that a project is finished when the analysis report is finished. It's not! As any writer will tell you, one of the most important parts of writing is re-writing ... and the same is true of analytic deliverables. You should plan to preview, clarify, and have a plan to share out results.

Build in time for that. Don't just email results. Commonly, the underlying problem is said to be "throwing a report over the wall" ... at which point it will almost certainly become a "shelf report".

DO: use suppliers and/or contractors to accelerate. What if the time estimate here is too long? Answer: use suppliers to accelerate the work. A supplier may be able to bring a team to bear who already have expertise and can work in parallel. For example, on the segmentation project mentioned above, a vendor may be able to handle the multi-language translations, fielding, recruiting, and initial data cleaning.

OTOH, ramping up a supplier is itself a "project" ... and that means there is a fixed cost of you guessed it at least one month at the start. That may involve proposals, bids, reviews, negotiation, purchase order approval processes, legal approvals, accounting onboarding, and so forth. You will need to account for any anticipated acceleration after incurring those costs.

R bonus. Here's a very basic Gantt Chart for the steps that might occur in a "one month" Quant UX project. The "critical path" activities (according to the solution algorithm) are in red while other activities are teal. See the Appendix for the R code.

Do NOT: share the granular details of time estimates with stakeholders. Stakeholders may want to know what will take so long. If you go into the details of time estimates such as the time needed to recruit respondents or write code or do statistical analysis stakeholders will naturally engage you in negotiations to trim time from each of the stages.

But you are the expert and a key part of your job is to protect the project (and stakeholders) from overly aggressive schedules. Avoid that ... and if necessary, refer to standard expectations such as (for example) how long any survey project will takes to field.

Do NOT: deliver the same week (or day!) a project finishes. This is closely related to the first "Do" above, which is to build time for deliverables, but emphasizes a difference aspect: you will need time to consider the results, reflect on whether they are rational, and reconsider any possible problems.

Too often I see analysts share "the answer" perhaps with all of the preview and revision steps noted above only to discover later that the story doesn't really make sense, that there is a bug in the analysis, or a better approach. Just because you arrive at some answer doesn't mean it is the best answer. Build in time to think!

Do NOT: build a live dashboard. OK, I threw this in for provocation. I believe dashboards require a different skill set than doing research. Once you build a dashboard, you will become the "dashboard engineer" ... which turns into an ongoing time sink about reliability and uptime, not research.

That doesn't mean one should never deliver a dashboard, but rather that you must consider the sustained costs and opportunity costs very carefully. (For more, see Section 8.2.7 in the Quant UX book.)

For More

If you have the Quant UX book, see Chapter 13 for more about the expectations for working with stakeholders. When planning a project, be sure to consider the common stakeholder problems described in Section 13.5.

I'm not aware of a comprehensive guide to project estimation for quant UX or data science ... but there are several for software engineering. And those are fine, because software engineering strongly overlaps quant UX and data science projects (see the Venn diagram here).

My favorite for software engineering is a classic: Steve McConnell's Software Estimation: Demystifying the Black Art (Amazon). McConnell presents several models, discusses key problems in detail, and addresses common organizational problems. Although the book focuses on larger projects, most of the discussion is relevant to quant projects. You learn more about time estimation as well as concepts and jargon to better understand colleagues in software engineering. It is also surprisingly readable.

Thanks for reading ... and have a good month for your next project!

Appendix: R Code for the Chart

Here is R code to make the Gantt process chart shown above for a simple project, using the critpath package. It's not terribly useful as is, but I like to share R code for my blog illustrations. It may be a useful starting point for a more complex plan.

Minor R point. This code defines the project steps using horizontal vectors for simplicity. In practice you would almost always want to define those as vertical columns (e.g., in a spreadsheet) and transpose them (perhaps with t()).

# chris chapman# quantuxblog.com# example for gantt chart / time planningplan.dat <- data.frame(Start      = c(        1,     2,              2,            3,        4,          5,           6),                       End        = c(        3,     3,              5,            4,        6,          7,           8),                       Name       = c("Planning", "POC", "Initial.Code", "Pilot.Data", "Sample", "Analysis", "Reporting"),                       Duration   = c(        2,     1,              3,            1,        1,          2,           2))library(critpath)path.solve <- solve_pathAOA(plan.dat)plot_gantt(path.solve)

Quick Review: Conference Submission Sequences

Chris Chapman — Wed, 14 Feb 2024 15:36:38 GMT

During the Call for Proposal (CFP) period, the Quant UX Conference committee tracked submissions to see how the 2024 submissions were progressing, vs. 2023 data. On the committee, we wanted to know at any given time: "How are we doing? How many total submissions can we expect?"

In this post, I share the basic submission timeline data and my R code to review it. The analysis is simple but I hope it illustrates two things: (1) even simple analyses are useful, and (2) the data suggest some general lessons.

Set Up the Data

The main question here is this: at any given point in time, how do the number and trend of submissions in 2024 compare to the prior year (2023)? To answer that, we need the count of submissions longitudinally as they are submitted.

The Quant UX Committee has made the sequence of submission dates publicly available (after removing all identifying information). We wish to share as much data as possible about the conference, to illustrate real-world analytics.

First we load the submissions sequences and review their format:

# read the two years of dataqc.23 <- read.csv(url("https://www.quantuxbook.com/misc/quxc23.csv"))qc.24 <- read.csv(url("https://www.quantuxbook.com/misc/quxc24.csv"))head(qc.23$x)

The dates from CSV are in raw character format, so I convert them to actual date objects (in POSIXct format). The lubridate package is great for anything related to dates and times. At the same time, I use seq_along() to add values for the sequences of events, which is also the cumulative count of submissions:

# convert dates to standard formatlibrary(lubridate)qc.23$Date     <- parse_date_time(qc.23$x, orders="m/d/y")qc.23$Sequence <- seq_along(qc.23$Date)qc.24$Date     <- parse_date_time(qc.24$x, orders="m/d/y")qc.24$Sequence <- seq_along(qc.24$Date)

The closing dates for the CFP differed for 2023 and 2024 and it makes the most sense to align the submissions by the time before closing (instead of time from opening). We can do that by subtracting the closing date from the submission dates (I actually use the day after closing, which we need to remember below):

# align 2024 and 2023 dates based on the effective closing dateqc.23$DaysToGo <- as.Date(qc.23$Date) - as.Date("2023-02-05")qc.24$DaysToGo <- as.Date(qc.24$Date) - as.Date("2024-01-28")

I combine the two years in a single data frame, and add the Year as a factor variable. Note that simply extracting the year from the proposal dates using year() won't work, because some proposals in each year were for a conference in the next year. Instead, I code the Year based on a break point relative to the conference date itself:

# make combo DF with both# make combo DF with bothqc.sub      <- rbind(qc.23, qc.24)qc.sub$Year <- factor(ifelse(qc.sub$Date > as.Date("2023-06-01"), 2024, 2023))

As always, I do a lot of checking to make sure the data look reasonable, with head(), some(), and other commands such as str(). (Your results with some() will differ because it shows random rows.)

library(car)head(qc.sub)some(qc.sub)

In the data, we see that the Date is correct for a given character string, DaysToGo is calculated correctly, and the Sequence is in appropriate order vs. the Date.

Descriptive stats can answer, "how many submissions arrive at the last minute?" We can answer that using quantile() to get the cumulative frequency. I add a couple of details, such as using subset() to count only the submissions added during the CFP period, and adding +1 to DaysToGo because the closing date above was actually the day after closing. (It's easier to add +1 so 0 means "last day" rather that remembering that 1 means last day.) Finally, we can get percentile deciles in 10% increments using the shorthand proportions of 1:10/10. As often, the code is much shorter than the explanation:

# quantiles by Days To Go (+1 because we offset the closing date above)with(subset(qc.sub, DaysToGo < 1), quantile(DaysToGo + 1, pr=1:10/10))

The answer is:

Around 35% of proposals came on the final day that the CFP was open, about 70% in the final week, and 80% in the final two weeks. (That combines both years.)

So, the data look good and we can move to the main answer: plotting them!

Compare the Sequences by Plotting

During the 2024 CFP period, every couple of weeks I would plot the incoming 2024 CFP data vs. the 2023 data. By plotting the Sequence (cumulative total) of the new data vs. the old data, by DaysToGo on the X axis, the trends would be easy to see.

Here's the code. At this point, we have complete data, but this code worked identically as the data came in, too:

library(ggplot2)p <- ggplot(data=subset(qc.sub, DaysToGo < 1),             aes(x=DaysToGo, y=Sequence, group=Year, color=Year)) +  geom_line(linewidth=1, alpha=0.8) +  theme_minimal() +  scale_y_continuous(breaks=1:8*20) +  xlab("Days to go until CFP closes") +  ylab("Total Submissions") +  ggtitle("Submissions for Quant UX Con 2023 & 2024")p

A few minor comments about the code. I use a subset() of the data where DaysToGo < 1 because we added a few submissions to the program after the CFP closed (such as adding the keynote talk). Those would mess up the X axis. I set the line alpha=0.8 to allow slight transparency where the lines overlap. I save the chart to object p for reasons we'll see below.

Here's the resulting chart:

The trend was pretty clear at every point along the way! (BTW, ggplot2 warns about the X axis being a difference in times, but handles it OK. We'll solve that below.)

Next, let's look at a couple of interpretation points.

What About the Distribution Pattern?

For purposes of curiosity, we might wonder, "what kind of distribution is that?" It looks vaguely reverse-ly lognormal (or perhaps a variety of reversed zero-inflated binomial). Or perhaps it just reflects the famous log-procrastination distribution. (That's a joke.)

ggplot2 makes it easy to try a few different scales. However, we have DaysToGo in negative numbers ... and that won't work for log(), sqrt(), and the like. The easiest approach is simply to multiply X by -1 in our aes() call. I add as.numeric() so we can transform the values mathematically rather than treating them as times. This snippet also shows why I saved the plot object p above, so we could reuse the plot:

p + aes(x = -as.numeric(DaysToGo))

This chart is jagged because Sequence counts up along the negative timeline, but that is now reversed. That doesn't hurt out interpretation although it is unsightly. (I'll leave that as an exercise for the reader to fix! One hint: can you start each day's count where the previous day left off?)

Now that we have positive values, we try transformations such as scale_x_log10():

p + aes(x = -as.numeric(DaysToGo)) + scale_x_log10()

Interesting! Now we see that the sequence is much more approximately linear, with only a slight curve over time. (Note: log scaling gives a warning for the data points with DaysToGo==0. We can ignore that; or filter the data again.)

Speaking of which, curvature always suggests a square (or some other exponential) relationship for the Y axis. Let's try that (in this case, with the original plot; only the Y axis changes):

p + scale_y_sqrt()

Or we might combine them, with Y as square root and X on log scale:

p + aes(x = -as.numeric(DaysToGo)) + scale_x_log10() + scale_y_sqrt()

That is now very close to being linear in pattern. That doesn't mean that this is the right model in any way; it's just an observation. If I were interested to model such data generally, I would think about what to expect from a theoretical point of view, inspired by this observation.

The point here is this: if your data and plots are set up well, it can easy to explore distribution patterns and ask "what if?" questions about the potential distributions. That may give additional insight and prompt questions about the underlying data generation process, as part of good data exploration and theorizing.

Bonus Plot

I mentioned the log-procrastination distribution above ... which will be familiar to anyone reading this blog! Here's a chart, as featured in the cover for this post:

And of course, here is R code to make that chart:

timerange  <- 1:100productive <- data.frame(Time = timerange,                         Productivity = sapply(timerange,                                                function (x) ifelse(x < 90, x, x + exp(x/100)^8)))ggplot(productive, aes(x=Time, y=Productivity)) +  stat_smooth(fullrange = TRUE) +   theme_minimal() +  theme(axis.text.x = element_blank(),         axis.text.y = element_blank(),        panel.grid.major = element_blank(),         text = element_text(family="Comic Sans MS")) +  ggtitle("Productivity approaches infinity with deadline")

BTW, there's nothing particularly special in the Productivity anonymous function's formula. It was the first approach that occurred to me, and I tinkered to get the look I wanted. Many exponential functions would work.

Side note: if I were doing this for most purposes, I would use a less annoying font than Comic Sans! I went with Comic here to avoid diving into alternative font setup.

Reflections and Takeaways

People procrastinate for many reasons, some good so the submission rate approaches infinity with a deadline.
Don't bet against the data. We had almost the exact number of submissions in 2024 as in 2023 (2024 = 151, vs 2023 = 152. Data always amaze me!)
The estimates were good all along the way. The curves had close alignment in most of their range. That reassured us on the committee as submissions arrived. I regularly reported that we were on track vs. 2023 or possibly slightly ahead.
Surprisingly, there was high consistency despite our invitations growing. Our email list grew by 42% year over year, but that had no effect on submissions.
Such comparisons assume a consistent process (aka sample). That's why I don't include the first year, 2022, when our process was quite different.

And I have a meta-takeaway: I hope this post bolsters the claim that relatively straightforward R code (or Python) is very powerful. Even simple code gives us many tools for analyses, data cleaning, and data visualization that would be almost impossible otherwise.

Pointers to More

First, I posed a "reader exercise" in the "Distribution Pattern" section above. That is an R brainteaser and is good practice for the kind of problem that arises all the time in data visualization. I recommend trying to solve it. (Like everything in R, there are several approaches and no single right answer.)

Second, to learn more about data transformations, there are general guidelines in Section 4.5.4 of the R book and the Python book.

Third, if I were looking to model such a process more formally, I would consider models for time series analysis and especially (perhaps) survival analysis. In this case, the "survival event" is when a proposal is submitted. The main analytic challenge is that these data are 100% censored; there are 0 observations of anyone not submitting a proposal. (Warning: those are deep topics and not suitable to dip in for a quick answer.) Those go far beyond the goal of this kind of data review.

Finally, if you'd like to learn more about the Quant UX Conference, which is online and low cost, see the website (we'll hope to see you in June 2024!)

Appendix: All the Code

Here is all of the R code from this post.

One thing to notice is that I did not refactor the repeated chunks of data setup code in accordance with the "don't repeat yourself" (DRY) principle. That's a long discussion; in this case it was simpler and IMO makes the code easier to read.

# quantux con submissions over time, 2023 and 2024# for chris chapman, quantuxblog.com# read the two years of dataqc.23 <- read.csv(url("https://www.quantuxbook.com/misc/quxc23.csv"))qc.24 <- read.csv(url("https://www.quantuxbook.com/misc/quxc24.csv"))head(qc.23$x)# convert dates to standard formatlibrary(lubridate)qc.23$Date     <- parse_date_time(qc.23$x, orders="m/d/y")qc.23$Sequence <- seq_along(qc.23$Date)qc.24$Date     <- parse_date_time(qc.24$x, orders="m/d/y")qc.24$Sequence <- seq_along(qc.24$Date)# align 2024 and 2023 dates based on the effective closing dateqc.23$DaysToGo <- as.Date(qc.23$Date) - as.Date("2023-02-05")qc.24$DaysToGo <- as.Date(qc.24$Date) - as.Date("2024-01-28")# make combo DF with bothqc.sub      <- rbind(qc.23, qc.24)qc.sub$Year <- factor(ifelse(qc.sub$Date > as.Date("2023-06-01"), 2024, 2023))library(car)head(qc.sub)some(qc.sub)# quantiles by Days To Go (+1 because we offset the closing date above)with(subset(qc.sub, DaysToGo < 1), quantile(DaysToGo + 1, pr=1:10/10))# plotlibrary(ggplot2)p <- ggplot(data=subset(qc.sub, DaysToGo < 1),             aes(x=DaysToGo, y=Sequence, group=Year, color=Year)) +  geom_line(linewidth=1, alpha=0.8) +  theme_minimal() +  scale_y_continuous(breaks=1:8*20) +  xlab("Days to go until CFP closes") +  ylab("Total Submissions") +  ggtitle("Submissions for Quant UX Con 2023 & 2024")p# try it on a log scale (gives warning for 0 values)# jags on each day are an artifact of the cumulative sequences by day# (will leave exercise for reader on how to clean that up)p + aes(x = -as.numeric(DaysToGo))# fairly close to a log-linear distributionp + aes(x = -as.numeric(DaysToGo)) + scale_x_log10()# could it be more of an inverse square-root relation ?p + scale_y_sqrt()# how about both?p + aes(x = -as.numeric(DaysToGo)) + scale_x_log10() + scale_y_sqrt()# bonus# make our own chart of productivity vs time!timerange  <- 1:100productive <- data.frame(Time = timerange,                         Productivity = sapply(timerange,                                                function (x) ifelse(x < 90, x, x + exp(x/100)^8)))ggplot(productive, aes(x=Time, y=Productivity)) +  stat_smooth(fullrange = TRUE) +   theme_minimal() +  theme(axis.text.x = element_blank(),         axis.text.y = element_blank(),        panel.grid.major = element_blank(),         text = element_text(family="Comic Sans MS")) +  ggtitle("Productivity approaches infinity with deadline")

Misconceptions about Conjoint Analysis

Chris Chapman — Thu, 08 Feb 2024 16:55:05 GMT

Choice-based conjoint analysis (or "conjoint" for short) is an astonishingly powerful survey method. Conjoint can help determine the value of a product's features, find respondents' price sensitivity, and identify interested customer segments.

I've written separately about how well conjoint performs to determine product preference (short answer: very well, assuming you "tune" it across a few iterations). Besides product optimization and pricing, conjoint can even be used to find optimal product portfolios or create attitudinal consumer profiles (aka segments or personas).

Briefly, the power of conjoint is that it mimics a real shopping process, while using randomization and forced tradeoffs to determine what matters most to customers.

But this post is not an introduction to conjoint analysis; I assume that you are at least generally familiar with it. For an introduction, see any of these:

This introduction by Sawtooth Software
A webinar where I discussed applications of conjoint with Eric Bradlow of the Wharton School
For technical details, check out Chapters 9 and 13 in the R book, or Chapter 8 of the Python book.

My goal in this post is to illustrate a few ways in which conjoint may be misunderstood. These are based on a past talk about things clients get wrong about conjoint analysis.

Conjoint analysis is routinely used to determine the "preference share" for one product vs. another (or many others). The problem is that preference share sounds so much like market share that the two are routinely confused by stakeholders.

As a refresher, a product's estimated preference share is determined by adding up the conjoint "utility values" for its features, and then dividing the exponentiated (e^) total by the total sum across it and competing products. This table shows an example:

Stakeholders often read that as, "Product 1 will get 62% relative market share!"

The problem with this is that market share is influenced by many factors that are not captured in conjoint analysis. Those include:

Product awareness and availability in consumers' shopping channels
Features and competitors that were not tested in the conjoint analysis
Changes in the ecosystem since the conjoint analysis was conducted
Sample adequacy and projection to the market
Other behavioral effects such as regret minimization (where a consumer purchases a product that is least likely to be regretted, even if it not necessarily the most preferred)

Those factors do not mean that conjoint analysis is useless. Conjoint analysis is used to assess preferences, and preferences are highly correlated with sales. But they are not identical to sales.

To improve this misunderstanding:

Say that conjoint assesses the relative preference of products and features, under the assumption that other factors are all equal
It is possible to tune those results to more closely approximate market share if that is the goal but that requires various other inputs and iterative improvement.

In short, I always talk about "relative share of preference" unless I have strong evidence and specific intent to assess market share.

Mistake #2: Conjoint Gives Simple Pricing Data

An almost universal motivation for conjoint analysis is to get insight on pricing. It can estimate the value of various features, and how that value compares to the product cost. In other words, we want it to answer, "How much can we charge?"

Conjoint delivers tremendous information to inform pricing decisions, but conjoint does not directly answer, "How much can we charge?" or "How much is this feature worth?"

There are several factors that go into using conjoint data for effective and accurate pricing. One is the necessity to do iterative tuning as I noted above in Mistake #1.

Another factor is that the connection between a pricing curve and feature values must be set by an analyst using both economic theory and category knowledge. There is no single "right way" to map prices to preferences. Conjoint modeling and observed data must be combined with other knowledge.

To illustrate this, let's look at three common pricing patterns that arise when we attempt to estimate product demand (aka preference) using conjoint data.

Pattern A is shown below. In pattern A, the expected preference share (aka demand) decreases slowly as prices increase, such that it falls more slowly than the rate of price increase. Economists say the demand is "inelastic".

Stakeholders love to see Pattern A because it says, "Price it high!" The takeaway is that demand will drop as prices go up, but we'll more than make up for it in profit. (This strategy assumes that our goal is revenue or profit, not market share.)

Pattern B is next. In Pattern B, demand falls rapidly with prices, and the optimal point for demand and revenue (and perhaps profit) is to price low. This reflects "elastic" demand. Stakeholders may either love or hate Pattern B, depending on their business goal.

Finally there is Pattern C. Pattern C presents a U curve demand falls in part of the price curve but rebounds at the end (there is also an inverted version but it seems to be less common). Pattern C says, "Either price low or price high!" ... and stakeholders always love this pattern.

A frequent comment is, "Of course! If we price high, customers will think the product is better and they will want it more!"

Unfortunately, at least 2 of these 3 curves (and sometimes all 3) may be produced from the same observed conjoint data! They are not so much a reflection of the data as much as a combination of observed data plus statistical modeling assumptions.

In the next section, I look closely at how that works, and why Pattern C (in particular) is deceptive.

Deep Dive on Mistake #2 and Pattern C

This section is a bit long, as it dives into understanding how Pattern C arises. But I hope you'll find it interesting and useful to think more rationally about pricing.

Let's drill in on the statement above that conjoint results reflect data + assumptions. Pattern C which shows U-shaped demand across price points may arise when a conjoint analysis uses "piecewise estimation" for the demand at different price points. That means that each price point is estimated separately without an assumption of an overall linear pattern.

At first that sounds very reasonable. It seems like we should look at the demand at each price point, without prematurely forcing a linear pattern onto the prices and demand. That would let us discover something about customers, right?

The problem is this: a core assumption in conjoint analysis is that "all other factors" meaning, factors not included in the survey or modeled in the analysis are constant and do not influence the estimates. However, the post-hoc explanation of Pattern C typically violates that assumption it imagines that there is another factor influencing respondents' perception of price.

Here's a simple way to see that. Let's suppose that Pattern C is true, such that demand for our product increases as price goes up, in some part of the demand curve. That would imply the following:if you offer someone the chance, they would prefer to pay more, not less!

Here's one example:
"Thank you, I see you'd like to purchase this iPad for the listed price of $699. But you would actually prefer to pay $799 instead, right?"
Or another:
"We need to clear out last year's model from the warehouse inventory. So let's cancel the sale. Raise the price instead!"

It is irrational to believe that pricing would ever behave this way.

Instead, what is usually happening with Pattern C is this: prices are associated with brands, and the higher-priced brands are associated with higher quality or other desirable factors. Because of that, higher priced brands have relatively higher demand than we would otherwise expect from their features. This shows up in conjoint results as an artificial "bump" at higher price points.

Customers may have structurally different price curves for different brands ... but if you want to claim that as an analytic result, it should be modeled. That might be done, for example, by including an interaction effect for BRAND:PRICE (see the side note below for more). However, by default, conjoint analysis models use main effects models and therefore do not estimate such interactions at an aggregate level.

To the specific point of Pattern C, we should never expect demand to increase simply because a price increases. If that effect occurs, there must be some other factor going on ... and to make systematic claims about it, that factor should be modeled.

The key point is this: naive interpretations of pricing from conjoint analysis are almost always wrong, especially when they are a delightful "surprise". Price estimation depends on several assumptions that must be considered carefully.

Side note (even deeper brief dive): Bryan Orme of Sawtooth Software wrote to me to discuss when one needs to specify interaction models with hierarchical linear models (which are typically used in conjoint analysis). Bryan noted that when one uses lower-level estimates from hierarchical Bayes models, it is not generally necessary to specify interaction effects. Per-respondent differences in brand preference and price sensitivity will appear in the individual-level estimates and can be used in preference simulation and other analyses.
For my part, I agree although I place emphasis on a different part of the statistical model: the specified regression model itself and the assumptions that go into it. (See the R book Chapter 13 for more on the models.) In the example here, if we wish to claim that brands have differences in the shape of their price sensitivity curves, we should model that assumption and test it statistically. There is a middle position where we might discover such effects in individual-level estimates and then consider whether to model them later.
In either case, I'm not saying that one should always include brand interactions. There are good reasons to start with main effects models as a default. Rather, my point is that one should understand a model's structure and not read post-hoc explanations into its results. As a practical matter, I usually avoid Pattern C by imposing monotonic constraints on price estimation ... which is an assumption that goes into the model (and a longer topic on its own).
Thank you, Bryan, for the discussion! Readers: check out Bryan's book with Keith Chrzan below.

To use conjoint for successful pricing research, you need repeated observations to understand how prices work for your category, product, market, and brand. That requires iterative data, attention to the results, and rational modeling of effects.

Again, this doesn't mean that conjoint analysis is wrong or useless, or shouldn't be used for pricing. It is only to say that conjoint results must be reviewed rationally. Results should not be accepted simply because they tell a good story like Pattern C.

Mistake #3: Highest Preference == Best Product

A third confusion I've seen is when stakeholders believe that the best product decision is to offer the most-preferred feature(s) from a conjoint survey. At first that sounds very reasonable, no? Of course we should offer the most-preferred options.

Let's take a look at a fictional example. Suppose we're making USB keys at various price points, in various sizes, with different colors. Conjoint might give us results like the following fictional data. (In this chart, blue circles represent each respondent's preference for a feature; red is the average of respondents; and higher preference for each feature is at the top.)

Looking at the average interest (in red), we see that the highest-scoring USB key would have a thumb form factor, black color, 1GB of memory, and cost $9. (Remember, these are fake data; they may be unrealistic.)

However, that simple reading of the data assumes two things: (1) that such a combination is possible for engineering, and (2) that there is no competition! The first assumption is generally obvious to teams, but the second assumption is frequently overlooked.

When we consider competition, the picture may change dramatically. Suppose all of the existing USB keys are already black. Our [fake] results show a subset of customers who are highly interested in silver more than any customer is interested in black. That gives us an opportunity to address a highly-interested subset of customers with a differentiated product.

The same is true for tie-dye color the average interest is the lowest of any color but there are a few fans. If we are able to reach them, that could be a great product choice. Similarly assuming we believe it; see mistake #2 above there are customers who are open to paying more in the $89 preference column, and there could be an opportunity for a "higher design," more luxurious brand.

All of those possibilities open up once we look at the distribution of customer interest, and how the diversity of customer interest maps to competition and to our own business strategy. If we reach a subset of customers with differentiated interest, we may have a great business, even if the "average interest" is low.

Again, the point is not that conjoint answers are incorrect, but that conjoint delivers much more information than a superficial reading. In this case, the results point to many potential actions and strategic possibilities beyond simply making "the feature of highest interest."

For More

To learn more about the basics of conjoint, see the "introduction" references at the start of this post.

To go beyond those, I also recommend:

A complete whitepaper that adds several more examples and "mistakes" to the ones here: 9 Things Clients Get Wrong About Conjoint Analysis.
A great book that goes beyond the basics into intermediate and advanced practice of conjoint analysis, from the experts at Sawtooth Software: Orme & Chrzan: Becoming an Expert in Conjoint Analysis.
The conferenceAnalytics and Insights Summit, where practitioners and academics present the latest developments in choice modeling (among other topics).

I hope this post inspires you both to use conjoint analysis ... and to improve how your stakeholders understand it!

How the Quant UX book was written

Chris Chapman — Fri, 02 Feb 2024 16:11:44 GMT

This post shares high-level technical details of how Kerry Rodden and I wrote the Quant UX book. I start by describing the overall process and then I review how it all comes together in RStudio. (Another time I'll discuss more about the process of deciding whether to write a technical book.)

This may be useful even if you have no intention of writing a book. A similar process will work for many kinds of documents, including analytic projects, reports, academic papers, presentation decks, and complex projects like dissertations.

Our Goals and Overall Choices

After discussing my prior experience with the R book and Python book ([R | Python] for Marketing Research and Analytics), Kerry and I set a few principles and goals for the writing process for the Quant UX book:

We must be able to work asynchronously and merge separate work.
The canonical version must live in the cloud and not depend on a single system or owner.
All code in the book and everything that flows from it, such as output and charts must be reproducible and automatically included (no copy/paste).
It should be formatted like a printed book from the beginning, as that is both personally inspiring and useful (e.g., to get approximate page counts and to minimize typesetting surprises in production).
Share a bibliography and include, number, and format references automatically.

As background, the R and Python books were written using LaTeX (in TeXShop), and that solved goal 4 to be fully typeset. Goals 1 and 2 to work asynchronously and with backup were solved by using Git. Goal 5 was met by using BibDesk as a reference manager.

However, goal 3 that the entire project should be fully built and reproducible from a single code base and document was not solved by using LaTeX alone; the code and graphics for the R and Python books had to be handled somewhat manually. For example, the code files for the R book included additional sections that created print-quality PDF images to include in the LaTeX file. That code had to be stripped from the final code files shared with readers. (I'll skip the details, although FWIW, I wrote additional code that converted the authored versions to "shared" versions.)

To meet the goals for the Quant UX book, we landed on the following technical stack:

Use RStudio to write the book and code together (goal 3), using its built-in "knit" functions. Unlike LaTeX written in a standalone editor, this solves goal 3 by bringing all of the code, output, and graphics into one source and process.
Specifically, use the bookdown package to manage output to a book-like final PDF version. bookdown converts markdown to LaTeX under the hood, which then creates PDF and/or other formats as desired (goal 4)
Use Git to work asynchronously (goal 1), hosted on GitHub (goal 2)
Use BibDesk for references (sharing one .bib file in the git repository; goal 5).

In the next sections, I'll briefly show each of those.

RStudio for Writing

The following screenshot shows my actual RStudio project for the book (although I shrank the window & resolution to be more readable here). This file is from Chapter 6.

If RStudio knitting is new to you, I'll describe a few things that you can see above.

In the title bar and upper right, you see that I'm working on the "QuantBook" project. By keeping all files for the book in an RStudio project, we separate it from other work. (You can use multiple sessions in RStudio to have different projects open at the same time.)
In the Files pane (lower right), you can see the file path Home/Documents/papers/repos/QuantBook . I put all of my git repositories under a single /repos/ folder to make it simple to track and manage them (for example, to monitor how much space they are taking up).
In the Editor pane (upper left), you see some of the actual text from Chapter 6 of the book, which ended up on page 97 in the print edition. The second sentence shows how Rmarkdown can be used for formatting (in this case, using the ticks to format "my.data" as monospaced code font).
The Editor pane also shows how we included R code and its output in the text. The block starting {r, results="markdown"} is code that is typeset into the book and whose results will also be included. You can also see the results live in the Editor pane, and the actual code execution itself in the R Console window below (lower left).
Moving over to the lower right-hand side, the ./data/ folder shows some of the auxiliary files we used in the book. We kept those in the git repository and organized in a simple way to make the code clear throughout the book.
Finally, in the upper right-hand side, you see the Build pane from RStudio. The Build pane is where the magic happens. When I click "Build Book" it starts at the first chapter, and in each one:
- (1) runs all of the R code
- (2) converts the markdown text to LaTeX
- (3) inserts all of the code, results, and charts into the text
- (4) includes bibliography numbers where works are referenced
- (5) adds a formatted bibliography at the end, and
- (6) compiles all of the chapters to a single PDF file (you could instead compile them to Word or HTML, as desired)

An identical process works for slides, papers, and various other types of documents. We started with a book template but you could instead start with a presentation or other template. (Example: a talk I gave at UseR! 2019 was written as an RStudio presentation add built from Rstudio markdown into slides with code, data, results, and explanatory text; you can see it here.)

There have been various improvements and expansions recently with RStudio's Quarto system for markdown and knitting, but the overall picture is the same.

Git for Collaboration

If you're familiar with Git, then there's probably not much to say here. I created a local repository, hosted it upstream on GitHub, and Kerry was a collaborator.

For those new to Git, I'll point out a few things. First of all, Git is a file-based collaboration system (also known as a version control system). You create a folder (and any subfolders, etc., as needed) and put things there. That is mirrored between your local machine and (usually) an online hosting site such as GitHub or Bitbucket. It is widely used by developers to manage contributions from multiple contributors while separating stable code from experimental, new code.

While working on code or, in this case, a book contributors make edits locally and then check them in (known as making a commit). Importantly, when a text file is changed by more than one contributor, the changes can be merged on a line-by-line basis. For instance, if I edited one paragraph of a chapter while Kerry added a new section, our changes would get merged into a single file. In the case of conflicting edits, such as changing the same sentence in two different ways, it shows the conflicts and you can choose which one to retain. All commits are saved and you can go back to a previous version at any time.

The following screenshot shows my Git command line running in OS X Terminal, for one of our very last commits. At that time, we were updating the Acknowledgements and correcting a couple of typos in other chapters. You can see:

I got the latest edits with git fetch
I checked that my own edits were all up to date with git status
I merged a few edits from Kerry using git merge which updated 3 files with a total of 8 changes.
I saved that with git commit
I uploaded the result back to the shared online repository with git push

In case you're wondering, you do not have to use Git from the command line. It's integrated into RStudio. But I like using the command line so I know exactly what's going on. In the command line, I only need to remember a few commands, as listed above.

(Side note. One tip: if you use RStudio or TeXShop or any other plain text editor with Git: don't write in paragraphs but in sentences. Git manages changes on a line-by-line basis, and if you write each sentence as a separate line, there will be more granular tracking and fewer conflicting edits. It's also OK to end a line mid-sentence where the text editor wraps. You can see that in the text editor window in the RStudio screenshot above: each sentence is on its own line (or more than one line!), and then there is a blank line to separate paragraphs. The point is to keep the lines short.)

BibDesk for References

BibDesk is a bibliography editor that works with the LaTeX format for references. By having our references in BibDesk, we can simply refer to a "cite key" from BibDesk, and the reference will be handled automatically by RStudio (and the underlying LaTeX engine).

Here's a screenshot of BibDesk and the master reference database we used:

If I wanted to cite the highlighted reference, I could copy the citation key, "agresti2007introduction" and add it in the editor, such as writing:

.. a comprehensive guide to statistical analysis of categorical data [agresti2007introduction].

When the book is "built" the bracketed part will be replaced with whatever reference style has been chosen (we used bracketed numbers, so it would be ), and the full citation will be added to your bibliography at the end of the document.

Other programs do the same thing, but I've stuck with BibDesk for a decade thanks to its simplicity and the fact that I now have hundreds of references in it.

BTW, you might notice that my quantUXRbook library has 399 references but the book itself only has 153 references. LaTeX manages everything automatically so only the references that are cited will appear in the bibliography. Our editor asked us multiple times about that! It seems that uncited references are a common problem in books ... but not in our case, thanks to RStudio + LaTeX.

(Digression: the Agresti citation highlighted in the screenshot is an example of an uncited work it is cited in the R book but not in the Quant UX book. I copy my bibliography from project to project so there are many uncited works at any time; no problem. By the way, if you're wondering, the Agresti citation key says 2007 because I've cited it for a long time but updated the entry itself to 2012 when a new edition was released. I kept the citation key unchanged so any old docs could automatically reference the new version when they are rebuilt. Whether that makes sense depends on your citation style, such as whether you refer to page numbers that are edition-dependent.)

The Hardest Part (besides actually writing it)

The most difficult thing was setting up the bookdown / LaTeX template to have the customized book-like formatting we wanted (as opposed to a generic template), and to edit that such that we would get correct metadata everywhere (such as the book's title in the margins and title page). This was largely due to the complexity of our project:

a full length book
with multiple kinds of chapters (introduction, text, appendices)
to be typeset with a professional LaTeX class (template)
including a table of contents
and a bibliography
with specific styles for references and other elements

For many projects, you won't need most of those, and it's fine to start writing with a generic template and then update those details later. Because bookdown, Quarto, and LaTeX separate the class definition (formatting directions) from the text itself, you can always tinker with or update the class definition later. (In that way, it is somewhat similar to HTML and CSS. And completely unlike Word.)

For a complex project, you'll end up Googling bookdown options a lot in the beginning. In a nutshell, you probably want to start with the following:

Find a LaTeX class file that matches your desired format. We used the Springer Social Science class with very minor tinkering.
Get a bibliography format file for your preferred style. We used a LaTeX definition for the AMS (American Mathematical Society) style.
Update the bookdown file _output.yml to reference those files and to set a few other metadata options (such as the output default style). I've copied our exact _output.yml file below.
Update the LaTeX file preamble.tex for LaTeX options to apply project-wide. For instance, we defined a "bullet" character macro and set up our bibliography. I've copied that below, too!

Our _output.yml file, just for reference as you read about the options there:

bookdown::gitbook:  toc_depth: 3  css: style.css  config:    toc:      before: |        Quantitative User Experience Research#      after: |#        Published with bookdown    edit: https://github.com/rstudio/bookdown-demo/edit/master/%s    download: ["pdf", "epub"]bookdown::pdf_book:  toc_depth: 3  includes:    in_header: preamble.tex  latex_engine: xelatex  citation_package: default  keep_tex: yes  pandoc_args: [ "--csl", "springer-socpsych-brackets.csl" ]bookdown::epub_book: default

And here is our preamble.tex file, again for reference (Hashnode does not have LaTeX code formatting, so this file has no syntax highlighting):

\usepackage{booktabs}\usepackage{amsthm}\usepackage[hyphens]{url}\usepackage{graphicx}\newcommand\sbullet[1][.6]{\mathbin{\vcenter{\hbox{\scalebox{#1}{$\bullet$}}}}}\makeatletter\def\thm@space@setup{%  \thm@preskip=8pt plus 2pt minus 4pt  \thm@postskip=\thm@preskip}\makeatother% following adds bibliography to TOC% from https://github.com/rstudio/bookdown/issues/192\usepackage{makeidx}\makeindex\usepackage[nottoc]{tocbibind}

Don't let those scare you! To repeat, we wanted to format the result exactly like a typeset book. For many projects, it will work fine out of the box. But if you want to do advanced LaTeX formatting, the power is there.

FWIW, when we handed off the book to the publisher, we gave them the final PDF build, plus higher-resolution graphics files for the charts.

The final book that you see in print and in eBook format was built directly from the PDF files that we handed off to the publisher. At no point was there any messing with Microsoft Word or the like! The same was true for all three books, the Quant UX book, the R book, and the Python book.

Wait, should I learn LaTeX?

If you are reasonably fluent in R, and have a significant project similar to our requirements above inline code with reproducible typesetting, and wanting a professionally formatted result then I'd say, yes. LaTeX will give you a lot of power.

Except I'd rephrase the question. You don't really have to learn LaTeX but rather deal with it only slightly while working with the much simpler R markdown / Quarto system in RStudio.

There are a few contraindications when I recommend NOT to use RStudio markdown:

When your coauthors have no interest in either coding or R
When you wish to use an writers' composition platform (such as Scrivener)
If you have code that RStudio can't compile and include automatically
Generally if it is a creative project whose mindset doesn't connect with a technical writing stack. If I were to write a novel, I probably would not use RStudio (and yet I'd be even less likely to use Word!)

Learning More

TBH, the best and probably only way to learn about making production-quality articles, presentations, and books in RStudio is to do it.

Although there are books available about R markdown, bookdown, LaTeX, and other parts of the technical stack, in my experience, you will end up Googling everything. Start with the templates in RStudio and then customize from there.

I hope this article will get you started (or maybe convince you not to start) and give a few pointers. Cheers and happy writing!

Some Book Recs: Literary Fiction

Chris Chapman — Fri, 26 Jan 2024 18:52:16 GMT

This is a Friday newsletter bonus. I set quant topics aside and recommend a few literary fiction books.

I describe literary fiction as works written at the highest level of craft, demonstrating brilliant style and character portrayals along with not instead of great plotting. I read literary fiction along with genre fiction (mysteries, science fiction, etc.)

My list is idiosyncratic and random. I might have listed 100 others; yet each of these has stayed with me, and each is outstanding and unique. If you read any two of them, I hope you'll love at least one. Some are well known but others may be new to you.

BTW, in the cases of three authors Murakami, Murdoch, and Powers I recommend titles that are the best places to begin reading their work. Those books are excellent and yet not the pinnacles of their works, which you can discover later.

A Few Recommendations

Karl Marlantes. Matterhorn. A semi-autobiographical novel that depicts, as well as any novel could, what it is like to be in combat. Read it and understand why soldiers view war as a far last resort when diplomacy & leaders fail. Amazon, Powells.

Haruki Murakami. A Wild Sheep Chase. A great introduction to Murakami: quirky, readable, surreal, rollicking, set with unforgettable locations and mysterious characters. After this, read Dance Dance Dance. Amazon, Powells.

Iris Murdoch. The Sea, The Sea. A Booker Prize winning novel (1978) that combines exquisite, which is not to say comfortable, psychological and emotional depth with powerful evocations of relationships, passion, and place. Amazon, Powells.

Ruth Ozeki. A Tale for the Time Being. Ozeki is a novelist and Zen teacher. This book skillfully weaves current day Japan, the Canadian BC Coast, a character named Ruth Ozeki, historical Japan, and the Zen teachings of Dogen. Brilliant. Amazon, Powells.

Richard Powers. The Echo Maker. A meditative examination of "what happened?" in a car crash in a beautifully portrayed small Nebraska town. Explores memory, the brain, human connection, family, and the contingency of life. Amazon, Powells.

Leo Tolstoy. War and Peace. You've heard of it? It's a fantastic read, one of the best novels ever. But IMO you must pick a period to read 100+ pages daily, to keep the characters and plot in your head. Read the concluding essay, too! Amazon, Powells.

On my Short List to Read ...

Hernan Diaz: Trust.

Barbara Kingsolver: Demon Copperhead. Or maybe catch up with Flight Behavior.

Other Genres?

Sometime I'll share my recommendations in other genres, including Mysteries, Science Fiction, and perhaps others like espionage and historical fiction.

If you'd like recommendations for quantitative research & analytics ... I'll compile those sometime, too! Meanwhile, book recommendations are given at the end of each chapter in the Quant UX book, the R book, and the Python book.

Happy reading!

Product Interest and Engagement Scale (PIES)

Chris Chapman — Tue, 23 Jan 2024 15:17:05 GMT

When I first worked in UX for consumer electronics (CE) devices, I faced a recurring problem: what to do about the fact that some users are fans of any particular product category, while others care very little. For example, some people are keyboard enthusiasts but others are OK with almost any keyboard.

This issue showed up in multiple ways, such as:

Recruiting for UX studies: we wanted a mix of folks who were not only enthusiasts but also those who were uninterested.
Survey analytics: we needed a covariate to use in analyses (such as regression models) and make other estimates more precise (for example, to model perceived value, aka willingness-to-pay, as a function of category interest).
Building user profiles or segmentation: we wanted to have basis variables that were directly relevant to the product, to use for segmentation purposes.

In working with Ed Love (now chair of marketing & finance at Western Washington University), we found that previously published interest scales were not suitableto assess users of consumer electronics and other tech products. A typical problem was odd wording, such as asking whether a product was "relevant." Respondents thought it was bizarre to ask whether a keyboard was relevant.

So we developed and eventually published our own scale. In this post, I describe and share the "Product Interest and Engagement Scale" (PIES). In addition to Ed Love, my colleagues Michal Lahav and James Alford also contributed!

PIES Items and Scoring

I'll start with the part you probably most care about: the items. The final version of PIES has 3 subscales with 11 total items:

Item	General scale
1	I am very interested in __ .
2	__ are not very important to me. (Reverse-scored)
3	I never think about __. (Reverse-scored)
	Choice / Feature scale
4	In choosing a __ I would look for some specific features or options.
5	If I chose a new __ I would investigate the available choices in depth.
6	Some __ are clearly better than others.
7	If I were choosing a __, I would wish to learn about the available options in detail.
	Image scale
8	When people see someone's __, they form an opinion of that person.
9	A __ expresses a lot about the person who owns it.
10	You can learn a lot about a person by seeing the person's __.
11	It is important to choose a __ that matches one's image.

Each "__" should be replaced by the name of a general product category.

For example, you might ask "Some smartphones are clearly better than others" or "An automobile expresses a lot about the person who owns it." Adjust the grammar (singular/plural, etc.) as needed. In some cases, a brand or other descriptor may work instead of a product category.

The items are administered using a Likert-style agreement scale with consecutive numbering (e.g., 1-5). The items were developed and validated mostly using a 5-point scale with "strongly disagree", "disagree", "neutral", "agree", and "strongly agree". You can collect responses using either a grid or individual items. Randomize the item order (but don't randomize the scale direction unless you randomize it between respondents! Scales should stay consistent for any single respondent.)

IMO the details of the rating scale do not matter a lot. Instead of a 5-point scale, you could use a 7-point scale or something else. PIES was developed with both 5-point and 7-point samples (mostly 5-point). FWIW, I'm not into Likert scale "controversies," and I find that respondents can handle any reasonable scale. The UI and other decisions about a survey are far more important!

To score the items, add up the ratings reversing scores where noted for each of the subscales and the overall scale. PIES was explicitly designed (and validated!) to be simple, to add up as a quasi-continuous scale even if the items are technically ordinal in nature, and not to require fractional weighting of items or subscales.

Validation Research

Too often, applied researchers write items, make up a scale, and then assume it will work, with little or no research to validate it. (See this post for more.)

For PIES, we went multiple rounds of research to validate the scale for CE product categories and for other reference products (including automobiles, sandwich bread, everyday clothes, and several others).

I won't go into all of the details, but in a nutshell that involved these steps:

Assessing an initial list of 50+ items from existing literature plus newly written items.
Iterated exploratory factor analysis (EFA) and confirmatory factor analysis (CFA) to choose items that reliably associate with one another in a hierarchical factor structure. (We tested a 3-factor hierarchical model against 3-factor non-hierarchical and 1-factor alternative models.)
Validating an initial scale with 12 items, which we called PIES-beta.
Revising it after several rounds of applied usage; then EFA to develop and CFA to validate a final11 item version, as presented above.

In the validation research, we collected baseline stats for 13 product categories from adult respondents in the US. The results are somewhat dated now (this was in ~2011) but they are still informative and speak to how well the scale works. Here they are:

These norms show, for example, that respondents are far more interested in some categories than others (for instance, automobiles and laptops vs. bottled water and wine). Some categories are more about the features (cameras, appliances) whereas others are about personal image (clothes, shoes) and some are about both (automobiles, laptops).

And the categories differ in sample variance, i.e., whether people show lower differences from one another (laptops) or greater diversity in their level of interest (bottled water, wine). All of these results align with expectation about the categories as well as showing strong results in CFA analysis (see the references for more).

We also normed the scale on additional CE product categories that I'm not able to share because they were job-specific, as opposed to the general results above. For more on the validation steps and models, see the references at the end of this post.

Using PIES in Applied Research

The entire PIES scale is compact, compared to many scales, although 11 items is still a lot of items. I always advocate for asking as few items as possible!

Depending on the situation, you might decide use a portion of PIES, such as:

Use 1 subscale of interest (3 or 4 items)
Select 1 item from each scale, and add them up for an overall score (3 items)
Select 2 items from each subscale to reduce the length (6 items)

I won't go into specific recommendations for item subsets because (a) it depends, and (b) the items are sufficiently robust that you shouldn't go far wrong ... except to say that if you want to obtain and use subscale scores, you should ask at least 2 items per subscale.

I find PIES to be especially useful for:

Recruiting a desired sample according to product interest (either high interest, low, or mixed)
Understanding how users' perceptions your category. For instance, is our product an "image" product or not? In this case, collecting some benchmark reference data can be helpful! You might ask: is our product more or less interesting than sandwich bread, or automobiles?
Segmenting users on something that is closely product-related. (I don't like demographic segmentation; that's a separate discussion!)
Using it as a covariate in regression models to get a better signal from other metrics.

What About Intention to Purchase or Willingness to Pay?

PIES is about interest in a category and the type of interest (detailed choices vs. personal image). Whether a consumer will purchase a product is a separate question. Consumers often purchase things they are not interested in; and even highly interested customers can't purchase everything.

Interest and purchase intent are correlated we might expect interest to be partially associated with somewhat higher likelihood to purchase but are not identical. They are also assessed differently on surveys.

To assess feature preference, intention to purchase product, and price sensitivity, I most often recommend choice-based conjoint analysis. Conjoint analysis simulates a competitive purchasing process while randomizing products' features and prices. That's a topic for another post! (Or see Chapter 13 in the R book.)

In practice, a research project might combine them. I often use PIES to assess respondents' level of interest in a product, and then examine how that is associated with conjoint analysis results for preferences of features, brands, or price points.

Want to Use PIES?

Go ahead! It is a published scale. However, the coauthors and I would be happy if you add a citation of some sort, such as any one (or more) of these:

[preferred] CN Chapman, M Lahav, E Love, and JL Alford (2009). Product Interest and Engagement Scale, Beta (PIES-beta). In Proceedings of CHI 2009, Boston, MA.
CN Chapman, EM Feit (2019). Chapter 10 of R for Marketing Research and Analytics(2nd ed). New York: Springer.
CN Chapman (2024). Product Interest and Engagement Scale (PIES). At: The Quant UX Blog, https://quantuxblog.com

If you use PIES, I'd be interested to hear about your project and the experience.

Learning More

Here are a few pointers to additional reading.

First, if you're interested to use PIES and you want to understand more about it and how it was validated, see the first two references above (here and here). In particular, the R book explains the EFA and CFA methods used to develop and validate it.

However, you don't really need to understand those methods to use PIES. Just follow the instructions above.

Second, if you're interested to develop a similar scale on your own ... well, you're asking to dive into psychometrics! If that's new to you, I'd recommend instead to go looking for validated scales you could use. But if you go ahead, you'll want to learn about exploratory and confirmatory factor analysis (EFA and CFA).

As I noted above, too many survey practitioners simply make up items and scales, and then assume that they will work as intended. It is unlikely that untested items will work as expected. In the process of developing PIES, we tested and mostly discarded more than 70 items and versions of items. We discarded most of them not because the items were bad, but because they were not reliable or stable for many different reasons, or did not cleanly load onto useful factors (subscales).

BTW, this applies to screener items, too! Without assessment, screener items are more likely to introduce unknown biases than to work as intended. (That's another discussion!)

If you decide to develop a scale of your own, for any purpose, the process should include a few rounds of testing and analysis, especially EFA and (perhaps) CFA. An outstanding general introduction to the topic is DeVellis and Thorpe's Scale Development.

Then, to move into details of the analysis, Chapters 8 and 10 in the R book cover EFA and CFA, respectively. A free online book about psychometrics in R is Brown, Psychometrics in Exercises Using R and RStudio. That book is an excellent "next read" after the previously mentioned two chapters of the R book.

Kline's Principles and Practice of Structural Equation Modeling is a thorough review of SEM, but much broader than scale development. It could follow after reading about SEM in the R Book and then Brown's Psychometrics.

Cheers, and I hope PIES will be useful to you!

"Research" Concerns for LLM Applications

Chris Chapman — Wed, 17 Jan 2024 15:01:22 GMT

I write about Quant UX, R, and marketing research, and I review various conference, journal, and book proposals. Lately there has been a surge in papers about LLM applications, such as using LLMs as a source of potential data for product insights.

A common patterns I've seen in research papers about LLM applications is the following :

First, the authors posit an AI LLM model to [answer questions about X] ... or [solve problem Y] ... or [give quasi-human data about Z].

Here are two examples (I'm making these up, but they're similar to things I've seen several times):

"Marketers often run consumer segmentation studies. What if we used AI to take those surveys instead? How well do AI segments match live respondents?"
"We created an LLM to answer questions about grooming pets. We tested it with 50 prompts about grooming. How good are its answers?"

Second, the authors then compare the LLM results against another method such as a different data collection protocol (e.g., a consumer survey) or a different rating source (e.g., answers from experts)

In this post, I explain why such work usually does not constitute "research" and it makes uninteresting papers and conference presentations.

Note: this commentary is not about any proposals to Quant UX Con 2024. It's about papers generally. OTOH, perhaps something here will be useful for folks working on proposals for the Quant UX Con CFP.

To be clear, I'm not arguing that LLM applications are bad or useless. Instead, I'm arguing that industry papers claiming to do research about LLM use cases are often extremely misguided in their conceptions of research and statistics. This article is a call-to-action to make them better.

If you'd like an analogy, think about a new Tic-Tac-Toe application. One might have good reasons to create the app and it could be useful (perhaps it runs on a new device for children). However, a paper that compares it to previous Tic-Tac-Toe models is almost certain to be uninteresting as research. Just because something is useful doesn't mean that data and stats about it are interesting.

The File Drawer Problem

When reading a paper about the value of an LLM application (or any treatment, for that matter), the first three questions we should ask ourselves are:

Do the authors have a vested interest in the result? For instance, are they the creators of the application?
Were the questions biased (perhaps unknowingly) in a way that guarantees the result?
Is it possible that only positive results are being published, and that negative findings are buried?

The latter is the file drawer effect. It says that published findings tend to be heavily skewed towards positive outcomes, because negative results are unpublished. This is compounded when the authors have an incentive to publish; and compounded again when the authors created the assessment questions or methods themselves.

There's not much else to say here. We should not simply accept one-off claims that an LLM (or anything else) gives a great answer to a complex problem.

The Unstable Unit Problem

When we assess an outcome with and without some intervention (such as using an LLM), it is a variation of a clinical trial that proposes a treatment effect. Statistical analysis of such trials has a core assumption known as the Stable Unit Treatment Value Assumption (SUTVA).

In a nutshell, SUTVA says that the response of each unit such as each person in a UX study, or each LLM prompt in a study of AI systems depends only on its own circumstances and does not depend on the responses of other units. SUTVA is one of two necessary foundations of statistics that compare outcomes. It underlies everything from simple descriptive statistics (such as comparisons of means and standard deviations in t-tests) up to complex analyses such as causal modeling.

Unless units are stable and independent, it makes no sense to aggregate them in a statistical model. (BTW, the second necessary foundation is sampling. That may be random sampling or some other kind of sampling such as matched sampling that is characterized by a specific model. "Random" is a specific model.)

LLMs violate the Stable Unit Treatment Value Assumption in multiple ways:

LLMs do not have stable definitions. On short time scales, they are unstable due to their stochastic (random) nature and dependence on exact prompts. On medium time scales, they are unstable due to history and spillover effects. On longer time scales, they are unstable due to algorithm updates, UI changes, and training.
LLM answers may depend on other units. Depending on the exact LLM model, prompting regime, and training, an LLM's answers may depend on the output of other systems and on the prior history and sequence of prompts. (Consider IQ testing, which I've discussed separately. Humans are generally unaware of published answers, whereas LLMs are likely to be trained on such literature. A given LLM model may also update itself from prompts or other users' inputs.)
There is no good way to specify a sampling "unit" for an LLM. What exactly is being sampled when a prompt is given to ChatGPT or Bard? How does a particular prompt relate to the entity being sampled?

All of this leads to a core question: when we describe an LLM using statistics, what are we describing? The LLM will be different tomorrow, next week, next month, and next year. Traditional research methods assume that the "thing" being studied is definable and stable ... enough so that it is reasonably close to the SUTVA standard.

When comparing "treatments" such as different systems or prompts, LLMs are not close to the SUTVA standard. Therefore, traditional statistical analyses do not apply to naive sampling of LLMs. And yet too many papers and presentations use those methods. They simply assume that comparisons of outcomes using t-tests, descriptive statistics, or regression models will work for LLM assessments just as they do in clinical trials.

The Validity Problem

Hans-Peter Gauster on Unsplash " class="image--center mx-auto" />

I've discussed previously that research must demonstrate both convergent and discriminant validity. This is missing from too many papers about LLM applications.

Convergent validity means that some assessment (such as scoring the answers of LLM prompts) should be demonstrated to agree with other methods that assess the same thing. Discriminant validity means that an assessment method should be shown to differ from other methods that ought to differ.

So far, I have seen very few papers about LLM applications that give attention to either convergent or discriminant validity. Typically, what I see is something like this:

A list of prompts and responses
Single-rater assessment of the responses (e.g., by the authors)

... OR, in the case of comparing data sets ...

Statistics about data set A (such as a consumer survey)
Statistics about data set B (such as data created by the LLM)
Comparison of the two data sets with naive statistics (such as a t-test) and without attention to SUTVA as noted above

In both cases, what is missing is systematic attention to the validity of the assessments. Does the LLM perform better than other similar models? Does its answer agree with things it should agree with, and does differ from those that should differ?

Other forms of validity are also neglected too often, such as ecological validity and even face validity. And so are questions of reliability; would one obtain the same results again or with different prompts? (That is similar to the SUTVA discussion above.) I'll leave those discussions for another time.

Why These Papers are Uninteresting

It may be obvious from the above points, but I find such papers to be uninteresting when they do not answer these questions:

Would we expect a different team of researchers obtain the same results, using a different sample and different assessment methods?
Would we get the same results a month from now?
Do the results generalize in any systematic way to other problems?
Were the hypotheses pre-registered so there is not (or less of) a file drawer issue?
Are the results theoretically justified in an interesting way? Or are they merely descriptive ("look what the LLM does")?

Without good answers to those questions, an LLM application may be useful but a description of results from it is not research. Too often, it is instead a statement that amounts to, "We did a thing. It's great, see!"

That's not a problem in the world which is full of such claims but in my opinion, it's not enough to be interesting for a research journal or conference.

What to Do Instead

A prescription depends on the exact circumstances but as a general matter, I urge authors to consider the following:

Is there a good theoretical reason to expect that an LLM's answer will be good? Or are you on a fishing expedition to "see what happens"? Observations of interesting effects can be useful to share, but post-hoc statistical tests do not magically convert them into "research".
Do you have hypotheses that were at least informally pre-registered? The underlying problem of backwards hypothesis generation is not unique to LLM research, yet it should always be avoided.
What are the plausible alternative "null" models? An effect size of "better than zero" is not a reasonable comparison, although it is (too often) the default for statistical models such as t-tests and ANOVA. Can you posit alternative, simpler LLM applications that might be plausibly similar to yours? Show that your results are better than those alternatives (this is conceptually similar, in a hand-waving way, to null model comparisons used in structural equation modeling and similar methods).
Can you change the locus of the "treatment effect" to show that it is helpful or interesting generally? For example, instead of assessing the outcome yourself, can you show in an experimental trial that the results are helpful to some other sample of people?
Can you show that it is not a point-in-time result? Could the result depend on some particular and probably black box algorithm and training set? Do you learn something interesting about LLMs or AI applications in general?

Finally

I hope this inspires you to think critically when reading LLM research ... or conducting it. If so, I'll put in a plug for the upcoming Quant UX Conference in June 2024. Find out more here: https://quantuxcon.org

Thanks for reading!

North Star ... a path to being lost

Chris Chapman — Wed, 10 Jan 2024 16:42:17 GMT

Many Quant UX researchers are confronted with defining, measuring, or otherwise responding to "North Star" business metrics.

A North Star metric is supposed to express a single, simple goal that will unify a business to operate in a coherent direction. The term emerged sometime in the 2010s in Silicon Valley companies. It was widely promoted by Sean Ellis, who wrote in Growth Hackers in 2017:

The North Star Metric is the single metric that best captures the core value that your product delivers to customers. Optimizing your efforts to grow this metric is key to driving sustainable growth across your full customer base. [Sean Ellis]

Long-time readers of this blog should be able to identify one problem already. It purports to drive sustainable growth and identifying one's "full customer base" ... but that claim requires evidence from the future. How could a newly adopted metric and one first promoted in 2017 predict the long-term future? (This is an example of magical thinking; not much more needs to be said about that, IMO.)

In the rest of this post, I'll describe the more general and less magical reasons that I believe North Star metrics are misguided and harmful.

Problem 1a: Businesses are Not Unidimensional

Some years ago the exact company/product doesn't matter I had a discussion about the product strategy with one of the directors. I asked:

There are many potential goals for us: product quality, use case coverage, completeness, simplicity, market share, user satisfaction, revenue, profit, thought leadership, adoption rate, reliability ... and some of those pull in different directions, such as revenue vs. adoption rate. Which ones are most important? [me]

The answer I received was, "Nobody knows." And more specifically, I was told that different leaders had different goals.

North Star metric proponents would say that is a good setup for using a north star approach: to reconcile the different directions and confusion by unifying on one goal.

However, that doesn't really solve the problem. It simply ignores it! The fact is that businesses do have multiple important goals, and they do not all pull in the same direction. In sections below, we'll see an example with more discussion.

The job of management is not to pretend otherwise but to acknowledge the tradeoffs and to handle multiple and sometimes conflicting information and goals in a dynamic way. A better answer to my question would have been something like the following (I'm inventing this part):

In the short term we are prioritizing three things: use case coverage, product quality, and user retention. For the next year we do not have to worry about revenue or profit, but on a three year horizon those will become important. [me]

That is a clear and actionable set of goals that ties a market entry strategy to a well-specified set of implicit metrics.

Problem 1b: Tradeoffs are Essential in Business

Closely related to the problem above, we cannot pretend that tradeoffs don't exist in business. A quote attributed to Benjamin Franklin puts it as in the caption of the following meme (template by Stephanie Godbout) :

In this meme, the tradeoff is between adoption and initial user satisfaction (driven by low prices) as opposed to product quality and long-term user satisfaction.

The job of management and product design, and engineering is to make such tradeoffs in regards to user needs, a business strategy, and market opportunities.

Problem 1c: You Knew There Would be Code!

This is a quant blog so ... wait for it ... I have R code to demonstrate the concept! (You can copy/paste using the icon on the upper right in each block; or see the end of the post for all the code in a single chunk.)

Let's imagine we have a product line perhaps smartphones or purses or computer monitors or high-end wines whose prices might range $79-999. We set a random seed (used below) and assign price points:

# make randomization-repeatable fake dataset.seed(98245)# price series for hypothetical product space (phones, purses, tvs, etc)price    <- seq(from=79, to=999, by=10)

Demand for any product assuming everything else is constant decreases as price goes up. (You might say, "but what about luxury brands?" The answer is that people like price discounts on luxury brands, too; a sale is always good, if everything else is exactly constant.)

We add a demand variable inverse to price:

# demand vs. price# demand is inverse of price (could also use log(price), etc)demand   <- jitter(100/price, factor=10)

Plotting those:

library(ggplot2)ggplot(aes(x=price, y=demand), data=data.frame(demand, price)) +   stat_smooth() +  ggtitle("Demand vs Price (hypothetical)")

We get the following:

Nothing much to say, so far. Now let's add revenue as the function of demand and price, and plot that:

# revenuerevenue  <- demand * priceggplot(aes(x=price, y=revenue), data=data.frame(revenue, price)) +   stat_smooth() +  ggtitle("Revenue vs Price (hypothetical)")

This flat curve might surprise slightly but is implicit in the demand math. To put it differently, a good business could (in theory) exist at any reasonable price point.

Now suppose that customers are happier, in general, with higher-priced products. That's a reasonable assumption; all else being equal, a higher-priced product should be able to deliver more value.

We add CSat positively associated with price, with some random noise (jitter()) and fit onto a 5 point scale (using cut()):

# csat vs. price# random 1-5 scale csat that generally increases with pricecsat     <- as.numeric(cut(jitter(price, factor=600), 5))ggplot(aes(x=price, y=csat), data=data.frame(csat, price)) +   stat_smooth() +  ggtitle("CSat vs Price (hypothetical)")

As we coded, the plot shows hypothetical data for CSat going up with price.

As an aside, I don't advocate treating CSat as a continuous variable, as is implicit here by using the stat_smooth() function. CSat on a 5-point scale is an ordinal variable. However, it can be helpful to pretend it is continuous in simple discussions like this; see Chapter 8 of the Quant UX book for more.

To recap, our data include price, demand, and csat with some reasonable albeit naive associations. We have not yet looked at demand vs. csat. The following plots those, using jitter() to make csat more quasi-continuous (see the disclaimer above; this is for quick exploratory purposes):

# demand vs. csat# jitter x axis b/c csat is ordinal, helps with fitting lineggplot(aes(y=demand, x=csat), data=data.frame(csat=jitter(csat), demand)) +   stat_smooth() +   xlab("csat") +  ggtitle("Demand vs CSat (hypothetical)")

Now we see a perhaps surprising implication of our reasonable data:

Using our assumptions, demand is lowest for the most-satisfying products. And then we look at revenue vs csat:

# revenue vs. csatggplot(aes(y=revenue, x=csat), data=data.frame(csat=jitter(csat), revenue)) +   stat_smooth() +   xlab("csat") +  ggtitle("Revenue vs. CSat (hypothetical)")

Revenue doesn't change with CSat (in these hypothetical albeit perhaps plausible, data). Although this plot is algebraically straightforward, it nevertheless surprises me every time I think about it. We cannot expect everything "good" to align all at once.

What does all of this mean for North Star metrics? We can pick any single metric to optimize, but focusing on a single metric may lead to no change or a negative change in other aspects of our business.

You may decide that you don't care about those things that your business will be fine if revenue is flat, or demand drops, or users are dissatisfied but I wouldn't advise anyone to make that bet. Important metrics do not all move in the same direction, at the same time; there are inherent tensions. (If there weren't tradeoffs to be made, why would we need executive management?)

It is not difficult to track a few metrics and make sure that you are not driving the business off a cliff by excessively focusing on one metric that works against others.

To be clear, I'm not advocating tracking or trying to make decisions based on 20 metrics (or 50 or 100). But we can use more than 1.

Problem 2: North Stars Promote Misguided Expectations

North Star proponents may rebut the point above, saying, "Of course you should track multiple metrics, but it is still helpful to focus on a single one to align the team."

The problem with this is that focusing on one metric inherently devalues others. And, perhaps even worse, it causes people to assume that all metrics will improve together.

In my experience, it's common for folks in product development to believe something like the following:

If we make a great product, users will be happy
If users are happy, they will buy more from us
If they buy more from us, we will gain market share
If we gain market share, it will improve our brand perception
If our brand perception improves, we can charge more for products
If we charge more, we will make more profit
If we make more profit, we can make better products
Go to step 1 and repeat, as a virtuous cycle

That's a great story and very appealing to believe (the psychologist in me says, "ego-syntonic") ... and there is some degree of truth in each of the implications. But this is a longitudinal story over time, and all of those things cannot operate together at one time. Each of them also has a correlation less than 1.0; they are probabilistic at best.

Unfortunately by focusing on a single North Star, employees may assume that everything will improve, everywhere, all at once. Or to assume that it doesn't matter if some metrics that you don't even track, because "north star" get worse.

Is ignorance a roadmap to growth hacking? I wouldn't bet on it!

Problem 3: Good Employees See Through Simplistic Frameworks

I won't spend much time on this problem, as the title says it all. If you hire smart people, they will realize that focusing on a single metric is naive. That will lead them variously to do things like:

Roll their eyes and ignore management
Game the system to drive up the one metric (see Goodhart's Law)
Stand up to challenge the assumptions, and get knocked down as troublemakers

I hate to see any of those outcomes.

Instead

The idea of a North Star alludes to celestial navigation ... and if we consider actual navigation for a moment, then the better path is clear: use multiple signals.

At the simplest, most literal level i.e., navigation if you point at the North Star you will not know where you are. You know where you are pointing (your bearing), but you don't actually know where you are, or where you are going along the way! Celestial navigation requires more than one reading of time, distance, or bearings to establish triangulated multiple metric wayfinding. (See the history of the Longitude Prize for navigation, which led to the invention of reliable clocks.)

Using multiple signals does not have to be complicated, and it does not mean to track "everything". The key is to have a small number of metrics that give greater insight. That will diminish the risk of missing something crucial (such as declining revenue, quality, demand, brand perception, or user satisfaction).

Using a few metrics besides being a more accurate representation of business brings three other benefits:

if any of those stars wavers, you will have data to assess the risk and debug what is happening.
if conditions change either in the market or in your strategy you'll have data you need to make a faster, more informed decision
employees won't feel insulted by pretending things are unidimensional

Finally

From a UX perspective, I think a good place to start is with the HEART UX metrics, as Kerry Rodden and I outline in Chapter 7 of the Quant UX Book (or see here). The HEART metrics apply to UX and do not include not everything needed for a business ... but they are a great place to start.

A nice thing about HEART is that its metrics come from different sources and methods, reducing the risk of getting an overall incorrect signal. The HEART framework also explicitly proposes that any single metric is an uncertain and only approximate signal of an ultimate product goal that is not, in itself, directly measurable.

As promised, following is all of the R code used in this post. I use R as a kind of "sketch book" to demonstrate data and ideas. As discussed above, even simple assumptions may lead to surprising outcomes!

# north star blog post# chris chapman, january 2024, https://quantuxblog.com# make randomization-repeatable fake dataset.seed(98245)# price series for hypothetical product space (phones, purses, tvs, etc)price    <- seq(from=79, to=999, by=10)# demand vs. price# demand is inverse of price (could also use log(price), etc)demand   <- jitter(100/price, factor=10)library(ggplot2)ggplot(aes(x=price, y=demand), data=data.frame(demand, price)) +   stat_smooth() +  ggtitle("Demand vs Price (hypothetical)")# revenuerevenue  <- demand * priceggplot(aes(x=price, y=revenue), data=data.frame(revenue, price)) +   stat_smooth() +  ggtitle("Revenue vs Price (hypothetical)")# csat vs. price# random 1-5 scale csat that generally increases with pricecsat     <- as.numeric(cut(jitter(price, factor=600), 5))ggplot(aes(x=price, y=csat), data=data.frame(csat, price)) +   stat_smooth() +  ggtitle("CSat vs Price (hypothetical)")# demand vs. csat# jitter x axis b/c csat is ordinal, helps with fitting lineggplot(aes(y=demand, x=csat), data=data.frame(csat=jitter(csat), demand)) +   stat_smooth() +   xlab("csat") +  ggtitle("Demand vs CSat (hypothetical)")# revenue vs. csatggplot(aes(y=revenue, x=csat), data=data.frame(csat=jitter(csat), revenue)) +   stat_smooth() +   xlab("csat") +  ggtitle("Revenue vs. CSat (hypothetical)")

Hiring Interviews are not Data Collection

Chris Chapman — Thu, 21 Dec 2023 17:10:52 GMT

I'll discuss a common misunderstanding that I have sometimes observed in hiring interviews: the mistaken view that hiring interviews are a form of data collection.

This misunderstanding is seen when interviews have several of these features:

Questions that are 100% scripted in advance
Interviews that are identical for different candidates
Questions that do not reflect a candidate's skills
Little or no time for introductory comments
Little or no time for candidate questions at the beginning or end
Rudimentary answers from the interviewer when a candidate asks questions
Interviewers who give most of their attention to typing notes

I don't deny that interviews yield "data" in the rudimentary sense of being information. The problem arises when interviews are seen as a way to collect comparative data as if they were similar to research observations.

This misunderstanding is well-intentioned. I've heard interviewers say things like, "I want the interviews to be identical so I can compare candidates fairly." It is no surprise that academically trained UX researchers may expect such practices to make interviews fairer or more scientific.

But interviews are not research! Instead, interviews are a way for a candidate and interviewer to engage in mutual learning to determine whether there is a good person/job fit, and they should be tailored for that purpose.

In this post, I'll explain why.

Small print and disclaimers: these are personal reflections after being involved in hundreds of interviews as an interviewer, hiring manager, committee member, and trainer of interviewers in previous jobs. They also reflect my pre-tech experience as a clinical psychologist. They do not represent any company's policy nor my current job where, so far, I have not been involved in any hiring interviews except my own.

Interviews Involve Mutual Observation

One difference from research is this: a candidate observes an interviewer at least as much as the interviewer observes the candidate. Each of them is attempting to determine whether a particular position is the right fit.

Suppose a Quant UXR candidate's resume says they program in R. They show up to a programming interview and are grilled on SQL because that is part of the interviewer's script and they wish to be consistent (or they believe that SQL is "required" for data positions, a myth I will set aside for now).

What will the candidate think? They may think that the interviewer didn't read their resume, or wonder what it is that SQL can do that R can't (answer: for most cases, nothing), or wonder why SQL is so important (answer: it probably isn't very important, or they'd be hiring a database engineer).

In this situation, an inappropriate focus causes the interviewer to miss out on assessing the candidate's more general strengths. And the candidate will draw incorrect implications about the position.

Even if the candidate "passes" the interview, the incorrect assessment and implications can be detrimental. The stronger a candidate is, the more likely they will be bothered by such misalignment. In the worst cases, such misalignment may lead them to decline an offer.

Extreme Scripting is not Fairer or Scientific

There is a simple reason that adherence to an interview script does not collect particularly fair or scientific information: the interview script almost certainly has not been conceived, developed, tested, or normed as a scientific assessment.

Here's how I believe interview scripts are typically developed: an interviewer looks for canned questions in a shared list of interview questions or in their own archives, and perhaps adjusts or adds some questions. Then they use that script repeatedly in candidate interviews.

What's missing from that? Almost everything that would be done in psychometric validity assessment: definition of the construct being interviewed; writing and testing proposed items; assessing those items against other measures (see my post on convergent and discriminant validity); and collecting normative information across the population (see my post about LLM "IQ").

Without those things, a script is just that, a script. It reflects one interviewer's bias ... and applying that bias repeatedly to multiple people does not make it scientific.

Personalized Interviews Gather Better "Data"

Let's set job interviews aside for a moment and think about another situation where two strangers meet and assess one another: a blind date.

Imagine if you showed up to a date and the other person:

Asked scripted questions
The questions had little to do with you
They launched into the questions without small talk
You had almost no chance to ask anything
When you did ask, the date didn't answer much

Sound horrifying? It is!

Sound familiar? On the one hand, I hope not! And yet, on the other hand, I made that list by copying my list at the beginning of this post. I just changed the situation from a job interview to a date. So you've already read a longer version of this list! Doesn't it seem more awful now than it did in the "interview" context above?

Think about what you would hope to see on a date: human engagement. That is the same answer in a job interview: human engagement!

Put differently, candidates are not question-answering robots to be probed with semi-automated testing. They are people, and they are not only assessing you (as I noted above) but also candidates will respond better and more accurately when they are treated as people.

In clinical psychology, that is called "establishing rapport." A client and a therapist need to form a working relationship with mutual respect and engagement (or, in clinical jargon, a "therapeutic alliance"). Key parts of that include honesty, transparency, a reduction of anxiety, basic human warmth, and a focus on the needs of the person in front of you.

It is the same in interviews. They should be conducted to minimize anxiety, be as honest and transparent as possible, and focus on the shared goal of both parties, which is to establish and understand the fit between a candidate and a job.

Those goals are addressed when an interview opens with warmth and general discussion, and only then moves into scripted aspects. The scripted aspects should either be tailored to the candidate in advance (for instance, using R instead of SQL) or should be adapted on the fly to reflect the candidate's situation, skills, and interests ... and not a preconceived list.

When interviews are conducted that way, they will collect better and more ecologically valid "data" ... and they will be a better experience for the candidate. And that makes it more likely that they will say "yes!" when they get an offer.

Does That Mean "No Script"?

Not at all! It means that interviewers should adapt their scripts. And that means a bit of preparation in advance.

In particular, I do this:

Start with a previous interview guide or suggested questions
Adapt those to be questions that you find interesting as the interviewer!
For a particular candidate, review their resume, portfolio, and other materials and adjust the questions to fit their qualifications, background, or interests. (BTW, if some skill or statement appears on a resume, it's fair game to be probed!)
Start the interview casually, as noted above, to establish a friendly situation (as best as one can in a stressful activity like interviewing!)
Make sure your script includes branches that can cover higher- and lower-skilled responses. If a candidate is struggling, there is no reason to persist with difficult questions; or if they are acing all of them, increase the difficulty.
Once you have enough evidence, stop asking scripted questions and move to a more informal discussion. What questions does the candidate have? Give them a more person-to-person sense of what the job would be like.

This accomplishes multiple goals simultaneously: a better assessment of the candidate, higher-quality information, and a more appealing representation of the job and its culture.

For More about Quant UXR Interviews ...

I've written separately in other posts about the skills needed for Quant UX and Quant UXR programming interviews.

Stay tuned ... future posts will share more reflections and advice for Quant UX interviews (and UXR interviews in general).

Beyond that, in our Quant UX book, Kerry Rodden and I discuss interviews from the perspective of both candidates and interviewers ... including an appendix with evaluation rubrics for each of the core areas of Quant UX interviews.

Thanks for reading!

The CFP is open for Quant UX Con 2024

Chris Chapman — Mon, 11 Dec 2023 18:40:53 GMT

You might have seen the Call for Participation for Quant UX Con 2024. I hope all of my blog readers will consider proposing a talk, panel, or workshop!

In this post, I share a few reflections and suggestions that might be helpful.

Most of all: consider proposing something that YOU would have liked to know or learn a few years ago! What method, approach, advice, example, or best practices would have been helpful to you? Others will be interested.

ICYMI, there is a free ask-me-anything session on Dec 19 (10am Pacific). Bring your questions about Quant UX Con proposals, or just show up and listen to others. Sign up for the Dec 19 ask-me-anything session here.

Some Topic Ideas

I'll speak just for myself personally, as a reviewer of proposals. For the 2024 conference, we have a total of 40 volunteer reviewers plus 7 of us on the Organizing Committee. So there will be great community coverage in selecting the presentations, and these thoughts reflect only my opinion as 1 of those 47.

At Quant UX Con 2024, we have identified 22 topic areas (non-exclusively; other topics are also welcomed!) Those are listed on the CFP. Among the topics on the CFP, I am personally especially interested to see the following:

METHODS: Best Practices or Review of Methods
METHODS: Data Visualization
METHODS: Programming in R or Python (including shared code)
METHODS: Survey Research
ORGANIZATION: Careers and/or the Quant UX Role
ORGANIZATION: Diversity, Equity, and Inclusion in Organizations
TOPICS: Accessible and Inclusive Quant Research
TOPICS: Governmental or Public Services
TOPICS: Health or Mental Health
TOPICS: UX Metrics Frameworks or Processes

You can combine those, of course! For example, a talk might discuss inclusive research for mental health using survey methods.

Note that you do not need to and should not share confidential data. Everyone at Quant UX Con understands that some data cannot be shared, and approximated proxy data is acceptable as long as the context is clear.

Also, unlike an academic conference, do not worry about "novel contributions". The goal at Quant UX Con is to be useful to the audience, not to be novel.

Some Things to Avoid

There are a few things I personally do NOT like to see in proposals:

Vague, unsubstantiated, or exaggerated claims. For example, if a proposal claims that it will revolutionize the field or accomplish some extremely unlikely objective, I will be extremely skeptical.
Sales pitches, self-promotion, agency promotion, or proprietary methods. If your talk concerns a method or advantage that only you can offer, I'm not interested. The goal at Quant UX Con is to share knowledge.
Using AI systems as a source of data. IMHO there is no good theory to justify "learning from AI" or comparing the results of AI processes to those of humans (for example, determining whether an LLM answers surveys in the same way as people). Don't bother.
Purely qualitative research. Quant UX Con is open to presentations that combine quantitative questions or data with qualitative methods and results. But if a presentation is only about qualitative results, it doesn't fit our audience.
Extremely short proposals without detail. We ask for abstracts, not full papers, so the burden is low. But if your proposal is only a paragraph or two, it is unlikely to have enough information to evaluate. A good target is 200-350 words (about 1-2 pages).
Rudimentary case studies. The audience is familiar with basic methods and probably will not be interested in "how we did [basic method] at [my company]." However, this CAN be interesting if you add non-obvious aspects, such as hard-won lessons and best practices, more advanced methods, scaling up basic methods to the next level of complexity, and so forth.
Extremely technical talks that many attendees will not understand. This is a technical conference about Quant UX, but attendees vary in their backgrounds. Even if your presentation is highly technical, make its problem statement and takeaways approachable for a general UX audience. Then dive into the details! The goal is to be useful, not to show off.

All the Details

Need more inspiration? Check out the presentations from Quant UX Con 2023.

More AI Coding in R: Bard == meh (but not bad)

Chris Chapman — Fri, 08 Dec 2023 16:23:36 GMT

I recently spoke with a friend I'll call him "Gert" who is a very senior data scientist at a Big Tech company. Part of our conversation went like this:

Me: Execs don't realize that LLM use cases are narrow and brittle.
Gert: Yeah, but LLMs are good for coding. Junior engineers can get a lot done using AI to write initial code or test cases.
Me: Have you had one write code for you?
Gert: Well, no, not personally.
Me: Try it! The code I've seen is terrible.

This exchange aligns with two observations about the typical discussion of LLMs:

Many stakeholders imagine the value of LLMs for someone else, while they personally have limited experience with their use cases.
Discussion of LLM value relies too much on generalization from single use cases. A common pattern is, "It is valuable for me, so it must be valuable for [something else]."

In a previous blog post, I tried ChatGPT on a simple R programming exercise (taken from my Quant UX book). Two of my conclusions were:

"[ChatGPT code for R] contains so many errors that one would be better served to start over"
"[using LLM code requires high] expertise to debug some other entity's code"

Rather than saving time and energy, I found that the ChatGPT code in R was not only bug-ridden but nearly unintelligible. Only an advanced programmer (like Gert above) would be able to sort it out ... and yet such a programmer has no difficulty writing code, and they do not need the LLM at all.

But that was ChatGPT. How about Google Bard? In this post, I'll try the same problem on Google Bard, and see whether my two main conclusions still apply.

(tl/dr; yes, IMHO it still requires more advanced programming skills to use AI-generated code than to write your own.)

(Update: see the Appendix at the end of this post for an update after the release of Bard + Gemini.)

Bard's Answer

I gave Bard the same problem I had used with ChatGPT:

Using the R language, write a function that will take a typical license plate string and convert it to the NATO phonetic alphabet (https://en.wikipedia.org/wiki/NATO_phonetic_alphabet). For example, the license plate input "QUANT 1" would give the result "Quebec Uniform Alpha November Tango One".

LLM output is not generally reproducible although they are determinant systems, the determinants such as system architectures and random number seeds are not shared with users but the following is what Google Bard gave me on 11/26/23:

Bard then gave a test case and references:

I share and evaluate Bard's R code below, but first, let's look at Bard's references.

Bard Makes References

Bard shared two references: a book on Google Books and a Python repository on GitHub. My first thought was "References? Great!" ... but then I clicked on the links.

Here's the first reference that Bard gave me, a link to Google Books:

What? That's a UK political history book. What does that reference have to do with programming? Could it be that it references the NATO alphabet?

I clicked "Search Inside" to see whether I could search for "NATO". But "Search Inside" didn't link to the book as listed. Instead, it linked to something else that was unrelated, a US Navy publication from the 1950s:

In short, #1 of the 2 Bard references was garbage.

How about the 2nd reference? That pointed to this GitHub repo, which is directly related to the coding problem (although in Python rather than R):

That is a good reference for coding the NATO alphabet ... except that when you look closely, you'll see that the reference doesn't match Bard's solution!

The screenshot below shows a portion of the code in that Python repo. There are at least four substantial differences between the reference and Bard (setting aside the R/Python difference):

The reference uses different variants for some entries (e.g., it uses "alpha" whereas Bard uses the more official "Alfa"; it uses "ate" instead of "eight" or the official "ait"; and so forth)
The reference includes many additional symbols such as "tilde"
The reference code doesn't capitalize its translations
The reference code [in portions not shown] preserves capitalization and pronounces them (such as "UPPERCASE Quebec uniform alpha november tango")

In other words, it is one reference to code that translates the NATO alphabet but is not a good reference for Bard's solution.

So, for its references, I would score Bard at -0.5 out of 2.0. It listed one somewhat suitable reference but did so misleadingly (because its solution didn't match the reference), and I score that as 0.5 out of 1.0. But then Bard refers to a wildly irrelevant book, compounded by an inaccurate link in the Google Book entry itself. For being misleading, I score that reference as -1.0 out of 1.0. That gives a total -0.5 / 2.0. (If you want to score the irrelevant link as 0, then the total would be 0.5 / 2.0. In any case, that is an "F" grade for its references, which I find to be shocking for Google.)

A much better reference comes rather easily from ... wait for it ... Google Search. That is the official NATO alphabet reference (warning: PDF link).

OK, so Bard falsifies references. That's bad, yet is sadly unsurprising for an LLM. But how about its code? Let's see.

Testing Bard's Solution

Here is the complete function from Bard, followed by 3 test cases. The first test case is one that Bard wrote, and the next two are ones I wrote. (You can copy the R code using the "copy" button in the upper right.)

# This version exactly as written by Google Bardnato_license_plate <- function(license_plate) {  # Split the license plate into characters  license_plate_chars <- str_split(license_plate, "", simplify = TRUE)  # Convert each character to its NATO phonetic equivalent  nato_chars <- lapply(license_plate_chars, function(char) {    nato_alphabet <- c("A" = "Alfa", "B" = "Bravo", "C" = "Charlie", "D" = "Delta",                       "E" = "Echo", "F" = "Foxtrot", "G" = "Golf", "H" = "Hotel",                       "I" = "India", "J" = "Juliett", "K" = "Kilo", "L" = "Lima",                       "M" = "Mike", "N" = "November", "O" = "Oscar", "P" = "Papa",                       "Q" = "Quebec", "R" = "Romeo", "S" = "Sierra", "T" = "Tango",                       "U" = "Uniform", "V" = "Victor", "W" = "Whiskey",                       "X" = "X-ray", "Y" = "Yankee", "Z" = "Zulu",                       "0" = "Zero", "1" = "One", "2" = "Two", "3" = "Tree",                       "4" = "Fower", "5" = "Fiver", "6" = "Six", "7" = "Seven",                       "8" = "Eight", "9" = "Niner")    if (char %in% nato_alphabet) {      nato_alphabet[[char]]    } else {      ""    }  })  # Combine the NATO phonetic equivalents into a single string  nato_license_plate <- paste(unlist(nato_chars), collapse = " ")  return(nato_license_plate)}# test cases (the first one is from Bard; next two from me)nato_license_plate("QUANT 1")nato_license_plate("qUant 1")nato_license_plate("QU@NT 1")

When I run that function, I get an error that str_split() can't be found:

That's a simple oversight of not calling the correct package (stringr), which is not difficult to fix. So I added a line to use the package and try again:

# added package call before the str_split() function# ...  library(stringr)                         # FIX: required package  license_plate_chars <- str_split(license_plate, "", simplify = TRUE)# ...

Now the function runs, but it returns all blanks instead of a correctly translated string:

This tells us that something else is going wrong. I'll return to that in a moment, but first, let's take a look at Bard's test case.

Look at Bard's Test Case Again

Bard wrote a test case for us: nato_license_plate("QUANT 1"), and it gave the supposed output from it:

As we saw in the previous section, the function doesn't work. That's problem #1. LLMs fabricate their test cases and their results.

Problem #2 is that the test case itself is not great. Although it came from my input prompt, a much better result would be multiple test cases (my prompt said "for example" and did not say "only this example").

Problem #3 is slightly more subtle, which is that the supposed output does not match Bard's function, even if the function were working. You can see that the test output shows "Alpha" whereas Bard's function uses "Alfa" as shown in the code above. Again, Bard simply copied the desired output from the prompt regardless of whether its code could generate it.

Fixing All the Problems

At this point, I'll simply fix all of the problems in the code. Following is complete and functional code. I append "2" to the function name to keep my version of the function separate from Bard's code. After the code block, I comment on each change that I made.

nato_license_plate2 <- function(license_plate) {  # IMPROVE: declare this outside the lapply() loop  nato_alphabet <- c("A" = "Alfa", "B" = "Bravo", "C" = "Charlie", "D" = "Delta",                     "E" = "Echo", "F" = "Foxtrot", "G" = "Golf", "H" = "Hotel",                     "I" = "India", "J" = "Juliett", "K" = "Kilo", "L" = "Lima",                     "M" = "Mike", "N" = "November", "O" = "Oscar", "P" = "Papa",                     "Q" = "Quebec", "R" = "Romeo", "S" = "Sierra", "T" = "Tango",                     "U" = "Uniform", "V" = "Victor", "W" = "Whiskey",                     "X" = "X-ray", "Y" = "Yankee", "Z" = "Zulu",                     "0" = "Zero", "1" = "One", "2" = "Two", "3" = "Tree",                     "4" = "Fower", "5" = "Fiver", "6" = "Six", "7" = "Seven",                     "8" = "Eight", "9" = "Niner")  license_plate <- toupper(license_plate)  # IMPROVE: convert to upper case  # Split the license plate into characters  library(stringr)                         # FIX: required package  license_plate_chars <- str_split(license_plate, "", simplify = TRUE)  # Convert each character to its NATO phonetic equivalent  nato_chars <- lapply(license_plate_chars, function(char) {    if (char %in% names(nato_alphabet)) {  # FIX: lookup is to names, not values      nato_alphabet[[char]]    } else {      char                                 # FIX/IMPROVE: return unknown characters    }  })  # Combine the NATO phonetic equivalents into a single string  nato_license_plate <- paste(unlist(nato_chars), collapse = " ")  return(str_squish(nato_license_plate))   # IMPROVE: str_squish to remove duplicate spaces}nato_license_plate2("QUANT 1")nato_license_plate2("qUant 1")nato_license_plate2("QU@NT 1")

Stepping through the changes:

I moved the NATO alphabet declaration outside of the nato_chars assignment, where it initially appeared inside an anonymous function. That makes it more readable, and it reflects the better programming practice of separating data declarations from functional code. Score: Bard 0.75 the code works but it is oddly constructed. I give that a middle "C" grade, using the common US letter grade system.
I added a toupper() conversion so the input case would not matter. That is implicit in my prompt (and in general usage of the NATO alphabet in most contexts). Better would have been for Bard to ask which we wanted, to include both cases, or to call out its assumption. Without any of those, I score Bard 0.60, a "D-" grade.
I added the required call to the stringr package as above. I'll be generous here and score that as 0.75, a "C". It's a definite error but is also a common type of error and is easily fixed.
Now we get to the worst error in the code, and the reason all outputs were blank: Bard's code incorrectly indexes the translated strings and not their lookup values. In its list structure nato_alphabet, Bard created object names such as "A" that match values such as "Alfa". But when it checks whether an input letter (such as "A") is in the list, it looks at the translations and not at the names. I added names(nato_alphabet) so that would work correctly. Score: 0.0 a complete failure. It not only guarantees 100% wrong output, it uses completely wrong indexing logic. No points for that.
In the case of an unmatched character, I simply return the character (char) instead of Bard's answer to return a blank. This was not specified in my prompt but should have been clarified interactively by Bard but without specification, it is a very bad choice to silently eliminate input data; much better is to give it back untranslated (perhaps with a warning). Score 0.60, a "D-".
Finally, I use str_squish() to remove extra blank characters. Bard's output after fixing its function includes spaces that are present in the input but are not needed in the output ([1] "Quebec Uniform Alfa November Tango One"). The extra spaces should either be eliminated (which would match my prompt) or announced (with the translation, "space"). Score 0.75, a "C" because it's not terrible but is not great.
To round off the scores, I'll give 3.4 / 4.0 for the other parts of the function (which amount to about 4 lines of code). Those lines are rudimentary but have the virtue of working. That's an equivalent 85%, or a solid "B" for those lines.

The total score for Bard's code, if I were grading, is 68% or a "D+". The code was broken but wasn't very difficult for an experienced R programmer to coerce into a working function.

Before we leave this section, I'll note one other thing: Bard used a fairly advanced R construction that paired lapply() with an anonymous function (in this code: lapply(license_plate_chars, function(char) { ...) That is something I would only expect from a fluent R programmer.

That illustrates both a pro and a con of LLM code: you can learn new things, but understanding the code takes much more experience than a new programmer would have. In Bard's original code before I refactored it the anonymous function was further obfuscated by a lengthy declaration of the NATO lookup table. New programmers may find it difficult to refactor code in the presence of fairly complex structures like anonymous functions and that increases the odds of introducing bugs into the code (and decreases the odds of finding or solving an LLM's bugs).

What if Bard were a Quant UX Candidate?

When I worked at Google, I was often the "programming hat" interviewer for Quant UX candidates (and FWIW, I was the lead/owner of various docs for the job description, interview questions, interview evaluations, and quant careers, and a member of the central hiring committee).

I've written separately about why programming is highly desired for Quant UXRs. How would I score Bard if it were a Quant UX candidate?

First of all, just as IQ tests don't apply to LLMs, interview rubrics do not apply to LLMs. The only "correct" answer is that an LLM can't be a Quant UXR so the question is not answerable. So let's ask this closely related question: what if Bard's code had been written by a Quant UX candidate?

Considering only the code in itself, I would grade it as "Pass". It uses R fluently, well above the minimum bar, as is demonstrated by correctly using lapply() and an anonymous function. (Those are not required for R fluency, but they help demonstrate fluency.) It demonstrates correct usage of control structures (if()) and looping (implicit in lapply()). The syntax errors are minor and do not worry me; they would appear quickly when testing.

However, I would consider more than just the code. First of all, in interviews I probe on the code. What if lapply() were not available? What would they do instead? What if the maximum string size were 100B characters? (Thus exceeding memory.) What if the dictionary needed to change ad hoc? (For example, they might discuss using a separate data file that defines the dictionary instead of hard-coding it.)

Second, I would look for the candidate's questions of me, and their considerations and assumptions. Bard asked no questions and stated no assumptions; for that reason, even with decent code, a candidate would fail. (TBF, I would first encourage a candidate to ask questions!)

Third, I would see whether anything else raises issues of a candidate's competence. In Bard's case, it gave unnecessary and irrelevant references along with fictitious output for the test case. Again, I would ask a candidate about such things ... but my default for such things before a good explanation is "no hire."

In short, Bard would get a "hire" for its code, but a "no hire" for its overall responses.

Is Bard's Answer Useful?

I find this code to be useful for only one thing: the definition of the data dictionary ("A" = "Alfa", etc) which saves some typing. Even that is a mixed bag because it would usually be much more flexible to separate the data definition from the code.

Fixing the code as I did above refactoring the data definition, correcting the errors, making it work regardless of input case, and so forth takes at least as much time as writing such a function from scratch.

You might say, "well, but it's a good starting point." I'll grant that, although I think it would be odd for someone to have the R skills needed to debug a list-apply with an anonymous function and yet not know where to start with a simple problem.

Would it do better on larger, more complex problems? I don't know ... although I see no reason to expect fewer bugs or better code architecture for more complex problems. I would expect the bugs to scale up at least linearly.

Conclusions

To return to my initial observations from my previous review:

"[ChatGPT code for R] contains so many errors that one would be better served to start over" ==> This is less clear-cut with Bard. Its code is markedly better than what I saw from ChatGPT for this problem, and some people may find its code to be a useful starting point.
"[using LLM code requires high] expertise to debug some other entity's code" ==> This is still true. Although Bard's code is easier to fix than ChatGPT's, it still requires substantial expertise.

In short, from this exercise I cannot recommend Bard to help beginners write R code. It requires advanced skills in R syntax, debugging, and testing. Experienced R programmers might find its approach to be inspirational or even time-saving, although I personally did not.

To update my opening discussion with my friend "Gert," I didn't find Bard's code unlike ChatGPT's to be "terrible," just not good enough.

What about me? I do not use LLMs to write any of my code. I have not yet found any of their code to be worth the effort and the risk of bugs. I also have concerns about the ethics of using code that may be plagiarized; Bard's references are not reassuring.

Appendix: All the Code

As always, I compile all my R code. Here is Bard's code and my corrected version.

##### as written by Bardnato_license_plate <- function(license_plate) {  # Split the license plate into characters  license_plate_chars <- str_split(license_plate, "", simplify = TRUE)  # Convert each character to its NATO phonetic equivalent  nato_chars <- lapply(license_plate_chars, function(char) {    nato_alphabet <- c("A" = "Alfa", "B" = "Bravo", "C" = "Charlie", "D" = "Delta",                       "E" = "Echo", "F" = "Foxtrot", "G" = "Golf", "H" = "Hotel",                       "I" = "India", "J" = "Juliett", "K" = "Kilo", "L" = "Lima",                       "M" = "Mike", "N" = "November", "O" = "Oscar", "P" = "Papa",                       "Q" = "Quebec", "R" = "Romeo", "S" = "Sierra", "T" = "Tango",                       "U" = "Uniform", "V" = "Victor", "W" = "Whiskey",                       "X" = "X-ray", "Y" = "Yankee", "Z" = "Zulu",                       "0" = "Zero", "1" = "One", "2" = "Two", "3" = "Tree",                       "4" = "Fower", "5" = "Fiver", "6" = "Six", "7" = "Seven",                       "8" = "Eight", "9" = "Niner")    if (char %in% nato_alphabet) {      nato_alphabet[[char]]    } else {      ""    }  })  # Combine the NATO phonetic equivalents into a single string  nato_license_plate <- paste(unlist(nato_chars), collapse = " ")  return(nato_license_plate)}nato_license_plate("QUANT 1") # Bardnato_license_plate("qUant 1") # Addednato_license_plate("QU@NT 1") # Added###### revisednato_license_plate2 <- function(license_plate) {  # IMPROVE: declare this outside the lapply() loop  nato_alphabet <- c("A" = "Alfa", "B" = "Bravo", "C" = "Charlie", "D" = "Delta",                     "E" = "Echo", "F" = "Foxtrot", "G" = "Golf", "H" = "Hotel",                     "I" = "India", "J" = "Juliett", "K" = "Kilo", "L" = "Lima",                     "M" = "Mike", "N" = "November", "O" = "Oscar", "P" = "Papa",                     "Q" = "Quebec", "R" = "Romeo", "S" = "Sierra", "T" = "Tango",                     "U" = "Uniform", "V" = "Victor", "W" = "Whiskey",                     "X" = "X-ray", "Y" = "Yankee", "Z" = "Zulu",                     "0" = "Zero", "1" = "One", "2" = "Two", "3" = "Tree",                     "4" = "Fower", "5" = "Fiver", "6" = "Six", "7" = "Seven",                     "8" = "Eight", "9" = "Niner")  license_plate <- toupper(license_plate)  # IMPROVE: convert to upper case  # Split the license plate into characters  library(stringr)                         # FIX: required package  license_plate_chars <- str_split(license_plate, "", simplify = TRUE)  # Convert each character to its NATO phonetic equivalent  nato_chars <- lapply(license_plate_chars, function(char) {    if (char %in% names(nato_alphabet)) {  # FIX: lookup is to names, not values      nato_alphabet[[char]]    } else {      char                                 # FIX/IMPROVE: return unknown characters    }  })  # Combine the NATO phonetic equivalents into a single string  nato_license_plate <- paste(unlist(nato_chars), collapse = " ")  return(str_squish(nato_license_plate))   # IMPROVE: str_squish to remove duplicate spaces}nato_license_plate2("QUANT 1")nato_license_plate2("qUant 1")nato_license_plate2("QU@NT 1")

Appendix: An Update Using Bard + Gemini (Dec 8, 2023)

Just before publishing the above post, Google released the first version of "Gemini" and added it into the Bard LLM.

I tried the same problem again, and noticed that Bard + Gemini's answer was much worse. It was so much worse that it was essentially worthless and unrecoverable due to the number of bugs and poor coding decisions. Rather than being a substantial improvement over the old ChatGPT code, I would say it is roughly just as bad now.

Briefly, I observed:

It uses "." as if it were an automatically defined object
Its indexing logic is even worse and more incorrect than the version above
It uses indexing that (even if it were correct) would fail in case of unmatched input
It attempts to index under the assumption of an initial character match which although it would work for the alphabetic characters using one version of their phonetic equivalents, would fail for other variations for for numerals
It matches numbers to the wrong words (namely, to letters)
It fails to include the phonetic strings for numbers
It uses tidyverse for no particularly good reason, which obfuscates the code
It fails to call the correct string function library (or to use a base R equivalent), and fails to call a required tidyverse library.

That is so many errors and poor decisions for a function that should take only 4-10 total lines of code! that it is pretty much worthless for anything except saving the typing of NATO phonetic spellings (and TBH those would be better handled in a data file anyway).

The initial Bard code above was salvageable with moderate effort and some degree of R knowledge. The Gemini code here is not salvageable in my opinion. It has too many errors and bizarre choices to be worth the effort.

In short, Bard + Gemini appears to be a significant downgrade from the original Bard version above for this R problem. (Caveat: LLM output is not generally reproducible so this is a single-shot observation.)

For reference, here's what Bard + Gemini wrote. I made the bullet points above from reading its code (which doesn't run correctly, of course).

# ed. note: version written by Bard after Gemini release, 12/8/23# Function to convert license plate to NATO phonetic alphabetto_nato_phonetic <- function(plate) {  # Define conversion table  nato_table <- c(    "Alpha", "Bravo", "Charlie", "Delta", "Echo", "Foxtrot", "Golf",    "Hotel", "India", "Juliett", "Kilo", "Lima", "Mike", "November",    "Oscar", "Papa", "Quebec", "Romeo", "Sierra", "Tango", "Uniform",    "Victor", "Whiskey", "X-ray", "Yankee", "Zulu",    0:9  )  # Split the plate into individual characters  chars <- str_split(plate, "", simplify = TRUE)  # Convert each character to its NATO phonetic equivalent  phonetic_chars <- sapply(chars, function(x) {    if (is.numeric(x)) {      # Convert numbers to NATO words      nato_table[x + 1]    } else {      # Convert letters to NATO words      tolower(x) %>% toupper() %>% str_sub(1, 1) %>%         which(nato_table == .) %>% nato_table[[.]]    }  })  # Combine the phonetic characters into a single string  return(paste0(phonetic_chars, collapse = " "))}# Example usageplate <- "QUANT 1"phonetic_plate <- to_nato_phonetic(plate)cat("Input:", plate, "\nOutput:", phonetic_plate)

What's the difference? Quant UX vs. CX Insights

Chris Chapman — Tue, 05 Dec 2023 16:50:27 GMT

I had a great discussion with Maayan Klimenko about the Quant UX book and she asked:

What's the difference between a "Consumer Insight Manager" [CX] and a "Quant UXR"?

I've heard this question many times over the years often asked by executive-level stakeholders. There are several reasons it confuses many stakeholders:

Quant UX and CX both collect information from users/customers and share it with stakeholders, so there is an obvious overlap. Executives may believe it is efficient to combine the two roles.
In any given organization, either UX or CX will be stronger or have a better audience with executives for random reasons (and perhaps one of the roles is not even present). This leads to the question of whether one team's mission should be expanded to incorporate the other.
Stakeholders may consider research to be about "numbers," and then everything numbers-related seems like the same thing to them.
Many organizations confuse "customers" with "users." I say more about this below.

In my experience, CX and Quant UX roles overlap anywhere from slightly to substantially. It depends on the company, its organizational structure, and the product. However, I think CX and UX are distinct albeit overlapping. Although every company has a unique pattern, on average I see several important differences between CX and UX.

I will caveat the discussion here by saying that I've always worked in UX organizations, so I have far greater experience with the UX side of the discussion. These are my reflections from a long-time UX perspective.

Customer != User

Customers are a partial subset of users. People use products they did not purchase, and some purchasers buy products for other people. That occurs in small and obvious ways, such as purchases of gifts, and occurs in systemic, structural ways, such as IT services and self-service kiosks. Thus, the explicit targets of CX and UX are imperfectly overlapping.

Even more importantly, the term "customer" encourages perhaps unconsciously a focus on purchases, channels, and revenue. "User" encourages a focus on needs, interaction, and experience.

In practice, this leads CX organizations to focus on the post-purchase experience, such as customer satisfaction, reasons for product returns or churn, and so forth.

UX organizations focus much more (although not exclusively) on the initial phases of product design and engineering. In those early stages of a product lifecycle, UX research identifies unmet needs, issues with a product's interaction model, and so forth.

Again, these are imperfect distinctions and the roles overlap. It is common, for instance, for Quant UX research to examine retention/churn (a "customer" focus) and for CX to report unmet needs that they uncover (a "user" focus). But on average, the focal points of UX and CX activities differ.

Decisions vs. Tracking

Another difference I've observed is how UX and CX relate to decisions. To the extent that the roles inform decisions in addition to activities such as descriptive research UX tends to inform decisions early in the product development lifecycle, whereas CX tends to inform decisions that are addressable later. For example, UX may inform A/B decisions about designs or feature implementations, while CX may inform outreach to customers to resolve or fix problems.

Closely related to this is that CX often implements tracking systems, such as customer satisfaction tracking (CSat), analysis of top issues in support logs, and similar efforts, often with monthly or quarterly reporting and goals. The stakeholders for such efforts are typically organization-wide, and the processes become important internal deliverables, such as a dashboard. When UX undertakes such efforts, such as CSat surveys, they are somewhat more likely to target an Engineering organization as the key stakeholder and to have less emphasis on reach and complexity.

UX processes are more closely related to Engineering decisions while CX processes are more often related to broader organizational knowledge. Again, there is a lot of overlap.

Research Novelty

Related to the focus on decisions vs. tracking, UX research is typically designed to answer a single crucial question to inform a particular engineering choice. What needs should we target? Which design is better for users? What issues block or interfere with usage? How do users feel about the product? How does it compare to competitors? When using the product, where do users abandon it?

As a general matter, the implementation of UX research for such questions requires one-off, customized research plans. An appropriate plan to assess A/B preference for this feature, in this product, with this audience, answering to this stakeholder will differ from a plan for that feature, that product, that audience, or that stakeholder.

To the extent that there are similarities across UX research projects, they reflect the embedding of UX researchers within product teams (and thus the product, stakeholder, and perhaps audience remain constant for a particular researcher) as well as the skills and preferences of a given researcher. Yet over time, UX researchers may expect to engage in a vast array of different research projects, and the most fundamental skill for UX researchers is the ability to design research to fit many different situations, selecting among many different methods.

By contrast, the research questions in CX projects tend to be more stable over time, reflecting the greater breadth of stakeholders who are informed by the efforts and the importance of consistent reporting and tracking as noted above. Whereas UX researchers are called upon to develop plans on the fly and deliver quick turnaround, CX builds systems and relationships that endure.

That implies that the fundamental skills for CX analysts emphasize stakeholder engagement, continuity of deliverables, effective communication, and cross-team interaction rather than novel research design. Again, there is overlap and those skills are also important for UX (especially communication) but the degree of emphasis differs.

Technical Complexity

This difference is based on observation: Quant UX tends to have somewhat higher technical complexity than CX. Because it informs immediate decisions and often works with point-in-time data, Quant UX research demands a high degree of research "rigor" (however that may be defined, ranging from data quality to analytic method).

There is a significant downside to complex Quant UX research: it is difficult to explain and also may be difficult to repeat (not because of irreproducible results, but as a practical matter of effort). The complexity becomes valuable in a particular case because of the importance of a specific decision. That makes it worth the effort to explain on a one-off basis with the relevant stakeholders.

CX projects need to be simpler, clearer, and easier to explain because they emphasize broader engagement across an organization. Or, to put it differently, CX projects target foundational informational needs that may be met by straightforward descriptive statistics and time series visualization (backed up with the capability to slice, filter, cross-tabulate, and drill into cases).

When Quant UX addresses one-off decisions, it often forces complexity on the side of data acquisition. When a problem is new, there may be no data pipelines or data quality processes in place. Longer-term projects, as are common in CX, are more likely to work with mature data pipelines. A traditional, informal rule for Quant UX is that it takes at least a month to figure out what data is available for a new question before any analysis occurs. That would be intolerable in many CX organizations.

As for software skills, CX tends to emphasize general data analytic and visualization tools such as Excel, Tableau, Google Analytics, and SQL. Quant UX tends to emphasize R, Python, and/or SQL.

Organizational Position

Finally, there is an organizational difference on average. UX is typically embedded in an Engineering organization as part of product development. This may parallel (but remain separate from) the Product Management team. UXRs are generally assigned to specific products in development. Thus, the UXR team will often directly mirror the allocation of developers and PMs to specific engineering projects.

CX teams appear in many different places, including as part of Marketing, Operations, Engineering, and elsewhere. I have not yet seen a case where the structure of a CX team closely mirrors the composition of the product engineering teams. CX groups are staffed to meet broad goals that span teams.

However, occasionally one of these roles may appear in an organization devoted to the other. For instance, I've seen UXR jobs posted inside CX organizations at government agencies. In those cases, I strongly encourage job candidates and, even more importantly, managers in those organizations to inquire carefully about the details of positions to make sure they align with expectations and career success.

Final Thoughts

Any given researcher may enjoy either CX or UX/Quant UX, depending on their preferences, personality, and goals. At the most general level, I expect Quant UX to appeal to researchers who like continual research novelty and somewhat independent research, whereas CX may appeal to those who enjoy building long-term processes and systems with broader, collegial engagement.

For more on Quant careers and organizations, check out Chapters 2-4 in the Quant UX book, or this article about Quant UX at Google. Cheers!

Data inspection and modeling for counts, Part 2

Chris Chapman — Tue, 28 Nov 2023 15:47:43 GMT

In Part 1 of this two-part post, I reviewed how to analyze counts (frequencies) in R. Part 1 explained how counts follow a (roughly) log-normal distribution and showed how to do Poisson regression for counts data in R.

In this post, I apply the same techniques to a larger data set and add basic data visualization. As in the Part 1 post, my goal is not to give a comprehensive account of the models but rather to demonstrate how I start with an analysis of such data and to share reusable code as an example.

Be sure to read Part 1 first! The code here builds on that post without repeating the explanations.

Quant UX Con Attendance by Country

I hope you already know about Quant UX Con. It's a large, low-cost, online conference for everyone interested in Quant UX (if not, sign up to learn more!)

In this section, I analyze the count of attendees by country for the first two years of Quant UX Con, 2022 and 2023. We'll see how the counts follow a power law distribution again, and apply a Poisson regression model to this larger data set.

Assuming you have access to read a URL, you can get the data and follow along:

qux.df <- read.csv(url("https://quantuxbook.com/misc/quantuxcon_2022_2023_countries.csv"))str(qux.df)

The data set has 5052 observations (attendees) in total, from 2022 and 2023, with columns named Country and Year. BTW, the data set excludes observations with an unknown country. In 2022, that was 7% of the registrations, and was 21% in 2023. (One might wonder about the difference; I return to that at the end of this post.) You should also know that Quant UX Con never shares individually identifiable data without explicit opt-in permission; these data have no personal data, identifiers, or covariates, only the conference Year and a record's Country.

First, I'll look at two basic stats: the number of attendees by year, and the overall number of unique countries:

# number of unique countrieslength(unique(qux.df$Country))# count of attendees by yeartable(qux.df$Year)

The length() function shows that attendees came from a total of 86 unique countries (that were identified in the data). The table() shows that Quant UX Con 2022 had N=2333 attendees (in this data set, after excluding 7% with unknown Country) while Quant UX Con 2023 had N=2719 attendees (again, in these data, after excluding 21% that had unknown Country).

Next, let's ask: what were the Top 10 countries for attendees? To get that, we will first count the attendees by Country and Year using table(). Then we'll add the two years together for each country in that table by adding the columns, which represent the years and use order() to take the top 10. Here's the code:

# count by countryqux.tab <- table(qux.df$Country, qux.df$Year)# top 10 in combined 2 year attendancequx.tab[order(qux.tab[, 1] + qux.tab[ , 2], decreasing=TRUE)[1:10], ]

The answer is that the US had the most attendees by far, followed by Canada, Germany, the UK, Brazil, and India, as shown in the table:

We can also find the countries with the lowest attendance in the same way, just changing the call to order() so it is ordered increasingly (which is the default). I'll omit the result but here's the code:

# lowest attendancequx.tab[order(qux.tab[, 1] + qux.tab[ , 2])[1:10], ]

In the next section, we'll visualize attendance by country.

Visualizing the Data by Country

As usual, data visualization takes a few steps. First, there is a long tail of countries that had few attendees, and plotting all of them would clutter the chart. Let's select only countries that had 8+ attendees in total. We can obtain that list by using the fact that our table() object includes rownames() that are useful and indexable:

# get countries with at least 8 attendees so we can select them laterwhichCountries <- rownames(qux.tab[qux.tab[, 1] + qux.tab[ , 2] >= 8, ])length(whichCountries)

That selects 44 unique countries with 8+ attendees. Next, I do some basic tidying of the data. I convert the countries and years to factor() variables so R will understand that they are not just letters or numbers, and I put the Country factor into reverse alphabetical order. I do that so the ggplot2 charts will read from the top down rather than the bottom up. Here's the code:

qux.df$Country <- factor(qux.df$Country,                          levels=rev(sort(unique(as.character(qux.df$Country)))))qux.df$Year    <- factor(qux.df$Year)

To set the reverse alphabetized Country names, I first convert the names to raw characters using as.character() (which they already would be unless I accidentally ran this line again after converting them to factors), get the unique() names, sort() them, and then reverse the sorted order with rev().

As I described in Part 1, it is helpful to use the log() of counts rather than raw counts. However, on a chart, axes that are labeled on a log scale can be difficult to read; ggplot2 might not choose the best label positions. Before plotting, I define a helper function logBreaks() that will give slightly better label positions:

logBreaks <- function(nint=10, ...) {  function(x) {    axisTicks(log10(range(x, na.rm=TRUE)), log=TRUE, nint=nint)  }}

This function tries by default to put 10 labels on the axis, using the entire range() of values, and does so on a log10() scale (which is easier to read than a natural log() scale). We'll see how this helper function is used in ggplot() in a moment. (BTW, it's perfectly OK not to have a helpful function like this, but it can make the chart easier to read, with less effort than manually specifying breaks in ggplot(). And I'll give a shout out to Heather Turner for this helpful StackOverflow answer that reminded me how to use axisTicks(). If you use someone's code, cite them :)

Here's the code to visualize attendance by Country, by Year:

library(ggplot2)ggplot(data=subset(qux.df, Country %in% whichCountries), aes(x=Country, fill=Year)) +  geom_bar(position="dodge") +  geom_text(stat="count", aes(label=after_stat(count), color=Year),             position = position_dodge(width = .9), hjust=-0.2, size=2.5) +  scale_y_log10(breaks=logBreaks()) +  xlab("Country (alphabetical)") +   ylab("Attendees at Quant UX Con (axis is log scaled)") +  theme_minimal() +  coord_flip()

I'll comment on each line briefly:

For ggplot(data=...), I select only the subset() of Country that we identified above as having 8+ attendees, compiled in whichCountries.
The geom_bar() gives a bar chart (counts).
In the bar chart, I dodge the bars so each Year will appear separately rather than stacked.
geom_text() puts the observed counts as computed in geom_bar() and accessed with after_stat(count) at the end of each bar.
scale_y_log10() rescales the axis to the logarithmic scale, and I tell it to get breaks suggested by the logBreaks() helper function identified above.

Finally, I label the axes, remove some clutter with theme_minimal(), and coord_flip() the chart so the countries appear along the side rather than the bottom.

Here's the result, which shows attendance in alphabetical order by Country:

There are a few interesting points in the chart. First, the logBreaks() function gave readable labels on the 5s and 10s. The log scaling makes the individual countries more readable (I show the alternative raw scale below). Interestingly, we see that the growth from 2022 to 2023 was not driven by the US, where attendance was almost unchanged. Rather, attendance grew substantially in Europe, India, and other countries. (Why? Probably because Quant UX Con 2023 switched to a round-the-clock model that offered talks in all time zones instead of only US time!)

Although that chart makes it easy to find a particular country, it makes it difficult to compare them. Who had more attendees, Poland or France? Let's make a second chart sorted by overall attendance.

To do that, I use the forcats package and the helpful fct_infreq() function. That function sets the order of a factor ("fct") according to ("in") the frequency ("freq") of observations in the data set. Again, I use rev() or in this case, fct_rev() to work on a factor variable so that will read from the top down rather than the bottom up when plotted.

Here's the code. It is identical to the code above, except for using those factor functions to sort the axis:

library(forcats)ggplot(data=subset(qux.df, Country %in% whichCountries),        aes(x=fct_rev(fct_infreq(Country)), fill=Year)) +  geom_bar(position="dodge") +  geom_text(stat="count", aes(label=after_stat(count), color=Year),             position = position_dodge(width = .9), hjust=-0.2, size=2.5) +  scale_y_log10(breaks=axisBreaks()) +  xlab("Country (sorted by attendance)") +   ylab("Attendees at Quant UX Con (axis is log scaled)") +  theme_minimal() +  coord_flip()

The result is the following chart. We see that Poland barely topped France in attendance, with a total of 3 more attendees across the two years. In this chart, it is even easier to see that attendance went up from 2022 to 2023 for most countries. Hurrah to Spain, Japan, Italy, Argentina, Finland, Taiwan, Romania, Ukraine, and Serbia among others for the relative growth in interest!

I mentioned above that I'd show a chart with the raw values. Here is the same chart, sorted by attendance, but on a raw value scale instead of a log scale:

# same plot except NOT log-scaledggplot(data=subset(qux.df, Country %in% whichCountries),        aes(x=fct_rev(fct_infreq(Country)), fill=Year)) +  geom_bar(position="dodge") +  geom_text(stat="count", aes(label=after_stat(count), color=Year),             position = position_dodge(width = .9), hjust=-0.2, size=2.5) +  scale_y_continuous() +                       # <== changed here  xlab("Country (sorted by attendance)") +   ylab("Attendees at Quant UX Con (axis NOT log scaled)") +  theme_minimal() +  coord_flip()

For stakeholders, I would usually show a raw value chart like this. Log axes are confusing to many people and I wouldn't want stakeholders to focus inappropriately on small differences. In that case, I would zoom in as needed by removing the US and perhaps other countries, in order to focus discussion as needed. But for myself and other analysts, the log-scale chart may be preferable.

The Counts Distribution

In Part 1, I mentioned that count data is typically distributed according to the inverse of frequency and a log() transformation often makes it more normal, i.e., Gaussian, in its distribution. Let's look at that for the Quant UX Con registrations.

First, we'll compute the total attendance by Country for the two years combined (of course you could do the same examination for each year separately):

qux.sum <- qux.tab[ , 1] + qux.tab[ , 2]   # total attendancehead(qux.sum)

The head() shows the first few countries' total counts:

Argentina Armenia Australia Austria Azerbaijan Bangladesh

19 8 52 8 1 1

As we saw in Part 1, the geometric mean (exp(mean(log())) is much closer to the median than the arithmetic mean:

summary(qux.sum)exp(summary(log(qux.sum)))

That gives the following. In the raw data, the mean and median are quite different, and the mean is far outside of the interquartile range (1st to 3rd quartile, i.e., the range from the 25th to 75th percentiles of the observations). In the log() transformed data, the mean and median are not so far apart.

Next, I'll plot the raw and log() transformed distributions, smoothing them with a density() plot:

# density, raw valuesplot(density(qux.sum))# density, log frequencyplot(density(log(qux.sum)))

That gives the following two charts. We see again that the log() plot is much closer to a normal curve, but there is still a markedly elevated bump for the US.

In Part 1, I noted that counts data are often well-approximated by a Poisson distribution. Here's a bit more on Poisson distributions in R. The key parameter for a Poisson distribution is known as lambda, which is the mean of the observed counts on the logarithmic scale. If you specify the mean as being for 2 observations (log(2)), or 8.3 (log(8.3)), or 10.7, or 5000, or whatever, then the shape of the overall distribution is identified. You'll see below how I use the mean of the log() of observations for our data.

Comparing Observed vs. Random Counts

Given that mean-log, we can draw random possible values from a Poisson distribution using the rpois(N, lambda) (random poisson) function. For example, suppose our observed data has a geometric mean of 200 observations (see Part 1 for geometric mean). That would be a Poisson lambda of log(200), or 5.29. We could then draw random values using that value. For instance, rpois(10, 5.29) draws 10 random values from a Poisson distribution with a mean(log()) value of 5.29. On my system with the current .Random.seed value, that is:

If we exponentiate that sequence and summarize it, we get the following:

Those 10 random Poisson values ranged from counts of 3 to 22026 with a mean of 121.5. As we draw larger samples, we would see the values converge towards the stated mean of 200 (technically, 198 because I'm using 5.29 and exp(5.29) is 198). Here are some larger samples ranging from 1000 to 1M:

By the time we draw 1M samples, the mean converges on the specified value. (BTW, R can draw 1M random samples and summarize them on my Mac laptop in less than 0.05 seconds (!). FYI, you can time R code using the system.time() function.)

How do these random values help us? By drawing random values from the theoretical Poisson distribution identified only by its mean value and plotting those on top of observed data, we can see how closely our observed data fit the theoretical expectation.

Let's compare the observations and theoretical distribution for the Quant UX Con attendance values. First, I plot the density function for the observed attendance by country, using the log() transformation. Compared to the chart above, I slightly smooth the density line by increasing its bandwidth (bw=0.8 in the code; that affects the range of data used to smooth each part of the curve). I also set x- and y-axis limits, cutting the chart at x==0 because our observations start at a count of 1 (and exp(0)==1).

Here's the code, followed by the chart (I'll get to the "dashed" part in a moment):

plot(density(log(qux.sum), bw=0.8),      ylim=c(0, 0.25), xlim=c(0, 10),      main="Observed (solid) vs. theoretical (dashed) counts, Quant UX Con",     xlab="Attendance (log scale)")

Again we see a fairly "normal" shape with an elevated tail at log()\==8 (the US).

Now let's add the theoretical Poisson distribution with the same log() mean as our data. We'll draw 10000 random values from the distribution using rpois() and then add the density() of those to the chart using the lines() function:

lines(density(rpois(10000, mean(log(qux.sum))), bw=0.8),       col="red", lty="dashed")

Here's the final chart:

Interesting! The data fit the theoretical distribution almost exactly, in the range of about [2.7, 7.0] which in raw numbers is the range of exp(c(2.7, 7.0)) or attendance of 151100 total attendees. The theoretical distribution would have expected more countries in the range of 2-15 attendees (0.72.7 on the log scale), and fewer countries with either 1 attendee (log=0) or 3171 attendees (log=8.06).

Put briefly, the Poisson distribution looks like a great fit for most of our observations, except that there appears to be something else affecting the tails of the distribution. (I'll comment more on that below.)

With that in place, let's take a look at Poisson regression for these data.

Regression Model for Counts by Country & Year

In Part 1, I described how to do a linear regression model with counts data, using the Poisson distribution. Now I'll repeat that for the conference attendance data.

First, I'd note that, although we technically can do regression with these data modeling the attendance Count as a function of Country and Year it's not very interesting. We already know that the countries vary and that attendance went up from 2022 to 2023. This is a case where descriptive statistics tell us almost everything we might want to know. We don't have enough years to fit any particular regression trends (see the "Additional Questions" section below).

However, if we had more years or other predictors (see below) then it would be interesting, and I think it's useful to model what we have to demonstrate a starting point. When you have the model working for these simple data, it's easy to apply it to other data.

Let's think through the data structure needed for regression. First, we want to model the count of attendance by country, by year, so we need counts. Those are in the table() object we made above, qux.tab. Here's an excerpt from that table:

Second, we want the country. That's in the rownames() of the table. Finally, we want the year ... and that's in the table, except that there are 2 entries (years) for each row. We need to separate those into different observations namely, an observation (row) for 2022 and another observation (row) for 2023. The melt() function in the reshape2 package can do that for us. (There are many other approaches; this is just one way.)

Here's the code:

library(reshape2)qux.tab.m <- melt(qux.tab)names(qux.tab.m) <- c("Country", "Year", "Count")head(qux.tab.m)

The melt() function is short in this case, and then I update its resulting object names() to be more descriptive. That gives us the following, which has the Country, Year, and Count exactly what we need to model the count by country and year:

The regression model is straightforward, remembering to use glm() and a Poisson distribution for the counts:

qux.lm <- glm(Count ~ Country + Year, data=qux.tab.m, family="poisson")summary(qux.lm)

Inspecting the summary() of the model, we see effects for the various countries along with the year. Here's an excerpt:

Poisson model coefficients are log values, so we can exponentiate them to get a rate of change. In this case, the Year coefficient is 0.1531, and exp(0.1531) is 1.16. So, on average, each country saw a 16% increase in attendance from 2022 to 2023. (It might make sense to use an interaction effect, where countries can differ in that rate; but for purposes here, we'll keep it simple. See the next section for more discussion.)

Remember how I said a regression model was not very interesting in this case? We already knew that there were N=2719 observations for 2023, up from N=2333 for 2022. The ratio of those is 2719 / 2333 = 1.16. There is nothing magical about regression! It all depends on the question you are asking. Very often, descriptive statistics and data visualization can answer our questions without "fancy" models.

In the next section, I'll look at other questions we might ask.

Additional Questions

The analyses above are a good starting point, and in practice, might answer many or most questions for a particular project. However, for further analysis and important projects, there are a variety of questions and possibilities to explore. Here are some notes and directions.

Countries are not all equal, so treating each country as "1 observation" is not optimal. A more complete model might include country-level covariates such as population, GDP, % of people speaking English (the conference language), employment or market capitalization of tech firms (largest employers of UXRs), etc. In the R book and the Python book, we discuss how to build regression models with covariates (Chapter 7).
The 2022 and 2023 conferences had different reach due to the differences in time zone coverage. With just two observations of Year, we can't separate the time zone effect from the sequential year effect, so some analyses would require a longer time series. In that case, we could also consider the interaction of country and year; countries may vary in their trends from year to year.
Many more observations are missing the Country variable in 2023 (21% vs. 7%). That suggests doing a missing data analysis to see what's happening there. That might involve reviewing the data pipeline (in this case, Hopin and Stripe) to see whether some countries get filtered; doing covariate analysis (perhaps even propensity score analysis) to examine patterns; and perhaps applying some kind of data imputation method.
As noted, the Poisson distribution is an excellent default for count data, but we could consider more precise options for the distribution. Some possibilities include zero-truncated Poisson and zero-inflated negative binomial distributions. Another would be a Poisson mixture model, where some countries such as the US have a separate distribution. An R package with many additional distributions for such data is the actuar (actuarial analysis) package.

Personally, I would start with the missing data question in #2 above data quality always comes first and then examine covariates in #1 above, and then consider question #4 about the appropriate distribution.

Learning More

Compared to the first post, this post had a lot more R code. My R book with Elea Feit can help you learn more about R. In particular, Sections 4.2, 5.2, and 9.2 deal with categorical data similar to what we've seen in this post.

As I mentioned in Part 1, the regression model described here does not appear in the R book, so this post adds to the book (although the book has much more about regression models in general!) I view these blog posts as being like "extra chapters" for the Quant UX book, the R book, and the Python book.

Finally, I'll call out the best source for the mathematical details and statistics for categorical data of all types (of which counts are an example): Agresti's Categorical Data Analysis (2012).

All the Code

Here is one block with all of the R code from this post. Cheers!

# (c) 2023 Chris Chapman, quantuxblog.com# Reuse is permitted with citation: Chapman, C (2023), Quant UX Blog.# data setqux.df <- read.csv(url("https://quantuxbook.com/misc/quantuxcon_2022_2023_countries.csv"))str(qux.df)# number of unique countrieslength(unique(qux.df$Country))# count of attendees by yeartable(qux.df$Year)# count by countryqux.tab <- table(qux.df$Country, qux.df$Year)# top 10 in combined 2 year attendancequx.tab[order(qux.tab[, 1] + qux.tab[ , 2], decreasing=TRUE)[1:10], ]# lowest attendancequx.tab[order(qux.tab[, 1] + qux.tab[ , 2])[1:10], ]# get countries with at least 8 attendees so we can select them laterwhichCountries <- rownames(qux.tab[qux.tab[, 1] + qux.tab[ , 2] >= 8, ])length(whichCountries)# slightly tidy the data for plottingqux.df$Country <- factor(qux.df$Country,                          levels=rev(sort(unique(as.character(qux.df$Country)))))qux.df$Year    <- factor(qux.df$Year)# plot frequency by country and year# helper function for prettier break valueslogBreaks <- function(nint=10, ...) {  function(x) {    axisTicks(log10(range(x, na.rm=TRUE)), log=TRUE, nint=nint)  }}# plot in alphabetical orderlibrary(ggplot2)ggplot(data=subset(qux.df, Country %in% whichCountries), aes(x=Country, fill=Year)) +  geom_bar(position="dodge") +  geom_text(stat="count", aes(label=after_stat(count), color=Year),             position = position_dodge(width = .9), hjust=-0.2, size=2.5) +  scale_y_log10(breaks=logBreaks()) +  xlab("Country (alphabetical)") +   ylab("Attendees at Quant UX Con (axis is log scaled)") +  theme_minimal() +  coord_flip()# plot sorted by observed countlibrary(forcats)ggplot(data=subset(qux.df, Country %in% whichCountries),        aes(x=fct_rev(fct_infreq(Country)), fill=Year)) +  geom_bar(position="dodge") +  geom_text(stat="count", aes(label=after_stat(count), color=Year),             position = position_dodge(width = .9), hjust=-0.2, size=2.5) +  scale_y_log10(breaks=axisBreaks()) +  xlab("Country (sorted by attendance)") +   ylab("Attendees at Quant UX Con (axis is log scaled)") +  theme_minimal() +  coord_flip()# same plot except NOT log-scaledggplot(data=subset(qux.df, Country %in% whichCountries),        aes(x=fct_rev(fct_infreq(Country)), fill=Year)) +  geom_bar(position="dodge") +  geom_text(stat="count", aes(label=after_stat(count), color=Year),             position = position_dodge(width = .9), hjust=-0.2, size=2.5) +  scale_y_continuous() +                       # <== changed here  xlab("Country (sorted by attendance)") +   ylab("Attendees at Quant UX Con (axis NOT log scaled)") +  theme_minimal() +  coord_flip()# compare raw and geometric means of total attendancequx.sum <- qux.tab[ , 1] + qux.tab[ , 2]   # total attendancehead(qux.sum)summary(qux.sum)exp(summary(log(qux.sum)))# density, raw valuesplot(density(qux.sum))# density, log frequencyplot(density(log(qux.sum)))# log density plot with rpois dataplot(density(log(qux.sum), bw=0.8),      ylim=c(0, 0.25), xlim=c(0, 10),      main="Observed (solid) vs. theoretical (dashed) counts, Quant UX Con",     xlab="Attendance (log scale)")lines(density(rpois(10000, mean(log(qux.sum))), bw=0.8),       col="red", lty="dashed")# regression model# set up data in long format so year is an "observation"library(reshape2)qux.tab.m <- melt(qux.tab)names(qux.tab.m) <- c("Country", "Year", "Count")head(qux.tab.m)# modelqux.lm <- glm(Count ~ Country + Year, data=qux.tab.m, family="poisson")summary(qux.lm)

The Origins of Quant UX Con

Chris Chapman — Tue, 21 Nov 2023 16:05:42 GMT

I hope you've heard of the Quant UX Conference! It is the largest worldwide conference dedicated to everything and everyone in Quant UX. It has 2500+ attendees, occurs online, and is low-cost. I expect it would appeal to every reader of this blog. The next meeting will be June 12-13, 2024 (join the email list here).

This post discusses how Quant UX Con started, beginning with an idea in late 2021, through its first occurrence in June 2022, and the second time in June 2023. These are personal reflections adding to the Quant UX Con history, lessons, and statistics that have been published by the Organizing Committee.

If you're interested in how a new conference starts, read on! Otherwise, my next post with research and stats discussion will arrive the week of November 27.

The Backstory

When I was a Quant UX Researcher at Google, there was an internal, Google-only conference for Quant UXRs known as "QuantCon". QuantCon still occurs and it features talks by Quant UXRs, plus social events and networking opportunities.

Google has other internal educational events including "UX University" for designers and UX researchers worldwide. There are regional offerings such as UX University EMEA, which I co-founded with my ex-Google colleague Katie Tzanidou. We helped organize UX University EMEA events in London, Munich, Zurich, and Stockholm.

In 2021, I was talking with Kitty Xu, who was then a Quant UXR at Pinterest. Kitty asked whether she could attend QuantCon at Google. The answer was "No, it's internal only" ... and then I realized, "but we could create one!"

At that time, I had been a committee member or chairperson of a few conferences, including the Advanced Research Techniques Forum, Analytics with a Purpose, the Sawtooth Software Conference, and Google's UX University EMEA.

Here's a photo of me at the 2017 EARL Conference (Enterprise Applications of the R Language) in London. On the screen, you can see how much I like sharing R code! (Side note: the talk was about conjoint analysis. It's archived here and the slides themselves were written in RStudio with R markdown. But I digress!)

With 15 years of conference experience, I knew we could create a Quant UX conference! In October 2021, Kitty and I started planning Quant UX Con 2022.

The Initial Concept

We did not wish to recreate Google's QuantCon. It has several aspects that only fit an internal Google event, such as including employees and topics of a formal job role. We wanted to create a conference for everyone interested. First, we picked the dates and chose a name to emphasize Quant + UX.

With the dates in place, we recruited a fantastic Organizing Committee. We were joined by Fei Gao (Meta then; Pinterest now), Maria Cipollone (Comcast then; Spotify now), and Mario Callegaro (Google) and they are all still on the committee! Recently we have been joined by Mackenzie Sunday (Google) and Katie Wainwright (Google). Branding and design expert Jeff Miao volunteered to design our first logo (partially shown above). The logo alludes to data visualization in its design language.

We planned to host an in-person conference at Google HQ in California. Google generously provided space and many Google UXRs volunteered to help. Beyond that, we planned Quant UX Con 2022 to be:

An unconference with an agenda emerging through live discussion
Dedicated to building the community, rather than formal talks
Freely available to everyone who could attend
Limited to approximately 150 attendees, via a lottery system. We imagined a total population of 300 interested people. Google had reserved a 200-person main room plus a few breakout rooms.

We announced the dates, posted a Call for Participation (CFP), and waited to see what would happen.

The Reality Hit

Registration for Quant UX Con 2022 went online on March 14, 2022. A week later, Initial registrations aligned with our expectation of 150 people in a 200-person room:

After a few more days, we were surprised to find that interest was accelerating. There were too many people for our event site but, luckily, we had told everyone registering that there would be a lottery. Here's what we saw:

We told ourselves, "OK, 800 people have registered, but let's assume that only half of them will travel and show up. Among those 400, we can take 50% by lottery, giving us 200 people. That will fit still into our room."

But then, 10 days later ...

Yikes! As you can see, we passed 1300 registrations with 2 months to go until the conference ... and the trend showed no sign of slowing down. That was a great problem to have, but it meant we had to alter our plans. We didn't want to exclude 90% or more of the interested audience!

So we decided to make Quant UX Con a hybrid conference: it would be in-person at Google HQ, and online everywhere else. We dropped the unconference aspect, as that would be difficult to manage online. By going hybrid, we wanted to include everyone while having some of the in-person experience.

Everything Breaks, Everywhere, All at Once ... Almost

The next two months required rapid-fire learning about conference communication and organizing. And then a late-breaking discovery almost derailed the event: one week before the conference, we found out that remote presenters could not connect and present interactively with the live audience.

Why not? We planned to use Google's on-campus solution for conference hosting, but the system at the time was optimized for either internal meetings or for public that are broadcasted to external audiences ... but not for the combination of the two. We discovered this through a pre-event discussion with the audiovisual support team. In short, the video solution couldn't handle our plan for a hybrid conference.

The details are not important (if you're interested, they involved Google's campus setup and how its options are based on the number of simultaneous viewers). But the implication was very important: with one week to go, we had to find a new streaming video solution!

Kitty had experience at events hosted on the Hopin platform and thought it could handle our conference. With 6 days to go, we talked with a personal contact Kitty knew at Hopin, bought a license, and set up everything. We created the conference, tracks, sessions, registration options, ran a pilot test, and sent updates to all of our speakers and registered attendees!

Because of the difficulty with Google's AV solution, we moved the entire conference online, and added watch parties instead of a single, dedicated meeting location.

It Goes Live

The Organing Committee gathered live in Mountain View, CA, to run the conference. There is no way to convey how nerve-wracking the first day was for us! We had 2000+ attendees, 70 speakers, and we had bet the entire conference on a video platform that most of us had never used. Being familiar with Tech demos and events, we expected it to crash in some way!

But it worked. Our presenters showed up on Hopin, 2000+ attendees re-registered to use it, and Hopin itself performed flawlessly (kudos to Hopin!) To be sure, we made mistakes and learned as we went ... but the community adapted and helped one another in the live event. In short, Quant UX Con 2022 was a success, with more than 2500 attendees worldwide.

Here is a photo of the 2022 Organizing Committee (except Maria Cipollone, who was hosting in New York). Left to right, that's Fei Gao, Kitty Xu, me, and Mario Callegaro. It may be hard to tell with the COVID masks, but we were happy after the successful first day!

(BTW, this is the last photo of me in the black shirt wearing a Google badge. I left Google the day after Quant UX Con 2022. It was a great 10.5 years at Google!)

A Life of Its Own

In its first year, Quant UX Con 2022 had 2,517 attendees on Hopin -- plus hundreds more in watch parties. We expected that most attendees would watch one or two talks ... and we were wrong! The average engagement time was 7 hours a full day.

The overwhelming feedback from attendees was:

Do it again! They asked to continue Quant UX Con in 2023
Record the sessions for attendees with conflicts
Expand the worldwide offering

In 2023, we switched to a round-the-clock "follow the sun" model where Quant UX Con ran continuously across all time zones worldwide. We also moved our CFP to Oxford Abstracts, which greatly assists the submission and review process (the first year, we used Google Forms and Sheets).

In 2023, there were 3,430 registered attendees (2308 attended live, and 1122 more had recording access), as well as hundreds more who attended watch parties. The average engagement time was 7 hours overall, a result that was consistent with 2022 for live attendance, along with growth for recordings and watch parties. And our satisfaction improved: Top 2 Box CSat improved from 79% in 2022 to 86% in 2023.

This tells us that Quant UX Con is here for the long haul. I believe it is an amazing creation of an all-volunteer organizing committee who have full-time jobs!

Finally, to ensure that Quant UX Con is sustainable, we founded the Quantitative UX Association, a US charitable nonprofit (aka, nongovernmental organization), whose mission is to host Quant UX Con and related activities that advance Quant UX research. Together with volunteer organizers, our generous sponsors and supporters make Quant UX Con available to everyone interested worldwide.

Reflections: Takeaways

Adapt to customer demand. Demand exceeded our expectations by 15x, pulling in >2500 people instead of the expected 150. As a UXer, I'm delighted to learn something unexpected about customer demand. We learned that the interest in Quant UX is high around the world.
When you launch something new, minimize the dependencies. By starting small and light with a donated meeting room and all-volunteer staff we were able to pivot and change the format twice before the 2022 conference even happened. First, we changed it to a hybrid format; and then we switched video platforms with 1 week to go. If we had been locked into an expensive location or format, it would have failed or turned out very differently.
There is high demand for remote participation. During COVID, conferences switched to online or didn't occur. But after COVID nearly all conferences including UX conferences CHI, HFES, and UXPA have returned to in-person meetings. Their business models have been developed to fit in-person events. Quant UX Con complements those conferences with an online offering, available worldwide at low cost.
If you want to do something, quite probably you can do it. Our team created this conference because we wanted to and that commitment led us to success. That's not to say that you can do everything now. Anything worth doing requires time, experience, and expertise. But please don't wait for someone else to create the future. It's up to you!

Help us make Quant UX Con even better we hope to see you in June 2024!

Appendix: The R Code

Wondering how I made the plots above? Here's the code (I can't share the data set ... but the only data you need are registration times, i.e., the Timestamp column.)

There are 3 almost identical code snippets for the plots because I show progressive sequences and annotate them. (I could have used one ggplot2 object for the unchanging aspects while adding progressive modifications and clipping the axis. But copy/paste is simpler for one-off code!)

A couple of minor code notes. I used lubridate and parse_date_time() to interpret the date format in our data set. To set the time series X axis, I add a uniform Counter column and count the rows using cumsum() (there are other approaches; this was simple). The ggplot2 function annotate() puts interpretations directly onto the charts with no need to edit the images separately.

# quant ux con 2022 analyticsq22.dat <- read.csv(filename)  # redacted; a CSV with registration data, esp. the "Timestamp" columnq22.dat$Counter <- 1names(q22.dat)library(lubridate)q22.dat$RegDate <- parse_date_time(q22.dat$Timestamp, orders="mdYHMS")library(ggplot2)# cumsumq22.dat$sum <- cumsum(q22.dat$Counter)# example plotlibrary(ggplot2)ggplot(subset(q22.dat, sum < 200)) +  geom_line(aes(x = RegDate, y = sum), color="blue", linewidth=1.5) +  ggtitle("Quant UX Con 2022 Registrations by Date") +  xlab("Registration Date") +  ylab("Number of Registered Attendees") +  geom_hline(yintercept=200, color="red", linetype="dashed") +   annotate("text", x=q22.dat$RegDate[50], y=250, color="red", label="The room size") +   theme_minimal() +  coord_cartesian(ylim=c(0, 1500))ggplot(subset(q22.dat, sum < 800)) +  geom_line(aes(x = RegDate, y = sum), color="blue", linewidth=1.5) +  ggtitle("Quant UX Con 2022 Registrations by Date") +  xlab("Registration Date") +  ylab("Number of Registered Attendees") +  geom_hline(yintercept=200, color="red", linetype="dashed") +   annotate("text", x=q22.dat$RegDate[50], y=250, color="red", label="The room size") +   theme_minimal() +  coord_cartesian(ylim=c(0, 1500))ggplot(q22.dat) +  geom_line(aes(x = RegDate, y = sum), color="blue", linewidth=1.5) +  ggtitle("Quant UX Con 2022 Registrations by Date") +  xlab("Registration Date") +  ylab("Number of Registered Attendees") +  geom_hline(yintercept=200, color="red", linetype="dashed") +   annotate("text", x=q22.dat$RegDate[50], y=250, label="The room size") +   annotate("text", x=q22.dat$RegDate[1200], y=1400, color="red", label="Oh no!") +  theme_minimal() +  coord_cartesian(ylim=c(0, 1500))

Quant UX Blog

There's No Magic Bullet

The Underlying Situation

Some Implications

Conclusion

Surveys and the "True Score" Mistake

Examples of the Mistake

A Specific Case

Surveys as Motivated Communication

What We Should Do as Survey Authors

Wait, What about Bots?

Back to the Salary Research

Conclusion & More Reading

Open for Registration: Quant UX Con 2024

Psss ...

More

Intro to the Van Westendorp Pricing Exercise

The Items & Basic Concept

PSM Analysis: Set Up

PSM Analysis: Complete

Assumptions & Problems

Recommendations & Learning More

R code

The End of Tech as a Big Family

(1) How Tech Cultivated a "Family": Hiring

(2) How Tech Cultivated a "Family": Work Places

(3) How Tech Cultivated a "Family": Trust

(4) How Tech Cultivated a "Family": Social & Travel

(5) How Tech Cultivated a "Family": Status & Mission

(6) How Tech Cultivated a "Family": Care for Real Family

Brief Layoff Examples, ICYMI

How Family Emotions Make Layoffs a Betrayal

Why Would a Company Do Layoffs that Way?

What Should a Company Do?

What Should You Do? (Remember, Work != Family)

For More

Six month blog "performance review"

Top Goals

HEART UX Metrics

Takeaways

Thank YOU for reading the blog

But wait ... there actually is some R code

Time Estimation for Quant UXR Projects

Unit == a Month

A Big Project == Several Projects == Several Months

What About Quick Projects and Data Requests?

Do's and Don'ts in Time Estimation

For More

Appendix: R Code for the Chart

Quick Review: Conference Submission Sequences

Set Up the Data

Compare the Sequences by Plotting

What About the Distribution Pattern?

Bonus Plot

Reflections and Takeaways

Pointers to More

Appendix: All the Code

Misconceptions about Conjoint Analysis

Mistake #1: Conjoint will Estimate a Market Share

Mistake #2: Conjoint Gives Simple Pricing Data

Deep Dive on Mistake #2 and Pattern C

Mistake #3: Highest Preference == Best Product

For More

How the Quant UX book was written

Our Goals and Overall Choices

RStudio for Writing

Git for Collaboration

BibDesk for References

The Hardest Part (besides actually writing it)

Wait, should I learn LaTeX?

Learning More

Some Book Recs: Literary Fiction

A Few Recommendations

On my Short List to Read ...

Other Genres?

Product Interest and Engagement Scale (PIES)

PIES Items and Scoring

Validation Research

Using PIES in Applied Research

What About Intention to Purchase or Willingness to Pay?