# Individual Scores in Choice Models, Part 3: Respondent Segments

This is **Part 3 of a series** examining how to work in R with individual-level estimates for preferences. We’re using real data from a [MaxDiff survey](https://quantuxblog.com/easy-maxdiff-in-r) on the preferences of N=308 respondents interested to take [classes](https://www.quantuxcon.org/classes) from the Quant UX Association.

If you haven’t read Posts 1 and 2, you’ll want to catch up on the background and [basic data in Post 1](https://quantuxblog.com/individual-scores-in-choice-models-part-1-data-averages), and on the question of correlational patterns of interest in [Post 2](https://quantuxblog.com/individual-scores-in-choice-models-part-2-correlations-among-items). As usual, I illustrate everything with R code and will discuss the code as we go, with all of the code compiled at the end.

The topic of this Part 3 is this: are there **clusters (segments)** of *respondents* who have similar patterns in their interest in classes? As my coauthors and I discuss in the [R book](https://www.amazon.com/Marketing-Research-Analytics-Use/dp/3030143155) and the [Python book](https://www.amazon.com/Python-Marketing-Research-Analytics-Schwarz/dp/3030497194), the process of finding segments in real data is complex and there is no magic answer. Segmentation is an iterative process of finding some **number of segments** and seeing whether those segments are **useful and interpretable** for your business need.

In this post, I discuss **one way** to find the number of segments and to visualize the results … with a big caveat: there is much more to say! This is *one* *illustration* and is *not a prescription*. No blog post can give a complete recipe for clustering. For comprehensive discussion and learning, see “Learning More” at the end of this post. Also stay tuned for a future “Segmentation FAQ” blog post soon!

**Warning: long post**! I like talking about segmentation (both pro and con) :) Like many posts in this blog, this is intended to complement other writings. It complements the [R book](https://www.amazon.com/Marketing-Research-Analytics-Use/dp/3030143155) and [Python book](https://www.amazon.com/Python-Marketing-Research-Analytics-Schwarz/dp/3030497194) for conjoint, and the [Quant UX book](https://www.amazon.com/Quantitative-User-Experience-Research-Understanding/dp/1484292677) for MaxDiff.

---

### First, Get the Data

You can read [Posts 1](https://quantuxblog.com/individual-scores-in-choice-models-part-1-data-averages) and [2](https://quantuxblog.com/individual-scores-in-choice-models-part-2-correlations-among-items) for background. But if you’re jumping in here, the following R code will load the data. These are N=308 responses estimating researchers’ levels of interest in Quant UX classes, from a MaxDiff survey.

You can get the data with:

```r
# get the data; repeating here for blog post 3, see post 1 for details
library(openxlsx) # install if needed
md.dat <- read.xlsx("https://quantuxbook.com/misc/QUX%20Survey%202024%20-%20Future%20Classes%20-%20MaxDiff%20Individual%20raw%20scores.xlsx")   # 
md.dat$Anchor <- NULL
names(md.dat)[3:16] <- c("Choice Models",  "Surveys",       "Log Sequences",    "Psychometrics", 
                         "R Programming",  "Pricing",       "UX Metrics",       "Bayes Stats",
                         "Text Analytics", "Causal Models", "Interviewer-ing",  "Advanced Choice", 
                         "Segmentation",   "Metrics Sprints")
classCols <- 3:ncol(md.dat)    # generally, Sawtooth exported utilities start in column 3
```

I recommend to follow along live in R, installing packages as needed. All the R code is given as I go, and also compiled at the end of the post.

---

### Background: General Clustering vs Latent Class Analysis

At a high level, there are two approaches to clustering/segmenting data from choice surveys.

One approach is to estimate respondents' preferences first as a whole sample, and then apply **general clustering** methods *later* to those estimates to find segments. My experience suggests that a majority of analysts use this approach for several reasons. First, it is easiest because it divides analysis into tidy stages. Second, you can use the widest range of tools and methods. Third, it does not require specialized software for the clustering portion, but uses general methods from R, Python, or wherever. Fourth, the approach can be used with any kind of data and with mixed types of data.

The second approach is to model segments *at the same time* that you estimate individuals’ preferences. This is the approach taken by **latent class analysis** (LCA), as applied to choice surveys. The primary advantage of LCA is that, by doing both steps together (estimation + segmentation), it is more robust to sparse data and noise in the data. On the other hand, LCA requires more specific assumptions about the data, along with specialized tools.

In the choice modeling community, the most common tools used for choice model LCA are Statistical Innovation’s *Latent Gold*, Sawtooth Software’s *Lighthouse Studio*, and proprietary code bases within analytics firms. These may add cost and complexity to analyses, while limiting the generality of what you can do. By contrast, general clustering methods are widely available in R and other platforms. (Note that choice model LCA is possible in **R** … but the effort is beyond reasonable levels for general practitioners. FWIW, I personally *avoid* doing choice model LCA in R.)

**In this article, I demonstrate use the general clustering approach** … with one note: if you *frequently* need to cluster *choice model* data, you should check out Sawtooth Software’s Lighthouse Studio and/or Statistical Innovations’s Latent Gold. Their tailored functionality can be helpful, efficient, and more precise for practitioners. For my part, I often use LCA in Sawtooth’s Lighthouse Studio. However, I *also* often use general clustering as this post illustrates.

In short, although I use choice survey data to demonstrate the methods here, the methods here are general. You can apply the tools and code in this post to a wide variety of data sources.

---

### Finding a Number of Segments: Part 1

**The first question in segmentation is this: are there segments at all?** And if so, how many?

A *reasonable assumption* — especially in high dimensional space, i.e., in data with many variables — *is that people do* not *cluster together*, and that instead, they differ from one another in unstructured (“random”) ways. Some jargon for that is that the data exhibit a *multivariate normal distribution (MVN) with one component*. In other words, there is a single “segment” and every respondent is a member!

My take — see the “clustering” chapters in the [R book](https://www.amazon.com/Marketing-Research-Analytics-Use/dp/3030143155) and the [Python book](https://www.amazon.com/Python-Marketing-Research-Analytics-Schwarz/dp/3030497194) — is that MVN data with 1 cluster is quite often the “best” statistical answer … but the **number of segments is not primarily a statistical question**. There are hundreds of methods to cluster data and they typically give different answers. The question is whether we can find segments that are *useful and informative* for a business question, while secondarily optimizing statistically.

There are several ways we might inspect the data to see whether it appears to have clusters. One way is a **model-based** approach to determines how many clusters may underlie the data. The `mclust` library is one of my favorites. In R we can fit a clustering (segmentation) model to the MaxDiff data as follows:

```r
# mclust uses model-based fit estimates to determine best number of latent clusters
library(mclust)    # install if needed
md.mc <- Mclust(md.dat[ , classCols])
summary(md.mc)
```

The result is:

![results of the mclust summary as produced by the code, with 1 cluster](https://cdn.hashnode.com/res/hashnode/image/upload/v1729526558978/24be6dc9-1825-4ccf-8bd4-cee8baea4703.png align="center")

It found a multivariate normal distribution with one component. In other words, the best fit is **1 segment**.

Another approach is **visual**. We could use a dimensional reduction approach to represent the data in a visually suitable number of dimensions — 2 or 3 dimensions — and then inspect it to see whether there is some obvious clustering.

There are many dimensional reduction models but a modern starting point is UMAP (*uniform manifold approximation and projection*). UMAP attempts to “project” (i.e., re-estimate) data in a lower number of nonlinear dimensions (“uniform manifold”) while also preserving (“approximating”) the structure of the complete data. (For an introduction, see Coenen and Pearce on [“Understanding UMAP” here](https://pair-code.github.io/understanding-umap/).)

We can visualize the MaxDiff class interest data with UMAP as follows. First we fit a UMAP projection to the data, setting a random seed to make it reproducible:

```r
library(umap)
# get umap representation of the individual interest data
umap.config <- umap.defaults
umap.config$random_state <- 98101

umap.sol <- umap(md.dat[ , classCols], config = umap.config)
umap.dat <- data.frame(umap.sol$layout)
# review the kind of data we have
str(umap.dat)
```

Using `str()` we see that the data now has 308 observations reduced to 2 dimensions, `X1` and `X2`. The particular values don’t really matter in themselves; it is the *relationships* among the values that reproduce the relative structure of the observations vs. one another:

![data from UMAP, a data frame with values in 2 dimensions (columns)](https://cdn.hashnode.com/res/hashnode/image/upload/v1729633063693/8967ace8-be13-4e47-9b42-03817be3f101.png align="center")

Now we can plot the UMAP points just like any other X / Y data using a simple scatterplot of the points:

```r
# plot it
ggplot(umap.dat, aes(x = X1, y = X2)) +
  geom_point() +
  xlab("High order dimension 1") + ylab("High order dimension 2") +
  theme_minimal()
```

Here’s the chart:

![an ellipsoidal plot of the points that is very smoothly round with no obvious clustering from initial visual inspection](https://cdn.hashnode.com/res/hashnode/image/upload/v1729633199019/380c3364-2c31-425c-b940-132efe12388d.png align="center")

**This chart shows no obvious “clusters” of respondents**. It is slightly denser in some areas than others but overall has an appearance of being smoothly round without separation among groups. This appearance is not a statistical test, but it is congruent with the idea that there are not multiple obvious segments but 1 segment of everyone.

So far we are striking out on finding segments. The data set strongly appears to come from a single multivariate normal population (i.e., from one segment).

But remember I mentioned that there are hundreds of clustering methods? Let’s keep prodding and see if we can find something else that is useful!

---

### Finding a Number of Segments: Part 2

Another R library that suggests the number of clusters in data is `NbClust`. It implements 8 clustering methods with a choice of 6 different distance (dissimilarity) metrics applicable to each one. After you select a distance metric and clustering method, it reports 30 heuristics (sometimes fewer, for reasons I’ll skip) that each suggest the number of clusters.

That’s easier to understand from an example:

```r
library(NbClust)
md.nbc <- NbClust(md.dat[ , classCols],
                  min.nc = 2, max.nc = 10,
                  method = "ward.D", 
                  distance = "euclidean")
```

In this code, I ask it to consider anywhere from 2 to 10 clusters (`min.nc`, `max.nc`). For a distance metric, I use the classic Euclidean metric (i.e., triangular hypotenuse, except extended to many dimensions), and cluster them with the “Ward’s D” method (see `?NbClust` for details).

Here are the results, first for 24 methods that give textual results, and then for 2 methods that give graphical results:

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1729634680960/67d11ca9-89cf-4cab-95a2-e483f480de41.png align="center")

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1729634713712/d59b50ba-6c9c-478c-be80-63dd6cd79a43.png align="center")

The textual results say that a “winner takes all” vote by the methods indicates **2 clusters** (as chosen by 9 out of 24 methods). However, almost as many methods (7 out of 24) suggest **5 clusters**. In the graphical results, both charts suggest 2 clusters as the best fit; and one of them suggests 5 clusters as a secondary possibility.

So, should we choose 2 clusters? In practice, I believe **it is rarely useful or desired (by stakeholders) to discuss only 2 clusters** — unless we are strictly speaking of likelihood of a binary outcome (such as purchase behavior). For purposes of *understanding* customers and users, binary options are rarely illuminating.

Thus, instead of 2 clusters, let’s look at the “second best” option of 5 clusters. What would that solution tell us?

To begin with, we’ll impose a 5 cluster solution. There are many ways to do that, but in this case we can simply run `NbClust()` again, setting the minimum number to 5:

```r
# we'll try a 5 cluster solution instead of 2
md.nbc <- NbClust(md.dat[ , classCols],
                  min.nc = 5, max.nc = 10,
                  method = "ward.D", 
                  distance = "euclidean")

md.segs <- md.nbc$Best.partition
table(md.segs)
proportions(table(md.segs))
```

This code runs `NbClust` again, with a minimum solution of 5 clusters (which is then found to be the best option with that limitation). We extract the segment assignment for each respondents (`$Best.partition`) and save them as `md.segs`. The final two lines show the counts of segment assignments and their relative proportions:

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1729641005386/8c784c8c-181e-4656-8d67-3d6911d29b1c.png align="center")

The smallest segment (segment 5) has 10% of respondents, while the largest (segment 1) has 28% of respondents. That is a reasonable spread. It does not suffer the common problem where one or two segments have a very small proportion of respondents such as only 2% or 5% of respondents.

Next we’ll see whether the segments differ in *interpretable* ways.

---

### Visualizing Segments in Dimensional Space

Remember the UMAP chart above? Now that we have segment assignments, we can add those to the chart. Are there visual patterns when the segments are plotted in the higher order dimensions found by UMAP?

Here’s the code. It’s the same as above, except it now colors the points according to which segment each respondent is in (`md.segs`):

```r
# add the 5-segment membership to the UMAP data
umap.dat$Segment <- factor(paste0("S", md.segs))

# plot it with segments
ggplot(umap.dat, aes(x = X1, y = X2, colour = Segment)) +
  geom_point() +
  xlab("High order dimension 1") + ylab("High order dimension 2") +
  theme_minimal() +
  ggtitle("Dimensional map with 5 segments")
```

Here’s the result:

![the same UMAP chart as above, but the respondents are colored by segment assignment. There are 5 clear slices of color.](https://cdn.hashnode.com/res/hashnode/image/upload/v1729641289278/029031e1-6bba-44c2-b7b9-0789b2b1ab18.png align="center")

**Whoa**! This is a chart to make a Quant UXer very happy! In it, there are five extremely clear slices of the plot. Almost all of the members of each segment are close to others in the same segment — which is to say that the segments map very cleanly to the higher order dimensions of the data. That suggests that we may see real differences when we examine those dimensions (interest in different classes).

But first, I hope you’re thinking to yourself, “*Wait, am I fooling myself*? Maybe the colors being close together doesn’t mean much.” Good question!

Let’s double-check ourselves by comparing what the UMAP chart would look like if the segments were *randomly* assigned. To do that, we shuffle the same set of assigned segment memberships (using `sample()`) such that they are randomly assigned to different rows of the UMAP data, and plot them again:

```r
# make sure we're not fooling ourselves ...
# what would it look like if the segments were random?
set.seed(98107)
# the same segments but in random order
umap.dat$RndSeg <- sample(umap.dat$Segment)
# plot those
ggplot(umap.dat, aes(x = X1, y = X2, colour = RndSeg)) +
  geom_point() +
  xlab("High order dimension 1") + ylab("High order dimension 2") +
  theme_minimal()+
  ggtitle("Dimensional map with RANDOM segments")
```

The result:

![same chart as above, except the 5 segments have been assigned randomly as a permutation of the "real" segment assignments. Now the chart shows a random mix of colors with no segment structure.](https://cdn.hashnode.com/res/hashnode/image/upload/v1729641899282/ba88dc56-c552-4ca6-b4c1-b6f08e03be5d.png align="center")

In this chart, there is no pattern to the colors for segment assignment. This tells us that the extremely clear pattern shown in the previous chart reflects real structure of the segments, relative to the data; the appearance was not simply an accident.

**In short, the five segments align strongly with real properties of the data**. Now the question becomes: *is that alignment interpretable and useful*? I look at that question next.

---

### Visualize Segments with a Heat Map

In the above analyses, we found — or, if you prefer, “imposed” — segment assignments that show statistical differentiation (the assessment heuristics in `NbClust`) and that align with the high dimensional data (the segment visual differences in UMAP). But are those results merely statistical or are they strong enough to be *meaningful* as a practical matter?

To find out, I’ll plot a **heat map** for class interest, broken out by segment. And, as you’ve probably guessed from my previous posts, I’ll make it a **function** :) With nothing more than minor tweaks (at most), this function will work for segmentation cross-tabs from many different kinds of data sets.

Here’s the function, with comments following below:

```r
seg.heat <- function(dat, segs) {
  library(superheat)   # install if needed
  # aggregate mean value by segment
  heat.sum <- data.frame(t(aggregate(. ~ segs, data = dat, mean))[-1, ])
  # make the column names reflect the segment number and size (%)
  names(heat.sum) <- paste0("S", 1:max(segs), " (", 
                            round(prop.table(table(segs)), 2)*100, "%)")
  # draw it with superheat
  superheat(heat.sum, 
            heat.pal = c("red3", "white", "green3"),
            grid.hline.col = "white", grid.vline.col = "white",
            pretty.order.rows = TRUE,
            clustering.method = "hierarchical",
            X.text = round(as.matrix(heat.sum), 2),  
            X.text.size = 4, 
            left.label.size = 0.3,
            left.label.text.size = 4,
            bottom.label.text.size = 4,
            legend = FALSE)
}
```

In this function, I use the `superheat` package to draw a heat map. There are many alternatives but `superheat` is straightforward while offering many options. The first step is to use `aggregate()` to get the cross-tabs of item means broken out by segment. That single line of code also converts the result to a data frame as `superheat` expects. (FYI, I drop the initial row using `[-1, ]` in the conversion because that row contains column names rather than data.)

In the next line, I change the `names()` of the data frame so they start with “S” for segment, followed by the nominal segment “number”. I add a label for the relative *percentage* of each. That gives nice labels in `superheat` with the segment name and its size, such as “*S1 (28%)*”. BTW, if you had friendly names for the segments — such as “R Pirates” or “UX Mavens” — you could add them here.

After that, the final step is to call `superheat` to draw the heat map. I won’t explain all of the options in that call, but will mention the three most important ones: (1) I set a **color** **palette** that avoids extreme color intensity. (2) I use “**pretty rows**” and clustering to reorder the results to be grouped by similarity, making it easier to read and interpret the segments. (3) I add the **mean values** themselves — rounded off to 2 decimal places — to the cells using `X.text`. Without that, it would draw the heat colors but not label the values (which is also a perfectly appropriate choice).

We call the `seg.heat()` function with the MaxDiff estimates and assigned segments:

```r
seg.heat(md.dat[ , classCols], md.segs)
```

Here’s the result:

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1729724203602/7638f3bf-0ca8-4473-ad60-a8d62fff08be.png align="center")

What do we see? The five segments are strongly differentiated. I suggest you take a minute to interpret the results … before reading further!

What did you find? Here are a few reflections on each segment from my perspective. I tend to interpret segments according to *strongest positive interest* first.

* **Segment 4** stands out strongly for its interest in R and statistical/data analyses. With roughly 25% of the sample, we can consider it to be a “hardcore Quant” or “data science in UX” segment.
    
* After that, **Segment 5** stands out as a “Core UX” segment, strongly interested in classic mixed-methods UX issues, and not very interested in the stats topics.
    
* **Segments 1 and 3** are quite similar, where both are interested in survey methods and choice models. The difference is that Segment 3 is also interested in core UX topics whereas Segment 1 is interested in pricing methods and more advanced choice models. Without looking at the data, my guess is that Segment 1 reflects quant researchers working in *marketing*, whereas Segment 3 has folks with similar skills and interests but working in *UX*.
    
* Finally, **Segment 2** is not interested in much … except a segmentation class. That’s fine and is a common pattern — respondents who take the chance to communicate strongly about the one thing they want the most.
    

I will comment about *one thing I almost always see in segmentation projects … but do NOT see in these data*: there is often a segment that shows almost **no differentiation** of interest in anything. Such a segment is common in consumer data, and quite likely reflects low quality respondents, speeders, and/or those for whom the survey is inappropriate. It often gives a segment with 30% or more of the sample (I’ve seen as high as 60%). However, the present data come from an especially engaged and high quality panel (Quant UX conference attendees!) so they do not have that problem.

Now, is it actionable? What would we do from a business perspective with these data? From the point of view of the Quant UX Association offering classes, here are some thoughts:

* The results **reinforce** the idea (from Post 1) that — although the *average* interest in (for example) an **R class** is low, there is a core group of people (Segment 4) who are strongly interested. Now we have a good estimate for that interest, as a segment: 24%
    
* The results also reinforce the interest in a **Segmentation** class. It has strong interest for every segment except Segment 5 … adding up to 90% of the sample.
    
* After that, there is clear interest in **Choice Models** and related topics — in roughly 2nd or 3rd place for three segments — adding up to more than 70% of the sample.
    
* *Although* the size of **Segment 5 is small** (10%), its interest in **UX metrics** topics is extremely clear and crisp. That makes a nice possibility for a clear offering … if the Quant UX Association is able to reach the smallish group in a reasonable way.
    

---

### Visualizing Individual Distributions within a Segment

Did you [read post 1](https://quantuxblog.com/individual-scores-in-choice-models-part-1-data-averages) in this series? Do you recall the chart of the *individual distribution* of interest for each class? (That was a density chart for each item.)

We can reuse that plot function now, and ask a similar question: *within a particular segment, what is the distribution of interest*? Is everyone in the segment alike, or are there strong differences?

Why would we ask that? For two reasons: **to understand the details** of the segment better, and **to gain additional insight before we take action**. For example, suppose we want to offer an R class and we believe that would target the 24% of respondents who fall into Segment 4. We might design the class or message it using insight about the *other* classes of interest to that segment (such as choice models and psychometrics).

However, if we design or message the class in that way, we would be well-advised not to choose something **polarizing**. It could be that some topic scores high on average with the segment and yet that is driven by a small group. We might be served better by choosing something that is lower on average but is more broadly acceptable. The distribution plot can help us to evaluate such questions.

As presented in [Post #1](https://quantuxblog.com/individual-scores-in-choice-models-part-1-data-averages) in this series, here is the function to make an individual-level distribution chart. This is just a repeat from that post (*see there for explanation*):

```r
cbc.plot <- function(dat, itemCols=3:ncol(dat), 
                     title = "Preference estimates: Overall + Individual level", 
                     meanOrder=TRUE) {
  
  # get the mean points so we can plot those over the density plot
  mean.df <- lapply(dat[ , itemCols], mean)
  
  # melt the data for ggplot
  library(reshape2)
  #                       vvvv  assumes Sawtooth order;       vvv  (ID in col 1, remove RLH in col 2)
  plot.df <- melt(dat[, c(1, itemCols)], id.vars=names(dat)[1])
  
  # get the N of respondents so we can set an appropriate level of point transparency
  p.resp  <- length(unique(plot.df[ , 1]))
  
  # optionally and by default order the results not by column but by mean value
  # because ggplot builds from the bottom, we'll reverse them to put max value at the top
  # we could use fct_reorder but manually setting the order is straightforward in this case
  if (meanOrder) {
    plot.df$variable <- factor(plot.df$variable, levels = rev(names(mean.df)[order(unlist(mean.df))]))
  }
  
  #### Now : Build the plot
  # set.seed(ran.seed)   # optional; points are jittered; setting a seed would make them exactly reproducible
  library(ggplot2)
  library(ggridges)
  
  # build the first layer with the individual distributions
  p <- ggplot(data=plot.df, aes(x=value, y=variable, group=variable)) +
    geom_density_ridges(scale=0.9, alpha=0, jittered_points=TRUE,
                        rel_min_height=0.005,
                        position="points_sina",
                        # set individual point alphas in inverse proportion to sample size
                        point_color = "blue", point_alpha=1/sqrt(p.resp),
                        point_size=2.5) +
    # reverse y axis to match attribute order from top
    scale_y_discrete(limits=rev) +                                                   
    ylab("Level") + xlab("Relative preference (blue=individuals, red=average)") +
    ggtitle(title) +
    theme_minimal()
  
  # now add second layer to plot with the means of each item distribution
  for (i in 1:length(mean.df)) {
    if (meanOrder) {
      # if we're drawing them in mean order, get the right one same as above
      p <- p + geom_point(x=mean.df[[rev(order(unlist(mean.df)))[i]]], 
                          y=length(mean.df)-i+1, colour="tomato",      # adjust y axis because axis is reversed above
                          alpha=0.5, size=2.0, shape=0, inherit.aes=FALSE)
      
    } else {
      p <- p + geom_point(x=mean.df[[i]], 
                          y=length(mean.df)-i+1, colour="tomato",      # adjust y axis because axis is reversed above
                          alpha=0.5, size=2.0, shape=0, inherit.aes=FALSE)
    }
  }
  p
}
```

We can reuse that function exactly as is, except we plot only the respondents in Segment 4 by subsetting the data appropriately (`md.segs == 4`):

```r
cbc.plot(md.dat[md.segs == 4, ], itemCols = classCols) + 
  ylab("Quant Course Offering") +
  ggtitle("Class interest: Segment 4 Only (R)")
```

The result:

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1729725989889/682385fe-2f21-433f-9657-9e250fb18dd6.png align="center")

In this chart we see the strong cluster for the R programming interest again, not only as the #1 item on average but showing a strong group of individuals with exceptional interest on the right hand side of the distribution.

After that, **Choice Modeling** appears to be a good secondary emphasis — not only is it #2 in position, but the distribution is tightly grouped and **non-polarizing**. Every respondent had interest greater than 0 (the “anchor” position for positive interest). Psychometrics, Causal Models, and Advanced Choice are all similar.

Farther down, we see that Metrics Sprints, Surveys, and Pricing are fairly highly **polarizing** within this segment: there are a few people with strong interest but also those with strong disinterest.

From a business perspective, that reinforces the idea that we might offer an R class with a secondary emphasis on choice modeling (*like this post!*) or psychometrics. And it suggests to avoid a few topics that are polarizing; they would lack appeal to too many participants (assuming our strategy is to go after the general audience).

Again, **the key point of such deep dives is to understand the preference structure in more detail, we we can consider actions** in design and/or promotion of our offerings. In this case, it could help with constructing the content of a class, with targeting respondents for interest in it, and describing it.

*This is a good time to add one important caveat* to all of this, an obvious point to researchers and yet one that is easy to forget: *such results come from a particular sample, and we have to limit our interpretation to that sample (or its implied population)*. In this case, the results come from attendees of Quant UX Con … and Quant UX Con promoted many of the same topics (R, choice models, metrics, and the like). So to some extent it is no surprise that this sample would have these patterns of interest! Yet, at the same time, the data deepen our understanding of the sample, and our confidence in potential strategic actions.

---

### But Again: There is no Magic Segmentation

Let’s take a look again at a fact I mentioned earlier: that there are hundreds of clustering methods — multiplied by dozens or even hundreds of ways to apply each of them. For example, with each one we might select among multiple distance metrics for observation similarity or multiple way to choose a number of clusters.

To take one example, we have already seen the `mclust` package that finds model-based multivariate normal mixtures (i.e., clusters). As we saw above, it suggests that the data represent 1 cluster. However, we can force it to fit a 5 cluster model, by setting the parameter `G`:

```r
# we could force a 5-segment solution (for example) in Mclust
library(mclust)
md.mc <- Mclust(md.dat[ , classCols], G=5)
```

Now we can compare the solution from `mclust` to the solution from `NbClust` above:

```r
# compare the mclust solution to the previous NbClust solution
table(md.mc$classification, md.segs)
```

Here is the comparison table, which presents `NbClust` segment assignments in the columns, cross-tabbed by `mclust` assignments in the rows:

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1729815694693/e50726bb-8af8-4b21-b49e-d41838225971.png align="center")

We can’t compare the segment numbers — such as Segment 1 in each solution — because the assignment labels are nominal and have no order or meaning in themselves. For example, most of the respondents (26/31) who are labeled as “Segment 5” in `NbClust` fall into the group called “Segment 4” in the `mclust` solution.

However, if it were the case that only the *labels* were different, then all of the respondents in one column would fall into a single row. When they don’t, then it says that respondents assigned to one group in `NbClust` fall into multiple groups in `mclust`. In other words, the two methods find different groupings of respondents.

With that perspective, **we can easily see signs of substantial disagreement between the two 5-cluster solutions**. In Segments 1, 2, and 3 from `NbClust` — and segments 2, 3, and 5 from `mclust` — fewer than 50% of the observations from any segment are assigned to any single group in the other solution. In other words, the solutions disagree on a *majority* of segment assignments.

We might compute agreement by hand. Here’s one way. Column 4 maps best to the label of “3” in the other solution, capturing 55 respondents. Column 5 maps to “4” (26 respondents). Column 3 *would* map to “3” but that is already taken by Column 4, so we’ll map it to the second best assignment, “2” (N=21). That leaves Column 1 mapping to “5” (N=24) and Column 2 mapping to “1” (N=23). Adding them up, (55+26+21+24+23) / 308 = **48% raw agreement**. That tells us something, but it would be better to adjust that with respect to chance (for example, if one cluster is large, then agreement could be “high” just by assigning all 100% of respondents to it!)

The **adjusted Rand index** is one method to determine congruence between cluster solutions, relative to base rate. We check that:

```r
# Rand index for the degree of agreement between them
adjustedRandIndex(md.mc$classification, md.segs)
```

The answer is ARI = 0.186 … which means that **the 5-segment solutions from** `NbClust` **and** `mclust` **agree only slightly better than random chance** (about “18% better than random” to use a heuristic explanation of the index). Put differently, the two statistical models often assign respondents to quite different segments.

We can also draw a UMAP plot with the new `mclust` segment assignments:

```r
# UMAP plot of the mclust assignments
set.seed(98107)
# the same segments but in random order
umap.dat$mclustSeg <- factor(md.mc$classification)
# plot those
ggplot(umap.dat, aes(x = X1, y = X2, colour = mclustSeg)) +
  geom_point() +
  xlab("High order dimension 1") + ylab("High order dimension 2") +
  theme_minimal()+
  ggtitle("Dimensional map with mclust segments")
```

Here’s the new plot:

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1729816624402/18ad8954-9a09-488a-a498-f0ac8d29b22a.png align="center")

The result shows segments that are much less cleanly separated than the `NbClust` segments, with respect to the UMAP dimensions that we plotted above. It has clear clusters, but not as clear as the chart above. Segment 3 is relatively separate from the others, but segments 1, 2, 4, and 5 have a lot of mixing with near neighbors assigned to other segments.

**That doesn’t mean that** `NbClust` **is better than** `mclust` … because UMAP dimensions are not the only method to assess segment clarity or utility. **It only shows that the two solutions are different** … *because there is no single, canonical, or “best” segmentation method*.

That’s not the end of the story, however! We need to look at the implications.

---

### And Yet It May Be Useful

**Beyond individual respondent assignment, there is a more important question**: *do the strategic implications differ* between the two solutions? Let’s look at a heat map for the new `mclust` segments; our reusable function above makes it easy; we just pass those segment assignments to it (`md.mc$classification`):

```r
# plot the 5-segment mclust solution as a heat map
seg.heat(md.dat[ , classCols], md.mc$classification)
```

Here’s the plot:

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1729894992149/706111a8-af1e-45d9-8fcb-e9573fb51505.png align="center")

To interpret it, I have copied below the comments I made above about the `NbClust` segments … and added commentary that compares each segment to the new, and — on an *individual level* — substantially different `mclust` segment:

* **Segment 4** \[NbClust\] stands out strongly for its interest in R and statistical/data analyses. With roughly 25% of the sample, we can consider it to be a “hardcore Quant” or “data science in UX” segment. **\===&gt;** *That aligns with Segment 3* in the new solution, although the segment is somewhat larger (34%)
    
* After that, **Segment 5** \[NbClust\] stands out as a “Core UX” segment, strongly interested in classic mixed-methods UX issues, and not very interested in the stats topics. **\===&gt;** *That aligns with Segment 4* in the new solution, similarly sized.
    
* **Segments 1 and 3** \[NbClust\] are quite similar, where both are interested in survey methods and choice models. Segment 3 is interested in core UX topics whereas Segment 1 is interested in pricing methods and more advanced choice models. **\===&gt;** *These are generally similar to Segments 2 and 5* in the new solution.
    
* Finally, **Segment 2** \[NbClust\] is not interested in much … except a segmentation class. That’s fine and is a common pattern — respondents who take the chance to communicate strongly about the one thing they want the most. **\===&gt;** *That looks a lot like Segment 1* in the new solution.
    

In short, although the segmentation solutions differ substantially according to which *respondents* they places into each segment, **the two methods ended up finding quite similar segments from a strategic point of view**! The business implications — e.g., for R/stats classes, high interest in segmentation, a core UX segment, and so forth — appear to be nearly identical between the two solutions. That happy result reassures us about the plausibility and statistical stability of the solutions.

Put differently, although the assignment of each *respondent* to a segment is uncertain here and often disagrees, the segment *averages* end up being quite similar between the two methods. That is a general observation in choice model data (and TBH it also elates to a bedrock principle of statistics: we do stats to understand a population better by sampling and averaging! \[among other ways\])

However, I will add a caution: **the strong similarity of the results here is an N=1 occurrence**. *There is no guarantee that two segmentation methods will give such similar results*. In fact, given the low agreement we see for individual respondent assignment, **my general presumption is that any two clustering solutions are somewhat likely to disagree in their strategic implications**. We must do work to get beyond that, through careful interpretation and iteration, and by confirming and triangulating results!

---

### Conclusion and a Final Warning

One point I have not yet made is this: **an extremely common problem in cluster analysis is using far too many variables**. In this post, I used 14 variables and I think that’s a perfectly reasonable size. It helps that they come from the same kind of observation (choice model) and estimation method (HB utilities). Yet many analysts attempt to cluster data sets with 50, 100, or 300 variables per respondent.

The problem with that is that **clustering works by evaluating how similar respondents are to one another … and with so many variables, no one is similar!** It is crucial to prune down the set of “basis variables” to be both *few* and of *known importance*. That’s a topic for more extended discussion.

Meanwhile, I hope this post demonstrates that:

1. It is quite possible to **find useful segments** in data
    
2. … **even when** a statistical model claims there are “no segments”
    
3. … while using **general methods** that work across many kinds of data
    
4. … and that provide good opportunities to **cross-check our work** as we go
    
5. … while building **reusable functions** along the way
    
6. … as long as we are cautious and **avoid common pitfalls**.
    

I hope you’ve found this interesting and helpful for your own analyses!

---

### Learning More

If you’d like to learn more about segmentation, here are some places to start:

* For **general segmentation**, see the chapters on clustering and classification in the [R book](https://www.amazon.com/Marketing-Research-Analytics-Use/dp/3030143155) and the [Python book](https://www.amazon.com/Python-Marketing-Research-Analytics-Schwarz/dp/3030497194).
    
* … or sign up for a **class**! In January 2025 there will be an [**online Segmentation Class** from the Quant UX Association](https://events.ringcentral.com/events/segmentation-master-class-jan2025) taught by surveys expert Keith Chrzan
    
* To learn about **choice models** in general, see discussion of conjoint analysis in the [R book](https://www.amazon.com/Marketing-Research-Analytics-Use/dp/3030143155) and the [Python book](https://www.amazon.com/Python-Marketing-Research-Analytics-Schwarz/dp/3030497194), and of MaxDiff in the Quant UX book.
    
* … or join me for an upcoming offering of the [Choice Modeling Master Class](https://www.quantuxcon.org/classes)!
    
* For **latent class models** specific to conjoint analysis and MaxDiff, check out many papers over the years in the [archives of the Sawtooth Software Conference](https://sawtoothsoftware.com/resources/events/conferences) (scroll down for the PDF archives by year).
    

---

### All the Code

As always, here is all of the R code from this post compiled in one place. You can grab it with the *copy* icon and paste into your favorite R code editor.

```r
##### Post 3 - Segmentation (clustering)

# get the data; repeating here for blog post 3, see post 1 for details
library(openxlsx)
md.dat <- read.xlsx("https://quantuxbook.com/misc/QUX%20Survey%202024%20-%20Future%20Classes%20-%20MaxDiff%20Individual%20raw%20scores.xlsx")   # 
md.dat$Anchor <- NULL
names(md.dat)[3:16] <- c("Choice Models",  "Surveys",       "Log Sequences",    "Psychometrics", 
                         "R Programming",  "Pricing",       "UX Metrics",       "Bayes Stats",
                         "Text Analytics", "Causal Models", "Interviewer-ing",  "Advanced Choice", 
                         "Segmentation",   "Metrics Sprints")
classCols <- 3:ncol(md.dat)    # generally, Sawtooth exported utilities start in column 3


##### 5. Number of segments?

# mclust uses model-based fit estimates to determine best number of latent clusters
library(mclust)
md.mc <- Mclust(md.dat[ , classCols])
summary(md.mc)     # best answer is only 1 "segment", multivariate normal within it

# initial visualization with UMAP

library(umap)
# get umap representation of the individual interest data
umap.config <- umap.defaults
umap.config$random_state <- 98101

umap.sol <- umap(md.dat[ , classCols], config = umap.config)
umap.dat <- data.frame(umap.sol$layout)
# review the kind of data we have
str(umap.dat)

# plot it
ggplot(umap.dat, aes(x = X1, y = X2)) +
  geom_point() +
  xlab("High order dimension 1") + ylab("High order dimension 2") +
  theme_minimal()


##### 6. Find number of clusters and initial solution

library(NbClust)
md.nbc <- NbClust(md.dat[ , classCols],
                  min.nc = 2, max.nc = 10,
                  method = "ward.D", 
                  distance = "euclidean")

# we'll try a 5 cluster solution instead of 2
md.nbc <- NbClust(md.dat[ , classCols],
                  min.nc = 5, max.nc = 10,
                  method = "ward.D", 
                  distance = "euclidean")

md.segs <- md.nbc$Best.partition
table(md.segs)
proportions(table(md.segs))


##### 7. Plot umap with segment membership coded

# add the 5-segment membership to the UMAP data
umap.dat$Segment <- factor(paste0("S", md.segs))

# plot it with segments
ggplot(umap.dat, aes(x = X1, y = X2, colour = Segment)) +
  geom_point() +
  xlab("High order dimension 1") + ylab("High order dimension 2") +
  theme_minimal() +
  ggtitle("Dimensional map with 5 segments")

# make sure we're not fooling ourselves ...
# what would it look like if the segments were random?
set.seed(98107)
# the same segments but in random order
umap.dat$RndSeg <- sample(umap.dat$Segment)
# plot those
ggplot(umap.dat, aes(x = X1, y = X2, colour = RndSeg)) +
  geom_point() +
  xlab("High order dimension 1") + ylab("High order dimension 2") +
  theme_minimal()+
  ggtitle("Dimensional map with RANDOM segments")


##### 8. Plot heat map for segments

# function to plot average values by segment
seg.heat <- function(dat, segs) {
  library(superheat)
  # aggregate mean value by segment
  heat.sum <- data.frame(t(aggregate(. ~ segs, data=dat, mean))[-1, ])
  # make the column names reflect the segment number and size (%)
  names(heat.sum) <- paste0("S", 1:max(segs), " (", 
                            round(prop.table(table(segs)), 2)*100, "%)")
  # draw it with superheat
  superheat(heat.sum, 
            heat.pal = c("red3", "white", "green3"),
            grid.hline.col = "white", grid.vline.col = "white",
            pretty.order.rows = TRUE,
            clustering.method = "hierarchical",
            X.text = round(as.matrix(heat.sum), 2),  
            X.text.size = 4, 
            left.label.size = 0.3,
            left.label.text.size = 4,
            bottom.label.text.size = 4,
            legend = FALSE)
}

# plot the segmentation solution
seg.heat(md.dat[ , classCols], md.segs)


##### 9. Plot distribution of interest *within* a segment
# check the data briefly ; not discussed in the blog post
summary(md.dat[md.segs==4, ])

# reuse our plot function -- copied here from the part 1 post
# see the part 1 blog post for an explanation of this function
cbc.plot <- function(dat, itemCols=3:ncol(dat), 
                     title = "Preference estimates: Overall + Individual level", 
                     meanOrder=TRUE) {
  
  # get the mean points so we can plot those over the density plot
  mean.df <- lapply(dat[ , itemCols], mean)
  
  # melt the data for ggplot
  library(reshape2)
  #                       vvvv  assumes Sawtooth order;       vvv  (ID in col 1, remove RLH in col 2)
  plot.df <- melt(dat[, c(1, itemCols)], id.vars=names(dat)[1])
  
  # get the N of respondents so we can set an appropriate level of point transparency
  p.resp  <- length(unique(plot.df[ , 1]))
  
  # optionally and by default order the results not by column but by mean value
  # because ggplot builds from the bottom, we'll reverse them to put max value at the top
  # we could use fct_reorder but manually setting the order is straightforward in this case
  if (meanOrder) {
    plot.df$variable <- factor(plot.df$variable, levels = rev(names(mean.df)[order(unlist(mean.df))]))
  }
  
  #### Now : Build the plot
  # set.seed(ran.seed)   # optional; points are jittered; setting a seed would make them exactly reproducible
  library(ggplot2)
  library(ggridges)
  
  # build the first layer with the individual distributions
  p <- ggplot(data=plot.df, aes(x=value, y=variable, group=variable)) +
    geom_density_ridges(scale=0.9, alpha=0, jittered_points=TRUE,
                        rel_min_height=0.005,
                        position="points_sina",
                        # set individual point alphas in inverse proportion to sample size
                        point_color = "blue", point_alpha=1/sqrt(p.resp),
                        point_size=2.5) +
    # reverse y axis to match attribute order from top
    scale_y_discrete(limits=rev) +                                                   
    ylab("Level") + xlab("Relative preference (blue=individuals, red=average)") +
    ggtitle(title) +
    theme_minimal()
  
  # now add second layer to plot with the means of each item distribution
  for (i in 1:length(mean.df)) {
    if (meanOrder) {
      # if we're drawing them in mean order, get the right one same as above
      p <- p + geom_point(x=mean.df[[rev(order(unlist(mean.df)))[i]]], 
                          y=length(mean.df)-i+1, colour="tomato",      # adjust y axis because axis is reversed above
                          alpha=0.5, size=2.0, shape=0, inherit.aes=FALSE)
      
    } else {
      p <- p + geom_point(x=mean.df[[i]], 
                          y=length(mean.df)-i+1, colour="tomato",      # adjust y axis because axis is reversed above
                          alpha=0.5, size=2.0, shape=0, inherit.aes=FALSE)
    }
  }
  p
}

# use our plot function with only the Segment 4 data
cbc.plot(md.dat[md.segs==4, ], itemCols=classCols) + 
  ylab("Quant Course Offering") +
  ggtitle("Class interest: Segment 4 Only (R)")

##### 10. There is no magic answer in clustering

# we could force a 5-segment solution (for example) in Mclust
library(mclust)
md.mc <- Mclust(md.dat[ , classCols], G=5)

# compare the mclust solution to the previous NbClust solution
table(md.mc$classification, md.segs)
# Rand index for the degree of agreement between them
adjustedRandIndex(md.mc$classification, md.segs)

# UMAP plot of the mclust assignments
set.seed(98107)
# the same segments but in random order
umap.dat$mclustSeg <- factor(md.mc$classification)
# plot those
ggplot(umap.dat, aes(x = X1, y = X2, colour = mclustSeg)) +
  geom_point() +
  xlab("High order dimension 1") + ylab("High order dimension 2") +
  theme_minimal()+
  ggtitle("Dimensional map with mclust segments")

# plot the 5-segment mclust solution as a heat map
seg.heat(md.dat[ , classCols], md.mc$classification)

# and plot the individual distributions for the "R" segment, similar to above
# not discussed in the post; just illustrating the reusable plot function :)
cbc.plot(md.dat[md.mc$classification==3, ], itemCols=classCols) + 
  ylab("Quant Course Offering") +
  ggtitle("Class interest: Mclust R Segment Only")
```

[![](https://cdn.hashnode.com/res/hashnode/image/upload/v1746999945541/9d224843-9e9f-44cc-98c5-276915794420.png align="center")](https://notbyai.fyi)
