Tips for Using Cluster Analysis to Segment Markets (CASS)

Posted on November 17, 2010


When segmenting markets the objective is to find distinct groups of consumers who are similar to each other on multiple variables of “interest”.  If that lofty goal is realized, then products and marketing programs can be designed to appeal to desirable consumer groups so that when CONSUMERS select between you and your competitors, what your company offers will be a better fit in their eyes.

– Cluster Analysis groups observations based on their similarity across multiple variables. (That sounds “kind of” like the goal.)

– Determine what similarities are useful.
– Adjust for different measurement scales. (Prices from $12,000 to $25,000 represent 13,000 measurement units while a 1 – 7 agree/disagree survey questions represent 7 units of measurement.)

A lot of people’s knee jerk reaction to different measurement scales is to standardize the data to mean zero and standard deviation one. That changes things, but does it fix anything?  Now the original distribution of $12,000-$25,000 dollars is about equal to a seven point scale survey question. Is that appropriate? Maybe, but I could also imagine scenarios where the willingness to pay double the price might be worth 9 or 10 times more than agreement to a survey question.  In effect, the scale problem and the usefulness problem are inextricably linked.

The second problem with standardization is that the data that WAS on the same scale is now corrupted. A survey question showing large consumer differences across the whole 7-point range is deflated and a question that all respondents answered as either 6 or 7 is inflated so that they both end up with a standard deviation of one. Consequently, the procedure blurs a lot of the consumer distinctiveness you set out to look for.

Another common practice is to run a Factor or Correspondence Analysis prior to the Cluster Analysis. Many of these procedures standardize the data automatically (in SAS it’s automatically standardized unless you specify otherwise), but apart from that there is a second problem. Some of the factors represent large sources of variation (a.k.a. differences between consumers), while others represent trivial ones. Cluster Analysis just sees the units of measurement that you give it, so if you do not rescale the factors in accordance with their importance, this procedure will also wipe away data patterns you’d like to identify.

Unlike many other statistical techniques, there is nothing “unbiased” about using all the data you have. Let’s say you’ve got one pricing variable from your database, and twelve survey questions about different advertising themes. Cluster Analysis measures the distances you give it, so 12 measures of advertising is implicitly about 12 times more distance than one. Second, irrelevant data sums together just as easily as relevant data, so including everything available is likely to reduce the separation of more important concepts.

Although there are several issues to watch for, all of the above problems can be solved by scaling the data based on its importance to your goals.  It’s very much akin to creating a dependent variable.  You could over or under weight something, by accident or on purpose, but if the scaling is done with a good faith effort, the results should move you a lot closer to your goals.

In Cluster Analysis, the science takes you part way, but it reminds me of one of those “mathematical word problems” from the 5th grade.  It’s up to you to pick which numbers have to be added together to get the desired answer.  The described approach, I call “Concept Availability Scaled Segmentation” (CASS). I hope you find it useful.