Best–worst scaling explained

Best–worst scaling should not be confused with MaxDiff.

Best–worst scaling (BWS)^[1] techniques involve choice modelling (or discrete choice experiment – "DCE") and were invented by Jordan Louviere in 1987 while on the faculty at the University of Alberta. In general with BWS, survey respondents are shown a subset of items from a master list and are asked to indicate the best and worst items (or most and least important, or most and least appealing, etc.). The task is repeated a number of times, varying the particular subset of items in a systematic way, typically according to a statistical design. Analysis is typically conducted, as with DCEs more generally, assuming that respondents makes choices according to a random utility model (RUM). RUMs assume that an estimate of how much a respondent prefers item A over item B is provided by how often item A is chosen over item B in repeated choices. Thus, choice frequencies estimate the utilities on the relevant latent scale. BWS essentially aims to provide more choice information at the lower end of this scale without having to ask additional questions that are specific to lower ranked items.

History

Louviere attributes the idea to the early work of Anthony A. J. Marley in his PhD thesis, who together with Duncan Luce in the 1960s produced much of the ground-breaking research in mathematical psychology and psychophysics to axiomatise utility theory. Marley had encountered problems axiomatising certain types of ranking data and speculated in the discussion of his thesis that examination of the "inferior" and "superior" items in a list might be a fruitful topic for future research. The idea then languished for three decades until the first working papers and publications appeared in the early 1990s. The definitive textbook describing the theory, methods and applications was published in September 2015 (Cambridge University Press) by Jordan Louviere (University of South Australia), Terry N Flynn (TF Choices Ltd.) and Anthony A. J Marley (University of Victoria and University of South Australia). The book brings together the disparate research from various academic and practical disciplines, in the hope that replication and mistakes in implementation are avoided. The three authors have (individually and together) already published many of the key academic peer-reviewed articles describing BWS theory,^[2] ^[3] ^[4] practice,^[5] ^[6] and a number of applications in health,^[5] social care,^[7] marketing,^[6] transport, voting,^[8] and environmental economics.^[9] However, the method has now become popular in the wider research and practitioner communities, with other researchers exploring its use in areas as diverse as student evaluation of teaching,^[10] marketing of wine,^[11] quantification of concerns over ADHD medication,^[12] the importance of environmental sustainability,^[13] and priority-setting in genetic testing.^[14]

Purposes

There are two different purposes of BWS – as a method of data collection, and/or as a theory of how people make choices when confronted with three or more items. This distinction is crucial, given the continuing misuse of the term maxdiff to describe the method. As Marley and Louviere note, maxdiff is a long-established academic mathematical theory with very specific assumptions about how people make choices: it assumes that respondents evaluate all possible pairs of items within the displayed set and choose the pair that reflects the maximum difference in preference or importance.

As a theory of process (theory of decision-making)

Consider a set in which a respondent evaluates four items: A, B, C and D. If the respondent says that A is best and D is worst, these two responses inform us about five of six possible implied paired comparisons:

A > B, A > C, A > D, B > D, C > DThe only paired comparison that cannot be inferred is B vs. C. In a choice among five items, MaxDiff questioning informs on seven of ten implied paired comparisons. Thus BWS may be thought of as a variation of the method of Paired Comparisons.

Yet respondents can produce best-worst data in any of a number of ways. Instead of evaluating all possible pairs (the maxdiff model), they might choose the best from n items, the worst from the remaining n-1, or vice versa. Or indeed they may use another method entirely. Thus it should be clear that maxdiff is a subset of BWS. The maxdiff model has proved to be useful in proving the properties of a number of estimators in BWS. However, its realism as a description of how humans might actually provide best and worst data can be questioned for the following reason. As the number of items increases, the number of possible pairs increases in a multiplicative fashion: n items produces n(n-1) pairs (where best-worst order matters). To assume that respondents do evaluate all possible pairs is a strong assumption and in 14 years of presentations, the three co-authors have virtually never found a course or conference participant who admitted to using this method to decide their best and worst choices. Virtually all admitted to using sequential models (best then worst or worst then best).^[15]

Early work (including that of Louviere himself) did use the term maxdiff to refer to BWS, but with the recruitment of Marley to the team developing the method, correct academic terminology has been disseminated throughout Europe and Asia-Pacific (if not North America, which continues to use the maxdiff term). Indeed, it is an open question whether major software manufacturers of discrete choice maxdiff routines actually implement maxdiff models in estimating parameters, despite this continuing advertising of maxdiff capabilities.

As a method of data collection

The second use of BWS is as a method of data collection (rather than as a theory of how humans produce a best and a worst item). BWS can, particularly in the age of web-based surveys, be used to collect data in a systematic way that (1) forces all respondents to provide best and worst data in the same way (by, for instance, asking best first, greying out the chosen option, then asking worst); (2) Enables collection of a full ranking, if repeated BWS questioning is implemented to collect the "inner rankings". In many contexts, BWS for data collection has been regarded merely as a way to obtain such data in order to facilitate data expansion (to estimate conditional logit models with far more choice sets) or to estimate conventional rank ordered logit models.^[16]

Types ("cases")

The renaming of the method, to make clear that maxdiff scaling is BWS but BWS is not necessarily maxdiff, was decided by Louviere in consultation with his two key contributors (Flynn and Marley) in preparation for the book, and was presented in an article by Flynn.^[17] That paper also took the opportunity to make clear that there are, in fact, three types ("cases") of BWS: Case 1 (the "object case"), Case 2 (the "profile case") and Case 3 (the "multi-profile case"). These three cases differ largely in the complexity of the choice items on offer.

Case 1 (the "object case")

Case 1 presents items that may be attitudinal statements, policy goals, marketing slogans or any type of item that has no attribute and level structure. It is primarily used to avoid scale biases known to affect rating (Likert) scale data.^[18] ^[19] It is particularly useful when eliciting the degree of importance or agreement that respondents ascribe from a set of statements and when the researcher wishes to ensure that the items compete with each other (so that respondents cannot easily rate multiple items as being of the same importance).

Case 2 (the "profile case")

Case 2 has predominated in health and the items are the attribute levels describing a single profile of the type familiar to choice modellers. Instead of making choices between profiles, the respondent must make best and worst (most and least) choices within a profile. Thus, for the example of a mobile (cell) phone, the choices would be the most acceptable and least acceptable features of a given phone. Case 2 has proved to be powerful in eliciting preferences among vulnerable groups, such as the elderly,^[20] ^[21] older carers,^[22] and children,^[23] who find conventional multi-profile discrete choice experiments difficult. Indeed, the first comparison of Case 2 with a DCE in a single model found that whilst the vast majority of (older) respondents provided usable data from the BWS task, only around one half do so for the DCE.^[20]

Case 3 (the "multi-profile case")

Case 3 is perhaps the most familiar to choice modellers, being merely an extension of a discrete choice model: the number of profiles must be three or more, and instead of simply choosing the one the respondent would purchase, (s)he chooses the best and worst profile.

Designs for studies

Case 1 BWS studies typically use balanced incomplete block designs (BIBDs). These cause every item to appear the same number of times and also force every item to compete with every other the same number of times. These features are attractive since the respondent is prevented from inferring erroneous information about the items (what items the designer is "really" interested in). They also ensure that there can be no "ties" in importance/salience at the very top or bottom of the scale.

Case 2 BWS studies can use Orthogonal Main Effects Plans (OMEPs) or efficient designs, although the former has predominated to date.

Case 3 BWS studies may use any of the types of design typically used for a DCE, with the proviso that the number of profiles (alternatives) in a choice set must be three or more for the BWS task to make sense.

Recent history

Steve Cohen introduced BWS to the marketing research world in a paper presented at an ESOMAR Conference in Barcelona in 2002 entitled, "Renewing market segmentation: Some new tools to correct old problems." This paper was nominated for Best paper at that conference. In 2003 at the ESOMAR Latin America Conference in Punta del Este, Uruguay, Steve and his co-author, Dr. Leopldo Neira, compared BWS results to those obtained by rating scale methods. This paper won Best Methodological Paper at that conference. Later the same year, it was selected as winner of the John and Mary Goodyear Award for Best Paper at all ESOMAR Conferences in 2003 and then it was published as the lead article in "Excellence in International Research 2004," published by ESOMAR. At the 2003 Sawtooth Software Conference, Steve Cohen's paper, "Maximum Difference Scaling: Improved Measures of Importance and Preference for Segmentation," was selected as Best Presentation. Cohen and Sawtooth Software president Bryan Orme agreed that MaxDiff should be part of the Sawtooth package and it was introduced later that year. Later in 2004, Cohen and Orme won the David K. Hardin Award from the AMA for their paper which was published in Marketing Research Magazine entitled, "What's your preference? Asking survey respondents about their preferences creates new scaling decisions."

In parallel to this, Emma McIntosh and Jordan Louviere introduced BWS (case 2) to the health community at the 2002 Health Economists' Study Group conference. This prompted the collaboration with Flynn and ultimately the link-up with Marley, who had begun working with Louviere independently to prove the properties of BWS estimators. The popularity of the three cases has largely varied by academic discipline, with case 1 proving popular in marketing and food research, case 2 largely being adopted in health, and case 3 being used across a variety of disciplines that already use DCEs. It was partly this lack of understanding in many disciplines that there are actually three cases of BWS that prompted the three main developers to write the textbook.

The book contains an introductory chapter summarising the history of BWS and the three cases, together with why the respondent must think whether (s)he wishes to use it to understand theory (processes) of decision-making and/or merely to collect data in a systematic way. Three chapters, one for each case, follow, detailing the intuition and application of each. A chapter bringing together Marley's work proving the properties of the key estimators and laying out some open issues then follows. After laying out open issues for further analysis, nine chapters (three per case – describing applications from a variety of disciplines) then follow.

Conducting a study

The basic steps in conducting all types of BWS study are:

Conduct proper qualitative or other research to properly identify and describe all items of interest.^[24]
Construct a statistical design that indicates what items are to be presented in each set of items ("choice set") – designs may come from publicly available catalogues, be constructed by hand, or produced from commercially available software.
Use the design to construct the choice sets, which contain the actual relevant items (textually or visually).
Obtain response data where respondents choose the best and worst from each task; repeat best-worst (to obtain second best, second worst, etc.) may be conducted if the analyst wishes for more data.
Input the data into a statistical software program and analyse. The software will produce utility functions for each of the features. In addition to utility scores, you can also request raw counts which will simply sum the total number of times a product was selected as best and worst. These utility functions indicate the perceived value of the product on an individual level and how sensitive consumer perceptions and preferences are to changes in product features.

Analysis

Estimation of the utility function is performed using any of a variety of methods.

multinomial discrete choice analysis, in particular multinomial logit (strictly speaking the conditional logit, although the two terms are now used interchangeably). The multinomial logit (MNL) model is often the first stage in analysis and provides a measure of average utility for the attribute levels or objects (depending on the Case).
In many cases, particularly cases 1 and 2, simple observation and plotting of choice frequencies should actually be the first step, as it is very useful in identifying preference heterogeneity and respondents using decision-rules based on a single attribute.
Several algorithms could be used in this estimation process, including maximum likelihood, neural networks, and the hierarchical Bayes model. The Hierarchical Bayes model is beneficial because it allows for borrowing across the data, although since BWS often allows the estimation of individual level models, the benefits of Bayesian models are heavily attenuated. Response time models have recently been shown to replicate the utility estimates of BWS, which represents a major step forward in the validation of stated preferences generally, and BWS preferences specifically.^[25] ^[26]

Advantages

BWS questionnaires are relatively easy for most respondents to understand. Furthermore, humans are much better at judging items at extremes than in discriminating among items of middling importance or preference. And since the responses involve choices of items rather than expressing strength of preference, there is no opportunity for scale use bias.

Respondents find these ratings scales very easy but they do tend to deliver results which indicate that everything is "quite important", making the data not especially actionable. BWS on the other hand forces respondents to make choices between options, while still delivering rankings showing the relative importance of the items being rated. It also produces:

Distributions of "the scores" (calculated as the best frequency minus the worst frequency) for all items which allow the researcher to observe the empirical distribution of estimated utilities. This produces information on how realistic the results from traditional analysis methods assuming standard continuous distributions are likely to be. Consumers tend to form distinct groups with often very different preferences, giving rise to multi-modal distributions.
Data that allow investigation of the decision rule (functional form of the utility function) at various ranking depths (most simply, the "best decision rule vs the worst decision rule"). Emerging research is suggesting that in some contexts respondents do not use the same rule, which calls into question the use of estimation methods such as the rank ordered logit model.
Estimation of attribute impact, a measure of the overall impact of an attribute upon choices that is not available from conventional discrete choice models.
More data, that allow greater insights into choices, for a given number of choice sets. The same information could be obtained by simply presenting more choice sets but this runs the risk that respondents become bored and disengage with the task.
Quantifying the phenomena of response shift and adaptation to poor health states.

Disadvantages

Best–worst scaling involves the collection of at least two sets of data: at a minimum, first-best and first-worst, and in some cases additional ranks (second best, second worst, etc...) The issue of how to combine these data is pertinent. Early work assumed best was simply the inverse of worst: that respondents had an internal ranking of all items and just chose the highest/lowest ranked item in a given question. More recent work has suggested that in some contexts this is not the case: a person might (for instance) choose according to traditional economic theory for best (trading across attributes) but choose worst using an elimination by attributes strategy (choosing as worst the item that is simply unacceptable on one attribute). In the presence of such different decision rules it becomes impossible to know how to combine the data: at what point does the person, when moving down the rankings, move from "economic trading" to "elimination by aspects".

This presents a clear problem for the data augmentation motivation for BWS but not necessarily for BWS when used as a way to understand process (decision-making). Psychologists in particular would be particularly interested in the different types of decision-making. Marketers, also, might wish to know if a given product had an unacceptable feature. Work is ongoing to investigate when different decision rules arise, and whether/how data from such different sources may be combined.

BWS also suffers from the same disadvantages of all stated preference techniques. It is unknown if the preferences are consistent with choices made in the real world (revealed preferences). In some instances revealed preferences (typically real market decisions) are available, providing a test of the BWS choices. In others, quite often health, there are no revealed preference data and validation appears impossible. More recently attempts have been made to validate SP data using physiological data, such as eye-tracking and response times. Early work suggests that response time models are consistent with results from BWS models in health care but more research is required in other contexts.

References

Web site: Best-Worst Scaling. Cambridge University Press. 2015-10-01.
Some probabilistic models of best, worst, and best–worst choices. Journal of Mathematical Psychology. 2005-12-01. 464–480. 49. Special Issue Honoring Jean-Claude Falmagne: Part 1Special Issue Honoring Jean-Claude Falmagne: Part 1. 6. 10.1016/j.jmp.2005.05.003. A. A. J.. Marley. J. J.. Louviere.
Probabilistic models of set-dependent and attribute-level best–worst choice. Journal of Mathematical Psychology. 2008-10-01. 281–296. 52. 5. 10.1016/j.jmp.2008.02.002. A. A. J.. Marley. Terry N.. Flynn. J. J.. Louviere. 10453/8292. free.
Models of best–worst choice and ranking among multiattribute options (profiles). Journal of Mathematical Psychology. 2012-02-01. 24–34. 56. 1. 10.1016/j.jmp.2011.09.001. A. A. J.. Marley. D.. Pihlens.
Best–worst scaling: What it can do for health care research and how to do it. Journal of Health Economics. 2007-01-01. 171–189. 26. 1. 10.1016/j.jhealeco.2006.04.002. 16707175. Terry N.. Flynn. Jordan J.. Louviere. Tim J.. Peters. Joanna. Coast.
An introduction to the application of (case 1) best–worst scaling in marketing research. International Journal of Research in Marketing. 2013-09-01. 292–303. 30. 3. 10.1016/j.ijresmar.2012.10.002. Jordan. Louviere. Ian. Lings. Towhidul. Islam. Siegfried. Gudergan. Terry. Flynn.
Best–worst scaling vs. discrete choice experiments: An empirical comparison using social care data. Social Science & Medicine. 2011-05-01. 1717–1727. 72. 10. 10.1016/j.socscimed.2011.03.027. 21530040. Dimitris. Potoglou. Peter. Burge. Terry. Flynn. Ann. Netten. Juliette. Malley. Julien. Forder. John E.. Brazier. 10594387 .
Characterizing best–worst voting systems in the scoring context. Social Choice and Welfare. 2009-09-12. 0176-1714. 487–496. 34. 3. 10.1007/s00355-009-0417-1. José Luis. García-Lapresta. A. a. J.. Marley. Miguel. Martínez-Panero. 18334695 .
Exploring Scale Effects of Best/Worst Rank Ordered Choice Data to Estimate Benefits of Tourism in Alpine Grazing Commons. American Journal of Agricultural Economics. 93. 3. 2011-06-19. 0002-9092. 813–828. 10.1093/ajae/aaq174. Riccardo. Scarpa. Sandra. Notaro. Jordan. Louviere. Roberta. Raffaelli.
Student evaluation of teaching: the use of best–worst scaling. Assessment & Evaluation in Higher Education. 2014-05-19. 0260-2938. 496–513. 39. 4. 10.1080/02602938.2013.851782. Twan. Huybers. 144637200 .
Applying best‐worst scaling to wine marketingnull. International Journal of Wine Business Research. 2009-03-20. 1751-1062. 8–23. 21. 1. 10.1108/17511060910948008. Cohen. Eli.
A Best-Worst Scaling Experiment to Prioritize Caregiver Concerns About ADHD Medication for Children. Psychiatric Services. 2014-11-17. 1075-2730. 208–211. 66. 2. 10.1176/appi.ps.201300525. 25642618. Melissa. Ross. John F. P.. Bridges. Xinyi. Ng. Lauren D.. Wagner. Emily. Frosch. Gloria. Reeves. Susan. dosReis. 5294953.
Testing the robustness of best worst scaling for cross-national segmentation with different numbers of choice sets. Food Quality and Preference. 2013-03-01. 230–242. 27. Ninth Pangborn Sensory Science Symposium. 2. 10.1016/j.foodqual.2012.02.002. Simone. Mueller Loose. Larry. Lockshin.
Eliciting preferences for priority setting in genetic testing: a pilot study comparing best-worst scaling and discrete-choice experiments. European Journal of Human Genetics. 2013-11-01. 1018-4813. 3798841. 23486538. 1202–1208. 21. 11. 10.1038/ejhg.2013.36. Franziska. Severin. Jörg. Schmidtke. Axel. Mühlbacher. Wolf H.. Rogowski.
Estimating preferences for a dermatology consultation using Best-Worst Scaling: Comparison of various methods of analysis. BMC Medical Research Methodology. 2008-11-18. 1471-2288. 2600822. 19017376. 76. 8. 1. 10.1186/1471-2288-8-76. Terry N.. Flynn. Jordan J.. Louviere. Tim J.. Peters. Joanna. Coast . free .
Modeling the choices of individual decision-makers by combining efficient choice experiment designs with extra preference information. Journal of Choice Modelling. 2008-01-01. 128–164. 1. 1. 10.1016/S1755-5345(13)70025-3. Jordan J.. Louviere. Deborah. Street. Deborah Street . Leonie. Burgess. Nada. Wasi. Towhidul. Islam. Anthony A. J.. Marley. free. 10453/9977. free.
Valuing citizen and patient preferences in health: recent developments in three types of best–worst scaling. Expert Review of Pharmacoeconomics & Outcomes Research. 2010-06-01. 1473-7167. 259–267. 10. 3. 10.1586/erp.10.29. 20545591. Terry N.. Flynn. 39949090 .
Response Styles in Marketing Research: A Cross-National Investigation. Journal of Marketing Research. 2001-05-01. 0022-2437. 143–156. 38. 2. 10.1509/jmkr.38.2.143.18840. Hans. Baumgartner. Jan-Benedict E.M.. Steenkamp. 11304067.
Assessing Measurement Invariance in Cross‐National Consumer Research. 10.1086/209528. Journal of Consumer Research. 1998-06-01. 78–107. 25. 1. 10.1086/209528. Jan‐Benedict E. M.. Steenkamp. Hans. Baumgartner.
Quantifying response shift or adaptation effects in quality of life by synthesising best-worst scaling and discrete choice data. Journal of Choice Modelling. 2013-03-01. 34–43. 6. 10.1016/j.jocm.2013.04.004. Terry. N. Flynn. Tim. J. Peters. Joanna. Coast.
Valuing the ICECAP capability index for older people. Social Science & Medicine. 2008-09-01. 874–882. 67. Part Special Issue: Ethics and the ethnography of medical research in Africa. 5. 10.1016/j.socscimed.2008.05.015. 18572295. Joanna. Coast. Terry N.. Flynn. Lucy. Natarajan. Kerry. Sproston. Jane. Lewis. Jordan J.. Louviere. Tim J.. Peters. 10453/9747. free.
Estimation of a Preference-Based Carer Experience Scale. Medical Decision Making. 2011-05-01. 0272-989X. 20924044. 458–468. 31. 3. 10.1177/0272989X10381280. Hareth. Al-Janabi. Terry N.. Flynn. Joanna. Coast. 30922199 .
Developing Adolescent-Specific Health State Values for Economic Evaluation. PharmacoEconomics. 2012-12-23. 1170-7690. 713–727. 30. 8. 10.2165/11597900-000000000-00000. 22788261. Professor Julie. Ratcliffe. Terry. Flynn. Frances. Terlich. Katherine. Stevens. John. Brazier. Michael. Sawyer. 21778695 .
Using qualitative methods for attribute development for discrete choice experiments: issues and recommendations. Health Economics. 2012-06-01. 1099-1050. 730–741. 21. 6. 10.1002/hec.1739. 21557381. Joanna. Coast. Hareth. Al-Janabi. Eileen J.. Sutton. Susan A.. Horrocks. A. Jane. Vosper. Dawn R.. Swancutt. Terry N.. Flynn.
Integrating Cognitive Process and Descriptive Models of Attitudes and Preferences. Cognitive Science. 2014-05-01. 1551-6709. 701–735. 38. 4. 10.1111/cogs.12094. 24124986. Guy E.. Hawkins. A.a.j.. Marley. Andrew. Heathcote. Terry N.. Flynn. Jordan J.. Louviere. Scott D.. Brown. 15328149. free. 1959.13/1053320. free.
Web site: The best of times and the worst of times are interchangeable.. APA PsycNET. 2015-10-01.