All the hard work we put into the “model” on the right hand side of the equation, is only as accurate as the dependent variable was to start with in reflecting the business problem at hand. Yet…. modeling efforts typically focus almost exclusively on the prediction of the objective variable, while often accepting the dependent “AS IS” with all that that implies.
Dependent variable definitions are highly situation specific. Perhaps that encourages many scientists, trained to be unbiased and consistent, to quickly move conversations away from the judgements necessary to define them, as these “judgements” will be less scientifically defensible. That can be a critical mistake. The dependent variable is unequivocally the most important variable in the model and its definition plays a pivotal role in the success of any project. Because each situation is different, I’ll contribute three anecdotal stories about the defining of the “model’s goal” and let everyone take away what he/she can to apply to their own problems.
No Dependent Available
Targeting for a new model car: Once I worked on a targeting project for a new car that had, at that point, never been sold. In other words there was NO sales history. A group of managers and myself judgmentally determined how similar the new car was to competitive cars that did have a sales history, and then we modeled our similar sales history as the dependent. The mix of science and judgment worked quite well in predicting new sales and offered a 36 to 1 return over no targeting. (Admittedly, because cars are infrequently purchased durable goods, no targeting is an ineffective and low bar to jump, but the model was nevertheless a big success.)
Bad Dependent Available
Customer Attrition: This is an area that is often modeled poorly, because the initial temptation is to take everyone within a time period and define ALL those who later leave the company as the attriters. While this sounds okay at first pass, it works poorly because many customers don’t just leave, they phase out little by little. The problem is that having a model that says everyone who has quickly drawn down their bank balance to $5 will soon leave the bank isn’t very insightful and more importantly it is too late. A better definition is to count all these ghost accounts as another form of attrition. This is sometimes resisted by modelers because it will drop their “stated accuracy” (R-Squared or whatever) like a rock, but it is clearly more useful for the business to offer an actionable prediction with a low precision, than a prediction with a high precision that can’t be used.
Many Results but Little History
Soccer Modeling: I modeled European Soccer outcomes for betting purposes for several years. One of the challenges was that the primary betting result of (Home Win / Draw / Away Win) had very little granularity, but the predictions had to be very accurate in order to beat the odds consistently enough to make money over the long run. One thing that helped a lot was to do a multi-stage estimation so that first you estimated each team’s ability in terms of: Shots Taken, Corners, Fouls, Cards, etc and then used those estimates as predictors of who would win. It was an effective way to take advantage of both game history and the data structure to get more finely tuned results.
If you liked this discussion, I’d appreciate you sharing it or clicking the “like” button. Your vote of approval is always appreciated and useful in the prioritization of further content.
Meta Brown
November 18, 2010
David,
You make a very good point here.
My clients have sometimes struggled to model an elusive dependent variable because their management had chosen that variable as a performance measure. For example, one marketing manager inquired about using text analytics to measure the performance of her campaigns. It turned out that her performance was judged on the basis of unique measures used only by her employer and calculated using some arcane and not necessarily objective formulas. Try building a model to predict that!
The most important modeling advice I could give her was to start by identifying some reasonable, concrete measures for success of media campaigns – such as the number of media mentions, the readership of the media outlets, views of online mentions and so forth. It was up to her to change the game with her management.
David Young
March 11, 2011
REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Group: Predictive Modeling, Data Mining, Actuary / Actuarial and Statistics
Group Discussion: The Most Important and Least Thought about Variable: The Dependent
Well spoken, sir! While I’m not a real statistician (I’m a programmer), I’ve certainly seen my share of DPV problems. One key to avoiding such is for the analyst and (especially) his data master to take nothing for granted and to always check things out for themselves, just as any good mechanic would. Data prep may not be glamorous and your client/boss may regard it as something to be minimized, if not avoided altogether, but there really is no substitute for a good initial examination of the data before modeling begins, and a good, modelable dataset created in accordance therewith.I’ve yet to see a database that was designed for analysis. I’m sure they exist, but I suspect that they’re rare.
Posted by John Ries
David Young
March 11, 2011
REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Group: The R Project for Statistical Computing
Discussion: The Most Important and Least Thought about Variable: The Dependent
I totally agree with the statement of the problem. Interesting how it’s done in different fields. For example, in Organizational Psychology, where people frequently deal with courts, the “Criterion problem” is a well-recognized and widely discussed issue. What is “job performance”? How do we operationalize it in order to predict it?
Posted by Dimitri Liakhovitski
David Young
March 11, 2011
REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
http://www.analyticbridge.com/group/analyticaltechniques/forum/topics/the-most-important-and-least?commentId=2004291%3AComment%3A77633&xg_source=msg_com_forum
Reply by Tom Wolfer on September 2, 2010 at 2:02pm
Hi David. I like your comments, and I agree about your assessment of the dependent variable challenges. As far as attrition is concerned, I also agree, which is why I think that it is important to develop a ‘pattern of behaviour’ for each customer before he or she attrites: perhaps define attriton as when he or she begins to ‘deviate’ from that behaviour rather than when the behaviour stops altogether. For example, one might find that, if a customer usually buys a carton of milk every two days at the same store using the same credit card, a first red flag as to whether he or she may attrite may be if we observe that he or she begins to buy every week. Clustering algorithms may be helpful in identifying customers according to ‘behaviour patterns’. I have written a blog article about this.
Thanks
David Young
March 11, 2011
REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Group: Dallas R Users Group
Discussion: The Most Important and Least Thought about Variable: The Dependent
Great comments. I can remember one instance when working for a client that it was misunderstood what we were modeling. We decided on one dependent variable and the client wanted us to work on another dependent variable. Its very important to make sure when working with clients that expectations and assumptions are very clear up front.
Posted by Larry D’Agostino, P.E.
David Young
March 11, 2011
REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Group: Global Analytics Network (+5K analytic professionals)
Discussion: The Most Important and Least Thought about Variable: The Dependent
My experience has mostly been in the area of CPG but I have seen many of the problems you mentioned. A few more:
1. Trying to use one definition for multiple decision issues / An example – Category management = These analyses usually start with ” estimate the average weekly sales of the product.” However – to allocate shelf space, you need to consider the average weekly sales from the shelf area, and ignore the contribution from secondary locations such as end-aisle displays. However, to determine whether the product should be included in the assortment mix, or dropped, you need to include the secondary location sales.
2. Incomplete specification of the dependent – a typical CPG product has multiple UPC’s, including special holiday packs or “special packs” that for a brief period of time surplant the “regular item” on shelves (Think Hershey kisses in red valentine, orange thanksgiving, or red and green Christmas wrappers that replace the regular product). If modeling category assortment at UPC level, these must be properly handled.
3. Improper level of aggregation – We are often tempted to “model the data we have” and hope that we can contribute to understanding. However, often the data we have is either too aggregate ( or occasionally too disaggregate) for the problem we are studying. I first encountered this in the study of the relation between advertising expenditures and market sales (or share) in the 60’s both with annual data, and bi-monthly Nielsen estimates. Over the years there have been several studies of aggregation bias in price/promotion models conducted on market level models that aggregate over multiple retail chains. Now, I see some examples coming up with people trying to model individual household loyalty marketing data to understand retail sales, when (in my judgement) a higher level of aggregation is called for.
Posted by John Totten
David Young
March 11, 2011
REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Group: Statistical Consultants
Discussion: The Most Important and Least Thought about Variable: The Dependent
@David,
Regarding “No statistics class I ever took said the first word about the dependent variable and, in practice as well, it is often taken “AS IS” with all that that implies,” I must point out that your experience does in fact constitute a sample of 1. When I was teaching graduate level statistics and methods (and the latter is crucial in this regard) the concepts of measurement theory, measurement error, and the consequences for the models of errors for both independent and dependent variables was emphasized quite heavily. (Also a sample of 1.)Now, it is still the case that we are often forced to model what are at best imperfect — and often-times ludicrous — outcome measures.
Posted by David Mangen
David Young
March 11, 2011
REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Group: Global Analytics Network (+5K analytic professionals)
Discussion: The Most Important and Least Thought about Variable: The Dependent
Using a dependent variable which is correlated with the variable you are really researching, or has a “hidden” effect with the real variable. Usually done as a matter of convenience. With regard of homicides. That in itself is often used incorrectly as a dependent variable. Often shows up in studies regarding gun control and can be used to prove either argument. Usually needs a moderating variable to explain.
-Ralph Winters
Posted by Ralph Winters
David Young
March 11, 2011
REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Group: Statistical Consultants
Discussion: The Most Important and Least Thought about Variable: The Dependent
The most I ever heard regarding the dependent variable had to do with issues of its distribution and scale (e.g., interval, normally distributed in the case of regression). And we often fail to sufficiently evaluate the psychometric properties of our outcome measures.
Posted by Barth Riley
David Young
March 11, 2011
REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Group: Global Analytics Network (+5K analytic professionals)
Discussion: The Most Important and Least Thought about Variable: The Dependent
Thanks for your thoughts on the dependent variable. The dependent variable indeed needs a careful definition and must reflect the target of the study/ business objectives. Here are a few experiences that we have had.
In credit risk, the definition of dependent variable is critical. If you use a late definition and use a lot of intermediate credit risk indicators as predictors, the models will throw what could be obvious.
In retention management, it is very important that you define the dependent variable exactly and as per business requirement. If you use reduced engagement for e.g., then you are including a lot of people who have reduced engagement for reasons other than disinterest or dissatisfaction.
In direct marketing, you have to run several models for different dependent variables through various stages of the acquisition process to understand the engagement and conversions through out.
Posted by Meduri Ravi Kumar
David Young
March 11, 2011
REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Group: Advanced Analytics, Predictive Modeling & Statistical Analyses Professionals Group
Discussion: The Most Important and Least Thought about Variable: The Dependent
In Medicine, we worry quite a bit about the dependent variable. Check the literature on surrogate outcomes.
Steve Simon, http://www.pmean.com
Posted by Steve Simon
David Young
March 11, 2011
REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Group: Advanced Business Analytics, Data Mining and Predictive Modeling
Discussion: The Most Important and Least Thought about Variable: The Dependent
The dependent variable in statistics is analogous to the labels in supervised learning where we have training cases that are by and large labeled. your experience where you had to devise your own dependent variable resembles experiences in data mining where at times we have to artificially duplicate training cases in less represented classes to avoid training bias. The process of labeling training data is laborious and expensive and at times we have to devise quick methods of labeling in order to develop good models or have to contend with unsupervised learning techniques or additive learning.
Posted by Ernest Mugambi
David Young
March 11, 2011
REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Group: Statistical Consultants
Discussion: The Most Important and Least Thought about Variable: The Dependent
David,
One of the things that this thread has reminded me of is a structured process that can be used to develop measures, either independent or dependent variables. Long ago I wrote some academic pieces regarding this process, but the intellectual origins (at least to the best of my knowledge) stem from some work done in the mid-1960s by Murray Straus. He coined the phrase, “the rational approach to measurement,” to refer to a deductive logical model by which one can conceptually explicate what in fact is meant by any measure that you intend to create.
To briefly illustrate, the process entails starting out at the highest level, and then drilling down and laying out the different dimensions and facets that are applicable to the measure. So, the concept of customer loyalty might start by having behavioral and psychological dimensions. In a multi-tiered organization, you might overlay this with the different divisions of the company — let’s say for this example that the client company (whose loyalty is being assessed) has three different divisions. But you also can take into account the sponsoring organization’s divisional structure, which for the sake of this discussion we’ll assume has four different divisions.
This produces a 2 x 3 x 4 matrix where you might begin to look for or develop indicators appropriate for each cell of the matrix, and that your total loyalty index might be some combination of these different sub-indices. In some cases, a cell may be a structural zero — logically impossible — at which point it can be ruled out.
I certainly do not intend this brief example to constitute a recommendation for how I believe customer loyalty should be measured. What I hope is that it illustrates the logical process that can be used to develop a measure. For what it is worth, when I have used this approach in survey-research related endeavors I have typically found that the psychometric properties of the measures that I develop are quite good. In essence, the logical process forces you to carefully identify what is distinct to each cell of the matrix, and where the confusion exists across cells of the matrix. The process also lends itself quite readily to confirmatory modeling procedures when it comes time to do the statistical analysis of the data.
Posted by David Mangen
David Young
March 11, 2011
REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Group: Statistical Consultants
Discussion: The Most Important and Least Thought about Variable: The Dependent
Straus, Murray A., 1964 “Measuring families.” Pp. 335-400 in Harold T. Christensen (Ed.), Handbook of Marriage and the Family. Chicago:Rand McNally.
Posted by David Mangen
David Young
March 11, 2011
REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Group: Advanced Business Analytics, Data Mining and Predictive Modeling
Discussion: The Most Important and Least Thought about Variable: The Dependent
David,
Your original question applies only to so-called supervised learning /modeling, since unsupervised modeling such as clustering does not have independent variable.
I have been building predictive models for over a decade. True no class will teach you how to do dependent variables. Your value, as an employee of some kind, lies in how to define the dependent variable as related to solving business problems on hand. In a typical modeling project, >80% of the time, at least according to my experience, is on hashing out the definition of the model universe, which is to build the dependent variable. When the universe is built and signed off by your customers, modelers simply start to ‘process the model’. Defining the dependent variable is forever an art. Your brain, equiped with all the technical skills, should serve as the intersection of all the important paths, business people, data manager, technology people, vendors, needless to say the all mighty senior management. It is your job to design a dish to make every body as happy as possible.
Posted by Jia (Jason) Xin
David Young
March 11, 2011
REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Group: Statistical Consultants
Discussion: The Most Important and Least Thought about Variable: The Dependent
I think this is a great topic, and even more so I agree that the Y is often the most disregarded element. One approach I have used is to try to modify it (with respect to the business need) and convert it to different level of granularity or explore a natural hierarchy that the variable exhibits. For example a buying amount can be decomposed to model for buying and a model for buying amount etc. Additionally in such situation the continuous variable might be binned to stabilize the process while still be actionable.
David: I’m kind of curious on your soccer modeling.. being from Europe it strikes a cord 😉 can you share more detail if that’s possible?
Posted by Georgi Georgiev
David Young
March 11, 2011
Georgi,
I stopped betting professionally a couple of years ago when the largest Betting Exchange, Betfair, opened a sister company in Malta that bet its own money agaist the exchange participants like a bookmarker does. This seemed to coincide with a vast improvement in the accuracy of the betting prices/odds and didn’t leave me enough margin to cover the 3-5% comission on winning bets. If you paid for the more detailed Carling Opta data perhaps you could still out predict them, but I didn’t want to up the stakes just when everything was going sour. I’d made 289% on funds in 2004, but I think those days are gone forever.
That said, the models were systems of non-linear equations that had the basic form (Team A Strength less Team B Strength)= Team A’s # of shots taken at goal. You had to estimate these strengths across games because of course all the estimates were interdependent. i.e. It’s easier to get a shot at Wolverhampton’s goal than it is to get one at Man United’s. Step two was to take everyone’s strengths on different game attributes and use those as the predictors in an equation to predict the win. There was a lot more to it than this, but that was the basic strategy.
David Young
March 11, 2011
REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Reply by John A Morrison on September 8, 2010 at 10:03pm
I just want to say I think the whole discussion is massively interesting, as much for the way it is capturing attention on LinkedIn as well as Analytic Bridge, which is important because it should be our mission to share and communicate quantitative ideas right now, they are the future, I think.
http://www.analyticbridge.com/group/analyticaltechniques/forum/topics/the-most-important-and-least?commentId=2004291%3AComment%3A78478&xg_source=msg_com_forum
REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Reply by Tom Wolfer on September 12, 2010 at 3:01pm
This discussion has turned very, very interesting. Hopefully my post on Link Analysis tomorrow will generate as much discussion….
http://www.analyticbridge.com/group/analyticaltechniques/forum/topics/the-most-important-and-least?commentId=2004291%3AComment%3A78478&xg_source=msg_com_forum
David Young
March 11, 2011
REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Group: Global Analytics Network (+5K analytic professionals)
Discussion: The Most Important and Least Thought about Variable: The Dependent
In my experience, the dependent variable has to be formulated with respect to the business issue being researched. Besides this and the usual challenges of data availability, time, organizational dynamics and cost, I think the issue of level vs. % change vs. change is area of challenge. So regardless of metric (e.g. sales), I find using % change vs. level can result in different results, some subtle but some important not to overlook.
Posted by Shwetal Patel
David Young
March 11, 2011
Group: Global Analytics Network (+5K analytic professionals)
Discussion: The Most Important and Least Thought about Variable: The Dependent
I agree that dependent variable is at times loosely defined, but definitely not least thought. Most of the effort culminates when a business leader / stakeholder tracks the performance and calls upon a need for corrective measure / improvement. In other words, the objective function is directly related to business problem in hand. Quantitative methods heavily rely on correctness of data. In most cases, a closer look at data is required to arrive and agree upon the definition. If one were to model likelihood of a customer to be ‘Hit and Run’ (use the credit card just once), what should be the ideal wait period- 3/6/9/12 months? It can not be indefinite. We need to look at data to make an assertive statement such as ‘if a customer has not turned back in 90 days, its highly unlikely that he is coming back ever’. Of course it is possible that a few customers turn up, but the impact would be meager. There can only be an optimum definition like optimum solution in predictive modeling or optimization. The trade-off is necessary to tackle bigger issue. A quick solution is required to mitigate losses / utilize opportunity at the earliest. One can always re-visit the implemented strategy and fine tune it based on its performance. For example, one might find that new customers (early months on book) have different behavior to older accounts; a segmented approach can be taken to handle such cases. Its a continuous improvement process. Re-iterating my point, one can only hope for ‘optimum’ solution and not a ‘perfect’ solution, at least when dealing with predictive analytics. Adequate validation (such as testing the model on out of time sample) and simulation is essential to test the predictive power and expected impact of any solution.
Working on real time business cases would help grasp the nuances of technique in any course. I find ‘on the job training’ (OJT) the most effective way of learning. Of course one needs to know the theory else the ‘analytics’ in itself would seem to be black box.
Posted by Vinodh Kumar
David Young
March 11, 2011
REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
roup: Global Analytics Network (+5K analytic professionals)
Discussion: The Most Important and Least Thought about Variable: The Dependent
Suppose the dependent variable Y is something that cannot be easily measured, for example, lifetime or success. On the other hand, the independent variables X (could be a vector) is something that can be measured, like lifestyle or GPA. Under the assumption that (X,Y) has a multivariate normal distribution, one way to proceed is as follows. Consider the probability of Y, given X. Restrict X to a subset which will be selected, so that the conditional probability that Y is in a region, given X, is acceptable. In the simplest case where (X,Y) is bivariate normal, the solution is simple if the mean and variance of X is known, since today computation of bivariate normal probabilities should be fairly routine. This procedure is sometimes called a selection process or screening.
Posted by Roy W. Haas, Ph.D.
David Young
March 11, 2011
REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Group: Advanced Business Analytics, Data Mining and Predictive Modeling
Discussion: Statistical Modeling is like building with LEGOS
Enjoyed reading your blog! Gave me something to think about on how to improve my models. Thanks!
Posted by Marty Epstein
David Young
March 11, 2011
REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Group: Global Analytics Network (+5K analytic professionals)
Discussion: The Most Important and Least Thought about Variable: The Dependent
DV definition is absolutely the key to successful modeling, The key to the process – understanding what you’re modeling and how the model will be used – and don’t ignore things like sunk costs. A ‘bad’ isn’t always a ‘bad’ – if I consider the balance a customer owes me as sunk, for modeling purposes, the fact that he/she charges off isn’t necessarily bad. If a customer owes me $5000, and I collect $4000 and charge-off $1000, I might want to call that charged-off account a GOOD account. At least I didn’t lose $3000! (or $4000 or $5000!).
Too often the definitions are just knee-jerk coding decisions:
if CO = 1 then BAD = 1; else BAD = 0;
And your model or strategy is doomed from the get-go…
Posted by Jim Nielsen
David Young
March 11, 2011
REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Group: Predictive Modeling, Data Mining, Actuary / Actuarial and Statistics Group
Discussion: The Most Important and Least Thought about Variable: The Dependent
Great discussion topic!
Her’s my 2 cents:
Even when you’re sure you’re measuring the right response variable, you still might want to tweak it to model well. Some examples
1. Many standard statistical analyses are valid if the error variance is constant. Yet the variance of many response variables increases with the level of the response variable. To utilize such standard techniques it becomes important to transform the dependent variable before modeling. I frequently model the log of such response variables as my dependent variable.
2. Say you want to predict monthly sales of magazines. Some magazines are published monthly, but the magazines that are published weekly will have 4 issues some months and 5 in other months. To stabilize the model so that it focuses on demand related issues and removes the noise from different numbers of issues per month, you could model sales/n_issues as my dependent variable.
3. Say you want to compare different forecasting methodologies, some which can model covariates and disruptive events, and some which can’t, on an equal footing. I’ve developed systems to come up with good parsimonious models of the effects of these covariates and events and their interactions using covariance structures which reflect the time dependencies, and adjusted the series for these effects. Then various exponential smoothing and ARIMA techniques can be compared on an equal footing to determine which best fits the baseline demand. The effects of the covariates and events are then added back for final forecasts.
I too am currently available to help clients with designing and implementing automated scalable statistical modeling frameworks, remotely or on-site in USA, Canada, or Israel.
Posted by Aaron Dukes