You’re working on the MAIN MODEL. The one that leverages half the company’s assets, and on which your paycheck and that of many others depends. You’ve already run through a stepwise, forward, and backward search of the variables, their interactions, and possible curvatures. What are the most productive things to do next?
Here are a couple of ideas revolving around the ideas of relationship consistency and complex variable interactions.
1. COMPLEX VARIABLE INTERACTIONS – Predictive variables sometimes aren’t: It’s a funny statement, but it represents a common problem that’s usually ignored. We’ve all seen variable interactions that change the significance, curvature, and even the sign of an important predictor. It’s not uncommon. I think we can also agree that virtually no dataset contains all the data we’d like it to, so it only stands to reason that there are many unavailable interacting variables. That’s to say, many unidentified situations when our predictors don’t predict the way we believe, or just aren’t predictive at all.
While we’ll never have all the data we’d like, it’s possible to look for situations in which our predictive variables aren’t behaving well, or in which normally unpredictive variables are useful.
Example: I was once predicting stock price movements and could graphically see long trends in prices, but none of a variety of trend calculations showed much promise of predicting future prices. There were a lot of issues at play: Trends had to be calculated from prior highs and lows not just from a fixed time interval, Some time periods were noticeably more volatile than others, Down trends were usually more volatile than up trends, etc. That’s when it occurred to me that the solution rested not in showing that the trend was, or wasn’t predictive, but upon determining WHEN it was predictive. As a result I began to create descriptive statistics about the trend calculations. These proved to be invaluable in illustrating when the trend did predict the future and when it did not.
Interestingly, while it is easy to see and show the value of these interactions once they are known, they aren’t detected by techniques such as stepwise regression or CART. This is because while the trend calculation is predictive in specific situations, neither the trend, nor its descriptive statistics are predictive individually. Thus they aren’t identified as valuable by most algorithms.
2. THE UNEXPECTED IMPACT OF MISSING: When a variable is included, or taken out of a model, it impacts the parameters of the other variables. That same thing happens when a variable contains missing values, and imputing the variable’s average doesn’t fix it.
Example: I was once predicting credit card transaction revenue for a bank. Two of the predictors were customer income and customer age, but the bank only had the incomes for about half its clients. The presence or absence of income had a strong impact on how the customer age was modeled. When income was present in the model, age appeared to act like a proxy for willingness to adapt technology, with more card usage for younger customers and DECREASING with age, given the same income. However, when income was missing and represented by an average value, the age variable had a completely different relationship. In the absence of income, age acted as a proxy for income with card transactions INCREASING with age until retirement, where they dropped again. In that case, I ended up building a model for when income was available and another for when it wasn’t.
Depending upon the assets being leveraged, this type of solution might become worthwhile long before the percentage of missing gets high.
David Young
March 11, 2011
REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Group: American Association for Public Opinion Research (AAPOR)
Discussion: Ideas for improving already good models
David,
You seemed to be exploring the form of the model. I had some problems with intercorrelation of variables and the potential selection of dependent variables. Both can effect the structure of the models. You might wish to explore using canonical analysis to determine the general structure and potentially the need for multiple models. I found it with factor analysis to be good exploratory tools. For example with business portfolio models I’ve had problems with including both earnings and revenue. These are usually highly correlated and can produce strange effects.
It is also useful to explore the nature of the data. In particular, I found it useful to see if you need to explore multiple models in the form of a mixture problem. Try using Regression Clustering techniques such as Latent Class Regression to determine if multiple models are necessary and what drives the groups. This is particularly useful in examining large portfolios of assets and businesses. The drivers of success can vary greatly depending on the nature of the assets.
Posted by Gene Lieb
David Young
March 11, 2011
REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Group: Quant Finance
Discussion: Ideas for improving already good models
David,
I read and enjoyed your above discussion on what we in accounting refer to as “Relevant Ranges.” Given a specific data-set “Context #1” … here is what is going to happen (Content = Cause–>Effect Domino Supply Chain) However, if the “Context” (relevant range) moves to “Context #2″ this is what will occur (Content Cause–>Effect)
The questions I have … and obviously from my posting I am not a math guy … is as follows:
#1) Contextual Algorithms change but are they predictable?
#2) A higher or “Meta Context” can change or alter a lower “Context” in a “Contextual Cause–>Effect while within that lower “Context” it has its own sub or smaller cause–>effect chain of results. Do you know of any current reading that I can access to gain a better understanding of this?
and finally,,,
#3) Assume there is a third level Meta Context #3 — that impacts a lower Meta Context #2) that impacts a even lower Meta Context #1 — which obviously has a lot of sub-context “Cause–>Effect” things going on. Systems Theory tells us that we are all living in a multi-dimensional feed-back … feed-forward system of loops. If that is so … wouldn’t this Quant Predictive industry … be chasing after a massive Random Generator called the universe … the Meta Matrix of the Mind of Allness? : – ) Like Einstein said … “I just want to know how He thinks.” Confusion fed via Chaos
Posted by Lawrence Carson
David Young
March 11, 2011
Lawrence,
You’ve raised an important issue with regards to changes in the environment that might make relationships specific to situations obsolete. While you’ll never be able to predict or protect yourself from all types of risk, I’d say you have three lines of defense against environmental changes.
The first defense is consistency over time. A situational relationship that you can observe happening repeatedly over a long period of time in the same situations is more likely to repeat itself again than one that can not be shown to have happened repeatedly. If you can find this kind of periodic consistency then you could verify whether or not it has been robust enough to survive prior environmental changes.
The second defense is tracking. Things can change even if they haven’t in the past. So the closer you track a relationship the earlier you’ll be warned of inconsistencies from the expected.
The third defense is diversification. If it is possible in your situation to divide your investment over industries, analysis systems, people, and other sources of error, you can gain some protection from problems arising in a specific area. The down side of this of course is that you’ll also reduce profitability because some industries, analysis systems, and people are more lucrative than others and you’ll inevitably be trading down to some extent in order to achieve the diversification.
With all that you can still get burned, but sometimes the only good defense is to try to earn as much as you can during the good times in order to withstand the evitable unforeseen problems.
Dave
David Young
March 11, 2011
REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Group: Predictive Modeling, Data Mining, Actuary / Actuarial and Statistics Group
Discussion: Ideas for improving already good models
great discussion, please keep posting:)
Posted by Jie Gao
David Young
March 11, 2011
REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Group: Stat-Math Statistics
Discussion: Ideas for improving already good models
I prefer to follow the George Box advice: “all models are wrong, but some are useful.” So the question should be “how can our current model become more useful?”
Posted by Roy W. Haas, Ph.D.
David Young
March 11, 2011
Roy,
I think your general philosphy is a good one. I often focus more on making the models fit the problem better as opposed to making them more accurate given the original problem formulation. This post focuses more on improving accuracy on problems that have been well defined and thought about, but in many situations I’ve seen more opportunity for improvement in the “problem definition” than in the “optimization of the solution”.
Dave
David Young
March 11, 2011
REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Group: The R Project for Statistical Computing
Discussion: Ideas for improving already good models
David, my field is the biostatistic and not the economy, and only could see the problem partially, but
1) Have measure the lineality trend of the variable in the 2 diferentent contex? May be the price had lianeality in one of the context and don’t have good predictor like continous variable.
2) In the missing data
We used the multiple imputation method to replace each missing value with a set of
plausible values that represent the uncertainty about the right value to impute.
(Statistical analysis with missing data. Roderick JA Little; Donald B Rubin. 2.
Hoboken, N.J: Wiley Interscience ; 2002).
Posted by Emilio Cabrera
David Young
March 11, 2011
REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Group: American Association for Public Opinion Research (AAPOR)
Discussion: Ideas for improving already good models
Dave,
I agree with Roy. Let go back to my engineering days and recall the concept that in engineering we obtain either an approximate solution to the “exact” problem or the “exact” solution to the approximate problem. In business analytics, our situation is worse, we really obtain approximate solutions to approximate problems. We are generally dealing with approximate models relying on unrealistic assumptions with incomplete and inaccurate data based on ill-defined metrics. So what else is new.
However, being that is the situation, in general, it is useful to use multiple models based on different assumptions, data and approaches to get at the same end points. I don’t think seeking the “better” or even an “optimum” model is not necessary a bad idea, but it will never be perfect or in most cases even “good”. I think multiple approaches is probably the only direction to get any assurance of that the results of models are creditable and can be considered useful.
Gene
Posted by Gene Lieb
David Young
March 11, 2011
Gene, Roy, and other advocates of fitting the problem over fitting the solution:
If what you find yourself working on can be described as an “approximate problem” then I’d be the first to agree that you can’t do better than an approximate solution, for many reasons, but a limited number of significant digits being one of them. I’ll also confess that in years gone by, I’ve sometimes looked down on people making the kind of suggestions I’m making here. Suggestions that could use a considerable amount of time and reward only a moderate increase accuracy. I thought of it all as “the quagmire of irrelevant precision” and a good way to produce little value for the time spent.
Nevertheless here I am. I don’t believe that I was wrong then and right now, or “vice versa”. I believe it depends on the problem at hand and what you’ve got to work with. If you’ve got a problem that can be shown to be:
1. Fairly stable over time and/or repeat consistently,
2. Offer an ample supply of data,
3. And, Small changes in accuracy will have a large impact on results,
then fine-tuning the solution makes sense. If not, your time might be spent better moving to the next issue to solve.
My first work complying with those criteria was probably the sports wagering where payouts were the direct inverse of probability so any deviation in the true odds from the estimated odds were important and small advantages in probability would be compounded hundreds of times across multiple games. Nevertheless, I believe that while it might not be the typical case in Business, it is sometimes the case. For instance, “Term Life Insurance” would appear to fit all of the above criteria.
Dave
David Young
March 11, 2011
REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Group: Data Mining, Statistics, and Data Visualization
Discussion: Ideas for improving already good models
One of the thing I found useful to increase accuracy of already good data models is to perform Genetic Feature selection on the selected model. It produces better combinations of variables on the model using gradient search genetic algorithms.
Posted by Seyhan Yildiz
David Young
March 11, 2011
REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Group: Global Analytics Network (+5K analytic professionals)
Discussion: Ideas for improving already good models
Great discussion. I think one additional way to improve an already good model in terms of definition, is to simplify it by really understanding the problem. In other words, try to put in as much a priori domain knowledge as possible, instead of putting any variable we can think of in the model and then see if it works.
M. J. Crawley in his R book has a great discussion as to what constitutes a good model and how to find the “minimal adequate model”.
Posted by Theophano Mitsa Ph.D.
David Young
March 11, 2011
REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Group: Advanced Business Analytics, Data Mining and Predictive Modeling
Discussion: Statistical Modeling is like building with LEGOS
Enjoyed reading your blog! Gave me something to think about on how to improve my models. Thanks!
Posted by Marty Epstein
David Young
March 11, 2011
REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Group: Data Mining, Statistics, and Data Visualization
Discussion: Ideas for improving already good models
In your post above, you mentioned that Data Miners use forward/backward/stepwise regression. I’m puzzled by the use of those methods, interesting as they may be. Serious problems with those methods have been demonstrated in many articles in the statistical literature.
Posted by Chris Barker
David Young
March 11, 2011
REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Group: Data Mining, Statistics, and Data Visualization
Discussion: Ideas for improving already good models
David, I appreciated the email. My read of the statistical literature, is that step-anything over-selects noise variables and are very biased methods. There is an extensive literature, too extensive to cite here. A couple article citations from my web search (.
Citations herein: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.126.4133&rep=rep1&type=pdf
http://www.ncbi.nlm.nih.gov/pubmed/18189162
There are some isolated cases where ‘step” might work: http://www.jstor.org/pss/2282632
I think its ok to use the step- methods, as it may generate hypotheses worth exploring and may help to better understand some features of the data. Otherwise it is to be view
ed with great caution and there does not seem to be any methodology for correcting for the biases
An excellent Authority on the problems of step-any is Frank Harrell
http://www.stata.com/support/faqs/stat/stepwise.html
Posted by Chris Barker
David Young
March 11, 2011
Chris,
Thanks for bringing up the over selection issue with regard to stepwise procedures. I used it as prelude to my suggestions because of its wide spread use and basically as a jumping off point for the discussion.
My feeling about stepwise procedures is that while they do have some problems, they work fairly well for many situations. That said, since this discussion is about things you can do to take a step forward, it seems like an appropriate place to raise the issues regarding stepwise procedures and solutions for them.
The abstract of the second article you’ve linked provides a good problem definition:
“When variable selection with stepwise regression and model fitting are conducted on the same data set, competition for inclusion in the model induces a selection bias in coefficient estimators away from zero.” There are other problems, but I think that’s the main one. The solution they propose is to use a bootstrap method to separate the variable selection and parameter estimation processes. That sounds like a pretty good idea to me. Thanks for raising the issue.
Dave
David Young
March 11, 2011
Maybe I should just add a couple of things about the stepwise procedures. I think that in practice most analysts would advocate only selecting among variables that had a plausible connection to the dependent variable and any counter intuitive results are normally put under the microscope to see if they make sense or not. Those “logic checks” reduce significantly the chance of false positives that might otherwise be associated with stepwise, or any search for predictors. But the automated search process weeds out non-competitive predictors so that that type of labor intensive scrutiny isn’t wasted.
Some of the other suggestions for countering the problems of “searched for predictors” include using more than one model to average out model bias. That’s another suggestion that makes sense if the trade off in implementation complication seems worth it.
There’s my two cents worth on stepwise.
Dave
David Young
March 11, 2011
REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Group: Quant Finance
Discussion: Ideas for improving already good models
“Sociologists and anthropologists are under no such obligation to force the messy, multi-faceted world of human behavior into the economists’ Procrustean bed. Many of us still do quantitative analyses, but our professional norms place far more value on empirical data than on elegant mathematical models.” by
http://thesocietypages.org/economicsociology/tag/procrustes/
Posted by Stjepan Anic
David Young
March 11, 2011
REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Group: Advanced Business Analytics, Data Mining and Predictive Modeling
Discussion: Ideas for improving already good models
We, more often than not, miss out on exogenous variables that have powerful predictive influences on the output variable. In the credit card example quoted in David’s article, possibly, we need to include general buying sentiment of the consumers in the area or country in question while predicting revenue. You could have incomes and age nicely built into the model but might still fail the prediction accuracy test since the quality of income (implied cash flow regularity and reliability); the employer credibility and family liabilities of the card holder could just as well be powerful predictive influences missing out in the model.
These are only some of the fundamental issues. More advanced analysis is required for difficult models that have repercussions in business decision making.
Posted by Debashish Banerjee
David Young
March 11, 2011
REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Group: Data Mining, Statistics, and Data Visualization
Discussion: Ideas for improving already good models
Several thoughts in general about models:
1) How robust is your model? Robust statistics seeks to provide methods that emulate popular statistical methods, but which are not unduly affected by outliers or other small departures from model assumptions. In statistics, classical methods rely heavily on assumptions which are often not met in practice. In particular, it is often assumed that the data residuals are normally distributed, at least approximately, or that the central limit theorem can be relied on to produce normally distributed estimates. Unfortunately, when there are outliers in the data, classical methods often have very poor performance.
What assumptions do you make implicitly or explicitly? How far can you push them and the model still produce ‘good’ results.
2) Test with very unlikely possibilities. See David Merkel’s multi part series on Investment Modeling at alephblog.com for more on this. This quote of his illustrates the idea. “The idea is model completely, and don’t ignore scenarios that could not happen. My interest rate model had scenarios that mimicked what we actually got, though what we got was not a high probability.”
3) Continually validate the model against real data. This will help you see when today’s model that performs well begins to diverge from reality. Companies will spend $$$ to build a model and then think it is a something static, like the quadratic equation, that will always work. In reality a model balances a number of approximations that, in the best case, happen to produce useful results within a limited scope of data. Seldom do you have a clue when input data has crossed the boundary and has moved outside of the set where the model produces good results. Trust but verify and verify and verify . . .
4) See if there are things that you can simplify that do not impact the quality of the model.
Posted by William Cormier
David Young
March 11, 2011
REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Group: Quant Finance
Discussion: Ideas for improving already good models
David,
I like this discussion. May I suggest a couple of other notions?
#1. Does the magnitude of the trend / regime change matter?
When the model (be it CART or some other regime switch technique) recognizes (or fails to recognize) a new trend, is the trend itself meaningful enough?
If the model captures big regime switches even with slight lag, I would still consider it useful. However, it also depends on the nature of the objective function.
#2. That begs a second question: who is the target audience of the model?
Say, if you are trading just one contract back and forth, big regime changes don’t matter. In fact, big regime changes are more geared towards large position trades. On the other hand, for a market-maker who hedges his/her greeks locally, small signals are their bread & butter.
#3. Following on #1, would you say that regime changes themselves are stochastic variables?
And as such they also have their own distributional properties. Again, it depends on what your action plan is given a signal. Also, in the context of a specific strategy was the regime switch within loss tolerance levels?
It’s like driving a car with a “back-seat driver” (who you want to tune off from time to time). 😉
#4. Finally, is there a room to step in and avoid the pitfall of self-fulfilling prophecy?
In other words, do you know the strengths and weaknesses of the model to tell when it should not be the guiding factor? — Isn’t this when we humans have added value?
Good luck with the new position! Let us know how it goes.
Thanks!
Posted by Ulan Asanov
David Young
March 11, 2011
REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Group: Global Analytics Network (+5K analytic professionals)
Discussion: Ideas for improving already good models
Most interesting and informative discussion.
Posted by Maria Luna Ruiz
David Young
March 11, 2011
REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Group: Advanced Business Analytics, Data Mining and Predictive Modeling
Discussion: Ideas for improving already good models
I would add:
* identifying better predictors (combinations and ratios of core metrics, rather than using the core metrics alone)
* leveraging external data sources
* using metrics derived from exploratory analysis (e.g. user segment rather than user ID, category of IP address rather than IP address)
* make sure that you use the right methodology to measure how good your model is; if you think it’s good based on back-testing, then switch to cross-validation and walk-forward tests: these (if correctly implemented from a methodological point of view) reveal the true strength and predictive power of your model
Posted by Vincent Granville
David Young
March 11, 2011
REPOSTED WITH PERMISSION OF THE CONTRIBUTOR
Group: Financial Risk Management Network
Discussion: Ideas for improving already good models
I’m not sure that something like this on Linkedin is going to be a powerful hiring tool, but I did like what you had to say. The idea of focusing on factors that can predict when situations are ‘reliable’ or ‘not reliable’ is pretty brilliant.
You obviously know the game and play it at a higher level.
Posted by klancy kennedy