Mark Kritzman: Thank you, Kayla, and thank all of you for being here this afternoon. I'm going to talk about some research that I've done with my colleagues, both at Wyndham and State Street, Michelle and David. And it's research about a new way of forming predictions from data. And I'm going to illustrate it with two applications one in finance and one in basketball. And the basketball application, I think is relevant in that it shows the versatility of this methodology and it's the first time we ever applied it. Cross-sectionally So it was it was a lot of fun doing that. So let me just give you an overview of relevance based prediction first. So it's a it's this new forecasting routine and it forms predictions as weighted averages of observed outcomes that vary either through time or, as I just mentioned, cross-sectionally, in which the weights are a precise statistical measure called relevance. So this is you'll see over and over again in this talk that this is in no way arbitrary. It's very theoretically justified. Relevance has two components. One is similarity, the other is informativeness. And they are both measured by mahalanobis distances. And it also depends crucially on a measure called fit. And this gives this measures the average alignment of relevance and outcomes across all pairs of observations that go into a prediction.
And it can also be thought of as the squared correlation of relevance and outcomes. It actually tells us how much confidence you should assign to an individual prediction. So all of you, I'm sure, have used ordinary least squares more times than you can count. And the confidence you have in your prediction is basically the R squared of the regression model. But what most people don't really reflect on very often is that is that that R squared reflects both good predictions and bad predictions. Right? It's just an average. What we're going to do is show you how you can measure this specific trust you should have in an individual prediction. And we also use fit to identify the optimal combination of observations and variables to use in your prediction. So there are several advantages here. One is that relevance based prediction addresses complexities such as asymmetry that are beyond the reach of linear regression analysis. And then if you compare it to machine learning approaches, it's more transparent and it's more adaptive than model based machine learning algorithms. And it is theoretically justified. And it's also a sophisticated extension of model free algorithms. And it actually uniquely converges to the same answer you'd get from linear regression analysis across the entire sample of observations.
So here we go.
This is a sort of a visual presentation of relevance. This is a scatter plot. And these are the current circumstances. These are the circumstances from which you are forming a prediction. Right. And what we're going to do is measure the relevance of two observations. So this observation is moderately relevant because it's similar. It's similar to current circumstances. So observations that are similar are more relevant than those that are not. This observation is highly relevant because it's equally similar to current circumstances, but it's much more different. It's further away from the average. So it's much more unusual. Right. So observations that are like current circumstances but different from average are more relevant than those that are not. Okay. So this is a bit contrary to the way we learn statistics, right? We were taught to be suspicious of outliers that either they reflect results that are not likely to recur. So why why pay attention to them or that they might be data errors? Well, clearly, if they're data errors, that's a problem. But if they're not likely to recur, they're still they still could be extremely informative. Let me just so that was a visual representation of relevance. Now, I'm going to show you in some detail how to compute it mathematically. So as I mentioned, it's this it is composed of similarity and informativeness, and they're both measured as mahalanobis distances. So this is how we measure similarity. Let me just let me not talk about one half for now, because that's a bit tricky.
I'll come back to it, I promise. But what you see is that think of x I as a vector. Of of. It's it's an observation and it's a vector of the values of a variable. And xt are the values of that current circumstances, right. The values of the variables of the current circumstances. So x I minus x t just tells you separately how close an observation is to current circumstances. Then we multiply by the inverse of the covariance matrix. So that takes into account the interactions and it also divides by variance. So it standardizes everything. And so we now have a vector, right? A vector times a matrix gives us a vector and we post-multiply by x, I minus x t the transpose and that gives us a single number. Okay, so that's how we measure similarity. But I'll come back to the one half. We use the mahalanobis distance also to measure informativeness in this case. Instead of considering how far an observation is from current circumstances, we want to know how far it is from average circumstances. Right. So this tells us the informativeness of each observation and then this tells us the the informativeness of the current observation. And we need to do that just to center this measure of relevance on zero. Okay. So positive numbers mean that the observation is relevant. Negative numbers mean the observation is not relevant. Now let me just come back to the minus one half.
So when we're measuring similarity, we're interested in, you know, we need to multiply by a minus sign to convert a measure of distance into a measure of similarity. So that's why we have a minus sign there. And then one half is because the average squared distances of observations from each other are twice as large as the squared difference of observations from the average. So we have to multiply by one half. So it turns out that we we measure relevance in this very precise way. Okay. Now, as I mentioned, this is not arbitrary. It's based in part on information theory. And that tells us that the information given by an observation is the negative logarithm of its likelihood. And this was taught to us by, in my view, the greatest genius of the 20th century. Claude Shannon. It also tells us that the normal distribution has maximum entropy for a given variance. And in addition, the central limit theorem motivates the normal distribution as a common limiting case of many random processes and the mahalanobis distance measures the information in a multivariate normal distribution. A normal observation or between a pair of observations. So you see Claude Shannon and you see Mahalanobis there. Now, let me just explain why it is or give you an illustration of why it is that unusual observations contain more information than common observations. This is one of the key ideas of relevance. Suppose you asked ten people to choose a number from 1 to 10, and then you consider the sum of the numbers that they chose? Well, if the sum is 100.
First of all, that's very unusual. It's 1 in 10,000,000,000. Right. But there's but you know everything there is to know. They all chose the number ten. There's no more information to be gained. So you have something that's very, very unlikely, but super informative. Fully informative. Right. If the number instead if the sum instead were 99. Right. That's ten times as likely. But you lose one piece of information. You know, somebody chose a nine, but you don't know who. And then the sum is 98. Again, that's much more likely. But you lose information. So what you see is that the the probabilities are changing. Multiplicatively and the information is changing. Additively So that's why you get this relationship of an inverse logarithmic function. So if you were to as you go toward 50, then you know nothing, right? Common information tells you nothing unusual information. That's where the information is. Unusual observations. Now, the other reason that relevance is not arbitrary is that it converges to precisely the same answer that you would get from linear regression analysis if you use all of the observations in your sample. Right. So the top formula shows you how you get a prediction from relevance. It's just that the prediction is literally the relevance weighted average of past outcomes. Right? And that gives you exactly the same answer you'd get from ordinary least squares over the full sample.
Right. And you know, Carl Gauss back in I think it was 1809, gave a very, you know, a great theoretical justification of why ordinary least squares gives the best prediction if you have a linear relationship. But. This equivalence of relevance based prediction with linear regression reveals a really intriguing insight. It reveals that linear regression places as much importance on non relevant observations as it does on relevant observations. It just changes the sign of their effect. Right. So let me let me give you a an example of why that might not be a good thing to do. Let's suppose you're not feeling well and you visit your doctor and she examines you and then prescribes some medicine and you ask her, well, how should I expect this to work? How long is it going to take? Is it going to. Is it going to successfully treat my illness? And your doctor says, oh, yes. We've tested this medicine on a very large sample of people who have the exact opposite health profile that you have, and they all died. So this is going to be great for you. Well, that sounds funny, right? And you would not really be inspired by your doctor if she told you that. But this is what linear regression analysis actually does, right? It anyway, so what we thought Meg, Dave and I thought that maybe it might be better to form a prediction from just part of the sample, that part being the relevant observations.
So that's called partial sample regression. And there you see the formula for that. And all we do here is we have a threshold for relevance and we just select observations that are beyond that threshold. We sensor the others and by censoring, we're introducing conditionality. Okay, so partial sample regression. We're just using the irrelevant sub sample to form the prediction. So let me give a finance application. So it's often the case that you have some you know, you expect a certain kind of economic environment to prevail, a regime, and you need to forecast returns and risk and correlations. But based on the assumption that this regime is going to prevail, so the typical way people do this is sort of in a binary fashion. Let's say the regime is inflation. You pick an inflation threshold. And if inflation if if inflation for periods where inflation is below that threshold, you say that's our non inflationary regime. And for periods where it's above that threshold, you say that's our inflationary regime. And then you just take all of those periods and you equally weight that them to come up with your prediction, right? That's the binary approach. That's what most people in this room probably do. Well, instead, what we're saying is that you shouldn't think of a regime as a binary yes, no kind of outcome. You should think of something being like, I'll make up a word regime. Yes, right. What you want are the prototypical circumstances of a regime.
And then you want to know how relevant is the period to those circumstances. And that's what these weights are. If you were to do this binary approach, the weights would all be the same. So let me just review a comparison here. So with binary weights, you typically define a regime based on a single indicator and you equally weight those the observations to come up with your forecasts of return risk or correlations. So it assumes that all periods are equally useful in predicting a regime outcome and it also makes it difficult to have more than one regime indicator in a principled way. I don't know how you would do that in a principled way, and it also risks the possibility that you're going to come up with a regime that has no observations in it, Right, If you have too many indicators. So our approach, this partial sample weight, we we use partial sample prediction instead weights past outcomes by the relevance of each period. Right. And they reflect the extent to which a period is characterized by the prototypical circumstances of a regime. And it allows a period to be connected to more than one regime. So the same period can be in more than one regime to different degrees. And it also enables us to consider regimes that have never occurred in history. Right. So that's, that's my very brief finance example. Let me carry on with the math and then we'll get to our basketball example.
So I want to talk about fit. I think fit. Is really a very, very powerful measure. So it tells us how confident we should be for each of the predictions separately from the overall confidence of a prediction model so we can express fit as the sum of the products of normalized Z scores. So the idea is that what we want to do, we'll take a pair of observations and we'll look at the relevance of each observation and we'll look at the outcome of each observation. So if both observations have high relevance and they both have high outcomes, then they're aligned right? If they both have low relevance and low outcomes, they're aligned. You can have one has a high and the other is low and then the other one is high and low. They're still aligned, right? So if there's strong alignment, that means that there's strong fit and we, we calculate fit across every pair of outcomes in our sample. Right. And that tells us how much confidence we should have in our predictions. So we haven't even made a prediction yet. We haven't you know, we've just looked at how aligned the, you know, the relevance is with the with the outcomes. And you can also calculate it as just the squared correlation between the relevance weights and the the outcomes. And this calculation is only possible because we know each observations contribution to the prediction, right? It's it's the relevance.
That's, that's how the prediction is formed. It's a weighted average of relevance. So we have a way of measuring the quality of a prediction before we even make the prediction. Now when it really gets interesting is that if we were to look at fit across all of the predictions, it would actually aggregate up to R squared in linear regression analysis, right? So you can get to R squared without ever doing a regression analysis, right? Without ever calculating sum of the squares or anything like that. So you can think of fit as the fundamental building block of R squared, but it's so much more powerful than R squared. It's so much more nuanced because it applies to a specific prediction task. And it turns out that as you'll see fit well, as you already saw in that previous example when you were looking at relevance. But fit varies a lot and across prediction tasks, which tells you that r squared is not necessarily a very reliable indicator of the quality of an individual prediction. It only measures the quality on average across all the predictions. So for parcel sample regression, rather than choose this threshold arbitrarily, remember I said for partial sample regression, we're going to just pick a sub sample of relevant observations. Well, what should the threshold be? Right. We you don't want to just do it arbitrarily. What we can do is iteratively change the threshold which we call our stard. Right. Continually raise it, which actually shrinks the sub sample and to the point where it's maximized.
And when it's at its maximum, that's the optimal sub sample to use. And the trade off is that as we raise our start, we're focusing on a narrower and narrower sample of observations which have greater fit, right, but at the expense of introducing more noise, right? So if we maximize fit across all the observations, it's going to optimally trade off noise with the fit of the sub sample and tell us exactly which observations to use. It turns out that the higher the fit, the more likely it is that we're going to get a reliable prediction. But. This approach will be better than regression analysis. Also to the extent that there's asymmetry, right? So regression doesn't work if there's asymmetry between the predictive variables and the outcomes. That's why people have turned to machine learning to deal with that kind of complexity. Right. So we're offering an alternative. So what we're really interested in is not just how strong the alignment is, which is what Fit tells us, but also how much asymmetry there is in the relationship. Right? The greater the asymmetry, the better this is going to do compared to regression analysis. So the asymmetry is given by this equation, right? It just tells us the fit of the retained observations and the fit of the censored observations. And that difference tells us how much asymmetry there is. If the if the fit is the same, we effectively have a linear regression and a linear relationship and we can just use regression analysis if that fits.
Not the same between the retained and censored subsamples, then we want to use this approach. Now there's one final thing I want to mention, which is codependence, right? It turns out that. We shouldn't. So I've just described how we can come up with the optimal subsample, but that subsample is for a specific set of variables, right? It may not be optimal for a different set of predictive variables. Right? So the idea here is that the choice of the subsample of observations and the choice of the variables are co dependent. Right. So we can actually use this adjusted fit. And we also multiply by the number of predictive variables to simultaneously get the optimal combination of predictive variables and observations. And that optimal combination is going to be specific to each individual prediction task. Right. So this is what we call a regression. It recognizes that the usefulness of observations and variables is not static nor mutually independent. It depends on the the prediction tasks, unique circumstances. So you all are familiar, I suppose, with lasso regression. This tells us how to select variables in a less principled way than what I've just described. And kernel regression tells us how to select observations, how to weight observations. Right. But those are not typically always they're applied independently. Nobody's bothered to recognize that each one depends on the other, right? The choice of the variables depends on the observations you're using.
Your choice of observations depends on the variables you're using, right? So you don't want to apply these independently. So this notion of regression basically replaces both lasso regression and kernel regression in one single process. Now, you know, Dave and Meg and I have given this talk a lot all over the world for the last few years, and invariably somebody in the audience will say, Isn't this just kernel regression what you're doing? Right? And so we explain why it's not. And we actually have written about this in some of the papers we've we've published, we've actually shown the specific steps you would need to take to convert a kernel regression into relevance based prediction. But rather than go through those steps, I thought I would try to explain the difference by analogy. Right. La. So, you know, in terms of prediction, you know, in terms of kernel regression and relevance based prediction, you know, they both want to make forecasts, right? And they both make forecasts by weighting past observations, right? Well, here you have on the left a kind of lawnmower that I used when I was a kid. And and then on the right, this is a robotic lawnmower. Right. It doesn't require a driver just goes around and it automatically adapts to its circumstances, just like relevance based prediction adapts to its circumstances. And it even has this attachment to pick up the clippings. Right. So this is I thought of using a scythe instead of a that, but I thought that would be a bit too harsh.
So anyway, if you ever are tempted to think that relevance based prediction is just a kernel regression, I want you to keep this image in mind. All right, let's go to basketball. So what we're going to do is use this approach to predict outcomes for NBA draft prospects. Right. And specifically, we're going to predict a statistic called BPM stands for box score plus minus. And it is a it's a composite that takes into account box score statistics that we're all familiar with, points, scored, rebounds, assists, etcetera. It takes into account team efficiency and player positions. So it's a composite that's meant to measure how much a player is contributing to the success of the team while that player is actually playing. Okay. So this is just one example. We thought it would be a good one since it's so widely used. So that's that's our prediction task, our training sample. So the we're going to predict this for all of the players that we have data for that entered the draft in 2018. And our sample is going to be the players who were drafted in previous seasons, players that came before them, right from the seasons 2011 through 2017, all from Division one schools who played at least one season in the NBA. And we're going to use a variety of predictive variables that you see listed there now, and they fall into four categories.
One would be performance in college data from the combine non player factors like how good is the college that they played at and the team that they went to. So those are the categories of predictive variables. Some you can see are blue and some are black. We want to illustrate the fact that the idea that this approach can consider different combinations of predictive variables. So we have we're going to just consider two combinations for convenience, all of the predictive variables and a subset of what we call seven key predictive variables. In principle, you would use every combination of the predictive variables, right? You'd you'd test every combination of predictive variables for every sample sub size. So we choose the combination of selection criteria. So we're going to select the sample based on either relevance or just the similarity component. We'll just let fate decide which is better predictive variables that maximize the predictions adjusted fit. Right? So here's a bit of a summary. The panel on the top tells you whether we used all of the variables or the key seven. That's the color, right? If it's blue, we use all of the variables. If it's green, we just use the seven key variables. The height of the bar tells you the fraction of previously drafted players that we used. Right now on the bottom, you see this measure called adjusted fit, which comprises two things. One is the fit, right? The alignment of the predictive variables of relevance with the outcomes and asymmetry, Right? So we'd like to see we would have confidence in a prediction that had high alignment and high asymmetry.
So look at I don't know if you can read those names of those players. And I think that's the order of in which they were drafted. So look at four. It's. Who is for Mo Bamba. Okay, so there's no asymmetry. And what happens? This approach just reverts to linear regression analysis. It uses the full sample, which is what it should do. It's not going to improve upon linear regression because there's no asymmetry between relevance in the outcomes for Obama's prediction. And you see that with the other ones that are highlighted. Then over here, look at. Or let's look at 14. That's Donte DiVincenzo. Right. Just got traded over the summer to the New York Knicks from Golden State. Hi. Fit and very significant asymmetry. And look what happens. Relevance based prediction selects a tiny subset of observations to form this prediction. And then the other another great feature is you get so much transparency into the prediction. And this is good for preventing overfitting, for example. So this shows you you see Dante's numbers there in the left column and you see he went to Villanova. I actually saw him play when he was in college at Madison Square Garden. I was very impressed. He was just a freshman at the time. I was super impressed.
Anyway, so you see the data for him and then you say you see the three players who were drafted before him that were most relevant to forming his prediction. And you see the three players on the right that were least relevant to forming his prediction. So they were censored from the prediction task. They were just those players were kicked out and Dante's prediction was formed by those three players and a few other relevant players. The main point here is that you have all of this transparency into how the prediction is being formed. Now let's look at another player. This is Mikael Bridges, who plays for the Nets. And what you see here is in this case, the relevance based prediction selected only the the key seven variables to form the prediction. It didn't use the full sample of variables. And you see, he also went to Villanova. And the other thing that you notice is that one of the most relevant players to forming his prediction is this guy Josh Hart, who also went to Villanova and they all won national championships playing for the same coach, Jay Wright. So it's sort of interesting that this you know, this is just a statistical process. It's blind to whether it's Villanova or Jay Wright or whatever, but it's sort of keying in on players who were coached by this particular person. So let's see how this does as a benchmark. We just use vanilla linear regression analysis. And this is this is a measure of the correlation of the predictions with the actual outcome.
So we have predictions before they enter the NBA. Right. And and then we know what they actually did. So this is this is a correlation of our predictions of their BPM with what their BPM actually turned out to be in their rookie season. And you can see, you know, even just using simple OLS, there's a pretty you know, there's a decent relationship. Then we use partial sample. This is where we just select the optimal sub sample. But we ignore the fact that we have flexibility to choose different predictive variables. So that improves significantly upon OLS. And then when we give it a choice of these two different sets of variables, it improves even more. I mean, if we give it a choice of lots of different sets of variables, I would imagine that would go up quite a bit. But then look at this. Let's think about this relevance based prediction. So this is this to me is I don't know of any other prediction system that can do this. Certainly not machine learning, certainly not regression based approaches. Relevance based prediction predicts the quality of the predictions. Right before you even form the prediction. It's telling you, is this going to be a good prediction or a bad prediction? It's adjusted fit. If adjusted fit is high, it's going to be a good prediction. If adjusted fit is not high, it's not going to be a good prediction.
It's telling us ahead of time before we make the prediction. So this foreknowledge enables us to discard predictions that we know in advance aren't that trustworthy. So now look at this. The bar that has the correlation of 52 between the predictions and the outcomes. Those are the predictions that we judge in advance to be reliable. Before we know what the answer, before we know how they turned out. And the final bar over on the right were the predictions that we discarded that we knew in advance were going to be unreliable. So I think there are two key takeaways on this slide. One is the difference between the 0.19 OLS and the 0.52. This approach nearly triples the reliability of OLS. And then the difference between the 0.52 and the 0.09. That tells you how good adjusted fit is at distinguishing ahead of time. Predictions that are going to be good versus predictions that aren't going to be good. I mean, this would be extremely valuable to a team. It would tell them how how committed they want to be to a draft prospect. And it would be very valuable to investors because it would give them a really good way to scale their bets. Right. If you know the quality of each bet you're making ahead of time. So let me just compare this approach. This is sort of a summary with with linear regression and machine learning based on three criteria three, three principles, transparency, adaptability and non arbitrariness.
So blue is sort of a it's okay, not great, not horrible. Green is great. Red is horrible. Okay. So as far as ols with transparency, it's not great. It's not horrible. It doesn't tell us how any of the individual observations inform the prediction. So it's not transparent in that way. It's not adaptable, right? It only works if you have a linear relationship, right? It can't deal with any kind of complexity, but it it scores very well on the non arbitrariness. It's very theoretically grounded by none other than Carl Friedrich Gauss. Now let's consider machine learning. We'd like to think of machine learning in two categories. One is model based machine learning algorithms and model free algorithms. So the most powerful machine learning algorithms are notoriously opaque. So we have neural networks that have some of them have more than a trillion parameters, right? So you get an answer, but you have no idea why you got that answer and therefore you have no idea whether that answer is spurious or not spurious. Right. It's not very adaptable. Right. So you build a very, very complicated neural network. You've captured, you know, a ton of information from the historical sample that you have. But if circumstances change and if those circumstances aren't in that sample, it doesn't automatically adapt. You have to go back and start all over again and build a new model, right? So it's not adaptable. And then as far as being theoretically grounded, it's just empirically driven, right? It's a trial and error thing.
There's no theory. Then we get to relevant space prediction, right? Transparent. Super transparent. You see precisely how each observation inform the prediction and you have all of this insight into the quality of the predictions even before you make the prediction. It's adaptable. It automatically adjusts to new circumstances. It retrieves a different set of relevant observations based on their similarity and informativeness For every new circumstance, it just automatically it uses the full set of data that you have available to you for each new prediction task. So you can think of like machine learning as sort of like a fixed approach, a lot of investment in it. You can think of our relevance based prediction is more of a dynamic approach that keeps rebuilding itself for every new prediction. And then as far as being non-arbitrary, it's justified by the central limit, theorem by information theory and also by these two really, I think, cool convergences that across the full sample of observations the answer converges to what you'd get from OLS and that across all of the prediction tasks, the fit aggregates up to the R squared, right? So it's got this mathematical cohesion. So let me. Okay. I was okay. I was told to quiz you, and I will be stunned if the answer if if everybody 100% of you get the right answer. If you don't get the right answer, I would be stunned. I would. You should just. So go ahead and choose the answer.
Wow. I'm stuffed.
Mark Kritzman: Who is doing this well. Can you raise your hands? Okay. Well, I guess I didn't do. I didn't give a very clear presentation.
Speaker2: I'm sorry.
Mark Kritzman: I accept responsibility. Okay. I'm going to. I'm going to move on. This is too painful. That's the end. Okay. All right. That's it. That's the end. Thank you.
Speaker3: That was very compelling and eye opening. Pardon me. It's very compelling.
Speaker2: Who are those people?
Speaker3: We'll find them, don't worry. So I believe we have time for at least one question. Do we have any questions out in the crowd? Someone I think should be passing around a microphone. If not, you can also submit your questions on Slido. Okay. Okay. No brave souls. I think. I think they're a little timid after that. Yeah. Okay. Anyway, there are a couple that have been submitted from the virtual. I'm going to ask Kayla.
Speaker2: Yes. Okay. Okay.
Speaker3: Let's see. Okay. Why? Why is such a great tool so underused? What are the drawbacks? Overfitting? Under which conditions would this likely underperform? Linear or lasso?
Mark Kritzman: I think it's not widely used because it's new. It hasn't been widely exposed. I mean, you know, we've Dave and Meg and I have got in front of a lot of people and I'm hoping they're using it. But I think that's the reason why it's I think it will be very widely used fairly, fairly soon what was and then overfitting. So this is much, much better off using this approach if you're worried about overfitting because you have all of this transparency into how the prediction is being formed, right? So you can see how each individual observation contributes to the prediction. So you can, you know, if if there's something weird going on, you can you can figure that out. And then when would it not do as well as linear regression.
Speaker2: And lasso.
Mark Kritzman: And lasso regression. I think I think it would never not do as well as lasso and it would not do as well as linear regression if there's a linear relationship. And because what this will do is it may push you toward a, you know, a relatively small relevant subsample. So it's going to be noisier, right? So it's conceivable that I mean, it's likely that there are many circumstances in which linear regression is probably going to do better than this for that reason. If it's linear, then you get a less noisy prediction because you have a larger sample of observations.
Speaker3: Wonderful. Well, thank you so much.
Mark Kritzman: Okay. Thank you all. Thanks, Kayla.