Selecting Mutual Funds from the Stocks They Hold: A Machine Learning Approach (2020)
Bin Li, Alberto G. Rossi
SSRN Working Paper, URL
The third AGNOSTIC Paper on the application of machine learning in manager selection. This week’s paper is very similar to AgPa #65 and AgPa #66, and again examines the data on US mutual funds. The methodology, however, is somewhat different. Most importantly, this week’s authors focus on the characteristics of the fund holdings and train their prediction models on total returns instead of alphas. The results nonetheless point in a similar direction. This, of course, further increases the evidence that machine learning is actually useful for manager selection…
- Week 1: US Mutual Funds – Alphas
- Week 2: US Mutual Funds – Long Only
- Week 3: US Mutual Funds – Total Returns
- Week 4: Hedge Funds
Everything that follows is only my summary of the original paper. So unless indicated otherwise, all tables and charts belong to the authors of the paper and I am just quoting them. The authors deserve full credit for creating this material, so please always cite the original source.
Setup and Idea
For the setup and idea of Machine-Learned Manager Selection, I kindly refer to the respective sections in AgPa #65 and AgPa #66. Having said that, there are some notable differences between this week’s and the previous two papers.
First, the authors again focus on hypothetical long-short portfolios of mutual funds. This is similar to Kaniel et al. (2022) in AgPa #65, but different from the more practical long-only approach of DeMiguel et al. (2023) in AgPa #66.
Second, the authors employ various machine learning models, but use Boosted Regression Trees for most of their analyses. Interestingly, this is different from both previous papers which find Neural Networks (AgPa #65) and Random Forests (AgPa #66) to be the most powerful models.1Ask three people about the best model and you get three different answers. Welcome to the messy world of statistical analysis in financial markets…
Third and probably most important, the setup of this week’s study is conceptually different from the previous two. Both Kaniel et al. (2022) and DeMiguel et al. (2023) train their machine learning models on alphas. By contrast, this week’s authors focus on the unadjusted total return of mutual funds. Since Kaniel et al. (2022) also discussed this issue in their paper, I have written an entire section about it in AgPa #65. The general conclusion was that it seems more difficult to train a model on total returns than on alpha. A prediction like “This fund will make 1% next month” is very hard because the stock market itself is so volatile. Removing this component and predict how funds will do relative to the market or some factor model should therefore increase the signal-to-noise ratio and thus the performance of models. On the other hand, as this week’s paper shows, you can aim on total returns if you consider the information from factor models as features in your prediction.2Despite generally arguing in favor of alpha-predictions, Kaniel et al. (2022) also mention this in their paper. But more on that below.
Data and Methodology
Similar to the previous two, this week’s authors source their data from the CRSP Survivorship-Bias-Free US Mutual Fund database. They focus on active equity funds and therefore remove target date and index funds from the sample. The authors next merge quarterly fund holdings from the Thomson Reuters Mutual Fund Holdings database. The final data covers 2,980 mutual funds between January 1980 and December 2018.3This is pretty comparable to the sample of 3,275 mutual funds between 1980 and 2019 from Kaniel et al. (2022) in AgPa #65. However, it is less than the 8,767 unique mutual fund share classes from De Miguel et al. (2023) in AgPa #66. This is mainly because the latter do their analyses on the share-class level whereas the first two aggregate them.
For the features in their machine learning models, the authors next collect a set of 94 stock characteristics for which the literature found significant relations with expected returns (see table above). While adhering to best-practices regarding data management and avoiding look-ahead-biases, the authors use their holdings data to aggregate each characteristics for each fund via a simple weighted average. In a second specification, they first rank stocks by each characteristic and calculate the fund exposure as a weighted average of percentile ranks in the US universe at the respective point in time.
Using this large set of features, the authors train their machine learning models to predict the next month’s fund return t+1 with information available at the current month t and repeat this exercise every month. As I already mentioned, the authors decide on Boosted Regression Trees as their primary model mainly because it offers some advantages regarding interpretability. However, they also show that the results are generally similar for other well known machine learning algorithms like LASSO, ElasticNet, Random Forests, and Neural Networks.
To convert the model forecasts into portfolios, the authors simply rank funds according to each month’s prediction and sort them into deciles. They examine both equal and value-weighted portfolios, although I personally believe that ranking mutual funds according to their assets under management is not necessarily comparable to market-cap-weighting for single stocks. The equal-weighted portfolios are therefore the basis of the following results and the value-weights serve as additional robustness test.
Important Results and Takeaways
Machine learning helps to identify outperforming funds
Before starting with the machine learning prediction, the authors first show that simple fund selection approaches are not really helpful. Specifically, they sort funds by each of the 94 characteristics and examine if there are any meaningful differences along the decile portfolios. There are not. Very few of the univariate sorts deliver significant performance patterns. Also remember that with 94 individual sorts, almost 5 of them will look statistically significant by pure chance if you apply the common significance level of 95%. The authors therefore really just use this analysis to show the need for more advanced methods.
The key results of the paper are the performance statistics of the Machine Learned Manager Selection, i.e. the decile portfolios of mutual funds. The table below shows the annualized return above the risk-free rate and Carhart-4-factor (market, size, value, momentum) alphas for the equal weighted portfolios. For comparison to the machine learning selection (BRT), the authors also include the prediction of a simple linear regression (LR), and the best-performing univariate sort (BEST).
The results are generally very strong. The portfolio of funds with the highest return predictions for the next month (High) generates an annualized excess return of 10.91% over the sample period. This number is pretty astonishing as the total return of the S&P 500 was about 11% per year over the same period. The hypothetical fund selection therefore almost beat the overall market without even adding back the risk-free rate. Also note that the fund returns are already net of fees, but before potential trading costs. Of course, for a proper comparison we would also need information about the risk of the fund selection strategy but such strong returns are already quite promising.
On the other end, the portfolio of funds with the lowest return predictions for the next month (Low) generates only excess returns of 4.23%. That leads to a very sizable spread of 6.68% per year for the hypothetical long-short portfolio (High-Low). Further note that the t-statistics of 3.59 and 3.56 for the High and High-Low portfolio suggest that those returns are also statistically significant. The standalone returns of the Low portfolio, in contrast, are statistically not even different from zero (t-statistic of 1.43). Finally note that with almost no exceptions, annualized returns monotonically increase along the deciles which suggests that the machine learning prediction indeed identifies patterns in the data.
Turning to the 4-factor alphas, the picture looks generally similar but the details are interesting. The High portfolio generates positive alpha of 2.88% per year, but this number is with a t-statistic of 2.16 not necessarily significant if we apply some stricter standards (see AgPa #49, #50, #55). On the other hand, the worst funds (Low) generated significantly negative alpha of 4.59% per year. So very similar to the results of Kaniel et al. (2022), most of the alpha from Machine-Learned Manager Selection apparently comes from the short side. Unfortunately, as I also mentioned in AgPa #65, it is quite difficult to short mutual funds and capitalize on such predictions in practice.
Let me conclude with some thoughts on those results. First, the strong excess returns and the considerably lower 4-factor alpha of the High portfolio suggest that the machine learning prediction identifies some ides that are related to the four factors (market, size, value, momentum) in the model. Second, although the risk-adjusted alphas are not super strong, a manager selection that provides a long-only portfolio of mutual funds that makes almost 11% after fees over the risk-free rate is definitely interesting for investors. Third, despite the direction of the results we shouldn’t be too excited about them. It was hardly possible to implement such machine-learning models until maybe the late 2000s and early 2010s. Furthermore, the 94 predictors that the authors use to train them were also gradually discovered in the literature over the sample period. Therefore, there is some kind of look-ahead bias embedded in this analysis.4Admittedly, this critique applies to virtually all longer backtests of machine learning strategies. Finally, it would be interesting to see some more practical details of the portfolios like drawdowns or volatility to finally decide if the approach is indeed interesting for practitioners.
The table above again shows the same excess return statistics for various alternative specifications of the prediction model and portfolio construction. I will not go into too much detail, but the overall pattern seems very robust. The authors also provide a whole bunch of further robustness tests which strengthen the evidence that the patterns are not the result of data mining.
The best and worst funds share common characteristics
After observing such pervasive return patterns and profit opportunities, the next question of course becomes what kind of funds the machine selects. In the following table, the authors therefore provide some statistics about fund characteristics of each decile portfolio.
They highlight a few points. First, there is an inverse-U-shaped relation for fund age, assets under management, and number of holdings. What does this mean? Well, both the best and the worst funds are on average young, small, and concentrated. I think this makes intuitive sense because managers who dare to differ from the mainstream increase both their chance of becoming successful or failing badly.5You need to be different from everyone else to get different results than everyone else. Unfortunately, that works in both directions… A similar logic applies to the U-shaped distribution of expense ratios. Both the best and worst funds are on average more expensive than all others. Again, I think this makes intuitive sense. The best funds are worth higher fees, whereas the worst funds are probably in some part among the worst because they charge too much.6Overcharging is probably the biggest problem of active management (see AgPa #17 and #35).
For manager selectors without access to the machine learning predictions, these results are obviously not really helpful. If anything, the table shows that there is no magic shortcut. If you just buy young mutual funds with not too many assets and a concentrated portfolio, you may end up successfully. But if you are wrong, chances are high that you will end up at the very bottom of the distribution instead. The authors therefore argue that this patterns support the idea that their Machine-Learned Manager Selection actually separates winners from losers and is better than such simple filters.
Finally, the authors mention that Small and Micro Cap funds are somewhat overrepresented in the High and Low decile-portfolios. In my opinion, this fits perfectly to the arguments above. It is a reasonable assumption that there are more alpha opportunities among small and micro caps. At the same time, those areas of the market also tend to be more volatile and risky. Once again, focusing on such stocks increases both the chance of hitting the jackpot and screwing up badly.
Trading Frictions and Momentum are the most relevant variables
In one of the last sections of the paper, the authors go one step further and examine which firm characteristics are most relevant for their models’ predictions. As always, its inherently difficult to make causal statements, but the authors do their best and use some established methodology to interpret the results of the Boosted Regression Trees algorithm.
Before going into specific variables, they mention an important general point. There is not the single one or the single handful of variables to predict mutual fund returns. All 94 are somehow contributing to the prediction and the interaction among them is probably even more important. Further note that the relevance of individual characteristics fluctuates over time. This is not too surprising as most of the input variables are somehow related to the major factors which occasionally go through prolonged periods of underperformance.
With that in mind, the table above provides an overview about the variables that statistically contributed the most. The winners are clearly Momentum and Trading Frictions which might be disappointing for fundamentalists. Again, this is a statistical analysis and the fact that things like value or quality variables don’t show up in this table doesn’t mean they aren’t important. Interestingly, however, these results are pretty much in line with the findings in AgPa #65 and AgPa #66 that also find measures of past performance and momentum to be among the most important.7In some part, this certainly comes from the monthly frequency of the machine-learning predictions (prices simply fluctuate more in one month than fundamentals). On the other hand, DeMiguel et al. (2023) in AgPa #66 forecast 12-month returns and also find momentum to be among the most important features.
Conclusions and Further Ideas
This is the third paper that examines the CRSP US Mutual Fund database and explores Machine-Learned Manager Selection. All three papers are similar in that they attempt to predict mutual fund performance by fund or holding characteristics and machine learning models. The details obviously differ, but I think the results are generally very similar. Needless to say, this is the way science should work. When three teams of authors more or less independently find similar patterns in the same data, it increases the chance that those patterns are actually there. To conclude, let me therefore summarize the common results of all three papers and the points where they deviate from each other.
- At least in hypothetical backtests, machine learning models seem to be useful for fund selection and outperform simpler approaches like linear regression or basic filters.
- From a risk-adjusted-perspective, however, most of the alpha comes from shorting the funds with the worst machine-learning predictions for future returns. This is difficult in practice.
- Which features are relevant depends on what you predict. For alphas, fund characteristics are more important as the holding characteristics are mostly considered by the alpha-calculation. Total returns are generally harder to predict, but it is possible if you include both fund and the stock characteristics of the underlying holdings.
- All results are quite robust and survive various practical difficulties like expense ratios, fee structures of different share classes, different weighting rules, and in some analyses even front or back loads.
While I really like the research and believe all of this is quite promising, the literature is still in its early stage. The three papers are not yet published (probably in the process of doing so) and so far, the analyses are limited to the US and equity funds. The natural way to build up more evidence are therefore, as so often, out-of-sample tests across regions, asset classes, and time periods. To finally conclude this series on Machine-Learned Manager Selection, we will look at such an analyses for a sample of hedge funds next week.
- AgPa #83: How Much of the US Market is Passive?
- AgPa #82: Equity Risk Premiums and Interest Rates (2/2)
- AgPa #81: Equity Risk Premiums and Interest Rates (1/2)
- AgPa #80: Forget Factors and Keep it Simple?
This content is for educational and informational purposes only and no substitute for professional or financial advice. The use of any information on this website is solely on your own risk and I do not take responsibility or liability for any damages that may occur. The views expressed on this website are solely my own and do not necessarily reflect the views of any organisation I am associated with. Income- or benefit-generating links are marked with a star (*). All content that is not my intellectual property is marked as such. If you own the intellectual property displayed on this website and do not agree with my use of it, please send me an e-mail and I will remedy the situation immediately. Please also read the Disclaimer.
Endnotes
1 | Ask three people about the best model and you get three different answers. Welcome to the messy world of statistical analysis in financial markets… |
---|---|
2 | Despite generally arguing in favor of alpha-predictions, Kaniel et al. (2022) also mention this in their paper. |
3 | This is pretty comparable to the sample of 3,275 mutual funds between 1980 and 2019 from Kaniel et al. (2022) in AgPa #65. However, it is less than the 8,767 unique mutual fund share classes from De Miguel et al. (2023) in AgPa #66. This is mainly because the latter do their analyses on the share-class level whereas the first two aggregate them. |
4 | Admittedly, this critique applies to virtually all longer backtests of machine learning strategies. |
5 | You need to be different from everyone else to get different results than everyone else. Unfortunately, that works in both directions… |
6 | Overcharging is probably the biggest problem of active management (see AgPa #17 and #35). |
7 | In some part, this certainly comes from the monthly frequency of the machine-learning predictions (prices simply fluctuate more in one month than fundamentals). On the other hand, DeMiguel et al. (2023) in AgPa #66 forecast 12-month returns and also find momentum to be among the most important features. |