Predicting Performance Using Consumer Big Data (2022)
Kenneth Froot, Namho Kang, Gideon Ozik, Ronnie Sadka
The Journal of Portfolio Management 48(3), 47-61, URL
After the last posts on the global market portfolio and the long history of financial markets, this week’s AGNOSTIC Paper is again more related to my other content. The authors use proxies for in-store activity, brand awareness, and web traffic to predict fundamentals and returns of consumer-oriented companies. I like the paper because it examines alternative data and is published in a peer-reviewed journal. Many other studies are often just white papers of data providers which, unsurprisingly, find that alternative datasets are insanely important and everybody should buy them.1I don’t want to be too sarcastic on that. Of course, there are valuable alternative datasets. However, research on a product that the researcher wants to sell can be dangerous. So let’s look at a more scientific analysis. As usual, I divide this post into the following parts and start with some background on alternative data.
Everything that follows is only my summary of the original paper. So unless indicated otherwise, all tables and charts belong to the authors of the paper and I am just quoting them. The authors deserve full credit for creating this material, so please always cite the original source.
Setup and Idea
With the exponential development of big data in recent years, there are now more data sources available than ever before and investors have used this very creatively. Satellite images of parking lots to predict sales of stores2Exemplary reference from BerkeleyHaas., corporate jet movements to identify merger negotiations3Paragon Intel is selling the data for this., Amazon prices to construct real-time inflation measures4Exemplary reference from Harvard Business School. – and there are many more examples. All of this is now known as alternative data and became an industry of itself.
What is the difference between alternative and non-alternative5At least to me, the opposite of alternative is not so clear. Maybe traditional? data? There is no clear definition, however, a common theme is that alternative data is somehow unstructured.6Israel et al. (2020) briefly discuss this in their paper on machine learning in finance. Unstructured simply means that the data is not readily available in columns and rows. For example, stock prices are recorded with a time stamp and are easy to analyze. This is very different for a video. To convert a video into analyzable data, you need some advanced methodology first. Therefore, it is alternative.
Another part of the definition is based on data sources. Non-alternative data is basically everything that is available from the standard services. For example, prices and fundamentals. In contrast, data sources are virtually unlimited for alternative data. To provide some structure, the authors cite a study from J.P. Morgan Macro Quantitative and Derivatives Strategy who cluster alternative datasets into three categories.
- Data from Individual Activity
- Data from Business Processes
- Data from Sensors
Individual-activity datasets capture all sorts of consumers’ actions. For example, web searches, social media activities, or the use of discount coupons. Business-process data are for example credit card transactions or the number of open positions on a firm’s website. Finally, I guess that sensor data is self-explanatory. The most popular example from this category are satellite images to count the number of cars in front of restaurants and malls.
The industry around alternative data grew tremendously over the last years by virtually every measure. According to the authors and AlternativeData.org, an industry website, there are now more than 400 alternative data vendors to choose from. Based on their estimates, revenues in this industry more than quadrupled since 2017 to about $1.7B in 2020. The demand is also growing. In 2017, already more than 70% of funds used or planned to use alternative data in their investment processes.7AlternativeData.org cite this figure from the EY Global Hedge Fund and Investor Survey 2017 as of June 4, 2022. The number on their website may change in the future.
Alternative data is therefore an important trend in the investment industry. In my opinion, this is not surprising. Active investing has always been about identifying superior information to generate outperformance. All of the novel data sources are potentially useful for that. So in a somewhat efficient market, we would expect that investors use them. But as with many other things in finance, there is a danger of overhyping and alternative data has limitations. Rigor scientific analysis is therefore important and I appreciate the authors’ effort in this direction.
Data and Methodology
The authors use a proprietary dataset from MKT MediaStats LLC, a research firm. Specifically, they obtain three proxies for consumer behavior. Therefore, they only capture Data from Individual Activities. The sample ranges from 2009 to 2020 and only covers companies in the United States.
The first proxy, IN-STORE, measures the activity of consumers at retail stores. It consists of various data sources. For example, searches of driving directions to stores and downloads of discount coupons. Due to the nature of this data, it is only available for 66 retail firms.
The second proxy, WEB, captures consumer visits to firm-websites. According to the authors, the research firm constructs this dataset from “[…] a panel of a few tens of millions of individual Internet users.” (p. 49). As this data is more general, it is available for 312 firms in various industries. Most of them are of course consumer-oriented.
The final proxy, BRAND, measures consumer interest in product brands. Most of it is based on social media activities and internet searches. The data is available for 250 firms in different industries. Again, most of them are consumer-oriented.
The non-alternative data comes from CRSP (stock prices), Compustat (fundamentals), and IBES (analyst forecasts). These are the standard high-quality databases for the US stock market and there shouldn’t be any issues with that. In contrast, we need to rely on the market research firm’s methodology and the authors’ analyses for the alternative datasets. Since the paper is published in the Journal of Portfolio Management, I assume that the datasets don’t suffer from common problems like survivorship- or look-ahead bias. The following table shows a snapshot of the sample in 2017.
The table and the discussion before reveals the most evident limitations of alternative data. It is typically not available for all stocks and its history is limited. In many cases, there is no way to mitigate those problems. Social media and web searches simply didn’t exist 30 years ago. And even though they are available today, web activities are probably not that important for the performance of B2B businesses. Studies and applications of alternative data are therefore usually limited to a curated sub-sample of stocks.
The authors also cannot avoid those problems. Out of 4,511 stocks in total, their sample covers only 331 (7.3%). Although the statistics are much better for the two consumer-oriented sectors, it is not a market-wide analysis. In terms of market capitalization, the sample covers 45.9% of the aggregate value. This suggests that the sample has a substantial tilt to larger companies.
All of this sounds bad, but for a paper on alternative data, I think the sample is actually quite large. As mentioned before, alternative data is usually not available and/or relevant for the entire market. That is just something we have to live with. In fact, many studies on alternative date are just anecdotal. So I appreciate the larger-scale analysis of the authors.
Important Results and Takeaways
In the main part, the authors examine whether their alternative-data-proxies are actually valuable for investors. First, they test their predictive power for firm fundamentals, in particular sales growth. Subsequently, they sort firms by their respective proxies and calculate the performance of quintile portfolios. They also test if such trading strategies remain profitable after transaction costs. In addition to that, the authors also compare the COVID-period with the years before. Unsurprisingly, they find a pronounced shift to more online activity.
Alternative consumer-data predicted firm fundamentals
The authors use a linear regression to predict quarterly sales growth, unexpected quarterly revenues and earnings,8Measured by the quarter-on-quarter increase, relative to the last two years. and analyst forecast errors.9Measured by the deviation between expected earnings per share (EPS) and actual EPS. IN-STORE, WEB, and BRAND are all significantly related to future revenue growth. Since all of them are sales proxies, this is what we would like to see. The effects of BRAND and WEB are slightly more pronounced for the sub-sample of consumer firms. Again, this is not surprising as the measures are based on consumer activities.
Most regression coefficients are also significantly positive for the other firm fundamentals. However, not as strong as for sales growth. Overall, the authors conclude that the proxies predict future fundamentals of consumer-oriented firms. However, the R2 of the various regressions are relatively modest (0%-20%). This suggests that predicting the future remains difficult, even with more sophisticated data.
From an academic perspective, it is perfectly fine to estimate the predictive power of the three proxies with a simple linear regression. For practical applications, however, I don’t think it is particularly helpful. To bet on unexpected earnings or sales, we need to forecast the number for the individual company. To do this successfully, we probably need a more sophisticated model and sometimes even some human judgement.
Trading on alternative consumer-data generated monthly alphas of up to 1.9%
To test if IN-STORE, WEB, and BRAND are also profitable trading signals, the authors follow the standard methodology. They calculate alphas of quintile portfolios versus the Fama-French five-factor model plus momentum. There is a table with detailed statistics in the paper. But in my opinion, the following charts of the “Top”, “Bottom”, and long-short (“Spread”) portfolios summarize the results more intuitively.
The charts show that most of the proxies “worked” in the sense that companies with higher scores (better fundamentals) outperform those with lower scores (worse fundamentals). Especially IN-STORE exhibits a fairly stable pattern for both equal- and value-weighted portfolios (EW and VW, respectively). To exploit the high-frequency of the proxies, all portfolios are rebalanced monthly. Depending on the specification, the hypothetical long-short portfolios generated monthly alphas of up to 1.93%. Except for the IN-STORE proxy, however, most alphas and return spreads are not statistically significant. BRAND and WEB tended to work better for larger firms and longer holding periods (e.g. rebalancing every two months).
To further examine how the information moves into market prices, the authors conduct an event-study. For each proxy, they look at the performance of the long-short portfolio over the following 65 trading days (about 3 months) after portfolio-formation.
The charts for IN-STORE and BRAND almost look too good to be true. On average, the hypothetical long-short portfolio earns very stable returns over the 65 days after portfolio-formation. This suggests that the information of those proxies is not immediately reflected by the market price.
In contrast, the picture is more volatile for WEB. There are some pronounced spikes over the first 30 trading days after formation. After that, the long-short return remains basically flat. This indicates that the information from WEB moves into prices more quickly. Perhaps because web searches and social media activity is easier to access than proprietary analyses of consumer behavior.
Finally, the authors test whether their strategies survive real-world trading costs. Although they are frequently ignored in papers, trading costs are actually very important. Depending on the turnover of the strategy, even just a few basis points add up quickly. The authors use a well-known study by Frazzini, Israel, and Moskowitz (2015) to estimate the performance of the portfolios after costs. Unsurprisingly, the results are substantially weaker for most specifications. However, IN-STORE remains a profitable signal with a monthly alpha of about 1.6% after costs (1.93% before costs).
Conclusions and Further Ideas
Before I summarize the results, please keep in mind that the authors only use three arbitrary alternative datasets. As mentioned earlier, there are more than 400 other providers out there. So it is impossible to generalize from this paper. That said, I think that some insights also apply to other datasets.
The authors present some robust evidence that alternative data is indeed useful to predict both fundamentals and stock returns. But the results also suggest that fundamentals are the easier part. The predictive power of IN-STORE, WEB, and BRAND is much more robust for sales growth than for stock returns. I believe this also applies to other alternative datasets. Predicting returns is much harder than predicting fundamentals.
For this reason, alternative data is probably most relevant for discretionary investors who really try to evaluate the underlying business. Systematic investors will almost certainly struggle with limited data availability and short history. Moreover, they don’t necessarily have to predict fundamentals but need robust alpha signals.
There seems to be a trade-off between data-availability and sophistication. As we have seen in the paper, the specialized IN-STORE proxy had the most predictive power. But it is just available for 66 retail firms. The implications of this are clear. If you are a discretionary stock picker, you should go for the most specialized data available. If you are a systematic investor who loves diversification, you must find some balance between data-availability and sophistication. In the end, it all depends on the data at hand. While it is difficult to incorporate the IN-STORE proxy into a systematic investment process, the text of annual reports is also alternative data and available for all firms.
- AgPa #83: How Much of the US Market is Passive?
- AgPa #82: Equity Risk Premiums and Interest Rates (2/2)
- AgPa #81: Equity Risk Premiums and Interest Rates (1/2)
- AgPa #80: Forget Factors and Keep it Simple?
This content is for educational and informational purposes only and no substitute for professional or financial advice. The use of any information on this website is solely on your own risk and I do not take responsibility or liability for any damages that may occur. The views expressed on this website are solely my own and do not necessarily reflect the views of any organisation I am associated with. Income- or benefit-generating links are marked with a star (*). All content that is not my intellectual property is marked as such. If you own the intellectual property displayed on this website and do not agree with my use of it, please send me an e-mail and I will remedy the situation immediately. Please also read the Disclaimer.
Endnotes
1 | I don’t want to be too sarcastic on that. Of course, there are valuable alternative datasets. However, research on a product that the researcher wants to sell can be dangerous. |
---|---|
2 | Exemplary reference from BerkeleyHaas. |
3 | Paragon Intel is selling the data for this. |
4 | Exemplary reference from Harvard Business School. |
5 | At least to me, the opposite of alternative is not so clear. Maybe traditional? |
6 | Israel et al. (2020) briefly discuss this in their paper on machine learning in finance. |
7 | AlternativeData.org cite this figure from the EY Global Hedge Fund and Investor Survey 2017 as of June 4, 2022. The number on their website may change in the future. |
8 | Measured by the quarter-on-quarter increase, relative to the last two years. |
9 | Measured by the deviation between expected earnings per share (EPS) and actual EPS. |