Aggregate Confusion: The Divergence of ESG Ratings (2022)
Florian Berg, Julian F. Kölbel, Roberto Rigobon
Review of Finance, Corrected Proof, 1-30, URL
As announced last week, this is the second AGNOSTIC Paper on the confusion around Environmental, Social, Governance a.k.a. ESG. As a reminder, last week’s paper highlights 1) that ESG ratings of different data providers disagree, and 2) that this disagreement has real consequences for firms and investors. The authors of this week’s paper examine the first part in much more detail and provide some specific explanations why ESG ratings are so different. As with my last series, I recommend to start with the first post to get the full story.
Everything that follows is only my summary of the original paper. So unless indicated otherwise, all tables and charts belong to the authors of the paper and I am just quoting them. The authors deserve full credit for creating this material, so please always cite the original source.
Data and Methodology
The authors construct a comprehensive sample of ESG ratings from six data providers: KLD, Sustainalytics, Moody’s ESG, Refinitiv, MSCI, and S&P Global. So compared to last week’s paper, there are some overlaps (Sustainalytics, MSCI, KLD) but also differences. Depending on the data provider, the authors obtain ESG ratings for 1,665 to 9,662 companies. But since they focus on the divergence of ESG ratings across data providers, they only use the largest possible common sample. This covers 924 firms in the year 2014, so it’s a purely cross-sectional analysis.1This is a notable difference to last week’s panel dataset of S&P 500 companies. The authors do not comment on it specifically, but I suspect that their sample covers not only US- but also some international companies.
Similar to last week, the methodology of the first part is fairly straight forward and boils down to correlations and descriptive statistics. The second part is still simple but a lot of appreciable effort. To examine the divergence of ESG ratings on a more granular level, the authors develop their own ESG taxonomy from more than 700 indicators. In the last part, they use some sophisticated econometrics to identify the underlying mechanisms behind the divergence of ESG ratings.
Before I continue with the results, one brief remark. As I mentioned above, the data and sample are different to the last paper. Please keep that in mind and don’t compare the results up to the last decimal. The cool thing is: despite the different samples, the overall results are very similar. This speaks for the quality of both papers and a robust empirical fact.
Important Results and Takeaways
ESG ratings disagree: the average correlation is just 0.54
The first table again shows the average correlation of ESG ratings for each pair of the data providers. For aggregate ESG ratings, the average correlation among the 924 companies is a modest 0.54. This is slightly higher than last week’s 0.45 but very similar in terms of economic magnitude. Also similar to last week, disagreement is highest for Governance (correlation of 0.3) and lowest for Environmental (correlation of 0.53).2It’s interesting that the correlation for the total ESG score is slightly higher (0.54) than for the best dimension (0.53). The total score should by definition be some average of the three. But this is just a minor detail. For some pairs, correlations of governance-ratings are even close to zero or slightly negative. For example, the correlation of -0.05 between KLD and Refinitiv (KL-RE) suggests that the two data providers have a very different and maybe even opposing view of good governance.
Overall, those are the same patterns as last week and I suspect that the same explanations are also valid. Environmental issues are much easier to measure than governance and there is more consensus about them. I think we all agree that more CO2 emissions are bad but who knows what a fair CEO-compensation is? However, please also note that not all data providers disagree so strongly. There are some pairs with higher correlations. For example, Sustainalytics and Moody’s (SA-MO) or Moody’s and S&P Global (MO-SP) with values around 0.7.
If you don’t like tables and numbers, the authors also have a quite intuitive graphical illustration. The following chart shows a scatter plot of the Sustainalytics ESG rating relative to the five others. If ESG ratings would be perfectly correlated, the chart would be a straight line. But given that correlations are considerably lower than +1, it’s actually a quite noisy point cloud.
No matter how you present it, the authors provide convincing evidence that different ESG ratings for the same companies disagree considerably. Those results are very similar to last week’s paper and thus provide important out-of-sample evidence. So there is growing evidence that ESG ratings indeed do not serve their main purpose: unambiguously measure the ESG performance of companies. But as you should know from last weeks introduction, this is no surprise because ESG is just a very subjective concept. The next part will also show this impressively…
709 indicators and 64 categories: no wonder that there is disagreement
Enviably, the authors not only have the final ESG ratings but also the underlying raw data. At the most granular level, the six data providers process 709 unique indicators to construct their ESG ratings. There are of course some overlaps but this is still a wide range of data. To somehow compare the different methodologies, the authors develop their own ESG taxonomy from this long list. They group the 709 indicators into 64 categories using the following two rules. Each indicator only belongs to exactly one category, and each category consists of at least two indicators. Of course, this methodology is also somewhat subjective but the authors are very transparent with their assumptions. In my opinion, that’s the best they can do and having such a common taxonomy is very helpful.
The number of indicators already reveals considerable different methodologies among the data providers. For example, Refinitiv processes 282 indicators while Moody‘s ESG uses only 38. There are also notable differences on the category-level. Only 10 out of the 64 categories are covered by all six data providers.3These are: Biodiversity, Employee Development, Energy, Green Products, Health and Safety, Labor Practices, Product Safety, Remuneration, Supply Chain, and Water. Without going into more detail, this analysis already suggests that 1) data providers look at different categories, and 2) measure the same categories by different indicators. The following table supports this and summarizes the correlation of ESG ratings for each category. Empty cells indicate that data providers doesn’t consider the respective category.
The results are very mixed. The heat map indicates that there are some categories and pairs of data providers with fairly high correlations. On the other hand, correlations are also low or even negative for a considerable number of categories. Overall, the table once again highlights the current mess around ESG. There are a lot of different aspects to define it (categories), different indicators to measure them, and different data providers thus arrive at different conclusions. Reason enough to bring a little structure into the chaos…
Most disagreement comes from “measurement” and “scope”
To identify the reasons for the observed ESG disagreement, the authors use their 64-category-taxonomy and decompose rating differences into the following mechanisms.
- Scope: Which categories do data providers consider for their ESG ratings? Disagreements due to scope are for example if Refinitiv analyzes the category Child Labor but MSCI does not.
- Measurement: Given that two data providers consider the same category, by which indicators do they measure it? Disagreements due to measurement are for example if Refinitiv measures the category Energy by power consumption while MSCI looks at the share of renewable power.4I made up this indicators for illustration and the information does not correspond to the true methodology of those data providers. The paper doesn’t provide specific information about that.
- Weight: How do data providers combine their different categories into a final ESG rating? Disagreements due to weight are for example if Refinitiv puts a weight of 30% to the Energy category while MSCI considers it less important and weights it by 25%.5I again made up this examples for illustration and the information does not correspond to the true methodology of those data providers. The paper doesn’t provide specific information about that.
Building on this methodology, the authors decompose the differences of ESG ratings and estimate how much is related to scope, measurement, and weight. Since the detailed methodology of the data providers is proprietary, this involves quite a bit of investigative econometric modeling. But to spare you the pleasure of non-negative least square regressions, I will skip the details and go straight to the results.6Essentially, the authors just regress the final ESG ratings on the scores for each category. The resulting coefficients are an estimate for the proprietary weights of each data provider. And since negative weights don’t make sense, the authors force the coefficients to be non-negative.
On average, 56% of the differences in ESG ratings come from measurement, 38% from scope, and 6% from weight. This means that most ESG disagreement comes from data providers defining ESG by different categories (scope) and evaluating the same categories by different indicators (measurement). The methodology to merge categories into a final rating (weight) plays only a minor role. Of course, the results are different for each data provider pair. But overall, the results provide some explanations why ESG ratings are so different.
There is a “rater effect” for ESG ratings
In the final part of the paper, the authors document a rater effect for ESG ratings. According to the paper, the rater effect is “[…] a bias, where performance in one category influences perceived performance in other categories.”7Quote from page 24 of the paper. In english: if a data provider rates a company good in category 1, the company has a better chance to also receive a good rating in category 2. Note that this is not necessarily an insult to data providers. ESG performance could be very well positively correlated along categories. But as you probably already expected from my wording, the authors control for that and the effect still remains…
I will again not go into details, but the authors use two econometric approaches to test for a rater effect within ESG ratings.8A simple fixed-effects regression and a LASSO approach. Both of them yield statistically significant evidence in favor of it. So indeed, if a company scores well in one category, the same data provider tends to be more likely to also give a good ESG rating in another category. The authors argue that this might be explained by ESG analysts who cover companies instead of categories. If a company leaves an overall good ESG impression, the analyst may be more forgiving in categories where it still has some weaknesses.
Another explanation for the rater effect are deliberate methodological details. The authors explain that some data providers make it impossible for companies to receive the best rating if they don’t answer all questions or don’t provide data. I believe that this is fair because you can’t give a good rating if you don’t know what is going on. But of course, such missing policies make the ESG ratings even more subjective. You see, no matter how we slice it, we always come back to the problem that ESG is subjective.
Conclusions and Further Ideas
Looking at the two papers, I think there is convincing evidence that the current state of ESG ratings is quite messy and unsatisfying. Data providers are supposed to measure the same thing but arrive at quite different conclusions. For some ESG categories, ratings are even negatively correlated and thus indicate opposing views of data providers. Now, as I stressed several times, this subjective character is unfortunately an inherent feature of ESG and hardly solvable.
So what should investors do? Last week, I already mentioned The Economist’s Special Report that argues in favor of a radical simplification of ESG to just carbon emissions. This is appealing because 1) we can measure carbon emissions with at least some certainty, and 2) probably most of us agree that less emissions are better than more. The results of the two papers support this idea as correlations are highest for the Environmental dimension of ESG.
Another idea would be the direct opposite. Instead of (hopelessly) trying to standardize ESG in all of its details, we could also leave it to investors themselves. Looking at the sheer number of 709 indicators that are processed by the six data providers, ESG ratings are (in my opinion) just another type of fundamental analysis. We also don’t have the a consensus about the one and only “right” way to analyze companies and live quite okay with it. Investors and asset managers just use the models they believe in. If it doesn’t work, they lose money and adjust accordingly (at least in theory). Why don’t also do this with ESG? Data providers deliver the raw data (as they do for all other fundamental data) and investors can do their own ESG analyses.
My impression is that data providers are currently stuck somewhere in between. They cover ESG in all of its details (remember the 709 indicators and 64 categories) and simultaneously try to standardize it into one final rating. The result is a lot of disagreement and confusion. As a consequence, the two approaches I mentioned above are already ongoing. Many asset managers advertise their own proprietary ESG methodology and only use raw data from data providers. At the same time, other investors simply exclude the worst performing companies in terms of emissions or severe problems and simplify ESG accordingly. It’s going to be interesting how ESG develops further. At the moment, the skeptics9One of the most prominent is Aswath Damodaran. claim that the concept is just a money-printing-machine for data providers and consultants without much added value. The results of the two papers support this view at least to some extent.
- AgPa #72: Machine-Reading of Private Equity Prospectuses
- AgPa #71: Go Where the Earnings (per Share) Are
- AgPa #70: Equal vs. Market Cap Weights
- AgPa #69: Rebalancing Luck
This content is for educational and informational purposes only and no substitute for professional or financial advice. The use of any information on this website is solely on your own risk and I do not take responsibility or liability for any damages that may occur. The views expressed on this website are solely my own and do not necessarily reflect the views of any organisation I am associated with. Income- or benefit-generating links are marked with a star (*). All content that is not my intellectual property is marked as such. If you own the intellectual property displayed on this website and do not agree with my use of it, please send me an e-mail and I will remedy the situation immediately. Please also read the Disclaimer.
|1||This is a notable difference to last week’s panel dataset of S&P 500 companies.|
|2||It’s interesting that the correlation for the total ESG score is slightly higher (0.54) than for the best dimension (0.53). The total score should by definition be some average of the three. But this is just a minor detail.|
|3||These are: Biodiversity, Employee Development, Energy, Green Products, Health and Safety, Labor Practices, Product Safety, Remuneration, Supply Chain, and Water.|
|4||I made up this indicators for illustration and the information does not correspond to the true methodology of those data providers. The paper doesn’t provide specific information about that.|
|5||I again made up this examples for illustration and the information does not correspond to the true methodology of those data providers. The paper doesn’t provide specific information about that.|
|6||Essentially, the authors just regress the final ESG ratings on the scores for each category. The resulting coefficients are an estimate for the proprietary weights of each data provider. And since negative weights don’t make sense, the authors force the coefficients to be non-negative.|
|7||Quote from page 24 of the paper.|
|8||A simple fixed-effects regression and a LASSO approach.|
|9||One of the most prominent is Aswath Damodaran.|