Measuring Diversity: Shannon’s Diversity Index

We keep hearing about how Mumbai and New York are the two most diverse cities in India and US respectively, in terms of the culture and languages of people. So we thought – does our data support this hypothesis? We came up with a metric to measure the diversity score of some of the largest cities in India and US, based on our data, to test this hypothesis. Essentially the question we were trying to answer was – which city has the highest diversity in terms of the languages in which people listen to music.

 

What is diversity?

Before we get to the results and the details of, let’s explain what we mean by diversity. The graph below shows sample data for 3 different cities, for 3 languages – Hindi, English and one regional language.

What's diversity? The 1st distribution of population is most diverse (all languages are heard equally), followed by 2nd and 3rd

What’s diversity? The 1st distribution of population is most diverse (all languages are heard equally), followed by 2nd and 3rd

In the first example, the streams for the 3 languages are exactly equal – and we say that this city is perfectly diverse. This is because, based on the music data, there are equal music listeners of all the 3 languages in this city. In the second case, 2/3rd of the streams are for Hindi, and much more than English and regional language, and as such, this city is less diverse, since the data implies that Hindi speakers (or listeners, to be precise) dominate the population.

The third city has almost all streams in Hindi, and linguistic diversity is even lesser.

How do we quantify this? Shannon’s diversity index! Here is the expression:

Shannon's Index

Shannon’s Index

Here, pi represents the proportion of streams belonging to i-th language.

 

The higher the diversity, which means all of the languages are roughly equal in proportion, the higher is the value of Shannon index. If only one language dominates the proportions, the proportion for that language would be 1, and the value would be close to 0.

In the examples in the graphs above, the Shannon index respectively are 0.477, 0.376 and 0.0735 respectively – which is what we would expect.

 

Data Normalization

So basically, to calculate the diversity metric of cities, we just need to calculate the value of Shannon index for each of the cities – and that’s it! Apart from one minor detail.

The user base on Saavn is inherently biased towards Hindi speaking population as compared to Tamil/Telugu/Kannada and other languages. Which means our overall language distribution looks roughly like this.

Typical distribution of songs played on Saavn

Typical distribution of songs played on Saavn

Which means that in a city, if all the languages have roughly equal streams, then that city is actually less diverse – because the two regional languages have disproportionately higher listeners in that city. To remove this bias, here is what we did – for each city, we normalized that city’s language distribution by the overall language distribution on Saavn. So if a city has a language distribution exactly like above, it would be considered perfectly diverse.

 

Results!

After normalization, we just need to calculate the Shannon index for each city – and here are the results for India!

Diversity of Indian Cities

Diversity of Indian Cities

Well well! Looks like the IT city takes the crown! Actually, all the 3 cities at the top – Bangalore, Mumbai and Delhi are roughly equal in terms of diversity, followed by a sharp dip. Kolkata and Chennai are disproportionately dominated by local languages, and are thus lower in the rankings.

Diversity of Indian States

Diversity of Indian States

 

Diversity of US Cities

Diversity of US Cities

Just for fun, we also did the same exercise for cities in the US, and the results were quite predictable.

As expected, New York is the most diverse city by a reasonable margin, followed by Bay Area.

 

Artists Diversity

The diversity index that we defined above can actually be used for a lot of other interesting cuts of data as well – like mapping the diversity of artists. We would expect evergreen artists to have a high diversity – since a lot of their songs are popular.

Here is a graph showing the diversity of artists across their popularity and num. of songs they have.

Diversity of Artists (Click for Full Screen)

Diversity of Artists (Click for Full Screen)

 

Some insights from here:

  • Evergreen artists – Kishore Kumar, Lata Mangeshkar are truly diverse – they have a large number of songs, out of which a big number are still popular even after all these years.
  • Badshah and Arijit Singh have fewer songs, and the most recent songs by them dominate their catalogue – which is why they are less diverse.
  • Yo Yo Honey Singh kind of stands out – notice the yellow island in the middle of the data – he’s popular and turns out that all of his songs are roughly popular. This could also be because there hasn’t been a new release by him in some time.
  • One hit stars! Artists with very low diversity and high popularity because one of their songs is disproportionately popular. For eg : Amar Arshi – Kala Chashma, Isheeta – Sau Tarah Ke, and Harris Jayaraj – Halena

 

Conclusion

Of course, Shannon’s diversity index isn’t the only metric to measure diversity, but hey – don’t you think that the results make sense? Let us know what you think in your comments – or email us at data [at] saavn.com – we would love to hear from all the data geeks out there!

The Author