Introduction:Cluster analysis is a valuable machine learning tool, with applications across virtually every discipline. These include anything from analyzing satellite images of agricultural areas to identifying different types of crops, to finding themes across the billions of social media posts that are broadcast publicly every month. Clustering is often used together with natural language processing (NLP). NLP allows us to turn unstructured text data into structured numerical data. This is how we transform human language into the computer-friendly inputs required to apply machine learning. For a straightforward application of how these two techniques can be applied in sequence, we’ve chosen an easily accessible public data set: the 'Top 50 Songs of 2015' by Genius, the digital media company. In our example, we look at how songs can be broken into similar clusters by their lyrics, and then conduct further statistical analysis against the resulting clusters. Machine learning and other data science methodologies increasingly confer great powers upon those who practice it. It’s not, however, without its limitations. The misapplication or misinterpretation of these methodologies can lead not only to reduced ability to extract insights but also to categorically false or misleading conclusions. We examine potential mishaps in the transformation and analysis of the data at various stages in this study. This study explores the use of cluster analysis to answer the following questions:
Notes:
Methodology:We performed this analysis in Python, using Pandas, Sklearn, and NLTK to model the data. Graphs were projected with matplotlib and/or seaborn. Steps:
* We first fed the data into an affinity propagation model, but found the model unsuited to the data. Further Analysis: After clustering, we analyzed the relationships between word count, vocabulary, songs, and the clusters the songs were assigned to. Results are below. Project Results and Analysis: |
Top 5 Most Similar Songs:************************************************** #1: Father John Misty The Night Josh Tillman Came To Our Apt Vs. Kendrick Lamar The Blacker The Berry Dist. = 0.835462649851 ************************************************** #2: Vince Staples Jump Off The Roof Vs. The Weeknd Tell Your Friends Dist. = 0.873377864682 ************************************************** #3: The Weeknd Tell Your Friends Vs. Drake Back To Back Dist. = 0.908259234026 ************************************************** #4: Joey Bada Paper Trail Vs. Lupe Fiasco Prisoner 1 And 2 Dist. = 0.937557822226 ************************************************** #5: Dr Dre Darkside Gone Vs. Drake Back To Back Dist. = 0.939432020598 | Top 5 Most Dissimilar Songs:************************************************** #1: Sufjan Stevens Blue Bucket Of Gold Vs. Major Lazer Lean On Dist. = 3.0627237839 ************************************************** #2: Major Lazer Lean On Vs. Dangelo And The Vanguard Really Love Dist. = 3.01516407241 ************************************************** #3: Towkio Heaven Only Knows Vs. Major Lazer Lean On Dist. = 2.8076161803 ************************************************** #4: Post Malone White Iverson Vs. Major Lazer Lean On Dist. = 2.77959991825 ************************************************** #5: Major Lazer Lean On Vs. The Weeknd The Hills Dist. = 2.76888048057 |
Interestingly, Major Lazer's "Lean On" is so lyrically dissimilar from every other song, that it occupies the top five most dissimilar ranks, against five other songs.
Primary Component Analysis
Unfortunately, PCA Analysis here does not help as much as it may in other cases. We require the first 9 vectors in our new matrix of primary components, to account for only just over 60% of the variance.
Clustering the songs
The Reddit dataset this analysis was inspired by used an affinity propagation algorithm to perform it's clustering. Affinity propagation does not require you to input the initial number of clusters. This means that, unlike the K-Means, the number of clusters is NOT arbitrary. This is therefore often a great choice, when it works.
We, however, found K-Means to be a better choice for this dataset. The affinity propagation model kept outputting very uneven class distributions. We'd get roughly 8 to 12 clusters - 3 of which would contain about 90% of the songs, while the remaining 9 clusters had just 1 or 2 songs each.
This kind of clustering isn't necessarily bad. If many songs are similar to one another, and a few are very dissimilar to all the others, than the model is just doing it's job, and showing those relationships to us. We wanted to perform analysis of the summary statistics for each class, though, and the large number of single-song classes made that difficult.
After some initial tweaking, we fit a KMeans model to the first 9 dimensions of our PCA transform.
We found 8 clusters to be an effective tradeoff between the number of classes, and the distribution of songs within those classes.
We, however, found K-Means to be a better choice for this dataset. The affinity propagation model kept outputting very uneven class distributions. We'd get roughly 8 to 12 clusters - 3 of which would contain about 90% of the songs, while the remaining 9 clusters had just 1 or 2 songs each.
This kind of clustering isn't necessarily bad. If many songs are similar to one another, and a few are very dissimilar to all the others, than the model is just doing it's job, and showing those relationships to us. We wanted to perform analysis of the summary statistics for each class, though, and the large number of single-song classes made that difficult.
After some initial tweaking, we fit a KMeans model to the first 9 dimensions of our PCA transform.
We found 8 clusters to be an effective tradeoff between the number of classes, and the distribution of songs within those classes.
Of the 8 classes we get, five of them have six or more songs within them.
Now that we've clustered the Genius Top 50 Songs of 2015 into eight groups, how can we visually represent them?
If we plot the first 3 primary components along 3 axises, we can already start to see some clustering. This plot is actually more illustrative than might be expected: Even though these 3 dimensions only account for ~25% of the variance, we can see a reasonable degree of grouping.
If we plot the first 3 primary components along 3 axises, we can already start to see some clustering. This plot is actually more illustrative than might be expected: Even though these 3 dimensions only account for ~25% of the variance, we can see a reasonable degree of grouping.
Besides lyrics, what do classes share in common?
First, let's visualize some of our distributions:
1. On the left, we have a box plot showing the distribution of words per song.
- The mean is just over 400 words,
- The standard deviation is large relative to the mean, and word count is distributed fairly equally around it.
- The first and third quartiles range between ~300 and ~600 words
2. On the right, we have a histogram depicting the number of songs which fall into bins characterized by word quantity.
- The distribution is right-skewed, due to a single song with over 1400 words (3.5x the mean)
Next, let's see whether there is any correlation between classes, and the average number of unique words within each class.
There is no reason there *should* be, as songs were clustered by the words themselves, rather than the amount of words.
Still, there may be some correlation. Perhaps songs with similar lyrics share a similar breadth of vocabulary (amounts of unique words per song).
There is no correlation between the number of words or number of unique words, and the class a song was assigned to.
This graph is somewhat misleading, because it implies the possibility of some kind of linear relationship between classes and the number of words or unique words (or anything else at all). In reality, the class labels themselves are meaningless; they are randomly assigned by the KMeans algorithm. The information they contain relates to the ways in which songs with the same label relate to one another.
This graph itself, though, conveys valuable information.
If there were a correlation between classes and number of unique words, we would expect to see some kind of clustering along the horizontal axis, for each class along the vertical axis.
There is none of that whatsoever, indicating a complete lack of any correlation between classes and the number of words for songs within them. In fact, the R-Squared (not pictured here) between classes and unique words is ~0.01, or essentially zero.
This graph is somewhat misleading, because it implies the possibility of some kind of linear relationship between classes and the number of words or unique words (or anything else at all). In reality, the class labels themselves are meaningless; they are randomly assigned by the KMeans algorithm. The information they contain relates to the ways in which songs with the same label relate to one another.
This graph itself, though, conveys valuable information.
If there were a correlation between classes and number of unique words, we would expect to see some kind of clustering along the horizontal axis, for each class along the vertical axis.
There is none of that whatsoever, indicating a complete lack of any correlation between classes and the number of words for songs within them. In fact, the R-Squared (not pictured here) between classes and unique words is ~0.01, or essentially zero.
Word Count & Unique Words Vs Song Ranking
Is Song Rank a function of the number of unique words in a song?
Is Song Rank a function of a song's word count?
No. With R-Squared values of 0.0037, and 2.4e-6, there is absolutely no correlation between the number of unique words - or total words - in a song, and the song's ranking.
Is there a correlation between the AVERAGE rank per class, and the AVERAGE word count or unique word count per class?
Note, that we dropped 3 clusters here. These clusters had only 1, 1, and 3 songs respectively.
If we keep those clusters, single songs may exert far too much leverage on the rankings for each cluster.
The remaining 5 classes contain 45 of our 50 songs - or 90% of the data.
Each of these has between 6 and 12 songs in it.
If we keep those clusters, single songs may exert far too much leverage on the rankings for each cluster.
The remaining 5 classes contain 45 of our 50 songs - or 90% of the data.
Each of these has between 6 and 12 songs in it.
Yes. There does appear to be a correlation (13% and 34% of variance).
This finding seems to imply that, while a song's word count or vocabulary breadth holds no value as a predictor of it's ranking, the average word count or breadth of vocabulary for each kind* of song may actually hold some predictive value for that kind of song's average ranking.
* The easiest analogy to kinds of songs is genre. But these "genres" are not defined by popularly-agreed divisions or titles (eg. rap, country, rock). The "genres" here are determined** by the songs lyrical similarities - or dissimilarities - to one another.
** It would be interesting to compare the classifications made by analysis of the lyrics, to the genres the songs actually do fall under. The question here is, "Do songs that are clustered together tend to fall into the same genre? If so, how significant is the correlation?". This, however, is another job for another time.
This finding seems to imply that, while a song's word count or vocabulary breadth holds no value as a predictor of it's ranking, the average word count or breadth of vocabulary for each kind* of song may actually hold some predictive value for that kind of song's average ranking.
* The easiest analogy to kinds of songs is genre. But these "genres" are not defined by popularly-agreed divisions or titles (eg. rap, country, rock). The "genres" here are determined** by the songs lyrical similarities - or dissimilarities - to one another.
** It would be interesting to compare the classifications made by analysis of the lyrics, to the genres the songs actually do fall under. The question here is, "Do songs that are clustered together tend to fall into the same genre? If so, how significant is the correlation?". This, however, is another job for another time.
Conclusion
1. Songs can be effectively clustered together by analysis of their lyrics.
2. In the context of individual songs, there is no relationship between song ranking and number of words, or unique words, in a song.
3. After clustering songs into similar kinds of songs, relationships do emerge:
1. There is a weak inverse correlation between the average number of words in a cluster, and the average ranking
* This can account for ~13% of the variance.
2. There is a stronger - but still not huge - inverse correlation between the average number of unique words in a cluster, and the average ranking.
* This can account for ~34% of the variance
Considerations:
1. Small sample size: 50 songs, and resultant 8 classes (5 of which we kept - accounting for 90% of the songs)
* Statistical significance of regression (class rank vs word count) with only 5 data counts is *very* limited.
* It is neat, however, to see that there does at least *appear* to be an relationship between word count / vocabulary, and ranking
2. Subjective rankings.
* Genius describes their process as such: "Contributors voted on an initial poll, spent weeks discussing revisions and replacements, and elected to write about their favorite tracks."
* While it does seem that the Genius community at large was polled, and that poll determined the songs that were ultimately selected, the actual ranking was not necessarily reflective of the community at large. Rather, a select group individual contributors had the final say.
3. Relatively few words per song.
* The average song had between ~300 and ~600 words. This lead to a relatively *small* incidence of repeated words between songs. Of the total 3,126 unique words found across the entire dataset, only the top 36 appeared more than 100 times, cumulatively. This means that nearly 99% of the words appeared *less* than twice per song. This gives tremendous leverage to a small number of words.
* While the statistical significance of regression techniques would be improved by a larger dataset, that would still not likely change the reality that 1% of the words accounts for 100% of the leverage in our model. Indeed, the analysis of Reddit's top 50 subreddits that this project was inspired by was a clustering of entire *forums* - with an associated plethora of data to draw from.
2. In the context of individual songs, there is no relationship between song ranking and number of words, or unique words, in a song.
3. After clustering songs into similar kinds of songs, relationships do emerge:
1. There is a weak inverse correlation between the average number of words in a cluster, and the average ranking
* This can account for ~13% of the variance.
2. There is a stronger - but still not huge - inverse correlation between the average number of unique words in a cluster, and the average ranking.
* This can account for ~34% of the variance
Considerations:
1. Small sample size: 50 songs, and resultant 8 classes (5 of which we kept - accounting for 90% of the songs)
* Statistical significance of regression (class rank vs word count) with only 5 data counts is *very* limited.
* It is neat, however, to see that there does at least *appear* to be an relationship between word count / vocabulary, and ranking
2. Subjective rankings.
* Genius describes their process as such: "Contributors voted on an initial poll, spent weeks discussing revisions and replacements, and elected to write about their favorite tracks."
* While it does seem that the Genius community at large was polled, and that poll determined the songs that were ultimately selected, the actual ranking was not necessarily reflective of the community at large. Rather, a select group individual contributors had the final say.
3. Relatively few words per song.
* The average song had between ~300 and ~600 words. This lead to a relatively *small* incidence of repeated words between songs. Of the total 3,126 unique words found across the entire dataset, only the top 36 appeared more than 100 times, cumulatively. This means that nearly 99% of the words appeared *less* than twice per song. This gives tremendous leverage to a small number of words.
* While the statistical significance of regression techniques would be improved by a larger dataset, that would still not likely change the reality that 1% of the words accounts for 100% of the leverage in our model. Indeed, the analysis of Reddit's top 50 subreddits that this project was inspired by was a clustering of entire *forums* - with an associated plethora of data to draw from.
Comments are closed.