Family, home, warPolice, killed, murdersFather, New York, brothersDance, singing, loveKilled, soldiers, captain
How can you learn about the underlying structure of documents in a way that is informative and intuitive? This basic
motivating question led me on a journey to visualize and cluster documents in a two-dimensional space. What you see
above is an output of an analytical pipeline that begin by gathering synopses on the top 100 films of all time and ended by
analyzing the latent topics within each document. In between I ran significant manipulations on these synopses (tokenization, stemming),
transformed them into a vector space model (tf-idf), and clustered them into groups (k-means). You can learn all about how
I did this with my detailed guide to Document Clustering with Python. But first, what did I learn?
A bit of background
I obtained a list of the top 100 films of all time from an IMDB user list called
Top 100 Greatest Movies of All Time (The Ultimate List)
by ChrisWalczyk55.
ChrisWalczyk55 claims that "My lists are not based on my own personal favorites;
they are based on the true greatness and/or sucess of the person, place, or thing
being ranked." Ok, sure, whatever. Using this list and it's ordinal rankings,
combined with synopses gathered from IMDB and Wikipedia, I was able to separate the films into 5 clusters.
Why 5? Clustering is more art than science and if I selected 20 clusters they would be too narrow to allow
me to draw any generalizations. If I picked 2 or 3 clusters they would be too broad. 5 to 8 generated a good fit,
but I chose 5 clusters since this led to the best intuition.
Understanding the visualization
The visualization at the top of the page is a 2-dimensional scatterplot of the cosine distance of each of the movies
(colored by cluster). The dimensions (X and Y) do not actually have labels. The way to interpret the the scatterplot is
by examining the location of one film, relative to others, in this 2-d space. Proximity in this space equates to similarity as
determined by a multi-dimensional scaling of the cosine distance (1 minus cosine similarity) between synopses contained
within the term frequency-inverse document frequency (tf-idf) matrix. That was probably confusing and I plan to explain it in a
more detailed write up of my methodology, but the basic intuition is that, based on the collected synopses, each film is plotted in relation
to its similarity to all other films contained in the plot. You might find some wierd relationships in this plot: keep in mind
that similarity was measured based on the words found in the film synopses. If the film synopses were written poorly or very short
the results were most certainly impacted. Garbage in, garbage out. Mostly I was interested in exploring the methodology.
Scoring the clusters
Based on the outcome of the clustering, I used the average rank from the IMDB list to score the clusters (lower is better).
Rank
Cluster
Score
Count
1
Killed, soldiers, captain
43.7
26
2
Family, home, war
47.2
25
3
Father, New York, brothers
49.4
21
4
Dance, singing, love
54.5
12
5
Police, killed, murders
58.8
16
You can see that the war movies scored the best. The basic war epic cluster was at the top, followed
closely by family/home with some war mixed in. Family and "New York" or perhaps just cities follows the war grouping.
Dancing, singing, love is beats out the crime-ish flicks which, in the scheme of the top 100 movies, tend towards the bottom.
This despite the dominance of the Godfather films.
Killed, soldiers, captain
Rank
Title
2
The Shawshank Redemption
11
Lawrence of Arabia
18
The Sound of Music
20
Star Wars
22
2001: A Space Odyssey
25
The Bridge on the River Kwai
30
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb