Top 100 Films of all Time
How can you learn about the underlying structure of documents in a way that is informative and intuitive? This basic
motivating question led me on a journey to visualize and cluster documents in a two-dimensional space. What you see
above is an output of an analytical pipeline that begin by gathering synopses on the top 100 films of all time and ended by
analyzing the latent topics within each document. In between I ran significant manipulations on these synopses (tokenization, stemming),
transformed them into a vector space model (tf-idf), and clustered them into groups (k-means). You can learn all about how
I did this with my detailed guide to Document Clustering with Python. But first, what did I learn?
A bit of background
I obtained a list of the top 100 films of all time from an IMDB user list called
Top 100 Greatest Movies of All Time (The Ultimate List)
by ChrisWalczyk55.
ChrisWalczyk55 claims that "My lists are not based on my own personal favorites;
they are based on the true greatness and/or sucess of the person, place, or thing
being ranked." Ok, sure, whatever. Using this list and it's ordinal rankings,
combined with synopses gathered from IMDB and Wikipedia, I was able to separate the films into 5 clusters.
Why 5? Clustering is more art than science and if I selected 20 clusters they would be too narrow to allow
me to draw any generalizations. If I picked 2 or 3 clusters they would be too broad. 5 to 8 generated a good fit,
but I chose 5 clusters since this led to the best intuition.
Understanding the visualization
The visualization at the top of the page is a 2-dimensional scatterplot of the cosine distance of each of the movies
(colored by cluster). The dimensions (X and Y) do not actually have labels. The way to interpret the the scatterplot is
by examining the location of one film, relative to others, in this 2-d space. Proximity in this space equates to similarity as
determined by a multi-dimensional scaling of the cosine distance (1 minus cosine similarity) between synopses contained
within the term frequency-inverse document frequency (tf-idf) matrix. That was probably confusing and I plan to explain it in a
more detailed write up of my methodology, but the basic intuition is that, based on the collected synopses, each film is plotted in relation
to its similarity to all other films contained in the plot. You might find some wierd relationships in this plot: keep in mind
that similarity was measured based on the words found in the film synopses. If the film synopses were written poorly or very short
the results were most certainly impacted. Garbage in, garbage out. Mostly I was interested in exploring the methodology.
Scoring the clusters
Based on the outcome of the clustering, I used the average rank from the IMDB list to score the clusters (lower is better).
Rank |
Cluster |
Score |
Count |
1 |
Killed, soldiers, captain |
43.7 |
26 |
2 |
Family, home, war |
47.2 |
25 |
3 |
Father, New York, brothers |
49.4 |
21 |
4 |
Dance, singing, love |
54.5 |
12 |
5 |
Police, killed, murders |
58.8 |
16 |
You can see that the war movies scored the best. The basic war epic cluster was at the top, followed
closely by family/home with some war mixed in. Family and "New York" or perhaps just cities follows the war grouping.
Dancing, singing, love is beats out the crime-ish flicks which, in the scheme of the top 100 movies, tend towards the bottom.
This despite the dominance of the Godfather films.
Killed, soldiers, captain |
Rank |
Title |
2 |
The Shawshank Redemption |
11 |
Lawrence of Arabia |
18 |
The Sound of Music |
20 |
Star Wars |
22 |
2001: A Space Odyssey |
25 |
The Bridge on the River Kwai |
30 |
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb |
32 |
Apocalypse Now |
34 |
The Lord of the Rings: The Return of the King |
35 |
Gladiator |
36 |
From Here to Eternity |
37 |
Saving Private Ryan |
38 |
Unforgiven |
39 |
Raiders of the Lost Ark |
49 |
Patton |
50 |
Jaws |
53 |
Butch Cassidy and the Sundance Kid |
54 |
The Treasure of the Sierra Madre |
56 |
Platoon |
58 |
Dances with Wolves |
62 |
The Deer Hunter |
63 |
All Quiet on the Western Front |
80 |
Shane |
81 |
The Green Mile |
88 |
The African Queen |
90 |
Mutiny on the Bounty |
Family, home, war |
Rank |
Title |
3 |
Schindler's List |
6 |
One Flew Over the Cuckoo's Nest |
7 |
Gone with the Wind |
9 |
The Wizard of Oz |
10 |
Titanic |
17 |
Forrest Gump |
21 |
E.T. the Extra-Terrestrial |
23 |
The Silence of the Lambs |
33 |
Gandhi |
41 |
A Streetcar Named Desire |
45 |
The Best Years of Our Lives |
46 |
My Fair Lady |
47 |
Ben-Hur |
48 |
Doctor Zhivago |
59 |
The Pianist |
61 |
The Exorcist |
73 |
Out of Africa |
74 |
Good Will Hunting |
75 |
Terms of Endearment |
78 |
Giant |
79 |
The Grapes of Wrath |
82 |
Close Encounters of the Third Kind |
85 |
The Graduate |
89 |
Stagecoach |
94 |
Wuthering Heights |
Father, New York, brothers |
Rank |
Title |
1 |
The Godfather |
4 |
Raging Bull |
8 |
Citizen Kane |
12 |
The Godfather: Part II |
16 |
On the Waterfront |
29 |
12 Angry Men |
40 |
Rocky |
43 |
To Kill a Mockingbird |
51 |
Braveheart |
52 |
The Good, the Bad and the Ugly |
55 |
The Apartment |
60 |
Goodfellas |
65 |
City Lights |
67 |
It Happened One Night |
69 |
Midnight Cowboy |
70 |
Mr. Smith Goes to Washington |
71 |
Rain Man |
72 |
Annie Hall |
83 |
Network |
93 |
Taxi Driver |
97 |
Rear Window |
Dance, singing, love |
Rank |
Title |
19 |
West Side Story |
26 |
Singin' in the Rain |
27 |
It's a Wonderful Life |
28 |
Some Like It Hot |
42 |
The Philadelphia Story |
44 |
An American in Paris |
66 |
The King's Speech |
68 |
A Place in the Sun |
76 |
Tootsie |
84 |
Nashville |
86 |
American Graffiti |
100 |
Yankee Doodle Dandy |
Police, killed, murders |
Rank |
Title |
5 |
Casablanca |
13 |
Psycho |
14 |
Sunset Blvd. |
15 |
Vertigo |
24 |
Chinatown |
31 |
Amadeus |
57 |
High Noon |
64 |
The French Connection |
77 |
Fargo |
87 |
Pulp Fiction |
91 |
The Maltese Falcon |
92 |
A Clockwork Orange |
95 |
Double Indemnity |
96 |
Rebel Without a Cause |
98 |
The Third Man |
99 |
North by Northwest |