Top 100 Films of all Time

How can you learn about the underlying structure of documents in a way that is informative and intuitive? This basic motivating question led me on a journey to visualize and cluster documents in a two-dimensional space. What you see above is an output of an analytical pipeline that begin by gathering synopses on the top 100 films of all time and ended by analyzing the latent topics within each document. In between I ran significant manipulations on these synopses (tokenization, stemming), transformed them into a vector space model (tf-idf), and clustered them into groups (k-means). You can learn all about how I did this with my detailed guide to Document Clustering with Python. But first, what did I learn?

A bit of background

I obtained a list of the top 100 films of all time from an IMDB user list called Top 100 Greatest Movies of All Time (The Ultimate List) by ChrisWalczyk55. ChrisWalczyk55 claims that "My lists are not based on my own personal favorites; they are based on the true greatness and/or sucess of the person, place, or thing being ranked." Ok, sure, whatever. Using this list and it's ordinal rankings, combined with synopses gathered from IMDB and Wikipedia, I was able to separate the films into 5 clusters. Why 5? Clustering is more art than science and if I selected 20 clusters they would be too narrow to allow me to draw any generalizations. If I picked 2 or 3 clusters they would be too broad. 5 to 8 generated a good fit, but I chose 5 clusters since this led to the best intuition.

Understanding the visualization

The visualization at the top of the page is a 2-dimensional scatterplot of the cosine distance of each of the movies (colored by cluster). The dimensions (X and Y) do not actually have labels. The way to interpret the the scatterplot is by examining the location of one film, relative to others, in this 2-d space. Proximity in this space equates to similarity as determined by a multi-dimensional scaling of the cosine distance (1 minus cosine similarity) between synopses contained within the term frequency-inverse document frequency (tf-idf) matrix. That was probably confusing and I plan to explain it in a more detailed write up of my methodology, but the basic intuition is that, based on the collected synopses, each film is plotted in relation to its similarity to all other films contained in the plot. You might find some wierd relationships in this plot: keep in mind that similarity was measured based on the words found in the film synopses. If the film synopses were written poorly or very short the results were most certainly impacted. Garbage in, garbage out. Mostly I was interested in exploring the methodology.

Scoring the clusters

Based on the outcome of the clustering, I used the average rank from the IMDB list to score the clusters (lower is better).

Rank Cluster Score Count
1 Killed, soldiers, captain 43.7 26
2 Family, home, war 47.2 25
3 Father, New York, brothers 49.4 21
4 Dance, singing, love 54.5 12
5 Police, killed, murders 58.8 16

You can see that the war movies scored the best. The basic war epic cluster was at the top, followed closely by family/home with some war mixed in. Family and "New York" or perhaps just cities follows the war grouping. Dancing, singing, love is beats out the crime-ish flicks which, in the scheme of the top 100 movies, tend towards the bottom. This despite the dominance of the Godfather films.

Killed, soldiers, captain
Rank Title
2 The Shawshank Redemption
11 Lawrence of Arabia
18 The Sound of Music
20 Star Wars
22 2001: A Space Odyssey
25 The Bridge on the River Kwai
30 Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb
32 Apocalypse Now
34 The Lord of the Rings: The Return of the King
35 Gladiator
36 From Here to Eternity
37 Saving Private Ryan
38 Unforgiven
39 Raiders of the Lost Ark
49 Patton
50 Jaws
53 Butch Cassidy and the Sundance Kid
54 The Treasure of the Sierra Madre
56 Platoon
58 Dances with Wolves
62 The Deer Hunter
63 All Quiet on the Western Front
80 Shane
81 The Green Mile
88 The African Queen
90 Mutiny on the Bounty
Family, home, war
Rank Title
3 Schindler's List
6 One Flew Over the Cuckoo's Nest
7 Gone with the Wind
9 The Wizard of Oz
10 Titanic
17 Forrest Gump
21 E.T. the Extra-Terrestrial
23 The Silence of the Lambs
33 Gandhi
41 A Streetcar Named Desire
45 The Best Years of Our Lives
46 My Fair Lady
47 Ben-Hur
48 Doctor Zhivago
59 The Pianist
61 The Exorcist
73 Out of Africa
74 Good Will Hunting
75 Terms of Endearment
78 Giant
79 The Grapes of Wrath
82 Close Encounters of the Third Kind
85 The Graduate
89 Stagecoach
94 Wuthering Heights

Father, New York, brothers
Rank Title
1 The Godfather
4 Raging Bull
8 Citizen Kane
12 The Godfather: Part II
16 On the Waterfront
29 12 Angry Men
40 Rocky
43 To Kill a Mockingbird
51 Braveheart
52 The Good, the Bad and the Ugly
55 The Apartment
60 Goodfellas
65 City Lights
67 It Happened One Night
69 Midnight Cowboy
70 Mr. Smith Goes to Washington
71 Rain Man
72 Annie Hall
83 Network
93 Taxi Driver
97 Rear Window
Dance, singing, love
Rank Title
19 West Side Story
26 Singin' in the Rain
27 It's a Wonderful Life
28 Some Like It Hot
42 The Philadelphia Story
44 An American in Paris
66 The King's Speech
68 A Place in the Sun
76 Tootsie
84 Nashville
86 American Graffiti
100 Yankee Doodle Dandy

Police, killed, murders
Rank Title
5 Casablanca
13 Psycho
14 Sunset Blvd.
15 Vertigo
24 Chinatown
31 Amadeus
57 High Noon
64 The French Connection
77 Fargo
87 Pulp Fiction
91 The Maltese Falcon
92 A Clockwork Orange
95 Double Indemnity
96 Rebel Without a Cause
98 The Third Man
99 North by Northwest