Top 100 Films of all Time

How can you learn about the underlying structure of documents in a way that is informative and intuitive? This basic motivating question led me on a journey to visualize and cluster documents in a two-dimensional space. What you see above is an output of an analytical pipeline that begin by gathering synopses on the top 100 films of all time and ended by analyzing the latent topics within each document. In between I ran significant manipulations on these synopses (tokenization, stemming), transformed them into a vector space model (tf-idf), and clustered them into groups (k-means). You can learn all about how I did this with my detailed guide to Document Clustering with Python. But first, what did I learn?

A bit of background

I obtained a list of the top 100 films of all time from an IMDB user list called Top 100 Greatest Movies of All Time (The Ultimate List) by ChrisWalczyk55. ChrisWalczyk55 claims that "My lists are not based on my own personal favorites; they are based on the true greatness and/or sucess of the person, place, or thing being ranked." Ok, sure, whatever. Using this list and it's ordinal rankings, combined with synopses gathered from IMDB and Wikipedia, I was able to separate the films into 5 clusters. Why 5? Clustering is more art than science and if I selected 20 clusters they would be too narrow to allow me to draw any generalizations. If I picked 2 or 3 clusters they would be too broad. 5 to 8 generated a good fit, but I chose 5 clusters since this led to the best intuition.

Understanding the visualization

The visualization at the top of the page is a 2-dimensional scatterplot of the cosine distance of each of the movies (colored by cluster). The dimensions (X and Y) do not actually have labels. The way to interpret the the scatterplot is by examining the location of one film, relative to others, in this 2-d space. Proximity in this space equates to similarity as determined by a multi-dimensional scaling of the cosine distance (1 minus cosine similarity) between synopses contained within the term frequency-inverse document frequency (tf-idf) matrix. That was probably confusing and I plan to explain it in a more detailed write up of my methodology, but the basic intuition is that, based on the collected synopses, each film is plotted in relation to its similarity to all other films contained in the plot. You might find some wierd relationships in this plot: keep in mind that similarity was measured based on the words found in the film synopses. If the film synopses were written poorly or very short the results were most certainly impacted. Garbage in, garbage out. Mostly I was interested in exploring the methodology.

Scoring the clusters

Based on the outcome of the clustering, I used the average rank from the IMDB list to score the clusters (lower is better).

Rank	Cluster	Score	Count
1	Killed, soldiers, captain	43.7	26
2	Family, home, war	47.2	25
3	Father, New York, brothers	49.4	21
4	Dance, singing, love	54.5	12
5	Police, killed, murders	58.8	16

You can see that the war movies scored the best. The basic war epic cluster was at the top, followed closely by family/home with some war mixed in. Family and "New York" or perhaps just cities follows the war grouping. Dancing, singing, love is beats out the crime-ish flicks which, in the scheme of the top 100 movies, tend towards the bottom. This despite the dominance of the Godfather films.

Killed, soldiers, captain
Rank	Title
2	The Shawshank Redemption
11	Lawrence of Arabia
18	The Sound of Music
20	Star Wars
22	2001: A Space Odyssey
25	The Bridge on the River Kwai
30	Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb
32	Apocalypse Now
34	The Lord of the Rings: The Return of the King
35	Gladiator
36	From Here to Eternity
37	Saving Private Ryan
38	Unforgiven
39	Raiders of the Lost Ark
49	Patton
50	Jaws
53	Butch Cassidy and the Sundance Kid
54	The Treasure of the Sierra Madre
56	Platoon
58	Dances with Wolves
62	The Deer Hunter
63	All Quiet on the Western Front
80	Shane
81	The Green Mile
88	The African Queen
90	Mutiny on the Bounty

Family, home, war
Rank	Title
3	Schindler's List
6	One Flew Over the Cuckoo's Nest
7	Gone with the Wind
9	The Wizard of Oz
10	Titanic
17	Forrest Gump
21	E.T. the Extra-Terrestrial
23	The Silence of the Lambs
33	Gandhi
41	A Streetcar Named Desire
45	The Best Years of Our Lives
46	My Fair Lady
47	Ben-Hur
48	Doctor Zhivago
59	The Pianist
61	The Exorcist
73	Out of Africa
74	Good Will Hunting
75	Terms of Endearment
78	Giant
79	The Grapes of Wrath
82	Close Encounters of the Third Kind
85	The Graduate
89	Stagecoach
94	Wuthering Heights

Father, New York, brothers
Rank	Title
1	The Godfather
4	Raging Bull
8	Citizen Kane
12	The Godfather: Part II
16	On the Waterfront
29	12 Angry Men
40	Rocky
43	To Kill a Mockingbird
51	Braveheart
52	The Good, the Bad and the Ugly
55	The Apartment
60	Goodfellas
65	City Lights
67	It Happened One Night
69	Midnight Cowboy
70	Mr. Smith Goes to Washington
71	Rain Man
72	Annie Hall
83	Network
93	Taxi Driver
97	Rear Window

Dance, singing, love
Rank	Title
19	West Side Story
26	Singin' in the Rain
27	It's a Wonderful Life
28	Some Like It Hot
42	The Philadelphia Story
44	An American in Paris
66	The King's Speech
68	A Place in the Sun
76	Tootsie
84	Nashville
86	American Graffiti
100	Yankee Doodle Dandy

Police, killed, murders
Rank	Title
5	Casablanca
13	Psycho
14	Sunset Blvd.
15	Vertigo
24	Chinatown
31	Amadeus
57	High Noon
64	The French Connection
77	Fargo
87	Pulp Fiction
91	The Maltese Falcon
92	A Clockwork Orange
95	Double Indemnity
96	Rebel Without a Cause
98	The Third Man
99	North by Northwest