Or, how you can extract meaningful information from raw text and use it to analyze the networks of individuals hidden within your data set.
We are all drowning in text. Fortunately there are a number of data science strategies for handling the deluge. If you'd like to learn about using machine learning for this check out my guide on document clustering. In this guide I'm going to walk you through a strategy for making sense of massive troves of unstructured text using entity extration and network analysis. These strategies are actively employed for legal e-discovery and within law enforcement and the intelligence community. Imagine you work at the FBI and you just uncovered a massive trove of documents on a confiscated laptop or server. What would you do? This guide offers an approach for dealing with this type of scenario. By the end of it you'll have generated a graph like the one above, which you can use to analyze the network hidden within your data set.
We are going take a set of documents (in our case, news articles), extract entities from within them, and develop a social network based on entity document co-occurrence. This can be a useful approach for getting a sense of which entities exist in a set of documents and how those entities might be related. I'll talk more about using document co-occurrence as the mechanism for drawing an edge in a social network graph later.
In this guide I rely on 4 primary pieces of software:
If you're not familiar with these libraries, don't worry, I'll make it easy to get off to the races with them in no time.
Note that my github repo for the whole project is available. You can use corpus.txt
as a sample data set if you'd like. Also, make sure to capture the force
directory when you try to run this on your own. You need force/force.html
, force/force.css
, and force/force.js
in order to create the chart at the end of the guide.
If you have any questions for me, feel free to reach out on Twitter to @brandonmrose or open up an issue on the github repo.
First, we need to get Core NLP running on Docker. If you're not familiar with Docker, that's ok! It's an easy to use containerization service. The concept is that anywhere you can run docker you can run a docker container. Period. No need to worry about dependency management, just get docker running and pull down the container you need. Easy.
Stanford Core NLP is one of the most popular natural language processing tools out there. It has a ton of functionality which includes part of speech tagging, parsing, lemmatization, tokenization, and what we are interested in: named entity recognition (NER). NER is the process of analyzing text in order to find people, places, and organizations contained within the text. These named entities will form the basis of the rest of our analysis, so being able to extract them from text is critical.
Docker now has great installation instructions (trust me, this wasn't always the case). I'm using a Mac so I followed their Mac OSX Docker installation guide. If you're using Windows check out their Windows install guide. If you're using Linux I'm pretty sure you'll be able to get Docker installed on your own.
To verify the installation was successful go to your command line and try running:
docker ps
You should an empty docker listing that looks like (I truncated a couple columns, but you get the idea):
CONTAINER ID IMAGE COMMAND CREATED
If this isn't empty you already had docker running with a container. If you are not able to run the docker
or docker ps
commands from your command line, STOP. You need to get this installed before continuing.
This part is pretty easy. You just need to run the following command at your command line:
docker run -p 9000:9000 --name coreNLP --rm -i -t motiz88/corenlp
This will pull motiz88's Docker port of Core NLP and run it using port 9000. This means that port 9000 from the container will be forwarded to port 9000 on your localhost (your computer). So, you can access the Core NLP API over http://localhost:9000
. Note that this is a fairly large container so it may take a few minutes to download and install.
To make sure that the server is running, in your browser go to http://localhost:9000. You should see:
If you don't, don't move forward until you can verify the Core NLP server is running. You might try docker ps
to see if the container is listed. If it is, you can scope out the logs with docker logs coreNLP
. If it is running feel free to play around with the server UI. Input some text to get a feel for how it works!
To use Core NLP Server, we are going to leverage the pycorenlp
Python wrapper which can be installed with pip install pycorenlp
. Once that's installed, you can instantiate a connection with the coreNLP server.
from pycorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP('http://localhost:9000')
Next, let's take a look at the basic functionality by feeding a few sentences of text to the coreNLP server
text = ("Bill and Ted are excellent! "
"Pusheen Smith and Jillian Marie walked along the beach; Pusheen led the way. "
"Pusheen wanted to surf, but fell off the surfboard. "
"They are both friends with Jean Claude van Dam, Sam's neighbor.")
output = nlp.annotate(text, properties={
'annotators': 'ner',
'outputFormat': 'json'
})
print('The output object has keys: {}'.format(output.keys()))
print('Each sentence object has keys: {}'.format(output['sentences'][0].keys()))
The output
object, as you can see for yourself is extremely verbose. It's comprised of a top-level key
called sentences
which contains one object per sentence. Each sentence
object has an array of token
objects that can be accessed at output['sentences'][i]['tokens']
where i
is the index (e.g. 0, 1, 2, etc) of the sentence of interest.
What is a token
you ask? Typically in natural language processing (NLP) when you process text you want to tokenize
it. This means splitting the text into its respective components at the word and punctuation level. So, the sentence 'The quick brown fox jumped over the lazy dog.'
would be tokenized into an array that looks like:
['The','quick','brown','fox','jumped','over','the','lazy','dog','.']
. Some tokenizers ignore punctuation; others retain it.
You can print out the output
if you're interested in seeing what it looks like. That said, we need to be able to identify the people that the ner
or Named Entity Recognition module discovered. So, let's go ahead and define a function which takes a set of sentence tokens and finds the tokens which were labeled as PERSON
. This gets a little tricky as individual tokens can be labeled as PERSON
when they actually correspond to the same person. For example, the tokens Jean
, Claude
, van
, and Dam
all correspond to the same person. So, the function below take tokens
which are contiguous (next to one another) within the same sentence and combines them into the same person entity. Perfect!
By the way, this proc_sentence
function is not very Pythonic. Ideas for doing this more efficiently are welcome!
def proc_sentence(tokens):
"""
Takes as input a set of tokens from Stanford Core NLP output and returns
the set of peoplefound within the sentence. This relies on the fact that
named entities which are contiguous within a sentence should be part of
the same name. For example, in the following:
[
{'word': 'Brandon', 'ner': 'PERSON'},
{'word': 'Rose', 'ner': 'PERSON'},
{'word': 'eats', 'ner': 'O'},
{'word': 'bananas', 'ner': 'O'}
]
we can safely assume that the contiguous PERSONs Brandon + Rose are part of the
same named entity, Brandon Rose.
"""
people = set()
token_count = 0
for i in range(len(tokens)):
if token_count < len(tokens):
person = ''
token = tokens[token_count]
if token['ner'] == 'PERSON':
person += token['word'].lower()
checking = True
while checking == True:
if token_count + 1 < len(tokens):
if tokens[token_count + 1]['ner'] == 'PERSON':
token_count += 1
person += ' {}'.format(tokens[token_count]['word'].lower())
else:
checking = False
token_count += 1
else:
checking = False
token_count += 1
else:
token_count += 1
if person != '':
people.add(person)
return people
Let's take a look at the people which we can extract from each of the sentences. Note that the output of the proc_sentence
function is a set
, which means that it will only contain unique people entities.
for sent in output['sentences']:
people = proc_sentence(sent['tokens'])
print(people)
As you can see, we receive a set
of the extracted people entities from each sentence. We can join the results with a superset:
people_super = set()
for sent in output['sentences']:
people = proc_sentence(sent['tokens'])
for person in people:
people_super.add(person)
print(people_super)
Looking good, except notice that we see two items for Pusheen: 'pusheen'
and 'pusheen smith'
. We've done a decent job of entity extraction, but we need to take some additional steps for entity resolution.
If entity extraction is the process of finding entities (in this case, people) within a body of text then entity resolution is the process of putting like with like. As humans we know that pusheen
and pusheen smith
are the same person. How do we get a computer to do the same?
There are many approaches that you can take for this, but we are going to use fuzzy deduplication found within a Python package called fuzzywuzzy (pip install fuzzywuzzy
). Specifically, we'll use the fuzzy deduplication function (shameless plug, this is something I contributed to the fuzzywuzzy project). We can use the defaults, however you are welcome to tune the parameters.
Note that you may be asked to optionally install python-Levenshtein
to speed up fuzzywuzzy
; you can do this with pip install python-Levenshtein
.
As an example of what fuzzy deduping is, let's try it!
from fuzzywuzzy.process import dedupe as fuzzy_dedupe
From our last step, we already have a list containing duplicates where some entities are partial representations of the other (pusheen
vs. pusheen smith
). Using fuzzywuzzy's dedupe function we can take care of this pretty easily. Fuzzywuzzy defaults to returning the longest representation of the resolved entity as it assumes this contains the most information. So, we expect to see pusheen
resolve to pusheen smith
. Also, fuzzywuzzy can handle slight mispellings.
contains_dupes = list(people_super)
fuzzy_dedupe(contains_dupes)
That looks like a useful list of entities to me!
For this guide I'll be using a selection of news articles from Breitbart's Big Government section. Who knows, maybe we'll gain some insights into the networks at play in "Big Government." Could be fun.
To get the articles, I'm using Newspaper. I'm going to scrape about 150 articles off the Breitbart Big Government section.
If you have your own data that's cool too. When you load the data it should be in JSON form:
{
0: {'article': 'some article text here'},
1: {'article': 'some other articles text here'},
...
n: {'article': 'the nth articles text'}
}
import requests
import json
import time
import newspaper
First, we need to profile the site to find articles:
breitbart = newspaper.build('http://www.breitbart.com/big-government/')
Now we can actually download them
corpus = []
count = 0
for article in breitbart.articles:
time.sleep(1)
article.download()
article.parse()
text = article.text
corpus.append(text)
if count % 10 == 0 and count != 0:
print('Obtained {} articles'.format(count))
count += 1
Since this type of scraping can lead to your IP address getting flagged by some news sites, I've added a small sleep of 1 second between each article download. Just in case we get flagged, make sure to save our corpus. If you have a hard time using newspaper
to get data you can just load up the data from corpus.txt
within the github repo.
with open('corpus.txt', 'a') as fp:
count = 0
for item in corpus:
loaded = item.encode('utf-8')
loaded_j = {count: loaded}
fp.write(json.dumps(loaded_j) + '\n')
count += 1
fp.close()
We can read back in the data we wrote to disk in the format of:
data[index]: {'article': 'article text'}
where the index is the order we read in the data.
data = {}
with open('corpus.txt', 'r') as fp:
for line in fp:
item = json.loads(line)
key = int(list(item.keys())[0])
value = list(item.values())[0].encode('ascii','ignore')
data[key] = {'article':value}
fp.close()
Now let's get the entities for each of the articles we've grabbed. We'll write the results back to the data
dictionary in the format:
data[index]: {
'article': article text,
'people': [person entities]
}
Let's make a function that wraps up both using the Core NLP Server and Fuzzywuzzy to return the correct entities:
def proc_article(article):
"""
Wrapper for coreNLP and fuzzywuzzy entity extraction and entity resolution.
"""
output = nlp.annotate(article, properties={
'annotators': 'ner',
'outputFormat': 'json'
})
people_super = set()
for sent in output['sentences']:
people = proc_sentence(sent['tokens'])
for person in people:
people_super.add(person)
contains_dupes = list(people_super)
deduped = fuzzy_dedupe(contains_dupes)
return deduped
We can now process each article we downloaded. Note that sometimes newspaper
will return an empty article so we can double check for these to make sure that we don't try to send them to Core NLP Server.
fail_keys = []
for key in data:
# makes sure that the article actually has text
if data[key]['article'] != '':
people = proc_article(str(data[key]['article']))
data[key]['people'] = people
# if it's an empty article, let's save the key in `fail_keys`
else:
fail_keys.append(key)
# now let's ditch any pesky empty articles
for key in fail_keys:
data.pop(key)
Now we need to actually generate the network graph. I'll use the Python library networkx
(pip install networkx
) to build the network graph. To do this, I need to generate a dictionary of entities where each key is a unique entity and the values are a list of vertices that entity is connected to via an edge. For example, here we are indicating that George Clooney is connected to Bill Murray, Brad Pitt, and Seth Myers and has the highest degree centrality in the social network (due to having the highest number of edges).
{'George Clooney': ['Bill Murray', 'Brad Pitt', 'Seth Myers'],
'Bill Murray': ['Brad Pitt', 'George Clooney'],
'Seth Myers: ['George Clooney'],
'Brad Pitt': ['Bill Murray', 'George Clooney']
'}
import networkx as nx
from networkx.readwrite import json_graph
from itertools import combinations
from fuzzywuzzy.process import extractBests
Before we get started building our graph, we need to conduct entity resolution across our document corpus. We already did this at the document level, but surely different articles will refer to the President and others in different ways (e.g. "Donald Trump", "Donald J. Trump", "President Trump"). We can deal with this in the same way we handled differences within the same article with a slight addition: we need to build a lookup dictionary so that we can quickly convert the original entity into its resolved form.
person_lookup = {}
for kk, vv in data.items():
for person in vv['people']:
person_lookup[person] = ''
people_deduped = list(fuzzy_dedupe(person_lookup.keys()))
# manually add the donald back in since fuzzy_dedupe will preference donald trump jr.
people_deduped.append('donald trump')
for person in person_lookup.keys():
match = extractBests(person, people_deduped)[0][0]
person_lookup[person] = match
Let's see if this works:
print('donald trump resolves to: {}'.format(person_lookup['donald trump']))
print('donald j. trump resolves to: {}'.format(person_lookup['donald j. trump']))
print('donald trumps resolves to: {}'.format(person_lookup['donald trumps']))
Looks good! This way we don't have multiple entities in our graph representing the same person. Now we can go about building an adjacency dictionary which we'll call entities
.
entities = {}
for key in data:
people = data[key]['people']
doc_ents = []
for person in people:
# let's makes sure the person is a full name (has a space in between two words)
# let's also make sure that the total person name is at least 10 characters
if ' ' in person and len(person) > 10:
# note we will use our person_lookup to get the resolved person entity
doc_ents.append(person_lookup[person])
for ent in doc_ents:
try:
entities[ent].extend([doc for doc in doc_ents if doc != ent])
except:
entities[ent] = [doc for doc in doc_ents if doc != ent]
From here we need to actually build out the networkx
graph. We can create a function which iteratively builds a networkx graph based on an entity adjacency dictionary:
def network_graph(ent_dict):
"""
Takes in an entity adjacency dictionary and returns a networkx graph
"""
index = ent_dict.keys()
g = nx.Graph()
for ind in index:
ents = ent_dict[ind]
# Add previously unseen entities as nodes
for ent in ents:
if ent not in g:
g.add_node(ent, dict(
name = ent,
type = 'person',
degree = str(len(ents))))
for ind in index:
ent = ent_dict[ind]
for edge in ent:
if edge in index:
new_edge = (ind,edge)
if new_edge not in g.edges():
g.add_edge(ind, edge)
js = json_graph.node_link_data(g)
js['adj'] = g.adj
return (g, js)
Now we can use our function to build the graph
graph = network_graph(entities)[0]
Before we continue, we can do some cool things with our graph. One of them is determining who the most important people in our network are. We can quickly do this using degree centrality, which is a measure of the number of edges a node in our graph has. In this case, each node in the graph represents a person entity which was extracted from the Breitbart articles. The more people that a given individual co-occurred with the higher the degree of that node and the stronger his or her degree centrality.
This image demonstrates how the degree of each node is calculated:
When we calculated degree centrality with networkx
we are returned normalized degree centrality scores which is the degree of a node divided by the maximum possible degree within the graph (N-1
, where N
is the number of nodes in the graph). Note that the term node
and vertex
can be taken to mean the same thing in network analysis. I prefer the term node
.
We can take a guess that since these articles were from Breitbart's government section there will be a number of articles referencing Donald Trump. So, we can assume he'll be at the top of the list. Who else will bubble up based on the number of people they are referenced along with? Let's find out!
centrality = nx.degree_centrality(G_)
centrality_ = []
for kk, vv in centrality.items():
centrality_.append((vv,kk))
centrality_.sort(reverse=True)
for person in centrality_[:10]:
print("{0}: {1}".format(person[1],person[0]))
The fact someone co-occurs in a document with another person does not mean anything specifically. We can't tell that they are friends, lovers, enemies, etc. However, when we do this type of analysis in aggregate we can begin to see patterns. For example, if Donald Trump and Vladimir Putin co-occur in a large number of documents we can assume that there is a dynamic or some sort of relationship between the two entities. There are actually approaches to entity extraction which attempt to explain the relationship between entities within documents, which might be fodder for another guide later on.
All that said, typically this type of analysis requires a human in the loop (HITL) to validate the results since without additional context we can't tell exactly what to make of the fact that Donald Trump and Vladimir Putin appear to have a relationship within our graph's context.
With that caveat, we are about ready to visualize our graph. Since these types of graphs are best for exploration as interactives, we are going to rely on some javascript and HTML to render the graph. You'll need to ensure that you copy the force
directory from the github repo for this project so that you have access to the correct .css
, .html
and .js
files to build the charts.
You actually can generate static network plots using networkx and matplotlib, but they aren't very fun and are hard to read. So, I've packaged up some d3.js for you that will render the plots within an iframe.
from networkx.readwrite import json_graph
for node in graph.nodes():
# let's drop any node that has a degree less than 13
# this is somewhat arbitrary, but helps trim our graph so that we
# only focus on relatively high degree entities
if int(graph.node[node]['degree']) < 13:
graph.remove_node(node)
d = json_graph.node_link_data(graph) # node-link format to serialize
json.dump(d, open('force/force.json','w'))
Here's what's cool: we're going to embed an iframe within the notebook. However, if you want you've also got the d3.js based javascript and HTML code to pop this into your own website. I've done a couple customizations to this network diagram, including adding a tooltip with the entity name when you hover over the node. Also, the nodes are sticky--you can click the node and freeze it wherever you would like. This can seriously help when you are trying to understand what you are looking at.
%%HTML
<iframe height=400px width=100% src='force/force.html'></iframe>