Home

Entity Extraction and Network Analysis

Or, how you can extract meaningful information from raw text and use it to analyze the networks of individuals hidden within your data set. Network Diagram

We are all drowning in text. Fortunately there are a number of data science strategies for handling the deluge. If you'd like to learn about using machine learning for this check out my guide on document clustering. In this guide I'm going to walk you through a strategy for making sense of massive troves of unstructured text using entity extration and network analysis. These strategies are actively employed for legal e-discovery and within law enforcement and the intelligence community. Imagine you work at the FBI and you just uncovered a massive trove of documents on a confiscated laptop or server. What would you do? This guide offers an approach for dealing with this type of scenario. By the end of it you'll have generated a graph like the one above, which you can use to analyze the network hidden within your data set.

Overview

We are going take a set of documents (in our case, news articles), extract entities from within them, and develop a social network based on entity document co-occurrence. This can be a useful approach for getting a sense of which entities exist in a set of documents and how those entities might be related. I'll talk more about using document co-occurrence as the mechanism for drawing an edge in a social network graph later.

In this guide I rely on 4 primary pieces of software:

  1. Stanford Core NLP
  2. Fuzzywuzzy
  3. Networkx
  4. D3.js

If you're not familiar with these libraries, don't worry, I'll make it easy to get off to the races with them in no time.

Note that my github repo for the whole project is available. You can use corpus.txt as a sample data set if you'd like. Also, make sure to capture the force directory when you try to run this on your own. You need force/force.html, force/force.css, and force/force.js in order to create the chart at the end of the guide.

If you have any questions for me, feel free to reach out on Twitter to @brandonmrose or open up an issue on the github repo.

Installing CoreNLP with Docker

First, we need to get Core NLP running on Docker. If you're not familiar with Docker, that's ok! It's an easy to use containerization service. The concept is that anywhere you can run docker you can run a docker container. Period. No need to worry about dependency management, just get docker running and pull down the container you need. Easy.

Stanford Core NLP is one of the most popular natural language processing tools out there. It has a ton of functionality which includes part of speech tagging, parsing, lemmatization, tokenization, and what we are interested in: named entity recognition (NER). NER is the process of analyzing text in order to find people, places, and organizations contained within the text. These named entities will form the basis of the rest of our analysis, so being able to extract them from text is critical.

Installing Docker

Docker now has great installation instructions (trust me, this wasn't always the case). I'm using a Mac so I followed their Mac OSX Docker installation guide. If you're using Windows check out their Windows install guide. If you're using Linux I'm pretty sure you'll be able to get Docker installed on your own.

To verify the installation was successful go to your command line and try running:

docker ps

You should an empty docker listing that looks like (I truncated a couple columns, but you get the idea):

CONTAINER ID        IMAGE               COMMAND             CREATED

If this isn't empty you already had docker running with a container. If you are not able to run the docker or docker ps commands from your command line, STOP. You need to get this installed before continuing.

Installing the Core NLP container

This part is pretty easy. You just need to run the following command at your command line:

docker run -p 9000:9000 --name coreNLP --rm -i -t motiz88/corenlp

This will pull motiz88's Docker port of Core NLP and run it using port 9000. This means that port 9000 from the container will be forwarded to port 9000 on your localhost (your computer). So, you can access the Core NLP API over http://localhost:9000. Note that this is a fairly large container so it may take a few minutes to download and install.

To make sure that the server is running, in your browser go to http://localhost:9000. You should see: Stanford Core NLP Server

If you don't, don't move forward until you can verify the Core NLP server is running. You might try docker ps to see if the container is listed. If it is, you can scope out the logs with docker logs coreNLP. If it is running feel free to play around with the server UI. Input some text to get a feel for how it works!

Entity Extraction with Core NLP Server

To use Core NLP Server, we are going to leverage the pycorenlp Python wrapper which can be installed with pip install pycorenlp. Once that's installed, you can instantiate a connection with the coreNLP server.

In [1]:
from pycorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP('http://localhost:9000')

Next, let's take a look at the basic functionality by feeding a few sentences of text to the coreNLP server

In [2]:
text = ("Bill and Ted are excellent! "
        "Pusheen Smith and Jillian Marie walked along the beach; Pusheen led the way. "
        "Pusheen wanted to surf, but fell off the surfboard. "
        "They are both friends with Jean Claude van Dam, Sam's neighbor.")
output = nlp.annotate(text, properties={
  'annotators': 'ner',
  'outputFormat': 'json'
  })

print('The output object has keys: {}'.format(output.keys()))
print('Each sentence object has keys: {}'.format(output['sentences'][0].keys()))
The output object has keys: dict_keys(['sentences'])
Each sentence object has keys: dict_keys(['index', 'parse', 'tokens'])

The output object, as you can see for yourself is extremely verbose. It's comprised of a top-level key called sentences which contains one object per sentence. Each sentence object has an array of token objects that can be accessed at output['sentences'][i]['tokens'] where i is the index (e.g. 0, 1, 2, etc) of the sentence of interest.

What is a token you ask? Typically in natural language processing (NLP) when you process text you want to tokenize it. This means splitting the text into its respective components at the word and punctuation level. So, the sentence 'The quick brown fox jumped over the lazy dog.' would be tokenized into an array that looks like: ['The','quick','brown','fox','jumped','over','the','lazy','dog','.']. Some tokenizers ignore punctuation; others retain it.

You can print out the output if you're interested in seeing what it looks like. That said, we need to be able to identify the people that the ner or Named Entity Recognition module discovered. So, let's go ahead and define a function which takes a set of sentence tokens and finds the tokens which were labeled as PERSON. This gets a little tricky as individual tokens can be labeled as PERSON when they actually correspond to the same person. For example, the tokens Jean, Claude, van, and Dam all correspond to the same person. So, the function below take tokens which are contiguous (next to one another) within the same sentence and combines them into the same person entity. Perfect!

By the way, this proc_sentence function is not very Pythonic. Ideas for doing this more efficiently are welcome!

In [3]:
def proc_sentence(tokens):
    """
    Takes as input a set of tokens from Stanford Core NLP output and returns 
    the set of peoplefound within the sentence. This relies on the fact that
    named entities which are contiguous within a sentence should be part of 
    the same name. For example, in the following:
    [
        {'word': 'Brandon', 'ner': 'PERSON'},
        {'word': 'Rose', 'ner': 'PERSON'},
        {'word': 'eats', 'ner': 'O'},
        {'word': 'bananas', 'ner': 'O'}
    ]
    we can safely assume that the contiguous PERSONs Brandon + Rose are part of the 
    same named entity, Brandon Rose.
    """
    people = set()
    token_count = 0
    for i in range(len(tokens)):
        if token_count < len(tokens):
            person = ''
            token = tokens[token_count]
            if token['ner'] == 'PERSON':
                person += token['word'].lower()
                checking = True
                while checking == True:
                    if token_count + 1 < len(tokens):
                        if tokens[token_count + 1]['ner'] == 'PERSON':
                            token_count += 1
                            person += ' {}'.format(tokens[token_count]['word'].lower())
                        else:
                            checking = False
                            token_count += 1
                    else:
                        checking = False
                        token_count += 1
            else:
                token_count += 1
            if person != '':
                people.add(person)
    return people

Let's take a look at the people which we can extract from each of the sentences. Note that the output of the proc_sentence function is a set, which means that it will only contain unique people entities.

In [4]:
for sent in output['sentences']:
    people = proc_sentence(sent['tokens'])
    print(people)
{'ted', 'bill'}
{'pusheen smith', 'pusheen', 'jillian marie'}
{'pusheen'}
{'sam', 'jean claude van dam'}

As you can see, we receive a set of the extracted people entities from each sentence. We can join the results with a superset:

In [5]:
people_super = set()
for sent in output['sentences']:
    people = proc_sentence(sent['tokens'])
    for person in people:
        people_super.add(person)

print(people_super)
{'pusheen', 'bill', 'sam', 'jean claude van dam', 'jillian marie', 'ted', 'pusheen smith'}

Looking good, except notice that we see two items for Pusheen: 'pusheen' and 'pusheen smith'. We've done a decent job of entity extraction, but we need to take some additional steps for entity resolution.

Entity Resolution with Fuzzywuzzy

If entity extraction is the process of finding entities (in this case, people) within a body of text then entity resolution is the process of putting like with like. As humans we know that pusheen and pusheen smith are the same person. How do we get a computer to do the same?

There are many approaches that you can take for this, but we are going to use fuzzy deduplication found within a Python package called fuzzywuzzy (pip install fuzzywuzzy). Specifically, we'll use the fuzzy deduplication function (shameless plug, this is something I contributed to the fuzzywuzzy project). We can use the defaults, however you are welcome to tune the parameters.

Note that you may be asked to optionally install python-Levenshtein to speed up fuzzywuzzy; you can do this with pip install python-Levenshtein.

As an example of what fuzzy deduping is, let's try it!

In [10]:
from fuzzywuzzy.process import dedupe as fuzzy_dedupe

From our last step, we already have a list containing duplicates where some entities are partial representations of the other (pusheen vs. pusheen smith). Using fuzzywuzzy's dedupe function we can take care of this pretty easily. Fuzzywuzzy defaults to returning the longest representation of the resolved entity as it assumes this contains the most information. So, we expect to see pusheen resolve to pusheen smith. Also, fuzzywuzzy can handle slight mispellings.

In [12]:
contains_dupes = list(people_super)
fuzzy_dedupe(contains_dupes)
Out[12]:
dict_keys(['pusheen smith', 'bill', 'sam', 'jean claude van dam', 'jillian marie', 'ted'])

That looks like a useful list of entities to me!

Getting some data

For this guide I'll be using a selection of news articles from Breitbart's Big Government section. Who knows, maybe we'll gain some insights into the networks at play in "Big Government." Could be fun.

To get the articles, I'm using Newspaper. I'm going to scrape about 150 articles off the Breitbart Big Government section.

If you have your own data that's cool too. When you load the data it should be in JSON form:

{
    0: {'article': 'some article text here'},
    1: {'article': 'some other articles text here'},
    ...
    n: {'article': 'the nth articles text'}
}
In [302]:
import requests
import json
import time
import newspaper

First, we need to profile the site to find articles:

In [14]:
breitbart = newspaper.build('http://www.breitbart.com/big-government/')

Now we can actually download them

In [112]:
corpus = []
count = 0
for article in breitbart.articles:
    time.sleep(1)
    article.download()
    article.parse()
    text = article.text
    corpus.append(text)
    if count % 10 == 0 and count != 0:
        print('Obtained {} articles'.format(count))
    count += 1
Obtained 10 articles
Obtained 20 articles
Obtained 30 articles
Obtained 40 articles
Obtained 50 articles
Obtained 60 articles
Obtained 70 articles
Obtained 80 articles
Obtained 90 articles
Obtained 100 articles
Obtained 110 articles
Obtained 120 articles
Obtained 130 articles
Obtained 140 articles
Obtained 150 articles

Since this type of scraping can lead to your IP address getting flagged by some news sites, I've added a small sleep of 1 second between each article download. Just in case we get flagged, make sure to save our corpus. If you have a hard time using newspaper to get data you can just load up the data from corpus.txt within the github repo.

In [115]:
with open('corpus.txt', 'a') as fp:
    count = 0
    for item in corpus:
        loaded = item.encode('utf-8')
        loaded_j = {count: loaded}
        fp.write(json.dumps(loaded_j) + '\n')
        count += 1
    fp.close()

We can read back in the data we wrote to disk in the format of:

data[index]: {'article': 'article text'}

where the index is the order we read in the data.

In [27]:
data = {}
with open('corpus.txt', 'r') as fp:
    for line in fp:
        item = json.loads(line)
        key = int(list(item.keys())[0])
        value = list(item.values())[0].encode('ascii','ignore')
        data[key] = {'article':value}
    fp.close()

Now let's get the entities for each of the articles we've grabbed. We'll write the results back to the data dictionary in the format:

data[index]: {
              'article': article text,
              'people': [person entities]
             }

Let's make a function that wraps up both using the Core NLP Server and Fuzzywuzzy to return the correct entities:

In [96]:
def proc_article(article):
    """
    Wrapper for coreNLP and fuzzywuzzy entity extraction and entity resolution.
    """
    output = nlp.annotate(article, properties={
      'annotators': 'ner',
      'outputFormat': 'json'
      })
    
    people_super = set()
    for sent in output['sentences']:
        people = proc_sentence(sent['tokens'])
        for person in people:
            people_super.add(person)

    contains_dupes = list(people_super)
    
    deduped = fuzzy_dedupe(contains_dupes)
    
    return deduped

We can now process each article we downloaded. Note that sometimes newspaper will return an empty article so we can double check for these to make sure that we don't try to send them to Core NLP Server.

In [32]:
fail_keys = []
for key in data:
     # makes sure that the article actually has text
    if data[key]['article'] != '':
        people = proc_article(str(data[key]['article']))
        data[key]['people'] = people
    # if it's an empty article, let's save the key in `fail_keys`
    else: 
        fail_keys.append(key)

# now let's ditch any pesky empty articles
for key in fail_keys:
    data.pop(key)

Now we need to actually generate the network graph. I'll use the Python library networkx (pip install networkx) to build the network graph. To do this, I need to generate a dictionary of entities where each key is a unique entity and the values are a list of vertices that entity is connected to via an edge. For example, here we are indicating that George Clooney is connected to Bill Murray, Brad Pitt, and Seth Myers and has the highest degree centrality in the social network (due to having the highest number of edges).

{'George Clooney': ['Bill Murray', 'Brad Pitt', 'Seth Myers'],
 'Bill Murray': ['Brad Pitt', 'George Clooney'],
 'Seth Myers: ['George Clooney'],
 'Brad Pitt': ['Bill Murray', 'George Clooney']
 '}
In [303]:
import networkx as nx
from networkx.readwrite import json_graph
from itertools import combinations
from fuzzywuzzy.process import extractBests

Across document entity resolution

Before we get started building our graph, we need to conduct entity resolution across our document corpus. We already did this at the document level, but surely different articles will refer to the President and others in different ways (e.g. "Donald Trump", "Donald J. Trump", "President Trump"). We can deal with this in the same way we handled differences within the same article with a slight addition: we need to build a lookup dictionary so that we can quickly convert the original entity into its resolved form.

In [213]:
person_lookup = {}
for kk, vv in data.items():
    for person in vv['people']:
        person_lookup[person] = ''

people_deduped = list(fuzzy_dedupe(person_lookup.keys()))

# manually add the donald back in since fuzzy_dedupe will preference donald trump jr.
people_deduped.append('donald trump')

for person in person_lookup.keys():
    match = extractBests(person, people_deduped)[0][0]
    person_lookup[person] = match

Let's see if this works:

In [204]:
print('donald trump resolves to: {}'.format(person_lookup['donald trump']))
print('donald j. trump resolves to: {}'.format(person_lookup['donald j. trump']))
print('donald trumps resolves to: {}'.format(person_lookup['donald trumps']))
donald trump resolves to: donald trump
donald j. trump resolves to: donald trump
donald trumps resolves to: donald trump

Looks good! This way we don't have multiple entities in our graph representing the same person. Now we can go about building an adjacency dictionary which we'll call entities.

In [206]:
entities = {}

for key in data:
    people = data[key]['people']
    
    doc_ents = []
    for person in people:
        # let's makes sure the person is a full name (has a space in between two words)
        # let's also make sure that the total person name is at least 10 characters
        if ' ' in person and len(person) > 10:
            # note we will use our person_lookup to get the resolved person entity
            doc_ents.append(person_lookup[person])
    
    for ent in doc_ents:
        try:
            entities[ent].extend([doc for doc in doc_ents if doc != ent])
        except:
            entities[ent] = [doc for doc in doc_ents if doc != ent]

From here we need to actually build out the networkx graph. We can create a function which iteratively builds a networkx graph based on an entity adjacency dictionary:

In [214]:
def network_graph(ent_dict):
    """
    Takes in an entity adjacency dictionary and returns a networkx graph
    """
    index = ent_dict.keys()
    
    g = nx.Graph()

    for ind in index:
        ents = ent_dict[ind]

        # Add previously unseen entities as nodes
        for ent in ents:
            if ent not in g:
                g.add_node(ent, dict(
                    name = ent,
                    type = 'person',
                    degree = str(len(ents))))

    for ind in index:
        ent = ent_dict[ind]
        
        for edge in ent:
            if edge in index:
                new_edge = (ind,edge)
                if new_edge not in g.edges(): 
                    g.add_edge(ind, edge)
        
    js = json_graph.node_link_data(g)
    js['adj'] = g.adj
    return (g, js)

Now we can use our function to build the graph

In [262]:
graph = network_graph(entities)[0]

Before we continue, we can do some cool things with our graph. One of them is determining who the most important people in our network are. We can quickly do this using degree centrality, which is a measure of the number of edges a node in our graph has. In this case, each node in the graph represents a person entity which was extracted from the Breitbart articles. The more people that a given individual co-occurred with the higher the degree of that node and the stronger his or her degree centrality.

This image demonstrates how the degree of each node is calculated:

Degree Example

When we calculated degree centrality with networkx we are returned normalized degree centrality scores which is the degree of a node divided by the maximum possible degree within the graph (N-1, where N is the number of nodes in the graph). Note that the term node and vertex can be taken to mean the same thing in network analysis. I prefer the term node.

We can take a guess that since these articles were from Breitbart's government section there will be a number of articles referencing Donald Trump. So, we can assume he'll be at the top of the list. Who else will bubble up based on the number of people they are referenced along with? Let's find out!

In [263]:
centrality = nx.degree_centrality(G_)
centrality_ = []
for kk, vv in centrality.items():
    centrality_.append((vv,kk))
centrality_.sort(reverse=True)
for person in centrality_[:10]:
    print("{0}: {1}".format(person[1],person[0]))
donald trump: 0.5423728813559322
vladimir -rsb- putin: 0.21468926553672316
jerome hudson: 0.192090395480226
george soros: 0.15254237288135594
miosotis familias: 0.11864406779661017
james baldwin: 0.10734463276836158
barack obama: 0.10734463276836158
patrisse cullors: 0.0847457627118644
opal tometi: 0.0847457627118644
micah x. johnson: 0.0847457627118644

The fact someone co-occurs in a document with another person does not mean anything specifically. We can't tell that they are friends, lovers, enemies, etc. However, when we do this type of analysis in aggregate we can begin to see patterns. For example, if Donald Trump and Vladimir Putin co-occur in a large number of documents we can assume that there is a dynamic or some sort of relationship between the two entities. There are actually approaches to entity extraction which attempt to explain the relationship between entities within documents, which might be fodder for another guide later on.

All that said, typically this type of analysis requires a human in the loop (HITL) to validate the results since without additional context we can't tell exactly what to make of the fact that Donald Trump and Vladimir Putin appear to have a relationship within our graph's context.

Visualizing the graph

With that caveat, we are about ready to visualize our graph. Since these types of graphs are best for exploration as interactives, we are going to rely on some javascript and HTML to render the graph. You'll need to ensure that you copy the force directory from the github repo for this project so that you have access to the correct .css, .html and .js files to build the charts.

You actually can generate static network plots using networkx and matplotlib, but they aren't very fun and are hard to read. So, I've packaged up some d3.js for you that will render the plots within an iframe.

In [276]:
from networkx.readwrite import json_graph

for node in graph.nodes():
    # let's drop any node that has a degree less than 13
    # this is somewhat arbitrary, but helps trim our graph so that we
    # only focus on relatively high degree entities
    if int(graph.node[node]['degree']) < 13:
        graph.remove_node(node)
        
d = json_graph.node_link_data(graph) # node-link format to serialize

json.dump(d, open('force/force.json','w'))

Rendering the chart

Here's what's cool: we're going to embed an iframe within the notebook. However, if you want you've also got the d3.js based javascript and HTML code to pop this into your own website. I've done a couple customizations to this network diagram, including adding a tooltip with the entity name when you hover over the node. Also, the nodes are sticky--you can click the node and freeze it wherever you would like. This can seriously help when you are trying to understand what you are looking at.

In [300]:
%%HTML
<iframe height=400px width=100% src='force/force.html'></iframe>