Welcome to Text as Data

Welcome to my course entitled “Text as Data.” On this page, you will find an overview of the course, a description of each topic covered in the course, and a series of instructions about how to access all of the software and materials necessary for the course.

What is Text as Data?

The past decade has witnessed an explosion of data produced by websites such as Twitter, Facebook, Google, and Wikipedia, but also the mass digitization of historical archives and administrative records. Though these new data sources hold enormous potential to address a range of pressing problems within industry and academia, collecting and analyzing text-based data presents unique challenges. Fortunately, the widespread availability of text-based data coincides with major advances in the fields of computer science and natural language processing. This course will provide students with an overview of popular techniques for collecting, processing, and analyzing text-based data—including screen-scraping, mining data from application programming interfaces or APIs, topic modeling, text networks, and advanced text classifiers.

What Subjects are Covered in this Class?

This class covers a range of different topics that build on top of each other. For example, in the first tutorial, you will learn how to collect data from Twitter, and in subsequent tutorials you will learn how to analyze those data using automated text analysis techniques. For this reason, you may find it difficult to jump towards one of the most advanced issues before covering the basics.


Introduction: Strengths and Weaknesses of Text as Data

Application Programming Interfaces

Screen-Scraping

Basic Text Analysis

Dictionary-Based Text Analysis

Topic Modeling

Text Networks

Word Embeddings


Who am I?

I am a Professor of Sociology, Public Policy, and Data Science at Duke University who studies political polarization on social media. You can learn more about my research here or follow me on twitter here. Much of the material in the tutorials above draws upon my own research and text analysis techniques I’ve developed. Yet I also draw heavily on a number of other excellent tutorials by a range of different people who I tried to remember to thank in each tutorial above—if I forgot to recognize your work, please email me!

How can I Access the Course Materials?

All of the materials for this course are available on my Github page. There you will find datasets used in the tutorials above as well as all of the source files necessary to produce the tutorials above. If you notice problems, or other limitations of the tutorials above, kindly submit a “pull” request (if you know how to use Github).

How can I get started?

This course assumes basic familiarity with the R software. If you are new to R, I recommend the sequence of online courses described on this website to get you started.

What if the Code in the Tutorials does not work?

I will do my best to update the tutorials above as often as possible– but in the world of open source software it is inevitable that problems will arise that I may not be able to address quickly.