Distant Reading with the Distant Reader

Objectives

By the end of this tutorial, you will be able to:

What You Need

About the Distant Reader

Distant reading generally refers to the use of computational methods to analyze literary texts. Created by Eric Lease Morgan at the University of Notre Dame, the Distant Reader is a web-based text analysis toolset for “reading” and analyzing texts. It takes unstructured data (text) as input, and it outputs sets of structured data for analysis. The Distant Reader is intended to supplement the traditional reading process by simplifying the process of identifying trends and anomalies in large volumes of text. It is also a method for pre-processing data to be used in natural language processing.

Much of the following is adapted from Eric Lease Morgan’s documentation and workshops about the Distant Reader tool. See the references section of this tutorial for links to these resources.

How the Distant Reader Works

The Distant Reader is a system which locally harvests/caches content you specify. It then transforms the content into plain text, performs sets of natural language processing & text mining against the text, saves the results in a number of formats, reduces the whole to a cross-platform database file, queries the database thus summarizing the collection, zips the results of the entire process into a single file, and makes the file available to you for further investigation.

What We Can Do with the Distant Reader

Designed to “read” everything from a single item to a large corpus, the Distant Reader can help answer questions like:

Getting Started

  1. Using your web browser, navigate to https://distantreader.org/.
  2. Select “Create Account” on the top right. DRCreateAccount1
  3. Fill in your account information, confirm your account, and log in.
  4. You should now see your Distant Reader Portal Dashboard where you can submit yout input and download the results (referred to as “study carrels”). DRHome

Submitting Content (“Experiments”) to the Distant Reader

Types of Input

The Distant Reader can currently accept these five types of input:

Caveats:

Tip: Pulling URLs from link can be incredibly tedious. Try using a URL extraction tool to speed up the process. Eric Lease Morgan suggests the Google Chrome extension called Link Grabber.

Caveats:

How to create this CSV File:

Creating a New “Experiment”

  1. Determine your type of input and select the corresponding experiment application in your Distant Reader dashboard. For this tutorial, let’s try using a single .csv file. The file used here is the Bike PGH Bicycle Pavement Markings June 2016 downloaded from the WPRDC site.
  2. Enter the name of your experiment, add your input (you can ignore pretty much everything else on the page), then submit your experiment by selecting “Save and Launch.” DRInput1
  3. Your experiment is now sent to the queue. If there are fewer than ten jobs currently running, the submitted job will be run immediately. It takes almost two minutes for the Reader to instantiate a new virtual machine, and then, depending on the number and sizes of items to read, processing can take as short as four minutes or as long as twelve hours. Generally, this process takes less than ten minutes. It is not necessary to keep your Web browser open to the Reader’s interface; the Distant Reader will do its work and wait for you to return. DRInput2
  4. When the Reader has finished, your dashboard will have been updated and you can navigate to the “Experiment Summary” page. DRInput3 DRInput4

    From here you can:

    • Read the standard error report; send this to the author if something goes amiss.
    • Read the standard output report, which is a simple summary of what the Reader found; look at this report first.
    • Download the study carrel.

Working with the Distant Reader Results (“Study Carrels”)

Study Carrels

The results of the Distant Reader process is a “study carrel” – a .zip file containing a set of structured data that includes your original content, various transformations of it, and various sets of analysis. Uncompressing the Distant Reader study carrel results in a directory/folder containing a standard set of files and subdirectories. The following sections detail how to download your study carrel and some basic ways of interacting with the results.

Downloading Your Study Carrel

  1. Download the results of your experiment found under “ZipFile” in your experiment summary. DRstudycarrel1
  2. Find your locally downloaded .zip file. The zipped file should have downloaded with “study-carrel” somewhere in the file name.
  3. Unzip the file by double-clicking it.
  4. The resulting unzipped file will have a long, seemingly unrelated string of characters as the filename. Rename the file to something simpler and more relevant to you. Consider moving the file to your desktop while you work with it.

Working with Your Study Carrel - Narrative Reports

  1. Open the .zip file and navigate to the HTML text file called “index.htm.” This is the root of the narrative interface and will open in your web browser. DRnarrative1
  2. The body of the study carrel’s narrative interface provides a very broad overview of your study carrel. Narrative reports including frequencies, keywords, and topic modeling can be accessed on the left-hand side of the page. DRnarrative2
  3. Take some time now to explore the narrative reports. Eric Lease Morgan provides some detail as to what these reports can help tell you about your data. Also, see the standard-output.txt file in the unzipped study carrel, as it will both summarize and elaborate upon this narrative report.DRnarrative3

    As you can see, this single .csv does not have a ton of variation or narrative data to analyze. While not the most exciting corpus, this does at least point to the idea that this dataset is used for bureaucratic purposes and is fairly consistent. Think about what you see in these reports and how using Distant Reader for this type of dataset might not be the best option. Let’s try running an experiment with more narrative content to see how this works more clearly. To keep with the theme, let’s find the URL for the Wikipedia page for Bike Lanes and follow the same steps as the Creating a New Experiment section above. What differences do you see?

Working with Your Study Carrel - Interactive Reports

  1. The links at the top of the page point to interactive HTML pages. Each page is really a table listing bi-grams, noun-verb combinations, adjective-noun combinations, questions, etc. Let’s look at how to browse, sort, and search the content of the menu items named Ngrams, POS, Grammars, and Others. You can use these reports to look for patterns or anomalies, ask yourself questions, and then enter text into any of the available text areas in order to answer (or at least address) your question: For example, enter the words “who”, “what”, “when”, “where”, “why”, “how”, or “how many” into the text area of the question page. The interactive HTML pages are akin to a back-of-the-book index.

    As mentioned above, our data output regarding bike lanes does not have a lot of fodder for questions like these. Let’s continue using our Bike Lanes Wikipedia Study Carrel for the rest of this section.

  2. Now navigate to the “Named Entities” table in the “Other” Dropdown. DRinterp1
  3. The named entity pages list names of people (PERSON), organizations (ORG), places (GRE), and locations (LOC). To learn who is mentioned in your study carrel, enter “PERSON” into the text area. To learn what organizations, places, or locations are mentioned in your study carrel, enter the labels “ORG”, “GRE”, or “LOC” into the text area. DRinterp2
  4. All of the other pages linked from the top of the index.htm page operate very similarly. Explore those now.
  5. And there you have it! You now know the basics of working with the Distant Reader. As with everything, using the tool requires practice and refining. Remeber that it’s almost impossible to break anyhting using the Distant Reader, so you can rest assured when trying all sorts of experiments.

Cleaning and Analyzing Your Study Carrel

Notice how the content we used cannot simply be taken at face value. For example, Wikipedia and it’s related privacy policy information tends to take over frequency assessments if not cleaned.

keywords DRIssues1

The Distance Reader can help with pre-processing, but there is always work to do in terms of cleaning your data and conducting analysis. There are three essential types of desktop tools you will need/want in order to use the content of a study carrel. These types include: text editors, spreadsheet/database applications, and analysis programs.

Text Editors

Text editors read and write plain text files – files with no formatting and no binary characters. Plain text files usually have a .txt extension. Every single file in a Distant Reader study carrel (except one) is a plain text file, and therefore, every single file (except one) is openable by any text editor. You can use a text editor to find/replace any character and change it to something else, which is useful for stopwords, carriage returns, newline characters, etc. Another very useful function of a text editor, especially used for the purposes of text mining and natural language processing, is the ability to change the case of all letters to either their upper or lower-case forms. Such is the most basic of text normalization/cleaning processes.

Spreadsheet/Database Applications

Spreadsheet/database applications are designed to read “delimited” files, plain text files where each line is a row in a matrix, and each item is punctuated by some special character such as a tab character or a comma. These items are the columns in the matrix. The whole file is akin to a spreadsheet or a database. Like a text editor, you will want to use the spreadsheet/database application to support sort.

The majority of the files in a study carrel are delimited files, and these delimited files are really annotated lists. Examples include lists of words and their parts-of-speech, lists of documents and the URLs they contain, or lists of sentences and the named-entities they include. Given these files the student, researcher, or scholar could compare & contrast the ratio of named entities across a corpus, or they could plot the ebb & flow of an idea over time.

Analysis Programs

Analysis programs cover a wide spectrum of tools and these tools fall into a number of categories including counting & tabulating, concordancing, topic modeling, and visualizing. It is not within the scope of this tutorial to cover all of these programs, but here are a few that can be useful when working with Distant Reader study carrels:

Conclusions

“The Distant Reader is a tool for reading. It takes an almost arbitrary amount of unstructured data (text) as input, and it outputs sets of structured data for analysis – reading. As such, the Distant Reader is akin to books’ page numbers, tables-of-contents, back-of-the-book indexes, and other apparatuses used to make them easier to use and understand. The difference is the Distant Reader does these things at scale. The Distant Reader is not a replacement for the traditional reading process, but instead it supplements the process. If you were asked to articulate a few main themes or just about anything else about any given corpus, then you would probably be able do so, but with the Distant Reader you would be able to so more thoroughly. Moreover, you would be able to literally point where those themes were manifested themselves. The Distant Reader does not output truth nor meaning. It merely outputs observations. Just like the traditional reading processes, it is up to the student, researcher, or scholar to interpret the observations and discuss the possible truth or meaning. In this way, the Distant Reader is simply a tool in the ongoing dialog on what it means to be human. Given the amount of narrative data/information at our finger tips, discovery is not the problem to solve. I don’t know about you, but I can find plenty. No, the problem goes beyond that; the problem is about getting the data/information, and reading it. The Distant Reader is a tool intended to address just these sorts of issues.” From Eric Lease Morgan’s Distant Reader Workshop

Distant Reader Resources

This tutorial was adapted and written by Jane Thaler in 2020.