So recently in my Digital Humanities Seminar class, John Laudun asked us to find different tools to use to output a frequency list from a text that interested us. Instead, I decided to try out my R skills, and dusted off my notes from an R Bootcamp for corpus linguistics I attended last year (this was led by Stefan Th. Gries (author of Quantitative Corpus Linguistics with R: A Practical Introduction) and organized by Stefanie Wulff at the University of North Texas, Denton–a brilliant workshop that anyone interested in corpus linguistics and R should attend).
I won’t even try to go into what R is; however, a good introduction to it (besides Gries’ book) is at the R Project. What I’m going to do today is show you what I did to create that frequency list. It’s very simple and may be of use to some of you who are thinking about jumping into R and corpus linguistics.
Before I begin, I just want to make it clear that my information on the R commands comes from my notes of the bootcamp with Stefan Gries (I was/am very new to R). You can also find this information within his book.
Building the text file:
I am working with the text, Charlotte Temple by by Susanna Rowson found on Project Gutenberg. I originally downloaded the text last year, and while cleaning up the text (mis-scans from Project Gutenberg), I had also decided to break the text up into multiple files by chapter. Although I can work with multiple input files in R, I thought it might be easier to get my feet wet by beginning with only one text file. All of those individual files began with “ct” followed by the chapter number, ending with a .txt file extension (for example, ct01.txt).This made it easier for me to sort and locate the particular chapters I needed. I use Windows 7, and don’t know of an easy way to combine all the files into one file besides cutting and pasting, so I decided to go out to the Command Prompt. I first navigated to the directory where my files were located:
Here’s a list of all my chapter files: dir ct*.txt (this says show me all the files that begin with “ct” and end with “.txt”):
To combine all of them into one file, I entered: type ct*.txt > ctcompl.txt
This reads as follows:
- type = display the contents to the screen
- this is followed by what file to type–in this case, it’s all the files that begin with ‘ct’ and end with ‘.txt’ (the asterisk, *, is a wildcard).
- but instead of displaying the contents on the screen, I used a “>” to send the contents to a file.
- the file name I used for the complete text is ctcompl.txt (if the file doesn’t exist, it will be created; if it exists, then it will be overwritten).
(Notice how the file names are displayed. Though you can’t see it in this screen shot, it displays them all, including the new file created)
A quick command, type ctcompl.txt will allow you to verify the contents of the file. Now one thing I should point out is, that my original naming convention for the separate chapter files is what allowed the text to be built in the correct order. It’s something to keep in mind when building any sets of corpora.
The R commands
I’m assuming you have R already loaded. If not, or for instructions and help R, go to the R Project download page. This is the development environment I’m using. During the bootcamp, Stefan constantly warned us to type our commands in our favorite (R friendly) text editor so that we would not mistakenly overwrite a vector that took a number of steps to create. It’s really good advice. Though I like Notepad++ for most of my text editing, I use Tinn-R for my R work. This is probably because this is what Stefan had us use in class (I’m a creature of building on familiarity when it comes to learning…)
Now, if you’re like me, one of the first things I had to get over was what a variable is called in R. It’s called a vector. I’m sure someone (probably Stefan) knows why. But for my peace of mind, I still call it a variable. Declaring or creating a vector is very easy; you just type in the name you want (there are restrictions) then pipe to (send to) it whatever information you want it to contain.
Okay; let’s start. First of all we need to tell R what text we want to use. And not only that, but we need for R to remember it so that we can do things with the text later. This means we have to tell R to read in our file and stuff it into a vector. There are a number of ways of doing it, but Stefan showed us a slick trick for Windows users (sorry Mac fans–does anyone know of the Mac equivalent?):
temple.text<-scan(choose.files(), what=”char”, sep=”\n”)
This reads as follows:
- create a vector (variable) called temple.text (this is just my own naming convention–it helps me to remember this is the complete text for Charlotte Temple)
- <- is like the Windows Command Prompt’s redirect command”>” in that it takes the output of the command and sends it to the temple.text vector, and
- in this case, the command is scan . This will read the data from a file, which we specify afterwards using the choose.files() command:
- choose.files() opens up a browse window to allow you to choose your file (or files–it’s that cool; I really didn’t need to combine all my files after all!). Again, I’m not sure how to do this interactively on the Mac.
- you could always manually set a path (hardcode it), using the argument for the scan command, file=””. For example, “scan(file=”c:\myTextFile.txt”, …”)
- the what=”char” tells R what kind of data the file contains (allowed types are logical, integer, numeric, complex, character, raw and list).
- the sep=”\n” tells R how the data is delimited. In this case, I’m using lines. I believe if I didn’t, the data would just be delimited by whitespace
So now that we have the text, we might want to normalize it in some way. R is case sensitive, and so the word “Dog” and “dog” are different words. And though there are many cases where that might be important, for me, I would rather treat them as the same, and so I am going to convert the entire text to lower case:
This reads as follows (from right to left (or inner to outer)):
- tolower(temple.text) says to take the vector that contains all of our text, and convert it all to lowercase, then
- redirect that data (send to/save)
- back into the original vector name we read from (overwriting it with the newly made lowercase version)
Next we need to break up the data into words. Again, we are doing something to the complete text (temple.text) but instead of overwriting it like we did with the lowercase conversion, we are going to create a new vector in order to keep track of things.
temple.words.list<-strsplit(temple.text, “\\W+”, perl=TRUE)
This reads as follows (again, from the innermost, or right function):
- strsplit tells are to break up (or split) a string (which is what a bunch of character data is called).
- In order for it to know what to split, we have to feed it some arguments (specific inputs), such as the data we want to split, temple.text, and how we want to split it, “\\W+”, perl=TRUE) (notice that the arguments are seperated by commas)
- the “\\W+” says to split up the data based on whitespace (one way to create words–note though, this will keep punctuation within a word, possibly throwing off your results (“dog.” and “dog” are two different words)–for this post, I’m not going to clean that up, though it can be done).
- Then take all of this and save into something new, temple.words.list (temple.words.list <-)
So now, I’ve slightly lied. I’ve been talking about everything as if it’s a vector. What we’ve created so far is a list, which has a different internal structure than a vector. For brevity’s sake, I’m just going to say that we need to convert the list to a vector to continue to work with it:
Again, starting from the right, we can read it as
- “Take the list, words.list, and turn it into a vector by unlisting it.
- Then save that information into a vector called temple.words.vector.”
Right now, temple.words.vector looks like this:
> head (temple.words.vector,10)
 “preface” “for” “the” “perusal” “of” “the” “young” “and” “thoughtless”
This is just giving a list of the words and their positions. So we need to use the table command to work with it in a slightly different fashion to make a frequency list :
Though we have a list, another way to store data is within a table (think Excel, columns and rows). R is performs certain kinds of calculations for tables that we don’t get within a list. For a very useful site that explains tables (was well as other things R), go visit the R Tutorial by Clarkson College.
To read this though, again, start from the right:
- We are telling R to take our new vector, temple.words.vector,
- and to make a table out of it,
- and save it to temple.freq.list.
What the temple.freq.list looks like now is:
a abandon abandoned abated
644 1415 2 6 2
abbess abhorred abilities abject abjure
2 2 4 4 2
able abode about above abroad
4 2 42 6 4
absence absent absolutely absorbed abuse
6 4 2 2 2
abused abyss academy accent accept
2 4 2 8 2
This lists the different factors with their frequencies. We can now use this list to sort by frequency rather than alphabetically:
This works a lot like what we did with the tolower function in that we saved the new version over the original, but instead we are just sorting it rather than converting its case.
So, we are telling R
- to take the list we just made, temple.freq.list,
- and sort it descending (using the argument “decreasing=TRUE“),
- then save all of this into temple.sorted.freq.list
temple.sorted.freq.list now looks like this:
the to and of her a i she
3529 2522 2421 2291 1726 1415 1300 1137
in he was you my that his
1036 884 826 768 757 644 622 604
it but with not for charlotte be had
601 599 588 569 535 525 456 441
said me by as from on is at
432 431 399 394 394 391 379 362
This looks a lot like how we would imagine: function words are typically the most frequent.
Next we are going to put this into a table format that is more readable than what we see above:
temple.sorted.table<-paste(names(temple.sorted.freq.list), temple.sorted.freq.list, sep=”\t”)
This is weird looking, I know. But basically, we want to take what we’ve come up so far in temple.words.vector
the to and of her a i she
3529 2522 2421 2291 1726 1415 1300 1137
somehow turn the words and frequencies into column looking data.
So we have to build this by first just getting the words themselves, using the name() function
 “the” “to” “and” “of” “her” “a” “i”
 “she” “in” “he” “was” “you” “my” “”
 “that” “his” “it” “but” “with” “not” “for”
We then feed that list of names as one argument back into the paste() function, followed by the list itself as another argument, then followed bythe last argument, which says to insert a tab (“\t)between them, so that when we open it up, the values will appear separated by the tab. The paste() function basically takes these separate factors and puts them into one:
> paste(names(temple.sorted.freq.list), temple.sorted.freq.list, sep=”\t”)
 “the\t3529” “to\t2522” “and\t2421” “of\t2291” “her\t1726” “a\t1415”
 “i\t1300” “she\t1137” “in\t1036” “he\t884” “was\t826” “you\t768”
 “my\t757” “\t644” “that\t622” “his\t604” “it\t601” “but\t599”
So what we’re doing is building a file to look the way we want it to: a frequency list in columns (rather than the output you saw above)
Saving the data to a file:
cat(“Word\tFREQ”, temple.sorted.table, file=choose.files(), sep=”\n”)
- cat() is a way to output a file like type is within the Command Prompt, except that it concatenates its element into a character string before outputting them.
- In this case, R is beginning the string of data with the string of characters, “Word\tFREQ“–that is, “Word” and “Freq” will be separates by a tab in a text editor; they are the column headers.
- Then R will concatenate the data we have in temple.sorted.table behind the column headers–all into a character string, which then will be save to a location (the file argument) using
- the choose.files() argument where Windows users may browse and create a new file (again, Mac users will need to do something different–instead of choose.files(), you could specify the path and what you want to call the new file–for example: file=”myTextFile.txt”)
The result look like this within Notepad++:
You could also opened the new text file within Excel to make it prettier: