Organic Markup?

I was just reading Claire Ross’s latest post which is about integrating visitor interpretations as part of the museum experience and how that experience might be gauged for future improvements. In the course of her discussion, she uses the phrase “exhibition labels” which made me immediately think of markup. Though I natter on occasionally about using markup for research, my feet have only just gotten wet; I don’t pretend to know a whole lot about it. However, thinking of markup while reading about her idea for an organic museum experience by way of visitor interpretations caused me to think about issues with semantic markup. That is, is there a such a thing as dynamic markup? I suppose that doing xsl transformations on xml is  dynamic in a sense, but from what I understand, it’s still non-dynamic in the sense of using preprogrammed selections that can be run dynamically. What I really am asking, I think, is if  there is such a thing as organic markup–markup that can be fed back into the original markup to grow it, rather than making use of pre-interpreted markup–a crowd-sourced, on the fly, sort of markup? The reason I’m wondering is, that I think it could help future interactions of previously marked up texts–as a way to evolve with future interpretations of not only the text, but of  the tag set used to mark up that text. I’m guessing that that is what natural language processing folk are dealing with–trying to interpret the text instead of the tags. Of course, I know even less about that group. But I would imagine that such organic markup could aid natural language processing . It just seems to me that something like this might  treat the interpretive act of marking up text more as conversation rather than a monologue by one person/project team.I can’t help but think that people have already been working with such an idea. Is Wikipedia  really this kind of markup?

In the interest of full disclosure, before reading Ross’ post, I had also just watched this video  by Barry Ridge, who is a Ph.D. student at the ViCoS lab, showing how their robot, George, used interactive learning to create knowledge updates (basically, how they started off with a simple knowledge schema and slowly grew it by way of his asking questions of the humans). Very cool stuff. And another thing I think is cool about it is that I think the core of what Ridge  is doing is also what Ross is getting at  (but in terms of a different discipline). It also shows how my procrastination tends to guide my reading into very cool things… Oh, the positive reinforcement!

Digital Humanists Skillsets

Recently on Claire Ross’ blog, she asks the question, “Do you need to be procedural literate to be a great digital humanist?” in response to a previous discussion of a paper by Michael Mateas (“Procedural Literacy – Educating the New Media Practitioner”). Her summation is that Mateas

“…suggests that procedural literacy is necessary for DH and new media researchers, because without understanding the back end of the programme, researchers will never be able to think critically about digital projects.”

I think her question is a great one and the easy answer is that being literate would definitely help, but is it necessary? It seems like the answer should be obvious but like all things worth pondering, it really depends.

For one thing, the scope and time-frame of any project will dictate much of who can do what by when and for whom. Before academia, I used to program for a large corporation. Many of our projects–all, if they were not an internal tool for the IT group or an infrastructure project for the company–were managed by people with the business expertise, usually having no formal IT skills (except what they gained through working on such projects). The company’s policy was that business needs ought to guide development and not the other way around. Having project managers didn’t necessary mean top down workflow. These managers had to listen to input from the particular experts as well as be able to ask good questions. It was basically a collaborative learning as well as teaching environment.  And it makes sense for large-scale projects.

But likewise, for smaller projects–helping improve particular department’s tools/workflow or create something new based on new business demands usually consisted of a developer or two acting as a project manager to work with a representative  from the department–again, someone who had the the particular business expertise. It was a collaborative effort. In either of these scenarios, it took someone with vision as well as someone with the particular know-how. In my own experiences, any sort of successful project often boils down to someone having great trouble-shooting skills regardless of whether it’s an IT related project or a strictly business practice related one.

Having said this though, I believe these same sorts of trouble-shooting skills are at the heart of writing essays as well research projects in general. You break down the paper into sections that you know you need to explore, then work on learning what it is you need to in order to do the exploring. This may involve asking other experts, such as advisers, for leads to articles or books. Granted, projects involving developing research/archival sites or tools can feel a lot more like building a house (which can require a lot of different domains of expertise)–which brings me round again to my opening comments about the scope and time-frame of a project. I’ve been wrestling this last year on my own project, knowing I don’t have forever to learn all the necessary programming languages and tools I believe I need to pull it off. But with slow, very minor steps, such as getting my feet wet last year with TEI via Brown University’s text encoding workshops followed by an XSLT class at the Digital Humanities Summer Institute last month,  though I don’t possess any experience with these tools, I’m seeing how I can actually get at some of my project’s questions while also seeing a way to maybe narrow the scope.  At least today I feel this way.  I admit though, that after hearing at the DHSI of all the different projects people are working, I was overwhelmed by how large they were, and as well as the large infrastructures (whether it was time, training, developers, etc through such organizations as the Nines) they required; resources I don’t have. But the good news is that experiences with my smaller projects may lead to work with these larger collaborative efforts.

Back in my IT days, we used to refer to ourselves with that old saw about being a Jack of all trades, master of none. These days, I feel more like the squire…. The cool part about the promise of the Digital Humanities is the amount of cross-collaborative possibilities it holds. As organizations like Project Bamboo mature, they hopefully will become the model of an open market place of skills within which different universities and organizations can trade such skills frequently and within a flattened hierarchy. When that comes, I think the idea of cross-collaboration project managers will become more important than any one individual needing to know not only how to program but multiple languages.  But this then brings up a different issue:  what happens when the majority of people within the field want to be only project managers? Will that create an imbalance that will eventually force people to acquire procedural literacy that Claire and the rest of us are asking about? Hmmm. Might best to work on those skills, if slowly, just in case.

 

More on Using Microsoft Word’s Find/Replace tool

Just because I’m finding  using Microsoft Word’s find/replace tool so much fun, I thought I would share another experience with it.

I had originally converted all my italicized words and phrases to tagged items in my earlier data prep entries. After having run my macro to output author and book lists, I found lots of mistakes in my (manual) tagging. But they were easy enough to correct. As I worked along, I also corrected spellings (words connected to other words, filling out fullnames, etc). But I didn’t want to do all this again for my “clean” (readable) copy of the text file.

Now, I must admit, that for readable copies of texts, I need formatting such as italics. You may argue with me on future compatibility grounds, but the fact is, my reading experience, even with data, must take precedence over any such issues. So to “reprettify” my text, I used Microsoft Word’s find/replace feature to 1) find certain sets of tags and then 2) get rid of the surrounding tags, leaving the embedded content, and 3) format that content.

So in the Find/Replace window, I searched for:

(\<work*\>)(*)(\</work\>)

and replaced with:

\2

and added the italics format.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(*be sure to click More and check the Use wildcards option)

Like in other programs that use regular expressions, you can group them so you can refer to them later, in this case, with parentheses (for example, the first, second, or third grouped item). In this case, I wanted to get rid of the tags (<work> and </work>) but retain all the content between, which is the second group.

The backslashes “\” tell Word that I’m looking for the following character (in other words, escaping out special characters like “<”s). The reason I added a wildcard “*” after “<work” was that I had originally created my data file using “<book>” , <poem>, etc. However, while starting to markup  a different set of files, I decided to use a broader tagn ame, <work>, with an attribute of “type” (i.e. <work type=”book”>). So when I want to find complete tags, I have to account for the extra information between the name and the end bracket of the opening tag.

 

Here is a example of what I’m starting with:

 

And here’s what it looks like after finding/replacing the text:

 

 

Now, I could have done this for all tags rather than just this one particular set. But I didn’t want to highlight names within my reading (though I may change my mind about that). To do so,  I would have to alter the search terms to something like this:

(\<[!/]*\>)(*)(\</*\>)

Notice how I added (\<[!/]*\>)(*)(\</*\>). This tells Word to make sure that the character following the “<” is not a forward slash “/”. I needed to do this because I saw that although Word found the first complete phrase just fine, without that restriction, the start of the next phrase would begin with that closing tag of the previous phrase, causing my opening/ending tags to get out of alignment. I didn’t run into this problem in my first example because the first group within in my search terms precluded it bringing back a closing tag; that is, it could only be an opening tag.

So just to walk through this (to help make sure I’m following this myself), Group 1,  (\<[!/]*\>), reads:

“Look for the “<” character followed by something that is NOT a “/” character (again, to exclude the closing tag from the beginning), followed by all character until and including  the opening tag’s closing bracket, “>”

The second group, (*), reads as

“(continue to) get me all characters”, followed by

the third group, (\</*\>), the final piece of the pattern:

“Look for the characters,  “</” (the opening bracket of the closing tag), followed by any text until and including the final bracket of the closing tag.”

As I was typing this out, I noticed that there is a potential problem with using multiply embedded tags which my first find/replace would not encounter (again, due to specifying the particular tag). That is, I’m guessing that if I had the line,

“<person>John Anderson’s</person> essay, <work type=”essay”>Surviving <person>Walt Whitman’s</person> <work type=”book”>Leaves of Grass</work></work>”

the find/replace pattern would probably retrieve:

<work type=”essay”>Surviving <person>Walt Whitman’s</person>

which is not aligned. I can see how to fix this easily in a macro, just saving off the tagname part in a variable to use later. If Word’s search/replace tool worked like other programs using regular expressions, I imagine that you could do the same thing by subdividing Group 1′s pattern,  (\<[!/]*\>), into more groups:  (\<[!/])(*)(\>)  so that I could then use “\2″ in  tag name part of the original Group 3′s pattern (where the “*” was originally):  (\</\2\>), so that it would look like this:

(\<)([!/]*)\>*\<\/\2\>

Just out of curiosity, I went ahead and tested this and the embedded tags were indeed a problem. However, my solution only partially worked; that is, it found the first person tag just fine, but my solution only works with tag names that don’t contain anything else (like attributes) within the brackets besides the tagname itself. There might be away around this using exclusion (!) and the range “[]” (the square brackets), but for now, I still think this is pretty cool. It just is good to know about problems/limitations before you start tagging your file so that you can come up with a scheme that will work using the tools you want (or have).

Although I’ve mentioned before that I’m not so worried about “elegant” solutions, the danger of finding new tricks for myself is that it can sometimes keep me from moving forward on projects because of the fun in tweaking them…

Lab Note book: Data Prep 2 and Wordsmith and other tools

 

Now, in my previous post, I explained that I was tagging some data in my exam corpus for future use as well as simply making reading and author lists.  I was planning on using a feature in Wordsmith Tools that allows you to use tags as selectors in addition to Markup to INclude or EXclude. But I’ve been having a heck of a time of making it work the way I want. Initially, I was using a text file encoded in big indian format for some reason, but after some emails to Wordsmith Tools creator, Mike Scott, he quickly and politely explained that WS doesn’t like that format and instead prefers little indian. WS thankfully also has a unicode converter within its utilities that quickly fixed that problem–but WS 5′s version is buggy; WS 6′s beta version worked great.

I tried the following based on Wordsmith’s help (using Tags) on the web.

So withing the settings section, Tags, I told WS to automatically load a tag file in Markup to Include section with just this line: <book>,</book>:

 

 

 

 

 

 

 

 

 

Don’t forget to click “load” (it’ll say “clear” if you have already loaded it). It’ll show you what tags it found:

 

In the Only Part of File, Sections to Keep, I have : <book> to <book>    (the Sections to Cut out is empty):

 

 

 

 

 

 

 

When I create a new concordance on the tag <book>*, it finds all 8 entries in my test file, however, it brings back the full concordance context line:

 

 

 

After trying a bunch of different search words as well as checking the settings, I emailed Mike Scott just to make sure I’m understanding what WS can actually do with this as well as to check my settings file. So while I’m waiting to hear back from him, I decided to go ahead an write a VBA macro (MS Word). I basically recorded my find of the tags and then edited it to save the file as a text file:

Selec All Code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
Sub WriteTagsToTextFile()
'
' WriteTagsToTextFile
'
'declare my variables
Dim sBookList As String
Dim oRange As Range
Dim iCounter As Integer
Dim strPath As String
 
'initilize variables
iCounter = 0
strPath = "C:\Users\Me\BookList2.txt"
 
With Selection
    .HomeKey (wdStory) 'the homekey is like pressing ctrl home to move to beginning of document
    .Find.ClearFormatting 'get rid of formatting
End With
 
With Selection.Find
    .Text = "(\<book>*\</book\>)" 'find the tagged entries I'm interested in; in the future I may try to creat a input box to manually enter this in
    .Forward = True
    .Wrap = wdFindStop
     .MatchWildcards = True
    Do While .Execute
        Set oRange = Selection.Range
        With oRange
            'copying all the found tag sets into a variable, inserting a carraige return / line feed after each set
            sBookList = sBookList & .Text & vbCrLf
            iCounter = iCounter + 1
        End With
    Loop
End With
 
'cleanup
Set oRange = Nothing
Selection.Find.ClearFormatting
Selection.HomeKey (wdStory)
 
 
'Now, create the text file and save the list--if the file exists, it will be over-written

' first, open the text file (to create or overwrite it)
'The #1 is how we refer to the file later to write and close it; I would like to use a save as box here instead, but this is easier for now...
Open strPath For Output As #1
' Write the tag list to the file
Write #1, sBookList
' remember to close the file
Close #1 'again, we use #1 to refer to the opened file
'Remind yourself of what you just did--it isn't necessary, but it's also helpful to know that the script really finished
MsgBox iCounter & " enteries found and saved to " & strPath
 
 
End Sub

(btw, I’m trying out My Syntax to display my code snippits.) 

After my macro runs, I see:

 

and my text file looks like this:

 

I can now clean up this by finding/replacing the tags with nothing, or open up in Excel. The point is, I have my lists. I also can now count the frequency that particular titles are referenced.

As I said in my comments, I would like to have to have NOT hard-coded the file name and path; that is, I would like to have the macro prompt the user for that information. But maybe later. Or maybe not at all if I am able to do this in Wordsmith. But now that I have my macro, I have other options…  My main point in doing this in vba (besides getting at the data I want) is to highlight the use of having multiple tools to get at the same or similar data. As I’ve mentioned elsewhere, I’m more interested in getting at the data rather than elegant solutions (though they may be cool ones).

Next, I’ll share some results and future directions.

Using Wordsmith Tools at ULL

Here’s set of instructions for accessing and using Wordsmith Tools on the ULL campus. Note that your ULL user id must be added to the share’s security (for licensing purposes), which you can request through myself, Dr. Clai Rice, or Dr. John Laudun. If you are logged onto the campus’ domain, you won’t need to enter your credentials, but if you are using the wireless network, you will (see the instructions):

Accessing and Starting Wordsmith for ULL Users

 

Lab Notebook: Data Prep

So recently, in preparing for one of my comprehensive exams in early American literature, I scanned in the last eleven years worth of exams into PDF format to make it easier to take notes, copy and paste book titles, authors, etc. Unfortunately, the state of the copies in our department folders meant that I had to first make clean photocopies in order to be able to scan them using our Digital Humanities department’s Fujitsu Snapscan (the side effect of which is a serious case of scanner envy). It wasn’t until I was looking through the newly created PDF that I found a few missing pages, page sequence issues, as well as page direction problems. But using Acrobat Pro’s tools, cleanup of this sort was easy. I also used Acrobat’s OCR tool on the file so that I could copy and paste the text into Word and Excel. I had used Word as my first step so that I could clean up the text (lots of minor OCR misreads) before copying and pasting into Excel (I’m using different worksheets based on the different sections of the exams; though the format has evolved a little over the last decade, they basically fall into IDs, shorter essays, and longer essay sections). And since I was already cleaning up the text within Word, I decided to also keep the Word document just to have a cleaner version of the exams. In cleaning up my Word file, I made sure to try to maintain all the original significant formatting such as italics for book titles; it just makes the exams easier to read.

I also stripped out nonessential information, such as sectional instructions (though that could make for an interesting rhetorical analysis in itself), and just labeled each section “Part I,” “Part II,” and “Part III.”

 

My Excel file is just beginning with rudimentary information for now:

Eventually, I’ll add columns such as themes, persons of interests, periods, related critics, related novels, etc.

 

After  getting this far, it occurred to me this task might be easier and more valuable in the long run for gathering different sets of stats, if I were able to insert markup tags. For instance, I know Word will let me search and replace based on text formatting, and so I might be able to search and replace all the italicized words with their original words but also with opening and closing tags (something like <book>…</book>). I couldn’t figure out how to do the replace part until I googled around for Word and using regular expressions. Sure enough it handles them (read a good introduction on this by Microsoft: “Putting regular expressions to work in Word”.

However, in the end, I didn’t need to use them. I spent a great deal of time yesterday trying to get past the problem of being able to only replace every italicized word rather than the entire phrase. I eventually got it working using a VBA macro. But it turns out I didn’t need to use the regular expressions or my VBA code. Today, while retyping all of this (I lost my file while running test code—the lesson being, save all open files before testing out your VBA code!), I found exactly what I needed here. I just swapped out their replacement text with what I was looking for and it worked like a charm. The “^&” below is the  code for what I originally was finding (think of it like a variable that contains the original text). By using it in the replace box, I’m able to insert what I needed to as well as the original search terms (in this case, the formatted phrase).

Very cool. And powerful.

One thing to note, though, was that before I did this, I first had to do a search and replace italicized paragraph mark with a non-formatted paragraph mark because it would create an empty set of tags if the paragraph marker was also formatted as italics.

 

Since I didn’t remove formatting within the replace box, however, my <book>…</book> tags were inserted as italicized text. So a simple search and replace the tags with an non-formatted version and presto:

Just remember to do this with the closing tag as well. Though there are other cooler ways of doing this, simple and fast go a long way in my book.

 

Of course, I also had to manually verify that all of these tags actually were for book titles. There were a few cases of quotes or exam instructions that I hadn’t taken out, or cases where the question text was being emphasized. In those cases, I used other tags (such as <emphasis>why</emphasis>  or <foreign>fin de siècle</foreign>). I currently have no need of these tags, but since it was easy (and I was verifying the text anyway), I decided to go ahead and use them.

 

Next, I want to use this file in Wordsmith Tools to see if anything interesting or useful pops up and see if I can simply create a list of the books based on the <book> tags.

Double-checking R’s Frequency List numbers

Always wanting to feel more confident about results, I decided to double-check the frequency list I generated the other day using R’s functions with Wordsmith Tools (which is a great tool btw).

I was very happy to find that their numbers do match–except one:

My gut feeling is that it has to do with extra consecutive spaces. I’ll do some checking into it…

Happy 75th Birthday, Iowa Writer’s Workshop

MFA. Three letters that mean a world to writers new or experienced, though it’s difficult to put into words. Which is funny considering… And though the non-debate over the possibilities of being able to teach writing rages on (quietly), anyone in or has matriculated through an MFA program, owes their thanks to the Iowa Writer’s Workshop for its help in establishing, at the very least, an awareness of the importance of granting writers a place and time to write. And though I think this clip by PBS’s Newshour is very general, it does a good job at showing people a glimpse between the fence slots that line the backyards of some people who are trying to put their writing first in their lives and the difficulty that entails. My thanks to Pop for telling me about the piece.

 

A Simple Frequency List using R

So recently in my Digital Humanities Seminar class, John Laudun asked us to find different tools to use to output a frequency list from a text that interested us. Instead, I decided to try out my R skills, and dusted off my notes from an R Bootcamp for corpus linguistics I attended last year (this was led by Stefan Th. Gries (author of Quantitative Corpus Linguistics with R: A Practical Introduction) and organized by Stefanie Wulff at the University of North Texas, Denton–a brilliant workshop that anyone interested in corpus linguistics and R should attend).

I won’t even try to go into what R is; however, a good introduction to it (besides Gries’ book) is at the R Project. What I’m going to do today is show you what I did to create that frequency list. It’s very simple and may be of use to some of you who are thinking about jumping into R and corpus linguistics.

Before I begin, I just want to make it clear that my information on the R commands comes from my notes of the bootcamp with Stefan Gries (I was/am very new to R). You can also find this information within his book.

 

Building the text file:

I am working with the text, Charlotte Temple by by Susanna Rowson found on Project Gutenberg. I originally downloaded the text last year, and while cleaning up the text (mis-scans from Project Gutenberg), I had also decided to break the text up into multiple files by chapter. Although I can work with multiple input files in R, I thought it might be easier to get my feet wet by beginning with only one text file. All of those individual files began with “ct” followed by the chapter number, ending with a .txt file extension (for example, ct01.txt).This made it easier for me to sort and locate the particular chapters I needed. I use Windows 7, and don’t know of an easy way to combine all the files into one file besides cutting and pasting, so I decided to go out to the Command Prompt. I first navigated to the directory where my files were located:

Here’s a list of all my chapter files: dir ct*.txt  (this says show me all the files that begin with “ct” and end with “.txt”):

To combine all of them into one file, I entered:  type ct*.txt > ctcompl.txt

This reads as follows:

  • type = display the contents to the screen
  • this is followed by what file to type–in this case, it’s all the files that begin with ‘ct’ and end with ‘.txt’ (the asterisk, *, is a wildcard).
  • but instead of displaying the contents on the screen, I used a “>” to send the contents to a file.
  • the file name I used for the complete text is ctcompl.txt (if the file doesn’t exist, it will be created; if it exists, then it will be overwritten).

(Notice how the file names are displayed. Though you can’t see it in this screen shot, it displays them all, including the new file created)

A quick command, type ctcompl.txt will allow you to verify the contents of the file. Now one thing I should point out is, that my original naming convention for the separate chapter files is what allowed the text to be built in the correct order. It’s something to keep in mind when building any sets of corpora.

 

The R commands

I’m assuming you have R already loaded. If not, or for instructions and help R, go to the R Project download page. This is the development environment I’m using. During the bootcamp, Stefan constantly warned us to type our commands in our favorite (R friendly) text editor so that we would not mistakenly overwrite a vector that took a number of steps to create. It’s really good advice. Though I like Notepad++ for most of my text editing, I use Tinn-R for my R work. This is probably because this is what Stefan had us use in class (I’m a creature of building on familiarity when it comes to learning…)

Now, if you’re like me, one of the first things I had to get over was what a variable is called in R. It’s called a vector. I’m sure someone (probably Stefan) knows why. But for my peace of mind, I still call it a variable. Declaring or creating a vector is very easy; you just type in the name you want (there are restrictions) then pipe to (send to) it whatever information you want it to contain.

Okay; let’s start. First of all we need to tell R what text we want to use. And not only that, but we need for R to remember it so that we can do things with the text later. This means we have to tell R to read in our file and stuff it into a vector. There are a number of ways of doing it, but Stefan showed us a slick trick for Windows users (sorry Mac fans–does anyone know of the Mac equivalent?):

temple.text<-scan(choose.files(), what=”char”, sep=”\n”)

This reads as follows:

  • create a vector (variable) called temple.text (this is just my own naming convention–it helps me to remember this is the complete text for Charlotte Temple)
  • <- is like the Windows Command Prompt’s redirect command”>” in that it takes the output of the command and sends it to the temple.text vector, and
  • in this case, the command is scan . This will read the data from a file, which we specify afterwards using the choose.files() command:
  • choose.files() opens up a browse window to allow you to choose your file (or files–it’s that cool; I really didn’t need to combine all my files after all!). Again, I’m not sure how to do this interactively on the Mac.
    • you could always manually set a path (hardcode it), using the argument for the scan command, file=”". For example, “scan(file=”c:\myTextFile.txt”, …”)
  • the what=”char” tells R what kind of data the file contains (allowed types are logical, integer, numeric, complex, character, raw and list).
  • the sep=”\n” tells R how the data is delimited. In this case, I’m using lines. I believe if I didn’t, the data would just be delimited by whitespace

So now that we have the text, we might want to normalize it in some way. R is case sensitive, and so the word “Dog” and “dog” are different words. And though there are many cases where that might be important, for me, I would rather treat them as the same, and so I am going to convert the entire text to lower case:

temple.text<-tolower(temple.text)

This reads as follows (from right to left (or inner to outer)):

  • tolower(temple.text) says to take the vector that contains all of our text, and convert it all to lowercase, then
  • redirect that data (send to/save)
  • <-
  • back into the original vector name we read from (overwriting it with the newly made lowercase version)
  • temple.text

Next we need to break up the data into words. Again, we are doing something to the complete text (temple.text) but instead of overwriting it like we did with the lowercase conversion, we are going to create a new vector in order to keep track of things.

temple.words.list<-strsplit(temple.text, “\\W+”, perl=TRUE)

This reads as follows (again, from the innermost, or right function):

  • strsplit tells are to break up (or split) a string (which is what a bunch of character data is called).
    • In order for it to know what to split, we have to feed it some arguments (specific inputs), such as the data we want to split, temple.text, and how we want to split it,  “\\W+”, perl=TRUE) (notice that the arguments are seperated by commas)
    • the “\\W+” says to split up the data based on whitespace (one way to create words–note though, this will keep punctuation within a word, possibly throwing off your results (“dog.” and “dog” are two different words)–for this post, I’m not going to clean that up, though it can be done).
  • Then take all of this and save into something new, temple.words.list (temple.words.list <-)

So now, I’ve slightly lied. I’ve been talking about everything as if it’s a vector. What we’ve created so far is a list, which has a different internal structure than a vector. For brevity’s sake, I’m just going to say that we need to convert the list to a vector to continue to work with it:

temple.words.vector<-unlist(words.list)

Again, starting from the right, we can read it as

  • “Take the list, words.list, and turn it into a vector by unlisting it.
  • Then save that information into a vector called temple.words.vector.”

Right now, temple.words.vector looks like this:

> head (temple.words.vector,10)
[1] “preface”     “for”         “the”         “perusal”     “of”          “the”         “young”       “and”         “thoughtless”
[10] “of”

This is just giving a list of the words and their positions. So we need to use the table command to work with it in a slightly different fashion to make a frequency list :

temple.freq.list<-table(temple.words.vector)

Though we have a list, another way to store data is within a table (think Excel, columns and rows). R is performs certain kinds of calculations for tables that we don’t get within a list. For a very useful site that explains tables (was well as other things R), go visit the R Tutorial by Clarkson College.

To read this though, again, start from the right:

  • We are telling R to take our new vector, temple.words.vector,
  • and to make a table out of it,
  • and save it to temple.freq.list.

What the temple.freq.list looks like now is:

> table(temple.words.vector)
temple.words.vector
a         abandon       abandoned          abated
644            1415               2               6               2
abbess        abhorred       abilities          abject          abjure
2               2               4               4               2
able           abode           about           above          abroad
4               2              42               6               4
absence          absent      absolutely        absorbed           abuse
6               4               2               2               2
abused           abyss         academy          accent          accept
2               4               2               8               2

This lists the different factors with their frequencies. We can now use this list to sort by frequency rather than alphabetically:

temple.sorted.freq.list<-sort(temple.freq.list, decreasing=TRUE)

This works a lot like what we did with the tolower function in that we saved the new version over the original, but instead we are just sorting it rather than converting its case.

So,  we are telling R

  • to take the list we just made, temple.freq.list,
  • and sort it descending (using the argument “decreasing=TRUE“),
  • then save all of this into temple.sorted.freq.list

temple.sorted.freq.list now looks like this:

>temple.words.vector
the              to             and              of             her               a               i             she
3529            2522            2421            2291            1726            1415            1300            1137
in              he             was             you              my                            that             his
1036             884             826             768             757             644             622             604
it             but            with             not             for       charlotte              be             had
601             599             588             569             535             525             456             441
said              me              by              as            from              on              is              at
432             431             399             394             394             391             379             362

This looks a lot like how we would imagine:  function words are typically the most frequent.

Next we are going to put this into a table format that is more readable than what we see above:

temple.sorted.table<-paste(names(temple.sorted.freq.list), temple.sorted.freq.list, sep=”\t”)

This is weird looking, I know. But basically, we want to take what we’ve come up so far in temple.words.vector

the              to             and              of             her               a               i             she
3529            2522            2421            2291            1726            1415            1300            1137

somehow turn the words and frequencies into column looking data.

So we have to build this by first just getting the words themselves, using the name() function

> names(temple.sorted.freq.list)
[1] “the”             “to”              “and”             “of”              “her”             “a”               “i”
[8] “she”             “in”              “he”              “was”             “you”             “my”              “”
[15] “that”            “his”             “it”              “but”             “with”            “not”             “for”

We then feed that list of names as one argument back into the paste() function, followed by the list itself as another argument, then followed bythe last argument, which says to insert a tab (“\t)between them, so that when we open it up, the values will appear separated by the tab.  The paste() function basically takes these separate factors and puts them into one:

> paste(names(temple.sorted.freq.list), temple.sorted.freq.list, sep=”\t”)

[1] “the\t3529″          “to\t2522″           “and\t2421″          “of\t2291″           “her\t1726″          “a\t1415″
[7] “i\t1300″            “she\t1137″          “in\t1036″           “he\t884″            “was\t826″           “you\t768″
[13] “my\t757″            “\t644″              “that\t622″          “his\t604″           “it\t601″            “but\t599″

So what we’re doing is building a file to look the way we want it to: a frequency list in columns (rather than the output you saw above)

Saving the data to a file:

cat(“Word\tFREQ”, temple.sorted.table, file=choose.files(), sep=”\n”)

  • cat() is a way to output a file like type is within the Command Prompt, except that it concatenates its element into a character string before outputting them.
  • In this case,  R is beginning the string of data with the string of characters, “Word\tFREQ“–that is, “Word” and “Freq” will be separates by a tab in a text editor; they are the column headers.
  • Then R will concatenate the data we have in temple.sorted.table behind the column headers–all into a character string, which then will be save to a location (the file argument) using
  • the choose.files() argument where Windows users may browse and create a new file (again, Mac users will need to do something different–instead of choose.files(), you could specify the path and what you want to call the new file–for example:  file=”myTextFile.txt”)

The result look like this within Notepad++:

 

You could also opened the new text file within Excel to make it prettier:

Using IBM’s Word Cloud Generator on Windows

The following is just a revisioning of John Laudun’s Digital Humanities Blog post, Using IBM’s Word Cloud Generator, for where his instructions would differ for Windows users (basic, but useful information for command prompt initiates:

And so, perhaps, the first place to begin is finding out how to get to the command line in Windows (XP, Vista, 7):

click on the Start Button, All Programs, Accessories, Command Prompt:

Command Prompt from the Startmenu

(or from the Run box, type in cmd and press enter). This will display the command prompt window:

It should take you to your user folder that corresponds to your login ID (this is slightly different in Windows XP). The Windows world uses the backslash key to describe its folder structure.

So the following path C:\Users\Big John can be read as follows:

  • C: is the drive (in this case, the “C drive”)
  • The \ (backslash) separates different levels of folders and files, in this case,  \Users is the Users subfolder
  • followed by  \Big John, the “Big John” subdirectory (folder)
  • The folder names are not case sensitive (at least when navigating) (so “big john” is read the same as “Big John”)

Okay, now you have the Command Prompt window open.

The > is known as the prompt, which is short for “the command line prompt.”

Your prompt is ready to receive instructions. (There’s a lot more to say about the environment in which you now find yourself, but for the sake of getting on with this tutorial we will leave that for another time.)

If you were to paste the code that you copied out of the .bat file we discussed in class and try to run it from where you are, chances are you will get nothing. That is because the prompt can only run things when it knows where they are — much the same applies in the GUI, but Windows and Mac and Linux GUIs do a lot of work behind the scenes to find applications for you. You have two choices: add the file hierarchy to your command (the %PATH% variable) or to navigate to where the WCG application is and run it from within its directory. (If you were going to use the application a lot, there are some other considerations, but we will leave those for another time — but feel free to ask if you like.)

So to navigate within the Command Prompt, you can use the following commands

  • your current (working) directory is automatically shown in the command box to the left of the cursor
  • type dir to display the contents of the current working directory (this won’t show you hidden folders or files; to see those, type in dir /a:h (show me files w a
  • type cd to change the directory you are in
  • type cd .. (that’s cd followed by a space followed by two periods) to move “up” a directory
  • to see a list of most of the commands available from the command prompt, you can type help , or for help with a particular command, such as dir, you could enter after the > help dir

To Navigate to the IBM Word Cloud directory, we are going to pretend it is on your Desktop:

C:\Users\Big John> cd desktop\IBM Word Cloud

This means, change directory (cd) to subfolder called desktop, and within that one, go to another subfolder called IBM Word Cloud. You can always change one directory at a time and do a dir to see what folders are in there in case you don’t remember:

C:\Users\Big John> cd desktop

C:\Users\Big John\Desktop> cd IBM Word Cloud

Typically, most Terminal windows will start you in your user home directory. My best advice for the sake of this current activity is to use Windows Explorer or the Mac Finder and move the unzipped folder containing the WCG, which is named “IBM Word Cloud” in my case, to the Desktop or to your Documents folder. Some place easy to get to.

From here, you should be able to run the bat file for testing:

But if you want to paste in the script from within the bat file (right click from Explorer and open with your favorite text editor), then copy the text of the script as you normally would. To paste the text inside the Command Prompt window, click on the  upper left corner on the c:\ icon:

Click on Edit within the dropdown menu, then click Paste:

Keep watching the Digital Humanities Seminar blog for more information