Category: Lab Notebook

NEH 2012 Jefferson Lecture: Wendell Barry

Kudos to the NEH for making inspiring lectures like this available online (it also streamed this live)

(I would have embedded the video, but I’m having a devil of a time disabling the autostart. Their embed code has no autostart variable, but I tried adding one and setting it to not play (play=”false”) along with other variations but none of my attempts have worked…)

Wendell Barry is one of those rare individuals who tries to walk his talk, inspiring the rest of us to also try. A write-up of the lecture appears here.

More on Using Microsoft Word’s Find/Replace tool

Just because I’m finding  using Microsoft Word’s find/replace tool so much fun, I thought I would share another experience with it.

I had originally converted all my italicized words and phrases to tagged items in my earlier data prep entries. After having run my macro to output author and book lists, I found lots of mistakes in my (manual) tagging. But they were easy enough to correct. As I worked along, I also corrected spellings (words connected to other words, filling out fullnames, etc). But I didn’t want to do all this again for my “clean” (readable) copy of the text file.

Now, I must admit, that for readable copies of texts, I need formatting such as italics. You may argue with me on future compatibility grounds, but the fact is, my reading experience, even with data, must take precedence over any such issues. So to “reprettify” my text, I used Microsoft Word’s find/replace feature to 1) find certain sets of tags and then 2) get rid of the surrounding tags, leaving the embedded content, and 3) format that content.

So in the Find/Replace window, I searched for:

(\<work*\>)(*)(\</work\>)

and replaced with:

\2

and added the italics format.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(*be sure to click More and check the Use wildcards option)

Like in other programs that use regular expressions, you can group them so you can refer to them later, in this case, with parentheses (for example, the first, second, or third grouped item). In this case, I wanted to get rid of the tags (<work> and </work>) but retain all the content between, which is the second group.

The backslashes “\” tell Word that I’m looking for the following character (in other words, escaping out special characters like “<“s). The reason I added a wildcard “*” after “<work” was that I had originally created my data file using “<book>” , <poem>, etc. However, while starting to markup  a different set of files, I decided to use a broader tagn ame, <work>, with an attribute of “type” (i.e. <work type=”book”>). So when I want to find complete tags, I have to account for the extra information between the name and the end bracket of the opening tag.

 

Here is a example of what I’m starting with:

 

And here’s what it looks like after finding/replacing the text:

 

 

Now, I could have done this for all tags rather than just this one particular set. But I didn’t want to highlight names within my reading (though I may change my mind about that). To do so,  I would have to alter the search terms to something like this:

(\<[!/]*\>)(*)(\</*\>)

Notice how I added (\<[!/]*\>)(*)(\</*\>). This tells Word to make sure that the character following the “<” is not a forward slash “/”. I needed to do this because I saw that although Word found the first complete phrase just fine, without that restriction, the start of the next phrase would begin with that closing tag of the previous phrase, causing my opening/ending tags to get out of alignment. I didn’t run into this problem in my first example because the first group within in my search terms precluded it bringing back a closing tag; that is, it could only be an opening tag.

So just to walk through this (to help make sure I’m following this myself), Group 1,  (\<[!/]*\>), reads:

“Look for the “<” character followed by something that is NOT a “/” character (again, to exclude the closing tag from the beginning), followed by all character until and including  the opening tag’s closing bracket, “>”

The second group, (*), reads as

“(continue to) get me all characters”, followed by

the third group, (\</*\>), the final piece of the pattern:

“Look for the characters,  “</” (the opening bracket of the closing tag), followed by any text until and including the final bracket of the closing tag.”

As I was typing this out, I noticed that there is a potential problem with using multiply embedded tags which my first find/replace would not encounter (again, due to specifying the particular tag). That is, I’m guessing that if I had the line,

“<person>John Anderson’s</person> essay, <work type=”essay”>Surviving <person>Walt Whitman’s</person> <work type=”book”>Leaves of Grass</work></work>”

the find/replace pattern would probably retrieve:

<work type=”essay”>Surviving <person>Walt Whitman’s</person>

which is not aligned. I can see how to fix this easily in a macro, just saving off the tagname part in a variable to use later. If Word’s search/replace tool worked like other programs using regular expressions, I imagine that you could do the same thing by subdividing Group 1’s pattern,  (\<[!/]*\>), into more groups:  (\<[!/])(*)(\>)  so that I could then use “\2” in  tag name part of the original Group 3’s pattern (where the “*” was originally):  (\</\2\>), so that it would look like this:

(\<)([!/]*)\>*\<\/\2\>

Just out of curiosity, I went ahead and tested this and the embedded tags were indeed a problem. However, my solution only partially worked; that is, it found the first person tag just fine, but my solution only works with tag names that don’t contain anything else (like attributes) within the brackets besides the tagname itself. There might be away around this using exclusion (!) and the range “[]” (the square brackets), but for now, I still think this is pretty cool. It just is good to know about problems/limitations before you start tagging your file so that you can come up with a scheme that will work using the tools you want (or have).

Although I’ve mentioned before that I’m not so worried about “elegant” solutions, the danger of finding new tricks for myself is that it can sometimes keep me from moving forward on projects because of the fun in tweaking them…

Lab Note book: Data Prep 2 and Wordsmith and other tools

 

Now, in my previous post, I explained that I was tagging some data in my exam corpus for future use as well as simply making reading and author lists.  I was planning on using a feature in Wordsmith Tools that allows you to use tags as selectors in addition to Markup to INclude or EXclude. But I’ve been having a heck of a time of making it work the way I want. Initially, I was using a text file encoded in big indian format for some reason, but after some emails to Wordsmith Tools creator, Mike Scott, he quickly and politely explained that WS doesn’t like that format and instead prefers little indian. WS thankfully also has a unicode converter within its utilities that quickly fixed that problem–but WS 5’s version is buggy; WS 6’s beta version worked great.

I tried the following based on Wordsmith’s help (using Tags) on the web.

So withing the settings section, Tags, I told WS to automatically load a tag file in Markup to Include section with just this line: <book>,</book>:

 

 

 

 

 

 

 

 

 

Don’t forget to click “load” (it’ll say “clear” if you have already loaded it). It’ll show you what tags it found:

 

In the Only Part of File, Sections to Keep, I have : <book> to <book>    (the Sections to Cut out is empty):

 

 

 

 

 

 

 

When I create a new concordance on the tag <book>*, it finds all 8 entries in my test file, however, it brings back the full concordance context line:

 

 

 

After trying a bunch of different search words as well as checking the settings, I emailed Mike Scott just to make sure I’m understanding what WS can actually do with this as well as to check my settings file. So while I’m waiting to hear back from him, I decided to go ahead an write a VBA macro (MS Word). I basically recorded my find of the tags and then edited it to save the file as a text file:

Selec All Code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
Sub WriteTagsToTextFile()
'
' WriteTagsToTextFile
'
'declare my variables
Dim sBookList As String
Dim oRange As Range
Dim iCounter As Integer
Dim strPath As String
 
'initilize variables
iCounter = 0
strPath = "C:\Users\Me\BookList2.txt"
 
With Selection
    .HomeKey (wdStory) 'the homekey is like pressing ctrl home to move to beginning of document
    .Find.ClearFormatting 'get rid of formatting
End With
 
With Selection.Find
    .Text = "(\<book>*\</book\>)" 'find the tagged entries I'm interested in; in the future I may try to creat a input box to manually enter this in
    .Forward = True
    .Wrap = wdFindStop
     .MatchWildcards = True
    Do While .Execute
        Set oRange = Selection.Range
        With oRange
            'copying all the found tag sets into a variable, inserting a carraige return / line feed after each set
            sBookList = sBookList & .Text & vbCrLf
            iCounter = iCounter + 1
        End With
    Loop
End With
 
'cleanup
Set oRange = Nothing
Selection.Find.ClearFormatting
Selection.HomeKey (wdStory)
 
 
'Now, create the text file and save the list--if the file exists, it will be over-written

' first, open the text file (to create or overwrite it)
'The #1 is how we refer to the file later to write and close it; I would like to use a save as box here instead, but this is easier for now...
Open strPath For Output As #1
' Write the tag list to the file
Write #1, sBookList
' remember to close the file
Close #1 'again, we use #1 to refer to the opened file
'Remind yourself of what you just did--it isn't necessary, but it's also helpful to know that the script really finished
MsgBox iCounter & " enteries found and saved to " & strPath
 
 
End Sub

(btw, I’m trying out My Syntax to display my code snippits.) 

After my macro runs, I see:

 

and my text file looks like this:

 

I can now clean up this by finding/replacing the tags with nothing, or open up in Excel. The point is, I have my lists. I also can now count the frequency that particular titles are referenced.

As I said in my comments, I would like to have to have NOT hard-coded the file name and path; that is, I would like to have the macro prompt the user for that information. But maybe later. Or maybe not at all if I am able to do this in Wordsmith. But now that I have my macro, I have other options…  My main point in doing this in vba (besides getting at the data I want) is to highlight the use of having multiple tools to get at the same or similar data. As I’ve mentioned elsewhere, I’m more interested in getting at the data rather than elegant solutions (though they may be cool ones).

Next, I’ll share some results and future directions.

Lab Notebook: Data Prep

So recently, in preparing for one of my comprehensive exams in early American literature, I scanned in the last eleven years worth of exams into PDF format to make it easier to take notes, copy and paste book titles, authors, etc. Unfortunately, the state of the copies in our department folders meant that I had to first make clean photocopies in order to be able to scan them using our Digital Humanities department’s Fujitsu Snapscan (the side effect of which is a serious case of scanner envy). It wasn’t until I was looking through the newly created PDF that I found a few missing pages, page sequence issues, as well as page direction problems. But using Acrobat Pro’s tools, cleanup of this sort was easy. I also used Acrobat’s OCR tool on the file so that I could copy and paste the text into Word and Excel. I had used Word as my first step so that I could clean up the text (lots of minor OCR misreads) before copying and pasting into Excel (I’m using different worksheets based on the different sections of the exams; though the format has evolved a little over the last decade, they basically fall into IDs, shorter essays, and longer essay sections). And since I was already cleaning up the text within Word, I decided to also keep the Word document just to have a cleaner version of the exams. In cleaning up my Word file, I made sure to try to maintain all the original significant formatting such as italics for book titles; it just makes the exams easier to read.

I also stripped out nonessential information, such as sectional instructions (though that could make for an interesting rhetorical analysis in itself), and just labeled each section “Part I,” “Part II,” and “Part III.”

 

My Excel file is just beginning with rudimentary information for now:

Eventually, I’ll add columns such as themes, persons of interests, periods, related critics, related novels, etc.

 

After  getting this far, it occurred to me this task might be easier and more valuable in the long run for gathering different sets of stats, if I were able to insert markup tags. For instance, I know Word will let me search and replace based on text formatting, and so I might be able to search and replace all the italicized words with their original words but also with opening and closing tags (something like <book>…</book>). I couldn’t figure out how to do the replace part until I googled around for Word and using regular expressions. Sure enough it handles them (read a good introduction on this by Microsoft: “Putting regular expressions to work in Word”.

However, in the end, I didn’t need to use them. I spent a great deal of time yesterday trying to get past the problem of being able to only replace every italicized word rather than the entire phrase. I eventually got it working using a VBA macro. But it turns out I didn’t need to use the regular expressions or my VBA code. Today, while retyping all of this (I lost my file while running test code—the lesson being, save all open files before testing out your VBA code!), I found exactly what I needed here. I just swapped out their replacement text with what I was looking for and it worked like a charm. The “^&” below is the  code for what I originally was finding (think of it like a variable that contains the original text). By using it in the replace box, I’m able to insert what I needed to as well as the original search terms (in this case, the formatted phrase).

Very cool. And powerful.

One thing to note, though, was that before I did this, I first had to do a search and replace italicized paragraph mark with a non-formatted paragraph mark because it would create an empty set of tags if the paragraph marker was also formatted as italics.

 

Since I didn’t remove formatting within the replace box, however, my <book>…</book> tags were inserted as italicized text. So a simple search and replace the tags with an non-formatted version and presto:

Just remember to do this with the closing tag as well. Though there are other cooler ways of doing this, simple and fast go a long way in my book.

 

Of course, I also had to manually verify that all of these tags actually were for book titles. There were a few cases of quotes or exam instructions that I hadn’t taken out, or cases where the question text was being emphasized. In those cases, I used other tags (such as <emphasis>why</emphasis>  or <foreign>fin de siècle</foreign>). I currently have no need of these tags, but since it was easy (and I was verifying the text anyway), I decided to go ahead and use them.

 

Next, I want to use this file in Wordsmith Tools to see if anything interesting or useful pops up and see if I can simply create a list of the books based on the <book> tags.