Lab Notebook: Data Prep

So recently, in preparing for one of my comprehensive exams in early American literature, I scanned in the last eleven years worth of exams into PDF format to make it easier to take notes, copy and paste book titles, authors, etc. Unfortunately, the state of the copies in our department folders meant that I had to first make clean photocopies in order to be able to scan them using our Digital Humanities department’s Fujitsu Snapscan (the side effect of which is a serious case of scanner envy). It wasn’t until I was looking through the newly created PDF that I found a few missing pages, page sequence issues, as well as page direction problems. But using Acrobat Pro’s tools, cleanup of this sort was easy. I also used Acrobat’s OCR tool on the file so that I could copy and paste the text into Word and Excel. I had used Word as my first step so that I could clean up the text (lots of minor OCR misreads) before copying and pasting into Excel (I’m using different worksheets based on the different sections of the exams; though the format has evolved a little over the last decade, they basically fall into IDs, shorter essays, and longer essay sections). And since I was already cleaning up the text within Word, I decided to also keep the Word document just to have a cleaner version of the exams. In cleaning up my Word file, I made sure to try to maintain all the original significant formatting such as italics for book titles; it just makes the exams easier to read.

I also stripped out nonessential information, such as sectional instructions (though that could make for an interesting rhetorical analysis in itself), and just labeled each section “Part I,” “Part II,” and “Part III.”

 

My Excel file is just beginning with rudimentary information for now:

Eventually, I’ll add columns such as themes, persons of interests, periods, related critics, related novels, etc.

 

After  getting this far, it occurred to me this task might be easier and more valuable in the long run for gathering different sets of stats, if I were able to insert markup tags. For instance, I know Word will let me search and replace based on text formatting, and so I might be able to search and replace all the italicized words with their original words but also with opening and closing tags (something like <book>…</book>). I couldn’t figure out how to do the replace part until I googled around for Word and using regular expressions. Sure enough it handles them (read a good introduction on this by Microsoft: “Putting regular expressions to work in Word”.

However, in the end, I didn’t need to use them. I spent a great deal of time yesterday trying to get past the problem of being able to only replace every italicized word rather than the entire phrase. I eventually got it working using a VBA macro. But it turns out I didn’t need to use the regular expressions or my VBA code. Today, while retyping all of this (I lost my file while running test code—the lesson being, save all open files before testing out your VBA code!), I found exactly what I needed here. I just swapped out their replacement text with what I was looking for and it worked like a charm. The “^&” below is the  code for what I originally was finding (think of it like a variable that contains the original text). By using it in the replace box, I’m able to insert what I needed to as well as the original search terms (in this case, the formatted phrase).

Very cool. And powerful.

One thing to note, though, was that before I did this, I first had to do a search and replace italicized paragraph mark with a non-formatted paragraph mark because it would create an empty set of tags if the paragraph marker was also formatted as italics.

 

Since I didn’t remove formatting within the replace box, however, my <book>…</book> tags were inserted as italicized text. So a simple search and replace the tags with an non-formatted version and presto:

Just remember to do this with the closing tag as well. Though there are other cooler ways of doing this, simple and fast go a long way in my book.

 

Of course, I also had to manually verify that all of these tags actually were for book titles. There were a few cases of quotes or exam instructions that I hadn’t taken out, or cases where the question text was being emphasized. In those cases, I used other tags (such as <emphasis>why</emphasis>  or <foreign>fin de siècle</foreign>). I currently have no need of these tags, but since it was easy (and I was verifying the text anyway), I decided to go ahead and use them.

 

Next, I want to use this file in Wordsmith Tools to see if anything interesting or useful pops up and see if I can simply create a list of the books based on the <book> tags.

One comment

Leave a Reply

Your email address will not be published. Required fields are marked *