Just because I’m finding using Microsoft Word’s find/replace tool so much fun, I thought I would share another experience with it.
I had originally converted all my italicized words and phrases to tagged items in my earlier data prep entries. After having run my macro to output author and book lists, I found lots of mistakes in my (manual) tagging. But they were easy enough to correct. As I worked along, I also corrected spellings (words connected to other words, filling out fullnames, etc). But I didn’t want to do all this again for my “clean” (readable) copy of the text file.
Now, I must admit, that for readable copies of texts, I need formatting such as italics. You may argue with me on future compatibility grounds, but the fact is, my reading experience, even with data, must take precedence over any such issues. So to “reprettify” my text, I used Microsoft Word’s find/replace feature to 1) find certain sets of tags and then 2) get rid of the surrounding tags, leaving the embedded content, and 3) format that content.
So in the Find/Replace window, I searched for:
and replaced with:
and added the italics format.
(*be sure to click More and check the Use wildcards option)
Like in other programs that use regular expressions, you can group them so you can refer to them later, in this case, with parentheses (for example, the first, second, or third grouped item). In this case, I wanted to get rid of the tags (<work> and </work>) but retain all the content between, which is the second group.
The backslashes “\” tell Word that I’m looking for the following character (in other words, escaping out special characters like “<“s). The reason I added a wildcard “*” after “<work” was that I had originally created my data file using “<book>” , <poem>, etc. However, while starting to markup a different set of files, I decided to use a broader tagn ame, <work>, with an attribute of “type” (i.e. <work type=”book”>). So when I want to find complete tags, I have to account for the extra information between the name and the end bracket of the opening tag.
Here is a example of what I’m starting with:
And here’s what it looks like after finding/replacing the text:
Now, I could have done this for all tags rather than just this one particular set. But I didn’t want to highlight names within my reading (though I may change my mind about that). To do so, I would have to alter the search terms to something like this:
Notice how I added (\<[!/]*\>)(*)(\</*\>). This tells Word to make sure that the character following the “<” is not a forward slash “/”. I needed to do this because I saw that although Word found the first complete phrase just fine, without that restriction, the start of the next phrase would begin with that closing tag of the previous phrase, causing my opening/ending tags to get out of alignment. I didn’t run into this problem in my first example because the first group within in my search terms precluded it bringing back a closing tag; that is, it could only be an opening tag.
So just to walk through this (to help make sure I’m following this myself), Group 1, (\<[!/]*\>), reads:
“Look for the “<” character followed by something that is NOT a “/” character (again, to exclude the closing tag from the beginning), followed by all character until and including the opening tag’s closing bracket, “>”
The second group, (*), reads as
“(continue to) get me all characters”, followed by
the third group, (\</*\>), the final piece of the pattern:
“Look for the characters, “</” (the opening bracket of the closing tag), followed by any text until and including the final bracket of the closing tag.”
As I was typing this out, I noticed that there is a potential problem with using multiply embedded tags which my first find/replace would not encounter (again, due to specifying the particular tag). That is, I’m guessing that if I had the line,
“<person>John Anderson’s</person> essay, <work type=”essay”>Surviving <person>Walt Whitman’s</person> <work type=”book”>Leaves of Grass</work></work>”
the find/replace pattern would probably retrieve:
<work type=”essay”>Surviving <person>Walt Whitman’s</person>
which is not aligned. I can see how to fix this easily in a macro, just saving off the tagname part in a variable to use later. If Word’s search/replace tool worked like other programs using regular expressions, I imagine that you could do the same thing by subdividing Group 1’s pattern, (\<[!/]*\>), into more groups: (\<[!/])(*)(\>) so that I could then use “\2” in tag name part of the original Group 3’s pattern (where the “*” was originally): (\</\2\>), so that it would look like this:
Just out of curiosity, I went ahead and tested this and the embedded tags were indeed a problem. However, my solution only partially worked; that is, it found the first person tag just fine, but my solution only works with tag names that don’t contain anything else (like attributes) within the brackets besides the tagname itself. There might be away around this using exclusion (!) and the range “” (the square brackets), but for now, I still think this is pretty cool. It just is good to know about problems/limitations before you start tagging your file so that you can come up with a scheme that will work using the tools you want (or have).
Although I’ve mentioned before that I’m not so worried about “elegant” solutions, the danger of finding new tricks for myself is that it can sometimes keep me from moving forward on projects because of the fun in tweaking them…