The more you know

One of the most interesting things about working in the digital humanities is how much you can learn from even the smallest projects. I discovered a set of characters in an ebook I was reading that had been encoded not as capital Ks, but rather as capital Kappas (which I suspect to be the result of an OCR glitch, but haven’t had a chance to confirm yet), and thought that it might be handy to run the whole ebook through a program that could extract each of the individual characters and sort them so that I could discover any other similarly misencoded characters. Because this document was encoded as UTF-8, I knew that sorting the characters alphabetically would separate any other Greek characters, which have a higher numerical value than any look-alike Roman characters, allowing me to distinguish them quite easily. I put together a quick Python program and ran it on the text.

The results were not what I’d expected. Where I’d expected to see the character Kappa appear, I found nothing. I narrowed the text I was processing down to just known instances of the character Kappa, and again found nothing. Instead, it appeared that the character Kappa was being broken down into two characters, specifically “ö” and “Œ.” I had a hunch as to what was going wrong, but it still seemed strange to me: UTF-8 is so named because the characters it stores are 8 bits in length—one byte—and it appears that Kappa was actually a two-byte character, and that the two bytes were being broken apart by the Python program and treated as two one-byte characters.

A little research revealed that I was correct: UTF-8 does indeed include a number of two-byte characters (and three-byte, and four-byte). The first byte in a multi-byte character indicates to whatever is decoding the characters that the following byte is to be read as part of the same character (and in the case of three- or four-byte characters, the second and third bytes do the same), allowing for the occasional multi-byte character to be integrated into an encoding system that normally only works with single-byte characters, which saves a lot of space and processing power, as empty bytes are excluded from the most common characters.

This system for encoding multi-byte characters into what is normally a single-byte encoding standard works well, for the most part, but can cause problems in situations where it is not processed properly. This is what happened in the case of my Python program, and also happens frequently in webpages, where characters such as smartquotes are regularly misinterpreted as “junk.” This is why one of the first things you learn when working with HTML is to encode smart quotes using named and numeric references, rather than copying and pasting them in from Word (or other such programs). Personally, I find it kind of odd that such commonly used characters are treated so strangely, and would love to learn more about why that might be the case (something I’ll have to research once I’ve cleared up some of my current research).

As usual, the digital humanities community was very helpful. After I posted my issues to Twitter, a number of people either responded or retweeted, and with the help of Ted Underwood and my friend Chris, I was able to get my program working properly (for those wondering, they pointed me to this helpful Python doc on unicode—a must read for DHers interested in data mining and who would like to avoid encountering the same troubles I had). Moments like this always make me realise how much digital humanists have to learn as we go. While most scholars will never have to think about the way their computer encodes text, it is almost impossible for a digital humanist to work with that text without knowing exactly how it’s encoded, lest we mangle the very text we’re trying to work with. My students last year encountered the same problem when many of them tried to upload a .doc file to TAPoR instead of the .txt files they were supposed to be working with, resulting in a mess of XML tagging making its way into their results, rather than the text they’d expected.

These incidents also help highlight the strong kinship DH work has with that of bibliographers. Anyone undertaking a bibliographical project will quickly find themselves immersed in the printing technology of the time period they’re studying. For example, the rise of word processing has meant that we never really have to deal with kerned type and the ligatures that prevent kerned sorts from fouling when they are forced to precede a sort with an ascender—something that would have been obvious to 18th-century printers, and would have informed all of their typesetting. As such, we find ourselves mentally (and for many bibliographers, physically) venturing into printing shops that reflect the periods we study, much as I found myself learning the fundamentals of UTF-8 encoding and decoding.

The more you know..

Leave a Reply