Respect the Work: Shortcuts Used to Republish Public Domain Works
The first post of this series, “Respect the Work: Republishing Public Domain Works” called for those who republish materials in the public domain to honor the original author with an authentic, faithful reproduction of their work. The post was prompted by the controversy that arose over the inappropriate choice of cover model used in a republication of Anne of Green Gables, by Lucy Maud Montgomery.
I first learned about the controversy when a friend posted a link to the article in the Guardian on his Facebook timeline. His post was scathing, and his friends visibly disappointed and angry in their comments. Their reaction to what had been done to the image of the character of Anne was comparable to what I might expect if they were talking about a close friend in real life whose image and character had been maligned. The intensity of their reaction was also reflected by commenters in the reviews left for the edition on Amazon.com.
This energetic reaction to a poor cover image for a story originally published in 1908 should serve as fair warning to publishers who elect to republish materials in the public domain: readers care. Capturing historic creative works so that they can be enjoyed now and in the future is a worthwhile enterprise. When public domain books are made available for free, as from Project Gutenberg, providing them in easily produced formats is understandable and acceptable. But if a publisher charges for an edition that would otherwise be available in the public domain, they should add value to the work, or at least not detract from it, as was the case of the poor cover choice for Anne of Green Gables.
A poor quality cover design unsuitable to the work is perhaps the most visible indication that a publisher is not respectful of the work of original authors. Additional short cuts that publishers take that reflect poorly on the original author involve capturing and reproducing text and graphics. Two common means of capturing text from a public domain work are capturing the text as an image of the page, and scanning the document and processing it with optical character recognition (OCR) software.
Compiling Page Images as Books
Publishers sometimes offer as books compilations of images of the pages from the original work, bound electronically or physically. If the book were composed of reproductions of major works of art or high quality illustration plates, or were hand-written calligraphy, a book consisting of bound page images might be warranted as the best means to make the work available to contemporary readers. Aside from unique cases such as these, copying and compiling page images most often results in a poor quality book. When the pages are imaged, compiled, and bound into a physical volume, often the text quality is poor to begin with and made even worse in the course of reproduction. If the image is recorded strictly as a high-contrast, low-resolution, black-and-white document image, illustrations in light-colored ink may be washed out altogether.
The snips below illustrate these differences. The snip on the left was taken from an edition of Margaret Hill McCarter’s The Corner Stone, provided in electronic (pdf) format for free from Forgotten Books. The snip on the right is a high resolution color scan from an original paperback copy of the book. To begin, comparing the two snips shows the degradation in text quality as a result of the low resolution scanning process to produce the pdf edition on the left. Second, in the Forgotten Books edition, the first word, with drop cap formatting, has been lost. This loss is only one word out of sixty, or less than 2% of the text, but it makes reading and understanding the passage incrementally more difficult. Finally, in capturing the text as a high-contrast, black-and-white document, the two lines of golden graphics the original publisher used to set off the text passage have been lost altogether.
In addition to poor text quality from the original scan, a book composed of page images has other disadvantages. Image compression used to format the page for an electronic reader can reduce the image quality even further. Also, since the pages consist of a single image, it is not possible to manipulate the font choice, font size, or other features available on electronic readers. Finally, easy navigation within the text of a book and the ability to search for words and phrases are two benefits provided by electronic readers. When publishers offer a book composed of page images for reading on electronic readers, the book rarely includes an active table of contents or index, or the ability to search the text.
One publisher extends the caveat below for historic books that they offer produced as compiled page images:
This is a reproduction of a book published before 1923. This book may have occasional imperfections such as missing or blurred pages, poor pictures, errant marks, etc. that were either part of the original artifact, or were introduced by the scanning process. We believe this work is culturally important, and despite the imperfections, have elected to bring it back into print as part of our continuing commitment to the preservation of printed works worldwide. We appreciate your understanding of the imperfections in the preservation process, and hope you enjoy this valuable book.
Another publisher provides this caveat:
[This publisher] utilizes the latest technology to regenerate facsimiles of historically important writings. Careful attention has been made to accurately preserve the original format of each page whilst digitally enhancing the aged text.
At least the reader of these books has been informed what to expect.
Compiling Unedited OCR Processed Pages as Books
The second method used to capture text from existing manuscripts for republication is to scan the document and process the text using optical character recognition. There are numerous programs that can be used for OCR processing that rapidly convert a page image into editable text. The OCR will be subject to introducing errors if the page is dirty or damaged, if the font or kerning is too small, if the font in the original work is too ornate, or if the image text is badly aligned. Some OCR programs handle tables, columns and text boxes particularly badly, mixing the text together into a single line of text. Finally, OCR can introduce errors into punctuation, or more commonly, miss punctuation completely.
Processing a page image by OCR typically provides as much as 90 – 95% of the document text, but it is exceedingly rare that any OCR program will generate a complete page with 100% accuracy and completeness. Eliminating errors that have been introduced into the text and ensuring that all of the text was captured requires editing, comparing the original to the OCR text, page by page, line by line, word by word. It can take longer to proofread and correct the processed text than it would take for an accomplished typist to enter the text from scratch.
To illustrate the need for proofreading text generated using OCR on a page image, the snip on the left is part of a page from an original copy of The Corner Stone. The snip on the right shows the result from using OCR on this image, without editing. There are nine errors in a selection of about 150 words, or 6% error rate. And I should note, the OCR program used to process this passage was the best of the three that I have used.
Unfortunately, not all publishers are willing to invest the time and labor required to proofread text obtained by OCR from a scanned image. Some publishers offer the raw, unedited text obtained by processing the image with OCR as an electronic book. To me, this is far more egregious disrespect of an original creative work than is making a poor choice of cover design.
Recapturing a public domain work by processing the page image with OCR requires significant editing, but it also offers advantages over compiling page images, particularly with respect to publishing in electronic form. For a publisher who chooses to capture text from an original work using OCR, the resultant text is more amenable to formatting with section headings and tables of contents, and can be searched. The publisher can change the font used to present the text, and is not constrained to the font used in the original work. Once formatted for a particular brand of electronic reader, the font size – and in some cases even the choice of font – can be manipulated by the reader, making for an improved reading experience. These benefits, however, are not enough to overcome poorly edited text.
My friend was alarmed at the poor choice that a publisher made for the cover of a republication of a well-loved book; I find it egregious that publishers are able to profit from the creative work of others when their choices for how to capture and present that work ultimately detract from the original work. As someone who buys both hard copy and electronic copy of historic works, I avoid publishers who take shortcuts such as those described here, and look for publishers who honor the original creative work with an authentic and faithful reproduction of that work.
The next post in this series provides more detail on what it means to me that a publisher honors the original author.