EEBO Project- The reviewing process
Background
Files are selected for keying by John Latta in Michigan, and are sent to three firms specialising in capturing this material: PDCC (now known as SPI Content Sciences), TECH and APEX.
The keyed in texts are then returned to Michigan where once a month they are divided between the reviewers in Michigan and the reviewers in Oxford. They are then ftped from Michigan to our inbox here in Oxford. The name of the directory in the inbox normally corresponds to the previous month (since that is the month when the files actually arrived from the vendors). This is to ensure that the directory naming system matches the one in use at Michigan.
With each batch of files comes a simple file list, named after the same month, e.g. filelist.200202.txt. This can be converted to a spreadsheet along the following lines:
File name / Local Number / Verdict / Reviewer / Date / File Size (bytes) / Sample (bytes) / Inex. / excusable / $ / illegibles / iIlegibility ratingS10765.apex.sgm / A00631 / 190,343
S10787-5.pdcc.sgm / A00652 / 8,982
S11181.pdcc.sgm / A68062 / 201,937
S12206a-3.pdcc.sgm / A02062 / 7,686
At the end of the month, the files we have completed, together with their notes files, are collected from us by ftp to go back to Michigan, where they are converted to XML before being mounted on the EEBO searchable texts website.
Step One
First, select your text from the month’s inbox and move it to your own “in progress” folder (Emma, Judith, Jonathan):
Step Two
Prepare your proofing sample. First find out how many pages there are. You do this in TextPad by searching for all occurrences of <PB[^>]*> within the file. This is a regular expression which looks for all the page break <PB> tags, whatever their attributes. In TextPad, use “Search in Files” with options “binary files”, “all matching lines” and “regular expression” ticked. When you have the list in front of you, you should run your eye down and check that the <PB>s are complete and in order. Any anomalies should be noted and checked when you are working on the text later. Vendors have a habit of putting an unnecessary <PB> at the beginning and/or end of the text – watch out for this!
You should select 5% of the book, or five pages, whichever is greater. For example, in a 200 page book you would proof 10 pages, and in a 67 page book you would proof 5 pages. Do not pick the first page or two, or the last page or two, since these are usually blank flyleaves.
Prepare the sample by opening the file in TextPad and copying the selected pages to a new document called filename.test.sgm
Step Three
Download your image set. In the header of the file there is the tag <VID>a number</VID>. This is the image reference number used by the EEBO Images website.Open up the “Quick PDF downloader” webpage and first click on St. Augustine (all will become clear) to go to the ProQuest image site and get authenticated to use the images, then use the back button, enter the vid number and click “tolle, lege”. You will then be prompted to choose the location to save the image to, which should be your working directory, where the file is also held.
Step Four
Make your proofing sample and start your notes file.
There are two perl scripts which are used. The first converts the SGML in the test file (created in step two) to HTML for proofing, and the second strips the test file of all its tags so the resulting file size is the same as the number of characters keyed in by the vendor. This is used for the error count. There must not be more than one character error (mistranscription) in 20,000 characters. So in a stripped sample size of 52,000 bytes, we would accept the file if 2 errors or fewer were discovered during proofing, pardon the file if three errors were found, but reject it if there were more than three errors.
To run the perl scripts open up command prompt (make sure it has opened in the correct directory), type “makesample” and hit <ENTER>. Then enter “dir”. This will make the sample and give you the information you need to fill in the notes file you are about to create.
Open up the template notes file from the Docs folder and save it to your working directory with the name filnamenotes.vendor.txt e.g. S3142notes.pdcc.txt. Fill in the file name and file size information using the information from the command prompt window.
In TextPad, find out how many $ and $ groups there are by searching for the following regular expressions (with all the same boxes checked as for the page break search above):
\$*[^ ]*\$ for total number of dollar groups
\$+word\$+ for number of $word$
\$+words\$+ for number of $words$
\$+line\$+ for number of $line$
\$+span\$+ for number of $span$
\$+page\$+ for number of $page$
\$+para\$+ for number of $para$
$ for total number of $ (for this, untick the “regular expression” and “all matching lines” boxes).
Fill in this information in the dollar count boxes in the notes file.
The notes fileis very important both for statistics and for the vendors, who are sent the error counts and any comments at the end of each month. This feedback alerts them (and us) to any persistent problems.
Step Five
Print out the html proofing sample which was generated in your working directory by the “makesample” batch file. Open the pdf file which you downloaded from ProQuest and start proofing the printed out sample against it. These two screen shots show you examples of a proofing sample and the text image.
You should print out the html proofing sample, proof against the page image on screen, and mark any errors as you go,being careful to distinguish:
- mistransciptions that should have been avoided (using the online documents that supply the principles and examples of “inexcusable” errors).
- mistranscriptions that could not reasonably have been avoided (same principles).
- completely legible letters or words which are illegitimately flagged as illegible. Adopt a generous and forgiving spirit in compiling this count.
- letters (or words) that are more or less completely missing in the original, but which the vendor supplied anyway (we never count these towards the verdict on the text).
- spacing errors (we never count these towards the verdict on the text).
When a book is very bad, you may be able to stop proofing early. Very small books (5-10 pages) we will normally pardon regardless of how many errors it has, since we will have effectively proofed the entire book anyway.
Step Six
After you have finished proofing, fill in the notes giving your verdict (accepted, pardoned or rejected) and the date in the form yyyy-mm-dd.
If you reject the text, move the sgm file into the rejected folder and the notes file into the notes folder, delete the other files from your directory and start again. Otherwise, move on to step seven.
Step Seven
If you have accepted or pardoned the file, the next step is to open it in XMetal and put a new header in. There is a different header for each vendor, called vendortempheader.txt (e.g. pdcctempheader.txt) and each is in the docs folder. This header calls up a different, slightly stricter DTD than that used by the vendors. It also contains a checklist of tasks to be carried out on each file. Copy the header and paste it in. You will also need to add </ETS> at the end of the file (XMetal does this automatically when you switch to tags on view).
Step Eight
Check the structure and add div types. Div types are the names we assign to the various text divisions. If the text division does not have a heading of its own, it is the div type that is used in the automatically generated table of contents on the EEBO searchable texts website. Usually the div types are things like “title page”, “dedication”, “colophon”, and so on, but occasionally they can be harder to name, especially if the text is not in English. There are guidelines in the online documentation. Lack of div types should be the main reason a file fails to validate. Pursue invalid bits until the file validates.
Step Nine
Proof title page. This is partly to make sure there are no mistranscriptions on the very first page of the text. However, it is mostly to remove unnecessary tagging. For example, on the page below we would not tag the change of font, which is purely for decoration. We would also move all the text apart from the printing information into one paragraph. It is also acceptable to be more adventurous in correcting illegible text on the title page. It is sometimes useful to judiciously use the citation to fill in any missing information.
Step Ten
Check for ^ and other areas where mistakes are commonly made. ^ is used to mark superscript, while ^^ marks subscript. It should precede each super or sub scripted letter, not just the initial one. For example, 12th May should be captured 12^t^h May and not 12^th May. In the text below, the example is y^u, a common abbreviation for you, also shown as it appears on the page image.
Other characters to check for now are @ and # which are used by the vendors when they don’t recognise a symbol. Use the online guide to help you to identify the symbol required. (@ should no longer be used by the vendors, but we still check for it just in case).
Step Eleven
Use the checklist (Appendix One) to help you to check through other problem areas. These include adding extents for <GAP> tags, checking that <EPIGRAPH> and <ARGUMENT> have been used correctly, and checking for I/J problems (the vendors have a tendency to capture italic J as I, which we are attempting to correct).
Step Twelve
Make corrections from the proofing sample. Any mistranscriptions you have come across, or $ that you have been able to read, should be corrected. Use the proofsheets as indicators of possible other “global”problems: if a U is captured as V on the proofsheet, it is worth checking others in the file; if numbered Ps are captured as ITEMs on a proofsheet, it is worth checking ITEMs throughout. If a note is seriously misplaced (or notes are inconsistently placed) on the proofsheet, the same is probably true throughout.
Step Thirteen
If there are fewer than 100 occurrences of $ then you should try to correct all the dollar groups. This involves searching for each $, looking up the appropriate page image and trying to distinguish the letters. In the example (St$ll$n waters) below it is impossible to tell from the form of the characters on the printed page whether they are o, e, or c.
When you are finished, use TextPad to check the number of $ there are left, so that you are able to say how many you have corrected.
Step Fourteen
The $ groups which you haven’t been able to correct are not left as $. Instead they are converted to a <GAP DESC=”illegible” EXTENT=”xx” RESP=”xxx”>tag where the extent is “page”, “para”, “span”, “line”“word”, or “3” for three characters, “1” for one character etc.
If the file has more than 100 dollar groups and you haven’t tried to correct them, then they are all replaced with the <GAP> tag, with the RESP (responsibility) attribute being the name of the vendor concerned. If you have checked the gap yourself and you know it is illegible you could make the RESP attribute RESP=”OXF” instead.
Step Fifteen
Fill in header and complete notes. The header contains all the routine jobs you should do to every file, while the notes file contains the $ group count and any comments about the text which you want to be seen by the vendors or the team in Michigan.
Step Sixteen
Validate your file and fill in the metadata information. The metadata is simply kept in the Excel spreadsheet described in the Background section. It includes your name, verdict, date, number of errors ($ stands for perfectly legible characters that were not captured by the vendors) and total number of $ groups before you started correcting.
Validation involves using another perl script at the command prompt. Simply type in v filename e.g. v S3452.pdcc.sgm
Step Seventeen
Move the file to the “done” or “rejected” folder and the notes to the “notes” folder, where they will be picked up by Michigan. If a file is rejected after step six, it is moved to the “rejected” folder, and eventually is resubmitted by the vendor. We bundle the done files into a zip folder at the end of the month to make the transfer to Michiganeasier.
Step Eighteen
Delete all the other related files from the working directory (samples, .pdf, backup copies etc.)
Pick a new text and start again!
Appendix One
Added TYPEs to DIVs in order to validate.
Proofed title page(s).
Reviewed structure, including head and feet of divisions.
Checked ^s.
Checked placement and completeness of PBs.
Added extents for GAPs. Checked #s, @s, spacing of foreign GAPs
Checked for EPIGRAPHs and ARGUMENTs.
Checked for startqs, endqs, oes, q;s, and Qs.
Checked for OPENERs, SALUTEs, SIGNEDs and LETTERs
Checked for LISTs and TABLEs
Checked for STAGE
Checked for CLOSERs, TRAILERs, and BYLINEs.
Checked for I/J problems
Checked proofsheets and made corrections found there.
Corrected $s of
Converted $s to GAP DESC="illegible" RESP="**”.
Checked for $s in plain text view
Filled in header (including name and date)!
Completed notes
Validated
Moved files
Filled in metadata
DONE