My Experience at the Digital Library Program: September 2012

Saturday, September 29, 2012

Reading #4

Ochoa, X. & Duval, E. (2009). Automatic evaluation of metadata quality in digital repositories. International Journal on Digital Libraries, 10(2-3), 67-91.

The frequency of manual quality control of metadata is rapidly declining due to the recent developments in automatic metadata creation and interoperability between digital repositories. However, the result is the occasional absence of any form of quality control that can negatively affect the services provided to the repository users. Various quality metrics for metadata were presented and experiments were run to evaluate these metrics. One compared the quality metrics with quality assessment by human reviewers. The second study evaluated the quality metrics’ ability to discriminate key properties of the metadata sets. The third study tested a practical application in which the quality metrics were tested as an automatic low-quality filter. The metric, Textual Information Content, appeared to be a good approximation of the human reviews and is even able to evaluate quality characteristics that human reviewers cannot. The development of quality metrics will facilitate metadata researchers to monitor the quality of a repository as well as track its growth and the events that can affect it without the need for costly human involvement.

Reading #3

Caplan, P. (2003). The TEI Header. In Metadata fundamentals for all librarians (pp. 66-75). Chicago, ALA Editions.

The importance of the TEI Header is evaluated by reviewing the history of the Text Encoding Initiative, examining the comprehensive nature of the TEI Guidelines, assessing the goal of the TEI Header as the basis for library cataloging and contemplating the header’s intended flexibility. An examination of the four sections of the TEI Header reveals the descriptive, subject, non-bibliographic and administrative strengths of the header. The conclusion is that the header is a widely used metadata scheme that can be easily adapted for use with a wide range of XML documents. The usefulness of the header in documenting non-bibliographic aspects ensures the TEI Header’s dominance as a standard for describing electronic texts.

Friday, September 28, 2012

Week 4-6: Authoring Schematron

Earlier in the week, I wrapped up the process of transferring the data from the 'fluffy' version to the structured version. After ensuring that we wanted to create two separate Schematrons, one for encoders and one for editor's, I began indicating in blue on the spreadsheet those fields intended for the editor Schematron. I now have 25 asserts that are intended for the editor Schematron and 36 intended for the encoder Schematron. Many of the editor asserts are simple checks, such as tei:author or tei:date. These are just for the editor to make sure that the encoder has not accidentally erased the important information.

Then, on Wednesday I started the process of authoring the Schematron for the TEI Header. I began writing in Oxygen as an XML file, but saved the document as a .sch, for Schematron. I began writing the encoder Schematron first, since in my mind would be the first step in the workflow of checking the VWWP TEI-encoded documents. I started by declaring the namespaces, one for the Schematron (<schema xmlns="http://purl.oclc.org/dsdl/schematron">) and one for the TEI (<TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:ns="http://www.tei-c.org/ns/1.0">). Next, I included the title and a one line description of the intention (This schema will test the legacy and new TEI-encoded texts of the Victorian Women Writers Project).

Then, I began authoring the Schematron. Taking what I had learned from all of the readings, I began with declaring the pattern. The pattern is a set of related rules that are used to test the XML document. It is important to also assign an identifier to each pattern. I named the pattern identifier based on its function. So, if the pattern contains a rule that checks for the template value of the responsibility statement, then I would assign an abbreviated name as that pattern's identifier (respStmttemplate-valuecheck). Next, I included a short paragraph that explained the purpose of the assert. I partly did this so I could remember later on the purpose of each assert, especially if it wasn't written out completely. I also did it so that other people could understand my intentions after I have finished the project. Next, I wrote the rule. In Schematron, the rule contains one or more tests (called asserts or reports) that apply in a given context. The context is very important in this situation. The context is the XPath that has been written to determine at which element the test or series of tests should be performed. Hands On Schematron by Mulberry Technologies explains, "For every element in the document described as the context of a rule, the rule's tests will be made with that element as context." So, once I had written the part of the Schematron that declared the context, I then needed to write the tests. The tests can either be asserts or reports. Asserts are useful when you want to know something is not true. For example, I want there to be an author element in this context. If it is true, fine. If it is not true, let me know. Reports can be used to locate elements of interest or also to check for the existence of things of interest. If you are looking for '2010' as the year of encoding, and write the check properly, if that year does exist as a year of encoding in your document, you will get a message declaring it to be true. This part took me a little while to figure out, but Hands On Schematron helped clarify the differences. They write, "report means 'ho hum, show me where this is true' and assert means 'it better be true, or else'". While, I kind of disagree with the "ho hum", it did give me a better idea of the positive versus negative aspects of the different kinds of tests. Another way to think about it is that reports are more like warnings for the user, whereas asserts are more like errors.

Once I completed the encoder version, I moved on to the editor version. This process was essentially the same, except the Schematron was shorter because there were fewer checks to make. For both versions, I began keeping a list of questions I had for Michelle and would either email them to her or make sure to ask her during our meetings. Also, I realized during this process that all of the work with the spreadsheets really helped later on. The structured asserts spreadsheet helped especially in creating the logical structure in my mind.

Sunday, September 9, 2012

Reading #2

Nellhaus, T. (2001). XML, TEI, and Digital Libraries in the Humanities. Libraries and the Academy, 1(3), 257-277.

The implications of the “new” language XML and accompanying encoding structure TEI in academic digital humanities libraries is contemplated. The history of XML and TEI is discussed, including their origins in SGML. Next, the differences between SGML and XML are illuminated through examining the basic concepts of XML, such as DTDs, author-defined tag sets, XSL and expanded linking. The history and structure of TEI is described, including the significance of the independence of the TEI Header from the rest of the TEI document as a way to increase interoperability, for example, with the MARC record. Due to the optional specialized tag sets offered by TEI, standard TEI markup makes it easier to describe a wide variety of documents more richly. The conclusion is that XML and TEI have great potential to help build digital libraries in the humanities.

Friday, September 7, 2012

Week 3: Building Schematron- The Foundation 2.0

This week I began transferring all of my work from the 'fluffy' version of the conceptual mapping to the structured version. I found this to be pretty straightforward, now that I knew all of the elements and attributes so well, and what exactly I would be testing for each one.

I liked this part because in addition to the context (XPath) I began writing the tests. This is where I felt like I began to really create and control the kinds of conditions I would like Schematron to check for. It was also at this point that I realized how the kinds of things the Schematron will check for are quite varied.

There are tests that check that an encoder has replaced a template value with an actual value. Example: tei:name[@xml:id='encoderusername']. This will test that the encoder has replaced the template value of xml:id with his or her actual username. There are other template values that the encoder could miss and Schematron needs to check for them. Example: tei:title='$Title of introduction'. This checks that the encoder hs replaced '$Title of introduction' with the actual title of the introduction. These are all very important checks that other schemas cannot check for.

The Schematron will also check for patterns and consistencies. For example, the publication statement (<publicationStmt>) includes important information about the encoding, such as which institution completed the coding, the year of the encoding and a short paragraph about copyright. Part of the publication statement is the element <idno>. It is an identifier that is used to identify an object, in this case the particular XML document that is being encoded. The <idno> is included in the TEI root and must match the <idno> in the publication statement. So, Schematron needs to alert the encoder or editor if the two values do not match. Example: tei:idno='tei:TEI[@xml:id]'. This is used to test that the idno value matches the TEI root xml:id value.

During our weekly meeting on Wednesday, September 5th, Michelle and I discussed the possibility of writing two Schematrons: one for the encoder and one for the editor. There are several checks that only need to happen on the editor's side. One very obvious editor check will be making sure the editor has entered his or her name and assigned that element's xml:id as his or her username. Example: tei:name='$Editor's First and Last Name' and tei:name[@xml:id='editorusername']. There is no reason for this check to be performed while the encoder is still working, so it makes sense to create another Schematron that only the editor will need to use. In this meeting Michelle and I also discussed the possibility that the pseudonym check and prosopography check should also be part of the editor-only version of Schematron. We'll both think about it the next week and go from there.

These weeks' reading was an older article from 1998, called 'XML and the Future of Digital Libraries'. It was fascinating reading this article because it was filled with wonder, excitement and also apprehension about this new language. Yes, at one point I had no idea what XML was, and learning the very basics was exciting, but scary. But I always accepted it, because by the time I started learning it, it was already so established. I enjoyed this article and think it was useful for putting the metalanguage into context.