My Experience at the Digital Library Program: August 2012

Friday, August 31, 2012

Reading #1

Sperberg-McQueen, C .M. (1998). XML and the future of digital libraries. Journal of Academic Librarianship, 24(4), 314-317.

The rise of XML in digital libraries at the end of the 20^th century is described through answering questions about the impact of the markup language. The article asks what the presence of XML means for users of the Web and information professionals. A comparison with HTML reveals that XML can be thought of as a subset of SGML, but is different enough from HTML to require some change in thinking from those who use HTML. Some key specifications, such as the Extensible Style Language (XSL) are introduced to demonstrate how XML documents will now be displayed through a browser. Some frequently asked questions are addressed in order to clarify some misinformation about XML. The conclusion is, the organizations who currently only use HTML for encoding and storing their documents will likely face difficulties in making the switch to XML, but most organizations that use TEI or another SGML DTD will need to make very little changes in their current procedures.

Thursday, August 30, 2012

Week 2: Building Schematron--The Foundation

So, this week was exciting. After spending a few good hours looking at the VWWP guidelines and some completed/in progress documents on Xubmit, I began the conceptual mapping for Schematron. I began identifying elements and attributes that would need Schematron validation, as well as values that need encoder input and watching out for any xml:id's.

Michelle had created an Excel spreadsheet for me. Michelle had actually created two spreadsheets, a 'Fluffy' version and then the Structured Assertions or Reports version. The 'fluffy' version was to begin recording the elements and attributes that are to be checked, the xPath needed to identify the appropriate node and any descriptions of what I will be checking (basically a rough version of the message that the user will see if there is any problem with the validity of their XML document). The Structured Assertions or Reports version includes the context, or the xPath, the test (either the assert or report) and the assertion or message.

I began with the 'fluffy' version since I thought it would be a good way to understand the architecture of the XML documents. I wanted to make sure I had mapped out all of the elements and attributes present in the encoded documents and the hierarchies and relationshps. This took up a lot of my time, but I think it was necessary for me to really get a grasp of what I would be working with. After meeting with Michelle on Wednesday, August 29th, she gave me a better understanding of what I really needed to check for and it put me back on track. With that knowledge, I was able to quickly discern which elements and attributes and values needed to be checked. There are several values that are imported from the MARC record. For example, in the title statement (<titleStmt>), the value of the author element is pulled directly from the MARC 100 entry. Therefore, it does not need to be validated with Schematron.

It was during this time that it was also identified that I would have to pay attention to elements that were affected by the change from using <biblFull> to <biblStruct>. While this was not a great problem, I have to remember to account for the <biblFull> element and its ancestors when I get to the stage of actually authoring Schematron.

I finished up my week with writing down several questions for Michelle to review. I have to say I am enjoying this process. I get to work with TEI (although more indirectly now) and I'm becoming more confident with xPath. And the quality control aspect is challenging, in a good way. It really demands that I make many logical decisions and consider all possibilities of the encoding process. I have a bit more of the 'fluffy' version to work on, and then I think I can move onto the more detailed spreadsheet.

Thursday, August 23, 2012

Week 1: What is Schematron?

So, this week I began my internship at the Digital Library Program here at Indiana University. After meeting with my supervisor, Michelle Dalmau, it was decided that I would undertake the Schematron project to improve the Victorian Women Writers Project text-encoding workflow. Schematron is a second-level of validation that is used to check the quality of XML data. Mainly, it checks for the presence or absence of patterns in the XML document as well as double-checking the validity of encoder entered data. One thing I will have to keep reminding myself is that Schematron does not check what the schema already checks.

I had my first orientation meeting with Michelle and then I was put to work learning about Schematron. She gave me a binder, called Hands On Schematron and some other readings that would help aquaint me with this validation language. Everything was a bit overwhelming at first. I feel like I am so new to this world of what I used to call "techy things" that I'm still a bit in a bit of disbelief that I'm even doing these things. However, this is important to me. This feels right, so I knew I just had to dive right in.

A few things immediately stood out to me about Schematron that began to answer that question of "what is Schematron?" It is an XML vocabulary. So, it is in itself an XML document. The assertions (assert and report elements) are XML elements. But unlike DTD it cannot be used to describe the structure of a document and it does not manipulate data in any way. However, the beauty of Schematron is that it can express constraints that other XML-based languages cannot. For example, a validator such as the W3C validator can assert that the list element must contain an item element when listing LCSH or MLA keywords, but it cannot assert that the encoder must enter at least one subject heading or term in their document. Schematron can check for this.

So, this week was basically just me sitting at my desk reading over these manuals and papers. It's hard for me to picture what work I'll be doing next, because I really need to look at the VWWP encoding guidelines and some completed XML documents in order to really know where to start. I'm looking forward to start mapping out all of the elements, attributes, values and patterns that need to be checked.