My Experience at the Digital Library Program: October 2012

Friday, October 26, 2012

Reading #7

Lim, S. & Liew, C. (2011). Metadata quality and interoperability of GLAM digital images. AsLib Proceedings, 63(5), 484-498. doi:10.1108/00012531111164978

An exploration of how metadata have been appropriated in galleries, libraries, archives and museums (GLAM) in institutions in New Zealand and an analysis of its quality with the regards to the interoperability of its metadata set. The data collection took place in two stages. First, the metadata records of 16 institutions affiliated with GLAM in New Zealand were analyzed for the kinds and extent of metadata used. However, because these records were publicly accessed, it was impossible to view the metadata that were kept from public view. Therefore, interviews with staff from the institutions were conducted. The study found that the digital image metadata records amongst the four types of institutions differed in their emphases on metadata types and function. A second issue is the lack of variety of metadata. Thirdly, not enough institutions are employing technical metadata in their records, resulting in possible loss of important data. It appears that many institutions treat their digital images as surrogates of physical collections. Further research is proposed on the importance of types of data from the user perspective for the best retrieval and interoperability.

Thursday, October 25, 2012

Week 10: A Very 'Duh' Moment

It’s funny. Most of the time, the message ‘Validation failed’ with the little red icon is such a letdown. Now, it’s a pleasant experience. Seeing that message means that I’m doing something correct! It means that my XPath expressions are correct and that my assertions and reports are well-written. After roughly 2 and a half weeks of stalling due to my silly mistake (I feel really bad for wasting everyone’s time), the problem with Schematron has finally been worked out. It turns out that my XPath was not properly written. I had been writing them out including the child node that I wanted to test for. It turns out that the XPath has to end at the parent node. So, once I made all of the corrections, things started moving along again. I’m still playing around with some assertions, such as the one that will check for date format. I tried doing a string length, but I’ve been receiving an error saying that date and string length cannot be compared. I’ve tried matches, but apparently I need two arguments for a match. But I have solved some other things, such as how to check that an @href value contains ‘http://purl.dlib.indiana.edu/’. I’m relieved that this is back up and running.

I have also started some preliminary work on the Image Collections Online project. I’ve reviewed the elements and attributes and begun to see redundancies. I’ve read the Wiki pages for the project and am getting a good idea on the workflow of the project. Now that Schematron is working again, I’m not sure if I’ll still be working on the ICO project. There is a meeting tomorrow that I am going to attend and we will all discuss whether there is a need for an intern and if so, what more specifically I can do. While some of the work that I would do has already been completed, it was done by several different people over a long period of time, and some of the end products may not be relevant or accurate anymore.

Hopefully, I will have the Header section of the VWWP Schematron complete by the end of today and then finish the Body section tomorrow.

I also plan on attending the Digital Brown Bag next Wednesday about the Wikipedia GLAM project. I have never heard of it, so I decided to do my reading on it and hopefully will learn something interesting.

Thursday, October 18, 2012

Week 9: A change of plans

So, it seems like things have really changed. After the weeks long stall due to the technical problems running Schematron, I spoke with Michelle about where to go from here. Until she is able to get in touch with people from Mulberry Technologies and other Schematron list groups, there is very little I can do with Schematron, besides some editing here and there.

During our meeting, she suggested to me that I take a break from Schematron and start looking at another potential project. Based on my interest/desire to expose myself to additional metadata standards (other than Dublin Core and RDF), she mentioned a project working with Image Collection Online. This project would help acquaint me more with metadata and metadata mapping. DLP first began working with other collections to help them with their cataloging and metadata needs. Originally, DLP gave these special collections (ex: Liberian Photograph Collections) a core set of fields. Slowly, these collections started asking for more specialized fields to account for their diverse cataloging needs. While this may have started off well, and DLP was glad to be flexible, soon the process of customizing fields became too overwhelming and chaotic. In addition, even though a collection may have a large amount of specialized fields, they will not all show up in the interface, anyway. So, now, DLP wants to start moving back towards a core set of fields. What Michelle thinks I can do is play the part of a sort of metadata analyst. I can first establish what the core fields are, then analyze the divergent fields and see if any of those are actually more similar than previously thought. We could then merge those fields to allow the collections to express what they feel is necessary, but without overwhelming the system. Then, I would figure out what metadata can be mapped to MODS. One nice thing about this project is that I would be working with the Metadata Working Group associated with ICO. This would allow me to have the collaboration I was missing. As much as I feel dismayed about leaving Schematron behind for now, I think this is a good direction to take things.

Friday, October 12, 2012

Reading #6

Dalmau, M. & Schlosser, M. (2010). Challenges of serials text encoding in the spirit of scholarly communication. Library Hi Tech, 28(3), 345-359.

In 2006, the Digital Library Program at Indiana University received a grant from the state of Indiana to digitize and encode the nearly 100 year run of the Indiana Magazine of History. The project intended to provide full-text and facsimile views, improve metadata for better search and retrieval and to develop a publishing model for the journal. The digitization and encoding was conducted by a combination of in-house and outsourced personnel in coordination with several quality control guidelines. TEI was chosen for its strength in encoding texts that are literary in nature. TEI's independent header was also seen as a strength for its ability to capture bibliographic metadata. Quality control was handled manually to a small extent, but due to the limited time and budget, most of it was automated. The experience provided a few lessons for the future. It was the opinion of the program that it is best to perform semantic or structurally difficult encoding in-house. In addition, the more manual quality control is performed in advance, the more smoothly the subsequent automated process will run. The paper suggests future emphasis on guidelines and consistent communication with any outside vendors.

Thursday, October 11, 2012

Week 8: Moving On. . .

So, last week was tough. But after speaking with Michelle, we have decided that I should move on to the conceptual mapping for the TEI body. The process is essentially the same; starting with the spreadsheet, although this time I'm skipping the 'fluffy' version and going straight to the structured version. Michelle had told me that this section would in some ways be easier, but also be more difficult in others.

I think she's right. There seems to be fewer things that need to be selected and validated, but it is harder to create the logic for those that do exist. For example, there needs to be a way to check that if a note spans pages, for readability, the note should be collapsed into one page. While the easy way would just to write the assert as <report test="tei:note"/>, all that would do would be to check to see if there was a note. It wouldn't actually check whether the note spans pages. So, this is the kind of logic that I have to play around with. There is also another encoding rule that the encoder needs to remove all end of line hyphens for word splits. However, sometimes a split traverses pages. This is going to be tricky to figure out the logic for.

So far, I have completed the conceptual mapping for the TEI body and have now begun to write the Schematron with 10 asserts. So, yes for now, moving on.

Saturday, October 6, 2012

Reading #5

XPath Tutorial. (n.d.) Retrieved October 3, 2012 from W3Schools: http://www.w3schools.com/xpath/default.asp

This W3C tutorial was helpful in providing a concise yet comprehensive review of XPath. Given my difficulties with Schematron this past week, I did a lot of readings to try and figure out the problem. This reading begins simply with first describing what XPath is. A Venn Diagram demonstrates the relationship that XPath has with other XML functions and applications, such as XQuery, XPointer, XLink and XSLT. A bulleted list also provides understanding. For example, "XPath uses path expressions to navigate in XML documents." The reading then points out that these path expressions are used to select nodes or node-sets in an XML document and look and behave similarly to computer files.

The reading then moves on to discuss the relationship of nodes. Parent, children, sibling, ancestor, and descendant nodes are explained. Even though I have a pretty good grasp of node relationships, I really appreciated that examples were included in the reading. I especially liked how the examples were pretty much consistent. If I were new to this concept, it would have really helped me get used to the XML example document, instead of changing just soon as I started understanding the idea.

I was hoping that the next section would help me figure out what was wrong with my Schematron because it was about selecting nodes. I have a feeling that because no errors were being found in the XML documents I ran against Schematron that the problem was with the Schematron not catching on to the right places in the document. However, I still don't really understand why the direct XPath is working, but the validation run through Schematron is not. I reviewed the section on selecting nodes, but I didn't find anything that helped. I feel like I am back at square one now. At the very least, I had a good review of XPath.

Friday, October 5, 2012

Week 7: Bump in the Road

This week began well, but quickly became frustrating. I finished much of the Schematron for the TEI Header and was ready to test what I had done against a couple of VWWP legacy and new texts. I was nervous, but hoping for the best. I had felt like I had taken so much time with the two different kinds of conceptual mappings and then the writing process had taken a while, so I wanted to be successful on the first try.

I pulled up one legacy and one new text in Oxygen, along with the editor version of Schematron. I decided to try the new text first. I selected the correct version of Schematron against which to validate the TEI document. Immediately, I got an error, stating that there was a problem with the TEI namespace. I spent about 4 hours searching online, Googling as much as I could to figure out what could possibly be wrong with the way I had declared the namespace. I came across a post on the TEI boards where some other people were having issues with the TEI namespace. Someone had posted another version of the namespace, so I tried it. Essentially, it seemed like it had to be declared twice. It looked strange to me, but Oxygen no longer complained about it. I ended up changing the namespace yet again, after Michelle told me that Professor Walsh uses a different version. But that was a bit later. I ran the validation again, and much to my surprise, the TEI document validated. I was in shock. I immediately knew that something was wrong, but I couldn't figure it out. I tried Googling and looking in all of the readings that Michelle had given me. There was nothing I could find that addressed the problem of Schematron not catching any errors.

I ended up meeting with Michelle later in the week and we tried playing with different options available within Oxygen. I clicked on what seemed an endless number of options, in endless combinations and nothing worked. I thought that maybe I needed to change the version of XPath I was using, but that did not work either. I also tried typing the XPath directly into the search and that worked perfectly. So, Michelle and I are thinking that my XPath is written fine and really don't know what it could be. She has promised to speak with Professor Walsh as well as writing an acquaintance of hers at Mulberry Technologies. This was definitely a frustrating week and I feel powerless.