My Experience at the Digital Library Program: 2012

Friday, December 7, 2012

Reading #9

Park, J. R., & Tosaka, Y. (2010). Metadata quality control in digital repositories and collections: Criteria, semantics, and mechanisms. Cataloging & Classification Quarterly, 48(8), 696-715.

A systematic assessment of practices and issues that affect the quality of metadata in digital repositories and collections is reviewed. The researchers distributed a web-based survey to approximately 600 participants, mostly heads of cataloging and technical services, via mailing lists relevant to the field. A total of 303 people completed the survey. The results of the surveys can be categorized in the following three ways: perceived importance of metadata quality control, criteria in use to measure metadata quality, and the utilization of quality control mechanisms in digital repositories. Of note, the study found that metadata semantics is perceived to be less important than content standards for quality control. This contrasts with 45% and 41% of respondents stating that semantic overlaps and ambiguities, respectively, are the two most significant factors that arise in the application of Dublin Core for their collections. This study emphasizes the need for a strong awareness of content-based metadata quality control in collaboration with metadata guidelines to guarantee consistency in resource description within and across digital collections.

Thursday, December 6, 2012

Week 14 &15: Finishing up

So, these past two weeks I have not done much work because I have basically finished up my hours. I have been coming in about once a week, and those hours may end up counting towards my internship next semester.
Last week, week of 11/26, I came to the Metadata Working Group meeting to discuss the findings of the survey. We received 13 responses, which was great! Based on the survey results we further refined the core set and also made the following conclusions:
          1. There is not much of a unique cataloging emphasis.
          2. The primary goal of the collections/collection managers is end-user discovery.
          3. The majority of collections’ external records share some info with Photocat.
          4. The core set as presented in the survey is not satisfactory.

The following fields are all now (some more confidently than others) in the core:

ABSTRACT
CAPTION
CITY
CREATOR
COPYRIGHT OWNER
COPYRIGHT STATUS
COUNTRY
FEATURED
MODIFYING USER
TITLE
TOPICAL SUBJECT
US STATE

Also, now I have begun compiling the core set definitions. I have sent out an email to the members of the metadata subgroup and hope to hear some responses soon.

This week, week of 12/3, I came in on 12/6 to meet with Michelle to finalize everything and have her sign the evaluation. After our meeting, I started some MODS mapping for the core fields that the group has decided upon. So far, I have all of the mappings except for three fields: COPYRIGHT STATUS, FEATURED and MODIFYING USER. I am really not sure if it is possible to even map those to MODS, so I will wait until next semester to speak to Julie about it.

So, I guess I am finished for this semester. I really appreciate the opportunity I had working here and with the great people at the DLP. I really am glad that I got a good, gentle introduction to XPath and XSL. I hope to get further acquainted with them in the future. I also learned that Schematron is not an easy thing to work on, especially when there are hangups. It is not this ubiquitous thing like XML or TEI, so it was tough to find a good community to consult with when we were having problems. It also was hard to find relevant literature on Schematron that wasn't just guidelines or documentation. Hopefully, Schematron will catch on, because it is a really useful tool. I will not wrap this up entirely, since I will be back in a few weeks to document another 4 1/2 months of my experiences at the Digital Library Program.

Friday, November 16, 2012

Week 13: Survey Revision and Indexing

After the meeting on Thursday, I made some changes to the survey. Some members of the group brought up some really important points. One was that we need to take into account the difference between item-level cataloging and collection-level cataloging. The survey as written before would not have been able to capture that. There was also the issue of whether Photocat was primarily a cataloging tool for the collection or an end-user discovery tool. With this information, I went back to revise the survey. I added a question to the survey that would hopefully provide answers from the collection managers about whether their collection included information that was not anywhere else. I also added a Venn Diagram accompanied by a question that asked whether Photocat shared information with the collection's external records. Lastly, I added a two-part question that asked if the following metadata fields (that are currently used by at least 50% of all collections) would satisfy the core set needs and if not, which of the following would. We hope to get this survey out right after Thanksgiving.

Another thing I did this week was think about indexable fields for Photocat. After reviewing how the collections use the fields, and considering what kind of information could be easily searched for, I narrowed a list down to 5 indexable fields: Country, Date Taken, Creator, Topical Subject and US State. After discussing with other members of the group, it seemed like most everyone had come to a similar conclusion.

Lastly, I did a bit of review of Schematron this week in order to keep it fresh in my mind since I will be picking that back up in January. Now that I know what I'm doing, it should be easier.

Friday, November 9, 2012

Reading #8

Greenberg, J. (2001). A Quantitative Categorical Analysis of Metadata Elements in Image-Applicable Metadata Schemas. Journal of the American Society for Information Science and Technology, 52(11), 917-924.

A quantitative analysis on the metadata schemas Dublin Core, VRA Core, REACH and EAD with regards to their usefulness in describing visual images. Metadata elements comprising the schemas were individually studied and grouped according to the four metadata classes established for the study: discovery, use, authentication, and administration, taking care to evaluate the applicability of each element to both print and digital images. Each metadata element had a minimum of being assigned to one class and a maximum of being assigned to four classes. The metadata element that met the qualifications of more than one class was considered multi-functional. Each of the metadata schemas had elements that supported functions of each of the classes established by the study. The results illuminate the need for a reconsideration of metadata schemas and perhaps a move away from cataloging-based schemas towards a class-oriented, functionality-based metadata schema for images across multiple domains.

Thursday, November 8, 2012

Week 12: Semantic Grouping of ICO Field Names

After the last meeting with the Metadata Working Group, each member was asked to either write up or put into a spreadsheet their ideas for grouping duplicate fields and to send it to me so that I could write up a summary. Semantically, there are several fields that are similar or identical, and in order to work on determining a core set, some of these fields may need to be grouped together. Alternatively, some of them may need to be less emphasized in favor of a more universal field name.

One of my tasks this week was to write up a summary comparing two members' suggestions for dealing with duplicate or similar fields. I identified the following:

Brad suggested only using Accession Number and to do away with Acquisition Date, Donor Name, Donor Notes, Location Code (archives notes this in their Accession record), Physical Location (archives notes this in their Accession record), Physical Location Shelf/Box/Folder (archives notes this in their Accession record), Lily Location (archives notes this in their accession record), Seller and Provenance.

Ronda did not suggest removing any of these field names, but she did divide and group them. She combined Provenance, Seller and Donor Name into one category that she named Provenance. She then tentatively grouped Acquisition Date and Donor notes into another category that she called Internal Technical/Administrative Information. Location Code, Physical Location, Physical Location Shelf/Box/Folder and Lily Location were then all combined into a Location of Original category, along with Accession Number. She felt that these were all used to identify the location of the original item or parent, but it was not completely clear from the descriptors how much they overlap.

Brad then proposed combining six fields into one, but repeatable field: Alt ID, Call Number, Title Control Number (but this links to IUCAT), Donor ID, External URL (this links to something), and Roll and Frame #. The main concern with removing these fields would be losing semantics if a field links or points to something.

Ronda addressed these fields as well, but did not suggest considering their removal. Again, she tried to group them semantically. She combined Alt ID, Call Number, Donor ID and Roll and Frame #. She sees this group as various ID numbers assigned specifically to the item (as opposed to the parent or collection unit). Title Control Number and External URL were combined into a Supplemental Metadata category. She questioned whether Accession Number could function similarly and therefore belong in that category as well. She mentioned that the Title Control Number could potentially be used to link out to a collection-level MARC record. External URL is more generic so it could maybe be used for both, but she points out that it could only work if the external resource could be identified.

Brad questioned whether Abstract, Caption, Physical Description and Photographer’s Description could all be combined into one free-text field. For example, “The physical description is albumen print. . .” or “The photographer described this photo, in full, blah. . .”

Ronda grouped Abstract, Caption and Photographer’s Description into one category called Description. She thought these could be changed into something more generic, maybe with a dropdown box to indicate the source (cataloger, caption, photographer, person pictured, etc.) Ronda did not include Physical Description into that grouping, but rather into a category she named Description of Physical Object with other fields like Material and Film Type.

Ronda’s other groupings that do not overlap with Brad’s ideas are on the Metadata Subgroup wiki.

Thursday, November 1, 2012

Week 11: Field Names, Display Labels and Surveys

This was the first full week where I worked on the Image Collection Online. During our last meeting with the Metadata Working Group I was given the task of comparing field names with the actual display labels that collection managers were using when describing their digital resources in ICO. The collection managers and other people working in the different collections use Photocat to enter metadata about the items in the collection. I was provided screenshots for all of the collections that use Photocat so that I could see the difference between the field name and the display label. I then put this information into a spreadsheet. I had columns for field type and then individual columns for the name of each collection accompanied by another column where I could denote whether the label was viewable to the public. I actually created two spreadsheets, one for live collections and one for non-live collections.

As I started entering the data, I began to see that not only do collections use some field names differently than intended by the DLP, but there are also inconsistencies with the way the field names are perceived amongst the collections. In order to highlight the collections that are using the field names in the same way, I highlighted those rows in green. For example, all of the collections that use the field name 'Photographer' also all use the same display label, 'Photographer'. But the field name 'City' is not used similarly amongst all the collections that use that field name. The display labels differ, some use 'City' and others use 'City/Town/Village'. It is this kind of information, laid out in a spreadsheet that may help the members of the Metadata Working Group to get an idea of how collection managers utilize the field names for the purposes of their collection. This could also maybe help in determining a core set.

Another method we are using to help narrow down a core set is through the distribution of a survey. I have begun drafting a short survey, no more than 5 questions, that will try and determine how people are using Photocat. I enjoy this aspect because it is using what I've learned here at SLIS, which is communicating with the users to identify how to best create a system/service etc. to help them with their information needs.

Friday, October 26, 2012

Reading #7

Lim, S. & Liew, C. (2011). Metadata quality and interoperability of GLAM digital images. AsLib Proceedings, 63(5), 484-498. doi:10.1108/00012531111164978

An exploration of how metadata have been appropriated in galleries, libraries, archives and museums (GLAM) in institutions in New Zealand and an analysis of its quality with the regards to the interoperability of its metadata set. The data collection took place in two stages. First, the metadata records of 16 institutions affiliated with GLAM in New Zealand were analyzed for the kinds and extent of metadata used. However, because these records were publicly accessed, it was impossible to view the metadata that were kept from public view. Therefore, interviews with staff from the institutions were conducted. The study found that the digital image metadata records amongst the four types of institutions differed in their emphases on metadata types and function. A second issue is the lack of variety of metadata. Thirdly, not enough institutions are employing technical metadata in their records, resulting in possible loss of important data. It appears that many institutions treat their digital images as surrogates of physical collections. Further research is proposed on the importance of types of data from the user perspective for the best retrieval and interoperability.

Thursday, October 25, 2012

Week 10: A Very 'Duh' Moment

It’s funny. Most of the time, the message ‘Validation failed’ with the little red icon is such a letdown. Now, it’s a pleasant experience. Seeing that message means that I’m doing something correct! It means that my XPath expressions are correct and that my assertions and reports are well-written. After roughly 2 and a half weeks of stalling due to my silly mistake (I feel really bad for wasting everyone’s time), the problem with Schematron has finally been worked out. It turns out that my XPath was not properly written. I had been writing them out including the child node that I wanted to test for. It turns out that the XPath has to end at the parent node. So, once I made all of the corrections, things started moving along again. I’m still playing around with some assertions, such as the one that will check for date format. I tried doing a string length, but I’ve been receiving an error saying that date and string length cannot be compared. I’ve tried matches, but apparently I need two arguments for a match. But I have solved some other things, such as how to check that an @href value contains ‘http://purl.dlib.indiana.edu/’. I’m relieved that this is back up and running.

I have also started some preliminary work on the Image Collections Online project. I’ve reviewed the elements and attributes and begun to see redundancies. I’ve read the Wiki pages for the project and am getting a good idea on the workflow of the project. Now that Schematron is working again, I’m not sure if I’ll still be working on the ICO project. There is a meeting tomorrow that I am going to attend and we will all discuss whether there is a need for an intern and if so, what more specifically I can do. While some of the work that I would do has already been completed, it was done by several different people over a long period of time, and some of the end products may not be relevant or accurate anymore.

Hopefully, I will have the Header section of the VWWP Schematron complete by the end of today and then finish the Body section tomorrow.

I also plan on attending the Digital Brown Bag next Wednesday about the Wikipedia GLAM project. I have never heard of it, so I decided to do my reading on it and hopefully will learn something interesting.

Thursday, October 18, 2012

Week 9: A change of plans

So, it seems like things have really changed. After the weeks long stall due to the technical problems running Schematron, I spoke with Michelle about where to go from here. Until she is able to get in touch with people from Mulberry Technologies and other Schematron list groups, there is very little I can do with Schematron, besides some editing here and there.

During our meeting, she suggested to me that I take a break from Schematron and start looking at another potential project. Based on my interest/desire to expose myself to additional metadata standards (other than Dublin Core and RDF), she mentioned a project working with Image Collection Online. This project would help acquaint me more with metadata and metadata mapping. DLP first began working with other collections to help them with their cataloging and metadata needs. Originally, DLP gave these special collections (ex: Liberian Photograph Collections) a core set of fields. Slowly, these collections started asking for more specialized fields to account for their diverse cataloging needs. While this may have started off well, and DLP was glad to be flexible, soon the process of customizing fields became too overwhelming and chaotic. In addition, even though a collection may have a large amount of specialized fields, they will not all show up in the interface, anyway. So, now, DLP wants to start moving back towards a core set of fields. What Michelle thinks I can do is play the part of a sort of metadata analyst. I can first establish what the core fields are, then analyze the divergent fields and see if any of those are actually more similar than previously thought. We could then merge those fields to allow the collections to express what they feel is necessary, but without overwhelming the system. Then, I would figure out what metadata can be mapped to MODS. One nice thing about this project is that I would be working with the Metadata Working Group associated with ICO. This would allow me to have the collaboration I was missing. As much as I feel dismayed about leaving Schematron behind for now, I think this is a good direction to take things.

Friday, October 12, 2012

Reading #6

Dalmau, M. & Schlosser, M. (2010). Challenges of serials text encoding in the spirit of scholarly communication. Library Hi Tech, 28(3), 345-359.

In 2006, the Digital Library Program at Indiana University received a grant from the state of Indiana to digitize and encode the nearly 100 year run of the Indiana Magazine of History. The project intended to provide full-text and facsimile views, improve metadata for better search and retrieval and to develop a publishing model for the journal. The digitization and encoding was conducted by a combination of in-house and outsourced personnel in coordination with several quality control guidelines. TEI was chosen for its strength in encoding texts that are literary in nature. TEI's independent header was also seen as a strength for its ability to capture bibliographic metadata. Quality control was handled manually to a small extent, but due to the limited time and budget, most of it was automated. The experience provided a few lessons for the future. It was the opinion of the program that it is best to perform semantic or structurally difficult encoding in-house. In addition, the more manual quality control is performed in advance, the more smoothly the subsequent automated process will run. The paper suggests future emphasis on guidelines and consistent communication with any outside vendors.

Thursday, October 11, 2012

Week 8: Moving On. . .

So, last week was tough. But after speaking with Michelle, we have decided that I should move on to the conceptual mapping for the TEI body. The process is essentially the same; starting with the spreadsheet, although this time I'm skipping the 'fluffy' version and going straight to the structured version. Michelle had told me that this section would in some ways be easier, but also be more difficult in others.

I think she's right. There seems to be fewer things that need to be selected and validated, but it is harder to create the logic for those that do exist. For example, there needs to be a way to check that if a note spans pages, for readability, the note should be collapsed into one page. While the easy way would just to write the assert as <report test="tei:note"/>, all that would do would be to check to see if there was a note. It wouldn't actually check whether the note spans pages. So, this is the kind of logic that I have to play around with. There is also another encoding rule that the encoder needs to remove all end of line hyphens for word splits. However, sometimes a split traverses pages. This is going to be tricky to figure out the logic for.

So far, I have completed the conceptual mapping for the TEI body and have now begun to write the Schematron with 10 asserts. So, yes for now, moving on.

Saturday, October 6, 2012

Reading #5

XPath Tutorial. (n.d.) Retrieved October 3, 2012 from W3Schools: http://www.w3schools.com/xpath/default.asp

This W3C tutorial was helpful in providing a concise yet comprehensive review of XPath. Given my difficulties with Schematron this past week, I did a lot of readings to try and figure out the problem. This reading begins simply with first describing what XPath is. A Venn Diagram demonstrates the relationship that XPath has with other XML functions and applications, such as XQuery, XPointer, XLink and XSLT. A bulleted list also provides understanding. For example, "XPath uses path expressions to navigate in XML documents." The reading then points out that these path expressions are used to select nodes or node-sets in an XML document and look and behave similarly to computer files.

The reading then moves on to discuss the relationship of nodes. Parent, children, sibling, ancestor, and descendant nodes are explained. Even though I have a pretty good grasp of node relationships, I really appreciated that examples were included in the reading. I especially liked how the examples were pretty much consistent. If I were new to this concept, it would have really helped me get used to the XML example document, instead of changing just soon as I started understanding the idea.

I was hoping that the next section would help me figure out what was wrong with my Schematron because it was about selecting nodes. I have a feeling that because no errors were being found in the XML documents I ran against Schematron that the problem was with the Schematron not catching on to the right places in the document. However, I still don't really understand why the direct XPath is working, but the validation run through Schematron is not. I reviewed the section on selecting nodes, but I didn't find anything that helped. I feel like I am back at square one now. At the very least, I had a good review of XPath.

Friday, October 5, 2012

Week 7: Bump in the Road

This week began well, but quickly became frustrating. I finished much of the Schematron for the TEI Header and was ready to test what I had done against a couple of VWWP legacy and new texts. I was nervous, but hoping for the best. I had felt like I had taken so much time with the two different kinds of conceptual mappings and then the writing process had taken a while, so I wanted to be successful on the first try.

I pulled up one legacy and one new text in Oxygen, along with the editor version of Schematron. I decided to try the new text first. I selected the correct version of Schematron against which to validate the TEI document. Immediately, I got an error, stating that there was a problem with the TEI namespace. I spent about 4 hours searching online, Googling as much as I could to figure out what could possibly be wrong with the way I had declared the namespace. I came across a post on the TEI boards where some other people were having issues with the TEI namespace. Someone had posted another version of the namespace, so I tried it. Essentially, it seemed like it had to be declared twice. It looked strange to me, but Oxygen no longer complained about it. I ended up changing the namespace yet again, after Michelle told me that Professor Walsh uses a different version. But that was a bit later. I ran the validation again, and much to my surprise, the TEI document validated. I was in shock. I immediately knew that something was wrong, but I couldn't figure it out. I tried Googling and looking in all of the readings that Michelle had given me. There was nothing I could find that addressed the problem of Schematron not catching any errors.

I ended up meeting with Michelle later in the week and we tried playing with different options available within Oxygen. I clicked on what seemed an endless number of options, in endless combinations and nothing worked. I thought that maybe I needed to change the version of XPath I was using, but that did not work either. I also tried typing the XPath directly into the search and that worked perfectly. So, Michelle and I are thinking that my XPath is written fine and really don't know what it could be. She has promised to speak with Professor Walsh as well as writing an acquaintance of hers at Mulberry Technologies. This was definitely a frustrating week and I feel powerless.

Saturday, September 29, 2012

Reading #4

Ochoa, X. & Duval, E. (2009). Automatic evaluation of metadata quality in digital repositories. International Journal on Digital Libraries, 10(2-3), 67-91.

The frequency of manual quality control of metadata is rapidly declining due to the recent developments in automatic metadata creation and interoperability between digital repositories. However, the result is the occasional absence of any form of quality control that can negatively affect the services provided to the repository users. Various quality metrics for metadata were presented and experiments were run to evaluate these metrics. One compared the quality metrics with quality assessment by human reviewers. The second study evaluated the quality metrics’ ability to discriminate key properties of the metadata sets. The third study tested a practical application in which the quality metrics were tested as an automatic low-quality filter. The metric, Textual Information Content, appeared to be a good approximation of the human reviews and is even able to evaluate quality characteristics that human reviewers cannot. The development of quality metrics will facilitate metadata researchers to monitor the quality of a repository as well as track its growth and the events that can affect it without the need for costly human involvement.

Reading #3

Caplan, P. (2003). The TEI Header. In Metadata fundamentals for all librarians (pp. 66-75). Chicago, ALA Editions.

The importance of the TEI Header is evaluated by reviewing the history of the Text Encoding Initiative, examining the comprehensive nature of the TEI Guidelines, assessing the goal of the TEI Header as the basis for library cataloging and contemplating the header’s intended flexibility. An examination of the four sections of the TEI Header reveals the descriptive, subject, non-bibliographic and administrative strengths of the header. The conclusion is that the header is a widely used metadata scheme that can be easily adapted for use with a wide range of XML documents. The usefulness of the header in documenting non-bibliographic aspects ensures the TEI Header’s dominance as a standard for describing electronic texts.