Saturday, May 28, 2011

Rosetta

On Friday I heard a presentation by Ex Libris staff about their Rosetta long term digital preservation system. Marketing presentations generally do not interest me, but the presenter was the project manager and could in fact answer questions about technical issues.

Bitstream Integrity

Basically this is not a problem that Rosetta addresses directly, but it also does not deny its importance. They have in fact talked with David Rosenthal about it. The system structure separates the bit-management from other layers and allows multiple solutions, including those that do active integrity checking. When Rosetta must manage the storage directly, it uses checksums and does periodic integrity testing against the stored copy. But if the copy's checksum does not match the stored checksum, then they can only ask for someone else to give them a new copy, which could be troublesome in 100 years or so.

We talked about whether LOCKSS might integrate with Rosetta at this level. The general answer seemed to be yes, or at least that it might be worth a try.

Authenticity

Rosetta does maintain provenance information, but has no way to link back to check against the original to make sure that the authenticity remains synchronized. This is problem is not unique to Rosetta. The digital preservation field really needs to develop reasonable criteria for authenticity testing.

Access

Here Rosetta seems to do a good job in making various access copies and controlling the access rules.

Risk Manager

This feature appears to function something like the migration manager in koLibRI. It uses a database to keep track of technical metadata about formats and versions. Rosetta has a knowledge base that allow institutions "to share their formats, related risks and applications". Rosetta has a work-group to enhance the knowledge base as well. [Thanks to Ido Peled for this addition.]

We talked about the risk problem generally with format change. It is not really 0 or 1, but more likely a scaled reduction of access to certain formats. Clearly there needs to be more thinking about when to trigger migration and what kind of migration (on-the-fly or preventative) makes sense.

Load Speeds

I was pleased to see that Rosetta has tested its performance loading different sizes of data and that the information is publicly accessible (see figure 3). I have talked with a number of publishers recently that have concerns about the ability of archiving systems to load their contents in a timely manner and I think other systems should test their ingest times.

Conclusion

The session ended with our agreeing to talk more about potential collaboration in the research arena.


Tuesday, May 24, 2011

ANADP in Tallinn - day 3

Educational Alignment Panel

At the EU level, there is an effort to get more involved with professional bodies. Internships play no significant role in UK digital preservation education, since the masters there tends to be a one-year degree. Knolwledge Exchange is interested in these developments. The Library of Congress has collaborated with a number of US schools to establish internships, which benefit both the Library and the interns, as well as the professions that they later enter.

One the biggest challenges is how to identify the essential skills needed for digital preservation. What correlates to bookbinding in the digital world? It may be programming. The students need much more technical competence. While adding courses step by step may be insufficient, finding the time for curriculum reform is challenging. Addressing the funding dilemma is a key aspect and George Coulbourne (LC) suggests corporate partnerships to share costs and responsibilities. In the question and answer period, the question of a "new" profession vs mainstreaming the new skills in the old profession arose. We need to remain aware of the difference between education and mere training that focuses only on particular skills and belongs to ongoing professional development.

Economic Alignment Panel

Costs are a vital issue for any digital archive. Sharing tools and collaboration are ways to manage costs. Examples include LOCKSS and NDIIPP. In Italy MiBAC offers a legal deposit service for its small institution partners. We can also learn from failed initiatives. PADI was, for example, discontinued after 10 years (see ACRL), largely for economic reasons because the national library ended up having to do most of the funding. Neil Grindley used an analogy with the computer game Asteroids -- in his version funders like JISC fire money a big problems like digital preservation in the hope of breaking the problem (asteroid) up. But what angle should the funder take, when it funds. Looking back at the JISC funding efforts, Neil wonders whether someone should write a "really good" history of digital funding. JISC has been doing some cost-modeling. Archival storage is consistently a small portion (15%) of overall project costs. Repairing problems costs significantly more than initial preservation. The UK is building a higher education cloud infrastructure. PEPRS (Piloting an e-journals preservation registry service) is trying to build similar infrastructure for preservation.

In the Czech republic a funding problem is that digital preservation is invisible and often ignored in favor of digitizing more documents. Electronic deposit began only in 2011 as a pilot project, but digitization began in the 1990s with historical manuscripts, and with endangered newspapers and monographs. The aim is to digitize 26 million pages by 2014. The budget is 12 million Euros.

Digital preservation is the flipside of collection development. At Auburn University in Alabama they are using distributed digital preservation in a Private LOCKSS Network. 7 institutions have joined the Alabama PLN and it has been self-supporting since 2008. The fee-base varies from $300 to $4800 per year. Governance took longer to establish. The guiding principles: keep it simple, keep it cheap, don't build something new if you don't have to. Recommendation: stop chasing soft money and start making tough choices about local commitment.

Breakout session: Education

A former student from the Royal School in Copenhagen suggested that we consider the Erasmus Mundus program and put together a focused program for that funding source. The students would like more specific job expectations, but the expectations are very various. Employers look for the right mindset, not the right skill-set. Squeezing in internships is hard. From the employer perspective, an internship is like a year-long interview. Many of the schools have active hands-on programs that emphasize teamwork and practical problem-solving.

Summary Presentations

Benchmarking takes data from content providers and some are ready to make data available. We also need to communicate about benchmarking and other tests.

Cliff Lynch offered an "opinionated synthesis." What does this term "alignment" mean? Making our limited economic and intellectual resources go further through collaboration is obviously beneficial. Another aspect of alignment is that a common case should speak more effectively to governments.

In the tech discussion there were valuable conversations about benchmarking and testing. We need to be clear what we mean by interoperability, what we want to accomplish and what we want to get out of it. Two topics were missing: monoculture and hubris. We will have more confidence that we know what we need to do in 100 years and diversity in the system in a valuable antidote to the mistakes we make. We need to focus on the bit-storage layer and there will be a lot of money flowing in this area. Security and integrity are topics that were mentioned but need more focus. Imagine a wiki-leaks type leak of embargoed content. It would undermine the trust in cultural preservation institutions.

Strategies inside the national level were not discussed as much as they should have been. The question of the replication of material along organizations also needs more discussion. It is interesting that we see standards in so many roles in digital preservation. The legal issues are becoming more and more dominant and we need to look more opportunities to collaborate here. The one thing he would note on education is that the discussion needs to feed back into the national discussions. We did not talk much about scale in the discussion about economics. The risk management tradeoff for digitizing needs assessment.

The elephant in the room is e-science and e-scholarship. There is a lot of money involved here and big investments. This is not a place where many national libraries have been involved, though universities often are. This is driving both technology and some educational efforts. Other smaller elephants are audiovisual material and the newborn digital contents.

There are two additional axes that matter. One is making the case outside out community. The second is collecting policy. News has been a fundamental part of the public record and we know that it is fundamentally changing its character. Also software, personal records, social media. Cliff hopes that this is helpful in providing a frame for the conversation we have had in the last two days.


Links

Another blog about the ANADP conference is Inge Angevaare' Long-term Access blog (this portion of the blog is in English -- sometimes it is also in Dutch).

Monday, May 23, 2011

ANADP in Tallinn - day 2

Keynote

The keynote speaker at the ANADP conference for the second day was Gunnar Sahlin from the National Library of Sweden. One of the National Library's explicit tasks is to support university libraries. Open access and e-publishing are key initiatives together with the other 4 Nordic countries. Linked open data a more problematic topic because of resistance by publishers, but the National library strongly supports Europeana's efforts in this area. There is a close cooperation with the public sector, especially Swedish radio and television. The Swedish Parliament is considering a new copyright law that may clarify some issues.

Standards panel

The standards panel began with the idea that we have both too many standards and too few. Standards can be seen as a sign of maturity in a field. Digital preservation has not only its own standards, but many from other areas -- a Chinese menu of choices. Information security standards are for preserving confidentiality, integrity, and the availability of information. Many memory institutions have to comply with these standards. The issue was especially important for Estonia because of internet attacks, especially denial of service attacks. In general information security is well integrated into plans in the Baltic countries, but long term digital preservation is not. Only 12% have an offsite disaster recovery plan.

The UK Data Archive is an archive for social science and humanities data since 1967. "A standard is an agreed and repeatable way of doing something -- a specification of precise criteria designed to be used consistently and appropriately." In fact many standards are impractical, with unnecessary detail (8 [?] pages to explain options for gender in humans). Cal Lee spoke about 10 fundamental assertions, including that no particular level of preservation is canonically correct. Context is the set of symbolic and social relationships. With best practices and standards, trust is a key issue. PLANETS is concerned about quality standards and such standards begin with testing. Trust consists of audits, peer-reviewing, self-assessment, and certification. The process moves from awareness to evidence to learning. The biggest technology challenge comes from de facto standards from industry, and we have little control there. Good standards have metrics and measurement systems. Within our lifetime everything that we have as a preservation standard now will be superseded, but the principles will remain.

Copyright panel

Digital legal deposit is a key element, but not a form of alignment. In the UK, for example, legal deposit is still just for print material. In the Netherlands there is a voluntary agreement that works well. Territoriality is a problem – how to define the venue in which publishing takes place in the digital world, what is unlawful, what is protected, etc. The variance in legal deposit between countries leads to gaps. The rules for diligent search for orphan works are so complicated that they are too expensive to use. Even within the context of Europeana cross border access to orphan works is a problem. In US law contract law takes precedent over copyright law. Too many licenses could undermine the ability to preserve materials.

To a question about Google a speaker said that Google's original defense of the scanning project was "fair use" (17 USC 107) and they had a good chance there. It changed to a class-action suit, which is more complicated. The breakout session on copyright went into further depth about what problems exist in dealing with copyright across national borders. Apparently a feature of Irish copyright law is that the copyright law takes precedence over private contracts. Generally contracts take priority.


Summary of the sessions

Panel chairs gave a summary of their sessions and breakout sessions. For the technical group I spoke about the need for testing, trust (or distrust) and metrics and argued that we are really just beginning to address these issues.


ANADP in Tallinn - day 1

Opening

This blog is beginning with a very international conference called Aligning National Approaches to Digital Preservation (ANADP) that is taking place today in Tallinn, Estonia.

The President of Estonia opened the conference. He emphasized how technologically and digitally aware the country is and also the national library. 10 million Estonian books were destroyed during the Soviet occupation as part of an effort to erase Estonian identity. Further destruction took place when Estonia came free and people wanted to cover up their past. Digitization allows the country to preserve and made materials accessible. The president closed by saying: "Digitizing our national memory is a cornerstone of liberty."

Laura Campbell (Library of Congress) gave the keynote address. NDIIPP (National Digital Information Infrastructure Preservation Plan) has the goal of preserving digital materials. Congress provided $100 million for this effort. LoC has worked on a distributed network to carry out this mission. The program model was to learn by doing. There was no clear pathway forward. She cited WARC development as one of the key technological components and argued that secrecy through proprietary systems does not lead to success in digital archiving. As an example she told the story of Goldcorp - a Canadian gold mining company -- that put their proprietary software online and offered a prize for the best recommendations on what to do. The company grew significantly as a result of crowdsource-suggestions. She recommended planing broad goals for collaboration for digital preservation and expanding national digital collections into international ones. LoC has a strong outreach program with classroom teachers to push discussion out to younger people.

Technical Alignment

The technical alignment panel looked at two issues: infrastructure and testing. Presentations on infrastructure included kopal, nestor, LuKII, and the UK LOCKSS Alliance. The presentations about testing called for benchmarking, public tests, and metrics that librarians can use when making decisions, rather than just believing vendor claims. A vendor raised questions about this, but admitted that they were not willing to share their test data, except among customers. (Note: I was panel chair and could not make detailed notes during this session.)

The panel on organizational alignment looked at long term commitment, the scale necessary to make the work cost-efficient, and effective interaction with vendors. While the EU funds many projects that promise to continue when the funding ended, most do not. TRAC fostered the audit and Certification of Trustwothy Digital Repositories, which is now an ISO Standard. Social collaboration is a necessary element of infrastructure and the National Digital Stewardship Alliance is an attempt to address this. Distributed digital preservation is an idea as old as monastic copying. MetaArchive is a distributed digital preservation initiative that began with NDIIPP funding. MetaArchive now also has European members and has been experimenting with cross-deposit with IRODS.

Later

The day ended with a reception.