March 22, 2016

An overly long post where I reflect on the IIPC Steering Committee and my tenure on it since 2010

Steering Committee Elections Are Upon Us


The IIPC Steering Committee (SC) elections are almost upon us. Last year the SC decided, as part of the new Consortium Agreement to move the SC election from late in the year, to the General Assembly. The term no longer being from January 1st, but rather starting on June 1st. Term length remains three years, but obviously the term of current SC members is extended by five months to cover the transition.

It is hoped that by moving the elections to the General Assembly (GA), we can increase the visibility of the SC and make it easier for members to participate in the governance of the IIPC. It also means that SC chairs will now have the GA at the end of their term. As the GA serves as the most significant event each year, it makes sense to hand over the reins following the GA, instead of doing so a few months before it.

Of the 15 SC seats, two are allocated to institutions hosting IIPC officers. Currently, the British Library with the communication and program officer and Bibliothèque nationale de France with the treasurer. About a third of the remaining 13 seats are voted on each year. This year five seats are up for a vote. Including that of my organization, The National and University Library of Iceland. The other four are The Library of Congress, Internet Archive, The German National Library and the Swiss National Library.

As far as I know all five seek reelection to the SC. I haven't heard of additional SC candidate institutions, but they have until March 30 to declare their interest.

I thought I'd use this opportunity to reflect on my two terms serving on the SC.

Iceland on the Steering Committee


The National and University Library of Iceland is a founding member of the IIPC and has had a continuous presence on the SC from the start. From 2003 to 2009, we were represented by Þorsteinn Hallgrímsson, then Deputy National Librarian. He also served as chair of the SC in 2008.

In 2009 the library volunteered to be one of the five organizations who would be the first to put their seat on the SC up for election (for a term beginning January 1st 2010). Voting on SC seats was a part of the changes made with the third Consortium Agreement, which covered 2010-2012 and remained little changed for 2013-2015. Before this, SC seat had been fixed.

Given that Þorsteinn was planning to retire in 2010, it was decided that I would replace him should we retain our seat on the SC. As it turned out, there were only five candidates for the five seats as two incumbents decided to withdraw from the SC. We won our seat by default.

To confess, at the time I didn't quite know what I was getting myself into. The SC meetings are always closed and while Þorsteinn would occasionally confer with me about something discussed at an SC meeting, for the most part the operation of the SC was a bit of a black box. Unfortunately, I think this is how it appears to all too many members. Especially those from institutions not serving on the SC. Some efforts have been made to made to make it more transparent, but I believe more is needed. I suppose this post is a small contribution to that end.

I arrive in Singapore


My first SC meeting was during the Singapore GA. I remember thinking, before going into the meeting, 'just shut up and listen'. I still try very hard to listen carefully. However, as my fellow SC members will surely attest to, I have long since given up on 'shutting up'.

Singapore, National Library
The Singapore meeting was relatively short (being shoehorned into an afternoon during the GA) and nothing of any lasting consequence was discussed. Looking at the minutes I see the usual budgetary items for the GA and following events. There was some review of the growth in membership (2007-2009 was the "growth" phase of the IIPC). Review of funded projects and talk of the strategic direction of the IIPC. A notable concern about a divide between "old" and "new" members.

Looking back now, it all seems very typical. Budget, events and sponsored projects have always occupied a notable fraction of the SC's time. Often to the point that we were unable to adequately discuss more transformative issues.

Especially during my first term, I frequently found the time for of the SC meetings to be far too short to tackle anything substantive. I remember sitting in the SC meeting at the 2013 GA in Ljubljana discussing changes to the working groups. Kris Carpenter and I were both quite unhappy with the status quo and were pushing for changes. But with only a half day for the SC meeting, the matter couldn't be adequately covered. In fact, today we are still struggling with the topic to some extent. At the time it was very frustrating.

Fortunately, this has been changing for the better in the last couple of years. We now typically have a two day, dedicated SC meeting in the fall. This has proven far more productive, in my opinion, and enabled us to make far more substantial changes to the consortium agreement this time around. We also try to squeeze in a whole day during the GA, although this has proven more difficult. The use of online meetings has also grown and become more productive.

An Executive Committee


Finding more time for SC meetings is important, as the SC faces a number of important issues at the moment. Some quite foundational, like what exactly is the IIPC. Are we a forum for discussion and knowledge sharing only? Are we an advocacy group? Should we build tools? Develop standards and APIs? Build collections?

Of course this shouldn't all be decided by the SC alone. But, the SC needs to be in a leadership role here. Perhaps that is one of the thing that needs to change more. Historically, SC members have fulfilled their role mostly be attending SC meetings. Often, with little activity in between.

The SC poses for a photo after a meeting in Paris, fall 2015
Starting with the newest consortium agreement, however, the SC will contain, in addition to the chair, four new recognized roles. The members serving in these roles will be expected to provide leadership with respects to their particular role.

The four new roles are, Vice-Chair, which has been around unofficially for a while and then there are three topic oriented roles, Tools Development, Membership Engagement and Partnerships & Outreach.

This is a big change. Whereas before a SC member could expect to be called on to serve as chair for one year once every 5 terms (15 years), now an SC member should expect to serve in one of these roles for at least one year, each term.

Our current chair, Paul Wagner, was the one to come up with this and I must applaud him for it. This is the largest change to the SC's function since I joined it. Perhaps since the very beginning.

This also addresses (in part, at least) some of the issues we've had with the IIPC officers. All too often they've been saddled with strategic responsibilities. The officers are there to keep day-to-day matters running. But, in the past we've asked them to essentially take up leadership roles in areas of outreach and member services (among other). With poor result, perhaps partly due to churn in the PCO role, but I suspect largely because they are not properly empowered nor do they have enough time to devote to this.

It may prove that the SC members also do not have enough time for this. But, if that is the case, I think we either need new SC members or we can just close down the consortium.

It makes more sense that the SC (as a whole and as individuals) assume the responsibility for providing leadership. The SC isn't, in this way, like the board of a company. It isn't an oversight body with some broad policy influence. The SC is an executive board.

It is time we act like one.

New faces and old


Looking at the list of attendees from that first SC meeting in Singapore, I see that only three people remain on the SC. Myself, Birgit Henriksen (Denmark) and Sven Arne Solbakk (Norway). The latter two have been on the SC since day one!

Apparently, us Nordic folk are exceptionally stable :)

Several institutions have withdrawn from the SC in that time but much of the turnover comes from personnel changes at the member institutions. The fact that we get board turnover both from members leaving and from individual representatives leaving/being promoted means that we have a fairly "young" SC. It is sometimes strange being an "old hand" after only six years. Stranger still as I remain one of the youngest representatives on the SC.

My and Gildas discussing something of
profound importance at the 2012 GA
reception in Washington
It is important to retain a reasonable amount of institutional memory. Over the last three years we've
had many important people leave the SC (Gildas Illien, Martha Anderson, Kris Carpenter, Helen Hockx-Yu to name just a few) largely reshaping the SC.

It must be said, some of the new representatives have brought energizing ideas and changes with them. Notably, Paul Wagner, who joined the SC on behalf of Library and Archives Canada in 2014. I've found his approach to the issues we are tackling to often be refreshing. He has certainly left his mark on the SC and, I expect, the IIPC in general.

Still, we must also take care not to lose the IIPC's institutional memory. Even if only to avoid repeating the same mistakes.

The Future


So, why do I want to serve another term on the SC? It is a fair question, and one I hope all candidates ask themselves honestly.

I suppose there is some perceived prestige or status that comes with it (for the representative and institution). That might have tickled me slightly when I sought my first term, six years ago. But that is a poor reason.

Today, I seek to remain on the SC because I believe that the IIPC matters. Both broadly, in the world, and also directly to the mission of my library.

Iceland, National Library
I also believe that I have something worthwhile to contribute. A viewpoint that may otherwise go unheeded. And I am willing to shoulder my part of the responsibilities that come with serving on the Steering Committee. Even now, when those responsibilities are set to increase notably.

I think that should be clear from the work I have done over the last six years. I've never been content to just attend SC meetings. Instead, I have led a working group, managed a task force, helped organize a technical training workshop, helped update the WARC standard, worked on our open source projects (Heritrix and OpenWayback), provided leadership on those projects when needed and, now, my library is hosting this year's GA.

I do this, not because I'm generous, but because I've come to understand the value of our cooperative efforts. The Icelandic Web Archive would barely exist without the support, knowledge and tools we have garnered via international cooperation, notably within the IIPC.

I suppose we could have gained some of it without giving back. But if everyone does that, we all lose out. I have no doubt we've gotten back more than we have given. That is the beauty of the IIPC, it isn't a zero-sum game. We can all get more out of it than we put in. Also, some of the most rewarding things can only be had by fully engaging with the community.

We wish (my institution and me) to continue to contribute. Because we know we it is eminently worthwhile.



March 18, 2016

Declaring WARR on "CDX Server" API

Work is currently ongoing to specify a "CDX Server API" for OpenWayback. The name of this API has, however, caused an unfortunate amount of confusion. Despite the name, the data served via this API needn't be in CDX files!

The core purpose of this API is to respond to a query containing an URL and optionally a timestamp or timerange with a set of records that fall within those parameters. This is meant to support two basic functionalities. One, replay of captured web content and, two, discovery of capture web content.

CDXs need not enter into it. It is just that the most common way (by far) to manage such an index is to use sorted CDX files. Thus the unfortunate name. Nothing prevents alternative indexing solutions being used. You could use a relational database, Lucene or whatever tool allows lookups of strings!

So, this API desperately needs a new name. My suggestion is "Web Archive Resource Resolution Service" or WARR Service for short. Yes, I did torture that until it produced a usable acronym.

In my last post I discussed changes to the CDX file format itself. Those changes should facilitate WARR servers running on CDX indexes. But ultimately, the development of the WARR Service API is not directly coupled to those changes. We should focus on developing the WARR Service API with respect to the established use cases.

In truth, the exact scope and nature of this new API remains debated. You can find some lively discussion in this Github issue. More on that topic another day.

March 17, 2016

Rewriting the CDX file format

CDX files are used to support URL+timestamp searching of web archives. They've been around for a long time, having first been used to catalog the contents of ARC files. Despite the advent of the WARC file format, they haven't changed much. I think it is past due that we reconsider the format from the ground up.

The current specification lists a large number of possible fields. Many are not used in typical scenarios.

The first field is a canonicalized URL. I.e. an URL with trivial elements (such as protocol) removed so that equivalent URLs end up with the same canonical URL here. This serves as the primary search key.

The only problem with this is that searching for content in all subdomains is not possible without scanning the entire CDX. This is because the subdomain comes before the domain. Instead, we should use a SURT (Sort-friendly URI Reordering Transform) form of the canonical URL instead. SURT URLs turn the domain/sub-domain structure around, making such queries fairly straightforward. There is essentially no downside to doing this and, in fact, a number of CDXs have been built in this manner, regardless of any "formal" standardization (as there isn't really any formal standard).

I suggest that any revised CDX format mandate the use of SURT URLs for the first field. Furthermore, we should utilize the correct SURT format. In most (probably all) current CDXs with SURT URLs, an annoying mistake has been made where the closing comma is missing. An URL that should read:
   com,example,www,)
instead reads:
   com,example,www)
The protocol prefix has been removed as unnecessary along with the opening ellipse. 

The second field should remain the timestamp with whatever precision is available in the ARC/WARC. I.e. an w3c-iso8601 of varying accuracy as per this proposed revision the WARC standard (the revision is extremely likely to be included in WARC 1.1).

The third field would remain the original URL.

The fourth field should be a content digest including the hashing algorithm. Presently, this field is missing the algorithm.

The fifth field would be the WARC record type (or a special value to indicate an ARC response record). This is the most significant change as it allows us to capture additional WARC record types (such as metadata and conversion) while also handling the existing fields in a more targeted manner (e.g. response vs revisit). It might be argued that this should be the second field to facilitate searches of a specific record type. I believe that, probably implemented, this field would allow replay tools to effectively surface any content "related" to the URL currently being viewed, a problem that I know many are trying to tackle.

The next two fields would be the WARC (or ARC) filename (this is supposed to be unique) of the file containing the record and offset at which the record exists within the (W)ARC. This is as it works currently. Some would argue for a more expressive resource locator here, but I believe that is best handled be a separate (W)ARC resolution service. Otherwise you may have to substantially rebuild your CDX index just because you moved your (W)ARCs to a new disk or service.

Lastly, there should be a single line JSON "blob" containing record type relevant additional data. For response records, this would include HTTP status code and content type which I've excluded from the "base" fields in the CDX. This part would be significantly more flexible due to the JSON format, allowing us to include optional data where appropriate etc. The full range of possible values is beyond the scope of this blog post.

There is clearly more work to be done on the JSON aspect, plus some adjustments may be necessary to the base data, but I believe that, at minimum, this is the right direction to head in. Of course, this means we have to rebuild all our CDX files in order to implement this. That's a tall order, but the benefits should be more than enough to justify that one-time cost.