May 30, 2016

Heritrix 3.3.0-LBS-2016-02, now in stores

A month ago I posted that I was testing a 'semi-stable' build of Heritrix. The new build is called "Heritrix 3.3.0-LBS-2016-02" as this is built for LBS's (Icelandic acronym for my library) 2016-02 domain crawl (i.e. the second one this year).

I can now report that this version has passed all my tests without any regressions showing up. I've noted two minor issues, one of which was fixed immediately (Noisy alerts about 401s without auth challenge) and the other has been around since Heritrix 3.1.0 at the least and does not affect crawling in any way (Bug in non-fatal-error log).

Additionally, I heard from They also tested this version with no regressions found.

I think it is safe to say that if you are currently using my previous semi-stable build (LBS-2015-01), upgrading to this version should be entirely straightforward. There are no notable API changes to worry about either. Unless, of course, you are using features that are less 'mainstream'.

You can find this version on our Github page. You'll have to download the source and build it for yourself.

Update As you can see in the comments below, has put the artifacts into a publicly accessible repository. Very helpful if you have code with dependencies on Heritrix and you don't have your own repository.

Thanks for the heads-up, Nicholas.

May 17, 2016

WARC MIME Media Type

A curious thing came up during the WARC 1.1 review process. In version 1.0, section 8 talked about what MIME media types should be used when exchanging WARCs over the Internet. During the review process, however, it was pointed out that this is actually outside the scope of the standard. 1.1 consequently drops section 8.

For now we should regard the instructions from 1.0 section 8 as best practices. But it isn't part of any official standard.

That's not to say that it isn't important to have a standard set of MIME types for WARC content. Only that the WARC ISO standard isn't the place for it. This is actually something that IANA is responsible for, with specification work going through the IETF if I'm understanding this correctly.

I'm not at all familiar with this process. But it is clear that if we wish to have this standardized then going through this process is the only option. If anyone can offer further insight into how we could move this forward please get in touch.

May 12, 2016

What I learned hosting the 2016 IIPC GA/WAC

National Library of Iceland
Photo taken by GA/WAC attendee
It's been nearly a month since the 2016 IIPC General Assembly (GA) / Web Archiving Conference (WAC) in Reykjavik ended and I think I'm just about ready to try to deconstruct the experience a bit.

Plan ahead

Looking back, planning of the the practical aspects - logistics - of the conference seem to have been mostly spot on. The 2015 event in Stanford had had a problem with no-shows, but this wasn't a big factor in Reykjavik. I suspect largely due to the small number of local attendees. Our expectations about the number of people who would come ended up being more or less correct (about 90 for the GA and 145 for the WAC).

A big part of why the logistics side ran smoothly was, I feel, due to advance planning. We first decided to offer to host the 2016 GA in October of 2013. We made the space reservations at the conference hotel in September 2014. Consequently, there was never any rush or panic on the logistics. Everything felt like it was happening right on schedule with very few surprises.

The IIPC SC had a meeting in Lisbon
following the 2013 iPres conference.
The idea for Reykjavik as the venue
for the 2016 IIPC GA first arose there.

Given how much work it was, despite all the careful planning, I don't care to imagine what doing this under pressure would be like. I've been advocating in the IIPC Steering Committee (SC), for years, that we should leave each GA with a firm date and place for the next two GAs and a good idea of where the one to be held in three years will be be.

Nothing, in my experience hosting a GA, has changed my mind about that.


There was some discussion about whether some days/sessions should be recorded and put online. This was done in Stanford, but looking at the viewing numbers, I felt that it represented a poor use of money. Ultimately the SC agreed. Recording and editing can be quite costly. It may be worth reviewing this decision in the future. Or, perhaps something else can be used to 'open' the conference to those not physically present.

It was certainly a worthwhile experiment, but overall, I think we made the right decision not doing it in Reykjavik. Especially as the cost was quite, even compared to Stanford.

Another thing we decided not to spend money on was an event planner. I know one was used for the 2015 GA. That one needed to be planned in a hurry and thus may have required such a service. But I can't see how it would have made things much easier in 2016 unless you're willing to hand over the responsibility for making specific choices to the planner. Such as catering etc.

True, that does take a bit of effort, but I felt that was a part of the responsibility that comes with hosting. Just handing it over to a planner wouldn't have sat right. And if I'm vetting the planners choices, then very little effort is being saved.

I'm happy to concede, though, that this may vary very much by location and host.


Some of the communication surrounding the GA/WAC was sub-optimal. The GA page on was never really up to the task, although it got better over time. Some of this was down to the lack of flexibility of the netpreserve website. Future events should have a solid communication plan at an early date. Including what gets communicated where and who is responsible for it. Perhaps it is time that each GA/WAC gets its own little website? Or perhaps not.

The dual nature of the event also caused some confusion. This led some people to only register for one of the two events etc. There was also confusion (even among the program committee!) about whether the CFP was for the WAC and GA or WAC only.

This leads us to the most important lesson I took away from this all...

Clearly separate the General Assembly and the Web Archiving Conference!

This isn't a new insight. We've been discussing what separates the 'open days' from 'member only' days for several years. In Reykjavik this was, for the first time, formally divided into two separate events. Yet, the distinction between them was less than absolutely clear.

This is, at least in part, due to how the schedule for the two events was organized. A single program committee was set up (as has been the case most years). It was quite small this year. This committee then organized the call for proposals (CFP) and arranged the schedule to accommodate the proposals that come in from the CFP.

This led to the conference over-spilling onto GA days (notably Tuesday). And it wasn't the first time that has happened. There was definitely a lack separation in Stanford (although perhaps for slightly different reasons) and in Paris, in 2014, the effort to shoehorn in all the proposals from the CFP had a profound effect on the member-only days.

This model of a program committee and a CFP is entirely suitable for a conference and should be continued. But going forward, I think it is absolutely necessary that the program committee for the WAC have no responsibility or direct influence on the GA agenda.

To facilitate this I suggest that the organization of these two events consist of three bodies (in addition to the IIPC Steering Committee (SC) which will continue to bear overall responsibility).

  1. Logistics Team. Membership includes 1-2 people from the hosting institution, the IIPC officers, at least one SC member (if the hosting institution is an SC member this may be their representative) and perhaps one or two people with relevant experience (e.g. have hosted before etc.).
    This group is responsible for arranging space, catering, the reception, badges and other printed conference material, hotels (if needed) etc. They get their direction on the amount of space needed from the SC and the two other teams.
    This group is responsible for the event staying under budget. Which is why the treasurer is included.
  2. WAC Program Committee. The program committee would be comprised of a number of members and may include several non-members who bring notable expertise and have been engaged in this community for a long time.
    The program committee would have a reserved space on it for the hosting institution (which they may decline). There should also be a minimum of one SC member on the committee.
    The PCO (program and communications officer) would be included in all communications and assist the committee with communications with members and other prospective attendees (e.g. in sending out the CFP) but would not participate in evaluating the proposals sent in.
    The program committee would have a hand in crafting the CFP, but input on overall 'theme' would be expected from the SC.
    The program committee's primary task would be to evaluate the proposals sent in after the CFP and arranging them into a coherent schedule. The mechanism for evaluating (and potentially rejecting!) proposals needs to be established before the CFP's come in! Otherwise, it will be hard to avoid the feeling that they are being tailored to fit specific proposals.
  3. GA Organizing Group. The PCO would be responsible for coordinating this group. Included are the SC Chair and Vice Chair, portfolio leads and leaders of working and interest groups. For the most part, each member is primarily responsible for the the needs of their respective areas of responsibility.
    More on GA organization in a bit.
None of this gives the SC a free pass. As you'll note, I've mandated an SC presence in all the groups. This both gives the groups access to someone who can easily bring matters to the SC's attention and ensures that there is someone there to ensure that the direction the SC has laid out is, broadly speaking, followed.

For the WAC, the SC's biggest responsibility (aside from choosing the location and setting the budget) will be in deciding how much time it gets (two days, two and a half, three?), what themes to focus around and whether the conference should try to accomplish a specific outreach goal (and if so how).

This was, for example, the case in Stanford where the goal was to get the attention of the big tech companies. Getting Vint Cerf (a VP of Google) to be a keynote speaker was a good effort in that direction. Nothing similar was done during the Reykjavik meeting.


Keynotes are likely to be one of the best ways of accomplishing this. Getting a keynote speaker from a different background can help build bridges. I think this is absolutely a worthwhile path to consider.

However, unless we are hosting the WAC in their backyard (as was the case with Vint Cerf), we need to reach out to them very early and probably be prepared to cover the cost of travel. This is a choice that needs to be made very early. And, indeed, the choice of a keynote may ultimately help frame the overall them of the conference (or not).

Hjálmar Gíslason delivering the
2016 IIPC WAC opening keynote
We had two keynotes in Reykjavik. Both were great, although neither was chosen 'strategically'. The choice of Hjálmar Gíslason was largely with my library. Allowing the hosting institution some influence on one of the keynotes may be appropriate. The other keynote, Brewster Kahle, wasn't chosen until after the CFP was in. We essentially asked him to expand his proposal into a keynote. Given the topic and Brewster's acclaim within our community, this worked out very well. We did have other candidates in mind (but no one confirmed). It was quite fortunate that such a perfect candidate fell into our laps.

It is worth planning this early as people become unavailable surprisingly far in advance.

It could also be argued that we don't need keynotes. People aren't coming to the IIPC WAC to hear some 'rock star' presenter. The event itself is the draw. But I think a couple of keynotes really help tie the event together.

One change may be worth considering. Instead of a whole day with a single track featuring both keynotes, perhaps have multiple tracks on all days but do a single track session at the start of day one and at the end of day two that accommodates the keynotes and the welcome and wrap up talks.

When we were trying to fit in all the proposals we got for Reykjavik, we considered doing this, but the idea simply arose too late. We were unable to secure the additional space required.

Again, we need to plan early.

The General Assembly should not be a conference

The GAs have changed a lot over the years. The IIPC met in Reykjavik for the first time in 2005. Back then we didn't call the meetings "GAs", they were just meetings. And they mostly oriented around discussions. They were working meetings. And they were usually very good.

The first GA, in Paris 2007, largely retained that, despite the fact that the IIPC was already beginning to grow. There was no 'open day'. 

By 2010 in Singapore, the open day was there. But in a way that made sense and it didn't overly affect the rest of the GA. I did notice, however, a marked change in the level of engagement by the attendees during sessions.

There seemed to be more people there 'just to listen'. There had always been some of those, but I found it difficult to get discussions going, where two years prior, they'd had usually been difficult to stop in order to take breaks! Not that those discussions had always been all that productive (some of it was just talk), but the atmosphere was more restrained.

At that time I was co-chair of the Harvesting Working Group (HWG) along with Lewis Crawford of the British Library. And although there was always good attendance at the HWG meetings we really struggled to engage the attendees.

Helen Hockx-Yu and Kris Carpenter, who led the Access Working Group (AWG) did a better job of this but clearly felt the same problem. Ultimately, both HWG and AWG became more of GA events than working groups and have now been decommissioned.

With larger groups and especially with many there 'just to listen' it becomes much easier to just do a series of presentations. Its safer, more predictable and when you add the pressure to fit in all the material from the CFP, it becomes inevitable.

But, in the process we have lost something.

Now that the WAC is firmly established and can serve very well for the people who 'just want to listen', I think it is time we refocus the GA on being working meetings. A venue for addressing both consortium business (like the portfolio breakout sessions in Reykjavik, but with more time!) and the work of the consortium (like the OpenWayback meeting and the Preservation and Content Development Working Group meetings in Reykjavik).

This will inevitably include some presentations (but keep them to a minimum!) and there may be some panel discussions but the overall focus should be on working meetings. Where specific topics are discussed and, as much as possible, actions are decided.

That's why I nominated the people I did for the GA Organizing Group. These are the people driving the work of the consortium. They should help form the GA agenda. At least as far as their area of responsibility is concerned.

To accommodate the less knowledgeable GA attendee (e.g. new members) it may be a good idea to schedule tutorials and/or training sessions in parallel to some of these working meetings.

I believe this can build up a more engaged community. And for those not interested in participating in specific work, the WAC will be there to provide them with an opportunity to learn and connect with other members.

This wont be an easy transition. As my experience with the HWG showed, it can be difficult to engage people. But by having a conference (and perhaps training events) to divert those just looking to learn and building sessions around specific strategic goals, I think we can bring this element of 'work' back.

And if we can't, I'm not sure we have much of a future except as a yearly conference.