September 26, 2016

3 crawlers : 1 writer

Last week I attended an IIPC sponsored hackathon with the overarching theme of 'Building better crawlers'. I can't say we built a better crawler in the room, but it did help clarify for me the likely future of archival crawling. And it involves three types of crawlers.

The first type is the bulk crawler. Heritrix is an example of this. Can crawl a wide variety of sites 'good enough' and has fairly modest hardware requirements, allowing it to scale quite well. It is, however, limited in its ability to handle scripted content (i.e. JavaScript) as all link extraction is based on heuristics.

The second type is a browser driven crawler. Still fully (mostly) automated but using a browser to render pages. Additionally, scripts can be run on rendered pages to simulate scrolling, clicking and other user behavior we may wish to capture. Brozzler (Internet Archive) is an example of this approach. This allows far better capture of scripted content, but at a price in terms of resources.

For large scale crawls, it seem likely that a hybrid approach would serve us best. To have a bulk crawler cover the majority of URLs, only delegating those URLs that are deemed 'troublesome' to the more expensive browser based rendering.

The trick here is to make the two approaches work together smoothly (Brozzler, for example, does state very differently from Heritrix) and being smart about which content goes in which bucket.

The third type of crawler is what I'll call a manual crawler. I.e. a crawler whose activities are entirely driven by a human operator. An example of this is Webrecorder.io. This enables us to fill in whatever blanks the automated crawlers leave. It can also prove useful for highly targeted collection, where curators are handpicking, not just sites, but specific individual pages. They can then complete the process, right there in the browser.

There is, however, no reason that these crawlers can not all use the same back end for writing WARCs, handling deduplication and otherwise doing post acquisition tasks. By using a suitable archiving proxy all three types of crawlers can easily add their data to our collections.

Such proxy tools already exist, it is simply a matter of making sure these crawlers use them (many already do), and that they use them consistently. I.e. that there is a nice clear API for a archiving proxy that covers the use cases of all the crawlers. Allows them to communicate collection metadata, dictate deduplication policies etc.

Now is the right time to establishing this API. I think the first steps in that direction were taken at the hackathon. Hopefully, we'll have a first draft available on the IIPC GitHub page before too long.

May 30, 2016

Heritrix 3.3.0-LBS-2016-02, now in stores

A month ago I posted that I was testing a 'semi-stable' build of Heritrix. The new build is called "Heritrix 3.3.0-LBS-2016-02" as this is built for LBS's (Icelandic acronym for my library) 2016-02 domain crawl (i.e. the second one this year).

I can now report that this version has passed all my tests without any regressions showing up. I've noted two minor issues, one of which was fixed immediately (Noisy alerts about 401s without auth challenge) and the other has been around since Heritrix 3.1.0 at the least and does not affect crawling in any way (Bug in non-fatal-error log).

Additionally, I heard from Netarkivet.dk. They also tested this version with no regressions found.

I think it is safe to say that if you are currently using my previous semi-stable build (LBS-2015-01), upgrading to this version should be entirely straightforward. There are no notable API changes to worry about either. Unless, of course, you are using features that are less 'mainstream'.

You can find this version on our Github page. You'll have to download the source and build it for yourself.

Update As you can see in the comments below, Netarkivet.dk has put the artifacts into a publicly accessible repository. Very helpful if you have code with dependencies on Heritrix and you don't have your own repository.

Thanks for the heads-up, Nicholas.

May 17, 2016

WARC MIME Media Type

A curious thing came up during the WARC 1.1 review process. In version 1.0, section 8 talked about what MIME media types should be used when exchanging WARCs over the Internet. During the review process, however, it was pointed out that this is actually outside the scope of the standard. 1.1 consequently drops section 8.

For now we should regard the instructions from 1.0 section 8 as best practices. But it isn't part of any official standard.

That's not to say that it isn't important to have a standard set of MIME types for WARC content. Only that the WARC ISO standard isn't the place for it. This is actually something that IANA is responsible for, with specification work going through the IETF if I'm understanding this correctly.

I'm not at all familiar with this process. But it is clear that if we wish to have this standardized then going through this process is the only option. If anyone can offer further insight into how we could move this forward please get in touch.


May 12, 2016

What I learned hosting the 2016 IIPC GA/WAC

National Library of Iceland
Photo taken by GA/WAC attendee
It's been nearly a month since the 2016 IIPC General Assembly (GA) / Web Archiving Conference (WAC) in Reykjavik ended and I think I'm just about ready to try to deconstruct the experience a bit.

Plan ahead


Looking back, planning of the the practical aspects - logistics - of the conference seem to have been mostly spot on. The 2015 event in Stanford had had a problem with no-shows, but this wasn't a big factor in Reykjavik. I suspect largely due to the small number of local attendees. Our expectations about the number of people who would come ended up being more or less correct (about 90 for the GA and 145 for the WAC).

A big part of why the logistics side ran smoothly was, I feel, due to advance planning. We first decided to offer to host the 2016 GA in October of 2013. We made the space reservations at the conference hotel in September 2014. Consequently, there was never any rush or panic on the logistics. Everything felt like it was happening right on schedule with very few surprises.

The IIPC SC had a meeting in Lisbon
following the 2013 iPres conference.
The idea for Reykjavik as the venue
for the 2016 IIPC GA first arose there.

Given how much work it was, despite all the careful planning, I don't care to imagine what doing this under pressure would be like. I've been advocating in the IIPC Steering Committee (SC), for years, that we should leave each GA with a firm date and place for the next two GAs and a good idea of where the one to be held in three years will be be.

Nothing, in my experience hosting a GA, has changed my mind about that.

Spendthrift 


There was some discussion about whether some days/sessions should be recorded and put online. This was done in Stanford, but looking at the viewing numbers, I felt that it represented a poor use of money. Ultimately the SC agreed. Recording and editing can be quite costly. It may be worth reviewing this decision in the future. Or, perhaps something else can be used to 'open' the conference to those not physically present.

It was certainly a worthwhile experiment, but overall, I think we made the right decision not doing it in Reykjavik. Especially as the cost was quite, even compared to Stanford.

Another thing we decided not to spend money on was an event planner. I know one was used for the 2015 GA. That one needed to be planned in a hurry and thus may have required such a service. But I can't see how it would have made things much easier in 2016 unless you're willing to hand over the responsibility for making specific choices to the planner. Such as catering etc.

True, that does take a bit of effort, but I felt that was a part of the responsibility that comes with hosting. Just handing it over to a planner wouldn't have sat right. And if I'm vetting the planners choices, then very little effort is being saved.

I'm happy to concede, though, that this may vary very much by location and host.

Communication


Some of the communication surrounding the GA/WAC was sub-optimal. The GA page on netpreserve.org was never really up to the task, although it got better over time. Some of this was down to the lack of flexibility of the netpreserve website. Future events should have a solid communication plan at an early date. Including what gets communicated where and who is responsible for it. Perhaps it is time that each GA/WAC gets its own little website? Or perhaps not.

The dual nature of the event also caused some confusion. This led some people to only register for one of the two events etc. There was also confusion (even among the program committee!) about whether the CFP was for the WAC and GA or WAC only.

This leads us to the most important lesson I took away from this all...

Clearly separate the General Assembly and the Web Archiving Conference!


This isn't a new insight. We've been discussing what separates the 'open days' from 'member only' days for several years. In Reykjavik this was, for the first time, formally divided into two separate events. Yet, the distinction between them was less than absolutely clear.

This is, at least in part, due to how the schedule for the two events was organized. A single program committee was set up (as has been the case most years). It was quite small this year. This committee then organized the call for proposals (CFP) and arranged the schedule to accommodate the proposals that come in from the CFP.

This led to the conference over-spilling onto GA days (notably Tuesday). And it wasn't the first time that has happened. There was definitely a lack separation in Stanford (although perhaps for slightly different reasons) and in Paris, in 2014, the effort to shoehorn in all the proposals from the CFP had a profound effect on the member-only days.

This model of a program committee and a CFP is entirely suitable for a conference and should be continued. But going forward, I think it is absolutely necessary that the program committee for the WAC have no responsibility or direct influence on the GA agenda.

To facilitate this I suggest that the organization of these two events consist of three bodies (in addition to the IIPC Steering Committee (SC) which will continue to bear overall responsibility).

  1. Logistics Team. Membership includes 1-2 people from the hosting institution, the IIPC officers, at least one SC member (if the hosting institution is an SC member this may be their representative) and perhaps one or two people with relevant experience (e.g. have hosted before etc.).
    This group is responsible for arranging space, catering, the reception, badges and other printed conference material, hotels (if needed) etc. They get their direction on the amount of space needed from the SC and the two other teams.
    This group is responsible for the event staying under budget. Which is why the treasurer is included.
  2. WAC Program Committee. The program committee would be comprised of a number of members and may include several non-members who bring notable expertise and have been engaged in this community for a long time.
    The program committee would have a reserved space on it for the hosting institution (which they may decline). There should also be a minimum of one SC member on the committee.
    The PCO (program and communications officer) would be included in all communications and assist the committee with communications with members and other prospective attendees (e.g. in sending out the CFP) but would not participate in evaluating the proposals sent in.
    The program committee would have a hand in crafting the CFP, but input on overall 'theme' would be expected from the SC.
    The program committee's primary task would be to evaluate the proposals sent in after the CFP and arranging them into a coherent schedule. The mechanism for evaluating (and potentially rejecting!) proposals needs to be established before the CFP's come in! Otherwise, it will be hard to avoid the feeling that they are being tailored to fit specific proposals.
  3. GA Organizing Group. The PCO would be responsible for coordinating this group. Included are the SC Chair and Vice Chair, portfolio leads and leaders of working and interest groups. For the most part, each member is primarily responsible for the the needs of their respective areas of responsibility.
    More on GA organization in a bit.
None of this gives the SC a free pass. As you'll note, I've mandated an SC presence in all the groups. This both gives the groups access to someone who can easily bring matters to the SC's attention and ensures that there is someone there to ensure that the direction the SC has laid out is, broadly speaking, followed.

For the WAC, the SC's biggest responsibility (aside from choosing the location and setting the budget) will be in deciding how much time it gets (two days, two and a half, three?), what themes to focus around and whether the conference should try to accomplish a specific outreach goal (and if so how).

This was, for example, the case in Stanford where the goal was to get the attention of the big tech companies. Getting Vint Cerf (a VP of Google) to be a keynote speaker was a good effort in that direction. Nothing similar was done during the Reykjavik meeting.

Keynotes


Keynotes are likely to be one of the best ways of accomplishing this. Getting a keynote speaker from a different background can help build bridges. I think this is absolutely a worthwhile path to consider.

However, unless we are hosting the WAC in their backyard (as was the case with Vint Cerf), we need to reach out to them very early and probably be prepared to cover the cost of travel. This is a choice that needs to be made very early. And, indeed, the choice of a keynote may ultimately help frame the overall them of the conference (or not).

Hjálmar Gíslason delivering the
2016 IIPC WAC opening keynote
We had two keynotes in Reykjavik. Both were great, although neither was chosen 'strategically'. The choice of Hjálmar Gíslason was largely with my library. Allowing the hosting institution some influence on one of the keynotes may be appropriate. The other keynote, Brewster Kahle, wasn't chosen until after the CFP was in. We essentially asked him to expand his proposal into a keynote. Given the topic and Brewster's acclaim within our community, this worked out very well. We did have other candidates in mind (but no one confirmed). It was quite fortunate that such a perfect candidate fell into our laps.

It is worth planning this early as people become unavailable surprisingly far in advance.

It could also be argued that we don't need keynotes. People aren't coming to the IIPC WAC to hear some 'rock star' presenter. The event itself is the draw. But I think a couple of keynotes really help tie the event together.

One change may be worth considering. Instead of a whole day with a single track featuring both keynotes, perhaps have multiple tracks on all days but do a single track session at the start of day one and at the end of day two that accommodates the keynotes and the welcome and wrap up talks.

When we were trying to fit in all the proposals we got for Reykjavik, we considered doing this, but the idea simply arose too late. We were unable to secure the additional space required.

Again, we need to plan early.

The General Assembly should not be a conference


The GAs have changed a lot over the years. The IIPC met in Reykjavik for the first time in 2005. Back then we didn't call the meetings "GAs", they were just meetings. And they mostly oriented around discussions. They were working meetings. And they were usually very good.

The first GA, in Paris 2007, largely retained that, despite the fact that the IIPC was already beginning to grow. There was no 'open day'. 

By 2010 in Singapore, the open day was there. But in a way that made sense and it didn't overly affect the rest of the GA. I did notice, however, a marked change in the level of engagement by the attendees during sessions.

There seemed to be more people there 'just to listen'. There had always been some of those, but I found it difficult to get discussions going, where two years prior, they'd had usually been difficult to stop in order to take breaks! Not that those discussions had always been all that productive (some of it was just talk), but the atmosphere was more restrained.

At that time I was co-chair of the Harvesting Working Group (HWG) along with Lewis Crawford of the British Library. And although there was always good attendance at the HWG meetings we really struggled to engage the attendees.

Helen Hockx-Yu and Kris Carpenter, who led the Access Working Group (AWG) did a better job of this but clearly felt the same problem. Ultimately, both HWG and AWG became more of GA events than working groups and have now been decommissioned.

With larger groups and especially with many there 'just to listen' it becomes much easier to just do a series of presentations. Its safer, more predictable and when you add the pressure to fit in all the material from the CFP, it becomes inevitable.

But, in the process we have lost something.

Now that the WAC is firmly established and can serve very well for the people who 'just want to listen', I think it is time we refocus the GA on being working meetings. A venue for addressing both consortium business (like the portfolio breakout sessions in Reykjavik, but with more time!) and the work of the consortium (like the OpenWayback meeting and the Preservation and Content Development Working Group meetings in Reykjavik).

This will inevitably include some presentations (but keep them to a minimum!) and there may be some panel discussions but the overall focus should be on working meetings. Where specific topics are discussed and, as much as possible, actions are decided.

That's why I nominated the people I did for the GA Organizing Group. These are the people driving the work of the consortium. They should help form the GA agenda. At least as far as their area of responsibility is concerned.

To accommodate the less knowledgeable GA attendee (e.g. new members) it may be a good idea to schedule tutorials and/or training sessions in parallel to some of these working meetings.

I believe this can build up a more engaged community. And for those not interested in participating in specific work, the WAC will be there to provide them with an opportunity to learn and connect with other members.

This wont be an easy transition. As my experience with the HWG showed, it can be difficult to engage people. But by having a conference (and perhaps training events) to divert those just looking to learn and building sessions around specific strategic goals, I think we can bring this element of 'work' back.

And if we can't, I'm not sure we have much of a future except as a yearly conference.

April 28, 2016

New 'semi-stable' build for Heritrix

Earlier this month I mentioned that I was planning on making another one of my "semi-stable" Heritrix build, using the current master 3.3.0. This work is now underway in the Landsbokasafn Heritrix repo on GitHub. LBS-2016-02.

Heritrix 3.3.0-LBS-2016-02
I've merged in one pull request that is still open in the IA repository, #154 Fixes for apparent build errors. Most notably, this makes it possible to have Travis-CI build and test Heritrix.

You can review the full list of changes between my last Heritrix build (2015-01) and this new one here. Here is a list of the main changes:

  • Some fixes to how server-not-modified revisit records are written (PR #118).
  • Fix outlink hoppath in metadata records (PR #119)
  • Allow dots in filenames for known good extensions (PR #120)
  • Require Maven 3.3 (PR #126
  • Allow realm to be set by server for basic auth (PR #124)
  • Better error handling in StatisticsTracker (PR #130)
  • Fix to Java 8 Keytool (PR #129) - I wrote a post about this back in 2014.
  • Changes to how cookies are stored in Bdb (PR #133)
  • Handle multiple clauses for same user agent in robots.txt (PR #139)
  • SourceSeedDecideRule and SeedLimitsEnforcer (PR #137 and #148)
  • 'Novel' URL and byte quotes (PR #138)
  • Only submit 'checked' checkbox and radio buttons when submitting forms (PR #122)
  • Form login improvements (PR #142 and #143)
  • Improvements to hosts report (PR #123)
  • Handle SNI error better (PR #141)
  • Allow some whitespace in URLs extracted by ExtractorJS (PR #145)
  • Fix to ExtractorHTML dealing with HTML comments (PR #149)
  • Build against Java 7 (PR #152)

I've ignored all pull request that apply primarily to the contrib package in the above. There were quite a few there, mostly (but not exclusively) relating to AMQP.

I've done some preliminary testing and everything looks good.  So far, the only issue I've noted is one that I was already aware of, about noisy alerts relating to 401s.

I'll be testing this version further over the next few weeks and welcome any additional input.

April 18, 2016

A long week is over. Thank you all.

The 2016 IIPC General Assembly and Web Archiving Conference is over. Phew!

Me, opening the Harvesting Tools
session on Tuesday

I always look forward to this event each year. It is by far the most stimulating and productive meeting/conference that I attend regularly. I believe we managed to live up to that this time.

The meeting had a wonderful Twitter back-channel that you can still review using the hashtags #iipcGA16 and #iipcWAC16.

It has been over two years since we, at the National and University Library of Iceland, offered to host the 2016 GA, and over a half year before that when the initial decision was made. Even with a 2.5 year lead time, it barely felt like enough.

I'd like to take this opportunity to thank, again, all the people who helped make last week's event a success.

First off, there is the program committee, which was very small this year, comprising, in addition to myself, (in alphabetical order) Alex Thurman (Columbia University Libraries), Gina Jones (Library of Congress), Jason Webber (IIPC PCO/British Library), Nicholas Taylor (Stanford University Libraries) and Peter Stirling (Bibliothèque nationale de France). I literally couldn't have done this without you.

I'd also like to note the contribution of our incoming PCO in this list, Olga Holownia who put in a lot of work during the conference to help make sure everything was just right for each session.

Next, I'd like to thank my colleagues at the National Library who assisted me in organizing this event and helped out during by week by handling registration, running tours etc. It was a team effort. Notable mentions to Áki Karlsson and Erla Bjarnadóttir who spent much of the week making sure that all the little details were attended to.

The Steering Committee on Friday
following the SC meeting
A big thank you to all the speakers and session moderators.

And lastly, I'd like to thank the members of the Steering Committee for being willing to entrust the single most important event of the IIPC calendar to one of the IIPC's smallest members. Indeed, doing so without the slightest hesitation.

I've learned a lot from this past week and I hope to be able to distill that experience and write it up so that next year's GA/WAC can be even better. But that will have to wait for another day and another blog post.

For now, I'll just say thanks for coming and see you all again in Lisbon for #iipcGA17 and #iipWAC17.

April 7, 2016

Still Looking For Stability In Heritrix Releases

I'd just like to briefly follow up on a blog post I wrote last September, Looking For Stability In Heritrix Releases.

The short version is that the response I got was, in my opinion, insufficient to proceed. I'm open to revisiting the idea if that changes, but for now it is on ice.

There is little doubt in my mind that having (somewhat) regular stable releases made of Heritrix would be of notable benefit. Even better if they are published to Maven Central.

Instead, I'll continue to make my own forks from time to time and make sure they are stable for me. The last one was dubbed LBS-2015-01. It is now over a year old and a lot has changed. I expect I'll be making a new one in May/June. You can see what's changed in Heritrix in the meantime here.

I know a few organizations are also using my semi-stable releases. If you are one of them and would like to get some changes in before the next version (to be dubbed LBS-2016-02), you should try to get a PR into Heritrix before the end of April. Likewise, if you know of a serious/blocking bug in the current master of Heritrix, please bring it to my attention.