August 12, 2014

JWAT WARC reading speed doubled

This is a follow up to my last post about inefficiencies in WARC readers.

As I noted there, reading a WARC file straight (via á GZIP reader, so uncompressing it, but not parsing the content) takes about 10 seconds, whereas iterating over the same file, using the Java tools available (webarchive-commons, JWAT) takes about 40 seconds.

More specifically:

JWAT GzipReader: 10s
webarchive-commons WARCReader: 42s
JWAT WarcReader: 45s

There is a variability there of a few hundred milliseconds between runs so I've rounded to the nearest second. Note also, that the WARC readers were run immediately after the GzipReader and would have benefited from any OS caching.

In my earlier post I speculated that adding a read buffer to JWAT's ByteCountingPushbackInputStream would likely improve the efficiency considerably. I proceeded to test this hypothesis. New run time:

JWAT WarcReader for GZ: 21s

So, JWAT goes from being slightly slower than webarchive-commons, to being twice as fast.

The class still passes all its unit tests and as far as I can tell it is still functionally equivalent.

I've forked the JWAT project to my account on GitHub. You can see the modified file here: ByteCountingPushBackInputStream.java

There are no doubt greater gains to be had, but they'll require a deeper understanding of the code than I possesses at this moment.

The downside to this change is, of course, a slight uptick in the amount of memory used, as a 10K buffer is assigned every time an instance of the ByteCountingPushBackInputStream is created (and that is done more frequently than just to wrap the core gzip read). Still, it seems a small price to pay for the speed increase.

I have no doubt that improvements can also be made in webarchive-commons, but it is far less clear to me where those changes should be made

2 comments:

  1. Hi, have you ever created a pull request on JWAT project for this improvement? seems like a must have, our datasets are becoming bigger and bigger everyday :-)

    ReplyDelete
    Replies
    1. No. The JWAT project is (or at least was) set up to disallow public forks on BitBucket. So I forked it to GitHub instead making pull requests awkward. I was also unsure if my particular approach was the best way to fix the issue. I considered it more a proof of concept.

      Delete