December 5, 2014

URI agnostic deduplication on content discovered at crawl time


In my last blog post I showed that URI agnostic duplicates accounted for about 5% of all duplicates by volume (bytes) and about 11% by URI count. But this is limited to looking up content digests that had been discovered in a previous crawl. What if we also deduplicated on content digests that are discovered during a crawl?

So I put together a little script and set it lose on the crawl logs for the domain crawl. As before I only considered documents whose content (mime) type does not start with "text/".

In all, this would have increased the number of duplicates found by 3.5%. It would also increase the number of bytes deemed duplicate by 3.5%.

In practical terms this means I could have avoided storing 121 GiB of data. This is about 9.2% of the data volume that was subjected to deduplication and deemed novel. Or 3.3% of the overall data volume deemed novel and stored.

The following is a table showing the actual numbers. The difference between the total and 'subject to deduplication' is made up of URIs whose content type started with "text/".

URIs GiB
Total:106.690.7927.127
Subject to deduplication:33.096.2194.791
Deemed duplicates (total):24.522.4093.477
- Exact URL matches:18.505.1382.941
- Canonical URL matches:3.273.397353
- Digest only matches:2.743.874176
Missed digest at crawl time matches:853.013121

So there doesn't seem to be that much gain from tackling this class of duplicates. Heritrix does offer a tool for this  (that I haven't tried). I think it'll come down to how difficult this is to implement and its effect on performance. If its easy and doesn't hurt performance, reducing data volume by 3-4% can add up.

No comments:

Post a Comment