I've been running large and small scale crawls for almost 12 years. In that time I've encountered any number of unfortunate circumstances where our crawler has caused a website some amount of trouble. We always aim to run a "good" crawler. One that is respectful of website operators and doesn't cause any issues. To accomplish this we have some basic rules and limits to abide by. The reasons we sometimes do cause trouble comes down to the complicated nature of the internet.
During this time I've also been responsible for operating a number of websites. Including a few that are (by Icelandic standards) quite popular. Doing so has shown me the other side of web crawling. Turns out that a lot of crawlers are not following the basics dos and don'ts of web crawling. Including some supposedly respectable crawlers.
This seems to be getting worse each year.
Of course there are "bad" robots. Run by people who do not care about the negative impact they cause. But, even "good" robots (i.e. ones that at least seem to have good intentions) are all to frequently misbehaving.
So, here are 3 things you absolutely should abide by if you want your robot to be considered "good". Remember, it doesn't matter if your robot is a web-scale crawl or a scraping of a single site. As soon you've scripted something to automatically fetch stuff from a 3rd party server (without the 3rd party's explicit permission) you are running a crawler.
1. Be polite
Never do concurrent requests to the same site. Rate limit yourself to around 1 request every 2-5 seconds or so. More if the responses are slow. If you hit a 500 error code, wait a few minutes.
If you know (with certainty) that the site you are crawling can handle a more aggressive load (e.g. crawling google.com), it may be OK to step it up a bit. But when crawling sites where you do not have any insight, it is best to be cautious. Remember, yours is probably not the only crawler hitting them. Not to mention all the regular users.
I know, it can be infuriating when trying to scrape a large dataset. It could take weeks! But your needs do not outweigh the needs of other users. Also, crawlers are often more expensive to serve than "normal" users. They tend to systematically go through the entire site, meaning that caching strategies fail to speed up their requests.
2. Identify yourself
Your user agent string must contain enough information to allow a website operator to find out who you are, why you are crawling their site and how to get in touch with you. This isn't negotiable.
It also goes for little custom tools built to just scrape that one website. You don't have to set a user agent if you are using e.g. curl on the command line to get a single resource. But the moment you script or program something you must put identifying information in the user agent string. If all I see is
curl/7.19.7 (x86_64-redhat-linux-gnu) lib...
I'll assume it is a "bad" crawler. Same for things written in Python, Java etc. Always identify yourself.
And make sure you are responsive to feedback you may get. Something, a number of big, supposedly "good" crawl operators fail to do.
The thing is, if your robot is causing a problem and I don't know who is operating it. I will ban it. Should you wish to have that ban lifted, I will not be especially predisposed towards cooperation. You've already made a very bad first impression.
Yes, you might be able to get around a ban by getting a new IP address, but at that point you are no longer running a "good" crawler.
The bottom line is, even if you are very careful, your crawler may inadvertently cause a problem. In those scenarios. If your intentions are good, then you must make yourself available to deal with the issue and prevent it from reoccurring. If you do not do this, you are running a "bad" crawler.
3. Honor robots.txt
I'm aware of the irony here. As an operator of a legal deposit crawler, I do not respect robots.txt in most of the crawls I conduct. But you probably don't have that shield of being legally required to "get all the URLs".
A lot of websites block perfectly "legitimate" material using robots.txt (e.g. images). It is annoying. But if you are running a "good" crawler, then you have to abide by these rules, including the crawl delay. You really should also be able to parse the newer wildcard rules.
At most, you might bend the rules to get embedded content (images) necessary to render the page. Even that should be done carefully and only while adhering to the first two rules firmly.
If you feel you need content blocked by robots.txt, you must ask for it. Politely. Some sites may be happy to assist you (ours included). Others may tell you to go away. Either way, if you are running a "good" crawler, you'll have your answer.