Not sure if this applies to what you guys are doing with the crawler......But i´ll post it just incase:
http://en.wikipedia.org/wiki/Wikipedia
ownload
Quote:
Please do not use a web crawler
Please do not use a web crawler to download large numbers of articles. Aggressive crawling of the server can cause a dramatic slow-down of Wikipedia. Our robots.txt restricts bots to one page per second and blocks many ill-behaved bots.
[edit]
Sample blocked crawler email
IP address nnn.nnn.nnn.nnn was retrieving up to 50 pages per second from wikipedia.org addresses. Robots.txt has a rate limit of one per second set using the Crawl-delay setting. Please respect that setting. If you must exceed it a little, do so only during the least busy times shown in our site load graphs at http://wikimedia.org/stats/live/org....ests-hits.html . It's worth noting that to crawl the whole site at one hit per second will take several weeks. The originating IP is now blocked or will be shortly. Please contact us if you want it unblocked. Please don't try to circumvent it - we'll just block your whole IP range.
If you want information on how to get our content more efficiently, we offer a variety of methods, including weekly database dumps which you can load into MySQL and crawl locally at any rate you find convenient. Tools are also available which will do that for you as often as you like once you have the infrastructure in place. More details are available at http://en.wikipedia.org/wiki/Wikiped...abase_download.
|