View Single Post
  #87 (permalink)  
Old February 19th, 2006, 12:50 PM
MrHiggy MrHiggy is offline
Newbie Floating Down The Mistic River
 
Join Date: Feb 2006
Location: Nr Wrexham, Wales/Birmingham, England
Posts: 30
If you want a sampling of the data my crawler has collected thus far (not a great deal but I'm working on optimising it as much as I can, and also my connection is quite slow so that doesn't help), I could provide a compressed file with the data.

I've got a lot of cleaning up of my code to do before I'm happy to release it.

It basically downloads the edit page, looks for the page data within a text area in that page, looks for [[page]] tags to find other pages to download, removes the formatting tags and then saves the data into a text file. Checks for the next page name in the 'to_crawl' array and does the same for that. Simply repeats the process in an endless loop until, hopefully, it has crawled all but any orphan pages (for which I'll write another script to create a list of all orphaned pages and put them into a 'to_crawl' file).

I know it isn't a terribly efficient system but I really wanted to write a crawler .

Does anyone know if there's a easier way of getting the data (including the formatting, as I need the page tags) without having to download all this extra HTML? It really isn't helping that half the data I'm downloading is discarded.
Reply With Quote