| Home | Forums | Register | FAQ | Search | Today's Posts | Mark Forums Read |
|
Welcome to the misticriver forums. You are currently viewing our boards as a guest which gives you limited access to view most discussions and access our other features. By joining our free community you will have access to post topics, communicate privately with other members (PM), respond to polls, upload content and access many other special features. Registration is fast, simple and absolutely free so please, join our community today! If you have any problems with the registration process or your account login, please contact contact us. |
![]() |
|
|
LinkBack | Thread Tools | Display Modes |
|
|||
|
Don't know if it helps at all but I've just spent some time hacking together a crawler in PHP. It needs some work but I'm currently able to save the raw data in .txt files (including a lot of the formatting that makes it difficult to read in plain text) under folders named, for example 'Eq'.
The script also allows crawling of other wikis using the mediawiki software*. I'll have to re-read to see exactly what specifications are needed for how the data is stored. I don't suppose a viewer can be written that allows use of the formatting used in the wiki? It'd be nice to be able to see at least some of that formatting. * at least it does if you don't have to be logged in to edit pages. |
|
|||
|
As far as I'm going with this is making a search engine to parse raw text. Nice job on the crawler, by the way. Too bad we can't strip raw text from it, but I'm sure we'll come up with something. Want to share code? I've got some of the search engine done, but no way of entering information into it. Does someone want to mod ROCKword for this? That'd be the balls. Also, we still need a sourceforge project for this. Anyone want to set it up? I can't, because they reject all of my submissions for some reason. Dunno why.
|
|
|||
|
this looks promising, i would help, but i dont know any code
__________________
Iriver H340 - 1.29K - Got it Feb. 24, 2005 : 3:53 pm EST 2200 mah ipod battery - 60 Gig HDD With Custom inSkin - Rockbox (H3XX Optimized Builds) Sennheiser CX-300S |
|
|||
|
Just some Braggin Rights
Hi ya,
I'm a programmer (c#, php, asp, sql, rexx) and have dealt with manipulating volumes of documentation to do with the pharmaceutical industry and to a lesser extent, seismic event data. Anyway, was wondering if you need a hand? As I understand it one of you built a crawler to scrape tag info from wiki. If you have some tag schema it would save me some time. Just let me know what you need and more importantly what you would like. regards A PS. Not much experience in electronics but does anyone think it would be possible to make either usb or remote jack coexist with a keyboard? |
|
|||
|
PAYDIRT!
|
|
|||
|
If you want a sampling of the data my crawler has collected thus far (not a great deal but I'm working on optimising it as much as I can, and also my connection is quite slow so that doesn't help), I could provide a compressed file with the data.
I've got a lot of cleaning up of my code to do before I'm happy to release it. It basically downloads the edit page, looks for the page data within a text area in that page, looks for [[page]] tags to find other pages to download, removes the formatting tags and then saves the data into a text file. Checks for the next page name in the 'to_crawl' array and does the same for that. Simply repeats the process in an endless loop until, hopefully, it has crawled all but any orphan pages (for which I'll write another script to create a list of all orphaned pages and put them into a 'to_crawl' file). I know it isn't a terribly efficient system but I really wanted to write a crawler Does anyone know if there's a easier way of getting the data (including the formatting, as I need the page tags) without having to download all this extra HTML? It really isn't helping that half the data I'm downloading is discarded. |
|
|||
|
this might help to get the data faste out of wikipedia http://en.wikipedia.org/wiki/Special:Export and http://download.wikimedia.org/
On another note: http://en.wikipedia.org/wiki/Wikipedia And some more: http://en.wikipedia.org/wiki/Wikipedia:TomeRaider _database edit: I just checked the tomeraider article out. Maybe it's an idea to make it a tomeraider (or other handheld optimised) ebook viewer. There are many (free ) ebooks available in this format. (link) And another edit: Maybe this reader (for Palm OS, GNU, source in c) can be ported. And there is also this one. More options and i don't know in what language it's programmed, since most information is in russian. Last edited by salival : February 19th, 2006 at 02:53 PM. |
|
|||
|
Thanks for that, salival. Gave quite a speed boost. Looks as though I should probably download a dump and then try to get the data from that rather than risking an IP ban but that'll have to wait until I have a larger hard disk (hopefully that won't have to wait until I can fork out for a whole new system as well).
But... I'd just rewritten my script to make it far more efficient and it was working rather well until there was a huge crack, like a massive spark of electricity, and my computer died. It is now refusing to turn on. Fortunately my h340 doesn't appear to have suffered any damaged, despite being plugged into the USB and transferring my collection of John Frusciante albums at the time. Unfortunately I only have a Win98 system now so I can't get any data off (or onto) it. So don't expect me to be able to provide anything for a while. Good luck to the rest of you. |
|
|||
|
Ugh. Unfortunately, I can't produce much right now because I'm preoccupied with family and friends. I will try to clean up some of my code and post it on the site. Sorry guys, looks like Lady Luck is working against us. That bitch. We should cut her.
|
|
|||
|
To addon to what Salival said, http://download.wikimedia.org/enwiki/20060219/ looks like it'll have what we need, as soon as it finishs running.
Now.. the thing we need to remember with the spidering/etc.. is that it'd need to be updatable fairly easily. WikiPedia is far from static Josh |
|
|||
|
Looking at salival['s recent links now. I thought I set this thread to email me. Been waiting around for responses. What's your deal on the UI for the H300. One of you mentioned RockWord which I have failed to get to do anything except write multi colored text. No save, No open no bold, the only thing that works in the menu bar is the palette.
Anyone know of Rockword doco? With tag information you maybe able to generate rockword tag styles on the fly, but I'd like to know what the programming framework is for rock box. Also alphabetic sequentialisation of wiki data seems a good idea but isn't there a limit on directories and files in H300 architecture? Has this been already addressed? here's an example of ripping apart huge amounts of text (60M) and presenting it very pretty. View source to get my drift http://www.ozepharmacy.com.au/prescr...n=sicabboc.cpi and for a difference http://www.ozepharmacy.com.au/epharm...ACP.CPI&cols=3 |
|
|||
|
Not sure if this applies to what you guys are doing with the crawler......But i´ll post it just incase:
http://en.wikipedia.org/wiki/Wikipedia Quote:
|
|
|||
|
I also believe it does apply. That's why I'm going to wait until I have a large enough hard drive to store the dump on and then try and get something running on that. It may be much quicker overall than trying to download each page from the server.
I don't think my connection is fast enough to have gone anything above 1 page per second. That is as long as they don't expect a 1 second wait between closing and opening connections, in which case I might've gone wrong. I guess we'll see. Anyone else is welcome to have a go at getting the data. I'll see how quickly I can get another system up and running. Hopefully my hard disk didn't get fried... |
|
|||
|
the dumps are in xml format (one single file). (more info) so it shouldn't be too difficult for someone to write a parser to export it to a desired format.
|
|
|||
|
"That's why I'm going to wait until I have a large enough hard drive to store the dump on and then try and get something running on that."
I have a massive amount of free space (roughly 120GB) if you want me to download something to store. I can't program anything, but if you develop something to extract the text, i could run it through the big block of wikiness then upload the results in a torrent |
|
|||
|
The main articles dump has finished. Its a .bz2 thats about 1GB. I'm sure it'd be a heck of a lot larger uncompressed though.
http://download.wikimedia.org/enwiki...ticles.xml.bz2 Josh |
|
|||
|
"I also have large quantities of space left (175gb)"
Oh? So its oneupmanship is it? In that case, i have five billion petabytes free and a house in the country you can use 40 weeks every year. |
|
||||
|
No, just volunteering the use of my resources. Jesus, people just need to chill! I only quoted the amount of space the little dialogue-box told me! If you really feel that badly about it I could change it to...
"I have 175gb of free space but I'm not going to give up any of it because I'm a stingy git" Now back on topic... 10% downloaded. Being no good at programming myself, would someone else be willing to write a simple parser, like salival suggested? |
|
|||
|
Oops, i didnt mean that to come off as serious at all. Sorry drippydoughnut, sarcasm is quite tricky to get right in text form, no offence meant.
Has anyone made a soureforge project for this yet? I could do that. |