This image is the top of the content box. Unfortunately, there is no information-based reason for this div to be here. It's just here for design reasons. Sorry.
Go Back   MisticRiver :: For iriver Enthusiasts > All things Rockbox > ROCKbox Forums
Home Forums Register FAQ Search Today's Posts Mark Forums Read


Welcome to the misticriver forums.

You are currently viewing our boards as a guest which gives you limited access to view most discussions and access our other features. By joining our free community you will have access to post topics, communicate privately with other members (PM), respond to polls, upload content and access many other special features. Registration is fast, simple and absolutely free so please, join our community today!

If you have any problems with the registration process or your account login, please contact contact us.
Reply
 
LinkBack Thread Tools Display Modes
  #81 (permalink)  
Old February 17th, 2006, 06:36 AM
Hoping For A Cool Title
 
Join Date: May 2005
Posts: 78
Alright, well, if someone wants to setup a sourceforge project, go for it. I can't, however, since I've had bad experiences with SF.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #82 (permalink)  
Old February 17th, 2006, 06:37 PM
Newbie Floating Down The Mistic River
 
Join Date: Feb 2006
Location: Nr Wrexham, Wales/Birmingham, England
Posts: 30
Don't know if it helps at all but I've just spent some time hacking together a crawler in PHP. It needs some work but I'm currently able to save the raw data in .txt files (including a lot of the formatting that makes it difficult to read in plain text) under folders named, for example 'Eq'.

The script also allows crawling of other wikis using the mediawiki software*.

I'll have to re-read to see exactly what specifications are needed for how the data is stored.

I don't suppose a viewer can be written that allows use of the formatting used in the wiki? It'd be nice to be able to see at least some of that formatting.

* at least it does if you don't have to be logged in to edit pages.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #83 (permalink)  
Old February 17th, 2006, 06:54 PM
Hoping For A Cool Title
 
Join Date: May 2005
Posts: 78
As far as I'm going with this is making a search engine to parse raw text. Nice job on the crawler, by the way. Too bad we can't strip raw text from it, but I'm sure we'll come up with something. Want to share code? I've got some of the search engine done, but no way of entering information into it. Does someone want to mod ROCKword for this? That'd be the balls. Also, we still need a sourceforge project for this. Anyone want to set it up? I can't, because they reject all of my submissions for some reason. Dunno why.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #84 (permalink)  
Old February 17th, 2006, 08:37 PM
Eager Mistic Beaver
 
Join Date: Feb 2005
Posts: 220
this looks promising, i would help, but i dont know any code
__________________
Iriver H340 - 1.29K - Got it Feb. 24, 2005 : 3:53 pm EST
2200 mah ipod battery - 60 Gig HDD
With Custom inSkin - Rockbox (H3XX Optimized Builds)
Sennheiser CX-300S
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #85 (permalink)  
Old February 19th, 2006, 04:32 AM
Newbie Floating Down The Mistic River
 
Join Date: Nov 2004
Location: Australia
Posts: 9
Just some Braggin Rights

Hi ya,
I'm a programmer (c#, php, asp, sql, rexx) and have dealt with manipulating volumes of documentation to do with the pharmaceutical industry and to a lesser extent, seismic event data. Anyway, was wondering if you need a hand?

As I understand it one of you built a crawler to scrape tag info from wiki. If you have some tag schema it would save me some time. Just let me know what you need and more importantly what you would like.

regards
A

PS. Not much experience in electronics but does anyone think it would be possible to make either usb or remote jack coexist with a keyboard?
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #86 (permalink)  
Old February 19th, 2006, 09:10 AM
Hoping For A Cool Title
 
Join Date: May 2005
Posts: 78
PAYDIRT! Yes, I'd love to have you on board. The search engine I've been hacking up is very slow. Ugh. The one who wrote the crawler was MrHiggy. I don't know anything about it, so you'll have to talk to him. If you want, I can send over the search code I have. It's in bits and pieces, though.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #87 (permalink)  
Old February 19th, 2006, 11:50 AM
Newbie Floating Down The Mistic River
 
Join Date: Feb 2006
Location: Nr Wrexham, Wales/Birmingham, England
Posts: 30
If you want a sampling of the data my crawler has collected thus far (not a great deal but I'm working on optimising it as much as I can, and also my connection is quite slow so that doesn't help), I could provide a compressed file with the data.

I've got a lot of cleaning up of my code to do before I'm happy to release it.

It basically downloads the edit page, looks for the page data within a text area in that page, looks for [[page]] tags to find other pages to download, removes the formatting tags and then saves the data into a text file. Checks for the next page name in the 'to_crawl' array and does the same for that. Simply repeats the process in an endless loop until, hopefully, it has crawled all but any orphan pages (for which I'll write another script to create a list of all orphaned pages and put them into a 'to_crawl' file).

I know it isn't a terribly efficient system but I really wanted to write a crawler .

Does anyone know if there's a easier way of getting the data (including the formatting, as I need the page tags) without having to download all this extra HTML? It really isn't helping that half the data I'm downloading is discarded.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #88 (permalink)  
Old February 19th, 2006, 12:37 PM
Eager Mistic Beaver
 
Join Date: Feb 2005
Posts: 310
this might help to get the data faste out of wikipedia http://en.wikipedia.org/wiki/Special:Export and http://download.wikimedia.org/
On another note: http://en.wikipedia.org/wiki/Wikipediaownload
And some more: http://en.wikipedia.org/wiki/Wikipedia:TomeRaider _database edit: I just checked the tomeraider article out. Maybe it's an idea to make it a tomeraider (or other handheld optimised) ebook viewer. There are many (free ) ebooks available in this format. (link)

And another edit: Maybe this reader (for Palm OS, GNU, source in c) can be ported. And there is also this one. More options and i don't know in what language it's programmed, since most information is in russian.
__________________
"I can't die; I don't have a life"
portfolio

Last edited by salival : February 19th, 2006 at 02:53 PM.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #89 (permalink)  
Old February 19th, 2006, 05:58 PM
Newbie Floating Down The Mistic River
 
Join Date: Dec 2004
Posts: 24
Awesome idea, you can have my babies if you pull it off.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #90 (permalink)  
Old February 19th, 2006, 06:41 PM
Newbie Floating Down The Mistic River
 
Join Date: Feb 2006
Location: Nr Wrexham, Wales/Birmingham, England
Posts: 30
Thanks for that, salival. Gave quite a speed boost. Looks as though I should probably download a dump and then try to get the data from that rather than risking an IP ban but that'll have to wait until I have a larger hard disk (hopefully that won't have to wait until I can fork out for a whole new system as well).

But... I'd just rewritten my script to make it far more efficient and it was working rather well until there was a huge crack, like a massive spark of electricity, and my computer died. It is now refusing to turn on.

Fortunately my h340 doesn't appear to have suffered any damaged, despite being plugged into the USB and transferring my collection of John Frusciante albums at the time. Unfortunately I only have a Win98 system now so I can't get any data off (or onto) it.

So don't expect me to be able to provide anything for a while.

Good luck to the rest of you.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #91 (permalink)  
Old February 19th, 2006, 07:04 PM
Hoping For A Cool Title
 
Join Date: May 2005
Posts: 78
Ugh. Unfortunately, I can't produce much right now because I'm preoccupied with family and friends. I will try to clean up some of my code and post it on the site. Sorry guys, looks like Lady Luck is working against us. That bitch. We should cut her.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #92 (permalink)  
Old February 19th, 2006, 08:15 PM
Newbie Floating Down The Mistic River
 
Join Date: Feb 2006
Posts: 4
To addon to what Salival said, http://download.wikimedia.org/enwiki/20060219/ looks like it'll have what we need, as soon as it finishs running.

Now.. the thing we need to remember with the spidering/etc.. is that it'd need to be updatable fairly easily. WikiPedia is far from static

Josh
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #93 (permalink)  
Old February 20th, 2006, 02:49 AM
Newbie Floating Down The Mistic River
 
Join Date: Nov 2004
Location: Australia
Posts: 9
Looking at salival['s recent links now. I thought I set this thread to email me. Been waiting around for responses. What's your deal on the UI for the H300. One of you mentioned RockWord which I have failed to get to do anything except write multi colored text. No save, No open no bold, the only thing that works in the menu bar is the palette.

Anyone know of Rockword doco?

With tag information you maybe able to generate rockword tag styles on the fly, but I'd like to know what the programming framework is for rock box. Also alphabetic sequentialisation of wiki data seems a good idea but isn't there a limit on directories and files in H300 architecture? Has this been already addressed?

here's an example of ripping apart huge amounts of text (60M) and presenting it very pretty. View source to get my drift
http://www.ozepharmacy.com.au/prescr...n=sicabboc.cpi
and for a difference
http://www.ozepharmacy.com.au/epharm...ACP.CPI&cols=3
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #94 (permalink)  
Old February 20th, 2006, 04:49 AM
Newbie Floating Down The Mistic River
 
Join Date: Dec 2005
Posts: 24
Not sure if this applies to what you guys are doing with the crawler......But i´ll post it just incase:

http://en.wikipedia.org/wiki/Wikipediaownload

Quote:
Please do not use a web crawler

Please do not use a web crawler to download large numbers of articles. Aggressive crawling of the server can cause a dramatic slow-down of Wikipedia. Our robots.txt restricts bots to one page per second and blocks many ill-behaved bots.
[edit]

Sample blocked crawler email
IP address nnn.nnn.nnn.nnn was retrieving up to 50 pages per second from wikipedia.org addresses. Robots.txt has a rate limit of one per second set using the Crawl-delay setting. Please respect that setting. If you must exceed it a little, do so only during the least busy times shown in our site load graphs at http://wikimedia.org/stats/live/org....ests-hits.html . It's worth noting that to crawl the whole site at one hit per second will take several weeks. The originating IP is now blocked or will be shortly. Please contact us if you want it unblocked. Please don't try to circumvent it - we'll just block your whole IP range.
If you want information on how to get our content more efficiently, we offer a variety of methods, including weekly database dumps which you can load into MySQL and crawl locally at any rate you find convenient. Tools are also available which will do that for you as often as you like once you have the infrastructure in place. More details are available at http://en.wikipedia.org/wiki/Wikiped...abase_download.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #95 (permalink)  
Old February 20th, 2006, 04:51 AM
DD's Avatar
DD DD is offline
♫ ♪ ♫ ♪ ♫ - misticlurker
 
Join Date: Aug 2005
Location: Hampshire, England
Posts: 2,695
Erm, I do believe it does apply! Ooops ! Also, anyone else finding the smiley in that web address funny?
__________________

www.designcut.co.uk
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #96 (permalink)  
Old February 20th, 2006, 06:18 AM
Newbie Floating Down The Mistic River
 
Join Date: Feb 2006
Location: Nr Wrexham, Wales/Birmingham, England
Posts: 30
I also believe it does apply. That's why I'm going to wait until I have a large enough hard drive to store the dump on and then try and get something running on that. It may be much quicker overall than trying to download each page from the server.

I don't think my connection is fast enough to have gone anything above 1 page per second. That is as long as they don't expect a 1 second wait between closing and opening connections, in which case I might've gone wrong.

I guess we'll see. Anyone else is welcome to have a go at getting the data. I'll see how quickly I can get another system up and running. Hopefully my hard disk didn't get fried...
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #97 (permalink)  
Old February 20th, 2006, 07:07 AM
Eager Mistic Beaver
 
Join Date: Feb 2005
Posts: 310
the dumps are in xml format (one single file). (more info) so it shouldn't be too difficult for someone to write a parser to export it to a desired format.
__________________
"I can't die; I don't have a life"
portfolio
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #98 (permalink)  
Old February 20th, 2006, 11:37 AM
Newbie Floating Down The Mistic River
 
Join Date: Dec 2004
Posts: 24
"That's why I'm going to wait until I have a large enough hard drive to store the dump on and then try and get something running on that."

I have a massive amount of free space (roughly 120GB) if you want me to download something to store.

I can't program anything, but if you develop something to extract the text, i could run it through the big block of wikiness then upload the results in a torrent
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #99 (permalink)  
Old February 20th, 2006, 11:41 AM
DD's Avatar
DD DD is offline
♫ ♪ ♫ ♪ ♫ - misticlurker
 
Join Date: Aug 2005
Location: Hampshire, England
Posts: 2,695
I also have large quantities of space left (175gb?) if storage space is needed. My internet is fairly slow though (1mbit) but I'm more than happy to download anything up to 4gb at a time. I don't have the knowledge to write a parser, but if someone else wanted to write a simple one like salival suggested I wouldn't mind testing it.
__________________

www.designcut.co.uk
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #100 (permalink)  
Old February 20th, 2006, 12:13 PM
Newbie Floating Down The Mistic River
 
Join Date: Feb 2006
Posts: 4
The main articles dump has finished. Its a .bz2 thats about 1GB. I'm sure it'd be a heck of a lot larger uncompressed though.

http://download.wikimedia.org/enwiki...ticles.xml.bz2

Josh
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #101 (permalink)  
Old February 20th, 2006, 12:21 PM
Newbie Floating Down The Mistic River
 
Join Date: Dec 2004
Posts: 24
"I also have large quantities of space left (175gb)"

Oh? So its oneupmanship is it?

In that case, i have five billion petabytes free and a house in the country you can use 40 weeks every year.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #102 (permalink)  
Old February 20th, 2006, 12:26 PM
DD's Avatar
DD DD is offline
♫ ♪ ♫ ♪ ♫ - misticlurker
 
Join Date: Aug 2005
Location: Hampshire, England
Posts: 2,695
No, just volunteering the use of my resources. Jesus, people just need to chill! I only quoted the amount of space the little dialogue-box told me! If you really feel that badly about it I could change it to...

"I have 175gb of free space but I'm not going to give up any of it because I'm a stingy git"


Now back on topic... 10% downloaded. Being no good at programming myself, would someone else be willing to write a simple parser, like salival suggested?
__________________

www.designcut.co.uk
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #103 (permalink)  
Old February 20th, 2006, 12:38 PM
Newbie Floating Down The Mistic River
 
Join Date: Dec 2004
Posts: 24
Oops, i didnt mean that to come off as serious at all. Sorry drippydoughnut, sarcasm is quite tricky to get right in text form, no offence meant.

Has anyone made a soureforge project for this yet? I could do that.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #104 (permalink)  
Old February 20th, 2006, 12:45 PM
DD's Avatar
DD DD is offline
♫ ♪ ♫ ♪ ♫ - misticlurker
 
Join Date: Aug 2005
Location: Hampshire, England
Posts: 2,695
Lol, things like sarcasm don't usually come over too well on the internet . Apologies to