<rafa>
wolfspraul: morning.. do you know if there is some wikipedia reader effort for nn?
<wolfspraul>
rafa: we had several discussions in irc and the mailing list
<wolfspraul>
I think most people want to start with small apps that extract text-only from compressed files
<wolfspraul>
I'm in 3 other chats right now, maybe later...
<rafa>
no problem, I can check mailing lists archives
<rafa>
I was thinking the same: text only wikipedia reader
<rafa>
and my head is starting to want to do something right now :)
<kristianpaul>
rafa: are you ware of wikipedia dump reader?
<kristianpaul>
s?ware/aware
<rafa>
kristianpaul: yes, but that is something with hl right?
<rafa>
html*
<rafa>
kristianpaul: it says text only but the format is still html I guess. I want to try plain text so the whole wikipedia backup is a lot smaller
<rafa>
kristianpaul: my current problem to try is that all my disks are very busy.. so I do not have 300GB+ free to download the current dumps and to convert them in just plain text
<kristianpaul>
rafa: not html
<kristianpaul>
xml actually
<rafa>
kristianpaul: Do you understand my idea? I want to have a text plain wikipedia dump
<kristianpaul>
is indezxed
<kristianpaul>
xml is plain no?
<kristianpaul>
yes i understand
<rafa>
yes, is plain.. but no for mi idea. Example
<rafa>
:
<kristianpaul>
well you could translate to soemthing else
<wolfspraul>
the key is to use/reuse the .tar.bz2 files that are already being published by the WMF
<wolfspraul>
don't just think about the algorithms, also think about the pile of data
<wolfspraul>
I think that's the #1 problem that the openzim people got wrong
<kristianpaul>
i think wikipedia dump reader strategy is good because indexing
<wolfspraul>
this is a massive amount of data, and growing, and the dumps are being produced, servers running all the time
<wolfspraul>
we need to reuse that
<kristianpaul>
thats saves time and allows go to the right of data
<wolfspraul>
most important for me is to reuse the .tar.bz2 files that are constantly being produced and offered on WMF servers already
<kristianpaul>
just move wikipedia dump reader to c or something light is clue
<wolfspraul>
that's the only strategy that will get us to usable results fast
<rafa>
wolfspraul: okey, but what about the sizes.. I think that current dumps in xml/html formats are still huge.. and I would like to see if plain text only would not be something nicer.
<kristianpaul>
the required i think is move the qt gui to soemthin g lighter
<kristianpaul>
cuase in depth it is C code/libs and gnu utils like grep
<rafa>
if we could use current dumps from WMF and convert them to plain text compressed dumps we could have smallers dumps perhaps. Of course, we need to keep some nice indexes to work fast
<wolfspraul>
I need to check what they are offering now exactly
<wolfspraul>
I am just saying let's very carefully look at what files they offer for download
<kristianpaul>
articles
<wolfspraul>
there is _HUGE_ work (server load) necessary to create these files, and update them
<wolfspraul>
you don't want to have a server farm running somewhere only so that you can have a little more optimized format
<wolfspraul>
because in reality it will never happen :-)
<kristianpaul>
wikipedia dumo reader do some indexing but it dint take so long and don t require huge process
<rafa>
I think that marks use lot of space.. if articles have : <mark> text1 </mark> ... it would be nicer to have just : text1
<rafa>
without marks
<kristianpaul>
we canb save space but look for the articles still beeing a problem
<rafa>
wolfspraul: also, we would not be re building the whole dumps often, I think that once a year is okey.. nobody would like to install new dumps versions every week on SD
<kristianpaul>
and fast acess time is important i think
<rafa>
kristianpaul: sure.. I would like to save space being enough fast like now or better
<rafa>
:)
<kristianpaul>
i dont ave time now to migrate dumo reader to C/C++
<kristianpaul>
but is a taks that worth i tjink
<wolfspraul>
rafa: I think people want updates
<wolfspraul>
and more importantly, there are many wikis, not just English Wikipedia
<kristianpaul>
yeap
<kristianpaul>
fetching france wikipedia dump now
<wolfspraul>
so every step you think you 'quickly' run over whatever file the WMF has for download will easily turn into a _MASSIVE_ amount of work
<wolfspraul>
I have done this stuff as part of my work for Wikireader.
<kristianpaul>
anybody can try wikipedia dump reader on his nano, i cant tigh now cause my uboot is not the last one
<wolfspraul>
so I would really go to great lengths now to avoid doing any additional processing on top of the files being generated already on the hundreds of WMF servers
<kristianpaul>
and jlime dind have pyQT support isnt?
<wolfspraul>
instead I would focus my work on building indices or whatever is necessary to _complement_ the WMF files, then write a little reader
<kristianpaul>
agree
<kristianpaul>
index work but not as we wish
<wolfspraul>
collectively, the wmf has at least a few hundred GB of text data for download now :-)
<wolfspraul>
without pictures of course (wikimedia commons)
<wolfspraul>
and it's growing fast
<kristianpaul>
too much
<kristianpaul>
is HUGE
<kristianpaul>
teabytes
<kristianpaul>
.....
<wolfspraul>
yes, so if we want something functioning for the NN soon, we need to build something that reuses those files
<wolfspraul>
everything else will end like openzim :-)
<rafa>
wolfspraul: okey, current dumps and better indexes sounds great. I just would do the plain text dump to know how much it would save .. Just to know that detail.
<wolfspraul>
(don't envy those guys, I know they are still trying and working...)
<rafa>
kristianpaul: pyqt, no idea, if it is not the repo I can check current OE sources
<wolfspraul>
rafa: go and do that :-) I have shuffled gigabytes of WMF data for months...
<wolfspraul>
into sql, out of sql, into filesystems that crashed under the load, etc... :-)
<kristianpaul>
:(
<kristianpaul>
yes
<rafa>
wolfspraul: For those tests I need a lot of space on disks.. that I would need to join right now :(
<kristianpaul>
openzim tends to be unmaintainable
<wolfspraul>
kristianpaul: agree
<wolfspraul>
rafa: your space problems already start now?
<rafa>
wolfspraul: hahaha.. so fun stuff eh?.. it brearks everything :) filesystems.. indexes.. sql tables
<rafa>
cool
<wolfspraul>
I had a small cluster of 5 quad-core machines cranking on this stuff, with 4 GB RAM each etc.
<rafa>
wolfspraul: yes, the space problem started when I was thinking to do something with the wikipedia reader for jlime
<wolfspraul>
aha :-)
<wolfspraul>
so that's why I am making my point here - just from my perspective - reusing the WMF files is the way to go
<wolfspraul>
let's build on top of the work they have already done
<kristianpaul>
rafa: you know python?
<wolfspraul>
(and continue to do, I believe they firmly stand behind those dump files and have full-time people on it)
<rafa>
wolfspraul: but really, I am thinking which application would be a really good killer for the nn device.. so far people is asking which is the best application for nn, in order to want to buy one
<rafa>
and I would like to have many different kind of applications ready
<wolfspraul>
wmf data would be awesome
<rafa>
to try understand the best final market
<rafa>
for nn
<wolfspraul>
also openstreetmap data, similar idea but very different data format and technical challenge
<wolfspraul>
I would love to have a little map in my pocket
<wolfspraul>
doesn't matter that it doesn't have GPS
<rafa>
kristianpaul: just enough to fix some python code :)
<wolfspraul>
if it has a good index I can just enter street name and woops, I have a (color) map of my surroundings
<rafa>
wolfspraul: that idea sounds really cool as well
<rafa>
wolfspraul: I am adding it to the jlime TODO list :)
<wolfspraul>
yes, nanomap is a start, and kristianpaul is hacking on GPS modules
<david_>
it's very hard to install debian-lenny in my nano
<Guest90854>
hello After flashing with debian, I've "g: VFS: Unable to mount root fs on unknown-block(0,0)...
<Guest90854>
help!
<urandom_>
Guest90854 seems like there is something wrong with your rootfs, uhm do not know much about debian, what about reflashing your rootfs once again?