For a corpa to use for an as-yet-unnamed project I’m working on, I’ve been struggling with the unwieldy wikipedia XML dump.

1.4gb of pure XML wikicontent. A huge pain to import however, since SQL dumps are no longer directly released. I had to install mediawiki’s (the software that wikipedia runs on) database structure (in the source code its in maintentance/tables.sql), then run a java program called mwdumper to create an enourmous SQL file. All of that didn’t take very long, what’s taking a while now is actually importing that SQL file.
If you liked this post, don't forget to subscribe to my RSS feed or my email newsletter so you never miss the science.







January 12th, 2007 at 7:07 am
[...] They provide a handy piece of software called MWDumper–which doesn’t work (on WinXP–apparently it works fine on *nix). [...]
May 7th, 2007 at 11:33 am
Do you have any old dumps available that I could download? The “official” ones aren’t working and nobody even replies to my many tries… Please?
Thanks,
Vasco
July 2nd, 2007 at 10:47 pm
I tried MWDumper on WinXP about a year ago and if I remember correctly it worked then.
Also, check here for English Wikipedia data dumps:
http://download.wikimedia.org/enwiki/
The 20070527 directory has a valid copy of wikipedia.
Simon
November 28th, 2007 at 12:49 pm
I downloaded the 5.04G “full” dump and used their xml2sql program to turn the XML file into a SQL dump. So far the file is 13.9G (for just one of the files being created). How large were your SQL dumps when they were completed and which file did you originally download from Wikipedia?