For a corpa to use for an as-yet-unnamed project I’m working on, I’ve been struggling with the unwieldy wikipedia XML dump.

1.4gb of pure XML wikicontent. A huge pain to import however, since SQL dumps are no longer directly released. I had to install mediawiki’s (the software that wikipedia runs on) database structure (in the source code its in maintentance/tables.sql), then run a java program called mwdumper to create an enourmous SQL file. All of that didn’t take very long, what’s taking a while now is actually importing that SQL file.
If you liked this post, don't forget to subscribe to my RSS feed or my email newsletter so you never miss the science.
Take this quick survey and tell me what social media data you'd like to see me analyze.
Buy Zarrella's Hierarchy of Contagiousness: The Science, Design, and Engineering of Contagious Ideas today!




Pingback: Netherbound » Blog Archive » a ge(r)m of an idea…