2011-08-27

Zombies are awesome. So is Wikipedia.

Zombie movies are cool. Zombie books are even better. Many a geek fantasizes about what they would do when the zombies come. Memorize safe locations and make foolproof plans. Practice Rule #1: Cardio. Design their own version of a "Lobo". Create their own gated community with zombie tests, and blog about it. Nobody cares that it's scientifically impossible. It's just plain fun to imagine a world that's black and white, where you're fighting a clearly defined enemy without morally gray areas, and having to put your wits to the test.

Then there are people with other apocalypse scenarios they love to play through: Nuclear Armageddon. Waterworld-scale climate change. Political and economic collapse. Civil War. Massive, uncontrollable plague. Meteoric destruction. Worldwide natural disasters. Y2K-style computer bugs or viruses. The Other Political Party Wins. Rapture. Peak Oil. Peak Food. It's all a game of survival: How will you make sure that _you_ come out looking pretty after disaster strikes?

At least, they're all fun to think of as long as you're sitting on your couch, watching TV or talking online with friends. Long term visits to Zimbabwe, Somalia, the Congo, Argentina (circa 2000-ish), Pakistan, and Afghanistan, though, are definitely not on the Survival Game lovers' travel plans.

Still, there is a lot of sense in preparing. Disaster _can_ strike. Let's look at the past 10 years.

1) Tsunamis taking out a lot of towns and villages, leaving millions stranded and starving? Check.
2) Massive earthquakes, destroying a huge number of homes and leaving millions stranded? Check.
3) Hurricanes taking out levees and drowning towns, with government not responding for over a week? Check.

Even without the daydreams of the Survival Game lovers, disaster _can_ happen. So prepare.

Certainly, there's the basics of survival: Food. Water. Shelter. But given our Google-rotted brains and dependency on the internet and authorities who aren't us, there's something even more crucial to us: Information.

Post-Apocalypse lovers hoard information by the gigabyte: How to build an off-the-grid home. Medical databases. Water filtration. Farming.

But there's a huge amount of knowledge, and you can download all the PDFs you want, but managing all that information is a hassle. Thankfully, there was a group of awesome people that made an easy way to catalogue and access a huge amount of information: Wikipedia.

Sure, there's not many details, and you'll have to come up with plans on your own, but there's a lot of information there for the Survival Gamer:


Lots and lots and lots of knowledge, all in an easy to access format. Sure, Wikipedia can often be incorrect or deliberately wrong, but by and large, it's a better jack-of-all-trades knowledge database than many of us have access to.

But unfortunately, many disaster scenarios preclude internet access. Loss of power from downed lines. Loss of internet access from cut cables. Being stuck on the road. Living in your log cabin in your mountain getaway.

Sure, you can print out those gigabytes of data on paper, but why kill a forest? (The book in that image supposedly only has 1/10000th of Wikipedia!)

For those times, you'll want to run Wikipedia on your laptop. Power is a lot more reliable than internet - Laptops usually only need around 60 watts, and a DC->AC inverter for your car can put out 200 watts. Survival Game lovers like to tout generators - Your car is already a power generator, it just needs a $30 inverter. Or you might have solar panels, wind power, or more. With some help from Wikipedia, you can even make your own generator.

Fortunately, Wikipedia lets you download a copy of the entire database.

Unfortunately: While it's "only" 7GB compressed, uncompressed it takes a whopping 31GB, in a barely usable xml format. If you want to run any of the variations of Wikipedia server, you'll need a database taking up at least 30 gb, then a huge amount of ram, . . . etc etc etc.

Luckily, somebody deciphered how to run Wikipedia off of the provided, compressed xml file - So you can keep and run Wikipedia off of only 7GB! Unfortunately, it's rather a pain to set up: Perl, Python, PHP, Xapian and Django.

My thought: Why not just one program to do all of it?

As at the beginning of every project, I set down my goals so I could review and decide on the next step:

1) I want it user friendly. So I can download the latest pages-articles, stick it in a drop directory, run _one_ program, and it'll do everything else. Maybe a config file for handling port binding.

2) I want it as a web service: I have no interest in writing a whole GUI just for it.

3) I want it to run on limited resources: It's for a home laptop use, for a single person. It shouldn't chew up the whole machine once it's set up and running.

4) I want fast title searching, such as Xapian serves. But I also want other features: Typo fixes, accents, different word order, etc. But with 11 million titles, it can be a hassle. grep took 12 seconds to find every instance of the word "free" in the title file. For me, this is the number to beat. (Edit: As of now, a title search takes ~ 1.2 seconds on my laptop)

My first attempt, Ruby: The script took 45 minutes to generate a list of titles from the segmented .bz2 files, simply using "bzcat" and looking for <title>...</title>. Unfortunately, it took a very long time to try and read the title list into memory, for fast indexing. 'top' showed it taking over 3 gigabytes of ram to keep just 345 megabytes of title data. At this point, I spent some minutes looking over it and wondering if there was a fault in my code, but all it was really doing was reading key:value into a massive hash. And Ruby was choking on it.

Next try: Google Go. I've been looking for an excuse to play more with it since I wrote a little string glob matching library as a "First Date with Go" bit. Go has: Package bzip2, which lets it read and write bz2 files. An extensible, and fast, http server. Actual system concurrency, which ruby doesn't have. It would take more effort to get it work on systems other than mine, but it can be compiled for all three major platforms. And, hell, I just want to learn more of it.

So I started over with the process: Make it recognize the latest timestamp (in filename: *YYYYMMDD*.xml.bz2), check if that's the current working one in the cached data/ dir, run bzip2recover, generate the title cache file, etc etc etc.

Some hours later, and I have something to show. It's ugly. It's basic, and I'll be improving on it for a long while, but if you have Google Go installed and a 7GB pages-articles.xml.bz2 installed, you can have wikipedia on your machine, taking only 7GB disk space for the articles, ~ 350 MB space for the title cache, and around 1.5GB of RAM when running. (I'm trying to think of a way to reduce that :D) (EDIT: It's now taking only 20MB when running up to around 80MB when searching, but it still gets up to around 600MB when building the initial index, a one-time run.)

The code that I'm using to convert MediaWiki markup to HTML is from the nifty InstaView.js, created by Wikipedia user Pilaf. It doesn't do everything needed, and I still need to get proper .css setup, etc, but at least it's not so raw anymore!

Feel free to grab what I've done off of GitHub: https://github.com/captdeaf/bzwikipedia

2 comments:

  1. It'll be interesting to see how this pans out. Any plans to add parallelism (e.g., goroutines)?

    I don't know the go language at all, so I can't comment on the quality of the code, but it looks like you don't do any error handling. For example on http://en.wikipedia.org/wiki/Wikipedia:Database_download it looks like some dumps could be incomplete even if they say they are complete ("Dump complete, 1 item failed").

    Still, this is a promising start - keep it up! Zombies might attack at any time.

    ReplyDelete
  2. That's exactly what it is: Just a start. =). Half of it's learning Go, half of it is experimenting with Go's features and standard library.

    ReplyDelete