This article gives instructions for loading Wikipedia articles in to ElasticSearch. I did this on Windows, but all of these steps should work on any java friendly platform.
- Download ElasticSearch
- Download stream2es
- Download Wikipedia articles
- Start ElasticSearch
- Run stream2es
Download and unzip the elasticsearch download in to a folder of your choice.
Download Wikipedia articles
(It's over 12GB, so be sure you have plenty of disk space.)
I'm on Windows, so I opened a command window and ran this:
That starts up your local ElasticSearch instance at localhost:9200
- Move the stream2es file to your ElasticSearch bin folder. I put stream2es here c:\elasticsearch-1.5.2\bin\
- Move the Wikipedia archive (enwiki-latest-pages-articles.xml.bz2) to your ElasticSearch bin folder too.
- Run the stream2es java file:
C:\elasticsearch-1.5.2\bin>java -jar stream2es wiki --target http://localhost:9200/mywiki --log debug --source /enwiki-latest-pages-articles.xml.bz2
- You can change the "mywiki" to whatever you want your specific ElasticSearch index name to be.
- I had some trouble getting stream2es to find my wikipedia archive path on Windows, but the / in front of the file name worked.
I ran this all local on my Windows desktop, and it took 6-8 hours. It appears to be locked up near the end, but it did eventually exit.
Now, you should have over 16 million Wikipedia articles loaded in to your local ElasticSearch index. Enjoy.
I plan on doing future articles on using this Wikipedia data for machine learning, natural language processing, and topic clustering.