Most of what my work online either involves checking mail or browsing
forums for getting answers or reading Wikipedia for getting information
or social networking. With LAN cuts introduced in the IITs, it is
difficult for a student to access information after 12:10 unless they
breakout somehow. In an earlier post, I had explained with references to
my code, on how to download parts of Wikipedia, I thought it would be
helpful to download the whole of Wikipedia on to your computer. In this
post I will show you how Wikipedia / stack-overflow / gmail can be
download for offline use.
Wikipedia
Requirements:
- LAMP (Linux, Apache, MySQL, PHP)
- Around 30 GB of space in primary partition 30 GB of space for storage. In my case the root partition
- 7 GB of free Internet download
- 3 days of free time
Wikipedia dumps can be downloaded from the
Wikipedia site
in XML format compressed in .7zip. This is around 6 GBs when compressed
and expands to around 25GB of XML pages. It doesn’t include any images.
This
page shows how one can extract text articles from articles and construct corpuses from the same. Apart from this, a static
HTML dump
can also be downloaded from Wikipedia page (wikipedia-en-html.tar.7z)
and this version has images in it. The compressed version is at 15 GB
and it expands to over 200 GB because of all the images.
The Static HTML dump can simply be extracted to get all the HTML
files and the required HTML file can be opened to view the required
content. In case you download the XML dump, there is more – you have to
extract the articles and create your customized offline Wikipedia.with
the following steps.
- Download the latest mediawiki and install
it on your Linux/Windows machine using LAMP/WAMP/XAMPP. Mediawiki is
the software that renders Wikipedia articles using the data stored in
MySQL.
- Mediawiki needs a few extensions which have been installed in
Wikipedia.Once we have mediawiki installed say /var/www/wiki/, download
each of them and install by extracting these extensions in the
/var/www/wiki/extensions directory.
The following extensions have to be installed – CategoryTree, CharInsert, Cite, ImageMap, InputBox, ParserFunctions (very important), Poem, randomSelection, SyntaxHighlight-GeSHi, timeline, wikihero which can all be found in the Mediawiki extensions download page by following the instructions. In addition you can install any template
to make your wiki look like whatever you want. Now your own Wiki is
ready for you to use, you can add your own articles but what we want now
is to copy the original Wikipedia articles to our Wiki.
- It is easy to import all the data once and then construct an index
for the data in MySQL than to update the index each time an article is
added. Open MySQL and your database, the tables that are used in the
import are text, page and revs. You can delete all the indexes on that
page and create it again in the 5th step to speed up the process.
- Now that we have our XML database, we need to import it into the MySQL database. You can find the instructions here.
In short, a summary of the instructions found on that page, the ONLY
WAY you can get Wikipedia really fast on your computer is to use mwdumpertool
to import into the database. The inbuilt tool in mediawiki won’t work
fast and may run for several days. The following command can be used to
import the dump into the database within an hour.
java -jar mwdumper.jar --format=sql:1.5 | mysql -u -p
- Recreate the indexes on the tables ‘page’, ‘revs’ and ‘text’ and you are done.
You can comment if you want to try the same or if you run into any problems while trying.
Stack-overflow
Requirements
- LAMP (Linux, Apache, MySQL, PHP)
- Around 15 GB of space in the primary partition and 15 GB of storage. In my case the root partition
- 4 GB of free Internet download
media10.simplex.tv/content/xtendx/stu/stackoverflow has
several stackoverflow zip files available for direct download.
Alternatively, stack-overflow dumps can be downloaded using a
torrent. A torrent download can be converted into an FTP download using
http://www.torrific.com/home/.
Once you have the dumps you can unpack them to get huge XML files for
several stack sites. Stack-Overflow is one of the stack sites, the 7zip
file is broken into 4 parts and have to be combined using a command (cat
a.xml b.xml c.xml d.xml > full.xml) Once combined and extracted, we
can see 6 xml files for each site (badges, comments, postHistory, posts,
users, votes, ) Among these, comments, posts and votes may seem useful
for offline usage of the forum. A main post may consist of several reply
posts and each such post may have follow-up comments. Votes are used to
rate an answer and they can be used as signals while you browse through
questions. Follow the following steps to import the data into the
database and use the UI to browse posts offline.
- Download Stack sites
- Create a database StackOverflow with the schema using the description here. (comments, posts and votes tables are enough)
- Use the code
to import the data to the database. (Suitably modify the variables
serveraddress, port, username, password, databasename, rowspercommit, filePath
and site in the code)
- Run the code on Stack Mathematics to import the mathematics site.
For bigger sites, it may take much more time and a lot of optimizations
are needed along with a lot of disk space in the primary partition where
the MySQL stores its databases.
- Use the UI php files to view a post given the post number along with the comments and replies.
- TODO: Additionally we can add a search engine that searches the
table ‘posts’ for queries and returns post numbers which match the same.
Gmail offline
Requirements:
- Windows / Mac prefered
- Firefox prefered
- 20 minutes for setup
- 1 hour for download
Gmail allows offline usage of mails, chats, calendar data and
contacts. You can follow the following simple series of steps to get
gmail on your computer.
- Install Google gears for firefox
- You can install google gears from the site http://gears.google.com
- If you are on Linux, you can install gears package. [sudo apt-get install xul-ext-gears]
- Note: Gears works well in Windows, may fail on Linux
- Login to gmail
- Create new label “offline-download”
- Create a filter {[subject contains: "Chat with"] or [from:
] -> add label “offline-download” to selectively
download your conversations.
- Enable offline Gmail in settings, and allow download “offline-downloa” for 5 years. You can select the period of time as well.
- Start, it will end in around an hour and you will have your mails on your computer in an hour.
Offline gmail creates a database called [emailID]@gmail.com#database in your computer. The
gears site gives you the location. You can find some information about
offline GMail here.
If you want a custom interface for your mails / chats etc, you can
create one which queries the SQLITE database mentioned above to present
the content however you want. The software
diarymaker can
be used to read your chat data with plots of frequencies with time and
rank your friends based on the interactivity. It works on Linux and uses
the Qt platform. I will add a post on it soon.
Feel free to comment on any issue, if you have an idea for
downloading any other kind of data on to your computer for offline
usage, please let us know with a comment.
Update:
sumber http://kashthealien.wordpress.com/2011/08/06/wikipedia-offline/