Here's a Java program that transcribes any thread of this forum into a single file for offline reading! It processes about 260 pages a minute. Time might vary depending on connection speed and site traffic, as well as your preferences (check the menu).
The program can also be ordered to only keep the posts of certain users, great for reading only the GM's posts or finding the Toad's posts in the Future of the Fortress thread.
The program can combine multiple threads into one, arranging the posts chronologically.
If the output file ends in .html, the program will split the output into multiple pages based on user preferences (check the menu).
If the output file ends in .txt, the program will keep the output as a single plain text file, great for reading in an e-book or for processing with regular expressions or programs like grep.
Here's the program: B12_PostProcessor.jar (http://goo.gl/x7obSt) (10 MB)
Should work on most common OSs (Windows, Linux, Mac) and architectures (x86, x86_64) with an up to date Java VM installed.
Bugfixing
Rewrite it to make the code more manageable.
Make it understand and store more info.
Make it download multiple pages at a time to save time, since processing a single page doesn't take long. It's establishing a connection and downloading that are the real time sink. Done
Make it able to login. Done
Make it able to download images. Done
Make it download theme information and images so that things don't look so black and white when offline. Done; needs some bugfixing but nothing serious
Long term:
Make it combine multiple threads into one. Done; needs to understand dates other than %a %d-%m-%Y, %H:%M:%S
Make it prettier.
Option to divide output to multiple files. Done
Filter options: Remove OOC (()), Only include posts with certain Text types (Italics, Bold, Underlined, Coloured, etc), Include posts with character speech "", mix and match rules
Output options: minimize post size (http://www.bay12forums.com/smf/index.php?topic=131917.msg4686380#msg4686380),
plain text output (http://www.bay12forums.com/smf/index.php?topic=131917.msg4688297#msg4688297) Done; maybe add a few more options and clean up output
Very Long term
Make it able to work on other forums.
Here (http://goo.gl/gZLMEK)'s the source code if anyone wants to mess with it. (30 MB, Eclipse project)
Changelog:
1/4/2014:New version can combine multiple files into one based on time. Unfortunately, it can only process times in the %a %d-%m-%Y, %H:%M:%S format. So make sure to log in and change your date format to that if you want to use it until I get around and fix it.
28/3/2014: Ability to split output to multiple files. New options menu. Various bugfixes.
2/11/2013: New version can save output as a "lightweight" .txt file.
1/11/2013: Fixed an out of memory error that occurred in 32-bit windows JVMs when processing more than 1069 pages
23/10/2013: New version can login and can download images and forum theme images. It also utilizes multiple downloader threads to reduce download time. Finally, it has a better GUI.
11/10/2013: New version should work on most common OSs (Windows, Linux, Mac) and architectures (x86, x86_64) with an up to data Java VM installed.
10/10/2013: Made the program create a window that acts as a terminal. This means that you can now just double click the file instead of having to launch it from the terminal. Bad thing is, it only works on 64-bit Linux now. I'll fix that tomorrow. Should be an easy fix (famous last words).
(http://imageshack.us/a/img19/9011/4nl8.jpg)
Version 2 is coming along nicely. At this rate, I should be ready for a release near the end of the week or sometime in the next week. Most of my work so far has gone to designing a simple GUI and a login system. Next up is upgrading the downloader itself to download images, downloading through multiple connections to increase download speed and making it smarter so that it can filter messages according to their content.
Speaking of filters, as of right now, I'm working on two filters. One that removes out of character content by removing any text enclosed in (( )) and another that checks for types of text like Italics, Bold, Underlined, Coloured, etc. Any ideas for other filters you would like to have?
Bug squashing took more time than expected (mostly due to personal issues and my lack of experience with working with SWT), but lo and behold:
(http://imageshack.us/a/img841/3186/4xzf.png)
A speed of about 260 pages per minutes (which translates to 3800 posts per minute with 15 posts per page, speed should increase with higher settings) with 6 threads downloading and processing data (I'll probably add an option to increase the number of threads).
Now all that's left is the ability to download images and it'll be ready for the next release.
New version can split the output into multiple files.
Simply click the "Stuff" menu and then the "Other Options" menu item. The posts per page is the option you want.
Option only works for html output. Plain text output is still one huge file. I should probably add an option to enable it or disable it.
You can also mess with the other options there if you want. Although the checkboxes there don't do anything yet.
Stopped the program from failing when downloading image links with no file type (like this one: https://i.chzbgr.com/maxW500/7047413504/h45184DA5/ (https://i.chzbgr.com/maxW500/7047413504/h45184DA5/))
Fixed a divided by zero error that happened in some rare cases.
Can't get page numbers to show properly when pressing the ... part of the change page. Even though all pages are shown correctly in the first second and middle points, I can't make the ... points to show the correct page numbers when they expand, even though though they point to the correct page files. Thus we are left with the following page number problem:
Not expanded:
0 ... 2 3 [4] 5 6 ... 8
Expanded (by clicking the ...):
0 2 2 3 [4] 5 6 8 8
Where the red 2 points to page 1 and the red 8 points to page 7.
There is sometimes an extra empty final page. Need to check my divisions, probably getting an extra page in there somewhere.
Add a check for <posts per page> equals 0 and throw an appropriate error. Right now it just stops working.