Tadoku Tools

What I’ve got here today is a couple of tools designed for use in the Read More Or Die contest (aka Tadoku; nka タ毒), though they’re certainly not restricted to that.

First off, download the tools here. You’ll need Python 2.6 or 2.7 (32-bit) to run them. They’re command-line tools because I didn’t feel like making a GUI. It shouldn’t be too hard to figure out.

I’ll just copy the documentation here, which should explain everything you need to know.

Tadoku Tools

A set of tools written specifically for the Read More Or Die contest (also known as Tadoku) designed to expedite the use of 青空文庫-formatted text documents on e-readers while maintaining their usefulness in the contest. These tools are designed around the specific features of the Java 青空文庫-to-PDF converter 青P (http://www.geocities.jp/is3000nx/soft/share/).

This toolset currently includes two tools:

–1. ImageFixer–

i. Converts inline comments containing image references to HTML image tags.

ii. Strips all non-image HTML tags from the document.

iii. Attempts to match image references to files that exist on the disk. It does so by checking for files with different image extensions and, if the reference points to a directory other than the source file’s, it also checks for images in the same physical location as the source file.

–2. PageMarker–

i. Puts a page marker in the text every 400 characters, passing over existing comments, image tags, furigana, and other 青空文庫-specific markup.

NOTE: These tools will try to convert the source document to UTF-8, and, if it cannot detect the original encoding properly, you might end up with a lot of gibberish. If so, try converting the document to UTF-8 before using these tools.

Author: BlackDragonHunt ([name][at]gmail[dot]com)

License: Do whatever you want with it, I don’t really care.

Disclaimer: Provided “as-is,” no warranty, etc., etc. It’s not my fault if anything goes wrong, and I do not and will not take any responsibility for any damage done to you or your system as a result of using this program.

Usage instructions:

Run Tadoku Tools as follows:

python TadokuTools.py input_files [-i] [-p] [-h]

* “input_files” is a Unix-style glob which expands to a list of text files to be processed, using one or more of the following operations.

NOTE: Each file processed will generate a new file with “-fixed” appended to the original filename. If such a file exists, it will be overwritten.

* The “-i” flag tells the program to attempt to repair image references in the text file(s). It does so by converting commented images to HTML tags, checking to see if the referenced files exist on disk, and, if not, checking to see if they exist with a different extension or at the same physical location as the current text file.

NOTE: This also strips the file of all non-image HTML tags and modifies existing image tags to be self-closing.

* The “-p” flag tells the program to insert page markers into the text as comments. This gives a rough approximation of the number of pages and physical page break locations in printed text form.

* The “-h” flag shows this.

* NOTE: If neither the “-i” nor “-p” flags are provided, both image repair and page marking is performed.

You’ll probably notice that the Image Fixer from my previous post is included in these tools. That’s because I figured the tools I had worked well together and didn’t want to have to run everything through two separate tools just to get the results I wanted. And so this was born.

Unlike the last one, however, Beautiful Soup is included, so you don’t need to install it. You might, however, still want to grab the chardet module.

Again, if you have any trouble, just let me know and I’ll try to get it all sorted out for you.

EDIT: Just a note: when running this, you’ll probably get some warnings from BeautifulSoup.py. As far as I know these are harmless, and I have no idea where they’re actually coming from.

EDIT 2: Fixed a small bug (pointed out to me by ImaginaryJapan) where the Page Marker script would erroneously add extra page markers in rare situations. Check the code for full details (PageMarker.py lines 57-63).

EDIT 3: Fixed a silly error that made the script incompatible with Python 2.6. Should work fine with either 2.6 or 2.7 now, but I haven’t tested on any versions.

This entry was posted in E-Readers, Japanese, Tools. Bookmark the permalink.

7 Responses to Tadoku Tools

  1. Pingback: Tools of Mass Reconstruction | Nothing in Particular

  2. Pingback: Tweets that mention Tadoku Tools | Nothing in Particular -- Topsy.com

  3. Pingback: Ebook Readers, or How the Cool Kids Read « Read More or Die

  4. nacest says:

    Thanks for taking the time to write this!

    One question: what format should be fed to the script? A plain text file or a pdf?
    I use 青空キンドル http://a2k.aill.org/ to convert the stories in a kindle-perfect pdf format. Is that ok?

    • No problem! The script is expecting plain-text files, though HTML would probably work just fine, too. The image fixer strips all the HTML except image tags, so it’d end up being just plain text afterwards anyway.

      I like the tool you linked, though the scripts are written with 青P in mind (which I talk about here). I know the site doesn’t support embedding external images, and I don’t like how it handles the comments the page marker tool inserts, so it’s up to you whether to use it.

  5. Pingback: In preparation for the coming storm « 我輩はブリートである。

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s