OCRopus を試してみた - moriyoshiの日記

「OCRopus」はその名の通り OCR を行うソフトウェアだ。曰く、

OCRopus is a state-of-the-art document analysis and OCR system, featuring pluggable layout analysis, pluggable character recognition, statistical natural language modeling, and multi-lingual capabilities

だという。自分で「state-of-the-art」とか言うな、というのはさておき、「multi-lingual capabilities」とあるが日本語には今のところ対応していない様子。FAQには次のように書かれている。

Will it support Arabic/Chinese/Japanese/...?
(中略)
We hope to be able to plug in the first non-English recognizers right after the alpha release (when the OpenFST language modeling has been integrated).

OCRopusは文字認識エンジンのラッパにレイアウト抽出などの機能を付加して、OCRソフトウェアとしての体裁を整えたものだ。文字認識エンジンを差し替えることで、多言語対応が可能になる。現在サポートされている文字認識エンジンはラテンアルファベットに対応した「Tesseract」だけのようだ。

なお、TesseractはもともとHPで10年くらいの歳月をかけて開発されたもので、2005年にオープンソース化された_(ソース)。

Mac OS X で試すには

MacPortsで次のパッケージをインストールする。
- zlib
- libpng
- jpeg
- tiff
- aspell
- aspell-dict-en
- jam

Tesseractをsvn checkoutして、runautoconfを実行、configureして入れる。

svn checkout http://tesseract-ocr.googlecode.com/svn/trunk/ tesseract-ocr
cd tesseract-ocr
./runautoconf
LDFLAGS="-L/opt/local" CPPFLAGS="-I/opt/local" ./configure --prefix=/opt/local
make
make install

OCRopusをsvn checkoutして、autoconfを実行、configureして入れる。Automakeの代わりにJamを使っているのでautoconfだけでいい。

svn checkout http://ocropus.googlecode.com/svn/trunk/ ocropus
cd ocropus
autoconf
LDFLAGS="-L/opt/local" CPPFLAGS="-I/opt/local" ./configure --prefix=/opt/local
jam

精度は高いので…今後に期待

いまのところ jam install はできないそうなので (WTF?) とりあえずソースツリーの上で試すべし。ocropus-cmd 以下にフロントエンドのコマンドラインツールが生成されるのでこれを叩く。結果は HTML として標準出力に出力される。

ocropus-cmd/ocropus ocr ~/Desktop/test.jpg > test.html

結構重い。画像スパムフィルタに使えるかもと書いてあったけど、ちょっと現状では厳しいかも。でも、精度は普通に実用レベルと思いました (いくつか試した中では)。