(Note: the italics should be change to your input/ouput docs specific filenames and the filetype you want to output) (Another note: Tesseract defaults its output to. In Terminal navigate to the directory that contains the. TIFF file, open the file with Preview and Export the file as a *.TIFF (I use 300 pixels per inch.)ī. Assuming you don’t get any failures or errors, you can then test whether or not your OCR works by doing the following.Ī. If you don’t have a. Assuming everything above installed without errors, run the following commands (still in Terminal entering one at a time) (again based on the instructions in ):Ħ.
(Note: I did not need to install aclocal or autoheader from Homebrew as they aren’t formulas in Homebrew.)ĥ. Install the tesseract dependencies listed at above again by entering one at a time. Install, update, and verify Homebrew by entering the following in terminal one at a time (aka. OK… now that all of that is out of the way, here is the process that worked for me.
I can work on these if I find time, but since TIFF is working they aren’t a priority. Another problem for another day.Įrror in pixRead: image file not found: %PDF-1.2 PDF files provide the following error and I can’t remember if Leptonica is supposed to be able to input PDF files or not. PNG files do not seem to work as inputs (it outputs two identically named files: one that can’t be opened and one that only has the first page of the input). “Warning in pixReadMemTiff: tiff page 25 not found” It gives me the following error in which the page # is always the last page of the file, but it doesn’t seem to be a problem. I have tested Tesseract with TIFF (single and multiple pages) and it is working well. Slight detour – If you know what MacPorts and Homebrew are great, but I had trouble building Tesseract 3.03 when I had both installed on my machine so my recommendation is only use Homebrew. Everything built well and without errors (note: I did have warnings, but no errors.). I was able to get Tesseract 3.03 release candidate to build on OSX 10.9.4 from source ( ) and it is working with some warnings (detailed below). I HIGHLY recommend backing up your system before you do anything like what’s described below. Consider yourself warned… you are attempting this at your own risk. This is written for those who have never (or barely) used the Terminal app on OSX and are new to Tesseract and ORC.Ī lot of credit goes to mchristy at the Early Modern OCR Project ( ) as you’ll notice many, but not all things are the same as he outlined for OSX 10.8.ĭISCLAIMER: Attempting the process outlined below may cause problems with the operation of your computer or cause you to lose data.
I don’t know if this journey is over, but I can tell you my OCR process works well enough for now. The following is the fruit of my journey so far. So this past weekend I decided that I wanted to OCR some image-only PDFs into searchable PDFs that could also be annotated correctly.
I know of programs that will automatically OCR (object character recognition) documents like DEVONthink Pro Office and PDFpen, but 1) I’m on a grad school budget and 2) I like the challenge of figuring out ways to configure and promote technology using open source resources. These PDFs can’t be searched or annotated and for my workflow this is a no go. However, every so often I can only obtain PDFs that are images. Most of these articles are in PDF file format and I use Skim to read and annotate them. Since I’m in the middle of my doctoral studies, I read A LOT of journal articles.