CCNMTL (1999-2015) pages for archival purposes only. Please visit CTL.columbia.edu.

Using Google Docs to OCR Scanned Text

Files of many types can be uploaded to Google Docs for storage in the cloud. Some files can be converted to corresponding Google Doc formats upon upload. For example, a PowerPoint file can be converted to a Google Docs presentation file. Another lesser known but useful conversion option is to convert scanned text (in an image or PDF file) to editable text. Google Docs can be instructed to perform optical character recognition (OCR) on the file by selecting a check box on the upload screen. When the file opens, you have the original scan followed by the editable text. An example is shown below.

A snippet of scanned text from a PDF uploaded for OCR processing:

Scanned Text Image

The text returned by Google Docs OCR process (unedited):

And first, I have to reply to the older charges and to my first accusers, and then I will go to the later ones. For I have had many accusers, who accused me of old, and their false charges have continued during many years; and I am more afraid of them than of Anytus and his associates, who are dangerous, too, in their own way. But far more dangerous are these, who began when you were children, and took possession of your minds with their falsehoods, telling of one Socrates, a wise man, who speculated about the heaven above, and searched into the earth beneath, and made the Worse appear the better cause. These are the accusers Whom I dread; for they are the circulators of this rumor, and their hearers are too apt to fancy that speculators of this sort do not believe in the gods. And they are many, and their charges against me are of ancient date, and they made them in days when you were impressible - in childhood, or perhaps in youth - and the cause when heard went by default, for there was none to answer. And, hardest of all, their names I do not know and cannot tell; unless in the chance of a comic poet. But the main body of

A review shows that the scan only missed a couple of w's, capitalizing them incorrectly. The process is quick, but files uploaded for conversion have a limit of 2MB. OCR on PDF files is limited to the first 10 pages. Should you want to convert a longer document, you would have to split up the file.