Don’t Use AI-Enhanced OCR Just Yet!

Well, it depends on your prompts, and how high the quality of your image is. I’ll give you my example.

Several months ago I got hold of a missing chapter from a story I thought would be ideal to discuss in my thesis. Namely The Worlds Last Wonder, a science fiction story from 1901.

Having tested ChatGPT’s OCR capabilities a few months ago and found it not that much better than what I already had, I thought I’d test it again this week to see if things had improved.

My initial thought was that they had. I got a completely OCR’d page of an over 100 year old page that was free of errors.

Here’s an OCR example of a section of the passage using basic software: He said this so exultantly that I asked why lie was so cer,a.io. H-: replied that he had improved the iuventiou si~ce S:r Mark_ha.d the bet opportunity of steahng the designs.

Here’s ChatGPT’s version

He said this so exultantly that I asked why he was so certain. He replied that he had improved the invention since Sir Mark had the last opportunity of stealing the design.

As large language models don’t possess intuition I decided that a) it was because I had a good scan and b) any minor problems with letter recognition would mean context would be considered and the correct letter recognised. i.e. I thought an LLMs OCR came from predictions of what the poorly printed letter in a word should be, rather than what word would look best in that sentence. (I didn’t know it had dropped the s off designs until now!)

I didn’t think much more about it as I had planned to start harvesting the 97 science fiction stories I’d found today, then asking ChatGPT to OCR the images that didn’t have legible chapters.

Coincidentally, and probably fortunately, Trove shut down at 8pm for maintenance. So I paused my harvesting and shifted back to getting ChatGPT to OCR. This time, though, I had time to have a close look at the text in the image and compare it to the original.

Bear in mind that I wanted the original to be converted to text, retaining authorial intent, not edited or updated for the 21st century. Of course, I know ChatGPT is predominantly trained on 21st century language. Still, I wasn’t expecting it to dramatically change anything. And even if it did, I just assumed it would be my fault for not creating a suitable prompt.

I soon realised that wasn’t the case. And, after 3 hours of ChatGPT chipping away at my patience, I decided I would try one more time. If that didn’t work I would write this off as an exercise in futility.

And then a strange thing happened.

But I’ll get to that in a moment.

Here are a few excerpts of the conversation.

How can I improve the prompt so that ChatGPT does not correct spelling errors or rearrange words or fix grammar?


So, I used that. Then –

In column 4, for example, the original image says ‘why should I not plunge boldly‘ but ChatGPT changed it to ‘why should not I plunge boldly‘ What can I say to prevent this?


So, I used that. Then –

The original text of the 4th column says “He then took me a journey to where‘ but ChatGPT changed it to ‘He then took me on a journey to where‘ How can the prompt be updated so that ChatGPT will not add words to fix sentences?


So, I used that, then ChatGPT took a relatively readable sentence in the image “A string enabled him to turn the lever partly back, and the thing slowly descended.” and changed it to “He then turned the lever partway back, and the thing slowly descended.” That string might be important later. I don’t know but best to keep everything. Authorial intent is important!”

ChatGPT created completely new sentences to replace the poor OCR. Please update the prompt so that if ChatGPT finds a word or group of words that can’t be OCR’d then that part of the output should feature stars ‘*’ replacing the letters that are difficult to discern.


So I did that with the next image, not having much hope it was going to work, and then this happened:

Then, ChatGPT OCR looked like this:

“THE WORLD’S LAST WONDER.Fon Taz Tocsw.)Cmaprer I. + 7I was sitting in my room, or what by a politefiction was called mine—in reality it belonged –to somebody else, whose sole business on earthWas to extort’\’w charge from’ other people for»leave to be ona particular part of it. As formyself, it appeared just then that Thad no |Business on earth at all, althongh it appeatedjuss certain that I could not get away from‘the earth, How Sir Mark Solomon came to‘own my room. was thus .**

I understand they’re using the tesseract engine. I have that on my PC but have found it struggles with scans of old newspapers. But, why was ChatGPT reset?

I guess being a subscriber that also recently spent on tokens, as well as contacted support about a tier issue, I might have had extra attention paid to what I was doing, especially as the same image was being asked to be OCR’d several times. They probably reviewed the discussion and realised that ChatGPT wasn’t really performing OCR and pulled the plug on that feature to have a closer look. After all, while fiction isn’t that important in the grand scheme of things, incorrect OCR of old legal texts definitely is. Especially if ChatGPT wants to change words around.

Still, it’s a great start, and for the majority of the image it did a reasonable job. I’m looking forward to seeing a more accurate and less creative OCR LLM system in the future.

Leave a Reply

Your email address will not be published. Required fields are marked *

Verified by MonsterInsights