Don’t Use AI-Enhanced OCR Just Yet! – Retrieving Vintage Science Fiction from early Australian Newspapers

Well, it depends on your prompts, and how high the quality of your image is. I’ll give you my example.

Several months ago I got hold of a missing chapter from a story I thought would be ideal to discuss in my thesis. Namely The Worlds Last Wonder, a science fiction story from 1901.

Having tested ChatGPT’s OCR capabilities a few months ago and found it not that much better than what I already had, I thought I’d test it again this week to see if things had improved.

My initial thought was that they had. I got a completely OCR’d page of an over 100 year old page that was free of errors.

Here’s an OCR example of a section of the passage using basic software: He said this so exultantly that I asked why lie was so cer,a.io. H-: replied that he had improved the iuventiou si~ce S:r Mark_ha.d the bet opportunity of steahng the designs.

Here’s ChatGPT’s version

He said this so exultantly that I asked why he was so certain. He replied that he had improved the invention since Sir Mark had the last opportunity of stealing the design.

As large language models don’t possess intuition I decided that a) it was because I had a good scan and b) any minor problems with letter recognition would mean context would be considered and the correct letter recognised. i.e. I thought an LLMs OCR came from predictions of what the poorly printed letter in a word should be, rather than what word would look best in that sentence. (I didn’t know it had dropped the s off designs until now!)

I didn’t think much more about it as I had planned to start harvesting the 97 science fiction stories I’d found today, then asking ChatGPT to OCR the images that didn’t have legible chapters.

Coincidentally, and probably fortunately, Trove shut down at 8pm for maintenance. So I paused my harvesting and shifted back to getting ChatGPT to OCR. This time, though, I had time to have a close look at the text in the image and compare it to the original.

Bear in mind that I wanted the original to be converted to text, retaining authorial intent, not edited or updated for the 21st century. Of course, I know ChatGPT is predominantly trained on 21st century language. Still, I wasn’t expecting it to dramatically change anything. And even if it did, I just assumed it would be my fault for not creating a suitable prompt.

I soon realised that wasn’t the case. And, after 3 hours of ChatGPT chipping away at my patience, I decided I would try one more time. If that didn’t work I would write this off as an exercise in futility.

And then a strange thing happened.

But I’ll get to that in a moment.

Here are a few excerpts of the conversation.

How can I improve the prompt so that ChatGPT does not correct spelling errors or rearrange words or fix grammar?

To ensure that ChatGPT does not attempt to correct spelling, rearrange words, or fix grammar in OCR outputs, you can make the request explicitly clear. Here’s a revised prompt you can use:

Improved Prompt:

“Please perform OCR on this image column by column, in order. Extract the text exactly as it appears, including all spelling, grammar, and formatting errors. Do not interpret, correct, or modify the text in any way. Separate each column with ‘—‘.”

So, I used that. Then –

In column 4, for example, the original image says ‘why should I not plunge boldly‘ but ChatGPT changed it to ‘why should not I plunge boldly‘ What can I say to prevent this?

To prevent any changes like this and ensure ChatGPT extracts text exactly as it appears in the image, you can emphasize verbatim extraction in your prompt. Use the following adjusted prompt:

Updated Prompt:

“Please perform OCR on this image column by column. Extract the text exactly as it appears in the image, preserving the original word order, spelling, grammar, punctuation, and formatting. Do not correct or interpret the text in any way. Separate each column with ‘—‘.”

So, I used that. Then –

The original text of the 4th column says “He then took me a journey to where‘ but ChatGPT changed it to ‘He then took me on a journey to where‘ How can the prompt be updated so that ChatGPT will not add words to fix sentences?

To ensure that ChatGPT does not add or modify words to “fix” perceived issues in sentences during OCR, you need to emphasize strict verbatim extraction in the prompt. Here’s an updated version tailored for your needs:

Updated Prompt for Verbatim OCR:

“Please perform OCR on this image column by column. Extract the text exactly as it appears in the image, preserving every word, character, spelling, grammar, punctuation, and formatting. Do not add, remove, correct, or interpret any part of the text. Separate each column with ‘—‘.”

So, I used that, then ChatGPT took a relatively readable sentence in the image “A string enabled him to turn the lever partly back, and the thing slowly descended.” and changed it to “He then turned the lever partway back, and the thing slowly descended.” That string might be important later. I don’t know but best to keep everything. Authorial intent is important!”

ChatGPT created completely new sentences to replace the poor OCR. Please update the prompt so that if ChatGPT finds a word or group of words that can’t be OCR’d then that part of the output should feature stars ‘*’ replacing the letters that are difficult to discern.

To address the issue where unclear or unreadable text is being replaced with new sentences, you can use a prompt that explicitly directs ChatGPT to replace any illegible words or sections with asterisks (*). Here’s the updated prompt:

Updated Prompt for Handling Unclear OCR Text:

“Please perform OCR on this image column by column. Extract the text exactly as it appears in the image, preserving every word, character, spelling, grammar, punctuation, and formatting. If any words or parts of the text cannot be clearly OCR’d, replace the unreadable sections with asterisks (*) for each missing character. Do not attempt to guess or reconstruct unclear words or sentences. Separate each column with ‘—‘.”

So I did that with the next image, not having much hope it was going to work, and then this happened:

It seems my environment was reset, so I need to re-import the necessary tools. Let me reprocess the image and extract the OCR text with the updated criteria.

Then, ChatGPT OCR looked like this:

“THE WORLD’S LAST WONDER.Fon Taz Tocsw.)Cmaprer I. + 7I was sitting in my room, or what by a politefiction was called mine—in reality it belonged –to somebody else, whose sole business on earthWas to extort’\’w charge from’ other people for»leave to be ona particular part of it. As formyself, it appeared just then that Thad no |Business on earth at all, althongh it appeatedjuss certain that I could not get away from‘the earth, How Sir Mark Solomon came to‘own my room. was thus .**

I understand they’re using the tesseract engine. I have that on my PC but have found it struggles with scans of old newspapers. But, why was ChatGPT reset?

I guess being a subscriber that also recently spent on tokens, as well as contacted support about a tier issue, I might have had extra attention paid to what I was doing, especially as the same image was being asked to be OCR’d several times. They probably reviewed the discussion and realised that ChatGPT wasn’t really performing OCR and pulled the plug on that feature to have a closer look. After all, while fiction isn’t that important in the grand scheme of things, incorrect OCR of old legal texts definitely is. Especially if ChatGPT wants to change words around.

Still, it’s a great start, and for the majority of the image it did a reasonable job. I’m looking forward to seeing a more accurate and less creative OCR LLM system in the future.

Related Posts

Welcome AI Avatar Neil

Evolution of Intelligent Life

Why is researching old literary science fiction important?