Automated Popular English Fiction Genre Classification System – Retrieving 'Science in Fiction' from early 20th Century Australian Newspapers

I’m pleased to announce I’ve had some success with ChatGPT classifying texts into fiction genres. I used the ChatGPT API and a bespoke Python script. It was a bit expensive but as this was crucial to my entire PhD, worth it.

If you’ve come straight to this post without knowing about my research, here’s a quick run down.

I’m researching science fiction in early 20th century Australian newspapers (1901-1940) with a focus on the To Be Continued Fiction Database (TBC), which contains extracted fiction texts from the National Library of Australia’s Trove database. TBC doesn’t have every chapter, though a team is working on that. Also, it contains a lot of poor OCR, though with the help of volunteers, that text is slowly being corrected.

However, none of the 51,000 stories in the TBC have an assigned genre. So I started by creating groups of keywords that I dubbed keyclouds that represented popular fiction genres. But after getting a copy of the TBC database to search it with keyclouds, I found that all the chapters were separate files. These then had to be concatenated into their story groups. As many stories had been syndicated, and some of those records were missing chapters, I thought the best way to use keyclouds with the database was to pack every version and every chapter copy of the same story into one file. I ended up with 25,100 stories to search through. Using this corpus and my keyclouds (each keycloud took several weeks to develop) I was able to locate stories in about 20 different popular fiction genres (there may be about 140 subgenres but I needed to start quickly).

This led me to finding about 50 science fiction stories. Once I had found those, I knew there had to be more, but keyclouds can only do so much. For example, this keycloud, “terror dread nightmare death scream blood madness darkness evil grave doom kill horror run fear murderer monster creature hunt frighten creep corpse danger dead knife alone fight secure victim beating soul villain scream village die bloody shadow suspicion violent heart” found 65 stories in the Horror genre, but with most stories having poor OCR or missing a lot of chapters, the keycloud will only find stories that contain those words somewhere, so won’t retrieve less complete stories. Also, the only keycloud I’ve really spent a lot of time on is the Planetary keycloud which took days to refine. This horror keycloud probably contains words that stop stories from appearing, such as ‘villain’.

I then started working through various digital humanities tools and reading about 150 papers on automatic genre identification. What I found was that fiction wasn’t well served in this area. It was rarely a focus of the paper. Those that did focus on it had limited successes. Even so, with my decades of fiction writing, I knew why they didn’t work. (That will be part of a paper I’m working on.) After trying Mallet, Keybert, and various other classification methods and tools I ended up using a combination of TF-IDF, bag of concepts, an ESA model trained on 20 popular vintage genres, and a few other features with a Python script. This was able to cluster the 21,000 stories into various groups, though I suspect 100 would have been better than 20, but I was running out of time. Even so, the keyclouds and these steps and the final clustering experiment helped me get to a total of around 70 stories in the era I was focused on, as well as 50 stories outside that. (The ESA results will be part of a future paper.)

Still, this clustering didn’t give me great results. And there was one thing I hadn’t tried yet. While I have been able to give ChatGPT a text file and ask it to summarise it for me, I hadn’t yet asked it to identify a genre of a story as its language is 21st century. I doubted it would be able to deal with 19th and early 20th vocabulary that have since fallen out of use. I hadn’t considered that ChatGPT’s extensive role playing training could be used to recalibrate it to being a vintage fiction researcher! So, we discussed early genres, and developed a way of assigning a genre to a text. It can take up to 10 text files at a time and analyse the first 1200-1500 words, give a summary of those as well as suggest which genre that section represents. I tested it on several stories and it did quite well. I refined the prompts and added to the genres, then with ChatGPTs help, created a python script that would use the ChatGPT API to analyse all of the concatenated files I had on my laptop. With fiction invariably using multiple genres I designed it so that three genres were assigned. I also found that the first 1500 words was usually sufficient to determine a genre. It didn’t have to be entirely accurate. I just wanted to find more stories that could be identified as science fiction. These would no doubt stand out.

Through this process ChatGPT was able to find about 500 more stories that could belong to the ‘invention’ genre which I then skimmed through and determined that 20 of them were predictive enough and contained enough science to be considered science fiction in that era. I won’t go into genres here as the explanation for why these and not the others will probably be two pages long. Also, by today’s definition of science fiction, which is that speculative, fantasy and superhero fiction can be considered science fiction, it would mean there’s probably at least 1000 stories in the TBC that could be considered science fiction! I’m only looking for science fiction that followed the predictive ‘science in fiction’ definition of the time.

In any case, I now have my 97 stories to analyse and a method to find more if I need to. Which is great. And as a bonus researchers now have the first real Automated Popular English Fiction Identification System! I’ll add the python code for everyone to copy as a separate page soon.

Here’s the summary:

Extracted stories from the TBC database (approx 51,000)
Sorted chapter file numbers into story name folders (approx 680,000 text files) (python)
Renamed chapter file numbers with chapter numbers (python)
Concatenated the files using the story names, year, TBC code, and chapter numbers, (Chapter IX comes before chapter V, L before X, unfortunately, but that doesn’t matter at this stage.) (various python scripts)
Created a python script with ChatGPT for the API. Due to costs (roughly $0.007 per file, depending on tokens used forward and back) I only asked the ChatGPT API to read the first few pages. I added several 19^th and early 20^th century fiction genres into the script for it to choose from. Summary of results from three versions of the script: Version 1: csv with title, recognised level of science fiction rated 1-5, and whether it’s a high, moderate or other level, story summary, and level of OCR A-E, 1000 tokens read (approx 750 words) – 100 tokens allowed for results per line. (Multiple combinations of this) Results were promising but not that good – discarded. – Version 2: csv with title, recognised level of science fiction rated 1-10, and whether it’s a high, moderate or other level, story summary, level of OCR A-E, three suggested genres, 1500 tokens read (approx 1125 words) – 150 tokens allowed for results per line. Results were good. – Version 3: csv with title, recognised level of science fiction rated 1-10, and whether it’s a high, moderate or other level, summary, level of OCR A-E, three suggested genres, 2000 tokens read (approx 1500 words) 100 tokens allowed for results per line. Results were great, though due to 100 tokens, occasionally genre wasn’t listed next to the story. The summary was usually good enough to fill in the gaps when that happened.
Note that I didn’t clean the OCR or get all the chapters for this experiment. ChatGPT is able to interpret poor OCR in large batches in my experience. The more text, the more accurate the summary, but with potentially 2.5 billion words to scan through of an incomplete database, and possibly 10 billion of a complete one, I don’t have the budget.

Are the genres accurate? Yes, with a few caveats.

I skimmed most of the stories that ChatGPT had allocated 1-10 to, so at least 3000 of them, plus some 0s. Good interpretations for the first 1500 words. The caveat is that many of the stories didn’t stick to a particular genre, so if the genre dramatically changes at some point later on in the story with no foreshadowing, ChatGPT might not interpret it in those first 1500 words (of whatever chapters the file starts with).

The Dark Planet (1913), for example. The section ChatGPT analysed didn’t feature enough markers to show the main character was on another planet. (It only analysed the character teaching a medieval level society about science). It still assigned a level 6 to the story which I thought was pretty good, and the genres it chose are a good reflection. Dystopian, Invention, Colonial. If it had more of the story it would probably have replaced Colonial with Planetary.

Another example is The Lighthouse of Bluff Cape (1887) (the first Australian fiction story to mention a ‘submarine vessel’ I’ve found so far) which jumps 30 years into the future in the final chapters. Adventure, Colonial, Travel are correct. If ChatGPT had access to the final chapters and could add an extra genre, it might have added Prediction.

One story I found had really weird OCR where ‘heir’ was replaced with AIR several times which confused ChatGPT enough that it decided the business meeting must be happening in an advanced aerial craft above the city. Unfortunately, no.

So, fairly accurate. Give it complete, perfectly clean records, along with detailed genre definitions of the time, and no doubt it would successfully identify the genre or genres of a fiction story.

Visit this page to copy the code and view an example of the output:

If you plan to use the script in your own fiction genre identification project, please leave a comment. It would be great to hear how this has helped other researchers.

If you wish to refer to this in your next paper or thesis, here’s an example citation you can use:

Hogan, Neil. “Automated Popular English Fiction Genre Classification System.” Retrieving “Science in Fiction” from Early 20th Century Australian Newspapers, Neil Hogan, 2 Nov. 2024, https://neilhogan.com/automated-popular-english-fiction-genre-classification-system/

Related Posts

Why is researching old literary science fiction important?

Maintain Academic Integrity while using ChatGPT and Other Large Language Models

Don’t Use AI-Enhanced OCR Just Yet!