2024-04-19 NeilJHogan 0 Comments

Learning Digital Humanities tools such as Python to analyse 19th-20th century newspaper fiction

To Be Continued is a database of newspaper fiction from Australian newspapers from the 19^th century until the 1950s that I am using as an integral part to my PhD thesis. The database, as of February 2024, was made up of over 580,000 pieces of text in the form of chapters and half chapters. Within the database these formed about 51,000 stories.

Due to this relatively large database, I felt the need to focus on a small section of the years and focus on one genre as I thought that it would be impossible for me to analyse all the stories first. And, I believed that, even if I did learn the tools it would still take me months to do it. With less than two years to go on my PhD, I doubted I would be able to complete it in time.

My supervisor kindly pointed out that things have moved forward in recent years and if I have the right tools, I could probably get the answers I wanted within 20 minutes!

The thing about digital humanities is that it IS possible to analyse 580,000 pieces of text using computational tools. But first, I had to learn how.

So, in the past month I have done the following:

Experimented with: Mallet, Orange Data Mining, KeyBert

Worked through: University of Michigan’s free Introductory Python course through EDX

Watched Youtube videos on: Sqlite, Sqlite Studio, Python (too many to list. Lots of great ones there! Special mention goes to Python Simplified).

As I had a copy of the To Be Continued sqlite database file as well, this enabled me to use Python to run scripts combining processes between a db file and txt files outputting to a csv or more txt files. For example, I first wanted to concatenate those 580,000 pieces into their 51,000 stories in chapter ID order. No problem for a Python script. Did it in an hour.

I can’t take the credit for the coding. I just needed to know the basics to know what to ask ChatGPT to do. I needed it to access certain files, with specific imported modules, a certain way in a certain order and run a loop to move through all the files, then output them how I want them. If I hadn’t known what to ask ChatGPT it would have likely given me something that simply doesn’t work. (Which happened a few times when I asked the wrong thing).

This enabled me to ask ChatGPT to help me write several different scripts. After much trial and error, I ended up with about 7 useful scripts from about 30 attempts. Of course, it’s not really ChatGPT. It’s all those decades of Python scripts it has absorbed from Python enthusiasts posting their work online since 1991. You’re all legends.

With 51,000 concatenated files I now had data that could be topic modelled. However, after much research I found that topic modelling fiction doesn’t work as well as topic modelling non fiction. Ie, even though there were tools out there that could do it, the results would not be as clear as if it was a document focused on one thing. Topic modelling doesn’t do well with endless dialogue, for example.

Instead, I ran those concatenated files through a script I’d worked out that output 100 nouns per story in order of frequency use into a csv, in the hope of quickly locating a particular genre. My theory was, if it output nouns like murder, detective, body, morgue, killer, then I could easily guess it was a crime story. Equally, if it output scientist, science, experiment, invention, professor, I might guess it is a proto science fiction story.

But, as newspapers published predominantly romantic fiction, even after only choosing the years 1901-1939, the most common nouns were ‘man’, ‘girl’, and ‘time’ which doesn’t tell me anything. These were followed by equally useless ‘face’ and ‘door’. (This selection was only 7770 lines so easy to look through, unlike the first attempt at doing the whole unconcatenated txts and poorly scripted output which ended up being over a million lines). There were also a lot of poorly ocrd nouns so I cleaned those ending with an average of 50 nouns per line. Near the end of each line I could start to see nouns that might give an idea of genre but, 7,700 lines is a bit time consuming, so what I really wanted to do was get a program that looked at those nouns for me and could make the same kind of supposition that I can.

Was there anything like that?

No!

So, I tried to create my own. Spoiler alert. It failed. But I thought you might like to know the process.

I discussed the issue with ChatGPT and it suggested I look at word lists and see if I could match the nouns to the list, if the word list represented a category or genre. So, I looked online for wordlists and found Wordnet. A repository of words in categories but also one that you can click on a word and find the word above it. The word above ‘murder’, for example, would be homicide. This is a direct hypernym. (Words below are known as hyponyms so in this case they would be mainly what you’d find in a thesaurus) After learning how Wordnet works I went back to ChatGPT and wrote a script that used wordnet to analyse each of the nouns in the spreadsheet and output their hypernyms, then looks for the most common hypernyms and outputs that.

The hypothesis was that if I could find several matching direct hypernyms, for example if the noun list showed ‘assassination’, ‘contract killing’, ‘shoot-down’, ‘bloodshed’ all those words would end up listing their direct hypernyms as ‘murder’, the script would then output just a few words to describe the story, maybe even a recognisable genre word.

Imagine if I could just decide a rough genre tag for every story in the database in just 20 minutes?! Hmm. Why hasn’t this been done before?

Well, because it doesn’t work.

Invariably, using the faulty logic of ‘most common nouns’ resulted in concept level results. As said before, with man, girl and time being the most common nouns, most of the stories ended up having direct hypernyms of ‘organism’ or ‘being’ which, reviewing everything that led up to this point, I just groaned aloud. How stupid of me. Of course it would!

Back to the drawing board. Not the most common nouns. It has to be the most important nouns. And to do that, I need to do some intensive reading on what other scholars have done. ‘Keybert’ has potential but its current output is garbled so I need to figure out what I’m doing wrong there to get that. I might end up going back to training a classifier to detect genre another way. So, not quite a dead end. Getting clarity on where I need to focus to figure out a more accurate way to identify a rough genre that is not going through thousands of lines of nouns. Then, I can give that result to the world.

Incidentally, if you do get ChatGPT to write something for you, you’ll need to remind it to do a lot of things like 1) export the data to the csv/folder as its found, rather than waiting until the end 2) using ‘r’ or double slash for addresses so that the system will recognize it isn’t an exit line (it’s not programmed to include that) 3) it won’t assume a delay is needed for some APIs like Trove’s 4) It’s out of date so it doesn’t know how to deal with Trove V3 which was released this year. 5) If you’re not careful with the language you use with it, and the code you first started working with it on has been updated in the conversation, it might add your latest update but go back to the original code, forgetting all the other updates.

In my searching though I did discover something new. From about 1877 to about 1912, (from memory), there were many stories featuring the following keywords ‘scientist’, ‘invention’ , ‘experiment’ that are not science fiction. In fact, I found about 900 references just searching with those three keywords through my concatenated txt files. It seems that this was a popular trope at the time. Some of the stories had the following plots:

Girl wants to marry scientist, but he wants to complete his invention to have enough money to support her
Scientist has an invention that he needs to sell to have money to impress the girl he wants
Family wants a girl to marry a scientist so that she is secure, yet the scientist hasn’t made any inventions yet that make money
Old scientist is pursued by young girl and ignores advances until she helps him with his experiments
Scientist’s invention is wanted around the world, and he needs to hide it from nefarious people until he can sell it.
Scientist’s invention is stolen, and the world could be in danger if he doesn’t get it back
Girl leaves scientist because he spends too much time with his inventions
Scientist wastes all his money on inventions and ignores the love in front of him

Of the ones I took a close look at, there doesn’t seem to be any science in them. The inventions are never explained. So, these don’t even count as ‘science in fiction’ stories which could have made them part of the ‘early science fiction’ genre.

Several of the science fiction stories I had found before also came up for those keywords so it may be there are other science fiction stories in those 900.

Even so, I think it is interesting that this trope was popular for about 35 years then seemed to vanish from all the newspapers in Australia. Then again, not all my concatenated files have all the chapters yet, so there could be more to find.

Oh, yeah, and I also tried to work with Trove’s API but haven’t been able to, so moved onto working with a great Harvester script.

That’s my report for this month. Thanks for reading.

Learning Digital Humanities tools such as Python to analyse 19th-20th century newspaper fiction

Related Posts

Research has begunResearch has begun

The World’s Last Wonder Chapter IVThe World’s Last Wonder Chapter IV

An Aerial Adventure, or The Secret of a Scientist (1909-1910) by Victor D. A. Courtney – final part of the review.An Aerial Adventure, or The Secret of a Scientist (1909-1910) by Victor D. A. Courtney – final part of the review.

Leave a Reply Cancel reply