A deep blog dive using tools from the Digital Humanities (Ep 1)

I first began blogging here back in 2007 to think about the bits of technology which got me excited, little did I think that nearly 20 years later, I would still enjoy it as much as I did that very first day.

For the last decade at the end of each year, I have done a round up of the most popular blogs, mostly by looking at the hits they received. However, for the longest time, I have felt that there’s much more to the story in the stats but I couldn’t quite decide how to mine the blog for that story.

I did think about applying digital humanities techniques (in 2019) to this blog after I had spent time reading up on how literary informatics librarians work when approaching the canon. In that blog in the previous link, I refer to Dr Heather Froehlich looking for the presence and absence of characters or gender-specific words in the plays of Shakespeare. No doubt because she had a theory. The problem was that I didn’t have anything I wanted to find. I didn’t have a theory. I just wanted to find the patterns.

At the end of 2023, I printed out the whole blog aka my complete works, which was no small feat, given that it came in at 343,000 words – nowadays it must be half a million – hoping that if I read it I would see the patterns. The book gave me no end of joy to carry around but it turned out to be too difficult to analyse. There were patterns but nothing substantial.

Then, I thought that I would use some AI but as I have said in previous blogs about machine learning and natural language processing, these algorithms cannot find these patterns themselves without a human supervising what they do. This is done by a human or group of humans carefully selecting the data on which it will be trained and/or, marking up said data so that it can easily find the patterns. This was no help at all as I would still have to decide what data to feed in and how to mark it up for training. Alternatively I could use some one else’s neural network, but I would either have to use their pattern matching and understand what it was, or still need to know for what I was looking, whilst I sift through all the potentials it could give me.

Looking for mutants

This is not new, my first AI research project was in fact looking for patterns. It was looking for errors which I had seeded in a knowledge-based system (KBS). By writing test cases to look for these specific bugs, I would hopefully uncover other bugs which were in the code, thus eradicating that type of bug. I presented the results of this research at European Symposium on the Validation and Verification of Knowledge Based Systems in St-Baldolph, France. We called the paper: Mutants in the KBS Testing Process.

Recently, I have been reading up about the use of digital humanities techniques once more and this inspired me to take a different approach, a digital humanities approach to my complete works, aka this blog. Although, there is the very big worry apparently about the flattening of the humanities by making the digital humanities a big part of it in a neoliberal displacement of hermeneutics. In other words, you can’t just reduce everything down to numbers, especially when you try to tie that to economics and measure its utility. Here, here! However, I’ve tried reading the blog and so I need a bit of help. This blog is born digital, that is, apart from me putting it altogether, it hasn’t existed in any other format. So, it is a very modern kind of archive and I find it all very exciting but very large. So, I am going to borrow some computing techniques often employed in the humanities, since Jesuit Italian priest Roberto Busa first decided to apply computers to the study of the works of Thomas Aquinas (1225-1274).

Helped by Woolf and Pepys

Inspired, I started by looking at two very famous diaries, those of Virginia Woolf and Samuel Pepys and the ways in which I could look at my blog as less of a complete works but more like a diary. What similar questions might I want to ask of my blog? And, now would mine differ?

The one caveat is that Woolf and Pepys wrote their diaries in private, though the link above says that Pepys always had one eye on posterity. Woolf went back and forth on whether she should destroy them or not, as she appreciated that people would want to know her writing process, but it was a private record.

In contrast, I have always written my blog publicly. It is highly curated and thus, a selective public performance. Alongside this, for many years I have kept a private diary on various installations of WordPress on local servers on my computer, never for public consumption. Indeed the two times when I did accidently publish a personal blog on my public WordPress, I was mortified for days and days and really hoped that no one read it. Writing for a public versus a private space is very different indeed which I have blogged about. I asked what is privacy? And, what is the trade-off between privacy, intimacy on the WWW? Two years ago, I swapped my private writings to Day One so that I may never mix the two again.

Both Pepys and Woolf’s diaries have been studied at length with common themes and it was those themes I was interested in and how they could be applied to studying my blog such as emotional trends, personal philosophy, language and selfhood, and key historical events interspersed with their daily lives. For Pepys, this was one day witnessing the Plague or Great Fire of London, buying a wig, or sexually assaulting the maid. For Woolf this was the way she hated the working-class staff she had in her house whilst witnessing WWI and writing the famous room of one’s own promoting feminism (just not for working-class women who were smelly). I really don’t know why Jeanette Winterson bangs on about Virginia Woolf when VW wouldn’t have considered JW as a human being. I say this myself as a working-class woman who got into quite a lather considering Alison Light’s 2009 book: Mrs Woolf and the Servants.

But back to my blog, looking at the tag cloud on the sidebar, I might say human-computer interaction (HCI) is the topic I write about the most and topics such as women in technology not so much because I only wrote one series. However, by employing statistical techniques, I could uncover much more and I really won’t know what I write about the most until I see the analysis results.

The deep dive

There’s a whole range of very interesting techniques from topic modelling in particular latent dirichlet allocation (LDA) which would uncover hidden topics, to styometry which measures vocabulary richness using a metric called type token ratio (TTR) to see how my tone shifts between blogposts, so that I can see how my writing has changed over time.

I would write them in python. I love a bit of python, having taught my girls as well as performing sentiment analysis, and then mine this blog for information and insight in the way that I imagined I could do back in 2023. Overall, this project has the potential to finally prove my husband right when he says that I will one day disappear up inside my blog, but I think that I would be very happy in there should that day come.

I guess that I will be writing more about this, once I begin coding up the various digital humanities statistical techniques. For now I am including below a recording of The Deep Dive (alternatively, can watch it on YouTube) on which the two hosts discuss how my blog is prime grey literature (similar to white papers from think tanks which is highly researched but commercially published), some examples of the techniques and, the ethics digital humanists must consider in and around web scraping particularly say if a team came in and started analysing this site without my permission.

It is a fascinating area and I cannot wait to begin.

The Deep Dive on using digital humanities techniques to analyse the complete works of ruthstalkerfirth.com

A deep blog dive using tools from the Digital Humanities (Ep 1)

Looking for mutants

Helped by Woolf and Pepys

The deep dive

[Episode 2]

Related Posts