--- title: "SherlockHolmes Part I" author: Barry Zeeberg [aut, cre] date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{SherlockHolmes Part I} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} ---

SherlockHolmes: An R Program to Analyze the Hidden Structure of Sherlock Holmes Stories by Statistical Pattern Analysis of Concordances

 

 

Barry Zeeberg

barryz2013@gmail.com


Motivation

 

Although Arthur Conan Doyle was best known for his 60 Sherlock Holmes stories, he was a prolific writer of many other works [https://en.wikipedia.org/wiki/Arthur_Conan_Doyle].

 

For many decades, I have been interested in the Sherlock Holmes stories. I have also had an interest in the Dutch artist Johannes Vermeer, the American artist Edward Hopper, and the American bluesman Robert Johnson. Perhaps it is just some peculiarity of my own subjective perception, but for each of these I tend to categorize e.g. “real Vermeersversus “fake Vermeers.” I do not mean “fake” in the sense of a forgery. I mean that certain of the Vermeer paintings strike me as representing why he is so highly-regarded, and others are more or less “pedestrian” (as a math professor used to say about calculus proofs that were fairly routine).

 

I found that some of the Sherlock Holmes stories seemed more like stories about something else, and Sherlock was just added as an afterthought. Without any actual knowledge of the subject, I assumed that Sherlock was very popular, and Conan Doyle could just use Sherlock as a “bait” to get more readership of his “other” stories.

 

It would be rather tedious to read each story and tabulate how much of a story was really Sherlock detecting, and how much was seemingly thousands of pages about soldiers in India or the KKK. It occurred to me that I could write an R program to perform a concordance that might shed some light on the matter in an objective manner. It is always gratifying to “validate” my subjective biases using an objective procedure J.

 

The idea that I had was that Watson was usually present in the stories, and he would either address Sherlock directly, calling him “Holmes,” or he would mention “Holmes did or said such and such.” The point is that, thanks to the presence of Watson as his chronicler, the literal string “Holmes” could be used as a proxy for the presence or activity by Sherlock. I am not sure that this could be done as successfully in general.

 

Technical Methods

 

All analyses are performed by invoking
Sherlock(titles,texts,patterns,toupper,odir,minl=100,P=0.00001,verbose=FALSE)
* titles is a character string containing the full path name for a text file containing the titles of the stories in the same order that they appear in the texts file. If titles=="NONE", treat the entire book as one story. * texts is a character string containing the full path name for a text file containing the full texts of all of the stories. Each story should be preceded by the title matching that given in the titles file. * patterns is a vector containing the search patterns. * toupper is a Boolean TRUE if the titles should be converted to upper case. * odir is a character string containing the full path name of the output directory. * minl is an integer param passed to dpseg::dpseg. * P is a numeric param passed to dpseg::dpseg. * verbose is a Boolean = FALSE to suppress optional diagnostic output to the console.

In two of the figures presented below (Figures 5 and 6), I used an excellent R language package “dpseg: Piecewise Linear Segmentation by Dynamic Programming” by Rainer Machne and Peter Stadler. This package implements piecewise linear regression modeling, and enables a quantitative analysis of the results in those 2 figures. Dr. Machne kindly provided me with the source code for the plotting function, that permitted me to make several minor custom modifications.

Literary Methods

 

An excellent single text file containing all of the stories is readily available online [https://sherlock-holm.es/stories/plain-text/cano.txt]. A small amount of manual editing was required to prepare this file for automated analysis (mostly consisting of deleting some passages that were not part of the story, and some chapter subtitles, etc.).


The full version (contents.txt and processed_download.txt), and an abbreviated version (contents3.txt and processed_download3.txt) of the titles and text files for the Sherlock Holmes stories are provided in /inst/extdata.


The 60 Sherlock Holmes stories are comprised of 4 novels and 56 short stories. My recollection is that the 4 novels are the most flagrant in not being “real” Sherlock Holmes stories, although there were plenty of short stories for which this is also true. There was one short story “The Adventure of the Blue Carbuncle,” which I remember had a very high degree of participation by Holmes. I certainly expect that the analyses described below will be consistent with these subjective impressions.

 

I performed 2 types of analyses.

 

The first type counted the number of times that the search pattern “Holmes” appeared, in comparison with the total number of words, in each story. A low ratio of “Holmes”/total could indicate a “fake” Sherlock story.

 

The second type constructed a cumulative distribution of “Holmes” and of total words. For example, if the pattern “Holmes” appeared twice in the first line of story, zero in the second and once in the third, the pattern cumulative distribution would be 2 2 3. The total words cumulative distribution would be something like e.g. 10 21 29. If Holmes appeared throughout the whole story, we would see a cumulative distribution like e.g. 2 4 5 8 9 . . . 20 25 32. If Holmes appeared only at the end of the story, we would see a cumulative distribution like e.g. 0 0 0 0 . . . 2 4 5 8 9.

 

We can see whether there is a relationship between the results of the 2 types of analyses.

 

We can also see if the there is a “typical” result for the 2 types of analyses that hold for “most” of the stories, with just a few stories that deviate. Or perhaps there is no “typical,” and each story has its own characteristic analysis.

 

Results and Discussion

 

Basic Word Count Analysis

Table 1 shows that the fraction (number of instances of “Holmes”/total number of words) covers an over 18-fold range, from a low of 0.00066 for “The Musgrave Ritual” to a high of 0.01209 for “The Adventure of the Three Gables.” This large range is consistent with the hypothesis that Holmes is but a minor character in a number of the stories.


Table 1. Fraction values for the search pattern “Holmes,” across all 60 Sherlock Holmes stories.

Title

Words

Fraction

 

 

 

A Study In Scarlet

43167

0.002220

The Sign of the Four

42915

0.003150

A Scandal in Bohemia

8512

0.005640

The Red-Headed League

9098

0.005830

A Case of Identity

6971

0.006600

The Boscombe Valley Mystery

9614

0.004890

The Five Orange Pips

7312

0.003420

The Man with the Twisted Lip

9192

0.003150

The Adventure of the Blue Carbuncle

7805

0.004870

The Adventure of the Speckled Band

9801

0.005710

The Adventure of the Engineer's Thumb

8281

0.001690

The Adventure of the Noble Bachelor

8100

0.004200

The Adventure of the Beryl Coronet

9674

0.002890

The Adventure of the Copper Beeches

9943

0.004320

Silver Blaze

9573

0.005330

The Yellow Face

7497

0.002530

The Stock-Broker's Clerk

6782

0.003830

The "Gloria Scott"

7835

0.001020

The Musgrave Ritual

7568

0.000660

The Reigate Squires

7196

0.007370

The Crooked Man

7126

0.001540

The Resident Patient

6607

0.005900

The Greek Interpreter

6996

0.004150

The Naval Treaty

12603

0.005320

The Final Problem

7155

0.004190

The Adventure of the Empty House

8689

0.004600

The Adventure of the Norwood Builder

9213

0.006950

The Adventure of the Dancing Men

9702

0.006290

The Adventure of the Solitary Cyclist

7824

0.006260

The Adventure of the Priory School

11458

0.007510

The Adventure of Black Peter

8098

0.006790

The Adventure of Charles Augustus Milverton

6699

0.008360

The Adventure of the Six Napoleons

8319

0.006970

The Adventure of the Three Students

6456

0.007590

The Adventure of the Golden Pince-Nez

8921

0.006500

The Adventure of the Missing Three-Quarter

8011

0.006490

The Adventure of the Abbey Grange

9141

0.004490

The Adventure of the Second Stain

9621

0.008320

The Hound of the Baskervilles

59015

0.003250

The Valley Of Fear

57480

0.002610

The Adventure of Wisteria Lodge

11375

0.005450

The Adventure of the Cardboard Box

8510

0.003170

The Adventure of the Red Circle

7277

0.004260

The Adventure of the Bruce-Partington Plans

10668

0.005810

The Adventure of the Dying Detective

5769

0.008670

The Disappearance of Lady Frances Carfax

7665

0.007050

The Adventure of the Devil's Foot

9968

0.005920

His Last Bow

6054

0.003630

The Illustrious Client

9731

0.006170

The Blanched Soldier

7705

0.001690

The Adventure Of The Mazarin Stone

5639

0.009040

The Adventure of the Three Gables

6039

0.012090

The Adventure of the Sussex Vampire

5957

0.007550

The Adventure of the Three Garridebs

6184

0.008090

The Problem of Thor Bridge

9569

0.006170

The Adventure of the Creeping Man

7646

0.008240

The Adventure of the Lion's Mane

7171

0.001950

The Adventure of the Veiled Lodger

4457

0.004940

The Adventure of Shoscombe Old Place

6230

0.008190

The Adventure of the Retired Colourman

5498

0.007640

 

 

The 4 novels (“A Study in Scarlet,” “The Valley of Fear,” “The sign of the four,” and “The Hound of the Baskervilles”) are among the 14 lowest fractions, but are on an even footing with a substantial number of the short stories. This observation is consistent with the hypothesis that the longer novels are mostly a ploy to tell a long non-Holmesian story, but a good number of the short stories also were used for that purpose.

 

In the interest of full disclosure, I recall that “The Adventure of the Blue Carbuncle” was a story that featured Holmes were actively pursuing clues, and I would have expected it to be at the top of the range of fractions. Yet its fraction is roughly in the middle of the range.

 

The data of Table 1 are displayed as a histogram (Figure 1), illustrating a roughly normal distribution of fraction values.


![](Picture1.jpg){width=90%}

Figure 1. Histogram of fraction values for the search pattern “Holmes,” across all 60 Sherlock Holmes stories.


We can perhaps somewhat arbitrarily divide the fraction values into 3 types:

 

0.00066 <= low < 0.004

 

0.004 <= normal < 0.008

 

0.008 <= high <= 0.01209

 

Another way to look at the same data is a scatter plot of fraction values as a function of the total number of words (Figure 2).

![](Picture2.jpg){width=90%}

Figure 2. Scatter plot of fraction values as a function of the total number of words in the story for the search pattern “Holmes,” across all 60 Sherlock Holmes stories.

 

As expected, we can clearly see a pattern for the 4 novels in the lower right corner. However, the short stories do not display a discernible pattern.

 

It is interesting that the fraction values tend to increase in accord with the publication date (Figure 3).

![](Picture3.jpg){width=90%}

Figure 3. Scatter plot of fraction values as a function of the chronological order for the search pattern “Holmes,” across all 60 Sherlock Holmes stories.

 

However, the large amount of scatter in data prevent this correlation from achieving statistical significance. Perhaps Conan Doyle eventually started feeling some remorse over “cheating” his loyal readers. The trend is consistent with 2 of the 4 novels being written as the first 2 stories. The other 2 novels were written around the middle chronologically, and like the first 2, they were written one after the other.

 

Cumulative Distribution Analysis

The results of the cumulative distribution analysis are given in a series of 60 graphs, one for each story. Let us first take the 2 most extreme stories, as they might be expected to most clearly show distinct characteristic behaviors.

 

“The Musgrave Ritual” (Figure 4) exhibited the lowest overall fraction value (0.00066).

![](Picture4.jpg){width=90%}

Figure 4. Cumulative distribution analysis for the search pattern “Holmes” in “The Musgrave Ritual.”

 

This story had such a low number of instances of Holmes, that I was worried that the program had made a mistake, so I examined this story directly. Yes, there really were just 5 instances of Holmes. The cumulative analysis (Figure 4) shows that after the first 2000 words (or around 25% of the story), “Holmes” is only mentioned once. This is consistent with the final 75% of the story not really being so much a Sherlock Holmes story as it is a story about a family ritual.

 

“The Adventure of the Three Gables” (Figure 5) exhibited the highest overall fraction value (0.012090).

![](Picture5.jpg){width=90%}

Figure 5. Cumulative distribution analysis for the search pattern “Holmes” in “The Adventure of the Three Gables.”

 

The cumulative graph is so different from that for “The Musgrave Ritual” that it is almost hard to believe that the 2 stories were written by the same author. The cumulative graph for “The Adventure of the Three Gables” shows an uninterrupted presence of Holmes throughout the whole story.

 

It is remarkable that I remembered one story in which an annotator questioned whether it was written by Conan Doyle, because there were some racist epithets by Holmes, which was totally contrary to his character. Believe it or not, I just now looked up the name of that story, and it is in fact “The Adventure of the Three Gables,” see e.g., [https://lesliesklinger.com/2020/07/07/the-elephant-in-the-room/].

 

Although I mentioned that the cumulative graph was very different from that for “The Musgrave Ritual,” there are many other stories with cumulative graphs that are qualitatively essentially identical to that for “The Adventure of the Three Gables,” so it cannot be ruled out as an authentic Conan Doyle story on that basis.

 

I had mentioned above that there was one short story “The Adventure of the Blue Carbuncle,” (Figure 6) which I remember had a very consistent degree of participation by Holmes. This recollection is borne out by the cumulative distribution analysis. The fraction value is 0.004870. According to the histogram (Figure 1), this value is around the mean for all 60 stories. The cumulative distribution graph (Figure 6) shows a consistent presence of “Holmes” throughout the entire story. Apparently the moderate number of mentions of “Holmes” were distributed evenly through the story.

 

 

![](Picture6.jpg){width=90%}

Figure 6. Cumulative distribution analysis for the search pattern “Holmes” in “The Adventure of the Blue Carbuncle.”

 

Enhanced Features

 

In order to keep this initial description more comprehensible, I did not present certain enhanced features that were added to the package after the manuscript was completed. These features will be presented in a subsequent manuscript.

These include:

* Integrated output directory format archiving the results of multiple analyses mentioned below

* Expand studies to many more authors and stories

* Rolling averages in addition to cumulative distributions

* Tabulate numerical values of linear regression analyses (permits other programs to perform automated analyses on the segmentation of texts)

* Multiple search patterns overlay on same plot (e.g., "Holmes" and "Watson")

* Concordance of frequency of words (within a window of several sentences surrounding the search pattern)

* Full text of surrounding sentences

* Histogram of over-abundance of words in the windows (e.g., how often does the word “woman” appear near the search term “Watson”)

```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ```