SherlockHolmes Part I

SherlockHolmes: An R Program to Analyze the Hidden Structure of Sherlock Holmes Stories by Statistical Pattern Analysis of Concordances

Barry Zeeberg

[email protected]

Motivation

Although Arthur Conan Doyle was best known for his 60 Sherlock Holmes stories, he was a prolific writer of many other works [https://en.wikipedia.org/wiki/Arthur_Conan_Doyle].

For many decades, I have been interested in the Sherlock Holmes stories. I have also had an interest in the Dutch artist Johannes Vermeer, the American artist Edward Hopper, and the American bluesman Robert Johnson. Perhaps it is just some peculiarity of my own subjective perception, but for each of these I tend to categorize e.g. “real Vermeers” versus “fake Vermeers.” I do not mean “fake” in the sense of a forgery. I mean that certain of the Vermeer paintings strike me as representing why he is so highly-regarded, and others are more or less “pedestrian” (as a math professor used to say about calculus proofs that were fairly routine).

I found that some of the Sherlock Holmes stories seemed more like stories about something else, and Sherlock was just added as an afterthought. Without any actual knowledge of the subject, I assumed that Sherlock was very popular, and Conan Doyle could just use Sherlock as a “bait” to get more readership of his “other” stories.

It would be rather tedious to read each story and tabulate how much of a story was really Sherlock detecting, and how much was seemingly thousands of pages about soldiers in India or the KKK. It occurred to me that I could write an R program to perform a concordance that might shed some light on the matter in an objective manner. It is always gratifying to “validate” my subjective biases using an objective procedure J.

The idea that I had was that Watson was usually present in the stories, and he would either address Sherlock directly, calling him “Holmes,” or he would mention “Holmes did or said such and such.” The point is that, thanks to the presence of Watson as his chronicler, the literal string “Holmes” could be used as a proxy for the presence or activity by Sherlock. I am not sure that this could be done as successfully in general.

Technical Methods

All analyses are performed by invoking

Sherlock(titles,texts,patterns,toupper,odir,minl=100,P=0.00001,verbose=FALSE)

titles is a character string containing the full path name for a text file containing the titles of the stories in the same order that they appear in the texts file. If titles==“NONE”, treat the entire book as one story.
texts is a character string containing the full path name for a text file containing the full texts of all of the stories. Each story should be preceded by the title matching that given in the titles file.
patterns is a vector containing the search patterns.
toupper is a Boolean TRUE if the titles should be converted to upper case.
odir is a character string containing the full path name of the output directory.
minl is an integer param passed to dpseg::dpseg.
P is a numeric param passed to dpseg::dpseg.
verbose is a Boolean = FALSE to suppress optional diagnostic output to the console.

In two of the figures presented below (Figures 5 and 6), I used an excellent R language package “dpseg: Piecewise Linear Segmentation by Dynamic Programming” by Rainer Machne and Peter Stadler. This package implements piecewise linear regression modeling, and enables a quantitative analysis of the results in those 2 figures. Dr. Machne kindly provided me with the source code for the plotting function, that permitted me to make several minor custom modifications.

Literary Methods

An excellent single text file containing all of the stories is readily available online [https://sherlock-holm.es/stories/plain-text/cano.txt]. A small amount of manual editing was required to prepare this file for automated analysis (mostly consisting of deleting some passages that were not part of the story, and some chapter subtitles, etc.).

The full version (contents.txt and processed_download.txt), and an abbreviated version (contents3.txt and processed_download3.txt) of the titles and text files for the Sherlock Holmes stories are provided in /inst/extdata.

The 60 Sherlock Holmes stories are comprised of 4 novels and 56 short stories. My recollection is that the 4 novels are the most flagrant in not being “real” Sherlock Holmes stories, although there were plenty of short stories for which this is also true. There was one short story “The Adventure of the Blue Carbuncle,” which I remember had a very high degree of participation by Holmes. I certainly expect that the analyses described below will be consistent with these subjective impressions.

I performed 2 types of analyses.

The first type counted the number of times that the search pattern “Holmes” appeared, in comparison with the total number of words, in each story. A low ratio of “Holmes”/total could indicate a “fake” Sherlock story.

The second type constructed a cumulative distribution of “Holmes” and of total words. For example, if the pattern “Holmes” appeared twice in the first line of story, zero in the second and once in the third, the pattern cumulative distribution would be 2 2 3. The total words cumulative distribution would be something like e.g. 10 21 29. If Holmes appeared throughout the whole story, we would see a cumulative distribution like e.g. 2 4 5 8 9 . . . 20 25 32. If Holmes appeared only at the end of the story, we would see a cumulative distribution like e.g. 0 0 0 0 . . . 2 4 5 8 9.

We can see whether there is a relationship between the results of the 2 types of analyses.

We can also see if the there is a “typical” result for the 2 types of analyses that hold for “most” of the stories, with just a few stories that deviate. Or perhaps there is no “typical,” and each story has its own characteristic analysis.

Results and Discussion

Basic Word Count Analysis

Table 1 shows that the fraction (number of instances of “Holmes”/total number of words) covers an over 18-fold range, from a low of 0.00066 for “The Musgrave Ritual” to a high of 0.01209 for “The Adventure of the Three Gables.” This large range is consistent with the hypothesis that Holmes is but a minor character in a number of the stories.

Table 1. Fraction values for the search pattern “Holmes,” across all 60 Sherlock Holmes stories.

Title	Words	Fraction

A Study In Scarlet	43167	0.002220
The Sign of the Four	42915	0.003150
A Scandal in Bohemia	8512	0.005640
The Red-Headed League	9098	0.005830
A Case of Identity	6971	0.006600
The Boscombe Valley Mystery	9614	0.004890
The Five Orange Pips	7312	0.003420
The Man with the Twisted Lip	9192	0.003150
The Adventure of the Blue Carbuncle	7805	0.004870
The Adventure of the Speckled Band	9801	0.005710
The Adventure of the Engineer’s Thumb	8281	0.001690
The Adventure of the Noble Bachelor	8100	0.004200
The Adventure of the Beryl Coronet	9674	0.002890
The Adventure of the Copper Beeches	9943	0.004320
Silver Blaze	9573	0.005330
The Yellow Face	7497	0.002530
The Stock-Broker’s Clerk	6782	0.003830
The "Gloria Scott"	7835	0.001020
The Musgrave Ritual	7568	0.000660
The Reigate Squires	7196	0.007370
The Crooked Man	7126	0.001540
The Resident Patient	6607	0.005900
The Greek Interpreter	6996	0.004150
The Naval Treaty	12603	0.005320
The Final Problem	7155	0.004190
The Adventure of the Empty House	8689	0.004600
The Adventure of the Norwood Builder	9213	0.006950
The Adventure of the Dancing Men	9702	0.006290
The Adventure of the Solitary Cyclist	7824	0.006260
The Adventure of the Priory School	11458	0.007510
The Adventure of Black Peter	8098	0.006790
The Adventure of Charles Augustus Milverton	6699	0.008360
The Adventure of the Six Napoleons	8319	0.006970
The Adventure of the Three Students	6456	0.007590
The Adventure of the Golden Pince-Nez	8921	0.006500
The Adventure of the Missing Three-Quarter	8011	0.006490
The Adventure of the Abbey Grange	9141	0.004490
The Adventure of the Second Stain	9621	0.008320
The Hound of the Baskervilles	59015	0.003250
The Valley Of Fear	57480	0.002610
The Adventure of Wisteria Lodge	11375	0.005450
The Adventure of the Cardboard Box	8510	0.003170
The Adventure of the Red Circle	7277	0.004260
The Adventure of the Bruce-Partington Plans	10668	0.005810
The Adventure of the Dying Detective	5769	0.008670
The Disappearance of Lady Frances Carfax	7665	0.007050
The Adventure of the Devil’s Foot	9968	0.005920
His Last Bow	6054	0.003630
The Illustrious Client	9731	0.006170
The Blanched Soldier	7705	0.001690
The Adventure Of The Mazarin Stone	5639	0.009040
The Adventure of the Three Gables	6039	0.012090
The Adventure of the Sussex Vampire	5957	0.007550
The Adventure of the Three Garridebs	6184	0.008090
The Problem of Thor Bridge	9569	0.006170
The Adventure of the Creeping Man	7646	0.008240
The Adventure of the Lion’s Mane	7171	0.001950
The Adventure of the Veiled Lodger	4457	0.004940
The Adventure of Shoscombe Old Place	6230	0.008190
The Adventure of the Retired Colourman	5498	0.007640

The 4 novels (“A Study in Scarlet,” “The Valley of Fear,” “The sign of the four,” and “The Hound of the Baskervilles”) are among the 14 lowest fractions, but are on an even footing with a substantial number of the short stories. This observation is consistent with the hypothesis that the longer novels are mostly a ploy to tell a long non-Holmesian story, but a good number of the short stories also were used for that purpose.

In the interest of full disclosure, I recall that “The Adventure of the Blue Carbuncle” was a story that featured Holmes were actively pursuing clues, and I would have expected it to be at the top of the range of fractions. Yet its fraction is roughly in the middle of the range.

The data of Table 1 are displayed as a histogram (Figure 1), illustrating a roughly normal distribution of fraction values.

Figure 1. Histogram of fraction values for the search pattern “Holmes,” across all 60 Sherlock Holmes stories.

We can perhaps somewhat arbitrarily divide the fraction values into 3 types:

0.00066 <= low < 0.004

0.004 <= normal < 0.008

0.008 <= high <= 0.01209

Another way to look at the same data is a scatter plot of fraction values as a function of the total number of words (Figure 2).

Figure 2. Scatter plot of fraction values as a function of the total number of words in the story for the search pattern “Holmes,” across all 60 Sherlock Holmes stories.

As expected, we can clearly see a pattern for the 4 novels in the lower right corner. However, the short stories do not display a discernible pattern.

It is interesting that the fraction values tend to increase in accord with the publication date (Figure 3).

Figure 3. Scatter plot of fraction values as a function of the chronological order for the search pattern “Holmes,” across all 60 Sherlock Holmes stories.

However, the large amount of scatter in data prevent this correlation from achieving statistical significance. Perhaps Conan Doyle eventually started feeling some remorse over “cheating” his loyal readers. The trend is consistent with 2 of the 4 novels being written as the first 2 stories. The other 2 novels were written around the middle chronologically, and like the first 2, they were written one after the other.

Cumulative Distribution Analysis

The results of the cumulative distribution analysis are given in a series of 60 graphs, one for each story. Let us first take the 2 most extreme stories, as they might be expected to most clearly show distinct characteristic behaviors.

“The Musgrave Ritual” (Figure 4) exhibited the lowest overall fraction value (0.00066).

Figure 4. Cumulative distribution analysis for the search pattern “Holmes” in “The Musgrave Ritual.”

This story had such a low number of instances of Holmes, that I was worried that the program had made a mistake, so I examined this story directly. Yes, there really were just 5 instances of Holmes. The cumulative analysis (Figure 4) shows that after the first 2000 words (or around 25% of the story), “Holmes” is only mentioned once. This is consistent with the final 75% of the story not really being so much a Sherlock Holmes story as it is a story about a family ritual.

“The Adventure of the Three Gables” (Figure 5) exhibited the highest overall fraction value (0.012090).

Figure 5. Cumulative distribution analysis for the search pattern “Holmes” in “The Adventure of the Three Gables.”

The cumulative graph is so different from that for “The Musgrave Ritual” that it is almost hard to believe that the 2 stories were written by the same author. The cumulative graph for “The Adventure of the Three Gables” shows an uninterrupted presence of Holmes throughout the whole story.

It is remarkable that I remembered one story in which an annotator questioned whether it was written by Conan Doyle, because there were some racist epithets by Holmes, which was totally contrary to his character. Believe it or not, I just now looked up the name of that story, and it is in fact “The Adventure of the Three Gables,” see e.g., [https://lesliesklinger.com/2020/07/07/the-elephant-in-the-room/].

Although I mentioned that the cumulative graph was very different from that for “The Musgrave Ritual,” there are many other stories with cumulative graphs that are qualitatively essentially identical to that for “The Adventure of the Three Gables,” so it cannot be ruled out as an authentic Conan Doyle story on that basis.

I had mentioned above that there was one short story “The Adventure of the Blue Carbuncle,” (Figure 6) which I remember had a very consistent degree of participation by Holmes. This recollection is borne out by the cumulative distribution analysis. The fraction value is 0.004870. According to the histogram (Figure 1), this value is around the mean for all 60 stories. The cumulative distribution graph (Figure 6) shows a consistent presence of “Holmes” throughout the entire story. Apparently the moderate number of mentions of “Holmes” were distributed evenly through the story.

Figure 6. Cumulative distribution analysis for the search pattern “Holmes” in “The Adventure of the Blue Carbuncle.”

Enhanced Features

In order to keep this initial description more comprehensible, I did not present certain enhanced features that were added to the package after the manuscript was completed. These features will be presented in a subsequent manuscript.

These include:

Integrated output directory format archiving the results of multiple analyses mentioned below
Expand studies to many more authors and stories
Rolling averages in addition to cumulative distributions
Tabulate numerical values of linear regression analyses (permits other programs to perform automated analyses on the segmentation of texts)
Multiple search patterns overlay on same plot (e.g., “Holmes” and “Watson”)
Concordance of frequency of words (within a window of several sentences surrounding the search pattern)
- Full text of surrounding sentences
- Histogram of over-abundance of words in the windows (e.g., how often does the word “woman” appear near the search term “Watson”)