--- title: "SherlockHolmes Part I" author: Barry Zeeberg [aut, cre] date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{SherlockHolmes Part I} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} ---
SherlockHolmes: An R Program to Analyze the Hidden Structure of Sherlock
Holmes Stories by Statistical Pattern Analysis of Concordances
Barry
Zeeberg
barryz2013@gmail.com
Motivation
Although Arthur Conan Doyle was best known
for his 60 Sherlock Holmes stories, he was a prolific writer of many other
works [https://en.wikipedia.org/wiki/Arthur_Conan_Doyle].
For many decades, I have been interested
in the Sherlock Holmes stories. I have also had an interest in the Dutch artist
Johannes Vermeer, the American artist Edward Hopper, and the American bluesman
Robert Johnson. Perhaps it is just some peculiarity of my own subjective
perception, but for each of these I tend to categorize e.g. “real Vermeers” versus “fake Vermeers.” I do not mean
“fake” in the sense of a forgery. I mean that certain of the Vermeer paintings
strike me as representing why he is so highly-regarded, and others are more or
less “pedestrian” (as a math professor used to say about calculus proofs that
were fairly routine).
I found that some of the Sherlock Holmes
stories seemed more like stories about something else, and Sherlock was just added
as an afterthought. Without any actual knowledge of the subject, I assumed that
Sherlock was very popular, and Conan Doyle could just use Sherlock as a “bait”
to get more readership of his “other” stories.
It would be rather tedious to read each
story and tabulate how much of a story was really Sherlock detecting, and how
much was seemingly thousands of pages about soldiers in India or the KKK. It
occurred to me that I could write an R program to perform a concordance that
might shed some light on the matter in an objective manner. It is always
gratifying to “validate” my subjective biases using an objective procedure J.
The idea that I had was that Watson was
usually present in the stories, and he would either address Sherlock directly,
calling him “Holmes,” or he would mention “Holmes did or said such and such.” The
point is that, thanks to the presence of Watson as his chronicler, the literal
string “Holmes” could be used as a proxy for the presence or activity by
Sherlock. I am not sure that this could be done as successfully in general.
Technical Methods
All analyses are performed by invoking
In two
of the figures presented below (Figures 5 and 6), I used an excellent R language package
“dpseg: Piecewise Linear Segmentation by Dynamic
Programming” by Rainer Machne and Peter Stadler. This
package implements piecewise linear regression modeling, and enables a
quantitative analysis of the results in those 2 figures. Dr. Machne kindly provided me with the source code for the plotting function, that permitted me to make several minor custom modifications. Literary Methods
Sherlock(titles,texts,patterns,toupper,odir,minl=100,P=0.00001,verbose=FALSE)
* titles is a character string containing the full path name
for a text file containing the titles of the stories
in the same order that they appear in the texts file.
If titles=="NONE", treat the entire book as one story.
* texts is a character string containing the full path name for a text file containing the full texts of all of the stories. Each story should be preceded by the title matching that given in the titles file.
* patterns is a vector containing the search patterns.
* toupper is a Boolean TRUE if the titles should be converted to upper case.
* odir is a character string containing the full path name of the output directory.
* minl is an integer param passed to dpseg::dpseg.
* P is a numeric param passed to dpseg::dpseg.
* verbose is a Boolean = FALSE to suppress optional diagnostic output to the console.
An excellent single text file containing
all of the stories is readily available online [https://sherlock-holm.es/stories/plain-text/cano.txt]. A small amount
of manual editing was required to prepare this file for automated analysis
(mostly consisting of deleting some passages that were not part of the story,
and some chapter subtitles, etc.).
The full version (contents.txt and processed_download.txt), and an abbreviated version (contents3.txt and processed_download3.txt) of the titles and text files for the Sherlock Holmes stories are provided in /inst/extdata.
The 60 Sherlock Holmes stories are
comprised of 4 novels and 56 short stories. My recollection is that the 4 novels
are the most flagrant in not being “real” Sherlock Holmes stories, although
there were plenty of short stories for which this is also true. There was one
short story “The Adventure of the Blue Carbuncle,” which I remember had a very
high degree of participation by Holmes. I certainly expect that the analyses
described below will be consistent with these subjective impressions.
I performed 2 types of analyses.
The first type counted the number of times
that the search pattern “Holmes” appeared, in comparison with the total number
of words, in each story. A low ratio of “Holmes”/total could indicate a “fake”
Sherlock story.
The second type constructed a cumulative
distribution of “Holmes” and of total words. For example, if the pattern
“Holmes” appeared twice in the first line of story, zero in the second and once
in the third, the pattern cumulative distribution would be 2 2 3. The total
words cumulative distribution would be something like e.g. 10 21 29. If Holmes appeared throughout the whole story, we
would see a cumulative distribution like e.g.
2 4 5 8 9 . . . 20 25 32. If Holmes appeared only at the end of the story, we
would see a cumulative distribution like e.g.
0 0 0 0 . . . 2 4 5 8 9.
We can see whether there is a relationship
between the results of the 2 types of analyses.
We can also see if the there is a
“typical” result for the 2 types of analyses that hold for “most” of the
stories, with just a few stories that deviate. Or perhaps there is no “typical,”
and each story has its own characteristic analysis.
Results
and Discussion
Basic
Word Count Analysis
Table 1 shows that the
fraction (number of instances of “Holmes”/total number of words) covers an over
18-fold range, from a low of 0.00066 for “The Musgrave Ritual” to a high of
0.01209 for “The Adventure of the Three Gables.” This large range is consistent
with the hypothesis that Holmes is but a minor character in a number of the
stories.
Table 1. Fraction values for
the search pattern “Holmes,” across all 60 Sherlock Holmes stories.
Title |
Words |
Fraction |
|
|
|
A
Study In Scarlet |
43167 |
0.002220 |
The
Sign of the Four |
42915 |
0.003150 |
A
Scandal in Bohemia |
8512 |
0.005640 |
The
Red-Headed League |
9098 |
0.005830 |
A
Case of Identity |
6971 |
0.006600 |
The
Boscombe Valley Mystery |
9614 |
0.004890 |
The
Five Orange Pips |
7312 |
0.003420 |
The
Man with the Twisted Lip |
9192 |
0.003150 |
The
Adventure of the Blue Carbuncle |
7805 |
0.004870 |
The
Adventure of the Speckled Band |
9801 |
0.005710 |
The
Adventure of the Engineer's Thumb |
8281 |
0.001690 |
The
Adventure of the Noble Bachelor |
8100 |
0.004200 |
The
Adventure of the Beryl Coronet |
9674 |
0.002890 |
The
Adventure of the Copper Beeches |
9943 |
0.004320 |
Silver
Blaze |
9573 |
0.005330 |
The
Yellow Face |
7497 |
0.002530 |
The
Stock-Broker's Clerk |
6782 |
0.003830 |
The
"Gloria Scott" |
7835 |
0.001020 |
The
Musgrave Ritual |
7568 |
0.000660 |
The
Reigate Squires |
7196 |
0.007370 |
The
Crooked Man |
7126 |
0.001540 |
The
Resident Patient |
6607 |
0.005900 |
The
Greek Interpreter |
6996 |
0.004150 |
The
Naval Treaty |
12603 |
0.005320 |
The
Final Problem |
7155 |
0.004190 |
The
Adventure of the Empty House |
8689 |
0.004600 |
The
Adventure of the Norwood Builder |
9213 |
0.006950 |
The
Adventure of the Dancing Men |
9702 |
0.006290 |
The
Adventure of the Solitary Cyclist |
7824 |
0.006260 |
The
Adventure of the Priory School |
11458 |
0.007510 |
The
Adventure of Black Peter |
8098 |
0.006790 |
The
Adventure of Charles Augustus Milverton |
6699 |
0.008360 |
The
Adventure of the Six Napoleons |
8319 |
0.006970 |
The
Adventure of the Three Students |
6456 |
0.007590 |
The
Adventure of the Golden Pince-Nez |
8921 |
0.006500 |
The
Adventure of the Missing Three-Quarter |
8011 |
0.006490 |
The
Adventure of the Abbey Grange |
9141 |
0.004490 |
The
Adventure of the Second Stain |
9621 |
0.008320 |
The
Hound of the Baskervilles |
59015 |
0.003250 |
The
Valley Of Fear |
57480 |
0.002610 |
The
Adventure of Wisteria Lodge |
11375 |
0.005450 |
The
Adventure of the Cardboard Box |
8510 |
0.003170 |
The
Adventure of the Red Circle |
7277 |
0.004260 |
The
Adventure of the Bruce-Partington Plans |
10668 |
0.005810 |
The
Adventure of the Dying Detective |
5769 |
0.008670 |
The
Disappearance of Lady Frances Carfax |
7665 |
0.007050 |
The
Adventure of the Devil's Foot |
9968 |
0.005920 |
His
Last Bow |
6054 |
0.003630 |
The
Illustrious Client |
9731 |
0.006170 |
The
Blanched Soldier |
7705 |
0.001690 |
The
Adventure Of The Mazarin Stone |
5639 |
0.009040 |
The
Adventure of the Three Gables |
6039 |
0.012090 |
The
Adventure of the Sussex Vampire |
5957 |
0.007550 |
The
Adventure of the Three Garridebs |
6184 |
0.008090 |
The
Problem of Thor Bridge |
9569 |
0.006170 |
The
Adventure of the Creeping Man |
7646 |
0.008240 |
The
Adventure of the Lion's Mane |
7171 |
0.001950 |
The
Adventure of the Veiled Lodger |
4457 |
0.004940 |
The
Adventure of Shoscombe Old Place |
6230 |
0.008190 |
The
Adventure of the Retired Colourman |
5498 |
0.007640 |
The 4 novels (“A Study in Scarlet,” “The
Valley of Fear,” “The sign of the four,” and “The Hound of the Baskervilles”)
are among the 14 lowest fractions, but are on an even footing with a
substantial number of the short stories. This observation is consistent with
the hypothesis that the longer novels are mostly a ploy to tell a long
non-Holmesian story, but a good number of the short stories also were used for
that purpose.
In the interest of full disclosure, I
recall that “The Adventure of the Blue Carbuncle” was a story that featured
Holmes were actively pursuing clues, and I would have expected it to be at the
top of the range of fractions. Yet its fraction is roughly in the middle of the
range.
The data of Table 1 are displayed as a histogram
(Figure 1), illustrating a roughly normal distribution of fraction values.
Figure 1. Histogram of fraction values for
the search pattern “Holmes,” across all 60 Sherlock Holmes stories.
We can perhaps somewhat arbitrarily divide
the fraction values into 3 types:
0.00066 <= low
< 0.004
0.004 <= normal
< 0.008
0.008 <= high
<= 0.01209
Another way to look at the same data is a
scatter plot of fraction values as a function of the total number of words
(Figure 2).
Figure 2. Scatter plot of fraction values
as a function of the total number of words in the story for the search pattern
“Holmes,” across all 60 Sherlock Holmes stories.
As expected, we can clearly see a pattern for the 4
novels in the lower right corner. However, the short stories do not display a
discernible pattern.
It is interesting that the fraction values
tend to increase in accord with the publication date (Figure 3).
Figure 3. Scatter plot of fraction values
as a function of the chronological order for the search pattern “Holmes,”
across all 60 Sherlock Holmes stories.
However, the large amount of scatter in data
prevent this correlation from achieving statistical significance. Perhaps Conan
Doyle eventually started feeling some remorse over “cheating” his loyal
readers. The trend is consistent with 2 of the 4 novels being written as the
first 2 stories. The other 2 novels were written around the middle
chronologically, and like the first 2, they were written one after the other.
Cumulative
Distribution Analysis
The results of the cumulative distribution
analysis are given in a series of 60 graphs, one for each story. Let us first take
the 2 most extreme stories, as they might be expected to most clearly show
distinct characteristic behaviors.
“The Musgrave Ritual” (Figure 4) exhibited
the lowest overall fraction value (0.00066).
Figure 4. Cumulative distribution analysis
for the search pattern “Holmes” in “The Musgrave Ritual.”
This story had such a low number of
instances of Holmes, that I was worried that the program had made a mistake, so
I examined this story directly. Yes, there really were just 5 instances of
Holmes. The cumulative analysis (Figure 4) shows that after the first 2000
words (or around 25% of the story), “Holmes” is only mentioned once. This is
consistent with the final 75% of the story not really being so much a Sherlock
Holmes story as it is a story about a family ritual.
“The Adventure of the Three Gables”
(Figure 5) exhibited the highest overall fraction value (0.012090).
Figure 5. Cumulative distribution analysis
for the search pattern “Holmes” in “The Adventure of the Three Gables.”
The cumulative graph is so different from
that for “The Musgrave Ritual” that it is almost hard to believe that the 2
stories were written by the same author. The cumulative graph for “The
Adventure of the Three Gables” shows an uninterrupted presence of Holmes
throughout the whole story.
It is remarkable that I remembered one
story in which an annotator questioned whether it was written by Conan Doyle,
because there were some racist epithets by Holmes, which was totally contrary
to his character. Believe it or not, I just now looked up the name of that
story, and it is in fact “The Adventure of the Three Gables,” see e.g., [https://lesliesklinger.com/2020/07/07/the-elephant-in-the-room/].
Although I mentioned that the cumulative
graph was very different from that for “The Musgrave Ritual,” there are many
other stories with cumulative graphs that are qualitatively essentially
identical to that for “The Adventure of the Three Gables,” so it cannot be
ruled out as an authentic Conan Doyle story on that basis.
I had mentioned above that there was one short
story “The Adventure of the Blue Carbuncle,” (Figure 6) which I remember had a
very consistent degree of participation by Holmes. This recollection is borne
out by the cumulative distribution analysis. The fraction value is 0.004870.
According to the histogram (Figure 1), this value is around the mean for all 60
stories. The cumulative distribution graph (Figure 6) shows a consistent
presence of “Holmes” throughout the entire story. Apparently
the moderate number of mentions of “Holmes” were distributed evenly through the
story.
Figure 6. Cumulative distribution analysis
for the search pattern “Holmes” in “The Adventure of the Blue Carbuncle.”
Enhanced
Features
In order to keep this initial description
more comprehensible, I did not present certain enhanced features that were
added to the package after the manuscript was completed. These features will be
presented in a subsequent manuscript.
These include: