Holmes is a Python 3 library (tested with version 3.9.5) running on top of spaCy (tested with version 3.1.2) that supports a number of use cases involving information extraction from English and German texts. In all use cases, the information extraction is based on analysing the semantic relationships expressed by the component parts of each sentence:
In the chatbot use case, the system is configured using one or more search phrases. Holmes then looks for structures whose meanings correspond to those of these search phrases within a searched document, which in this case corresponds to an individual snippet of text or speech entered by the end user. Within a match, each word with its own meaning (i.e. that does not merely fulfil a grammatical function) in the search phrase corresponds to one or more such words in the document. Both the fact that a search phrase was matched and any structured information the search phrase extracts can be used to drive the chatbot.
The structural extraction use case uses exactly the same structural matching technology as the chatbot use case, but searching takes place with respect to a pre-existing document or documents that are typically much longer than the snippets analysed in the chatbot use case, and the aim is to extract and store structured information. For example, a set of business articles could be searched to find all the places where one company is said to be planning to take over a second company. The identities of the companies concerned could then be stored in a database.
The topic matching use case aims to find passages in a document or documents whose meaning is close to that of another document, which takes on the role of the query document, or to that of a query phrase entered ad-hoc by the user. Holmes extracts a number of small phraselets from the query phrase or query document, matches the documents being searched against each phraselet, and conflates the results to find the most relevant passages within the documents. Because there is no strict requirement that every word with its own meaning in the query document match a specific word or words in the searched documents, more matches are found than in the structural extraction use case, but the matches do not contain structured information that can be used in subsequent processing. The topic matching use case is demonstrated by a website allowing searches within the Harry Potter corpus (for English) and around 350 traditional stories (for German).
The supervised document classification use case uses training data to learn a classifier that assigns one or more classification labels to new documents based on what they are about. It classifies a new document by matching it against phraselets that were extracted from the training documents in the same way that phraselets are extracted from the query document in the topic matching use case. The technique is inspired by bag-of-words-based classification algorithms that use n-grams, but aims to derive n-grams whose component words are related semantically rather than that just happen to be neighbours in the surface representation of a language.
In all four use cases, the individual words are matched using a number of strategies. To work out whether two grammatical structures that contain individually matching words correspond logically and constitute a match, Holmes transforms the syntactic parse information provided by the spaCy library into semantic structures that allow texts to be compared using predicate logic. As a user of Holmes, you do not need to understand the intricacies of how this works, although there are some important tips around writing effective search phrases for the chatbot and structural extraction use cases that you should try and take on board.
Holmes aims to offer generalist solutions that can be used more or less out of the box with relatively little tuning, tweaking or training and that are rapidly applicable to a wide range of use cases. At its core lies a logical, programmed, rule-based system that describes how syntactic representations in each language express semantic relationships. Although the supervised document classification use case does incorporate a neural network and although the spaCy library upon which Holmes builds has itself been pre-trained using machine learning, the essentially rule-based nature of Holmes means that the chatbot, structural extraction and topic matching use cases can be put to use out of the box without any training and that the supervised document classification use case typically requires relatively little training data, which is a great advantage because pre-labelled training data is not available for many real-world problems.
Install Holmes using the following commands:
pip3 install holmes-extractor
pip install holmes-extractor
To upgrade from a previous Holmes version, issue the following commands and then reissue the commands to download the spaCy and coreferee models to ensure you have the correct versions of them:
pip3 install --upgrade holmes-extractor
pip install --upgrade holmes-extractor
Note that if you are upgrading to a new Holmes version that uses a different major or minor version of Python from the pre-existing version, you will need to upgrade Python and then follow the instructions for installing Holmes from scratch.
If you are working on some versions of Windows and have not used Python before, several of Holmes' dependencies may require you to download Visual Studio and then rerun the installation. During the Visual Studio install, it is imperative to select the Desktop Development with C++ option, which is not checked by default.
If you wish to use the examples and tests, clone the source code using
git clone https://github.com/msg-systems/holmes-extractor
If you wish to experiment with changing the source code, you can
override the installed code by starting Python (type
python3 (Linux) or
(Windows)) in the parent directory of the directory where your altered
module code is. If you have checked Holmes out of Git, this will be the
If you wish to uninstall Holmes again, this is achieved by deleting the installed
file(s) directly from the file system. These can be found by issuing the
following from the Python command prompt started from any directory other
than the parent directory of
import holmes_extractor print(holmes_extractor.__file__)
The spaCy and coreferee libraries that Holmes builds upon require language-specific models that have to be downloaded separately before Holmes can be used:
python3 -m spacy download en_core_web_trf python3 -m spacy download en_core_web_lg python3 -m coreferee install en
python3 -m spacy download de_core_news_lg python3 -m coreferee install de
python -m spacy download en_core_web_trf python -m spacy download en_core_web_lg python -m coreferee install en
python -m spacy download de_core_news_lg python -m coreferee install de
and if you plan to run the regression tests:
python3 -m spacy download en_core_web_sm
python -m spacy download en_core_web_sm
You specify a spaCy model for Holmes to use when you instantiate the Manager facade class.
de_core_web_lg are the models that have been found to yield the best results for English and German respectively. Because
en_core_web_trf does not have its own word vectors, but Holmes requires word vectors for embedding-based-matching, the
en_core_web_lg model is loaded as a vector source whenever
en_core_web_trf is specified to the Manager class as the main model.
en_core_web_trf model requires sufficiently more resources than the other models; in a siutation where resources are scarce, it may be a sensible compromise to use
en_core_web_lg as the main model instead.
The best way of integrating Holmes into a non-Python environment is to wrap it as a RESTful HTTP service and to deploy it as a microservice. See here for an example.
Because Holmes performs complex, intelligent analysis, it is inevitable that it requires more hardware resources than more traditional search frameworks. The use cases that involve loading documents — structural extraction and topic matching — are most immediately applicable to large but not massive corpora (e.g. all the documents belonging to a certain organisation, all the patents on a certain topic, all the books by a certain author). For cost reasons, Holmes would not be an appropriate tool with which to analyse the content of the entire Internet!
That said, Holmes is both vertically and horizontally scalable. With sufficient hardware, both these use cases can be applied to an essentially unlimited number of documents by running Holmes on multiple machines, processing a different set of documents on each one and conflating the results. Note that this strategy is already employed to distribute matching amongst multiple cores on a single machine: the Manager class starts a number of worker processes and distributes registered documents between them.
Holmes holds loaded documents in memory, which ties in with its intended use with large but not massive corpora. The performance of document loading, structural extraction and topic matching all degrade heavily if the operating system has to swaps memory pages to secondary storage, because Holmes can require memory from a variety of pages to be addressed when processing a single sentence. This means it is important to supply enough RAM on each machine to hold all loaded documents.
Please note the above comments about the relative resource requirements of the different models.
The easiest use case with which to get a quick basic idea of how Holmes works is the chatbot use case.
Here one or more search phrases are defined to Holmes in advance, and the
searched documents are short sentences or paragraphs typed in
interactively by an end user. In a real-life setting, the extracted
information would be used to
determine the flow of interaction with the end user. For testing and
demonstration purposes, there is a console that displays
its matched findings interactively. It can be easily and
quickly started from the Python command line (which is itself started from the
operating system prompt by typing
python3 (Linux) or
or from within a Jupyter notebook.
The following code snippet can be entered line for line into the Python command line, into a Jupyter notebook or into an IDE. It registers the fact that you are interested in sentences about big dogs chasing cats and starts a demonstration chatbot console:
import holmes_extractor as holmes holmes_manager = holmes.Manager(model='en_core_web_lg', number_of_workers=1) holmes_manager.register_search_phrase('A big dog chases a cat') holmes_manager.start_chatbot_mode_console()
import holmes_extractor as holmes holmes_manager = holmes.Manager(model='de_core_news_lg', number_of_workers=1) holmes_manager.register_search_phrase('Ein großer Hund jagt eine Katze') holmes_manager.start_chatbot_mode_console()
If you now enter a sentence that corresponds to the search phrase, the console will display a match:
Ready for input A big dog chased a cat Matched search phrase with text 'A big dog chases a cat': 'big'->'big' (Matches BIG directly); 'A big dog'->'dog' (Matches DOG directly); 'chased'->'chase' (Matches CHASE directly); 'a cat'->'cat' (Matches CAT directly)
Ready for input Ein großer Hund jagte eine Katze Matched search phrase 'Ein großer Hund jagt eine Katze': 'großer'->'groß' (Matches GROSS directly); 'Ein großer Hund'->'hund' (Matches HUND directly); 'jagte'->'jagen' (Matches JAGEN directly); 'eine Katze'->'katze' (Matches KATZE directly)
This could easily have been achieved with a simple matching algorithm, so type in a few more complex sentences to convince yourself that Holmes is really grasping them and that matches are still returned:
The big dog would not stop chasing the cat The big dog who was tired chased the cat The cat was chased by the big dog The cat always used to be chased by the big dog The big dog was going to chase the cat The big dog decided to chase the cat The cat was afraid of being chased by the big dog I saw a cat-chasing big dog The cat the big dog chased was scared The big dog chasing the cat was a problem There was a big dog that was chasing a cat The cat chase by the big dog There was a big dog and it was chasing a cat. I saw a big dog. My cat was afraid of being chased by the dog. There was a big dog. His name was Fido. He was chasing my cat. A dog appeared. It was chasing a cat. It was very big. The cat sneaked back into our lounge because a big dog had been chasing her outside. Our big dog was excited because he had been chasing a cat.
Der große Hund hat die Katze ständig gejagt Der große Hund, der müde war, jagte die Katze Die Katze wurde vom großen Hund gejagt Die Katze wurde immer wieder durch den großen Hund gejagt Der große Hund wollte die Katze jagen Der große Hund entschied sich, die Katze zu jagen Die Katze, die der große Hund gejagt hatte, hatte Angst Dass der große Hund die Katze jagte, war ein Problem Es gab einen großen Hund, der eine Katze jagte Die Katzenjagd durch den großen Hund Es gab einen großen Hund und er jagte eine Katze Es gab einen großen Hund. Er hieß Fido. Er jagte meine Katze Es erschien ein Hund. Er jagte eine Katze. Er war sehr groß. Die Katze schlich sich in unser Wohnzimmer zurück, weil ein großer Hund sie draußen gejagt hatte Unser großer Hund war aufgeregt, weil er eine Katze gejagt hatte
The demonstration is not complete without trying other sentences that contain the same words but do not express the same idea and observing that they are not matched:
The dog chased a big cat The big dog and the cat chased about The big dog chased a mouse but the cat was tired The big dog always used to be chased by the cat The big dog the cat chased was scared Our big dog was upset because he had been chased by a cat. The dog chase of the big cat
Der Hund jagte eine große Katze Die Katze jagte den großen Hund Der große Hund und die Katze jagten Der große Hund jagte eine Maus aber die Katze war müde Der große Hund wurde ständig von der Katze gejagt Der große Hund entschloss sich, von der Katze gejagt zu werden Die Hundejagd durch den große Katze
In the above examples, Holmes has matched a variety of different sentence-level structures that share the same meaning, but the base forms of the three words in the matched documents have always been the same as the three words in the search phrase. Holmes provides several further strategies for matching at the individual word level. In combination with Holmes's ability to match different sentence structures, these can enable a search phrase to be matched to a document sentence that shares its meaning even where the two share no words and are grammatically completely different.
One of these additional word-matching strategies is named-entity
matching: special words can be included in search phrases
that match whole classes of names like people or places. Exit the
console by typing
exit, then register a second search phrase and
restart the console:
holmes_manager.register_search_phrase('An ENTITYPERSON goes into town') holmes_manager.start_chatbot_mode_console()
holmes_manager.register_search_phrase('Ein ENTITYPER geht in die Stadt') holmes_manager.start_chatbot_mode_console()
You have now registered your interest in people going into town and can enter appropriate sentences into the console:
Ready for input I met Richard Hudson and John Doe last week. They didn't want to go into town. Matched search phrase with text 'An ENTITYPERSON goes into town'; negated; uncertain; involves coreference: 'Richard Hudson'->'ENTITYPERSON' (Has an entity label matching ENTITYPERSON); 'go'->'go' (Matches GO directly); 'into'->'into' (Matches INTO directly); 'town'->'town' (Matches TOWN directly) Matched search phrase with text 'An ENTITYPERSON goes into town'; negated; uncertain; involves coreference: 'John Doe'->'ENTITYPERSON' (Has an entity label matching ENTITYPERSON); 'go'->'go' (Matches GO directly); 'into'->'into' (Matches INTO directly); 'town'->'town' (Matches TOWN directly)
Ready for input Letzte Woche sah ich Richard Hudson und Max Mustermann. Sie wollten nicht mehr in die Stadt gehen. Matched search phrase with text 'Ein ENTITYPER geht in die Stadt'; negated; uncertain; involves coreference: 'Richard Hudson'->'ENTITYPER' (Has an entity label matching ENTITYPER); 'gehen'->'gehen' (Matches GEHEN directly); 'in'->'in' (Matches IN directly); 'die Stadt'->'stadt' (Matches STADT directly) Matched search phrase with text 'Ein ENTITYPER geht in die Stadt'; negated; uncertain; involves coreference: 'Max Mustermann'->'ENTITYPER' (Has an entity label matching ENTITYPER); 'gehen'->'gehen' (Matches GEHEN directly); 'in'->'in' (Matches IN directly); 'die Stadt'->'stadt' (Matches STADT directly)
In each of the two languages, this last example demonstrates several further features of Holmes:
For more examples, please see section 5.
Direct matching between search phrase words and document words is always active. The strategy relies mainly on matching stem forms of words, e.g. matching English buy and child to bought and children, German steigen and Kind to stieg and Kinder. However, in order to increase the chance of direct matching working when the parser delivers an incorrect stem form for a word, the raw-text forms of both search-phrase and document words are also taken into consideration during direct matching.
Derivation-based matching involves distinct but related words that typically
belong to different word classes, e.g. English assess and assessment,
German jagen and Jagd. It is active by default but can be switched off using
analyze_derivational_morphology parameter, which is set when instantiating the Manager class.
Named-entity matching is activated by inserting a special named-entity identifier at the desired point in a search phrase in place of a noun, e.g.
An ENTITYPERSON goes into town (English)
Ein ENTITYPER geht in die Stadt (German).
The supported named-entity identifiers depend directly on the named-entity information supplied by the spaCy models for each language (descriptions copied from an earlier version of the spaCy documentation):
|ENTITYNOUN||Any noun phrase.|
|ENTITYPERSON||People, including fictional.|
|ENTITYNORP||Nationalities or religious or political groups.|
|ENTITYFAC||Buildings, airports, highways, bridges, etc.|
|ENTITYORG||Companies, agencies, institutions, etc.|
|ENTITYGPE||Countries, cities, states.|
|ENTITYLOC||Non-GPE locations, mountain ranges, bodies of water.|
|ENTITYPRODUCT||Objects, vehicles, foods, etc. (Not services.)|
|ENTITYEVENT||Named hurricanes, battles, wars, sports events, etc.|
|ENTITYWORK_OF_ART||Titles of books, songs, etc.|
|ENTITYLAW||Named documents made into laws.|
|ENTITYLANGUAGE||Any named language.|
|ENTITYDATE||Absolute or relative dates or periods.|
|ENTITYTIME||Times smaller than a day.|
|ENTITYPERCENT||Percentage, including "%".|
|ENTITYMONEY||Monetary values, including unit.|
|ENTITYQUANTITY||Measurements, as of weight or distance.|
|ENTITYORDINAL||"first", "second", etc.|
|ENTITYCARDINAL||Numerals that do not fall under another type.|
|ENTITYNOUN||Any noun phrase.|
|ENTITYPER||Named person or family.|
|ENTITYLOC||Name of politically or geographically defined location (cities, provinces, countries, international regions, bodies of water, mountains).|
|ENTITYORG||Named corporate, governmental, or other organizational entity.|
|ENTITYMISC||Miscellaneous entities, e.g. events, nationalities, products or works of art.|
We have added
ENTITYNOUN to the genuine named-entity identifiers. As
it matches any noun phrase, it behaves in a similar fashion to generic pronouns.
The differences are that
ENTITYNOUN has to match a specific noun phrase within a document
and that this specific noun phrase is extracted and available for further processing.
ENTITYNOUN is not supported within the topic matching use case.
An ontology enables the user to define relationships between words that are then taken into account when matching documents to search phrases. The three relevant relationship types are hyponyms (something is a subtype of something), synonyms (something means the same as something) and named individuals (something is a specific instance of something). The three relationship types are exemplified in Figure 1:
Ontologies are defined to Holmes using the OWL ontology standard serialized using RDF/XML. Such ontologies can be generated with a variety of tools. For the Holmes examples and tests, the free tool Protege was used. It is recommended that you use Protege both to define your own ontologies and to browse the ontologies that ship with the examples and tests. When saving an ontology under Protege, please select RDF/XML as the format. Protege assigns standard labels for the hyponym, synonym and named-individual relationships that Holmes understands as defaults but that can also be overridden.
Ontology entries are defined using an Internationalized Resource
Holmes only uses the final fragment for matching, which allows homonyms
(words with the same form but multiple meanings) to be defined at
multiple points in the ontology tree.
Ontology-based matching gives the best results with Holmes when small ontologies are used that have been built for specific subject domains and use cases. For example, if you are implementing a chatbot for a building insurance use case, you should create a small ontology capturing the terms and relationships within that specific domain. On the other hand, it is not recommended to use large ontologies built for all domains within an entire language such as WordNet. This is because the many homonyms and relationships that only apply in narrow subject domains will tend to lead to a large number of incorrect matches. For general use cases, embedding-based matching will tend to yield better results.
Each word in an ontology can be regarded as heading a subtree consisting of its hyponyms, synonyms and named individuals, those words' hyponyms, synonyms and named individuals, and so on. With an ontology set up in the standard fashion that is appropriate for the chatbot and structural extraction use cases, a word in a Holmes search phrase matches a word in a document if the document word is within the subtree of the search phrase word. Were the ontology in Figure 1 defined to Holmes, in addition to the direct matching strategy, which would match each word to itself, the following combinations would match:
English phrasal verbs like eat up and German separable verbs like aufessen must be defined as single items within ontologies. When Holmes is analysing a text and comes across such a verb, the main verb and the particle are conflated into a single logical word that can then be matched via an ontology. This means that eat up within a text would match the subtree of eat up within the ontology but not the subtree of eat within the ontology.
If derivation-based matching is active, it is taken into account on both sides of a potential ontology-based match. For example, if alter and amend are defined as synonyms in an ontology, alteration and amendment would also match each other.
In situations where finding relevant sentences is more important than ensuring the logical correspondence of document matches to search phrases, it may make sense to specify symmetric matching when defining the ontology. Symmetric matching is recommended for the topic matching use case, but is unlikely to be appropriate for the chatbot or structural extraction use cases. It means that the hypernym (reverse hyponym) relationship is taken into account as well as the hyponym and synonym relationships when matching, thus leading to a more symmetric relationship between documents and search phrases. An important rule applied when matching via a symmetric ontology is that a match path may not contain both hypernym and hyponym relationships, i.e. you cannot go back on yourself. Were the ontology above defined as symmetric, the following combinations would match:
In the supervised document classification use case, two separate ontologies can be used:
The structural matching ontology is used to analyse the content of both training and test documents. Each word from a document that is found in the ontology is replaced by its most general hypernym ancestor. It is important to realise that an ontology is only likely to work with structural matching for supervised document classification if it was built specifically for the purpose: such an ontology should consist of a number of separate trees representing the main classes of object in the documents to be classified. In the example ontology shown above, all words in the ontology would be replaced with animal; in an extreme case with a WordNet-style ontology, all nouns would end up being replaced with thing, which is clearly not a desirable outcome!
The classification ontology is used to capture relationships between classification labels: that a document
has a certain classification implies it also has any classifications to whose subtree that classification belongs.
Synonyms should be used sparingly if at all in classification ontologies because they add to the complexity of the
neural network without adding any value; and although it is technically possible to set up a classification
ontology to use symmetric matching, there is no sensible reason for doing so. Note that a label within the
classification ontology that is not directly defined as the label of any training document
has to be registered specifically using the
SupervisedTopicTrainingBasis.register_additional_classification_label() method if it is to be taken into
account when training the classifier.
spaCy offers word embeddings: machine-learning-generated numerical vector representations of words that capture the contexts in which each word tends to occur. Two words with similar meaning tend to emerge with word embeddings that are close to each other, and spaCy can measure the cosine similarity between any two words' embeddings expressed as a decimal between 0.0 (no similarity) and 1.0 (the same word). Because dog and cat tend to appear in similar contexts, they have a similarity of 0.80; dog and horse have less in common and have a similarity of 0.62; and dog and iron have a similarity of only 0.25. Embedding-based matching is only activated for nouns, adjectives and adverbs because the results have been found to be unsatisfactory with other word classes.
It is important to understand that the fact that two words have similar embeddings does not imply the same sort of logical relationship between the two as when ontology-based matching is used: for example, the fact that dog and cat have similar embeddings means neither that a dog is a type of cat nor that a cat is a type of dog. Whether or not embedding-based matching is nonetheless an appropriate choice depends on the functional use case.
For the chatbot, structural extraction and supervised document classification use cases, Holmes makes use of word-
embedding-based similarities using a
overall_similarity_threshold parameter defined globally on
the Manager class. A match is detected between a
search phrase and a structure within a document whenever the geometric
mean of the similarities between the individual corresponding word pairs
is greater than this threshold. The intuition behind this technique is
that where a search phrase with e.g. six lexical words has matched a
document structure where five of these words match exactly and only one
corresponds via an embedding, the similarity that should be required to match this sixth word is
less than when only three of the words matched exactly and two of the other words also correspond
Matching a search phrase to a document begins by finding words in the document that match the word at the root (syntactic head) of the search phrase. Holmes then investigates the structure around each of these matched document words to check whether the document structure matches the search phrase structure in its entirity. The document words that match the search phrase root word are normally found using an index. However, if embeddings have to be taken into account when finding document words that match a search phrase root word, every word in every document with a valid word class has to be compared for similarity to that search phrase root word. This has a very noticeable performance hit that renders all use cases except the chatbot use case essentially unusable.
To avoid the typically unnecessary performance hit that results from embedding-based matching
of search phrase root words, it is controlled separately from embedding-based matching in general
embedding_based_matching_on_root_words parameter, which is set when instantiating the
Manager class. You are advised to keep this setting switched off (value
False) for most use cases.
overall_similarity_threshold nor the
embedding_based_matching_on_root_words parameter has any effect on the topic matching use case. Here word-level embedding similarity thresholds are set using the
initial_question_word_embedding_match_threshold parameters when calling the
topic_match_documents_against function on the Manager class.
A named-entity-embedding based match obtains between a searched-document word that has a certain entity label and a search phrase or query document word whose embedding is sufficiently similar to the underlying meaning of that entity label, e.g. the word individual in a search phrase has a similar word embedding to the underlying meaning of the PERSON entity label. Note that named-entity-embedding-based matching is never active on root words regardless of the
Initial-question-word matching is only active during topic matching. Initial question words in query phrases match entities in the searched documents that represent potential answers to the question, e.g. when comparing the query phrase When did Peter have breakfast to the searched-document phrase Peter had breakfast at 8 a.m., the question word When would match the temporal adverbial phrase at 8 a.m..
Initial-question-word matching is switched on and off using the
initial_question_word_behaviour parameter when calling the
topic_match_documents_against function on the Manager class. It is only likely to be useful when topic matching is being performed in an interactive setting where the user enters short query phrases, as opposed to when it is being used to find documents on a similar topic to an pre-existing query document: initial question words are only processed at the beginning of the first sentence of the query phrase or query document.
If a query phrase consists of a complex question with several elements dependent on the main verb, a finding in a searched document is only strictly an 'answer' if contains matches to all these elements. Because recall is typically more important than precision when performing topic matching with interactive query phrases, however, Holmes will match an initial question word to a searched-document phrase wherever they correspond semantically (e.g. wherever when corresponds to a temporal adverbial phrase) and each depend on verbs that themselves match at the word level. One possible strategy to filter out 'incomplete answers' would be to calculate the maximum possible score for a query phrase and reject topic matches that score below a threshold scaled to this maximum.
Before Holmes analyses a searched document or query document, coreference resolution is performed using the coreferee library running on top of spaCy. This means that situations are recognised where pronouns and nouns that are located near one another within a text refer to the same entities. The information from one mention can then be applied to the analysis of further mentions:
I saw a big dog. It was chasing a cat.
I saw a big dog. The dog was chasing a cat.
Coreferee also detects situations where a noun refers back to a named entity:
We discussed msg systems. The company had made a profit.
If this example were to match the search phrase A company makes a profit, the coreference information that the company under discussion is msg systems is clearly relevant and worth extracting in addition to the word(s) directly matched to the search phrase. Such information is captured in the word_match.extracted_word field.
The concept of search phrases has already been introduced and is relevant to the chatbot use case, the structural extraction use case and to preselection within the supervised document classification use case.
It is crucial to understand that the tips and limitations set out in Section 4 do not apply in any way to query phrases in topic matching. If you are using Holmes for topic matching only, you can completely ignore this section!
Structural matching between search phrases and documents is not symmetric: there are many situations in which sentence X as a search phrase would match sentence Y within a document but where the converse would not be true. Although Holmes does its best to understand any search phrases, the results are better when the user writing them follows certain patterns and tendencies, and getting to grips with these patterns and tendencies is the key to using the relevant features of Holmes successfully.
Holmes distinguishes between: lexical words like dog, chase and cat (English) or Hund, jagen and Katze (German) in the initial example above; and grammatical words like a (English) or ein and eine (German) in the initial example above. Only lexical words match words in documents, but grammatical words still play a crucial role within a search phrase: they enable Holmes to understand it.
Dog chase cat (English)
Hund jagen Katze (German)
contain the same lexical words as the search phrases in the initial example above, but as they are not grammatical sentences Holmes is liable to misunderstand them if they are used as search phrases. This is a major difference between Holmes search phrases and the search phrases you use instinctively with standard search engines like Google, and it can take some getting used to.
A search phrase need not contain a verb:
A big dog (English)
Interest in fishing (English)
Ein großer Hund (German)
Interesse am Angeln (German)
are all perfectly valid and potentially useful search phrases.
Where a verb is present, however, Holmes delivers the best results when the verb is in the present active, as chases and jagt are in the initial example above. This gives Holmes the best chance of understanding the relationship correctly and of matching the widest range of document structures that share the target meaning.
Sometimes you may only wish to extract the object of a verb. For example, you might want to find sentences that are discussing a cat being chased regardless of who is doing the chasing. In order to avoid a search phrase containing a passive expression like
A cat is chased (English)
Eine Katze wird gejagt (German)
you can use a generic pronoun. This is a word that Holmes treats like a grammatical word in that it is not matched to documents; its sole purpose is to help the user form a grammatically optimal search phrase in the present active. Recognised generic pronouns are English something, somebody and someone and German jemand (and inflected forms of jemand) and etwas: Holmes treats them all as equivalent. Using generic pronouns, the passive search phrases above could be re-expressed as
Somebody chases a cat (English)
Jemand jagt eine Katze (German).
Experience shows that different prepositions are often used with the same meaning in equivalent phrases and that this can prevent search phrases from matching where one would intuitively expect it. For example, the search phrases
Somebody is at the market (English)
Jemand ist auf dem Marktplatz (German)
would fail to match the document phrases
Richard was in the market (English)
Richard war am Marktplatz (German)
The best way of solving this problem is to define the prepositions in question as synonyms in an ontology.
The following types of structures are prohibited in search phrases and result in Python user-defined errors:
A dog chases a cat. A cat chases a dog (English)
Ein Hund jagt eine Katze. Eine Katze jagt einen Hund (German)
Each clause must be separated out into its own search phrase and registered individually.
A dog does not chase a cat. (English)
Ein Hund jagt keine Katze. (German)
Negative expressions are recognised as such in documents and the generated matches marked as negative; allowing search phrases themselves to be negative would overcomplicate the library without offering any benefits.
A dog and a lion chase a cat. (English)
Ein Hund und ein Löwe jagen eine Katze. (German)
Wherever conjunction occurs in documents, Holmes distributes the
information among multiple matches as explained above. In the
unlikely event that there should be a requirement to capture conjunction explicitly
when matching, this could be achieved by using the
Manager.match() function and looking for situations
where the document token objects are shared by multiple match objects.
A search phrase cannot be processed if it does not contain any words that can be matched to documents.
A dog chases a cat and he chases a mouse (English)
Ein Hund jagt eine Katze und er jagt eine Maus (German)
Pronouns that corefer with nouns elsewhere in the search phrase are not permitted as this would overcomplicate the library without offering any benefits.
The following types of structures are strongly discouraged in search phrases:
Dog chase cat (English)
Hund jagen Katze (German)
Although these will sometimes work, the results will be better if search phrases are expressed grammatically.
A cat is chased by a dog (English)
A dog will have chased a cat (English)
Eine Katze wird durch einen Hund gejagt (German)
Ein Hund wird eine Katze gejagt haben (German)
Although these will sometimes work, the results will be better if verbs in search phrases are expressed in the present active.
Who chases the cat? (English)
Wer jagt die Katze? (German)
Although questions are supported as query phrases in the topic matching use case, they are not appropriate as search phrases. Questions should be re-phrased as statements, in this case
Something chases the cat (English)
Etwas jagt die Katze (German).
Eine Zeltfeier (German)
The internal structure of German compound words is analysed within searched documents as well as within query phrases in the topic matching use case, but not within search phrases. Compound words should be reexpressed using multiple words, which in any case enables the relationship between the elements to be expressed less ambiguously. Compound words can often but not always be reexpressed as genitive phrases:
Extraktion der Information (German)
Eine Feier in einem Zelt (German)
The following types of structures should be used with caution in search phrases:
A fierce dog chases a scared cat on the way to the theatre
Ein kämpferischer Hund jagt eine verängstigte Katze auf dem Weg ins Theater (German)
Holmes can handle any level of complexity within search phrases, but the more complex a structure, the less likely it becomes that a document sentence will match it. If it is really necessary to match such complex relationships with search phrases rather than with topic matching, they are typically better extracted by splitting the search phrase up, e.g.
A fierce dog (English)
A scared cat (English)
A dog chases a cat (English)
Something chases something on the way to the theatre (English)
Ein kämpferischer Hund (German)
Eine verängstigte Katze (German)
Ein Hund jagt eine Katze (German)
Etwas jagt etwas auf dem Weg ins Theater (German)
Correlations between the resulting matches can then be established by
matching via the
Manager.match() function and looking for
situations where the document token objects are shared across multiple match objects.
One possible exception to this piece of advice is when embedding-based matching is active. Because whether or not each word in a search phrase matches then depends on whether or not other words in the same search phrase have been matched, large, complex search phrases can sometimes yield results that a combination of smaller, simpler search phrases would not.
The chasing of a cat (English)
Die Jagd einer Katze (German)
These will often work, but it is generally better practice to use verbal search phrases like
Something chases a cat (English)
Etwas jagt eine Katze (German)
and to allow the corresponding nominal phrases to be matched via derivation-based matching.
The chatbot use case has already been introduced: a predefined set of search phrases is used to extract information from phrases entered interactively by an end user, which in this use case act as the documents.
The Holmes source code ships with two examples demonstrating the chatbot
use case, one for each language, with predefined ontologies. Having
cloned the source code and installed the Holmes library,
navigate to the
/examples directory and type the following (Linux):
or click on the files in Windows Explorer (Windows).
Holmes matches syntactically distinct structures that are semantically equivalent, i.e. that share the same meaning. In a real chatbot use case, users will typically enter equivalent information with phrases that are semantically distinct as well, i.e. that have different meanings. Because the effort involved in registering a search phrase is barely greater than the time it takes to type it in, it makes sense to register a large number of search phrases for each relationship you are trying to extract: essentially all ways people have been observed to express the information you are interested in or all ways you can imagine somebody might express the information you are interested in. To assist this, search phrases can be registered with labels that do not need to be unique: a label can then be used to express the relationship an entire group of search phrases is designed to extract. Note that when many search phrases have been defined to extract the same relationship, a single user entry is likely to be sometimes matched by multiple search phrases. This must be handled appropriately by the calling application.
One obvious weakness of Holmes in the chatbot setting is its sensitivity to correct spelling and, to a lesser extent, to correct grammar. Strategies for mitigating this weakness include:
The structural extraction use case uses structural matching in the same way as the chatbot use case, and many of the same comments and tips apply to it. The principal differences are that pre-existing and often lengthy documents are scanned rather than text snippets entered ad-hoc by the user, and that the returned match objects are not used to drive a dialog flow; they are examined solely to extract and store structured information.
Code for performing structural extraction would typically perform the following tasks:
Manager.register_search_phrase()several times to define a number of search phrases specifying the information to be extracted.
Manager.parse_and_register_document()several times to load a number of documents within which to search.
Manager.match()to perform the matching.
The topic matching use case matches a query document, or alternatively a query phrase entered ad-hoc by the user, against a set of documents pre-loaded into memory. The aim is to find the passages in the documents whose topic most closely corresponds to the topic of the query document; the output is a ordered list of passages scored according to topic similarity. Additionally, if a query phrase contains an initial question word, the output will contain potential answers to the question.
Topic matching queries may contain generic pronouns and
named-entity identifiers just like search phrases, although the
token is not supported. However, an important difference from
search phrases is that the topic matching use case places no
restrictions on the grammatical structures permissible within the query document.
The Holmes source code ships with three examples demonstrating the topic matching use case with an English literature corpus, a German literature corpus and a German legal corpus respectively. The two literature examples are hosted at the Holmes demonstration website, although users are encouraged to run the scripts locally as well to get a feel for how they work. The German law example starts a simple interactive console and its script contains some example queries as comments.
Topic matching uses a variety of strategies to find text passages that are relevant to the query. These include resource-hungry procedures like investigating semantic relationships and comparing embeddings. Because applying these across the board would prevent topic matching from scaling, Holmes only attempts them for specific areas of the text that less resource-intensive strategies have already marked as looking promising. This and the other interior workings of topic matching are explained here.
In the supervised document classification use case, a classifier is trained with a number of documents that are each pre-labelled with a classification. The trained classifier then assigns one or more labels to new documents according to what each new document is about. As explained here, ontologies can be used both to enrichen the comparison of the content of the various documents and to capture implication relationships between classification labels.
A classifier makes use of a neural network (a multilayer perceptron) whose topology can either be determined automatically by Holmes or specified explicitly by the user. With a large number of training documents, the automatically determined topology can easily exhaust the memory available on a typical machine; if there is no opportunity to scale up the memory, this problem can be remedied by specifying a smaller number of hidden layers or a smaller number of nodes in one or more of the layers.
A trained document classification model retains no references to its training data. This is an advantage from a data protection viewpoint, although it cannot presently be guaranteed that models will not contain individual personal or company names.
A typical problem with the execution of many document classification use cases is that a new classification label is added when the system is already live but that there are initially no examples of this new classification with which to train a new model. The best course of action in such a situation is to define search phrases which preselect the more obvious documents with the new classification using structural matching. Those documents that are not preselected as having the new classification label are then passed to the existing, previously trained classifier in the normal way. When enough documents exemplifying the new classification have accumulated in the system, the model can be retrained and the preselection search phrases removed.
Holmes ships with an example script demonstrating supervised document classification for English with the BBC Documents dataset. The script downloads the documents (for this operation and for this operation alone, you will need to be online) and places them in a working directory. When training is complete, the script saves the model to the working directory. If the model file is found in the working directory on subsequent invocations of the script, the training phase is skipped and the script goes straight to the testing phase. This means that if it is wished to repeat the training phase, either the model has to be deleted from the working directory or a new working directory has to be specified to the script.
Having cloned the source code and installed the Holmes library,
navigate to the
/examples directory. Specify a working directory at the top of the
example_supervised_topic_model_EN.py file, then type
python3 example_supervised_topic_model_EN (Linux)
or click on the script in Windows Explorer (Windows).
It is important to realise that Holmes learns to classify documents according to the words or semantic relationships they contain, taking any structural matching ontology into account in the process. For many classification tasks, this is exactly what is required; but there are tasks (e.g. author attribution according to the frequency of grammatical constructions typical for each author) where it is not. For the right task, Holmes achieves impressive results. For the BBC Documents benchmark processed by the example script, Holmes performs slightly better than benchmarks available online (see e.g. here) although the difference is probably too slight to be significant, especially given that the different training/test splits were used in each case: Holmes has been observed to learn models that predict the correct result between 96.9% and 98.7% of the time. The range is explained by the fact that the behaviour of the neural network is not fully deterministic.
The interior workings of supervised document classification are explained here.
holmes_extractor.Manager(self, model, *, overall_similarity_threshold=1.0, embedding_based_matching_on_root_words=False, ontology=None, analyze_derivational_morphology=True, perform_coreference_resolution=None, number_of_workers=None, verbose=False) The facade class for the Holmes library. Parameters: model -- the name of the spaCy model, e.g. *en_core_web_trf* overall_similarity_threshold -- the overall similarity threshold for embedding-based matching. Defaults to *1.0*, which deactivates embedding-based matching. Note that this parameter is not relevant for topic matching, where the thresholds for embedding-based matching are set on the call to *topic_match_documents_against*. embedding_based_matching_on_root_words -- determines whether or not embedding-based matching should be attempted on search-phrase root tokens, which has a considerable performance hit. Defaults to *False*. Note that this parameter is not relevant for topic matching. ontology -- an *Ontology* object. Defaults to *None* (no ontology). analyze_derivational_morphology -- *True* if matching should be attempted between different words from the same word family. Defaults to *True*. perform_coreference_resolution -- *True* if coreference resolution should be taken into account when matching. Defaults to *True*. use_reverse_dependency_matching -- *True* if appropriate dependencies in documents can be matched to dependencies in search phrases where the two dependencies point in opposite directions. Defaults to *True*. number_of_workers -- the number of worker processes to use, or *None* if the number of worker processes should depend on the number of available cores. Defaults to *None* verbose -- a boolean value specifying whether multiprocessing messages should be outputted to the console. Defaults to *False*
Manager.register_serialized_document(self, serialized_document:bytes, label:str) -> None Parameters: document -- a preparsed Holmes document. label -- a label for the document which must be unique. Defaults to the empty string, which is intended for use cases involving single documents (typically user entries).
Manager.register_serialized_documents(self, document_dictionary:dict[str, Doc]) -> None Note that this function is the most efficient way of loading documents. Parameters: document_dictionary -- a dictionary from labels to serialized documents.
Manager.parse_and_register_document(self, document_text:str, label:str='') -> None Parameters: document_text -- the raw document text. label -- a label for the document which must be unique. Defaults to the empty string, which is intended for use cases involving single documents (typically user entries).
Manager.remove_document(self, label:str) -> None
Manager.remove_all_documents(self) -> None
Manager.document_labels(self) -> list[str] Returns a list of the labels of the currently registered documents.
Manager.serialize_document(self, label:str) -> bytes Returns a serialized representation of a Holmes document that can be persisted to a file. If 'label' is not the label of a registered document, 'None' is returned instead. Parameters: label -- the label of the document to be serialized.
Manager.get_document(self, label:str='') -> Doc Returns a Holmes document. If *label* is not the label of a registered document, *None* is returned instead. Parameters: label -- the label of the document to be serialized.
Manager.debug_document(self, label:str='') -> Doc Outputs a debug representation for a loaded document. Parameters: label -- the label of the document to be serialized.
Manager.register_search_phrase(self, search_phrase_text:str, label:str=None) -> SearchPhrase Registers and returns a new search phrase. Parameters: search_phrase_text -- the raw search phrase text. label -- a label for the search phrase, which need not be unique. If label==None, the assigned label defaults to the raw search phrase text.
Manager.remove_all_search_phrases_with_label(self, label:str) -> None
Manager.remove_all_search_phrases(self) -> None
Manager.list_search_phrase_labels(self) -> list[str]
Manager.match(self, search_phrase_text:str=None, document_text:str=None) -> list[dict] Matches search phrases to documents and returns the result as match dictionaries. Parameters: search_phrase_text -- a text from which to generate a search phrase, or *None* if the preloaded search phrases should be used for matching. document_text -- a text from which to generate a document, or *None* if the preloaded documents should be used for matching.
topic_match_documents_against(self, text_to_match:str, *, use_frequency_factor:bool=True, maximum_activation_distance:int=75, word_embedding_match_threshold:float=0.8, initial_question_word_embedding_match_threshold:float=0.7, relation_score:int=300, reverse_only_relation_score:int=200, single_word_score:int=50, single_word_any_tag_score:int=20, initial_question_word_answer_score:int=600, initial_question_word_behaviour:str='process', different_match_cutoff_score:int=15, overlapping_relation_multiplier:float=1.5, embedding_penalty:float=0.6, ontology_penalty:float=0.9, relation_matching_frequency_threshold:float=0.25, embedding_matching_frequency_threshold:float=0.5, sideways_match_extent:int=100, only_one_result_per_document:bool=False, number_of_results:int=10, document_label_filter:str=None, tied_result_quotient:float=0.9) -> list[dict]: Returns a list of dictionaries representing the results of a topic match between an entered text and the loaded documents. Properties: text_to_match -- the text to match against the loaded documents. use_frequency_factor -- *True* if scores should be multiplied by a factor between 0 and 1 expressing how rare the words matching each phraselet are in the corpus. Note that, even if set to *False*, the factors are still calculated as they are required for determining which relation and embedding matches should be attempted. maximum_activation_distance -- the number of words it takes for a previous phraselet activation to reduce to zero when the library is reading through a document. word_embedding_match_threshold -- the cosine similarity above which two words match where the search phrase word does not govern an interrogative pronoun. initial_question_word_embedding_match_threshold -- the cosine similarity above which two words match where the search phrase word governs an interrogative pronoun. relation_score -- the activation score added when a normal two-word relation is matched. reverse_only_relation_score -- the activation score added when a two-word relation is matched using a search phrase that can only be reverse-matched. single_word_score -- the activation score added when a normal single word is matched. single_word_any_tag_score -- the activation score added when a single word is matched whose tag would not normally allow it to be matched by phraselets. initial_question_word_answer_score -- the activation score added when a question word is matched to an potential answer phrase. initial_question_word_behaviour -- 'process' if a question word in the sentence constinuent at the beginning of *text_to_match* is to be matched to document phrases that answer it; 'exclusive' if only topic matches that involve such question words are to be permitted; 'ignore' if question words are to be ignored. different_match_cutoff_score -- the activation threshold under which topic matches are separated from one another. Note that the default value will probably be too low if *use_frequency_factor* is set to *False*. overlapping_relation_multiplier -- the value by which the activation score is multiplied when two relations were matched and the matches involved a common document word. embedding_penalty -- a value between 0 and 1 with which scores are multiplied when the match involved an embedding. The result is additionally multiplied by the overall similarity measure of the match. ontology_penalty -- a value between 0 and 1 with which scores are multiplied for each word match within a match that involved the ontology. For each such word match, the score is multiplied by the value (abs(depth) + 1) times, so that the penalty is higher for hyponyms and hypernyms than for synonyms and increases with the depth distance. relation_matching_frequency_threshold -- the frequency threshold above which single word matches are used as the basis for attempting relation matches. embedding_matching_frequency_threshold -- the frequency threshold above which single word matches are used as the basis for attempting relation matches with embedding-based matching on the second word. sideways_match_extent -- the maximum number of words that may be incorporated into a topic match either side of the word where the activation peaked. only_one_result_per_document -- if 'True', prevents multiple results from being returned for the same document. number_of_results -- the number of topic match objects to return. document_label_filter -- optionally, a string with which document labels must start to be considered for inclusion in the results. tied_result_quotient -- the quotient between a result and following results above which the results are interpreted as tied.
Manager.get_supervised_topic_training_basis(self, *, classification_ontology:Ontology=None, overlap_memory_size:int=10, oneshot:bool=True, match_all_words:bool=False, verbose:bool=True) -> SupervisedTopicTrainingBasis: Returns an object that is used to train and generate a model for the supervised document classification use case. Parameters: classification_ontology -- an Ontology object incorporating relationships between classification labels, or 'None' if no such ontology is to be used. overlap_memory_size -- how many non-word phraselet matches to the left should be checked for words in common with a current match. oneshot -- whether the same word or relationship matched multiple times within a single document should be counted once only (value 'True') or multiple times (value 'False') match_all_words -- whether all single words should be taken into account (value 'True') or only single words with noun tags (value 'False') verbose -- if 'True', information about training progress is outputted to the console.
Manager.deserialize_supervised_topic_classifier(self, serialized_model:str, verbose:bool=False) -> SupervisedTopicClassifier: Returns a classifier for the supervised document classification use case that will use a supplied pre-trained model. Parameters: serialized_model -- the pre-trained model as returned from `SupervisedTopicClassifier.serialize_model()`. verbose -- if 'True', information about matching is outputted to the console.
Manager.start_chatbot_mode_console(self) Starts a chatbot mode console enabling the matching of pre-registered search phrases to documents (chatbot entries) entered ad-hoc by the user.
Manager.start_structural_search_mode_console(self) Starts a structural extraction mode console enabling the matching of pre-registered documents to search phrases entered ad-hoc by the user.
Manager.start_topic_matching_search_mode_console(self, only_one_result_per_document:bool=False, word_embedding_match_threshold:float=0.8, initial_question_word_embedding_match_threshold:float=0.7): Starts a topic matching search mode console enabling the matching of pre-registered documents to query phrases entered ad-hoc by the user. Parameters: only_one_result_per_document -- if 'True', prevents multiple topic match results from being returned for the same document. word_embedding_match_threshold -- the cosine similarity above which two words match where the search phrase word does not govern an interrogative pronoun. initial_question_word_embedding_match_threshold -- the cosine similarity above which two words match where the search phrase word governs an interrogative pronoun.
Manager.close(self) -> None Terminates the worker processes.
manager.nlp is the underlying spaCy Language object on which both Coreferee and Holmes have been registered as custom pipeline components. The most efficient way of parsing documents for use with Holmes is to call
manager.nlp.pipe(). This yields an iterable of documents that can then be loaded into Holmes via
holmes_extractor.Ontology(self, ontology_path, owl_class_type='http://www.w3.org/2002/07/owl#Class', owl_individual_type='http://www.w3.org/2002/07/owl#NamedIndividual', owl_type_link='http://www.w3.org/1999/02/22-rdf-syntax-ns#type', owl_synonym_type='http://www.w3.org/2002/07/owl#equivalentClass', owl_hyponym_type='http://www.w3.org/2000/01/rdf-schema#subClassOf', symmetric_matching=False) Loads information from an existing ontology and manages ontology matching. The ontology must follow the W3C OWL 2 standard. Search phrase words are matched to hyponyms, synonyms and instances from within documents being searched. This class is designed for small ontologies that have been constructed by hand for specific use cases. Where the aim is to model a large number of semantic relationships, word embeddings are likely to offer better results. Matching is case-insensitive. Parameters: ontology_path -- the path from where the ontology is to be loaded, or a list of several such paths. See https://github.com/RDFLib/rdflib/. owl_class_type -- optionally overrides the OWL 2 URL for types. owl_individual_type -- optionally overrides the OWL 2 URL for individuals. owl_type_link -- optionally overrides the RDF URL for types. owl_synonym_type -- optionally overrides the OWL 2 URL for synonyms. owl_hyponym_type -- optionally overrides the RDF URL for hyponyms. symmetric_matching -- if 'True', means hypernym relationships are also taken into account.
Holder object for training documents and their classifications from which one or more SupervisedTopicModelTrainer objects can be derived. This class is NOT threadsafe.
SupervisedTopicTrainingBasis.parse_and_register_training_document(self, text, classification, label=None) Parses and registers a document to use for training. Parameters: text -- the document text classification -- the classification label label -- a label with which to identify the document in verbose training output, or 'None' if a random label should be assigned.
SupervisedTopicTrainingBasis.register_training_document(self, doc, classification, label=None) Registers a pre-parsed document to use for training. Parameters: doc -- the document classification -- the classification label label -- a label with which to identify the document in verbose training output, or 'None' if a random label should be assigned.
SupervisedTopicTrainingBasis.register_additional_classification_label(self, label) Register an additional classification label which no training document possesses explicitly but that should be assigned to documents whose explicit labels are related to the additional classification label via the classification ontology.
SupervisedTopicTrainingBasis.prepare() Matches the phraselets derived from the training documents against the training documents to generate frequencies that also include combined labels, and examines the explicit classification labels, the additional classification labels and the classification ontology to derive classification implications. Once this method has been called, the instance no longer accepts new training documents or additional classification labels.
SupervisedTopicTrainingBasis.train(self, *, minimum_occurrences=4, cv_threshold=1.0, mlp_activation='relu', mlp_solver='adam', mlp_learning_rate='constant', mlp_learning_rate_init=0.001, mlp_max_iter=200, mlp_shuffle=True, mlp_random_state=42, hidden_layer_sizes=None): Trains a model based on the prepared state. Parameters: minimum_occurrences -- the minimum number of times a word or relationship has to occur in the context of the same classification for the phraselet to be accepted into the final model. cv_threshold -- the minimum coefficient of variation with which a word or relationship has to occur across the explicit classification labels for the phraselet to be accepted into the final model. mlp_* -- see https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html. hidden_layer_sizes -- a list where each entry is the size of a hidden layer, or 'None' if the topology should be determined automatically.
Worker object used to train and generate models. This object could be removed from the public interface
SupervisedTopicTrainingBasis.train() could return a
SupervisedTopicClassifier directly) but has
been retained to facilitate testability.
SupervisedTopicModelTrainer.classifier() Returns a supervised topic classifier which contains no explicit references to the training data and that can be serialized.
SupervisedTopicClassifier.parse_and_classify(self, text) Returns a list containing zero, one or many document classifications. Where more than one classification is returned, the labels are ordered by decreasing probability. Parameters: text -- the text to parse and classify.
SupervisedTopicClassifier.classify(self, doc) Returns a list containing zero, one or many document classifications. Where more than one classification is returned, the labels are ordered by decreasing probability. Parameters: doc -- the pre-parsed document to classify.
SupervisedTopicClassifier.serialize_model(self) -> str Returns a serialized model that can be reloaded using *Manager.deserialize_supervised_topic_classifier()*
A text-only representation of a match between a search phrase and a document. The indexes refer to tokens. Properties: search_phrase_label -- the label of the search phrase. search_phrase_text -- the text of the search phrase. document -- the label of the document. index_within_document -- the index of the match within the document. sentences_within_document -- the raw text of the sentences within the document that matched. negated -- 'True' if this match is negated. uncertain -- 'True' if this match is uncertain. involves_coreference -- 'True' if this match was found using coreference resolution. overall_similarity_measure -- the overall similarity of the match, or '1.0' if embedding-based matching was not involved in the match. word_matches -- an array of dictionaries with the properties: search_phrase_token_index -- the index of the token that matched from the search phrase. search_phrase_word -- the string that matched from the search phrase. document_token_index -- the index of the token that matched within the document. first_document_token_index -- the index of the first token that matched within the document. Identical to 'document_token_index' except where the match involves a multiword phrase. last_document_token_index -- the index of the last token that matched within the document (NOT one more than that index). Identical to 'document_token_index' except where the match involves a multiword phrase. structurally_matched_document_token_index -- the index of the token within the document that structurally matched the search phrase token. Is either the same as 'document_token_index' or is linked to 'document_token_index' within a coreference chain. document_subword_index -- the index of the token subword that matched within the document, or 'None' if matching was not with a subword but with an entire token. document_subword_containing_token_index -- the index of the document token that contained the subword that matched, which may be different from 'document_token_index' in situations where a word containing multiple subwords is split by hyphenation and a subword whose sense contributes to a word is not overtly realised within that word. document_word -- the string that matched from the document. document_phrase -- the phrase headed by the word that matched from the document. match_type -- 'direct', 'derivation', 'entity', 'embedding', 'ontology' or 'entity_embedding'. negated -- 'True' if this word match is negated. uncertain -- 'True' if this word match is uncertain. similarity_measure -- for types 'embedding' and 'entity_embedding', the similarity between the two tokens, otherwise '1.0'. involves_coreference -- 'True' if the word was matched using coreference resolution. extracted_word -- within the coreference chain, the most specific term that corresponded to the document_word. depth -- the number of hyponym relationships linking 'search_phrase_word' and 'extracted_word', or '0' if ontology-based matching is not active. Can be negative if symmetric matching is active. explanation -- creates a human-readable explanation of the word match from the perspective of the document word (e.g. to be used as a tooltip over it).
A text-only representation of a topic match between a search text and a document. Properties: document_label -- the label of the document. text -- the document text that was matched. text_to_match -- the search text. rank -- a string representation of the scoring rank which can have the form '2=' in case of a tie. index_within_document -- the index of the document token where the activation peaked. subword_index -- the index of the subword within the document token where the activation peaked, or 'None' if the activation did not peak at a specific subword. start_index -- the index of the first document token in the topic match. end_index -- the index of the last document token in the topic match (NOT one more than that index). sentences_start_index -- the token start index within the document of the sentence that contains 'start_index' sentences_end_index -- the token end index within the document of the sentence that contains 'end_index' (NOT one more than that index). sentences_character_start_index_in_document -- the character index of the first character of 'text' within the document. sentences_character_end_index_in_document -- one more than the character index of the last character of 'text' within the document. score -- the score word_infos -- an array of arrays with the semantics:  -- 'relative_start_index' -- the index of the first character in the word relative to 'sentences_character_start_index_in_document'.  -- 'relative_end_index' -- one more than the index of the last character in the word relative to 'sentences_character_start_index_in_document'.  -- 'type' -- 'single' for a single-word match, 'relation' if within a relation match involving two words, 'overlapping_relation' if within a relation match involving three or more words.  -- 'is_highest_activation' -- 'True' if this was the word at which the highest activation score reported in 'score' was achieved, otherwise 'False'.  -- 'explanation' -- a human-readable explanation of the word match from the perspective of the document word (e.g. to be used as a tooltip over it). answers -- an array of arrays with the semantics:  -- the index of the first character of a potential answer to an initial question word.  -- one more than the index of the last character of a potential answer to an initial question word.
Holmes encompasses several concepts that build on work that the author, Richard Paul Hudson, carried out as a young graduate and for which his former employer, [Definiens], has since been granted a U.S. patent. Definiens has kindly permitted the author to publish Holmes under the GNU General Public License ("GPL"). As long as you abide by the terms of the GPL, this means you can use the library without worrying about the patent, even if your activities take place in the United States of America.
The GPL is often misunderstood to be a license for non-commercial use. In reality, it certainly does permit commercial use as well in various scenarios, especially if you are building bespoke software in an enterprise context: consult the very comprehensive GPL FAQ to determine whether it is suitable for your needs.
If you wish to use Holmes in a way that is not permitted by the GPL, please get in touch with the author and we can try and find a solution which will obviously need to involve Definiens as well if whatever you are proposing involves the USA in any way.
The word-level matching and the high-level operation of structural
matching between search-phrase and document subgraphs both work more or
less as one would expect. What is perhaps more in need of further
comment is the semantic analysis code subsumed in the parsing.py
script as well as in the
language_specific_rules.py script for each
SemanticAnalyzer is an abstract class that is subclassed for each
language: at present by
GermanSemanticAnalyzer. These classes contain most of the semantic analysis code.
SemanticMatchingHelper is a second abstract class, again with an concrete
implementation for each language, that contains semantic analysis code
that is required at matching time. Moving this out to a separate class family
was necessary because, on operating systems that spawn processes rather
than forking processes (e.g. Windows),
have to be serialized when the worker processes are created: this would
not be possible for
SemanticAnalyzer instances because not all
spaCy models are serializable, and would also unnecessarily consume
large amounts of memory.
At present, all functionality that is common to the two languages is realised in the two abstract parent classes. Especially because English and German are closely related languages, it is probable that functionality will need to be moved from the abstract parent classes to specific implementing children classes if and when new semantic analyzers are added for new languages.
HolmesDictionary class is defined as a spaCy extension
that is accessed using the syntax
token._.holmes. The most important
information in the dictionary is a list of
These are derived from the dependency relationships in the spaCy output
token.dep_) but go through a considerable amount of processing to
make them 'less syntactic' and 'more semantic'. To give but a few
Some new semantic dependency labels that do not occur in spaCy outputs
as values of
token.dep_ are added for Holmes semantic dependencies.
It is important to understand that Holmes semantic dependencies are used
exclusively for matching and are therefore neither intended nor required
to form a coherent set of linguistic theoretical entities or relationships;
whatever works best for matching is assigned on an ad-hoc basis.
For each language, the
match_implication_dict dictionary maps search-phrase semantic dependencies
to matching document semantic dependencies and is responsible for the asymmetry of matching
between search phrases and documents.
Topic matching involves the following steps:
SemanticMatchingHelper.topic_matching_phraselet_stop_lemmas), which are consistently ignored throughout the whole process.
SemanticMatchingHelper.topic_matching_reverse_only_parent_lemmas) or when the frequency factor for the parent word is below the threshold for relation matching (
relation_matching_frequency_threshold, default: 0.25). These measures are necessary because matching on e.g. a parent preposition would lead to a large number of potential matches that would take a lot of resources to investigate: it is better to start investigation from the less frequent word within a given relation.
relation_matching_frequency_threshold, default: 0.25).
embedding_matching_frequency_threshold, default: 0.5), matching at all of those words where the relation template has not already been matched is retried using embeddings at the other word within the relation. A pair of words is then regarded as matching when their mutual cosine similarity is above
initial_question_word_embedding_match_threshold(default: 0.7) in situations where the document word has an initial question word in its phrase or
word_embedding_match_threshold(default: 0.8) in all other situations.
use_frequency_factoris set to
True(the default), each score are scaled by the frequency factor of its phraselet, meaning that words that occur less frequently in the corpus give rise to higher scores.
maximum_activation_distance; default: 75) as each new word is read.
single_word_score; default: 50), another type of single-word phraselet or a noun phraselet that matched a subword (
single_word_any_tag_score; default: 20), a relation phraselet produced by a reverse-only template (
reverse_only_relation_score; default: 200), any other (normally matched) relation phraselet (
relation_score; default: 300), or a relation phraselet involving an initial question word (
initial_question_word_answer_score; default: 600).
embedding_penalty; default: 0.6').
ontology_penalty; default: 0.9') once more often than the difference in depth between the two ontology entries, i.e. once for a synonym, twice for a child, three times for a grandchild and so on.
overlapping_relation_multiplier; default: 1.5).
sideways_match_extent; default: 100 words) within which the activation score is higher than the
different_match_cutoff_score(default: 15) are regarded as belonging to a contiguous passage around the peak that is then returned as a
TopicMatchobject. (Note that this default will almost certainly turn out to be too low if
use_frequency_factoris set to
False.) A word whose activation equals the threshold exactly is included at the beginning of the area as long as the next word where activation increases has a score above the threshold. If the topic match peak is below the threshold, the topic match will only consist of the peak word.
initial_question_word_behaviouris set to
process(the default) or to
exclusive, where a document word has matched an initial question word from the query phrase, the subtree of the matched document word is identified as a potential answer to the question and added to the dictionary to be returned. If
initial_question_word_behaviouris set to
exclusive, any topic matches that do not contain answers to initial question words are discarded.
only_one_result_per_document = Trueprevents more than one result from being returned from the same document; only the result from each document with the highest score will then be returned.
tied_result_quotient(default: 0.9) are labelled as tied.
The supervised document classification use case relies on the same phraselets as the
topic matching use case, although reverse-only templates are ignored and
a different set of stop words is used (
Classifiers are built and trained as follows:
oneshot; whether single-word phraselets are generated for all words with their own meaning or only for those such words whose part-of-speech tags match the single-word phraselet template specification (essentially: noun phraselets) depends on the value of
match_all_words. Wherever two phraselet matches overlap, a combined match is recorded. Combined matches are treated in the same way as other phraselet matches in further processing. This means that effectively the algorithm picks up one-word, two-word and three-word semantic combinations. See here for a discussion of the performance of this step.
minimum_occurrences; default: 4) or where the coefficient of variation (the standard deviation divided by the arithmetic mean) of the occurrences across the categories is below a threshold (
cv_threshold; default: 1.0).
oneshot==Falserespectively). The outputs are the category labels, including any additional labels determined via a classification ontology. By default, the multilayer perceptron has three hidden layers where the first hidden layer has the same number of neurons as the input layer and the second and third layers have sizes in between the input and the output layer with an equally sized step between each size; the user is however free to specify any other topology.
Holmes code adheres broadly to the PEP-8 standard. Because of the complexity of some of the code, Holmes adheres to a 100-character rather than an 80-character line width as permitted as an option there.
The complexity of what Holmes does makes development impossible without
a robust set of over 1350 regression tests. These can be executed individually
unittest or all at once by running the
pytest utility from the Holmes
source code root directory. (Note that the Python 3 command on Linux
pytest variant will only work on machines with sufficient memory resources. To
reduce this problem, the tests are distributed across three subdirectories, so that
pytest can be run three times, once from each subdirectory:
New languages can be added to Holmes by subclassing the
SemanticMatchingHelper classes as explained
The sets of matching semantic dependencies captured in the
_matching_dep_dict dictionary for each language have been obtained on
the basis of a mixture of linguistic-theoretical expectations and trial
and error. The results would probably be improved if the
could be derived using machine learning instead; as yet this has not been
attempted because of the lack of appropriate training data.
An attempt should be made to remove personal data from supervised document classification models to make them more compliant with data protection laws.
In cases where embedding-based matching is not active, the second step of the supervised document classification procedure repeats a considerable amount of processing from the first step. Retaining the relevant information from the first step of the procedure would greatly improve training performance. This has not been attempted up to now because a large number of tests would be required to prove that such performance improvements did not have any inadvertent impacts on functionality.
The topic matching and supervised document classification use cases are both configured with a number of hyperparameters that are presently set to best-guess values derived on a purely theoretical basis. Results could be further improved by testing the use cases with a variety of hyperparameters to learn the optimal values.
The initial open-source version.
pobjplinking parents of prepositions directly with their children.
MultiprocessingManagerobject as its facade.
MultiprocessingManagerclasses into a single
Managerclass, with a redesigned public interface, that uses worker threads for everything except supervised document classification.