2. Getting Started with CLTK¶

Dr. W.J.B. Mattingly Smithsonian Data Science Lab and United States Holocaust Memorial Museum April 2022

2.1. Covered in this Chapter¶

2.2. Introduction¶

2.3. Getting Text Data¶

While the CLTK allows you to access large corpora for each language, to get familiar with the basics, we will be working in this portion of the textbook with local data. A lot of what follows can be found in a notebook on the CLTK GitHub repo. Inside this textbook repository, we already have a collection of texts available. Let’s take a look at a sample from Livy.

First, we need to load the textual data and create an object of the text.

with open("texts/lat-livy.txt") as f:
    livy_full = f.read()

Excellent! Now, let’s analyze this just a bit so we can get a sense of our data.

print("Text snippet:", livy_full[:200])
print("Character count:", len(livy_full))
print("Approximate token count:", len(livy_full.split()))

Text snippet: Iam primum omnium satis constat Troia capta in ceteros saevitum esse Troianos, duobus, Aeneae Antenorique, et vetusti iure hospitii et quia pacis reddendaeque Helenae semper auctores fuerant, omne ius
Character count: 921462
Approximate token count: 129799

We use the term “approximate token count” here because tokens are considered anything that has syntactic meaning in text. This means that a token is not just simply a word, rather punctuation as well. We use the word approximate here because the split() function in Python only separates words by whitespace (as a default). In other words, the proper token count is much higher since each punctuation is split with the word it is next to.

2.4. Calling a CLTK Pipeline¶

The CLTK is specifically designed as a natural language processing pipeline for ancient and medieval languages. In order to leverage the power of the library, we first need to import the NLP pipeline from the CLTK library.

from cltk import NLP

If the above code worked without error, then it means we have successfully imported the NLP class from CLTK. This allows us to now create a CLTK NLP pipeline. In order to do that, however, we need to know the language of the document that we are examining. Since Livy is a Latin author, the language code will be “lat”.

# Load the default Pipeline for Latin
cltk_nlp = NLP(language="lat")

‎𐤀 CLTK version '1.0.25'.
Pipeline for language 'Latin' (ISO: 'lat'): `LatinNormalizeProcess`, `LatinStanzaProcess`, `LatinEmbeddingsProcess`, `StopsProcess`, `LatinNERProcess`, `LatinLexiconProcess`.

It is Pythonic to create an NLP object with either the name “nlp” (spaCy syntax) or cltk_nlp (CLTK syntax). One reason for distinguishing between these two is that you may have two separate NLP pipelines in your workflow. It may help, therefore, to specify which nlp object is your cltk NLP pipeline. You can name this object whatever you like but it is best to stick to these conventions as it will make your code easier to understand.

The output from this cell provides key information for our pipeline. It includes the following pipes, or processes on the input data (the text):

LatinNormalizeProcess
LatinStanzaProcess
LatinEmbeddingsProcess
StopsProcess
LatinNERProcess
LatinLexiconProcess

We will cover each of these in depth later in this notebook. For now, simply understand that as your text moves through the NLP class, it moves through a pipeline of different processes. The sequence here is important as some processes rely on post-processing from earlier pipes.

If we wish to remove a pipe from the pipeline, we can use .pipeline.process.pop(INSERT INDEX EHRER). Let’s see what this looks like in practice.

cltk_nlp.pipeline.processes.pop(-1)
print(cltk_nlp.pipeline.processes)

[<class 'cltk.alphabet.processes.LatinNormalizeProcess'>, <class 'cltk.dependency.processes.LatinStanzaProcess'>, <class 'cltk.embeddings.processes.LatinEmbeddingsProcess'>, <class 'cltk.stops.processes.StopsProcess'>, <class 'cltk.ner.processes.LatinNERProcess'>]

By using pop at -1, we are removing the final pipe. One reason for wishing to do this may be speed. The LatinLexiconProcess is one of the more time-consuming pipes in the pipeline and may not be necessary for your workflow which just needs to use the NER pipe.

2.5. The CLTK Doc Object¶

Now that we have our pipeline assembled, let’s try to analyze a text. To do this, we need to create a CLTK Doc object. If you are familiar with spaCy or other NLP libraries, this should be somewhat familiar to you. The Doc object is a special class that holds data about the text. Before wee examine the Doc object, though, we need to first create it. First, though, let’s shorten our Livy text.

livy = livy_full[:len(livy_full) // 12]
print("Approximate token count:", len(livy.split()))

Approximate token count: 10905

Now that we have shortened Livy, let’s create the CLTK Doc object. To do this, we will run the CLTK NLP class object, call the analyze method and pass in one argument: the text which is livy. If it is your first time running this, you may be prompted to download the stanza models. Type “Y” to download them.

cltk_doc = cltk_nlp.analyze(text=livy)

CLTK message: This part of the CLTK depends upon the Stanza NLP library.
CLTK message: Allow download of Stanza models to ``C:\Users\wma22/stanza_resources/la/tokenize/ittb.pt``? [Y/n] 

2022-04-20 08:53:02 INFO: Downloading these customized packages for language: la (Latin)...
=======================
| Processor | Package |
-----------------------
| tokenize  | ittb    |
| pos       | ittb    |
| lemma     | ittb    |
| depparse  | ittb    |
| pretrain  | ittb    |
=======================

2022-04-20 08:53:59 INFO: Finished downloading models and saved to C:\Users\wma22\stanza_resources.

This part of the CLTK depends upon models from the CLTK project.
Do you want to download 'https://github.com/cltk/lat_models_cltk' to '~/cltk_data/lat'? [Y/n] 

Now that all the models are downloaded, our pipeline should have completed its processing on the text. Let’s start examining the doc object a bit more closely. Let’s first examining what type of object it is.

print(type(cltk_doc))

<class 'cltk.core.data_types.Doc'>

Notice that it is a special class object that is related to the cltk, specifically a Doc object.

2.6. Doc Object Accessors¶

The Doc object contains what the CLTK calls “accessors”. If you are familiar with spaCy syntax, these function rather like spaCy attributes. They contain a specific piece of data. In some instances, this will be at the token level (e.g. token, lemmata, pos, etc.). In other cases, they occur at the sentence level (e.g. sentences, sentences_strings, sentences_tokens). This allows you to parse the Doc object in several different ways. Let’s take a look at all the accessors that are available to us from the Latin pipeline.

accessors = ([x for x in dir(cltk_doc) if not x.startswith("__")])
for a in accessors:
    print (a)

_get_words_attribute
embeddings
embeddings_model
language
lemmata
morphosyntactic_features
normalized_text
pipeline
pos
raw
sentence_embeddings
sentences
sentences_strings
sentences_tokens
stanza_doc
stems
tokens
tokens_stops_filtered
words

Let’s now examine some of these a bit more closely. Each will have a header so that you can use the navigation in the textbook (on the right of the screen) to navigate more easily.

2.6.1. Raw¶

The raw accessor is no different from the plain text object that we passed to the pipeline. It’s index, therefore, functions just as the text input does. Let’s take a look.

print (cltk_doc.raw[:20])

Iam primum omnium sa

2.6.2. Token¶

The token accessor, however, is fundamentally different. This accessor contains all sequential tokens in the text. Let’s take a look at the first 20.

print(cltk_doc.tokens[:20])

['Iam', 'primum', 'omnium', 'satis', 'constat', 'Troia', 'capta', 'in', 'ceteros', 'saevitum', 'esse', 'Troianos', ',', 'duobus', ',', 'Aeneae', 'Antenorique', ',', 'et', 'vetusti']

Notice how not only words are separated out in the output below, but also punctuation marks. This is what makes processing a text so powerful. We can analyze a text at the word level.

2.6.3. Lemmata¶

Like the token accessor, the lemmata accessor also functions at the token level. Unlike the token accessor, however, the lemmata contains the lemma forms of each token. Note “capta” above is now replaced with its lemma: “capio”.

print(cltk_doc.lemmata[:20])

['Iam', 'primus', 'omnis', 'satis', 'consto', 'Troia', 'capio', 'in', 'ceterus', 'saevitum', 'sum', 'Troianos', ',', 'duo', ',', 'Aeneae', 'Antenorique', ',', 'et', 'vetusti']

2.6.4. POS¶

The accessor pos also functions at the token level. The word pos refers to part-of-speech. This is common across all NLP libraries. It allows us to see what part of speech a given word is.

print(cltk_doc.pos[:20])

['ADV', 'ADJ', 'PRON', 'ADV', 'VERB', 'NOUN', 'VERB', 'ADP', 'PRON', 'VERB', 'AUX', 'NOUN', 'PUNCT', 'NUM', 'PUNCT', 'NOUN', 'VERB', 'PUNCT', 'CCONJ', 'NOUN']

2.6.5. Words¶

The words accessor may seem on the surface to resemble the token accessor, but it is a lot different. It contains all metadata about the word. It functions rather like spaCy’s token attribute. Let’s take a look at the seventh word in the text, “capta”.

print (cltk_doc.words[6])

Word(index_char_start=None, index_char_stop=None, index_token=6, index_sentence=0, string='capta', pos=verb, lemma='capio', stem=None, scansion=None, xpos='L2|modM|tem4|grp1|casA|gen2', upos='VERB', dependency_relation='acl', governor=5, features={Aspect: [perfective], Case: [nominative], Degree: [positive], Gender: [feminine], Number: [singular], Tense: [past], VerbForm: [participle], Voice: [passive]}, category={F: [neg], N: [neg], V: [pos]}, stop=False, named_entity=False, syllables=None, phonetic_transcription=None, definition=None)

Note that unlike the token accessor, the words accessor allows us to see all metadata relevant to this individual word. We can access each of these features as well. Let’s say I was interested in knowing its part-of-speech. I can access that data like so:

print (cltk_doc.words[6].pos)

verb

Now we know it is a verb. What if we wanted to know its voice? We could access its features.

print (cltk_doc.words[6].features)

{Aspect: [perfective], Case: [nominative], Degree: [positive], Gender: [feminine], Number: [singular], Tense: [past], VerbForm: [participle], Voice: [passive]}

And from here we can navigate this dictionary to the “Voice” key.

print (cltk_doc.words[6].features["Voice"])

[passive]

And we can see that it is passive. We can access all these features equally easily.

print("Number:", cltk_doc.words[6].features["Number"])
print("Tense:", cltk_doc.words[6].features["Tense"])
print("VerbForm:", cltk_doc.words[6].features["VerbForm"]) 
print("Voice:", cltk_doc.words[6].features["Voice"])

Number: [singular]
Tense: [past]
VerbForm: [participle]
Voice: [passive]

The words accessor is one of the more powerful aspects of the CLTK pipeline. I encourage you to spend a bit of time exploring what is available to you from the words accessor with your own text.

2.6.6. Sentence Tokens¶

Unlike the previous accessors, the sentence_tokens accessor allows us to analyze the Doc object at the sentencee level. This allows us to parse a text sentence-by-sentence which is not possible in Python. The split(“.”) approach separates a text at every “.”. This means that it will separate the text where a “.” is used to denote an abbreviation. In Latin, as in English, this makes the approach impossible to use effectively. The CLTK pipeline, however, allows us to parse ancient and medieval languages effectively at the sentence level.

print(cltk_doc.sentences_tokens[:2])

[['Iam', 'primum', 'omnium', 'satis', 'constat', 'Troia', 'capta', 'in', 'ceteros', 'saevitum', 'esse', 'Troianos', ',', 'duobus', ',', 'Aeneae', 'Antenorique', ',', 'et', 'vetusti', 'iure', 'hospitii', 'et', 'quia', 'pacis', 'reddendaeque', 'Helenae', 'semper', 'auctores', 'fuerant', ',', 'omne', 'ius', 'belli', 'Achiuos', 'abstinuisse', ';'], ['casibus', 'deinde', 'variis', 'Antenorem', 'cum', 'multitudine', 'Enetum', ',', 'qui', 'seditione', 'ex', 'Paphlagonia', 'pulsi', 'et', 'sedes', 'et', 'ducem', 'rege', 'Pylaemene', 'ad', 'Troiam', 'amisso', 'quaerebant', ',', 'venisse', 'in', 'intimum', 'maris', 'Hadriatici', 'sinum', ',', 'Euganeisque', 'qui', 'inter', 'mare', 'Alpesque', 'incolebant', 'pulsis', 'Enetos', 'Troianosque', 'eas', 'tenuisse', 'terras', '.']]

2.7. Conclusion¶

This chapter has introduced you to the salient features of the CLTK NLP Class and how to construct pipeline and pass a text through it. In the next chapter, we will examine specifically named entity recognition (NER).

Introduction to CLTK

Getting Started with CLTK

Contents