1. Introduction to CLTK¶

Dr. W.J.B. Mattingly Smithsonian Data Science Lab and CLTK April 2022

1.1. Covered in this Chapter¶

1.2. About the Author¶

I am Dr. William Mattingly. I hold a PhD in Medieval History from the University of Kentucky where I explored early medieval social networks. A lot of my research was aided by my ability to code, specifically in Python, for data cleaning and analysis. I was even able to use Python to plot and visualize social networks and data. During the fourth year of my PhD, I used Python to create an app that I could use to plot and analyze these social networks. Currently, I am a Postdoctoral Fellow for the Analysis of Historical Documents at the Smithsonian Institution with a joint appointment at the United States Holocaust Memorial Museum. In both institutions, I use Python, machine learning, and natural language processing (NLP) to analyze historical texts in large quantities to generate new insights about the documents held in the archives. In all, I have nearly a decade of experience using Python as a historian.

When I first started to explore Python, there were not many available tutorials geared towards humanists and, for that reason, four years ago I started PythonHumanities.com and Python Tutorials for Digital Humanities on YouTube. I geared these resources to humanists who had no prior knowledge about computing or coding. This new JupyterBook is the third iteration of this textbook that brings a lot of the material that first appeared on PythonHumanities.com years ago into a new, more accessible JupyterBook. It will forever remain free to all as will the video lectures embedded in this book.

1.3. About this Textbook¶

Because this textbook is not peer-reviewed, typos may remain or errors may exist. I openly and freely admit to this. This textbook is community-inspired and I would like it to be community-supported. If you see a mistake, you can click thee GitHub logo inn the top right corner of the screen to submit a pull request or make a note for edits. I highly encourage this and I am open to and welcome any criticism to improve this textbook for all.

1.4. What is CLTK?¶

The Classical Language Toolkit (CLTK) is a Python library for performing natural langauge processing (NLP) on ancient and medieval languages. It was created by Kyle P. Johnson in 2014 to address the lack of resources for doing NLP on pre-modern languages. These languages suffer from several notable drawbacks:

limited texts
no living native speakers
inconsistent encoding
limited funding

All of these factors make pre-modern languages low-resource languages. The CLTK library resolves many of these issues by providing the data (texts), models (stanza and spaCy) and pipelines (CLTK) for processing pre-modern texts.

1.5. Why use CLTK?¶

Other available NLP libraries in Python, such as spaCy and Stanza, are powerful tools. They are not, however, necessarily equipped to handle some of the challenges pre-modern languages face. SpaCy cannot, for example, work with Arabic due to its right-to-left style. The CLTK was designed by researchers who are intimately familiar with the challenges of the languages they study. It was built from the ground up to address those specific challenges.

The chief reason to use the CLTK is, therefore, so that you do not need to reinvent the wheel. Rather than coding solutions to problems you are seeing with Old English, leverage the prebuilt pipelines CLTK offers. As you become familiar with the CLTK, you will undoubtedly find things you wish to improve. As you do, you can code your own solutions and make a formal request for CLTK to adopt those changes via GitHub.

1.6. Installation¶

On Linux and Mac, you can install CLTK with “pip install cltk”. If you are on Windows, however, you will encounter two potential errors:

Error “Microsoft Visual C++ 14.0 is required (at installation)
Stanza KeyError: ‘feat_dropout’ (when first trying to use a CLTK model)

1.6.1. Error “Microsoft Visual C++ 14.0 is required¶

The address the ‘Error “Microsoft Visual C++ 14.0 is required’ error at install, you must first download and install Microsoft Visual Studio. You can find it here. (modified slightly from source: https://stackoverflow.com/questions/29846087/error-microsoft-visual-c-14-0-is-required-unable-to-find-vcvarsall-bat)

Go to Build Tools for Visual Studio 2022
Select free download under Visual Studio Community 2017. This will download the installer. Run the installer.
Select what you need under workload tab:
- Under Windows, there are three choices. Only check Desktop development with C++.

If you have installed everything correctly, you should now be able to install CLTK. If you are stilling having issues, please submit an issue on GitHub for this textbook and I will update it accordingly.

1.6.2. Stanza KeyError: ‘feat_dropout’ (when first trying to use a CLTK model)¶

This error occurs when you try and run a CLTK pipeline over a text. It occurs because you have old stanza models on your computer. You will need to have the newer Stanza v. 1.3 models installed. To do this, you will first need to find where they are located. On Windows, this will be in your username under the foldere “stanza_resources”. Simply delete this folder and rerun your code. You will notice that CLTK will redownload the Stanza models.

1.7. Citation of the CLTK¶

When using the CLTK, you should cite the following publication, including the DOI:

Johnson, Kyle P., Patrick J. Burns, John Stewart, Todd Cook, Clément Besnier, and William J. B. Mattingly. “The Classical Language Toolkit: An NLP Framework for Pre-Modern Languages.” In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pp. 20-29. 2021. 10.18653/v1/2021.acl-demo.3

1.8. Conclusion¶

This chapter has introduced you generally to the CLTK library and given you instructions on how to install it. In the next chapter, we will import and begin using the CLTK library to analyze a text of Livy, a Roman author.

Introduction to CLTK