Create_morpheus_TF_dataset

Create morpheus TF dataset

Introduction

This repository documents the creation of a new morphology-focused feature-set for the Nestle 1904 Greek New Testament Text-Fabric dataset (N1904-TF). The main goal is to add all possible morphological analyses to each Greek word, based on its textual form. To achieve this, the project uses the well-known Perseus Morpheus analyzer. The parses produced by Morpheus are ranked using a heuristic that compares them with existing Text-Fabric morphological features (such as case, number, and tense). The highest-ranked parse is the one that most closely matches the generally accepted interpretation of the word in its specific context.

This repository provides insight into the processing pipeline, including the Python code (primarily embedded in Jupyter Notebooks with comments), intermediate data, and the resulting Text-Fabric feature files. The final feature files (*.tf) are included in the package available at the tonyjurg/N1904addons repository. This repository also explains how an executable instance of Morpheus was set up to run inside a Docker virtualization environment.

The dataset builds on a previously developed Text-Fabric feature that added a betacode representation to each surface-level word. A new word-node feature, betacode, was created to store the Beta Code equivalent of the Unicode text found in the text feature.

All procedures and tools are fully documented and openly accessible to ensure complete reproducibility. The workflow is implemented in Python using Jupyter Notebooks, with each stage of the process modularized into standalone notebooks or scripts. This openness aims to encourage reuse and highlight Text-Fabric’s transparency and flexibility.

Setting up environment

Production pipeline

Full production notebook

Testing

Testing: Section on testing of the data and its conversion

Sandborg-Petersen-decoder: decoding the morphological tags.
Morphkit: Python package for interfacing with Morpheus and performing output analysis.
Creating the TF feature betacode: the foundational feature used to bridge Morpheus and Text-Fabric world.
Creating the entropy features: entropy as a measure of the uncertainty or variability of how an element (such as a surface level word, its morph, or its lemma) predicts or aligns with the syntactic functions (like Subject, Object, etc.) of the phrase it belongs to.
Creating statistical functions: like number of words and type to token ratio (TTR).