Additional features for the N1904-TF, the syntactic annotated Text-Fabric dataset of the Greek New Testament.
About this datasetFeature group | Feature type | Data type | Available for node types | Feature status |
---|---|---|---|---|
statistic |
Node |
str |
book chapter verse sentence group wg phrase subphrase clause |
✅ |
Type to Token Ratio based on morph-tags for all word nodes under this node.
A float number stored as a string representing a ratio in the range 0 to 1 (inclusive) where the dot denotes a decimal point, not a thousands separator.
This feature provides the Morph-to-Token Ratio (MTR), which is a measure for morph diversity. It is defined as:
\[\text{MTR} = \frac{|\{\text{unique morphs in the text}\}|}{N}\]The following plot compares the Type-to-Token Ratios measured over word form (TTR), lemma (LTR), and morphology (MTR) for each book of the New Testament. The image clearly shows that shorter books generaly are resulting in higher ratio, even though TTR is iself already a normalized measure. To account for this length-related bias there are various methods of normalization. A large number of those methods are made conveniently accessible using the Python package lexicalrichness.
Related features:
The production notebook can be found on this repository.