Additional features for the N1904-TF, the syntactic annotated Text-Fabric dataset of the Greek New Testament.
About this datasetFeature group | Feature type | Data type | Available for node types | Feature status |
---|---|---|---|---|
statistic |
Node |
int |
word |
✅ |
Normalized entropy of a morph(-tag of a word)* as predictor of its parent phrase function (like Subject, Object, etc.).
*) For the calculation of entropy the dialect tag (like -ATT) was ignored.
A number stored as integer representing the entropy normalized to a range from 0 to 1000 (inclusive).
In practical terms, entropy reflects how predictable an element’s syntactic behavior is predictable. This feature expresses how consistently a given lemma (as stored in the morph
feature) maps to a particular phrase function.
In this context these phrase functions are derived from feature function
and expanded with several additional categories (to view all details expand the item below).
In the N1904-TF dataset, not all words belong to phrases with well-defined syntactic functions such as Subject or Object. For instance, conjunctions like δὲ or καὶ typically do not form part of syntactic phrases in the strict sense.
To ensure that every word can still be assigned a functional label, the following Python script was developed. This script prioritizes assigning the canonical phrase function where available, but also supplements gaps with a set of extended categories.
The table below distinguishes between these two types of categories and shows the number of word nodes mapped to each one.
Source | Value | Description | Frequency |
---|---|---|---|
From feature function (6 classes) |
Cmpl | Complement | 35442 |
Pred | Predicate | 25138 | |
Subj | Subject | 21567 | |
Objc | Object | 19371 | |
PreC | Predicate-Complement | 9595 | |
Adv | Adverbial | 5367 | |
Augmented pseudo classes (5 classes) | Conj | Conjunction | 16316 |
Unkn | Unknown | 2076 | |
Intj | Interjection | 1470 | |
Aux | Auxiliar | 1136 | |
Appo | Apposition | 301 |
High entropy values indicate that a form is ambiguous, as it appears in multiple syntactic functions with similar probabilities. In contrast, low entropy values signify that a form is strongly associated with a single syntactic function, making it a reliable indicator of that role within the parent phrase.
Entropy is a measure from information theory that quantifies uncertainty or unpredictability in a probability distribution. It is defined as:
$$H(X) = -\sum_i P(x_i) \log_2 P(x_i)$$Where:
Entropy measures the uncertainty associated with a probability distribution. It reaches its maximum when all outcomes are equally likely (i.e., maximum uncertainty), and its minimum (zero) when one outcome is certain.
In the context of the N1904-FT dataset, we apply this principle to estimate the uncertainty of syntactic function prediction based on linguistic features.
Let an element \(e \in D \), where \( D = \{ \text{lemma}, \text{morph}, \text{text} \} \), represent a linguistic feature. If this element is associated with \( n \) different phrase functions \( f \), then the entropy \( H(e \mid f) \) in bits is calculated as:
$$H(e|f) = -\sum_{i=1}^{n} p_i \log_2(p_i)$$where \( p_i \) is the probability that element \( e \) corresponds to the \( i-th \) function.
If the distribution is uniform (i.e., all \( p_i = \frac{1}{n} \) ), the entropy reaches its maximum:
$$H(e|f) = -n \cdot \frac{1}{n} \cdot \log_2\left(\frac{1}{n}\right) = \log_2(n)$$In the mapping used for calculating this feature, there are \( n = 11 \) phrase function categories. Thus, the theoretical maximum entropy for a given datatype \( D \) is:
$$H_{\text{max}}(D) = \log_2(11) \approx 3.459 \text{ bits}$$This value represents the upper bound of uncertainty when a linguistic feature provides no predictive information about phrase function.
To obtain a normalized entropy, where values for \( H(e|f) \) are in the range 0 to 1 (inclusive), the following formula can be applied for each datatype \( D \):
$$H_{\text{norm}}(D) = \frac{H(D)}{H_{\text{max}}(D)}$$
The following table provides some statistic key metrics for this feature counted over the total of 1055 unique morphs:
=== morph ===
Count: 1018
Min: 0000
25%ile: 0000
Median: 0000
75%ile: 0254
Max: 0788
Mean: 0125
StdDev: 0168
This indicates that most morphs are highly predictable in terms of their syntactic roles, while a small subset show high entropy due to usage in multiple phrase functions.
The following plot illustrates both the absolute and normalized entropy distribution for all 1055 unique morph tag in the N1904-TF dataset:
To better understand the significance of entropy values, the following plot illustrates the absolute entropy associated with several synthetic probability distributions.
The information can also be presented in a table augmented with the values for normalized entropy. In the example above, we assumed a total of 8 possible classes. Therefore, to compute the normalized entropy, we can simply divide each absolute entropy by 3, since since \( \log_2(8) \) = 3.
Distribution | Absolute entropy (bits) | Normalized entropy (8 classes) |
---|---|---|
Uniform (2 outcomes) | 1.0000 | 0,3333 |
Skewed (0.9, 0.1) | 0.4689 | 0.1563 |
Heavy skewed (0.99, 0.01) | 0.0808 | 0.1563 |
Moderate skewed (0.7, 0.3) | 0.8813 | 0.2938 |
Uniform (4 outcomes) | 2.0000 | 0.667 |
Uniform (8 outcomes) | 3.0000 | 1.0000 |
Mixed (0.6, 0.2, 0.1, 0.1) | ~1.6855 | ~0.5618 |
This highlight the following key properties of entropy:
Related features