Additional features for the N1904-TF, the syntactic annotated Text-Fabric dataset of the Greek New Testament.
About this datasetFeature group | Feature type | Data type | Available for node types | Feature status |
---|---|---|---|---|
statistic |
Node |
int |
word |
✅ |
Absolute entropy of a surface level wordform (featuretext
) as predictor of its parent phrase function (like Subject, Object, etc.).
A number stored as integer representing the entropy in mili-bits.
In the N1904-TF dataset, the actual value ranges from 0 to 2584.
In practical terms, entropy reflects how predictable an element’s syntactic behavior is predictable. This feature expresses how consistently a given surface level word (as stored in the text
feature) maps to a particular phrase function.
In this context these phrase functions are derived from feature function
and expanded with several additional categories (to view all details expand the item below).
In the N1904-TF dataset, not all words belong to phrases with well-defined syntactic functions such as Subject or Object. For instance, conjunctions like δὲ or καὶ typically do not form part of syntactic phrases in the strict sense.
To ensure that every word can still be assigned a functional label, the following Python script was developed. This script prioritizes assigning the canonical phrase function where available, but also supplements gaps with a set of extended categories.
The table below distinguishes between these two types of categories and shows the number of word nodes mapped to each one.
Source | Value | Description | Frequency |
---|---|---|---|
From feature function (6 classes) |
Cmpl | Complement | 35442 |
Pred | Predicate | 25138 | |
Subj | Subject | 21567 | |
Objc | Object | 19371 | |
PreC | Predicate-Complement | 9595 | |
Adv | Adverbial | 5367 | |
Augmented pseudo classes (5 classes) | Conj | Conjunction | 16316 |
Unkn | Unknown | 2076 | |
Intj | Interjection | 1470 | |
Aux | Auxiliar | 1136 | |
Appo | Apposition | 301 |
High entropy values indicate that a form is ambiguous, as it appears in multiple syntactic functions with similar probabilities. In contrast, low entropy values signify that a form is strongly associated with a single syntactic function, making it a reliable indicator of that role within the parent phrase.
Entropy is a measure from information theory that quantifies uncertainty or unpredictability in a probability distribution. It is defined as:
$$H(X) = -\sum_i P(x_i) \log_2 P(x_i)$$Where:
Entropy measures the uncertainty associated with a probability distribution. It reaches its maximum when all outcomes are equally likely (i.e., maximum uncertainty), and its minimum (zero) when one outcome is certain.
In the context of the N1904-FT dataset, we apply this principle to estimate the uncertainty of syntactic function prediction based on linguistic features.
Let an element \(e \in D \), where \( D = \{ \text{lemma}, \text{morph}, \text{text} \} \), represent a linguistic feature. If this element is associated with \( n \) different phrase functions \( f \), then the entropy \( H(e \mid f) \) in bits is calculated as:
$$H(e|f) = -\sum_{i=1}^{n} p_i \log_2(p_i)$$where \( p_i \) is the probability that element \( e \) corresponds to the \( i-th \) function.
If the distribution is uniform (i.e., all \( p_i = \frac{1}{n} \) ), the entropy reaches its maximum:
$$H(e|f) = -n \cdot \frac{1}{n} \cdot \log_2\left(\frac{1}{n}\right) = \log_2(n)$$In the mapping used for calculating this feature, there are \( n = 11 \) phrase function categories. Thus, the theoretical maximum entropy for a given datatype \( D \) is:
$$H_{\text{max}}(D) = \log_2(11) \approx 3.459 \text{ bits}$$This value represents the upper bound of uncertainty when a linguistic feature provides no predictive information about phrase function.
To obtain a normalized entropy, where values for \( H(e|f) \) are in the range 0 to 1 (inclusive), the following formula can be applied for each datatype \( D \):
$$H_{\text{norm}}(D) = \frac{H(D)}{H_{\text{max}}(D)}$$
The following table provides some statistic key metrics for the absolute entropy for the total of unique text token (surface level word forms) in N1904-TF:
=== text ===
Count: 19446
Min: 0000
25%ile: 0000
Median: 0000
75%ile: 0000
Max: 2584
Mean: 0150
StdDev: 0393
This indicates that most text elements are highly predictable in terms of their syntactic roles, while a small subset show high entropy due to usage in multiple phrase functions.
The following plot illustrates both the absolute and normalized entropy distribution for all 19446 unique surface level word forms in the N1904-TF dataset):
To better understand the significance of entropy values, the following plot illustrates the absolute entropy associated with several synthetic probability distributions.
The information can also be presented in a table augmented with the values for normalized entropy. In the example above, we assumed a total of 8 possible classes. Therefore, to compute the normalized entropy, we can simply divide each absolute entropy by 3, since since \( \log_2(8) \) = 3.
Distribution | Absolute entropy (bits) | Normalized entropy (8 classes) |
---|---|---|
Uniform (2 outcomes) | 1.0000 | 0,3333 |
Skewed (0.9, 0.1) | 0.4689 | 0.1563 |
Heavy skewed (0.99, 0.01) | 0.0808 | 0.1563 |
Moderate skewed (0.7, 0.3) | 0.8813 | 0.2938 |
Uniform (4 outcomes) | 2.0000 | 0.667 |
Uniform (8 outcomes) | 3.0000 | 1.0000 |
Mixed (0.6, 0.2, 0.1, 0.1) | ~1.6855 | ~0.5618 |
This highlight the following key properties of entropy:
Related features: