N1904addons

Additional features for the N1904-TF, the syntactic annotated Text-Fabric dataset of the Greek New Testament.

About this dataset
Featureset
Loading the dataset
Using the Morpheus features
Latest release

N1904addons - Feature: text_entr_norm

Feature group Feature type Data type Available for node types Feature status
statistic Node int word

Feature short description

Normalized entropy of a surface level wordform as predictor of its parent phrase function (like Subject, Object, etc.).

Feature values

A number stored as integer representing the entropy normalized to a range from 0 to 1000 (inclusive).

Feature detailed description

In practical terms, entropy reflects how predictable an element’s syntactic behavior is predictable. This feature expresses how consistently a given surface wordform (as stored in the text feature) maps to a particular phrase function.

In this context these phrase functions are derived from feature function and expanded with several additional categories (to view all details expand the item below).

Parent phrase function details

In the N1904-TF dataset, not all words belong to phrases with well-defined syntactic functions such as Subject or Object. For instance, conjunctions like δὲ or καὶ typically do not form part of syntactic phrases in the strict sense.

To ensure that every word can still be assigned a functional label, the following Python script was developed. This script prioritizes assigning the canonical phrase function where available, but also supplements gaps with a set of extended categories.

The table below distinguishes between these two types of categories and shows the number of word nodes mapped to each one.

Source Value Description Frequency
From feature function (6 classes) Cmpl Complement 35442
Pred Predicate 25138
Subj Subject 21567
Objc Object 19371
PreC Predicate-Complement 9595
Adv Adverbial 5367
Augmented pseudo classes (5 classes) Conj Conjunction 16316
Unkn Unknown 2076
Intj Interjection 1470
Aux Auxiliar 1136
Appo Apposition 301
The "Unkn" (unknown) category accounts for approximately 1.5% of all mappings, slightly raising both the absolute and normalized entropy.


High entropy values indicate that a form is ambiguous, as it appears in multiple syntactic functions with similar probabilities. In contrast, low entropy values signify that a form is strongly associated with a single syntactic function, making it a reliable indicator of that role within the parent phrase.

Detailed mathematic description

Definition

Entropy is a measure from information theory that quantifies uncertainty or unpredictability in a probability distribution. It is defined as:

$$H(X) = -\sum_i P(x_i) \log_2 P(x_i)$$

Where:

  • The part \( P(x_i) \) is the probability of the \( i-th \) outcome.
  • The part \( log_2 \) ensures the result is expressed in bits.
  • It is assumed in this context that \( log_2(0)=0 \).

Entropy measures the uncertainty associated with a probability distribution. It reaches its maximum when all outcomes are equally likely (i.e., maximum uncertainty), and its minimum (zero) when one outcome is certain.

Application

In the context of the N1904-FT dataset, we apply this principle to estimate the uncertainty of syntactic function prediction based on linguistic features.

Let an element \(e \in D \), where \( D = \{ \text{lemma}, \text{morph}, \text{text} \} \), represent a linguistic feature. If this element is associated with \( n \) different phrase functions \( f \), then the entropy \( H(e \mid f) \) in bits is calculated as:

$$H(e|f) = -\sum_{i=1}^{n} p_i \log_2(p_i)$$

where \( p_i \) is the probability that element \( e \) corresponds to the \( i-th \) function.

If the distribution is uniform (i.e., all \( p_i = \frac{1}{n} \) ), the entropy reaches its maximum:

$$H(e|f) = -n \cdot \frac{1}{n} \cdot \log_2\left(\frac{1}{n}\right) = \log_2(n)$$

In the mapping used for calculating this feature, there are \( n = 11 \) phrase function categories. Thus, the theoretical maximum entropy for a given datatype \( D \) is:

$$H_{\text{max}}(D) = \log_2(11) \approx 3.459 \text{ bits}$$

This value represents the upper bound of uncertainty when a linguistic feature provides no predictive information about phrase function.

To obtain a normalized entropy, where values for \( H(e|f) \) are in the range 0 to 1 (inclusive), the following formula can be applied for each datatype \( D \):

$$H_{\text{norm}}(D) = \frac{H(D)}{H_{\text{max}}(D)}$$


The following table provides some statistic key metrics for the absolute entropy for the total of unique text token (surface level word forms):

=== text ===
Count:   19446
Min:     0000
25%ile:  0000
Median:  0000
75%ile:  0000
Max:     0747
Mean:    0043
StdDev:  0113

This indicates that most text elements are highly predictable in terms of their syntactic roles, while a small subset show high entropy due to usage in multiple phrase functions.

The following plot illustrates both the absolute and normalized entropy distribution for all 19446 unique surface level word forms in the N1904-TF dataset):

Entropy distribution text -> phrase function

Theoretical example

To better understand the significance of entropy values, the following plot illustrates the absolute entropy associated with several synthetic probability distributions.

Entropy examples

The information can also be presented in a table augmented with the values for normalized entropy. In the example above, we assumed a total of 8 possible classes. Therefore, to compute the normalized entropy, we can simply divide each absolute entropy by 3, since since \( \log_2(8) \) = 3.

Distribution Absolute entropy (bits) Normalized entropy (8 classes)
Uniform (2 outcomes) 1.0000 0,3333
Skewed (0.9, 0.1) 0.4689 0.1563
Heavy skewed (0.99, 0.01) 0.0808 0.1563
Moderate skewed (0.7, 0.3) 0.8813 0.2938
Uniform (4 outcomes) 2.0000 0.667
Uniform (8 outcomes) 3.0000 1.0000
Mixed (0.6, 0.2, 0.1, 0.1) ~1.6855 ~0.5618

This highlight the following key properties of entropy:

See also

Related features:

References

Data source

Github repository.