Additional features for the N1904-TF, the syntactic annotated Text-Fabric dataset of the Greek New Testament.
About this dataset| Feature group | Feature type | Data type | Available for node types | Feature status |
|---|---|---|---|---|
statistic |
Node |
str |
word |
✅ |
TF-IDF score (x 1,000,000) for this token, calculated using all tokens in the GNT corpus, aggregated per book
A scaled by 1,000,000 positive float number stored as a string.
The Term Frequency-Inverse Document Frequency (TF-IDF) feature treats each book as a ‘document’ and providing the TF-IDF scores per normalized token, and then map those scores back to each node in the corpus. This allows to identify book-specific vocabulary and to use these weights for further quantitative or visualization-oriented analyses.
It follows the information provided in the TF-IDF explanation on GeeksforGeeks and uses scikit-learn’s TfidfVectorizer for computations.
Related features:
GitHub repository Create_TF-IDF_Text-Fabric_features.