N1904addons - Feature: tfidfns

Feature group	Feature type	Data type	Available for node types	Feature status
`statistic`	`Node`	`str`	`word`	✅

Feature short description

TF-IDF score (x 1,000,000) for this token, calculated using only non-stopword tokens in the GNT corpus, aggregated per book.

Feature values

A scaled by 1,000,000 positive float number stored as a string.

Detailed feature description

The Term Frequency-Inverse Document Frequency (TF-IDF) feature treats each book as a ‘document’ and providing the TF-IDF scores per normalized token, and then map those scores back to each node in the corpus. This allows to identify book-specific vocabulary and to use these weights for further quantitative or visualization-oriented analyses.

This feature excludes all stop words (any token with a part-of-speech value of ‘intj’, ‘prep’, ‘art’, ‘conj’); all values for these token are set to zero.

It follows the information provided in the TF-IDF explanation on GeeksforGeeks and uses scikit-learn’s TfidfVectorizer for computations.

Data source

GitHub repository Create_TF-IDF_Text-Fabric_features.

N1904addons - Feature: tfidfns

Feature short description

Feature values

Detailed feature description

See also

Data source