MorphKit Functions

This page documents the core functions available in the morphkit package.

Analyse morph tag

morphkit.analyse_morph_tag(parse: Dict[str, Any], debug: bool = False) → str[source]

Compute the Sandborg–Petersen morphological tag for a single Morpheus analyses block.

Args:

parse (dict):

Morphological parse with keys like ‘pos’, ‘tense’, ‘voice’, ‘mood’, ‘case’, ‘number’, ‘gender’, etc.

debug (bool):

Optional argument. Defaults to False. If set to True the function print some debug information.

Returns:

str:

The SP morphological tag or ‘UNK’ if unrecognized.

Steps:

Determine the POS prefix (e.g. ‘N-’, ‘V-’, ‘A-’, ‘ADV’, etc.).

Return immediately for indeclinable POS (adverbs, particles, etc.).

For verbs, build the tag as ‘V-<Tense><Voice><Mood>’ plus a suffix for person & number (finite), infinitive (no suffix), or participle (case–number–gender).

For nouns, adjectives, and articles, append uppercase initials of case, number, and gender; adjectives may get ‘-C’/’-S’ for degree.

For pronouns, combine person, case, number, and gender.

If nothing matches, return ‘UNK’.

Example:

api_endpoint = "10.10.0.10:1315"
blocs=morphkit.get_word_blocks('sune/rxomai',api_endpoint)
for block in blocks:
    parse=morphkit.parse_word_block(block)
    analysis=morphkit.analyse_morph_tag(parse)

# dictionairy has now entry 'morph' added:
{'analyses': [{'end_bc': 'omai',
               'end_codes': ['w_stem'],
               ...
               'mood': 'indicative',
               'morph': 'V-PEI-1S',
               'number': 'singular',
               ...

General notes:

The documentation for the SP morphology is available via: https://github.com/biblicalhumanities/Nestle1904/blob/master/morph/parsing.txt

Analyse POS

morphkit.analyse_pos(parse: Dict[str, Any], debug: bool = False) → str[source]

analyse a single Morpheus parse record and determine its part of speech.

Args:

parse (dict):
A parse dictionary with the following structure:
 {
     'raw_uc': '...',
     'stam_codes': [...],
     ...
     'morph_flags': [...],
     'tense': 'present',
     ...
}
debug (bool):

Optional argument. Defaults to False. If set to True the function print some debug information.

Returns:

str:

The determined Part of Speech label (e.g. ‘noun’, ‘verb’, ‘adverb’, …), or ‘unknown’ if no rule applies.

Steps:

The analysis consist of the following major steps:

Verbs (presence of ‘tense’ or ‘mood’ keys).

Note: one could argue for two dedicated POS classes, for participle and infinitive, c.f Wallace GGBB p.613 & p.588. This was NOT done in order to stay in line with the current N1904-TF classification used by feature sp. The differentation between participle, infinitive and ‘other’ verb types is done in module ‘init_compare_tags’.

Specific morph codes and flags → mapped POS (e.g. ‘conj’ → conjunction).

Indeclinable forms (‘indeclform’ flag):

Neuter-singular nom/acc → adverb.

Numeral indecl → numeral.

Proper noun indecl if gender/number present → proper noun.

Otherwise → other indeclinable noun.

Proclitic or enclitic forms → particle.

Anything with case or gender → noun.

If other_end_token == adverbial → adverb.

Fallback → unknown.

Example:

parse = {'raw_uc':'λέγω','tense':'present','mood':'indicative', ...}
morphkit.analyse_pos(parse)
'verb'

Analyse word with Morpheus

morphkit.analyse_word_with_morpheus(word_beta: str, api_endpoint: str, language: str = 'greek', add_pos: bool = True, add_morph: bool = True, debug: bool = False) → Dict[str, Any][source]

Query the Morpheus morphological analyser for a Greek word in Betacode and parse its analyses.

Args:

word_beta (str):

The input word in beta-code format to look up. Backslashes in the input need to be escaped: e.g., ‘a)nh/r’ -> ‘a)nh/r'.

api_endpoint (str):

IP adress & port of the Morpheus API endpoint (e.g., 192.168.0.5:1315).

language (str):

Optional argument. Defaults to greek. The other option is ‘latin’. If set to ‘latin’ no POS and morph field will be added.

add_pos (bool):

Optional argument. Defaults to True. If set to False no POS field will be added to the parse.

add_morph (bool):

Optional argument. Defaults to True. If set to False no morph field will be added to the parse.

debug (bool):

Optional argument. Defaults to False. If set to True the function print some debug information.

Returns:

Dict[str, Any]:

A dictionary with the following structure:

{
    'word': str,            # Normalized Betacode key returned by Morpheus
    'raw_uni': str,         # Unicode Greek of raw format (not returned when 'language=latin')
    'blocks': int,          # Number of blocks parsed
    'analyses': List[dict], # Parsed analyses from each block
}

Steps:

Fetch raw Morpheus output using function get_word_blocks().

Split the response into analysis blocks at each ‘:raw’ marker using function split_into_raw_blocks().

For each block, call parse_word_block() to create a parse dictionairy.

Add Part of Speech tag to the parse dictionairy by calling analyse_pos().

Add the SP morph-tag to the parse dictionairy by calling analyse_morph_tag().

Return a structured result.

Raises:

ValueError:: If the language parameter is invalid (only ‘greek’ and ‘latin’ are allowed).
ValueError:: If the api_endpoint parameter is malformed (format should be ‘host(IP or name):port’).

Example:

api_endpoint="192.168.0.5:1315"
result=morphkit.analyse_word_with_morpheus('au(/th',api_endpoint)

Flow diagram:

+------------------------------+
| analyse_word_with_morpheus() |
+--------------+---------------+
               |
               v
+-----------------------+   HTTP request  +--------------------+
|  1. get word blocks   +<--------------->+  Morpheus endpoint |
+--------------+--------+  HTTP response  +--------------------+
               |
               v
+--------------+----------------+
| 2. Split into blocks          |
+-------------------------------+
               |
               v
+--------------+----------------+
| 3. for each block:            |
|     +----------------------+  |
|     | analyse_pos          |  |
|     | analyse_morph        |  |
|     +----------------------+  |
+--------------+----------------+
               |
               v
+--------------+----------------+
| 4. Return combined analyses   |
+-------------------------------+

Annotate and sort analyses

morphkit.annotate_and_sort_analyses(full_analysis: Dict[str, Any], reference_morph: str, reference_lemma: str, base_key: str = 'lem_base_bc', full_key: str = 'lem_full_bc', morph_key: str = 'morph', sim_key: str = 'morph_similarity', lower_case: bool = True, debug: bool = False) → Dict[str, Any][source]

Annotate and sort analyses in a morphkit-compatible structure, grouping by base lemma and appending homonym suffixes extracted from lem_full_bc minus lem_base_bc.

Args:

full_analysis (Dict[str, Any]):

A dict with an ‘analyses’ list of blocks (dicts).

reference_morph (str):

The reference morph tag to compare against each block.

reference_lemma (str):

The Betacode lemma (base form, without suffix) to prioritize.

base_key (str):

Optional argument. Defaults to ‘lem_base_bc’. Key under which the base lemma is stored in each block.

full_key (str):

Optional argument. Defaults to ‘lem_full_bc’. Key under which the full lemma is stored in each block.

morph_key (str):

Optional argument. Defaults to ‘morph’. Key under which the raw morph string is stored.

sim_key (str):

Optional argument. Defaults to ‘morph_similarity’. Key under which to store the similarity string.

lower_case (bool):

Optional argument. Defaults to True. If set to True, convert lemmas to lowercase before comparison.

debug (bool):

Optional argument. Defaults to False. If set to True, the function print some debug information.

Returns:

Dict[str, Any]:

A new full_analysis dictionairy with annotated and sorted analyses, and with lem_base_bc modified to include homonym suffix when applicable.

Steps:

Deep-copy the input to avoid mutating the original data.

For each analysis block:

Compute the homonym suffix as the portion of lem_full_bc after lem_base_bc.

If non-empty, append “_(SUFFIX)” to lem_base_bc.

Compute similarity percentages for each tag against reference_morph.

Store sim_key as a slash-separated string of percentages.

Store ‘_max_’ + sim_key as the integer max similarity for this block.

Group blocks by their finalized lem_base_bc (with suffix).

Identify which group key should be first:

If reference_lemma matches any finalized base lemma exactly, that group is first.

Else if normalize(reference_lemma) matches normalize(base lemma), that group is first.

Compute for each group:

group_max: the highest block-level max similarity within that group.

Sort groups so that:

The chosen reference group (if any) comes first.

Remaining groups follow in descending order of group_max.

Within each group, sort its blocks by descending block-level max similarity.

Flatten groups back into a single list.

Remove temporary helper keys and return the new full_analysis dict.

Compare tags

morphkit.compare_tags(tag1, tag2, debug=False)

Compare two morphological parsing tags by decoding them into features and computing a weighted similarity score.

This function is generated by init_compare_tags() and performs the following actions:

Uses decodeTag to turn each tag (e.g. “V-PAI-3S”) into a dict of grammatical features.

For each feature (Part of Speech, Tense, Case, etc.), looks up the similarity via prebuilt similarity functions.

Multiplies each similarity by its weight, sums and normalizes to the range [0.0,1.0].

Returns both the overall score and a breakdown per feature.

Args:

tag1 (str):

The “gold standard” tag you expect (e.g. from a reference corpus).

tag2 (str):

The tag you want to evaluate against the “gold standard”.

debug (bool):

Optional argument. Defaults to False. If True, print each feature’s known vs. generated value, the raw similarity score, and the feature’s weight.

Returns:

dict:

A dictionairy with the following structure:

"tag" (str),                   # echo of `generated_tag`.
"overall_similarity" (float)   # weighted, normalized [0.0–1.0].
"details" (dict)               # for each feature name, a sub-dict with:
    "tag1" (str)               # the decoded known feature.
    "tag2" (str)               # the decoded generated feature.
    "similarity" (float)       # the raw sim score (0.0–1.0).

Example:

result = morphkit.compare_tags("N-NSM", "N-DSM")
print(result["overall_similarity"])
0.875
print(result["details"]["Case"])
{"tag1": "Nominative", "tag2": "Dative", "similarity": 0.2}

Flow diagram:

       +----------------------------+
       | decode_tag(tag1)           |
       | decode_tag(tag2)           |
       +-------------+--------------+
                     |
                     v
+------------------------------------------+
|   Adjust POS if Mood = Participle/Inf    |
+--------------------+---------------------+
                     |
                     v
+------------------------------------------+
| for each feature in weights:             |
|   - get tag1/2 values                    |
|   - sim = sim_funcs[feature](tag1, tag2) |
|   - accumulate score × weight            |
|   - store details                        |
+--------------------+---------------------+
                     |
                     v
+------------------------------------------+
|     Normalize: total_score / weight      |
+--------------------+---------------------+
                     |
                     v
+------------------------------------------+
| Return: dict with tag1, tag2, similarity |
|          and per-feature details         |
+------------------------------------------+

Decode morph tag

morphkit.decode_tag(tag_input: str, debug: bool = False) → Dict[str, Any][source]

Decode a morphological tag into a set of human-readable features.

This function takes a morphological tag (e.g. “V-PAI-3S”) and returns a dictionary of interpreted grammatical properties, such as Part of Speech, case, number, gender, tense, voice, mood, person, and any suffix details.

Args:

tag_input (str):

The raw morphological tag string. Usually includes prefixes like “N-”, “V-”, “A-”, etc., followed by coded letters/numbers.

debug (bool):

Optional argument. Defaults to False. If set to True the function print some debug information.

Returns:

Dict[str, Any]:

A mapping from feature names to their full descriptions.

Possible keys include:

“Part of Speech”

“Case”, “Number”, “Gender”

“Tense”, “Voice”, “Mood”

“Verb Extra” or “Suffix”

“Warning”, warning related to feature elements

“Error” (e.g., if input is empty)

If the part of speech can not be determined, it returns {“Part of Speech”: “Unknown or Unsupported”}.

If tag_input is empty or whitespace, it returns {“Error”: “Please enter a parsing tag.”}.

Example:

morphkit.decode_tag("N-NSM")
{
    "Part of Speech": "Noun",
    "Case": "Nominative",
    "Number": "Singular",
    "Gender": "Masculine",
    ...
}

Note:

This function is an addapted version of the tool available at https://github.com/tonyjurg/Sandborg-Petersen-decoder.

Split into raw blocks

morphkit.split_into_raw_blocks(text: str, debug: bool = False) → List[List[str]][source]

Split the input text into blocks at each ‘:raw’ header using multiline regex.

Args:

text (str):

The input text to be split.

debug (bool):

Optional argument. Defaults to False. If set to True the function print some debug information.

Returns:

List[List[str]]:

A list of raw blocks, where each block is a list of lines.

Example:

raw_text=morphkit.get_word_blocks("tou",api_endpoint)
blocks=morphkit.split_into_raw_blocks(raw_text)
for block in blocks:
    # Process each individual block

Get word blocks

morphkit.get_word_blocks(word_beta: str, api_endpoint: str, language: str = 'greek', output: str = 'full', debug: bool = False) → str[source]

Retrieve the raw word blocks data for a given beta-code word from a Morpheus endpoint.

Args:

word_beta (str):

The input word in beta-code format to look up. Backslashes in the input string need to be escaped: e.g., ‘a)nh/r’ -> ‘a)nh/r'

api_endpoint (str):

IP adress & port of the Morpheus API endpoint (e.g., ‘192.168.0.5:1315’).

language (str):

Optional argument. Defaults to greek. Sets the language of the word to analyse. It can be set to greek or latin.

output {str}:

Optional argument. Defaults to full. Output format of the Analytic block. Either full for the internal database format, or compact for a brief output.

debug (bool):

Optional argument. Defaults to False. If set to True, prints the constructed URL and response size.

Returns:

str:

The plain text response containing the word blocks for the requested beta-code form.

Raises:

ValueError:

The language parameter is invalid (only ‘greek’ and ‘latin’ are allowed).

ValueError:

The api_endpoint parameter is malformed (format should be ‘host(IP or name):port’).

requests.HTTPError:

HTTP request failed (non-2xx status code).

Example:

api_endpoint = "10.10.0.10:1315"
blocs=morphkit.get_word_blocks('sune/rxomai', api_endpoint)