# Test 'default' Morpheus option for PoS info

## Table of content (ToC)<a class="anchor" id="TOC"></a>
* <a href="#bullet1">1 - Introduction</a>
* <a href="#bullet2">2 - Feeder script</a>
* <a href="#bullet3">3 - Output data</a>
* <a href="#bullet4">4 - Processing the output data</a>
* <a href="#bullet5">5 - Notebook version details</a>

# 1 - Introduction<a class="anchor" id="bullet1"></a>
##### [Back to ToC](#TOC)

This notebook was created to verify if the 'default' option for Morpheus chruncher does provide sufficient details on the Part of Speech in order to be usefull.

Note: this notebook still used a local script running inside the Morpheus docker instance. This configuration has been replaced with an API and a specialy written Python package [morphkit](https://tonyjurg.github.io/morphkit/) which simplifies these type of analysis significant. 

# 2 - Feeder script<a class="anchor" id="bullet2"></a>
##### [Back to ToC](#TOC)

The following bash script was run in the Docker environment to feed Morpheus chruncher (note the 'default' as there are no switches provided):

# 3 - Output data<a class="anchor" id="bullet3"></a>
##### [Back to ToC](#TOC)

A small slice of the output data:

# 4 - Processing the output data<a class="anchor" id="bullet4"></a>
##### [Back to ToC](#TOC)

After porting the output file back to my local machine, the following Python code was executed:

In [4]:
from collections import Counter

def countLettersAfterNL(filePath):
    with open(filePath, "r", encoding="utf-8") as f:
        content = f.read()

    counter = Counter()
    idx = 0
    while idx < len(content):
        idx = content.find("<NL>", idx)
        if idx == -1:
            break
        after_nl_idx = idx + len("<NL>")
        if after_nl_idx < len(content):
            letter = content[after_nl_idx]
            if letter.strip():  # skip any whitespace
                counter[letter] += 1
        idx = after_nl_idx

    return counter

if __name__ == "__main__":
    inputFile = "gnt_morphology_results2.txt"  # input text file
    letterCounts = countLettersAfterNL(inputFile)

    print("\nFrequency of first letters after <NL>:\n")
    for letter, count in letterCounts.most_common():
        print(f"{letter}: {count}")


Frequency of first letters after <NL>:

V: 14163
N: 11263
P: 4971
E: 9


Hence the conclusion is clear. This is not giving me the proper Part of Speech.

# 5 - Notebook version details<a class="anchor" id="bullet5"></a>
##### [Back to ToC](#TOC)

<div style="float: left;">
  <table>
    <tr>
      <td><strong>Author</strong></td>
      <td>Tony Jurg</td>
    </tr>
    <tr>
      <td><strong>Version</strong></td>
      <td>1.1</td>
    </tr>
    <tr>
      <td><strong>Date</strong></td>
      <td>22 April 2025</td>
    </tr>
  </table>
</div>