Cornish texts in digital form

For some text to feed into these programs, I recommend the website by Howlsedhes Services offering the historical Cornish language texts in digital plain-text form for download.
I have a version of Gwreans An Bys (The Creation of the World, 1611) that I stripped out most of the extraneous characters like line numbers and comments.

Descriptive statistics with NLTK

The Python Natural Language Processing Toolkit has a number of methods of corpus analysis, including creating frequency distributions, conditional frequency distributions, lists of co-locations found within a text etc. I have created a script which runs a few of these analyses on the Cornish texts above, along with two samples of revived Cornish, the Solempnyta short story by Benjamin Bruch and some Lord of the Rings chapters translated by Jerry Jefferies. It is available in my Bitbucket repository.

Some example output:

The below is a selection of the output of the script. The co-locations and words of high frequency often correlate with the characters of the drama, and the theme of the text.

Bewnans Meryasek

Text: Improved version Bewnans Meryasek KK version from Stokes...

Collocations: pur wir; dhy hwi; Comes venetensis; Yesu Arloedh; heb falladow; Tertius tortor; Secundus tortor; Primus tortor; pub eur; Episcopus Kernow; Yesu Krist; wosa hemma; Rag kerensa; heb ahwer; kuv kolonn; deun alemma; pur dhiogel; pub termyn; heb namm; heb wow

number of words = 26815

number of different words = 4664

Lengths of words in descending order of frequency: [(3, 5094), (2, 4813), (4, 3857), (5, 3270), (1, 3078), (6, 2636), (7, 1697), (8, 1180), (9, 612), (10, 385), (11, 115), (12, 57), (13, 18), (14, 2), (18, 1)]

Top 50 words: ['a', 'y', 'n', 'dhe', 'ha', 'yn', 'an', 'ow', 'my', 'yw', 'c', 'ny', 'na', 're', 's', 'dha', 'omma', 'pur', 'ni', 'm', 'rag', 'meryasek', 'ma', 'sur', 'krist', 'yesu', 'bys', 'th', 'hwi', 'mar', 'heb', 'arloedh', 'oll', 'ev', 'vynn', 'gans', 'yma', 'dyw', 'vydh', 'lemmyn', 'vy', 'maria', 'den', 'ty', 'wir', 'dell', 'eus', 'meriadocus', 'dhymm', 'sertan']

Top 50 words of 4 or more letters: ['omma', 'meryasek', 'krist', 'yesu', 'arloedh', 'vynn', 'gans', 'vydh', 'lemmyn', 'maria', 'dell', 'meriadocus', 'dhymm', 'sertan', 'meur', 'dhymmo', 'dhyn', 'dhis', 'finit', 'episcopus', 'agas', 'comes', 'primus', 'secundus', 'nyns', 'yredi', 'orth', 'henna', 'prest', 'syrr', 'agan', 'devri', 'tortor', 'dhywgh', 'nevra', 'gweres', 'alemma', 'hanow', 'bydh', 'bynytha', 'deun', 'dhodho', 'epskop', 'hemma', 'lies', 'descendit', 'dhiso', 'lowena', 'mones', 'aredy']

Arloedh an Bysowyer - Chaptra 1

Text: Osta karer Arloedh An Bysowyer Wel ottomma dha...

Collocations: Yth esa; yth esa; dhe vos; dhe ves; haval orth; Unn Bysow; medh Gandalf; Bag End; Nyns eus; dro dhe; leveris Gandalf; wovynnas Frodo; neb kas; pup prys; medh Frodo; fatell wrug; dann gel; dell dybav; Parkow Gladen; res dhis

number of words = 11147

number of different words = 1966

Lengths of words in descending order of frequency: [(2, 2309), (3, 2004), (1, 1526), (4, 1442), (5, 1342), (6, 976), (7, 772), (8, 395), (9, 185), (10, 129), (11, 36), (12, 10), (13, 9), (17, 5), (15, 3), (18, 2), (14, 1), (16, 1)]

Top 50 words: ['a', 'an', 'ev', 'y', 'yn', 'ha', 'n', 'dhe', 'hag', 'mes', 'o', 'ow', 'na', 'yw', 'ny', 'frodo', 'bysow', 'esa', 'vy', 'yth', 're', 'my', 'nyns', 'gans', 'wrug', 'dell', 'bos', 'rag', 'i', 'oll', 'gandalf', 'vos', 'bylbo', 'orth', 'po', 'mar', 'termyn', 'henna', 'dre', 'leveris', 'meur', 'dhodho', 'medh', 'aga', 'es', 'pan', 'pur', 'dres', 'ta', 'yma']

Top 50 words of 4 or more letters: ['frodo', 'bysow', 'nyns', 'gans', 'wrug', 'dell', 'gandalf', 'bylbo', 'orth', 'termyn', 'henna', 'leveris', 'meur', 'dhodho', 'medh', 'dres', 'arta', 'kever', 'nerth', 'dhymm', 'diworth', 'golum', 'shayr', 'tewl', 'haval', 'hobytow', 'hwir', 'nebes', 'wosa', 'henn', 'honan', 'lemmyn', 'yndella', 'arall', 'kyns', 'vydh', 'hwath', 'ganso', 'klywes', 'pyth', 'woer', 'drefenn', 'elfow', 'leverel', 'owth', 'ytho', 'dhis', 'nans', 'nevra', 'orto']

The Tregear Homilies

Text: THE TREGEAR HOMILIES KK Version made from Christopher...

Collocations: Building collocations list Folio Homily; keth sam; dhe vos; kepar dell; Spyrys Sans; mab den; agan Savyour; Homily JHESUS; katholik eglos; pub eur; mar veur; heb diwedh; vab den; Yesu Krist; dre reson; fatell wrug; agan honan; Savyour Yesu; Katholik Eglos; res dhyn

number of words = 40897

number of different words = 5246

Lengths of words in descending order of frequency: [(2, 8508), (3, 7334), (1, 5121), (4, 5001), (5, 4461), (6, 3516), (7, 2555), (8, 2112), (9, 1009), (10, 637), (11, 317), (12, 155), (13, 99), (14, 43), (15, 17), (16, 5), (17, 3), (19, 2), (18, 1), (20, 1)]

Top 50 words: ['a', 'ha', 'an', 'n', 'dhe', 'y', 'yn', 'yw', 'ow', 'ni', 'ev', 'ma', 'na', 'rag', 'krist', 'agan', 'wrug', 's', 'oll', 'dre', 'yma', 'eglos', 'dyw', 'gans', 'hag', 'bonner', 'fatell', 'henna', 'et', 'kepar', 'den', 'leverel', 'vos', 'aga', 'yth', 'mar', 'keth', 're', 'honan', 'dell', 'bos', 'i', 'in', 'vydh', 'folio', 'ny', 'o', 'de', 'homily', 'nyns']

Top 50 words of 4 or more letters: ['krist', 'agan', 'wrug', 'eglos', 'gans', 'bonner', 'fatell', 'henna', 'kepar', 'leverel', 'keth', 'honan', 'dell', 'vydh', 'folio', 'homily', 'nyns', 'dhyn', 'dhyw', 'ynwedh', 'korf', 'savyour', 'rakhenna', 'hemma', 'henn', 'dhiworth', 'katholik', 'onan', 'geryow', 'pyth', 'hwath', 'arloedh', 'peder', 'chaptra', 'gwrys', 'omma', 'yndella', 'skryptor', 'lemmyn', 'bobel', 'sans', 'arall', 'dhodho', 'goes', 'leveris', 'lies', 'spyrys', 'agas', 'powl', 'termyn']


Cumulative frequency depending on the length of words, for the various texts.
The relative abundance of several words across the various texts.


The program now provides the facility to customize the output via the GUI.
A general report on one of the texts. The boxes allow the number of most frequent words to be specified, and the minimum number of letters to be used in the case of the list of words with at least a given number of letters.
Reporting the 3 most frequent words of at least 5 letters for all of the texts.
With the Menowghder Ger (tresenn bar) option, using the lower input box, and the button Keworra ger dhe'n rol it is possible to create a list of words, which can be used to create a grouped bar chart showing the frequency of each word in the list.
The percentage frequencies of the Cornish words for the numbers 1 to 8. The frequency of the number 8 (eth) is likely to be artificially high, since a word with the same spelling is a part of the verb bos (to be).