My blog, imported from Blogger and converted using Jekyll.

Translation memory for Cornish now with a GUI

Aug 19, 2016

I have developed the translation memory software a little further as part of my TaklowKernewek tools.

It now has a GUI:

Using only bigrams and trigrams from the corpus that contain at least one non stopword (based on NLTK stopwords corpus).

Showing all bigrams and trigrams outputs a long list of sentences containing ('is', 'the').
Sentences in the corpus that contain multiple trigrams in common with the input are ranked highest, and similarly with bigrams.
After improvement to the text wrapping of the output sentences to split longer lines:

Translation memory software for Cornish

Aug 18, 2016

One of the discussions I was having with Mark Trevethan by email recently was about the translation service of the Cornish Language Office, and the idea of 'translation memory', that is when text is to be translated, to store examples of previous work done. This has two main advantages, one being saving labour, and secondly improving consistency.

I had an idea to make a rudimentary version of this myself, using the Python Natural Language Toolkit. To make this work, I needed a bilingual corpus, which had the same sentences in both Cornish and English.

The electronic version of the Cornish language textbook Skeul an Yeth 1 by Wella Brown, has been made available online free by Kesva an Taves Kernewek (The Cornish Language Board).

This contains a list of example sentences at the end of every chapter, which provides the bilingual corpus for this work.

What the program does is to ask for an input sentence (currently only via the command-line) in English, and then find the 'bigrams' and 'trigrams' in it, and also do so for the sentences from Skeul an Yeth 1.

The program uses the NLTK 'stopwords' corpus, to filter the bigrams/trigrams for whether they are in a list of common words that may not have much in the way of lexical content. Sentences containing trigrams containing at least 1 non-stopword are listed first, followed by bigrams with at least 1 non-stopword, followed by trigrams and bigrams that consist solely of stopwords.

For a larger corpus the numbers of sentences found for common bigrams such as ('in', 'the') could become very large.

Enter an English sentence
The cat is sleeping on the floor next to the fire.

trigrams for input sentence are:
[('the', 'cat', 'is'), ('cat', 'is', 'sleeping'), ('is', 'sleeping', 'on'), ('sleeping', 'on', 'the'), ('on', 'the', 'floor'), ('the', 'floor', 'next'), ('floor', 'next', 'to'), ('next', 'to', 'the'), ('to', 'the', 'fire'), ('the', 'fire', '.')]

bigrams for input sentence are:
[('the', 'cat'), ('cat', 'is'), ('is', 'sleeping'), ('sleeping', 'on'), ('on', 'the'), ('the', 'floor'), ('floor', 'next'), ('next', 'to'), ('to', 'the'), ('the', 'fire'), ('fire', '.')]

Listing N grams with a minimum of 1 non-stopword each:
Common trigrams:

Yma an gath a'y growedh war an leur yn-dann an gador y'n esedhva. -- The cat is lying on the floor under the chair in the sitting room. (the cat is), (on the floor)
Ottena! An maw moen na ryb an daras. -- There look! That thin boy next to the door. (next to the)
War an leur yn-dann dha weli yn dha jambour, dell vydh usys! -- On the floor under your bed in your bedroom, as usual! (on the floor)
Usi! Hag yma an gath ena ynwedh. -- Yes! And the cat is there also. (the cat is)
Nag esons! Yma an ki war an leur mes yma an gath y'n wydhenn. -- No! The dog is on the ground but the cat is in the tree. (the cat is)
Gorr glow war an tan. Oer yw hi. -- Put coal on the fire. It's cold. (the fire.)
Esedh orth an tan! Ty a vydh toemma ena. -- Sit at the fire. You will be warmer there. (the fire.)
Dewgh orth an tan! Oer yw hi! -- Come to the fire it's very cold! (to the fire)

Common bigrams:

An gath a gosk war an gweliow. -- The cat sleeps on the beds. (the cat), (on the)
Yma Jerri ow koska lemmyn. -- Jerry is sleeping now. (is sleeping)
Ple'ma an gath? -- Where is the cat? (the cat)
Usi an gath y'n lowarth? -- Is the cat in the garden? (the cat)
orth an tan -- at the fire (the fire)
A esedhons i orth an tan pub gorthugher? -- Do they sit at the fire every evening? (the fire)

Other N grams containing only stopwords:
Common trigrams:

Common bigrams:

Ni a dhybris li. Ena ni a gerdhas. Kerdh hir o dhe'n kerrek war an hal -- We ate lunch. Then we walked. It was a long walk to the rocks on the moor. (to the), (on the)
Eus jynn-skrifa war an desk? -- Is there a typewriter on the desk? (on the)
Ottena - yma an genter war an eurlenn. -- Look there - there's the nail on the carpet. (on the)
Yma pras war an woen hag yma chi ryb an pras na. -- There's a field on the down and there's a house by that field. (on the)
Sur, yma lyver war an voes. -- Certainly there is a book on the table. (on the)
Eus traow gesys war an lestrier? -- Are there things left on the dresser? (on the)
Yma bleujyow byw gesys war an fordh omma. -- There are live flowers left on the road here. (on the)
Yma padell blos war voes an gegin. -- There's a dirty pan on the kitchen table. (on the)
Ottena teyr delenn rudh war an leur. -- Look there are three red leaves on the ground. (on the)
Eus hwetek plat byghan war an lestrier? -- Are there sixteen small plates on the dresser? (on the)
Yw. Yma hi war an voes y'n gegin. -- Yes. It's on the kitchen table. (on the)
Eus amanenn war an bara? Eus! -- Is there butter on the bread? Yes! (on the)
Deves yw tanow war an voen. -- Sheep are scarce on the down. (on the)
war an amari -- on the cupboard (on the)
A nyns usi an boes war an voes hwath? -- Isn't the food on the table yet? (on the)
Esons i war an voes? -- Are they on the table? (on the)
War an voes (yma) martesen. -- On the table (it is) perhaps. (on the)
Nebes fordhow y'n ynys yw ledan lowr mes meur a fordhow ena yw re gul. -- Few roads on the island are wide enough but many roads there are too narrow. (on the)
War an voes ymons. -- They are on the table. (on the)
Skrifewgh hanow an lyver war gynsa linen an folenn! -- Write the name of the book on the first line of the page! (on the)
Esesta war an treth? -- Were you on the beach? (on the)
Esewgh hwi war an treth? -- Were you on the beach? (on the)
Y'n koes yth esa del gell war an leur. -- In the wood there were brown leaves on the ground. (on the)
An vamm re worras an kinyow war an voes lemmyn. Kynsa yma kowl onyon. -- Mother has put the dinner on the table now. First there is onion soup. (on the)
War an voes y hworrons i an boes. -- On the table they put the food. (on the)
Ena y hworrav ow hota war an gador. -- Then I put my coat on the chair. (on the)
Yma krys ow kregi war benn an gweli. -- There is a shirt hanging on the end of the bed. (on the)
Gorr an kellylli war an voes! -- Put the knives on the table! (on the)
Y'n seythves dydh an dra o dien. -- On the seventh day the matter was complete. (on the)
Ny yllydh jy esedha war an glesin. Re lyb yw ev. -- You can't sit on the lawn. It's too wet. (on the)
Ev a redyas y hanow y'n peswara koloven war an pympes folenn a'n paper-nowodhow. -- He read his name in the fourth column on the fifth page of the newspaper. (on the)
Pan splann an loergann war an arvor a-dreus an mor kosel, assyw hi teg! -- When the full moon shines on the shore across the calm sea, how beautiful it is! (on the)
Tasik! Tasik! Ottena! Ergh war an glesin! -- Daddy! Daddy! Look! Snow on the lawn! (on the)
War drysa estyllenn an argh-lyvrow y'n esedhva yma, dell dybav. -- On the third shelf of the bookcase in the lounge it is, I think. (on the)
Nyns eus karr vyth y'n fordh. -- There is no car at all on the road. (on the)
An gewer yw hager war an heyl. Ny yll den gweles a-dreus dhodho. -- The weather is ugly on the estuary. A person cannot see across it. (on the)
An rewler a worras an lytherow war an desk rybdho. -- The manager put the letters on the desk beside him. (on the)
Ottena! A-dro dhe hanterkans hos war an lynn yn kres an hal. -- Look there! About fifty ducks on the lake in the middle of the moor. (on the)
An peswara drehevyans diworto yw ev a'n keth tu. -- It's the fourth building from it on the same side. (on the)
Goel Sen Pyran a vydh pub blydhen dhe'n pympes a vis Meurth. -- St Piran's Day is on the fifth of March each year. (on the)
Henri a vynn esedha war an isella kador. -- Henry will sit on the lowest chair. (on the)
Ny yll ev esedha war an ughella huni. -- He cannot sit on the highest one. (on the)
Prag y tregh ev an skorrennow na? Drefenn ev dh'aga leski war an tansys. -- Why does he cut those branches? Because he burns them on the bonfire. (on the)
'Yma diwros war an fordh ena ha gour shyndys a'y wrowedh war an leur', an gwithyas kres a leveris. 'Res yw dhis gortos deg mynysenn, mar pleg. Ni a vynn y worra dhe'n klavji a-dhistowgh.' -- 'There's a bicycle on the road there and a man lying injured.' replied the policeman. 'You must wait ten minutes, please. We will take him to hospital immediately.' (on the)
Kerdh hir yw dhe'n eglos. -- It's a long walk to the church. (to the)
Py chambour yw an nessa dhe'n lowarth a-rag? -- Which bedroom is nearest to the front garden? (to the)
Ke dhe'n fenester, mar pleg! -- Go to the window, please! (to the)
Nyns yw an traow ma pur haval orth an re erell, yns i? -- These are not very similar to the others, are they? (to the)
Martyn eth dhe'n treth mes nyns eth dhe neuvya. -- Martin went to the beach but he didn't go to swim. (to the)
Y'n eur na yth eth ev dhe skol an eglos. -- He then went to the church school. (to the)
My a wra lenna hwedhel dhe'n fleghes pub gorthugher. -- I read to the children every evening. (to the)
An dowr a yn nans dhe'n mor. -- The water goes down to the sea. (to the)
Ni oll warbarth eth yn-nans dhe'n treth rag neuvya. -- We all went down to the beach together in order to swim. (to the)
An keur a ganas dhe'n fleghes. -- The choir sang to the children. (to the)
A vynnowgh hwi mones genen dhe'n dons? -- Will you go with us to the dance? (to the)
Dowr an fenten a dhe'n gover. -- The spring water goes to the brook. (to the)
Karol a lanhas an lestri kyns aga daskorr dhe'n lestrier. -- Carol cleaned the dishes before returning them to the dresser. (to the)
An tiek a dhros y vughes dhe'n skiber. -- The farmer brought his cows to the barn. (to the)
An brassa stevell yw an nessa stevell dhe'n wolghva. -- The biggest room is the nearest room to the bathroom. (to the)
An skoloryon, mebyon ha mowesi, a dhe'n keth skol y'n dre. -- The schoolchildren, boys and girls, go to the same school in town. (to the)
An awel o krev. Ny allas an gorholyon dos ogas dhe'n porth. -- The wind was strong. The ships couldn't come near to the harbour. (to the)

Text to speech in Cornish

Aug 15, 2016

The program espeak offers text to speech in a variety of languages, not yet Cornish, but I have made a bit of a hack that allows Cornish text to be spoken by it.

There is a Welsh language voice for it, and I have created a script that processes Cornish text doing a series of replaces to make it conform to Welsh spelling rules.

It would be possible to get espeak to speak Cornish directly by creating a Cornish voice for it, and I did start doing this a long time ago, but unfortunately lost this work along with my previous laptop.

The GUI launcher currently only works in Linux-compatible systems, because it launches espeak via the command-line via the Python os library. However espeak itself is also available for Windows and I will adapt the script to work on Windows dreckly.

The first quote as an mp3 file. The second is generated by pressing the "Gorhemmyn" button, and an appropriate greeting is chosen according to the system clock.

Transliteration from Kernewek Kemmyn to Standard Written Form

Aug 14, 2016

The script and its GUI frontend converts text from Kernewek Kemmyn to Standard Written Form (Main Form).

See also the brief writeup on my website, and earlier on this blog.

A couple of example sentences I use to illustrate some of its features are:

  • Yth esa gwydhenn y'n goeswik
  • Yth esa gwydhennow y'n goeswik

 There was a tree in the forest is the translation of the first sentance, and gwydhennow is the plural of the singlative gwydhenn which derives from the collective noun gwydh (trees). Gwydh would be use for a general mass of trees, gwydhenn a single tree, and gwydhennow a countable collection of individual trees.

In the left hand panel, gwydhenn becomes gwedhen showing two changes, firstly the doubled consonant -nn becomes single -n. The program will make this change for unstressed syllables, exluding those that are prefixes that have secondary stress like penn- in pennseythun and some others.
The other change is the y becoming an e as part of vocalic alternation. This occurs for y vowels that are 'half-long' in Kernewek Kemmyn, which is detected via the syllable segmentation program.
The function converty(inputsyl) in applies this change as long as the word isn't in a list of exceptions given in and the syllable ends in a consonant. If the syllable ends in a vowel (e.g. ay, ey, oy diphthongs, and -ya endings where the y (which is really a semi-vowel y) has been erroneously assigned to the previous syllable) the change is not made.

If backwards segmentation is chosen, this change won't happen since gwydhenn will be segmented into ['gwy', 'dhenn'] and the y will not be changed since it is now in a syllable ending in a vowel.

The word goeswik (mutation of koeswik) becomes goswik, as the Kernewek Kemmyn oe becomes o where it is a short or half-long vowel, and oo in a syllable with a Kernewek Kemmyn long vowel.

In the right hand panel, the word gwydhennow is unchanged, because the y vowel in the first syllable is now short rather than half-long, and the -nn is in a stressed syllable so retained as a double consonant.

Syllable segmentation in Cornish - forward vs. backward segmentation

Aug 14, 2016

The syllable segmentation module of TaklowKernewek I have commented on earlier in this blog, and on my website.

However there is much more to discuss, and one aspect of this is that the program offers a choice between forwards and backwards segmentation.

This means either starting from the beginning of the word, and working forwards assigning the letters to particular syllables, or starting from the end and working backwards.

I present some of the code from the program, which is admittedly difficult to read, and if you like, skip down to the examples at the bottom. It may also be easier to read at my Bitbucket site.

The core of this program is a set of regular expressions, as follows:

# syllabelRegExp should match syllable anywhere in a word
# a syllable could have structure CV, CVC, VC, V
# will now match traditional graphs c-, qw- yn syllable initial position
syllabelRegExp = r'''(?x)
((bl|br|Bl|Br|kl|Kl|kr|Kr|kn|Kn|kwr?|Kwr?|qwr?|Qwr?|ch|Ch|Dhr?\'?|dhr?\'?|dl|dr|Dr|fl|Fl|fr|Fr|vl|Vl|vr|Vr|vv|ll|gwr?|gwl?|gl|gr|gg?h|gn|Gwr?|Gwl?|Gl|Gr|Gn|hwr?|Hwr?|ph|Ph|pr|pl|Pr|Pl|shr?|Shr?|str?|Str?|skr?|Skr?|skw?|Skw?|sbr|Sbr|spr|Spr|sp?l?|Sp?l?|sm|Sm|tth|Tth|thr?|Thr?|tr|Tr|tl|Tl|wr|Wr|wl|Wl|[bckdfjvlghmnprstwyzBCKDFJVLGHMNPRSTVWZY]) # consonant
\'?(ay|a\'?w|eu|ey|ew|iw|oe|oy|ow|ou|uw|yw|[aeoiuy])\'? #vowel
(lgh|ls|lt|bl|br|bb|kl|kr|kn|kwr?|kk|n?ch|dhr?|dl|n?dr|dd|fl|fr|ff|vl|vv|gg?ht?|gw|gl|gn|ld|lf|lk|ll|mm|mp|nk|nd|nj|ns|nth?|nn|ph|pr|pl|pp|rgh?|rdh?|rth?|rk|rl|rv|rm|rn|rr|rj|rf|rs|sh|st|sk|ss|sp?l?|tt?h|tt|[bdfgljmnpkrstvw])? # optional const.
)| # or
(\'?(ay|a\'?w|eu|ew|ey|iw|oe|oy|ow|ou|uw|yw|Ay|Aw|Ey|Eu|Ew|Iw|Oe|Oy|Ow|Ou|Uw|Yw|[aeoiuyAEIOUY])\'? # vowel
(lgh|ls|lt|bl|bb|kl|kr|kn|kwr?|kk|cch|n?ch|dhr?|dl|n?dr|dd|fl|fr|ff|vl|vv|gg?ht?|gw|gl|gn|ld|lf|lk|ll|mm|mp|nk|nd|nj|ns|nth?|nn|ph|pr|pl|pp|rgh?|rdh?|rth?|rk|rl|rv|rm|rn|rr|rj|rf|rs|sh|st|sk|ss|sp?l?|tt?h|tt|[bdfgljmnpkrstvw]\'?)?) # consonant (optional)
# diwethRegExp matches a syllable at the end of the word
diwetRegExp = r'''(?x)
((bl|br|Bl|Br|kl|Kl|kr|Kr|kn|Kn|kwr?|Kwr?|qwr?|Qwr?|ch|Ch|Dhr?\'?|dhr?\'?|dl|dr|Dl|Dr|fl|Fl|fr|Fr|vl|Vl|vr|Vr|vv|ll|gwr?|gwl?|gl|gr|gg?h|gn|Gwr?|Gwl?|Gl|Gr|Gn|hwr?|Hwr?|ph|Ph|pr|pl|Pr|Pl|shr?|Shr?|str?|Str?|skr?|Skr?|skw?|Skw?|sbr|Sbr|spr|Spr|sp?l?|Sp?l?|sm|Sm|tth|Tth|thr?|Thr?|tr|Tr|tl|Tl|wr|Wr|wl|Wl|[bckdfjlghpmnrstvwyzBCKDFJLGHPMNRSTVWYZ]\'?)? #consonant or c. cluster
\'?(ay|a\'?w|eu|ew|ey|iw|oe|oy|ow|ou|uw|yw|Ay|Aw|Ey|Eu|Ew|Iw|Oe|Oy|Ow|Ou|Uw|Yw|\'?[aeoiuyAEIOUY]\'?) # vowel
(lgh|ls|lt|bl|br|bb|kl|kr|kn|kwr?|kk|cch|n?ch|dhr?|dl|n?dr|dd|fl|fr|ff|vl|vv|gg?ht?|gw|gl|gn|ld|lf|lk|ll|mm|mp|nk|nd|nj|ns|nth?|nn|ph|pr|pl|pp|rgh?|rdh?|rth?|rk|rl|rv|rm|rn|rr|rj|rf|rs|sh|st|sk|ss|sp?l?|tt?h|tt|[bdfgjklmnprstvw]\'?)? # optionally a second consonant or cluster ie CVC?
# kynsaRegExp matches syllable at beginning of a word
# 1st syllable could be CV, CVC, VC, V
kynsaRegExp = r'''(?x)
^((\'?(bl|br|Bl|Br|kl|Kl|kr|Kr|kn|Kn|kwr?|Kwr?|qwr?|Qwr?|ch|Ch|Dhr?|dhr?|dl|dr|Dr|fl|Fl|fr|Fr|vl|Vl|vr|Vr|gwr?|gwl?|gl|gr|gn|Gwr?|Gwl?|Gl|Gr|Gn|hwr?|Hwr?|ph|Ph|pr|pl|Pr|Pl|shr?|Shr?|str?|Str?|skr?|Skr?|skw?|Skw?|sbr|Sbr|spr|Spr|sp?l?|Sp?l?|sm|Sm|tth|Tth|thr?|Thr?|tr|Tr|tl|Tl|wr|Wr|wl|Wl|[bckdfghjlmnprtvwyzBCKDFGHJLMNPRTVWYZ])\'?)? # optional C.
\'?(ay|a\'?w|eu|ew|ey|iw|oe|oy|ow|ou|uw|yw|Ay|Aw|Ey|Eu|Ew|Iw|Oe|Oy|Ow|Ou|Uw|Yw|[aeoiuyAEIOUY])\'? # Vowel
(lgh|ls|lk|ld|lf|lt|bb?|kk?|cch|n?ch|n?dr|dh|dd?|ff?|vv?|ght|gg?h?|ll?|mp|mm?|nk|nd|nj|ns|nth?|nn?|pp?|rgh?|rdh?|rth?|rk|rl|rv|rm|rn|rj|rf|rs|rr?|sh|st|sk|sp|ss?|tt?h|tt?|[jw]\'?)? # optional C.

In the actual segmentation of the word itself, the expressions kynsaRegExp and diwetRegExp are used, depending on whether we are going forwards starting from the beginning or backwards from the end:

if fwds:
# go forwards
sls = rannans.ranna_syl(self.graph,regexps.kynsaRegExp,fwd=True,bwd=False)
# go backwards from end
sls = rannans.ranna_syl(self.graph,regexps.diwetRegExp,fwd=False,bwd=True)

where ranna_syl() is the actual function that returns a list of syllables from the word ger:

def ranna_syl(self,ger,regexp,fwd=True,bwd=False):
""" divide a word into a list of its syllables
and return this as a list of plain text strings
syl_list = []
if fwd:
# go forwards through the word
while ger:
# print(ger)
k = self.match_syl(ger,regexp)
# print("kynsa syl:{k}".format(k=k))
# add the syllable to the list
if k != '':
if k != '' and len(ger.split(k,1))>1:
# if there is more of the word after the
# 1st syllable
# remove the 1st syllable
ger = ger.split(k,1)[1]

ger = ''
if bwd:
# go backwards from the end through the word
while ger:
# print(ger)
d = self.match_syl(ger,regexp)
# print(d)
# add the syllable to the list
if d != '':
if d != '' and len(ger.rsplit(d,1))>1:
# if there is more of the word before the
# last syllable
# remove the last syllable
ger = ger.rsplit(d,1)[0]
ger = ''
# this is returning
# a list of plain text
# not Syllabenn objects
return syl_list

The syllabelRegExp regular expression is used in Syllabenn class itself, as part of the code that initates a Syllabenn object and works out the syllable parts, i.e. consanant clusters and vowels, and the overall length.

Example sentences

The effect of going forwards or backwards can be illustrated in the processing of an example sentence:

Going backwards from the end, tends to maximise consonants at the beginning of syllables. For example the word 'gewer' is processed into ['ge', 'wer'] i.e. the w is assigned to the second syllable whereas in this word the 'ew' is actually pronounced as a diphthong. The gemminated consonant 'mm' in lemmyn is split into two different syllables.
Now working forward, the processing of the word 'gewer' now splits into ['gew', 'er'] which accords with the status of 'ew' as a diphthong. 'Lemmyn' now splits into ['lemm', 'yn'] assigning the whole of the gemminated consonant to the first syllable. The word 'Fatell' now has the 't' assigned to the first syllable

A similar effect can be seen in another sentence:
Special cases such as the unstressed monosyllables 'ha', and 'dell' are detailed in the file

With forwards segmentation, the processing of 'kommolek', and 'hevel' assigns consonants to the coda of syllables rather than maximising the onset.

All Posts