Cornish corpus word clouds with the assistance of the Institute of the Czech National Corpus
Tags: kernewekcorpus linguisticscornish language
21 Jun 2019 - MawKernewek
Intro
Kwords is an online tool available at kwords.korpus.cz from the Institute of the Czech National Corpus. It is optimised for Czech or English but it is possible to use for other languages by uploading your own reference text. I have fed it with the traditional Cornish texts (originally prepared in digital form by Keith Syed of KDL), which I have stripped of all except the Cornish text itself. They are at my bitbucket repository. My method has been to use Origo Mundi as a reference text and compare everything else to it. Normally you would use a much more extensive corpus but this doesn't really exist for Cornish. There is an option to exclude certain non-content words as pronouns, prepositions, conjuctions and numbers but this is only available in Czech and English. The word clouds below are the keywords that the software has detected in each text, by comparison of the frequency of words within the text, to that of the words within the reference corpus (i.e. in this case, Origo Mundi).
Passyon agan Arloedh
Kommolenn Ger-alhwedh - Keyword cloud
Bewnans Meryasek
Kommolenn Ger-alhwedh - Keyword cloud
Gwreans an Bys
Kommolenn Ger-alhwedh - Keyword cloud
Passio Christ
Kommolenn Ger-alhwedh - Keyword cloud
Resurrectio Domini
Kommolenn Ger-alhwedh - Keyword cloud
Pregothow Tregear
Kommolenn Ger-alhwedh - Keyword cloud
Tolkein
Kommolenn Ger-alhwedh - Keyword cloud
Skeul an Yeth 1
Kommolenn Ger-alhwedh - Keyword cloud
Solempnyta
Kommolenn Ger-alhwedh - Keyword cloud
Screenshot of the full output for Bewnans Meryasek
Applications
There are a number of applications of this kind of analysis illustrated in the talk slides here at Workshop on Quantitative Text Analysis for the Humanities and Social Sciences in April 2016 at Brown University.
Comparison of annual addresses by Gustav Husák against a reference of a current corpus to using a communist newspaper. slides Václav Cvrček & Masako Fidler
We can see that choosing a different reference corpus leads to different keywords being noted by the software as important. With Cornish we only have a small corpus available so our future plans are that in the next five-year plan we increase the output of Cornish.
Gwren ni ynkressya agan eskorrans geryow Kernewek! Yn-rag kowetha yn unnveredh kuntellek! Gyllyn!