(no subject)
Apr. 17th, 2008 01:22 pmI'm trying to create a vocabulary list from an e-text by means of histogram analysis, so I can concentrate on the most frequently used words.
But my program is unable to link the various forms of Russian verbs, nouns and adjectives.
Does anyone know a public domain algorithm (or a list of sets of related morphemes) that can do this with reasonable accuracy?
Pim
no subject
Date: 2008-04-17 01:19 pm (UTC)Leave leading 3 letters as is, remove vowels and letters "вгйлмт" from the tail.
no subject
Date: 2008-04-17 02:31 pm (UTC)no subject
Date: 2008-04-17 02:07 pm (UTC)http://starling.rinet.ru/cgi-bin/morphque.cgi?flags=kndnnnn
you probably can write a script that uses it to fetch the canonical form of any word.
there's more information at http://starling.rinet.ru/program.php?lan=en and http://starling.rinet.ru/downl.php?lan=en#soft
they have useful databases and windows executables there, but i'm not sure about source code. it is probably available if you ask them.
no subject
Date: 2008-04-17 07:02 pm (UTC)no subject
Date: 2008-04-17 10:03 pm (UTC)Good luck!
no subject
Date: 2008-04-17 10:04 pm (UTC)