Level of proficiency
I only wanted to know how you calculate the level (A1 to C1) for a text? That would be very useful so I can justify wich text to use with my students.
Thanks a lot
The difficulty of each text is calculated via quite a crude approach described here: https://readlang.uservoice.com/knowledgebase/articles/722085-how-is-the-difficulty-of-a-text-calculated
To be honest, they should be taken with a pinch of salt.
Anna Vernerová commented
I think relying on 2000 most common word forms is not suitable for languages with rich inflection, where the 2000 most common lexemes (dictionary entries, lemmas) may easily produce tens or even hundreds of thousands of different word forms. I am seeing the same problem as Den K for Finnish, and the reason is the same in both cases: these are agglutinative languages where words take on a large number of endings.
I can think of two different approaches to address this problem:
- use comparable word lists for multiple languages (https://en.wiktionary.org/wiki/Wiktionary:Frequency_lists, e.g. choose the open subtitles collection) and rather than calculating the proportion of words among the top 2000, calculate the average relative rank (e.g. "this text contains mostly words that appear in the top 5% of the list) - because languages with richer inflection should produce longer word lists, the relative rank could be a good measure
- take a list of not just 2000 most common word forms, but as many as you can possibly get and translate it into English and compare the translations with the list of 2000 most common English word forms (but be aware that e.g. for Finnish, the translation will be e.g. 'of my house' and you only want to know whether 'house' appears in the English list) - likely, you'll find that you can choose a cutoff such that the first X words mostly translate to the common English words while words lower in the list mostly don't (even though of course it will not be entirely clear where that cutoff should be set)
- parse each text and then compare the list of resulting lemmas to the list of X (X<2000, but I don't know by how much) most common lemmas in the language. A parser with a large number of language models (BY-NC-SA) would be https://ufal.mff.cuni.cz/udpipe/2 .
Den K commented
Am I correct that it's not possible to change level manually? Currently it grades A1 Turkish short story (from somewhat credible source) as C1 without an option to change level manually.