add text-grading functionality
As you mention in your blog, some texts can be too difficult. Sometimes we want to read an easy text fast, with only occasional translation, but it's difficult to determine the level of difficulty in a foreign language. One way is to run the text through a readability app. Then, if people want to share that text, they can have the option to select the readability score. This would save subsequent readers the effort of sifting through texts that are at a reasonable level (my current problem in Portuguese). If your app is also teacher-friendly, teachers would have the incentive to add graded content. YT video on this also forthcoming.
I’ve implemented a basic version of this now.
It’s enabled for most of the languages, and you will see it to the left of each book/article in the Library. The languages where it’s not enabled will have a warning status message at the bottom of the screen.
It uses a combination of the Automated Readability Index (http://en.wikipedia.org/wiki/Automated_Readability_Index) and the percentage of words which are in the 2000 most frequently used in the language.
I don’t claim it’s perfect, far from it. But it seems to give reasonably sensible results in Spanish and English at least. I can’t judge the other languages. Please let me know how you find it, especially if you think it’s getting the grading wrong.
It’s very likely I’ll tweak this over the next few months so don’t be surprised if the grades start changing :).
-
patrickw commented
I'm a new user to the site and would love to see this grading approach refined.
I've tried parsing some German books and I'm getting pretty much the same result for all the books I enter - B2 - whether this a book I know is pretty difficult or pretty easy (e.g., the first Harry Potter).
Anyway I am sure you have lots of other issues you'd like to sort out, but I just thought I would mention that as least one customer would be interested in a better implementation of the grading function.
Thanks for the very interesting and useful site.
-
Tom Tabaczynski commented
Sorry, don't really understand that. Research cited in Nation (2001) Learning Vocabulary in Another Language, indicates that knowledge of vocabulary is the main variable for comprehensibility, and that you need 95-98% caverage, or 3-5000 word families depending on the text (eg., novel vs. academic) to reach a threshold whereby you're reading. I'm gonna get tired of repeating these research findings, and it's not really my job to tell you how to implement them as it's not really my business, so I'd suggest you read that text yourself. After all, it's only one book and it will probably give you 98% of what you need to know about vocabulary learning for reading, and why extensive reading is so important.
-
Well synchronised indeed!
The problem with using the site you linked to count the number of unique word families is that it depends a lot on the length of the text too. e.g. a short text with only 100 word families may be far more difficult to read than a long text with 1000 word families. For this reason I'm using a general word frequency list for each language.
I don't group by word family, so plural and singular, or verb conjugations, are counted separately. I may look into grouping in future but it's not easy with all the languages I support. Anyway, first I'd like to see how well a system based on the current non-grouped word lists can work, I have the feeling that with a bit of tweaking it could be pretty good.
(Note that one nice side effect of not grouping is that it recognises that some verb conjugations are more difficult than others, e.g. the more difficult subjunctive tenses of Spanish verbs are lower frequency than the simpler tenses. This particular example doesn't work so well for English, but the principle of correlating word frequency with difficulty is still valid, I'm slightly worried about languages where words can be combined into one like German, but even there I'm hopeful that the correlation will still hold on average.)
-
Tom Tabaczynski commented
Hey Steve, we're well synchronised!
Well, I just did point you: lextutor.com. I'm using it to create materials for my students and I think that the look of the site might be misleading: these are cutting edge tools as far as I can tell.
From this, the important information seems to be the 1000 and 2000 benchmarks to get to on my guesstimate B1 level, then 3-5k might be B2, and 6k+ would be into the C's.
Now, these numbers are either lemmas or families, not types. Eg., the General Service List which is still in use is 2000 word families.
Nation (2001) says that his research shows that the different lists overlap with the GSL and that they make little difference.
The different ways that the graded readers are graded would not be relevant. The question is whether the lists you are using are word types, lemmas or families, because it seems that it's the word families that you need so that, eg., singular and plural forms are counted as the same word.
-
Hi Tom,
The codes are from the Common European Framework: http://en.wikipedia.org/wiki/Common_European_Framework_of_Reference_for_Languages
They are:
A1 - Beginner
A2 - Elementary
B1 - Intermediate
B2 - High Intermediate
C1 - Advanced
C2 - MasteryI agree these are cryptic and will think about a clearer representation soon.
Using the number of words as some graded readers do (1000, 2000, etc) isn't a completely foolproof way of comparing texts from different publishers because some may base it on the number of headwords in the texts, some may use general word lists. In short, the methods they use are opaque and so can't be used for reliable comparison, except perhaps between books from the same publisher. If you can point me towards any standardised algorithm to grade texts I'd love to read about it.
In the short term, I just want to indicate to users roughly how difficult the texts are.
-
Tom Tabaczynski commented
Correction: Lextutor Vocab Profile for Brit Nat Corp here: http://www.lextutor.ca/vp/bnc/
The point of grading is extensive reading, so need to understand some theoretical issues here. Esp. this passage in the Wikipedia article on Extensive Reading:
"Laufer suggest that 3,000 word families or 5,000 lexical items are the threshold (Laufer 1997). Coady & Nation (1998) suggest 98% of lexical coverage and 5,000 word families or 8,000 items for a pleasurable reading (Coady & Huckin 1997, p. 233). After this threshold, the learner leave the beginner paradox, and enter a virtuous circle (Coady & Huckin 1997, p. 233). Then, extensive reading become more efficient."
The 'beginner paradox' is this: Learners need to read in order to learn vocabulary. But they need vocabulary in order to read. Grading is a way of overcoming the paradox to get learners to the 5-6000 level after which they can pick up vocabulary incidentally through ungraded extensiver reading.
-
Tom Tabaczynski commented
I'm looking at it but not really sure how it works: what do the A1, A2, etc mean?
I'm reading Paul Nation at the moment.
It seems clear that the most important vocabulary is the first 1000 high frequency words (about 75% coverage of any text), then the next 1000 (extra 5% coverage). After that it drops very quickly in terms of coverage per 1000.
The way vocabulary learning should be distributed is this:
- 75% incidental exposure through extensive reading and listening in meaning-focused activiteis
- 25% of intentional learning through flash-cards and vocab activities.
- The 2000 high-frequency words are the most important and need to be learned intentionally (as well as through extensive reading etc.)
- Low frequency words need to be learned through a variety of strategies like guessing from context.
- Extensive reading is the most important activity, but it requires that 95-98% of the words are known. Therefore, the grading needs to indicate the percentage of words are outside of that. Lexitutor.com has tools that do that sort of analysis for English and French.
- 6000 level is the minimum to understand 98% of most texts.I think any app oriented towards reading for vocabulary needs to take account of these research findings, especially the different approaches to learning the high frequency words (first 2000) vs. low frequency, and the need to grade texts in a similar manner as with graded readers:
-1000
-2000
-3000
-etc. -
Tom Tabaczynski commented
There are two models to look at.
One is YouTube. It has channels, which allow for branding. If I'm a professional seeking to establish online presence, I will be motivated to curate quality content, and more over to keep updating it. I do something similar on Quizlet. People regularly request to join my classes there, and because my school name is visible that gives me some incentive to generate additional content in the form of flashcards.
Another model are educational apps like Wikispaces, or Quizlet itself, which are building in classroom management functions.
In either case, it is up to the content provider the level of visibility of their content.
The worst case scenario is LingQ where, even if I add content there, it's behind their paywall, and essentially it's not longer my business, its theirs, ie., they are in competition with me, so no incentive at all.
So to sum up, the incentive to share quality content for a teacher is to have presence in the role of a teacher, and to establish online presence. But this also assumes control over the content.
On LinQ people constantly request various sorts of stuff, but because of their general philosophy that it's 'the complete system', and you're not supposed to want other things, which is false, someone like me is prevented from trying to cater to that.
So what you're ending up with is stuff that people have uploaded, but ultimately their idea is that you upload your own materials! In other words, it's really a reader for Upper Intermediate to Advanced level students, and not a language learning site at all.
-
Tom Tabaczynski commented
Ok. Just realised that the word I need is this: curation ... curated content. What you want is for people to curate the content, eg., like in Scoop.it.
-
Tom Tabaczynski commented
Hi Steve,
That sounds good. I have to confess that my thoughts originate from reflections on the limitations of LingQ, and to my mind your project might suffer from similar limitations, so here are my thoughts.
Apart from the anti-teacher rhetoric, and other pedagogical opinions of its owner that he freely imposes on anyone and everyone, despite the intentions of some users to share and grade content have not come to fruition, at least in the Portuguese section that I can see.
Apart from the navigational issues of that site, the problem with crowdsourcing content is that LingQ is charging money for its use, yet it is asking its users to do the work of (a) contribute and share texts, (b) grade and file them, and (c) add audio.
The only texts that I found useful, including by Yuri Vieira, were those of a Brazilian woman who had her own educational site. However, LingQ's functionality, unlike Quizlet, is limited and clunky in terms of branding and content management.
These are some of the main reasong I believe the content sharing is unlikely to happen there, otherwise I'd be there getting the content at this moment right now.
Point: it has little to do with attractiveness or good will, and everything to do with incentives, esp. ability to brand. Otherwise it's crowdsourcing/free labour.
-
Hi Tom, yes this is a problem at the moment. I hope eventually it will be solved by:
1. Native speakers, probably teachers, sharing content on the site. As the site improves I hope it will be an attractive place for teachers to place content for their students, and the wider community.
2. To gradually build up a collection of links to sites that beginners enjoy reading.
3. To filter the above content based on difficulty level.
There's obviously a long way to go, but this is my rough plan at the moment.
-
Tom Tabaczynski commented
I don't see how someone at a A2-B1 level of reading can locate texts at approximately the right level in the target language. I'm trying to find easy stuff to read in Portuguese and it's not easy, because I have to Google stuff in Portuguese, but I'm not really at the level to do that effectively.
It seems to make more sense for NSs to share, grade, and perhaps even simplify texts for the learners, but there would have to be incentive to do that.
Otherwise only people who are already at B2 or higher level that can do that effectively. But it's the lower levels that need lower graded reading materials. Or am I confused?
-
@Claytanic: Thanks, yes I'm keen to use common word lists to do this kind of grading, and in addition to indicate how complex the sentences and words are using something like this: http://en.wikipedia.org/wiki/Automated_Readability_Index.
-
Claytanic commented
Have you thought about automatically passing the texts through a service such as the Oxford 3000 Language profile, at least for English? http://www.oup.com/oald-bin/oxfordProfiler.pl
-
Tom Tabaczynski commented
Yes, if people could indicate whether a site or a blog has content that is at a given level then that would be of help. Eg., a simple scheme of A, B, and C for Lower, Intermediate, Upper.
Generally, could use descriptors such as 'Everyday Life' 'Literature' 'Academic' ... blogs and sites describing daily life will tend to be more appropriate for lower levels.
I'll look for some examples.