normalize Unicode in typed answers

Unicode's a great standard, but doesn't resolve all problems. Some diacritics have various representations, which messes up comparisons. E.g., the two strings below are different:
অলৌকিক কাজ
অলৌকিক কাজ

One starts "অ ল ে ৗ"; (ending in ৌ) the other starts "অ ল ৌ" (ending in ৌ), the latter being a "composed" form of the former.

The answer is apparently to use Unicode normalized forms, as documented (at least for a start) at http://en.wikipedia.org/wiki/Unicode_equivalence. Of course, http://rishida.net/tools/conversion/ is great for peering into Unicode streams.

3 votes

Joel shared this idea · April 22, 2014 · Report… · Admin →

completed ·

AdminReadlang (Language learning app, Readlang) responded · May 08, 2014

I’ve implemented a fix, but it will only work if you are using the latest Chrome or Firefox, either:

Chrome – version 34 (desktop or android)
Firefox – version 31 (desktop only)

(The fix uses this recent addition to Javascript: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize )

Hope this works for you. Please let me know if you find any more problems!

Show previous admin responses (1)

An error occurred while saving the comment

AdminReadlang (Language learning app, Readlang) commented · May 09, 2014 18:06 · Report

Yeah, I kept putting it off since it sounded like it would be a pain to do. It was quite a relief when I found about that normalize function :-)

Submitting...
Joel commented · May 09, 2014 18:04 · Report

Brilliant! I usually don't use typing mode when on my mobile (on which I do use Firefox), and on other platforms it sounds like you've found a basically "free" solution. :-) Sounds great!

Submitting...
Joel commented · April 28, 2014 12:42 · Report

Re. the enya, from Wikipedia (http://en.wikipedia.org/wiki/%C3%91): it appears that it could be represented either as U+004E U+0303 (n + tilde diacritic) or as U+00F1 (the composed character ñ).

Submitting...
Joel commented · April 27, 2014 18:22 · Report

Unfortunately, I'm not sure for what all languages it's relevant, though I'd guess applying it for Sanskrit-derived languages would go a long way toward dealing w/ the issue. I'm guessing, though, it could theoretically arise any time a visual element (think the enya, or the grave-accented "e") has both a composed and a decomposed form: "n + ~ diacritic vs. a single 'enya'". In Sanskrit-based languages, the problem is just exceptionally obvious because diacritics are used quite a bit in Unicode representation of the writing system. I suspect that Thai and other "abugida" writing systems (cool new term I learned recently; see Wikipedia) *might* have similar issues, but really don't know w/o knowing a bit more about the writing systems and their Unicode representations.

Submitting...
Joel commented · April 22, 2014 10:20 · Report

In case those don't render correctly on your computer, the Unicode code points (from the rishida.net page) for the two strings are:
অলৌকিক কাজ
অলৌকিক কাজ

Submitting...

I suggest you ...

normalize Unicode in typed answers

Feedback

General

Feedback and Knowledge Base

Searching…

Give feedback

Knowledge Base

Readlang

normalize Unicode in typed answers

We're glad you're here

We're glad you're here

We're glad you're here

We're glad you're here

General

Categories

Searching…

Contact support

Give feedback

Knowledge Base

Readlang