normalize Unicode in typed answers
Unicode's a great standard, but doesn't resolve all problems. Some diacritics have various representations, which messes up comparisons. E.g., the two strings below are different:
অলৌকিক কাজ
অলৌকিক কাজ
One starts "অ ল ে ৗ"; (ending in ৌ) the other starts "অ ল ৌ" (ending in ৌ), the latter being a "composed" form of the former.
The answer is apparently to use Unicode normalized forms, as documented (at least for a start) at http://en.wikipedia.org/wiki/Unicode_equivalence. Of course, http://rishida.net/tools/conversion/ is great for peering into Unicode streams.
I’ve implemented a fix, but it will only work if you are using the latest Chrome or Firefox, either:
Chrome – version 34 (desktop or android)
Firefox – version 31 (desktop only)
(The fix uses this recent addition to Javascript: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize )
Hope this works for you. Please let me know if you find any more problems!
-
Yeah, I kept putting it off since it sounded like it would be a pain to do. It was quite a relief when I found about that normalize function :-)
-
Joel commented
Brilliant! I usually don't use typing mode when on my mobile (on which I do use Firefox), and on other platforms it sounds like you've found a basically "free" solution. :-) Sounds great!
-
Joel commented
Re. the enya, from Wikipedia (http://en.wikipedia.org/wiki/%C3%91): it appears that it could be represented either as U+004E U+0303 (n + tilde diacritic) or as U+00F1 (the composed character ñ).
-
Joel commented
Unfortunately, I'm not sure for what all languages it's relevant, though I'd guess applying it for Sanskrit-derived languages would go a long way toward dealing w/ the issue. I'm guessing, though, it could theoretically arise any time a visual element (think the enya, or the grave-accented "e") has both a composed and a decomposed form: "n + ~ diacritic vs. a single 'enya'". In Sanskrit-based languages, the problem is just exceptionally obvious because diacritics are used quite a bit in Unicode representation of the writing system. I suspect that Thai and other "abugida" writing systems (cool new term I learned recently; see Wikipedia) *might* have similar issues, but really don't know w/o knowing a bit more about the writing systems and their Unicode representations.
-
Joel commented
In case those don't render correctly on your computer, the Unicode code points (from the rishida.net page) for the two strings are:
অলৌকিক কাজ
অলৌকিক কাজ