8
votes

Unicode specifies that \X should match an "extened grapheme cluster" - for instance a base character followed by zero or more combining characters. (I believe this is a simplification but may suffice for my needs.)

I'm pretty sure at least Perl supports \X in its regular expresions.

But Vim defines \X to match a non-hexadecimal digit.

Does Vim have any equivalent to \X or any way to match a Unicode extended grapheme cluster?

Vim does have a concept of combining or "composing" characters, but its documentation does not cover whether or how they are supported in regular expressions.

It seems that Vim does not yet support this directly, but I am still interested in a workaround where a search will highlight all characters which include a combining character in at least the most basic range of U+0300 to U+0364.

2
What exactly do you want to do? Could you provide a sample case? Do you want to match such "characters" as à or Æ?romainl
I'm going to write some JavaScript code to convert between Georgian language characters and various official and ad-hoc transliteration schemes. Some such characters may involve combining characters so I want to make sure my tools are capable of working with them including telling me which text I find in the wild and paste in contains such characters.hippietrail
For instance, I might need to handle (004a 030c). But more generally I just want to know whether Vim has or plans to have support for this, as it's becoming more and more common that us programmers have to deal with such things.hippietrail
Your example is matched with /\%u004a\%u030c\Z. You'll have to come up with a seriously big pattern if you want to highlight every possible combinations. The upside is that it will probably be portable to JS with "minimal" effort. Ho, and Kyle's answer is very informative.romainl
@romainl: In fact my example is also matched by just \%u030c, but when I try to extend the pattern from just COMBINING CARON to the entire Combining Diacritical Marks range by using [\u0300-\u0364] nothing is matched any longer!hippietrail

2 Answers

3
votes

You can search for all characters and ignore composing characters with \Z. Or you can search for a range of Unicode characters. Read :help /[] from more information on both.

The last post here may offer some more help:

http://vim.1045645.n5.nabble.com/using-regexp-to-search-for-Unicode-code-points-and-properties-td1190333.html

But Vim's regex does not have a character class like Perl.

3
votes

If your vim installation is compiled with perl support, you may be able to run:

:perldo s/\X/replacement/g

I installed vim-nox on debian (which contains perl support), and matching \X in with perldo does indeed work, but I'm not sure it will do what you want, since all normal characters are also matched and it doesn't seem like perldo will get you highlighting in vim.

While it's not perfect, if you can get perl support, you can use unicode blocks and categories. Which means you can use \p{Block: Combining_Diacritical_Marks} or \p{Category: Nonspacing_Mark} to at least detect certain characters, though you still won't get highlighting.