10
votes

I have strings that are multi-lingual consist of both languages that use whitespace as word separator (English, French, etc) and languages that don't (Chinese, Japanese, Korean).

Given such a string, I want to separate the English/French/etc part into words using whitespace as separator, and to separate the Chinese/Japanese/Korean part into individual characters.

And I want to put of all those separated components into a list.

Some examples would probably make this clear:

Case 1: English-only string. This case is easy:

>>> "I love Python".split()
['I', 'love', 'Python']

Case 2: Chinese-only string:

>>> list(u"我爱蟒蛇")
[u'\u6211', u'\u7231', u'\u87d2', u'\u86c7']

In this case I can turn the string into a list of Chinese characters. But within the list I'm getting unicode representations:

[u'\u6211', u'\u7231', u'\u87d2', u'\u86c7']

How do I get it to display the actual characters instead of the unicode? Something like:

['我', '爱', '蟒', '蛇']

??

Case 3: A mix of English & Chinese:

I want to turn an input string such as

"我爱Python"

and turns it into a list like this:

['我', '爱', 'Python']

Is it possible to do something like that?

5
Unfortunately, there is a misfeature in Python's current re module that precludes re.split() to split on zero-length matches: stackoverflow.com/questions/2713060/… - therefore you can't use regular expressions in Python for this directly.Tim Pietzcker
Korean uses whitespace for word separation.Leovt

5 Answers

6
votes

I thought I'd show the regex approach, too. It doesn't feel right to me, but that's mostly because all of the language-specific i18n oddnesses I've seen makes me worried that a regular expression might not be flexible enough for all of them--but you may well not need any of that. (In other words--overdesign.)

# -*- coding: utf-8 -*-
import re
def group_words(s):
    regex = []

    # Match a whole word:
    regex += [ur'\w+']

    # Match a single CJK character:
    regex += [ur'[\u4e00-\ufaff]']

    # Match one of anything else, except for spaces:
    regex += [ur'[^\s]']

    regex = "|".join(regex)
    r = re.compile(regex)

    return r.findall(s)

if __name__ == "__main__":
    print group_words(u"Testing English text")
    print group_words(u"我爱蟒蛇")
    print group_words(u"Testing English text我爱蟒蛇")

In practice, you'd probably want to only compile the regex once, not on each call. Again, filling in the particulars of character grouping is up to you.

4
votes

In Python 3, it also splits the number if you needed.

def spliteKeyWord(str):
    regex = r"[\u4e00-\ufaff]|[0-9]+|[a-zA-Z]+\'*[a-z]*"
    matches = re.findall(regex, str, re.UNICODE)
    return matches

print(spliteKeyWord("Testing English text我爱Python123"))

=> ['Testing', 'English', 'text', '我', '爱', 'Python', '123']

2
votes

Formatting a list shows the repr of its components. If you want to view the strings naturally rather than escaped, you'll need to format it yourself. (repr should not be escaping these characters; repr(u'我') should return "u'我'", not "u'\\u6211'. Apparently this does happen in Python 3; only 2.x is stuck with the English-centric escaping for Unicode strings.)

A basic algorithm you can use is assigning a character class to each character, then grouping letters by class. Starter code is below.

I didn't use a doctest for this because I hit some odd encoding issues that I don't want to look into (out of scope). You'll need to implement a correct grouping function.

Note that if you're using this for word wrapping, there are other per-language considerations. For example, you don't want to break on non-breaking spaces; you do want to break on hyphens; for Japanese you don't want to split apart きゅ; and so on.

# -*- coding: utf-8 -*-
import itertools, unicodedata

def group_words(s):
    # This is a closure for key(), encapsulated in an array to work around
    # 2.x's lack of the nonlocal keyword.
    sequence = [0x10000000]

    def key(part):
        val = ord(part)
        if part.isspace():
            return 0

        # This is incorrect, but serves this example; finding a more
        # accurate categorization of characters is up to the user.
        asian = unicodedata.category(part) == "Lo"
        if asian:
            # Never group asian characters, by returning a unique value for each one.
            sequence[0] += 1
            return sequence[0]

        return 2

    result = []
    for key, group in itertools.groupby(s, key):
        # Discard groups of whitespace.
        if key == 0:
            continue

        str = "".join(group)
        result.append(str)

    return result

if __name__ == "__main__":
    print group_words(u"Testing English text")
    print group_words(u"我爱蟒蛇")
    print group_words(u"Testing English text我爱蟒蛇")
0
votes

Modified Glenn's solution to drop symbols and work for Russian, French, etc alphabets:

def rec_group_words():
    regex = []

    # Match a whole word:
    regex += [r'[A-za-z0-9\xc0-\xff]+']

    # Match a single CJK character:
    regex += [r'[\u4e00-\ufaff]']

    regex = "|".join(regex)
    return re.compile(regex)
0
votes

The following works for python3.7:

import re
def group_words(s):
    return re.findall(u'[\u4e00-\u9fff]|[a-zA-Z0-9]+', s)


if __name__ == "__main__":
    print(group_words(u"Testing English text"))
    print(group_words(u"我爱蟒蛇"))
    print(group_words(u"Testing English text我爱蟒蛇"))

['Testing', 'English', 'text']
['我', '爱', '蟒', '蛇']
['Testing', 'English', 'text', '我', '爱', '蟒', '蛇']

For some reason, I cannot adapt Glenn Maynard's answer to python3.