0
votes

I have two very related questions:

  • I want to match a string pattern with a wildcard (i.e. containing one or more '*' or '?') and then form a replacement string with a second wildcard pattern. There the placeholders should refer to the same matched substring (As for instance in the DOS copy command)

    Example: pattern='*.txt' and replacement-pattern='*.doc': I want aaa.txt --> aaa.doc and xx.txt.txt --> xx.txt.doc

    Ideally it would work with multiple, arbitrarily placed wildcards: e.g., pattern='*.*' and replacement-pattern='XX*.*'.

    Of course one needs to apply some constraints (e.g. greedy strategy). Otherwise patterns such as X*X*X are not unique for string XXXXXX.

  • or, alternatively, form a multi-match. That is I have one or more wildcard patterns each with the same number of wildcard characters. Each pattern is matched to one string but the wildcard characters should refer to the same matching text.

    Example: pattern1='*.txt' and pattern2='*-suffix.txt Should match the pair string1='XX.txt' and string2='XX-suffix.txt' but not string1='XX.txt' and string2='YY-suffix.txt'

    In contrast to the first this is a more well defined problem as it avoids the ambiguity problem but is perhaps quite similar.

I am sure there are algorithms for these tasks, however, I am unable to find anything useful.

The Python library has fnmatch but this is does not support what I want to do.

1

1 Answers

1
votes

There are many ways to do this, but I came up with the following, which should work for your first question. Based on your examples I’m assuming you don’t want to match whitespace.

This function turns the first passed pattern into a regex and the passed replacement pattern into a string suitable for the re.sub function.

import re

def replaceWildcards(string, pattern, replacementPattern):
    splitPattern = re.split(r'([*?])', pattern)
    splitReplacement = re.split(r'([*?])', replacementPattern)
    if (len(splitPattern) != len(splitReplacement)):
        raise ValueError("Provided pattern wildcards do not match")
    reg = ""
    sub = ""
    for idx, (regexPiece, replacementPiece) in enumerate(zip(splitPattern, splitReplacement)):
        if regexPiece in ["*", "?"]:
            if replacementPiece != regexPiece:
                raise ValueError("Provided pattern wildcards do not match")
            reg += f"(\\S{regexPiece if regexPiece == '*' else ''})" # Match anything but whitespace
            sub += f"\\{idx + 1}" # Regex matches start at 1, not 0
        else:
            reg += f"({re.escape(regexPiece)})"
            sub += f"{replacementPiece}"
    return re.sub(reg, sub, string)

Sample output:

replaceWildcards("aaa.txt xx.txt.txt aaa.bat", "*.txt", "*.doc")
# 'aaa.doc xx.txt.doc aaa.bat'

replaceWildcards("aaa10.txt a1.txt aaa23.bat", "a??.txt", "b??.doc")
# 'aab10.doc a1.txt aaa23.bat'

replaceWildcards("aaa10.txt a1-suffix.txt aaa23.bat", "a*-suffix.txt", "b*-suffix.doc")
# 'aaa10.txt b1-suffix.doc aaa23.bat'

replaceWildcards("prefix-2aaa10-suffix.txt a1-suffix.txt", "prefix-*a*-suffix.txt", "prefix-*b*-suffix.doc")
# 'prefix-2aab10-suffix.doc a1-suffix.txt

Note f-strings require Python >=3.6.