Issue with regular expressions returning an extra empty string

Question

I am trying to use re.findall to get all of the Capitalized words and abbreviations. I have figured out regular expressions to find each individually, but when I try to combine the two, I end up being returned tuples with an empty string and then the item that I wanted to find.

Here is my regular expression that seems to not work- I imagine its a quick fix I am just unaware of:

x = re.findall("([A-Z][A-Za-z]+\.?)|(\\b[A-Z](?:[\\.&]?[A-Z]){2,}\\b)", txt) #just has extra "" in each set

edit:

I am currently using this as my test case:

"USA. U.S.A America."

This is my output:

[('USA.', ''), ('', 'U.S.A'), ('America.', '')]

Yes, I am currently using "USA. U.S.A America." as my test case. This is my output: [('USA.', ''), ('', 'U.S.A'), ('America.', '')] — vynabhnnqwxleicntw
@NielGodfreyPonciano They can be lowercase, thank you for pointing that out, that wasn't in one of my test cases — vynabhnnqwxleicntw
It's a good idea to show your strings as edits in the post body, with monospace font so there's no ambiguity as to the contents of the string. — ggorlen

Jiří Baum Jiří Baum · Accepted Answer · 2021-09-02T23:51:18

In your regular expression, you have two sets of capturing (...), one for each alternative, so re.findall() returns a tuple of them. This is useful if you need to match several parts of a string, or if you need to know which alternative was the one that matched.

In order to get just a flat list, you'll need to either omit those or turn them into non-capturing (?:...):

x = re.findall("[A-Z][A-Za-z]+\.?|\\b[A-Z](?:[\\.&]?[A-Z]){2,}\\b", txt)

or, if the (...) were significant (or you want them for clarity):

x = re.findall("(?:[A-Z][A-Za-z]+\.?)|(?:\\b[A-Z](?:[\\.&]?[A-Z]){2,}\\b)", txt)

Either of these returns the value: ['USA.', 'U.S.A', 'America.']

Issue with regular expressions returning an extra empty string

2 Answers