1
votes

I am trying to use re.findall to get all of the Capitalized words and abbreviations. I have figured out regular expressions to find each individually, but when I try to combine the two, I end up being returned tuples with an empty string and then the item that I wanted to find.

Here is my regular expression that seems to not work- I imagine its a quick fix I am just unaware of:

x = re.findall("([A-Z][A-Za-z]+\.?)|(\\b[A-Z](?:[\\.&]?[A-Z]){2,}\\b)", txt) #just has extra "" in each set

edit:

I am currently using this as my test case:

"USA. U.S.A America."

This is my output:

[('USA.', ''), ('', 'U.S.A'), ('America.', '')]
2
Could you share a sample value of txt?Niel Godfrey Ponciano
Yes, I am currently using "USA. U.S.A America." as my test case. This is my output: [('USA.', ''), ('', 'U.S.A'), ('America.', '')]vynabhnnqwxleicntw
What output were you expecting?Jiří Baum
@NielGodfreyPonciano They can be lowercase, thank you for pointing that out, that wasn't in one of my test casesvynabhnnqwxleicntw
It's a good idea to show your strings as edits in the post body, with monospace font so there's no ambiguity as to the contents of the string.ggorlen

2 Answers

1
votes

In your regular expression, you have two sets of capturing (...), one for each alternative, so re.findall() returns a tuple of them. This is useful if you need to match several parts of a string, or if you need to know which alternative was the one that matched.

In order to get just a flat list, you'll need to either omit those or turn them into non-capturing (?:...):

x = re.findall("[A-Z][A-Za-z]+\.?|\\b[A-Z](?:[\\.&]?[A-Z]){2,}\\b", txt)

or, if the (...) were significant (or you want them for clarity):

x = re.findall("(?:[A-Z][A-Za-z]+\.?)|(?:\\b[A-Z](?:[\\.&]?[A-Z]){2,}\\b)", txt)

Either of these returns the value: ['USA.', 'U.S.A', 'America.']

0
votes

Use (?:...) to not capture a group as documented.

Here is a simplified version of the combined regex searches of the following:

  • Any word that starts with a capital letter
  • Any word that is an abbreviation/acronym marked by a separator dot (.)

We wouldn't capture those searches individually by putting (?:...) per search group. What we would do instead is capture the result of both groups e.g. ( (?:...) | (?:...) ) where the first (?:...) is for the capital letter search and the second (?:...) is for the acronym search.

import re

txt = "USA. U.S.A   America. arctic u.s.a Mars v.. A.b earth c.D.e. .pluto nep.tune. uranus. f.g.h.i Sun  "
matches = re.findall("((?:[A-Z]\w+)|(?:\w+\.+\w+[\w\.]*))", txt)
print(matches)
['USA', 'U.S.A', 'America', 'u.s.a', 'Mars', 'A.b', 'c.D.e.', 'nep.tune.', 'f.g.h.i', 'Sun']