2
votes

Extracting from text For example; the following sentence contains the initial capital letters. How can I separate them?

Text:

A. lorem ipsum dolor sit B . 41dipiscing elit sedC. lorem ipsum dolor sit amet D. 35 Consectetur adipiscing E .Sed do eiusmod tempor

Goal:

A. lorem ipsum dolor sit 
B . 41dipiscing elit sed 
C. lorem ipsum dolor sit amet 
D. 35 Consectetur adipiscing 
E .Sed do eiusmod tempor

What have I done?

^(([a-zA-Z]{1}|[0-9]+)\s*[.,]{1})(.*)$

Result:

https://regex101.com/r/4HB0oD/1

But my Regex code doesn't detect it without first sentence. What is the reason of this?

2
Note that the quantifier {1} is inherently redundant. If you want to match something once, simply don't add a quantifier.CAustin
What is the tool or language?The fourth bird

2 Answers

2
votes

Maybe,

(?=[A-Z]\s*\.)

might work OK.

RegEx Demo

Test

import re

string = '''
A. lorem ipsum dolor sit B . 41dipiscing elit sedC. lorem ipsum dolor sit amet D. 35 Consectetur adipiscing E .Sed do eiusmod tempor
'''

print(re.sub(r'(?=[A-Z]\s*\.)', '\n', string))

Output


A. lorem ipsum dolor sit 
B . 41dipiscing elit sed
C. lorem ipsum dolor sit amet 
D. 35 Consectetur adipiscing 
E .Sed do eiusmod tempor


If you wish to simplify/update/explore the expression, it's been explained on the top right panel of regex101.com. You can watch the matching steps or modify them in this debugger link, if you'd be interested. The debugger demonstrates that how a RegEx engine might step by step consume some sample input strings and would perform the matching process.


RegEx Circuit

jex.im visualizes regular expressions:

enter image description here

0
votes

This pattern should do what you're looking for:

[A-Z\d] ?\..+?(?=$|[A-Z\d] ?\.)

https://regex101.com/r/i92QR1/1