1
votes

Hello everyone I need help with a mix of Python and RegEx. I am working on a project that is taking raw text and converting it to XML. The text consist of multiple people speaking saying different things. What I am trying to do is break the speeches up and convert them to XML.

The sample string is:

Mr. COX. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vestibulum sollicitudin vestibulum consectetur. Aliquam rhoncus nisl id velit gravida, quis volutpat est eleifend. Donec posuere a magna ac molestie. Vivamus sed lacinia lectus, quis feugiat libero. Nam sapien lacus, hendrerit at posuere ut, ullamcorper sit amet augue. Ut fringilla lobortis nulla. Nulla facilisi. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam efficitur rutrum dictum. Aenean a sem mollis justo scelerisque posuere eget sit amet orci. Praesent condimentum, leo at commodo dapibus, leo mi pretium lectus, et sagittis lorem sapien ut enim. Nulla sagittis varius eros, eget pretium arcu suscipit aliquet. Mr. SEABASS. Ut condimentum lobortis suscipit. Donec eget tempor ex, vel porttitor velit. Aliquam vulputate, leo in aliquet laoreet, sem ante dapibus velit, nec imperdiet felis tellus vel leo. Nunc mattis velit sed turpis consectetur tempus. Nam volutpat vel metus sed aliquam. Curabitur vitae elit urna. Nulla vehicula sapien quis libero elementum, vitae sodales tellus commodo. Pellentesque pulvinar felis vitae neque viverra posuere vitae sit amet neque. Curabitur lorem libero, mollis consectetur tempus sit amet, tincidunt vitae dolor. Cras ullamcorper arcu ac orci pharetra consequat. Nunc magna justo, sollicitudin at enim vel, volutpat elementum sapien. Mauris sit amet velit in diam imperdiet tempor facilisis et ex. Praesent consectetur leo a eros mattis tempor. Mr. REX. Nullam interdum urna quis nunc sodales, id posuere nisl malesuada. Nam nec lacus et ipsum ultrices pharetra. Nullam vitae mauris sodales, fringilla augue at, efficitur arcu. Sed ex diam, ullamcorper a auctor eget, volutpat sit amet est. Suspendisse urna eros, ullamcorper in semper at, lobortis eget quam. Fusce auctor, augue sit amet convallis condimentum, diam libero porta lectus, consectetur posuere nisi mi non nulla. Suspendisse vel ante efficitur, eleifend justo sed, lobortis augue. Sed rhoncus neque libero, et tempor ipsum imperdiet id. Integer at purus eget dolor pharetra varius ut et massa. Etiam risus enim, ultrices vitae nisl eu, interdum dignissim tellus. Nullam tellus metus, finibus non justo at, lobortis imperdiet tortor. Nulla nec tortor sagittis, fringilla nisi quis, bibendum leo.

The above is just a sample. The RegEx and Python code would need to read through an entire file writing speakers and their speech as it is found.

The above should yield something like:

<Gutenberg>
<Speaker>Mr. COX</Speaker>
<Speech>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vestibulum sollicitudin vestibulum consectetur. Aliquam rhoncus nisl id velit gravida, quis volutpat est eleifend. Donec posuere a magna ac molestie. Vivamus sed lacinia lectus, quis feugiat libero. Nam sapien lacus, hendrerit at posuere ut, ullamcorper sit amet augue. Ut fringilla lobortis nulla. Nulla facilisi. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam efficitur rutrum dictum. Aenean a sem mollis justo scelerisque posuere eget sit amet orci. Praesent condimentum, leo at commodo dapibus, leo mi pretium lectus, et sagittis lorem sapien ut enim. Nulla sagittis varius eros, eget pretium arcu suscipit aliquet. </Speech>
<Speaker>Mr. SEABASS</Speaker>
<Speech>Ut condimentum lobortis suscipit. Donec eget tempor ex, vel porttitor velit. Aliquam vulputate, leo in aliquet laoreet, sem ante dapibus velit, nec imperdiet felis tellus vel leo. Nunc mattis velit sed turpis consectetur tempus. Nam volutpat vel metus sed aliquam. Curabitur vitae elit urna. Nulla vehicula sapien quis libero elementum, vitae sodales tellus commodo. Pellentesque pulvinar felis vitae neque viverra posuere vitae sit amet neque. Curabitur lorem libero, mollis consectetur tempus sit amet, tincidunt vitae dolor. Cras ullamcorper arcu ac orci pharetra consequat. Nunc magna justo, sollicitudin at enim vel, volutpat elementum sapien. Mauris sit amet velit in diam imperdiet tempor facilisis et ex. Praesent consectetur leo a eros mattis tempor.</Speech>
<Speaker>Mr. REX</Speaker>
<Speech>Nullam interdum urna quis nunc sodales, id posuere nisl malesuada. Nam nec lacus et ipsum ultrices pharetra. Nullam vitae mauris sodales, fringilla augue at, efficitur arcu. Sed ex diam, ullamcorper a auctor eget, volutpat sit amet est. Suspendisse urna eros, ullamcorper in semper at, lobortis eget quam. Fusce auctor, augue sit amet convallis condimentum, diam libero porta lectus, consectetur posuere nisi mi non nulla. Suspendisse vel ante efficitur, eleifend justo sed, lobortis augue. Sed rhoncus neque libero, et tempor ipsum imperdiet id. Integer at purus eget dolor pharetra varius ut et massa. Etiam risus enim, ultrices vitae nisl eu, interdum dignissim tellus. Nullam tellus metus, finibus non justo at, lobortis imperdiet tortor. Nulla nec tortor sagittis, fringilla nisi quis, bibendum leo. </Speech>
</Gutenberg>

Now I already have the RegEx that takes care of finding the speakers. However, I am having difficulty matching the speeches to the speaker. Also the speeches and speakers will be different, not the same three speakers or speeches. So the RegEx will need to be flexible.

This question is not a duplicate as the other question in reference is for a movie script and this is for project gutenberg eBooks.

1
What's your RegEx to identify the speakers?Nick Weseman
The single line string is you posted is the exact format you're dealing with? Where are you getting this from so I can get a wider sample of speeches?Verbal_Kint
This looks like a challenge for natural language processing, not regular expressions.Chris
This could probably be accomplished using groups and look-ahead assertions, but it won't be simple.Joe D
We need more info about the rest of the file. What do the other lines look like? Can there be multiple paragraphs to a speaker's text?aghast

1 Answers

2
votes

Providing a regex for your limited example is easy but will ineveitably be unreliable and brittle. There must be another way to extract the information you have in more relevant chunks. The reason being that there are limited predictable patterns which will allow for identifying a new Speaker/Speech combo. In this case the Mr. ALLCAPSNAME. seems to be the only semi-reliable identifier. Using that, if any speech has the words Mr|Ms|Mrs followed by an all caps word, that will be mistaken for a breakpoint. So this would trip it up:

Mr. ABC. I think Mrs. ABC. is awsome.

Would give you:

[('Mr. ABC.', 'I think'), ('Mrs. ABC.', 'is awesome.')]

And I could easily see someone mentioning Mr/Ms/Mrs WHOMEVER within a speech. Failing a better way to extract it, this may work, but I wouldn't trust it:

In [1]: pattern = re.compile(r"""
            (?<=\b)\s*
            (?P<Speaker>M(?:rs?|s)\.\s+[A-Z]+\.)\s+
            (?P<Speech>.+?\.)
            (?=\s+M(?:rs?|s)\.\s+[A-Z]+\.|$)
            """, 
            re.VERBOSE)
In [1]:  pattern.findall(example)
Out[1]: 
[('Mr. COX.',
'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vestibulum sollicitudin vestibulum consectetur. Aliquam rhoncus nisl id velit gravida, quis volutpat est eleifend. Donec posuere a magna ac molestie. Vivamus sed lacinia lectus, quis feugiat libero. Nam sapien lacus, hendrerit at posuere ut, ullamcorper sit amet augue. Ut fringilla lobortis nulla. Nulla facilisi. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam efficitur rutrum dictum. Aenean a sem mollis justo scelerisque posuere eget sit amet orci. Praesent condimentum, leo at commodo dapibus, leo mi pretium lectus, et sagittis lorem sapien ut enim. Nulla sagittis varius eros, eget pretium arcu suscipit aliquet.'),
 ('Mr. SEABASS.',
  'Ut condimentum lobortis suscipit. Donec eget tempor ex, vel porttitor velit. Aliquam vulputate, leo in aliquet laoreet, sem ante dapibus velit, nec imperdiet felis tellus vel leo. Nunc mattis velit sed turpis consectetur tempus. Nam volutpat vel metus sed aliquam. Curabitur vitae elit urna. Nulla vehicula sapien quis libero elementum, vitae sodales tellus commodo. Pellentesque pulvinar felis vitae neque viverra posuere vitae sit amet neque. Curabitur lorem libero, mollis consectetur tempus sit amet, tincidunt vitae dolor. Cras ullamcorper arcu ac orci pharetra consequat. Nunc magna justo, sollicitudin at enim vel, volutpat elementum sapien. Mauris sit amet velit in diam imperdiet tempor facilisis et ex. Praesent consectetur leo a eros mattis tempor.'),
 ('Mr. REX.',
  'Nullam interdum urna quis nunc sodales, id posuere nisl malesuada. Nam nec lacus et ipsum ultrices pharetra. Nullam vitae mauris sodales, fringilla augue at, efficitur arcu. Sed ex diam, ullamcorper a auctor eget, volutpat sit amet est. Suspendisse urna eros, ullamcorper in semper at, lobortis eget quam. Fusce auctor, augue sit amet convallis condimentum, diam libero porta lectus, consectetur posuere nisi mi non nulla. Suspendisse vel ante efficitur, eleifend justo sed, lobortis augue. Sed rhoncus neque libero, et tempor ipsum imperdiet id. Integer at purus eget dolor pharetra varius ut et massa. Etiam risus enim, ultrices vitae nisl eu, interdum dignissim tellus. Nullam tellus metus, finibus non justo at, lobortis imperdiet tortor. Nulla nec tortor sagittis, fringilla nisi quis, bibendum leo.')]

If this pattern works you can use this to XMLize it:

def to_xml(l):
    base_element = Element('Gutenburg')
    speeches = SubElement(base_element, 'Speeches')

    for speaker, speech in l:
        sp = SubElement(speeches, 'Speech')

        s = SubElement(sp, 'Speaker')
        s.text = speaker

        text = SubElement(sp, 'Text')
        text.text = speech

    return base_element

Then:

tostring(result)

for your xml string.