2
votes

I am building a document retrieval engine in python which returns documents ranked by their relevance with respect to a user submitted query. I have a collection of documents which also include PowerPoint files. For the PPTs, on the results page I want to show the first few slide titles to the user to give him/her a clearer picture(kinda like we see in Google searches).

So basically, I want to extract the text from the slide titles from the PPT files using python. I am using the python-pptx package for that. Currently my implementation looks something like this

from pptx import Presentation
prs = Presentation(filepath) # load the ppt
slide_titles = [] # container foe slide titles
for slide in prs.slides: # iterate over each slide
        title_shape =  slide.shapes[0] # consider the zeroth indexed shape as the title
        if title_shape.has_text_frame: # is this shape has textframe attribute true then
            # check if the slide title already exists in the slide_title container
            if title_shape.text.strip(""" !@#$%^&*)(_-+=}{][:;<,>.?"'/<,""")+ '. ' not in slide_titles: 
                slide_titles.append(title_shape.text.strip(""" !@#$%^&*)(_-+=}{][:;<,>.?"'/<,""")+ '. ')  

But as you can see I am assuming the zero indexed shape on each slide to be the slide title which is obviously not the case everytime. Any ideas on how to accomplish this?

Thanks in advance.

3

3 Answers

2
votes

Slide.shapes (a SlideShapes object) has the property .title which returns the title shape when there is one (usually is) or None if no title is present.
http://python-pptx.readthedocs.io/en/latest/api/shapes.html#slideshapes-objects

This is the preferred way to access the title shape.

Note that not all slides have a title shape, so you have to test for a None result to avoid errors in that case.

Also note that users sometimes use a different shape for the title, like maybe a separate new text box they add. So you're not guaranteed to get the text that "appears" as the title in the slide. However, you will get the text that matches what PowerPoint considers the title, for example, the text it displays as the title for that slide in the Outline view.

prs = Presentation(path)
for slide in prs.slides:
    title_shape = slide.title
    if title_shape is None:
        continue
    print(title_shape.text)
2
votes
local_pptxFileList = ["abc.pptx"]


for i in local_pptxFileList:
            ppt = Presentation(i)
            for slide in ppt.slides:
                for shape in slide.shapes:
                    if shape.has_text_frame:
                        print(shape.text)
                       
0
votes

How to extract all text from the pptx in a directory (from this blog)

from pptx import Presentation
import glob

for eachfile in glob.glob("*.pptx"):
    prs = Presentation(eachfile)
    print(eachfile)
    print("----------------------")
    for slide in prs.slides:
        for shape in slide.shapes:
            if hasattr(shape, "text"):
                print(shape.text)