I am building a document retrieval engine in python which returns documents ranked by their relevance with respect to a user submitted query. I have a collection of documents which also include PowerPoint files. For the PPTs, on the results page I want to show the first few slide titles to the user to give him/her a clearer picture(kinda like we see in Google searches).
So basically, I want to extract the text from the slide titles from the PPT files using python. I am using the python-pptx package for that. Currently my implementation looks something like this
from pptx import Presentation
prs = Presentation(filepath) # load the ppt
slide_titles = [] # container foe slide titles
for slide in prs.slides: # iterate over each slide
title_shape = slide.shapes[0] # consider the zeroth indexed shape as the title
if title_shape.has_text_frame: # is this shape has textframe attribute true then
# check if the slide title already exists in the slide_title container
if title_shape.text.strip(""" !@#$%^&*)(_-+=}{][:;<,>.?"'/<,""")+ '. ' not in slide_titles:
slide_titles.append(title_shape.text.strip(""" !@#$%^&*)(_-+=}{][:;<,>.?"'/<,""")+ '. ')
But as you can see I am assuming the zero indexed shape on each slide to be the slide title which is obviously not the case everytime. Any ideas on how to accomplish this?
Thanks in advance.