4
votes

I am currently attempting to pull text from .ppt and .pptx files. I am successfully using python-pptx in order to handle .pptx files, BUT according to its documentation, ".ppt files from PowerPoint 2003 and earlier won’t work."

When creating a presentation item using this line of code:

`prs = Presentation("Filepath\\presentation.ppt")`

I receive the following error:

`Traceback (most recent call last):
...shortened for brevity....
KeyError: "no relationship of type 'http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument' in collection"`

I believe that this error is occurring because python-pptx cannot handle .ppt files. I have tried to remedy this situation three ways:

  1. I wanted to use the .save() function associated with python-pptx BUT I would have to make a presentation item to do that. I cannot do that because I'd have to make use of python-pptx which cannot handle the .ppt file in the first place.
  2. Make use of os.rename(src, dst)
    • This did not work. Renaming the file does not work the same as 'save as' therefore making the file corrupt.
  3. I used win32com to open the PowerPoint Application, open the .ppt file, and then save the file as .pptx, and close both the file and application.

    • This method worked BUT it is really 'clunky.' (See code below.)

    Application = win32com.client.Dispatch("PowerPoint.Application") Application.Visible = True Presentation = Application.Presentations.Open("Filepath\\presentation.ppt") Presentation.Saveas("Filepath\\presentation.pptx") Presentation.Close() Application.Quit()

My question to the community is whether there is a more sophisticated or elegant way in which to solve my dilemma. My dilemma being that I want to be able to parse text from .ppt files and python-pptx does not handle those file types.

1

1 Answers

1
votes

Your approach is the way I would do it, perhaps as a batched process before starting python-pptx processing. I would probably use IronPython for accessing the MS API, but it's essentially the same approach.

It's possible you could do this with a Python library that addresses the LibreOffice or Open Office libraries as an alternative (PyOO is an example). This might have the advantage of not requiring Windows, but it will still be essentially "scripting" a running Office application to do the work; it's not a direct library interface. This means it's probably not well-suited to reliable running server-side if that's what you're after.