Is there a way to get speaker notes accurately from a given PowerPoint file with Apache poi?

Question

I'm trying to transfer speaker notes from one powerpoint to another using apache poi, and I can't get an accurate transfer.

After looking around a bit, I couldn't find many resources. I did find this link: How to get pptx slide notes text using apache poi? , and it works in most situations. But when some features such as the slide master are involved in the original pptx, some text that aren't part of the speaker notes are interpreted as speaker notes.

XSLFNotes notes_src = slides_src[i].getNotes();
XSLFNotes notes_dst = ppt_dst.getNotesSlide(slides_dst[i]);

This is all inside a for loop where i is the iteration number. Here I'm getting slide i for the source and the corresponding slide i from the destination file.

for (XSLFShape shape_src : notes_src) {
    if (shape_src instanceof XSLFTextShape) {
        XSLFTextShape txShape = (XSLFTextShape) shape_src;
        for (XSLFTextParagraph xslfParagraph : txShape.getTextParagraphs()) {

Here I'm getting the text from the slide. The if loop below is where I have to start filtering out some "speaker" notes which aren't actually speaker notes (for example, the slide number is somehow interpreted as a note; there's also this copyright symbol printed).

    if (!(xslfParagraph.getText().startsWith("" + (i + 1)) & xslfParagraph.getText().length() < 3) & !(xslfParagraph.getText().startsWith("Copyright ©"))) {
        for (XSLFTextShape shape_dst : notes_dst.getPlaceholders()) {
            if (shape_dst.getTextType() == Placeholder.BODY) {
                shape_dst.setText(shape_dst.getText() + xslfParagraph.getText() + "\n");

The statement below is yet another filter; if a feature involving master slides is involved, a weird "click to edit master text styles..." piece of text will be interpreted as speaker notes as well.

    shape_dst.setText(shape_dst.getText().replace("Click to edit Master text styles", "").replace("Second level", "").replace("Third level", "").replace("Fourth level", "").replace("Fifth level", ""));
}}}}}}

In short, things that aren't speaker notes are appearing as "notes". There aren't many resources online about this subject; can someone help?

Axel Richter Axel Richter · Accepted Answer · 2019-08-25T15:51:33

What XSLFSlide.getNotes gets are the notes slides. Those may have not only the body text shapes containing the notes but also text shapes filled via other placeholders like header, footer, date time and slide number. To determine what kind of text shape one has got, one could get the placeholder type from the shape. This is

CTShape cTShape = (CTShape)shape.getXmlObject(); 
STPlaceholderType.Enum type = cTShape.getNvSpPr().getNvPr().getPh().getType();

Then one could get only text shapes of type STPlaceholderType.BODY.

Example:

import java.io.FileInputStream;

import org.apache.poi.xslf.usermodel.*;

import org.openxmlformats.schemas.presentationml.x2006.main.CTShape;
import org.openxmlformats.schemas.presentationml.x2006.main.STPlaceholderType;

import java.util.List;

public class PowerPointReadNotes {

 public static void main(String[] args) throws Exception {

  XMLSlideShow slideShow = new XMLSlideShow(new FileInputStream("PowerPointHavingNotes.pptx"));

  List<XSLFSlide> slides = slideShow.getSlides();
  for (XSLFSlide slide : slides) {
   XSLFNotes notes = slide.getNotes();
   for (XSLFShape shape : notes) {
    CTShape cTShape = (CTShape)shape.getXmlObject();
    STPlaceholderType.Enum type = cTShape.getNvSpPr().getNvPr().getPh().getType();
    System.out.println("type: " + type); 
    if (type == STPlaceholderType.BODY) { // get only shapes of type BODY
     if (shape instanceof XSLFTextShape) {
      XSLFTextShape textShape = (XSLFTextShape) shape;
      for (XSLFTextParagraph paragraph : textShape) {
       System.out.println(paragraph.getText());
      }
     }
    }
   }
  }
 }
}

Possible types are BODY, CHART, CLIP_ART, CTR_TITLE, DGM, DT, FTR, HDR, MEDIA, OBJ, PIC, SLD_IMG, SLD_NUM, SUB_TITLE, TBL, TITLE.

Unfortunately there is not any documentation about the ooxml schemas public available. So we need downloading the sources of ooxml-schemas and then doing javadoc form those to get a API documentation which describes the classes and methods.

There we then find org.openxmlformats.schemas.presentationml.x2006.main.* classes which are the classes for presentation part of Office Open XML. There one can look at /org/openxmlformats/schemas/presentationml/x2006/main/CTShape.html in API documentatinón created by javadoc and then go forward getNvSpPr() - getNvPr() - getPh() - getType().

Using the current apache poi 4.1.0 there is a enum Placeholder in high level API which also can be used.

Example:

import java.io.FileInputStream;

import org.apache.poi.xslf.usermodel.*;
import org.apache.poi.sl.usermodel.Placeholder;

import java.util.List;

public class PowerPointReadNotesHL {

 public static void main(String[] args) throws Exception {

  XMLSlideShow slideShow = new XMLSlideShow(new FileInputStream("PowerPointHavingNotes.pptx"));

  List<XSLFSlide> slides = slideShow.getSlides();
  for (XSLFSlide slide : slides) {
   XSLFNotes notes = slide.getNotes();
   for (XSLFShape shape : notes) {
    Placeholder placeholder = shape.getPlaceholder();
    System.out.println("placeholder: " + placeholder); 
    if (placeholder == Placeholder.BODY) { // get only shapes of type BODY
     if (shape instanceof XSLFTextShape) {
      XSLFTextShape textShape = (XSLFTextShape) shape;
      for (XSLFTextParagraph paragraph : textShape) {
       System.out.println(paragraph.getText());
      }
     }
    }
   }
  }
 }
}

Then the directly usage of the low level ooxml-schema classes is not necessary.

Is there a way to get speaker notes accurately from a given PowerPoint file with Apache poi?

1 Answers