0
votes

I have tried to follow this documentation in the most precise way I could:

https://beam.apache.org/documentation/sdks/javadoc/2.0.0/org/apache/beam/sdk/io/xml/XmlIO.html

Please find below my codes :

public static void main(String args[])
{

    DataflowPipelineOptions options=PipelineOptionsFactory.as(DataflowPipelineOptions.class);
     options.setTempLocation("gs://balajee_test/stagging");
     options.setProject("test-1-130106");

     Pipeline p=Pipeline.create(options);

     PCollection<XMLFormatter> record= p.apply(XmlIO.<XMLFormatter>read()
             .from("gs://balajee_test/sample_3.xml")
             .withRootElement("book")
             .withRecordElement("author")
             .withRecordElement("title")
             .withRecordElement("genre")
             .withRecordElement("price")
             .withRecordElement("description")
             .withRecordClass(XMLFormatter.class)
             );

     record.apply(ParDo.of(new DoFn<XMLFormatter,String>(){
                @ProcessElement

                public void processElement(ProcessContext c)
                {
                    System.out.println(c.element().getAuthor());    
                }
             }));

     p.run(); 
}   

I'm getting 'null' value for every XML component. Could you please review my code and suggest me the corrective course of action required?

Test File

package com.bitwise.cloud;

import javax.xml.bind.annotation.XmlElement;
import javax.xml.bind.annotation.XmlRootElement;
import javax.xml.bind.annotation.XmlType;

@XmlRootElement(name = "book")
@XmlType(propOrder = {"author", "title","genre","price","description"})
public class XMLFormatter {
private String author;
private String title;
private String genre;
private String price;
private String description;

public XMLFormatter() { }

public XMLFormatter(String author, String title,String genre,String price,String description) {
this.author = author;
this.title = title;
this.genre = genre;
this.price = price;
this.description = description;
}

@XmlElement
public void setAuthor(String author) {
this.author = author;
}

public String getAuthor() {
return author;
}

@XmlElement
public void setTitle(String title) {
this.title = title;
}

public String getTitle() {
return title;
}

@XmlElement
public void setGenre(String genre) {
this.genre = genre;
}

public String getGenre() {
return genre;
}

@XmlElement
public void setPrice(String price) {
this.price = price;
}

public String getPrice() {
return price;
}


@XmlElement
public void setDescription(String description) {
this.description = description;
}

public String getDescription() {
return description;
}
}
1
What runner are you using? The DirectRunner? Dataflow Runner? Something else? Do you have a job ID for the failing pipeline?Ben Chambers
Tried using DirectRunner as well as DataflowRunner. I do have a job ID for the failing Pipeline (2017-08-30_02_55_24-4448720439481076797). Would you mind sharing a working code sample for reading an XML file?Balajee Venkatesh
@BenChambers Could you please review my code?Still I haven't been able to resolve this issue. It would be quite helpful if you share any working snippet for the same.Balajee Venkatesh
Can you provide your test file?Lara Schmidt
Have you tried running a DirectPipeline from a txt file instead of a GCS file? That would allow you to know that the issue is reading GCS or with the XML formatter. Or you could try reading from a GCS file and writing the output by line to verify that the GCS file is able to be read.Lara Schmidt

1 Answers

1
votes

XmlIO.Read PTransform doesn't support providing multiple record elements (author, title, genre, etc). You have to provide a single root element and a record element and your XML document has to contain records that have the same record element. See the example given in the following location.

https://github.com/apache/beam/blob/master/sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlIO.java#L59