0
votes

In pentaho kettle, I configured the RSS Input step with some URLs. When I run the transformation, it runs perfect most of the times but sometimes, it shows the following error:

2016/06/29 13:10:48 - RSS Input.0 - ERROR (version 6.0.1.0-386, build 1 from 2015-12-03 11.37.25 by buildguy) : Unexpected Exception : it.sauronsoftware.feed4j.FeedXMLParseException: org.dom4j.DocumentException: Error on line -1 of document  : Premature end of file. Nested exception: Premature end of file.
2016/06/29 13:10:48 - RSS Input.0 - ERROR (version 6.0.1.0-386, build 1 from 2015-12-03 11.37.25 by buildguy) : it.sauronsoftware.feed4j.FeedXMLParseException: org.dom4j.DocumentException: Error on line -1 of document  : Premature end of file. Nested exception: Premature end of file.
2016/06/29 13:10:48 - RSS Input.0 -     at it.sauronsoftware.feed4j.FeedParser.parse(FeedParser.java:53)
2016/06/29 13:10:48 - RSS Input.0 -     at org.pentaho.di.trans.steps.rssinput.RssInput.readNextUrl(RssInput.java:168)
2016/06/29 13:10:48 - RSS Input.0 -     at org.pentaho.di.trans.steps.rssinput.RssInput.getOneRow(RssInput.java:198)
2016/06/29 13:10:48 - RSS Input.0 -     at org.pentaho.di.trans.steps.rssinput.RssInput.processRow(RssInput.java:312)
2016/06/29 13:10:48 - RSS Input.0 -     at org.pentaho.di.trans.step.RunThread.run(RunThread.java:62)
2016/06/29 13:10:48 - RSS Input.0 -     at java.lang.Thread.run(Thread.java:745)
2016/06/29 13:10:48 - RSS Input.0 - Caused by: org.dom4j.DocumentException: Error on line -1 of document  : Premature end of file. Nested exception: Premature end of file.
2016/06/29 13:10:48 - RSS Input.0 -     at org.dom4j.io.SAXReader.read(SAXReader.java:482)
2016/06/29 13:10:48 - RSS Input.0 -     at org.dom4j.io.SAXReader.read(SAXReader.java:291)
2016/06/29 13:10:48 - RSS Input.0 -     at it.sauronsoftware.feed4j.FeedParser.parse(FeedParser.java:37)
2016/06/29 13:10:48 - RSS Input.0 -     ... 5 more

I have used the default RSS Input step that comes with kettle, and here is the screenshot:

enter image description here

And the links that I have configured in RSS feed are:

enter image description here

How to resolve this issue? Even when I run the RSS feed on one of the links, it shows the same error occasionally. Is there some problem with this plugin?

2
More details about exception is in here stackoverflow.com/questions/10022796/…simar
It looks like one of u feeds or unavailable sometimes or network connection unstable or rss server just drop connection.simar
U can try to use User Java Class to manually download and parse content of rss feed. U will gain control on connection timeout, how to handle such and error and retry if failed first time.simar
try to set Number of rows in rowset to 1. This will minimize chance of such error.simar

2 Answers

1
votes

If it is really necessary manually adjust source code.

Just get source of feed4j. It is quiet old, so there is just single version.

Open file in editor it.sauronsoftware.feed4j.FeedParser.java

It has single method parse

public static Feed parse(Url url){
    SAXReader saxReader = new SAXReader();
    Document document = saxReader.read(url);
    ...

Good staff SAXReader has several overloaded method, one on them what u need

   saxParser.read(InputStream is)

Instead of passing url to method read, just write code to read data from url using httpclient (good news it is bundled with kettle-pdi but to clarify version look into $KETTLE-HOME/lib/commons-httpclient-x.x.jar)

Then wrap received from server by httpclient data into ByteArrayInputSteam and pass it into SaxReader

Build library and replace feed4j-1.0.jar with yours

And u are done.

code will something like this

public static Feed parse(Url url){
    SAXReader saxReader = new SAXReader();
    CloseableHttpClient client = HttpClients.createDefault();
    HttpGet get = new HttpGet(url);
    CloseableHttpResponse response = client.execute(get);
    HttpEntity entity = response.getEntity();
    byte[] b = new byte[(int)entity.getContentLength()];
    entity.getContent().read(b);
    InputStream is = new ByteArrayInputStream(b);

    Document document = saxReader.read(is);
    ...

Extra details

  • Might need to add code to wrap possible IOException to FeedXMLParseException
  • This code assume that server post Content-Length header in response
  • Use matching jdk version
1
votes

Main problem is www.ft.com

For some reason after some time website server drops connection in the middle, meanwhile python implementation able read all data from http stream and successfully parse data.

Seems to me that implementation of building rss response has some bug on website.

Kettle use feed4j to parse rss. Library feed4j utilize simple HttpConnection to open stream and get data.

I did simple code to read for HttpConnection io stream and same happens to me. Webserver drops connection occasionally.

Request to same resource using Apache HttpClient work well. No errors, all data received from server.

My guess, request to http://ft.com needs properly formed http request, most probably some well formed headers.