Parsing rss feed with simple HTML dom not working for all elements

Question

I am trying to get the title, description, link, image, date of each item from this rss feed http://www.autoexpress.co.uk/car-news/feed/. But don't understand why, the link tag and the src for the image tag are impossible to get, rest of them work fine. This is what I tried:

<?php
    include "testing3/lib/simple_html_dom.php";
    $url = 'http://www.autoexpress.co.uk/car-news/feed';
    $rss= file_get_html($url);
    $items = $rss->find('item');
    foreach ($items as $article) {
        $title[] = $article->find('title',0)->plaintext;
        $description[] = $article->find('description',0)->plaintext;
        $link[] = $article->find('link', 0)->plaintext;
        $image[] = $article->find('img', 0);
        $date[] = $article->find('pubDate', 0)->plaintext;
    }
    echo 'Title is '.$title[0].'<br>';
    echo 'Description is '.strip_tags(html_entity_decode($description[0])).'<br>';
    echo 'Link is '.$link[1].'<br>';
    echo 'Date is '.$date[1].'<br>';
    echo 'Image Source is '.$image[1];
?>

This is the output Title is Fiat Panda 4x4 Antarctica review - pictures Description is Pictures See all 8 pictures 24 May, 2014 Link is Date is Fri, 23 May 2014 16:29:39 +0000 Image Source is

var_dump($link); I get an array of empty strings:

array(40) { [0]=> string(0) "" [1]=> string(0) "" [2]=> string(0) "" etc

var_dump($image) same thing just that there are NULL VALUES. What am I mistaking?

Hektor Hektor · Accepted Answer · 2014-05-24T15:33:51

Straight off the bat, that's a pretty nasty-looking RSS feed. My guess is your library isn't capable of dealing with nested/escaped RSS tags. Since no-one's got back to you in 40-odd minutes, here's the bog-standard approach:

            $rssfeed = simplexml_load_file('http://www.autoexpress.co.uk/car-news/feed');
            foreach ($rssfeed->channel as $channel) {

                echo '<ul>';
                foreach ($channel->item as $item) {
                    echo '<li><a href="' . htmlentities($item->link) . '"</a>';
                    echo htmlentities($item->title);
                    echo htmlentities($item->description);
                    echo htmlentities($item->img);
                    echo htmlentities($item->pubDate);
                    echo '</li>';
                }
                echo '</ul>';
            }

Yup , that doesn't even use the library you've cited at the top of your excerpt, but it grabs the required code, escaped img tag included, even if it needs some serious clean-up afterwards.

Actually I think this script fails in the img tag, but that's because the escaped img tag is nested inside the description.

Parsing rss feed with simple HTML dom not working for all elements

2 Answers