4
votes

I need to parse a large XML file (>1 GB) which is located on a FTP server. I have a FTP stream aquired by ftp_connect(). (I use this stream for other FTP-related actions)

I know XMLReader is preferred for large XML files, but it will only accept a URI. So I assume a stream wrapper will be required. And the only ftp-function I know of which will allow me to retrieve only a small part of the file is ftp_nb_fget() in combination with ftp_nb_continue().

However, I do not know how I should put all of this together to make sure that a minimum amount of memory is used.

3
If you have to parse the entire file, it may be better to simply download the whole thing first and work off that, rather than having to mess with streams and whatnot. - Marc B
That would certainly be easier. But I feel it would be more efficient to do both downloading and parsing at the same time. I have no need to save the file to the harddrive. - Troubled Magento user
I guess if the XML doesn't nest itself too deep (e.g. a single <val>1</val> nested a kajillion layers down), parse-as-you-go would make sense then. - Marc B
The entries (elements) wont be larger than a few KB, and should not nest deeper than 4 elements. I think my biggest problem is memory management. ftp_nb_fget() wont let me specify how many bytes to read, and I'm not sure how it would react if I remove data from the file handle ($handle) when I'm finished with that data. - Troubled Magento user

3 Answers

0
votes

It looks like you may need to build on top of the low-level XML parser bits.

In particular, you can use xml_parse to process XML one chunk of the XML string at a time, after calling the various xml_set_* functions with callbacks to handle elements, character data, namespaces, entities, and so on. Those callbacks will be triggered whenever the parser detects that it has enough data to do so, which should mean that you can process the file as you read it in arbitrarily-sized chunks from the FTP site.


Proof of concept using CLI and xml_set_default_handler, which will get called for everything that doesn't have a specific handler:

php > $p = xml_parser_create('utf-8');
php > xml_set_default_handler($p, function() { print_r(func_get_args()); });
php > xml_parse($p, '<a');
php > xml_parse($p, '>');
php > xml_parse($p, 'Foo<b>Bar</b>Baz');
Array
(
    [0] => Resource id #3
    [1] => <a>
)
Array
(
    [0] => Resource id #3
    [1] => Foo
)
Array
(
    [0] => Resource id #3
    [1] => <b>
)
Array
(
    [0] => Resource id #3
    [1] => Bar
)
Array
(
    [0] => Resource id #3
    [1] => </b>
)
php > xml_parse($p, '</a>');
Array
(
    [0] => Resource id #3
    [1] => Baz
)
Array
(
    [0] => Resource id #3
    [1] => </a>
)
php >
0
votes

This will depend on the schema of your XML file. But if it's something similar to RSS in that it's really just a long list of items (all encapsulated in a tag), then what I've done is to parse out the individual sections, and parse them as individual domdocuments:

$buffer = '';
while ($line = getLineFromFtp()) {
    $buffer .= $line;
    if (strpos($line, '</item>') !== false) {
        parseBuffer($buffer);
        $buffer = '';
    }
}

That's pseudo code, but it's a light way of handling a specific type of XML file without building your own XMLReader. You'd of course need to check for opening tags as well, to ensure that the buffer is always a valid xml file.

Note that this won't work with all XML types. But if it fits, it's a easy and clean way of doing it while keeping your memory footprint as low as possible...

0
votes

Hmm, I never tried that with FTP, but setting the Stream Context can be done with

Then just put in the FTP URI in open().

EDIT: Note that you can use the Stream Context for other actions as well. If you are uploading files, you can probably use the same stream context in combination with file_put_contents, so you dont necessarily need any of the ftp* functions at all.