1
votes

I'm using lxml and python 3 to parse many files and merge files that belong together. The files are actually stored in pairs of two (that are also merged first) inside zip files but i don't think that matters here.

We're talking about 100k files that are about 900MB in zipped form.

My problems is that my script works fine but at somepoint (for multiple runs it's not always the same point so it shouldn't be a problem with a certain file) i get this error:

File "C:\Users\xxx\workspace\xxx\src\zip2xml.py", line 110, in _writetonorm normroot.getroottree().write(norm_file_path) File "lxml.etree.pyx", line 1866, in lxml.etree._ElementTree.write (src/lxml\lxml.etree.c:46006) File "serializer.pxi", line 481, in lxml.etree._tofilelike (src/lxml\lxml.etree.c:93719) File "serializer.pxi", line 187, in lxml.etree._raiseSerialisationError (src/lxml\lxml.etree.c:90965) lxml.etree.SerialisationError: IO_WRITE

I have no idea what causes this error. The entire code is a little cumbersome so i hope the relevant areas suffice:

def _writetonorm(self, outputpath):
    '''Writes the current XML to a file. 
    It'll update the file if it already exists and create the file otherwise'''

    #Find Name
    name = None
    try:
        name = self._xml.xpath("xxx")[0].text.rstrip().lstrip()
    except Exception as e:
        try:
            name = self._xml.xpath("xxx")[0].text.rstrip().lstrip()
        except Exception as e:
            name = "damn it!"

    if name != None:
        #clean name a bit
        name = name[:35]
        table = str.maketrans(' /#*"$!&<>-:.,;()','_________________')
        name = name.translate(table)
        name = name.lstrip("_-").rstrip("_-")

        #generate filename
        norm_file_name = name + ".xml"
        norm_file_path = os.path.join(outputpath, norm_file_name) 

        #Check if we have that completefile already. If we do, update it.            
        if os.path.isfile(norm_file_path):
            norm_file = etree.parse(norm_file_path, self._parser)
            try:
                normroot = norm_file.getroot()
            except:
                print(norm_file_path + "is broken !!!!")
                time.sleep(10)                
        else:
            normroot = etree.Element("norm")
        jurblock = etree.Element("jurblock")
        self._add_jurblok_attributes(jurblock)
        jurblock.insert(0, self._xml)
        normroot.insert(0, jurblock)
        try:
            normroot.getroottree().write(norm_file_path) #here the Exception occurs
        except Exception as e:
            print(norm_file_path)
            raise e

I know that my exception handling isn't great but this is just a proof of work for now. Can anyone tell me why the error happens ?

Looking at the file that causes the error it's not wellformed but I suspect that is because the error happened and it was fine before the latest iteration.

1
It looks like an ordinary IO error. Do you access to these files simultaneously (multiple worker threads or sth) or maybe some different program writes to that files while your script is running? In first case you may reach some system limit (number of open files), in second case a program may lock file you want to write.Tupteq
I was worrying about that so I currently run it single threaded but still get that error at some point. I'm always using context managers so everything should be closed properly.pypat
Not good. Last idea - try to change environment: run script on different computer or OS, try different version of lxml and maybe even different Python version. It would be good if you were able to isolate error (provide small self-contained code and data that reproduce error) so I could check myself.Tupteq
Thanks for your effort but even If I could provice a small isolated snippet you'd also need the 900MB of data because it occurs very randomly :)pypat

1 Answers

1
votes

It seems to have been a mistake to use maped network drives for this. No such Exception when letting it work with the files locally.

Learned something :)