3
votes

I have an xml file containing only one text line of length more than 32767. For now SAS truncates it at 32767-th character and stops reading this line further. The task is to split the input line into separate variables or separate observations. The code I use to read file is:

data out (drop=v_length);
    length xml_text $32767;
    retain xml_text v_length gr_split;
    infile tempxml encoding='utf-8' end=last;
    input;
    if _n_ = 1 then do;
        v_length = length(left(_infile_));
        gr_split = 1;
    end; else
        v_length=v_length+length(left(_infile_));
    if v_length gt 32767 then do;
        gr_split + 1;
        v_length=length(left(_infile_));
    end;
    if _n_ = 1 or v_length=length(left(_infile_)) then do;
        xml_text = compress(left(_infile_),,'c');
    end; else
        xml_text = trim(xml_text)||compress(left(_infile_),,'c');
    if last then do;
        call symput('NumOfTextGroups',gr_split);
        call symput('LastRow',_n_);
    end;
run;

When the whole xml length is no longer than 32767 the code produces a single cell. Otherwise it outputs n rows. In first case I can parse it in Oracle directly (once data is delivered there). In second, I first bring data to Oracle and there I assemble the cell to parse. However it works only when each line of xml file is less 32767 characters.

1
How are you reading it in? What code? SAS is perfectly capable of reading well over 32767 long lines, I've read xml files in with close to a million characters on a single line no problem.Joe
@Joe I updated my question with the code I use now.griboedov
@Joe I'd love to see the approach you used to do this. Did you then parse the XML once it was stored in SAS variables? If so, how, because you would have to search either across rows or across vars to find the closing tags.Robert Penridge
@RobertPenridge It was at a former employer so I don't have the code any longer, but I did it two different ways for different needs. One I used LIBNAME with an XML map - that works very well in most cases. The other I used delimited by ">" if I remember correctly to read it in. You can use @"<tag>"` as well.Joe
UFF!!! This is a big one! SAS DS2 has overcome the 32767 limitation. But that only works on structured SQL tables. I am not sure how DS2 will behave with XMLs. When I had to deal with that, I gave up and worked around by running a Python script called from a SAS Datastep program. That worked well for me.Salva

1 Answers

3
votes

If you're fine bringing in 32767 at a time, then you should use recfm=f (fixed record length). That will create 32767-long lines.

data for_oracle;
  infile "\wherever\blah.xml" lrecl=32767 recfm=f truncover;
  input @1 myline $CHAR32767.; *char is important in case a space exists that you care about at the start;
run;

You can still do some of the same stuff you do in your code (but don't need most of it, I suspect, other than compressing the control characters).