0
votes

We tried reading 3-4 GB of XML file using SAS XML mapper .but when we PROC COPY the data from the XML engine to SAS Dataset its taking almost 5 to 6 mins which is too much time for us since we have to process 3000 files a day .We are running almost 10 files in parallel.One table almost have 230 columns.

Is there any other faster way to process the XML ? can we use PROC GROOVY ? will it be efficient? if yes can any one provide me a sample code? i tried searching online but not able to get one.

The XML has PII data and its huge of 3 GB .

The Code being run is very simple and straight forward:

filename NHL "/path/ODM.xml";
filename map "/path/odm_map.map";
libname NHL xmlv2 xmlmap=map;
proc copy in=nhl out=work;
run;

Total Table created : 54 in which more than 14 tables have ~18000 records and remaining tables have ~1000 records

The Log window shows

NOTE: PROCEDURE COPY used (Total process time): 
      real time           4:03.72 
      user cpu time       4:00.68 
      system cpu time        1.17 seconds 
      memory              32842.37k 
      OS Memory           52888.00k 
      Timestamp           19/05/2020 03:14:43 PM 
      Step Count          4 Switch Count 802
      Page Faults 3 
      Page Reclaims 17172 
      Page Swaps 0 
      Voluntary Context Switches 3662 
      Involuntary Context Switches 27536 
      Block Input Operations 504 
      Block Output Operations 56512 

      SAS Version : 9.4_M2 

total memsize is MEMSIZE=3221225472 in our server

3000 files total out of which 1000 will be 3 to 4 GB and some of which will be 1 GB and 1000 files will be in KB .The smaller files are getting processed quickly the problem is only with big files .it uses almost the entire CPU.

The copy time from XML engine varies when we reduce the number of file,but for that to happen we have to change the map file or the input xml.

Already raised SAS tracks and have questioned the same in SAS communities still no luck.looks like its parser limitation itself.

Any idea about the shredder in Teradata ? will it be efficient?

1
Show your code and log messages. Describe your system resources, CPU and RAM. Do all 3,000 files use the same map ? ? If so, when is that map created ? How long does Mapper take to make the XMLMap file? Can you upload a sample xml file and it's map to a site such as pastebin.com ?Richard
Side comment: You can edit your post to make corrections and include more information. An edit should not substantially change a question and ask a new one. If there is a new question, create a new question. I edited the topic and added to Log message you put in a comment. Also, can you clarify, do you have 3,000 files of 3-4GB size, or 3,000 files totaling 3-4GB in size.Richard
You should contact SAS Support and open a track if you are dealing with PII data. In the interim, have you tried upping the MEMMAXSZ option in a config.sas. Also, the XML engine might be bottlenecked in the resources available to the SAS private runtime JRE. Finally, you can use communities.sas.com for topics requiring a more threaded conversation.Richard
I can't imagine SAS being really able to help with this - 3000 files each a few GB in size? That's a humungous amount of ETL. You'd need a really big server to run that program efficiently. Sounds like they're already doing things the right way - run 10 in parallel times 300 sets of jobs, 300 x 4-5 mins each. Just not what SAS is built to do well.Joe
Does it work faster to modify the map to just read one of the 54 tables at a time?Tom

1 Answers

0
votes

I would do this in two pieces, first convert XML to ascii and then into SAS. SAS isn't going to be very fast at converting XML into SAS; it's just not something SAS is optimized for. You're using nearly entirely CPU time, so you're not disk limited - you're limited by SAS's ability to parse the XML file.

Write a program in a more optimized language that can parse the XML much faster, and then read the results of that into SAS. Python might be one option - it's not super optimized either, but it's more optimized for this sort of thing than SAS I suspect - or an even lower level language (like c/c++) might be your best bet.