1
votes

I am inputting a .dat data set into sas, in an exercise teaching informat use. Here is what I have so far.

DATA companies;
    INFILE "/folders/myshortcuts/Stat324/BigCompanies.dat" encoding='wlatin2';
    INPUT rank 3. @6 company $UTF8X25. @35 country $17. @53 sales comma6. @60 profits comma8. @70 assets comma8. @82 marketval comma6.;
RUN;

This works for every line except for those containing special/international characters. Such as:

94   SociÈtÈ GÈnÈrale             France             $98.6B    $3.3B $1,531.1B    $25.8B

These lines trip up at the first currency value (@53 sales comma6.) and a warning is thrown indicating that invalid data was found for that input, and a missing value (.) is assigned.

Playing around with @ pointers and informat w values seems to reveal that the special characters are throwing off the column alignments, is this possible (a special character actually taking up 2 bits/spaces even if it prints as a single character. Is there a simple solution?

1

1 Answers

1
votes

Yes, you're exactly correct: if the characters are encoded in UTF8, they may take between 1 and 4 bytes, with many characters being one byte, but some taking more (what you call "special characters" here). If SAS is reading the file as WLATIN1, then it will assume each byte is a separate character.

Your code is a bit confusing to me: you specify that the file is WLATIN1, but then you instruct SAS to read in that field as UTF-8. Which is it?

If your session encoding is compatible with UTF-8, and the file to be read in is encoded UTF-8, then you likely need to simply switch the encoding on infile to UTF-8. If your file has mixed encoding, and there is a reason you can't use UTF-8 encoding to read it in, then you may have a complicated problem that will need to be handled with special code (i.e., to figure out how long the UTF8 portion actually is, and then advance the pointer to the right spot to read the next field in). You also may be able to use a delimiter to read this in; that depends some on the exact format of the data.