1
votes

I have a data set with two main variables of interest now - Major and Major_Code. These should match up 1 to 1 but there are some errors I need to fix and what I've found is that for 14 Major_Code values, there are two different Majors. This is only due to a change in spelling or punctuation, like "ed." and "education". They are supposed to have the same value here but don't.

So I have a table with 7 pairs. Each pair has the same Major_Code and different a Major. How can I select one of the Major vales to use for each code? My only idea was through an if-then statement but that seems horribly inefficient.

I found the doubled values like this:

proc freq data=majorslist; tables Major_Code/out=majorcodedups; run;

proc print data=majorcodedups; where COUNT > 1; run;

So I can easily find these observations but can't extract certain values to overwrite onto another observation. I've looked into arrays, macros, sql and transpose but it's all a bit over my head right now.

Logically it would work like this:

from obs i to n, find value for variable x at obs i, output value onto variable y at obs i, go to obs(i+1) and repeat.

1
The key is to determine how you want to choose which value for "Major" is correct. The rest is easy. - Joe

1 Answers

0
votes

Assuming you have some rule for determining which MAJOR is correct for a MAJOR_CODE, you should do this:

This assumes majorslist is a dataset of every major/major_code pair whether unique or not - but only one per major/major_code pair.

proc sort data=majorslist;
by major_code major;
run;

data majorslist_unique;
set majorslist;
by major_code major;
if first.major_code and last.major_code then output;
else do;
*rule to determine whether to output it or not;
end;
run;

So, you now have the major_code/major relationship. Let's say you picked if first.major_code then output; as your rule (ie, take the major_code with the alphabetically first major value).

Now, you need to apply this to your larger dataset. There are a lot of ways to do that - merge this on is one, format is another, for starters. Format works like this:

Create a dataset with FMTNAME, START, LABEL defined. For each value of MAJOR_CODE, construct one row like that, where START is MAJOR_CODE and LABEL is MAJOR. We'll also add an extra line that says what to do with non-matches (in case you get new values of major_code).

data for_fmt;
set majorslist_unique;
fmtname='MAJORF'; *add a $ if MAJOR_CODE is a character variable;
start=major_code;
label=major;
output;
if _n_=1 then do;
  hlo='o';
  call missing(start);
  label='NONMATCHED';
  output;
end;
keep fmtname start label hlo;
run;

proc format cntlin=for_fmt;
quit;

Now you have a format, MAJORF. (or $MAJORF. if MAJOR_CODE is character), that you can use in a PUT statement.

data my_bigdata2;
set my_bigdata;
major = put(major_code,MAJORF.);
run;