0
votes

I am looking for a specific employer in a SAS data set. The data set has not been reviewed for spelling so if I am looking for Univ it could be entered as Unversity, University, Univercity ...

I've tried scaning, counting the matching letters, 'contains'. These are work but I am still missing some.

proc sql; create table SpecificEmployers as select * , case when employer contains 'Univ' then 'Y' else 'N' end as Emp from AllEmployers ;quit;

2

2 Answers

0
votes

In this case, rather than searching for a substring, I would suggest searching individual characters which can occur most commonly such as U, N, V etc. Then you can keep only those values which have all these characters available. For example- I have used findc function to search the string which has U, N and V

data have;
input string $15.;
datalines;
uNiverstY
UNVERSTy
college
univercity
school
schools
UNIVERSITY
Uversity
unvarcity
school123
;
run;

proc sql;
  select string from have
  where findc(upcase(string),'U')>=1
    and findc(upcase(string),'N')>=1
    and findc(upcase(string),'V')>=1;
  quit;

proc print data=want; run;

using upcase will also make your task easy .. so you don't have to worry about the case. You can put as many conditions as you need depending on the value

0
votes

You should investigate some of the edit distance functions:

http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002206133.htm

http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002206137.htm

One approach would be to loop through each word in the employer name and see if any of the individual words has an edit distance below a certain threshold when compared to the string university.