0
votes

Background

I have a large dataset in SAS that has 17 variables of which four are numeric and 13 character/string. The original dataset that I am using can be found here: https://www.kaggle.com/austinreese/craigslist-carstrucks-data.

  1. cylinders
  2. condition
  3. drive
  4. paint_color
  5. type
  6. manufacturer
  7. title_status
  8. model
  9. fuel
  10. transmission
  11. description
  12. region
  13. state
  14. price (num)
  15. posting_date (num)
  16. odometer (num)
  17. year (num)

After applying specific filters to the numeric columns, there are no missing values for each numeric variable. However, there are thousands to hundreds of thousands of missing variables for the remaining 14 char/string variables.

Request

Similar to the blog post towards data science as shown here (https://towardsdatascience.com/end-to-end-data-science-project-predicting-used-car-prices-using-regression-1b12386c69c8), specifically under the Feature Engineering section, how can I write the equivalent SAS code where I use regex on the description column to fill missing values of the other string/char columns with categorical values such as cylinders, condition, drive, paint_color, and so on?

Here is the Python code from the blog post.

import re

manufacturer = '(gmc | hyundai | toyota | mitsubishi | ford | chevrolet | ram | buick | jeep | dodge | subaru | nissan | audi | rover  | lexus \
| honda | chrysler | mini | pontiac | mercedes-benz | cadillac | bmw | kia | volvo | volkswagen | jaguar | acura | saturn | mazda | \
mercury | lincoln | infiniti | ferrari | fiat | tesla | land rover | harley-davidson | datsun | alfa-romeo | morgan | aston-martin | porche \
| hennessey)'
condition = '(excellent | good | fair | like new | salvage | new)'
fuel = '(gas | hybrid | diesel |electric)'
title_status = '(clean | lien | rebuilt | salvage | missing | parts only)'
transmission = '(automatic | manual)'
drive = '(4x4 | awd | fwd | rwd | 4wd)'
size = '(mid-size | full-size | compact | sub-compact)'
type_ = '(sedan | truck | SUV | mini-van | wagon | hatchback | coupe | pickup | convertible | van | bus | offroad)'
paint_color = '(red | grey | blue | white | custom | silver | brown | black | purple | green | orange | yellow)'
cylinders = '(\s[1-9] cylinders? |\s1[0-6]? cylinders?)'

keys =    ['manufacturer', 'condition', 'fuel', 'title_status', 'transmission', 'drive','size', 'type', 'paint_color' , 'cylinders']
columns = [ manufacturer,   condition,   fuel,  title_status, transmission ,drive, size, type_, paint_color,   cylinders]

for i,column in zip(keys,columns):
    database[i] = database[i].fillna(
      database['description'].str.extract(column, flags=re.IGNORECASE, expand=False)).str.lower()

database.drop('description', axis=1, inplace= True)

What would be the equivalent SAS code for the Python code shown above?

1

1 Answers

1
votes

It's basically just doing a word search of sorts.

A simplified example in SAS:

data want;
set have;
array _fuel(*) $ _temporary_ ("gas", "hybrid", "diesel", "electric");

do i=1 to dim(_fuel);
if find(description, _fuel(i), 'it')>0 then fuel = _fuel(i);
*does not deal with multiple finds so the last one found will be kept;
end;

run;

You can expand this by creating an array for each variable and then looping through your lists. I think you can replace the loop with a REGEX command as well in SAS but regex requires too much thinking so someone else will have to provide that answer.