0
votes

I have transposed some data based upon a condition and I would now like to de-duplicate the data across the variables. I would have done this before transposing the data but that would have left me with only one example of each variable where I actually need to some kind of conditional deduplication.

I started with data that looks like this;

data have;
input ID Event level1 $;
datalines;
12345 9008 Activity1
12345 9008 Activity1
12345 9008 Activity1
12345 9008 Activity1
12345 1001 Activity2
12345 1001 Activity2
12345 1002 Activity3
12345 1002 Activity3
12345 6009 Activity4
12345 9009 Activity1
12345 9008 Activity1
12345 1001 Activity2
23145 9008 Activity1
23145 1001 Activity2
23145 1001 Activity2
23145 1001 Activity2
23145 1002 Activity3
23145 6009 Activity4
23145 1003 Activity5
23145 1003 Activity5
23145 1003 Activity5
23145 1004 Activity6
35762 9008 Activity1
35762 1001 Activity2
35762 1002 Activity3
35762 1002 Activity3
35762 6009 Activity4
35762 6009 Activity4
24958 9008 Activity1
24958 1002 Activity3
24958 1002 Activity3
24958 1002 Activity3
24958 9009 Activity1
24958 9009 Activity2
24958 9009 Activity3
;
run;

This data is sorted based upon the ID and a time stamp (not shown here). I want to divide this up into journeys and transpose the data based upon the first 9008 or 9009 code. This I have managed to do by using the following code

data AS.TENMAY_JOUR_TRANS;
set AS.TENMAY_LEVEL1;
if first.EVENT and EVENT = 9008 or EVENT = 9009 then;
proc transpose data=AS.TENMAY_LEVEL1
                out=AS.TENMAY_JOUR_TRANS
                name=Journey;
by ID;
var level1;
run;
run;

This then gives this;

data have;
   input ID Journey $ Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10 Var11 Var12 Var13;
datalines;
12345 level1 Activity1 Activity1 Activity1 Activity1 Activity2 Activity2 Activity3 Activity3 Activity4
12345 level1 Activity1 Activity1 Activity2
23145 level1 Activity1 Activity2 Activity2 Activity2 Activity3 Activity4 Activity5 Activity5 Activity5 Activity6
35762 level1 Activity1 Activity2 Activity3 Activity3 Activity4 Activity4
24958 level1 Activity1 Activity3 Activity3 Activity3
24958 level1 Activity1 Activity2 Activity3
;
run;

however I want to get to this;

data have ;
input ID Journey $ Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10 Var11 Var12 Var13;
datalines;
12345 level1 Activity1 Activity2 Activity3 Activity4
23145 level1 Activity1 Activity2 Activity3 ACtivity4 Activity5 Activity6
35762 level1 Activity1 Activity3 Activity4
24958 level1 Activity1 Activity3
12345 level1 Activity1 Activity2
24958 level1 Activity1 Activity2 Activity3
;
run;

I cant deduplicate the initial data as I require some of the duplication to remain. I only need the level1 variable to be deduplicated between the first new 9009 or 9008 of each ID

I am supposing the best way to do this is through using the 'NODUP' or 'NODUPKEY' commands in a 'PROC SORT' statement and I am exploring options like

data AS.TENMAY_JOUR_TRANS;
    set AS.TENMAY_LEVEL11;
    if first.EVENT and EVENT = Activity0 or EVENT = Activity1 then;
    proc sort data= AS.TENMAY_JOUR_TRANS NODUP;
    proc transpose data=AS2.TENMAY_LEVEL11
                    out=AS2.TENMAY_JOUR_TRANS
                    name=Journey;
    by ID_INTERNET;
    var level1;
    run;
    run;
run;

I am not sure this is the best way to do this and if so, basically I am not sure whether to place the 'PROC SORT' step, inside or outside of the 'PROC TRANSPOSE' step.

Any help much appreciated.

Thanks!

3
I don't see how you managed to get any results by incorporating a proc step into a data step. SAS just won't allow that...Dominic Comtois
I've got no idea either. What I typed above gave me those results though :-/ They are also seem to resemble what I expected to get when I wrote it. Bizarre. I have no idea why it works!Taylrl
Did you check your log? What happened is probably that your data step has been ignored (because it contains errors, such as the then;) and only your proc transpose worked.Dominic Comtois

3 Answers

1
votes

I managed to do this by creating a flag equal to 1 for each first entry of each type of Activity and then filtering based upon this before I do the first Transpose. This I did with the following code

data AS2.TENMAY_LOGFLAG;
    set AS2.TENMAY_LEVEL1;
    if level1 = 'Activity' then logflag =1;
run;

I then used EG to filter for 'where logflag = 1' and then did the Transpose

0
votes

You should deduplicate before having your data transposed. Let's assume you have

data beforehave;
   input ID Journey $ Activity $;
datalines;
12345 level1 Activity1
12345 level1 Activity1
12345 level1 Activity1 
12345 level1 Activity2 
12345 level1 Activity2 
23145 level1 Activity1 
23145 level1 Activity2 
23145 level1 Activity2 
;
RUN;

Then you have to deduplicate across both Journey and Activity:

PROC SORT data=beforehave; BY journey Activity nodupkey; RUN;

The you can transpose. Sorry not to give more code: I don't have access to my SAS today...

0
votes

So let's start with the big issue. You cannot mix Procedures inside Data Steps. The Data Step is basically a loop over a data set, allowing you to alter/change/create table(s). A procedure takes in a data set and does something with it.

So what you want to do is to transpose the data, remove the duplicates, and then transpose back to the wide format.

Step 1: Transpose.

proc sort data=have;
by id journey;
run;

proc transpose data=have out=have_t(where=(col1 ^= "") drop=_name_);
by id journey;
var var:;
run;

This pivots your table, creating a column named COL1 with the values inside all the VAR* variables. A _NAME_ column holds the old column name. We don't want that, so we drop it. The where on the out data set filters out variables that are not populated.

The sort is needed for the BY group processing.

Step 2, remove dups.

proc sort data=have_t nodup;
by id journey;
run;

Step 2.5, add back the _NAME_ variable that will be new column names.

data have_t(drop=i);
set have_t;
by id journey;
retain i;
format _name_ $32.;

if first.journey then
    i=0;

i = i + 1;
_name_ = "var" || strip(put(i,best.));
run; 

Here we create a count variable, i, and increment the count inside each journey. Drop the i variable in the output.

Step 3, Transpose back

proc transpose data=have_t out=want(drop=_name_);
by id journey;
id _name_;
var col1;
run;

Pretty straight forward transpose, removing the new _name_ variable.

3.5: If you want all 13 Var columns.

data want;
format id best. journey $8. var1-var13 $12.;
set want;
run;

This defines all the columns and then fills it with the values from the WANT data set, overwriting that data set.