0
votes

I was doing the transpose operation using the following code:

proc sort data=TPSDATA;
            by TIME SYMBOL ;
    run;
    PROC TRANSPOSE DATA=TPSDATA OUT=TPS;
            by TIME;
            id SYMBOL;
            var return_sec;
    RUN;

My original dataset is similar to the following structure:

| SYMBOL   | Time     | RETURN_SEC |
|----------|----------|--------------|
| Facebook | 12:00:01 | 1%           |
| Facebook | 12:00:02 | 1.5%         |
| ...      |          |              |
| Apple    | 12:00:01 | -0.5%        |
| Apple    | 12:00:02 | -0.3%        |
| ...      |          |              |
| Alibaba  | 12:00:01 | -0.5%        |
| Alibaba  | 12:00:02 | -0.3%        |
| ...      |          |              |

I got some weird empty cells:

WNR WNS WOOD    WOOF    WOR
0   0   0   0   0







-0.000183117    0   0   -0.000684932    0





0   0   0   0   0

0   0   0   0   0
0.000732467 0   0   0   0
0.001280878 0   -0.000430108    0   0

Some columns have empty column names but non-empty contents:

CRESW   CRFN
















0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0

0   0   0   0   0   0   0   0

0   0   0   0   0   0   0   0

I was thinking why?

I played around some sample cases, trying to replicate the issue. I copied and pasted 80 rows or my original data (of two different stocks) into the dataset, and it works fine.

Then I tried to delete some of the data. For example, I deleted facebook 12:00:01. However, instead of having an empty column name, I just got an empty cell.

The test code that failed to replicate my problem:

data long4; 
  input SYMBOL $ TIME RETURN_SEC $; 
cards; 
TEG 1 53.525
TEG 2 53.5
TEG 3 53.5
TEG 4 53.485
TEG 5 53.48
TEG 6 53.485
TEG 7 53.48
TEG 8 53.495
TEG 9 53.495
TEG 10 53.495
TEG 11 53.495
TEG 12 53.495
TEG 13 53.495
TEG 14 53.495
TEG 15 53.495
TEG 16 53.495
TEG 17 53.495
TEG 18 53.495
TEG 19 53.5
TEG 20 53.63
TEG 21 53.625
TEG 22 53.625
TEG 23 53.625
TEG 24 53.625
TEG 25 53.625
TEG 26 53.625
TEG 27 53.62
TEG 28 53.62
TEG 29 53.62
TEG 30 53.62
TEG 31 53.62
TEG 32 53.62
TEG 33 53.62
TEG 34 53.555
TEG 35 53.525
TEG 36 53.525
TEG 37 53.535
TEG 38 53.535
TEG 39 53.535
TEG 40 53.535
TEG 41 53.535
TEG 42 53.535
TEG 43 53.54
TEG 44 53.505
TEG 45 53.51
TEG 46 53.505
TEG 47 53.51
TEG 48 53.505
TEG 49 53.51
TEG 50 53.52
TEG 51 53.51
TEG 52 53.52
TEG 53 53.525
TEG 54 53.52
TEG 55 53.52
TEG 56 53.475
TEG 57 53.46
TEG 58 53.46
TEG 59 53.46
TEG 60 53.46
TEG 61 53.43
TEG 62 53.415
TEG 63 53.405
TEG 64 53.4
TEG 65 53.4
TEG 66 53.38
TEG 67 53.32
TEG 68 53.315
TEG 69 53.325
TEG 70 53.325
TEG 71 53.33
TEG 72 53.33
TEG 73 53.33
TEG 74 53.325
TEG 75 53.325
TEG 76 53.27
TEG 77 53.27
TEG 78 53.27
TEG 79 53.27
GOOG 10 53.495
GOOG 11 53.495
GOOG 12 53.495
GOOG 13 53.495
GOOG 14 53.495
GOOG 15 53.495
GOOG 16 53.495
GOOG 17 53.495
GOOG 18 53.495
GOOG 19 53.5
GOOG 20 53.63
GOOG 21 53.625
GOOG 22 53.625
GOOG 23 53.625
GOOG 24 53.625
GOOG 25 53.625
GOOG 26 53.625
GOOG 27 53.62
GOOG 28 53.62
GOOG 29 53.62
GOOG 30 53.62
GOOG 31 53.62
GOOG 32 53.62
GOOG 33 53.62
GOOG 34 53.555
GOOG 35 53.525
GOOG 36 53.525
GOOG 37 53.535
GOOG 38 53.535
GOOG 39 53.535
GOOG 40 53.535
GOOG 41 53.535
GOOG 42 53.535
GOOG 43 53.54
GOOG 44 53.505
GOOG 45 53.51
GOOG 46 53.505
GOOG 47 53.51
GOOG 48 53.505
GOOG 49 53.51
GOOG 50 53.52
GOOG 51 53.51
GOOG 52 53.52
GOOG 53 53.525
GOOG 54 53.52
GOOG 55 53.52
GOOG 56 53.475
GOOG 57 53.46
GOOG 58 53.46
GOOG 59 53.46
GOOG 60 53.46
GOOG 61 53.43
GOOG 62 53.415
GOOG 63 53.405
GOOG 64 53.4
GOOG 65 53.4
GOOG 66 53.38
GOOG 67 53.32
GOOG 68 53.315
GOOG 69 53.325
GOOG 70 53.325
GOOG 71 53.33
GOOG 72 53.33
GOOG 73 53.33
GOOG 74 53.325
GOOG 75 53.325
GOOG 76 53.27
GOOG 77 53.27
GOOG 78 53.27
; 
run; 

proc sort data=long4 force; by Time SYMBOL; run;

proc transpose data=long4 out=wide4;
  by TIME;
  id SYMBOL;
  var RETURN_SEC;
run;

proc print data=wide4;
run;

It's weird. I am just thinking what is cause? The original dataset is huge and I can only test my code on a subset of the dataset.

I also checked - there is no missing ticker symbol in the original dataset.

UPDATE:

Code:

/*
 * Principle Component Analysis
 */
%MACRO PCA (NBBOOUTSET,DATE_VAR);
        options nonotes;
        %put ;
        %put ### START PCA Estimation for: &DATE_VAR ;
        /* Sort the variable by time first */
        proc sort data=&NBBOOUTSET;
                by TIME SYMBOL ;
        run;
        DATA TPSDATA;
                SET &NBBOOUTSET;
        RUN;
        proc export data=TPSDATA (obs=99999)
        outfile="&date_var.tpstest.csv"
        dbms=CSV REPLACE;
        putname=YES;
        run;
        proc sort data=TPSDATA;
                by TIME SYMBOL ;
        run;
        PROC TRANSPOSE DATA=TPSDATA OUT=TPS;
                by TIME;
                id SYMBOL;
                var return_sec;
        RUN;
        DATA TPS_DROPPED;
                SET TPS;
                drop _NAME_ time;
        RUN;
        proc export data=TPS_DROPPED (obs=99999)
        outfile="&date_var.dropped.csv"
        dbms=CSV REPLACE;
        putname=YES;
        run;
%MEND;

Log:

https://www.dropbox.com/s/gciemd8fhno2o6k/pca.log?dl=0

As you can see, there is no error. But the columns are empty after ticker "CRFN" in the exported dataset.

1
Post your full log.Reeza
can you dropbox your full dataset? Clearly, from the information you've given it's not possible to help.DCR
It is really not possible to create variables with no name. Perhaps the issue is how you are looking at the results?Tom
I posted part of the results above (they are CSV files) @TomJinhua Wang
there are empty column names. @TomJinhua Wang

1 Answers

0
votes

Actually, Reeza is correct.

The problem is not on the code itself, but the way that I was reading the csv (or exporting the data).

It seems that problems occur when there are too much columns in a csv file (there are around 2000 ~ 4000 columns). I tried to read the csv with python, pandas and jupyternotebook, but the columns are completely off (thousands of them with missing names). I tried to read in the csv file with excel, hundreds of columns are off.

There are no empty column names.

The error produced by the code is caused by missing contents (proc factor does not allow a single missing value, other wise it will ignore the complete observation. it turns out i have at least one value missing in each observation, then sas ignored all of my dataset).