0
votes

In Matlab, I have a dataset in a table of the form:

SCHOOL  SEX  AGE  ADDRESS  STATUS  JOB  GUARDIAN  HEALTH  GRADE
UR      F    12   U        FT      TEA  MOTHER    1       11
GB      M    22   R        FT      SER  FATHER    5       15
GB      M    12   R        FT      OTH  FATHER    3       12
GB      M    11   R        PT      POL  FATHER    2       10

Where some variables are binary, some are categorical, some numerical. Would it be possible to extract from it a correlation matrix, with the correlation coefficients between the variables? I tried using both corrcoef and corrplot from the econometrics toolbox, but I come across errors such as 'observed data must be convertible to type double'.

Anyone would have a take on how this can be done? Thank you.

2

2 Answers

0
votes

I think you need to make all the data numeric, i.e change/code the non-numerical columns to for example:

SCHOOL  SEX  AGE  ADDRESS  STATUS  JOB  GUARDIAN  HEALTH  GRADE
1       1    12   1        1       1    1         1       11
2       2    22   2        1       2    2         5       15
2       2    12   2        1       3    2         3       12
2       2    11   2        2       4    2         2       10

and then do the correlation.

0
votes

As said above, you first need to transform your categorical and binary variables to numerical values. So if your data is in a table (T) do something like:

    T.SCHOOL = categorical(T.SCHOOL);

A worked example can be found in the Matlab help here, where they use the patients dataset, which seems to be similar to your data.

You could then transform your categorical columns to double:

    T.SCHOOL = double(T.SCHOOL);

Be careful with double though, as it transforms categorical variables to arbitrary numbers, see the matlab forum.

Also note, that you are introducing order into your categorical variables, if you simply transform them to numbers. So if you for example transform JOB 'TEA', 'SER', 'OTH' to 1, 2, 3, etc. you are making the variable ordinal. 'TEA' is then < 'OTH'.

If you want to avoid that you can re-code the categorical columns into 'binary' dummy variables:

    dummy_schools = dummyvar(T.SCHOOL);

Which returns a matrix of size nrows x unique(T.SCHOOL).

And then there is the whole discussion, whether it is useful to calculate correlations of categorical variables. Like here.

I hope this helps :)