2
votes

I need to select values from a single column in a Julia dataframe based on multiple criteria sourced from an array. Context: I'm attempting to format the data from a large Julia DataFrame to support a PCA (primary component analysis), so I first split the original data into an anlytical matrix and a label array. This is my code, so far (doesn't work):

### Initialize source dataframe for PCA
dfSource=DataFrame(
    colDataX=[0,5,10,15,5,20,0,5,10,30],
    colDataY=[1,2,3,4,5,6,7,8,9,0],
    colRowLabels=[0.2,0.3,0.5,0.6,0.0,0.1,0.2,0.1,0.8,0.0])
### Extract 1/2 of rows into analytical matrix
matSource=convert(Matrix,DataFrame(dfSource[1:2:end,1:2]))'
###  Extract last column as labels
arLabels=dfSource[1:2:end,3]
###  Select filtered rows
datGet=matSource[:,arLabels>=0.2 & arLabels<0.7][1,:]
print(datGet)

output> MethodError: no method matching...

At the last line before the print(datGet) statement, I get a MethodError indicating a method mismatch related to use of the & logic. What have I done wrong?

2

2 Answers

3
votes

A small example of alternative implementation (maybe you will find it useful to see what DataFrames.jl has in-built):

# avoid materialization if dfSource is large
dfSourceHalf = @view dfSource[1:2:end, :]
lazyFilter = Iterators.filter(row -> 0.2 <= row[3] < 0.7, eachrow(dfSourceHalf))
matFiltered = mapreduce(row -> collect(row[1:2]), hcat, lazyFilter)
matFiltered[1, :]

(this is not optimized for speed, but rather as a showcase what is possible, but still it is already several times faster than your code)

1
votes

This code works:

dfSource=DataFrame(
    colDataX=[0,5,10,15,5,20,0,5,10,30],
    colDataY=[1,2,3,4,5,6,7,8,9,0],
    colRowLabels=[0.2,0.3,0.5,0.6,0.0,0.1,0.2,0.1,0.8,0.0])

matSource=convert(Matrix,DataFrame(dfSource[1:2:end,1:2]))'

arLabels=dfSource[1:2:end,3]

datGet=matSource[:,(arLabels.>=0.2) .& (arLabels.<0.7)][1,:]
print(datGet)

output> [0,10,0]

Note the use of parenthetical enclosures (arLabels.>=0.2) and (arLabels<0.7), as well as the use of the .>= and .< syntax (which forces Julia to iterate through a container/collection). Finally, and most crucially (since it's the part most people miss), note the use of .& in place of just &. The dot operator makes all the difference!