3
votes

I am trying to understand how DataFrames work in Julia and I am having a rough time.

I usually worked with DataFrames --in Python-- adding new columns on every simulation step and populating each row with values.

For example, I have this DataFrame which contains input Data:

using DataFrames

df = DataFrame( A=Int[], B=Int[] )
push!(df, [1, 10])
push!(df, [2, 20])
push!(df, [3, 30])

Now, let's say that I do calculations based on those A and B columns that generate a third column C with DateTime objects. But DateTime objects are not generated for all rows, they could be null.

  1. How is that use case handled in Julia?
  2. How shall I create the new C column and assign values inside the for r in eachrow(df)?
# Pseudocode of what I intend to do

df[! :C] .= nothing

for r in eachrow(df)
    if condition
        r.C = mySuperComplexFunctionThatReturnsDateTimeForEachRow()
    else
        r.C = nothing
    end
end

To give a runable and concrete code, let's fake condition and function:

df[! :C] .= nothing

for r in eachrow(df)
    if r.A == 2
        r.C = Dates.now()
    else
        r.C = nothing
    end
end
2

2 Answers

5
votes

The efficient pattern to do this is:

df.C = f.(df.A, df.B)

where f is a function that takes scalars and calculates an output based on them (i.e. your simulation code) and you pass to it the columns you need to extract from df to perform the calculations. In this way the Julia compiler will be able to generate fast (type-stable) native code.

In your example the function f would be ifelse so you could write:

df.C = ifelse.(df.A .== 2, Dates.now(), nothing)

Also consider if you return nothing or missing (they have a different interpretation in Julia: nothing means absence of a value and missing means that the value is present but is not known; I am not sure which would be better in your case).

3
votes

If you initialize the column with df[!, :C] .= nothing it has the element type Nothing. When writing DateTimes to this column, Julia is attempting to convert them to Nothing and fails. I am not sure if this is the most efficient or recommended solution, but if you initialize the column as a union of DateTime and Nothing

df[!, :C] = Vector{Union{DateTime, Nothing}}(nothing, size(df, 1))

your example should work.