-
Notifications
You must be signed in to change notification settings - Fork 36
Description
Description
I'm comparing a pipeline written in polars to the equivalent in dataframe. I'm seeing a couple of mild 'pain points' around apply, derive and lift2, specifically relating to the handling of OptionalColumn.
It may be I'm taking the wrong approach (do let me know if I'm missing something here) but I wanted to share some thoughts / suggestions.
Essentially, if I have a column with missing values, I'm having to handle the column type @a and @(Maybe a) explicitly, whereas in polars we just ignore any missing values.
So, in the case of apply, instead of being explicit here handling both cases -
D.apply @Text f c
D.apply @(Maybe Text) (fmap f) c
we could just use the first line.
Similarly for derive and lift2 -
D.derive @Text b (F.lift2 f (F.col a) (F.col b))
D.derive @(Maybe Text) b (F.lift2 (\x y -> f x <$> y) (F.col a) (F.col b))
we could just use the first line (would just have Nothing if one/both arguments are missing).
Obviously, this could be a design decision as well, given the usual role of Maybe. IMO it would be nicer to adopt the polars behaviour here and not have to worry about the above examples.
Implementation
In mapColumn it was straightforward to add runOptional to replace the run used on OptionalColumn to effect this. This would take care of apply and derive.
To handle the lift2 behaviour, on first glance it would be a case of adding some extra cases in zipWithColumns. Happy to take a look but wanted to understand what the desired behavior is first.
Current behaviour
dataframe> df <- D.readCsv "test_missing.csv"
dataframe> df
-----------------------
category | value
---------|-------------
Int | Maybe Double
---------|-------------
1 | Just 10.5
1 | Nothing
2 | Just 50.5
dataframe> df |> D.derive "doubled_value" (F.col @Double "value" * 2)
*** Exception:
[Error]: Type Mismatch
While running your code I tried to get a column of type: "Double" but the column in the dataframe was actually of type: "Maybe Double"
This happened when calling function interpret on (mult (col @Double "value") (lit (2.0)))
To get the desired outcome, we use lift and fmap (although let me know if this is the wrong approach).
dataframe> df |> D.derive "doubled_value" (F.lift (fmap (*2)) $ F.col @(Maybe Double) "value")
---------------------------------------
category | value | doubled_value
---------|--------------|--------------
Int | Maybe Double | Maybe Double
---------|--------------|--------------
1 | Just 10.5 | Just 21.0
1 | Nothing | Nothing
2 | Just 50.5 | Just 101.0
Proposed behaviour
dataframe> df |> D.derive "doubled_value" (F.col @Double "value" * 2)
---------------------------------------
category | value | doubled_value
---------|--------------|--------------
Int | Maybe Double | Maybe Double
---------|--------------|--------------
1 | Just 10.5 | Just 21.0
1 | Nothing | Nothing
2 | Just 50.5 | Just 101.0
For reference, the polars behaviour
df = pl.DataFrame({
"category": [1, 1, 2],
"value": [10.5, None, 50.5],
})
df = df.with_columns((pl.col("value") * 2).alias("doubled_value"))
print(df)
shape: (3, 3)
┌──────────┬───────┬───────────────┐
│ category ┆ value ┆ doubled_value │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 │
╞══════════╪═══════╪═══════════════╡
│ 1 ┆ 10.5 ┆ 21.0 │
│ 1 ┆ null ┆ null │
│ 2 ┆ 50.5 ┆ 101.0 │
└──────────┴───────┴───────────────┘