Skip to content

Default behaviour of apply, derive and lift2 when column(s) have missing values #160

@mcoady

Description

@mcoady

Description

I'm comparing a pipeline written in polars to the equivalent in dataframe. I'm seeing a couple of mild 'pain points' around apply, derive and lift2, specifically relating to the handling of OptionalColumn.

It may be I'm taking the wrong approach (do let me know if I'm missing something here) but I wanted to share some thoughts / suggestions.

Essentially, if I have a column with missing values, I'm having to handle the column type @a and @(Maybe a) explicitly, whereas in polars we just ignore any missing values.

So, in the case of apply, instead of being explicit here handling both cases -

D.apply @Text f c
D.apply @(Maybe Text) (fmap f) c

we could just use the first line.

Similarly for derive and lift2 -

D.derive @Text b (F.lift2 f (F.col a) (F.col b))
D.derive @(Maybe Text) b (F.lift2 (\x y -> f x <$> y) (F.col a) (F.col b))

we could just use the first line (would just have Nothing if one/both arguments are missing).

Obviously, this could be a design decision as well, given the usual role of Maybe. IMO it would be nicer to adopt the polars behaviour here and not have to worry about the above examples.

Implementation

In mapColumn it was straightforward to add runOptional to replace the run used on OptionalColumn to effect this. This would take care of apply and derive.

To handle the lift2 behaviour, on first glance it would be a case of adding some extra cases in zipWithColumns. Happy to take a look but wanted to understand what the desired behavior is first.

Current behaviour

dataframe> df <- D.readCsv "test_missing.csv"
dataframe> df
-----------------------
category |    value
---------|-------------
  Int    | Maybe Double
---------|-------------
1        | Just 10.5
1        | Nothing
2        | Just 50.5

dataframe> df |> D.derive "doubled_value" (F.col @Double "value" * 2)
*** Exception:

[Error]: Type Mismatch
        While running your code I tried to get a column of type: "Double" but the column in the dataframe was actually of type: "Maybe Double"
        This happened when calling function interpret on (mult (col @Double "value") (lit (2.0)))

To get the desired outcome, we use lift and fmap (although let me know if this is the wrong approach).

dataframe> df |> D.derive "doubled_value" (F.lift (fmap (*2)) $ F.col @(Maybe Double) "value")
---------------------------------------
category |    value     | doubled_value
---------|--------------|--------------
  Int    | Maybe Double | Maybe Double
---------|--------------|--------------
1        | Just 10.5    | Just 21.0
1        | Nothing      | Nothing
2        | Just 50.5    | Just 101.0

Proposed behaviour

dataframe> df |> D.derive "doubled_value" (F.col @Double "value" * 2)
---------------------------------------
category |    value     | doubled_value
---------|--------------|--------------
  Int    | Maybe Double | Maybe Double
---------|--------------|--------------
1        | Just 10.5    | Just 21.0
1        | Nothing      | Nothing
2        | Just 50.5    | Just 101.0

For reference, the polars behaviour

df = pl.DataFrame({
  "category": [1, 1, 2],
  "value": [10.5, None, 50.5],
})
df = df.with_columns((pl.col("value") * 2).alias("doubled_value"))
print(df)
shape: (3, 3)
┌──────────┬───────┬───────────────┐
│ category ┆ value ┆ doubled_value │
│ ---      ┆ ---   ┆ ---           │
│ i64      ┆ f64   ┆ f64           │
╞══════════╪═══════╪═══════════════╡
│ 1        ┆ 10.5  ┆ 21.0          │
│ 1        ┆ null  ┆ null          │
│ 2        ┆ 50.5  ┆ 101.0         │
└──────────┴───────┴───────────────┘

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions