This repository was archived by the owner on Jan 9, 2020. It is now read-only.
Commit bfc7e1f
[SPARK-20396][SQL][PYSPARK] groupby().apply() with pandas udf
## What changes were proposed in this pull request?
This PR adds an apply() function on df.groupby(). apply() takes a pandas udf that is a transformation on `pandas.DataFrame` -> `pandas.DataFrame`.
Static schema
-------------------
```
schema = df.schema
pandas_udf(schema)
def normalize(df):
df = df.assign(v1 = (df.v1 - df.v1.mean()) / df.v1.std()
return df
df.groupBy('id').apply(normalize)
```
Dynamic schema
-----------------------
**This use case is removed from the PR and we will discuss this as a follow up. See discussion apache#18732 (review)
Another example to use pd.DataFrame dtypes as output schema of the udf:
```
sample_df = df.filter(df.id == 1).toPandas()
def foo(df):
ret = # Some transformation on the input pd.DataFrame
return ret
foo_udf = pandas_udf(foo, foo(sample_df).dtypes)
df.groupBy('id').apply(foo_udf)
```
In interactive use case, user usually have a sample pd.DataFrame to test function `foo` in their notebook. Having been able to use `foo(sample_df).dtypes` frees user from specifying the output schema of `foo`.
Design doc: https://github.com/icexelloss/spark/blob/pandas-udf-doc/docs/pyspark-pandas-udf.md
## How was this patch tested?
* Added GroupbyApplyTest
Author: Li Jin <ice.xelloss@gmail.com>
Author: Takuya UESHIN <ueshin@databricks.com>
Author: Bryan Cutler <cutlerb@gmail.com>
Closes apache#18732 from icexelloss/groupby-apply-SPARK-20396.1 parent 2028e5a commit bfc7e1f
File tree
14 files changed
+561
-69
lines changed- python/pyspark
- sql
- sql
- catalyst/src/main/scala/org/apache/spark/sql/catalyst
- optimizer
- plans/logical
- core/src/main/scala/org/apache/spark/sql
- execution
- python
14 files changed
+561
-69
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1227 | 1227 | | |
1228 | 1228 | | |
1229 | 1229 | | |
1230 | | - | |
| 1230 | + | |
1231 | 1231 | | |
1232 | 1232 | | |
1233 | 1233 | | |
| |||
1248 | 1248 | | |
1249 | 1249 | | |
1250 | 1250 | | |
1251 | | - | |
| 1251 | + | |
1252 | 1252 | | |
1253 | 1253 | | |
1254 | 1254 | | |
| |||
1271 | 1271 | | |
1272 | 1272 | | |
1273 | 1273 | | |
1274 | | - | |
| 1274 | + | |
1275 | 1275 | | |
1276 | 1276 | | |
1277 | 1277 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2058 | 2058 | | |
2059 | 2059 | | |
2060 | 2060 | | |
2061 | | - | |
| 2061 | + | |
2062 | 2062 | | |
2063 | 2063 | | |
2064 | 2064 | | |
| |||
2090 | 2090 | | |
2091 | 2091 | | |
2092 | 2092 | | |
2093 | | - | |
| 2093 | + | |
2094 | 2094 | | |
2095 | 2095 | | |
2096 | 2096 | | |
| |||
2118 | 2118 | | |
2119 | 2119 | | |
2120 | 2120 | | |
| 2121 | + | |
2121 | 2122 | | |
2122 | 2123 | | |
| 2124 | + | |
2123 | 2125 | | |
2124 | 2126 | | |
2125 | 2127 | | |
| |||
2129 | 2131 | | |
2130 | 2132 | | |
2131 | 2133 | | |
2132 | | - | |
2133 | | - | |
| 2134 | + | |
| 2135 | + | |
| 2136 | + | |
| 2137 | + | |
| 2138 | + | |
| 2139 | + | |
2134 | 2140 | | |
2135 | 2141 | | |
2136 | 2142 | | |
| |||
2146 | 2152 | | |
2147 | 2153 | | |
2148 | 2154 | | |
2149 | | - | |
| 2155 | + | |
2150 | 2156 | | |
2151 | 2157 | | |
2152 | 2158 | | |
| |||
2181 | 2187 | | |
2182 | 2188 | | |
2183 | 2189 | | |
2184 | | - | |
2185 | | - | |
| 2190 | + | |
2186 | 2191 | | |
2187 | | - | |
| 2192 | + | |
2188 | 2193 | | |
2189 | 2194 | | |
2190 | | - | |
2191 | | - | |
2192 | | - | |
2193 | | - | |
2194 | | - | |
2195 | | - | |
2196 | | - | |
2197 | | - | |
2198 | | - | |
2199 | | - | |
2200 | | - | |
2201 | | - | |
2202 | | - | |
2203 | | - | |
2204 | | - | |
2205 | | - | |
2206 | | - | |
2207 | | - | |
| 2195 | + | |
| 2196 | + | |
| 2197 | + | |
| 2198 | + | |
| 2199 | + | |
| 2200 | + | |
| 2201 | + | |
| 2202 | + | |
| 2203 | + | |
| 2204 | + | |
| 2205 | + | |
| 2206 | + | |
| 2207 | + | |
| 2208 | + | |
| 2209 | + | |
| 2210 | + | |
| 2211 | + | |
| 2212 | + | |
| 2213 | + | |
| 2214 | + | |
| 2215 | + | |
| 2216 | + | |
| 2217 | + | |
| 2218 | + | |
| 2219 | + | |
| 2220 | + | |
| 2221 | + | |
| 2222 | + | |
| 2223 | + | |
| 2224 | + | |
| 2225 | + | |
| 2226 | + | |
| 2227 | + | |
| 2228 | + | |
| 2229 | + | |
| 2230 | + | |
| 2231 | + | |
| 2232 | + | |
| 2233 | + | |
| 2234 | + | |
| 2235 | + | |
| 2236 | + | |
| 2237 | + | |
| 2238 | + | |
| 2239 | + | |
| 2240 | + | |
| 2241 | + | |
| 2242 | + | |
| 2243 | + | |
| 2244 | + | |
| 2245 | + | |
| 2246 | + | |
| 2247 | + | |
| 2248 | + | |
| 2249 | + | |
| 2250 | + | |
| 2251 | + | |
| 2252 | + | |
| 2253 | + | |
2208 | 2254 | | |
2209 | 2255 | | |
2210 | 2256 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
54 | 54 | | |
55 | 55 | | |
56 | 56 | | |
57 | | - | |
| 57 | + | |
58 | 58 | | |
59 | | - | |
| 59 | + | |
| 60 | + | |
60 | 61 | | |
61 | 62 | | |
62 | 63 | | |
| |||
170 | 171 | | |
171 | 172 | | |
172 | 173 | | |
173 | | - | |
| 174 | + | |
174 | 175 | | |
175 | 176 | | |
176 | 177 | | |
| |||
192 | 193 | | |
193 | 194 | | |
194 | 195 | | |
195 | | - | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
196 | 275 | | |
197 | 276 | | |
198 | 277 | | |
| |||
206 | 285 | | |
207 | 286 | | |
208 | 287 | | |
| 288 | + | |
209 | 289 | | |
210 | 290 | | |
211 | 291 | | |
| |||
0 commit comments