You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
|[StockDataSource](pyspark_datasources/stock.py)|`stock`| Batch Read | Read stock data from Alpha Vantage | None |`pip install pyspark-data-sources`<br/>`spark.read.format("stock").option("symbols", "AAPL,GOOGL").option("api_key", "key").load()`|
51
51
|**Batch Write**||||||
52
52
|[LanceSink](pyspark_datasources/lance.py)|`lance`| Batch Write | Write data in Lance format |`lance`|`pip install pyspark-data-sources[lance]`<br/>`df.write.format("lance").mode("append").save("/tmp/lance_data")`|
53
-
|[SimpleJsonDataSource](pyspark_datasources/simplejson.py)|`simplejson`| Batch Write | Write JSON data to Databricks DBFS |`databricks-sdk`|`pip install pyspark-data-sources[simplejson]`<br/>`df.write.format("simplejson").save("/path/to/output")`|
- Spawns a Python worker process running `python_streaming_source_runner.py`
34
34
- Serializes/deserializes data using Arrow format
@@ -146,7 +146,7 @@ PythonMicroBatchStream
146
146
-**Throughput ceiling**: Limited by driver's processing capacity
147
147
148
148
### Important Note from Source Code
149
-
From datasource.py:823-827:
149
+
From datasource.py:
150
150
> "Because SimpleDataSourceStreamReader read records in Spark driver node to determine end offset of each batch without partitioning, it is only supposed to be used in lightweight use cases where input rate and batch size is small."
0 commit comments