Skip to content

Commit 97a3600

Browse files
update
1 parent ca97e9c commit 97a3600

File tree

3 files changed

+4
-6
lines changed

3 files changed

+4
-6
lines changed

README.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,6 @@ spark.readStream.format("fake").load().writeStream.format("console").start()
5050
| [StockDataSource](pyspark_datasources/stock.py) | `stock` | Batch Read | Read stock data from Alpha Vantage | None | `pip install pyspark-data-sources`<br/>`spark.read.format("stock").option("symbols", "AAPL,GOOGL").option("api_key", "key").load()` |
5151
| **Batch Write** | | | | | |
5252
| [LanceSink](pyspark_datasources/lance.py) | `lance` | Batch Write | Write data in Lance format | `lance` | `pip install pyspark-data-sources[lance]`<br/>`df.write.format("lance").mode("append").save("/tmp/lance_data")` |
53-
| [SimpleJsonDataSource](pyspark_datasources/simplejson.py) | `simplejson` | Batch Write | Write JSON data to Databricks DBFS | `databricks-sdk` | `pip install pyspark-data-sources[simplejson]`<br/>`df.write.format("simplejson").save("/path/to/output")` |
5453
| **Streaming Read** | | | | | |
5554
| [OpenSkyDataSource](pyspark_datasources/opensky.py) | `opensky` | Streaming Read | Read from OpenSky Network. | None | `pip install pyspark-data-sources`<br/>`spark.readStream.format("opensky").option("region", "EUROPE").load()` |
5655
| [WeatherDataSource](pyspark_datasources/weather.py) | `weather` | Streaming Read | Fetch weather data from tomorrow.io | None | `pip install pyspark-data-sources`<br/>`spark.readStream.format("weather").option("locations", "[(37.7749, -122.4194)]").option("apikey", "key").load()` |

docs/index.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,6 @@ spark.readStream.format("fake").load().writeStream.format("console").start()
3737
| [FakeDataSource](./datasources/fake.md) | `fake` | Generate fake data using the `Faker` library | `faker` |
3838
| [HuggingFaceDatasets](./datasources/huggingface.md) | `huggingface` | Read datasets from the HuggingFace Hub | `datasets` |
3939
| [StockDataSource](./datasources/stock.md) | `stock` | Read stock data from Alpha Vantage | None |
40-
| [SimpleJsonDataSource](./datasources/simplejson.md) | `simplejson` | Write JSON data to Databricks DBFS | `databricks-sdk` |
4140
| [SalesforceDataSource](./datasources/salesforce.md) | `pyspark.datasource.salesforce` | Write streaming data to Salesforce objects |`simple-salesforce` |
4241
| [GoogleSheetsDataSource](./datasources/googlesheets.md) | `googlesheets` | Read table from public Google Sheets document | None |
4342
| [KaggleDataSource](./datasources/kaggle.md) | `kaggle` | Read datasets from Kaggle | `kagglehub`, `pandas` |

docs/simple-stream-reader-architecture.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88

99
### Python-Side Components
1010

11-
#### SimpleDataSourceStreamReader (datasource.py:816-911)
11+
#### SimpleDataSourceStreamReader (datasource.py)
1212
The user-facing API with three core methods:
1313
- `initialOffset()`: Returns the starting position for a new streaming query
1414
- `read(start)`: Reads all available data from a given offset and returns both the data and the next offset
@@ -22,13 +22,13 @@ A private wrapper that implements the prefetch-and-cache pattern:
2222

2323
### Scala-Side Components
2424

25-
#### PythonMicroBatchStream (PythonMicroBatchStream.scala:31-111)
25+
#### PythonMicroBatchStream (PythonMicroBatchStream.scala)
2626
Manages the micro-batch execution:
2727
- Creates and manages `PythonStreamingSourceRunner` for Python communication
2828
- Stores prefetched data in BlockManager with `PythonStreamBlockId`
2929
- Handles offset management and partition planning
3030

31-
#### PythonStreamingSourceRunner (PythonStreamingSourceRunner.scala:63-268)
31+
#### PythonStreamingSourceRunner (PythonStreamingSourceRunner.scala)
3232
The bridge between JVM and Python:
3333
- Spawns a Python worker process running `python_streaming_source_runner.py`
3434
- Serializes/deserializes data using Arrow format
@@ -146,7 +146,7 @@ PythonMicroBatchStream
146146
- **Throughput ceiling**: Limited by driver's processing capacity
147147

148148
### Important Note from Source Code
149-
From datasource.py:823-827:
149+
From datasource.py:
150150
> "Because SimpleDataSourceStreamReader read records in Spark driver node to determine end offset of each batch without partitioning, it is only supposed to be used in lightweight use cases where input rate and batch size is small."
151151
152152
## Use Cases

0 commit comments

Comments
 (0)