update

allisonwang-db · allisonwang-db · commit 97a3600cbb99 · 2025-10-03T13:33:43.000-07:00
diff --git a/README.md b/README.md
@@ -50,7 +50,6 @@ spark.readStream.format("fake").load().writeStream.format("console").start()
 | [StockDataSource](pyspark_datasources/stock.py)                        | `stock`        | Batch Read     | Read stock data from Alpha Vantage           | None                  | `pip install pyspark-data-sources`<br/>`spark.read.format("stock").option("symbols", "AAPL,GOOGL").option("api_key", "key").load()`                                                                  |
 | **Batch Write** | | | | | |
 | [LanceSink](pyspark_datasources/lance.py)                              | `lance`        | Batch Write    | Write data in Lance format                    | `lance`               | `pip install pyspark-data-sources[lance]`<br/>`df.write.format("lance").mode("append").save("/tmp/lance_data")`                                                                                          |
-| [SimpleJsonDataSource](pyspark_datasources/simplejson.py)              | `simplejson`   | Batch Write    | Write JSON data to Databricks DBFS           | `databricks-sdk`      | `pip install pyspark-data-sources[simplejson]`<br/>`df.write.format("simplejson").save("/path/to/output")`                                                                                               |
 | **Streaming Read** | | | | | |
 | [OpenSkyDataSource](pyspark_datasources/opensky.py)                 | `opensky`      | Streaming Read | Read from OpenSky Network.                   | None                  | `pip install pyspark-data-sources`<br/>`spark.readStream.format("opensky").option("region", "EUROPE").load()`                                                                                            |
 | [WeatherDataSource](pyspark_datasources/weather.py)                    | `weather`      | Streaming Read | Fetch weather data from tomorrow.io           | None                  | `pip install pyspark-data-sources`<br/>`spark.readStream.format("weather").option("locations", "[(37.7749, -122.4194)]").option("apikey", "key").load()`                                          |
diff --git a/docs/index.md b/docs/index.md
@@ -37,7 +37,6 @@ spark.readStream.format("fake").load().writeStream.format("console").start()
 | [FakeDataSource](./datasources/fake.md)                 | `fake`         | Generate fake data using the `Faker` library  | `faker`               |
 | [HuggingFaceDatasets](./datasources/huggingface.md)     | `huggingface`  | Read datasets from the HuggingFace Hub        | `datasets`            |
 | [StockDataSource](./datasources/stock.md)               | `stock`        | Read stock data from Alpha Vantage            | None                  |
-| [SimpleJsonDataSource](./datasources/simplejson.md)     | `simplejson`   | Write JSON data to Databricks DBFS            | `databricks-sdk`      |
 | [SalesforceDataSource](./datasources/salesforce.md)     | `pyspark.datasource.salesforce`   | Write streaming data to Salesforce objects    |`simple-salesforce`    |
 | [GoogleSheetsDataSource](./datasources/googlesheets.md) | `googlesheets` | Read table from public Google Sheets document | None                  |
 | [KaggleDataSource](./datasources/kaggle.md)             | `kaggle`       | Read datasets from Kaggle                     | `kagglehub`, `pandas` |
diff --git a/docs/simple-stream-reader-architecture.md b/docs/simple-stream-reader-architecture.md
@@ -8,7 +8,7 @@
 
 ### Python-Side Components
 
-#### SimpleDataSourceStreamReader (datasource.py:816-911)
+#### SimpleDataSourceStreamReader (datasource.py)
 The user-facing API with three core methods:
 - `initialOffset()`: Returns the starting position for a new streaming query
 - `read(start)`: Reads all available data from a given offset and returns both the data and the next offset
@@ -22,13 +22,13 @@ A private wrapper that implements the prefetch-and-cache pattern:
 
 ### Scala-Side Components
 
-#### PythonMicroBatchStream (PythonMicroBatchStream.scala:31-111)
+#### PythonMicroBatchStream (PythonMicroBatchStream.scala)
 Manages the micro-batch execution:
 - Creates and manages `PythonStreamingSourceRunner` for Python communication
 - Stores prefetched data in BlockManager with `PythonStreamBlockId`
 - Handles offset management and partition planning
 
-#### PythonStreamingSourceRunner (PythonStreamingSourceRunner.scala:63-268)
+#### PythonStreamingSourceRunner (PythonStreamingSourceRunner.scala)
 The bridge between JVM and Python:
 - Spawns a Python worker process running `python_streaming_source_runner.py`
 - Serializes/deserializes data using Arrow format
@@ -146,7 +146,7 @@ PythonMicroBatchStream
 - **Throughput ceiling**: Limited by driver's processing capacity
 
 ### Important Note from Source Code
-From datasource.py:823-827:
+From datasource.py:
 > "Because SimpleDataSourceStreamReader read records in Spark driver node to determine end offset of each batch without partitioning, it is only supposed to be used in lightweight use cases where input rate and batch size is small."
 
 ## Use Cases