Complete exercise.py by filling in all six TODOs. The Spark cluster is already
running. You just need to write the code and submit the job.
data/raw/coffee_orders.csv — 100 rows of coffee shop orders.
Columns: Order_ID, Customer_Name, Item, Quantity, Price, Total, Order_Date, Channel
TODO 0 - Create an output folder under the data folder for your results.
TODO 1 — Create a SparkSession with appName "CoffeeExercise", master
spark://spark-master:7077, and log level WARN.
TODO 2 — Read the CSV from /opt/spark/data/raw/coffee_orders.csv with
header=True and inferSchema=True. Print the schema and row count.
TODO 3 — Add three columns using withColumn():
order_size: "small" if Total < 10, "medium" if Total < 30, "large" otherwiserevenue_per_item: Total / Quantity rounded to 2 decimal places (usespark_round)processed_at: current timestamp
TODO 4 — Write the transformed DataFrame to Parquet and ORC with
mode="overwrite":
- Parquet →
/opt/spark/data/output/parquet - ORC →
/opt/spark/data/output/orc
TODO 5 — Register a temp view named "orders" and run a Spark SQL query
returning order count, total revenue, and average order total grouped by Channel.
Time the query using time.time() before and after .collect() and print the
elapsed time.
TODO 6 — Call spark.stop().
Update the last line of the spark-submit command in docker-compose.yaml to
point to exercise.py, then run:
docker compose run spark-submit
Push the following to a private GitHub repo and submit the URL. Add the professor and TAs as collaborators.
exercise.py— completed with all six TODOsscreenshot.png— terminal showing the SQL result table and query time