Skip to content

Comments

[DOCS] Clarify DataFrames in quickstart#54428

Open
celestehorgan wants to merge 1 commit intoapache:masterfrom
celestehorgan:update-quickstart
Open

[DOCS] Clarify DataFrames in quickstart#54428
celestehorgan wants to merge 1 commit intoapache:masterfrom
celestehorgan:update-quickstart

Conversation

@celestehorgan
Copy link

What changes were proposed in this pull request?

This pull request clarifies some of the language around DataFrames and Datasets in the Python Quickstart, and corrects some grammar/sentence structure in the first section of the Quickstart guide. No breaking changes are introduced.

Why are the changes needed?

The Quickstart is one of the highest traffic-ed pages in any documentation website. The original authors saw fit to introduce the idea of DataFrames vs. Datasets in the Python quickstart, but the user needs to understand why that matters (namely, that other languages they might use Spark in implement things differently – indeed, the Scala quickstart one tab over sticks entirely with the concept of Datasets).

Does this PR introduce any user-facing change?

Yes! Some language in https://spark.apache.org/docs/latest/quick-start.html changes.

How was this patch tested?

This patch was built locally to ensure the website still built.

Was this patch authored or co-authored using generative AI tooling?

No

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know its still WIP but just left a comment about leaving in the transformation language.

{% endhighlight %}

You can get values from DataFrame directly, by calling some actions, or transform the DataFrame to get a new one. For more details, please read the _[API doc](api/python/index.html#pyspark.sql.DataFrame)_.
Once you've created the DataFrame, you can perform actions against it. For more details see the [API doc](api/python/index.html#pyspark.sql.DataFrame).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So actions and transformations are seperate concepts in Spark, so having them both mentioned would be better.

Broadly speaking a transformation is something which gives you back another DataFrame/Dataset/RDD and an action is one which collects/writes out/ or otherwise forces evaluation of a DataFrame/Dataset/RDD.

The distinction is a bit more blurry with a few specific transformations but that's beyond the scope of getting started.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not addressed by the next section in L#76?

@celestehorgan celestehorgan changed the title [WIP][DOCS] Clarify DataFrames in quickstart [DOCS] Clarify DataFrames in quickstart Feb 24, 2026
@celestehorgan celestehorgan marked this pull request as ready for review February 24, 2026 18:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants