Dataset vs dataframe in PySpark
2024
In PySpark, DataFrames are the most commonly used data structures, while Datasets are not available in PySpark (they are used in Scala and Java). However, I can explain the difference between DataFrames in PySpark and Datasets in the context of Apache Spark, and why PySpark primarily uses DataFrames.
Conceptual Difference:
- DataFrame: A distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a DataFrame in Python (Pandas).
- Dataset: Available in Scala and Java, Datasets are strongly-typed, meaning they provide compile-time type safety by allowing developers to work with domain-specific objects. Datasets in Scala and Java combine the benefits of RDDs (Resilient Distributed Datasets) and DataFrames.
Hands-on Example with PySpark DataFrames
Since Datasets are not directly available in PySpark, we focus on DataFrames, which are widely used and highly optimized in PySpark for distributed data processing.
PySpark DataFrame Example
- Initialize PySpark and Create a DataFram
from pyspark.sql import SparkSession
# Initialize a Spark session
spark = SparkSession.builder.appName("DataFrameExample").getOrCreate()
# Create a simple DataFrame
data = [("John", 28), ("Alice", 24), ("Bob", 30)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
# Show the DataFrame
df.show()
This creates a DataFrame with two columns: “Name” and “Age”.
2. Perform Basic Operations on the DataFrame
DataFrames allow you to apply various transformations and actions.
- Filtering Rows:
# Filter rows where age is greater than 25
df_filtered = df.filter(df.Age > 25)
df_filtered.show()
Selecting Specific Columns:
# Select the 'Name' column
df_selected = df.select("Name")
df_selected.show()
Adding a New Column:
# Add a new column 'Age_in_5_years'
df = df.withColumn("Age_in_5_years", df.Age + 5)
df.show()
Key Characteristics of PySpark DataFrames:
- Schema: DataFrames have schemas, meaning they know the structure of the data (data types, column names).
df.printSchema()
- azy Evaluation: Transformations like
filter()
,select()
, orwithColumn()
are not executed immediately. The computation happens only when an action (likeshow()
orcollect()
) is called.
3. Write the DataFrame to a File (Action)
# Write the DataFrame to a CSV file
df.write.csv("output.csv", header=True)
Differences Between DataFrame and Dataset (Scala/Java):
- DataFrame:
- Untyped and works with Row objects, which are similar to tuples.
- Focuses on high-level operations and is optimized for performance.
- Dataset (Scala/Java only):
- Strongly-typed, meaning you can work with objects of a specific class (e.g.,
Person
). - Provides compile-time type safety, which is beneficial in catching errors earlier in the development cycle.
- Strongly-typed, meaning you can work with objects of a specific class (e.g.,
Example in Scala (Dataset):
// Scala example to create Dataset (not available in PySpark)
case class Person(name: String, age: Int)
val data = Seq(Person("John", 28), Person("Alice", 24), Person("Bob", 30))
val ds = spark.createDataset(data)
// Dataset has strong typing
ds.show()
In the above Scala code, we define a Person
case class, and the Dataset ensures that each entry conforms to the Person
schema.
Why PySpark Uses DataFrames:
PySpark focuses on DataFrames because Python is dynamically typed, and DataFrames provide all the optimizations (through Catalyst optimizer) needed for efficient distributed processing. Since Python lacks static typing, there is no need for the strongly-typed Dataset structure available in Scala and Java.
Conclusion:
- In PySpark, you work with DataFrames, which are optimized for distributed data processing and do not require strong typing.
- Datasets are a feature of Spark available in Scala and Java, where strong typing and compile-time type safety are essential.