Dataset vs dataframe in PySpark

In PySpark, DataFrames are the most commonly used data structures, while Datasets are not available in PySpark (they are used in Scala and Java). However, I can explain the difference between DataFrames in PySpark and Datasets in the context of Apache Spark, and why PySpark primarily uses DataFrames.

Conceptual Difference:

DataFrame: A distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a DataFrame in Python (Pandas).
Dataset: Available in Scala and Java, Datasets are strongly-typed, meaning they provide compile-time type safety by allowing developers to work with domain-specific objects. Datasets in Scala and Java combine the benefits of RDDs (Resilient Distributed Datasets) and DataFrames.

Hands-on Example with PySpark DataFrames

Since Datasets are not directly available in PySpark, we focus on DataFrames, which are widely used and highly optimized in PySpark for distributed data processing.

PySpark DataFrame Example

Initialize PySpark and Create a DataFram

from pyspark.sql import SparkSession

# Initialize a Spark session
spark = SparkSession.builder.appName("DataFrameExample").getOrCreate()

# Create a simple DataFrame
data = [("John", 28), ("Alice", 24), ("Bob", 30)]
columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)

# Show the DataFrame
df.show()

This creates a DataFrame with two columns: “Name” and “Age”.

2. Perform Basic Operations on the DataFrame

DataFrames allow you to apply various transformations and actions.

Filtering Rows:

# Filter rows where age is greater than 25
df_filtered = df.filter(df.Age > 25)
df_filtered.show()

Selecting Specific Columns:

# Select the 'Name' column
df_selected = df.select("Name")
df_selected.show()

Adding a New Column:

# Add a new column 'Age_in_5_years'
df = df.withColumn("Age_in_5_years", df.Age + 5)
df.show()

Key Characteristics of PySpark DataFrames:

Schema: DataFrames have schemas, meaning they know the structure of the data (data types, column names).

df.printSchema()

azy Evaluation: Transformations like filter(), select(), or withColumn() are not executed immediately. The computation happens only when an action (like show() or collect()) is called.

3. Write the DataFrame to a File (Action)

# Write the DataFrame to a CSV file
df.write.csv("output.csv", header=True)

Differences Between DataFrame and Dataset (Scala/Java):

DataFrame:
- Untyped and works with Row objects, which are similar to tuples.
- Focuses on high-level operations and is optimized for performance.
Dataset (Scala/Java only):
- Strongly-typed, meaning you can work with objects of a specific class (e.g., Person).
- Provides compile-time type safety, which is beneficial in catching errors earlier in the development cycle.

Example in Scala (Dataset):

// Scala example to create Dataset (not available in PySpark)
case class Person(name: String, age: Int)

val data = Seq(Person("John", 28), Person("Alice", 24), Person("Bob", 30))
val ds = spark.createDataset(data)

// Dataset has strong typing
ds.show()

In the above Scala code, we define a Person case class, and the Dataset ensures that each entry conforms to the Person schema.

Why PySpark Uses DataFrames:

PySpark focuses on DataFrames because Python is dynamically typed, and DataFrames provide all the optimizations (through Catalyst optimizer) needed for efficient distributed processing. Since Python lacks static typing, there is no need for the strongly-typed Dataset structure available in Scala and Java.

Conclusion:

In PySpark, you work with DataFrames, which are optimized for distributed data processing and do not require strong typing.
Datasets are a feature of Spark available in Scala and Java, where strong typing and compile-time type safety are essential.

Data Engineer Labs

Conceptual Difference:

Hands-on Example with PySpark DataFrames

PySpark DataFrame Example

2. Perform Basic Operations on the DataFrame

Key Characteristics of PySpark DataFrames:

3. Write the DataFrame to a File (Action)

Differences Between DataFrame and Dataset (Scala/Java):

Example in Scala (Dataset):

Why PySpark Uses DataFrames:

Conclusion:

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Archives

Categories