Browsing:

Month: September 2024

Dataset vs dataframe in PySpark

In PySpark, DataFrames are the most commonly used data structures, while Datasets are not available in PySpark (they are used in Scala and Java). However, I can explain the difference between DataFrames in PySpark and Datasets in the context of Read more…


Accessing data stored in Amazon S3 through both Amazon Redshift Spectrum and PySpark

1. Accessing Data through Redshift Spectrum Amazon Redshift Spectrum allows you to query data stored in S3 without loading it into Redshift. It uses the AWS Glue Data Catalog (or an external Hive metastore) to manage table metadata, such as Read more…


PySpark (on S3) vs Redshift:

When using PySpark to process data stored in Amazon S3 instead of using Amazon Redshift, you will be working in different paradigms, and while many concepts are similar, there are key differences in how queries are handled between the two Read more…


df.drop_duplicates(subset=[‘name’, ‘age’], keep=’first’, inplace=True) vs df[‘name’] = df[‘name’].drop_duplicates(keep=’first’)

1. df[‘name’] = df[‘name’].drop_duplicates(keep=’first’) Example: If your original DataFrame was: name age gender Alice 25 F Bob 30 M Alice 35 F Charlie 28 M The resulting DataFrame would be: name age gender Alice 25 F Bob 30 M NaN Read more…


Basic transformations in Data Engineering (Python, SQL, PySpark)

1.Replace or Remove Special Characters in Text Fields Pandas: PySpark SQL: 2.Standardize Values in Columns Standardize department names (e.g., change Sales to SALES): Panda PySpark SQL 3.To fill Missing Numerical Data Pandas 4. To convert Date Strings into a Consistent Read more…


if data is not None: vs if data:

In Python, both if data is not None: and if data: are common ways to check if a variable has a value, but they behave slightly differently in terms of what they check for. Here’s the difference: 1. if data Read more…


In a Pandas DataFrame, None, NaN, and NULL

In a Pandas DataFrame, None, NaN, and NULL are often used to represent missing values, but they are not exactly the same. Here’s a breakdown: Practical Differences in Pandas:


Reading Large Datasets in Chunks with Pandas and PySpark using CSV

S3, especially when dealing with 3 GB or more of data. Using Pandas to process data in chunks is a very effective method for handling large files without overwhelming your system’s memory. Here’s how to manage the process step-by-step: 1. Read more…


Data engineering challenges

When data engineers work with an ETL pipeline involving Python, AWS services like S3 (data lake), AWS Glue (Data Catalog), Athena (for querying), and Amazon Redshift (final data storage), they face several common challenges. Below are these challenges and their Read more…


DataFrames and Spark SQL

In PySpark, you have two primary abstractions for working with data: DataFrames and Spark SQL. Both can be used to perform similar operations like filtering, joining, aggregating, etc., but there are some differences in use cases, readability, performance, and ease Read more…