Data Eng – Data Engineer Labs

How Pandas and PySpark handle adding a column with fewer rows than the existing DataFrame

2024

Pandas: When you add a column with fewer rows in Pandas, the remaining rows will be filled with NaN (Not a Number) to represent the missing data. Pandas DataFrames allow mixed data types and handle missing values using NaN by Read more…

Data Eng, SQL

SQL window function with “ORDER BY” and without “ORDER BY”

2024

Dataset Example: Assume we have the following sales data for different employees across several months: emp_id sales month 1 100 Jan 2 200 Jan 3 300 Jan 4 150 Feb 5 250 Feb 6 350 Mar 7 400 Mar Now, Read more…

Data Eng, Python

Complex data types such as nested structures, arrays, and maps in CSV format

2024

When dealing with complex data types such as nested structures, arrays, and maps in CSV format, handling them can be more challenging than in Parquet because CSV files are inherently flat and do not support hierarchical or complex data structures Read more…

Data Eng, Python

Dataset vs dataframe in PySpark

In PySpark, DataFrames are the most commonly used data structures, while Datasets are not available in PySpark (they are used in Scala and Java). However, I can explain the difference between DataFrames in PySpark and Datasets in the context of Read more…

Data Eng, Python

Accessing data stored in Amazon S3 through both Amazon Redshift Spectrum and PySpark

1. Accessing Data through Redshift Spectrum Amazon Redshift Spectrum allows you to query data stored in S3 without loading it into Redshift. It uses the AWS Glue Data Catalog (or an external Hive metastore) to manage table metadata, such as Read more…

Data Eng, Python

PySpark (on S3) vs Redshift:

When using PySpark to process data stored in Amazon S3 instead of using Amazon Redshift, you will be working in different paradigms, and while many concepts are similar, there are key differences in how queries are handled between the two Read more…

Data Eng, Python

df.drop_duplicates(subset=[‘name’, ‘age’], keep=’first’, inplace=True) vs df[‘name’] = df[‘name’].drop_duplicates(keep=’first’)

1. df[‘name’] = df[‘name’].drop_duplicates(keep=’first’) Example: If your original DataFrame was: name age gender Alice 25 F Bob 30 M Alice 35 F Charlie 28 M The resulting DataFrame would be: name age gender Alice 25 F Bob 30 M NaN Read more…

Data Eng, Python

Basic transformations in Data Engineering (Python, SQL, PySpark)

1.Replace or Remove Special Characters in Text Fields Pandas: PySpark SQL: 2.Standardize Values in Columns Standardize department names (e.g., change Sales to SALES): Panda PySpark SQL 3.To fill Missing Numerical Data Pandas 4. To convert Date Strings into a Consistent Read more…

Data Eng, Python

if data is not None: vs if data:

In Python, both if data is not None: and if data: are common ways to check if a variable has a value, but they behave slightly differently in terms of what they check for. Here’s the difference: 1. if data Read more…

Data Eng, Python

In a Pandas DataFrame, None, NaN, and NULL

In a Pandas DataFrame, None, NaN, and NULL are often used to represent missing values, but they are not exactly the same. Here’s a breakdown: Practical Differences in Pandas:

Data Engineer Labs

Category: Data Eng

How Pandas and PySpark handle adding a column with fewer rows than the existing DataFrame

SQL window function with “ORDER BY” and without “ORDER BY”

Complex data types such as nested structures, arrays, and maps in CSV format

Dataset vs dataframe in PySpark

Accessing data stored in Amazon S3 through both Amazon Redshift Spectrum and PySpark

PySpark (on S3) vs Redshift:

df.drop_duplicates(subset=[‘name’, ‘age’], keep=’first’, inplace=True) vs df[‘name’] = df[‘name’].drop_duplicates(keep=’first’)

Basic transformations in Data Engineering (Python, SQL, PySpark)

if data is not None: vs if data:

In a Pandas DataFrame, None, NaN, and NULL

Recent Posts

Recent Comments

Archives

Categories