Basic transformations in Data Engineering (Python, SQL, PySpark)
2024

1.Replace or Remove Special Characters in Text Fields
Pandas:
# Remove special characters from 'name'
df['name'] = df['name'].str.replace(r'[^\w\s]', '', regex=True)
print(df)
PySpark
from pyspark.sql.functions import regexp_replace
# Remove special characters from 'name'
df = df.withColumn("name", regexp_replace(df["name"], r'[^\w\s]', ''))
df.show()
SQL:
SELECT id,
REGEXP_REPLACE(name, '[^a-zA-Z0-9\s]', '', 'g') AS name
FROM employees;
2.Standardize Values in Columns
Standardize department
names (e.g., change Sales
to SALES
):
Panda
# Standardize 'department' values
df['department'] = df['department'].str.upper()
print(df)
PySpark
df = df.withColumn("department", upper(df["department"]))
df.show()
SQL
UPDATE employees
SET department = UPPER(department)
WHERE department = 'Sales';
3.To fill Missing Numerical Data
Pandas
median_salary = df['salary'].median()
df['salary'].fillna (median_salary)
4. To convert Date Strings into a Consistent Date Format
Convert join_date
into YYYY-MM-DD
format:
Panda
# Convert 'join_date' to consistent format
df['join_date'] = pd.to_datetime(df['join_date'], errors='coerce')
print(df)
The errors='coerce'
argument in pd.to_datetime()
is used to handle invalid date formats or entries that cannot be converted to a valid datetime
object. Here’s how it works:
Behavior:
errors='coerce'
: Any invalid or unparseable date in thejoin_date
column will be replaced withNaT
(Not a Time). This is useful when you want to ensure that all entries in the column are either valid dates or explicitly marked as missing, rather than causing errors or retaining invalid data.
Example:
import pandas as pd
# Sample data with an invalid date entry
data = {
'name': ['Alice', 'Bob', 'Charlie'],
'join_date': ['2023-01-12', 'invalid_date', '2022-05-03']
}
# Create DataFrame
df = pd.DataFrame(data)
# Convert 'join_date' to datetime with errors='coerce'
df['join_date'] = pd.to_datetime(df['join_date'], errors='coerce')
print(df)

In this case, the invalid date invalid_date
is replaced with NaT
, while valid date strings are converted to proper datetime
objects.
Without errors='coerce'
:
If errors='coerce'
were not used, it would raise an error when encountering invalid date formats, halting the program.
This makes it useful for dealing with dirty or inconsistent data where some entries might not be in a valid date format.
Pyspark
from pyspark.sql.functions import to_date
# Convert 'join_date' to consistent format
df = df.withColumn("join_date", to_date(df["join_date"], "yyyy-MM-dd"))
df.show()
5. To remove Extra Spaces from Text Fields
Pandas
#remove extra space from the name
df['name'] = df['name'].str.strip()
PySpark
from Pyspark.sql.functions import trim
#remove extra spaces
df['name'] = df.withColumn ('name', trim(df['name']))
SQL:
select trim(name) as name from employees;
6. To convert text to upper case to ensure consistency
Panda
df['role'] - df['role'].str.upper()
PySpark
from pyspark.sql.functions import upper
df['role] = df.withColumn('role', upper(df['role']))
SQL
SELECT id, name, age, gender, salary, join_date,
UPPER(role) AS role
FROM employees;
7.Identifying and Deleting Duplicated Rows Based on Key Columns
Pandas
# Remove duplicates based on 'name' and 'age'
df.drop_duplicates(subset=['name', 'age'], keep='first', inplace=True)
print(df)
PySpark
from pyspark.sql SparkSession
#create a spark session
spark = SparkSession.builder.appName('example').getOrCreate()
# example data frame creation
data = [("Alice", 25), ("Mohan", 12),("Welden", 33),("Anne", 32)]
column = ['name', 'age']
# Create a DataFrame
df = spark.createDataFrame(data, columns)
# Remove duplicates based on 'name' and 'age'
df = df.dropDuplicates(['name', 'age'])
df.show()
SQL:
WITH CTE AS (
SELECT
id,
column1,
column2,
ROW_NUMBER() OVER (PARTITION BY column1, column2 ORDER BY id) AS row_num
FROM sales
)
DELETE FROM sales
WHERE id IN (
SELECT id
FROM CTE
WHERE row_num > 1
);