Basic transformations in Data Engineering (Python, SQL, PySpark)

1.Replace or Remove Special Characters in Text Fields

Pandas:

# Remove special characters from 'name'
df['name'] = df['name'].str.replace(r'[^\w\s]', '', regex=True)
print(df)

PySpark

from pyspark.sql.functions import regexp_replace

# Remove special characters from 'name'
df = df.withColumn("name", regexp_replace(df["name"], r'[^\w\s]', ''))
df.show()

SQL:

SELECT id, 
       REGEXP_REPLACE(name, '[^a-zA-Z0-9\s]', '', 'g') AS name
FROM employees;

2.Standardize Values in Columns

Standardize department names (e.g., change Sales to SALES):

Panda

# Standardize 'department' values
df['department'] = df['department'].str.upper()
print(df)

PySpark

df = df.withColumn("department", upper(df["department"]))
df.show()

SQL

UPDATE employees
SET department = UPPER(department)
WHERE department = 'Sales';

3.To fill Missing Numerical Data

Pandas

median_salary = df['salary'].median()
df['salary'].fillna (median_salary)

4. To convert Date Strings into a Consistent Date Format

Convert join_date into YYYY-MM-DD format:

Panda

# Convert 'join_date' to consistent format
df['join_date'] = pd.to_datetime(df['join_date'], errors='coerce')
print(df)

The errors='coerce' argument in pd.to_datetime() is used to handle invalid date formats or entries that cannot be converted to a valid datetime object. Here’s how it works:

Behavior:

  • errors='coerce': Any invalid or unparseable date in the join_date column will be replaced with NaT (Not a Time). This is useful when you want to ensure that all entries in the column are either valid dates or explicitly marked as missing, rather than causing errors or retaining invalid data.

Example:

import pandas as pd

# Sample data with an invalid date entry
data = {
    'name': ['Alice', 'Bob', 'Charlie'],
    'join_date': ['2023-01-12', 'invalid_date', '2022-05-03']
}

# Create DataFrame
df = pd.DataFrame(data)

# Convert 'join_date' to datetime with errors='coerce'
df['join_date'] = pd.to_datetime(df['join_date'], errors='coerce')

print(df)

In this case, the invalid date invalid_date is replaced with NaT, while valid date strings are converted to proper datetime objects.

Without errors='coerce':

If errors='coerce' were not used, it would raise an error when encountering invalid date formats, halting the program.

This makes it useful for dealing with dirty or inconsistent data where some entries might not be in a valid date format.

Pyspark

from pyspark.sql.functions import to_date

# Convert 'join_date' to consistent format
df = df.withColumn("join_date", to_date(df["join_date"], "yyyy-MM-dd"))
df.show()

5. To remove Extra Spaces from Text Fields

Pandas

#remove extra space from the name 
df['name'] = df['name'].str.strip()

PySpark

from Pyspark.sql.functions import trim

#remove extra spaces
df['name'] = df.withColumn ('name', trim(df['name']))

SQL:

select trim(name) as name from employees;

6. To convert text to upper case to ensure consistency

Panda

df['role'] - df['role'].str.upper()

PySpark

from pyspark.sql.functions import upper
df['role] = df.withColumn('role', upper(df['role']))

SQL

SELECT id, name, age, gender, salary, join_date,
       UPPER(role) AS role
FROM employees;

7.Identifying and Deleting Duplicated Rows Based on Key Columns

Pandas

# Remove duplicates based on 'name' and 'age'
df.drop_duplicates(subset=['name', 'age'], keep='first', inplace=True)
print(df)

PySpark

from pyspark.sql SparkSession

#create a spark session
spark = SparkSession.builder.appName('example').getOrCreate()

# example data frame creation
data = [("Alice", 25), ("Mohan", 12),("Welden", 33),("Anne", 32)]
column = ['name', 'age']
# Create a DataFrame
df = spark.createDataFrame(data, columns)

# Remove duplicates based on 'name' and 'age'
df = df.dropDuplicates(['name', 'age'])
df.show()


SQL:

WITH CTE AS (
    SELECT
        id,
        column1,
        column2,
        ROW_NUMBER() OVER (PARTITION BY column1, column2 ORDER BY id) AS row_num
    FROM sales
)
DELETE FROM sales
WHERE id IN (
    SELECT id
    FROM CTE
    WHERE row_num > 1
);

Leave a Reply

Your email address will not be published. Required fields are marked *

Deprecated: htmlspecialchars(): Passing null to parameter #1 ($string) of type string is deprecated in /var/www/html/wp-includes/formatting.php on line 4720