drop_duplicates() vs unique()

n pandas, there isn’t a method directly called df.unique() that applies to entire dataframes. Instead, the .unique() method is designed to be used with Series objects, which are essentially individual columns in a dataframe. This method helps in finding unique elements from a single column.

Using Series.unique() in pandas

If you want to find unique values from a specific column in a pandas DataFrame, you would call .unique() on that column. Here’s how you can do it:

Example DataFrame:

import pandas as pd

data = {
id': [1, 2, 3, 4, 5],
'name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],
'department': ['HR', 'HR', 'HR', 'IT', 'HR'],
'salary': [50000, 60000, 52000, 55000, 61000]
}

df = pd.DataFrame(data)

Finding Unique Names:

uniq_name = df['name'].unique()

Output:

['Alice', 'Bob', 'Charlie']

When You Need Unique Rows

If your goal is to get unique rows based on all or a subset of columns, you would use the drop_duplicates() method, as previously explained. This method allows you to specify which columns to consider when identifying duplicates and whether to keep the first or last occurrence of each duplicate row.

Here’s a quick reminder of how to use drop_duplicates():

Drop Duplicates Based on Multiple Columns:

unique_df = df.drop_duplicates(subset =['name', 'department'] , keep= First)
   id     name department  salary
0   1    Alice         HR   50000
1   2      Bob         HR   60000
3   4  Charlie         IT   55000

Leave a Reply

Your email address will not be published. Required fields are marked *

Deprecated: htmlspecialchars(): Passing null to parameter #1 ($string) of type string is deprecated in /var/www/html/wp-includes/formatting.php on line 4720