drop_duplicates() vs unique()
2024
n pandas, there isn’t a method directly called df.unique()
that applies to entire dataframes. Instead, the .unique()
method is designed to be used with Series objects, which are essentially individual columns in a dataframe. This method helps in finding unique elements from a single column.
Using Series.unique()
in pandas
If you want to find unique values from a specific column in a pandas DataFrame, you would call .unique()
on that column. Here’s how you can do it:
Example DataFrame:
import pandas as pd
data = {
id': [1, 2, 3, 4, 5],
'name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],
'department': ['HR', 'HR', 'HR', 'IT', 'HR'],
'salary': [50000, 60000, 52000, 55000, 61000]
}
df = pd.DataFrame(data)
Finding Unique Names:
uniq_name = df['name'].unique()
Output:
['Alice', 'Bob', 'Charlie']
When You Need Unique Rows
If your goal is to get unique rows based on all or a subset of columns, you would use the drop_duplicates()
method, as previously explained. This method allows you to specify which columns to consider when identifying duplicates and whether to keep the first or last occurrence of each duplicate row.
Here’s a quick reminder of how to use drop_duplicates()
:
Drop Duplicates Based on Multiple Columns:
unique_df = df.drop_duplicates(subset =['name', 'department'] , keep= First)
id name department salary
0 1 Alice HR 50000
1 2 Bob HR 60000
3 4 Charlie IT 55000