df.drop_duplicates(subset=[‘name’, ‘age’], keep=’first’, inplace=True) vs df[‘name’] = df[‘name’].drop_duplicates(keep=’first’)
2024
1. df['name'] = df['name'].drop_duplicates(keep='first')
- This operation is applied only to the
name
column. - It will remove duplicate values in the
name
column and assign the result back todf['name']
, leaving all other columns unchanged. - After removing duplicates in
name
, it will fill those positions withNaN
or cause misalignment with other columns if assigned back todf['name']
.
Example:
df['name'] = df['name'].drop_duplicates(keep='first')
If your original DataFrame was:
name | age | gender |
---|---|---|
Alice | 25 | F |
Bob | 30 | M |
Alice | 35 | F |
Charlie | 28 | M |
The resulting DataFrame would be:
name | age | gender |
---|---|---|
Alice | 25 | F |
Bob | 30 | M |
NaN | 35 | F |
Charlie | 28 | M |
Here, Alice
is duplicated, so only the first instance is kept, and the second Alice
is dropped, causing a NaN
in that row’s name
column.
2. df.drop_duplicates(subset=['name', 'age'], keep='first', inplace=True)
This removes duplicate rows based on the combination of the name
and age
columns.It will keep only the first occurrence of each unique combination of name
and age
while removing the entire row if a duplicate is found.
df.drop_duplicates(subset=['name', 'age'], keep='first', inplace=True)
If the original DataFrame was:
name | age | gender |
---|---|---|
Alice | 25 | F |
Alice | 25 | F |
Bob | 30 | M |
Charlie | 28 | M |
Alice | 35 | F |
The resulting DataFrame would be:
name | age | gender |
---|---|---|
Alice | 25 | F |
Bob | 30 | M |
Charlie | 28 | M |
Alice | 35 | F |
Here, only rows with the same name
and age
combination (in this case, the first two rows with Alice
and 25
) will be considered duplicates, and only the first occurrence is kept.
Key Differences:
- First operation only removes duplicates in the
name
column, possibly leaving misaligned rows (resulting inNaN
s). - Second operation removes entire rows based on duplicate combinations of values in
name
andage
, maintaining row alignment.
In most cases, you would prefer using df.drop_duplicates()
for handling duplicates in a DataFrame to ensure consistency across rows.