I've seen a couple questions like this but not a satisfactory answer for my situation. Here is a sample DataFrame:
+------+-----+----+
| id|value|type|
+------+-----+----+
|283924| 1.5| 0|
|283924| 1.5| 1|
|982384| 3.0| 0|
|982384| 3.0| 1|
|892383| 2.0| 0|
|892383| 2.5| 1|
+------+-----+----+
I want to identify duplicates by just the "id"
and "value"
columns, and then remove all instances.
In this case:
- Rows 1 and 2 are duplicates (again we are ignoring the "type" column)
- Rows 3 and 4 are duplicates, and therefore only rows 5 & 6 should remain:
The output would be:
+------+-----+----+
| id|value|type|
+------+-----+----+
|892383| 2.5| 1|
|892383| 2.0| 0|
+------+-----+----+
I've tried
df.dropDuplicates(subset = ['id', 'value'], keep = False)
But the "keep" feature isn't in PySpark (as it is in pandas.DataFrame.drop_duplicates
.
How else could I do this?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…