This can be formulated as a pair-wise Hamming distance calculation between all records, separating out subsequent pairs below some threshold. Fortunately, numpy/scipy/sklearn already has done the heavy lifting. I've included two functions that produce identical output - one that is fully vectorized (but consumes O(N^2) memory) and another that consumes O(N) memory but is only vectorized along a single dimension. At your scale, you almost certainly don't want the fully vectorized version - it will likely give an OOM error. In both cases, the basic algorithm is as follows:
- encode each feature value as an integer value (thanks sklearn!)
- for all row pairs, calculate the Hamming distance (sum of different values)
- if two rows are found at
threshold
or below Hamming distance, discard the latter until no rows remain below that threshold
code:
from sklearn.preprocessing import OrdinalEncoder
import pandas as pd
from scipy.spatial.distance import pdist, squareform
import numpy as np
def dedupe_fully_vectorized(df, threshold=1):
"""
fully vectorized memory hog version - best not to use for n > 10k
"""
# convert field data to integers
enc = OrdinalEncoder()
X = enc.fit_transform(df.to_numpy())
# calc the (unnormalized) hamming distance for all row pairs
d = pdist(X, metric="hamming") * df.shape[1]
s = squareform(d)
# s contains all pairs (j,k) and (k,j); exclude all pairs j < k as "duplicates"
s[np.triu_indices_from(s)] = -1
dupe_pair_matrix = (0 <= s) * (s <= threshold)
df_dupes = df[np.any(dupe_pair_matrix, axis=1)]
df_deduped = df.drop(df_dupes.index).sort_index()
return (df_deduped, df_dupes)
def dedupe_partially_vectorized(df, threshold=1):
"""
- Iterate through each row starting from the last; examine all previous rows for duplicates.
- If found, it is appended to a list of duplicate indices.
"""
# convert field data to integers
enc = OrdinalEncoder()
X = enc.fit_transform(df.to_numpy())
"""
- loop through each row, starting from last
- for each `row`, calculate hamming distance to all previous rows
- if any such distance is `threshold` or less, mark `idx` as duplicate
- loop ends at 2nd row (1st is by definition not a duplicate)
"""
dupe_idx = []
for j in range(len(X) - 1):
idx = len(X) - j - 1
row = X[idx]
prev_rows = X[0:idx]
dists = np.sum(row != prev_rows, axis=1)
if min(dists) <= threshold:
dupe_idx.append(idx)
dupe_idx = sorted(dupe_idx)
df_dupes = df.iloc[dupe_idx]
df_deduped = df.drop(dupe_idx)
return (df_deduped, df_dupes)
Now lets test things out. Sanity check first:
df = pd.DataFrame(
[
["john", "doe", "m", 23],
["john", "dupe", "m", 23],
["jane", "doe", "f", 29],
["jane", "dole", "f", 28],
["jon", "dupe", "m", 23],
["tom", "donald", "m", 12],
["john", "dupe", "m", 65],
],
columns=["first", "last", "s", "age"],
)
(df_deduped_fv, df_dupes_fv) = dedupe_fully_vectorized(df)
(df_deduped, df_dupes) = dedupe_partially_vectorized(df)
df_deduped_fv == df_deduped # True
# df_deduped
# first last s age
# 0 john doe m 23
# 2 jane doe f 29
# 3 jane dole f 28
# 5 tom donald m 12
# df_dupes
# first last s age
# 1 john dupe m 23
# 4 jon dupe m 23
# 6 john dupe m 65
I've tested this on dataframes up to ~40k rows (as below) and it seems to work (the two methods give identical results), but may take several seconds. I've haven't tried it at your scale, but it may be slow:
arr = np.array("abcdefgh")
df = pd.DataFrame(np.random.choice(arr, (40000, 15))
# (df_deduped, df_dupes) = dedupe_partially_vectorized(df)
If you can avoid doing all pairwise comparisons, such as grouping by name, that will improve performance significantly.
fun aside/issues with approach
You may notice you can get interesting "Hamming chains" (I don't know if this is a term) where very different records are connected by a chain of one-edit difference records:
df_bad_news = pd.DataFrame(
[
["john", "doe", "m", 88],
["jon", "doe", "m", 88],
["jan", "doe", "m", 88],
["jane", "doe", "m", 88],
["jane", "doe", "m", 12],
],
columns=["first", "last", "s", "age"],
)
(df_deduped, df_dupes) = dedupe(df)
# df_deduped
# first last s age
# 0 john doe m 88
# df_dupes
# first last s age
# 1 jon doe m 88
# 2 jan doe m 88
# 3 jane doe m 88
# 4 jane doe m 12
Performance will be greatly improved if there is a field you can groupby on (it was mentioned in the comments that name
is expected to be identical). Here the pairwise calculation is n^2 in memory. It is possible to trade some time efficiency for memory efficiency as needed.