A mailer, member database, and so much more, for digital activism.
Courtesy of GetUp, Identity integrates advanced machine-learning deduplication using the python library ‘dedupe’. Here is a quick guide on how to set it up.
Key terms:
deduper_settings -> The saved settings file which tells dedupe how to compare records
Blocking Map -> The table (dedupe_blocks
) which groups members into blocks to be compared with each other (see https://dedupe.io/developers/library/en/latest/Making-smart-comparisons.html)
Index data -> A subset of your data used to generate indexes for blocking
At the moment the deduper is called from UpsertMember#call
. It assumes you have already deduplicated the existing member data in Identity, so it only deduplicates new data arriving to UpsertMember#call
.
If you want to use the library to run a big dedupe of your Identity data you will need to adapt the scripts.
Load 10000-50000 rows into dedupe_index_data to train your data. Make sure you lowercase and remove trailing spaces. It’s important that the data in dedupe_index_data is not changed after you have done training, because it is used to guarantee consistency of block keys generated for each member.
python lib/dedupe_init.py
(hint: you might find it easier to run this locally against your production database, just set DATABASE_URL=[prod db])
This will take a little while to initialize and then ask you to compare a series of records. This trains the deduper against your dataset. Once you’ve done at least 20 positive and negative comparisons you can finish. It will then start ‘blocking’ your members. This could take over an hour if you have several million member records.
It will write two settings files that it will use in future. You can discard dedupe_training.json
(which contains personal data) once you are finished with setup, but hang onto dedupe_settings
. You should rename dedupe_settings
to dedupe_settings.[ORG_NAME]
, matching the ORG_NAME environment variable, and commit this file.
If you’re happy using another org’s settings you can just copy their settings file and add your organisation’s name.
Set DEDUPER_ENABLED=true or Settings.deduper.enabled = true. Any records which arrive at UpsertMember#call
and fail to match on email or phone will be passed to the deduper to try and find a match.