Deduper Service

A mailer, member database, and so much more, for digital activism.

Deduper Service

Courtesy of GetUp, Identity integrates advanced machine-learning deduplication using the python library ‘dedupe’. Here is a quick guide on how to set it up.

Overview

Key terms: deduper_settings -> The saved settings file which tells dedupe how to compare records Blocking Map -> The table (dedupe_blocks) which groups members into blocks to be compared with each other (see https://dedupe.io/developers/library/en/latest/Making-smart-comparisons.html) Index data -> A subset of your data used to generate indexes for blocking

At the moment the deduper is called from UpsertMember#call. It assumes you have already deduplicated the existing member data in Identity, so it only deduplicates new data arriving to UpsertMember#call.

If you want to use the library to run a big dedupe of your Identity data you will need to adapt the scripts.

Set Up

Import and train your dataset.

Load 10000-50000 rows into dedupe_index_data to train your data. Make sure you lowercase and remove trailing spaces. It’s important that the data in dedupe_index_data is not changed after you have done training, because it is used to guarantee consistency of block keys generated for each member.

Run the initialization script

python lib/dedupe_init.py (hint: you might find it easier to run this locally against your production database, just set DATABASE_URL=[prod db])

This will take a little while to initialize and then ask you to compare a series of records. This trains the deduper against your dataset. Once you’ve done at least 20 positive and negative comparisons you can finish. It will then start ‘blocking’ your members. This could take over an hour if you have several million member records.

It will write two settings files that it will use in future. You can discard dedupe_training.json (which contains personal data) once you are finished with setup, but hang onto dedupe_settings. You should rename dedupe_settings to dedupe_settings.[ORG_NAME], matching the ORG_NAME environment variable, and commit this file.

If you’re happy using another org’s settings you can just copy their settings file and add your organisation’s name.

Incremental deduplication

Set DEDUPER_ENABLED=true or Settings.deduper.enabled = true. Any records which arrive at UpsertMember#call and fail to match on email or phone will be passed to the deduper to try and find a match.