DataLoader

Salesforce Data Loader and Duplicate Records Management

Set batch size to 1

Problem Statement

The problem to solve is detection of duplicate records that exist in an input file read by Salesforce Data Loader and updated into Salesforce. In this scenario duplicate records exist in the input file; not to be confused with a different scenario, of records in a Salesforce Data Loader input file that duplicate other records already in Salesforce. This note is intended to help users with Data Loader input files (.csv files) that may contain duplicate records.

If a Data Loader input file has duplicates in it, and batch size is not set to 1, then duplicate records in that input file may end up loaded into Salesforce undetected. This will happen regardless of the duplicate management settings that exist in Setup> Data> Duplicate Management> Matching Rules.

The Salesforce documentation spells out this issue, in disparate places, with something less than perfect clarity, as detailed below from various sources.

Salesforce Data Quality Manual

The Salesforce Data Quality manual (https://resources.docs.salesforce.com/226/latest/en-us/sfdc/pdf/data_quality.pdf) on pages 14-16 states:

“Duplicates can be detected when a sales rep creates, clones, or edits a record and a duplicate rule runs, or when you run a duplicate job. They can also be detected as part of other processes, such as importing or an API.”

While this would seemingly suggest duplicates are detected in a data loader input file, the issue is that records within a batch, that may have duplicates, are all committed at one time.

Order of Execution in Apex Developer's Guide

In the Apex Developer’s Guide, “Triggers and Order of Execution” (https://developer.salesforce.com/docs/atlas.en-us.apexcode.meta/apexcode/apex_triggers_order_of_execution.htm), duplicate checking is sixth the event in sequence:

“When you save a record with an insert, update, or upsert statement”. This would also seem to suggest that consecutive duplicates within a Data Loader input file should be detected. The problem is, they are not, unless batch size is set to one.

Timing of Saves

There is clarification of this issue on page 18 of Data Quality, in “Things to know about duplicate rules”:

“Timing of Saves

If multiple records are saved simultaneously and your duplicate rule is set to block or alert sales reps, those records aren’t compared with each other. They’re compared only with records already in Salesforce. This behavior doesn't affect the Report option, and duplicate record sets do include records that match other records saved simultaneously.”

… Meaning the timing of saves is such that records within a batch are not “compared with each other… only compared with records already in Salesforce”.

Salesforce Duplicate Matching Rules

When using Data Loader to import records, Salesforce Duplicate Matching Rules compare input records with records ‘already in Salesforce’. Data Loader commits input records to the Salesforce database in increments equal to batch size. By setting batch size to one, input records are committed to the database one at a time. This allows for detection of duplicate records within a Data Loader input file.

Data Loader Guide

The default Data Loader batch size is 200. This means records read by Data Loader are committed to the Salesforce database in increments of 200 records at one time. If duplicates exist within that batch of 200 records they are not detected. If batch size is set to 1, each input record is committed to the database as it is read. If consecutive input records are duplicates and batch size is 1, matching rules detect this condition because each input record is committed to the database before the next record is inspected.

The Data Loader Guide (https://developer.salesforce.com/docs/atlas.en-us.dataLoader.meta/dataLoader/data_loader.htm) spells this out in the explanation of settings:

Data Loader Settings Panel

The Data Loader Settings panel looks like this. Note Batch size is the second field:

If batch size is set to one then duplicate records within a Data Loader input file are detected. Users will need to look at the success, error, and log files that are created by Data Loader to see all results of the run.

Calls to the Salesforce Help Desk

I first encountered this some time ago when teaching a class about duplicate management in Salesforce. During class a student promptly proved me wrong, by importing a small list of duplicate contacts with Data Loader. I raised the issue with the Salesforce help desk, and after some back-and-forth via phone and email it was explained how and why batch size must be set to one to correct this issue. It was also explained that large batch size improves performance, and setting batch size to one will induce a performance penalty. There is a separate paper quantifying the scope and scale of this performance degradation, which is significant.

Since then I have seen other users and companies occasionally encounter this issue, so I wrote this note. I hope it is helpful.