Deduplication is often used in document reviews to reduce the amount of data to be searched or reviewed. When multiple files have identical content, only one version is kept. Hash values such as MD5 are used to determine whether the files are identical and therefore duplicates. Another approach to prevent reviewing the same file twice is the automatic propagation of coding decisions to all duplicate files.
In both cases, some file details are ignored when assessing the similarity of two files. This article will disclose the pitfalls of deduplication and propagation in respect to this discarded information.