Duplicates: why are they an issue?

But clusters contain thousands of copies of the same fragment

Yes, but each cluster is analyzed as a whole. Each fragment in the library gives rise to one cluster and although this cluster contains thousands of copies of the fragment it will give rise to only one read in the FASTQ file. So this is not an issue since these thousands of copies will not distort the counts.

The issue is when there are many copies of the same fragment in the library. Copies that are generated by steps in the library prep or by the sequencing itself (so copies that do not have a biological origin). Each copy will give rise to a read in the FASTQ file so the file will contain duplicates that distort the counts.

What about technical duplicates in RNASeq data?

Unless you use UMIs you have no way to distinguish biological from technical duplicates in RNASeq data.

  • Biological duplicates have to be kept, removing them would distort the counts
  • Technical duplicates have to be removed, keeping them would distort the counts

It is expected that the biological duplicates outnumber the technical duplicates. So to be safe all duplicates are kept based on the underlying assumption that most of them are biological because typically, you don’t have to do a lot of PCR during RNASeq library prep.

How to distinguish technical from biological duplicates in RNASeq data?

If you want to make the distinction, you can add an extra barcode in the adapters that is unique for each transcript: the UMI. You ligate the adapters very early in the library prep so that each transcript has a unique UMI. If a transcript is then copied during the library prep there will be two copies in the library with the same UMI. In this way the UMIs will allow to identify technical duplicates. UMIs were originally used for single cell RNASeq but more and more people are also using them for bulk RNASeq.

Without the use of UMIs you cannot distinguish technical from biological duplicates in RNASeq data.