An important setting to learn about is the filtering for unique reads, which by default is enabled in EaSeq. For some Datasets it might make sense to turn it off. What the function does is that it (for each strand) discards the reads that maps to a coordinate where an already imported read is located.
This can reduce the number of reads markedly, so why would anyone want to do this?
During PCR the original fragments in the library are amplified, and if the library originally contains 5M fragments, sequencing 20M reads cannot possibly provide any more information than the coordinates for the 5M fragments in the library. Most of the 20M reads will be redundant information about the same fragment amplified and sequenced multiple times. The result is that both the signal and the background will be quite bumpy or even spiky, and this will obscure most analysis.
When should you not use this setting?
Typically when working with small genomes. E.g. yeast and bacteria often have a genome consisting of a few Mbp, and most reads in a 20M read library would occupy identical positions simply due to the high coverage of the sequencing. When working in mammalian genomes the likelihood of two reads occupying the same position by chance is usually rather low as they typically have in the range of app. 3x10E9 coordinates on each strand. In a 20M read library, this would roughly mean that the evenly distributed reads would be spaced app. 300 bp apart on each strand. In real life, things are a bit more complicated as the distribution is not uniformly random, not all of the genome is ‘mappable’, and the part of the reads that are due to the immunoprecipitation is usually located in confined areas corresponding to the peaks of the Dataset. Let us imagine a Dataset where there is no background at all and all signal is located within 10,000 peaks of 250 bp size. The reads would only be able to map to one of 5M potential coordinates, and they would therefore have a very high likelihood of having the same coordinate as another read. In this theoretical case, it would be best not to filter for unique reads, but this is very far from the situation of most Datasets where 95% of the reads or more might be background. Therefore, the exact conditions largely also depend on e.g. signal/noise ratio and the total area bound by the analyzed factor, so it is currently not possible for us to set up a rigid rule of thumb or an automated decision process. It is therefore optional in EaSeq to keep/remove duplicate reads when data are imported. For our own analyses, we have considered the advantage of reducing false positives from duplicates to outweigh the disadvantage of losing linearity for strongly enriched regions in larger Datasets – peaks from those regions would still be called.