Read-based Datasets contains the genomic coordinates of the millions of sequenced and mapped reads from e.g. a ChIP-seq sample, but can also come from e.g. DIP-seq, RIP-seq, RNA-seq, ATAC-seq, or CAGE-seq, as long as it comes from ‘Single Read’ sequencing.
The genomic coordinates for each of the reads in a Dataset consist of at least a chromosome name (e.g. chr8) and a one-dimensional position (e.g. 31,504,393). This means that the read is positioned at base 31,504,393 at the chromosome called chr8. Easeq is quite relaxed about what chromosomes are named, but will help you to use the standard names in all data.
In most Dataset formats the read contains two coordinates corresponding to the start and end position of the sequenced bases, and if 50 bps have been sequenced and mapped, start and end coordinates would be 50 bp apart. For NGS data, this information is mostly redundant as the length of the reads are highly uniform, and although this might vary slightly due to mapping and quality issues, the small deviations will only affect the a subset of methods that rely on single bp resolution (e.g. CAGE-seq, Start-seq). For ChIP-seq, on the other hand, the obtained resolution is usually a lot larger than the read length. Therefore, some data formats discard one of the coordinates in order to reduce file sizes.
EaSeq can handle both Datasets with one or two coordinates, and internally EaSeq will (in its current form) only use one coordinate pr. read to allow as many Datasets to be loaded into the memory as possible.
EaSeq will not discard reads with negative coordinates, but the position will be set to 1. The largest allowed coordinate is 4x10E9, but that is well above the size of any mammalian chromosome. EaSeq does not check the coordinates against a known reference, so it will not protest if reads are outside the coordinates for the chromosomes in the reference genome.
Finally, the reads usually contain information about which DNA-strand they are mapped to, + or -, F(orward) or R(everse) etc. This information is optional, so you can import data without it, if it is lost somehow, although the uses of the Datasets might be limited (e.g. peakfinding is dependent on this information), and the output might have a lower resolution.
EaSeq has an import wizard that will help you defining where it should look for the different types of information and is quite flexible with which column that contains what information. Data needs to be organized in columns and each line in the file should preferably have the same number of columns. The sign used to separate columns can be tab, semicolon, comma, or space – and other custom separators should work as well.
EaSeq accepts text files generated on a Mac, a Linux, or a Windows machine. Only one line of headers for the columns is allowed, so if the file contains more, import it in e.g. Excel or notepad, and remove the extra lines. In some formats # can be used to mark meta-information, comments or column headers, so EaSeq will ask the user if it should ignore a line starting with # or use it as a header.