Regionsets are lists of regions of interest, and it might help to think of Regionsets as viewpoints.
To e.g. visualize the signal at and around enhancers, EaSeq needs to ‘know’ the genomic coordinate of each enhancer, and Regionsets are basically lists of genomic coordinates as well as optional additional information associated with each region.
For example, a heatmap or an average track showing ChIP-seq signal at a set of enhancers would require these enhancers to be loaded as a Regionset, and the ChIP-seq as Dataset.
As for Datasets, the genomic coordinates for each region in a Regionset consist of at least a chromosome name (e.g. chr8) and a one-dimensional position (e.g. 31,504,393). This means that the region is positioned at base 31,504,393 at the chromosome called chr8. In most Regionsets the regions would also have a fixed or varying spatial extent, so that they contain two coordinates corresponding to the start and end position of each region. EaSeq will also need to assign each region to one of the DNA strands, and it will assign all regions on the plus strand, if no strand information is supplied.
As for Datasets, EaSeq can load files with varying formats as long as the content is in a column separated text-based format. The import wizard will help you defining where it should look for the different types of information and is quite flexible with which column that contains what information. Data needs to be organized in columns and each line in the file should preferably have the same number of columns. The sign used to separate columns can be tab, semicolon, comma, or space – and other custom separators should work as well. EaSeq accepts text files generated on a Mac, a Linux, or a Windows machine. Only one line of headers for the columns is allowed, so if the file contains more, import it in e.g. Excel or notepad, and remove the extra lines. In some formats # can be used to mark meta-information, comments or column headers, so EaSeq will ask the user if it should ignore a line starting with # or use it as a header. Easeq is quite relaxed about what chromosomes are named, but will help you to use the standard names in all data. It is also important to assure that the reference genome used to generate a set of regions correspond to the one that was used to map the reads in the Datasets.
I would like to import a list of genes as a Regionset (e.g. expression data from a microarray), but I do not have the genomic coordinates for each gene?
This can be done, if a reference genome is already loaded as Geneset by selecting ‘Find coordinates in an already loaded Geneset file’ in the import wizard. Then EaSeq will look for the genes or accession numbers that match, and if there are multiple matches in the geneset, then it will take the most encompassing genomic coordinates and use those for the Regionset.
However, always use probe, transcript or gene coordinates from the exported microarray data if they are available or can be identified. They are more precise assuring that potential relationships between ChIP-seq signal and expression will be as clear as possible. If EaSeq has to use the coordinates form a reference genome it will usually not be the exact same that was used for the microarray.