How to map a Dataset?

ChIP-seq data from published papers are usually stored in a repository, such as the Gene Expression Omnibus (GEO), they are however most often not mapped. To map ChIP-seq reads is sometimes also called to align the reads, and the proces will identify the genomic coordinates from a specified reference genome that match each read the most. However, this might not always be possible due to low sequencing quality or repeated sequences, so usually not all of the reads will lead to a genomic coordinate.

The most accessible way to map your own or downloaded ChIP-seq data, is to use the online analysis and processing platform Galaxy. An introduction to Galaxy can be found here.

As the procedure is quite standardized for ChIP-seq experiments, it might also be possible to have your sequencing facility or company to do the mapping of newly sequenced data.

The European Nucleotide Archive (ENA) mirrors all the public Datasets from GEO, and once you have located a Dataset of interest, the results from the ENA search page contains a link called ‘Fastq files (galaxy)’ enabling direct upload to your Galaxy account. It will take quite a while for the large file to be transferred, so it might make sense to wait until the next day before proceeding.

In Galaxy the reads can be mapped using one of two tools found under ‘NGS: Mapping’ BWA or Bowtie2. For newcomers, we would recommend using a workflow made by Benjamin LeBlanc from Kristian Helins group at BRIC. It will preprocess the data by trimming the reads according to quality scores and do a quality control and output aligned reads as a bam file that can then be downloaded and imported in EaSeq as a Dataset.

You can also use Galaxy to run the widely used peak-finding algorithm MACS on these data and then download the peaks and import them in EaSeq as a Regionset.

How to map a Dataset?

Sign up for newsletter

Thank you for signing up!