Frequently Asked Questions:
Click on the question of interest below to access answers.
Sequencing
- The size of the genome, piece of the genome (ERRBS, Exome Capture etc.) or transcriptome (RNASeq) that you expect to sequence.
- The required depth/coverage for your downstream analysis. You can use Illumina's sequencing coverage calculator to help with this.
- Your pooling/multiplexing design (if you are submitting pooled libraries).
Data Analysis and Retrieval
Sequence data (base call files or bcl files) generated from the sequencer are demultiplexed and converted to FASTQ files using the Illumina bcl2fastq software.
Your raw data will be available for download as a tar compressed archive (Sample_*.tar) of gzipped FASTQ files for each sample. Raw data can be post-processed upon request.
Presently FASTQ quality score encoding (resulting from the CASAVA pipeline) uses the standard offset value of 33 (legacy data may have previous Illumina‐specific offset value of 64 if generated by the discontinued CASAVA 1.7 pipeline). Click here to learn more about Illumina's quality scoring.
If requested at the time of sequencing, the sequence data is aligned to genomes available via Illumina's iGenome either using Gerald (Illumina’s ELAND aligner) via the CASAVA 1.8.2 pipeline (for ChIP Seq, DNA Seq etc.), STAR aligner (for RNA Seq data) and our in house ERRBS pipeline with the Bismark aligner (for ERRBS data). Only raw reads that pass Illumina's purity filter are aligned.
FASTQ files generated as described above are adapter trimmed and aligned to genomes available via Illumina's iGenome using the STAR aligner. Only raw reads that pass Illumina's purity filter are aligned. This pipeline results in the following file types:
*.bam - Upon alignment (if requested) the aligned data processed by STAR aligner is in the widely accepted BAM format (a binary version of the SAM format).
*.bai - This is an index file for your BAM alignments and allows certain browsers (such as the IGV browser) to better view the .bam file.
*-SJ.out.tab - The high confidence collapsed splice junctions in tab-delimited format. Only junctions supported by uniquely mapping reads are reported.
*-Log.final.out - A text file containing the STAR aligner generated summary statistics for the alignment of each sample.
Genomic Alignment Pipeline Results
FASTQ files generated as described above are aligned to genomes available via Illumina's iGenome using the BWA-MEM aligner. This pipeline results in the following file types:
*.maxL.bam - The top/best non-filtered alignment for each read in the widely accepted BAM format (a binary version of the SAM format) for each sample.
*.merged.bam - all alignments (including best and multiple alignments) for each read in the widely accepted BAM format (a binary version of the SAM format) for each sample.
*.maxL.bam.bai, *.merged.bam.bai - Index files (.bai files) for each sample which allow for easier viewing of the bam files in genome browsers such as UCSC Genome Browser) or IGV).
*-metrics.log - Summary metrics such as adapter trimming and alignment rates.
ERRBS Pipeline Results
We process Bisulfite sequencing data using an in-house BisSeq pipeline (Garrett-Bakelman F., Sheridan, C. et al. 2015) that generates 12 files for each sample:
methylcall.CpG.Sample.1x.txt.gz
methylcall.CHG.Sample.1x.txt.gz,
methylcall.CHH.Sample.1x.txt.gz,
cgunits.Sample.1x.txt.gz,
The *.1x.txt.gz files contain all reported sites for CpG, CHG, and CHH contexts. The minimum read coverage cutoff for this file is 1
The cgunits file contains the reported sites for CpG context where consecutive CpGs for the forward and reverse strand have been combined into one site, or CpG-unit.
These are tab delimited text files that contain the locations of C's in either CpG, CHG, or CHH, context and their methylation levels
The column headers are as follows:
- chrBase = This is the name (chromosome.base location)
- chr = chromosome on which the methylated base is located
- base = location of methylated base on the chromosome
- strand = forward strand (F) or reverse strand (R)
- coverage = read coverage
- freqC = % methylated
- freqT = % unmethylated
cpg.Sample.10x.txt.gz,
chg.Sample.10x.txt.gz,
chh.Sample.10x.txt.gz
The *.10x.txt.gz file contain sites with greater than or equal to 10x coverage for CpG, CHG, and CHH contexts
The minimum read coverage cutoff for this file is 10
These are tab delimited text files that contain the locations of C's in either CpG, CHG, or CHH, context and their methylation levels
The column headers are the same as above.
Sample_CpG.wig - This is a wiggle file that allows the display of the methylation levels of CpG sites at their location in the genome in a track format for uploading into a genome browser such as the UCSC genome browser or Broad Institute's IGV.
Sample.bam - This file contains the complete alignments in binary (BAM) format as output by the Bismark bisulfite mapper.
Sample.bedGraph.gz - This file contains methylation information for individual cytosines as output by the Bismark methylation extractor.
It is a sorted bedGraph file that reports the position of a given cytosine and its methylation state.
The bedGraph output is tab delimited with 0-based start coords and 1-based end coords. The columns are:
track type=bedGraph
chromosome
start position
end position
methylation percentage
Sample.bismark.cov.gz - This file contains methylation information for individual cytosines as output by the Bismark methylation extractor.
The coverage output is tab delimited with 1-based genomic coords and the columns are:
chromosome
start position
end position
methylation percentage
count methylated
count non-methylated
Sample_summary.txt - This file summarizes adapter trimming, alignment information, and mapping efficiency of the sample against the genome, in addition to the conversion rate and methylation statistics based on all reported CpG sites (cutoff of 10x), such as the average and median conversion rates plus the number of CpG's covered.
For your convinience we have also provided a perl script that allows you to filter, line-by-line, a given methylcall file for a given coverage.
[filterMethylcall.pl] - Please rightclick and save the link as file.
For example, the following command will filter the methylcall.CHH.Sample.mincov0.txt for sites >= 10x coverage:
filterMethylcall.pl methylcall.CHH.Sample.mincov0.txt 10
https://abc.med.cornell.edu/pubshare/ [Currently data from July 2013 onwards is available in PubShare. However, we cannot guarantee continued access to your data after 2 years of it's release.]
Note: Large .bam files may not download successfully when user older versions of web browsers on 32-bit operating systems, such as Windows XP.
To copy data from this webpage via UNIX shell (example: to your server home or data directory) you can use the provided wget commands.
Instructions for wget:
Just as when you use a web browser to download the tracks, you need to provide a username and password for the download with wget. You can generate a one-line wget command using the "Links" button in the cart.
[Please note that any .wgetrc files that you may have previously created may interfere with this wget command so you may need to rename these files.]
Alternatively, you can create a text file listing the links for the samples you wish to download (for example: links.txt) and use:
wget --no-check-certificate --content-disposition --ask-password --user=youremail@your.institution.edu -i links.txt
(you can get the download link for the sample you wish to download on by right clicking the sample link in a browser and selecting the option for copying the link)
You can view your bigBed (s_N.nh.bb) and bigWig (s_N.nh.overlap.bw) files in the UCSC genome browser using the following steps:
- Open your browser and go to the "add custom tracks" page: http://genome.ucsc.edu/cgi-bin/hgCustom
- In the input box where it says "Paste URLs or data:" edit and paste the following:
track type=bigWig name="some_meaningful_name_here" description="sample_description_here" bigDataUrl=http://epicore.med.cornell.edu/pubshare/more_url_here/s_N.nh.overlap.bw visibility=full color=128,0,255
You can get the URL for your file by right-clicking it and copying the link address. Copy and paste the whole URL in the bigDataUrl part of the custom track description. If you have several tracks, it may be useful to assign meaningful names and descriptions in the appropriate places as well. - Click the submit button next to the input box.
- A link should be generated with the name you indicated in the input box in the custom tracks table.
- Click on the "go to genome browser" button next to this link. This should bring you to your wiggles in the UCSC genome browser.
Quick PDFS/Forms
Analysis Resources
- Applied Bioinformatics Core
- WCM Library Bioinformatics Service
- methylKit and eDMR (Differential methylation analysis)
- ChIPseeqer (ChIP-Seq analysis)
- GobyWeb (NGS analysis)
- Galaxy (NGS analysis)
- featureCounts (Read summarization)
- DESeq (RNA-Seq analysis)
- A survey of best practices for RNA-seq data analysis (RNASeq anaylsis review paper)
- bedtools (genome analysis)
- IGV (Integrative Genomics Viewer)
- UCSC genome browser
- FASTQC (Quality control)
- FASTX Toolkit (Quality control)
- Computational Tools at Broad Institute
- Cytoscape (Network visualization)
- BaseSpace (Illumina)
- Epigenome Roadmap (Nature papers)
- SEQanswers (NGS related forum)
- SEQanswers Wiki and Software Hub
- Statistics for biologists (Nature.com collection)