Description
These tracks display the level of sequence uniqueness of the reference GRCh37/hg19
genome assembly. They were generated using different window sizes, and high signal
will be found in areas where the sequence is unique.
Display Conventions and Configuration
This track is a multi-view composite track that contains multiple data types
separated as separate (views). For each view, there are
multiple subtracks representing different sequence lengths or methods of preparation.
Instructions for configuring multi-view tracks are
here.
Mappability tracks consist of the following views:
- Alignability
- These tracks provide a measure of how often the sequence found at the particular
location will align within the whole genome. Unlike measures of uniqueness, alignability
will tolerate up to 2 mismatches. These tracks are in the form of signals ranging from
0 to 1 and have several configuration options.
- Uniqueness
- These tracks are a direct measure of sequence uniqueness throughout the reference
genome. These tracks are in the form of signals ranging from 0 to 1 and have several
configuration options.
- Blacklisted Regions
- Both tracks of blacklisted regions attempt to identify regions of the reference
genome which are troublesome for high throughput sequencing aligners. Troubled
regions may be due to repetitive elements or other anomalies. Each track contains a
set of regions of varying length with no special configuration options.
Methods
Alignability
The CRG Alignability tracks display how uniquely k-mer sequences align
to a region of the genome. To generate the data, the GEM-mappability
program has
been employed. The method is equivalent to mapping sliding windows of k-mers
(where k has been set to 36, 40, 50, 75 or 100 nts to produce these tracks)
back to the genome using the GEM mapper aligner (up to 2 mismatches were
allowed in this case). For each window, a mappability score was computed
(S = 1/(number of matches found in the genome): S=1 means one match in the
genome, S=0.5 is two matches in the genome, and so on). The
CRG Alignability tracks were
generated independently of the ENCODE project, in the framework of the GEM
(GEnome Multitool) project.
Uniqueness
The Duke Uniqueness tracks display how unique each sequence is on the
positive strand starting at a particular base and of a particular length.
Thus, the 20 bp track reflects the uniqueness of all 20 base sequences with
the score being assigned to the first base of the sequence. Scores are
normalized to between 0 and 1, with 1 representing a completely unique sequence
and 0 representing a sequence that occurs more than 4 times in the genome
(excluding chrN_random and alternative haplotypes). A score of 0.5
indicates the sequence occurs exactly twice, likewise 0.33 for three times
and 0.25 for four times. The Duke Uniqueness tracks were generated
for the ENCODE project as tools in the development of the
Open Chromatin:
DNaseI HS,
FAIRE,
TFBS and
Synthesis tracks.
Blacklisted Regions
The DAC Blacklisted Regions aim to identify a comprehensive set of
regions in the human genome that have anomalous, unstructured, high
signal/read counts in next gen sequencing experiments independent of
cell line and type of experiment. There were 80 open chromatin tracks
(DNase and FAIRE datasets) and 20 ChIP-seq input/control tracks spanning
~60 human tissue types/cell lines in total used to identify these regions
with signal artifacts. These regions tend to have a very high ratio of
multi-mapping to unique mapping reads and high variance in mappability.
Some of these regions overlap pathological repeat elements such as satellite,
centromeric and telomeric repeats. However, simple mappability based filters
do not account for most of these regions. Hence, it is recommended to use this
blacklist alongside mappability filters. The
DAC Blacklisted Regions track was generated for the ENCODE project.
The Duke Excluded Regions track displays genomic regions for
which mapped sequence tags were filtered out before signal generation
and peak calling for
Open Chromatin:
DNaseI HS and
FAIRE tracks.
This track contains problematic regions for short sequence tag signal
detection (such as satellites and rRNA genes). The
Duke Excluded Regions track was generated for the ENCODE project.
Release Notes
This is Release 3 (October 2011) of this track, which now includes the DAC
Blacklisted regions, Duke Uniqueness and Duke Excluded regions.
Credits
The CRG Alignability track was created by Thomas Derrien and Paolo
Ribeca
in Roderic Guigo's lab at the Centre for Genomic
Regulation (CRG), Barcelona, Spain. Thomas Derrien was supported by funds from NHGRI
for the ENCODE project, while Paolo Ribeca was funded by a Consolider grant
CDS2007-00050 from the Spanish Ministerio de Educación y Ciencia.
The Duke Uniqueness and Duke Excluded Regions tracks were created
by
Terry Furey
and Debbie Winter at Duke Univerisity's
Institute for Genome Sciences & Policy (IGSP);
and Stefan Graf at the
University of Cambridge, Department of Oncology and CR-UK Cambridge Research Institute (CRI).
We thank NHGRI for ENCODE funding support.
The DAC Blacklisted Regions were created by
Anshul Kundaje
at Stanford University in the labs of
Batzoglou and
Sidow
and in cooperation with
Ewan Birney
at the
European Bioinformatics Insitute (EBI).
We thank NHGRI for ENCODE funding support. (Contact:
Anshul Kundaje).
References
Derrien T, Estelle J, Marco Sola S, Knowles DG, Raineri E, Guigo R, Ribeca P.
Fast computation and applications of genome mappability.
PLoS One. 2012;7(1):e30377.
Data Release Policy
Data users may freely use all data in this track. ENCODE labs that
contributed annotations have exempted the data displayed here from the
ENCODE data release policy restrictions.
|
|