Description
This track is produced as part of the ENCODE Transcriptome Project.
It shows the starts and ends of DNA fragments from different
cell
lines
determined by paired-end ditag (PET) sequencing using different
DNA
fragment sizes
for analysis of genome structural variation.
Display Conventions and Configuration
In the graphical display, the ends are represented by blocks connected by a
horizontal line. In full and packed display modes, the arrowheads on the
horizontal line represent the strand, and an ID of the
format XXXXX-N-M is shown to the left of each PET, where X
is the unique ID for each PET, N indicates the number of mapping
locations in the genome (1 for a single mapping location, 2 for two mapping
locations, and so forth), and M is the number of PET sequences at
this location. PETs that mapped to multiple locations may represent low
complexity or repetitive sequences.
To show only selected subtracks, uncheck the boxes next to the tracks that
you wish to hide.
The query sequences in the SAM/BAM alignment representation
are normalized to the + strand of the reference genome
(see the SAM Format Specification
for more information on the SAM/BAM file format). If a query sequence was
originally the reverse of what has been stored and aligned, it will have the
following
flag:
(0x10) Read is on '-' strand.
BAM/SAM alignment representations also have tags. The following tags are associated with this track: RG, CQ, CS, and MD.
Mapping quality is not available for this track and so, in accordance with the
SAM Format Specification,
a score of 255 is used.
Methods
Sample genomic DNA was isolated, hydrosheared at a given size-range, then
ligated with specific DNA linker sequence at both ends, followed by
gel-selection of the desired size, e.g., 1 kb, 10 kb, etc. respectively.
The DNA fragments modified with linker at both ends (e.g., 10 kb) were then
circularized by ligation, followed by restriction digest with enzyme EcoP15I to
generate DNA PETs (25 bp tag from each end). The PETs were ligated with SOLiD
sequencing adaptors at both ends, then amplified by PCR and purified as complex
templates for high throughput DNA sequencing. The current DNA PET data sets
submitted are mostly generated by SOLiD platform.
Cells were grown according to the approved
ENCODE cell culture
protocols.
Data:
Reads of DNA PETs were mapped onto reference genome, GRCh37, hg19, excluding mitochondrion, haplotypes, randoms and chromosome Y. Majority of the PETs mapped on the same chromosome in correct orientations and within expected distance span (e.g., a 10 kb DNA PET was expected mapping on ~10 kb span distance). A small portion of misaligned PETs, called discordant PETs, mapped either too far from each other, had wrong orientations, or in different chromosomes indicating various genome structure or variations observed between the sample and the reference genome. The variations could be due to deletion, inversion, tandem repeats, trans-location, fusion etc.
Mapping parameters:
Mapping was done using Applied Biosystems' SOLiD alignment and pairing pipeline. The ungapped alignment is done in color space. Seed and extend strategy is adopted where initial seed length of 25 is mapped with maximum of 2 mismatches and then extended to read length, each color space match is awarded a score of +1 and each mismatch is awarded a penalty of -2.
Read Score = read length - # of mismatches - 2 * # of mismatches
After extension each read is trimmed to its maximum score, shortest length.
The color space sequences are then converted into base space and checked to ensure that each sequence has a maximum of 2 base pair mismatches. If any sequence has more than 2 mismatches, then that pair is discarded. The final output is converted into SAM/BAM format.
Verification
Representative structural variations identified by DNA PET data have been
verified by targeted PCR and sequencing analysis to confirm the predicted
rearrangement sites. Some of them have also been validated by FISH.
Credits
The GIS DNA PET libraries and sequence data for genome structural
variation analysis were produced at the
Genome Institute of
Singapore.
The data were mapped and analyzed by scientists Xiaoan Ruan, Atif Shahab,
Chialin Wei, and Yijun Ruan at the Genome Institute of Singapore.
Contact:
Yijun Ruan (now at The Jackson Laboratory)
Data Release Policy
Data users may freely use ENCODE data, but may not, without prior
consent, submit publications that use an unpublished ENCODE dataset until
nine months following the release of the dataset. This date is listed in
the Restricted Until column, above. The full data release policy
for ENCODE is available
here.
|
Top⇑ |