Description
These tracks show the one-to-one v1_nfLO alignments of the GRCh37/hg19 to the
T2T-CHM13 v2.0 assembly.
Display Conventions
The track displays boxes joined together by either single or double lines,
with the boxes represent aligning regions, single lines indicating gaps that
are largely due to a deletion in the CHM13 v2.0 assembly or an insertion in
the GRCh37/hg19, and double lines representing more complex gaps that involve
substantial sequence in both assembly.
Methods
Alignment and Chain Creation
For the minimap2-based pipeline, the initial chain file was generated using
nf-LO v1.5.1 with
minimap2 v2.24 alignments.
These chains were then split at all locations that contained unaligned segments greater
than 1 kbp or gaps greater than 10 kbp. Split chain files were then converted to PAF format
with extended CIGAR strings using chaintools (v0.1),
and alignments between nonhomologous chromosomes were removed. The trim-paf operation of
rustybam (v0.1.29) was next used to remove overlapping
alignments in the query sequence, and then the target sequence, to create 1:1 alignments.
PAF alignments were converted back to the chain format with paf2chain commit f68eeca, and
finally, chaintools was used to generate the inverted chain file.
Full commands with parameters used were:
nextflow run main.nf --source GRCh37.fa --target chm13v2.0.fasta --outdir dir -profile local --aligner minimap2
python chaintools/src/split.py -c input.chain -o input-split.chain
python chaintools/src/to_paf.py -c input-split.chain -t target.fa -q query.fa -o input-split.paf
awk '$1==$6' input-split.paf | rb break-paf --max-size 10000 | rb trim-paf -r | rb invert | rb trim-paf -r | rb invert > out.paf
paf2chain -i out.paf > out.chain
python chaintools/src/invert.py -c out.chain -o out_inverted.chain
The above process does not add chain ids or scores. The UCSC utilities
chainMergeSort and chainScore are used to update the
chains:
chainMergeSort out.chain | chainScore stdin chm13v2.0.2bit hg19.2bit chm13v2.0-hg19.chain
chainMergeSort out_inverted.chain | chainScore stdin hg19.2bit chm13v2.0.2bit hg19-chm13v2.0.chain
Rustybam trim-paf
uses dynamic programming and the CIGAR string to find an optimal
splitting point between overlapping alignments in the query sequence. It
starts its trimming with the largest overlap and then recursively trims
smaller overlaps.
Results were validated by using chaintools to confirm that there were no
overlapping sequences with respect to both CHM13v2.0 and GRCh37 in the
released chain file. In addition, trimmed alignments were visually inspected
with SafFire to confirm their quality.
Chains were swapped to make GRCh37/hg19 the target.
Credits
The v1_nflo chains were generated by Nae-Chyun Chen<[email protected]>
and Mitchell Vollger<[email protected]>
References
Nurk S, Koren S, Rhie A, Rautiainen M, et al. The complete sequence of a human genome. bioRxiv, 2021.
|