Centromeres Track Settings
 
Centromere Locations   (All Mapping and Sequencing tracks)

Display mode:      Duplicate track
Data schema/format description and download
Assembly: Human Dec. 2013 (GRCh38/hg38)
Data last updated at UCSC: 2014-01-09

Description

Track indicating the location of the centromere sequences. Centromeres are specialized chromatin structures that are required for cell division. These genomic regions are normally defined by long tracts of tandem repeats, or satellite DNA, that contain a limited number of sequence differences to distinguish the linear order of repeat copies. The size and repetitive nature of these regions mean they are typically not represented in reference assemblies. Unlike all previous versions of the human reference assembly, where the centromere regions have been represented by a multi-megabase gap, GRCh38 incorporates centromere reference models that provide an initial genomic description derived from chromosome-assigned whole genome shotgun (WGS) read libraries of alpha satellite.

Each reference model provides an approximation of the true array sequence organization. Although the long-range repeat ordering is not expected to represent the true organization, the submissions are expected to provide a biologically rich description of array variants and local-monomer organization as observed in the initial WGS read dataset. As a result, these sequences serve as a useful mapping target to extend sequence-based studies to sites previously omitted from the human reference genome.

Methods

The sequences are generated based on second-order Markov models of monomer variants, and graphical models of larger scale higher order repeats. The graphical models are based on an analysis of Sanger reads from the HuRef sequencing project (Assembly GCA_000002125.1; BioProject PRJNA19621), and their local-ordering is supported by observed same-read monomer adjacencies. The Markov models are generated by the program linearSat, which was written for this project and that also generates a linear representation of monomer order. The software linearSat generates a second-order Markov chain to the size of a given array provided by sequence coverage normalization estimates. The sequence definitions of transposable element insertions are limited to the sequences directly adjacent to alpha satellite within the read database, and incomplete representations are noted with an adjacent 100 bp gap. In total, these sequences provide a more complete reference of sequence composition and higher order repeat variation inherent to a given alpha satellite array, used to assemble centromeric regions of the human chromosomes.

Credits

The data for this track was supplied by Karen Miga.

References

Miga KH, Newton Y, Jain M, Altemose N, Willard HF, Kent WJ. Centromere reference models for human chromosomes X and Y satellite arrays. Genome Res. 2014 Apr;24(4):697-707. PMID: 24501022; PMC: PMC3975068