Description
This track is based on text-mining of full-text biomedical articles and includes two types of subtracks:
- Sequences found in publications, grouped by article and searched in genomes with BLAT
- Identifiers in publications that directly relate to chromosome locations (e.g., gene symbols, SNP identifiers, etc)
Both sources of information are linked to the respective articles.
Background information on how permission to full-text data was obtained can be found on the project website.
Display Convention and Configuration
The sequence subtrack indicates the location of sequences in publications
mapped back to the genome, annotated with the first author and the year of the
publication. All matches of one article are grouped ("chained") together.
Article titles are shown when you move the mouse cursor over the features.
Thicker parts of the features (exons) represent matching sequences,
connected by thin lines to matches from the same article within 30 kbp.
The subtrack "individual sequence matches" activates automatically when
the user clicks a sequence match and follows the link "Show sequence matches individually"
from the details page. Mouse-overs show flanking text around the sequence, and clicking
features links to BLAT alignments.
All other subtracks (i.e. bands, genes, SNPs) show the number of matching articles as
the feature description. Clicking on them shows the sentences and sections in articles
where the identifiers were found.
The track configuration includes a keyword and year filter. Keywords are space-separated
and are searched in the article's title, author list, and abstract.
Data
The track is based on text from biomedical research articles, obtained as
part of the UCSC Genocoding Project.
The current dataset consists of about 600,000 files (main text and
supplementary files) from PubMed Central (Open-Access set) and around 6 million text
files (main text) from Elsevier (as part of the Sciverse Apps program).
Methods
All file types (including XML, raw ASCII, PDFs and various Microsoft
Office formats (Excel, Word, PowerPoint)) were converted to text. The results were processed
to find groups of words that look like DNA/RNA sequences or
words that look like protein sequences. These were then mapped with BLAT to the
human genome and these model organisms: mouse (mm9), rat (rn4), zebrafish
(danRer6), Drosophila melanogaster (dm3), X. tropicalis (xenTro2), Medaka
(oryLat2), C. intestinalis (ci2), C. elegans (ce6) and yeast (sacCer2).
The pipeline roughly proceeds through these steps:
- For sequences, the best match across all genomes is used, if it is longer than 17 bp and matches at 90% identity.
Two sets of BLAT parameters are tried, the default ones for sequences longer than 25 bp, very sensitive ones (stepSize=5) for shorter sequences.
- Sequences are mapped to genomic DNA. Those that do not match are mapped to RefSeq cDNAs.
- Hits from the same article that are closer than 30 kbp are
joined into one feature (shown as exon-blocks on the browser).
- All parts of a joined feature have to match at least 25 bp.
- Non-unique hits are kept in the joined feature with the most members.
- Joined features with identical members in two different genomes are kept in both genomes.
Note that due to the 90% identity filter, some sequences do not match
anywhere in the genome. Examples include primers with added restriction sites,
mutation primers, or any other sequence that joins or mixes two pieces of genomic
DNA not part of RefSeq. Also note that some gene symbols correspond to
English words which can sometimes lead to many false positives.
Credits
Software and processing by Maximilian Haeussler. UCSC Track visualisation by
Larry Meyer and Hiram Clawson. Elsevier support by Max Berenstein, Raphael
Sidi, Judd Dunham, Scott Robbins and colleagues. Original version written at the Bergman Lab,
University of Manchester, UK. Testing by Mary Mangan, OpenHelix Inc, and Greg Roe, UCSC.
Feedback
Please send ideas, comments or feedback on this track to
max@soe.ucsc.edu.
We are very interested in getting access to more articles from publishers for this
dataset; see the project website.
References
Aerts S, Haeussler M, van Vooren S, Griffith OL, Hulpiau P, Jones SJ, Montgomery SB, Bergman CM,
Open Regulatory Annotation Consortium.
Text-mining assisted regulatory annotation.
Genome Biol. 2008;9(2):R31.
PMID: 18271954; PMC: PMC2374703
Haeussler M, Gerner M, Bergman CM.
Annotating genes and genomes with DNA sequences extracted from biomedical articles.
Bioinformatics. 2011 Apr 1;27(7):980-6.
PMID: 21325301; PMC: PMC3065681
Van Noorden R.
Trouble at the text mine.
Nature. 2012 Mar 7;483(7388):134-5.
|
|