so like I said in the previous post, 2 million additional SNPs were typed in the four HapMap populations, and the findings were reported in the latest Nature issue.
In that paper, they go over some of the patterns they find regarding, untaggable SNPs, recombination hotspots, and discuss some of the related ongoing and future genotyping efforts.
One of their main findings is the relatively high proportion of SNPs that are untaggable (in regions of high recombination, for example):
Despite the SNP density of the Phase II HapMap, there are high-frequency SNPs for which no tag can be identified. Among high-frequency SNPs (MAF 0.2), we marked as untaggable SNPs to which no other SNP within 100 kb has an r2 value of at least 0.2. In Phase II, approximately 0.5–1.0% of all high-frequency SNPs are untaggable and the proportion in YRI is approximately twice as high as in the other panels.
so are these SNPs untaggable due to high local recombination rates?: (answer: yes, probably)
Over 50% of all untaggable SNPs lie within 1 kb of the centre of a detected recombination hotspot and over 90% are within 5 kb. Because only 3–4% of all SNPs lie within 1 kb from the centre of a detected recombination hotspot (16% are within 5 kb), this constitutes a marked enrichment and implies that at least 10% of all SNPs within 1 kb of hotspots are untaggable. The implication for association mapping is that when a region of interest contains a known hotspot it may be prudent to perform additional sequencing within the hotspot. Many of the variants identified in this manner will be untaggable SNPs that should be genotyped directly in association studies.
They also discuss something that I've always wondered about. What determines where recombination takes place? According to them, some of the leading ideas are specific DNA motifs that influence hotspot location and some other things about the sequence including locations of genes - so, in other words, it seems that no one really knows for sure.
According to the authors:
The Phase II HapMap provides the resolution to separate these influences
What they find:
Within the transcribed region of genes there is a marked decrease in the estimated recombination rate. However, 5' of the transcription start site is a peak in recombination rate with a corresponding local increase in the density of hotspot motifs. This region also shows a marked increase in G+C content, reflecting the presence of CpG islands in promoter regions. There is also an asymmetry in recombination rate across genes, with recombination rates 3' of transcribed regions being elevated (as are motif density and G+C content) compared to regions 5' of genes. Studies in yeast have previously suggested an association between promoter regions and recombination hotspots39. Our results suggest a significant, although weak, relationship between promoters and recombination in humans. Nevertheless, the vast majority of hotspots in the human genome are not in gene promoters. The association may reflect a general association between regions of accessible chromatin and crossover activity.
They then examine if recombination happens more around some types of genes. They use the
Panther database of gene ontology to classify the genes and see if there are any noticeable patterns, and they find some interesting ones:
Average recombination rates vary more than sixfold among such gene classes (Fig. 5b), with defence and immunity genes showing the highest rates (1.9 cM Mb-1) and chaperones showing the lowest rates (0.3 cM Mb-1). Gene functions associated with cell surfaces and external functions tend to show higher recombination rates (immunity, cell adhesion, extracellular matrix, ion channels, signalling) whereas those with lower recombination rates are typically internal to cells (chaperones, ligase, isomerase, synthase). Controlling for systematic differences between gene classes in base composition and gene clustering, the differences between groups remain significant. We also find that the density of hotspot-associated DNA motifs varies systematically among gene classes and that variation in motif density explains over 50% of the variance in recombination rate among gene functions.
and then they briefly discuss the interesting evolutionary implications: namely that highly conserved gene functions are "protected" from recombination, while those involved in defnese and immunity from the everchanging outside world are more likely to be shuffled around...makes sense, I suppose.
...hotspots may be selected against in some highly conserved parts of the genome. In regions exposed to recurrent selection (for example, from changes in environment or pathogen pressure) it is plausible that recombination may be selected for. However, because the fine-scale structure of recombination seems to evolve rapidly, it will be important to learn whether patterns of recombination rate heterogeneity among molecular functions are conserved between species.
They go on to discuss some of the ongoing related projects: basically more populations, and more forms of variation:
NHGRI collection has cell lines for the following populations that they are currently hoping to get HapMap-style SNP genotypes on: (some of the populations are interesting and the
website has interesting descriptions of them, such as the
Gujarati Indians in Houston)
Maasai in Kinyawa, Kenya
Yoruba in Ibadan, Nigeria
Han Chinese in Beijing, China
Japanese in Tokyo, Japan
Luhya in Webuye, Kenya
Chinese in Metropolitan Denver, CO, USA
Gujarati Indians in Houston, TX, USA
Toscani in Italia
Mexican Ancestry in Los Angeles, CA, USA
African Ancestry in SW USA
Why don't they do this with the CEPH panel instead?
There are also people looking at CNVs on the original HapMap samples,... and then at some point, whole genome sequencing on these samples.
At some point I'll get to the selection paper.