[Translation] An introduction to genomics for programmers

[Translation] An introduction to genomics for programmers


About the author. Andy Thomason is the lead programmer for Genomics PLC . He has been working with graphic systems, games and compilers since the 70s; specialization - code performance.

Genes: a brief introduction


the human genome consists of two copies of about 3 billion base pairs of DNA, for which the letters A, C, G and T are used This is about two bits for every base pair:

3,000,000,000 × 2 × 2/8 = 1,500,000,000 or about 1.5 GB of data.

In fact, these copies are very similar, and the DNA of all people is almost the same: from merchants from Wall Street to Australian Aborigines.

There are a number of “reference genomes”, such as Ensembl Fasta files . Reference genomes help build a map with specific characteristics that are present in human DNA, but not unique to specific people.

For example, we can determine the “location” of a gene that encodes a BRCA2 protein that is responsible for DNA repair in breast cancer: here is this gene .

It is located on chromosome 13, from position 32315474 to 32400266.



Genetic variations


People are so similar that for the representation of a person it is usually enough to store a small set of “variations”.

Over time, our DNA is damaged by cosmic rays and copying errors, so the DNA that parents transmit to children is slightly different from their own.

Recombination mixes genes even more, so the child’s DNA inherits from each parent a mixture of grandparents' DNA from this side.

Therefore, for each change in our DNA, it is enough to keep only the differences from the reference genome. Usually they are saved in a VCF (Variant Call Format) file.

Like almost all files in bioinformatics, this is a file of type TSV (tab-delimited text format).

You can get your own VCF file from companies like 23 and Me and Ancestry.com : pay relatively little money and send a sample that is sequenced on a DNA microchip. It highlights fragments where the DNA matches the expected sequences.

An abbreviated example of the VCF specifications : < br/>
 ## fileDate = 20090805
 ## source = myImputationProgramV3.1
 ## reference = 1000GenomesPilot-NCBI36
 ## phasing = partial
 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
 20 14370 rs6054257 GA 29 PASS NS = 3; DP = 14; AF = 0.5; DB; H2 GT: GQ: DP: HQ 0 | 0: 48: 1: 51.51 1 | 0: 48: 8: 51.51  1/1: 43: 5:.,. 

Here we have three people named NA00001, NA00002 and NA00003 (we are very serious about the security of personal data in the world of genetics), who in position 14370 of chromosome 20 have certain differences 0 | 0 , 1 | 0 and 1 | 1 from G to A.

There are two numbers per person, since we all have two copies of chromosome 20 (one from each parent; the only exception is the sex chromosome). I was not lucky that I have only one X chromosome, so I inherited color blindness from my grandfather through my mother).

Possible options:

 0 | 0 both chromosomes correspond to the reference sample
 1 | 0 and 0 | 1 only one chromosome is different from the standard
 1 | 1 both chromosomes are different from the standard 

VCF files are considered “phased” if you can figure out which particular chromosome the variant is on or at least where it is located relative to its neighbors. In practice, it is difficult to say which chromosome DNA came from, so you have to speculate!

Thus, we have a bit vector 001011 , which is enough to classify three people in this variation. These are haplotypes or variations of individual chromosomes.

GWAS Research


Using this bit vector, we can try to figure out which parts of the genome affect diseases or other individual properties, such as hair color or growth. For each option, we build a haplotype for measurable traits ( phenotype ).

GWAS (Genome wide association study, polygenomic search for associations) is the basis for genetic analysis of variants. He compares variations with observational data.

For example:

 Haplotype Height Person
 0 1.5m NA00001
 0 1.5m
 1 1.75m NA00002
 0 1.75m
 1 1.95m NA00003
 1 1.95m 

Please note that each has two haplotypes, because we have a pair of chromosomes.

Here we see that options 1 are associated with higher growth, and the values ​​correspond to linear regression:

 beta Growth variation with variation variation.
 standard error Error rate. 

In practice, the data is really a lot of noise, and the error is usually larger than beta , but often we have several options, where beta is much higher than the error. This ratio - Z-score and its associated p-value - shows which options are most likely to affect growth.

The easiest way to regress is to use a Moore-Penrose inversion .

We compose a 2 × 2 covariance matrix with the scalar product of two vectors, and solve the problem using the least squares method.

We have trillions of data points, so it’s important to do this effectively.

The curse of the unbalanced clutch


Since we inherit large fragments of the genome from our parents, certain areas of DNA look very similar: they are much more similar than the case dictates.

This is good for us, because genes continue to work in the same way as their ancestors, but badly for genomics researchers. This means that the differences are not enough to determine the variations that caused the change in phenotype.

Nonequilibrium coupling (LD) determines how similar the two vectors with variations.

It calculates a value between -1 and 1, where

 -1 Completely opposite variations.
  0 Variations are not similar.
  1 The variations are exactly the same. 

To determine the similarity of variations, we create large square LD matrices for specific places in the genome. In practice, many of the variations around this place are almost identical to the one in the middle.

The matrix looks something like this, with large squares of similarity.

 v0 v2 v4 v6 v8 va vc ve vg
  v1 v3 v5 v7 v9 vb vd vf
 v0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
 v1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
 v2 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
 v3 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
 v4 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
 v5 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
 v6 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
 v7 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
 v8 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
 v9 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
 va 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
 vb 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
 vc 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
 vd 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
 ve 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
 vf 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
 vg 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 

Real values ​​are not 0 or 1, but very similar.

A recombination occurred between v7 and v8. Because of this, v0..v7 is different from v8..vg.

The problem of similarity is that we know that one of the options in the group caused something, but we do not know which one.

This limits the resolution of our genomic microscope , and additional problems, such as functional genomics, will have to be used to solve the problem.

Conclusion


In the end, you can never be 100% sure which particular region of the genome caused a particular individual feature, this is the essence of genetics. Biology is not an exact machine with perfect factory-made parts. This is a boiling mass of accidents that somehow create what we call life. That is why statistics, or “machine learning,” is so important, as it is now fashionable to call it.

Source text: [Translation] An introduction to genomics for programmers