Human Genome Project

• Started in 1990
• Research effort to sequence all of our DNA (46 chromosomes)
• Over 3.3 billion nucleotides
• Mapping every gene location (loci)
• Conducted by scientists around the world
HGP Insights
• Only 2% of human genome codes for proteins (exons)
• Other 98% (introns) are non-coding
• Only about 20,000 to 25,000 genes (expected 100,000)
• Proteome – organism’s complete set of proteins
• About 8 million single nucleotide polymorphisms (SNP) – places where humans differ by a single nucleotide
• About ½ of genome comes from transposons (pieces of DNA that move to different locations on chromosomes)
Benefits of Human Genome Project
• Improvements in medical prevention of disease, gene therapies, diagnosis techniques …
• Production of useful protein products for use in medicine, agriculture, bioremediation and pharmaceutical industries.
• Improved bioinformatics – using computers to help in DNA sequencing

The Human Genome Project was started in 1989 with the goal of sequencing and identifying all three billion chemical units in the human genetic instruction set, finding the genetic roots of disease and then developing treatments. With the sequence in hand, the next step was to identify the genetic variants that increase the risk for common diseases like cancer and diabetes.
It was far too expensive at that time to think of sequencing patients’ whole genomes. So the National Institutes of Health embraced the idea for a "shortcut", which was to look just at sites on the genome where many people have a variant DNA unit. The theory behind the shortcut was that since the major diseases are common, so too would be the genetic variants that caused them. Natural selection keeps the human genome free of variants that damage health before children are grown, the theory held, but fails against variants that strike later in life, allowing them to become quite common. (In 2002 the National Institutes of Health started a $138 million project called the HapMap to catalog the common variants in European, East Asian and African genomes.)
The genome was broken into smaller pieces; approximately 150,000 base pairs in length. These pieces were then ligated into a type of vector known as "bacterial artificial chromosomes", or BACs, which are derived from bacterial chromosomes which have been genetically engineered. The vectors containing the genes can be inserted into bacteria where they are copied by the bacterial DNA replication machinery. Each of these pieces was then sequenced separately as a small "shotgun" project and then assembled. The larger, 150,000 base pairs go together to create chromosomes. This is known as the "hierarchical shotgun" approach, because the genome is first broken into relatively large chunks, which are then mapped to chromosomes before being selected for sequencing.
Funding came from the US government through the National Institutes of Health in the United States, and a UK charity organization, the Wellcome Trust, as well as numerous other groups from around the world. The funding supported a number of large sequencing centers including those at Whitehead Institute, the Sanger Centre, Washington University in St. Louis, and Baylor College of Medicine.
The Human Genome Project is considered a Mega Project because the human genome has approximately 3.3 billion base-pairs.
If the sequence obtained was to be stored in book form, and if each page contained 1000 base-pairs recorded and each book contained 1000 pages, then 3300 such books would be needed in order to store the complete genome. However, if expressed in units of computer data storage, 3.3 billion base-pairs recorded at 2 bits per pair would equal 786 megabytes of raw data. This is comparable to a fully data loaded CD.