Biofelsefe — Genom
NFA 2020 / Aziz Yardımlı



Biofelsefe — Genom



  • Genom örgenliğin genetik gerecidir.
  • Genom DNA’dan oluşur (Viruslarda RNA).
  • DNA yaklaşık 4 milyar yıldır genetik bilginin taşıyıcısıdır.
  • Genom hem genleri (kodlayıcı bölgeler) ve kodlayıcı olmayan DNAyı, hem de mitokondrial DNA ve kloroplast DNAsını kapsar.
  • Bir genom dizisi bir bireyin kromozomlarının tümünü oluşturan nükleotidlerin (DNA için: A, C, G, T) tam listesidir.
  • Genomun incelemesi genomiktir.

📘 Comparisons of Neanderthal and human DNA

Comparisons of Neanderthal and human DNA have helped anthropologists settle a long-running debate about the genetic relationship of the two. The evidence shows that Neanderthals and our own species, Homo sapiens, last shared a common ancestor between 600,000 and 800,000 years ago. Neanderthal ancestors migrated to Europe about 400,000 years ago while our own ancestors remained in Africa. The two groups remained out of contact until 40,000 years ago, when Homo sapiens first arrived in Europe. Within a few millennia, the Neanderthals were extinct. However, their recently recovered DNA suggests that during the 10,000 years that Neanderthals shared Europe with Homo sapiens, some interbreeding took place; 1–4% of the genomes of modern non-Africans can be traced to Neanderthals.



  • 1976’da, Walter Fiers bir viral RNA-genomunun tam nükleotid dizisini saptadı (Bacteriophage MS2).
  • 1995’te ilk bakteriel genom dizisi saptandı (Haemophilus influenzae).
  • 1995’te ilk ökaryotik genom dizisi saptandı (Saccharomyces cerevisiae).
  • 2013’te bir Neanderthalin tam genom dizisi saptandı.
  • 2007’de James D. Watson’un genomunun tam dizisi saptandı.


  • Bir genom haritası bir genom dizisinden daha az ayrıntılıdır.
  • Viral genomlar ya RNA ya da DNAdan oluşur.
  • RNA viruslarının genomu çift ya da tek telli RNAdan oluşabilir.
  • DNA virusları da çift ya da tek telli DNA kapsayabilir.


  • Prokaryotlar ve ökaryotlar  DNA genomları taşır.
  • Arkeanın tekil dairesel kromozomu vardır.
  • Bakterilerin çoğunun tekil dairesel kromozomu vardır.
  • Ökaryotik genomlar bir ya da daha çok lineer DNA kromozomundan oluşur.
  • Kromozomların sayısı bir çiftten  720 çifte dek değişir.
  • İnsan kromozomlarının sayısı 46’dır.
  • Kloroplast ve mitokondrinin kendi DNAsı (ya da genomu) vardır (sırasıyla “plastome”  ve “mitochondrial genome”).



📹 An Introduction to the Human Genome / HMX Genetics (VİDEO)

📹 An Introduction to the Human Genome / HMX Genetics (LINK)

Humans are 99.9% genetically identical - and yet we are all so different. How can this be? This video, taken from a lesson in Harvard Medical School’s HMX Genetics course, explains.


📹 Whole Genome Sequencing and You / Iacahn Scool of Medicine (VİDEO)

📹 Whole Genome Sequencing and You / Iacahn Scool of Medicine (LINK)

This video is about whole genome sequencing. What is a genome? What are the basics of how whole genome sequencing works? What can you find out about yourself from getting your genome sequenced? And what are the potential benefits and risks? You might be considering getting your genome sequenced for clinical, research or personal reasons. Or you might just be curious and want to learn a bit more about this technology. This video was developed to help you understand a bit more about what whole genome sequencing is, and what it could mean for you. It was developed by researchers at Mount Sinai's Department of Genetics and Genomic Sciences and Department of Emergency Medicine with funding from the Charles Bronfman Institute for Personalized Medicine, with valuable input from several community consultants, patients and others from around the Mount Sinai community.


📹 How to sequence the human genome / TED-Ed (VİDEO)

📹 How to sequence the human genome / TED-Ed (LINK)

Your genome, every human's genome, consists of a unique DNA sequence of A's, T's, C's and G's that tell your cells how to operate. Thanks to technological advances, scientists are now able to know the sequence of letters that makes up an individual genome relatively quickly and inexpensively. Mark J. Kiel takes an in-depth look at the science behind the sequence.




A label diagram explaining the different parts of a prokaryotic genome


Composition of the human genome.
  Genome (W)

Genome (W)

Genome (W)

In the fields of molecular biology and genetics, a genome is the genetic material of an organism. It consists of DNA (or RNA in RNA viruses). The genome includes both the genes (the coding regions) and the noncoding DNA, as well as mitochondrial DNA and chloroplast DNA. The study of the genome is called genomics.
Origin of term

Origin of term

Origin of term (W)

The term genome was created in 1920 by Hans Winkler, professor of botany at the University of Hamburg, Germany. The Oxford Dictionary suggests the name is a blend of the words gene and chromosome. However, see omics for a more thorough discussion. A few related -ome words already existed, such as biome and rhizome, forming a vocabulary into which genome fits systematically.


Sequencing and mapping

Sequencing and mapping

Sequencing and mapping (W)

Further information: Genome project

A genome sequence is the complete list of the nucleotides (A, C, G, and T for DNA genomes) that make up all the chromosomes of an individual or a species. Within a species, the vast majority of nucleotides are identical between individuals, but sequencing multiple individuals is necessary to understand the genetic diversity.

An image of the 46 chromosomes making up the diploid genome of a human male. (The mitochondrial chromosome is not shown.)

In 1976, Walter Fiers at the University of Ghent (Belgium) was the first to establish the complete nucleotide sequence of a viral RNA-genome (Bacteriophage MS2). The next year, Fred Sanger completed the first DNA-genome sequence: Phage Φ-X174, of 5386 base pairs. The first complete genome sequences among all three domains of life were released within a short period during the mid-1990s: The first bacterial genome to be sequenced was that of Haemophilus influenzae, completed by a team at The Institute for Genomic Research in 1995. A few months later, the first eukaryotic genome was completed, with sequences of the 16 chromosomes of budding yeast Saccharomyces cerevisiae published as the result of a European-led effort begun in the mid-1980s. The first genome sequence for an archaeon, Methanococcus jannaschii, was completed in 1996, again by The Institute for Genomic Research.

The development of new technologies has made genome sequencing dramatically cheaper and easier, and the number of complete genome sequences is growing rapidly. The US National Institutes of Health maintains one of several comprehensive databases of genomic information. Among the thousands of completed genome sequencing projects include those for rice, a mouse, the plant Arabidopsis thaliana, the puffer fish, and the bacteria E. coli. In December 2013, scientists first sequenced the entire genome of a Neanderthal, an extinct species of humans. The genome was extracted from the toe bone of a 130,000-year-old Neanderthal found in a Siberian cave.

New sequencing technologies, such as massive parallel sequencing have also opened up the prospect of personal genome sequencing as a diagnostic tool, as pioneered by Manteia Predictive Medicine. A major step toward that goal was the completion in 2007 of the full genome of James D. Watson, one of the co-discoverers of the structure of DNA.

Whereas a genome sequence lists the order of every DNA base in a genome, a genome map identifies the landmarks. A genome map is less detailed than a genome sequence and aids in navigating around the genome. The Human Genome Project was organized to map and to sequence the human genome. A fundamental step in the project was the release of a detailed genomic map by Jean Weissenbach and his team at the Genoscope in Paris.

Reference genome sequences and maps continue to be updated, removing errors and clarifying regions of high allelic complexity. The decreasing cost of genomic mapping has permitted genealogical sites to offer it as a service, to the extent that one may submit one's genome to crowdsourced scientific endeavours such as DNA.LAND at the New York Genome Center, an example both of the economies of scale and of citizen science.


Part of DNA sequence - prototypification of complete genome of virus.


Viral genomes

Viral genomes

Viral genomes (W)

Viral genomes can be composed of either RNA or DNA. The genomes of RNA viruses can be either single-stranded RNA or double-stranded RNA, and may contain one or more separate RNA molecules (segments: monopartit or multipartit genome). DNA viruses can have either single-stranded or double-stranded genomes. Most DNA virus genomes are composed of a single, linear molecule of DNA, but some are made up of a circular DNA molecule.


Prokaryotic genomes

Prokaryotic genomes

Prokaryotic genomes (W)

Prokaryotes and eukaryotes have DNA genomes. Archaea have a single circular chromosome. Most bacteria also have a single circular chromosome; however, some bacterial species have linear chromosomes or multiple chromosomes. If the DNA is replicated faster than the bacterial cells divide, multiple copies of the chromosome can be present in a single cell, and if the cells divide faster than the DNA can be replicated, multiple replication of the chromosome is initiated before the division occurs, allowing daughter cells to inherit complete genomes and already partially replicated chromosomes. Most prokaryotes have very little repetitive DNA in their genomes. However, some symbiotic bacteria (e.g. Serratia symbiotica) have reduced genomes and a high fraction of pseudogenes: only ~40% of their DNA encodes proteins.

Some bacteria have auxiliary genetic material, also part of their genome, which is carried in plasmids. For this, the word genome should not be used as a synonym of chromosome.


Eukaryotic genomes

Eukaryotic genomes

Eukaryotic genomes (W)

Eukaryotic genomes are composed of one or more linear DNA chromosomes. The number of chromosomes varies widely from Jack jumper ants and an asexual nemotode, which each have only one pair, to a fern species that has 720 pairs. A typical human cell has two copies of each of 22 autosomes, one inherited from each parent, plus two sex chromosomes, making it diploid. Gametes, such as ova, sperm, spores, and pollen, are haploid, meaning they carry only one copy of each chromosome.

In addition to the chromosomes in the nucleus, organelles such as the chloroplasts and mitochondria have their own DNA. Mitochondria are sometimes said to have their own genome often referred to as the mitochondrial genome.” The DNA found within the chloroplast may be referred to as the plastome.Like the bacteria they originated from, mitochondria and chloroplasts have a circular chromosome.

Unlike prokaryotes, eukaryotes have exon-intron organization of protein coding genes and variable amounts of repetitive DNA. In mammals and plants, the majority of the genome is composed of repetitive DNA.


Coding sequences

Coding sequences (W)

DNA sequences that carry the instructions to make proteins are coding sequences. The proportion of the genome occupied by coding sequences varies widely. A larger genome does not necessarily contain more genes, and the proportion of non-repetitive DNA decreases along with increasing genome size in complex eukaryotes.

Simple eukaryotes such as C. elegans and fruit fly, have more non-repetitive DNA than repetitive DNA, while the genomes of more complex eukaryotes tend to be composed largely of repetitive DNA. In some plants and amphibians, the proportion of repetitive DNA is more than 80%. Similarly, only 2% of the human genome codes for proteins.


Noncoding sequences

Noncoding sequences (W)

Main article: Non-coding DNA

Noncoding sequences include introns, sequences for non-coding RNAs, regulatory regions, and repetitive DNA. Noncoding sequences make up 98% of the human genome. There are two categories of repetitive DNA in the genome: tandem repeats and interspersed repeats.


Composition of the human genome.


Tandem repeats

Tandem repeats (W)

Short, non-coding sequences that are repeated head-to-tail are called tandem repeats. Microsatellites consisting of 2-5 basepair repeats, while minisatellite repeats are 30-35 bp. Tandem repeats make up about 4% of the human genome and 9% of the fruit fly genome. Tandem repeats can be functional. For example, telomeres are composed of the tandem repeat TTAGGG in mammals, and they play an important role in protecting the ends of the chromosome.

In other cases, expansions in the number of tandem repeats in exons or introns can cause disease. For example, the human gene huntingtin typically contains 6–29 tandem repeats of the nucleotides CAG (encoding a polyglutamine tract). An expansion to over 36 repeats results in Huntington's disease, a neurodegenerative disease. Twenty human disorders are known to result from similar tandem repeat expansions in various genes. The mechanism by which proteins with expanded polygulatamine tracts cause death of neurons is not fully understood. One possibility is that the proteins fail to fold properly and avoid degradation, instead accumulating in aggregates that also sequester important transcription factors, thereby altering gene expression.

Tandem repeats are usually caused by slippage during replication, unequal crossing-over and gene conversion.


Transposable elements

Transposable elements (W)

Transposable elements (TEs) are sequences of DNA with a defined structure that are able to change their location in the genome. TEs are categorized as either class I TEs, which replicate by a copy-and-paste mechanism, or class II TEs, which can be excised from the genome and inserted at a new location.

The movement of TEs is a driving force of genome evolution in eukaryotes because their insertion can disrupt gene functions, homologous recombination between TEs can produce duplications, and TE can shuffle exons and regulatory sequences to new locations.



Retrotransposons (W)

Retrotransposons can be transcribed into RNA, which are then duplicated at another site into the genome. Retrotransposons can be divided into long terminal repeats (LTRs) and non-long terminal repeats (Non-LTRs).

Long terminal repeats (LTRs) are derived from ancient retroviral infections, so they encode proteins related to retroviral proteins including gag (structural proteins of the virus), pol (reverse transcriptase and integrase), pro (protease), and in some cases env (envelope) genes. These genes are flanked by long repeats at both 5' and 3' ends. It has been reported that LTRs consist of the largest fraction in most plant genome and might account for the huge variation in genome size.

Non-long terminal repeats (Non-LTRs) are classified as long interspersed nuclear elements (LINEs), short interspersed nuclear elements (SINEs), and Penelope-like elements (PLEs). In Dictyostelium discoideum, there is another DIRS-like elements belong to Non-LTRs. Non-LTRs are widely spread in eukaryotic genomes.

Long interspersed elements (LINEs) encode genes for reverse transcriptase and endonuclease, making them autonomous transposable elements. The human genome has around 500,000 LINEs, taking around 17% of the genome.

Short interspersed elements (SINEs) are usually less than 500 base pairs and are non-autonomous, so they rely on the proteins encoded by LINEs for transposition. The Alu element is the most common SINE found in primates. It is about 350 base pairs and occupies about 11% of the human genome with around 1,500,000 copies.


DNA transposons

DNA transposons (W)

DNA transposons encode a transposase enzyme between inverted terminal repeats. When expressed, the transposase recognizes the terminal inverted repeats that flank the transposon and catalyzes its excision and reinsertion in a new site. This cut-and-paste mechanism typically reinserts transposons near their original location (within 100kb). DNA transposons are found in bacteria and make up 3% of the human genome and 12% of the genome of the roundworm C. elegans.


Genome size

Genome size

Genome size (W)

Log-log plot of the total number of annotated proteins in genomes submitted to GenBank as a function of genome size.
Genome size is the total number of DNA base pairs in one copy of a haploid genome. In humans, the nuclear genome comprises approximately 3.2 billion nucleotides of DNA, divided into 24 linear molecules, the shortest 50 000 000 nucleotides in length and the longest 260 000 000 nucleotides, each contained in a different chromosome. The genome size is positively correlated with the morphological complexity among prokaryotes and lower eukaryotes; however, after mollusks and all the other higher eukaryotes above, this correlation is no longer effective. This phenomenon also indicates the mighty influence coming from repetitive DNA on the genomes.

Since genomes are very complex, one research strategy is to reduce the number of genes in a genome to the bare minimum and still have the organism in question survive. There is experimental work being done on minimal genomes for single cell organisms as well as minimal genomes for multi-cellular organisms (see Developmental biology). The work is both in vivo and in silico.

Here is a table of some significant or representative genomes. See #See also for lists of sequenced genomes.

Organism type Organism Genome size
(base pairs)
Approx. no. of genes Note
Virus Porcine circovirus type 1 1,759 1.8kb Smallest viruses replicating autonomously in eukaryotic cells.
Virus Bacteriophage MS2 3,569 3.5kb First sequenced RNA-genome
Virus SV40 5,224 5.2kb  
Virus Phage Φ-X174 5,386 5.4kb First sequenced DNA-genome
Virus HIV 9,749 9.7kb  
Virus Phage λ 48,502 48.5kb Often used as a vector for the cloning of recombinant DNA.
Virus Megavirus 1,259,197 1.3Mb Until 2013 the largest known viral genome.
Virus Pandoravirus salinus 2,470,000 2.47Mb Largest known viral genome.
Eukaryotic organelle Human mitochondrion 16,569 16.6kb  
Bacterium Nasuia deltocephalinicola (strain NAS-ALF) 112,091 112kb 137 Smallest known non-viral genome. Symbiont of leafhoppers.
Bacterium Carsonella ruddii 159,662 160kb An endosymbiont of psyllid insects
Bacterium Buchnera aphidicola 600,000 600kb An endosymbiont of aphids
Bacterium Wigglesworthia glossinidia 700,000 700Kb A symbiont in the gut of the tsetse fly
Bacteriumcyanobacterium Prochlorococcus spp. (1.7 Mb) 1,700,000 1.7Mb 1,884 Smallest known cyanobacterium genome. One of the primary photosynthesizers on Earth.
Bacterium Haemophilus influenzae 1,830,000 1.8Mb First genome of a living organism sequenced, July 1995
Bacterium Escherichia coli 4,600,000 4.6Mb 4,288  
Bacterium – cyanobacterium Nostoc punctiforme 9,000,000 9Mb 7,432 7432 open reading frames
Bacterium Solibacter usitatus (strain Ellin 6076) 9,970,000 10Mb  
Amoeboid Polychaos dubium ("Amoeba" dubia) 670,000,000,000 670Gb Largest known genome. (Disputed)
Plant Genlisea tuberosa 61,000,000 61Mb Smallest recorded flowering plant genome, 2014.
Plant Arabidopsis thaliana 135,000,000 135 Mb 27,655 First plant genome sequenced, December 2000.
Plant Populus trichocarpa 480,000,000 480Mb 73,013 First tree genome sequenced, September 2006
Plant Fritillaria assyriaca 130,000,000,000 130Gb
Plant Paris japonica (Japanese-native, pale-petal) 150,000,000,000 150Gb Largest plant genome known
Plantmoss Physcomitrella patens 480,000,000 480Mb First genome of a bryophyte sequenced, January 2008.
Fungusyeast Saccharomyces cerevisiae 12,100,000 12.1Mb 6,294 First eukaryotic genome sequenced, 1996
Fungus Aspergillus nidulans 30,000,000 30Mb 9,541  
Nematode Pratylenchus coffeae 20,000,000 20Mb Smallest animal genome known
Nematode Caenorhabditis elegans 100,300,000 100Mb 19,000 First multicellular animal genome sequenced, December 1998
Insect Drosophila melanogaster (fruit fly) 175,000,000 175Mb 13,600 Size variation based on strain (175-180Mb; standard y w strain is 175Mb)
Insect Apis mellifera (honey bee) 236,000,000 236Mb 10,157  
Insect Bombyx mori (silk moth) 432,000,000 432Mb 14,623 14,623 predicted genes
Insect Solenopsis invicta (fire ant) 480,000,000 480Mb 16,569  
Mammal Mus musculus 2,700,000,000 2.7Gb 20,210  
Mammal Pan paniscus 3,286,640,000 3.3Gb 20,000 Bonobo - estimated genome size 3.29 billion bp
Mammal Homo sapiens 3,289,000,000 3.3Gb 20,000 Homo sapiens estimated genome size 3.2 billion bp

Initial sequencing and analysis of the human genome

Bird Gallus gallus 1,043,000,000 1.0Gb 20,000  
Fish Tetraodon nigroviridis (type of puffer fish) 385,000,000 390Mb Smallest vertebrate genome known estimated to be 340 Mb – 385 Mb.
Fish Protopterus aethiopicus (marbled lungfish) 130,000,000,000 130Gb Largest vertebrate genome known


Genomic alterations

Genomic alterations

Genomic alterations (W)

All the cells of an organism originate from a single cell, so they are expected to have identical genomes; however, in some cases, differences arise. Both the process of copying DNA during cell division and exposure to environmental mutagens can result in mutations in somatic cells. In some cases, such mutations lead to cancer because they cause cells to divide more quickly and invade surrounding tissues. In certain lymphocytes in the human immune system, V(D)J recombination generates different genomic sequences such that each cell produces a unique antibody or T cell receptors.

During meiosis, diploid cells divide twice to produce haploid germ cells. During this process, recombination results in a reshuffling of the genetic material from homologous chromosomes so each gamete has a unique genome.


Genome-wide reprogramming

Genome-wide reprogramming (W)

Genome-wide reprogramming in mouse primordial germ cells involves epigenetic imprint erasure leading to totipotency. Reprogramming is facilitated by active DNA demethylation, a process that entails the DNA base excision repair pathway. This pathway is employed in the erasure of CpG methylation (5mC) in primordial germ cells. The erasure of 5mC occurs via its conversion to 5-hydroxymethylcytosine (5hmC) driven by high levels of the ten-eleven dioxygenase enzymes TET1 and TET2.


Genome evolution

Genome evolution

Genome evolution (W)

Genomes are more than the sum of an organism's genes and have traits that may be measured and studied without reference to the details of any particular genes and their products. Researchers compare traits such as karyotype (chromosome number), genome size, gene order, codon usage bias, and GC-content to determine what mechanisms could have produced the great variety of genomes that exist today (for recent overviews, see Brown 2002; Saccone and Pesole 2003; Benfey and Protopapas 2004; Gibson and Muse 2004; Reese 2004; Gregory 2005).

Duplications play a major role in shaping the genome. Duplication may range from extension of short tandem repeats, to duplication of a cluster of genes, and all the way to duplication of entire chromosomes or even entire genomes. Such duplications are probably fundamental to the creation of genetic novelty.

Horizontal gene transfer is invoked to explain how there is often an extreme similarity between small portions of the genomes of two organisms that are otherwise very distantly related. Horizontal gene transfer seems to be common among many microbes. Also, eukaryotic cells seem to have experienced a transfer of some genetic material from their chloroplast and mitochondrial genomes to their nuclear chromosomes. Recent empirical data suggest an important role of viruses and sub-viral RNA-networks to represent a main driving role to generate genetic novelty and natural genome editing.


In fiction

In fiction

In fiction (W)

Works of science fiction illustrate concerns about the availability of genome sequences.

Michael Crichton's 1990 novel Jurassic Park and the subsequent film tell the story of a billionaire who creates a theme park of cloned dinosaurs on a remote island, with disastrous outcomes. A geneticist extracts dinosaur DNA from the blood of ancient mosquitoes and fills in the gaps with DNA from modern species to create several species of dinosaurs. A chaos theorist is asked to give his expert opinion on the safety of engineering an ecosystem with the dinosaurs, and he repeatedly warns that the outcomes of the project will be unpredictable and ultimately uncontrollable. These warnings about the perils of using genomic information are a major theme of the book.

The 1997 film Gattaca is set in a futurist society where genomes of children are engineered to contain the most ideal combination of their parents' traits, and metrics such as risk of heart disease and predicted life expectancy are documented for each person based on their genome. People conceived outside of the eugenics program, known as "In-Valids" suffer discrimination and are relegated to menial occupations. The protagonist of the film is an In-Valid who works to defy the supposed genetic odds and achieve his dream of working as a space navigator. The film warns against a future where genomic information fuels prejudice and extreme class differences between those who can and can't afford genetically engineered children.


See also


  Human genome (B)

Human genome (B)

Human genome (B)

DNA; human genome
The human genome is made up of approximately three billion base pairs of deoxyribonucleic acid (DNA). The bases of DNA are adenine (A), thymine (T), guanine (G), and cytosine (C).

Human genome, all of the approximately three billion base pairs of deoxyribonucleic acid (DNA) that make up the entire set of chromosomes of the human organism. The human genome includes the coding regions of DNA, which encode all the genes (between 20,000 and 25,000) of the human organism, as well as the noncoding regions of DNA, which do not encode any genes. By 2003 the DNA sequence of the entire human genome was known.

The human genome, like the genomes of all other living animals, is a collection of long polymers of DNA. These polymers are maintained in duplicate copy in the form of chromosomes in every human celland encode in their sequence of constituent bases (guanine [G], adenine [A], thymine [T], and cytosine [C]) the details of the molecular and physical characteristics that form the corresponding organism. The sequence of these polymers, their organization and structure, and the chemical modifications they contain not only provide the machinery needed to express the information held within the genome but also provide the genome with the capability to replicate, repair, package, and otherwise maintain itself. In addition, the genome is essential for the survival of the human organism; without it no cell or tissue could live beyond a short period of time. For example, red blood cells (erythrocytes), which live for only about 120 days, and skin cells, which on average live for only about 17 days, must be renewed to maintain the viability of the human body, and it is within the genome that the fundamental information for the renewal of these cells, and many other types of cells, is found.

The human genome is not uniform. Excepting identical (monozygous) twins, no two humans on Earth share exactly the same genomic sequence. Further, the human genome is not static. Subtle and sometimes not so subtle changes arise with startling frequency. Some of these changes are neutral or even advantageous; these are passed from parent to child and eventually become commonplace in the population. Other changes may be detrimental,resulting in reduced survival or decreased fertility of those individuals who harbour them; these changes tend to be rare in the population. The genome of modern humans, therefore, is a record of the trials and successes of the generations that have come before. Reflected in the variation of the modern genome is the range of diversity that underlies what are typical traits of the human species. There is also evidence in the human genome of the continuing burden of detrimental variations that sometimes lead to disease.

Knowledge of the human genome provides an understanding of the origin of the human species, the relationships between subpopulations of humans, and the health tendencies or disease risks of individual humans. Indeed, in the past 20 years knowledge of the sequence and structure of the human genome has revolutionized many fields of study, including medicine, anthropology, and forensics. With technological advances that enable inexpensive and expanded access to genomic information, the amount of and the potential applications for the information that is extracted from the human genome is extraordinary.

Role Of The Human Genome In Research

Role Of The Human Genome In Research

Role Of The Human Genome In Research (B)

Since the 1980s there has been an explosion in genetic and genomic research. The combination of the discovery of the polymerase chain reaction, improvements in DNA sequencing technologies, advances in bioinformatics (mathematical biological analysis), and increased availability of faster, cheaper computing power has given scientists the ability to discern and interpret vast amounts of genetic information from tiny samples of biological material. Further, methodologies such as fluorescence in situ hybridization (FISH) and comparative genomic hybridization (CGH) have enabled the detection of the organization and copy number of specific sequences in a given genome.


📹 Learn how DNA thermal cycler employs polymerase chain reaction to copy DNA strands (VİDEO)

📹 Learn how DNA thermal cycler employs polymerase chain reaction to copy DNA strands (LINK)

Learn how DNA thermal cycler employs polymerase chain reaction to copy DNA strands

Specific segments of DNA are amplified (copied) in a laboratory using polymerase chain reaction (PCR) techniques


Understanding the origin of the human genome is of particular interest to many researchers since the genome is indicative of the evolution of humans. The public availability of full or almost full genomic sequence databases for humans and a multitude of other species has allowed researchers to compare and contrast genomic information between individuals, populations, and species. From the similarities and differences observed, it is possible to track the origins of the human genome and to see evidence of how the human species has expanded and migrated to occupy the planet.


Origins Of The Human Genome

Origins Of The Human Genome

Origins Of The Human Genome (B)

Comparisons of specific DNA sequences between humans and their closest living relative, the chimpanzee, reveal 99 percent identity, although the homology drops to 96 percent if insertions and deletions in the organization of those sequences are taken into account. This degree of sequence variation between humans and chimpanzees is only about 10-fold greater than that seen between two unrelated humans. From comparisons of the human genome with the genomes of other species, it is clear that the genome of modern humans shares common ancestry with the genomes of all other animals on the planet and that the modern human genome arose between 150,000 and 300,000 years ago.

Ongoing collaboration between archaeologists, anthropologists, and molecular geneticists at the Max Planck Institute in Germany and the Lawrence Berkeley National Laboratory and the Joint Genome Institute in the United States has enabled sequence comparisons between modern humans (Homo sapiens) and Neanderthals (H. neanderthalensis). The data obtained so far demonstrate that modern humans and Neanderthals share about 99.5 percent genome sequence identity; some scientists have claimed that sequence identity may actually be as high as 99.9 percent.

Research suggests that populations of H. sapiens split from H. neanderthalensis ancestral populations perhaps as recently as 370,000 years ago and likely shared a common ancestor some 500,000–700,000 years ago. Genomic studies have indicated that there was almost no interbreeding between H. sapiens and H. neanderthalensis. This suggests that when Neanderthals, the last of the Homo relatives of modern humans, became extinct about 30,000 years ago, only modern humans were left to populate Earth. However, other research has revealed that modern H. sapiens in Eurasia, specifically peoples in Europe, China, and Papua New Guinea, have genomes that are more similar to the Neanderthal genome than they are to the genomes of modern H. sapiens in Africa. Scientists estimate that 1 to 4 percent of DNA of modern Eurasians is shared with Neanderthals, a level of similarity that is not found between Neanderthals and modern Africans. These findings indicate that limited interbreeding and gene flow took place between Neanderthals and ancestral H. sapiens populations after the latter migrated out of Africa but before they dispersed to other parts of the world.

Comparing the DNA sequences of groups of modern humans from different continents also allows scientists to define the relationships and even the ages of these different populations. By combining these genetic data with archeological and linguistic information, anthropologists have been able to discern the origins of Homo sapiens in Africa and to track the timing and location of the waves of human migration out of Africa that led to the eventual spread of humans to other continents of the globe. For example, genetic evidence indicates that the first humans migrated out of Africa approximately 60,000 years ago, settling in southern Europe, the Middle East, southern Asia, and Australia. From there, subsequent and sequential migrations brought humans to northern Eurasia and across what was then a land bridge to North America and finally to South America.

As humans migrated across the continents, sequence variations arose that became differentially fixed in different populations. Some variations likely reflect what are called founder effects, changes in gene frequency that occur in small populations. Founder effects are generally characterized by genes that are expressed with increasing frequency from one generation to the next and can be traced back to the original founders of the population. Other variations reflect differential selective pressures at work. For example, populations living in equatorial climates were under strong selective pressure that favoured dark skin colour to protect against extreme sun exposure, thereby decreasing the deleterious health effects caused by sunburn and skin cancer. In contrast, populations migrating to more polar latitudes, where levels of sun exposure are relatively low, experienced strong selective pressure that favoured light skin colour, thereby facilitating the absorption of sunlight by the skin for the synthesis of vitamin D. In northern Europe and Scandinavia, therefore, individuals with genetic variations leading to lighter skin colour were less likely to become vitamin D deficient and suffer from the bone disease known as rickets.


The global distribution of human skin colour is a well-defined example of genetic variation in which differential selective pressures favoured different characteristics in skin colour that conferred a survival advantage. Selective pressures for skin colour correlate with regional climate factors, such as latitude and sunlight. For example, the first populations of humans to settle in northern regions of the world were under selective pressure that favoured light skin colour to facilitate the absorption of sunlight, thereby preventing premature death from debilitating bone diseases.


Social Impacts Of Human Genome Research

Social Impacts Of Human Genome Research

Social Impacts Of Human Genome Research (B)

Databases have been compiled that list and summarize specific DNA variations that are common in certain human populations but not in others. Because the underlying DNA sequences are passed from parent to child in a stable manner, these genetic variations provide a tool for distinguishing the members of one population from those of the other. Public genetic ancestry projects, in which small samples of DNA can be submitted and analyzed, have allowed individuals to trace the continental or even subcontinental origins of their most ancient ancestors.

The role of genetics in defining traits and health risks for individuals has been recognized for generations. Long before DNA or genomes were understood, it was clear that many traits tended to run in families and that family history was one of the strongest predictors of health or disease. Knowledge of the human genome has advanced that realization, enabling studies that have identified the genes and even specific sequence variations that contribute to a multitude of traits and disease risks. With this information in hand, health care professionals are able to practice predictive medicine, which translates in the best of scenarios to preventative medicine. Indeed, presymptomatic genetic diagnoses have enabled countless people to live longer and healthier lives. For example, mutations responsible for familial cancers of the breast and colon have been identified, enabling presymptomatic testing of individuals in at-risk families. Individuals who carry the mutant gene or genes are counseled to seek heightened surveillance. In this way, if and when cancer appears, these individuals can be diagnosed early, when the cancers are most effectively treated.



  Human genome (W)

Human genome (W)

Human genome (W)

Graphical representation of the idealized human diploid karyotype, showing the organization of the genome into chromosomes. This drawing shows both the female (XX) and male (XY) versions of the 23rd chromosome pair. Chromosomes are shown aligned at their centromeres. The mitochondrial DNA is not shown.

The human genome is a complete set of nucleic acid sequences for humans, encoded as DNA within the 23 chromosome pairs in cell nuclei and in a small DNA molecule found within individual mitochondria. These are usually treated separately as the nuclear genome, and the mitochondrial genome. Human genomes include both protein-coding DNA genes and noncoding DNA.

Haploid human genomes, which are contained in germ cells (the egg and sperm gamete cells created in the meiosis phase of sexual reproduction before fertilization creates a zygote) consist of three billion DNA base pairs, while diploid genomes (found in somatic cells) have twice the DNA content.

While there are significant differences among the genomes of human individuals (on the order of 0.1% due to single-nucleotide variants and 0.6% when considering indels), these are considerably smaller than the differences between humans and their closest living relatives, the bonobos and chimpanzees (~1.1% fixed single-nucleotide variants and 4% when including indels).

The first human genome sequences were published in nearly complete draft form in February 2001 by the Human Genome Project and Celera Corporation. Completion of the Human Genome Project's sequencing effort was announced in 2004 with the publication of a draft genome sequence, leaving just 341 gaps in the sequence, representing highly-repetitive and other DNA that could not be sequenced with the technology available at the time. The human genome was the first of all vertebrates to be sequenced to such near-completion, and as of 2018, the diploid genomes of over a million individual humans had been determined using next-generation sequencing. These data are used worldwide in biomedical science, anthropology, forensics and other branches of science. Such genomic studies have led to advances in the diagnosis and treatment of diseases, and to new insights in many fields of biology, including human evolution.

Although the sequence of the human genome has been (almost) completely determined by DNA sequencing, it is not yet fully understood. Most (though probably not all) genes have been identified by a combination of high throughput experimental and bioinformatics approaches, yet much work still needs to be done to further elucidate the biological functions of their protein and RNA products. Recent results suggest that most of the vast quantities of noncoding DNA within the genome have associated biochemical activities, including regulation of gene expression, organization of chromosome architecture, and signals controlling epigenetic inheritance.

Prior to the acquisition of the full genome sequence, estimates of the number of human genes ranged from 50,000 to 140,000 (with occasional vagueness about whether these estimates included non-protein coding genes). As genome sequence quality and the methods for identifying protein-coding genes improved, the count of recognized protein-coding genes dropped to 19,000-20,000. However, a fuller understanding of the role played by sequences that do not encode proteins, but instead express regulatory RNA, has raised the total number of genes to at least 46,831, plus another 2300 micro-RNA genes. By 2012, functional DNA elements that encode neither RNA nor proteins have been noted. and another 10% equivalent of human genome was found in a recent (2018) population survey. Protein-coding sequences account for only a very small fraction of the genome (approximately 1.5%), and the rest is associated with non-coding RNA genes, regulatory DNA sequences, LINEs, SINEs, introns, and sequences for which as yet no function has been determined.

In June 2016, scientists formally announced HGP-Write, a plan to synthesize the human genome.

Molecular organization and gene content

Molecular organization and gene content

Molecular organization and gene content (W)

The total length of the human genome is over 3 billion base pairs. The genome is organized into 22 paired chromosomes, termed autosomes, plus the 23rd pair of sex chromosomes (XX) in the female, and (XY) in the male. These are all large linear DNA molecules contained within the cell nucleus. The genome also includes the mitochondrial DNA, a comparatively small circular molecule present in each mitochondrion. Basic information about these molecules and their gene content, based on a reference genome that does not represent the sequence of any specific individual, are provided in the following table. (Data source: Ensembl genome browser release 87[permanent dead link], December 2016 for most values; Ensembl genome browser release 68, July 2012 for miRNA, rRNA, snRNA, snoRNA.)

Chromosome Length
Variations Protein-
miRNA rRNA snRNA snoRNA Misc
Links Centromere
1 85 248,956,422 12,151,146 2058 1220 1200 496 134 66 221 145 192 EBI 125 7.9
2 83 242,193,529 12,945,965 1309 1023 1037 375 115 40 161 117 176 EBI 93.3 16.2
3 67 198,295,559 10,638,715 1078 763 711 298 99 29 138 87 134 EBI 91 23
4 65 190,214,555 10,165,685 752 727 657 228 92 24 120 56 104 EBI 50.4 29.6
5 62 181,538,259 9,519,995 876 721 844 235 83 25 106 61 119 EBI 48.4 35.8
6 58 170,805,979 9,130,476 1048 801 639 234 81 26 111 73 105 EBI 61 41.6
7 54 159,345,973 8,613,298 989 885 605 208 90 24 90 76 143 EBI 59.9 47.1
8 50 145,138,636 8,221,520 677 613 735 214 80 28 86 52 82 EBI 45.6 52
9 48 138,394,717 6,590,811 786 661 491 190 69 19 66 51 96 EBI 49 56.3
10 46 133,797,422 7,223,944 733 568 579 204 64 32 87 56 89 EBI 40.2 60.9
11 46 135,086,622 7,535,370 1298 821 710 233 63 24 74 76 97 EBI 53.7 65.4
12 45 133,275,309 7,228,129 1034 617 848 227 72 27 106 62 115 EBI 35.8 70
13 39 114,364,328 5,082,574 327 372 397 104 42 16 45 34 75 EBI 17.9 73.4
14 36 107,043,718 4,865,950 830 523 533 239 92 10 65 97 79 EBI 17.6 76.4
15 35 101,991,189 4,515,076 613 510 639 250 78 13 63 136 93 EBI 19 79.3
16 31 90,338,345 5,101,702 873 465 799 187 52 32 53 58 51 EBI 36.6 82
17 28 83,257,441 4,614,972 1197 531 834 235 61 15 80 71 99 EBI 24 84.8
18 27 80,373,285 4,035,966 270 247 453 109 32 13 51 36 41 EBI 17.2 87.4
19 20 58,617,616 3,858,269 1472 512 628 179 110 13 29 31 61 EBI 26.5 89.3
20 21 64,444,167 3,439,621 544 249 384 131 57 15 46 37 68 EBI 27.5 91.4
21 16 46,709,983 2,049,697 234 185 305 71 16 5 21 19 24 EBI 13.2 92.6
22 17 50,818,468 2,135,311 488 324 357 78 31 5 23 23 62 EBI 14.7 93.8
X 53 156,040,895 5,753,881 842 874 271 258 128 22 85 64 100 EBI 60.6 99.1
Y 20 57,227,415 211,643 71 388 71 30 15 7 17 3 8 EBI 10.4 100
mtDNA 0.0054 16,569 929 13 0 0 24 0 2 0 0 0 EBI N/A 100
total 3,088,286,401 155,630,645 20412 14600 14727 5037 1756 532 1944 1521 2213

Table 1 (above) summarizes the physical organization and gene content of the human reference genome, with links to the original analysis, as published in the Ensembl database at the European Bioinformatics Institute (EBI) and Wellcome Trust Sanger Institute. Chromosome lengths were estimated by multiplying the number of base pairs by 0.34 nanometers, the distance between base pairs in the DNA double helix. A recent estimation of human chromosome lengths based on updated data reports 205.00 cm for the diploid male genome and 208.23 cm for female, corresponding to weights of 6.41 and 6.51 picograms (pg), respectively. The number of proteins is based on the number of initial precursor mRNA transcripts, and does not include products of alternative pre-mRNA splicing, or modifications to protein structure that occur after translation.

The number of genes in the human genome is not entirely clear because the function of numerous transcripts remains unclear. This is especially true for non-coding RNA (see below). The number of protein-coding genes is better known but there are still on the order of 1,400 questionable genes which may or may not encode functional proteins, usually encoded by short open reading frames. Table 2 gives estimates from various projects and shows these discrepancies.

Table 2. Number of human genes in different databases as of July 2018
Gencode Ensemble Refseq CHESS
protein-coding genes 19,901 20,376 20,345 21,306
lncRNA genes 15,779 14,720 17,712 18,484
antisense RNA 5501 28 2694
miscellaneous RNA 2213 2222 13,899 4347
Pseudogenes 14,723 1740 15,952
total transcripts 203,835 203,903 154,484 328,827

Variations are unique DNA sequence differences that have been identified in the individual human genome sequences analyzed by Ensembl as of December 2016. The number of identified variations is expected to increase as further personal genomes are sequenced and analyzed. In addition to the gene content shown in this table, a large number of non-expressed functional sequences have been identified throughout the human genome (see below). Links open windows to the reference chromosome sequences in the EBI genome browser.

Small non-coding RNAs are RNAs of as many as 200 bases that do not have protein-coding potential. These include: microRNAs, or miRNAs (post-transcriptional regulators of gene expression), small nuclear RNAs, or snRNAs (the RNA components of spliceosomes), and small nucleolar RNAs, or snoRNA (involved in guiding chemical modifications to other RNA molecules). Long non-coding RNAs are RNA molecules longer than 200 bases that do not have protein-coding potential. These include: ribosomal RNAs, or rRNAs (the RNA components of ribosomes), and a variety of other long RNAs that are involved in regulation of gene expression, epigenetic modifications of DNA nucleotides and histone proteins, and regulation of the activity of protein-coding genes. Small discrepancies between total-small-ncRNA numbers and the numbers of specific types of small ncNRAs result from the former values being sourced from Ensembl release 87 and the latter from Ensembl release 68.


Completeness of the human genome sequence

Completeness of the human genome sequence (W)

Although the human genome has been completely sequenced for some practical purposes, there are still hundreds of gaps in the sequence and an uncertainty of about 5–10% (300 million basepairs added in 2018). A study, published in 2015, noted more than 160 euchromatic gaps of which 50 gaps were closed. However, there are still numerous gaps in the heterochromatic parts of the genome which is much harder to sequence due to numerous repeats and other intractable sequence features.


Information content

Information content (W)

The haploid human genome (23 chromosomes) is about 3 billion base pairs long and contains around 30,000 genes. Since every base pair can be coded by 2 bits, this is about 750 megabytes of data. An individual somatic (diploid) cell contains twice this amount, that is, about 6 billion base pairs. Men have fewer than women because the Y chromosome is about 57 million base pairs whereas the X is about 156 million, but in terms of information men have more because the second X contains almost the same information as the first. Since individual genomes vary in sequence by less than 1% from each other, the variations of a given human's genome from a common reference can be losslessly compressed to roughly 4 megabytes.

The entropy rate of the genome differs significantly between coding and non-coding sequences. It is close to the maximum of 2 bits per base pair for the coding sequences (about 45 million base pairs), but less for the non-coding parts. It ranges between 1.5 and 1.9 bits per base pair for the individual chromosome, except for the Y-chromosome, which has an entropy rate below 0.9 bits per base pair


Diagram showing the number of base pairs on each chromosome in green.


Coding vs. noncoding DNA

Coding vs. noncoding DNA

Coding vs. noncoding DNA (W)

The content of the human genome is commonly divided into coding and noncoding DNA sequences. Coding DNA is defined as those sequences that can be transcribed into mRNA and translated into proteins during the human life cycle; these sequences occupy only a small fraction of the genome (<2%). Noncoding DNA is made up of all of those sequences (ca. 98% of the genome) that are not used to encode proteins.

Some noncoding DNA contains genes for RNA molecules with important biological functions (noncoding RNA, for example ribosomal RNA and transfer RNA). The exploration of the function and evolutionary origin of noncoding DNA is an important goal of contemporary genome research, including the ENCODE (Encyclopedia of DNA Elements) project, which aims to survey the entire human genome, using a variety of experimental tools whose results are indicative of molecular activity.

Because non-coding DNA greatly outnumbers coding DNA, the concept of the sequenced genome has become a more focused analytical concept than the classical concept of the DNA-coding gene.



Coding sequences (protein-coding genes)

Coding sequences (protein-coding genes) (W)

Human genes categorized by function of the transcribed proteins, given both as number of encoding genes and percentage of all genes.

Protein-coding sequences represent the most widely studied and best understood component of the human genome. These sequences ultimately lead to the production of all human proteins, although several biological processes (e.g. DNA rearrangements and alternative pre-mRNA splicing) can lead to the production of many more unique proteins than the number of protein-coding genes. The complete modular protein-coding capacity of the genome is contained within the exome, and consists of DNA sequences encoded by exons that can be translated into proteins. Because of its biological importance, and the fact that it constitutes less than 2% of the genome, sequencing of the exome was the first major milepost of the Human Genome Project.

Number of protein-coding genes. About 20,000 human proteins have been annotated in databases such as Uniprot. Historically, estimates for the number of protein genes have varied widely, ranging up to 2,000,000 in the late 1960s, but several researchers pointed out in the early 1970s that the estimated mutational load from deleterious mutations placed an upper limit of approximately 40,000 for the total number of functional loci (this includes protein-coding and functional non-coding genes). The number of human protein-coding genes is not significantly larger than that of many less complex organisms, such as the roundworm and the fruit fly. This difference may result from the extensive use of alternative pre-mRNA splicing in humans, which provides the ability to build a very large number of modular proteins through the selective incorporation of exons.

Protein-coding capacity per chromosome. Protein-coding genes are distributed unevenly across the chromosomes, ranging from a few dozen to more than 2000, with an especially high gene density within chromosomes 19, 11, and 1 (Table 1). Each chromosome contains various gene-rich and gene-poor regions, which may be correlated with chromosome bands and GC-content. The significance of these nonrandom patterns of gene density is not well understood.

Size of protein-coding genes. The size of protein-coding genes within the human genome shows enormous variability (Table 2). The median size of a protein-coding gene is 26,288 bp (mean = 66,577 bp; Table 2 in). For example, the gene for histone H1a (HIST1HIA) is relatively small and simple, lacking introns and encoding mRNA sequences of 781 nt and a 215 amino acid protein (648 nt open reading frame). Dystrophin (DMD) is the largest protein-coding gene in the human reference genome, spanning a total of 2.2 MB, while Titin (TTN) has the longest coding sequence (114,414 bp), the largest number of exons (363), and the longest single exon (17,106 bp). Over the whole genome, the median size of an exon is 122 bp (mean = 145 bp), the median number of exons is 7 (mean = 8.8), and the median coding sequence encodes 367 amino acids (mean = 447 amino acids; Table 21 in).

Table 1. Examples of human protein-coding genes
Protein Chrom Gene Length Exons Exon length Intron length Alt splicing
Breast cancer type 2 susceptibility protein 13 BRCA2 83,736 27 11,386 72,350 yes
Cystic fibrosis transmembrane conductance regulator 7 CFTR 202,881 27 4,440 198,441 yes
Cytochrome b MT MTCYB 1,140 1 1,140 0 no
Dystrophin X DMD 2,220,381 79 10,500 2,209,881 yes
Glyceraldehyde-3-phosphate dehydrogenase 12 GAPDH 4,444 9 1,425 3,019 yes
Hemoglobin beta subunit 11 HBB 1,605 3 626 979 no
Histone H1A 6 HIST1H1A 781 1 781 0 no
Titin 2 TTN 281,434 364 104,301 177,133 yes

Table 2. Examples of human protein-coding genes. Chrom, chromosome. Alt splicing, alternative pre-mRNA splicing. (Data source: Ensembl genome browser release 68, July 2012)

Recently, a systematic meta-analysis of updated data of the human genome found that the largest protein-coding gene in the human reference genome is RBFOX1 (RNA binding protein, fox-1 homolog 1), spanning a total of 2.47 MB. Over the whole genome, considering a curated set of protein-coding genes, the median size of an exon is currently estimated to be 133 bp (mean = 309 bp), the median number of exons is currently estimated to be 8 (mean = 11), and the median coding sequence is currently estimated to encode 425 amino acids (mean = 553 amino acids; Tables 2 and 5 in).


Noncoding DNA (ncDNA)

Noncoding DNA (ncDNA)

Noncoding DNA (ncDNA) (W)

Main article: Noncoding DNA

Noncoding DNA is defined as all of the DNA sequences within a genome that are not found within protein-coding exons, and so are never represented within the amino acid sequence of expressed proteins. By this definition, more than 98% of the human genomes is composed of ncDNA.

Numerous classes of noncoding DNA have been identified, including genes for noncoding RNA (e.g. tRNA and rRNA), pseudogenes, introns, untranslated regions of mRNA, regulatory DNA sequences, repetitive DNA sequences, and sequences related to mobile genetic elements.

Numerous sequences that are included within genes are also defined as noncoding DNA. These include genes for noncoding RNA (e.g. tRNA, rRNA), and untranslated components of protein-coding genes (e.g. introns, and 5' and 3' untranslated regions of mRNA).

Protein-coding sequences (specifically, coding exons) constitute less than 1.5% of the human genome. In addition, about 26% of the human genome is introns. Aside from genes (exons and introns) and known regulatory sequences (8–20%), the human genome contains regions of noncoding DNA. The exact amount of noncoding DNA that plays a role in cell physiology has been hotly debated. Recent analysis by the ENCODE project indicates that 80% of the entire human genome is either transcribed, binds to regulatory proteins, or is associated with some other biochemical activity.

It however remains controversial whether all of this biochemical activity contributes to cell physiology, or whether a substantial portion of this is the result transcriptional and biochemical noise, which must be actively filtered out by the organism. Excluding protein-coding sequences, introns, and regulatory regions, much of the non-coding DNA is composed of: Many DNA sequences that do not play a role in gene expression have important biological functions. Comparative genomics studies indicate that about 5% of the genome contains sequences of noncoding DNA that are highly conserved, sometimes on time-scales representing hundreds of millions of years, implying that these noncoding regions are under strong evolutionary pressure and positive selection.

Many of these sequences regulate the structure of chromosomes by limiting the regions of heterochromatin formation and regulating structural features of the chromosomes, such as the telomeres and centromeres. Other noncoding regions serve as origins of DNA replication. Finally several regions are transcribed into functional noncoding RNA that regulate the expression of protein-coding genes (for example), mRNA translation and stability (see miRNA), chromatin structure (including histone modifications, for example), DNA methylation (for example ), DNA recombination (for example), and cross-regulate other noncoding RNAs (for example). It is also likely that many transcribed noncoding regions do not serve any role and that this transcription is the product of non-specific RNA Polymerase activity.



Pseudogenes (W)

Main article: Pseudogene

Pseudogenes are inactive copies of protein-coding genes, often generated by gene duplication, that have become nonfunctional through the accumulation of inactivating mutations. Table 1 shows that the number of pseudogenes in the human genome is on the order of 13,000, and in some chromosomes is nearly the same as the number of functional protein-coding genes. Gene duplication is a major mechanism through which new genetic material is generated during molecular evolution.

For example, the olfactory receptor gene family is one of the best-documented examples of pseudogenes in the human genome. More than 60 percent of the genes in this family are non-functional pseudogenes in humans. By comparison, only 20 percent of genes in the mouse olfactory receptor gene family are pseudogenes. Research suggests that this is a species-specific characteristic, as the most closely related primates all have proportionally fewer pseudogenes. This genetic discovery helps to explain the less acute sense of smell in humans relative to other mammals.


Genes for noncoding RNA (ncRNA)

Genes for noncoding RNA (ncRNA) (W)

Main article: Noncoding RNA

Noncoding RNA molecules play many essential roles in cells, especially in the many reactions of protein synthesis and RNA processing. Noncoding RNA include tRNA, ribosomal RNA, microRNA, snRNA and other non-coding RNA genes including about 60,000 long non-coding RNAs (lncRNAs). Although the number of reported lncRNA genes continues to rise and the exact number in the human genome is yet to be defined, many of them are argued to be non-functional.

Many ncRNAs are critical elements in gene regulation and expression. Noncoding RNA also contributes to epigenetics, transcription, RNA splicing, and the translational machinery. The role of RNA in genetic regulation and disease offers a new potential level of unexplored genomic complexity.


Introns and untranslated regions of mRNA

Introns and untranslated regions of mRNA (W)

In addition to the ncRNA molecules that are encoded by discrete genes, the initial transcripts of protein coding genes usually contain extensive noncoding sequences, in the form of introns, 5'-untranslated regions (5'-UTR), and 3'-untranslated regions (3'-UTR). Within most protein-coding genes of the human genome, the length of intron sequences is 10- to 100-times the length of exon sequences (Table 2).


Regulatory DNA sequences

Regulatory DNA sequences (W)

The human genome has many different regulatory sequences which are crucial to controlling gene expression. Conservative estimates indicate that these sequences make up 8% of the genome, however extrapolations from the ENCODE project give that 20-40% of the genome is gene regulatory sequence. Some types of non-coding DNA are genetic "switches" that do not encode proteins, but do regulate when and where genes are expressed (called enhancers).

Regulatory sequences have been known since the late 1960s. The first identification of regulatory sequences in the human genome relied on recombinant DNA technology. Later with the advent of genomic sequencing, the identification of these sequences could be inferred by evolutionary conservation. The evolutionary branch between the primates and mouse, for example, occurred 70–90 million years ago. So computer comparisons of gene sequences that identify conserved non-coding sequences will be an indication of their importance in duties such as gene regulation.

Other genomes have been sequenced with the same intention of aiding conservation-guided methods, for exampled the pufferfish genome. However, regulatory sequences disappear and re-evolve during evolution at a high rate.

As of 2012, the efforts have shifted toward finding interactions between DNA and regulatory proteins by the technique ChIP-Seq, or gaps where the DNA is not packaged by histones (DNase hypersensitive sites), both of which tell where there are active regulatory sequences in the investigated cell type.


Repetitive DNA sequences

Repetitive DNA sequences (W)

Repetitive DNA sequences comprise approximately 50% of the human genome.

About 8% of the human genome consists of tandem DNA arrays or tandem repeats, low complexity repeat sequences that have multiple adjacent copies (e.g. "CAGCAGCAG..."). The tandem sequences may be of variable lengths, from two nucleotides to tens of nucleotides. These sequences are highly variable, even among closely related individuals, and so are used for genealogical DNA testing and forensic DNA analysis.

Repeated sequences of fewer than ten nucleotides (e.g. the dinucleotide repeat (AC)n) are termed microsatellite sequences. Among the microsatellite sequences, trinucleotide repeats are of particular importance, as sometimes occur within coding regions of genes for proteins and may lead to genetic disorders. For example, Huntington's disease results from an expansion of the trinucleotide repeat (CAG)n within the Huntingtin gene on human chromosome 4. Telomeres (the ends of linear chromosomes) end with a microsatellite hexanucleotide repeat of the sequence (TTAGGG)n.

Tandem repeats of longer sequences (arrays of repeated sequences 10–60 nucleotides long) are termed minisatellites.


Mobile genetic elements (transposons) and their relics

Mobile genetic elements (transposons) and their relics (W)

Transposable genetic elements, DNA sequences that can replicate and insert copies of themselves at other locations within a host genome, are an abundant component in the human genome. The most abundant transposon lineage, Alu, has about 50,000 active copies, and can be inserted into intragenic and intergenic regions. One other lineage, LINE-1, has about 100 active copies per genome (the number varies between people). Together with non-functional relics of old transposons, they account for over half of total human DNA. Sometimes called "jumping genes", transposons have played a major role in sculpting the human genome. Some of these sequences represent endogenous retroviruses, DNA copies of viral sequences that have become permanently integrated into the genome and are now passed on to succeeding generations.

Mobile elements within the human genome can be classified into LTR retrotransposons (8.3% of total genome), SINEs (13.1% of total genome) including Alu elements, LINEs (20.4% of total genome), SVAs and Class II DNA transposons (2.9% of total genome).


Genomic variation in humans

Genomic variation in humans

Genomic variation in humans (W)



Human reference genome

Human reference genome (W)

With the exception of identical twins, all humans show significant variation in genomic DNA sequences. The human reference genome (HRG) is used as a standard sequence reference.

There are several important points concerning the human reference genome:

  • The HRG is a haploid sequence. Each chromosome is represented once.
  • The HRG is a composite sequence, and does not correspond to any actual human individual.
  • The HRG is periodically updated to correct errors, ambiguities, and unknown "gaps".
  • The HRG in no way represents an "ideal" or "perfect" human individual. It is simply a standardized representation or model that is used for comparative purposes.

The Genome Reference Consortium is responsible for updating the HRG. Version 38 was released in December 2013


Measuring human genetic variation

Measuring human genetic variation (W)

Most studies of human genetic variation have focused on single-nucleotide polymorphisms (SNPs), which are substitutions in individual bases along a chromosome. Most analyses estimate that SNPs occur 1 in 1000 base pairs, on average, in the euchromatic human genome, although they do not occur at a uniform density. Thus follows the popular statement that "we are all, regardless of race, genetically 99.9% the same", although this would be somewhat qualified by most geneticists. For example, a much larger fraction of the genome is now thought to be involved in copy number variation. A large-scale collaborative effort to catalog SNP variations in the human genome is being undertaken by the International HapMap Project.

The genomic loci and length of certain types of small repetitive sequences are highly variable from person to person, which is the basis of DNA fingerprinting and DNA paternity testing technologies. The heterochromatic portions of the human genome, which total several hundred million base pairs, are also thought to be quite variable within the human population (they are so repetitive and so long that they cannot be accurately sequenced with current technology). These regions contain few genes, and it is unclear whether any significant phenotypic effect results from typical variation in repeats or heterochromatin.

Most gross genomic mutations in gamete germ cells probably result in inviable embryos; however, a number of human diseases are related to large-scale genomic abnormalities. Down syndrome, Turner Syndrome, and a number of other diseases result from nondisjunction of entire chromosomes. Cancer cells frequently have aneuploidy of chromosomes and chromosome arms, although a cause and effect relationship between aneuploidy and cancer has not been established.


Mapping human genomic variation

Mapping human genomic variation (W)

Whereas a genome sequence lists the order of every DNA base in a genome, a genome map identifies the landmarks. A genome map is less detailed than a genome sequence and aids in navigating around the genome.

An example of a variation map is the HapMap being developed by the International HapMap Project. The HapMap is a haplotype map of the human genome, "which will describe the common patterns of human DNA sequence variation." It catalogs the patterns of small-scale variations in the genome that involve single DNA letters, or bases.

Researchers published the first sequence-based map of large-scale structural variation across the human genome in the journal Nature in May 2008. Large-scale structural variations are differences in the genome among people that range from a few thousand to a few million DNA bases; some are gains or losses of stretches of genome sequence and others appear as re-arrangements of stretches of sequence. These variations include differences in the number of copies individuals have of a particular gene, deletions, translocations and inversions.


SNP frequency across the human genome

SNP frequency across the human genome (W)

Single-nucleotide polymorphisms (SNPs) do not occur homogeneously across the human genome. In fact, there is enormous diversity in SNP frequency between genes, reflecting different selective pressures on each gene as well as different mutation and recombination rates across the genome. However, studies on SNPs are biased towards coding regions, the data generated from them are unlikely to reflect the overall distribution of SNPs throughout the genome. Therefore, the SNP Consortium protocol was designed to identify SNPs with no bias towards coding regions and the Consortium's 100,000 SNPs generally reflect sequence diversity across the human chromosomes.The SNP Consortium aims to expand the number of SNPs identified across the genome to 300 000 by the end of the first quarter of 2001.

Changes in non-coding sequence and synonymous changes in coding sequence are generally more common than non-synonymous changes, reflecting greater selective pressure reducing diversity at positions dictating amino acid identity. Transitional changes are more common than transversions, with CpG dinucleotides showing the highest mutation rate, presumably due to deamination.


TSC SNP distribution along the long arm of chromosome 22 (from ). Each column represents a 1 Mb interval; the approximate cytogenetic position is given on the x-axis. Clear peaks and troughs of SNP density can be seen, possibly reflecting different rates of mutation, recombination and selection.


Personal genomes

Personal genomes (W)

A personal genome sequence is a (nearly) complete sequence of the chemical base pairs that make up the DNA of a single person. Because medical treatments have different effects on different people due to genetic variations such as single-nucleotide polymorphisms (SNPs), the analysis of personal genomes may lead to personalized medical treatment based on individual genotypes.

The first personal genome sequence to be determined was that of Craig Venter in 2007. Personal genomes had not been sequenced in the public Human Genome Project to protect the identity of volunteers who provided DNA samples. That sequence was derived from the DNA of several volunteers from a diverse population. However, early in the Venter-led Celera Genomics genome sequencing effort the decision was made to switch from sequencing a composite sample to using DNA from a single individual, later revealed to have been Venter himself. Thus the Celera human genome sequence released in 2000 was largely that of one man. Subsequent replacement of the early composite-derived data and determination of the diploid sequence, representing both sets of chromosomes, rather than a haploid sequence originally reported, allowed the release of the first personal genome. In April 2008, that of James Watson was also completed. Since then hundreds of personal genome sequences have been released, including those of Desmond Tutu, and of a Paleo-Eskimo. In 2012, the whole genome sequences of two family trios among 1092 genomes was made public. In November 2013, a Spanish family made four personal exome datasets (about 1% of the genome) publicly available under a Creative Commons public domain license. The Personal Genome Project (started in 2005) is among the few to make both genome sequences and corresponding medical phenotypes publicly available.

The sequencing of individual genomes further unveiled levels of genetic complexity that had not been appreciated before. Personal genomics helped reveal the significant level of diversity in the human genome attributed not only to SNPs but structural variations as well. However, the application of such knowledge to the treatment of disease and in the medical field is only in its very beginnings. Exome sequencing has become increasingly po


Human knockouts

Human knockouts (W)

In humans, gene knockouts naturally occur as heterozygous or homozygous loss-of-function gene knockouts. These knockouts are often difficult to distinguish, especially within heterogeneous genetic backgrounds. They are also difficult to find as they occur in low frequencies.

Populations with a high level of parental-relatedness result in a larger number of homozygous gene knockouts as compared to outbred populations.

Populations with high rates of consanguinity, such as countries with high rates of first-cousin marriages, display the highest frequencies of homozygous gene knockouts. Such populations include Pakistan, Iceland, and Amish populations. These populations with a high level of parental-relatedness have been subjects of human knock out research which has helped to determine the function of specific genes in humans. By distinguishing specific knockouts, researchers are able to use phenotypic analyses of these individuals to help characterize the gene that has been knocked out.

Knockouts in specific genes can cause genetic diseases, potentially have beneficial effects, or even result in no phenotypic effect at all. However, determining a knockout's phenotypic effect and in humans can be challenging. Challenges to characterizing and clinically interpreting knockouts include difficulty calling of DNA variants, determining disruption of protein function (annotation), and considering the amount of influence mosaicism has on the phenotype.

A pedigree displaying a first-cousin mating (carriers both carrying heterozygous knockouts mating as marked by double line) leading to offspring possessing a homozygous gene knockout.


One major study that investigated human knockouts is the Pakistan Risk of Myocardial Infarction study. It was found that individuals possessing a heterozygous loss-of-function gene knockout for the APOC3 gene had lower triglycerides in the blood after consuming a high fat meal as compared to individuals without the mutation. However, individuals possessing homozygous loss-of-function gene knockouts of the APOC3 gene displayed the lowest level of triglycerides in the blood after the fat load test, as they produce no functional APOC3 protein.


Human genetic disorders

Human genetic disorders

Human genetic disorders (W)

Further information: Genetic disorder

Most aspects of human biology involve both genetic (inherited) and non-genetic (environmental) factors. Some inherited variation influences aspects of our biology that are not medical in nature (height, eye color, ability to taste or smell certain compounds, etc.). Moreover, some genetic disorders only cause disease in combination with the appropriate environmental factors (such as diet). With these caveats, genetic disorders may be described as clinically defined diseases caused by genomic DNA sequence variation. In the most straightforward cases, the disorder can be associated with variation in a single gene. For example, cystic fibrosis is caused by mutations in the CFTR gene and is the most common recessive disorder in caucasian populations with over 1,300 different mutations known.

Disease-causing mutations in specific genes are usually severe in terms of gene function and are fortunately rare, thus genetic disorders are similarly individually rare. However, since there are many genes that can vary to cause genetic disorders, in aggregate they constitute a significant component of known medical conditions, especially in pediatric medicine. Molecularly characterized genetic disorders are those for which the underlying causal gene has been identified. Currently there are approximately 2,200 such disorders annotated in the OMIM database.

Studies of genetic disorders are often performed by means of family-based studies. In some instances, population based approaches are employed, particularly in the case of so-called founder populations such as those in Finland, French-Canada, Utah, Sardinia, etc. Diagnosis and treatment of genetic disorders are usually performed by a geneticist-physician trained in clinical/medical genetics. The results of the Human Genome Project are likely to provide increased availability of genetic testing for gene-related disorders, and eventually improved treatment. Parents can be screened for hereditary conditions and counselled on the consequences, the probability of inheritance, and how to avoid or ameliorate it in their offspring.

There are many different kinds of DNA sequence variation, ranging from complete extra or missing chromosomes down to single nucleotide changes. It is generally presumed that much naturally occurring genetic variation in human populations is phenotypically neutral, i.e., has little or no detectable effect on the physiology of the individual (although there may be fractional differences in fitness defined over evolutionary time frames). Genetic disorders can be caused by any or all known types of sequence variation. To molecularly characterize a new genetic disorder, it is necessary to establish a causal link between a particular genomic sequence variant and the clinical disease under investigation. Such studies constitute the realm of human molecular genetics.

With the advent of the Human Genome and International HapMap Project, it has become feasible to explore subtle genetic influences on many common disease conditions such as diabetes, asthma, migraine, schizophrenia, etc. Although some causal links have been made between genomic sequence variants in particular genes and some of these diseases, often with much publicity in the general media, these are usually not considered to be genetic disorders per se as their causes are complex, involving many different genetic and environmental factors. Thus there may be disagreement in particular cases whether a specific medical condition should be termed a genetic disorder.

Additional genetic disorders of mention are Kallman syndrome and Pfeiffer syndrome (gene FGFR1), Fuchs corneal dystrophy (gene TCF4), Hirschsprung's disease (genes RET and FECH), Bardet-Biedl syndrome 1 (genes CCDC28B and BBS1), Bardet-Biedl syndrome 10 (gene BBS10), and facioscapulohumeral muscular dystrophy type 2 (genes D4Z4 and SMCHD1).

Genome sequencing is now able to narrow the genome down to specific locations to more accurately find mutations that will result in a genetic disorder. Copy number variants (CNVs) and single nucleotide variants (SNVs) are also able to be detected at the same time as genome sequencing with newer sequencing procedures available, called Next Generation Sequencing (NGS). This only analyzes a small portion of the genome, around 1-2%. The results of this sequencing can be used for clinical diagnosis of a genetic condition, including Usher syndrome, retinal disease, hearing impairments, diabetes, epilepsy, Leigh disease, hereditary cancers, neuromuscular diseases, primary immunodeficiencies, severe combined immunodeficiency (SCID), and diseases of the mitochondria. NGS can also be used to identify carriers of diseases before conception. The diseases that can be detected in this sequencing include Tay-Sachs disease, Bloom syndrome, Gaucher disease, Canavan disease, familial dysautonomia, cystic fibrosis, spinal muscular atrophy, and fragile-X syndrome. The Next Genome Sequencing can be narrowed down to specifically look for diseases more prevalent in certain ethnic populations.

The categorized table below provides the prevalence as well as the genes or chromosomes associated with some human genetic disorders.

Disorder Prevalence Chromosome or gene involved
Chromosomal conditions
Down syndrome 1:600 Chromosome 21
Klinefelter syndrome 1:500–1000 males Additional X chromosome
Turner syndrome 1:2000 females Loss of X chromosome
Sickle cell anemia 1 in 50 births in parts of Africa; rarer elsewhere β-globin (on chromosome 11)
Bloom syndrome 1:48000 Ashkenazi Jews BLM
Breast/Ovarian cancer (susceptibility) ~5% of cases of these cancer types BRCA1, BRCA2
FAP (hereditary nonpolyposis coli) 1:3500 APC
Lynch syndrome 5–10% of all cases of bowel cancer MLH1, MSH2, MSH6, PMS2
Fanconi anemia 1:130000 births FANCC
Neurological conditions
Huntington disease 1:20000 Huntingtin
Alzheimer disease ‐ early onset 1:2500 PS1, PS2, APP
Tay-Sachs 1:3600 births in Ashkenazi Jews HEXA gene (on chromosome 15)
Canavan disease 2.5% Eastern European Jewish ancestry ASPA gene (on chromosome 17)
Familial dysautonomia 600 known cases worldwide since discovery IKBKAP gene (on chromosome 9)
Fragile X syndrome 1.4:10000 in males, 0.9:10000 in females FMR1 gene (on X chromosome)
Mucolipidosis type IV 1:90 to 1:100 in Ashkenazi Jews MCOLN1
Other conditions
Cystic fibrosis 1:2500 CFTR
Duchenne muscular dystrophy 1:3500 boys Dystrophin
Becker muscular dystrophy 1.5-6:100000 males DMD
Beta thalassemia 1:100000 HBB
Congenital adrenal hyperplasia 1:280 in Native Americans and Yupik Eskimos

1:15000 in American Caucasians

Glycogen storage disease type I 1:100000 births in America G6PC
Maple syrup urine disease 1:180000 in the U.S.

1:176 in Mennonite/Amish communities

1:250000 in Austria





Niemann–Pick disease, SMPD1-associated 1,200 cases worldwide SMPD1
Usher syndrome 1:23000 in the U.S.

1:28000 in Norway

1:12500 in Germany













Evolution (W)

See also: Human evolution and Chimpanzee Genome Project

Comparative genomics studies of mammalian genomes suggest that approximately 5% of the human genome has been conserved by evolution since the divergence of extant lineages approximately 200 million years ago, containing the vast majority of genes. The published chimpanzee genome differs from that of the human genome by 1.23% in direct sequence comparisons. Around 20% of this figure is accounted for by variation within each species, leaving only ~1.06% consistent sequence divergence between humans and chimps at shared genes. This nucleotide by nucleotide difference is dwarfed, however, by the portion of each genome that is not shared, including around 6% of functional genes that are unique to either humans or chimps.

In other words, the considerable observable differences between humans and chimps may be due as much or more to genome level variation in the number, function and expression of genes rather than DNA sequence changes in shared genes. Indeed, even within humans, there has been found to be a previously unappreciated amount of copy number variation (CNV) which can make up as much as 5 – 15% of the human genome. In other words, between humans, there could be +/- 500,000,000 base pairs of DNA, some being active genes, others inactivated, or active at different levels. The full significance of this finding remains to be seen. On average, a typical human protein-coding gene differs from its chimpanzee ortholog by only two amino acid substitutions; nearly one third of human genes have exactly the same protein translation as their chimpanzee orthologs. A major difference between the two genomes is human chromosome 2, which is equivalent to a fusion product of chimpanzee chromosomes 12 and 13. (later renamed to chromosomes 2A and 2B, respectively).

Humans have undergone an extraordinary loss of olfactory receptor genes during our recent evolution, which explains our relatively crude sense of smell compared to most other mammals. Evolutionary evidence suggests that the emergence of color vision in humans and several other primate species has diminished the need for the sense of smell.

In September 2016, scientists reported that, based on human DNA genetic studies, all non-Africans in the world today can be traced to a single population that exited Africa between 50,000 and 80,000 years ago.



Mitochondrial DNA

Mitochondrial DNA (W)

The human mitochondrial DNA is of tremendous interest to geneticists, since it undoubtedly plays a role in mitochondrial disease. It also sheds light on human evolution; for example, analysis of variation in the human mitochondrial genome has led to the postulation of a recent common ancestor for all humans on the maternal line of descent (see Mitochondrial Eve).

Due to the lack of a system for checking for copying errors, mitochondrial DNA (mtDNA) has a more rapid rate of variation than nuclear DNA. This 20-fold igher mutation rate allows mtDNA to be used for more accurate tracing of maternal ancestry. Studies of mtDNA in populations have allowed ancient migration paths to be traced, such as the migration of Native Americans from Siberia or Polynesians from southeastern Asia. It has also been used to show that there is no trace of Neanderthal DNA in the European gene mixture inherited through purely maternal lineage.] Due to the restrictive all or none manner of mtDNA inheritance, this result (no trace of Neanderthal mtDNA) would be likely unless there were a large percentage of Neanderthal ancestry, or there was strong positive selection for that mtDNA (for example, going back 5 generations, only 1 of your 32 ancestors contributed to your mtDNA, so if one of these 32 was pure Neanderthal you would expect that ~3% of your autosomal DNA would be of Neanderthal origin, yet you would have a ~97% chance to have no trace of Neanderthal mtDNA).




Epigenome (W)

See also: Epigenetics

Epigenetics describes a variety of features of the human genome that transcend its primary DNA sequence, such as chromatin packaging, histone modifications and DNA methylation, and which are important in regulating gene expression, genome replication and other cellular processes. Epigenetic markers strengthen and weaken transcription of certain genes but do not affect the actual sequence of DNA nucleotides. DNA methylation is a major form of epigenetic control over gene expression and one of the most highly studied topics in epigenetics. During development, the human DNA methylation profile experiences dramatic changes. In early germ line cells, the genome has very low methylation levels. These low levels generally describe active genes. As development progresses, parental imprinting tags lead to increased methylation activity.

Epigenetic patterns can be identified between tissues within an individual as well as between individuals themselves. Identical genes that have differences only in their epigenetic state are called epialleles. Epialleles can be placed into three categories: those directly determined by an individual's genotype, those influenced by genotype, and those entirely independent of genotype. The epigenome is also influenced significantly by environmental factors. Diet, toxins, and hormones impact the epigenetic state. Studies in dietary manipulation have demonstrated that methyl-deficient diets are associated with hypomethylation of the epigenome. Such studies establish epigenetics as an important interface between the environment and the genome.


See also



İdea Yayınevi Site Haritası | İdea Yayınevi Tüm Yayınlar
© Aziz Yardımlı 2020 | aziz@ideayayı