The image sketches the sequencing process for a DNA molecule (chromosome). At the beginning the sequences of the segments are unknown. Green lines represent pieces that have been read during the process.
The human genome consists of 46 long DNA molecules (chromosomes) contained within the nucleus of the cell. The chromosomes carry genetic information.
Each DNA molecule consists of two strand in the form of a double helix. Each DNA strand is a linear polymer that consists of similar subunits (monomers) connected end to end.
Within each monomer one can find a sugar, a phosphate and a base component. The sequences of bases represents a form of linear infomation. There are four bases denoted by the letters A, C, G and T. The bases A,T and G,C are complementary, i.e. bind to each other. Based on this base pair complementarity a single strand contains the full genetic information.
The goal of sequencing is to obtain the ordered set of bases contained in the DNA in form of a long string.
The sequencing machines cannot read the whole genome in one step. Therefore, the genome has to be cut into smaller pieces. In order to be able to reassemble the pieces they have to be overlapping.
This can be achieved by generating many copies of a DNA strand and cutting it into pieces randomly (with high pressure, ultrasound).
In the process of sequence assembley the full sequence of nucleotides is gathered from overlap information by performing a stepwise search for pieces with overlapping ends.
Then overlapping pieces are put together.
Bioinformaics provides suitable algorithms for the assembley step. These algorithms have to be very efficient as the number of pieces and hence the number of pairwise comparisons for overlaps is large. In addition the algorithms have to deal with such problems as repetitive sequences or reading errors in the genome pieces.