Essential Bioinformatics is a concise yet comprehensive textbook of bioinformatics that provides a broad introduction to the entire field. Written specifically for a. Cambridge Core - Genomics, Bioinformatics and Systems Biology - Essential Bioinformatics - by Jin Xiong. Frontmatter. pp i-iv. Access. PDF; Export citation. P1: JZP pre CB/Xiong 0 8 January 10, ESSENTIAL BIOINFORMATICS Essential Bioinformatics is a.
|Language:||English, Spanish, German|
|PDF File Size:||9.45 MB|
|Distribution:||Free* [*Regsitration Required]|
Essential Bioinformatics * Jin Xiong | 𝗥𝗲𝗾𝘂𝗲𝘀𝘁 𝗣𝗗𝗙 on ResearchGate | On May 26, , M. K. Ritke and others published Essential Bioinformatics * Jin. Name of Book: Essentials of Arabic Grammar for Learning Quranic Language By : Brig. (R Full vitecek.info Essentials of Anatomy and Physiology. Essential Bioinformatics by Jin Xiong. Read online, or download in secure PDF or secure EPUB format.
You can try selecting from a similar category, click on the author's name, or use the search box above to find your eBook. Click on the cover image above to read some pages of this book! Formatting may be different depending on your device and eBook type. Essential Bioinformatics is a concise yet comprehensive textbook of bioinformatics, which provides a broad introduction to the entire field.
Written specifically for a life science audience, the basics of bioinformatics are explained, followed by discussions of the state-of-the-art computational tools available to solve biological research problems. All key areas of bioinformatics are covered including biological databases, sequence alignment, genes and promoter prediction, molecular phylogenetics, structural bioinformatics, genomics and proteomics.
The book emphasizes how computational methods work and compares the strengths and weaknesses of different methods. This balanced yet easily accessible text will be invaluable to students who do not have sophisticated computational backgrounds.
Technical details of computational algorithms are explained with a minimum use of mathematical formulae; graphical illustrations are used in their place to aid understanding. The effective synthesis of existing literature as well as in-depth and up-to-date coverage of all key topics in bioinformatics make this an ideal textbook for all bioinformatics courses taken by life science students and for researchers wishing to develop their knowledge of bioinformatics to facilitate their own research.
These three public databases closely collaborate and exchange new data daily. This means that by connecting to any one of the three databases, one should have access to the same nucleotide sequence data.
Although the three databases all contain the same sets of raw data, each of the indi- vidual databases has a slightly different kind of format to represent the data. Fortunately, for the three-dimensional structures of biological macromolecules, there is only one centralized database, the PDB.
This database archives atomic coor- dinates of macromolecules both proteins and nucleic acids determined by x-ray crystallography and NMR. The web interface of PDB also provides viewing tools for simple image manipulation. More details of this database and its format are provided in Chapter Secondary Databases Sequence annotation information in the primary database is often minimal.
To turn the raw sequence information into more sophisticated biological knowledge, much postprocessing of the sequence information is needed. This begs the need for A prominent example of secondary databases is SWISS-PROT, which provides detailed sequence annotation that includes structure, function, and protein fam- ily assignment. The annotation of each entry is carefully curated by human experts and thus is of good quality. The data record also provides cross- referencing links to other online resources of interest.
The content of these databases may be sequences or other types of information. The sequences in these databases may overlap with a primary database, but may also have new data submitted directly by authors. In addition, there are also specialized databases that contain original data derived from functional analysis. Interconnection between Biological Databases As mentioned, primary databases are central repositories and distributors of raw sequence and structure information.
They support nearly all other types of biological databases in a way akin to the Associated Press providing news feeds to local news media, which then tailor the news to suit their own particular needs. Therefore, in the biological community, there is a frequent need for the secondary and specialized Instead of letting users visiting multiple databases, it is convenient for entries in a database to be cross-referenced and linked to related entries in other databases that contain additional information.
All these create a demand for linking different databases. The heterogeneous database structures limit communication between databases.
Inthisformat,eachbiologicalrecordisbrokendownintosmall,basiccom- ponents that are labeled with a hierarchical nesting of tags. Recently, a specialized protocol for bioinformatics data exchange has been developed. It is the distributed annotation system, which allows one computer to contact multiple servers and retrieve dispersed sequence annota- tion information related to a particular sequence and integrate the results into a single combined report. What is often ignored is the fact that there are many errors in sequence databases.
There are also high levels of redundancy in the primary sequence databases. Annotations of genes can also occasionally be false or incomplete. All these types of errors can be passed on to other databases, causing propagation of errors.
Most errors in nucleotide sequences are caused by sequencing errors.
Sometimes,genesequencesarecontaminatedwithsequences fromcloningvectors. Generallyspeaking,errorsaremorecommonforsequencespro- duced before the s; sequence quality has been greatly improved since. Therefore, exceptional care should be taken when dealing with more dated sequences.
Redundancy is another major problem affecting primary databases. There is tremendous duplication of information in the databases, for various reasons. The This makes some primary databases excessively large and unwieldy for information retrieval.
Steps have been taken to reduce the redundancy. The National Center for Biotech- nology Information NCBI has now created a nonredundant database, called RefSeq, in which identical sequences from the same organism and associated sequence frag- ments are merged into a single entry.
Proteins sequences derived from the same DNA sequences are explicitly linked as related entries. Sequence variants from the same organism with very minor differences, which may well be caused by sequencing errors, are treated as distinctly related entries.
This carefully curated database can be considered a secondary database. Anotherwaytoaddresstheredundancy problem is to create sequence-cluster databases such as UniGene see Chapter 18 that coalesce EST sequences that are derived from the same gene. The other common problem is erroneous annotations. Often, the same gene sequence is found under different names resulting in multiple entries and confu- sion about the data. Or conversely, unrelated genes bearing the same name are found in the databases.
To alleviate the problem of naming genes, reannotation of genes and proteins using a set of common, controlled vocabulary to describe a gene or protein is necessary. The goal is to provide a consistent and unambiguous naming system for all genes and proteins.
A prominent example of such systems is Gene Ontology see Chapter There are also some errors that are sim- plycausedbyomissionsormistakesintyping. Errorsinannotationcanbeparticularly damaging because the large majority of new sequences are assigned functions based on similarity with sequences in the databases that are already annotated.
Therefore, a wrong annotation can be easily transferred to all similar genes in the entire database. It is possible that some of these errors can be corrected at the informatics level by studying the protein domains and families. However, others eventually have to be corrected using experimental work. There are a number of retrieval systems for bio- logical data. The most popular retrieval systems for biological databases are Entrez and Sequence Retrieval Systems SRS that provide access to multiple databases for retrieval of integrated search results.
AND means that the search result must contain both words; OR means to search for results con- taining either word or both; NOT excludes results containing either one of the words.
Quotes can be used to specify a phrase. Most search engines of public biological databases use some form of this Boolean logic. It is a gateway that allows text-based searches for a wide variety of data, including annotated genetic sequence information, structural information, as well as citations and abstracts, full papers, and taxonomic data. The key feature of Entrez is its ability to integrate information, which comes from cross-referencing between NCBI databases based on preexisting and logical relationships between individual entries.
This is highly convenient: Effective use of Entrez requires an understanding of the main features of the search engine. There are several options common to all NCBI databases that help to narrow the search.
It can also be set to restrict a search to a particular database e. One of the databases accessible from Entrez is a biomedical literature database known as PubMed, which contains abstracts and in some cases the full text articles fromnearly4,journals.
AnimportantfeatureofPubMedistheretrievalofinforma- tion based on medical subject headings MeSH terms. The MeSH system consists of a collection of more than 20, controlled and standardized vocabulary terms used for indexing articles. In other words, it is a thesaurus that helps convert search keywords into standardized terms to describe a concept. PubMed uses a word weight algorithm to identify related articles with similar words in the titles, abstracts, and MeSH.
By using this feature, articles on the same topic that were missed in the original search can be retrieved. PubMed uses a list of tags for literature searches. Another unique database accessible from Entrez is Online Mendelian Inheritance inMan OMIM ,whichisanon-sequence-baseddatabaseofhumandiseasegenesand human genetic disorders.
Each entry in OMIM contains summary information about a particular disease as well as genes related to the disease. The text contains numerous hyperlinks to literature citations, primary sequence records, as well as chromosome loci of the disease genes.
The database can serve as an excellent starting point to study genes related to a disease. NCBI also maintains a taxonomy database that contains the names and taxonomic positions of over , organisms with at least one nucleotide or protein sequence The root level is Archaea, Eubacteria, and Eukaryota.
The database allows the taxonomic tree for a particular organism to be displayed. The tree is based on molecular phylogenetic data, namely, the small ribosomal RNA data.
GenBank GenBank is the most complete collection of annotated nucleic acid sequence data for almost every organism. There is also a GenPept database for protein sequences, the majority of which are conceptual trans- lations from DNA sequences, although a small number of the amino acid sequences are derived using peptide sequencing techniques.
There are two ways to search for sequences in GenBank. One is using text-based keywords similar to a PubMed search. GenBank is a relational database. This is followed by a three-letter code for GenBank divisions. Next to the division is the date when the record was made public which is different from the date when the data were submitted.
This is the number that should be cited in publications. It has two different formats: For a nucleotide sequence that has been translated into a protein sequence, In addition to the accession number, there is also a version number and a gene index gi number.
The purpose of these numbers is to identify the current version of the sequence. If the sequence annotation is revised at a later date, the accession num- ber remains the same, but the version number is incremented as is the gi number. A translated protein sequence also has a different gi number from the DNA sequence it is derived from. The citation is often hyperlinked to the PubMed record for access to the original literature information.
The last part of the Header is the contact information of the sequence submitter. Some optional information includes the clone source, the tissue type and the cell line. In addition to the GenBank format, there are many other sequence formats. FASTA is one of the simplest and the most popular sequence formats because it con- tains plain sequence information that is readable by many bioinformatics analysis programs.
The extra information is considered optional and is ignored by Not available for the protein or structure databases. Theplainsequenceinstandardone-lettersymbolsstarts in the second line. Each line of sequence data is limited to sixty to eighty characters in width. The drawback of this format is that much annotation information is lost. Abstract Syntax Notation One. It describes sequences with each item of information in a sequence record separated by tags so that each subportion of the sequence record can be easily added to relational tables and later extracted Fig.
This format also facilitates the transimission and integration of data between databases. Conversion of Sequence Formats In sequence analysis and phylogenetic analysis, there is a frequent need to convert betweensequenceformats. Oneofthemostpopularcomputerprogramsforsequence format conversion is Readseq, written by Don Gilbert at Indiana University. The web interface version of the program can be found at: It is not as integrated as Entrez, but allows the user to query multiple databases simultaneously, another good example of database integration.
It also offers direct access to certain sequence analysis applications such as sequence similarity searching and Clustal sequence alignment see Chapter 5. The search results contain the query sequence and sequence annotation as well as links to literature, metabolic pathways, and other biological databases.
The goal of a biological database is two fold: Relational databases organize data as tables and search information among tables with shared features. Object-oriented databases organize data as objects and associate the objects according to hierar- chical relationships. Biological databases encompass all three types. Based on their content, biological databases are divided into primary, secondary, and specialized databases.
Primary databases simply archive sequence or structure information; sec- ondary databases include further analysis on the sequences or structures. Special- ized databases cater to a particular research interest.
Biological databases need to be interconnected so that entries in one database can be cross-linked to related entries in another database. NCBI databases accessible through Entrez are among the most integrated databases.
Effective information retrieval involves the use of Boolean oper- ators. Entrez has additional user-friendly features to help conduct complex searches. It is also important to bear in mind that sequence data in these databases are less than perfect. There are sequence and annotation errors. Biological databases are also plagued by redundancy prob- lems. There are various solutions to correct annotation and reduce redundancy, for example, merging redundant sequences into a single entry or store highly redundant sequences into a separate database.
Protein sequence databases. Protein Chem. Blaschke, C. Information extraction in molecular biology. Geer, R. Making use of its power. Hughes, A. Sequence databases and the Internet. Methods Mol. Patnaik, S. Use of on-line tools and databases for routine sequence analyses. Stein, L. Integrating biological databases.
As newbiologicalsequencesarebeinggeneratedatexponentialrates,sequencecompari- sonisbecomingincreasinglyimportanttodrawfunctionalandevolutionaryinference of a new protein with proteins already existing in the database.
The most fundamental process in this type of comparison is sequence alignment. This is the process by which sequences are compared by searching for common character patterns and establish- ing residue—residue correspondence among related sequences.
Pairwise sequence alignment is the process of aligning two sequences and is the basis of database sim- ilarity searching see Chapter 4 and multiple sequence alignment see Chapter 5. This chapter introduces the basics of pairwise alignment. The building blocks of these biologi- cal macromolecules, nucleotide bases, and amino acids form linear sequences that determine the primary structure of the molecules.
These molecules can be consid- ered molecular fossils that encode the history of millions of years of evolution.
During this time period, the molecular sequences undergo random changes, some of which are selected during the process of evolution. The presence of evolutionary traces is because some of the residues that perform key func- tional and structural roles tend to be preserved by natural selection; other residues that may be less crucial for structure and function tend to mutate more frequently.
For example, active site residues of an enzyme family tend to be conserved because they are responsible for catalytic functions. Identifyingtheevolutionaryrelationshipsbetweensequenceshelpstocharacterize the function of unknown sequences.
If one member within the family has a known structure and function, then that information can be transferred to those that have not yet been experimentally characterized.
Therefore, sequence alignment can be used as basis for prediction of structure and function of uncharacterized sequences. Sequence alignment provides inference for the relatedness of two sequences under study. It is also possible that two sequences have derived from a common ancestor, but may have diverged to such an extent that the com- mon ancestral relationships are not recognizable at the sequence level.
In that case, the distant evolutionary relationships have to be detected using other methods see Chapter When two sequences are descended from a common evolutionary origin, they are said to have a homologous relationship or share homology.
A related but different term is sequence similarity, which is the percentage of aligned residues that are similar in physiochem- ical properties such as size, charge, and hydrophobicity. To be clear, sequence homology is an inference or a conclusion about a common ancestral relationship drawn from sequence simi- larity comparison when the two sequences share a high enough degree of similarity.
On the other hand, similarity is a direct result of observation from the sequence alignment. They are either homologous or nonhomologous. Generally, if the sequence similarity level is high enough, a common evolutionary relationshipcanbeinferred.
Indealingwithrealresearchproblems,theissueofatwhat similaritylevelcanoneinferhomologousrelationshipsisnotalwaysclear. Theanswer depends on the type of sequences being examined and sequence lengths. Nucleotide sequences consist of only four characters, and therefore, unrelated sequences have The three zones of protein sequence alignments. Two protein sequences can be regarded as homologous if the percentage sequence identity falls in the safe zone. Sequence length is also a crucial factor.
The shorter the sequence, the higher the chance that some alignment is attributable to random chance. The longer the sequence, the less likely the matching at the same level of similarity is attributable to random chance. This suggests that shorter sequences require higher cutoffs for inferring homolo- gous relationships than longer sequences. This is not a precise rule for determin- ingsequencerelationships,especiallyforsequencesinthetwilightzone.
Sequence similarity and sequence identity are synonymous for nucleotide sequences. For protein sequences, however, the two concepts are very In a protein sequence alignment, sequence identity refers to the percent- age of matches of the same amino acid residues between two aligned sequences.
Similarity refers to the percentage of aligned residues that have similar physicochem- ical characteristics and can be more readily substituted for each other. One involves the use of the overall sequence lengths of both sequences; the other normalizes by the size of the shorter sequence.
There are two different alignment strategies that are often used: Global Alignment and Local Alignment In global alignment, two sequences to be aligned are assumed to be generally simi- lar over their entire length.
This method is more applicable for aligning two closely related sequences of roughly the same length. For divergent sequences and sequences of variable lengths, this method may not be able to generate optimal results because it fails to recognize highly similar local regions between the two sequences. Local alignment, on the other hand, does not assume that the two sequences in question have similarity over the entire length.
This approach can be An example of pairwise sequence com- parison showing the distinction between global and local alignment. The global alignment top includes all residues of both sequences. The region with the highest similarity is highlighted in a box.
The local alignment only includes portions of the two sequences that have the highest regional similarity. The two sequences to be aligned can be of different lengths. This approach is more appropriate for aligning divergent biological sequences containing only modules that are similar, which are referred to as domains or motifs. Figure 3. Alignment Algorithms Alignmentalgorithms,bothglobalandlocal,arefundamentallysimilarandonlydiffer in the optimization strategy used in aligning similar residues.
Both types of algorithms can be based on one of the three methods: The dot matrix and dynamic programming methods are discussed herein. The word method, which is used in fast database sim- ilarity searching, is introduced in Chapter 4. Dot Matrix Method The most basic sequence alignment method is the dot matrix method, also known as the dot plot method.
It is a graphical way of comparing two sequences in a two- dimensional matrix. In a dot matrix, two sequences to be compared are written in the horizontal and vertical axes of the matrix.
The comparison is done by scanning each residue of one sequence for similarity with all residues in the other sequence. If a residue match is found, a dot is placed within the graph. Otherwise, the matrix positionsareleftblank. Whenthetwosequenceshavesubstantialregionsofsimilarity, many dots line up to form contiguous diagonal lines, which reveal the sequence alignment.
If there are interruptions in the middle of a diagonal line, they indicate insertions or deletions. Parallel diagonal lines within the matrix represent repetitive regions of the sequences Fig. A problem exists when comparing large sequences using the dot matrix method, namely, the high noise level.
In most dot plots, dots are plotted all over the graph, Example of comparing two sequences using dot plots. Lines linking the dots in diago- nals indicate sequence alignment. Diagonal lines above or below the main diagonal represent inter- nal repeats of either sequence.
For DNA sequences, the problem is particularly acute because there are only four possible characters in DNA and each residuethereforehasaone-in-fourchanceofmatchingaresidueinanothersequence. Dots are only placed when a stretch of residues equal to the window size from one sequence matches completely with a stretch of another sequence. This method has been shown to be effective in reducing the noise level.
The window is also called a tuple, the size of which can be manipulated so that a clear pattern of sequence match can be plotted. However, if the selected window size is too long, sensitivity of the alignment is lost. There are many variations of using the dot plot method. For example, a sequence can be aligned with itself to identify internal repeat elements. In the self- comparison, there is a main diagonal for perfect matching of each residue.
If repeats are present, short parallel lines are observed above and below the main diagonal. In this case, a DNA sequence is compared with its reverse-complemented sequence. Parallel diagonals represent the inverted repeats. For comparing protein sequences, a weighting scheme has to be used to account for similarities of physicochemical properties of amino acid residues.
The method thus has some applications in genomics.