What is UniProtKB ?

The UniProt Knowledgebase (UniProtKB) provides a collection of manually and automatically annotated protein sequences, which is freely available at www.uniprot.org.

In addition to the minimal mandatory core data (i.e. the protein sequence, protein name, taxonomic data and citation), as much information as possible is added including functional information, classifications, ontologies and cross-references. Information added to an entry is linked to the original source so that users can trace back its origin and evaluate it by themselves.

The UniProt Knowledgebase (UniProtKB) is maintained by the UniProt Consortium, a collaboration between the Swiss Institute of Bioinformatics (SIB), the European Bioinformatics Institute (EBI), and the Protein Information Resource (PIR). Across the three institutes close to 100 people are involved in different tasks such as manual and automated curation, software development and user support ().

The consortium maintains different databases for different usages (UniProt Non-redundant Reference databases (UniRef), UniProt Archive (UniParc), the Metagenomic and environmental samples sequences (UniMES) and UniProtKB), UniProtKB being the centerpiece.

UniProtKB consists of two sections

UniProtKB/Swiss-Prot (‘gold star’) contains manually annotated records (reviewed) with information extracted from the literature and curator-evaluated computational predictions;
UniProtKB/TrEMBL(‘grey star’) contains computationally annotated records (unreviewed).

The TrEMBL section of UniProtKB was introduced in response to the increased data flow resulting from genome sequencing projects, because the traditional time- and labour-consuming manual annotation process, which is the hallmark of Swiss-Prot, could not be broadened to encompass all publicly available protein sequences.

Protein sequences are obtained from the translation of annotated coding sequences (CDS) submitted to the ENA/GenBank/DDBJ nucleotide sequence databases, or from other sequence resources, such as Ensembl. They are automatically processed and integrated in UniProtKB/TrEMBL with addition of high quality automated annotation. The UniProtKB/Swiss-Prot annotation pipeline involves the manual annotation of UniProtKB/TrEMBL entries, their integration into UniProtKB/Swiss-Prot (with their original accession numbers) and subsequent deletion from UniProtKB/TrEMBL.

FAQ:'Does UniProtKB contain all protein sequences?' gives information on the UniProtKB protein sequence exclusion policies.

Although UniProtKB/Swiss-Prot provides annotated entries for more than 12’000 species, it focuses on the annotation of proteins from model organisms of distinct taxonomic groups.

UniProtKB/TrEMBL statistics - UniProtKB/Swiss-Prot statistics

What makes the UniProtKB/Swiss-Prot section unique?

Protein sequences and functional annotation are manually reviewed. Annotation methods applied to UniProtKB/Swiss-Prot include manual extraction and structuring of information from the literature, manual verification of results from computational analyses, mining and integration of large-scale data sets, and continuous updates as new information becomes available.

FAQ: How do we manually annotate a UniProtKB entry?

In order to achieve minimal redundancy and improve sequence reliability, protein sequences encoded by the same gene (in the same species) are merged into a single record. Sequence discrepancies are thoroughly documented.

FAQ: How redundant are the UniProt databases?

UniProtKB/Swiss-Prot is currently cross-referenced to over 140 different databases. It plays the role of a central hub for biological data, linking together relevant resources (more info).

UniProtKB/Swiss-Prot is distributed with a large number of index files and specialized documentation such as a user manual, release notes, FAQ, various species-specific documents, and lists of controlled vocabularies, nomenclature and guidelines, etc. (List of documents).

UniProtKB/Swiss-Prot is particularly suited for an overview on the current knowledge on a given protein, for similarity searches and the training of prediction software tools.

Anatomy of a UniProtKB/TrEMBL entry

Each UniProtKB/TrEMBL entry contains information associated with one protein sequence. 100% identical protein sequences over the entire length of the protein and from the same species are automatically merged together. The different sections of the entry report selected information extracted from the original ENA/GenBank/DDBJ entry as well as additional high quality automated annotation.

A UniProtKB/TrEMBL entry contains at least the following sections:

Entry information Each entry is associated with a stable unique identifier: the primary accession number. When citing an entry, always use the primary accession number. The ‘entry name’ is composed of the primary accession number and a mnemonic species identification: it is not stable and will change as soon as the entry will be reviewed and integrated into UniProtKB/Swiss-Prot. This section also provides additional information on the entry history (Example).

Names and origin Protein name (‘Submitted name’), synonyms, gene and locus names, and taxonomy information are automatically extracted from the original ENA/GenBank/DDBJ entry (Example).

Protein attributes This section provides information on the protein sequence length, indicates if the protein sequence is complete or a fragment (according to the original ENA/GenBank/DDBJ record). It also provides the level of evidence that supports the existence of the protein. The vast majority of UniProtKB/TrEMBL protein existence are considered as ‘Predicted’ since they derived from in silico nucleotide translations (Example) (more info on UniProtKB evidences for protein existence (Usermanual))

Sequences More than 99 % of the protein sequences are obtained from the translation of annotated coding sequences (CDS) in the ENA/GenBank/DDBJ databases and are automatically processed and entered in UniProtKB/TrEMBL (Example).

FAQ: Where do the UniProtKB protein sequences come from?

References This section contains published articles or submissions that were cited in the original ENA/GenBank/DDBJ entry. Additional computationally mapped references can also be accessed from this section (Example).

Cross-references This section is used to point to related information stored in other data collections, including the links to the original ENA/GenBank/DDBJ submissions (Example).

High quality automated annotation (including family attribution and inferred function and/or catalytic activity) is dispatched in the dedicated sections (i.e. General Annotation (Example), Ontologies (Example)). Other sections, such Binary interactions, can also be present (for a complete list of available sections see below (Anatomy of a UniProtKB/Swiss-Prot entry).

High quality automated annotation systems

Automated annotation systems are based on automatically generated annotation rules (SAAS, which uses the C4.5 decision tree algorithm to derive annotation rules in a fully automated fashion from manually curated UniProtKB/Swiss-Prot entries) or manually curated annotation rules (i.e. HAMAP, RuleBase, PIR name and site rules), which are linked to specific signatures, such as InterPro or membership of a specific taxonomic group. Rules have annotations and condition to be applied which are tested and validated against UniProtKB/Swiss-Prot. Rules and annotations are updated each UniProtKB release. About 38% of UniProtKB/TrEMBL receives at least one annotation from at least one of the automated annotation systems (more).

FAQ: What is HAMAP?

All information found in UniProtKB entries is linked to the original source so that users can trace back its origin and evaluate it.

Anatomy of a UniProtKB/Swiss-Prot entry

Each UniProtKB/Swiss-Prot entry contains manually reviewed information about one or more protein sequence(s) encoded by one gene in one species. Different sections of the entry store specific biological information. The entry view can be customized.

Entry information Each entry is associated with a stable unique identifier: the primary accession number. When citing an entry, always use the primary accession number. The ‘entry name’ (which is a mnemonic identifier) is unique, but not stable. This section also provides additional information on the entry history (Example).

Names and origin Protein name (‘Recommended name’), synonyms and abbreviations found in the literature or in specialized databases are reported, as well as gene and locus names, and taxonomy information (Example).

Protein attributes This section provides information on the protein sequence length, indicates if the protein sequence is complete or a fragment, and if the mature form of the protein is derived by processing of a precursor or not. It also provides the level of evidence that supports the existence of the protein (more info on UniProtKB evidences for protein existence (Usermanual)) (Example).

General annotation (Comments) This section provides any useful information about the protein, mostly biological knowledge (function, subcellular location, enzyme-specific information (catalytic activity, cofactors, metabolic pathway), tissue expression…Qualifiers (e.g. ‘By similarity’, ‘Probable’, ‘Potential’) are used in the absence of direct experimental evidence (Example).

Binary interactions This section provides information about binary protein-protein interactions. The data presented in this section are a quality-filtered subset of binary interactions automatically extracted from the IntAct database (Example).

Ontologies This section provides a selection of UniProtKB keywords, which are terms from a controlled vocabulary list, which summarizes the content of the entry and a selection of Gene Ontology (GO) terms (Example).

FAQ: What are the differences between UniProtKB keywords and the GO terms?

Sequences The protein sequence displayed by default in the entry is the most prevalent and/or the most similar to orthologous sequences. When the genomic sequence is available, we generally display the protein sequence derived from genome translation. Sequence discrepancies are thoroughly documented (Example).

FAQ: What are UniProtKB's criteria for defining a CDS as a protein?

Alternative products This section lists the alternative protein sequences that can be generated from the same gene by alternative promoter usage, alternative splicing, alternative initiation and/or ribosomal frameshifting. In addition, this section provides relevant information for each alternative protein isoforms (Example).

FAQ: What is the canonical sequence? Are all isoforms described in one entry? How can I retrieve them?

Sequence annotation (Features) Over 30 feature keys are available for the description of regions or sites of interest in the protein sequence, such as post-translational modifications (glycosylation, phosphorylation…), binding sites, enzyme active sites, local secondary structure, or variants. Features can be either experimentally proven in the literature or predicted in silico. Qualifiers (‘By similarity’, ‘Probable’ and ‘Potential’) indicate the existence of indirect experimental evidence or the computer-prediction of the feature. The sources of experimental data are indicated (e.g. Ref.30). Other sequence discrepancies, including sequencing errors, are also reported in this section (Example).

References This section contains the published articles or submissions that are the sources of the entry annotation Additional computationally mapped references can also be accessed from this section (Example).

Cross-references This section is used to point to related information stored in other data collections. It provides links to sequence databases (nucleic acid and protein sequences), 3D structure databases, enzyme and pathway databases, family and domain databases, gene expression databases, genome annotation databases, organism-specific databases, phylogenomic databases, polymorphism databases, proteomic databases, protein-protein interaction databases, protein family/group databases, PTM databases…(Example).

Use of the UniProt website

Query - Advanced Search

  • The ‘Query’ box is a simple and powerful text search tool which guides users with autocompletion and helpful suggestions and hints.
  • The ‘Advanced Search’ tool box allows users to query specific sections of a UniProtKB entry with autocompletion and suggestions. This approach is useful when users need to do specific queries and when they already have an idea of the content and structure of the UniProtKB entry.
  • The result page can be ‘Customized’: users can choose to display by default one or more topic(s) of interest (such as ‘InterPro’ match, ‘Function’, ‘Protein existence’ evidence, ‘Subcellular locations’, ‘Gene ontology’…).
  • The result table can be ‘Downloaded’ in different formats (such as Excel).
  • The resulting URL can be bookmarked and used for programmatic access.

Look for insulins (reviewed entries)

Query: (name:insulin) AND reviewed:yes

Look for human and Chimpanzee insulins (reviewed entries)

Query: ((name:insulin AND organism:homo sapiens) OR (name:insulin AND organism:" Pan troglodytes")) AND reviewed:yes

Look for the complete human proteome

Query: organism:"Homo sapiens (Human) [9606]" AND keyword:"Complete proteome [181]"

FAQ: What are Complete Proteome Sets?

Look for the complete human proteome (restricted the search to proteins which have evidence for existence at the protein level)

Query: organism:"Homo sapiens (Human) [9606]" AND keyword:"Complete proteome [181]" AND existence:"evidence at protein level"

Look for Saccharomyces cerevisiae proteins which are GPI anchored

Query: taxonomy:"Saccharomyces cerevisiae (Baker's yeast) [4932]" AND annotation:(type:lipid AND GPI-anchor)

Blast - http://www.uniprot.org/Blast/

  • You can perform a similarity search (Blast) using a protein or a nucleotide sequence against UniProtKB, specifying different taxonomic groups (mammals, complete microbial proteomes…), PDB, different UniProtKB sections (i.e. UniProtKB/Swiss-Prot), UniRef, and UniParc. The datasets include not only the canonical sequences and but also the ‘Alternative products’ sequences.
  • The Blast result page can be ‘Customized’: users can choose to display by default one or more topic(s) of interest (such as ‘InterPro’ match, ‘Function’, ‘Protein existence’ evidence, ‘Subcellular locations’, ‘Gene ontology’…).
  • The result table can be ‘Downloaded’ in different format (such as Excel).
  • The annotated sites/features (including active sites, glycosylation sites) can be highlighted in the local alignment.

Look for plant protein sequences similar to human hemoglobin (HBB). What are their functions? Are the metal binding sites (iron binding) conserved? (answer)

Align - http://www.uniprot.org/align/

  • You can align protein sequences (ClustalW).
  • The annotated sites/features (including active sites, glycosylation sites) can be highlighted within the alignment.

Align the following insulins from different organisms (P01308, P30410, P67974, P04667, P67970) and look for the conservation of the mature peptides and disulfide bonds.

Retrieve - http://www.uniprot.org/retrieve/

  • You can retrieve or upload a list of UniProt identifiers to download the corresponding entries.
  • You can then ‘Query’ your dataset as you would query the whole UniProtKB.
  • The result page can be ‘Customized’: users can choose to display by default one or more topic(s) of interest (such as ‘InterPro’ match, ‘Function’, ‘Protein existence’ evidence, ‘Subcellular locations’, ‘Gene ontology’…).
  • The result table can be‘Downloaded’ in different formats (such as Excel).

A biologist isolated a human threonine-phosphorylated protein by immunoaffinity. The monoclonal antibody recognizes the following epitope: V-S-T-Q where T is the phosphorylated threonine. Could you help him to find a list of candidate proteins? (use Scan Prosite (‘Matched UniProtKB entries’), and ‘Advanced query’)

ID mapping - http://www.uniprot.org/mapping/

  • The ID mapping tool allows to map identifiers to or from UniProtKB, i.e. to track proteins across various databases.

Several proteins have been identified in a proteomic experiment. Which GO terms do they share? (GI numbers of the identified proteins: 16130093, 20664033, 1789812, 89110178, 85677033, 27574045, 89111003, 229597766).