Data

Number of entries in biological sequence databases

See all data and research on:

What you should know about this indicator

  • Biological sequence data includes data such as DNA and RNA sequences, amino acid sequences of proteins, and three-dimensional structures of proteins and other molecules found in living organisms. These organisms can be bacteria, viruses, plants, animals and humans.
  • Researchers use this data to better understand the biology of organisms, including the functions of genes and proteins and how they interact. This knowledge can then be applied to e.g. develop new drugs and treatments for illnesses or use biotechnology to improve agriculture and environmental science.
  • This dataset provides an overview of the growth of key biological sequence databases over time.
  • GenBank is a database of RNA and DNA sequences maintained by the National Center for Biotechnology Information (NCBI). Researchers can submit nucleotide sequences, which are then reviewed and annotated by NCBI staff. Researchers are responsible for the scientific accuracy of their submissions.
  • RefSeq is a curated collection of DNA, RNA, and protein sequences maintained by the NCBI. It provides reference sequences for major research organisms, including humans, model organisms, and pathogens. RefSeq records integrate information on genomic DNA, transcripts, and proteins to provide a complete picture of each gene in each organism.
  • Protein Data Bank (PDB) is a database of 3D structural data of large biological molecules, such as proteins, nucleic acids, lipids, carbohydrates, and complex assemblies of these molecules. It only includes experimentally validated structures and is managed by the Research Collaboratory for Structural Bioinformatics (RCSB), which reviews each submission for quality and accuracy before adding it to the database.
  • UniProtKB/Swiss-Prot is a database of protein sequence and functional information. This includes information on both the protein sequence and structure, as well as its effects and interactions in an organism. The Swiss-Prot section is manually curated and contains only experimentally verified information.
  • AlphaFoldDB is a database of predicted 3D structures of proteins generated by the AlphaFold AI model, which uses deep learning to predict protein structures based on their amino acid sequences. While structures are generally not experimentally validated, when they are, predictions have been shown to be highly accurate.
  • ESMAtlas is a database of predicted 3D structures of proteins generated by the ESM AI model, which predicts protein structures based on their amino acid sequences. While structures are generally not experimentally validated, when they are, predictions have been shown to be highly accurate.
  • All databases listed here are freely accessible to the public and are widely used by researchers, educators, and students worldwide for various purposes, including scientific research, drug discovery, and education.

How is this data described by its producer?

GenBank

GenBank ® is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. GenBank is part of the International Nucleotide Sequence Database Collaboration, which comprises the DNA DataBank of Japan (DDBJ), the European Nucleotide Archive (ENA), and GenBank at NCBI. These three organizations exchange data on a daily basis.

RefSeq

The Reference Sequence (RefSeq) collection provides a comprehensive, integrated, non-redundant, well-annotated set of sequences, including genomic DNA, transcripts, and proteins. RefSeq sequences form a foundation for medical, functional, and diversity studies. They provide a stable reference for genome annotation, gene identification and characterization, mutation and polymorphism analysis (especially RefSeqGene records), expression studies, and comparative analyses.

RefSeq genomes are copies of selected assembled genomes available in GenBank. RefSeq transcript and protein records are generated by several processes including:

  • Computation
    • Eukaryotic Genome Annotation Pipeline
    • Prokaryotic Genome Annotation Pipeline
  • Manual curation
  • Propagation from annotated genomes that are submitted to members of the International - Nucleotide Sequence Database Collaboration (INSDC)

Protein Data Bank (PDB)

RCSB PDB (RCSB.org) is the US data center for the global Protein Data Bank (PDB) archive of 3D structure data for large biological molecules (proteins, DNA, and RNA) essential for research and education in fundamental biology, health, energy, and biotechnology.

UniProtKB/Swiss-Prot

The UniProt Knowledgebase (UniProtKB) is the central hub for the collection of functional information on proteins, with accurate, consistent and rich annotation. In addition to capturing the core data mandatory for each UniProtKB entry (mainly, the amino acid sequence, protein name or description, taxonomic data and citation information), as much annotation information as possible is added.

The UniProt Knowledgebase consists of two sections: a section containing manually-annotated records with information extracted from literature and curator-evaluated computational analysis (UniProtKB/Swiss-Prot), and a section with computationally analyzed records that await full manual annotation (UniProtKB/TrEMBL).

AlphaFoldDB

AlphaFold DB provides open access to over 200 million protein structure predictions to accelerate scientific research.

AlphaFold is an AI system developed by Google DeepMind that predicts a protein’s 3D structure from its amino acid sequence. It regularly achieves accuracy competitive with experiment.

Google DeepMind and EMBL’s European Bioinformatics Institute (EMBL-EBI) have partnered to create AlphaFold DB to make these predictions freely available to the scientific community. The latest database release contains over 200 million entries, providing broad coverage of UniProt (the standard repository of protein sequences and annotations). We provide individual downloads for the human proteome and for the proteomes of 47 other key organisms important in research and global health. We also provide a download for the manually curated subset of UniProt (Swiss-Prot).

ESMAtlas

A protein’s structure, the three-dimensional coordinates of all the atoms in the chain of amino acids, can be a key to understanding its function. This Metagenomic Atlas is the first large-scale view of the structures of metagenomic proteins encompassing hundreds of millions of proteins. To make structure predictions at this scale, a breakthrough in the speed of protein folding was necessary. We developed a new protein structure prediction approach named ESMFold. ESMFold uses the representations from a large language model (ESM2) to generate an accurate structure prediction from the sequence of a protein.

Number of entries in biological sequence databases
Biological sequence databases store data such as DNA, RNA, and amino acid sequences and 3D protein structures. This data includes entries from , , , , as well as predicted protein structures in and .
Source
Epoch AI (2024)with major processing by Our World in Data
Last updated
September 9, 2025
Date range
1976–2024
Unit
entries

What you should know about this indicator

  • Biological sequence data includes data such as DNA and RNA sequences, amino acid sequences of proteins, and three-dimensional structures of proteins and other molecules found in living organisms. These organisms can be bacteria, viruses, plants, animals and humans.
  • Researchers use this data to better understand the biology of organisms, including the functions of genes and proteins and how they interact. This knowledge can then be applied to e.g. develop new drugs and treatments for illnesses or use biotechnology to improve agriculture and environmental science.
  • This dataset provides an overview of the growth of key biological sequence databases over time.
  • GenBank is a database of RNA and DNA sequences maintained by the National Center for Biotechnology Information (NCBI). Researchers can submit nucleotide sequences, which are then reviewed and annotated by NCBI staff. Researchers are responsible for the scientific accuracy of their submissions.
  • RefSeq is a curated collection of DNA, RNA, and protein sequences maintained by the NCBI. It provides reference sequences for major research organisms, including humans, model organisms, and pathogens. RefSeq records integrate information on genomic DNA, transcripts, and proteins to provide a complete picture of each gene in each organism.
  • Protein Data Bank (PDB) is a database of 3D structural data of large biological molecules, such as proteins, nucleic acids, lipids, carbohydrates, and complex assemblies of these molecules. It only includes experimentally validated structures and is managed by the Research Collaboratory for Structural Bioinformatics (RCSB), which reviews each submission for quality and accuracy before adding it to the database.
  • UniProtKB/Swiss-Prot is a database of protein sequence and functional information. This includes information on both the protein sequence and structure, as well as its effects and interactions in an organism. The Swiss-Prot section is manually curated and contains only experimentally verified information.
  • AlphaFoldDB is a database of predicted 3D structures of proteins generated by the AlphaFold AI model, which uses deep learning to predict protein structures based on their amino acid sequences. While structures are generally not experimentally validated, when they are, predictions have been shown to be highly accurate.
  • ESMAtlas is a database of predicted 3D structures of proteins generated by the ESM AI model, which predicts protein structures based on their amino acid sequences. While structures are generally not experimentally validated, when they are, predictions have been shown to be highly accurate.
  • All databases listed here are freely accessible to the public and are widely used by researchers, educators, and students worldwide for various purposes, including scientific research, drug discovery, and education.

How is this data described by its producer?

GenBank

GenBank ® is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. GenBank is part of the International Nucleotide Sequence Database Collaboration, which comprises the DNA DataBank of Japan (DDBJ), the European Nucleotide Archive (ENA), and GenBank at NCBI. These three organizations exchange data on a daily basis.

RefSeq

The Reference Sequence (RefSeq) collection provides a comprehensive, integrated, non-redundant, well-annotated set of sequences, including genomic DNA, transcripts, and proteins. RefSeq sequences form a foundation for medical, functional, and diversity studies. They provide a stable reference for genome annotation, gene identification and characterization, mutation and polymorphism analysis (especially RefSeqGene records), expression studies, and comparative analyses.

RefSeq genomes are copies of selected assembled genomes available in GenBank. RefSeq transcript and protein records are generated by several processes including:

  • Computation
    • Eukaryotic Genome Annotation Pipeline
    • Prokaryotic Genome Annotation Pipeline
  • Manual curation
  • Propagation from annotated genomes that are submitted to members of the International - Nucleotide Sequence Database Collaboration (INSDC)

Protein Data Bank (PDB)

RCSB PDB (RCSB.org) is the US data center for the global Protein Data Bank (PDB) archive of 3D structure data for large biological molecules (proteins, DNA, and RNA) essential for research and education in fundamental biology, health, energy, and biotechnology.

UniProtKB/Swiss-Prot

The UniProt Knowledgebase (UniProtKB) is the central hub for the collection of functional information on proteins, with accurate, consistent and rich annotation. In addition to capturing the core data mandatory for each UniProtKB entry (mainly, the amino acid sequence, protein name or description, taxonomic data and citation information), as much annotation information as possible is added.

The UniProt Knowledgebase consists of two sections: a section containing manually-annotated records with information extracted from literature and curator-evaluated computational analysis (UniProtKB/Swiss-Prot), and a section with computationally analyzed records that await full manual annotation (UniProtKB/TrEMBL).

AlphaFoldDB

AlphaFold DB provides open access to over 200 million protein structure predictions to accelerate scientific research.

AlphaFold is an AI system developed by Google DeepMind that predicts a protein’s 3D structure from its amino acid sequence. It regularly achieves accuracy competitive with experiment.

Google DeepMind and EMBL’s European Bioinformatics Institute (EMBL-EBI) have partnered to create AlphaFold DB to make these predictions freely available to the scientific community. The latest database release contains over 200 million entries, providing broad coverage of UniProt (the standard repository of protein sequences and annotations). We provide individual downloads for the human proteome and for the proteomes of 47 other key organisms important in research and global health. We also provide a download for the manually curated subset of UniProt (Swiss-Prot).

ESMAtlas

A protein’s structure, the three-dimensional coordinates of all the atoms in the chain of amino acids, can be a key to understanding its function. This Metagenomic Atlas is the first large-scale view of the structures of metagenomic proteins encompassing hundreds of millions of proteins. To make structure predictions at this scale, a breakthrough in the speed of protein folding was necessary. We developed a new protein structure prediction approach named ESMFold. ESMFold uses the representations from a large language model (ESM2) to generate an accurate structure prediction from the sequence of a protein.

Number of entries in biological sequence databases
Biological sequence databases store data such as DNA, RNA, and amino acid sequences and 3D protein structures. This data includes entries from , , , , as well as predicted protein structures in and .
Source
Epoch AI (2024)with major processing by Our World in Data
Last updated
September 9, 2025
Date range
1976–2024
Unit
entries

Sources and processing

This data is based on the following sources

Epoch AI – Trends in Biological Sequence Data

Growth of key biological sequence databases between January 1976 and January 2024.

Biological sequence data used to train biological sequence models is provided by a vast array of public databases compiled by government, academic, and private institutions. Epoch delineates major sources into three primary categories:

  • DNA sequence databases. These have the highest growth rate of analyzed databases, with GenBank seeing a 31% increase in the number of sequences stored between 2022 and 2023. Whole genome shotgun sequencing studies have been the driving force of growth of DNA data, as the increase in number of entries in all other GenBank divisions, referred to as traditional entries, is greatly attenuated in comparison.

  • Protein sequence databases. The level of detail in protein sequence databases can vary. Databases with rich annotations such as UniProtKB have a much slower growth rate (6.7%), compared to metagenomic databases such as MGnify (20%), which provide protein sequences but lack detailed information about the protein’s structure, function, and origin.

  • Protein structure databases. Gathering experimental data on protein structures is slow and painstaking. Thus, the Protein Data Bank grows by only 6.5% per year. Instead, databases publishing protein structures predicted by AI models can quickly generate large volumes of synthetic data. Databases of synthetic data such as AlphaFoldDB and ESMAtlas have dramatically boosted the supply of available data, though their growth could slow as opportunities for synthetic data are exhausted.

The majority of entries in large biological databases such as the International Nucleotide Sequence Database Collaboration (INSDC), MGnify, UniProtKB and PDB pertain to cellular organisms (humans, animals, plants, fungi, yeast, bacteria). For example, UniProtKB entries comprise 97% cellular and 2% viral protein sequences, a subset of which are known pathogens.

Retrieved on
September 9, 2025
Citation
This is the citation of the original data obtained from the source, prior to any processing or adaptation by Our World in Data. To cite data downloaded from this page, please use the suggested citation given in Reuse This Work below.
Nicole Maug, Aidan O'Gara and Tamay Besiroglu (2024), "Biological Sequence Models in the Context of the AI Directives". Published online at epoch.ai. Retrieved from: 'https://epoch.ai/blog/biological-sequence-models-in-the-context-of-the-ai-directives' [online resource]

Growth of key biological sequence databases between January 1976 and January 2024.

Biological sequence data used to train biological sequence models is provided by a vast array of public databases compiled by government, academic, and private institutions. Epoch delineates major sources into three primary categories:

  • DNA sequence databases. These have the highest growth rate of analyzed databases, with GenBank seeing a 31% increase in the number of sequences stored between 2022 and 2023. Whole genome shotgun sequencing studies have been the driving force of growth of DNA data, as the increase in number of entries in all other GenBank divisions, referred to as traditional entries, is greatly attenuated in comparison.

  • Protein sequence databases. The level of detail in protein sequence databases can vary. Databases with rich annotations such as UniProtKB have a much slower growth rate (6.7%), compared to metagenomic databases such as MGnify (20%), which provide protein sequences but lack detailed information about the protein’s structure, function, and origin.

  • Protein structure databases. Gathering experimental data on protein structures is slow and painstaking. Thus, the Protein Data Bank grows by only 6.5% per year. Instead, databases publishing protein structures predicted by AI models can quickly generate large volumes of synthetic data. Databases of synthetic data such as AlphaFoldDB and ESMAtlas have dramatically boosted the supply of available data, though their growth could slow as opportunities for synthetic data are exhausted.

The majority of entries in large biological databases such as the International Nucleotide Sequence Database Collaboration (INSDC), MGnify, UniProtKB and PDB pertain to cellular organisms (humans, animals, plants, fungi, yeast, bacteria). For example, UniProtKB entries comprise 97% cellular and 2% viral protein sequences, a subset of which are known pathogens.

Retrieved on
September 9, 2025
Citation
This is the citation of the original data obtained from the source, prior to any processing or adaptation by Our World in Data. To cite data downloaded from this page, please use the suggested citation given in Reuse This Work below.
Nicole Maug, Aidan O'Gara and Tamay Besiroglu (2024), "Biological Sequence Models in the Context of the AI Directives". Published online at epoch.ai. Retrieved from: 'https://epoch.ai/blog/biological-sequence-models-in-the-context-of-the-ai-directives' [online resource]

How we process data at Our World in Data

All data and visualizations on Our World in Data rely on data sourced from one or several original data providers. Preparing this original data involves several processing steps. Depending on the data, this can include standardizing country names and world region definitions, converting units, calculating derived indicators such as per capita measures, as well as adding or adapting metadata such as the name or the description given to an indicator.

At the link below you can find a detailed description of the structure of our data pipeline, including links to all the code used to prepare data across Our World in Data.

Read about our data pipeline
Notes on our processing step for this indicator

We use the data collected by Epoch AI on the growth of key biological sequence databases over time. We have added their extraction notes below for reference.

We show the maximum number of entries reported for each database in a given year.

Extraction notes from Epoch AI

Reuse this work

  • All data produced by third-party providers and made available by Our World in Data are subject to the license terms from the original providers. Our work would not be possible without the data providers we rely on, so we ask you to always cite them appropriately (see below). This is crucial to allow data providers to continue doing their work, enhancing, maintaining and updating valuable data.
  • All data, visualizations, and code produced by Our World in Data are completely open access under the Creative Commons BY license. You have the permission to use, distribute, and reproduce these in any medium, provided the source and authors are credited.

Citations

How to cite this page

To cite this page overall, including any descriptions, FAQs or explanations of the data authored by Our World in Data, please use the following citation:

“Data Page: Number of entries in biological sequence databases”. Our World in Data (2025). Data adapted from Epoch AI. Retrieved from https://archive.ourworldindata.org/20251028-170323/grapher/number-of-entries-in-biological-sequence-databases.html [online resource] (archived on October 28, 2025).

How to cite this data

In-line citationIf you have limited space (e.g. in data visualizations), you can use this abbreviated in-line citation:

Epoch AI (2024) – with major processing by Our World in Data

Full citation

Epoch AI (2024) – with major processing by Our World in Data. “Number of entries in biological sequence databases” [dataset]. Epoch AI, “Trends in Biological Sequence Data” [original data]. Retrieved November 5, 2025 from https://archive.ourworldindata.org/20251028-170323/grapher/number-of-entries-in-biological-sequence-databases.html (archived on October 28, 2025).