DNA storage is an effective fusion of biotechnology (BT) and information technology (IT), representing a green revolution in data storage. DNA is a large polymer composed of deoxyribonucleotides. Deoxyribonucleotides consist of a base, deoxyribose, and a phosphate. There are four bases in DNA: adenine (A), guanine (G), thymine (T), and cytosine (C). In nature, DNA is arranged in a specific sequence to form genetic information, which guides the growth, development, and various activities throughout the lifecycle of organisms. DNA storage technology utilizes the characteristics of these four bases to develop and formulate codes, enabling the translation of digital information into a language at the DNA level.
The conceptual diagram of DNA storage technology.
*Related Services from BOC RNA
The mainstream DNA storage method is based on mapping 0 and 1 to the four DNA bases (A, G, C, T), converting digital signals into chemical signals through information encoding. In other words, to store binary digital files in DNA, scientists convert 0 and 1 into the letters A, G, C, and T, synthesizing DNA strands corresponding to the sequence of binary digital files. To retrieve the data, the DNA strands are sequenced, and then the base sequence is restored to the initial digital sequence according to the encoding rules.
DNA storage, as one of the solutions for data storage media, has significantly improved and enhanced factors such as storage density, lifespan, energy consumption, and data security relative to traditional storage media.
DNA information storage density is the best-performing among known storage technologies. For example, while flash memory can store 1 bit of data within 10nm, DNA can store 2 bits of data within 0.34nm. One kilogram of DNA can store 2×1024 bits of data, whereas storing the same amount of data using flash memory would require over 109 kilograms of silicon raw materials.
The lifespan of optical discs is 10-15 years, hard drives and flash memory have a lifespan of 5-10 years, and magnetic tape storage lasts 15-30 years. Additionally, traditional data storage systems require regular clearing of damaged data and replacement of faulty units, resulting in high maintenance costs for long-term, large-scale data storage. In contrast, DNA is an extremely stable biological molecule, with a half-life of over 500 years. Particularly under low-temperature conditions, DNA can be preserved for thousands to tens of thousands of years.
On one hand, traditional storage media fabrication requires a significant amount of non-renewable resources such as rare earth metals. DNA storage media, on the other hand, only require nucleoside monomers and some essential reagents, thereby saving energy from the raw materials stage. Furthermore, traditional information retrieval from storage media requires electrical energy input, with most of the energy dissipated in the form of heat (approximately 0.01-0.04W/GB). This poses a substantial energy consumption challenge for large data centers, with low energy utilization efficiency. In contrast, the read-write process based on DNA storage media, apart from the necessary power consumption of equipment, involves mostly chemical reactions, resulting in lower energy input and higher utilization efficiency (10-100W/GB), thereby saving a significant amount of electrical energy.
With the development of mathematics and computer technology, traditional binary data cryptography and steganography are easily decrypted, losing their original information encryption effectiveness. With the advancement of biology (BT) and information technology (IT), scientists are leveraging biomolecules to explore new encryption techniques. Biomolecules such as DNA, proteins, aptamers, and bacteria are being used to protect information security. Particularly in DNA storage, based on different information encoding principles and the subsequent complex operations of retrieval, sequencing, and reading, DNA information storage inherently possesses a technological barrier, thereby enhancing data security.
The current DNA data storage process consists of the following five key steps:
(1) Encoding: Transcoding data into DNA code (conversion from binary to DNA).
(2) Writing: Synthesizing the DNA sequence (chemical synthesis method/enzymatic method).
(3) Storing: Storing these data (freeze-drying and storing in inert gas/storing on particles or silicon chips).
(4) Reading: When data retrieval is needed, sequencing and reading the DNA (first-generation, second-generation, third-generation sequencing technologies).
(5) Decoding: Decoding the DNA sequence to restore it to binary data (converting DNA back to binary files).
Encoding and decoding are the first and final stages of the DNA data storage process. Currently, there are several methods to accomplish these steps, ranging from direct mapping, where one or a pair of bits (0 and 1 in computers) are represented by a single DNA base (A, C, G, or T), to more ambitious strategies-where each oligonucleotide contains two entirely different sets of data within the same DNA sequence. While this process may seem like a straightforward conversion of binary data into a sequence of DNA bases, DNA as a data medium has some limitations that must be considered.
Three methods of DNA storage exist: storing it (1) in test tubes or (2) in specially designed DNA storage devices (in vitro), or (3) as part of a living organism (in vivo). Regardless of the method, a controlled environment is crucial as DNA is relatively stable outside of living organisms but susceptible to degradation under various conditions such as UV light, mechanical forces, and hydrolysis. Preservation methods like storage on filter paper or freeze-drying can be employed. Encapsulating DNA in nanoparticles or capsules with inert atmospheres has been found to preserve it for tens of thousands or even hundreds of thousands of years. In vivo data storage involves storing digital data in biological microorganisms like bacteria or yeast, utilizing their genetic material to encode information in DNA or utilize metabolic pathways for data storage and processing. Storing and replicating data within living organisms is highly scalable due to cellular mechanisms for DNA replication and repair. Some organisms can maintain their DNA for years under adverse conditions. DNA carrying digital data can be integrated into stable regions of the host genome or stored separately in synthetic chromosomes.
DNA sequencing is an appropriate method of retrieval for information stored through DNA sequences. The DNA synthesis used for structure-based sequencing methods is direct, relying on a limited number of pre-defined sequence folding blocks, contrasting with traditional sequence-based methods, which require the synthesis of many unique sequences. It leverages the fact that copying DNA chains from predefined templates is easier and faster, akin to the biological processes of DNA replication and transcription, rather than synthesizing them from scratch.
From the perspective of technological development, on the one hand, due to the convergence of characteristic dimensions in semiconductor, micro-nano processing, and biological fields, the fusion of semiconductor technology and biotechnology has become an inevitable trend in technological advancement. Currently, DNA data storage represents the most typical case of the fusion of semiconductor and biotechnology. The clear technological framework of DNA storage systems, coupled with the accumulation of high-throughput synthesis and sequencing technologies, is brewing a new round of technological breakthroughs, and the emergence of DNA data storage concepts is an inevitable result of technological development. DNA data storage will not only revolutionize traditional data storage methods but will also leverage digitization, miniaturization, integration, and intelligence through biochemical processing and operations, exerting a significant impact on the future biochemical and medical detection industries, and altering the product forms in existing fields such as biochemical detection and medical health.