NC State researchers have developed new techniques for labeling and retrieving data files in DNA-based information storage systems, addressing two of the key obstacles to widespread adoption of DNA data storage technologies.
“DNA systems are attractive because of their potential information storage density; they could theoretically store a billion times the amount of data stored in a conventional electronic device of comparable size,” says Dr. James Tuck, co-corresponding author of a paper on the work and an associate professor of electrical and computer engineering.
“But two of the big challenges here are, how do you identify the strands of DNA that contain the file you are looking for? And once you identify those strands, how do you remove them so that they can be read — and do so without destroying the strands?”
“Previous work had come up with a system that appends short, 20-monomer long sequences of DNA called primer-binding sequences to the ends of DNA strands that are storing information,” says Dr. Albert Keung, co-corresponding author of the paper and an assistant professor in the Department of Chemical and Biomolecular Engineering. “You could use a small DNA primer that matches the corresponding primer-binding sequence to identify the appropriate strands that comprise your desired file. However, there are only an estimated 30,000 of these binding sequences available, which is insufficient for practical use. We wanted to find a way to overcome this limitation.”
To address these problems, the researchers developed two techniques that, taken together, they call DNA Enrichment and Nested Separation, or DENSe.
The researchers tackled the file identification challenge by using two, nested primer-binding sequences. The system first identifies all of the strands containing the initial binder sequence. It then conducts a second “search” of that subset of strands to single out those strands that contain the second binder sequence.
“This increases the number of estimated file names from approximately 30,000 to approximately 900 million,” Tuck says.
Once identified, the file still needs to be extracted. Existing techniques use polymerase chain reaction (PCR) to make lots (and lots) of copies of the relevant DNA strands, then sequence the entire sample. Because there are so many copies of the targeted DNA strands, their signal overwhelms the rest of the strands in the sample, making it possible to identify the targeted DNA sequence and read the file.
Co-lead authors of the paper are Kyle Tomek and Kevin Volkel, both Ph.D. students at NC State. The paper was co-authored by Alexander Simpson, a former graduate student at NC State; and Austin Hass and Elaine Indermaur, both undergraduates at NC State.