Poster Presentation 45th Lorne Genome Conference 2024

Using machine learning approaches to resolve the junk DNA-dark DNA debate (#122)

Brett N Adey 1 , Danielle J Maddock 1 , Anthony M Poole 1 , Paul P Gardner 2 3 , Austen RD Ganley 1 4
  1. School of Biological Sciences, University of Auckland, Auckland, New Zealand
  2. Department of Biochemistry, University of Otago, Dunedin, New Zealand
  3. Biomolecular Interaction Centre, University of Canterbury, Christchurch, New Zealand
  4. Digital Life Institute, University of Auckland, Auckland, New Zealand

Only about 2% of the human genome codes for proteins, begging the question of what the rest of the genome does. This is under active debate, with two views predominating. In one view, these regions are conceptualised as ‘dark’ DNA; regions which have a function that is currently unknown. In the opposing view, these regions mostly comprise non-functional 'junk’ DNA. Although most of the human genome does not encode proteins, it is mostly transcribed, a phenomenon known as pervasive transcription. A key part of the debate is whether it is possible to interpret biochemical activities such as pervasive transcription as signatures of function, or whether they are simply background noise. To distinguish between these interpretations, a negative control is required: a sequence of DNA that has not been shaped by selection and is therefore bona fide junk DNA. To provide such a control, we are using organisms that have heterologous DNA inserted into their genomes, including random DNA that has been enzymatically generated in the lab. These heterologous sequences are expected to provide no selected benefit to the host organism, thus allowing us to establish a baseline of genomic activity by measuring the levels of various activities, such as transcription, histone modification, and methylation. We are using machine learning approaches, such as supervised learning (e.g. Random Forests) and unsupervised learning models (e.g. PCA, k-means), to determine whether the patterns of native noncoding regions can be distinguished from heterologous junk DNA. This work provides the control necessary to identify candidate regions of junk and function within the noncoding genome, thereby resolving the junk-dark DNA debate.