ORCID
0009-0004-9913-7150
Year
2025
Season
Fall
Paper Type
Master's Thesis
College
College of Computing, Engineering & Construction
Degree Name
Master of Science in Computer and Information Sciences (MS)
Department
Computing
NACO controlled Corporate Body
University of North Florida. School of Computing
Committee Chairperson
Dr. Indika Kahanda
Second Advisor
Dr. Xudong Liu
Third Advisor
Dr. Sandeep Reddivari
Department Chair
Dr. Nan Niu
College Dean
Dr. William Klostermeyer
Abstract
Phenotypes are the observable characteristics of an individual organism. Predicting quantitative phenotypes from genomic variation remains challenging when causal signals span both local motifs and distal regulatory contexts. Building on Frequented Regions (FRs)—subsequences conserved across genomes and extracted from a pangenome graph generated from a large collection of closely related species—we compare several modeling strategies across 35 Saccharomyces cerevisiae growth phenotypes: Random Forest (RF) on FR counts (called RFCounts), RF on FR sequences, 1D convolutional neural networks (CNN) on FR sequences, Long Short-Term Memory (LSTM) networks on FR sequences, a Genomewide Association Study (GWAS) baseline, and a sequence-based transformer model, gReLU-Enformer, trained on raw FR nucleotide windows with prediction aggregation across sliding windows. Across the 35 phenotypes, sequence-based models often outperform the GWAS baseline and remain competitive with RF–Counts, indicating that nucleotide-level sequence context may provide additional predictive signal for many conditions. At the same time, RF–Counts is the top-performing model for a substantial subset of phenotypes, demonstrating that simple count-based summaries of FR frequency may capture strong predictive cues in settings dominated by short-range features. The gReLU-Enformer model consistently outperforms the CNN and LSTM baselines across all phenotypes and surpasses RF (FR-sequences) on most conditions, while remaining competitive in the remainder. These results suggest that transformer-based modeling of raw sequence windows may yield measurable advantages when regulatory influences arise from multiple or distal genomic regions. In contrast, lightweight short sequence or local-pattern learners may remain e!ective for phenotypes driven by more localized sequence signals. Overall, while short-range motif statistics can suffice for certain traits, architectures that integrate positional context and potential long-range interactions can provide additional gains, particularly when phenotypic variation reflects dispersed regulatory mechanisms.
Suggested Citation
Vemuri, Tejaswi, "Computational pangenomics and machine learning for genotype-phenotype analysis" (2025). UNF Graduate Theses and Dissertations. 1386.
https://digitalcommons.unf.edu/etd/1386