ORCID

0009-0004-9913-7150

Year

2025

Season

Fall

Paper Type

Master's Thesis

College

College of Computing, Engineering & Construction

Degree Name

Master of Science in Computer and Information Sciences (MS)

Department

Computing

NACO controlled Corporate Body

University of North Florida. School of Computing

Committee Chairperson

Dr. Indika Kahanda

Second Advisor

Dr. Xudong Liu

Third Advisor

Dr. Sandeep Reddivari

Department Chair

Dr. Nan Niu

College Dean

Dr. William Klostermeyer

Abstract

Phenotypes are the observable characteristics of an individual organism. Predicting quantitative phenotypes from genomic variation remains challenging when causal signals span both local motifs and distal regulatory contexts. Building on Frequented Regions (FRs)—subsequences conserved across genomes and extracted from a pangenome graph generated from a large collection of closely related species—we compare several modeling strategies across 35 Saccharomyces cerevisiae growth phenotypes: Random Forest (RF) on FR counts (called RFCounts), RF on FR sequences, 1D convolutional neural networks (CNN) on FR sequences, Long Short-Term Memory (LSTM) networks on FR sequences, a Genomewide Association Study (GWAS) baseline, and a sequence-based transformer model, gReLU-Enformer, trained on raw FR nucleotide windows with prediction aggregation across sliding windows. Across the 35 phenotypes, sequence-based models often outperform the GWAS baseline and remain competitive with RF–Counts, indicating that nucleotide-level sequence context may provide additional predictive signal for many conditions. At the same time, RF–Counts is the top-performing model for a substantial subset of phenotypes, demonstrating that simple count-based summaries of FR frequency may capture strong predictive cues in settings dominated by short-range features. The gReLU-Enformer model consistently outperforms the CNN and LSTM baselines across all phenotypes and surpasses RF (FR-sequences) on most conditions, while remaining competitive in the remainder. These results suggest that transformer-based modeling of raw sequence windows may yield measurable advantages when regulatory influences arise from multiple or distal genomic regions. In contrast, lightweight short sequence or local-pattern learners may remain e!ective for phenotypes driven by more localized sequence signals. Overall, while short-range motif statistics can suffice for certain traits, architectures that integrate positional context and potential long-range interactions can provide additional gains, particularly when phenotypic variation reflects dispersed regulatory mechanisms.

Share

COinS