Predicting Phenotypic Traits Using a Massive RNA-seq Dataset

biorxiv(2023)

Cited 0|Views6
No score
Abstract
Transcriptomic data can be used to predict environmentally impacted phenotypic traits. This type of prediction is particularly useful for monitoring difficult-to-measure phenotypic traits and has become increasingly popular for monitoring high-value agricultural crops and in precision medicine. Despite this increase in popularity, little research has been done on how many samples are required for these models to be accurate, and which normalization should be used. Here we create a massive RNA-seq dataset from publicly available Arabidopsis thaliana data with corresponding measurements for age and tissue type. We use this dataset to determine how many samples are required for accurate model prediction and which normalization method is required. We find that Median Ratios Normalization significantly increases performance when predicting age. We also find that in the case of our dataset, only a few hundred samples are required to predict tissue types, and only a few thousand samples are necessary to accurately predict age. Researchers should consider these results when choosing the number of samples in a transcriptomic experiment and during data-processing. Author Summary Large datasets have become ubiquitous in both research and industry, with thousands and sometimes millions of samples being collected for a single project. In biology a prominent new technology is RNA-seq, which can be used to measure the expression level of thousands of genes for a single sample. These measurements are used for a variety of downstream applications, including predicting phenotypic traits (i.e. height, disease, etc.). A number of experiments have attempted to use RNA-seq data to make phenotype predictions with varying success. This is partially due to the small sample size of their experiments. RNA-seq datasets are currently relatively small--only a dozen to a few hundred samples--due to the cost per sample. This is expected to change as the cost of sequencing decreases. In this paper we create a massive conglomerate RNA-seq dataset from publicly available Arabidopsis thaliana RNA-seq data. We use this dataset to determine how many samples are required to accurately predict plant age and tissue type using machine learning models. We also explore the best way to normalize large datasets. Our results show the potential of massive RNA-seq datasets, and can be used to inform experimental design for phenotype prediction. ### Competing Interest Statement The authors have declared no competing interest. * Arabidopsis : Arabidopsis thaliana DAG : dataset containing samples labeled as Days After Germination DAL : dataset containing samples “Days” Age annotation Labeled DAS : dataset containing samples labeled as Days After Sowing GEM : Gene Expression Matrix HSD : Tukey’s Honestly Significant Difference MRN : Median Ratios Normalization NCBI : National Center for Biotechnology Information NoNo : No Normalization SMOTE : Synthetic Minority Over-sampling Technique SRA : Sequence Read Archive TAIR : The Arabidopsis Information Resource tissue-4 : dataset containing samples annotated as “leaf”, “seed”, “root”, and “flower” tissue-6 : dataset containing samples annotated as “leaf”, “seedling”, “shoot”, “seed”, “root”, and “flower” TMM : Trimmed Mean of M values TPM : Transcripts Per kilobase Million
More
Translated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined