
Dear all, Our next AI seminar on *"**Over a trillion bases and counting: why leveraging public databases is essential for microbiome analysis**" *by professor Maude David is scheduled to be on February 23rd (Today), 1-2 PM PST. Zoom Link: https://oregonstate.zoom.us/j/93591935144?pwd=YjZaSjBYS0NmNUtjQzBEdzhPeDZ5UT... *Over a trillion bases and counting: why leveraging public databases is essential for microbiome analysis* Maude David Assistant Professor Department of Microbiology Oregon State University *Abstract:* The amount of publicly available sequencing data has doubled approximately every 18 months since 1982, and there are now over a billion Whole Genome Shotguns available on Genbank. But the number of studies that incorporate available datasets remain marginal, and as a result, especially in human gut microbiome studies, where collecting clinical samples can be arduous, the number of taxa considered in any one study often exceeds the number of samples ten to one hundred-fold. In this presentation we will first focus on 16S amplicon data by deriving microbiome-level properties by applying an embedding algorithm to quantify taxon co-occurrence patterns in over 18,000 samples from the American Gut Project microbiome crowdsourcing effort. We show that predictive models trained using property data are the most accurate, robust, and generalizable, and that property-based models can be trained on one dataset and deployed on another. Using these properties, we are able to extract known and new bacterial metabolic pathways associated with inflammatory bowel disease across two completely independent studies. Using publicly available datasets presents several limitations, and among them disparities in methodologies, sequencing technologies, and notably poor functional annotations. Rather than relying solely on completely curated databases for functional annotation, the second part of this presentation will focus on a new annotation strategy where we recruited protein sequences carrying common protein domains (via pfam) alongside KEGG Orthologs. We leveraged the unannotated sequences in order to generate KO level Hidden Markov Models that proved to be more sensitive than non-propagated models, on an independent testing set. *Speaker Bio:* Maude David graduated with a Ph.D. from Ecole Centrale de Lyon (France), and joined Oregon State University as an assistant professor in 2018 after her postdoctoral work at Lawrence Berkeley National Laboratory and Stanford University. The David Lab focuses on new biocomputing methods to utilize publicly available datasets to analyze microbiome sequencing data. The lab also works on how the gut microbiota can modulate behavior via the gut-microbiome-brain axis. We use an interdisciplinary approach including the use of mouse models, in vitro cell culture, anaerobic bacteria culture, bees (!) and machine learning to tackle those questions. *Please watch this space for future AI Seminars :* * https://eecs.oregonstate.edu/ai-events <https://eecs.oregonstate.edu/ai-events>* Rajesh Mangannavar, Graduate Student Oregon State University ---- AI Seminar Important Reminders: -> For graduate students in the AI program, attendance is strongly encouraged.