SORA-TABA-DLSPH Workshop 2020

SORA-TABA Workshop & DLSPH Biostatistics Research Day

The Division of Biostatistics at the Dalla Lana School of Public Health is pleased to host the 2020 SORA-TABA Workshop as an online workshop. The event brings together regional and local statistical communities who are interested in biostatistics, financial statistics and other applied areas of statistics. Please join us in making this event a great success!

Please note: The DLSPH Biostatistics Research Day Poster Presentations were held as an online event on June 17-18, 2020.

Date: August 20-21, 2020
Time: 11:00 am to 5:30 pm EDT
Location: Online Event
Title: Statistical Machine Learning for Biomedical Data
Instructor: Dr. Noah Simon, University of Washington

Download the Workshop Flyer.

Registration Fee:

Statistical Machine Learning for Biomedical Data

Instructor: Dr. Noah Simon, University of Washington

Dr. Noah Simon received his PhD in Statistics from Stanford University under the supervision of Professor Robert Tibshirani. He is an Associate Professor in the Department of Biostatistics at the University of Washington and has affiliate appointments at the Therapeutics Development Network of Seattle Children’s Hospital and the Kaiser Permanente Health Research Institute. His work is at the intersection of biostatistics, machine learning, and computational biology. He develops methodology that engages with machine learning, biomarker discovery, and clinical trial design. His collaborative work includes applications in immunology, oncology, and cystic fibrosis, among other areas.

Noah Simon
Dr Noah Simon

Abstract:

Dr. Noah Simon will present a number of supervised learning methods that can be applied to Biomedical Big Data: In particular he will cover penalized approaches to regression and classification; as well as support vector machines, tree-based methods, and deep learning.

Dr. Simon will consider the analysis of “high-dimensional Omics” data sets. These data are typically characterized by a huge number of molecular measurements (such as genes) and a relatively small number of samples (such as patients). In addition, he will discuss the use of these tools in the development of prognostic and predictive biomarkers.

Throughout the course, Dr. Simon will focus on common pitfalls in the supervised analysis of Biomedical Big Data and how to avoid them. The course will include interactive discussions/”Challenge Questions”, to help participants actively engage with applying these tools in biomedical scenarios.

This course assumes some previous exposure to linear regression and statistical hypothesis testing.

Outline:

1) Overview of Supervised Learning:

We will discuss what supervised learning is (distinguish it from unsupervised learning); and discuss the general purpose of supervised learning in biomedical applications. In addition we will refresh people on linear regression and run through a simple example.

2) High Dimensional Data and the Bias/Variance Trade-off:

We will introduce some high dimensional applications, and discuss what goes wrong if we use low dimensional methods with high dimensional data. We will introduce the idea of bias and variance in this context, and illustrate these ideas with a high dimensional example.

3) Split-Sample Validation:

We will discuss the use of sample splitting for evaluating model fit, and for tuning bias/variance. We will discuss both single-split validation and cross-validation. We will cover limitations of split-sample validation in the biomedical context (including issues of batch effects). We will also talk about common pitfalls with cross-validation.

4) Regression (in high dimensions):

We will introduce variable selection techniques and penalized methods for regression in high dimensions (including pre-screening, ridge regression and the Lasso).

5) Classification (in high dimensions):

We will give an overview of classification, and logistic regression. Then we will discuss high dimensional extensions to logistic regression (equivalent extensions as in [4]). We will also present support vector classifiers, as well as their extension to support vector machines (via the kernel trick). We will compare the logistic regression approach to the SVM approach.

6) Tree-based Methods:

We will introduce classification and regression trees. We will give details on cost-complexity pruning and relate this back to our bias/variance trade-off.

7) Model Aggregation:

We will discuss bagging, random forests, and (gradient) boosting using trees as a base learner.

8) Deep Learning:

We will discuss what exactly Neural Networks are, when/why they are useful, and how they relate to more classical techniques such as linear/logistic regression. We will discuss the most important neural network architectures (including convolutional and recurrent neural networks) and their application in image and audio analysis. We will also discuss why these tools have been so useful in certain engineering applications and why they may/may not be appropriate/useful in others.

Within each section, there will be “challenge questions”. These will engage the participants to actively understand how the presented ideas fit together, and how they apply to different biomedical applications.

Target Audience: This course will target both a) participants with very limited exposure to machine learning and high dimensional data, who would like an accelerated introduction to ideas and cutting edge methods in high dimensional statistical learning; and b) participants with some background who wish to better understand the nuances of these tools, especially in biomedical applications.

Learning Outcomes:

By the end of the short course, the participants should:

  1. Understand the bias/variance trade-off and its various applications;
  2. Understand the use of split-sample validation for tuning bias/variance and evaluating performance;
  3. Have some intuition for the various regression/classification methods;
  4. Understand how model aggregation techniques can be applied;
  5. Have some working knowledge for how to apply these tools in common biomedical scenarios;
  6. Understand the main ideas in deep learning, how they relate to classical statistical ideas, and some scenarios where they may be useful;

If you have any question regarding the workshop, please send your inquiry to Ryan Rosner at biostat.dlsph@utoronto.ca.