Statistical Learning in Computational Biology
Dr. Nico Pfeifer
Recent advances in high-throughput technologies have led to an exponential increase in biological data (such as genomic, epigenomic and proteomic data). To find meaningful insights in such large data collections, efficient statistical learning methods are needed. In January 2013 the junior group "Statistical Learning in Computational Biology" was established at the Department of Computational Biology and Applied Algorithmics of the MPI for Informatics. The focus of the group is on developing and applying new machine learning / statistical learning methods to solving computational biology problems and answering new biological questions. The focus is on epigenomic, immunomic and genomic data.
Application areas include the study of viruses like HIV, Hepatitis C or Influenza as well as the field of epigenetics. Method-wise the focus is on
- integration of heterogeneous data sets
- improving interpretability of non-linear estimators
- efficient learning methods for large data sets
Due to about two million new HIV infections per year and about 35 million people living with an HIV infection world-wide, the HI virus is still a major threat to mankind. Two areas are of particular importance:
• research towards a vaccine against HIV
• personalized HIV treatment
We are conducting research in both of these areas. Examples include modeling the adaptation of HIV in response to external pressure by the immune system, building better and more interpretable predictors for HIV coreceptor usage and CCR5 antagonist resistance prediction as well as the analysis of potent broadly HIV-1 neutralizing antibodies.
We are also working on methods that can better deal with noisy data. One application scenario is the analysis of molecular measurements from cancer samples. Here, many effects can introduce biases (e.g., batch effects). If one builds a prediction tool with the assumption that new data to come will be very similar, standard approaches are applicable. Unfortunately, this is not very often the case. Therefore, we introduced a method that is able to estimate certain differences in the underlying distribution of the training data and the test data and correct for them in the final prediction method. Furthermore, we provided interpretable results that can be used to understand the underlying causes of the prediction label. Additional analyses will show how much the methods can be extended.
Another important area is how to best integrate the different measurements (e.g., gene expression, DNA methylation, copy number variation), which is also of great interest to us.
Additionally, we are interested in developing methods for the analysis of open chromatin regions as well as the three dimensional organization of chromosomes.