Our lab works in the area of bioinformatics or computational biology using Python and R languages.
We use Python for machine learning (scikit-learn and the Keras deep-learning framework), visualization (matplotlib) and also for general data processing such as handling biological sequence and structure data (BioPython).
We use R for the analysis of high-throughput ‘-omics’ data such as gene expression data using BioConductor.
These projects, except for the last project, are most suitable if you intend to study the COMPM058 Bioinformatics module during Term 2 or already have done a bioinformatics module in your 3rd year (International Programme students for example). For the machine learning projects, ideally you would be studying a module in machine learning (3rd year COMP3058 Artificial Intelligence and Neural Computing, or one of the 4th year Machine Learning modules).
Skills developed: bioinformatics, algorithm design, R and C programming.
Currently the medical and bioscience community are creating very large –omics datasets for diseases such as cancer (some examples in cancer research include the Cancer Cell Line Encyclopaedia (CCLE) dataset and The Cancer Genome Atlas (TCGA) dataset). One of the challenges with these huge datasets is to “mine” information and patterns that may be useful for disease diagnosis or prognosis. We have recently published a biclustering (unsupervised learning) method called MCbiclust that can detect patterns within these large genomic datasets (Nucleic Acids Research, Volume 45, Issue 15, 6 September 2017, Pages 8712–8730).
The method has been made available as a BioConductor / R package and this first project involves developing a new version which is more computationally efficient since the current method is quite CPU intensive, often involving the use of the Legion supercomputer for analysis.
So the aims of this project area are:
This second project also involves the further development of the MCbiclust method that is outlined in the description of Project 1.
Currently the MCbiclust method is only accessible to users that can develop scripts in the R programming language. This greatly reduces the number of bioscience researches that can employ this method. The key aim of this project is to make this new biclustering method available as a web application, thus making it available to the full bioscience community.
Some of the key aims of the project are:
Supervised machine learning is extensively applied the field of bioinformatics for all sorts of prediction tasks: predicting structural aspects of proteins from only their sequence, predicting protein function from sequence, predicting cancer subtypes from gene expression data.
This project focusses one particularly structural aspect of proteins called intrinsic disorder. Normally proteins form one stable “native” fold which gives the protein its function. Intrinsically disordered regions of proteins do not do this – but jump between many different conformations. This allows certain proteins to have some unusual functions such as “entropic springs” that are employed within muscle and spider silk. Intrinsic disorder has also been implicated in a number of diseases.
The DisProt database has information about protein disorder where the different types of disorder have been characterized in terms of an Ontology. They key idea for this project is not only to predict disordered regions within proteins, but also to predict the ontology terms for these disordered regions, so essentially a multi-class prediction problem.
The project would involve the complete development of a bioinformatics web application: from processing of raw DisProt and PDB structural data; to the application of scikit-learn for machine learning; to the development of a Django web application so that medical and bioscience users can employ your newly developed method.
Polymerase Chain Reaction (PCR) revolutionized molecular biology and won Kary Mullis the 1993 Novel Prize in Chemistry. It allows minute quantities of specific types of DNA to be amplified and analysed. It has a wide range of applications from DNA fingerprinting for forensics to helping in the sequencing of the human genome. In collaboration with the Royal Free hospital, we have published a new approach to help identify different types of infectious organisms using PCR primers that have been computationally determined using a machine learning decision tree approach (J. Clin. Microbiol. July 2012 vol. 50 no. 7 2419-2427.).
This project intends to make the development of these techniques much more widely available to the experimental and clinical community by developing an interactive web application that allows the optimal design of PCR primers that satisfy complex criteria across different phylogenetic species.
Skills developed: experience with embedded hardware; networking; CNN neural networks for image recognition; Python and C programming.
In particular the aim would be to develop distributed sensors (MikroBUS cameras, PIR sensors) on PIC32 microcontroller “Clicker” boards that communicate using 6LowPAN to a central Ci40 hub board. The central Ci40 hub would orchestrate how the sensors are employed to detect images given movement of, say, different types of animals in the wild. The Ci40 card would then employ convolutional neural networks (CNNs) to do image recognition and communicate up-to-date count information of different animals to a central web application via a long-distance low-power LoRa network.
Such a system could be used for animal monitoring in a remote location where battery power can only be used and mobile networking does not exist.