The format of the projects is quite flexible. I foresee three broad types of work:
For all three types of work, I would like to see a review of the literature, sample data and a prototype implementation (where applicable). The main difference between each type of work will be the relative importance of each of the components.
Teams will be made of 1 or 2 members. Larger teams are possible and will have to produce proportionally more work! Complementary work between teams is also welcomed, i.e. two or more teams working on a related but complementary topic, leading to a more realistic application.
The project is worth 50% of your final mark. Its marking will be based on the outline, a written report as well as a short presentation in class (10 minutes). Reports should be sufficiently detailed that it should be possible to implement the approach on the basis of the text alone. Having said that, you should also make every conceivable effort to keep the report concise. Assuming a team of size 2, a 10–15 page report should be appropriate. Suggested structure for the reports:
Some of the projects require a fairly good background in statistics, I have annotated them with the letter S, while others may require more advanced knowledge of biology. Given the large number of projects there should be something for everyone. (A = application development, E = experiment, S = statistics, B = biology). You are also most welcomed to propose new projects.
RNA molecules are demonstrating a surprising breadth of biological functions. The repertoire of known non-protein-coding RNAs (ncRNAs) has grown extremely rapidly. Through all those discoveries, a new understanding of gene expression regulation is emerging.
Herein, we take an important step to help better understand the cellular roles of RNA by predicting their sub-cellular localization. The advent of widely available frameworks for deep learning (e.g., Scikit-Learn and TensorFlow) as well the recent release of RNALocate, a database for RNA subcellular localization, are making this project possible and timely.
Specifically, the project aims to answer the following research question: Can the subcellular localization of RNA molecules be predicted in silico from sequence information only?
Artificial Intelligence has become a hot topic again. Through the groundbreaking work of Jeff Hinton (University of Toronto) and Yoshua Bengio (Universit de Montral), Canada is now playing a leading role in machine learning. This work has found applications in research, but also in the industry. So much so that all the leading high-tech companies (Google, Facebook, Microsoft, Thales, etc.) now have research laboratories in Canada.
Throughout this project, the student will learn skills that are in high demand by the industry, locally, nationally and worldwide. He will develop skills preparing data for machine learning experiments. He will carefully design the research protocol to validate the results. He will evaluate the ability of different frameworks and deep neural network architectures to predict the localization of RNA molecules from sequence information only.
The project will consist of the following steps: 1) Review the literature on deep learning in bioinformatics (2 weeks). Data preparation (3 weeks): Writing parsers to extract the sequence information from the RNALocate database. Prepare the data to remove redundancy. Design the cross-validation protocol (2 weeks). Carry out the experiment (3 weeks). Analyze the results (2 weeks): Writing the abstract. Creating the poster presenting this work.
This project is similar in nature to the one above except for the data set. A machine learning approach will be used for the analysis of RNA sequences packaged in exosomes, which cell-derived vesicules.
See Jurtz, V. I. et al. An introduction to deep learning on biological sequence data: examples and solutions. Bioinformatics 33, 36853690 (2017).
Seed is a computer program that takes as input a set of unaligned RNA sequences and produces a set of secondary structure motifs. Suffix arrays are used enumerate complementary regions, possibly containing interior loops, as well for matching RNA secondary structure expressions.
Seed has several criteria to rank the motifs that it produces: minimum description length, information theory, and free energy.
In order to enable one of our future research directions, it would be interesting to see if deep learning can be used to rank the motifs produced by Seed.
The project consists of: developping a deep neural network architecture, using information from Rfam to train the model, using a rigourous cross-validation strategy, compare the accuracy of deep learning to classify motifs against the existing criteria that used by Seed.
The BioCatalogue (www.biocatalogue.org) is a curated repertoire of Web services.
Consult the web sites of the major conferences in bioinformatics, many have their proceedings online.
Consult the sites of the major journals. Several journals will allow free access (based on your IP address, therefore you must use a UofO computer or a proxy account).