Optimization Models on Protein Structure and Function

Ye Tian

Traditional experimental methods to study protein structure and function are often time consuming and expensive. Scoring functions are common computational methods for predictions. This dissertation is enclosed with the development and application of computational modeling and optimization in two research topics in bioinformatics. Scoring functions are built using concepts from computational geometry. Two typical model validation techniques are presented: collection of new data to check model and cross-validation. First, I present my study of the development of the scoring function on protein tertiary structure classification in SCOP database. Proteins are classified into four groups: α, β, α/β and α+β. The scoring function is a metric of the distance between two proteins based on their topology information. Training using Linear Programming (LP) is used to find optimal scoring function. Representative proteins for the four groups are calculated so that the classification of new proteins can be based on their distances to each of the four representatives. 385 proteins outside the training data are collected for testing the classifier. The accuracies are 80%, 67%, 77% and 36% for the above groups respectively. Second, I describe my study of another kind of scoring function for binary prediction of solubility mutagenesis. This type of scoring function not only uses protein sequence information but also considers three-dimensional structural characteristics. More specifically, a model is developed to predict whether a protein's solubility is increased or decreased due to mutations. LP techniques participate in deriving the optimal weighted scoring function. Considerations are taken into account on the values of weights to refine the model. The new model is also compared with SVMs and LASSO methods. Cross-validation on 137 mutants data is done to test model performance. An LP model has an overall prediction accuracy of 81%. Finally, I extend the binary classification model in part two to a three-class classification model: to include a third class of mutants that have no change in solubility. With an enlarged dataset of 168 mutants, the LP model and LIBSVM with various kernels are used for prediction. LIBSVM with sigmoid kernel demonstrates the best prediction accuracy.

Optimization Models on Protein Structure and Function

Files and links (1)

Abstract

Metrics

Details