Proteins always interact with others. The interactions are very important to annotating biological functions they performed. However, understanding and identifying the interactions is still a challenge due to complexity in protein-protein interactions and imbalanced data between interface residues and non-interface residues. To overcome them, approaches to simplify or remove redundant information from imbalanced data need to be addressed.
We propose an outlier detection based approach to address the problem of removing redundancy from protein interface data and then apply it in the interface prediction. In principle, an outlier is one that appears to deviate markedly from the remainder of a giving data set, taking into account the class labels for a binary classification. In this work, we take three factors to describe what extent each residue vector is to be outlier compared to others: distance from the center vector of the same class label(Dist), probability of the class label (PCL), and importance of within-class and between-class (IWB). Outlier scores integrating the three factors are computed, instances with a larger score are regarded as outliers and then will be removed, and finally the data set without outliers are input into a support vector machine ensemble for identifying protein interface residues. As expected, SVM ensemble without outliers constantly performs better than original SVM ensemble. Interestingly, some outlier interface residues really are near to non-interface regions to some extent and similarly, some outlier non-interface residues are close to interface regions.