Statistical methods for feature extraction in shape analysis and bioinformatics

Statistical methods for feature extraction in shape analysis and bioinformatics(2010)

引用 23|浏览3
暂无评分
摘要
Feature extraction aims to explain the underlying phenomena of interest of a given set of input data by simplifying the amount of resources required to accurately describe it. This terminology remains very broad as it refers to a lot of different objectives and encompasses multiple types of techniques, methods and processes. The work contained in this thesis explores two types of feature extraction, from two different domains, namely 3D shape analysis and bioinformatics. The objective of both projects is to detect and understand the relevant information from a noise corrupted data set. However, the two processes significantly differ from each other, as one aims to compress and smooth signals while the other consists of clustering data. In the first part of this thesis, a method for shape representation, compression and smoothing is proposed. First, it is shown that, similarly to spherical shapes, triangulated genus-one surfaces can be encoded using second generation wavelet decomposition. Next, a novel model is proposed for wavelet-based surface compression and smoothing. This part of the work aims to develop an efficient and robust process for eliminating irrelevant and noise-corrupted parts of the shape signal. Surfaces are encoded using wavelet filtering, and the objective of the proposed methodology is to separate noise-like wavelet coefficients from those contributing to the relevant part of the signal. The technique developed in this thesis consists of adaptively thresholding coefficients using a data-driven Bayesian framework. Once “thresholding” is performed, the coefficients that have been identified as irrelevant are removed and the inverse wavelet transform is applied to the “clean” set of wavelet coefficients. Experimental results show the efficiency of the proposed technique for surface smoothing and compression. The second part of this thesis proposes a statistical model for studying RNA (RiboNucleic Acid) spatial conformations. The functional diversity of the RNA molecule depends on the ability of the RNA polymer to fold into a large number of precisely defined spatial forms. Therefore, one of the main challenges of bioinformatics is to establish a clearer understanding of the structure/function relationships in these molecules. If the functionality of a specific substructure (or unit block) from a given part of a RNA strand is known, then the functionality of similar substructures is assumed to be similar. Therefore, it is important to find an efficient way to classify the unit blocks of the RNA molecule. Each type of substructure can be geometrically characterized by a set of d parameters, which defines the spatial arrangement of its constituents. Thus, a set of substructures from the same family can be represented as a point cloud in a d-dimensional data space. A similarity measure can therefore be defined to perform clustering on this given data set and classify the corresponding substructures into a limited number of groups. In the proposed work, a statistical clustering model is applied to this RNA structure classification problem. First, single nucleotide structures are classified with respect to their spatial configurations. Application of the method to various data sets validates the process and further analysis is conducted to compare the results to other classifications. Second, the same clustering scheme is applied to base doublet geometries (base pairs and base stacking). These conformations offer more complex and challenging data sets. The proposed clustering results bring new features into the existing classification schemes.
更多
查看译文
关键词
input data,feature extraction,unit block,d-dimensional data space,RNA molecule,shape analysis,various data set,noise corrupted data,Statistical method,RNA polymer,clustering data,challenging data set
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要