GuideWalk – Heterogeneous Data Fusion for Enhanced Learning – A Multiclass Document Classification Case
CoRR(2024)
摘要
One of the prime problems of computer science and machine learning is to
extract information efficiently from large-scale, heterogeneous data. Text
data, with its syntax, semantics, and even hidden information content,
possesses an exceptional place among the data types in concern. The processing
of the text data requires embedding, a method of translating the content of the
text to numeric vectors. A correct embedding algorithm is the starting point
for obtaining the full information content of the text data. In this work, a
new embedding method based on the graph structure of the meaningful sentences
is proposed. The design of the algorithm aims to construct an embedding vector
that constitutes syntactic and semantic elements as well as the hidden content
of the text data. The success of the proposed embedding method is tested in
classification problems. Among the wide range of application areas, text
classification is the best laboratory for embedding methods; the classification
power of the method can be tested using dimensional reduction without any
further processing. Furthermore, the method can be compared with different
embedding algorithms and machine learning methods. The proposed method is
tested with real-world data sets and eight well-known and successful embedding
algorithms. The proposed embedding method shows significantly better
classification for binary and multiclass datasets compared to well-known
algorithms.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要