Improving Chinese Character Representation with Formation Tree
CoRR(2024)
摘要
Learning effective representations for Chinese characters presents unique
challenges, primarily due to the vast number of characters and their continuous
growth, which requires models to handle an expanding category space.
Additionally, the inherent sparsity of character usage complicates the
generalization of learned representations. Prior research has explored
radical-based sequences to overcome these issues, achieving progress in
recognizing unseen characters. However, these approaches fail to fully exploit
the inherent tree structure of such sequences. To address these limitations and
leverage established data properties, we propose Formation Tree-CLIP (FT-CLIP).
This model utilizes formation trees to represent characters and incorporates a
dedicated tree encoder, significantly improving performance in both seen and
unseen character recognition tasks. We further introduce masking for to both
character images and tree nodes, enabling efficient and effective training.
This approach accelerates training significantly (by a factor of 2 or more)
while enhancing accuracy. Extensive experiments show that processing characters
through formation trees aligns better with their inherent properties than
direct sequential methods, significantly enhancing the generality and usability
of the representations.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要