CharSS: Character-Level Transformer Model for Sanskrit Word Segmentation
arxiv(2024)
Abstract
Subword tokens in Indian languages inherently carry meaning, and isolating
them can enhance NLP tasks, making sub-word segmentation a crucial process.
Segmenting Sanskrit and other Indian languages into subtokens is not
straightforward, as it may include sandhi, which may lead to changes in the
word boundaries. We propose a new approach of utilizing a Character-level
Transformer model for Sanskrit Word Segmentation (CharSS). We perform
experiments on three benchmark datasets to compare the performance of our
method against existing methods. On the UoH+SandhiKosh dataset, our method
outperforms the current state-of-the-art system by an absolute gain of 6.72
points in split prediction accuracy. On the hackathon dataset, our method
achieves a gain of 2.27 points over the current SOTA system in terms of perfect
match metric. We also propose a use-case of Sanskrit-based segments for a
linguistically informed translation of technical terms to lexically similar
low-resource Indian languages. In two separate experimental settings for this
task, we achieve an average improvement of 8.46 and 6.79 chrF++ scores,
respectively.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined