Text clustering with LLM embeddings
CoRR(2024)
摘要
Text clustering is an important approach for organising the growing amount of
digital content, helping to structure and find hidden patterns in uncategorised
data. However, the effectiveness of text clustering heavily relies on the
choice of textual embeddings and clustering algorithms. We argue that recent
advances in large language models (LLMs) can potentially improve this task. In
this research, we investigated how different textual embeddings – particularly
those used in LLMs – and clustering algorithms affect how text datasets are
clustered. A series of experiments were conducted to assess how embeddings
influence clustering results, the role played by dimensionality reduction
through summarisation, and model size adjustment. Findings reveal that LLM
embeddings excel at capturing subtleties in structured language, while BERT
leads the lightweight options in performance. In addition, we observe that
increasing model dimensionality and employing summarization techniques do not
consistently lead to improvements in clustering efficiency, suggesting that
these strategies require careful analysis to use in real-life models. These
results highlight a complex balance between the need for refined text
representation and computational feasibility in text clustering applications.
This study extends traditional text clustering frameworks by incorporating
embeddings from LLMs, providing a path for improved methodologies, while
informing new avenues for future research in various types of textual analysis.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要