L+M-24: Building a Dataset for Language + Molecules @ ACL 2024
arxiv(2024)
摘要
Language-molecule models have emerged as an exciting direction for molecular
discovery and understanding. However, training these models is challenging due
to the scarcity of molecule-language pair datasets. At this point, datasets
have been released which are 1) small and scraped from existing databases, 2)
large but noisy and constructed by performing entity linking on the scientific
literature, and 3) built by converting property prediction datasets to natural
language using templates. In this document, we detail the L+M-24
dataset, which has been created for the Language + Molecules Workshop shared
task at ACL 2024. In particular, L+M-24 is designed to focus on
three key benefits of natural language in molecule design: compositionality,
functionality, and abstraction.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要