Towards Fully-Automated Materials Discovery via Large-Scale Synthesis Dataset and Expert-Level LLM-as-a-Judgeopen access
- Authors
- Kim, Heegyu; Jeon, Taeyang; Choi, Seungtaek; Hong, Ji-hoon; Jeon, Dong-won; Baek, Ga-yeon; Kwak, Gyeong-won; Lee, Dong-hee; Bae, Jisu; Lee, Chi-hoon; Kim, Yoon-seo; Choi, Seon-Jin; Park, Jin-seong; Cho, Sung-beom; Cho, Hyunsouk
- Issue Date
- Nov-2025
- Publisher
- Association for Computing Machinery, Inc
- Keywords
- benchmark; dataset; human evaluation; large language model; llm-as-a-judge; materials science
- Citation
- CIKM 2025 - Proceedings of the 34th ACM International Conference on Information and Knowledge Management, pp 1302 - 1312
- Pages
- 11
- Indexed
- SCOPUS
- Journal Title
- CIKM 2025 - Proceedings of the 34th ACM International Conference on Information and Knowledge Management
- Start Page
- 1302
- End Page
- 1312
- URI
- https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/209906
- DOI
- 10.1145/3746252.3761359
- Abstract
- Materials synthesis remains a critical bottleneck in developing innovations for energy storage, catalysis, electronics, and biomedical devices. Current synthesis design relies heavily on empirical trial-and-error methods guided by expert intuition, limiting the pace of materials discovery. To address this challenge, we present AlchemyBench, a comprehensive benchmark built upon a curated dataset of 17,667 expert-verified synthesis recipes from open-access literature. AlchemyBench provides an end-to-end framework that supports research in large language models (LLMs) applied to materials synthesis prediction. The benchmark encompasses four key tasks: raw materials and equipment prediction, synthesis procedure generation, and characterization outcome forecasting. To enable scalable evaluation, we propose an LLM-as-a-Judge framework that leverages large language models for automated assessment, demonstrating strong agreement with expert evaluations (e.g., Pearson's r = 0.80, Spearman's ρ = 0.78). Our experimental results reveal that reasoning-focused models (Claude 3.7, GPT-4o) achieve scores around 4.0 on well-documented oxide and organic synthesis targets, but performance drops by approximately 0.3 points on electrochemical workflows. Fine-tuning on AlchemyBench data enables a 7B-parameter open-source model to surpass generic baselines trained on 1M samples, while retrieval-augmented generation provides an additional +0.20 improvement when supplied with five high-similarity contexts. AlchemyBench addresses a critical gap in the field by providing the first comprehensive, legally redistributable benchmark for automated materials synthesis prediction. Our contributions establish a foundation for exploring LLM capabilities in predicting and guiding materials synthesis, ultimately accelerating experimental design and innovation in materials science.
- Files in This Item
-
Go to Link
- Appears in
Collections - 서울 공과대학 > 서울 신소재공학부 > 1. Journal Articles

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.