Detailed Information

Cited 0 time in webofscience Cited 0 time in scopus
Metadata Downloads

Towards Fully-Automated Materials Discovery via Large-Scale Synthesis Dataset and Expert-Level LLM-as-a-Judgeopen access

Authors
Kim, HeegyuJeon, TaeyangChoi, SeungtaekHong, Ji-hoonJeon, Dong-wonBaek, Ga-yeonKwak, Gyeong-wonLee, Dong-heeBae, JisuLee, Chi-hoonKim, Yoon-seoChoi, Seon-JinPark, Jin-seongCho, Sung-beomCho, Hyunsouk
Issue Date
Nov-2025
Publisher
Association for Computing Machinery, Inc
Keywords
benchmark; dataset; human evaluation; large language model; llm-as-a-judge; materials science
Citation
CIKM 2025 - Proceedings of the 34th ACM International Conference on Information and Knowledge Management, pp 1302 - 1312
Pages
11
Indexed
SCOPUS
Journal Title
CIKM 2025 - Proceedings of the 34th ACM International Conference on Information and Knowledge Management
Start Page
1302
End Page
1312
URI
https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/209906
DOI
10.1145/3746252.3761359
Abstract
Materials synthesis remains a critical bottleneck in developing innovations for energy storage, catalysis, electronics, and biomedical devices. Current synthesis design relies heavily on empirical trial-and-error methods guided by expert intuition, limiting the pace of materials discovery. To address this challenge, we present AlchemyBench, a comprehensive benchmark built upon a curated dataset of 17,667 expert-verified synthesis recipes from open-access literature. AlchemyBench provides an end-to-end framework that supports research in large language models (LLMs) applied to materials synthesis prediction. The benchmark encompasses four key tasks: raw materials and equipment prediction, synthesis procedure generation, and characterization outcome forecasting. To enable scalable evaluation, we propose an LLM-as-a-Judge framework that leverages large language models for automated assessment, demonstrating strong agreement with expert evaluations (e.g., Pearson's r = 0.80, Spearman's ρ = 0.78). Our experimental results reveal that reasoning-focused models (Claude 3.7, GPT-4o) achieve scores around 4.0 on well-documented oxide and organic synthesis targets, but performance drops by approximately 0.3 points on electrochemical workflows. Fine-tuning on AlchemyBench data enables a 7B-parameter open-source model to surpass generic baselines trained on 1M samples, while retrieval-augmented generation provides an additional +0.20 improvement when supplied with five high-similarity contexts. AlchemyBench addresses a critical gap in the field by providing the first comprehensive, legally redistributable benchmark for automated materials synthesis prediction. Our contributions establish a foundation for exploring LLM capabilities in predicting and guiding materials synthesis, ultimately accelerating experimental design and innovation in materials science.
Files in This Item
Go to Link
Appears in
Collections
서울 공과대학 > 서울 신소재공학부 > 1. Journal Articles

qrcode

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.

Related Researcher

Researcher Park, Jinseong photo

Park, Jinseong
COLLEGE OF ENGINEERING (SCHOOL OF MATERIALS SCIENCE AND ENGINEERING)
Read more

Altmetrics

Total Views & Downloads

BROWSE