Towards Fully-Automated Materials Discovery via Large-Scale Synthesis Dataset and Expert-Level LLM-as-a-Judge

Kim, Heegyu; Jeon, Taeyang; Choi, Seungtaek; Hong, Ji-hoon; Jeon, Dong-won; Baek, Ga-yeon; Kwak, Gyeong-won; Lee, Dong-hee; Bae, Jisu; Lee, Chi-hoon; Kim, Yoon-seo; Choi, Seon-Jin; Park, Jin-seong; Cho, Sung-beom; Cho, Hyunsouk

doi:10.1145/3746252.3761359

Detailed Information

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

Towards Fully-Automated Materials Discovery via Large-Scale Synthesis Dataset and Expert-Level LLM-as-a-Judgeopen access

Authors: Kim, Heegyu; Jeon, Taeyang; Choi, Seungtaek; Hong, Ji-hoon; Jeon, Dong-won; Baek, Ga-yeon; Kwak, Gyeong-won; Lee, Dong-hee; Bae, Jisu; Lee, Chi-hoon; Kim, Yoon-seo; Choi, Seon-Jin; Park, Jin-seong; Cho, Sung-beom; Cho, Hyunsouk

Issue Date: Nov-2025

Publisher: Association for Computing Machinery, Inc

Keywords: benchmark; dataset; human evaluation; large language model; llm-as-a-judge; materials science

Citation: CIKM 2025 - Proceedings of the 34th ACM International Conference on Information and Knowledge Management, pp 1302 - 1312

Pages: 11

Indexed: SCOPUS

Journal Title: CIKM 2025 - Proceedings of the 34th ACM International Conference on Information and Knowledge Management

Start Page: 1302

End Page: 1312

URI: https://scholarworks.bwise.kr/hanyang/handle/2021.sw.hanyang/209906

DOI: 10.1145/3746252.3761359

Abstract: Materials synthesis remains a critical bottleneck in developing innovations for energy storage, catalysis, electronics, and biomedical devices. Current synthesis design relies heavily on empirical trial-and-error methods guided by expert intuition, limiting the pace of materials discovery. To address this challenge, we present AlchemyBench, a comprehensive benchmark built upon a curated dataset of 17,667 expert-verified synthesis recipes from open-access literature. AlchemyBench provides an end-to-end framework that supports research in large language models (LLMs) applied to materials synthesis prediction. The benchmark encompasses four key tasks: raw materials and equipment prediction, synthesis procedure generation, and characterization outcome forecasting. To enable scalable evaluation, we propose an LLM-as-a-Judge framework that leverages large language models for automated assessment, demonstrating strong agreement with expert evaluations (e.g., Pearson's r = 0.80, Spearman's ρ = 0.78). Our experimental results reveal that reasoning-focused models (Claude 3.7, GPT-4o) achieve scores around 4.0 on well-documented oxide and organic synthesis targets, but performance drops by approximately 0.3 points on electrochemical workflows. Fine-tuning on AlchemyBench data enables a 7B-parameter open-source model to surpass generic baselines trained on 1M samples, while retrieval-augmented generation provides an additional +0.20 improvement when supplied with five high-similarity contexts. AlchemyBench addresses a critical gap in the field by providing the first comprehensive, legally redistributable benchmark for automated materials synthesis prediction. Our contributions establish a foundation for exploring LLM capabilities in predicting and guiding materials synthesis, ultimately accelerating experimental design and innovation in materials science.

Files in This Item: Go to Link

Appears in Collections: 서울 공과대학 > 서울 신소재공학부 > 1. Journal Articles

Show full item record

qrcode

Related Researcher

Researcher Park, Jinseong photo

Park, Jinseong: COLLEGE OF ENGINEERING (SCHOOL OF MATERIALS SCIENCE AND ENGINEERING)

Read more

Altmetrics

Total Views & Downloads

RSS_1.0 RSS_2.0 ATOM_1.0

222, Wangsimni-ro, Seongdong-gu, Seoul, 04763, Korea+82-2-2220-1366

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Related Researcher

Altmetrics

Total Views & Downloads

BROWSE