NSF Award Search: Award # 1922090

Award Abstract # 1922090

DMREF: Collaborative Research: The Synthesis Genome: Data Mining for Synthesis of New Materials

NSF Org:	DMR Division Of Materials Research
Recipient:	UNIVERSITY OF MASSACHUSETTS
Initial Amendment Date:	August 5, 2019
Latest Amendment Date:	August 5, 2019
Award Number:	1922090
Award Instrument:	Standard Grant
Program Manager:	Mohsen Asle Zaeem DMR Division Of Materials Research MPS Directorate for Mathematical and Physical Sciences
Start Date:	October 1, 2019
End Date:	September 30, 2023 (Estimated)
Total Intended Award Amount:	$399,997.00
Total Awarded Amount to Date:	$399,997.00
Funds Obligated to Date:	FY 2019 = $399,997.00
History of Investigator:	Andrew McCallum (Principal Investigator) mccallum@cs.umass.edu
Recipient Sponsored Research Office:	University of Massachusetts Amherst 101 COMMONWEALTH AVE AMHERST MA US 01003-9252 (413)545-0698
Sponsor Congressional District:	02
Primary Place of Performance:	College of Information and Computer Sciences 100 Venture Way, Suite 201 Hadley MA US 01035-9450
Primary Place of Performance Congressional District:	02
Unique Entity Identifier (UEI):	VGJHK59NMPK9
Parent UEI:	VGJHK59NMPK9
NSF Program(s):	CI REUSE, DMREF
Primary Program Source:	01001920DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):	026Z, 054Z, 8004, 8400
Program Element Code(s):	689200, 829200
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.049

ABSTRACT

Successes in accelerated materials design, made possible in part through the Materials Genome Initiative, have shifted the bottleneck in materials development towards the synthesis of novel compounds. Existing databases do not contain information about the synthesis recipes necessary to make compounds that are found to have promising properties, designed through computational methods. As a result, much of the momentum and efficiency gained in the design process becomes gated by trial-and-error synthesis techniques. This delay in going from promising materials concept to validation, optimization, and scale-up is a significant burden to the commercialization of novel materials. This Designing Materials to Revolutionize and Engineer our Future (DMREF) research will build predictive tools for synthesis so that the development time for chemical compounds with interesting properties can be synthesized in a matter of days, rather than months or years. The research activities include automatically extracting information from the published literature and patents on how solid inorganic materials have been made in the past by using natural language processing techniques. After this text extraction the project will generate a "cookbook" of materials synthesis recipes. This cookbook can be mined through machine learning approaches for suggestions on how to make new materials by looking for patterns and similarities among previously made materials. The project outcome will be a data set of materials synthesis methods, to be made available to the community. Another key project outcome is to use machine learning to predict novel or optimized recipes for materials. These predictions will be accompanied by experimental confirmation for a class of materials used in catalysis called zeolites. The major objective of the outreach component of this research is to enable the use of the database by non-experts. This will be accomplished through both online tutorials and in person workshops. The online tutorials will teach the basic knowledge required to utilize the online tools and functionalities while the workshops will be addressed to students and researchers who want to make use of the database itself.

The approach to automatic extraction of information in the literature will be semi-supervised from a machine learning perspective. Unsupervised methods, including word embeddings that capture the context of words within scientific corpus, will be used. Then downstream supervised methods will be used to classify words by their type and their relationship to other words. This forms the basis of the recipe database. The extracted information will then be mined using machine learning tools from the materials informatics community. Because the recipe classification (described subsequently) leverages expertise from the NLP perspective and the target material classification leverages expertise from the materials perspective, there is significant leverage to be had from this interdisciplinary approach, a partnership not previously pursued to further materials design. This approach builds on established synthesis knowledge, and combines it with modern data extraction, materials informatics, text mining and machine learning techniques, and high-throughput ab-initio thermochemical data availability. The integration of these different fields will provide a direct route towards more rational design of synthesis methods and thereby significantly accelerate the deployment and testing of new materials concepts.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Chang, Haw-Shiuan and McCallum, Andrew "Softmax Bottleneck Makes Language Models Unable to Represent Multi-mode Word Distributions" Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics , v.Volume , 2022 https://doi.org/10.18653/v1/2022.acl-long.554 Citation Details

Kang, Hyeonsu B. and Mysore, Sheshera and Huang, Kevin and Chang, Haw-Shiuan and Prein, Thorben and McCallum, Andrew and Kittur, Aniket and Olivetti, Elsa "Augmenting Scientific Creativity with Retrieval across Knowledge Domains" NLP+HCI Workshop at North American Chapter of the Association for Computational Linguistics 2022 , 2022 Citation Details

Kim, Edward and Jensen, Zach and van Grootel, Alexander and Huang, Kevin and Staib, Matthew and Mysore, Sheshera and Chang, Haw-Shiuan and Strubell, Emma and McCallum, Andrew and Jegelka, Stefanie and Olivetti, Elsa "Inorganic Materials Synthesis Planning with Literature-Trained Neural Networks" Journal of Chemical Information and Modeling , v.60 , 2020 10.1021/acs.jcim.9b00995 Citation Details

Mysore, Sheshera and Cohan, Arman and Hope, Tom "Multi-Vector Models with Textual Guidance for Fine-Grained Scientific Document Similarity" Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , 2022 https://doi.org/10.18653/v1/2022.naacl-main.331 Citation Details

Mysore, Sheshera and Jasim, Mahmood and Mccallum, Andrew and Zamani, Hamed "Editable User Profiles for Controllable Text Recommendations" , 2023 https://doi.org/10.1145/3539618.3591677 Citation Details

Mysore, Sheshera and Mccallum, Andrew and Zamani, Hamed "Large Language Model Augmented Narrative Driven Recommendations" , 2023 https://doi.org/10.1145/3604915.3608829 Citation Details

Mysore, Sheshera and O'Gorman, Tim and McCallum, Andrew and Zamani, Hamed "CSFCube - A Test Collection of Computer Science Research Articles for Faceted Query by Example" NeurIPS 2021 Track on Datasets and Benchmarks , 2021 Citation Details

Ricci, Kathryn and Chang, Haw-Shiuan and Goyal, Purujit and McCallum, Andrew "Unsupervised Partial Sentence Matching for Cited Text Identification" ACL Proceedings of the Third Workshop on Scholarly Document Processing , 2022 Citation Details

Swarup, Daivik and Bajaj, Ahsaas and Mysore, Sheshera and OGorman, Tim and Das, Rajarshi and McCallum, Andrew "An Instance Level Approach for Shallow Semantic Parsing in Scientific Procedural Text" Findings of the Association for Computational Linguistics: EMNLP 2020 , 2020 https://doi.org/10.18653/v1/2020.findings-emnlp.270 Citation Details

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

The published materials science literature contains millions of materials synthesis procedures described in unstructured natural language text. Large-scale extraction and analysis of these synthesis procedures promises to enable automated synthesis planning and hypothesis generation which will dramatically speed up the development of novel new materials – a process currently gated by time consuming trial-and-error synthesis techniques. This research focussed on two lines of work to speed up the development of novel materials: (1) construction of materials synthesis knowledge bases (KBs) that enable automated synthesis planning and (2) exploration of collections of published papers with aspect based information retrieval systems which enable human-ai collaboration for generating novel materials synthesis hypothesis.

To enable construction of materials KBs, our work led to the development of the first dataset of materials synthesis procedures annotated with material science entities, such as compounds, properties, and actions and the relationships between these entities. This dataset led to the development of novel retrieval-augmented entity and relation extraction models which delivered high quality extractions in the absence of large amounts of labeled data commonly required for supervised NLP models. In subsequent work, we developed annotation procedures to scale up the construction of datasets used for materials KB construction. Using models developed with our datasets we automatically constructed materials synthesis KBs which enabled training variational autoencoder models capable of predicting the right precursor materials for specific target materials – a first step toward fully automated synthesis planning. Besides this, our released datasets have seen significant uptake and expansion by the broader information extraction and NLP communities resulting in the development of several datasets and models for synthesis extraction in materials science and procedural text understanding more broadly.

While materials KBs allow effective planning and exploration of synthesis procedures when well-defined schemas of entity and relation types exist, they pose a bottleneck when scientists wish to change the schema or wish to explore a new corpus. In such cases, directly exploring minimally structured text collections with information retrieval models offers an attractive alternative. Our work on human-AI collaboration for hypothesis generation explores this alternative through development of aspect oriented exploratory search systems. Such systems allow scientists to explore a large collection of scientific papers through aspects such as the “problems”, “solutions”, or “results” introduced in the papers. Our work lead to the development of the first dataset to evaluate such an aspect based search system, the development of novel multi vector representation models which represent text using bags of vectors each of which capture various aspects of the text, and the use of these multi vector models for a range of applications ranging from search, to question-answering, and text generation. These multi-vector models were subsequently also used to develop a search system to facilitate creative hypothesis generation for materials scientists – our systems allowed scientists to search for solutions in disciplines unknown to the scientist while describing the problem in the language of a discipline familiar to them. Our systems were found to allow scientists to see their problems in new light and indicated a potential to allow development of creative solutions by leveraging interdisciplinary ideas.

Last Modified: 02/02/2024
Modified by: Andrew K Mccallum

Please report errors in award information by writing to: awardsearch@nsf.gov.