
NSF Org: |
DMR Division Of Materials Research |
Recipient: |
|
Initial Amendment Date: | August 5, 2019 |
Latest Amendment Date: | August 5, 2019 |
Award Number: | 1922090 |
Award Instrument: | Standard Grant |
Program Manager: |
Mohsen Asle Zaeem
DMR Division Of Materials Research MPS Directorate for Mathematical and Physical Sciences |
Start Date: | October 1, 2019 |
End Date: | September 30, 2023 (Estimated) |
Total Intended Award Amount: | $399,997.00 |
Total Awarded Amount to Date: | $399,997.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
101 COMMONWEALTH AVE AMHERST MA US 01003-9252 (413)545-0698 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
100 Venture Way, Suite 201 Hadley MA US 01035-9450 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): |
CI REUSE, DMREF |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.049 |
ABSTRACT
Successes in accelerated materials design, made possible in part through the Materials Genome Initiative, have shifted the bottleneck in materials development towards the synthesis of novel compounds. Existing databases do not contain information about the synthesis recipes necessary to make compounds that are found to have promising properties, designed through computational methods. As a result, much of the momentum and efficiency gained in the design process becomes gated by trial-and-error synthesis techniques. This delay in going from promising materials concept to validation, optimization, and scale-up is a significant burden to the commercialization of novel materials. This Designing Materials to Revolutionize and Engineer our Future (DMREF) research will build predictive tools for synthesis so that the development time for chemical compounds with interesting properties can be synthesized in a matter of days, rather than months or years. The research activities include automatically extracting information from the published literature and patents on how solid inorganic materials have been made in the past by using natural language processing techniques. After this text extraction the project will generate a "cookbook" of materials synthesis recipes. This cookbook can be mined through machine learning approaches for suggestions on how to make new materials by looking for patterns and similarities among previously made materials. The project outcome will be a data set of materials synthesis methods, to be made available to the community. Another key project outcome is to use machine learning to predict novel or optimized recipes for materials. These predictions will be accompanied by experimental confirmation for a class of materials used in catalysis called zeolites. The major objective of the outreach component of this research is to enable the use of the database by non-experts. This will be accomplished through both online tutorials and in person workshops. The online tutorials will teach the basic knowledge required to utilize the online tools and functionalities while the workshops will be addressed to students and researchers who want to make use of the database itself.
The approach to automatic extraction of information in the literature will be semi-supervised from a machine learning perspective. Unsupervised methods, including word embeddings that capture the context of words within scientific corpus, will be used. Then downstream supervised methods will be used to classify words by their type and their relationship to other words. This forms the basis of the recipe database. The extracted information will then be mined using machine learning tools from the materials informatics community. Because the recipe classification (described subsequently) leverages expertise from the NLP perspective and the target material classification leverages expertise from the materials perspective, there is significant leverage to be had from this interdisciplinary approach, a partnership not previously pursued to further materials design. This approach builds on established synthesis knowledge, and combines it with modern data extraction, materials informatics, text mining and machine learning techniques, and high-throughput ab-initio thermochemical data availability. The integration of these different fields will provide a direct route towards more rational design of synthesis methods and thereby significantly accelerate the deployment and testing of new materials concepts.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
The published materials science literature contains millions of materials synthesis procedures described in unstructured natural language text. Large-scale extraction and analysis of these synthesis procedures promises to enable automated synthesis planning and hypothesis generation which will dramatically speed up the development of novel new materials – a process currently gated by time consuming trial-and-error synthesis techniques. This research focussed on two lines of work to speed up the development of novel materials: (1) construction of materials synthesis knowledge bases (KBs) that enable automated synthesis planning and (2) exploration of collections of published papers with aspect based information retrieval systems which enable human-ai collaboration for generating novel materials synthesis hypothesis.
To enable construction of materials KBs, our work led to the development of the first dataset of materials synthesis procedures annotated with material science entities, such as compounds, properties, and actions and the relationships between these entities. This dataset led to the development of novel retrieval-augmented entity and relation extraction models which delivered high quality extractions in the absence of large amounts of labeled data commonly required for supervised NLP models. In subsequent work, we developed annotation procedures to scale up the construction of datasets used for materials KB construction. Using models developed with our datasets we automatically constructed materials synthesis KBs which enabled training variational autoencoder models capable of predicting the right precursor materials for specific target materials – a first step toward fully automated synthesis planning. Besides this, our released datasets have seen significant uptake and expansion by the broader information extraction and NLP communities resulting in the development of several datasets and models for synthesis extraction in materials science and procedural text understanding more broadly.
While materials KBs allow effective planning and exploration of synthesis procedures when well-defined schemas of entity and relation types exist, they pose a bottleneck when scientists wish to change the schema or wish to explore a new corpus. In such cases, directly exploring minimally structured text collections with information retrieval models offers an attractive alternative. Our work on human-AI collaboration for hypothesis generation explores this alternative through development of aspect oriented exploratory search systems. Such systems allow scientists to explore a large collection of scientific papers through aspects such as the “problems”, “solutions”, or “results” introduced in the papers. Our work lead to the development of the first dataset to evaluate such an aspect based search system, the development of novel multi vector representation models which represent text using bags of vectors each of which capture various aspects of the text, and the use of these multi vector models for a range of applications ranging from search, to question-answering, and text generation. These multi-vector models were subsequently also used to develop a search system to facilitate creative hypothesis generation for materials scientists – our systems allowed scientists to search for solutions in disciplines unknown to the scientist while describing the problem in the language of a discipline familiar to them. Our systems were found to allow scientists to see their problems in new light and indicated a potential to allow development of creative solutions by leveraging interdisciplinary ideas.
Last Modified: 02/02/2024
Modified by: Andrew K Mccallum
Please report errors in award information by writing to: awardsearch@nsf.gov.