Award Abstract # 1717997
III: Small: Improving Technical Paper Database Search through Math-Aware Search Engines

NSF Org: IIS
Division of Information & Intelligent Systems
Recipient: ROCHESTER INSTITUTE OF TECHNOLOGY
Initial Amendment Date: December 1, 2017
Latest Amendment Date: December 1, 2017
Award Number: 1717997
Award Instrument: Standard Grant
Program Manager: Sylvia Spengler
sspengle@nsf.gov
 (703)292-7347
IIS
 Division of Information & Intelligent Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: December 1, 2017
End Date: January 31, 2022 (Estimated)
Total Intended Award Amount: $498,928.00
Total Awarded Amount to Date: $498,928.00
Funds Obligated to Date: FY 2018 = $498,928.00
History of Investigator:
  • Richard Zanibbi (Principal Investigator)
    rlaz@cs.rit.edu
  • Anurag Agarwal (Co-Principal Investigator)
Recipient Sponsored Research Office: Rochester Institute of Tech
1 LOMB MEMORIAL DR
ROCHESTER
NY  US  14623-5603
(585)475-7987
Sponsor Congressional District: 25
Primary Place of Performance: Rochester Institute of Tech
NY  US  14623-5608
Primary Place of Performance
Congressional District:
25
Unique Entity Identifier (UEI): J6TWTRKC1X14
Parent UEI:
NSF Program(s): Info Integration & Informatics
Primary Program Source: 01001819DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7364, 7923
Program Element Code(s): 736400
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Today's search engines make use of sophisticated techniques for searching based upon words, but are not able to make nuanced use of mathematical notation. This project aims to allow scientists, engineers, mathematicians, and students to locate technical information using words, mathematical notation, or some of each. For example, a mathematician studying graph theory could use these new capabilities to find related applications in physics, ecology, and social network analysis, despite any differences in the notation and terminology used in those disciplines. Given a large collection of technical documents, we will apply machine learning techniques to construct associations between the formulae and words used to explain mathematical ideas, and determine how to translate automatically between those two forms of expression. These associations and translations can then be used by students who write what they are looking for using words, with the search engine finding documents that express those same ideas, even if only in mathematical notation. These new math-aware search engines will accelerate innovation by allowing searchers to discover information both across technical disciplines and, by using mathematical notation as a pivot, even across human languages.

To accomplish these goals, the project will develop novel scalable techniques for indexing and retrieval of mathematical content in technical documents. These methods will accommodate a broad range of notational conventions, formats, and encodings. New context-based methods for inferring associations between formulae and related text will be used to build rich and flexible models of content equivalence. These equivalence models will be used in new ranking algorithms that integrate results found using words or using mathematical notation into a single ranked list. Open-source reference implementations will be shared publicly, and new test collections created to evaluate these implementations will be shared with other researchers. To gain experience with the use of these new capabilities, the project will add math-aware search to the CiteSeerX digital library of scientific literature. CiteSeerX is an open Web service that can be used to compare alternative retrieval methods in actual use. For further information see the project Web page: https://www.cs.rit.edu/~dprl/math-aware-search.html.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 28)
Avenoso, Robin and Mansouri, Behrooz and Zanibbi, Richard "XY-PHOC Symbol Location Embeddings for Math Formula Retrieval and Autocompletion" Proc. CLEF 2021 (CEUR Working Notes) , 2021 Citation Details
Davila, Kenny and Joshi, Ritvik and Setlur, Srirangaraj and Govindaraju, Venu and Zanibbi, Richard "Tangent-V: Math Formula Image Search Using Line-of-Sight Graphs" Proceedings of the European Conference on Information Retrieval (ECIR) , 2019 10.1007/978-3-030-15712-8_44 Citation Details
Davila, Kenny and Zanibbi, Richard "Visual Search Engine for Handwritten and Typeset Math in Lecture Videos and LATEX Notes" Proc. International Conference on Frontiers in Handwriting Recognition , 2018 10.1109/ICFHR-2018.2018.00018 Citation Details
Dey, Abhisek and Zanibbi, Richard "ScanSSD-XYc: Faster Detection for Math Formulas" Proc. GREC 2021 , 2021 https://doi.org/10.1007/978-3-030-86198-8_7 Citation Details
Diaz, Yancarlos and Nishizawa, Gavin and Mansouri, Behrooz and Davila, Kenny and Zanibbi, Richard "The MathDeck Formula Editor: Interactive Formula Entry Combining LaTeX , Structure Editing, and Search" Proc. CHI 2021 , 2021 https://doi.org/10.1145/3411763.3451564 Citation Details
Langsenkamp, Matt and Mansouri, Behrooz and Zanibbi, Richard "Expanding Spatial Regions and Incorporating IDF for PHOC-Based Math Formula Retrieval at ARQMath-3" Proc. CLEF 2022 (CEUR Working Notes) , 2022 Citation Details
Mahdavi, Mahshad and Condon, Michael and Davila, Kenny and Zanibbi, Richard "LPGA: Line-of-Sight Parsing with Graph-Based Attention for Math Formula Recognition" Proceedings of the International Conference on Document Analysis and Recognition , 2019 Citation Details
Mahdavi, Mahshad and Sun, Leilei and Zanibbi, Richard "Visual Parsing with Query-Driven Global Graph Attention (QD-GGA): Preliminary Results for Handwritten Math Formula Recognition" Proc. CVPR Workshop on Text and Documents ion the Deep Learning Era , 2020 10.1109/CVPRW50498.2020.00293 Citation Details
Mahdavi, Mahshad and Zanibbi, Richard and Mouchère, Harold and Viard-Gaudin, Christian and Garain, Utpal "ICDAR 2019 CROHME + TFD: Competition on Recognition of Handwritten Mathematical Expressions and Typeset Formula Detection" Proceedings of the International Conference on Document Analysis and Recognition , 2019 Citation Details
Mansouri, Behrooz and Agarwal, Anurag and Oard, Douglas W. and Zanibbi, Richard "Advancing Math-Aware Search: The ARQMath-2 Lab at CLEF 2021" Proc. ECIR 2021 , 2021 https://doi.org/10.1007/978-3-030-72240-1_74 Citation Details
Mansouri, Behrooz and Agarwal, Anurag and Oard, Douglas W. and Zanibbi, Richard "Advancing Math-Aware Search: The ARQMath-3 Lab at CLEF 2022" Proc. ECIR 2022 , 2022 https://doi.org/10.1007/978-3-030-99739-7_51 Citation Details
(Showing: 1 - 10 of 28)

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

The MathSeer project, funded jointly by the NSF and Alfred P. Sloan Foundation, sought to create new math-aware search engines and interfaces to fully support formulas as well as text in searching technical documents (e.g., research papers, and math course materials), intended for use by mathematical non-experts. Our slogan was 'Math search for the masses.'

Products. By the end of the project, the MathSeer project had produced the following. Note that source code for all software described is available as open source from GitLab.

  1. ARQMath Collection. A dataset created over three years through international competitions run by the MathSeer team for the CLEF conference. The collection was built from a large collection of Math Stack Exchange posts. ARQMath contains over 200 search topics (queries + scored results) for both math question answer retrieval and formula search tasks. ARQMath is the largest collection of its type, already serving as an international benchmark.

  2. Graphics Extraction and Retrieval Framework. PDFs generally contain drawing instructions and images, with no indication of where formula regions are located. To address this, we created a framework to detect and recognize math formulas in PDFs. The framework includes new detectors for math formulas in document images (ScanSSD and a Yolov4 variant), and a math formula parsing technique designed to work for handwriting, images, and PDF symbols extracted directly from PDF documents (QD-GGA). We also created a tool for extracting symbol locations from PDF documents accurately (SymbolScraper). SymbolScraper works well enough that it was used in creating Allen AI's initial Semantic Reader prototype. In addition, we have created a framework that supports the execution of queries combining text and formulas, when queries are represented in text with formulas in LaTeX format.

  3. MathDeck. We created a novel search interface to make it easier to create, edit, and use formulas in search queries. This led us to create both a new structure-based formlua editor supporting both visual operations and LaTeX, and the representation of formluas in 'chips' that can be used both in search and editing, allowing formulas to literally be built in pieces, and to move parts of formulas around. We also devised formula entity cards, which contain a functional formula 'chip' put on a card with a description from wikipedia attached. These cards appear during editing, as a form of information-rich autocompletion for formulas. Both chips and cards can be edited, saved, shared, and reused. A news story on MathDeck appeared in ACM TechNews in 2020 (story).

  4. A variety of state-of-the-art formula retrieval models, including neural embeddings, spatial pyramid representations of symbols that can be used with standard search engine construction frameworks (PHOC), learning-to-rank models for combining representations, and a new multmodal representation for text and math, that captures the argument structure of formulas and words in sentences in a uniform representation (MathAMR). MathAMR is adapted from Abstract Meaning Representation graphs used in Natural Language Processing (NLP).

Publications. The project produced 28 peer-reviewed publications, and received two best paper awards (ICFHR 2018, ECIR 2019) and another best paper nomination at JCDL 2019. A number of papers appeared in top venues spanning different areas of computer science, including SIGIR, CHI, and CVPR.

Project Members. The project involved the participation of 20 students and 1 full-time research programmer between 2017 and 2022. The project provided a collaborative, suportive environment to develop research skills for students from the community college level through PhD. PhDs from the project have gone on to faculty positions and research posts in industry; a number of PhD and Master's students held industrial research internships, and some Master's students went on to work in industrial research as well. Companies that students went to intern and work for include Amazon, Apple, Microsoft, and start-up companies such as Petuum.

Next Steps. The Document and Pattern Recognition Lab at RIT is currently working to extend the systems and tools created for MathSeer to chemistry, in support of another NSF-funded project, the MMLI NSF AI Center. The project has explored new directions to advance the development of graphics-aware search engines, which we believe have a very bright future.


Last Modified: 07/10/2022
Modified by: Richard Zanibbi

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page