NSF Award Search: Award # 1319846

Award Abstract # 1319846

RI: Small: RUI: AIR: Automatic Idiom Recognition

NSF Org:	IIS Division of Information & Intelligent Systems
Recipient:	MONTCLAIR STATE UNIVERSITY
Initial Amendment Date:	July 29, 2013
Latest Amendment Date:	March 1, 2017
Award Number:	1319846
Award Instrument:	Standard Grant
Program Manager:	Tatiana Korelsky IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering
Start Date:	August 1, 2013
End Date:	January 31, 2018 (Estimated)
Total Intended Award Amount:	$176,514.00
Total Awarded Amount to Date:	$176,514.00
Funds Obligated to Date:	FY 2013 = $176,514.00
History of Investigator:	Anna Feldman (Principal Investigator) feldmana@mail.montclair.edu Jing Peng (Co-Principal Investigator)
Recipient Sponsored Research Office:	Montclair State University 1 NORMAL AVE MONTCLAIR NJ US 07043-1624 (973)655-6923
Sponsor Congressional District:	11
Primary Place of Performance:	Montclair State University 1 Normal Avenue, Schmitt Hall 24 Montclair NJ US 07043-1624
Primary Place of Performance Congressional District:	11
Unique Entity Identifier (UEI):	CM4TTRKFCLF9
Parent UEI:
NSF Program(s):	Robust Intelligence
Primary Program Source:	01001314DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):	7495, 7923, 9229
Program Element Code(s):	749500
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.070

ABSTRACT

The main goal of this research project is to develop a language independent method for automatic idiom recognition. Idiomatic expressions, such as 'a blessing in disguise' and 'kick the bucket' are plentiful in everyday language, though they remain mysterious, as it is not clear exactly how people learn and understand them. There is no single agreed-upon definition of idiom that covers all members of this class, but idioms tend to be relatively fixed in grammatical form and meaning, but with relatively little predictability in the relation between form and meaning. Also, many idiomatic expressions can appear with both literal, i.e. fully predictable, interpretations given their form -- compare 'The little girl made a face at her mother.' (idiomatic) vs. 'The little girl made a face on the snowman using a carrot and two buttons.' (literal) As a result, idioms present great challenges for a variety of natural language processing applications, including machine translation systems, which often do not detect idiomatic language. To address these challenges, an algorithm is proposed that neither relies on target idiom types, lexicons, or large manually annotated corpora, nor limits the search space by a particular type of linguistic construction. The starting point is that idioms are semantic outliers that violate cohesive structure, especially in local contexts. The following properties are quantified and are incorporated into the outlier detection algorithm: 1) lack of compositionality comparing to literal expressions or other types of collocations; 2) violation of local cohesive ties, so that they tend to be semantically distant from the local topics; 3) while not all semantic outliers are idioms, non-compositional semantic outliers are likely to be idiomatic; 4) idiomaticity is not a binary property; rather, idioms fall on the continuum from being compositional to being partly unanalyzable to completely non-compositional.

This research contributes to the better understanding of idiomatic language, to the computational treatment of such phenomena and, with the creation of high quality, publicly available linguistic resources annotated for idioms, to the facilitation of machine learning research and big data science. Additional benefits include efficient algorithms for computing compositionality and topicality from large corpora, interesting new generalizations about the nature of figurative language, and the training of a cadre of undergraduate and graduate students in highly practical work on a difficult interdisciplinary problem.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 15)

Show All

Anna Feldman and Jing Peng "Automatic Detection of Idiomatic Clauses." Proceedings of Computational Linguistics and Intelligent Text Processing (CICLing 2013) , v.Part I , 2013 , p.435

Jing Peng and Anna Feldman "Automatic Idiom Recognition with Word Embeddings" Communications in Computer and Information Science , v.656 , 2016

Jing Peng and Anna Feldman "Experiments in Idiom Recognition" COLING , 2016

Jing Peng and Anna Feldman "Experiments in Idiom Recognition" Proceedings of the 26th International Conference on Computational Linguistics (COLING). , 2016

Jing Peng and Anna Feldman. "In God We Trust. All Others Must Bring Data. ? W. Edwards Deming ? Using word embeddings to recognize idioms." Proceedings of the 3rd Annual International Symposium on Information Management and Big Data , 2016

Jing Peng, Anna Feldman, and Ekaterina Vylomova "Classifying Idiomatic and Literal Expressions Using Topic Models and Intensity of Emotions." Proceedings of the 2014 Empirical Methods for Natural Language Processing Conference (EMNLP) , 2014

Jing Peng, Anna Feldman, and Ekaterina Vylomova. "Classifying Idiomatic and Literal Expressions Using Topic Models and Intensity of Emotions." In Proceedings of the 2014 Empirical Methods for Natural Language Processing Conference (EMNLP). , 2014

Jing Peng, Anna Feldman, and Hamza Jazmati "Classifying Idiomatic and Literal Expressions Using Vector Space Representations" Proceedings of the Recent Advances in Natural Language Processing (RANLP) conference , 2015

Jing Peng, Anna Feldman, and Hamza Jazmati "Classifying Idiomatic and Literal Expressions Using Vector Space Representations." Proceedings of the Recent Advances in Natural Language Processing (RANLP) conference , 2015

Jing Peng, Anna Feldman, and William Bryan "``In God we trust. All others must bring data.? -- W. Edwards Deming --- Using word embeddings to recognize idioms"" SIMBig 2016 : 3rd Annual International Symposium on Information Management and Big Data. , 2016

Jing Peng, Anna Feldman, and William Bryan. "``Back to the drawing board: A new algorithm for idiom detection"." Empirical Methods in Natural Language Processing (EMNLP) , 2016

(Showing: 1 - 10 of 15)

Show All

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Over the decades, humans have been learning computer language. Natural Language Processing (NLP) means computers are now learning how humans use language, i.e., the comprehension by computers of the structure and meaning of human languages, allowing users to interact with computers using natural sentences. This task is not easy. One of the challenges is to "understand" meaning, e.g., draw inferences, derive sentiment relations, understand figurative language.

This project has been concerned with developing an algorithm for detecting idiomatic expressions. Many idiomatic expressions are ambiguous. Ambiguities in semantics arise when multiple interpretations are possible. For example, hit the roof can be interpreted literally or idiomatically depending on the context: sales hit the roof vs. hit the roof of the car.

The research team has explored various linguistic properties of idiomatic expressions cross-linguistically. In our earlier work, we approached the problem as outlier detection: literal expressions are semantically related to the rest of the context, while words that constitute an idiom appear inconsistent with the rest of the context. Our technique incorporated the following observations: (1) A sequence with literal meaning has many neighbors, whereas a figurative one has few. (2) Idiomatic expressions should demonstrate low semantic proximity between the words composing them. (3) Idiomatic expressions should demonstrate low semantic proximity between the expression and the preceding and subsequent segments.

Later we refined our hypothesis and decided to work with text at the topic level structure. Informally, topics are just clusters of similar words. A document usually contains several topics. The idea is that topic words in a given text segment, such as a paragraph, are less likely to be a part of an idiomatic expression. Our additional hypothesis is that contexts in which idioms occur, typically, are more affective and therefore, we incorporate simple sentiment analysis, focusing on the intensity of emotions. This approach can be still viewed as outlier detection. This approach allowed us to differentiate idioms from literals using local semantic contexts with better accuracy.

We further developed our approach and captured the intuition about cohesive ties, idioms and local context by using a vector representation of words, i.e., we created a representation for words that capture their meaning, semantic relationships and the different types of contexts they are used in and used it in our classification task. The performance results have improved significantly. We applied our approach to English and Russian, the two languages from two different language families, Germanic and Slavic, respectively, and whose structural and morphological properties are different enough to test if our method is (relatively) language-independent. Our experimental results were positive.

The results of this study is a method for idiom detection that does not rely on resource-heavy linguistic technology, such as syntactic parsing, part-of-speech tagging or manual annotation. Our algorithm manages to capture the contextual differences in which literal and idiomatic expressions occur. This technology can be used in many other applications, such as machine translation, language understanding systems, sentiment and emotion analysis, among others.

A number of undergraduate and graduate students had the opportunity to participate in active research and gain experience in working collaboratively on an interdisciplinary problem. The project involved students from a variety of disciplines: linguistics, computer science, literature, and mathematics.

Last Modified: 06/01/2018
Modified by: Anna Feldman

Please report errors in award information by writing to: awardsearch@nsf.gov.

Success

Error