
NSF Org: |
IIS Division of Information & Intelligent Systems |
Recipient: |
|
Initial Amendment Date: | July 29, 2013 |
Latest Amendment Date: | March 1, 2017 |
Award Number: | 1319846 |
Award Instrument: | Standard Grant |
Program Manager: |
Tatiana Korelsky
IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | August 1, 2013 |
End Date: | January 31, 2018 (Estimated) |
Total Intended Award Amount: | $176,514.00 |
Total Awarded Amount to Date: | $176,514.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
1 NORMAL AVE MONTCLAIR NJ US 07043-1624 (973)655-6923 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
1 Normal Avenue, Schmitt Hall 24 Montclair NJ US 07043-1624 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Robust Intelligence |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
The main goal of this research project is to develop a language independent method for automatic idiom recognition. Idiomatic expressions, such as 'a blessing in disguise' and 'kick the bucket' are plentiful in everyday language, though they remain mysterious, as it is not clear exactly how people learn and understand them. There is no single agreed-upon definition of idiom that covers all members of this class, but idioms tend to be relatively fixed in grammatical form and meaning, but with relatively little predictability in the relation between form and meaning. Also, many idiomatic expressions can appear with both literal, i.e. fully predictable, interpretations given their form -- compare 'The little girl made a face at her mother.' (idiomatic) vs. 'The little girl made a face on the snowman using a carrot and two buttons.' (literal) As a result, idioms present great challenges for a variety of natural language processing applications, including machine translation systems, which often do not detect idiomatic language. To address these challenges, an algorithm is proposed that neither relies on target idiom types, lexicons, or large manually annotated corpora, nor limits the search space by a particular type of linguistic construction. The starting point is that idioms are semantic outliers that violate cohesive structure, especially in local contexts. The following properties are quantified and are incorporated into the outlier detection algorithm: 1) lack of compositionality comparing to literal expressions or other types of collocations; 2) violation of local cohesive ties, so that they tend to be semantically distant from the local topics; 3) while not all semantic outliers are idioms, non-compositional semantic outliers are likely to be idiomatic; 4) idiomaticity is not a binary property; rather, idioms fall on the continuum from being compositional to being partly unanalyzable to completely non-compositional.
This research contributes to the better understanding of idiomatic language, to the computational treatment of such phenomena and, with the creation of high quality, publicly available linguistic resources annotated for idioms, to the facilitation of machine learning research and big data science. Additional benefits include efficient algorithms for computing compositionality and topicality from large corpora, interesting new generalizations about the nature of figurative language, and the training of a cadre of undergraduate and graduate students in highly practical work on a difficult interdisciplinary problem.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
Over the decades, humans have been learning computer language. Natural Language Processing (NLP) means computers are now learning how humans use language, i.e., the comprehension by computers of the structure and meaning of human languages, allowing users to interact with computers using natural sentences. This task is not easy. One of the challenges is to "understand" meaning, e.g., draw inferences, derive sentiment relations, understand figurative language.
This project has been concerned with developing an algorithm for detecting idiomatic expressions. Many idiomatic expressions are ambiguous. Ambiguities in semantics arise when multiple interpretations are possible. For example, hit the roof can be interpreted literally or idiomatically depending on the context: sales hit the roof vs. hit the roof of the car.
The research team has explored various linguistic properties of idiomatic expressions cross-linguistically. In our earlier work, we approached the problem as outlier detection: literal expressions are semantically related to the rest of the context, while words that constitute an idiom appear inconsistent with the rest of the context. Our technique incorporated the following observations: (1) A sequence with literal meaning has many neighbors, whereas a figurative one has few. (2) Idiomatic expressions should demonstrate low semantic proximity between the words composing them. (3) Idiomatic expressions should demonstrate low semantic proximity between the expression and the preceding and subsequent segments.
Later we refined our hypothesis and decided to work with text at the topic level structure. Informally, topics are just clusters of similar words. A document usually contains several topics. The idea is that topic words in a given text segment, such as a paragraph, are less likely to be a part of an idiomatic expression. Our additional hypothesis is that contexts in which idioms occur, typically, are more affective and therefore, we incorporate simple sentiment analysis, focusing on the intensity of emotions. This approach can be still viewed as outlier detection. This approach allowed us to differentiate idioms from literals using local semantic contexts with better accuracy.
We further developed our approach and captured the intuition about cohesive ties, idioms and local context by using a vector representation of words, i.e., we created a representation for words that capture their meaning, semantic relationships and the different types of contexts they are used in and used it in our classification task. The performance results have improved significantly. We applied our approach to English and Russian, the two languages from two different language families, Germanic and Slavic, respectively, and whose structural and morphological properties are different enough to test if our method is (relatively) language-independent. Our experimental results were positive.
The results of this study is a method for idiom detection that does not rely on resource-heavy linguistic technology, such as syntactic parsing, part-of-speech tagging or manual annotation. Our algorithm manages to capture the contextual differences in which literal and idiomatic expressions occur. This technology can be used in many other applications, such as machine translation, language understanding systems, sentiment and emotion analysis, among others.
A number of undergraduate and graduate students had the opportunity to participate in active research and gain experience in working collaboratively on an interdisciplinary problem. The project involved students from a variety of disciplines: linguistics, computer science, literature, and mathematics.
Last Modified: 06/01/2018
Modified by: Anna Feldman
Please report errors in award information by writing to: awardsearch@nsf.gov.