
NSF Org: |
SES Division of Social and Economic Sciences |
Recipient: |
|
Initial Amendment Date: | September 17, 2015 |
Latest Amendment Date: | August 24, 2017 |
Award Number: | 1528409 |
Award Instrument: | Continuing Grant |
Program Manager: |
Brian Humes
SES Division of Social and Economic Sciences SBE Directorate for Social, Behavioral and Economic Sciences |
Start Date: | September 15, 2015 |
End Date: | August 31, 2019 (Estimated) |
Total Intended Award Amount: | $690,353.00 |
Total Awarded Amount to Date: | $690,353.00 |
Funds Obligated to Date: |
FY 2016 = $229,563.00 FY 2017 = $235,035.00 |
History of Investigator: |
|
Recipient Sponsored Research Office: |
201 OLD MAIN UNIVERSITY PARK PA US 16802-1503 (814)865-1372 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
110 Technology Center Building University Park PA US 16802-7000 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Political Science |
Primary Program Source: |
01001617DB NSF RESEARCH & RELATED ACTIVIT 01001718DB NSF RESEARCH & RELATED ACTIVIT |
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.075 |
ABSTRACT
General Summary
The Correlates of War Project's Militarized Interstate Dispute (MID) Data is the most prominent and heavily used data collection in the study of international conflict. The most recent version (MID4) was released in 2014 and brings the period covered to 1816-2010. The MID4 project utilized automated text classification procedures to make the process of identifying relevant news stories more efficient. Over the course of that project, the PIs determined the primary bottleneck in the workflow was the coding of those news documents. To address this inefficiency, The PIs completed a pilot project to determine whether crowdsourcing techniques could be used to code these documents. In the pilot, non-expert workers were paid small sums to read documents and to answer sets of questions, the answers to which were used to identify features of possible militarized incidents (the events that comprise MIDs). A systematic comparison of the crowdsourced responses with those of MID4 Project's trained coders revealed that the crowdsourced codings were completely accurate for 68 percent of the news reports coded; more importantly, high agreement among crowd responses on specific reports was strongly associated with correct coding. This enables the PIs to detect which documents require further expert involvement. As a result, the PIs can produce a majority of the MID data in near-realtime and at limited financial cost. These procedures are applied on the MID5 Project, which will update the MID data for the period 2011-2017.
Technical Summary
The MID5 project workflow begins with document retrieval from LexisNexis and document classification using the software and methods implemented in MID4. We discard the negatively classified documents, and proceed to extract metadata from the positively classified documents including the document title, the news agency that published the report, the date, and any actors mentioned in the text. Crowd workers are recruited through Amazon's Mechanical Turk and paid a wage to read one of these documents and answer a line of simple, objective questions about it. The questionnaire is predefined, but some extracted metadata is automatically inserted into the questionnaire to improve the quality of responses. Several workers complete a questionnaire for each document, leaving the PIs with problems of aggregation: how to combine multiple worker responses, possibly regarding multiple related questions, into usable data necessary to code the militarized incident. In the pilot study, the PIs show that Bayesian networks are the most effective way to achieve this aggregation. Recently, the PIs have made advances in semi-supervised text classification with hybrid, Deep Restricted Boltzmann Machines, which outperform previous methods in this task.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
The project – MID 5 – had two related goals. The ultimate purpose was to update the Militarized Interstate Dispute (MID) data up to as close to the present data as possible. This goal is a continuation of past NSF-support MID projects. The other purpose was to experiment with utilizing crowd-sourcing as a method of coding disputes. We approached this task after having experimented with a variety of ways to give coders news reports and with different methods of asking questions. Most of the time spent on the project was given to finding the best way to utilize the crowd. After a range of alternative methods were attempted – which are documented on our reports, publications and several conference presentations – we determined that the coding the MIDs was, essentially, too demanding for untrained individuals. We were unable to develop an aggregation technique across a small number of coders – generally 3-7 per news story – that “correctly” captured the events reported; the MID coding rules are demanding and, in some ways, peculiar enough such that intelligent but untrained readers cannot generally code the events consistent with the coding rules. To provide just one example, our meaning of “threat” is significantly different from common English usage. We concluded, unhappily, that the crowd could not be used to facilitate coding MIDs. In the last 1/3 of the project, we reverted to having trained graduate students code news reports, the process used in previous MID Projects. We are currently (January, 2020) coding MIDs through 2014, using funds provided by Penn State University. We hope to have this competed by the end of the Spring semester.
Last Modified: 01/29/2020
Modified by: Glenn H Palmer
Please report errors in award information by writing to: awardsearch@nsf.gov.