
NSF Org: |
CCF Division of Computing and Communication Foundations |
Recipient: |
|
Initial Amendment Date: | August 5, 2010 |
Latest Amendment Date: | June 27, 2012 |
Award Number: | 1016937 |
Award Instrument: | Continuing Grant |
Program Manager: |
Anindya Banerjee
abanerje@nsf.gov (703)292-7885 CCF Division of Computing and Communication Foundations CSE Directorate for Computer and Information Science and Engineering |
Start Date: | August 15, 2010 |
End Date: | July 31, 2014 (Estimated) |
Total Intended Award Amount: | $500,000.00 |
Total Awarded Amount to Date: | $500,000.00 |
Funds Obligated to Date: |
FY 2012 = $169,618.00 |
History of Investigator: |
|
Recipient Sponsored Research Office: |
1 NASSAU HALL PRINCETON NJ US 08544-2001 (609)258-3090 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
1 NASSAU HALL PRINCETON NJ US 08544-2001 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): |
Software & Hardware Foundation, PROGRAMMING LANGUAGES |
Primary Program Source: |
01001213DB NSF RESEARCH & RELATED ACTIVIT |
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
In every business, engineering endeavour and scientific discipline, workers are digitizing their knowledge with the hope of using computational methods to categorize, query, filter, search, diagnose, and visualize their data. While this effort is leading to remarkable industrial and scientific advances, it is also generating enormous amounts of ad hoc data (i.e., that data for which standard data processing tools such as query engines, statistical packages, graphing tools, or other software is not readily available). Ad hoc data poses tremendous challenges to its users because it is often highly varied, poorly documented, filled with errors, and continuously evolving --- yet ad hoc data also contains much valuable information. The goal of this research is to develop general-purpose software tools and techniques capable of managing ad hoc data efficiently. This research has the potential for a broad impact on society by dramatically improving the productivity of industrial data analysts, computer systems administrators and academics who must deal with ad hoc data on a day-to-day basis.
The central technical challenge of the research involves designing, implementing and evaluating a new domain-specific programming language that facilitates the management of ad hoc data sets. This new programming language will allow data analysts to specify the structure of ad hoc data files, how those files are arranged in a file system and what meta-data is associated with them. Once a specification is complete, it will be possible to use it as documentation for the data set or for generating data-processing tools. The research will also involve developing new methods for enabling users to generate specifications quickly and accurately, without actually having to write down all of the details by hand. Finally, the research will develop new algorithms for implementing the generated data-processing tools efficiently.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
We are currently in the midst of the "big data" revolution. In other words, we have, all around us, a growing number of computational processes that are each generating enormous amounts information and storing it away for future analysis. This data can be used to help us investigate the causes of disease, understand the formation of our universe, improve the efficiency of our energy grid, or discover which products are most popular amongst our customers. Of course, to make use of this data, software engineers must write programs to parse, transform, analyze, query and communicate this data from place to place. The purpose of this research project was to investigate new kinds of programming languages and programming tools that will help software engineers manage this data more easily, more efficiently, and more reliably.
The study of languages, like the ones developed in this project, involves a number of inter-related activities. First, it requires some design: the primitives of the language must be defined so they fit together effectively, and allow software engineers to construct an infinite spectrum of useful programs. Second, each primitive must have a clear semantics — i.e., a “meaning” or “definition.” These meanings are typically expressed in mathematical terms, and doing so allows language designers to prove strong properties of some, or possibly all, of the infinitely many programs that can be written in the language. Indeed, well-designed languages possess many useful safety properties that help programmers avoid errors in program construction. A language semantics is also useful to the engineers who develop the compilers or program analysis tools, and to the everyday programmer who needs to understand what their program does. Third, the study of languages requires implementation and experimentation. We must try the language out on real-world applications to find out how well it solves the problems of interest. Of course, each of these three activities complements the other: The semantic analysis typically tells us what kinds of designs are possible and guides the initial implementation; the implementation and applications tells us what kinds of designs are useful, and may suggest changes to the semantics of individual primitives.
During the execution of this grant, we engaged in each of the activities described above and developed several new programming languages and tools for data management and communication. More specifically, together with collaborators Kathleen Fisher (Tufts University) and Nate Foster (initially a post doc working at Princeton on this research and now an assistant professor at Cornell University), the PI developed a new language, called Forest, for specifying the structure of multi-directory file systems. From a single, compact specification, the Forest system is able to generate a host of different data-processing tools. For instance, Forest can generate a collection of programmer libraries for parsing, printing, querying, traversing or finding errors in the described data. Forest also has a rigorous semantics, which we have used to prove key properties of system, including various "round-tripping laws" that tell us, for example, that parsing is a proper inverse of printing. Such properties help improve our confidence in the reliability of our infrastructure and the basic soundness of our designs.
The Forest design was implemented as a domain-specific language embedded in Haskell, a modern functional programming language. Using&nbs...
Please report errors in award information by writing to: awardsearch@nsf.gov.