Award Abstract # 1016937
SHF:Small:Language Support for Ad Hoc Data Processing

NSF Org: CCF
Division of Computing and Communication Foundations
Recipient: THE TRUSTEES OF PRINCETON UNIVERSITY
Initial Amendment Date: August 5, 2010
Latest Amendment Date: June 27, 2012
Award Number: 1016937
Award Instrument: Continuing Grant
Program Manager: Anindya Banerjee
abanerje@nsf.gov
 (703)292-7885
CCF
 Division of Computing and Communication Foundations
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: August 15, 2010
End Date: July 31, 2014 (Estimated)
Total Intended Award Amount: $500,000.00
Total Awarded Amount to Date: $500,000.00
Funds Obligated to Date: FY 2010 = $330,382.00
FY 2012 = $169,618.00
History of Investigator:
  • David Walker (Principal Investigator)
Recipient Sponsored Research Office: Princeton University
1 NASSAU HALL
PRINCETON
NJ  US  08544-2001
(609)258-3090
Sponsor Congressional District: 12
Primary Place of Performance: Princeton University
1 NASSAU HALL
PRINCETON
NJ  US  08544-2001
Primary Place of Performance
Congressional District:
12
Unique Entity Identifier (UEI): NJ1YPQXQG7U5
Parent UEI:
NSF Program(s): Software & Hardware Foundation,
PROGRAMMING LANGUAGES
Primary Program Source: 01001011DB NSF RESEARCH & RELATED ACTIVIT
01001213DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 9218, HPCC
Program Element Code(s): 779800, 794300
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

In every business, engineering endeavour and scientific discipline, workers are digitizing their knowledge with the hope of using computational methods to categorize, query, filter, search, diagnose, and visualize their data. While this effort is leading to remarkable industrial and scientific advances, it is also generating enormous amounts of ad hoc data (i.e., that data for which standard data processing tools such as query engines, statistical packages, graphing tools, or other software is not readily available). Ad hoc data poses tremendous challenges to its users because it is often highly varied, poorly documented, filled with errors, and continuously evolving --- yet ad hoc data also contains much valuable information. The goal of this research is to develop general-purpose software tools and techniques capable of managing ad hoc data efficiently. This research has the potential for a broad impact on society by dramatically improving the productivity of industrial data analysts, computer systems administrators and academics who must deal with ad hoc data on a day-to-day basis.

The central technical challenge of the research involves designing, implementing and evaluating a new domain-specific programming language that facilitates the management of ad hoc data sets. This new programming language will allow data analysts to specify the structure of ad hoc data files, how those files are arranged in a file system and what meta-data is associated with them. Once a specification is complete, it will be possible to use it as documentation for the data set or for generating data-processing tools. The research will also involve developing new methods for enabling users to generate specifications quickly and accurately, without actually having to write down all of the details by hand. Finally, the research will develop new algorithms for implementing the generated data-processing tools efficiently.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Cole Schlesinger, Karthik Pattabiraman, Nikhil Swamy, David Walker, and Benjamin Zorn "Modular Protections against Non-control Data Attacks" The Journal of Computer Security , v.22 , 2014 , p.699 10.3233/JCS-140502
Cole Schlesinger, Karthik Pattabiraman, Nikhil Swamy, David Walker and Ben Zorn "Modular Protections against Non-control Data Attacks" Computer Security Foundations Symposium , 2011 , p.131 10.1109/CSF.2011.16
Kathleen Fisher and David Walker "The PADS project: An Overview" IEEE International Conference on Database Theory , 2011 , p.11
Kathleen Fisher, Nate Foster, David Walker and Kenny Q. Zhu "Forest: A Language and Toolkit for Programming with Filestores" ACM SIGPLAN International Conference on Functional Programming , 2011 , p.192 10.1145/2034773.2034814
Kenny Q. Zhu, Kathleen Fisher and David Walker "LearnPADS++: Incremental inference of Ad Hoc Data Formats." ACM SIGPLAN International Symposium on Practical Aspects of Declarative Languages , 2012 , p.168
Pat Bosshart, Dan Daly, Martin Izzard, Nick McKeown, Jennifer Rexford, Dan Talayco, Amin Vahdat, George Varghese and David Walker. "Programming Protocol-independent Packet Processors" Computer and Communications Review (CCR) , v.44 , 2014 http://dx.doi.org/00.0000/0000000.0000004

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

We are currently in the midst of the "big data" revolution.  In other words, we have, all around us, a growing number of computational processes that are each generating enormous amounts information and storing it away for future analysis.  This data can be used to help us investigate the causes of disease, understand the formation of our universe, improve the efficiency of our energy grid, or discover which products are most popular amongst our customers.  Of course, to make use of this data, software engineers must write programs to parse, transform, analyze, query and communicate this data from place to place.  The purpose of this research project was to investigate new kinds of programming languages and programming tools that will help software engineers manage this data more easily, more efficiently, and more reliably. 


The study of languages, like the ones developed in this project, involves a number of inter-related activities.  First, it requires some design: the primitives of the language must be defined so they fit together effectively, and allow software engineers to construct an infinite spectrum of useful programs.  Second, each primitive must have a clear semantics — i.e., a “meaning” or “definition.”  These meanings are typically expressed in mathematical terms, and doing so allows language designers to prove strong properties of some, or possibly all, of the infinitely many programs that can be written in the language.  Indeed, well-designed languages possess many useful safety properties that help programmers avoid errors in program construction.  A language semantics is also useful to the engineers who develop the compilers or program analysis tools, and to the everyday programmer who needs to understand what their program does. Third, the study of languages requires implementation and experimentation.  We must try the language out on real-world applications to find out how well it solves the problems of interest. Of course, each of these three activities complements the other: The semantic analysis typically tells us what kinds of designs are possible and guides the initial implementation; the implementation and applications tells us what kinds of designs are useful, and may suggest changes to the semantics of individual primitives. 


During the execution of this grant, we engaged in each of the activities described above and developed several new programming languages and tools for data management and communication.  More specifically, together with collaborators Kathleen Fisher (Tufts University) and Nate Foster (initially a post doc working at Princeton on this research and now an assistant professor at Cornell University), the PI developed a new language, called Forest, for specifying the structure of multi-directory file systems.  From a single, compact specification, the Forest system is able to generate a host of different data-processing tools.  For instance, Forest can generate a collection of programmer libraries for parsing, printing, querying, traversing or finding errors in the described data.  Forest also has a rigorous semantics, which we have used to prove key properties of system, including various "round-tripping laws" that tell us, for example, that parsing is a proper inverse of printing.  Such properties help improve our confidence in the reliability of our infrastructure and the basic soundness of our designs.

The Forest design was implemented as a domain-specific language embedded in Haskell, a modern functional programming language.  Using&nbs...

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page