
NSF Org: |
IIS Division of Information & Intelligent Systems |
Recipient: |
|
Initial Amendment Date: | August 20, 2015 |
Latest Amendment Date: | August 1, 2016 |
Award Number: | 1528041 |
Award Instrument: | Standard Grant |
Program Manager: |
Wei-Shinn Ku
IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | September 1, 2015 |
End Date: | August 31, 2020 (Estimated) |
Total Intended Award Amount: | $499,998.00 |
Total Awarded Amount to Date: | $515,998.00 |
Funds Obligated to Date: |
FY 2016 = $16,000.00 |
History of Investigator: |
|
Recipient Sponsored Research Office: |
3227 CHEADLE HALL SANTA BARBARA CA US 93106-0001 (805)893-4188 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
Santa Babara CA US 93106-5110 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Info Integration & Informatics |
Primary Program Source: |
01001617DB NSF RESEARCH & RELATED ACTIVIT |
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
Organizations and companies often archive high volumes of versioned digital datasets. There are research challenges and opportunities for developing integrated archival and search support needed for data preservation, electronic discovery, and regulatory compliance. Since versioned datasets contain highly repetitive content, deduplication can reduce the storage demand by an order of magnitude or more; however such an optimization is resource-intensive. After deduplication, the structure of an inverted index for versioned data becomes complex and it is expensive to search relevant results. This project will study low-cost solutions for compact archiving and indexing and develop efficient algorithms and systems techniques for searching versioned datasets. It will also consider that the archived data can be stored in an untrusted server environment and investigate tradeoffs in efficiency and privacy-preservation for search. The developed solutions will bring significant computing and storage cost advantages for application users involving large-scale versioned data management and search. The developed software will be made public for research communities. The research effort will be integrated with an educational plan containing research mentoring, instruction improvement, and outreach activities.
This project will be focused on studying key challenges and cost-sensitive technical aspects in integrated archival and search support for managing large versioned datasets. The main tasks include efficient software architecture and optimization for detecting duplicated content on a cloud cluster architecture, fast multi-phase search with a hybrid index structure to exploit content similarity and query characteristics, and an efficient privacy-preserving framework with top result ranking.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
This project has studied key challenges and cost-sensitive technical aspects in archival and search support for managing large versioned datasets. The main activities include efficient software architecture and optimization in detecting duplicated content on a cloud cluster architecture, fast multi-phase search with a hybrid index structure to exploit content similarity and query characteristics, and an efficient privacy-preserving framework with top result ranking.
For cluster-based cloud backup, we have proposed a source-side backup scheme with low-resource usage through collaborative deduplication and approximated lazy deletion when frequent virtual machine snapshot backup is required in a large-scale cloud cluster. The key ideas are to orchestrate multi-round duplicate detection batches among machines in a partitioned asynchronous manner and remove most unreferenced content chunks with approximated snapshot deletion. For the tested datasets, network cost is reduced by 255% and storage cost is reduced by 355% times compared to a pure dirty-bit-based source-side deduplication. The new scheme is an order of magnitude faster than a synchronous scheme.
For versioned-data search, we proposed an approach called representative-guided two-phase search (RTP) that uses cluster-based retrieval to quickly narrow the search scope guided by version representatives at Phase 1 and developed a hybrid index structure with adaptive runtime data traversal to conduct fast Phase 2 search and ranking. The hybrid scheme exploits the advantages of forward index and inverted index based on the term characteristics to minimize the time in extracting positional and other features during runtime search. The evaluation shows that RTP can be up-to 4x as fast as the previously-developed two-phase method for the tested datasets. The RTP prototype software is accessible from the project website.
To improve query-time efficiency for search, we have proposed an approximation of interaction-based neural ranking algorithms based on locality-sensitive hashing. This scheme accelerates query-document interaction computation by using a query-time cache with precomputed term vectors, and speeds up kernel calculation by taking advantages of limited integer similarity values. This scheme preserves semantic similarity of search words with a small error while yielding a significant efficiency gain. For example, our approximation solution for neural ranking yields up-to 106x scoring time speedup compared to the previous work in the tested datasets while accomplishing competitive relevance. We have also developed cache-aware runtime optimization to speed up the scoring of tree ensembles in ranking a large number of feature vectors.
We have further addressed several open problems in privacy-aware top result ranking based on linear and nonlinear ranking algorithms. The main challenge for privacy protection is that letting a cloud server perform ranking computation may unsafely reveal privacy-sensitive information. For linear additive scoring used for searching modest-sized cloud datasets, we have developed a privacy-aware scheme with single-round client-server collaboration and server-side partial ranking based on blinded feature weights and random masks. This scheme strikes for a tradeoff between privacy and efficiency and client-side preprocessing includes query decomposition with chunked postings to facilitate earlier range intersection and fast access of server-side key-value stores. Server-side query processing can efficiently deal with feature vector sparsity through optional feature matching and enable result filtering with query-dependent chunk-wide random masks for queries that yield too many matched documents.
For nonlinear result ranking with tree-based ensembles, we have developed a privacy-aware server-side query processing scheme based on comparison-preserving mapping that hides feature values and tree thresholds. This solution scales well for large datasets since a server does not involve time-consuming heavyweight cryptography or additional client involvement after receiving encrypted search words from a client. Our solution restricts the computation complexity of feature composition which represents a trade-off of relevancy for privacy. The evaluation shows that decision trees relying more on raw features can still perform well with a hybrid model trained for each query length and our approach only yields a small relevance degradation compared to a traditional tree algorithm for the tested dataset. For ranking with interaction-based neural ranking, we have analyzed the critical leakages during score calculation and studied countermeasures to mitigate such a leakage. We have developed a privacy-aware scheme and its key techniques include the replacement of the exact kernel with a tree ensemble, a soft match map using obfuscated kernel values and term closures, and adaptive clustering for term occurrence obfuscation and storage optimization. Our design for privacy enhancement is to prevent the leakage of two critical text signals in terms of term frequency and occurrence needed for the attacks shown in the previous work and our analysis. We have also investigated the feasibility of using trusted processors to enhance privacy for search.
During the project period, the PI and co-PI have taught graduate and undergraduate courses on information retrieval, high performance computing, cryptography, and security, and have advised a number of PhD, master, and undergraduate students including under-represented students, working on related research projects.
Last Modified: 07/24/2020
Modified by: Tao Yang
Please report errors in award information by writing to: awardsearch@nsf.gov.