Award Abstract # 1528041
III: Small: Low-Cost Deduplication and Search for Versioned Datasets

NSF Org: IIS
Division of Information & Intelligent Systems
Recipient: UNIVERSITY OF CALIFORNIA, SANTA BARBARA
Initial Amendment Date: August 20, 2015
Latest Amendment Date: August 1, 2016
Award Number: 1528041
Award Instrument: Standard Grant
Program Manager: Wei-Shinn Ku
IIS
 Division of Information & Intelligent Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: September 1, 2015
End Date: August 31, 2020 (Estimated)
Total Intended Award Amount: $499,998.00
Total Awarded Amount to Date: $515,998.00
Funds Obligated to Date: FY 2015 = $499,998.00
FY 2016 = $16,000.00
History of Investigator:
  • Tao Yang (Principal Investigator)
    tyang@cs.ucsb.edu
  • Stefano Tessaro (Co-Principal Investigator)
Recipient Sponsored Research Office: University of California-Santa Barbara
3227 CHEADLE HALL
SANTA BARBARA
CA  US  93106-0001
(805)893-4188
Sponsor Congressional District: 24
Primary Place of Performance: University of California-Santa Barbara
Santa Babara
CA  US  93106-5110
Primary Place of Performance
Congressional District:
24
Unique Entity Identifier (UEI): G9QBQDH39DF4
Parent UEI:
NSF Program(s): Info Integration & Informatics
Primary Program Source: 01001516DB NSF RESEARCH & RELATED ACTIVIT
01001617DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7923, 7364, 9251
Program Element Code(s): 736400
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Organizations and companies often archive high volumes of versioned digital datasets. There are research challenges and opportunities for developing integrated archival and search support needed for data preservation, electronic discovery, and regulatory compliance. Since versioned datasets contain highly repetitive content, deduplication can reduce the storage demand by an order of magnitude or more; however such an optimization is resource-intensive. After deduplication, the structure of an inverted index for versioned data becomes complex and it is expensive to search relevant results. This project will study low-cost solutions for compact archiving and indexing and develop efficient algorithms and systems techniques for searching versioned datasets. It will also consider that the archived data can be stored in an untrusted server environment and investigate tradeoffs in efficiency and privacy-preservation for search. The developed solutions will bring significant computing and storage cost advantages for application users involving large-scale versioned data management and search. The developed software will be made public for research communities. The research effort will be integrated with an educational plan containing research mentoring, instruction improvement, and outreach activities.

This project will be focused on studying key challenges and cost-sensitive technical aspects in integrated archival and search support for managing large versioned datasets. The main tasks include efficient software architecture and optimization for detecting duplicated content on a cloud cluster architecture, fast multi-phase search with a hybrid index structure to exploit content similarity and query characteristics, and an efficient privacy-preserving framework with top result ranking.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 11)
D. Agun, T. Yang, W. Zhang. "Low-Profile Source-side Deduplication for Virtual Machine Backup" Proc. of USENIX HotCloud '16 , 2016
Cetin Sahin, Victor Zakhary, Amr El Abbadi, Huijia Lin, and Stefano Tessaro "TaoStore: Overcoming Asynchronicity in Oblivious Data Storage" IEEE Symposium on Security and Privacy, SP 2016 , 2016 , p.198 10.1109/SP.2016.20
D. Agun, J. Shao, S. Ji, S. Tessaro, T. Yang "Tradeoffs between Ranking Privacy and Efficiency for Multiword Search" Proc. of International World Wide Web Conference (WWW 2018). , 2018 , p.1725-1734
J. Li, J. A. Benediktsson, B. Zhang, T. Yang, A. Plaza "Spatial Technology and Social Media in Remote Sensing: A Survey" Proceedings of the IEEE , v.105 , 2017 , p.1855-1864
J. Li, J. A. Benediktsson, B. Zhang, T. Yang, A. Plaza "Spatial Technology and Social Media in Remote Sensing: Challenges and Opportunities" Proceedings of the IEEE , v.105 , 2017 , p.1583-1585
J. Shao, S. Ji, and T. Yang "Privacy-aware Document Ranking with Neural Signals" Proc. of 2019 ACM SIGIR conference on Research and Development in Information Retrieval (SIGIR 2019) , 2019 , p.Pages 305
S. Ji, J. Shao, and T. Yang "Efficient Interaction-based Neural Ranking with Locality Sensitive Hashing" Proc. of International World Wide Web Conference (WWW 2019) , 2019 , p.Pages 285
S. Ji, J. Shao, D. Agun, T. Yang. "Privacy-aware Ranking with Tree Ensembles on the Cloud." Proc. of 2018 ACM SIGIR conference on Research and Development in Information Retrieval (SIGIR '2018) , 2018 , p.315-324
Xin Jin, Daniel Agun, Tao Yang, Qinghao Wu, Yifan Shen, Susen Zhao "Hybrid Indexing for Versioned Document Search with Cluster-based Retrieval" 25th ACM International Conference on Information and Knowledge Management (CIKM) , 2016 , p.Pages 377
X. Jin, T. Yang, X. Tang "A Comparison of Cache Blocking Methods for Fast Execution of Ensemble-based Score Computation" Proc. of 2016 ACM SIGIR conference on Research and Development in Information Retrieval , 2016
X. Tang, M. Alabduljalil, X. Jin, T. Yang "Partitioned Similarity Search with Cache-Conscious Data Traversal" ACM Transactions on Knowledge Discovery from Data (TKDD). Article 34 (April 2017), 38 pages , v.11 , 2017
(Showing: 1 - 10 of 11)

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

This project has studied key challenges and cost-sensitive technical aspects in archival and search support for managing large versioned datasets. The main activities include efficient software architecture and optimization in detecting duplicated content on a cloud cluster architecture, fast multi-phase search with a hybrid index structure to exploit content similarity and query characteristics, and an efficient privacy-preserving framework with top result ranking.

For cluster-based cloud backup, we have proposed a source-side backup scheme with low-resource usage through collaborative deduplication and approximated lazy deletion when frequent virtual machine snapshot backup is required in a large-scale cloud cluster. The key ideas are to orchestrate multi-round duplicate detection batches among machines in a partitioned asynchronous manner and remove most unreferenced content chunks with approximated snapshot deletion. For the tested datasets, network cost is reduced by 255% and storage cost is reduced by 355% times compared to a pure dirty-bit-based source-side deduplication. The new scheme is an order of magnitude faster than a synchronous scheme.

For versioned-data search, we proposed an approach called representative-guided two-phase search (RTP) that uses cluster-based retrieval to quickly narrow the search scope guided by version representatives at Phase 1 and developed a hybrid index structure with adaptive runtime data traversal to conduct fast Phase 2 search and ranking. The hybrid scheme exploits the advantages of forward index and inverted index based on the term characteristics to minimize the time in extracting positional and other features during runtime search. The evaluation shows that  RTP can be up-to 4x as fast as the previously-developed two-phase method for the tested datasets. The RTP prototype software is accessible from the project website. 

To improve query-time efficiency for search, we have proposed an approximation of interaction-based neural ranking algorithms based on locality-sensitive hashing. This scheme accelerates query-document interaction computation by using a query-time cache with precomputed term vectors, and speeds up kernel calculation by taking advantages of limited integer similarity values. This scheme preserves semantic similarity of search words with a small error while yielding a significant efficiency gain. For example, our approximation solution for neural ranking yields up-to 106x scoring time speedup compared to the previous work in the tested datasets while accomplishing competitive relevance. We have also developed cache-aware runtime optimization to speed up the scoring of tree ensembles in ranking a large number of feature vectors. 

We have further addressed several open problems in privacy-aware top result ranking based on linear and nonlinear ranking algorithms. The main challenge for privacy protection is that letting a cloud server perform ranking computation may unsafely reveal privacy-sensitive information.  For linear additive scoring used for searching modest-sized cloud datasets, we have developed a privacy-aware scheme with single-round client-server collaboration and server-side partial ranking based on blinded feature weights and random masks.  This scheme strikes for a tradeoff between privacy and efficiency and client-side preprocessing includes query decomposition with chunked postings to facilitate earlier range intersection and fast access of server-side key-value stores. Server-side query processing can efficiently deal with feature vector sparsity through optional feature matching and enable result filtering with query-dependent chunk-wide random masks for queries that yield too many matched documents. 

For nonlinear result ranking with tree-based ensembles, we have developed a privacy-aware server-side query processing scheme based on comparison-preserving mapping that hides feature values and tree thresholds. This solution scales well for large datasets since a server does not involve time-consuming heavyweight cryptography or additional client involvement after receiving encrypted search words from a client. Our solution restricts the computation complexity of feature composition which represents a trade-off of relevancy for privacy. The evaluation shows that decision trees relying more on raw features can still perform well with a hybrid model trained for each query length and our approach only yields a small relevance degradation compared to a traditional tree algorithm for the tested dataset. For ranking with interaction-based neural ranking, we have analyzed the critical leakages during score calculation and studied countermeasures to mitigate such a leakage. We have developed a privacy-aware scheme and its key techniques include the replacement of the exact kernel with a tree ensemble, a soft match map using obfuscated kernel values and term closures, and adaptive clustering for term occurrence obfuscation and storage optimization. Our design for privacy enhancement is to prevent the leakage of two critical text signals in terms of term frequency and occurrence needed for the attacks shown in the previous work and our analysis. We have also investigated the feasibility of using trusted processors to enhance privacy for search.

During the project period, the PI and co-PI have taught graduate and undergraduate courses on information retrieval, high performance computing, cryptography, and security, and have 
advised a number of  PhD, master, and undergraduate students including under-represented students,  working on related research projects. 

 


Last Modified: 07/24/2020
Modified by: Tao Yang

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page