Sangam: A Confluence of Knowledge Streams

QUALITY, RETRIEVAL AND ANALYSIS OF PROVENANCE IN LARGE-SCALE DATA

Show simple item record

dc.contributor Plale, Beth
dc.creator CHEAH, YOU-WEI
dc.date 2014-06-16T21:12:12Z
dc.date 2014-06-16T21:12:12Z
dc.date 2014-02
dc.date 2014
dc.date.accessioned 2023-02-21T11:19:00Z
dc.date.available 2023-02-21T11:19:00Z
dc.identifier http://hdl.handle.net/2022/18263
dc.identifier.uri http://localhost:8080/xmlui/handle/CUHPOERS/252961
dc.description Thesis (Ph.D.) - Indiana University, Computer Sciences, 2014
dc.description Provenance is metadata that describes the lineage of a data product. Lineage is invaluable in advancing the reuse and reproducibility of scientific results in e-Science. Through the availability of provenance, future researchers can make valid assessments of data quality or consider the trustworthiness of the data. The shift towards 'Big Data' has presented challenges in provenance driven by data volume and variety, and the need for making data more valuable and veracious. This dissertation examines provenance quality, capture, and representation particularly for highly voluminous provenance that occurs with growing frequency in large-scale science. This work has at its core a framework and methodology that identify three dimensions of provenance quality: correctness, completeness, and relevance. Based on the proposed quality dimensions, the framework supports provenance quality analysis at the node/edge, graph, and multi-graph levels, which includes analysis of annotations, timestamps and the structure of provenance traces. A supporting contribution is the design and generation of a pseudo-realistic provenance workload that consists of 48,000 provenance traces, forming a provenance database 10 Gigabytes in size. This workload is composed of provenance from 6 varied realistic workflows and includes a failure model that introduces several types of failures into provenance data including workflow executions that experienced failures and workflow executions that experienced faults in message passing communication between application and provenance system, the latter resulting in dropped provenance. Provenance in High Performance Computing is directly addressed with the design of a cache storage solution that supports multi-level provenance capture with minimum collection overhead. A distributed NoSQL database stores the collected provenance. Evaluation is carried out through experiments performed on two production systems at the National Energy Research Scientific Computing Center. The final contribution is in the experimental evaluation of two storage approaches for provenance, graph and relational databases, and the impact on retrieval for provenance specific realistic queries. Results carried out at scale and using real-world provenance traces show that graph databases are better suited for the retrieval of large provenance graphs by ID and relational databases provide a better option for provenance graphs that are of great depth in evaluated scenarios.
dc.language en
dc.publisher [Bloomington, Ind.] : Indiana University
dc.subject Large-scale Provenance
dc.subject Provenance Analysis
dc.subject Provenance Quality
dc.subject Provenance Query
dc.subject Computer science
dc.title QUALITY, RETRIEVAL AND ANALYSIS OF PROVENANCE IN LARGE-SCALE DATA
dc.type Doctoral Dissertation


Files in this item

Files Size Format View
CHEAH_indiana_0093A_12739.pdf 3.741Mb application/pdf View/Open

This item appears in the following Collection(s)

Show simple item record

Search DSpace


Advanced Search

Browse