QUALITY, RETRIEVAL AND ANALYSIS OF PROVENANCE IN LARGE-SCALE DATA

CHEAH, YOU-WEI

Sangam Home
→
Electronic Theses and Dissertations (ETDs)
→
IUScholarWorks
→
View Item

dc.contributor	Plale, Beth
dc.creator	CHEAH, YOU-WEI
dc.date	2014-06-16T21:12:12Z
dc.date	2014-06-16T21:12:12Z
dc.date	2014-02
dc.date	2014
dc.date.accessioned	2023-02-21T11:19:00Z
dc.date.available	2023-02-21T11:19:00Z
dc.identifier	http://hdl.handle.net/2022/18263
dc.identifier.uri	http://localhost:8080/xmlui/handle/CUHPOERS/252961
dc.description	Thesis (Ph.D.) - Indiana University, Computer Sciences, 2014
dc.description	Provenance is metadata that describes the lineage of a data product. Lineage is invaluable in advancing the reuse and reproducibility of scientific results in e-Science. Through the availability of provenance, future researchers can make valid assessments of data quality or consider the trustworthiness of the data. The shift towards 'Big Data' has presented challenges in provenance driven by data volume and variety, and the need for making data more valuable and veracious. This dissertation examines provenance quality, capture, and representation particularly for highly voluminous provenance that occurs with growing frequency in large-scale science. This work has at its core a framework and methodology that identify three dimensions of provenance quality: correctness, completeness, and relevance. Based on the proposed quality dimensions, the framework supports provenance quality analysis at the node/edge, graph, and multi-graph levels, which includes analysis of annotations, timestamps and the structure of provenance traces. A supporting contribution is the design and generation of a pseudo-realistic provenance workload that consists of 48,000 provenance traces, forming a provenance database 10 Gigabytes in size. This workload is composed of provenance from 6 varied realistic workflows and includes a failure model that introduces several types of failures into provenance data including workflow executions that experienced failures and workflow executions that experienced faults in message passing communication between application and provenance system, the latter resulting in dropped provenance. Provenance in High Performance Computing is directly addressed with the design of a cache storage solution that supports multi-level provenance capture with minimum collection overhead. A distributed NoSQL database stores the collected provenance. Evaluation is carried out through experiments performed on two production systems at the National Energy Research Scientific Computing Center. The final contribution is in the experimental evaluation of two storage approaches for provenance, graph and relational databases, and the impact on retrieval for provenance specific realistic queries. Results carried out at scale and using real-world provenance traces show that graph databases are better suited for the retrieval of large provenance graphs by ID and relational databases provide a better option for provenance graphs that are of great depth in evaluated scenarios.
dc.language	en
dc.publisher	[Bloomington, Ind.] : Indiana University
dc.subject	Large-scale Provenance
dc.subject	Provenance Analysis
dc.subject	Provenance Quality
dc.subject	Provenance Query
dc.subject	Computer science
dc.title	QUALITY, RETRIEVAL AND ANALYSIS OF PROVENANCE IN LARGE-SCALE DATA
dc.type	Doctoral Dissertation

Files in this item

Files	Size	Format	View
CHEAH_indiana_0093A_12739.pdf	3.741Mb	application/pdf	View/Open

This item appears in the following Collection(s)

IUScholarWorks [635]
Indiana University Bloomington

Show simple item record

Search DSpace

Advanced Search

Browse

All of DSpace
This Collection
- By Issue Date
- Authors
- Titles
- Subjects

QUALITY, RETRIEVAL AND ANALYSIS OF PROVENANCE IN LARGE-SCALE DATA

Files in this item

This item appears in the following Collection(s)

Related items

Search DSpace

Browse

All of DSpace

This Collection