Providing best-effort services in dataspace systems

Providing best-effort services in dataspace systems（2007）

引用 25|浏览21

暂无评分

摘要

Nowadays many data sharing applications need to manage a dataspace [68], which contains a number of heterogeneous data sources and partially unstructured data. Such scenarios include large enterprises, collaborative scientific projects, digital libraries, personal information, and the Web. Understanding the relationships between the data sources requires specifying schema mappings, such as one stating that full-name in one data source corresponds to the concatenation of first-name and last-name in another data source. However, the data sources in a dataspace are only loosely coupled, so we may not have schema mappings specified up front. This dissertation studies how to provide best-effort search, querying and browsing services in a dataspace system, even when precise schema mappings are not present. To provide useful services over all data in a dataspace, we need to resolve heterogeneity in the data. Heterogeneity exists at three levels in a dataspace. At the instance level, the same real-world entity can be referred to using different values; for example, a person can be referred to as "Mike" in some data sources and as "Michael" in others. At the schema level, the same domain can be described using different schemas; for example, a person can be described by his first-name and last-name in one data source and by his full-name and other-name in another data source. At the query level, user queries can be composed according to a schema different from the source schema, or even in a language that is not supported by the data model of the source data; for example, a user may compose a SQL query whereas some data sources are unstructured. In this dissertation we describe solutions for resolving heterogeneity in a dataspace. To resolve heterogeneity at the instance level, we describe an algorithm that reconcile references that refer to the same real-world entity. Our algorithm can be applied to references that belong to multiple classes where rich associations between the references exist. To resolve heterogeneity at the schema level, we propose the concept of probabilistic schema mapping, with which we can return approximate answers even when precise mappings do not exist. We study the complexity of query answering with respect to probabilistic mappings. To resolve heterogeneity at the query level, we design an index over heterogeneous data in a dataspace. Our index extends inverted lists to capture both text and structure of the data to facilitate efficient answering of queries that combine keywords and structure. In addition, we design an algorithm that answers structured queries on unstructured data, such that we can provide seamless search on both structured and unstructured data. Finally, we have grounded all our technical solutions to a particular system, the SEMEX Personal Information Management System. S EMEX provides a logical view of one's personal information, such that it supports associative browsing and provides seamless search and querying over one's personal data.

查看译文

关键词

data source corresponds,personal data,source data,heterogeneous data,schema level,best-effort service,schema mapping,heterogeneous data source,data model,unstructured data,dataspace system,data source

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要