Resource discovery and integration is one of the major problems addressed in the last decade. The challenge is to provide advanced solutions which permit business applications to interoperate between several information sources and several software services.
Scalability and heterogeneity are the main concern in these issues. Indeed, in one hand, the number of provided resources (i.e. information sources and business services), by various business operators, increases dramatically. In the other hand, a wide spectrum of software architectures, representation models and programming languages are proposed by IT actors to support agile design and deployment of new applications. Building a flexible information system in this context consists in desining a high level abstract solution composed by a virtual schema and a specification of a business process.
Given a deployment architecture, implementing such IS consists into mapping the virtual schema into selected data sources (i.e. defining queries which compute instances of the virtual schema from those of source schemas) and binding business activities to selected software services (e.g. Web services). Dynamic discovery of mappings and services is the main issue to introduce flexibility in the design and evolution of these systems.
The scalability of discovery algorithms and the heterogeneity handling of resource descriptions are the main challenges to deal with in research projects. Mappings discovery (MP) is considered as extremely difficult as the designer of the system must have a thorough understanding of the semantics of the numerous data sources which compose the integration system as well as the target schemas to which they should be linked. Another major issue is the maintenance of the mappings when the integration system evolves frequently.
Our research focused on mappings generation and evolution algorithms, exploiting rich meta data and putting emphasis on scalability.
Services discovery (SD) consists in selecting the most appropriate software services (or Web services) to compose a business application. Current approaches for services retrieval are mostly limited to the matching of their inputs/outputs, keywords search in registers like UDDI or ebXML, or correspondence tables. But recall and precision of these approaches are not satisfactory for many applications.
Within the framework of the semantic Web, description logics were proposed for a richer and precise formal specification of services. Derived ontologies, such as OWL-S, are used as a basis for semantic matching between a declarative description of the required service and descriptions of the offered services. However, the few existing approaches are only concerned by exact matches while many other services can partially fulfill user requirement.
Our research focused on SD based on behavioral specification allowing approximate and partial matching.
Our research on mapping generation and evolution started in the late 90' in the context of relational-based mediation systems. Within the MediaGrid project (ACI GRID, 2002-2004), these algorithms have been extended to XML data sources and the generation of XQuery mappings. Work done during 2004-2008 period essentially consists in the improvement of mappings generation algorithms for relational data sources, and in the specification of the mappings generation algorithm for XML data sources.
The problem of mappings discovery is formalized as a path searching problem in a graph whose nodes are source relations (possibly hundreds or thousands of nodes), and edges are possible joins between them. The desired paths are those which constitute queries that compute target relations.
Two main problems have to be solved in mapping discovery: search path optimization and heterogeneity resolution. We have introduced some heuristics on the lengths of the paths to limit the exponential cost of the exhaustive search. To detect the syntactic and semantic mismatches and resolve heterogeneity problems, we have proposed to extend data source descriptions with a rich data typing mechanism which will later facilitate matching procedures and the search of compensation rules. The mapping generation algorithm has then been extended to this purpose and a new advanced prototype has been implemented and evaluated. Significant improvements have been noticed with respect to the first generation algorithm.
Mapping generation for XML data sources is not fundamentally different from the one of relational data sources. The main difference resides in the complexity of data structure and on the variability of objects structures (i.e. the same object may be structured in several ways). To handle the complexity of mapping discovery, the target schema is decomposed into subtrees for which mappings are first created following the same methodology as for the relational model (assuming existence of join operations between two structured objects). Mappings for the whole schema are then obtained by composition of the partial mappings.
The initial funding of this research has been done through two national projects: Reanimatic (ACI Télémédecine) and MediaGrid (ACI GRID) which respectively concern integration of epidemiologic data (in particular data relative to nosocomial diseases) and integration of genomic data.
Our objective in service discovery is to propose an approach for service retrieval based on behavioral specification allowing an approximate match. The originality of the work is then the capability of proposed algorithms to retrieve services having similar behavior on the basis of a behavior-based similarity measure. Consequently, even if a service satisfying exactly the user requirements does not exist, the most similar ones, called partial matches, will be retrieved and proposed for reuse by extension or modification.
To do so, we reduce the problem of behavioral matching to a graph matching problem. We have introduced a semantic distance measure and a set of edit operations which allow the user to dynamically restructure his query graph when target graphs do not match to his requirement. We have studied two types of behavioral models: a simple automata-based model and a more complex model allowing parallel tasks. The matching algorithms have been improved by introducing quality factors, hence allowing to prune some undesired solutions. Another extension has been initiated in a collaborative work with B. Benatallah, F. Casati and F. Toumani. It concerns a taxonomy of the main mismatches that can arise between two services and a set of appropriate adaptors to alleviate these mismatches. A prototype has been developed; it takes as input two conversation protocols and evaluates the semantic distance between them. It also provides the script of edit operations that can be used to alter the query graph to render it as closer as possible to the target one. Finally, it include an evaluation platform which allows users to generate a catalog of service descriptions, a set of query graphs and the corresponding matches with their distance measures. This prototype is available as a Web service and has been demonstrated in the last EDBT conference.
The work has been partially funded by the grant of Alban Program (Europe Latino-America cooperation).