Third DIALOGUE Workshop: Towards the Next Generation of Data Grid Software

overview | registration | accommodation | agenda | list of participants


Abstracts


Mining spatial gene expression data for association rules


Jano van Hemert, Richard Baldock (NeSC)


We present the results from a data mining study on data from the Edinburgh Mouse Atlas; a spatio-temporal framework for capturing in-situ gene expression patterns. It contains curated scientific results that are mapped onto 3D embryo models of common stages of development. The framework provides access to the atlas using regular web services, which can be linked together using a workflow tool, such as Taverna, to perform data mining tasks in a spatial and temporal context. We will present results from one such task known as association rule mining, and highlight the challenges this type of data presents to the e-Science community. Moreover, we hope to attract collaborations in the area of workflow enabled data mining in order to further broaden the set of analysis tools. Regular association rules mining activities stop after rules are delivered to the end user. In the domain of spatial gene expression, we can go back to the original data to show the context of association rules, which provide the domain expert, i.e., the developmental biologists, with feedback where in an embryo or on which set of genes, particular rules are of interest. The data mining activity involves accessing spatio-temporal data in two stages, as well as allowing visual feedback by performing image processing steps. The web services that would allow the full mining process are still under construction, and we hope to get feedback on which architectural design would be most suitable to allow large scale mining activities.


OGSA-DAI: Ongoing Work and Future Plans


Konstantinos Karasavvas (NEsC) slides

OGSA-DAI is undergoing major architectural changes to many fundamental system aspects. In this talk wre are going to introduce some of these changes. We will start by describing the new activity structure and explaining the rational behind it. Then we will discuss improvements to the activity framework; how data is being passed between activities, in-between buffers and exactly what is being passed, leading to a more optimised enactment model/engine. In addition, grouping of data will be introduced and special activities to handle groups will be discussed. Finally, an overview of the new activities will be provided to show the variety of functionality that we intend to offer.


OGSA-WebDB: An SQL-based Web Database Data Integration Tool for the Grid


Said Mirza Pahlevi (AIST) slides

Many databases are available online, although they can generally only be accessed via HTTP requests and queried using keywords/Boolean conditions. OGSA-WebDB was proposed to allow grid applications to access and integrate data from these kinds of databases (called Web databases) using the standard database query language (SQL). The presentation will describe the basic architecture of OGSA-WebDB and major extensions that have been made recently, one deals with the general characteristics of Web databases, and the other facilitates Web database management and integration. In addition, two recently developed OGSA-WebDB applications Grid Search for Cancer Prevention (GSCP) and Grid Search for Estimation of Drug Side Effects by SNPs (GESE) will be given. These works are a joint project of the National Institute of Advanced Industrial Science and Technology (AIST), the Division of Clinical and BioInformatic Engineering (CBIE) of the University of Tokyo, and Business Search Technologies Corporation (BST). Finally, some new functions and extensions that will be included in the next release are described.


OGSA-DAI RDF: Extending OGSA-DAI to support Semantic Web/Grid applications


Isao Kojima (AIST) slides

OGSA-DAI RDF is a set of RDF processing activities which adds RDF data resource types to OGSA-DAI database middleware. The software provides much functionality including: 1) SPARQL query interface which supports W3C SPARQL query language. 2) Reasoning processing activity which supports reasoning functions 3) Ontology handling activity based on the preliminary specification of WS-DAI-Ont 4) Graph and Triple operations which provides basic operations for graphs and triples. Using OGSA-DAI RDF, it is possible to support various semantic Web applications which uses RDF-based data format under the OGSA based Grid environment. Example application that we have is Semantic resource discovery for grid resources and SPARQL-based matchmaking. Also, Semantic Web application with complex data processing can be easily constructed since these RDF processing activities can be combined with other OGSA-DAI data processing activities, such as network data transfer, data conversion and data transformation. The software is implemented to satisfy several related W3C standards such as SPARQL query result XML format. OGSA-DAI RDF is already preliminary released last year and is totally re-designed based on the feedback from the limited usage. The new version is enhanced with following functions. 1) Reflection of on-going discussion of OGF WS-DAI RDF(S): Definition of resource is changed to satisfy the definition in which RDF data resource is a set of Graphs. Several Graph operations are introduced. 2) Common java interface is designed to achieve multi-product support: Based on this specification, we provided implementations for each database product such as Jena2 and Sesame. Also, SPARQL query processing activity provides 3) All RDF processing activities designed to support data passing by URI also with stream-based data passing. By this function, it is possible to avoid large amount of data transfer. Also, product-specific data representations such as Jena model is added to standard RDF format in order to enhance the performance. 4) Graphical user interface is provided including our Activity editor which is originally developed in our OGSA-WebDB product. User can construct RDF processing OGSA-DAI workflow in a graphical environment. The new version of OGSA-DAI RDF currently supports Jena and Sesame and will be released by March.


Transaction-based Grid data replication using OGSA-DAI


Yin Chen (NeSC) slides

Data replication is one of the most useful strategies to achieve high levels of availability and fault tolerance as well as minimal access times in data grids. It is commonly demanded by many data grid applications. However, most of the existing Grid replication systems, such as the EDG replica manager, the Globus data replication service, and SRB, merely deal with files, through the use of cashing methods. In contrast, relational database venders provide replication tools that offer flexible operations. Nevertheless, the capabilities of these products are insufficient to address grid issues; they lack scalability and cannot cope with heterogeneous nature of the Grid. We will present a solution for database replication in a data intensive, large scale distributed networking environment. In this approach, a powerful grid data-delivering tool transfers the initial large block of data. Thereafter, a high-level API will allow a variety of relational database replication mechanisms to synchronize any changes. We extend OGSA-DAI by adding activities to control the replication data flow between heterogeneous data resources. In this way, we are able to support transaction-based large data syncing that maintain consistency in a grid environment.


Replica synchronization based on OGSA-DAI services.


Marek Ciglan, Ladislav Hluchy (SAS) slides

Nowadays, the grid technology is gaining popularity among wider scientific communities and is reaching beyond it's high-performance computing origins. Especially, the data management for the grid environments is an important topic for many research communities. Because of the historical reasons, the data management tools for grid computing were primary designed to manage read-only data sets. This simplification encouraged the adoption of the data replication concept as a way to increase the data availability and to provide fault-tolerance. Tools that would allow data replication and updating of those data could increase the cooperativeness and ease the data sharing among researchers. In this paper, we present our work on services for replica update propagation and consistency handling, which try to address the absence of software tools for maintaining the consistency among distributed data resources in grid environment. We describe the architecture of the services as well as several important implementation details. Our Replica Update proPAGATION service (RUPAGATION) use OGSA-DAI framework as the basic operational environment. It is build as a set of modules, which can be plugged in OGSA-DAI data services. The aim of RUPAGATION is to form the base stone for higher level grid consistency services, which will provide automatic and autonomous synchronization of replicated data resources in the distributed, heterogeneous grid environment. Update propagation service provides common base functionality for different possible consistency services and approaches. It is able to catch and store updates submitted to the specified data resources, propagate those updates to the other replica sites and apply updates to distinct data replicas. RUPAGATION can provide its functionality for all types of resources supported by OGSA-DAI. This means that if we have an implementation of a consistency model that is build on top of RUPAGATION service, we can use the same implementation to ensure consistency for a large number of relational databases, xml databases and file resources. This virtualization aspect is provided by OGSA-DAI framework. RUPAGATION was decoupled to several modules with clearly defined functionalities which can be deployed separately, according to the system requirements and needs. From the technical point of view, modules are set of OGSA-DAI activities. We used the functionality of the RUPAGATION service to implement consistency service for primary copy update model (PCUM). In primary copy update model, one replica of a data source is labeled to be a primary copy. For each non-primary replica, only read operations are allowed from users or application. Update operations are applied only on primary replica. Changes performed on primary replica are than propagated and applied on non-primary replicas. We also discuss the error handling in case of resource, service or network failures. Proposed services, PCUM consistency and RUPAGATION service, are beneficial not only to the applications that operates over replicated updateble grid data but also to the legacy grid middleware services. Especially for the central grid services. The failure of a central grid service can have a significant impact on the operation of whole grid system. Metadata services, services for replica location as well as security services for course-grained access control policies definition (such as CAS) are typically implemented as central services in todays grid middleware. The distribution of those services with replicated and synchronized data can bring important improvement in the fault-tolerance of the grid. To demonstrate this functionality, we have integrated proposed replica synchronization services with Metadata Catalog Service (MCS), legacy central grid service for metedata management, enriching it with functionality to run multiple instances of MCS on distinct sites operating over replicated and synchronized data sets.


Grid Authentication and Authorization with Reliably Distributed Services (GAARDS)


Stephen Langella (OSU) slides

In this presentation we will present The Grid Authentication and Authorization with Reliably Distributed Services (GAARDS). GAARDS provides services and tools for the administration and enforcement of security policy in an enterprise Grid. GAARDS was developed on top of the Globus Toolkit and extends the Grid Security Infrastructure (GSI) to provide enterprise services and administrative tools for: 1) grid user management, 2) identity federation, 3) trust management, 4) group/VO management 5) Access Control Policy management and enforcement, and 5) Integration between existing security domains and the grid security domain. GAARDS services can be used individually or grouped together to meet the authentication and authorization needs for Grids. Below is a list of some of the core services provided by GAARDS: Dorian - A grid service for the provisioning and management of grid users accounts. Dorian provides an integration point between external security domains and the grid, allowing accounts managed in external domains to be federated and managed in the grid. Dorian allows users to use their existing credentials (external to the grid) to authenticate to the grid. Grid Trust Service (GTS) - The Grid Trust Service (GTS) is a grid-wide mechanism for maintaining and provisioning a federated trust fabric consisting of trusted certificate authorities, such that grid services may make authentication decisions against the most up to date information. Grid Grouper - Provides a group-based authorization solution for the Grid, wherein grid services and applications enforce authorization policy based on membership to groups defined and managed at the grid level. Authentication Service - Provides a framework for issuing SAML assertions for existing credential providers such that they may easily integrated with Dorian and other grid credential providers. The authentication service also provides a uniform authentication interface in which applications can be built on. Common Security Module (CSM) - Provides a centralize approach to managing and enforcing access control policy authorization.


Introduce: Grid Service Authoring Toolkit


Shannon Hastings (OSU) slides

A number of tools and middleware systems have been developed to support application development using Grid Services frameworks. Most of these efforts, however, have focused on the low-level support for management and execution of Grid services, management of Grid-enabled resources, and deployment and execution of applications that make use of Grid services. Simple-to-use service development tools, which will allow a Grid service developer to leverage Grid technologies without needing to know low-level details, are becoming increasingly important. Moreover, support for development of strongly-typed services, in which data types consumed and produced by a service are well-defined and published in the Grid, is necessary to enable syntactic interoperability so that two Grid endpoints can interact with each other programmatically and correctly. Introduce is an open-source, extensible toolkit to support easy development and deployment of WS/WSRF compliant Grid services. Introduce aims to reduce the service development and deployment effort by hiding low level details of the Globus Toolkit and to enable the implementation of strongly-typed Grid services. We expect that enabling strongly typed grid services while lowering the difficulty of entry to the Grid via toolkits like Introduce will have a major impact to the success of the Grid and its wider adoption as a viable technology of choice, not only in the commercial sector, but also in other areas such as academic, medical, and government research.


Distributed Middleware Infrastructure for Imaging Applications


Tony Pan (OSU) slides

Image data management and image analysis present unique challenges for distribute computing. In this session we will present our current work in in vivo Imaging Middleware (IVIM), Out of Core Virtual Microscope (OCVM) toolkit, and GPU based computation, within the context of distributed image and algorithm access and management and large scale image analysis. IVIM is based on Globus and caBIG's caGrid middleware. It leverages grid service tools and security capability provided by these libraries. IVIM provides a set of libraries, tools, and services to facilitate development and deployment of grid services for imaging applications, particularly those in the area of radiology. These include library for interoperability layer between grid services and Digital Imaging and Communications in Medicine (DICOM) messaging, secure high-performance data transport, and facility for user-friendly imaging service creation and deployment. An application for distributed clinical review of radiology cases, called gridIMAGE, will be discussed. OCVM is a distributed application based on DataCutter. It allows processing of sets of partitioned images on a cluster using arbitrary MatLab algorithms. OCVM facilitates rapid algorithm prototyping and large scale image processing. A related effort involves using GPUs for high performance, stream-based image processing. We will present an application where OCVM and GPU image processing are used for high throughput analysis of microscopy images for computer aided quantification of Neuroblastoma.


The @neurIST Grid Infrastructure for Biomedical Data and Compute Services


Siegfried Benkner (UNIVIE)

The European Union's @neurIST project is developing a Grid-based IT infrastructure for the management of all processes linked to research, diagnosis and treatment development for complex and multi-factorial diseases encompassing data repositories, computational analysis services and information systems handling multi-scale, multi-modal information at distributed sites. Although the focus of @neurIST is on one such disease, cerebral aneurysm and subarachnoid haemorrhage, the core technologies will be generic and transferable to other areas. The project bases its developments on a service-oriented Grid architecture which facilitates access to and integration of distributed resources by providing support for access control, advanced security, and quality-of-service guarantees. This talk provides an overview of the @neurIST Grid middleware and outlines the infrastructure offered for the provision of advanced compute and data services to support computationally demanding modeling and simulation tasks and to access heterogeneous distributed data sources through semantic integration.


Monitoring Support for Advanced Database Integration on the Grid


Alexander Woehrer, Peter Brezany (UNIVIE) slides

The trend of Grid computing towards more data intensive applications, accessing more and more relational databases and requiring advanced data integration, is still upstanding. Metadata information, e.g. for selection of suitable candidates among similar sources and the exclusion of not required ones, plays a vital role for efficient data integration on the Grid. The process of collecting information regarding the current (sometimes also the past) status of Grid resources is known as monitoring. However, there is a lack of service oriented monitoring tools providing this metadata for relational data sources. We stay abreast of this development and make relational databases first class citizens in Grid computing by providing a service oriented monitoring tool tailored towards them. To the best of our knowledge, no research effort has been reported on this so far. This paper presents novel usage scenarios needing additional metadata and monitoring information about relational data sources on the Grid - in the areas of query optimization, adaptive query processing and data integration management - and requirements to be fulfilled by a monitoring service providing such information. Our approach supports coarse and fine grained information about heterogeneous relational databases via a uniform interface and provides a homogeneous view on the available metadata. We have evaluated our approach by implementing a research prototype based on the Web Service Resource Framework implementation of the current Globus Toolkit. The functionality and performance of the prototype is demonstrated for commonly used relational databases such as MySQL, PostgreSQL and Oracle.


Workflow Enactment Engine and OGSA-DAI


Ivan Janciak, Peter Brezany (UNIVIE) slides

The field of data intensive e-Science applications is a challenging research task of the distributed systems based on service oriented architecture. Many data intensive tasks are not implemented as monolithic codes encapsulated into a single application, but usually implemented as a composition of various services organized in a workflow. The management of such workflows is a key component in the modern systems where the applications can be viewed as complex workflows able to orchestrate the services so that they cooperate to achieve a desired behavior of the whole system. Such a service composition can be described in many different languages. The Business Process Execution Language for Web services (WS-BPEL) proposed by IBM and Microsoft has emerged as a leading workflow language for composing business Web services and it is also being widely accepted by the scientific community. In this presentation we will present our recent work conducted within the Workflow Enactment Engine Project (WEEP), which was established as a spin-off project of the GridMiner project. The project aims to build an easy-to-use and manage workflow enactment engine for Grid and Web services orchestration able to fulfill requirements of highly dynamic and interactive workflows that can be fully controlled by a user. Primarily, the engine is oriented on data-intensive Grid services which stress a strong need for delivering and managing large volumes of data in scientific workflows.


Management of Grid Data Sets using the Data Space Paradigm


Ibrahim Elsayed, Adnan Muslimovic, Peter Brezany (UNIVIE) slides

Database management systems providing powerful data management services have been developed in the recent decades to a great extent. All these systems are based on its own single data model, whereas the demand for managing multiple data sources with different data models is rapidly expanding. Modern collaborations in science are very often based on large scale linking of databases that were not expected to be used together when they were originally developed. Within the distributed database community, database integration approaches traditionally focus on structural heterogeneity. However, in many scientific applications, there is additionally a strong demand to solve problems of semantic heterogeneity. Therefore the need for intelligent management systems providing access to those heterogeneous and often distributed data sources and allowing to search, query and share them as a single information source, has never been greater. This research challenge is faced by the vision of Data Spaces recently introduced by M. Franklin et al. The idea is to raise up the level of abstraction at which data is managed today in order to provide a system managing different data sources, each with its own data model. The Data Space concepts are presented in a visionary way, however their implementation in real application environments opens new research challenges, especially, in distributed dynamic environments, like grids. So far, no effort was devoted to realization of Data Space paradigms on the Grid. We developed GriDateX, a Grid based Data Space Management System which we understand as a set of grid services and software components that manage large number of OGSA-DAI data resources and control the organization, storage and retrieval of data in a Data Space.


Runtime Provenance Collection for Dynamic Information Integration and Mining


Yogesh L. Simmhan, Beth Plale, Dennis Gannon (IU) slides

Workflows have evolved as the paradigm for conducting in silico experiments in scientific collaboratories, building on a service oriented architecture that provides transparent access to Grid resources. Data driven applications, as seen in mesoscale meteorology, are a major class of scientific applications that involve thousands of dynamic data products that are processed, transformed, fused, and reused in complex dataflows that adapt to realtime conditions. In such models, the need to track the execution of the workflow and the associated data products is paramount. Provenance is a form of metadata that describes the derivation history of data products and the execution trace of workflows. Provenance allows scientists to verify the execution of their experiment, provide a context to interpret workflow results, attribute data sources for copyright, and analyze archived provenance for data and service quality predictions. In a dynamic workflow environment, provenance acts as a information integrator, tying workflow runs with actual services used within, service instances with their invocations from different workflows, data products with services that produce and consume them, and the services and data with the virtualized grid resources used by them. The individual metadata about these pieces may themselves be located a external metadata repositories but provenance acts as glue linking them together. The dynamic nature of the system driven by external events along with the adaptation of workflows to realtime resources ensures that much of the provenance information has to be monitored at runtime and cannot be statically determined. The Karma provenance framework, developed for the Linked Environments for Atmospheric Discovery (LEAD) meteorology and education project, is a standalone service that uses the notion of activities published from instrumented services to track workflow orchestration and data derivation. These activities published through a publish-subscribe system describe stages in the lifecycle of a workflow and service run and their correlation with the data flow, and can is later stitched together to build the complete provenance graph for workflow and dataflow. Karma disseminates different views of provenance such as data provenance, data usage trail, and workflow trace, and allows incremental queries to explore the provenance space. Provenance provides an archive of information that can be mined and analyzed to provide insight on execution patterns and resource usage data that can help in searching for data and allocation of resources. Other than visually describing workflow execution, we are investigating the use of provenance to predict the quality of derived data products and assist in workflow composition through reasoning systems. Also of interest are the role of provenance in long term preservation of scientific data, resource management, and ways to automate the instrumentation of services to reduce the overhead for service providers.


caGrid Middleware Overview.


Joel Saltz (OSU) slides


Workflows Services in caGrid 1.0


Ravi Maduri (Argonne) slides