IC2E 2015 Accepted Full Papers with Abstracts

Abstract: Infrastructure as a Service (IaaS) generally provides a standard vanilla server that contains an OS and basic functions, and each user has to manually install the required applications for the proper server deployments. We are working on a composite application deployment approach to automatically install selected applications in a flexible manner, based on a set of application installation scripts that are invoked on the vanilla server. Some applications have installation dependencies involving multiple servers. Previous research projects on installing applications with multi-server dependencies have deployed the servers sequentially. This means the total deployment time grows linearly with the number of servers.

Our automated parallel approach makes the composite application deployment run in parallel when there are installation dependencies across multiple servers. We implemented a prototype system on Chef, a widely used automatic server installation framework, and evaluated the performance of our composite application deployment on a SoftLayer public cloud using two composite application server cases. The deployment times were reduced by roughly 40% in our trials.

Abstract: Graph stores are becoming increasingly popular among NOSQL applications seeking flexibility and heterogeneity in managing linked data. Conceptually and in practice, applications ranging from social networks, knowledge representations to Internet of things benefit from graph data stores built on a combination of relational and non-relational technologies aimed at desired performance characteristics. The most common data access pattern in querying graph stores is to traverse from a node to its neighboring nodes. This paper studies the impact of such traversal pattern to common data caching policies in a partitioned data environment where a big graph is distributed across servers in a cluster. We propose and evaluate a new graph aware caching policy designed to keep and evict nodes, edges and their metadata optimized for query traversal pattern. The algorithm distinguishes the topology of the graph as well as the latency of access to the graph nodes and neighbors. We implemented graph aware caching on a distributed data store Apache HBase in the Hadoop family. Performance evaluations showed up to 15x speedup on the benchmark datasets preferring our new graph aware policy over non-aware policies. We also show how to improve the performance of existing caching algorithms for distributed graphs by exploiting the topology information.

Abstract: Cloud providers must detect malicious traffic in and out of their network, virtual or otherwise. The use of Intrusion Detection Systems (IDS) has been hampered by the encryption of network communication. The result is that current signatures cannot match potentially malicious requests.

A method to acquire the encryption keys is Virtual Machine Introspection (VMI). VMI is a novel technique to view the internal, and yet raw, representation of a Virtual Machine (VM). Current methods to find keys are expensive and use sliding windows or entropy. This inevitably requires reading the memory space of the entire process, or worse the OS, in a live environment where performance is paramount. This paper describes a structured walk of memory to find keys, particularly RSA, using as fewer reads from the VM as possible. In doing this we create a scalable mechanism to populate an IDS with keys to analyse traffic.

Abstract: Frequent Itemset Mining (FIM) is a classic data
mining topic with many real world applications such as market
basket analysis. Many algorithms including Apriori, FP-Growth,
and Eclat were proposed in the FIM field. As the dataset size
grows, researchers have proposed MapReduce version of FIM
algorithms to meet the big data challenge. This paper proposes
new improvements to the MapReduce implementation of FIM
algorithm by introducing a cache layer and a selective online
analyzer. We have evaluated the effectiveness and efficiency of
SmartCache via extensive experiments on four public datasets.
SmartCache can reduce on average 45.4%, and up to 97.0%of the
total execution time compared with the state-of-the-art solution.

Abstract: Abstract—The abundance of compute and storage resources
available in the cloud makes it well-suited to addressing the
limitations of mobile devices. We explore the use of cloud
infrastructure to optimize content-centric mobile applications,
which can have high communication and storage requirements,
based on the analysis of user activity. We present two specific
optimizations, precaching and prefetching, as well as the design
and implementation of a middleware framework that allows mobile
application developers to easily utilize these techniques. Our
framework is fully generalizable to any content-centric mobile
application, a large and growing class of Internet applications. A
news aggregation application is used as a case study to evaluate
our implementation. We make use of a cosine similarity scheme
to identify users with similar interests, which in turn is used
to determine what content to prefetch. Various cache algorithms,
implemented for our framework, are also considered. A workload
trace and simulation are used to measure the performance of the
application and framework. We observe a dramatic improvement
in application performance due to use of our framework with a
reasonable amount of overhead. Our system also significantly
outperforms a baseline implementation that performs the same
optimizations without taking user activity into account.

Abstract: Today, cloud computing engines such as stream-processing Storm and batch-processing Hadoop are being increasingly run atop software-defined networks (SDNs). In these cloud stacks, the scheduler of the application engine (which allocates tasks to servers) remains decoupled from the SDN scheduler (which allocates network routes). We propose a new approach that performs cross-layer scheduling between the application layer and the networking layer. This coordinated scheduling orchestrates the placement of application tasks (e.g., Hadoop maps and reduces, or Storm bolts) in tandem with the selection of network routes that arise from these tasks. We present results from both cluster deployment and simulation, and using two representative network topologies: Fat-tree and Jellyfish. Our results show that cross-layer scheduling can improve throughput of Hadoop and Storm by between 22% to 34% in a 30-host cluster, and it scales well.

Abstract: Network topology and routing are two important factors in determining the communication costs of big data applications at large scale. As for a given Cluster, Cloud, or Grid system, the network topology is fixed and static or dynamic routing protocols are preinstalled to direct the network traffic. Once the system is deployed, users cannot change them. Hence, it is hard for application developers to identify the optimal network topology and routing algorithm for their applications with distinct communication patterns. In this study, we design a CCG virtual system (CCGVS) using container-based virtualization to allow users to create a farm of lightweight virtual machines on a single host. Moreover, we use software-defined networking (SDN) technique to control the network traffic among these virtual machines. Users can change the network topology and control the network traffic programmingly, thereby enabling application developers to evaluate their applications on the same system with different network topologies and routing algorithms. The preliminary experimental results have shown that CCGVS can represent application performance variations caused by network topology and routing algorithm. Case studies through both synthetic big data programs and NPB benchmarks have indicated that this virtual system enable researchers to identify the optimal network topology and routing algorithm for their applications.

Abstract: Abstract—Big data processing is one of the killer applications for cloud systems. MapReduce systems such as Hadoop are the most popular big data processing platforms used in the cloud system. Data corruption is one of the most critical problems in cloud data processing, which not only have serious impact on the integrity of individual application results but also affect the performance and availability of the whole data processing system. In this paper, we present a comprehensive study on 139 real world data corruption incidents reported in Hadoop/HDFS bug repositories. We characterize those data corruption problems in four aspects: 1) what impacts can the data corruption have on the application and system? 2) how are those data corruptions detected? 3) what are the causes of the data corruption problems? and 4) are the current data corruption handling mechanisms effective?
Our study reveals the following major findings: 1) 89% data corruption incidents we examined are caused by software bugs, which indicates the importance of fighting those data corruption bugs; 2) existing data corruption detection schemes are quite insufficient: only 24% of data corruption problems are correctly reported, 42% are silent data corruption without any error message, and 22% receive imprecise error report. We also found the detection system raised 12% false alarms; and 3) existing data corruption handling mechanisms (i.e., data replication, replica deletion, simple re-execution) make frequent mistakes including replicating corrupted data blocks, deleting healthy data blocks,
or causing undesirable resource hogging.

Abstract: Similar to memory or disk fragmentation in personal computers, emerging "virtual desktop cloud" (VDC) services experience the problem of data center resource fragmentation which occurs due to on-the-fly provisioning of virtual desktop (VD) resources. Irregular resource holes due to fragmentation lead to sub-optimal VD resource allocations, and cause: (a) decreased user quality of experience (QoE), and (b) increased operational costs for VDC service providers. In this paper, we address this problem by developing a novel, optimal "Market-Driven Provisioning and Placement" (MDPP) scheme that is based upon distributed optimization principles. The MDPP scheme channelizes inherent distributed nature of the resource allocation problem by capturing VD resource bids via a virtual market to explore soft spots in the problem space, and consequently defragments a VDC through cost-aware utility-maximal VD re-allocations or migrations. Through extensive simulations of VD request allocations to multiple data centers for diverse VD application and user QoE profiles, we demonstrate that our MDPP scheme outperforms existing schemes that are largely based on centralized optimization principles. Moreover, MDPP scheme can achieve high VDC performance and scalability, measurable in terms of a 'Net Utility' metric, even when VD resource location constraints are imposed to meet orthogonal security objectives.

Abstract: In recent years, researchers have contributed promising new techniques for allocating cloud resources in more robust, efficient, and ecologically sustainable ways. Unfortunately, the wide-spread use of these techniques in production systems has, to date, remained elusive. One reason for this is that the state of the art for investigating these innovations at scale often relies solely on model-driven simulation. Production-grade cloud soft- ware, however, demands certainty and precision for development and business planning that only comes from validating simulation against empirical observation.

In this work, we take an alternative approach to facilitating cloud research and engineering in order to transition innovations to production deployment faster. In particular, we present a new methodology that complements existing model-driven simulation with platform-specific and statistically trustworthy results. We simulate systems at scales and on time frames that are testable, and then, based on the statistical validation of these simulations, investigate scenarios beyond those feasibly observable in practice. We demonstrate the approach by developing an energy-aware cloud scheduler and evaluating it using production and synthetic traces in faster than real time. Our results show that we can accurately simulate a production IaaS system, ease capacity planning, and expedite the reliable development of its components and extensions.

Abstract: Applications on cloud infrastructures acquire virtual machines (VMs) from providers when necessary. The current interface for acquiring VMs from most providers, however, is too limiting for the tenants, in terms of granularity in which VMs can be acquired (e.g., small, medium, large, etc.), while giving very limited control over their placement. The former leads to VM underutilization, and the latter has performance implications, both translating into higher costs for the tenants. In this work, we leverage nested virtualization and a networking overlay to tackle these problems. We present Kangaroo, an OpenStack-based virtual infrastructure provider, and IPOPsm, a virtual networking switch for communication between nested VMs over different infrastructure VMs. In addition, we design and implement Skippy, the realization of our proposed virtual infrastructure API for programming Kangaroo. Our benchmarks show that through careful mapping of nested VMs to infrastructure VMs, Kangaroo achieves up to an order of magnitude better performance, with only half the cost on Amazon EC2. Further, Kangaroo's unified OpenStack API allows us to migrate an entire application between Amazon EC2 and our local OpenNebula deployment within a few minutes, without any downtime or modification to the application code.

Abstract: There exists a huge amount of vertical applications that are developed for isolated computing environments. Due to increasing demand for additional resources there is a clear need to adapt these applications to the distributed environments. However, this is not an easy task and numerous variants are possible. Moreover, in this transition a new quality requirements become important, such as application elasticity. Application elasticity has to be built into a software system to enable smooth cost optimization at the runtime.

In this paper, we provide a framework for evaluating different transformation variants of vertical Java EE multi-tiered applications into elastic cloud applications. With support of this framework the software developer is guided how to transform its application achieving optimal elasticity strategy. The framework is evaluated on slicing and evaluating elasticity of existing SaaS multi-tiered Java application used in Croatian market.

Abstract: Traditional Searchable Encryption (SE) solutions are only able to handle rather simple search queries, such as single or multi-keyword queries, and are not able to handle substring search queries over encrypted data. In this paper, for the first time, we propose a tree-based Substring Position Searchable Symmetric Encryption (SSP-SSE) to overcome the existing gap. Our solution efficiently finds occurrences of a substring queries over encrypted cloud data. We formally define the leakage functions and security properties of SSP-SSE. Then, we prove that the proposed scheme is secure against chosen-keyword attacks that involves an adaptive adversary. Our analysis demonstrates that SSP-SSE indeed introduces low overhead on computation and storage.

Abstract: Cloud providers face the challenge of efficiently managing their infrastructure through
minimizing resource consumption while allocating requests such that their profit is
maximized. We address this challenge by designing a greedy approximation algorithm
for solving the multi-resource sharing-aware virtual machine maximization (MSAVMM)
problem. The MSAVMM problem requires determining the set of VM that can be instantiated
on a given server such that the profit derived from hosting the VMs is maximized.
The solution to this problem has to consider the sharing of memory pages among VMs
and the restricted capacities of each type of resource requested by the VMs. We analyze the performance
of the proposed algorithm by determining its approximation ratio and by performing
extensive experiments against other sharing-aware VM allocation algorithms.

Abstract: We initiate the study of the following problem:
Suppose Alice and Bob would like to outsource their encrypted
private data sets to the cloud, and they also want to conduct
the set intersection operation on their plaintext data sets.
The straightforward solution for them is to download their
outsourced ciphertexts, decrypt the ciphertexts locally, and
then execute a commodity two-party set intersection protocol.
Unfortunately, this solution is not practical.

We therefore motivate and introduce the novel notion of
Verifiable Delegated Set Intersection on outsourced encrypted data
(VDSI). The basic idea is to delegate the set intersection operation
to the cloud, while (i) not giving the decryption capability to
the cloud, and (ii) being able to hold the misbehaving cloud
accountable. We formalize security properties of VDSI and
present a construction. In our solution, the computational and
communication costs on the users are linear to the size of the
intersection set, meaning that the efficiency is optimal up to a
constant factor.

Abstract: In this work we present Scalable Attestation, a method which combines
both secure boot and trusted boot technologies, and extends them up
into the host, its programs, and up into the guest's operating system
and workloads, to both detect and prevent integrity attacks. Anchored
in hardware, this integrity appraisal and attestation protects
persistent data (files) from remote attack, even if the attack is root
privileged. As an added benefit of a hardware rooted attestation, we
gain a simple hardware based geolocation attestation. This design is
implemented in multiple cloud test beds based on the QEMU/KVM
hypervisor, OpenStack, and OpenAttestation, and is shown to provide
significant additional integrity protection at negligible cost.

Abstract: The main quest for cloud stakeholders is to find an optimal deployment architecture for cloud applications that maximizes availability, minimizes cost, and addresses portability and scalability. Unfortunately, the lack of a unified definition and adequate modeling language and methodologies that address the cloud domain specific characteristics makes architecting efficient cloud applications a daunting task. This paper introduces StratusML: a technology agnostic integrated modeling framework for cloud applications. StratusML provides an intuitive user interface that allows the cloud stakeholders (i.e., providers, developers, administrators, and financial decision makers) to define their application services, configure them, specify the applications' behaviour at runtime through a set of adaptation rules, and estimate cost under diverse cloud platforms and configurations. Moreover, through a set of model transformation templates, StratusML maintains consistency between the various artifacts of cloud applications. This paper presents StratusML and illustrates its usefulness and practical applicability from different stakeholder perspectives. A demo video, usage scenario and other relevant information can be found at the StratusML webpage.

IC2E 2015 Accepted Papers with Abstracts