High-Efficient Fuzzy Querying with HiveQL for Big Data Warehousing

Abstract

Querying and reporting from large volumes of structured, semi-structured, and unstructured data often requires some flexibility. This flexibility provided by fuzzy sets allows for categorization of the surrounding world in a flexible, human mind-like manner. Apache Hive is a data warehousing framework working on top of the Hadoop platform for Big Data processing. Hive allows executing queries and aggregating and analyzing data stored in Hadoop Distributed File System and other repositories. Hive responds to the current needs for efficient Big Data warehousing, which is impossible with traditional data warehouses due to their rigid nature. This paper presents the FuzzyHive library that extends the Hive framework with fuzzy sets-based techniques for querying, analyzing, and reporting on Big data warehouses.We formalize the fuzzy techniques used while operating on Hive-based data warehouses (including fuzzy filtering on dimensional attributes, projection with fuzzy transformation, fuzzy grouping, and joining). We also show how we embedded these operations in Hive Query Language, which was not studied so far. Such extensions make Big Data warehousing more flexible and contribute to the portfolio of tools used by the community of people working with fuzzy sets and data analysis. The FuzzyHive library complements the spectrum of available solutions for fuzzy data processing and querying in large data sets. We investigate Hive fuzzy querying performance, effectiveness, and scalability for various data storage formats (text, Avro, and Parquet). Our experiments demonstrate that the proposed extensions introduce more elasticity and are also efficient for Big Data data warehousing, which is the first such kind of solution for this environment.

Existing System

? This area of research is not new one but there are still many possibilities for the improvement of existing approaches and for creating new approaches. ? Fuzzy queries have emerged in the last 25 years to deal with the necessity to soften the two-valued Boolean logic in relational databases. ? A fuzzy query system is an interface to users to get information from database using (quasi) natural language sentences. ? Many fuzzy query implementations have been proposed, resulting in slightly different languages.

Disadvantages

? A typical application might involve finding similar cases while analyzing a specific problem with flexible filtering on dimensional attributes in a Big Data warehouse (e.g., cases of similar patients suffering from the same disease). ? Although there are ready solutions that allow fuzzy searching for items, e.g., in relational databases, most of them do not address the challenges and technical problems of Big Data. ? Therefore, they do not support modern, Big data-oriented data warehousing by providing mechanisms that would give humanlike flexibility in formulating elastic search criteria or abstract membership categories. This paper addresses these problems by extending the Hive framework.

Proposed System

• The comparison between SQL and fuzzy query performance can not be unambiguously determined because of different nature of these two querying concepts. • SQL has faster performance because of non existence of additional calculation of lower bounds of fuzzy sets, membership degrees and QCIs for selected records as in the fuzzy counterpart. • On the other side fuzzy query provides more information than classical one and gives the user more freedom for creating of a selection task. • In cases when user does not have ambiguities and uncertainties concerning data, the SQL solves all user needs and requirement for fuzzy queries does not exist

Advantages

? To evaluate the effectiveness of the methods, we used the original data from both data sets. To assess time performance, we multiplied the data to several larger sizes by copying the records and modifying numerical values randomly. ? Those solutions focus mainly on the volume characteristic of the Big Data taking into account extensive capabilities for scaling and partitioning data in NoSQL databases. However, the authors have not provided any performance tests results that would prove their solution’s efficiency. ? Data structures that facilitate the definition of domainoriented fuzzy sets, linguistic variables, and categories, results of performance tests for extensive fuzzy searches executed in the Hadoop environment and investigations of the impact of various parameters on the query execution efficiency.

Download DOC Download PPT