Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2012
AI
This paper introduces a novel approach to Data Cleaning (DC) in Data Warehousing (DW) by actively involving users in the process to enhance Data Quality (DQ). It identifies key challenges in existing DC approaches, particularly the lack of user input, and demonstrates the benefits of interactivity in improving data accuracy and coherence. Experimental results showcase the effectiveness of user involvement in correcting errors and reducing the workload during the Extract, Transform, Load (ETL) process, ultimately leading to better quality data in DW.
In this paper we have discussed the problems of data quality which are addressed during data cleaning phase. Data cleaning is one of the important processes during ETL. Data cleaning is especially required when integrating heterogeneous data sources. This problem should be addresses together with schema related data transformation. At the end we have also discussed the Current tool which supports data cleaning.
Journal on Today's Ideas - Tomorrow's Technologies, 2014
In today's scenario, extraction-transformation-loading (eTl) tools have become important pieces of software responsible for integrating heterogeneous information from several sources. The task of carrying out the eTl process is potentially a complex, hard and time consuming. Organisations now-a-days are concerned about vast qualities of data. The data quality is concerned with technical issues in data warehouse environment. Research in last few decades has laid more stress on data quality issues in a data warehouse eTl process. The data quality can be ensured cleaning the data prior to loading the data into a warehouse. Since the data is collected from various sources, it comes in various formats. The standardization of formats and cleaning such data becomes the need of clean data warehouse environment. Data quality attributes like accuracy, correctness, consistency, timeliness are required for a Knowledge discovery process. The present state-of-the-art purpose of the research work is to deal on data quality issues at all the aforementioned stages of data warehousing 1) Data sources, 2) Data integration 3) Data staging, 4) Data warehouse modelling and schematic design and to formulate descriptive classification of these causes. The discovered knowledge is used to repair the data deficiencies. This work proposes a framework for quality of extraction transformation and loading of data into a warehouse.
International Journal of …, 2010
Data quality is a critical factor for the success of data warehousing projects. If data is of inadequate quality, then the knowledge workers who query the data warehouse and the decision makers who receive the information cannot trust the results. In order to obtain clean and reliable data, it is imperative to focus on data quality. While many data warehouse projects do take data quality into consideration, it is often given a delayed afterthought. Even QA after ETL is not good enough the Quality process needs to be incorporated in the ETL process itself. Data quality has to be maintained for individual records or even small bits of information to ensure accuracy of complete database. Data quality is an increasingly serious issue for organizations large and small. It is central to all data integration initiatives. Before data can be used effectively in a data warehouse, or in customer relationship management, enterprise resource planning or business analytics applications, it needs to be analyzed and cleansed. To ensure high quality data is sustained, organizations need to apply ongoing data cleansing processes and procedures, and to monitor and track data quality levels over time. Otherwise poor data quality will lead to increased costs, breakdowns in the supply chain and inferior customer relationship management. Defective data also hampers business decision making and efforts to meet regulatory compliance responsibilities. The key to successfully addressing data quality is to get business professionals centrally involved in the process. We have analyzed possible set of causes of data quality issues from exhaustive survey and discussions with data warehouse groups working in distinguishes organizations in India and abroad. We expect this paper will help modelers, designers of warehouse to analyse and implement quality warehouse and business intelligence applications.
International Journal of Innovative Technology and Exploring Engineering, 2019
Data quality (DQ) is as old as the data is. In last few years it is found that DQ can’t be ignored during the process of data warehouse (DW) construction and utilization as it is the major and critical issue for knowledge experts, workers and decision makers who test and query the data for organizational trust and customer satisfaction. Low data quality will lead to high costs, loss in the supply chain and degrade customer relationship management. Hence to ensure the quality before using the data in DW, CRM (Customer Relationship Management), ERP (Enterprise Resource Planning)or business analytics application, it needs to be analyzed and cleansed. In this, we are going to find out the problem regarding dirty data and try to solve them.
Commonly, DW development methodologies, paying little attention to the problem of data quality and completeness. One of the common mistakes made during the planning of a data warehousing project is to assume that data quality will be addressed during testing. In addition to the data warehouse development methodologies, we will introduce in this paper a new approach to data warehouse development. This proposal will be based on integration data quality into the whole data warehouse development phase, denoted by: integrated requirement analysis for designing data warehouse (IRADAH). This paper shows that data quality is not only an integrated part of data warehouse project, but will remain a sustained and ongoing activity
The data cleaning is the process of identifying and removing the errors in the data warehouse. Data cleaning is very important in data mining process. Most of the organizations are in the need of quality data. The quality of the data needs to be improved in the data warehouse before the mining process. The framework available for data cleaning offers the fundamental services for data cleaning such as attribute selection, formation of tokens, selection of clustering algorithm, selection of similarity function, selection of elimination function and merge function. This research paper deals about the new framework for data cleaning. It also presents a solution to handle data cleaning process by using a new framework design in a sequential order.
Data Cleansing is an activity involving a process of detecting and correcting the errors and inconsistencies in data warehouse. It deals with identification of corrupt and duplicate data inherent in the data sets of a data warehouse to enhance the quality of data. The study looked into investigating some research works conducted in the area of data cleansing. A thorough review into these existing works was studied to determine the achievable goals and the limitations that arose based on the approaches conducted by the researchers. They identification of errors by most of these researchers has led into the development of several frameworks and systems to be implemented in the area of data warehousing. Generally, these findings will contribute to the emerging empirical evidence of the strategic role data cleansing play in the growth of organizations, institutions and other government agencies in terms of data quality and reporting purposes and also to gain competitive advantage since they will overcome the mere existence of dirty data.
International Journal of Computer Applications, 2014
The quality of data can only be improved by cleaning data prior to loading into the data warehouse as correctness of data is essential for well-informed and reliable decision making. Data warehouse is the only viable solution that can bring that dream into a reality. The quality of the data can only be produced by cleaning data prior to loading into data warehouse. Data Cleaning is a very important process of the data warehouse. It is not a very easy process as many different types of unclean data can be present. So correctness of data is essential for well-informed and reliable decision making. Also, whether a data is clean or dirty is highly dependent on the nature and source of the raw data. Many attempts have been made till now to clean the data using different types of algorithms. In this paper an attempt has been made to provide a hybrid approach for cleaning data which combines modified versions of PNRS, Transitive closure algorithms and Semantic Data Matching algorithm can be applied to the data to get better results in data corrections.
International Journal of Computer Applications, 2013
Data Cleansing or (data scrubbing) is an activity involving a process of detecting and correcting the errors and inconsistencies in data warehouse. Thus poor quality data i.e.; dirty data present in a data mart can be avoided using various data cleaning strategies, and thus leading to more accurate and hence reliable decision making. The quality data can only be produced by cleaning the data and pre-processing it prior to loading it in the data warehouse.
IOSR Journal of Engineering, 2013
Ensuring Data Quality for an enterprise data repository various data quality tools are used that focus on this issue. The scope of these tools is moving from specific applications to a more global perspective so as to ensure data quality at every level. A more organized framework is needed to help managers to choose these tools so that that the data repositories or data warehouses could be maintained in a very efficient way. Data quality tools are used in data warehousing to ready the data and ensure that clean data populates the warehouse, thus enhancing usability of the warehouse. This research focuses on the on the various data quality tools which have been used and implemented successfully in the preparation of examination data of University of Kashmir for the preparation of results. This paper also proposes the mapping of data quality tools with the process which are involved for efficient data migration to data warehouse.
2016
Testing ETL (Extract, Transform, and Load) procedures is an important and vital phase during testing Data warehouse (DW); it’s almost the most complex phase, because it directly affects the quality of data. It has been proved that automated testing is valuable tool to improve the quality of DW systems while the manual testing process is time consuming and not accurate so automating tests improves Data Quality (DQ) in less time, cost and attaining good data quality. In this paper the author’s propose testing framework to automate testing data quality at the stage of ETL process. Different datasets with different volumes (stared from 10,000 records till 50,000 records) are used to evaluate the effectiveness of the proposed automated ETL testing. The conducted experimental results showed that the proposed testing framework is effective in detecting errors with the different data volumes.
2004
Today’s informational entanglement makes it crucial to enforce adequate management systems. Data warehousing systems appeared with the specific mission of providing adequate contents for data analysis, ensuring gathering, processing and maintenance of all data elements thought valuable. Data analysis in general, data mining and on-line analytical processing facilities, in particular, can achieve better, sharper results, because data quality is finally taken into account. The available elements must be submitted to an intensive processing before being able to integrate them into the data warehouse. Each data warehousing system embraces extraction, transformation and loading processes which are in charge of all the processing concerning the data preparation towards its integration into the data warehouse. Usually, data is scoped at several stages, inspecting data and schema issues and filtering all those elements that do not comply with the established rules. This paper proposes an ag...
Data errors occur in various ways when data is transferred from one point to the other. These data errors occur not necessarily from the formation/insertion of data but are developed and transformed when transferred from one process to another along the information chain within the data warehouse infrastructure. The main focus for this study is to conceptualize the data cleansing process from data acquisition to data maintenance. Data Cleansing is an activity involving a process of detecting and correcting the errors and inconsistencies in data warehouse. Poor data or “dirty data” requires cleansing before it can be useful to organizations. Data cleansing therefore deals with identification of corrupt and duplicate data inherent in the data sets of a data warehouse to enhance the quality of data. The research was directed at investigating some existing approaches and frameworks to data cleansing. The research attempted to solve the gaps identified in some data cleansing approaches and came up with a conceptual framework to overcome the weaknesses which were identified in those frameworks and approaches. This novel conceptual framework considered the data cleansing process from the point of data is obtained to the point of maintaining the data using a periodic automatic cleansing approach.
Data Cleansing is an activity involving a process of detecting and correcting the errors and inconsistencies in data warehouse. It deals with identification of corrupt and duplicate data inherent in the data sets of a data warehouse to enhance the quality of data. The research was directed at investigating some existing approaches and frameworks to data cleansing. That attempted to solve the data cleansing problem and came up with their strengths and weaknesses which led to the identification of gabs in those frameworks and approaches. A comparative analysis of the four frameworks was conducted and by using standard testing parameters a proposed feature was discussed to fit in the gaps.
TELKOMNIKA Telecommunication Computing Electronics and Control, 2018
Data warehouse is a collective entity of data from various data sources. Data are prone to several complications and irregularities in data warehouse. Data cleaning service is non trivial activity to ensure data quality. Data cleaning service involves identification of errors, removing them and improve the quality of data. One of the common methods is duplicate elimination. This research focuses on the service of duplicate elimination on local data. It initially surveys data quality focusing on quality problems, cleaning methodology, involved stages and services within data warehouse environment. It also provides a comparison through some experiments on local data with different cases, such as different spelling on different pronunciation, misspellings, name abbreviation, honorific prefixes, common nicknames, splitted name and exact match. All services are evaluated based on the proposed quality of service metrics such as performance, capability to process the number of records, platform support, data heterogeneity, and price; so that in the future these services are reliable to handle big data in data warehouse.
Testing ETL (Extract, Transform, and Load) procedures is an important and vital phase during testing Data warehouse (DW); it’s almost the most complex phase, because it directly affects the quality of data. It has been proved that automated testing is valuable tool to improve the quality of DW systems while the manual testing process is time consuming and not accurate so automating tests improves Data Quality (DQ) in less time, cost and attaining good data quality. In this paper the author’s propose testing framework to automate testing data quality at the stage of ETL process. Different datasets with different volumes (stared from 10,000 records till 50,000 records) are used to evaluate the effectiveness of the proposed automated ETL testing. The conducted experimental results showed that the proposed testing framework is effective in detecting errors with the different data volumes.
We classify data quality problems that are addressed by data cleaning and provide an overview of the main solution approaches. Data cleaning is especially required when integrating heterogeneous data sources and should be addressed together with schema-related data transformations. In data warehouses, data cleaning is a major part of the so-called ETL process. We also discuss current tool support for data cleaning.
1997
DWQ is a cooperative project in the ESPRIT program of the European Communities. It aims at establishing foundations of data warehouse quality through linking semantic models of data warehouse architecture to explicit models of data quality. This paper provides an overview of the project goals and offers an architectural framework in which the individual research contributions are embedded.
2001
The problem of data cleaning, which consists of removing inconsistencies and errors from original data sets, is well known in the area of decision support systems and data warehouses. However, for some applications, existing ETL (Extraction Transformation Loading) and data cleaning tools for writing data cleaning programs are insufficient. One important challenge with them is the design of a data flow graph that effectively generates clean data. A generalized difficulty is the lack of explanation of cleaning results and user interaction facilities to tune a data cleaning program. This paper presents a solution to handle this problem by enabling users to express user interactions declaratively and tune data cleaning programs.
Poor quality of data in a warehouse adversely impacts the usability of the warehouse and managing data quality in a warehouse is very important. In this article we describe a framework for managing data quality in a data warehouse. This framework is of interest to both academics and practitioners as it offers an intuitive approach for not just managing data quality in a data warehouse but also implementing total data quality management. The framework is based on the information product approach. Using this approach it integrates existing metadata in a warehouse with quality-related metadata and proposes a visual representation for communicating data quality to the decision-makers. It allows decision-makers to gauge data quality in context-dependent manner. The representation also helps implement capabilities that are integral components of total data quality management.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.