QA Validation in Big Data

Contents

Abstract

Introduction:

What does Big Data offers to Industries?

Black Box Data

Social Media Data:

Stock Exchange Data:

Power Grid Data:

Transport Data:

Search Engine Data:

BIG Data Life Cycle

Data sourcing:

Data extraction & Structuring:

Data Modelling:

Data Analysis and Interpretation:

Big Data Characteristics

Volume –

Variety –

Velocity –

Variability –

BIG data In QA

Stages in Testing Big Data Applications

Data Staging Validation

MapReduce Validation

Output Validation Phase

QA activities of Big data testing includes functional and non functional testing.

Abstract

This paper provides an overview on big data, its importance in our live and some technologies to handle big data. Also, elucidates the BIG data and its approach in the testing practices. Big data is data that exceeds the processing capacity of traditional databases. The data is too big to be processed by a single machine. New and innovative methods are required to process and store such large volumes of data. It also states some of the characteristics and features and is uses in various industries. Though this article does not cover the topic entirely but an attempt has been made to cover major aspects of big data.

Introduction:

As we become a more digital society, the amount of data being accumulated is accelerating significantly. Analysis of this ever-growing data becomes a challenge with traditional tools. Innovation is required to bridge the gap between data being generated and data that can be analyzed effectively. Few firms like Factset, Hoovers and Bloomberg, etc..are persisting the data from hundreds of years for analysis. The cost of processing, Analysing and reporting the data was high earlier compared to our current infrastructure.

Currently data management is faster and can store the unstructured data without manual intervention in lesser time with high quality. Big data tools and technologies offer opportunities and effectiveness in providing the analysis as per the customer preferences, gain a competitive advantage in the marketplace, and grow your business. Data management architectures have evolved from the traditional data warehousing model to more complex architectures that address more requirements, such as real-time and batch processing; structured and unstructured data; high-velocity transactions; and so on.

What does Big Data offers to Industries?

Big data captures the data produced by diverse devices and applications. Given below are some of the industries and variety of the data that will be captured as part of it.

Black Box Data:It is a component of helicopter, airplanes, and jets, etc. It captures voices of the flight crew, recordings of microphones and earphones, and the performance information of the aircraft.

Social Media Data:Social media such as Face book and Twitter hold information and the views posted by millions of people across the globe.

Stock Exchange Data:The stock exchange data holds information about the ‘buy’ and ‘sell’ decisions made on a share of different companies made by the customers.

Power Grid Data:The power grid data holds information consumed by a particular node with respect to a base station.

Transport Data:Transport data includes model, capacity, distance and availability of a vehicle.

Search Engine Data:Search engines retrieve lots of data from different databases.

BIG Data Life Cycle

It is a series of steps involved in the storing, processing and managing the data. Below are the steps involved in the Data Life Cycle.


Data sourcing:Data is obtained from customers, non customers( agencies, etc) and data sourced could be in numerical or descriptive form.

Data extraction & Structuring:Data sourced can be either structured or unstructured form. Data from source is obtained in multiple formats, such as text, number, document, image, video and so on. The data received in the unstructured format will be processed and stored after making them in usable format.

Data Modelling:Data modelling is the representation of data obtained from various sources. Modelling refers to design of rows and columns in the data base. This includes text, format, graphs etc.

Data Analysis and Interpretation:Data analysis and Interpretation is the process of assessing the data and making a decision. Data analysis has qualitative approach and quantitative approach. Quantitative approach is found using the basic statistical methods like mean, median, mode etc. and advance statistical methods includes regression, factor analysis, etc. Whereas qualitative approach includes, interviews, group discussions, focus groups, etc. Interpretation is arriving to a decision based on the analysis obtain from the data gathered and processed.

Big Data Characteristics

Volume – Size of data plays very crucial role in determining value out of data as for of output. Also, whether a particular data can actually be well thought-out as a Big Data or not, is dependent upon volume of data. Hence,'Volume'is one characteristic which needs to be considered while dealing with 'Big Data'

Variety –Earlier, spreadsheets and databases were the only sources of data considered by most of the applications. Now days, data in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. is also being considered in the analysis applications.

The data in it will be of three types.

Structured data: Relational data.

Semi Structured data: XML data.

Unstructured data: Word, PDF, Text, Media Logs.

Velocity –Big Data Velocity deals with the speed at which data flows in from sources like business processes, application logs, networks and social media sites, sensors,Mobiledevices, etc. The flow of data is massive and continuous.

Variability –This refers to the inconsistency which can be shown by the data at times, thus hampering the process of being able to handle and manage the data effectively.

BIG data In QA

For large scale of data, Big data techniques comfort engineers with exclusive skill sets that are used for testing large and complex data sets. Limitations of current Testing practices while testing Applications to solve Big Data problems:

  • Software testing approaches are determined by data rather than the testing scenarios.
  • Standard data matching tools don’t work with large volumes of data. This becomes a limitation to the software testing engineer’s skill sets.

Stages in Testing Big Data Applications

Data Staging Validation

The first step of big data testing also referred as pre-Hadoop stage involves process validation.

  • Data from various source like RDBMS, weblogs, social media, etc. should be validated to make sure that correct data is pulled into system
  • Comparing source data with the data pushed into the Hadoop system to make sure they match
  • Verify the right data is extracted and loaded into the correct HDFS location

MapReduceValidation

The second step is a validation of "MapReduce". In this stage, the tester verifies the business logic validation on every node and then validating them after running against multiple nodes, ensuring that the

  • Map Reduce process works correctly
  • Data aggregation or segregation rules are implemented on the data
  • Key value pairs are generated
  • Validating the data after Map Reduce process

Output Validation Phase

The final or third stage of Big Data testing is the output validation process. The output data files are generated and ready to be moved to an EDW (Enterprise Data Warehouse) or any other system based on the requirement.

Activities in third stage includes

  • To check the transformation rules are correctly applied
  • To check the data integrity and successful data load into the target system
  • To check that there is no data corruption by comparing the target data with the HDFS file system data

QA activities of Big data testing includes functional and non functional testing.

Reports/Dashboards Testing: Testing of report includes the mapping of fields in the report with the schema. Layout of the report, font style, format filter options. Validation of “drill-in”on the report with various data sets, Scheduling reports accuracy, Printing reports, and availability of the reports for exporting to excel, pdf, html, etc,.

Verifying the Source Vs output:Validating the source count and content with output records. Validation of data consistency.

Data security and Access:Data is available for all the privileged users based on the security level or hierarchy of the business.

Configuration of Reports:Data in the reports are accessible based on the configuration assigned to users. Example: If FMCG Manager in a Retail Store is given access to view FMCG sales reports of FMCG products, and he will not be able to view sales reports of the other departments.

End to End Testing:End to End testing of reports at various levels is required to ensure the quality and the performance of the reports is not impacted.

Performance Testing:Obtain and understand the actual performance under load of Big Data applications, such as response time, maximum online user data capacity size, and maximum processing time. Performance testing should be conducted by setting up large volumes of data with an environment similar to production. In addition to the usual performance metrics such as job completion time, data throughput and memory utilization, the following should also be noted and/or tracked.

References:

Bigdata-madesimple.com

Author Biography: Chandrakanth Gande is Sr Consultant in Banking and Capital Markets Vertical. He has 9 years of experience in IT, and Banking. He has involved in the QA activities for Banking projects.