BIG DATA — A Problem with Business And how to handle it!!!

Nikhil S Wani
6 min readMar 22, 2021

--

Today, It’s a world of Automation. We are moving towards an Agile world. But as we go forward, we met lots of problems comes in this world. From these problems, one of the major problems in today’s world is Big Data.

What is Big Data?

Big data is a collection of large datasets that cannot be processed using traditional computing techniques. It is not a single technique or a tool, rather it has become a complete subject, which involves various tools, techniques, and frameworks.

What Comes Under Big Data?

Big data involves the data produced by different devices and applications. Given below are some of the fields that come under the umbrella of Big Data.

  • Black Box Data − It is a component of helicopters, airplanes, and jets, etc. It captures the voices of the flight crew, recordings of microphones and earphones, and the performance information of the aircraft.
  • Social Media Data − Social media such as Facebook and Twitter hold information and the views posted by millions of people across the globe.
  • Stock Exchange Data − The stock exchange data holds information about the ‘buy’ and ‘sell’ decisions made on a share of different companies made by the customers.
  • Power Grid Data − The power grid data holds information consumed by a particular node with respect to a base station.
  • Transport Data − Transport data includes model, capacity, distance, and availability of a vehicle.
  • Search Engine Data − Search engines retrieve lots of data from different databases.

Benefits of Big Data

  • Using the information kept in social network like Facebook, the marketing agencies are learning about the response for their campaigns, promotions, and other advertising mediums.
  • Using the information in social media like preferences and product perception of their consumers, product companies and retail organizations are planning their production.
  • Using the data regarding the previous medical history of patients, hospitals are providing better and quick service.

Did you know???

Top companies in the world running mainly because of his huge data. If I talk about Google, Facebook, Amazon, or any other. These companies mainly doing well because of his great management of big data.

If I talk about today’s so there is more than 500 TB data comes every day in the company server. So What do you think? How these companies can manage this??? So After lots of research, we create a solution for the management of big data. There are mainly two approaches we can use to manage this.

Traditional Approach

In this approach, an enterprise will have a computer to store and process big data. For storage purposes, the programmers will take the help of their choice of database vendors such as Oracle, IBM, etc. In this approach, the user interacts with the application, which in turn handles the part of data storage and analysis.

Limitation

  • This approach works fine with those applications that process less voluminous data that can be accommodated by standard database servers, or up to the limit of the processor that is processing the data. But when it comes to dealing with huge amounts of scalable data, it is a hectic task to process such data through a single database bottleneck.
  • From this approach, you can store lots of data but you may suffer in terms of performance. As we store data more and more. you see there is low I/O speed of the disk. Even if you have the fastest SSD’s are also available.

What to do? How to overcome these issues?

Google solved this problem by using an algorithm called MapReduce. This algorithm divides the task into small parts and assigns them to many computers, and collects the results from them which when integrated, form the result dataset.

Hadoop

Using the solution provided by Google, Doug Cutting and his team developed an Open Source Project called HADOOP.

Hadoop runs applications using the MapReduce algorithm, where the data is processed in parallel with others. In short, Hadoop is used to develop applications that could perform complete statistical analysis of huge amounts of data.

Hadoop runs applications using the MapReduce algorithm, where the data is processed in parallel with others. In short, Hadoop is used to develop applications that could perform complete statistical analysis of huge amounts of data.

Hadoop Architecture

At its core, Hadoop has two major layers namely −

  • Processing/Computation layer (MapReduce), and
  • Storage layer (Hadoop Distributed File System).

How Does Hadoop Work?

It is quite expensive to build bigger servers with heavy configurations that handle large scale processing, but as an alternative, you can tie together many commodity computers with single-CPU, as a single functional distributed system and practically, the clustered machines can read the dataset in parallel and provide much higher throughput. Moreover, it is cheaper than one high-end server. So this is the first motivational factor behind using Hadoop that it runs across clustered and low-cost machines.

Hadoop runs code across a cluster of computers. This process includes the following core tasks that Hadoop performs −

  • Data is initially divided into directories and files. Files are divided into uniform-sized blocks of 128M and 64M (preferably 128M).
  • These files are then distributed across various cluster nodes for further processing.
  • HDFS, being on top of the local file system, supervises the processing.
  • Blocks are replicated for handling hardware failure.
  • Checking that the code was executed successfully.
  • Performing the sort that takes place between the map and reduces stages.
  • Sending the sorted data to a certain computer.
  • Writing the debugging logs for each job.

Advantages of Hadoop

  • Hadoop framework allows the user to quickly write and test distributed systems. It is efficient, and it automatically distributes the data and works across the machines, and in turn, utilizes the underlying parallelism of the CPU cores.
  • Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA), rather the Hadoop library itself has been designed to detect and handle failures at the application layer.
  • Servers can be added or removed from the cluster dynamically and Hadoop continues to operate without interruption.
  • Another big advantage of Hadoop is that apart from being open source, it is compatible with all the platforms since it is Java-based.

For this Automation world, there is the only way available to manage the Big data is Distributed Computing (Sharing of resources). There is a lot of Companies are also there which provide you software like Hadoop which is Ceph, Glusterfs. But all are works on the same concept Distributed computing. Mainly companies use the Hadoop cluster as a distributed computing.

That’s all

Thank you for reading …

--

--