Introduction to Big Data

5 min readSep 17, 2020

About Big Data

What is Big Data?

In modern world, there are many big problems. One of those problems is Big Data. At present world, data collection is very important.It is they key to the success of a company. But as users are increasing day by day, data is becoming larger and larger.

Some of the companies which acquire enormous data on daily basis are google,facebook,twitter, instagram etc. People all around the world post images and other stuff everyday. For example, facebook generates 4 PetaBytes per day. See below stats

Stats: Per Minute Ratings

Here are some of the per minute ratings for various social networks:

Snapchat: Over 527,760 photos shared by users
LinkedIn: Over 120 professionals join the network
YouTube: 4,146,600 videos watched
Twitter: 456,000 tweets sent or created
Instagram: 46,740 photos uploaded
Netflix: 69,444 hours of video watched
Giphy: 694,444 GIFs served
Tumblr: 74,220 posts published
Skype: 154,200 calls made by users.

So, how does these companies manage the data. The answer is by using a combination of massively paralleled systems.

The concept used in solving Big Data is Distributed System. To understand distributed systems, we need to understand another concept called IOPS.

What is IOPS?

IOPS means Input/Output operations per second. It is the unit for measuring performance characteristics of storage devices. IOPS represents how quickly a given storage device or medium can read and write commands in every second.When writing data into the disk, we dont write it in bytes.Rather we write it in form of blocks. Blocks have different sizes and it depends on the system. SQL Server uses 64kb blocks whereas Windows server uses 4kb blocks. To get better understanding of IOPS, lets take SSD and HDD. We know that SSD’s are faster than HDD’s. The iops for ssd is in range 3000 to 40,000 whereas iops for hdd is in range of 55 to 80.

What is a Distributed System?

A distributed system, also known as distributed computing, is a system with multiple components located on different machines that communicate and coordinate actions in order to appear as a single coherent system to the end-user.

Let us consider a storage appliances. Lets suppose we have 40 TB of data. To write 40 TB of data into a disk, it takes 40 min. If we split the data into 10 TB blocks and start writing the data in 4 disks, it takes total of 10 mins.Suppose if we split 40 TB in 5 TB blocks and start the process of storing it in 8 disks, it takes 5 mins to copy all data to disks. From this concept, we can say that using more number of disks and storing small data is more efficient than using large disk with large amount of data. This concept is called parallelisation and is used by Distributed System. Not only storage, even compute power and many other services use this concept.

Main benefits of Distributed System

Horizontal Scalability — Since computing happens independently on each machine, it is easy and generally inexpensive to add additional devices and functionality as necessary.
Reliability — Most distributed systems are fault-tolerant as they can be made up of hundreds of machines that work together. The system generally doesn’t experience any disruptions if a single machine fails.
Performance — Distributed systems are extremely efficient because work loads can be broken up and sent to multiple machines.

There are so many Big Data technologies like Apache Hadoop,Microsoft HDInsight, NoSQL, Hive, Sqoop etc. Out of all, most widely used is Hadoop.

The Three Vs of Big Data
Volume: Means amount of data. Big data is about volume. Volumes of data that can reach unprecedented heights in fact. It’s estimated that 2.5 quintillion bytes of data is created each day
Velocity: Velocity is the fast rate at which data is received and (perhaps) acted on. For example, Facebook users upload more than 900 million photos a day. Facebook has to handle a tsunami of photographs every day. It has to ingest it all, process it, file it, and somehow, later, be able to retrieve it.
Variety: It refers to the many types of data that are available.For example, you may have noticed that I’ve talked about photographs, sensor data, tweets, encrypted packets, and so on. Each of these are very different from each other.

What is Hadoop?

Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.

Hadoop uses Master-Slave architecture. It comprises of single NameNode (Master Node) and other nodes are DataNodes (Slave Nodes). All the data are stored in Data Nodes, not in master node. We use MasterNode to manage DataNodes.

Hadoop makes it easier to use all the storage and processing capacity in cluster servers, and to execute distributed processes against huge amounts of data. Hadoop provides the building blocks on which other services and applications can be built.

This is just one software. There are so many other softwares to handle big data. Every software(technology) has its own benefits.

Importance of Big Data

There is a famous quote. That is “Data is the new Oil”. It means data is very important in present times.

Data improves quality of life.
Data allows organizations to more effectively determine the cause of problems. Data allows organizations to visualize relationships between what is happening in different locations, departments, and systems.
Data Analytics provide us solutions for most of the problems we face today.
Data helps you understand performance.It helps you to understand your customers