Introduction to Big Data

Data vs. Information

Data is like building blocks – numbers, words, raw facts and figures. When we process data and put it together in a smart way, we get information.

Big Data

Big Data is like a huge amount of information. It's so big that regular tools (hardware devices) can't handle it. We need special tools and tricks to make sense of this massive amount of data.

Five Characteristics of Big Data

Volume: How much data we have collected.
Velocity: How fast data is coming at high speed.
Variety: Different types of data (like numbers, words, pictures, voice-records).
Veracity: How much we can trust the data (accuracy, precision, integrity, reliability).
Value: How useful the data is to make useful decisions.

Big Data Concepts and Terminology

Clustered Computing:

Combines resources from multiple machines to work as a single, powerful unit.

Parallel Computing:

Performs multiple tasks simultaneously on a single computer using multiple processors or cores.

Distributed Computing:

Uses a network of computers (nodes) to run tasks in parallel, sharing the workload across multiple machines.

Batch Processing:

Processes large volumes of data by breaking tasks into smaller pieces and running them on separate machines.

Real-Time Processing:

Handles data immediately as it arrives, ensuring instant analysis and response.

Big Data Processing Systems

Hadoop/MapReduce:

Description: A scalable and fault-tolerant framework written in Java.
Type: Open-source.
Processing: Primarily used for batch processing.

Apache Spark:

Description: A general-purpose, lightning-fast cluster computing system.
Type: Open-source.
Processing: Supports both batch and real-time data processing.

Note: Apache Spark is now preferred over Hadoop/MapReduce due to its speed and versatility.