Data Science provides special skills to learn from information. It helps us find hidden patterns in big sets of data, so we can make smart choices for future.
Why Do We Need Data Science?
We use Data Science to understand things better. It helps us make sense of the massive amount of information we gather every day. Think of all the numbers, words, and facts we collect from things like websites, apps, and sensors. It tells us what might happen next and helps us solve difficult problems.
Data vs. Information:
Data is like building blocks – numbers, words, raw facts and figures. When we process data and put it together in a smart way, we get information.
Big Data:
Big Data is like a huge amount of information. It's so big that regular tools (hardware devices) can't handle it. We need special tools and tricks to make sense of this massive amount of data.
Hence, Data Science is a field that uses scientific methods, algorithms, processes, and systems to extract valuable knowledge and insights from massive and complex datasets, commonly known as Big Data.
Five Characteristics of Big Data:
- Volume: How much data we have collected.
- Velocity: How fast data is coming at high speed.
- Variety: Different types of data (like numbers, words, pictures, voice-records).
- Veracity: How much we can trust the data (accuracy, precision, integrity, reliability).
- Value: How useful the data is to make useful decisions.
Data Science Life Cycle:
Here is how a Data Scientist works:
- Problem Understanding: Figuring out what the problem is. It starts with understanding the problem at hand, the questions, and the answers we are trying to find.
- Data Acquisition: Collecting the data required to answer the question or to solve the problem at hand.
- Data Wrangling: Cleaning and getting the data ready. It involves looking for missing values. It uses knowledge to give shape to the dataset appropriate for visualizations.
- Data Exploration: Data Exploration is about visualization and other statistics’ measures to see whether the questions we asked, in the beginning, are being answered or not?
- Feature Engineering and Selection: Picking the best parts of the data for your task.
- Modeling: Making a plan to solve the problem. It is about understanding the data’s behavior to make the model, which can be used for predictive analytics as described in the previous section.
- Deployment: Sharing your solution with others. It can be deployed on mobile applications and web applications.
- Monitoring: Keeping an eye on things to make sure they're going well. It also involves making changes to the analysis and starting over if required.
Data Structures?
Structured, Semi-Structured, and Unstructured data are terms used to describe different types of information based on their organization and format.
Structured Data:
Structured data is like information clearly arranged in rows and columns, just like a spreadsheet. It's highly organized and follows a fixed format. Examples of structured data include databases, spreadsheets, and tables.
Semi-Structured Data:
Semi-structured data is a bit like a mix between structured and unstructured data. It doesn't fit neatly into rows and columns, but it has some level of structure. Think of it as a collection of documents where each document might have a title, author, and date, but the content itself might not follow a strict structure. Examples of semi-structured data include XML files and JSON files.
Unstructured Data:
Unstructured data is like a bunch of information without a specific order. It's more like the freeform text you find in a book or a social media post.
Examples:
- Images
- Videos
- Speeches
Data Types and their Characteristics
Data:
In Data Science, data means collection of Objects and Attributes.
Data Types:
Categorical Data:
Categorical data represents categories or characteristics like gender, language, or movie genre. It's also called qualitative data. You can use numbers for them, but those numbers have no real math value (like 0/1 for male/female).
Types:
1. Nominal Data:
- No order
- Examples: gender, language, eye color.
- Analyze with frequencies, pie charts, etc.
- Has order.
- Examples: happiness level, education level, movie ratings.
- Summarize with median, mean, visualize with bar charts.
- Just two values: yes or no.
- Represented as "True" and "False" or 1 and 0.
Numerical Data:
Numerical data is expressed as numbers, allowing quantification. It represents values like integers or real numbers. It is also called quantitative data. Examples include a person's height, product prices, IQ scores, the number of lessons in a course, etc.
Types:
1. Discrete Data
- Has a finite or countably infinite set of values.
- Values are distinct and separate.
- Examples: zip codes, words in a document collection, number of coin toss heads, students in a classroom, cars in a showroom.
- Often represented as integer variables.
- Analyzed using mean, median, quartiles, box plots, and histograms.
2. Continuous Data
- Cannot be counted but can be measured.
- Represents measurements.
- Examples: market share price, height/weight of a person, amount of rainfall, car speed, Wi-Fi frequency.
- Can be divided into meaningful parts.
- Has real numbers as attribute values.
Types of Continuous Data:
a. Interval Data
- Categorized, ranked, and evenly spaced.
- Values have order and can be positive, zero, or negative.
- Allows comparison and quantification of differences.
- Examples: temperatures in Celsius or Fahrenheit, calendar dates.
b. Ratio Data
- Numeric attribute with an inherent zero-point.
- Values can be multiples or ratios of one another.
- Ordered values with computed differences.
- Examples: Kelvin temperature scale, years of experience, number of words.