Data Modeling, Featurization, and Visualization
1. What is Data Modeling?
Definition:
Data modeling is the process of creating a mathematical or logical structure to represent data and its
relationships, often used to predict outcomes based on input features.
Tools/Libraries:
- Python: scikit-learn, statsmodels
- R: caret, glm
Example:
from sklearn.linear_model import LinearRegression
X = [[5], [10], [15]]
y = [50, 100, 150]
model = LinearRegression()
model.fit(X, y)
print(model.predict([[20]])) # Output: [200.]
2. What is Featurization?
Definition:
Featurization is the process of converting raw data into meaningful input features that can be used
in machine learning models.
Tools/Libraries:
- pandas - for data manipulation
- scikit-learn - for encoding, scaling, etc.
- NLTK, spaCy - for text featurization
Examples:
Numerical Scaling:
from sklearn.preprocessing import MinMaxScaler
import numpy as np
data = np.array([[1], [10], [20]])
scaler = MinMaxScaler()
print(scaler.fit_transform(data))
Text to Features:
from sklearn.feature_extraction.text import CountVectorizer
text = ["I love data", "Data is power"]
vectorizer = CountVectorizer()
print(vectorizer.fit_transform(text).toarray())
3. What is Data Visualization?
Definition:
Data visualization is the graphical representation of information and data. It helps to understand
patterns, trends, and outliers in data.
Tools/Libraries:
- matplotlib
- seaborn
- plotly
Example:
import matplotlib.pyplot as plt
x = [1, 2, 3, 4]
y = [10, 20, 25, 30]
plt.plot(x, y)
plt.title("Simple Line Chart")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.show()
Summary Table
| Concept | Definition | Libraries Used | Example Use Case |
|----------------|---------------------------------------|------------------------------|----------------------------------|
| Data Modeling | Building predictive structures/models | scikit-learn, statsmodels | Predicting
sales or outcomes |
| Featurization | Converting raw data into features | pandas, sklearn, NLTK, spaCy | Scaling,
encoding, text features |
| Visualization | Drawing plots to show patterns | matplotlib, seaborn, plotly | Trend or
distribution analysis |