Showing posts with label eda. Show all posts
Showing posts with label eda. Show all posts

Friday

Retail Demand Forecasting

 

Photo by RDNE Stock project on pexel


Demand forecasting is a critical component of supply chain management. This solution uses historical data and machine learning algorithms to predict future demand.

Data Requirements

Historical sales data (3-5 years)
Seasonal data (e.g., holidays, promotions)
Product information (e.g., categories, subcategories)
External data (e.g., weather, economic indicators)

Data Preprocessing

Data cleaning: Handle missing values, outliers, and data inconsistencies.

Data transformation: Convert data into suitable formats for analysis.

Feature engineering: Extract relevant features from data, such as:

Time-based features (e.g., day of week, month)

Seasonal features (e.g., holiday indicators)

Product-based features (e.g., category, subcategory)

Model Selection

Choose a suitable algorithm based on data characteristics and performance metrics:
Traditional methods:

ARIMA (AutoRegressive Integrated Moving Average)

Exponential Smoothing (ES)
Naive Methods (e.g., moving average)
Machine learning methods:
Linear Regression
Decision Trees
Random Forest
LSTM (Long Short-Term Memory) networks
Prophet (Facebook's open-source forecasting tool)

Model Evaluation

Assess model performance using metrics:
Mean Absolute Error (MAE)
Mean Absolute Percentage Error (MAPE)
Root Mean Squared Error (RMSE)
Coefficient of Determination (R-squared)

Model Implementation

Train the selected model on historical data.
Tune hyperparameters for optimal performance.
Deploy the model in a production-ready environment.

Model Deployment

Integrate with existing ERP or supply chain systems.
Schedule regular updates to incorporate new data.
Provide user-friendly interface for stakeholders.

Solution Architecture

Data Ingestion: Load historical data into a data warehouse (e.g., AWS Redshift).
Data Processing: Use a data processing framework (e.g., Apache Spark).
Model Training: Train models using a machine learning framework (e.g., scikit-learn, TensorFlow).
Model Deployment: Deploy models using a containerization platform (e.g., Docker).
User Interface: Create a web-based interface using a framework (e.g., Flask, Django).

Tools and Technologies

Data visualization: Tableau, Power BI, or D3.js
Data preprocessing: Pandas, NumPy
Machine learning: scikit-learn, TensorFlow, PyTorch
Data warehouse: AWS Redshift, Google BigQuery
Containerization: Docker
Cloud platform: AWS, Google Cloud, Azure
Step-by-Step Implementation
Step 1: Data Collection and Preprocessing

Collect historical sales data

Clean and preprocess data

Transform data into suitable formats

Step 2: Feature Engineering

Extract relevant features from data
Create seasonal and time-based features

Step 3: Model Selection and Training
Choose suitable algorithm
Train model on historical data
Tune hyperparameters

Step 4: Model Evaluation
Assess model performance using metrics
Compare models and select best performer

Step 5: Model Deployment
Integrate with existing systems
Schedule regular updates
Provide user-friendly interface

Step 6: Monitoring and Maintenance
Monitor model performance
Update model with new data
Refine model as needed
Timeline
Data collection and preprocessing: 2 weeks
Feature engineering: 1 week
Model selection and training: 4 weeks
Model evaluation: 2 weeks
Model deployment: 4 weeks
Monitoring and maintenance: Ongoing

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

# Load historical sales data
data = pd.read_csv('sales_data.csv')

# Handle missing values
data.fillna(data.mean(), inplace=True)

# Convert date column to datetime
data['date'] = pd.to_datetime(data['date'])

# Extract relevant features
data['day_of_week'] = data['date'].dt.dayofweek
data['month'] = data['date'].dt.month

# Drop unnecessary columns
data.drop(['date', 'product_id'], axis=1, inplace=True)

# Split data into training and testing sets
X = data.drop('sales', axis=1)
y = data['sales']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale data using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Linear Regression model
lr_model = LinearRegression()
lr_model.fit(X_train_scaled, y_train)

# Make predictions
y_pred_lr = lr_model.predict(X_test_scaled)

# Evaluate model
mse_lr = mean_squared_error(y_test, y_pred_lr)
print(f'Linear Regression MSE: {mse_lr:.2f}')

# Train Random Forest Regressor model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train_scaled, y_train)

# Make predictions
y_pred_rf = rf_model.predict(X_test_scaled)

# Evaluate model
mse_rf = mean_squared_error(y_test, y_pred_rf)
print(f'Random Forest Regressor MSE: {mse_rf:.2f}')

# Perform hyperparameter tuning using GridSearchCV
param_grid = {'n_estimators': [50, 100, 200], 'max_depth': [None, 5, 10]}
grid_search = GridSearchCV(RandomForestRegressor(random_state=42), param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train_scaled, y_train)

# Print best parameters and score
print(f'Best Parameters: {grid_search.best_params_}')
print(f'Best Score: {grid_search.best_score_:.2f}')


Exploratory Data Analysys Topics

 

Photo by <a href=”https://unsplash.com/@claybanks?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Clay Banks</a>

Four main topics in EDA are

  • Descriptive statistics
  • Univariate analysis
  • Bivariate analysis
  • Multivariate analysis
  • Dimensionality reduction

Descriptive statistics are a set of methods used to summarize and describe the main features of a dataset, such as its central tendency, variability, and distribution. Some of the most common descriptive statistics include:

  • Mean: The mean is the average of all the values in a dataset.
  • Median: The median is the middle value in a dataset, when all the values are sorted from least to greatest.
  • Mode: The mode is the most frequent value in a dataset.
  • Range: The range is the difference between the largest and smallest values in a dataset.
  • Variance: The variance is a measure of how spread out the values in a dataset are.
  • Standard deviation: The standard deviation is a measure of how much variation there is in a dataset.

Here is an example of code that calculates the mean, median, mode, range, variance, and standard deviation of a dataset:

import numpy as n
import pandas as pd


# Create a dataset.
data = np.random.randint(0, 100, 100)


# Calculate the mean.
mean = np.mean(data)


# Calculate the median.
median = np.median(data)


# Calculate the mode.
mode = np.argmax(np.histogram(data)[0])


# Calculate the range.
range = np.max(data) - np.min(data)


# Calculate the variance.
variance = np.var(data)


# Calculate the standard deviation.
standard_deviation = np.std(data)


# Print the results.
print("Mean:", mean)
print("Median:", median)
print("Mode:", mode)
print("Range:", range)
print("Variance:", variance)
print("Standard deviation:", standard_deviation)

Univariate analysis is a statistical method that is used to analyze a single variable. Univariate analysis can be used to describe the distribution of a variable, to identify outliers, and to test hypotheses about the variable. Some of the most common univariate analysis methods include:

  • Frequency distribution: A frequency distribution shows the number of times each value in a variable appears.
  • Histogram: A histogram is a graphical representation of a frequency distribution.
  • Boxplot: A boxplot shows the distribution of a variable, including the mean, median, quartiles, and outliers.
  • QQ plot: A QQ plot is a graphical method for comparing two distributions.

Here is an example of code that creates a frequency distribution and a histogram of a variable:

import numpy as n
import pandas as pd


# Create a dataset.
data = np.random.randint(0, 100, 100)


# Create a frequency distribution.
frequency_distribution = pd.value_counts(data)


# Create a histogram.
plt.hist(data)
plt.show()

Bivariate analysis is a statistical method that is used to analyze two variables. Bivariate analysis can be used to investigate the relationship between two variables, to identify factors that influence a variable, and to make predictions about a variable. Some of the most common bivariate analysis methods include:

  • Correlation: Correlation measures the strength of the relationship between two variables.
  • Regression: Regression analysis is a statistical method that can be used to predict the value of one variable based on the value of another variable.
  • Chi-squared test: The chi-squared test is a statistical test that can be used to determine if there is a significant relationship between two categorical variables.

Here is an example of code that calculates the correlation coefficient between two variables:

import numpy as n
import pandas as pd


# Create two variables.
variable_1 = np.random.randint(0, 100, 100)
variable_2 = np.random.randint(0, 100, 100)


# Calculate the correlation coefficient.
correlation_coefficient = np.corrcoef(variable_1, variable_2)[0, 1]


# Print the correlation coefficient.
print("Correlation coefficient:", correlation_coefficient)

Multivariate analysis is a statistical method that is used to analyze multiple variables. Multivariate analysis can be used to investigate the relationships between multiple variables, to identify factors that influence multiple variables, and to make predictions about multiple variables.

It can be done with different ways:

  • Principal component analysis (PCA): PCA is a statistical method that can be used to reduce the dimensionality of a dataset while preserving as much information as possible. PCA is often used in machine learning applications, such as image recognition and text classification.
  • Factor analysis (FA): FA is a statistical method that can be used to identify the underlying factors that influence a set of variables. FA is often used in psychology and sociology applications, such as personality testing and market research.
  • Linear discriminant analysis (LDA): LDA is a statistical method that can be used to classify observations into two or more groups. LDA is often used in medical applications, such as cancer diagnosis and drug discovery.
  • Logistic regression: Logistic regression is a statistical method that can be used to predict the probability of an event occurring. Logistic regression is often used in marketing applications, such as customer segmentation and lead scoring.

Principal component analysis (PCA)

import numpy as n
import pandas as pd
from sklearn.decomposition import PCA


# Create a dataset.
data = np.random.randint(0, 100, (100, 3))


# Create a PCA model.
pca = PCA(n_components=2)


# Fit the PCA model to the data.
pca.fit(data)


# Transform the data to the principal components.
principal_components = pca.transform(data)


# Print the principal components.
print(principal_components)

Factor analysis (FA)

import numpy as n
import pandas as pd
from sklearn.decomposition import FactorAnalysis


# Create a dataset.
data = np.random.randint(0, 100, (100, 5))


# Create a FA model.
fa = FactorAnalysis(n_components=3)


# Fit the FA model to the data.
fa.fit(data)


# Transform the data to the factors.
factors = fa.transform(data)


# Print the factors.
print(factors)

Linear discriminant analysis (LDA)

import numpy as n
import pandas as pd
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis


# Create a dataset.
data = np.random.randint(0, 100, (100, 2))
labels = np.random.randint(0, 2, 100)


# Create an LDA model.
lda = LinearDiscriminantAnalysis()


# Fit the LDA model to the data.
lda.fit(data, labels)


# Predict the labels for the data.
predicted_labels = lda.predict(data)


# Print the accuracy of the model.
print(lda.score(data, labels))

Logistic regression:

import numpy as n
import pandas as pd
from sklearn.linear_model import LogisticRegression


# Create a dataset.
data = np.random.randint(0, 100, (100, 2))
labels = np.random.randint(0, 2, 100)


# Create a logistic regression model.
logistic_regression = LogisticRegression()


# Fit the logistic regression model to the data.
logistic_regression.fit(data, labels)


# Predict the labels for the data.
predicted_labels = logistic_regression.predict(data)


# Print the accuracy of the model.
print(logistic_regression.score(data, labels))

There are many good tutorials on above subjects. However here you will get a quick idea and example as well.

I am a Software Architect | AI, Data Science, IoT, Cloud ⌨️ 👨🏽 💻

Developing software, system ….. for more that 26 years. Thank you.