Sunday

Machine Learning - Statistics and Math Common Questions

1. What is the difference between supervised and unsupervised learning?

   - Supervised Learning: In supervised learning, the algorithm learns from labeled training data, where the input and corresponding output are provided. The goal is to learn a mapping function to make predictions on new, unseen data.

   - Unsupervised Learning: Unsupervised learning involves learning patterns and relationships from unlabeled data. It includes clustering (grouping similar data points) and dimensionality reduction (reducing the number of features while preserving important information).


2. Explain bias and variance trade-off in machine learning. 

   -  Bias:  Bias refers to the error due to overly simplistic assumptions in the learning algorithm, leading to underfitting. High bias can cause the model to miss relevant relations between features and target.

   -  Variance:  Variance is the error due to too much complexity in the model, leading to overfitting. High variance can make the model overly sensitive to noise in the training data.


 3. What is regularization? 

   - Regularization is a technique used to prevent overfitting by adding a penalty term to the loss function. It discourages the model from fitting the noise in the training data. Common regularization techniques include L1 (Lasso) and L2 (Ridge) regularization.


 4. What is the curse of dimensionality? 

   - The curse of dimensionality refers to the challenges that arise when working with high-dimensional data. As the number of dimensions increases, the data becomes sparse, and distances between points become less meaningful. This can lead to increased computation time and poor model performance.


 5. Explain ROC and AUC in the context of binary classification. 

   -  ROC Curve (Receiver Operating Characteristic):  It's a graphical representation of the performance of a binary classification model at different threshold settings. It plots the true positive rate against the false positive rate.

   -  AUC (Area Under the Curve):  AUC is the area under the ROC curve. It quantifies the model's ability to distinguish between positive and negative classes. A higher AUC indicates better model performance.


 6. What is the difference between correlation and covariance? 

   -  Covariance:  Covariance measures the degree to which two variables change together. A positive covariance indicates that as one variable increases, the other also tends to increase.

   -  Correlation:  Correlation is a standardized version of covariance that measures the strength and direction of the linear relationship between two variables. It ranges from -1 to 1, where 1 indicates a perfect positive correlation, -1 indicates a perfect negative correlation, and 0 indicates no correlation.


 7. What is the Central Limit Theorem? 

   - The Central Limit Theorem states that the distribution of the sample means approaches a normal distribution as the sample size increases, regardless of the original distribution of the data. This is a fundamental principle in statistics and is often used in hypothesis testing and confidence intervals.


 8. Explain gradient descent. 

   - Gradient descent is an optimization algorithm used to minimize the loss function of a machine learning model. It involves iteratively adjusting the model's parameters in the direction of the steepest descent of the loss function. The learning rate determines the step size in each iteration.


 9. What is the difference between probability and statistics? 

   -  Probability:  Probability deals with predicting the likelihood of future events based on a given set of conditions. It's used to model uncertain events.

   -  Statistics:  Statistics involves collecting, analyzing, interpreting, presenting, and organizing data. It helps us draw conclusions about the underlying population based on observed data.


 10. Explain the difference between correlation and causation. 

   -  Correlation:  Correlation indicates a statistical relationship between two variables. However, correlation does not imply a cause-and-effect relationship.

   -  Causation:  Causation implies that changes in one variable directly cause changes in another variable. Establishing causation often requires rigorous experimentation and control.


No comments:

Azure Data Factory Transform and Enrich Activity with Databricks and Pyspark

In #azuredatafactory at #transform and #enrich part can be done automatically or manually written by #pyspark two examples below one data so...