Skip to main content

Posts

Showing posts with the label inferential statistics

PDF & CDF

            I saw that students are unclear about #PDF [probability density function] and #CDF [cumulative density function]. I would be glad to provide a comprehensive explanation of probability density functions ( PDFs ) and cumulative distribution functions ( CDFs ), incorporating the best aspects of the previous responses and addressing any potential shortcomings: Probability Density Function ( PDF ): A PDF is a mathematical function that describes the probability distribution of a continuous random variable. It represents the likelihood of a random variable taking on a particular value within a certain range. The PDF is always non-negative and its integral over its entire range must equal 1. For a continuous random variable X, the PDF is denoted as f(x). The probability of X falling within a certain range [a, b] is given by the integral of the PDF over that range: P(a ≤ X ≤ b) = ∫[a, b] f(x) dx. Cumulative Distribution Function ( CDF ): A CDF is...

Preparing a Dataset for Fine-Tuning Foundation Model

  I am trying to preparing a Dataset for Fine-Tuning on Pathology Lab Data. 1. Dataset Collection    - Sources:  Gather data from pathology lab reports, medical journals, and any other relevant medical documents.    - Format:  Ensure that the data is in a readable format like CSV, JSON, or text files. 2. Data Preprocessing    - Cleaning:  Remove any irrelevant data, correct typos, and handle missing values.    - Formatting:  Convert the data into a format suitable for fine-tuning, usually pairs of input and output texts.    - Example Format:      - Input:  "Patient exhibits symptoms of hyperglycemia."      - Output:  "Hyperglycemia" 3. Tokenization    - Tokenize the text using the tokenizer that corresponds to the model you intend to fine-tune. Example Code for Dataset Preparation Using Pandas and Transformers for Preprocessing 1. Install Required Libraries: ...

Calculating Vaccine Effectiveness with Bayes' Theorem

We can use Bayes' Theorem to estimate the probability of someone not having an effect (meaning they get infected after vaccination) for both Covishield and Covaxin, considering a population of 1.4 billion individuals. Assumptions: We assume equal distribution of both vaccines in the population (700 million each). We focus on individual protection probabilities, not overall disease prevalence. Calculations: Covishield: Prior Probability (P(Effect)): Assume 10% of the vaccinated population gets infected (no effect), making P(Effect) = 0.1. Likelihood (P(No Effect|Effect)): This represents the probability of someone not being infected given they received Covishield. Given its 90% effectiveness, P(No Effect|Effect) = 0.9. Marginal Probability (P(No Effect)): This needs calculation, considering both vaccinated and unvaccinated scenarios. P(No Effect) = P(No Effect|Vaccinated) * P(Vaccinated) + P(No Effect|Unvaccinated) * P(Unvaccinated) Assuming 50% effectiveness for unvaccinated indivi...

Inference a Model in Small Microcontroller

                                                         Photo by Google DeepMind To improve model processing speed on a small microcontroller, you can consider the following strategies: 1. Optimize Your Model: - Use a model that is optimized for edge devices. Some frameworks like TensorFlow and PyTorch offer quantization techniques and smaller model architectures suitable for resource-constrained devices. - Prune your model to reduce its size by removing less important weights or neurons. 2. Accelerated Hardware: - Utilize hardware accelerators if your Raspberry Pi has them. For example, Raspberry Pi 4 and later versions have a VideoCore VI GPU, which can be used for certain AI workloads. - Consider using a Neural Compute Stick (NCS) or a Coral USB Accelerator, which can significantly speed up inferencing f...

Combine Several CSV Files for Time Series Analysis

Combining multiple CSV files in time series data analysis typically involves concatenating or merging the data to create a single, unified dataset. Here's a step-by-step guide on how to do this in Python using the pandas library: Assuming you have several CSV files in the same directory and each CSV file represents a time series for a specific period: Step 1: Import the required libraries. ```python import pandas as pd import os ``` Step 2: List all CSV files in the directory. ```python directory_path = "/path/to/your/csv/files"  # Replace with the path to your CSV files csv_files = [file for file in os.listdir(directory_path) if file.endswith('.csv')] ``` Step 3: Initialize an empty DataFrame to store the combined data. ```python combined_data = pd.DataFrame() ``` Step 4: Loop through the CSV files, read and append their contents to the combined DataFrame. ```python for file in csv_files:     file_path = os.path.join(directory_path, file)     df = pd.read_csv(f...

Statistical Distributions

Different types of distributions. Bernoulli distribution : A Bernoulli distribution is a discrete probability distribution with two possible outcomes, usually called "success" and "failure." The probability of success is denoted by p and the probability of failure is denoted by q . The Bernoulli distribution can be used to model a variety of events, such as whether a coin toss results in heads or tails, whether a student passes an exam, or whether a customer makes a purchase. Uniform distribution : A uniform distribution is a continuous probability distribution that assigns equal probability to all values within a specified range. The uniform distribution can be used to model a variety of events, such as the roll of a die, the draw of a card from a deck, or the time it takes to complete a task. Binomial distribution : A binomial distribution is a discrete probability distribution that describes the number of successes in a sequence of n independent trials, eac...

Gini Index & Information Gain in Machine Learning

What is the Gini index? The Gini index is a measure of impurity in a set of data. It is calculated by summing the squared probabilities of each class. A lower Gini index indicates a more pure set of data. What is information gain? Information gain is a measure of how much information is gained by splitting a set of data on a particular feature. It is calculated by comparing the entropy of the original set of data to the entropy of the two child sets. A higher information gain indicates that the feature is more effective at splitting the data. What is impurity? Impurity is a measure of how mixed up the classes are in a set of data. A more impure set of data will have a higher Gini index. How are Gini index and information gain related? Gini index and information gain are both measures of impurity, but they are calculated differently. Gini index is calculated by summing the squared probabilities of each class, while information gain is calculated by comparing the entropy of the original ...