Occurrence of Selection Bias:
Selection Bias occurs primarily during data collection. It arises when the sampled data is not representative of the overall population, leading to biased or skewed results in statistical analysis or machine learning models.
Key Points:
1️⃣ During Data Collection:
• If certain groups or categories are overrepresented or underrepresented in the collected dataset, selection bias occurs.
• Examples: Surveying only urban populations, excluding a demographic, or using a self-selected sample.
2️⃣ During Model Deployment:
• While selection bias originates from the dataset, its effects can propagate during model deployment.
• The model may perform poorly for underrepresented groups if the training data was biased.
• However, the bias itself is not created during deployment; it stems from the data used to train the model.
Example:
• A hiring algorithm trained on historical data where only male candidates were predominantly hired can develop a biased model. The bias occurs due to non-representative training data (selection bias during data collection), not because of the model deployment.
Conclusion:
Selection Bias occurs during data collection when the dataset is not representative of the population, though its effects may be visible during model deployment. Ensuring proper sampling and balanced datasets helps mitigate this bias.