Step 1: Understanding the Question:
The question asks to identify the most suitable measure of central tendency to impute (fill in) missing values in a categorical data column named 'colour' in a dataset.
Step 2: Data Types and Central Tendency:
- Numerical Data: Quantitative measurements (like price or age) where mathematical calculations like averages can be performed. Useful measures: Mean, Median.
- Categorical Data: Qualitative groupings (like colour, brand, or gender) consisting of text labels instead of numbers.
Step 3: Detailed Explanation:
- Let us evaluate the applicability of each measure of central tendency to the 'colour' column:
- Mean: Requires adding all the values and dividing by the total count. Since we cannot mathematically add text values (e.g., "Red" + "Blue" + "Green"), the mean is impossible to calculate for categorical data.
- Median: Requires sorting values numerically to locate the middle element. Since there is no inherent numerical order for color names, the median cannot be calculated for categorical data.
- Mode: Identifies the most frequently occurring value in the dataset. This can easily be computed for text data by counting the frequency of each category (e.g., if "Red" is the most common color, then "Red" is the mode).
- Therefore, when cleaning categorical columns with missing values, the standard practice is to replace the missing fields with the most frequent value, which is the Mode.
Step 4: Final Answer:
For categorical columns containing missing text data, the Mode is the preferred measure of central tendency to fill in the gaps.
Hence, option (C) is the correct choice.