Regarding Python: (Required)
Regarding Weka: (Optional)
java -Xmx768m -jar weka.jar
Note that although most attribute values in this dataset are represented using numeric values, this dataset contains attributes that are conceptually of different types:
1. name: Name of the country concerned 2. landmass: 1=N.America, 2=S.America, 3=Europe, 4=Africa, 4=Asia, 6=Oceania 3. zone: Geographic quadrant, based on Greenwich and the Equator; 1=NE, 2=SE, 3=SW, 4=NW 6. language: 1=English, 2=Spanish, 3=French, 4=German, 5=Slavic, 6=Other Indo-European, 7=Chinese, 8=Arabic, 9=Japanese/Turkish/Finnish/Magyar, 10=Others 7. religion: 0=Catholic, 1=Other Christian, 2=Muslim, 3=Buddhist, 4=Hindu, 5=Ethnic, 6=Marxist, 7=Others 18. mainhue: predominant colour in the flag (tie-breaks decided by taking the topmost hue, if that fails then the most central hue, and if that fails the leftmost hue) 29. topleft: colour in the top-left corner (moving right to decide tie-breaks) 30. botright: Colour in the bottom-left corner (moving left to decide tie-breaks)
11. red: 0 if red absent, 1 if red present in the flag 12. green: same for green 13. blue: same for blue 14. gold: same for gold (also yellow) 15. white: same for white 16. black: same for black 17. orange: same for orange (also brown) 24. crescent: 1 if a crescent moon symbol present, else 0 25. triangle: 1 if any triangles present, 0 otherwise 26. icon: 1 if an inanimate image present (e.g., a boat), otherwise 0 27. animate: 1 if an animate image (e.g., an eagle, a tree, a human hand) present, 0 otherwise 28. text: 1 if any letters or writing on the flag (e.g., a motto or slogan), 0 otherwise
4. area: in thousands of square km 5. population: in round millions 8. bars: Number of vertical bars in the flag 9. stripes: Number of horizontal stripes in the flag 10. colours: Number of different colours in the flag 19. circles: Number of circles in the flag 20. crosses: Number of (upright) crosses 21. saltires: Number of diagonal crosses 22. quarters: Number of quartered sections 23. sunstars: Number of sun or star symbols
- (10 points) Univariate feature imputation using the SimpleImputer function, experimenting with different strategies (mean, median, most frequent, and constant).
- (10 points) Multivariate feature imputation using the IterativeImputer function with default parameters (except for the "missing_values" parameter, which should be equal to 0 or to NaN, depending on what convention you are using). You should allow this iterative imputation to predict the 0 (or NaN) values of area using all of the other attributes in the dataset.
- (5 points) standardization using the scale function. Read scikit-learn data transformation functions section 5.3.1.
- (20 points) scaling to a range using the MinMaxScaler and the MaxAbsScaler functions (use the latter to see how it scales sparse data - read section 5.3.1.2); as well as robust_scale and RobustScaler, which handle outliers.
- (10 points) mapping to a uniform distribution using the QuantileTransformer and the quantile_transform functions. Read section 5.3.2.1
- (10 points) mapping to a Gaussian distribution using the PowerTransformer and the QuantileTransformer functions. Read section 5.3.2.2
- (5 points) normalization using the normalize function. Read section 5.3.3
- (10 points) K-bins discretization using the KBinsDiscretizer function, experimenting with different encodings, strategies, n_bins_ and bin_edges_ values.
- (10 points) Feature binarization using the Binarizer function, experimenting with different threshold values.
- In each of the problems below, use the orginal Flags dataset after converting discrete attributes to continuous (in part II.2) above but without any of the transformations in parts II.3-II.6.
- Study Section 2.4.5 of the Tan, Steinbach, Karpatne and Kumar's textbook for the definitions and formulas for correlation and covariance.
- (5 points) (plain) Random sampling without replacement using uniform distribution;
- (5 points) (plain) Random sampling with replacement using uniform distribution;
- (5 points) Stratified random sampling without replacement using religion (attribute #7) as the target (or class) attribute.
- (5 points) VarianceThreshold. Read section 1.13.1.
- SelectKBest, experimenting with the following "score_functions":
Read section 1.13.2.
- (5 points each) For regression: f_regression, mutual_info_regression
- (5 points each) For classification: chi2, f_classif, mutual_info_classif
- (10 points) Recursive Feature Elimination (RFE). Read section 1.13.3.
- (10 points) Look for an implementation of Correlation-based Feature Selection (CFS) in Python and experiment with it. See Witten's and Frank's textbook slides - Chapter 7 Slides 5-6 and also Mark A.Hall's phd thesis. See Section 2.4.6 of the Tan, Steinbach, Karpatne and Kumar's textbook for the definition and formulas for Mutual Information.