MF3Classifier

Model Overview

This is a machine learning pipeline designed to predict mutual fund performance using both numerical and categorical features. The model combines preprocessing steps with a Random Forest classifier, making it suitable for financial data analysis.

Model Architecture

The model uses a two-branch preprocessing pipeline followed by a Random Forest classifier:

Preprocessing Pipeline

Numerical Features Branch
- Features: ['AUM']
- Transformation: StandardScaler
Categorical Features Branch
- Features: ['AMC', 'Fund Category', 'Sub-Sheme', 'Investment Type', 'Growth Option']
- Transformations:
  - OneHotEncoder (non-sparse output, handles unknown categories)
  - Feature Selection (SelectKBest with mutual_info_classif, k=30)

Classifier

Model: RandomForestClassifier
Key Parameters:
- n_estimators: 30
- max_depth: 20
- min_samples_split: 10
- min_samples_leaf: 5
- n_jobs: -1 (parallel processing)
- random_state: 42

Use Cases

Mutual fund performance prediction
Investment strategy optimization
Portfolio management
Risk assessment

Model Parameters

Preprocessing Configuration

Numerical Features:
- StandardScaler with default parameters
- Handles mean centering and scaling
Categorical Features:
- OneHotEncoder:
  - handle_unknown: 'ignore'
  - sparse_output: False
  - dtype: numpy.float64
- Feature Selection:
  - Method: SelectKBest with mutual_info_classif
  - Number of features: 30

Random Forest Configuration

Tree Structure:
- Maximum depth: 20
- Minimum samples for split: 10
- Minimum samples per leaf: 5
Ensemble Settings:
- Number of trees: 30
- Feature selection: sqrt (auto)
- Bootstrap: True
- Criterion: gini

Technical Details

File Information

Model Type: Scikit-learn Pipeline
Last Updated: November 3, 2024

Input Features

Numerical Features:
- AUM (Assets Under Management)
Categorical Features:
- AMC (Asset Management Company)
- Fund Category
- Sub-Scheme
- Investment Type
- Growth Option

Limitations and Considerations

The model uses mutual_info_classif for feature selection, which may not capture all relevant relationships
Feature selection is limited to top 30 features
Performance may vary with unknown categories due to the 'ignore' setting in OneHotEncoder

alokpandey
/

MF3Classifier