metadata
license: mit
Dockerfile Commit Classification Model
This is a Logistic Regression model enhanced with a rule-based system for multi-label classification of Dockerfile-related commit messages. It combines machine learning with domain-specific rules to achieve accurate categorization.
Files
logistic_model.joblib
: Trained Logistic Regression model.tfidf_vectorizer.joblib
: TF-IDF vectorizer for text preprocessing.label_binarizer.joblib
: MultiLabelBinarizer for encoding/decoding labels.
Features
- Hybrid Approach: Combines machine learning with rule-based adjustments for better classification.
- Dockerfile-Specific Labels: Categorizes commit messages into predefined classes:
bug fix
code refactoring
feature addition
maintenance/other
Not enough information
- Multi-Label Support: Each commit message can belong to multiple categories.
How to Use
To use this model, load the files and preprocess your data as follows:
from joblib import load
# Load the model and preprocessing artifacts
model = load("logistic_model.joblib")
tfidf_vectorizer = load("tfidf_vectorizer.joblib")
mlb = load("label_binarizer.joblib")
# Example usage
new_messages = [
"Fixed an issue with the base image in Dockerfile",
"Added multistage builds to reduce image size",
"Updated Python version in Dockerfile to 3.10"
]
X_new_tfidf = tfidf_vectorizer.transform(new_messages)
# Predict the labels
predictions = model.predict(X_new_tfidf)
predicted_labels = mlb.inverse_transform(predictions)
# Print results
for msg, labels in zip(new_messages, predicted_labels):
print(f"Message: {msg}")
print(f"Predicted Labels: {', '.join(labels) if labels else 'No labels'}\n")