meriemm6's picture
Update README.md
8132247 verified
metadata
license: mit

Dockerfile Commit Classification Model

This is a Logistic Regression model enhanced with a rule-based system for multi-label classification of Dockerfile-related commit messages. It combines machine learning with domain-specific rules to achieve accurate categorization.

Files

  • logistic_model.joblib: Trained Logistic Regression model.
  • tfidf_vectorizer.joblib: TF-IDF vectorizer for text preprocessing.
  • label_binarizer.joblib: MultiLabelBinarizer for encoding/decoding labels.

Features

  • Hybrid Approach: Combines machine learning with rule-based adjustments for better classification.
  • Dockerfile-Specific Labels: Categorizes commit messages into predefined classes:
    • bug fix
    • code refactoring
    • feature addition
    • maintenance/other
    • Not enough information
  • Multi-Label Support: Each commit message can belong to multiple categories.

How to Use

To use this model, load the files and preprocess your data as follows:

from joblib import load

# Load the model and preprocessing artifacts
model = load("logistic_model.joblib")
tfidf_vectorizer = load("tfidf_vectorizer.joblib")
mlb = load("label_binarizer.joblib")

# Example usage
new_messages = [
    "Fixed an issue with the base image in Dockerfile",
    "Added multistage builds to reduce image size",
    "Updated Python version in Dockerfile to 3.10"
]
X_new_tfidf = tfidf_vectorizer.transform(new_messages)

# Predict the labels
predictions = model.predict(X_new_tfidf)
predicted_labels = mlb.inverse_transform(predictions)

# Print results
for msg, labels in zip(new_messages, predicted_labels):
    print(f"Message: {msg}")
    print(f"Predicted Labels: {', '.join(labels) if labels else 'No labels'}\n")