Sure, here's the content you provided formatted as a README file: # Resume Ranking App The Resume Ranking App is a Python application designed to rank and shortlist resumes based on their similarity to a given job description. The app utilizes various natural language processing (NLP) techniques and algorithms to process text data, extract information from PDF resumes, and calculate similarity scores between the job description and each resume. ## Algorithms and Techniques Used 1. **TF-IDF (Term Frequency-Inverse Document Frequency)**: The app uses TF-IDF, a feature extraction technique, to represent the text data (job description and resumes) as numerical vectors. It assigns weights to words based on their frequency in the document (TF) and inversely proportional to their frequency in the entire corpus (IDF). The `TfidfVectorizer` from `scikit-learn` is used to convert the text data into TF-IDF vectors. 2. **Cosine Similarity**: After representing the text data as TF-IDF vectors, the app calculates cosine similarity to measure the similarity between the job description and each resume. Cosine similarity calculates the cosine of the angle between two vectors, which represents their similarity. Higher cosine similarity values indicate higher similarity between the vectors. 3. **Text Preprocessing**: Before computing similarity scores, the text data undergoes preprocessing steps to remove noise and irrelevant information. The following preprocessing steps are applied to both the job description and resumes: - Tokenization: The text is split into individual words (tokens). - Lowercasing: All words are converted to lowercase to ensure case-insensitivity. - Stopword Removal: Commonly occurring English stopwords (e.g., "the", "and", "is") are removed from the text to reduce noise. - Stemming: Words are reduced to their base or root form (e.g., "running" to "run") using the Porter stemming algorithm. 4. **PDF Text Extraction**: The app utilizes the `PyPDF2` library to extract the content of PDF resumes. The extracted text is then cleaned to remove URLs, special characters, non-ASCII characters, etc. 5. **Regex Pattern Matching**: Regular expressions are used to extract candidate names from the resumes based on a specified regex pattern. 6. **Shortlisting**: Resumes with similarity scores above a specified threshold are shortlisted as potential matches for the job description. ## Getting Started To run the Resume Ranking App, follow these steps: 1. Install the required Python libraries by running: ``` pip install streamlit nltk scikit-learn PyPDF2 pdfminer.six ``` 2. Download the NLTK data by running the following code in Python: ```python import nltk nltk.download('punkt') nltk.download('stopwords') ``` 3. Run the app using the command: ``` streamlit run app.py ``` 4. The app will launch in your browser. You can upload the job description and resumes (in PDF format) using the provided file upload fields. 5. Click the "Submit" button to rank and shortlist the resumes based on similarity to the job description. ## Disclaimer This app provides an automated ranking and shortlisting process for resumes, but it is not a substitute for human judgment. The app's results are based on NLP techniques and algorithms and may not perfectly capture the best candidates. It is recommended to use the app's results as a starting point and perform further evaluations before making any final decisions. ## License The Resume Ranking App is licensed under the MIT License. Feel free to modify and use the code according to the terms of the license. --- _This README file provides an overview of the Resume Ranking App and instructions for running it. For detailed implementation and code, refer to the `app.py` file in the repository._