Spaces:
Runtime error
Runtime error
File size: 3,025 Bytes
cc93c45 dc3eeb1 cc93c45 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 |
- 2024_main_document_lvl
- 2024_main_paragraph_lvl
- 2023_main_document_lvl
- 2023_main_paragraph_lvl
- Embeddings convert pdfs
- Para
- Docs
- HNSW - Kmeans fast searcddh
- K means graphs based on the topics
- Check for similarity between our own db
- Para
- Docs
- Get The most important Ones
- Get the Unquine sentances like title & other content ?? - LLM think karun karel
- Search Google using the unquine searches --> get the top 3 and do the same check again --> result
### 1. Data Input:
- **Input Data:** Collect a diverse dataset of academic papers, articles, or textual content from various sources.
- **Format:** Ensure the data is in a consistent and machine-readable format, such as plain text or a format compatible with your chosen NLP library.
### 2. Data Cleaning:
- **Text Cleaning:**
- Remove metadata, formatting, and irrelevant details.
- Handle special characters, punctuation, and stopwords.
- **Normalization:**
- Convert text to lowercase to ensure uniformity.
- **Tokenization:**
- Tokenize the text into words or subword tokens.
- **Libraries:**
- For Python, you can use NLTK or spaCy for tokenization.
### 3. Embedding Generation:
- **Word Level Embeddings:**
- Utilize pre-trained word embeddings like Word2Vec or GloVe.
- **Libraries:**
- For Word2Vec: Gensim library.
- For GloVe: spaCy or gensim.
- **Paragraph Level Embeddings:**
- Aggregate word embeddings using techniques like averaging or using Doc2Vec.
- **Libraries:**
- Gensim for Doc2Vec.
- **Document Level Embeddings:**
- Consider using the average of paragraph embeddings or more advanced models.
- **Libraries:**
- spaCy or transformers library for more advanced models.
### 4. Pairwise Comparison:
- **Similarity Measures:**
- Calculate cosine similarity, Jaccard similarity, or other relevant measures.
- **Libraries:**
- scikit-learn for cosine similarity.
### 5. Clustering:
- **K-Means Clustering:**
- Partition documents into K clusters.
- **Libraries:**
- scikit-learn for K-Means.
- **Hierarchical Clustering:**
- Build a hierarchy of clusters.
- **Libraries:**
- scipy.cluster.hierarchy for hierarchical clustering.
- **DBSCAN:**
- Density-based clustering.
- **Libraries:**
- scikit-learn for DBSCAN.
### 6. Scoring System:
- **Threshold Setting:**
- Establish a threshold for similarity scores to classify documents.
- Determine the threshold through experimentation.
- **Scoring Logic:**
- Develop a scoring system based on the results of pairwise comparison and clustering.
- Decide on the scoring weights for each component.
### 7. Hybrid Approach:
- **Traditional Models:**
- Use traditional similarity measures for efficiency.
- Implement efficient algorithms for quick pairwise comparisons.
- **Large Language Models:**
- Fine-tune or use pre-trained models for enhanced context understanding.
- Hugging Face Transformers library for accessing pre-trained models.
- Fingerprinting Concept |