Spaces:
Runtime error
Runtime error
2024_main_document_lvl
2024_main_paragraph_lvl
2023_main_document_lvl
2023_main_paragraph_lvl
Embeddings convert pdfs
- Para
- Docs
HNSW - Kmeans fast searcddh
K means graphs based on the topics
Check for similarity between our own db
- Para
- Docs
Get The most important Ones
Get the Unquine sentances like title & other content ?? - LLM think karun karel
Search Google using the unquine searches --> get the top 3 and do the same check again --> result
1. Data Input:
- Input Data: Collect a diverse dataset of academic papers, articles, or textual content from various sources.
- Format: Ensure the data is in a consistent and machine-readable format, such as plain text or a format compatible with your chosen NLP library.
2. Data Cleaning:
Text Cleaning:
- Remove metadata, formatting, and irrelevant details.
- Handle special characters, punctuation, and stopwords.
Normalization:
- Convert text to lowercase to ensure uniformity.
Tokenization:
- Tokenize the text into words or subword tokens.
- Libraries:
- For Python, you can use NLTK or spaCy for tokenization.
3. Embedding Generation:
Word Level Embeddings:
- Utilize pre-trained word embeddings like Word2Vec or GloVe.
- Libraries:
- For Word2Vec: Gensim library.
- For GloVe: spaCy or gensim.
Paragraph Level Embeddings:
- Aggregate word embeddings using techniques like averaging or using Doc2Vec.
- Libraries:
- Gensim for Doc2Vec.
Document Level Embeddings:
- Consider using the average of paragraph embeddings or more advanced models.
- Libraries:
- spaCy or transformers library for more advanced models.
4. Pairwise Comparison:
- Similarity Measures:
- Calculate cosine similarity, Jaccard similarity, or other relevant measures.
- Libraries:
- scikit-learn for cosine similarity.
5. Clustering:
K-Means Clustering:
- Partition documents into K clusters.
- Libraries:
- scikit-learn for K-Means.
Hierarchical Clustering:
- Build a hierarchy of clusters.
- Libraries:
- scipy.cluster.hierarchy for hierarchical clustering.
DBSCAN:
- Density-based clustering.
- Libraries:
- scikit-learn for DBSCAN.
6. Scoring System:
Threshold Setting:
- Establish a threshold for similarity scores to classify documents.
- Determine the threshold through experimentation.
Scoring Logic:
- Develop a scoring system based on the results of pairwise comparison and clustering.
- Decide on the scoring weights for each component.
7. Hybrid Approach:
Traditional Models:
- Use traditional similarity measures for efficiency.
- Implement efficient algorithms for quick pairwise comparisons.
Large Language Models:
- Fine-tune or use pre-trained models for enhanced context understanding.
- Hugging Face Transformers library for accessing pre-trained models.
Fingerprinting Concept