Commits · seanpedrickcase/topic

Debugged reference to random_seed in vectorisation and reference to torch in representation_model.py

8216d8c

seanpedrickcase commited on 26 days ago

Importing space package near start of app now to avoid issue with cuda being initialised before

9e84863

seanpedrickcase commited on Dec 12, 2024

Llama-cpp-python in GPU mode doesn't seem to work well with Bertopic on Huggingface, so downgrading that to CPU version

88d81fa

seanpedrickcase commited on Dec 12, 2024

Rearranged functions for embeddings creation to be compatible with zero GPU space. Updated packages.

cc495e1

seanpedrickcase commited on Dec 12, 2024

Added and replaced relevant files to download in download_model.py to allow for app use on AWS

49e0db8

seanpedrickcase commited on Nov 20, 2024

Updated Dockerfile with latest packages

08eb30d

seanpedrickcase commited on Nov 20, 2024

Added example of how to run function from command line. Updated packages. Embedding model default now smaller and at fp16.

34f1e83

seanpedrickcase commited on Nov 20, 2024

Improved initial clean options. Now has option to return embeddings only.

89c4d20

seanpedrickcase commited on Nov 18, 2024

Corrected minor Dockerfile package version issue

593153e

seanpedrickcase commited on Sep 26, 2024

App now retains original index following cleaning to allow for referring back to original data

90553eb

seanpedrickcase commited on Sep 25, 2024

Now installed dependencies into correct folder in Dockerfile

5888649

seanpedrickcase commited on Aug 15, 2024

Finally managed to enforce cpu torch install in Dockerfile

97913c4

seanpedrickcase commited on Aug 15, 2024

Further optimised Dockerfile and requirements (smaller torch installation now hopefully)

00db72b

seanpedrickcase commited on Aug 15, 2024

Transferring across installed packages from build stage in Dockerfile

c9da99d

seanpedrickcase commited on Aug 14, 2024

Changed Dockerfile to multi-stage build to further reduce size

0fd155c

seanpedrickcase commited on Aug 14, 2024

Trying to make container image smaller through Dockerfile

7d5387e

seanpedrickcase commited on Aug 14, 2024

Minor changes to reduce Dockerfile size

b767539

seanpedrickcase commited on Aug 14, 2024

Updated download_model.py to download pytorch .bin file

1c0bfd4

seanpedrickcase commited on Aug 13, 2024

Removed some requirements from Dockerfile for AWS deployment to reduce container size

51ba1cb

seanpedrickcase commited on Aug 13, 2024

Added NUMBA_CACHE_DIR to Docker environmental variables

cd6a3e0

seanpedrickcase commited on Aug 13, 2024

Allowed for app running on AWS to use smaller embedding model and not to load representation LLM (due to size restrictions).

22ca76e

seanpedrickcase commited on Aug 12, 2024

Dockerfile now installs models directly into user folder instead of moving from base folder

3c1c3de

seanpedrickcase commited on Aug 9, 2024

Updated Gradio version for spaces. Updated Dockerfile to enable Llama.cpp build with Cmake

d34af22

seanpedrickcase commited on Aug 9, 2024

Only aggregate topics not 'other', allowed for minimum sentence length, default max_topics now will auto aggregate topics. Added Cognito Auth functionality (boto3 with AWS).

1e2bb3e

seanpedrickcase commited on Aug 9, 2024

Can split passages into sentences. Improved embedding, LLM representation models, improved zero shot capabilities

55f0ce3

seanpedrickcase commited on Jun 27, 2024

Updated packages. Improve hierarchy vis. Better models - mixedbread and phi3. Now option to split texts into sentences before modelling.

04a15c5

seanpedrickcase commited on Jun 20, 2024

Minor cleaning, csv formatting changes

d80c8f5

Sean-Case commited on Feb 16, 2024

Reduce outliers now more efficient and relabels with correct vectoriser. Default topic labels now tidier. Hiearchical topics outputs more useful for joining to df afterwards. Switched low resource reduction algorithm to UMAP as default is not good.

e1c1f68

Sonnyjim commited on Feb 7, 2024

Should now parse custom regex correctly. Will now wipe previously created embeddings if 'low resource mode' option switched.

0a543a0

Sean-Case commited on Feb 7, 2024

Allowed for uploading custom regex for cleaning. Fixed calculate all probabilities, reduce outliers. Added text tree for hierarchical modelling.

381f959

Sonnyjim commited on Feb 6, 2024

Upgraded to Gradio 4.16.0. Guide for converting to exe added.

0a177ca

Sonnyjim commited on Feb 5, 2024

Hopefully now LLM download from hub should work

cdcd7af

Sonnyjim commited on Feb 5, 2024

Note about LLM not working now successfully added!

e2dfc1e

Sean-Case commited on Feb 2, 2024

Added note to say that LLM representation is not currently working on the HF website

3b4333f

Sean-Case commited on Feb 2, 2024

Trying to download LLM to local_dir instead of cache_dir

539aba9

Sean-Case commited on Feb 2, 2024

LLM model save is failing in Huggingface - attempting instead to save to base folder

c2bf185

Sean-Case commited on Feb 2, 2024

Some text changes. Fixed a couple of TF-IDF embeddings issues

87306c7

Sean-Case commited on Feb 2, 2024

Switched embeddings to low resource TF-IDF by default. Some text changes.

a7fdf3b

Sean-Case commited on Feb 2, 2024

Fixed file load with files including capital letters

9c6425d

Sonnyjim commited on Feb 2, 2024

Added clean data options, improved re-representation options and visualisation. General format changes

4effac0

Sonnyjim commited on Feb 2, 2024

Allowed for loading in external topic labels. A few visualisation modifications.

b27bab2

Sonnyjim commited on Jan 29, 2024

Model save now checks and makes a folder before writing the model

356791c

Sonnyjim commited on Jan 29, 2024

Lots of general fixes. New visualisations, fixed hierarchical vis for zero shot. Added calc all probabilities.

b4510a6

Sonnyjim commited on Jan 29, 2024

Changed Phi model to smaller StableLM 2 1.6. Fixed a None type detection error.

1f1a1c7

Sonnyjim commited on Jan 27, 2024

Disabled console logging as it was getting in the way of file load into the app

731ed23

Sonnyjim commited on Jan 26, 2024

Switched embeddings model to BGE Small 1.5 as Jina seemed unable to do zero shot topic modelling properly

be094ee

Sonnyjim commited on Jan 26, 2024

Added minimum similarity slider for zero shot topic modelling

0fe5421

Sean-Case commited on Jan 26, 2024

model and hierarchy details should now save properly

6622531

Sonnyjim commited on Jan 26, 2024

Split off LLM representation, visualisation, and reduce outliers from main function. Added hierarchical visualisation and logs

5d87c3c

Sonnyjim commited on Jan 26, 2024

More efficient embeddings save and representations load/process. Custom visualisation hover option added, formatting improvements. Version 0.1?

ffe5eb2

Sonnyjim commited on Jan 25, 2024

Commit History

Debugged reference to random_seed in vectorisation and reference to torch in representation_model.py 8216d8c

Importing space package near start of app now to avoid issue with cuda being initialised before 9e84863

Llama-cpp-python in GPU mode doesn't seem to work well with Bertopic on Huggingface, so downgrading that to CPU version 88d81fa

Rearranged functions for embeddings creation to be compatible with zero GPU space. Updated packages. cc495e1

Added and replaced relevant files to download in download_model.py to allow for app use on AWS 49e0db8

Updated Dockerfile with latest packages 08eb30d

Added example of how to run function from command line. Updated packages. Embedding model default now smaller and at fp16. 34f1e83

Improved initial clean options. Now has option to return embeddings only. 89c4d20

Corrected minor Dockerfile package version issue 593153e

App now retains original index following cleaning to allow for referring back to original data 90553eb

Now installed dependencies into correct folder in Dockerfile 5888649

Finally managed to enforce cpu torch install in Dockerfile 97913c4

Further optimised Dockerfile and requirements (smaller torch installation now hopefully) 00db72b

Transferring across installed packages from build stage in Dockerfile c9da99d

Changed Dockerfile to multi-stage build to further reduce size 0fd155c

Trying to make container image smaller through Dockerfile 7d5387e

Minor changes to reduce Dockerfile size b767539

Updated download_model.py to download pytorch .bin file 1c0bfd4

Removed some requirements from Dockerfile for AWS deployment to reduce container size 51ba1cb

Added NUMBA_CACHE_DIR to Docker environmental variables cd6a3e0

Allowed for app running on AWS to use smaller embedding model and not to load representation LLM (due to size restrictions). 22ca76e

Dockerfile now installs models directly into user folder instead of moving from base folder 3c1c3de

Updated Gradio version for spaces. Updated Dockerfile to enable Llama.cpp build with Cmake d34af22

Only aggregate topics not 'other', allowed for minimum sentence length, default max_topics now will auto aggregate topics. Added Cognito Auth functionality (boto3 with AWS). 1e2bb3e

Can split passages into sentences. Improved embedding, LLM representation models, improved zero shot capabilities 55f0ce3

Updated packages. Improve hierarchy vis. Better models - mixedbread and phi3. Now option to split texts into sentences before modelling. 04a15c5

Minor cleaning, csv formatting changes d80c8f5

Reduce outliers now more efficient and relabels with correct vectoriser. Default topic labels now tidier. Hiearchical topics outputs more useful for joining to df afterwards. Switched low resource reduction algorithm to UMAP as default is not good. e1c1f68

Should now parse custom regex correctly. Will now wipe previously created embeddings if 'low resource mode' option switched. 0a543a0

Allowed for uploading custom regex for cleaning. Fixed calculate all probabilities, reduce outliers. Added text tree for hierarchical modelling. 381f959

Upgraded to Gradio 4.16.0. Guide for converting to exe added. 0a177ca

Hopefully now LLM download from hub should work cdcd7af

Note about LLM not working now successfully added! e2dfc1e

Added note to say that LLM representation is not currently working on the HF website 3b4333f

Trying to download LLM to local_dir instead of cache_dir 539aba9

LLM model save is failing in Huggingface - attempting instead to save to base folder c2bf185

Some text changes. Fixed a couple of TF-IDF embeddings issues 87306c7

Switched embeddings to low resource TF-IDF by default. Some text changes. a7fdf3b

Fixed file load with files including capital letters 9c6425d

Added clean data options, improved re-representation options and visualisation. General format changes 4effac0

Allowed for loading in external topic labels. A few visualisation modifications. b27bab2

Model save now checks and makes a folder before writing the model 356791c

Lots of general fixes. New visualisations, fixed hierarchical vis for zero shot. Added calc all probabilities. b4510a6

Changed Phi model to smaller StableLM 2 1.6. Fixed a None type detection error. 1f1a1c7

Disabled console logging as it was getting in the way of file load into the app 731ed23

Switched embeddings model to BGE Small 1.5 as Jina seemed unable to do zero shot topic modelling properly be094ee

Added minimum similarity slider for zero shot topic modelling 0fe5421

model and hierarchy details should now save properly 6622531

Split off LLM representation, visualisation, and reduce outliers from main function. Added hierarchical visualisation and logs 5d87c3c

More efficient embeddings save and representations load/process. Custom visualisation hover option added, formatting improvements. Version 0.1? ffe5eb2

Debugged reference to random_seed in vectorisation and reference to torch in representation_model.py

8216d8c

Importing space package near start of app now to avoid issue with cuda being initialised before

9e84863

Llama-cpp-python in GPU mode doesn't seem to work well with Bertopic on Huggingface, so downgrading that to CPU version

88d81fa

Rearranged functions for embeddings creation to be compatible with zero GPU space. Updated packages.

cc495e1

Added and replaced relevant files to download in download_model.py to allow for app use on AWS

49e0db8

Updated Dockerfile with latest packages

08eb30d

Added example of how to run function from command line. Updated packages. Embedding model default now smaller and at fp16.

34f1e83

Improved initial clean options. Now has option to return embeddings only.

89c4d20

Corrected minor Dockerfile package version issue

593153e

App now retains original index following cleaning to allow for referring back to original data

90553eb

Now installed dependencies into correct folder in Dockerfile

5888649

Finally managed to enforce cpu torch install in Dockerfile

97913c4

Further optimised Dockerfile and requirements (smaller torch installation now hopefully)

00db72b

Transferring across installed packages from build stage in Dockerfile

c9da99d

Changed Dockerfile to multi-stage build to further reduce size

0fd155c

Trying to make container image smaller through Dockerfile

7d5387e

Minor changes to reduce Dockerfile size

b767539

Updated download_model.py to download pytorch .bin file

1c0bfd4

Removed some requirements from Dockerfile for AWS deployment to reduce container size

51ba1cb

Added NUMBA_CACHE_DIR to Docker environmental variables

cd6a3e0

Allowed for app running on AWS to use smaller embedding model and not to load representation LLM (due to size restrictions).

22ca76e

Dockerfile now installs models directly into user folder instead of moving from base folder

3c1c3de

Updated Gradio version for spaces. Updated Dockerfile to enable Llama.cpp build with Cmake

d34af22

Only aggregate topics not 'other', allowed for minimum sentence length, default max_topics now will auto aggregate topics. Added Cognito Auth functionality (boto3 with AWS).

1e2bb3e

Can split passages into sentences. Improved embedding, LLM representation models, improved zero shot capabilities

55f0ce3

Updated packages. Improve hierarchy vis. Better models - mixedbread and phi3. Now option to split texts into sentences before modelling.

04a15c5

Minor cleaning, csv formatting changes

d80c8f5

Reduce outliers now more efficient and relabels with correct vectoriser. Default topic labels now tidier. Hiearchical topics outputs more useful for joining to df afterwards. Switched low resource reduction algorithm to UMAP as default is not good.

e1c1f68

Should now parse custom regex correctly. Will now wipe previously created embeddings if 'low resource mode' option switched.

0a543a0

Allowed for uploading custom regex for cleaning. Fixed calculate all probabilities, reduce outliers. Added text tree for hierarchical modelling.

381f959

Upgraded to Gradio 4.16.0. Guide for converting to exe added.

0a177ca

Hopefully now LLM download from hub should work

cdcd7af

Note about LLM not working now successfully added!

e2dfc1e

Added note to say that LLM representation is not currently working on the HF website

3b4333f

Trying to download LLM to local_dir instead of cache_dir

539aba9

LLM model save is failing in Huggingface - attempting instead to save to base folder

c2bf185

Some text changes. Fixed a couple of TF-IDF embeddings issues

87306c7

Switched embeddings to low resource TF-IDF by default. Some text changes.

a7fdf3b

Fixed file load with files including capital letters

9c6425d

Added clean data options, improved re-representation options and visualisation. General format changes

4effac0

Allowed for loading in external topic labels. A few visualisation modifications.

b27bab2

Model save now checks and makes a folder before writing the model

356791c

Lots of general fixes. New visualisations, fixed hierarchical vis for zero shot. Added calc all probabilities.

b4510a6

Changed Phi model to smaller StableLM 2 1.6. Fixed a None type detection error.

1f1a1c7

Disabled console logging as it was getting in the way of file load into the app

731ed23

Switched embeddings model to BGE Small 1.5 as Jina seemed unable to do zero shot topic modelling properly

be094ee

Added minimum similarity slider for zero shot topic modelling

0fe5421

model and hierarchy details should now save properly

6622531

Split off LLM representation, visualisation, and reduce outliers from main function. Added hierarchical visualisation and logs

5d87c3c

More efficient embeddings save and representations load/process. Custom visualisation hover option added, formatting improvements. Version 0.1?

ffe5eb2