adlumal commited on
Commit
d9171f8
1 Parent(s): 09e4cde

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -5
README.md CHANGED
@@ -19,7 +19,21 @@ language:
19
 
20
  This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.
21
 
22
- <!--- Describe your model here -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
 
24
  ## Usage (Sentence-Transformers)
25
 
@@ -40,13 +54,11 @@ embeddings = model.encode(sentences)
40
  print(embeddings)
41
  ```
42
 
43
-
44
-
45
  ## Evaluation Results
46
 
47
- <!--- Describe how your model was evaluated -->
48
 
49
- | Test | Score |
50
  |------------------------|--------------|
51
  | cos_sim-Accuracy@1 | 0.730206301 |
52
  | cos_sim-Accuracy@3 | 0.859562308 |
 
19
 
20
  This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.
21
 
22
+ This model is a fine-tune of [BAAI/bge-small-en](https://huggingface.co/BAAI/bge-small-en) using the HCA case law in the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus) by Umar Butler. The PDF/OCR cases were not used.
23
+
24
+ The cases were split into < 512 context chunks using the bge-small-en tokeniser and [semchunk](https://github.com/umarbutler/semchunk).
25
+ [mistralai/Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) was used to generate a legal question for each context chunk.
26
+
27
+ 129,137 context-question pairs were used for training.
28
+ 14,348 context-question pairs were used for evaluation (see the table below for results).
29
+
30
+ Using a 10% subset of the val dataset the following hit-rate performance was reached and is compared to the base model and OpenAI's default ada embedding model.
31
+
32
+ | **Model** | **Avg. hit-rate** |
33
+ |---------------------------|-------------------|
34
+ | BAAI/bge-small-en | 89% |
35
+ | OpenAI | 92% |
36
+ | adlumal/auslaw-embed-v1.0 | **97%** |
37
 
38
  ## Usage (Sentence-Transformers)
39
 
 
54
  print(embeddings)
55
  ```
56
 
 
 
57
  ## Evaluation Results
58
 
59
+ The model was evauluated on 10% of the available data. The automated eval results for the final step are presented below.
60
 
61
+ | Eval | Score |
62
  |------------------------|--------------|
63
  | cos_sim-Accuracy@1 | 0.730206301 |
64
  | cos_sim-Accuracy@3 | 0.859562308 |