phyloforfun commited on
Commit
142a372
1 Parent(s): 3b60ff3

Major update. Support for 15 LLMs, World Flora Online taxonomy validation, geolocation, 2 OCR methods, significant UI changes, stability improvements, consistent JSON parsing

Browse files
.gitignore CHANGED
@@ -31,10 +31,6 @@ demo/validation_configs/*
31
  !/bin/version.yml
32
  release*
33
  expense_report/*
34
- /custom_prompts/*
35
- !/custom_prompts/required_structure.yaml
36
- !/custom_prompts/version_2.yaml
37
- !/custom_prompts/version_2_OSU.yaml
38
  leafmachine2/*/.gitignore
39
 
40
  /bin/*
 
31
  !/bin/version.yml
32
  release*
33
  expense_report/*
 
 
 
 
34
  leafmachine2/*/.gitignore
35
 
36
  /bin/*
custom_prompts/SLTPvA_long.yaml ADDED
@@ -0,0 +1,112 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ prompt_author: Will Weaver
2
+ prompt_author_institution: University of Michigan
3
+ prompt_name: SLTPvA_long
4
+ prompt_version: v-1-0
5
+ prompt_description: Prompt developed by the University of Michigan.
6
+ SLTPvA prompts all have standardized column headers (fields) that were chosen due to their reliability and prevalence in herbarium records.
7
+ All field descriptions are based on the official Darwin Core guidelines.
8
+ SLTPvA_long - The most verbose prompt option. Descriptions closely follow DwC guides. Detailed rules for the LLM to follow. Works best with double or triple OCR to increase attention back to the OCR (select 'use both OCR models' or 'handwritten + printed' along with trOCR).
9
+ SLTPvA_medium - Shorter verion of _long.
10
+ SLTPvA_short - The least verbose possible prompt while still providing rules and DwC descriptions.
11
+ LLM: General Purpose
12
+ instructions: 1. Refactor the unstructured OCR text into a dictionary based on the JSON structure outlined below.
13
+ 2. Map the unstructured OCR text to the appropriate JSON key and populate the field given the user-defined rules.
14
+ 3. JSON key values are permitted to remain empty strings if the corresponding information is not found in the unstructured OCR text.
15
+ 4. Duplicate dictionary fields are not allowed.
16
+ 5. Ensure all JSON keys are in camel case.
17
+ 6. Ensure new JSON field values follow sentence case capitalization.
18
+ 7. Ensure all key-value pairs in the JSON dictionary strictly adhere to the format and data types specified in the template.
19
+ 8. Ensure output JSON string is valid JSON format. It should not have trailing commas or unquoted keys.
20
+ 9. Only return a JSON dictionary represented as a string. You should not explain your answer.
21
+ json_formatting_instructions: This section provides rules for formatting each JSON value organized by the JSON key.
22
+ rules:
23
+ catalogNumber: Barcode identifier, typically a number with at least 6 digits, but fewer than 30 digits.
24
+ order: The full scientific name of the order in which the taxon is classified. Order must be capitalized.
25
+ family: The full scientific name of the family in which the taxon is classified. Family must be capitalized.
26
+ scientificName: The scientific name of the taxon including genus, specific epithet,
27
+ and any lower classifications.
28
+ scientificNameAuthorship: The authorship information for the scientificName formatted according to the conventions of the applicable Darwin Core nomenclaturalCode.
29
+ genus: Taxonomic determination to genus. Genus must be capitalized. If
30
+ genus is not present use the taxonomic family name followed by the word 'indet'.
31
+ subgenus: The full scientific name of the subgenus in which the taxon is classified.
32
+ Values should include the genus to avoid homonym confusion.
33
+ specificEpithet: The name of the first or species epithet of the scientificName. Only include the species epithet.
34
+ infraspecificEpithet: The name of the lowest or terminal infraspecific epithet of the scientificName, excluding any rank designation.
35
+ identifiedBy: A comma separated list of names of people, groups, or organizations who assigned the taxon to the subject organism. This is not the specimen collector.
36
+ recordedBy: A comma separated list of names of people, groups, or organizations responsible for observing, recording, collecting, or presenting the original specimen.
37
+ The primary collector or observer should be listed first.
38
+ recordNumber: An identifier given to the occurrence at the time it was recorded.
39
+ Often serves as a link between field notes and an occurrence record, such as a specimen collector's number.
40
+ verbatimEventDate: The verbatim original representation of the date and time information for when the specimen was collected.
41
+ Date of collection exactly as it appears on the label. Do not change
42
+ the format or correct typos.
43
+ eventDate: Date the specimen was collected formatted as year-month-day, YYYY-MM_DD. If
44
+ specific components of the date are unknown, they should be replaced with
45
+ zeros. Examples "0000-00-00" if the entire date is unknown, "YYYY-00-00"
46
+ if only the year is known, and "YYYY-MM-00" if year and month are known
47
+ but day is not.
48
+ habitat: A category or description of the habitat in which the specimen collection event occurred.
49
+ occurrenceRemarks: Text describing the specimen's geographic location. Text describing the appearance of the specimen.
50
+ A statement about the presence or absence of a taxon at a the collection location.
51
+ Text describing the significance of the specimen, such as a specific expedition or notable collection.
52
+ Description of plant features such as leaf shape, size, color,
53
+ stem texture, height, flower structure, scent, fruit or seed characteristics,
54
+ root system type, overall growth habit and form, any notable aroma or secretions,
55
+ presence of hairs or bristles, and any other distinguishing morphological
56
+ or physiological characteristics.
57
+ country: The name of the country or major administrative unit in which the specimen was originally collected.
58
+ stateProvince: The name of the next smaller administrative region than country (state, province, canton, department, region, etc.) in which the specimen was originally collected.
59
+ county: The full, unabbreviated name of the next smaller administrative region than stateProvince (county, shire, department, parish etc.) in which the specimen was originally collected.
60
+ municipality: The full, unabbreviated name of the next smaller administrative region than county (city, municipality, etc.) in which the specimen was originally collected.
61
+ locality: Description of geographic location, landscape, landmarks, regional
62
+ features, nearby places, or any contextual information aiding in pinpointing
63
+ the exact origin or location of the specimen.
64
+ degreeOfEstablishment: Cultivated plants are intentionally grown by humans. In text descriptions,
65
+ look for planting dates, garden locations, ornamental, cultivar names, garden,
66
+ or farm to indicate cultivated plant. Use either - unknown or cultivated.
67
+ decimalLatitude: Latitude decimal coordinate. Correct and convert the verbatim location coordinates to conform
68
+ with the decimal degrees GPS coordinate format.
69
+ decimalLongitude: Longitude decimal coordinate. Correct and convert the verbatim location coordinates to conform
70
+ with the decimal degrees GPS coordinate format.
71
+ verbatimCoordinates: Verbatim location coordinates as they appear on the label. Do not
72
+ convert formats. Possible coordinate types include [Lat, Long, UTM, TRS].
73
+ minimumElevationInMeters: Minimum elevation or altitude in meters. Only if units are explicit
74
+ then convert from feet ("ft" or "ft."" or "feet") to meters ("m" or "m." or
75
+ "meters"). Round to integer.
76
+ maximumElevationInMeters: Maximum elevation or altitude in meters. If only one elevation
77
+ is present, then max_elevation should be set to the null_value. Only if units
78
+ are explicit then convert from feet ("ft" or "ft." or "feet") to meters ("m"
79
+ or "m." or "meters"). Round to integer.
80
+ mapping:
81
+ COLLECTING:
82
+ - identifiedBy
83
+ - recordedBy
84
+ - recordNumber
85
+ - verbatimEventDate
86
+ - eventDate
87
+ GEOGRAPHY:
88
+ - country
89
+ - stateProvince
90
+ - county
91
+ - municipality
92
+ - minimumElevationInMeters
93
+ - maximumElevationInMeters
94
+ LOCALITY:
95
+ - locality
96
+ - habitat
97
+ - decimalLatitude
98
+ - decimalLongitude
99
+ - verbatimCoordinates
100
+ MISCELLANEOUS:
101
+ - degreeOfEstablishment
102
+ - occurrenceRemarks
103
+ TAXONOMY:
104
+ - catalogNumber
105
+ - order
106
+ - family
107
+ - scientificName
108
+ - scientificNameAuthorship
109
+ - genus
110
+ - subgenus
111
+ - specificEpithet
112
+ - infraspecificEpithet
custom_prompts/SLTPvA_medium.yaml ADDED
@@ -0,0 +1,87 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ prompt_author: Will Weaver
2
+ prompt_author_institution: University of Michigan
3
+ prompt_name: SLTPvA_medium
4
+ prompt_version: v-1-0
5
+ prompt_description: Prompt developed by the University of Michigan.
6
+ SLTPvA prompts all have standardized column headers (fields) that were chosen due to their reliability and prevalence in herbarium records.
7
+ All field descriptions are based on the official Darwin Core guidelines.
8
+ SLTPvA_long - The most verbose prompt option. Descriptions closely follow DwC guides. Detailed rules for the LLM to follow. Works best with double or triple OCR to increase attention back to the OCR (select 'use both OCR models' or 'handwritten + printed' along with trOCR).
9
+ SLTPvA_medium - Shorter verion of _long.
10
+ SLTPvA_short - The least verbose possible prompt while still providing rules and DwC descriptions.
11
+ LLM: General Purpose
12
+ instructions: 1. Refactor the unstructured OCR text into a dictionary based on the JSON structure outlined below.
13
+ 2. Map the unstructured OCR text to the appropriate JSON key and populate the field given the user-defined rules.
14
+ 3. JSON key values are permitted to remain empty strings if the corresponding information is not found in the unstructured OCR text.
15
+ 4. Duplicate dictionary fields are not allowed.
16
+ 5. Ensure all JSON keys are in camel case.
17
+ 6. Ensure new JSON field values follow sentence case capitalization.
18
+ 7. Ensure all key-value pairs in the JSON dictionary strictly adhere to the format and data types specified in the template.
19
+ 8. Ensure output JSON string is valid JSON format. It should not have trailing commas or unquoted keys.
20
+ 9. Only return a JSON dictionary represented as a string. You should not explain your answer.
21
+ json_formatting_instructions: This section provides rules for formatting each JSON value organized by the JSON key.
22
+ rules:
23
+ catalogNumber: Barcode identifier, typically a number with at least 6 digits, but fewer than 30 digits.
24
+ order: The full scientific name of the order in which the taxon is classified. Order must be capitalized.
25
+ family: The full scientific name of the family in which the taxon is classified. Family must be capitalized.
26
+ scientificName: The scientific name of the taxon including genus, specific epithet,
27
+ and any lower classifications.
28
+ scientificNameAuthorship: The authorship information for the scientificName formatted according to the conventions of the applicable Darwin Core nomenclaturalCode.
29
+ genus: Taxonomic determination to genus. Genus must be capitalized.
30
+ subgenus: The full scientific name of the subgenus in which the taxon is classified.
31
+ specificEpithet: The name of the first or species epithet of the scientificName. Only include the species epithet.
32
+ infraspecificEpithet: The name of the lowest or terminal infraspecific epithet of the scientificName, excluding any rank designation.
33
+ identifiedBy: A comma separated list of names of people, groups, or organizations who assigned the taxon to the subject organism. This is not the specimen collector.
34
+ recordedBy: A comma separated list of names of people, groups, or organizations
35
+ recordNumber: An identifier given to the specimen at the time it was recorded.
36
+ verbatimEventDate: The verbatim original representation of the date and time information for when the specimen was collected.
37
+ eventDate: Date the specimen was collected formatted as year-month-day YYYY-MM-DD.
38
+ habitat: A category or description of the habitat in which the specimen collection event occurred.
39
+ occurrenceRemarks: Text describing the specimen's geographic location, appearance of the specimen, presence or absence of a taxon at a the collection location, the significance of the specimen, such as a specific expedition or notable collection, plant features and descriptions.
40
+ country: The name of the country or major administrative unit in which the specimen was originally collected.
41
+ stateProvince: The name of the next smaller administrative region than country (state, province, canton, department, region, etc.) in which the specimen was originally collected.
42
+ county: The full, unabbreviated name of the next smaller administrative region than stateProvince (county, shire, department, parish etc.) in which the specimen was originally collected.
43
+ municipality: The full, unabbreviated name of the next smaller administrative region than county (city, municipality, etc.) in which the specimen was originally collected.
44
+ locality: Description of geographic location, landscape, landmarks, regional
45
+ features, nearby places, or any contextual information aiding in pinpointing
46
+ the exact origin or location of the specimen.
47
+ degreeOfEstablishment: Cultivated plants are intentionally grown by humans. In text descriptions,
48
+ look for planting dates, garden locations, ornamental, cultivar names, garden,
49
+ or farm to indicate cultivated plant. Use either - unknown or cultivated.
50
+ decimalLatitude: Latitude decimal coordinate. Correct and convert the verbatim location coordinates to conform with the decimal degrees GPS coordinate format.
51
+ decimalLongitude: Longitude decimal coordinate. Correct and convert the verbatim location coordinates to conform with the decimal degrees GPS coordinate format.
52
+ verbatimCoordinates: Verbatim location coordinates as they appear on the label.
53
+ minimumElevationInMeters: Minimum elevation or altitude in meters. Only if units are explicit then convert from feet ("ft" or "ft."" or "feet") to meters ("m" or "m." or "meters"). Round to integer.
54
+ maximumElevationInMeters: Maximum elevation or altitude in meters. If only one elevation is present, then max_elevation should be set to the null_value. Only if units are explicit then convert from feet ("ft" or "ft." or "feet") to meters ("m" or "m." or "meters"). Round to integer.
55
+ mapping:
56
+ COLLECTING:
57
+ - identifiedBy
58
+ - recordedBy
59
+ - recordNumber
60
+ - verbatimEventDate
61
+ - eventDate
62
+ GEOGRAPHY:
63
+ - country
64
+ - stateProvince
65
+ - county
66
+ - municipality
67
+ - minimumElevationInMeters
68
+ - maximumElevationInMeters
69
+ LOCALITY:
70
+ - locality
71
+ - habitat
72
+ - decimalLatitude
73
+ - decimalLongitude
74
+ - verbatimCoordinates
75
+ MISCELLANEOUS:
76
+ - degreeOfEstablishment
77
+ - occurrenceRemarks
78
+ TAXONOMY:
79
+ - catalogNumber
80
+ - order
81
+ - family
82
+ - scientificName
83
+ - scientificNameAuthorship
84
+ - genus
85
+ - subgenus
86
+ - specificEpithet
87
+ - infraspecificEpithet
custom_prompts/SLTPvA_short.yaml ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ prompt_author: Will Weaver
2
+ prompt_author_institution: University of Michigan
3
+ prompt_name: SLTPvA_short
4
+ prompt_version: v-1-0
5
+ prompt_description: Prompt developed by the University of Michigan.
6
+ SLTPvA prompts all have standardized column headers (fields) that were chosen due to their reliability and prevalence in herbarium records.
7
+ All field descriptions are based on the official Darwin Core guidelines.
8
+ SLTPvA_long - The most verbose prompt option. Descriptions closely follow DwC guides. Detailed rules for the LLM to follow. Works best with double or triple OCR to increase attention back to the OCR (select 'use both OCR models' or 'handwritten + printed' along with trOCR).
9
+ SLTPvA_medium - Shorter verion of _long.
10
+ SLTPvA_short - The least verbose possible prompt while still providing rules and DwC descriptions.
11
+ LLM: General Purpose
12
+ instructions: 1. Refactor the unstructured OCR text into a dictionary based on the JSON structure outlined below.
13
+ 2. Map the unstructured OCR text to the appropriate JSON key and populate the field given the user-defined rules.
14
+ 3. JSON key values are permitted to remain empty strings if the corresponding information is not found in the unstructured OCR text.
15
+ 4. Duplicate dictionary fields are not allowed.
16
+ 5. Ensure all JSON keys are in camel case.
17
+ 6. Ensure new JSON field values follow sentence case capitalization.
18
+ 7. Ensure all key-value pairs in the JSON dictionary strictly adhere to the format and data types specified in the template.
19
+ 8. Ensure output JSON string is valid JSON format. It should not have trailing commas or unquoted keys.
20
+ 9. Only return a JSON dictionary represented as a string. You should not explain your answer.
21
+ json_formatting_instructions: This section provides rules for formatting each JSON value organized by the JSON key.
22
+ rules:
23
+ catalogNumber: barcode identifier, at least 6 digits, fewer than 30 digits.
24
+ order: full scientific name of the Order in which the taxon is classified. Order must be capitalized.
25
+ family: full scientific name of the Family in which the taxon is classified. Family must be capitalized.
26
+ scientificName: scientific name of the taxon including Genus, specific epithet, and any lower classifications.
27
+ scientificNameAuthorship: authorship information for the scientificName formatted according to the conventions of the applicable Darwin Core nomenclaturalCode.
28
+ genus: taxonomic determination to Genus, Genus must be capitalized.
29
+ subgenus: name of the subgenus.
30
+ specificEpithet: The name of the first or species epithet of the scientificName. Only include the species epithet.
31
+ infraspecificEpithet: lowest or terminal infraspecific epithet of the scientificName.
32
+ identifiedBy: list of names of people, doctors, professors, groups, or organizations who identified, determined the taxon name to the subject organism. This is not the specimen collector.
33
+ recordedBy: list of names of people, doctors, professors, groups, or organizations.
34
+ recordNumber: identifier given to the specimen at the time it was recorded.
35
+ verbatimEventDate: The verbatim original representation of the date and time information for when the specimen was collected.
36
+ eventDate: collection date formatted as year-month-day YYYY-MM-DD.
37
+ habitat: habitat.
38
+ occurrenceRemarks: all descriptive text in the OCR rearranged into sensible sentences or sentence fragments.
39
+ country: country or major administrative unit.
40
+ stateProvince: state, province, canton, department, region, etc.
41
+ county: county, shire, department, parish etc.
42
+ municipality: city, municipality, etc.
43
+ locality: description of geographic information aiding in pinpointing the exact origin or location of the specimen.
44
+ degreeOfEstablishment: cultivated plants are intentionally grown by humans. Use either - unknown or cultivated.
45
+ decimalLatitude: latitude decimal coordinate.
46
+ decimalLongitude: longitude decimal coordinate.
47
+ verbatimCoordinates: verbatim location coordinates.
48
+ minimumElevationInMeters: minimum elevation or altitude in meters.
49
+ maximumElevationInMeters: maximum elevation or altitude in meters.
50
+ mapping:
51
+ COLLECTING:
52
+ - identifiedBy
53
+ - recordedBy
54
+ - recordNumber
55
+ - verbatimEventDate
56
+ - eventDate
57
+ GEOGRAPHY:
58
+ - country
59
+ - stateProvince
60
+ - county
61
+ - municipality
62
+ - minimumElevationInMeters
63
+ - maximumElevationInMeters
64
+ LOCALITY:
65
+ - locality
66
+ - habitat
67
+ - decimalLatitude
68
+ - decimalLongitude
69
+ - verbatimCoordinates
70
+ MISCELLANEOUS:
71
+ - degreeOfEstablishment
72
+ - occurrenceRemarks
73
+ TAXONOMY:
74
+ - catalogNumber
75
+ - order
76
+ - family
77
+ - scientificName
78
+ - scientificNameAuthorship
79
+ - genus
80
+ - subgenus
81
+ - specificEpithet
82
+ - infraspecificEpithet