|
--- |
|
language: |
|
- en |
|
license: apache-2.0 |
|
library_name: transformers |
|
tags: |
|
- python |
|
- document |
|
- code |
|
- code2doc |
|
- instruction_tuned |
|
- basemodel |
|
- pytorch |
|
- docstring |
|
- documentation |
|
- text-generation-inference |
|
metrics: |
|
- accuracy |
|
pipeline_tag: text-generation |
|
widget: |
|
- text: '<example_response>--code:def function_divide2(x): return x / 2--question:Document the code--doc:Description:This function takes a number and divides it by 2.Parameters:- x (numeric): The input value to be divided by 2.Returns:- float: The result of x divided by 2.Example:To call the function, use the following code:function_divide2(1.0)</example_response><function_code>def _plot_bounding_polygon(polygons_coordinates, output_html_path=bounding_polygon_map.html):map_center = [sum([coord[0]for polygon_coords in polygons_coordinatesfor coord in polygon_coords])/ sum([len(polygon_coords) for polygon_coords in polygons_coordinates]),sum([coord[1]for polygon_coords in polygons_coordinatesfor coord in polygon_coords])/ sum([len(polygon_coords) for polygon_coords in polygons_coordinates]),]my_map = folium.Map(location=map_center, zoom_start=12)for polygon_coords in polygons_coordinates:folium.Polygon(locations=polygon_coords,color=blue,fill=True,fill_color=blue,fill_opacity=0.2,).add_to(my_map)marker_cluster = MarkerCluster().add_to(my_map)for polygon_coords in polygons_coordinates:for coord in polygon_coords:folium.Marker(location=[coord[0], coord[1]], popup=fCoordinates: {coord}).add_to(marker_cluster)draw = Draw(export=True)draw.add_to(my_map)my_map.save(output_html_path)return output_html_path</function_code><question>Document the python code above giving function description ,parameters and return type and example how to call the function</question><doc>' |
|
example_title: example |
|
--- |
|
# pip-code-to-doc |
|
|
|
[pipableAi](https://www.linkedin.com/company/pipable.ai/about/) |
|
|
|
[colab_notebook](https://colab.research.google.com/drive/17PyMU_3QN9LROy7x-jmaema0cuLRzBvc?usp=sharing) |
|
|
|
[pip library_etl](https://github.com/PipableAI/pip-library-etl.git) |
|
|
|
## What have we built? |
|
|
|
A 1.3 bn code documentation model that outperforms most models on documenting codes and making your in-house libs ready for LLM and RAG pipelines. |
|
We have also open sourced a [pip library_etl](https://github.com/PipableAI/pip-library-etl.git) for the same, together the lib and model can turn your codebase to functional parse tree ready to be consumed by LLMs to execute complex tasks. |
|
This model is also capable of generating SQL queries with accuracies on par with those of [pip-sql-1.3b](https://huggingface.co/PipableAI/pip-sql-1.3b), with additional capabilities of providing extra examples, instructions ,and column descriptions as context. |
|
This is a further trained version of pip-sql-1.3b. |
|
|
|
## How we built it? |
|
|
|
We used softmax cross entropy and a modified form of policy grad along with Q loss, optimized in an EM set up. |
|
Loss behaviour in the set up mentioned above - |
|
|
|
## License |
|
|
|
The model is open source under apache 2.0. License |
|
|
|
## Usage |
|
|
|
|
|
### Library use |
|
|
|
For directly using the capabilities of model without putting extra efforts on schems and prompts try to use [pip library_etl](https://github.com/PipableAI/pip-library-etl.git). |
|
For detaied usage refer to the [colab_notebook](https://colab.research.google.com/drive/17PyMU_3QN9LROy7x-jmaema0cuLRzBvc?usp=sharing) |
|
|
|
### Installation |
|
|
|
```bash |
|
pip install transformers |
|
``` |
|
|
|
### Prompt |
|
```python |
|
prompt = f"""<example_response>{--question , --query}</example_response><function_code>{code}</function_code> |
|
<question>Give one line description of the python code above in natural language.</question> |
|
<doc>""" |
|
|
|
prompt = f"""<example_response>{example of some --question: , --query}</example_response><schema>{schema with cols described}</schema> |
|
<question>Write a sql query to ....</question> |
|
<sql>""" |
|
``` |
|
|
|
### PyTorch |
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
device = "cuda" |
|
model = AutoModelForCausalLM.from_pretrained("PipableAI/pip-library-etl-1.3b").to(device) |
|
tokenizer = AutoTokenizer.from_pretrained("PipableAI/pip-library-etl-1.3b") |
|
prompt = f"""<example_response> |
|
--code:def function_2(x): return x / 2 |
|
--question:Document the python code above giving function description ,parameters and return type and example how to call the function. |
|
--doc: |
|
Description:This function takes a number and divides it by 2. |
|
Parameters: |
|
- x (numeric): The input value to be divided by 2. |
|
Returns: |
|
- float: The result of x divided by 2 |
|
Example: |
|
To call the function, use the following code: |
|
function2(1.0)</example_response> |
|
<function_code> |
|
def example_function(x): |
|
return x * 2 |
|
</function_code> |
|
<instructions> |
|
1. In the examples while calling function use the name mentioned after `def ` in the above function_code. |
|
2. In the generated docs use valid python type hints as per PEP 484. |
|
</instructions> |
|
<question>Document the python code above giving function description ,parameters and return type and example how to call the function.</question> |
|
<doc>""" |
|
inputs = tokenizer(prompt, return_tensors="pt").to("cuda") |
|
outputs = model.generate(**inputs, max_new_tokens=450) |
|
doc = ( |
|
tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
.split("<doc>")[-1] |
|
.split("</doc>")[0] |
|
) |
|
doc = ( |
|
doc.replace("<p>", "") |
|
.replace("</p>", "") |
|
.replace("<function_description>", "") |
|
.replace("</function_description>", "") |
|
) |
|
print(doc) |
|
``` |
|
|
|
|
|
|
|
## Examples |
|
|
|
### 1. Code Documentation |
|
### prompt |
|
```python |
|
text=''' <example_response> |
|
--code:def function_2(x): return x / 2 |
|
--question:Document the python code above giving function description ,parameters and return type and example how to call the function. |
|
--doc: |
|
Description:This function takes a number and divides it by 2. |
|
Parameters: |
|
- x (numeric): The input value to be divided by 2. |
|
Returns: |
|
- float: The result of x divided by 2 |
|
Example: |
|
To call the function, use the following code: |
|
function2(1.0)</example_response> |
|
<function_code>def _plot_bounding_polygon( |
|
polygons_coordinates, output_html_path="bounding_polygon_map.html" |
|
): |
|
# Create a Folium map centered at the average coordinates of all bounding boxes |
|
map_center = [ |
|
sum( |
|
[ |
|
coord[0] |
|
for polygon_coords in polygons_coordinates |
|
for coord in polygon_coords |
|
] |
|
) |
|
/ sum([len(polygon_coords) for polygon_coords in polygons_coordinates]), |
|
sum( |
|
[ |
|
coord[1] |
|
for polygon_coords in polygons_coordinates |
|
for coord in polygon_coords |
|
] |
|
) |
|
/ sum([len(polygon_coords) for polygon_coords in polygons_coordinates]), |
|
] |
|
|
|
my_map = folium.Map(location=map_center, zoom_start=12) |
|
|
|
# Add each bounding polygon to the map |
|
for polygon_coords in polygons_coordinates: |
|
folium.Polygon( |
|
locations=polygon_coords, |
|
color="blue", |
|
fill=True, |
|
fill_color="blue", |
|
fill_opacity=0.2, |
|
).add_to(my_map) |
|
|
|
# Add bounding boxes as markers to the map |
|
marker_cluster = MarkerCluster().add_to(my_map) |
|
|
|
for polygon_coords in polygons_coordinates: |
|
for coord in polygon_coords: |
|
folium.Marker( |
|
location=[coord[0], coord[1]], popup=f"Coordinates: {coord}" |
|
).add_to(marker_cluster) |
|
|
|
# Add draw control to allow users to draw additional polygons |
|
draw = Draw(export=True) |
|
draw.add_to(my_map) |
|
|
|
# Save the map as an HTML file |
|
my_map.save(output_html_path) |
|
|
|
return output_html_path |
|
</function_code> |
|
<instructions> |
|
1. In the examples while calling function use the name mentioned after `def ` in the above function_code. |
|
2. In the generated docs use valid python type hints as per PEP 484. |
|
</instructions> |
|
<question>Document the python code above giving function description ,parameters and return type and example how to call the function</question><doc>''' |
|
``` |
|
|
|
### Response |
|
```txt |
|
Description:This function generates a map of the bounding polygons and saves it as an HTML file. |
|
Parameters: |
|
- polygons_coordinates (list of lists of tuples): A list of lists of tuples representing the coordinates of the polygons. Each polygon is a list of coordinates. |
|
- output_html_path (str, optional): The path where the HTML file should be saved. Defaults to "bounding_polygon_map.html". |
|
Returns: |
|
- str: The path to the saved HTML file. |
|
Example: |
|
To call the function, use the following code: |
|
plot_bounding_polygon([[(0, 0), (1, 0), (1, 1), (0, 1)], [(2, 2), (3, 2), (3, 3), (2, 3)]], "my_map.html"). |
|
``` |
|
|
|
|
|
### 2. SQL Generation |
|
### prompt |
|
```python |
|
text = """Generate a simple SQL query from the schema mentioned for the following question. |
|
<schema>CREATE TABLE department (Department_ID number, |
|
Name text, |
|
Creation text, |
|
Ranking number, |
|
Budget_in_Billions number, |
|
Num_Employees number); |
|
|
|
CREATE TABLE head (head_ID number, |
|
name text, |
|
born_state text, |
|
age number); |
|
|
|
CREATE TABLE management (department_ID number, |
|
head_ID number, |
|
temporary_acting text);</schema> |
|
<question>What are the names of the heads who are born outside the California state?</question> |
|
<sql>""" |
|
``` |
|
|
|
### response |
|
```sql |
|
SELECT head.name FROM head WHERE head.born_state <> 'California'; |
|
``` |
|
|
|
### 3. Performance Schema Monitoring |
|
### prompt |
|
```python |
|
text = """Generate the SQL query for SkySQL performance schema for the following question. |
|
<example> |
|
--question: What are the top 10 most frequently used queries/statements? |
|
--sql: SELECT DIGEST_TEXT, COUNT(*) as frequency FROM performance_schema.events_statements_summary_by_digest GROUP BY DIGEST_TEXT ORDER BY frequency DESC LIMIT 10; |
|
</example> |
|
<schema> |
|
CREATE TABLE `accounts` (`USER` char(128) DEFAULT NULL -- 'The connection''s client user name for the connection, or NULL if an internal thread.', |
|
`HOST` char(255) DEFAULT NULL -- 'The connection client''s host name, or NULL if an internal thread.', |
|
`CURRENT_CONNECTIONS` bigint(20) NOT NULL -- 'Current connections for the account.',\n |
|
`TOTAL_CONNECTIONS` bigint(20) NOT NULL -- 'Total connections for the account.' |
|
) ; |
|
</schema> |
|
<question> |
|
Tell me the number of active connections each user has. |
|
</question> |
|
<sql> |
|
""" |
|
``` |
|
### response |
|
```sql |
|
SELECT USER, CURRENT_CONNECTIONS FROM accounts; |
|
``` |
|
|
|
### prompt |
|
```python |
|
text = """Generate the SQL query for SkySQL performance schema for the following question. |
|
<example> |
|
--question: What are the top 10 most frequently used queries/statements? |
|
--sql: SELECT DIGEST_TEXT, COUNT(*) as frequency FROM performance_schema.events_statements_summary_by_digest GROUP BY DIGEST_TEXT ORDER BY frequency DESC LIMIT 10; |
|
</example> |
|
<schema> |
|
CREATE TABLE `file_summary_by_instance` ( |
|
`FILE_NAME` varchar(512) NOT NULL -- 'File name.', |
|
`EVENT_NAME` varchar(128) NOT NULL -- 'Event name.', |
|
`OBJECT_INSTANCE_BEGIN` bigint(20) unsigned NOT NULL -- 'Address in memory. Together with FILE_NAME and EVENT_NAME uniquely identifies a row.', |
|
`COUNT_STAR` bigint(20) unsigned NOT NULL -- 'Number of summarized events', |
|
`SUM_TIMER_WAIT` bigint(20) unsigned NOT NULL -- 'Total wait time of the summarized events that are timed.', |
|
`MIN_TIMER_WAIT` bigint(20) unsigned NOT NULL -- 'Minimum wait time of the summarized events that are timed.', |
|
`AVG_TIMER_WAIT` bigint(20) unsigned NOT NULL -- 'Average wait time of the summarized events that are timed.', |
|
`MAX_TIMER_WAIT` bigint(20) unsigned NOT NULL -- 'Maximum wait time of the summarized events that are timed.', |
|
`COUNT_READ` bigint(20) unsigned NOT NULL -- 'Number of all read operations, including FGETS, FGETC, FREAD, and READ.', |
|
`SUM_TIMER_READ` bigint(20) unsigned NOT NULL -- 'Total wait time of all read operations that are timed.', |
|
`MIN_TIMER_READ` bigint(20) unsigned NOT NULL -- 'Minimum wait time of all read operations that are timed.', |
|
`AVG_TIMER_READ` bigint(20) unsigned NOT NULL -- 'Average wait time of all read operations that are timed.', |
|
`MAX_TIMER_READ` bigint(20) unsigned NOT NULL -- 'Maximum wait time of all read operations that are timed.', |
|
`SUM_NUMBER_OF_BYTES_READ` bigint(20) NOT NULL -- 'Bytes read by read operations.', |
|
`COUNT_WRITE` bigint(20) unsigned NOT NULL -- 'Number of all write operations, including FPUTS, FPUTC, FPRINTF, VFPRINTF, FWRITE, and PWRITE.', |
|
`SUM_TIMER_WRITE` bigint(20) unsigned NOT NULL -- 'Total wait time of all write operations that are timed.', |
|
`MIN_TIMER_WRITE` bigint(20) unsigned NOT NULL -- 'Minimum wait time of all write operations that are timed.', |
|
`AVG_TIMER_WRITE` bigint(20) unsigned NOT NULL -- 'Average wait time of all write operations that are timed.', |
|
`MAX_TIMER_WRITE` bigint(20) unsigned NOT NULL -- 'Maximum wait time of all write operations that are timed.', |
|
`SUM_NUMBER_OF_BYTES_WRITE` bigint(20) NOT NULL -- 'Bytes written by write operations.', |
|
`COUNT_MISC` bigint(20) unsigned NOT NULL -- 'Number of all miscellaneous operations not counted above, including CREATE, DELETE, OPEN, CLOSE, STREAM_OPEN, STREAM_CLOSE, SEEK, TELL, FLUSH, STAT, FSTAT, CHSIZE, RENAME, and SYNC.', |
|
`SUM_TIMER_MISC` bigint(20) unsigned NOT NULL -- 'Total wait time of all miscellaneous operations that are timed.', |
|
`MIN_TIMER_MISC` bigint(20) unsigned NOT NULL -- 'Minimum wait time of all miscellaneous operations that are timed.', |
|
`AVG_TIMER_MISC` bigint(20) unsigned NOT NULL -- 'Average wait time of all miscellaneous operations that are timed.', |
|
`MAX_TIMER_MISC` bigint(20) unsigned NOT NULL -- 'Maximum wait time of all miscellaneous operations that are timed.' |
|
); |
|
</schema> |
|
<question> |
|
List out 10 names of the files with the most read and writes |
|
</question> |
|
<sql> |
|
""" |
|
``` |
|
|
|
### response |
|
```sql |
|
SELECT FILE_NAME FROM file_summary_by_instance ORDER BY SUM_NUMBER_OF_BYTES_READ DESC, SUM_NUMBER_OF_BYTES_WRITE DESC LIMIT 10; |
|
``` |
|
|
|
### Team |
|
Avi Kothari, Gyan Ranjan, Pratham Gupta, Ritvik Aryan Kalra, Soham Acharya |