File size: 5,298 Bytes
c1c33a0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
aedb445
c1c33a0
 
 
aedb445
 
c1c33a0
 
 
 
 
 
 
aedb445
 
 
d9f977d
 
 
 
c1c33a0
aedb445
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c1c33a0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
{% extends "page.html" %} {% block stylesheet %}
<style>
  .left-align {
    text-align: left;
  }
  .center-align {
    text-align: center;
  }
  .container {
    margin: 20px auto;
    max-width: 800px;
  }
  h2,
  h3,
  h4 {
    margin-top: 20px;
  }
  p {
    line-height: 1.6;
  }
  ul {
    margin-left: 20px;
  }
</style>
{% endblock %} {% block site %}
<div id="jupyter-main-app" class="container">
  <div class="center-align">
    <img
      src="https://huggingface.co/datasets/davanstrien/assets/resolve/main/logo.jpg"
      alt="Space Logo"
      style="width: 75%"
    />
    <p>
      This Space is designed to provide you with an easy way to get started
      generating synthetic datasets using Spaces compute to host open LLMs. The
      Space comes with a ready-to-go environment and a series of notebooks
      showing various examples of generating synthetic datasets.
    </p>
  </div>
  <div class="left-align">
    <h2>What's covered?</h2>
    <p>Currently this Space has notebooks covering the following topics:</p>
    <h3>Creating synthetic text similarity datasets</h3>
    <p>
      A set of notebooks covering the steps for creating a synthetic dataset for
      fine-tuning a sentence similarity model. These notebooks cover:
    </p>
    <ul>
      <li>
        How to do structured generation using the
        <a href="https://github.com/outlines-dev/outlines">outlines</a> library
        to have more control on the outputs generated by a LLM.
      </li>
      <li>
        How to use
        <a href="https://docs.llamaindex.ai/en/stable/">Llama-index</a> to chunk
        texts to fit into the context length of sentence embedding models.
      </li>
      <li>
        Using <a href="https://github.com/vllm-project/vllm">vLLM</a> to
        efficiently create a dataset that can be used to fine-tune a Sentence
        similarity model.
      </li>
    </ul>
  </div>
  <div class="center-align">
    <h2>Using the Space</h2>
    <p>
      To use this Space, use the duplicate button. You'll want to enable
      persistent storage so you can save your work. To start, you may want to
      use a smaller GPU like the T4 and switch out to a bigger GPU when you want
      to use bigger models for generating data.
      <b>Reminder</b> you can preview the notebooks in the Space without running
      them. You can find the notebooks in the `notebooks` folder 
      <a href="https://huggingface.co/spaces/davanstrien/synthetic-data-workshop/tree/main/notebooks">here</a>.
    </p>
    </p>
    <h2>Duplicate the Space to run your own instance</h2>
    <br />
    <a
      class="duplicate-button"
      style="display: inline-block"
      target="_blank"
      href="https://huggingface.co/spaces/davanstrien/synthetic-data-workshop?duplicate=true"
    >
      <img
        style="margin: 0"
        src="https://img.shields.io/badge/-Duplicate%20Space-blue?labelColor=white&amp;style=flat&amp;logo=&amp;logoWidth=14"
        alt="Duplicate Space"
      />
    </a>
    <br />
    <br />
    <h4>The default token is <span style="color: orange">huggingface</span></h4>
  </div>
  {% if login_available %}
  <div class="center-align">
    <form
      action="{{base_url}}login?next={{next}}"
      method="post"
      class="form-inline"
    >
      {{ xsrf_form_html() | safe }} {% if token_available %}
      <label for="password_input"
        ><strong>{% trans %}Token:{% endtrans %}</strong></label
      >
      {% else %}
      <label for="password_input"
        ><strong>{% trans %}Password:{% endtrans %}</strong></label
      >
      {% endif %}
      <input
        type="password"
        name="password"
        id="password_input"
        class="form-control"
      />
      <button type="submit" class="btn btn-default" id="login_submit">
        {% trans %}Log in{% endtrans %}
      </button>
    </form>
  </div>
  {% else %}
  <div class="center-align">
    <p>
      {% trans %}No login available, you shouldn't be seeing this page.{%
      endtrans %}
    </p>
  </div>
  {% endif %}
  <div class="center-align" style="font-size: 0.8em; color: #888">
    <p>
      This template was created by
      <a href="https://twitter.com/camenduru" target="_blank">camenduru</a> and
      <a href="https://huggingface.co/nateraw" target="_blank">nateraw</a>, with
      contributions of
      <a href="https://huggingface.co/osanseviero" target="_blank"
        >osanseviero</a
      >
      and <a href="https://huggingface.co/azzr" target="_blank">azzr</a>
    </p>
  </div>
  {% if message %}
  <div class="row">
    {% for key in message %}
    <div class="message {{key}}">{{message[key]}}</div>
    {% endfor %}
  </div>
  {% endif %} {% if token_available %} {% block token_message %} {% endblock
  token_message %} {% endif %}
</div>
{% endblock %} {% block script %} {% endblock %}