Goekdeniz-Guelmez commited on
Commit
098adf9
·
verified ·
1 Parent(s): 07c2c89

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +49 -761
README.md CHANGED
@@ -22,796 +22,84 @@ datasets:
22
  language:
23
  - de
24
  - en
25
- library_name: mlx
26
  tags:
27
  - moe
28
  - multimodal
 
29
  - vision
30
  - audio
31
  - endtoend
32
  - j.o.s.i.e.
33
  ---
34
 
35
- # J.O.S.I.E. (Just a Smart and Intelligent Entity)
36
 
37
- Welcome to the J.O.S.I.E. project repository! J.O.S.I.E. is a cutting-edge, super intelligent AI assistant designed to revolutionize the way we interact with smart home systems and general AI capabilities. This document provides an overview of J.O.S.I.E.'s features, capabilities, and development roadmap.
38
 
39
- ## Table of Contents
40
 
41
- 1. [Introduction](#introduction)
42
- 2. [Features](#features)
43
- 3. [Training Stages](#training-stages)
44
- 4. [Current Progress](#current-progress)
45
- 5. [Usage](#usage)
46
- 6. [Contributing](#contributing)
47
- 7. [License](#license)
48
 
49
- ## Updates
50
 
51
- I'm curerntly createing the multimodal-smart-home-management, and tool-calling dataset in german and englisch.
52
 
53
- ## Introduction
54
 
55
- J.O.S.I.E. stands for "Just a Smart and Intelligent Entity." It is not just a conversational AI assistant but a fully multimodal AI designed to understand and process images, videos, thermal images, depth, and audio in real-time. J.O.S.I.E. is built to autonomously manage smart homes and provide general-purpose assistance, with advanced capabilities accessible only to the main user.
56
 
57
- ## Features
 
 
 
 
 
 
 
 
 
 
58
 
59
- - **Real-Time Processing:** J.O.S.I.E. operates in real-time, ensuring quick and efficient responses.
60
- - **Tool Calling:** Capable of calling various tools to perform tasks (only for the main user).
61
- - **Short/Long-Term Memory:** Remembers past interactions and uses this data to provide a more personalized experience.
62
- - **Secure Information Access:** Accesses top-secret information upon receiving a special password from the main user.
63
- - **Contextual Greetings:** Greets users based on contextual data such as time of day, birthdays, and more.
64
- - **Voice Interaction:** Will support real-time voice responses with a response time under 0.3 ms.
65
- - **Advanced Multimodal Capabilities:** Initially uses Meta's image binding model, transitioning to a self-implemented encoder.
66
- - **Uncensored Interaction:** Full, uncensored interaction capabilities are reserved for the main user.
67
- - **Autonomous Smart Home Management:** Manages smart home devices and systems autonomously.
68
 
69
- ## Training Stages
 
 
 
70
 
71
- J.O.S.I.E.'s development is structured into several meticulously planned stages, each focusing on different aspects of its capabilities:
72
 
73
- ### Stage 1: **Genesis**
74
- - **Objective:** Fine-tune the Large Language Model (LLM) with a custom dataset and prompt format. The LLM used is Qwen2 7B and 0.5B.
75
- - **Outcome:** A robust foundation for text-based interactions.
76
 
77
- ### Stage 2: **Fusion**
78
- - **Objective:** Train encoders separately using transfer learning to align input embeddings with text embeddings.
79
- - **Outcome:** Harmonized multimodal input processing.
80
 
81
- ### Stage 3: **Synergy**
82
- - **Objective:** Fine-tune the LLM for multimodal reasoning using a custom dataset.
83
- - **Outcome:** Enhanced reasoning capabilities across text and other modalities.
84
 
85
- ### Stage 4: **Vocalize**
86
- - **Objective:** Fine-tune the decoder for audio output, giving J.O.S.I.E. a voice.
87
- - **Outcome:** Synchronized text and audio responses.
88
 
89
- ### Stage 5: **Convergence**
90
- - **Objective:** Perform full model fine-tuning for seamless integration of all components.
91
- - **Outcome:** A fully multimodal, real-time interactive AI assistant.
92
 
93
- ## Current Progress
94
 
95
- J.O.S.I.E. is currently in its beta stage, specifically in Stage 1. The model is being actively developed, and the current version is focused on fine-tuning the LLM with custom datasets.
 
 
96
 
97
- ### Latest Beta Version 4 of Stage 1:
98
- - **Model:** [Isaak-Carter/josiev4o-7b-stage1-v0.1](https://huggingface.co/Isaak-Carter/J.O.S.I.E.v4o-7b-stage1-v0.1-gguf)
99
- - **Quants:** [Isaak-Carter/J.O.S.I.E.v4o-7b-stage1-v0.1-gguf](https://huggingface.co/Isaak-Carter/J.O.S.I.E.v4o-7b-stage1-v0.1-gguf)
100
 
101
- For a sneak peek at the current progress, visit the [GitHub Repo](https://github.com/Goekdeniz-Guelmez/J.O.S.I.E.-v4o.git).
 
 
102
 
103
- ## Source Code
104
 
105
- To se the latest updates on J.O.S.I.E.v4o you can see my <a href="https://github.com/Goekdeniz-Guelmez/J.O.S.I.E.-v4o.git">Github Repo</a>
106
-
107
- ## Contributing
108
 
109
- I welcome contributions from the you! To contribute to J.O.S.I.E., please fork the repository and create a pull request with your changes. Ensure that your code adheres to my coding standards and includes appropriate tests and comments.
110
-
111
- ## License
112
-
113
- J.O.S.I.E. is licensed under the Apache2 License. See the [LICENSE](LICENSE) file for more details.
114
-
115
-
116
-
117
-
118
-
119
- # Big Updates!
120
-
121
- I have finaly trained the Vision and Audio encoder part, big thangs to FaceBook Research for the ImageBind model, wich is what I have build it on top of.
122
-
123
- What I did was, I copied the weights from the original ImageBind model into a second 'downcycled' ImageBindVisionAudioHuge model.
124
- After that I have continued to trained the model on a custom Vision and Audio dataset using the contrastive learning Algorythm introduced by Google with Pali Gemma with the text embeddings from the origional ImageBind model.
125
-
126
- After mergind the encoder with the test reasoner (Qwen2-0.5B-Instruct), I got succesfull inference on both video, image and audio.
127
- I will slowly start writing the training scrypt, creating the new dataset, and optimizing the model and inference code a litle bit more, and lastly train the model.
128
-
129
- Here are the actual model layers:
130
-
131
- ```txt
132
- ImageBindModelAudioVision(
133
- (modality_preprocessors): ModuleDict(
134
- (vision): RGBDTPreprocessor(
135
- (cls_token): tensor((1, 1, 1280), requires_grad=True)
136
-
137
- (rgbt_stem): PatchEmbedGeneric(
138
- (proj): Sequential(
139
- (0): PadIm2Video()
140
- (1): Conv3d(3, 1280, kernel_size=(2, 14, 14), stride=(2, 14, 14), bias=False)
141
- )
142
- )
143
- (pos_embedding_helper): SpatioTemporalPosEmbeddingHelper(
144
- (pos_embed): tensor((1, 257, 1280), requires_grad=True)
145
-
146
- )
147
- )
148
- (audio): AudioPreprocessor(
149
- (cls_token): tensor((1, 1, 768), requires_grad=True)
150
-
151
- (rgbt_stem): PatchEmbedGeneric(
152
- (proj): Conv2d(1, 768, kernel_size=(16, 16), stride=(10, 10), bias=False)
153
- (norm_layer): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
154
- )
155
- (pos_embedding_helper): SpatioTemporalPosEmbeddingHelper(
156
- (pos_embed): tensor((1, 229, 768), requires_grad=True)
157
-
158
- )
159
- )
160
- )
161
- (modality_trunks): ModuleDict(
162
- (vision): SimpleTransformer(
163
- (pre_transformer_layer): Sequential(
164
- (0): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
165
- (1): EinOpsRearrange()
166
- )
167
- (blocks): Sequential(
168
- (0): BlockWithMasking(
169
- (attn): MultiheadAttention(
170
- (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
171
- )
172
- (drop_path): Identity()
173
- (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
174
- (mlp): Mlp(
175
- (fc1): Linear(in_features=1280, out_features=5120, bias=True)
176
- (act): GELU(approximate='none')
177
- (fc2): Linear(in_features=5120, out_features=1280, bias=True)
178
- (drop): Dropout(p=0.0, inplace=False)
179
- )
180
- (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
181
- )
182
- (1): BlockWithMasking(
183
- (attn): MultiheadAttention(
184
- (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
185
- )
186
- (drop_path): Identity()
187
- (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
188
- (mlp): Mlp(
189
- (fc1): Linear(in_features=1280, out_features=5120, bias=True)
190
- (act): GELU(approximate='none')
191
- (fc2): Linear(in_features=5120, out_features=1280, bias=True)
192
- (drop): Dropout(p=0.0, inplace=False)
193
- )
194
- (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
195
- )
196
- (2): BlockWithMasking(
197
- (attn): MultiheadAttention(
198
- (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
199
- )
200
- (drop_path): Identity()
201
- (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
202
- (mlp): Mlp(
203
- (fc1): Linear(in_features=1280, out_features=5120, bias=True)
204
- (act): GELU(approximate='none')
205
- (fc2): Linear(in_features=5120, out_features=1280, bias=True)
206
- (drop): Dropout(p=0.0, inplace=False)
207
- )
208
- (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
209
- )
210
- (3): BlockWithMasking(
211
- (attn): MultiheadAttention(
212
- (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
213
- )
214
- (drop_path): Identity()
215
- (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
216
- (mlp): Mlp(
217
- (fc1): Linear(in_features=1280, out_features=5120, bias=True)
218
- (act): GELU(approximate='none')
219
- (fc2): Linear(in_features=5120, out_features=1280, bias=True)
220
- (drop): Dropout(p=0.0, inplace=False)
221
- )
222
- (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
223
- )
224
- (4): BlockWithMasking(
225
- (attn): MultiheadAttention(
226
- (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
227
- )
228
- (drop_path): Identity()
229
- (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
230
- (mlp): Mlp(
231
- (fc1): Linear(in_features=1280, out_features=5120, bias=True)
232
- (act): GELU(approximate='none')
233
- (fc2): Linear(in_features=5120, out_features=1280, bias=True)
234
- (drop): Dropout(p=0.0, inplace=False)
235
- )
236
- (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
237
- )
238
- (5): BlockWithMasking(
239
- (attn): MultiheadAttention(
240
- (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
241
- )
242
- (drop_path): Identity()
243
- (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
244
- (mlp): Mlp(
245
- (fc1): Linear(in_features=1280, out_features=5120, bias=True)
246
- (act): GELU(approximate='none')
247
- (fc2): Linear(in_features=5120, out_features=1280, bias=True)
248
- (drop): Dropout(p=0.0, inplace=False)
249
- )
250
- (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
251
- )
252
- (6): BlockWithMasking(
253
- (attn): MultiheadAttention(
254
- (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
255
- )
256
- (drop_path): Identity()
257
- (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
258
- (mlp): Mlp(
259
- (fc1): Linear(in_features=1280, out_features=5120, bias=True)
260
- (act): GELU(approximate='none')
261
- (fc2): Linear(in_features=5120, out_features=1280, bias=True)
262
- (drop): Dropout(p=0.0, inplace=False)
263
- )
264
- (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
265
- )
266
- (7): BlockWithMasking(
267
- (attn): MultiheadAttention(
268
- (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
269
- )
270
- (drop_path): Identity()
271
- (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
272
- (mlp): Mlp(
273
- (fc1): Linear(in_features=1280, out_features=5120, bias=True)
274
- (act): GELU(approximate='none')
275
- (fc2): Linear(in_features=5120, out_features=1280, bias=True)
276
- (drop): Dropout(p=0.0, inplace=False)
277
- )
278
- (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
279
- )
280
- (8): BlockWithMasking(
281
- (attn): MultiheadAttention(
282
- (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
283
- )
284
- (drop_path): Identity()
285
- (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
286
- (mlp): Mlp(
287
- (fc1): Linear(in_features=1280, out_features=5120, bias=True)
288
- (act): GELU(approximate='none')
289
- (fc2): Linear(in_features=5120, out_features=1280, bias=True)
290
- (drop): Dropout(p=0.0, inplace=False)
291
- )
292
- (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
293
- )
294
- (9): BlockWithMasking(
295
- (attn): MultiheadAttention(
296
- (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
297
- )
298
- (drop_path): Identity()
299
- (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
300
- (mlp): Mlp(
301
- (fc1): Linear(in_features=1280, out_features=5120, bias=True)
302
- (act): GELU(approximate='none')
303
- (fc2): Linear(in_features=5120, out_features=1280, bias=True)
304
- (drop): Dropout(p=0.0, inplace=False)
305
- )
306
- (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
307
- )
308
- (10): BlockWithMasking(
309
- (attn): MultiheadAttention(
310
- (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
311
- )
312
- (drop_path): Identity()
313
- (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
314
- (mlp): Mlp(
315
- (fc1): Linear(in_features=1280, out_features=5120, bias=True)
316
- (act): GELU(approximate='none')
317
- (fc2): Linear(in_features=5120, out_features=1280, bias=True)
318
- (drop): Dropout(p=0.0, inplace=False)
319
- )
320
- (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
321
- )
322
- (11): BlockWithMasking(
323
- (attn): MultiheadAttention(
324
- (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
325
- )
326
- (drop_path): Identity()
327
- (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
328
- (mlp): Mlp(
329
- (fc1): Linear(in_features=1280, out_features=5120, bias=True)
330
- (act): GELU(approximate='none')
331
- (fc2): Linear(in_features=5120, out_features=1280, bias=True)
332
- (drop): Dropout(p=0.0, inplace=False)
333
- )
334
- (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
335
- )
336
- (12): BlockWithMasking(
337
- (attn): MultiheadAttention(
338
- (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
339
- )
340
- (drop_path): Identity()
341
- (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
342
- (mlp): Mlp(
343
- (fc1): Linear(in_features=1280, out_features=5120, bias=True)
344
- (act): GELU(approximate='none')
345
- (fc2): Linear(in_features=5120, out_features=1280, bias=True)
346
- (drop): Dropout(p=0.0, inplace=False)
347
- )
348
- (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
349
- )
350
- (13): BlockWithMasking(
351
- (attn): MultiheadAttention(
352
- (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
353
- )
354
- (drop_path): Identity()
355
- (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
356
- (mlp): Mlp(
357
- (fc1): Linear(in_features=1280, out_features=5120, bias=True)
358
- (act): GELU(approximate='none')
359
- (fc2): Linear(in_features=5120, out_features=1280, bias=True)
360
- (drop): Dropout(p=0.0, inplace=False)
361
- )
362
- (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
363
- )
364
- (14): BlockWithMasking(
365
- (attn): MultiheadAttention(
366
- (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
367
- )
368
- (drop_path): Identity()
369
- (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
370
- (mlp): Mlp(
371
- (fc1): Linear(in_features=1280, out_features=5120, bias=True)
372
- (act): GELU(approximate='none')
373
- (fc2): Linear(in_features=5120, out_features=1280, bias=True)
374
- (drop): Dropout(p=0.0, inplace=False)
375
- )
376
- (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
377
- )
378
- (15): BlockWithMasking(
379
- (attn): MultiheadAttention(
380
- (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
381
- )
382
- (drop_path): Identity()
383
- (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
384
- (mlp): Mlp(
385
- (fc1): Linear(in_features=1280, out_features=5120, bias=True)
386
- (act): GELU(approximate='none')
387
- (fc2): Linear(in_features=5120, out_features=1280, bias=True)
388
- (drop): Dropout(p=0.0, inplace=False)
389
- )
390
- (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
391
- )
392
- (16): BlockWithMasking(
393
- (attn): MultiheadAttention(
394
- (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
395
- )
396
- (drop_path): Identity()
397
- (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
398
- (mlp): Mlp(
399
- (fc1): Linear(in_features=1280, out_features=5120, bias=True)
400
- (act): GELU(approximate='none')
401
- (fc2): Linear(in_features=5120, out_features=1280, bias=True)
402
- (drop): Dropout(p=0.0, inplace=False)
403
- )
404
- (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
405
- )
406
- (17): BlockWithMasking(
407
- (attn): MultiheadAttention(
408
- (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
409
- )
410
- (drop_path): Identity()
411
- (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
412
- (mlp): Mlp(
413
- (fc1): Linear(in_features=1280, out_features=5120, bias=True)
414
- (act): GELU(approximate='none')
415
- (fc2): Linear(in_features=5120, out_features=1280, bias=True)
416
- (drop): Dropout(p=0.0, inplace=False)
417
- )
418
- (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
419
- )
420
- (18): BlockWithMasking(
421
- (attn): MultiheadAttention(
422
- (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
423
- )
424
- (drop_path): Identity()
425
- (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
426
- (mlp): Mlp(
427
- (fc1): Linear(in_features=1280, out_features=5120, bias=True)
428
- (act): GELU(approximate='none')
429
- (fc2): Linear(in_features=5120, out_features=1280, bias=True)
430
- (drop): Dropout(p=0.0, inplace=False)
431
- )
432
- (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
433
- )
434
- (19): BlockWithMasking(
435
- (attn): MultiheadAttention(
436
- (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
437
- )
438
- (drop_path): Identity()
439
- (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
440
- (mlp): Mlp(
441
- (fc1): Linear(in_features=1280, out_features=5120, bias=True)
442
- (act): GELU(approximate='none')
443
- (fc2): Linear(in_features=5120, out_features=1280, bias=True)
444
- (drop): Dropout(p=0.0, inplace=False)
445
- )
446
- (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
447
- )
448
- (20): BlockWithMasking(
449
- (attn): MultiheadAttention(
450
- (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
451
- )
452
- (drop_path): Identity()
453
- (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
454
- (mlp): Mlp(
455
- (fc1): Linear(in_features=1280, out_features=5120, bias=True)
456
- (act): GELU(approximate='none')
457
- (fc2): Linear(in_features=5120, out_features=1280, bias=True)
458
- (drop): Dropout(p=0.0, inplace=False)
459
- )
460
- (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
461
- )
462
- (21): BlockWithMasking(
463
- (attn): MultiheadAttention(
464
- (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
465
- )
466
- (drop_path): Identity()
467
- (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
468
- (mlp): Mlp(
469
- (fc1): Linear(in_features=1280, out_features=5120, bias=True)
470
- (act): GELU(approximate='none')
471
- (fc2): Linear(in_features=5120, out_features=1280, bias=True)
472
- (drop): Dropout(p=0.0, inplace=False)
473
- )
474
- (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
475
- )
476
- (22): BlockWithMasking(
477
- (attn): MultiheadAttention(
478
- (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
479
- )
480
- (drop_path): Identity()
481
- (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
482
- (mlp): Mlp(
483
- (fc1): Linear(in_features=1280, out_features=5120, bias=True)
484
- (act): GELU(approximate='none')
485
- (fc2): Linear(in_features=5120, out_features=1280, bias=True)
486
- (drop): Dropout(p=0.0, inplace=False)
487
- )
488
- (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
489
- )
490
- (23): BlockWithMasking(
491
- (attn): MultiheadAttention(
492
- (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
493
- )
494
- (drop_path): Identity()
495
- (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
496
- (mlp): Mlp(
497
- (fc1): Linear(in_features=1280, out_features=5120, bias=True)
498
- (act): GELU(approximate='none')
499
- (fc2): Linear(in_features=5120, out_features=1280, bias=True)
500
- (drop): Dropout(p=0.0, inplace=False)
501
- )
502
- (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
503
- )
504
- (24): BlockWithMasking(
505
- (attn): MultiheadAttention(
506
- (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
507
- )
508
- (drop_path): Identity()
509
- (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
510
- (mlp): Mlp(
511
- (fc1): Linear(in_features=1280, out_features=5120, bias=True)
512
- (act): GELU(approximate='none')
513
- (fc2): Linear(in_features=5120, out_features=1280, bias=True)
514
- (drop): Dropout(p=0.0, inplace=False)
515
- )
516
- (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
517
- )
518
- (25): BlockWithMasking(
519
- (attn): MultiheadAttention(
520
- (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
521
- )
522
- (drop_path): Identity()
523
- (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
524
- (mlp): Mlp(
525
- (fc1): Linear(in_features=1280, out_features=5120, bias=True)
526
- (act): GELU(approximate='none')
527
- (fc2): Linear(in_features=5120, out_features=1280, bias=True)
528
- (drop): Dropout(p=0.0, inplace=False)
529
- )
530
- (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
531
- )
532
- (26): BlockWithMasking(
533
- (attn): MultiheadAttention(
534
- (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
535
- )
536
- (drop_path): Identity()
537
- (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
538
- (mlp): Mlp(
539
- (fc1): Linear(in_features=1280, out_features=5120, bias=True)
540
- (act): GELU(approximate='none')
541
- (fc2): Linear(in_features=5120, out_features=1280, bias=True)
542
- (drop): Dropout(p=0.0, inplace=False)
543
- )
544
- (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
545
- )
546
- (27): BlockWithMasking(
547
- (attn): MultiheadAttention(
548
- (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
549
- )
550
- (drop_path): Identity()
551
- (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
552
- (mlp): Mlp(
553
- (fc1): Linear(in_features=1280, out_features=5120, bias=True)
554
- (act): GELU(approximate='none')
555
- (fc2): Linear(in_features=5120, out_features=1280, bias=True)
556
- (drop): Dropout(p=0.0, inplace=False)
557
- )
558
- (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
559
- )
560
- (28): BlockWithMasking(
561
- (attn): MultiheadAttention(
562
- (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
563
- )
564
- (drop_path): Identity()
565
- (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
566
- (mlp): Mlp(
567
- (fc1): Linear(in_features=1280, out_features=5120, bias=True)
568
- (act): GELU(approximate='none')
569
- (fc2): Linear(in_features=5120, out_features=1280, bias=True)
570
- (drop): Dropout(p=0.0, inplace=False)
571
- )
572
- (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
573
- )
574
- (29): BlockWithMasking(
575
- (attn): MultiheadAttention(
576
- (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
577
- )
578
- (drop_path): Identity()
579
- (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
580
- (mlp): Mlp(
581
- (fc1): Linear(in_features=1280, out_features=5120, bias=True)
582
- (act): GELU(approximate='none')
583
- (fc2): Linear(in_features=5120, out_features=1280, bias=True)
584
- (drop): Dropout(p=0.0, inplace=False)
585
- )
586
- (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
587
- )
588
- (30): BlockWithMasking(
589
- (attn): MultiheadAttention(
590
- (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
591
- )
592
- (drop_path): Identity()
593
- (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
594
- (mlp): Mlp(
595
- (fc1): Linear(in_features=1280, out_features=5120, bias=True)
596
- (act): GELU(approximate='none')
597
- (fc2): Linear(in_features=5120, out_features=1280, bias=True)
598
- (drop): Dropout(p=0.0, inplace=False)
599
- )
600
- (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
601
- )
602
- (31): BlockWithMasking(
603
- (attn): MultiheadAttention(
604
- (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
605
- )
606
- (drop_path): Identity()
607
- (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
608
- (mlp): Mlp(
609
- (fc1): Linear(in_features=1280, out_features=5120, bias=True)
610
- (act): GELU(approximate='none')
611
- (fc2): Linear(in_features=5120, out_features=1280, bias=True)
612
- (drop): Dropout(p=0.0, inplace=False)
613
- )
614
- (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
615
- )
616
- )
617
- (post_transformer_layer): EinOpsRearrange()
618
- )
619
- (audio): SimpleTransformer(
620
- (pre_transformer_layer): Sequential(
621
- (0): Identity()
622
- (1): EinOpsRearrange()
623
- )
624
- (blocks): Sequential(
625
- (0): BlockWithMasking(
626
- (attn): MultiheadAttention(
627
- (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
628
- )
629
- (drop_path): Identity()
630
- (norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
631
- (mlp): Mlp(
632
- (fc1): Linear(in_features=768, out_features=3072, bias=True)
633
- (act): GELU(approximate='none')
634
- (fc2): Linear(in_features=3072, out_features=768, bias=True)
635
- (drop): Dropout(p=0.0, inplace=False)
636
- )
637
- (norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
638
- )
639
- (1): BlockWithMasking(
640
- (attn): MultiheadAttention(
641
- (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
642
- )
643
- (drop_path): DropPath(drop_prob=0.009)
644
- (norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
645
- (mlp): Mlp(
646
- (fc1): Linear(in_features=768, out_features=3072, bias=True)
647
- (act): GELU(approximate='none')
648
- (fc2): Linear(in_features=3072, out_features=768, bias=True)
649
- (drop): Dropout(p=0.0, inplace=False)
650
- )
651
- (norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
652
- )
653
- (2): BlockWithMasking(
654
- (attn): MultiheadAttention(
655
- (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
656
- )
657
- (drop_path): DropPath(drop_prob=0.018)
658
- (norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
659
- (mlp): Mlp(
660
- (fc1): Linear(in_features=768, out_features=3072, bias=True)
661
- (act): GELU(approximate='none')
662
- (fc2): Linear(in_features=3072, out_features=768, bias=True)
663
- (drop): Dropout(p=0.0, inplace=False)
664
- )
665
- (norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
666
- )
667
- (3): BlockWithMasking(
668
- (attn): MultiheadAttention(
669
- (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
670
- )
671
- (drop_path): DropPath(drop_prob=0.027)
672
- (norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
673
- (mlp): Mlp(
674
- (fc1): Linear(in_features=768, out_features=3072, bias=True)
675
- (act): GELU(approximate='none')
676
- (fc2): Linear(in_features=3072, out_features=768, bias=True)
677
- (drop): Dropout(p=0.0, inplace=False)
678
- )
679
- (norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
680
- )
681
- (4): BlockWithMasking(
682
- (attn): MultiheadAttention(
683
- (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
684
- )
685
- (drop_path): DropPath(drop_prob=0.036)
686
- (norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
687
- (mlp): Mlp(
688
- (fc1): Linear(in_features=768, out_features=3072, bias=True)
689
- (act): GELU(approximate='none')
690
- (fc2): Linear(in_features=3072, out_features=768, bias=True)
691
- (drop): Dropout(p=0.0, inplace=False)
692
- )
693
- (norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
694
- )
695
- (5): BlockWithMasking(
696
- (attn): MultiheadAttention(
697
- (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
698
- )
699
- (drop_path): DropPath(drop_prob=0.045)
700
- (norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
701
- (mlp): Mlp(
702
- (fc1): Linear(in_features=768, out_features=3072, bias=True)
703
- (act): GELU(approximate='none')
704
- (fc2): Linear(in_features=3072, out_features=768, bias=True)
705
- (drop): Dropout(p=0.0, inplace=False)
706
- )
707
- (norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
708
- )
709
- (6): BlockWithMasking(
710
- (attn): MultiheadAttention(
711
- (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
712
- )
713
- (drop_path): DropPath(drop_prob=0.055)
714
- (norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
715
- (mlp): Mlp(
716
- (fc1): Linear(in_features=768, out_features=3072, bias=True)
717
- (act): GELU(approximate='none')
718
- (fc2): Linear(in_features=3072, out_features=768, bias=True)
719
- (drop): Dropout(p=0.0, inplace=False)
720
- )
721
- (norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
722
- )
723
- (7): BlockWithMasking(
724
- (attn): MultiheadAttention(
725
- (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
726
- )
727
- (drop_path): DropPath(drop_prob=0.064)
728
- (norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
729
- (mlp): Mlp(
730
- (fc1): Linear(in_features=768, out_features=3072, bias=True)
731
- (act): GELU(approximate='none')
732
- (fc2): Linear(in_features=3072, out_features=768, bias=True)
733
- (drop): Dropout(p=0.0, inplace=False)
734
- )
735
- (norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
736
- )
737
- (8): BlockWithMasking(
738
- (attn): MultiheadAttention(
739
- (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
740
- )
741
- (drop_path): DropPath(drop_prob=0.073)
742
- (norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
743
- (mlp): Mlp(
744
- (fc1): Linear(in_features=768, out_features=3072, bias=True)
745
- (act): GELU(approximate='none')
746
- (fc2): Linear(in_features=3072, out_features=768, bias=True)
747
- (drop): Dropout(p=0.0, inplace=False)
748
- )
749
- (norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
750
- )
751
- (9): BlockWithMasking(
752
- (attn): MultiheadAttention(
753
- (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
754
- )
755
- (drop_path): DropPath(drop_prob=0.082)
756
- (norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
757
- (mlp): Mlp(
758
- (fc1): Linear(in_features=768, out_features=3072, bias=True)
759
- (act): GELU(approximate='none')
760
- (fc2): Linear(in_features=3072, out_features=768, bias=True)
761
- (drop): Dropout(p=0.0, inplace=False)
762
- )
763
- (norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
764
- )
765
- (10): BlockWithMasking(
766
- (attn): MultiheadAttention(
767
- (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
768
- )
769
- (drop_path): DropPath(drop_prob=0.091)
770
- (norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
771
- (mlp): Mlp(
772
- (fc1): Linear(in_features=768, out_features=3072, bias=True)
773
- (act): GELU(approximate='none')
774
- (fc2): Linear(in_features=3072, out_features=768, bias=True)
775
- (drop): Dropout(p=0.0, inplace=False)
776
- )
777
- (norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
778
- )
779
- (11): BlockWithMasking(
780
- (attn): MultiheadAttention(
781
- (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
782
- )
783
- (drop_path): DropPath(drop_prob=0.100)
784
- (norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
785
- (mlp): Mlp(
786
- (fc1): Linear(in_features=768, out_features=3072, bias=True)
787
- (act): GELU(approximate='none')
788
- (fc2): Linear(in_features=3072, out_features=768, bias=True)
789
- (drop): Dropout(p=0.0, inplace=False)
790
- )
791
- (norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
792
- )
793
- )
794
- (post_transformer_layer): EinOpsRearrange()
795
- )
796
- )
797
- (modality_heads): ModuleDict(
798
- (vision): Sequential(
799
- (0): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
800
- (1): SelectElement()
801
- (2): Linear(in_features=1280, out_features=1024, bias=False)
802
- )
803
- (audio): Sequential(
804
- (0): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
805
- (1): SelectElement()
806
- (2): Linear(in_features=768, out_features=1024, bias=False)
807
- )
808
- )
809
- (modality_postprocessors): ModuleDict(
810
- (vision): Normalize()
811
- (audio): Sequential(
812
- (0): Normalize()
813
- (1): LearnableLogitScaling(logit_scale_init=20.0,learnable=False, max_logit_scale=100)
814
- )
815
- )
816
- )
817
- ```
 
22
  language:
23
  - de
24
  - en
 
25
  tags:
26
  - moe
27
  - multimodal
28
+ - any-to-any
29
  - vision
30
  - audio
31
  - endtoend
32
  - j.o.s.i.e.
33
  ---
34
 
35
+ Project JOSIE: Just an Outstandingly Smart Intelligent Entity
36
 
37
+ Overview:
38
 
39
+ Project JOSIE aims to create a next-generation, multimodal AI assistant designed to operate in real-time. The ultimate goal of JOSIE is to offer comprehensive support for personal assistance and smart home management, closely resembling the functionality of popular fictional AI assistants like JARVIS. JOSIE’s architecture is designed to handle complex, multi-sensory input, processing diverse data formats such as text, speech, images, and video. The initial implementation focuses on text and speech-to-text capabilities, with future iterations planned to introduce robust visual processing through both image and video inputs.
40
 
41
+ The system is structured to be responsive, proactive, and capable of real-time decision-making. JOSIE’s core strengths lie in her ability to intelligently interact across multiple modalities, integrate ongoing data streams, and respond with contextually relevant and articulate outputs. Through multimodal encoding, JOSIE’s pipeline merges discrete data types, creating an agile and efficient data-handling model with the flexibility for future expansions, such as additional sensory inputs or specialized data processing tasks.
 
 
 
 
 
 
42
 
43
+ Use Case:
44
 
45
+ JOSIE’s primary use case is real-time personal assistance, with an emphasis on home automation and management. She is intended to autonomously handle routine smart home tasks, with the capability to initiate conversations or prompt user interaction when necessary (e.g., identifying unrecognized individuals in security footage). While this core capability is geared toward managing an interconnected smart home, the system’s multimodal foundation allows JOSIE to adapt across other applications, potentially extending to health monitoring, environment mapping, and real-time AI-driven decision-making.
46
 
47
+ Model Card for JOSIE
48
 
49
+ Model Details:
50
 
51
+ • Model Name: JOSIE-v4
52
+ • Version: 4.0
53
+ • Model Type: Multimodal, real-time assistant
54
+ • Primary Use: Smart home management, personal assistance
55
+ • Current Modalities:
56
+ • Input: Text, Speech-to-Text
57
+ • Output: Text, Speech
58
+ Upcoming Modalities:
59
+ • Input: Image, Video, Depth, Thermal imaging
60
+ • Output: Enhanced audiovisual feedback
61
+ • Target Audience: Authorized primary users for full capabilities (smart home management and advanced AI interactions); limited assistance mode available for other authorized users.
62
 
63
+ Architecture:
 
 
 
 
 
 
 
 
64
 
65
+ • Core Framework: A central general-purpose LLM (LLaMA/Qwen) processes discrete tokens generated from various sensory inputs.
66
+ • Audio Processing: Employs RQ-Transformers with temporal and depth transformers, encoding raw audio into discrete tokens that the LLM processes. The tokens are then decoded back into audio responses, with the RQ-Transformer converting output tokens into Mel spectrograms that a vocoder renders into audio.
67
+ • Vision Processing (Planned): Image and video input capabilities will employ a separate vision transformer. This will create discrete tokens for processing, merged with audio and text embeddings for unified interpretation.
68
+ • Quantization and Tokenization: Implements residual quantization (RQ) with sequential codebooks for efficient tokenization of audio and depth data. Chunked and normalized embeddings are iteratively refined through RQ to produce a compact, final token representation.
69
 
70
+ Model Use:
71
 
72
+ • Input Formats: Currently accepts text and audio. The audio input is converted to discrete tokens via RQ-Transformer, allowing for efficient encoding and storage.
73
+ • Output Formats: Generates text and audio responses, with audio decoded back into speech using the RQ-Transformer’s vocoder output pipeline.
74
+ • Inference Speed: Real-time processing optimized for low-latency responses, employing batch processing techniques for fast generation, allowing seamless interaction in time-sensitive environments like smart home control.
75
 
76
+ Intended Use Cases:
 
 
77
 
78
+ 1. Smart Home Management: Autonomously controls and manages smart home devices, providing alerts and requesting interaction when necessary.
79
+ 2. Security and Monitoring: Identifies and distinguishes authorized from unauthorized individuals using upcoming vision capabilities.
80
+ 3. Personal Assistance: Engages in general-purpose conversations, provides reminders, and assists with basic daily tasks through conversational interaction.
81
 
82
+ Model Capabilities:
 
 
83
 
84
+ • Real-Time Processing: Continuous data input with second-by-second updates, capable of issuing commands and engaging in dialogue.
85
+ • Autonomous Behavior: Responds to certain triggers (e.g., security concerns or abnormal events) autonomously, yet requests user input for actions requiring confirmation.
86
+ • Proactive Interactivity: Acts as both a responsive assistant and a proactive agent, initiating conversations when a task, anomaly, or user behavior warrants attention.
87
 
88
+ Limitations:
89
 
90
+ • Current Modalities: Limited to text and speech, with vision functionality forthcoming.
91
+ • Authorized Access Only: Full capabilities are limited to the primary user, with a restricted, general-purpose assistant mode for other authorized users.
92
+ • Data-Intensive: Real-time processing requires significant data bandwidth and computational resources, particularly for multimodal and high-frequency tasks.
93
 
94
+ Future Enhancements:
 
 
95
 
96
+ • Vision and Depth Modalities: Planned addition of image and video input, enabling JOSIE to analyze visual data for a broader range of use cases.
97
+ • Expanded Memory and Interaction Context: Potential for an expanded memory module to increase contextual awareness and allow for longer interactions without losing track of prior exchanges.
98
+ • Enhanced Security and Recognition: Deep learning algorithms for security and monitoring applications, especially for facial recognition, gesture detection, and other high-stakes, real-time tasks.
99
 
100
+ Ethical Considerations:
101
 
102
+ • Data Privacy: Ensures that any visual, auditory, or environmental data collected remains private to the authorized user.
103
+ • Bias and Fairness: JOSIE’s behavior is trained to provide unbiased support. Future model improvements will address potential biases in visual and auditory data processing.
 
104
 
105
+ Project JOSIE’s roadmap is set on delivering a true multimodal experience, with the real-time integration of various sensory inputs and outputs at its core. This combination positions JOSIE to transform everyday interactions into seamless, responsive, and secure AI-powered experiences.