Building a Custom Retrieval System with Motoko and Node.js

Community Article Published December 9, 2024

In this tutorial, we’ll walk through building a custom embedding storage and retrieval system using Motoko (a smart contract language for the Internet Computer) and Node.js (a JavaScript runtime for building server-side applications). This system can store, retrieve, and manage embeddings—numerical representations often used in machine learning or AI applications, like recommendation engines or NLP systems.

What We’ll Cover

Understanding the Problem Space
- Why embeddings are important.
- The challenges of storing embeddings efficiently.
System Design Overview
- The role of Motoko for storage.
- Node.js as a bridge to expose a REST API.
Step-by-Step Implementation
- Setting up the Motoko canister.
- Integrating Node.js with the canister.
- Building the REST API.
Enhancing and Scaling
- Security considerations.
- Potential optimizations.

1. Understanding the Problem Space

What are embeddings?
Embeddings are dense numerical representations of data that capture semantic meaning. For example:

In NLP, embeddings represent words or sentences in a way that similar meanings are numerically closer.
In recommendation systems, embeddings are used to compare items and users.

Challenges:

Storage: Embeddings are often arrays of floats, requiring structured storage.
Retrieval: Efficient querying of embeddings is crucial, especially for large datasets.
Integration: Exposing these embeddings via a secure and accessible API.

2. System Design Overview

Architecture:

Motoko Canister: A smart contract deployed on the Internet Computer to store embeddings persistently.
Node.js Server: Acts as a bridge, exposing REST endpoints for users to interact with the canister.
Frontend/Client: (Optional) Can interact with the Node.js API for UI/UX.

3. Step-by-Step Implementation

Step 1: Setting up the Motoko Canister

Install the DFINITY SDK:

sh -ci "$(curl -fsSL https://smartcontracts.org/install.sh)"

Create a new Motoko project:

dfx new embedding-store
cd embedding-store

Define the EmbeddingStore Actor in main.mo:

import Array "mo:base/Array";
import Time "mo:base/Time";

actor EmbeddingStore {
    type Embedding = {
        text: Text;
        embedding: [Float];
        createdAt: Int;
    };

    stable var embeddings: [Embedding] = [];

    public shared func storeEmbedding(text: Text, embedding: [Float]) : async () {
        let timestamp = Time.now();
        embeddings := Array.append(embeddings, [{
            text = text;
            embedding = embedding;
            createdAt = timestamp;
        }]);
    };

    public query func getEmbeddings() : async [Embedding] {
        return embeddings;
    };
};

Deploy the Canister: Update dfx.json to define your canister, then deploy:
```
dfx start --background
dfx deploy
```

Test the Canister: Use dfx canister call to test methods:

dfx canister call embedding-store storeEmbedding '( "Sample Text", [1.0, 0.5, 0.25] )'
dfx canister call embedding-store getEmbeddings

Step 2: Setting up the Node.js Server

Initialize a Node.js Project:

mkdir embedding-api
cd embedding-api
npm init -y
npm install express body-parser @dfinity/agent dotenv

Create the index.js File:

const express = require('express');
const bodyParser = require('body-parser');
const { HttpAgent, Actor } = require('@dfinity/agent');
const { idlFactory } = require('./idl/embedding_store.did.js');
require('dotenv').config();

const app = express();
const port = 3000;

app.use(bodyParser.json());

const canisterId = process.env.CANISTER_ID;
const host = process.env.HOST;

const agent = new HttpAgent({ host });
agent.fetchRootKey();

const embeddingStore = Actor.createActor(idlFactory, {
    agent,
    canisterId,
});

app.post('/storeEmbedding', async (req, res) => {
    const { text, embedding } = req.body;
    try {
        const embeddingFloat64 = embedding.map(Number);
        await embeddingStore.storeEmbedding(text, embeddingFloat64);
        res.status(200).send('Embedding stored successfully.');
    } catch (error) {
        res.status(500).send(`Error: ${error.message}`);
    }
});

app.get('/getEmbeddings', async (req, res) => {
    try {
        const embeddings = await embeddingStore.getEmbeddings();
        res.status(200).json(embeddings);
    } catch (error) {
        res.status(500).send(`Error: ${error.message}`);
    }
});

app.listen(port, () => {
    console.log(`Server is running on http://localhost:${port}`);
});

Run the Server:
```
node index.js
```

Step 3: Interacting with the API

Storing an Embedding: Use a tool like curl or Postman to send a POST request:

curl -X POST http://localhost:3000/storeEmbedding \
-H "Content-Type: application/json" \
-d '{"text":"Sample Text","embedding":[0.1,0.2,0.3]}'

Retrieving Embeddings: Send a GET request:

curl http://localhost:3000/getEmbeddings

4. Enhancing and Scaling

Security: Use API keys, HTTPS, and rate-limiting to secure endpoints.
Performance: Optimize storage by indexing embeddings or using vector search.
Scaling: Split large embeddings across multiple canisters for horizontal scaling.

Closing Thoughts

By combining Motoko’s decentralized, persistent storage capabilities with Node.js’s ease of building APIs, this tutorial showcases a practical system for storing and retrieving embeddings. This setup is modular and can be enhanced with additional features like filtering, vector similarity search, or integration with frontend systems.

If you have any questions or ideas for expanding this system, feel free to reach out! Let’s build scalable, efficient solutions together. 🚀

#Motoko #NodeJS #AI #InternetComputer #Tutorial #SoftwareDevelopment

Upvote