File size: 2,213 Bytes
2f044c1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
from typing import List, Union

from relik.inference.data.splitters.base_sentence_splitter import BaseSentenceSplitter


class WindowSentenceSplitter(BaseSentenceSplitter):
    """
    A :obj:`WindowSentenceSplitter` that splits a text into windows of a given size.
    """

    def __init__(self, window_size: int, window_stride: int, *args, **kwargs) -> None:
        super(WindowSentenceSplitter, self).__init__()
        self.window_size = window_size
        self.window_stride = window_stride

    def __call__(
        self,
        texts: Union[str, List[str], List[List[str]]],
        is_split_into_words: bool = False,
        **kwargs,
    ) -> Union[List[str], List[List[str]]]:
        """
        Tokenize the input into single words using SpaCy models.

        Args:
            texts (:obj:`str`, :obj:`List[str]`, :obj:`List[List[str]]`):
                Text to tag. It can be a single string, a batch of string and pre-tokenized strings.

        Returns:
            :obj:`List[List[str]]`: The input doc split into sentences.
        """
        return self.split_sentences(texts)

    def split_sentences(self, text: str | List, *args, **kwargs) -> List[List]:
        """
        Splits a `text` into sentences.

        Args:
            text (:obj:`str`):
                Text to split.

        Returns:
            :obj:`List[str]`: The input text split into sentences.
        """

        if isinstance(text, str):
            text = text.split()
        sentences = []
        for i in range(0, len(text), self.window_stride):
            # if the last stride is smaller than the window size, then we can
            # include more tokens form the previous window.
            if i != 0 and i + self.window_size > len(text):
                overflowing_tokens = i + self.window_size - len(text)
                if overflowing_tokens >= self.window_stride:
                    break
                i -= overflowing_tokens
            involved_token_indices = list(
                range(i, min(i + self.window_size, len(text)))
            )
            window_tokens = [text[j] for j in involved_token_indices]
            sentences.append(window_tokens)
        return sentences