Spaces:

LLM360
/

TxT360

Running

victormiller commited on Sep 26, 2024

Commit

be782f3

verified ·

1 Parent(s): e04322e

Update web.py

Files changed (1) hide show

web.py CHANGED Viewed

@@ -708,25 +708,19 @@ def web_data():
         P("""
         There is minimal variation among existing pipeline implementations. We simply compute the mean word length as follows:
         """),
-        Div(
-            Code("""
                 words = text.split()
                 word_count = len(words)
                 character_count = sum(len(word) for word in words)
                 mean_word_length = character_count / word_count
-            """),
-            cls="code-block",
-        ),
         P("""
         It's worth noting that Dolma used the median word length instead of the mean in their codes.
         """),
-        Div(
-            Code("""
                 from statistics import median
                 median_word_length = median(len(word) for word in words)
-            """),
-            cls="code-block",
-        ),
         H5("Number of Sentences"),
         P("""
         The only publicly available implementation of this quality signal is from RedPajama V2, which uses regular expressions

         P("""
         There is minimal variation among existing pipeline implementations. We simply compute the mean word length as follows:
         """),
+        D_code("""
                 words = text.split()
                 word_count = len(words)
                 character_count = sum(len(word) for word in words)
                 mean_word_length = character_count / word_count
+            """, block="block", language="python"),
         P("""
         It's worth noting that Dolma used the median word length instead of the mean in their codes.
         """),
+        D_code("""
                 from statistics import median
                 median_word_length = median(len(word) for word in words)
+            """, block="block", language="python"),
         H5("Number of Sentences"),
         P("""
         The only publicly available implementation of this quality signal is from RedPajama V2, which uses regular expressions