find ./csvs -type f -name "*.csv" -exec tail -n +2 {} \; | tr -cd '§' | wc -c | awk '{ print int($1 / 2) }'
To count the number of rows in csvs. It equals number of labels we have.


find ./txts -type f -name "*.txt" -exec cat {} \; | wc -w
To count the total number of words in the txts.


The first version of the dataset contains 1,842,816 words and 264 labels. It means that each text is approximately 7,000 words long.


find ./txts -type f -name "*.txt" -exec awk 'length($0) > 20 {gsub(/[^[:alnum:]]/, " "); for (i=1; i<=NF; i++) if (length($i) > 20) print FILENAME ":", $i}' {} \;
To print files and words, which are longer than 20 letters

find ./txts -type f -name "*.txt" -exec awk 'length($0) > 20 {gsub(/[^[:alnum:]]/, " "); for (i=1; i<=NF; i++) if (length($i) > 20) count++} END {if (count > 0) print FILENAME ":", count}' {} \;
To print how many broken words there are in each file.