Spaces:
Paused
Paused
find ./csvs -type f -name "*.csv" -exec tail -n +2 {} \; | tr -cd '§' | wc -c | awk '{ print int($1 / 2) }' | |
To count the number of rows in csvs. It equals number of labels we have. | |
find ./txts -type f -name "*.txt" -exec cat {} \; | wc -w | |
To count the total number of words in the txts. | |
The first version of the dataset contains 1,842,816 words and 264 labels. It means that each text is approximately 7,000 words long. | |
find ./txts -type f -name "*.txt" -exec awk 'length($0) > 20 {gsub(/[^[:alnum:]]/, " "); for (i=1; i<=NF; i++) if (length($i) > 20) print FILENAME ":", $i}' {} \; | |
To print files and words, which are longer than 20 letters | |
find ./txts -type f -name "*.txt" -exec awk 'length($0) > 20 {gsub(/[^[:alnum:]]/, " "); for (i=1; i<=NF; i++) if (length($i) > 20) count++} END {if (count > 0) print FILENAME ":", count}' {} \; | |
To print how many broken words there are in each file. |