big data related cheatsheet (part i)

use for loop to increase a dataset size

The bash command to simply increase a dataset size with parquet format.

1
$ for i in {10..20}; do cp ./part-r-00000-...snappy.parquet* ./part-r-000$i-...snappy.parquet;done

This will increase more repeated parts at the same dataset folder.

use du to see files greater than a threshold size

You may also want to order by size, to easily find the biggest ones.

1
$ sudo du -h --threshold=1G / | sort -n

(Works on Ubuntu/Mint and a few other Linux distributions)

Checking running port

1
$ sudo netstat -plnt