Shard the output of any process for distributed processing

Go to file

acidvegas e0228615dc Added bash function alternative to show simplicity		2024-12-07 14:31:27 -05:00
.screens	Initial commit	2024-12-06 23:18:44 -05:00
man	Prepair for v1.0.1	2024-12-07 14:09:21 -05:00
pkg	Prepair for v1.0.1	2024-12-07 14:09:21 -05:00
LICENSE	Initial commit	2024-12-06 23:18:44 -05:00
Makefile	Prepair for v1.0.1	2024-12-07 14:09:21 -05:00
README.md	Added bash function alternative to show simplicity	2024-12-07 14:31:27 -05:00
shardz	Prepair for release 1.0.0	2024-12-06 23:24:02 -05:00
shardz.c	Initial commit	2024-12-06 23:18:44 -05:00
shardz.pc	Prepair for v1.0.1	2024-12-07 14:09:21 -05:00

README.md

Shardz

Shardz is a lightweight C utility that shards (splits) the output of any process for distributed processing. It allows you to easily distribute workloads across multiple processes or machines by splitting input streams into evenly distributed chunks.

Use Cases

Distributing large datasets across multiple workers
Parallel processing of log files
Load balancing input streams
Splitting any line-based input for distributed processing

Building & Installation

Quick Build

gcc -o shardz shardz.c

Using Make

# Build only
make

# Build and install system-wide (requires root/sudo)
sudo make install

# To uninstall
sudo make uninstall

Usage

some_command | shardz INDEX/TOTAL

Where:

INDEX is the shard number (starting from 1)
TOTAL is the total number of shards

Examples

Let's say you have a very large list of domains and you want to do recon on each domain. Using a single machine, this could take a very long time. However, you can split the workload across multiple machines:

Machine number 1 would run:

curl https://example.com/datasets/large_domain_list.txt | shardz 1/3 | httpx -title -ip -tech-detect -json -o shard-1.json

Machine number 2 would run:

curl https://example.com/datasets/large_domain_list.txt | shardz 2/3 | httpx -title -ip -tech-detect -json -o shard-2.json

Machine number 3 would run:

curl https://example.com/datasets/large_domain_list.txt | shardz 3/3 | httpx -title -ip -tech-detect -json -o shard-3.json

How It Works

Shardz uses a modulo operation to determine which lines should be processed by each shard. For example, with 3 total shards:

Shard 1 processes lines 1, 4, 7, 10, ...
Shard 2 processes lines 2, 5, 8, 11, ...
Shard 3 processes lines 3, 6, 9, 12, ...

This ensures an even distribution of the workload across all shards.

Simplicity

For what its worth, the same functionality of this tool can be done with a bash function in your .bashrc:

shardz() {
	awk -v n="$1" -v t="$2" 'NR % t == n'
}

cat domains.txt | shardz 1/3 | httpx -title -ip -tech-detect -json -o shard-1.json
cat domains.txt | shardz 2/3 | httpx -title -ip -tech-detect -json -o shard-2.json
cat domains.txt | shardz 3/3 | httpx -title -ip -tech-detect -json -o shard-3.json

This was just a fun little project to brush up on my C, and to explore the requirements to having a package added to Linux package manager repositories.