4.7 KiB
Rules for handling Data Leaks
This normalization framework is designed to standardize data leaks for Eleakxir, the open-source search engine, using leak-utils, a dedicated CLI tool that converts and cleans files for efficient indexing and searching.
The Relevance of Parquet for Data Leaks
Parquet is an efficient, open-source columnar storage file format designed to handle complex data in bulk. When dealing with data leaks, its choice is highly relevant for several reasons:
- Compression: Parquet files offer superior compression compared to row-based formats like CSV. By storing data column by column, it applies more effective compression algorithms, which significantly reduces disk space. For data leaks, where file sizes can range from gigabytes to terabytes, this is crucial for minimizing storage costs.
- Query Performance: As a columnar format, Parquet allows you to read only the specific columns you need for a query. In a data leak, you might only be interested in emails and passwords, not full addresses or phone numbers. This selective reading drastically speeds up search operations, as the system doesn't have to scan through entire rows of irrelevant data.
- Efficiency: The format is optimized for analytics. It stores data with metadata and statistics (min/max values) for each column, allowing for query pruning. This means a query can skip entire blocks of data that don't match the filtering criteria, boosting performance even further.
Disclaimer
The information in this document is provided for research and educational purposes only. I am not responsible for how this data, methods, or guidelines are used. Any misuse, unlawful activity, or harm resulting from applying this content is the sole responsibility of the individual or organization using it.
File Naming Convention
-
Lowercase only, ASCII (no accents).
-
Separators:
_inside blocks (date_2023_10)-between blocks (instagram.com-date_2023_10)
-
Prefix: always start with the source name or url (e.g.,
instagram.com,alien_txt). (for url, replace all-by_) -
Blocks: each additional part must be prefixed by its block name:
date_YYYY[_MM[_DD]]→ use ISO format (year, or year-month, or full date).source_*→ origin of the leak (e.g.,scrape,dump,combo).version_v*→ versioning if regenerated or transformed.notes_*→ optional clarifications.
-
Extension: always
.parquet.
Recommended pattern:
{source}-date_{YYYY[_MM[_DD]]}-source_{origin}-version_{vN}-notes_{info}.parquet
Examples:
instagram.com-date_2023_10.parquet
alien_txt-date_2022-source_dump.parquet
combo_french-notes_crypto.parquet
Column Naming Convention
-
snake_case only (lowercase,
_separator). -
No dots (
.) in column names (husband.phone→husband_phone). -
Allowed characters:
[a-z0-9_]+(no spaces, hyphens, or accents). -
Multiple variants of the same field:
- Relations → prefix clearly:
husband_phone,mother_last_name. - Multiples of the same type → numbered prefix:
1_phone,2_phone,3_phone. - Always end with the column "type" (e.g.,
_phone,_last_name).
- Relations → prefix clearly:
-
Rename if mislabeled: If a
usernamecolumn actually contains only emails rename it toemail. -
Remove irrelevant columns: Drop meaningless identifiers like
idor fields with no analytical value. -
Standard columns: to enable schema alignment across leaks:
Column email username password password_hash phone date birth_date age first_name last_name full_name address city country state postal_code ip url city
Standard Column Formatting
-
Email: lowercase, trimmed, keep only
[^a-z0-9._@-]. -
Phone: keep only
[^0-9] -
Names:
- Keep
first_name/last_nameif present. - Generate
full_name = CONCAT(first_name, ' ', last_name). - If only
nameexists, rename it tofull_name.
- Keep
-
Passwords:
- Hashes →
password_hash. - Plaintext →
password. - Never mix hashes and plaintext in the same column.
- Hashes →
-
NULLs: always use SQL
NULL(never""or"NULL").
Deduplication
Deduplication is often impractical at scale (billions of rows). Do not attempt to deduplicate at ingestion time. Instead, handle deduplication after running a search to optimize performance and storage.