Files
eleakxir/leak-utils/DATALEAKS-NORMALIZATION.md

4.7 KiB

Rules for handling Data Leaks

This normalization framework is designed to standardize data leaks for Eleakxir, the open-source search engine, using leak-utils, a dedicated CLI tool that converts and cleans files for efficient indexing and searching.

The Relevance of Parquet for Data Leaks

Parquet is an efficient, open-source columnar storage file format designed to handle complex data in bulk. When dealing with data leaks, its choice is highly relevant for several reasons:

  • Compression: Parquet files offer superior compression compared to row-based formats like CSV. By storing data column by column, it applies more effective compression algorithms, which significantly reduces disk space. For data leaks, where file sizes can range from gigabytes to terabytes, this is crucial for minimizing storage costs.
  • Query Performance: As a columnar format, Parquet allows you to read only the specific columns you need for a query. In a data leak, you might only be interested in emails and passwords, not full addresses or phone numbers. This selective reading drastically speeds up search operations, as the system doesn't have to scan through entire rows of irrelevant data.
  • Efficiency: The format is optimized for analytics. It stores data with metadata and statistics (min/max values) for each column, allowing for query pruning. This means a query can skip entire blocks of data that don't match the filtering criteria, boosting performance even further.

Disclaimer

The information in this document is provided for research and educational purposes only. I am not responsible for how this data, methods, or guidelines are used. Any misuse, unlawful activity, or harm resulting from applying this content is the sole responsibility of the individual or organization using it.

File Naming Convention

  • Lowercase only, ASCII (no accents).

  • Separators:

    • _ inside blocks (date_2023_10)
    • - between blocks (instagram.com-date_2023_10)
  • Prefix: always start with the source name or url (e.g., instagram.com, alien_txt). (for url, replace all - by _)

  • Blocks: each additional part must be prefixed by its block name:

    • date_YYYY[_MM[_DD]] → use ISO format (year, or year-month, or full date).
    • source_* → origin of the leak (e.g., scrape, dump, combo).
    • version_v* → versioning if regenerated or transformed.
    • notes_* → optional clarifications.
  • Extension: always .parquet.

Recommended pattern:

{source}-date_{YYYY[_MM[_DD]]}-source_{origin}-version_{vN}-notes_{info}.parquet

Examples:

instagram.com-date_2023_10.parquet
alien_txt-date_2022-source_dump.parquet
combo_french-notes_crypto.parquet

Column Naming Convention

  • snake_case only (lowercase, _ separator).

  • No dots (.) in column names (husband.phonehusband_phone).

  • Allowed characters: [a-z0-9_]+ (no spaces, hyphens, or accents).

  • Multiple variants of the same field:

    • Relations → prefix clearly: husband_phone, mother_last_name.
    • Multiples of the same type → numbered prefix: 1_phone, 2_phone, 3_phone.
    • Always end with the column "type" (e.g., _phone, _last_name).
  • Rename if mislabeled: If a username column actually contains only emails rename it to email.

  • Remove irrelevant columns: Drop meaningless identifiers like id or fields with no analytical value.

  • Standard columns: to enable schema alignment across leaks:

    Column
    email
    username
    password
    password_hash
    phone
    date
    birth_date
    age
    first_name
    last_name
    full_name
    address
    city
    country
    state
    postal_code
    ip
    url
    city

Standard Column Formatting

  • Email: lowercase, trimmed, keep only [^a-z0-9._@-].

  • Phone: keep only [^0-9]

  • Names:

    • Keep first_name / last_name if present.
    • Generate full_name = CONCAT(first_name, ' ', last_name).
    • If only name exists, rename it to full_name.
  • Passwords:

    • Hashes → password_hash.
    • Plaintext → password.
    • Never mix hashes and plaintext in the same column.
  • NULLs: always use SQL NULL (never "" or "NULL").

Deduplication

Deduplication is often impractical at scale (billions of rows). Do not attempt to deduplicate at ingestion time. Instead, handle deduplication after running a search to optimize performance and storage.