Files

Hadi 30e5f91ed9 Add url file normalization information

Signed-off-by: Hadi <112569860+anotherhadi@users.noreply.github.com>

2025-09-28 23:07:11 +02:00

4.7 KiB

Raw Blame History

Rules for handling Data Leaks

This normalization framework is designed to standardize data leaks for Eleakxir, the open-source search engine, using leak-utils, a dedicated CLI tool that converts and cleans files for efficient indexing and searching.

The Relevance of Parquet for Data Leaks

Parquet is an efficient, open-source columnar storage file format designed to handle complex data in bulk. When dealing with data leaks, its choice is highly relevant for several reasons:

Compression: Parquet files offer superior compression compared to row-based formats like CSV. By storing data column by column, it applies more effective compression algorithms, which significantly reduces disk space. For data leaks, where file sizes can range from gigabytes to terabytes, this is crucial for minimizing storage costs.
Query Performance: As a columnar format, Parquet allows you to read only the specific columns you need for a query. In a data leak, you might only be interested in emails and passwords, not full addresses or phone numbers. This selective reading drastically speeds up search operations, as the system doesn't have to scan through entire rows of irrelevant data.
Efficiency: The format is optimized for analytics. It stores data with metadata and statistics (min/max values) for each column, allowing for query pruning. This means a query can skip entire blocks of data that don't match the filtering criteria, boosting performance even further.

Disclaimer

The information in this document is provided for research and educational purposes only. I am not responsible for how this data, methods, or guidelines are used. Any misuse, unlawful activity, or harm resulting from applying this content is the sole responsibility of the individual or organization using it.

File Naming Convention

Lowercase only, ASCII (no accents).
Separators:
- _ inside blocks (date_2023_10)
- - between blocks (instagram.com-date_2023_10)
Prefix: always start with the source name or url (e.g., instagram.com, alien_txt). (for url, replace all - by _)
Blocks: each additional part must be prefixed by its block name:
- date_YYYY[_MM[_DD]] → use ISO format (year, or year-month, or full date).
- source_* → origin of the leak (e.g., scrape, dump, combo).
- version_v* → versioning if regenerated or transformed.
- notes_* → optional clarifications.
Extension: always .parquet.

Recommended pattern:

{source}-date_{YYYY[_MM[_DD]]}-source_{origin}-version_{vN}-notes_{info}.parquet

Examples:

instagram.com-date_2023_10.parquet
alien_txt-date_2022-source_dump.parquet
combo_french-notes_crypto.parquet

Column Naming Convention

snake_case only (lowercase, _ separator).
No dots (.) in column names (husband.phone → husband_phone).
Allowed characters: [a-z0-9_]+ (no spaces, hyphens, or accents).
Multiple variants of the same field:
- Relations → prefix clearly: husband_phone, mother_last_name.
- Multiples of the same type → numbered prefix: 1_phone, 2_phone, 3_phone.
- Always end with the column "type" (e.g., _phone, _last_name).
Rename if mislabeled: If a username column actually contains only emails rename it to email.
Remove irrelevant columns: Drop meaningless identifiers like id or fields with no analytical value.
Standard columns: to enable schema alignment across leaks:

Column

email

username

password

password_hash

phone

date

birth_date

age

first_name

last_name

full_name

address

city

country

state

postal_code

ip

url

city

Column
email
username
password
password_hash
phone
date
birth_date
age
first_name
last_name
full_name
address
city
country
state
postal_code
ip
url
city

Standard Column Formatting

Email: lowercase, trimmed, keep only [^a-z0-9._@-].
Phone: keep only [^0-9]
Names:
- Keep first_name / last_name if present.
- Generate full_name = CONCAT(first_name, ' ', last_name).
- If only name exists, rename it to full_name.
Passwords:
- Hashes → password_hash.
- Plaintext → password.
- Never mix hashes and plaintext in the same column.
NULLs: always use SQL NULL (never "" or "NULL").

Deduplication

Deduplication is often impractical at scale (billions of rows). Do not attempt to deduplicate at ingestion time. Instead, handle deduplication after running a search to optimize performance and storage.

4.7 KiB Raw Blame History