108 lines
2.9 KiB
Markdown
108 lines
2.9 KiB
Markdown
# 🛠 leak-utils: The Eleakxir Data Utility Toolkit
|
|
|
|
`leak-utils` is a powerful command-line tool built to help you manage, process,
|
|
and optimize data leaks for use with the **Eleakxir** search engine. It provides
|
|
a suite of utilities for data cleaning, format conversion, and file
|
|
manipulation, all designed to ensure your data wells are efficient and
|
|
standardized.
|
|
|
|
`leak-utils` is written in **Go** and leverages **DuckDB** for its
|
|
high-performance in-memory processing, ensuring fast and reliable operations on
|
|
large datasets.
|
|
|
|
## 🚀 Features
|
|
|
|
- **Parquet File Management**: Clean and inspect existing `.parquet` files.
|
|
- **Format Conversion**: Seamlessly convert `.csv`, `.txt`, `.json` files into
|
|
the optimized `.parquet` format.
|
|
- **Schema Uniformity**: Tools designed to help you standardize and normalize
|
|
your data to align with the
|
|
[Eleakxir data leak normalization rules](./DATALEAKS-NORMALIZATION.md). This
|
|
ensures a consistent schema across all your files, which is crucial for
|
|
efficient searching and consistent results.
|
|
- **High Performance**: Built with Go and DuckDB for fast and efficient data
|
|
processing.
|
|
|
|
## ⚙️ How to Use
|
|
|
|
The tool operates via a single executable with different commands, each
|
|
corresponding to a specific action. You can find the executable in the
|
|
`leak-utils` directory of the Eleakxir project.
|
|
|
|
### Install
|
|
|
|
#### With go
|
|
|
|
```bash
|
|
go install "github.com/anotherhadi/eleakxir/leak-utils@latest"
|
|
```
|
|
|
|
#### With Nix/NixOS
|
|
|
|
<details>
|
|
<summary>Click to expand</summary>
|
|
|
|
**From anywhere (using the repo URL):**
|
|
|
|
```bash
|
|
nix run "github:anotherhadi/eleakxir#leak-utils" -- action [--flags value]
|
|
```
|
|
|
|
**Permanent Installation:**
|
|
|
|
```bash
|
|
# add the flake to your flake.nix
|
|
{
|
|
inputs = {
|
|
eleakxir.url = "github:anotherhadi/eleakxir";
|
|
};
|
|
}
|
|
|
|
# then add it to your packages
|
|
environment.systemPackages = with pkgs; [ # or home.packages
|
|
eleakxir.packages.${pkgs.system}.leak-utils
|
|
];
|
|
```
|
|
|
|
</details>
|
|
|
|
### Available Actions
|
|
|
|
#### `cleanParquet`
|
|
|
|
Optimizes and cleans an existing Parquet file. This can be used to change
|
|
columns, clean rows, ...
|
|
|
|
See:
|
|
|
|
```bash
|
|
leak-utils cleanParquet --help
|
|
```
|
|
|
|
#### `infoParquet`
|
|
|
|
Displays metadata and schema information for a given Parquet file. Useful for
|
|
inspecting file structure and column types.
|
|
|
|
#### `csvToParquet`
|
|
|
|
Converts a `.csv` file into a highly compressed and efficient `.parquet` file.
|
|
This is the recommended way to prepare your data for Eleakxir.
|
|
|
|
#### `mergeFiles`
|
|
|
|
Merges multiple files (of the same type) into a single, larger file. This is
|
|
useful for combining smaller data leaks.
|
|
|
|
#### `removeUrlSchemeFromUlp`
|
|
|
|
This utility prevents the colon (`:`) in URL schemes like `https://` from being
|
|
mistakenly parsed as a column separator when processing ULP data in flat files
|
|
like CSV or TXT.
|
|
|
|
## 🤝 Contributing
|
|
|
|
[Contributions](../CONTRIBUTING.md) to `leak-utils` are welcome! Feel free to
|
|
open issues or submit pull requests for new features, bug fixes, or performance
|
|
improvements.
|