utils.file_utils#

Module Contents#

Functions#

expand_outdir_and_mkdir

filter_files_by_extension

Given a list of files, filter it to only include files matching given extension(s).

get_all_files_paths_under

This function returns a list of all the files under a specified directory. Args: root: The path to the directory to read. recurse_subdirecties: Whether to recurse into subdirectories. Please note that this can be slow for large number of files. followlinks: Whether to follow symbolic links. keep_extensions: A string or list of strings representing a file type or multiple file types to include in the output, e.g., “jsonl” or [“jsonl”, “parquet”].

get_batched_files

This function returns a batch of files that still remain to be processed.

get_remaining_files

This function returns a list of the files that still remain to be read.

merge_counts

mkdir

parse_str_of_num_bytes

remove_path_extension

reshard_jsonl

Reshards a directory of jsonl files to have a new (approximate) file size for each shard

separate_by_metadata

Saves the dataframe to subfolders named after a metadata

write_dataframe_by_meta

write_record

Data#

API#

utils.file_utils.NEMO_CURATOR_HOME#

‘get(…)’

utils.file_utils.expand_outdir_and_mkdir(outdir: str) str#
utils.file_utils.filter_files_by_extension(
files_list: list[str],
keep_extensions: str | list[str],
) list[str]#

Given a list of files, filter it to only include files matching given extension(s).

Args: files_list: List of files. keep_extensions: A string (e.g., “json”) or a list of strings (e.g., [“json”, “parquet”]) representing which file types to keep from files_list.

utils.file_utils.get_all_files_paths_under(
root: str,
recurse_subdirectories: bool = True,
followlinks: bool = False,
keep_extensions: str | list[str] | None = None,
) list[str]#

This function returns a list of all the files under a specified directory. Args: root: The path to the directory to read. recurse_subdirecties: Whether to recurse into subdirectories. Please note that this can be slow for large number of files. followlinks: Whether to follow symbolic links. keep_extensions: A string or list of strings representing a file type or multiple file types to include in the output, e.g., “jsonl” or [“jsonl”, “parquet”].

utils.file_utils.get_batched_files(
input_file_path: str,
output_file_path: str,
input_file_type: str,
batch_size: int = 64,
) list[list[str]]#

This function returns a batch of files that still remain to be processed.

Args: input_file_path: The path of the input files. output_file_path: The path of the output files. input_file_type: The type of the input files. batch_size: The number of files to be processed at once Returns: A batch of files that are not in the output directory.

utils.file_utils.get_remaining_files(
input_file_path: str,
output_file_path: str,
input_file_type: str,
output_file_type: str | None = None,
num_files: int = -1,
) list[str]#

This function returns a list of the files that still remain to be read.

Args: input_file_path: The path of the input files. output_file_path: The path of the output files. input_file_type: The type of the input files. output_file_type: The type of the output files. num_files: The max number of files to be returned. If -1, all files are returned. Returns: A list of files that still remain to be read.

utils.file_utils.merge_counts(first: dict, second: dict) dict#
utils.file_utils.mkdir(d: str) None#
utils.file_utils.parse_str_of_num_bytes(s: str, return_str: bool = False) str | int#
utils.file_utils.remove_path_extension(path: str) str#
utils.file_utils.reshard_jsonl(
input_dir: str,
output_dir: str,
output_file_size: str = '100M',
start_index: int = 0,
file_prefix: str = '',
) None#

Reshards a directory of jsonl files to have a new (approximate) file size for each shard

Args: input_dir: The input directory containing jsonl files output_dir: The output directory where the resharded jsonl files will be written output_file_size: Approximate size of output files. Must specify with a string and with the unit K, M or G for kilo, mega or gigabytes start_index: Starting index for naming the output files. Note: The indices may not be continuous if the sharding process would output an empty file in its place file_prefix: Prefix to use to prepend to output file number

utils.file_utils.separate_by_metadata(
input_data: dask.dataframe.DataFrame | str,
output_dir: str,
metadata_field: str,
remove_metadata: bool = False,
output_type: str = 'jsonl',
input_type: str = 'jsonl',
include_values: list[str] | None = None,
exclude_values: list[str] | None = None,
filename_col: str = 'file_name',
) dict#

Saves the dataframe to subfolders named after a metadata

Args: input_data: Either a DataFrame or a string representing the path to the input directory. If a DataFrame is provided, it must have a filename_col for the shard. output_dir: The base directory for which all metadata based subdirs will be created under metadata_field: The metadata field to split on remove_metadata: Whether to remove the metadata from the dataframe when saving it output_type: File type the dataset will be written to. Supported file formats include ‘jsonl’ (default), ‘pickle’, or ‘parquet’. (default: jsonl) include_values: A list of strings representing specific values to be selected or included. If provided, only the items matching these values should be kept. exclude_values: A list of strings representing specific values to be excluded or ignored. If provided, any items matching these values should be skipped. filename_col: The column name in the DataFrame that contains the filename. Default is “file_name”.

Returns: A delayed dictionary mapping each metadata to the count of entries with that metadata value.

utils.file_utils.write_dataframe_by_meta(
df: pandas.DataFrame,
output_dir: str,
metadata_field: str,
remove_metadata: bool = False,
output_type: str = 'jsonl',
include_values: list[str] | None = None,
exclude_values: list[str] | None = None,
filename_col: str = 'file_name',
) dict#
utils.file_utils.write_record(
input_dir: str,
file_name: str,
line: str,
field: str,
output_dir: str,
include_values: list[str] | None = None,
exclude_values: list[str] | None = None,
) str | None#