utils.file_utils
#
Module Contents#
Functions#
Given a list of files, filter it to only include files matching given extension(s). |
|
This function returns a list of all the files under a specified directory. Args: root: The path to the directory to read. recurse_subdirecties: Whether to recurse into subdirectories. Please note that this can be slow for large number of files. followlinks: Whether to follow symbolic links. keep_extensions: A string or list of strings representing a file type or multiple file types to include in the output, e.g., “jsonl” or [“jsonl”, “parquet”]. |
|
This function returns a batch of files that still remain to be processed. |
|
This function returns a list of the files that still remain to be read. |
|
Reshards a directory of jsonl files to have a new (approximate) file size for each shard |
|
Saves the dataframe to subfolders named after a metadata |
|
Data#
API#
- utils.file_utils.NEMO_CURATOR_HOME#
‘get(…)’
- utils.file_utils.expand_outdir_and_mkdir(outdir: str) str #
- utils.file_utils.filter_files_by_extension(
- files_list: list[str],
- keep_extensions: str | list[str],
Given a list of files, filter it to only include files matching given extension(s).
Args: files_list: List of files. keep_extensions: A string (e.g., “json”) or a list of strings (e.g., [“json”, “parquet”]) representing which file types to keep from files_list.
- utils.file_utils.get_all_files_paths_under(
- root: str,
- recurse_subdirectories: bool = True,
- followlinks: bool = False,
- keep_extensions: str | list[str] | None = None,
This function returns a list of all the files under a specified directory. Args: root: The path to the directory to read. recurse_subdirecties: Whether to recurse into subdirectories. Please note that this can be slow for large number of files. followlinks: Whether to follow symbolic links. keep_extensions: A string or list of strings representing a file type or multiple file types to include in the output, e.g., “jsonl” or [“jsonl”, “parquet”].
- utils.file_utils.get_batched_files(
- input_file_path: str,
- output_file_path: str,
- input_file_type: str,
- batch_size: int = 64,
This function returns a batch of files that still remain to be processed.
Args: input_file_path: The path of the input files. output_file_path: The path of the output files. input_file_type: The type of the input files. batch_size: The number of files to be processed at once Returns: A batch of files that are not in the output directory.
- utils.file_utils.get_remaining_files(
- input_file_path: str,
- output_file_path: str,
- input_file_type: str,
- output_file_type: str | None = None,
- num_files: int = -1,
This function returns a list of the files that still remain to be read.
Args: input_file_path: The path of the input files. output_file_path: The path of the output files. input_file_type: The type of the input files. output_file_type: The type of the output files. num_files: The max number of files to be returned. If -1, all files are returned. Returns: A list of files that still remain to be read.
- utils.file_utils.merge_counts(first: dict, second: dict) dict #
- utils.file_utils.mkdir(d: str) None #
- utils.file_utils.parse_str_of_num_bytes(s: str, return_str: bool = False) str | int #
- utils.file_utils.remove_path_extension(path: str) str #
- utils.file_utils.reshard_jsonl(
- input_dir: str,
- output_dir: str,
- output_file_size: str = '100M',
- start_index: int = 0,
- file_prefix: str = '',
Reshards a directory of jsonl files to have a new (approximate) file size for each shard
Args: input_dir: The input directory containing jsonl files output_dir: The output directory where the resharded jsonl files will be written output_file_size: Approximate size of output files. Must specify with a string and with the unit K, M or G for kilo, mega or gigabytes start_index: Starting index for naming the output files. Note: The indices may not be continuous if the sharding process would output an empty file in its place file_prefix: Prefix to use to prepend to output file number
- utils.file_utils.separate_by_metadata(
- input_data: dask.dataframe.DataFrame | str,
- output_dir: str,
- metadata_field: str,
- remove_metadata: bool = False,
- output_type: str = 'jsonl',
- input_type: str = 'jsonl',
- include_values: list[str] | None = None,
- exclude_values: list[str] | None = None,
- filename_col: str = 'file_name',
Saves the dataframe to subfolders named after a metadata
Args: input_data: Either a DataFrame or a string representing the path to the input directory. If a DataFrame is provided, it must have a filename_col for the shard. output_dir: The base directory for which all metadata based subdirs will be created under metadata_field: The metadata field to split on remove_metadata: Whether to remove the metadata from the dataframe when saving it output_type: File type the dataset will be written to. Supported file formats include ‘jsonl’ (default), ‘pickle’, or ‘parquet’. (default: jsonl) include_values: A list of strings representing specific values to be selected or included. If provided, only the items matching these values should be kept. exclude_values: A list of strings representing specific values to be excluded or ignored. If provided, any items matching these values should be skipped. filename_col: The column name in the DataFrame that contains the filename. Default is “file_name”.
Returns: A delayed dictionary mapping each metadata to the count of entries with that metadata value.
- utils.file_utils.write_dataframe_by_meta(
- df: pandas.DataFrame,
- output_dir: str,
- metadata_field: str,
- remove_metadata: bool = False,
- output_type: str = 'jsonl',
- include_values: list[str] | None = None,
- exclude_values: list[str] | None = None,
- filename_col: str = 'file_name',
- utils.file_utils.write_record(
- input_dir: str,
- file_name: str,
- line: str,
- field: str,
- output_dir: str,
- include_values: list[str] | None = None,
- exclude_values: list[str] | None = None,