Steps to generate CSVs#

Note

In case of value errors saying that ASR language model is not allowed to vary inside a raw CSV file, run fix_varying_language_model.py before prepare_raw_data.py.

Prerequisites: python, bash, numpy, pandas, PyYAML.

The following steps generate the performance tables, given that the directory structure of --input_dir is correct:

cd docs/tabbed_tables_scripts/performance
python prepare_raw_data.py \
  --input_dir "${PATH_TO_RESULTS_DIR}" \
  --output_dir "${PATH_TO_FINAL_CSV_TABLES}" \
  --tmp_dir_with_parsed_performance_data __tmp_dir_with_parsed_performance_data \
  --keep_intermediate_parsed_data
python fill_performance_pages.py --metadata "${PATH_TO_FINAL_CSV_TABLES}/metadata.yaml"

Of the above parameters, only --input_dir is required.

Currently, for the above command to work, the --input_dir must have structure:

.
├── AWS
│   ├── g4dn.16xlarge
│   │   ├── ASR
│   │   │   └── All_CPU_Cores
│   │   │       ├── results_citrinet-1024_en-US_flashlight_offline.csv
│   │   │       ├── ...
│   │   │       └── results_conformer_en-US_flashlight_streaming.csv
│   │   ├── NLP
│   │   │   └── results.csv
│   │   └── TTS
│   │       ├── results_file_faspitch_hifigan.csv
│   │       ...
│   ├── ...
│   └── p4d.24xlarge
│       ├── ASR
│       ├── NLP
│       └── TTS
├── GCP
│   ├── a2-highgpu-1g_[a100_12vcpu]
│   ├── ...
│   └── n1-highmem-8_[v100_8vcpu]
└── on_prem
    ├── A10
    │   ├── ASR
    │   │   ├── 16_Cores
    │   │   │   ├── results_conformer_en-US_flashlight_offline.csv
    │   │   │   ├── results_conformer_en-US_flashlight_streaming-throughput.csv
    │   │   │   └── results_conformer_en-US_flashlight_streaming.csv
    │   │   ├── 32_Cores
    │   │   ├── 64_Cores
    │   │   └── All_CPU_Cores
    │   │       ├── results_citrinet-1024_de-DE_flashlight_streaming.csv
    │   │       ├── ...
    │   │       └── results_quartznet_en-US_os2s_streaming.csv
    │   ├── NLP
    │   │   └── results.csv
    │   └── TTS
    │       └── results_file_faspitch_hifigan.csv
    ├── ...
    └── V100

Please note that results on embedded devices are provided in a different format. Before putting the results into the above directory tree, you need to convert .csv files from Embedded format to the format which is similar to the format of other .csv files. Use the following command:

python embedded_format_to_usual_format.py \
  --input_dir "${PATH_TO_DIR_IN_EMBEDDED_FORMAT}" \
  --output_dir "${PATH_TO_DIR_IN_USUAL_FORMAT}"

The above directory tree is composed of 4 PBR results:

  • PBR for On-Prem A100, A30, A10, V100, T4

  • PBR for On-Prem L4

  • PBR for AWS and GCP

  • PBR for results on Embedded results (JA0, JAX, JAX-NX GPUs in On-Prem directory)

Each directory level has to correspond to a specific parameter. In the example, the first level represents cloud service: AWS, GCP, on_prem (no cloud). The second level in AWS and GCP directories is a cloud instance (e.g. g4dn.16xlarge). In the on_prem directory the second level is the type of GPU (e.g. A10). In AWS and GCP directories GPU type level is missing and in on_prem directory cloud instance level is missing. You can find an explanation of missing levels in “How to deal with missing levels” section of this README.

NOTE: You can omit directories on a level if the whole level is kept. For example, your directory structure can be without AWS and GCP directories, but in that case it must contain on_prem directory for cloud level to remain. Thus, when a new PBR results come, you DON’T have to search for previous PBR results to update the table.

The ASR directories are special because they always contain an additional directory level describing number of used cores. No directory “number of cores” level is expected for NLP and TTS tasks.

It is ok for terminate directory with .csv files to contain an excess directory. Only a warning will be shown.

It is also fine if there are excess files on any level of results tree.

If parameter --output_dir of prepare_raw_data.py scripts is not set, then resulting CSV tables are saved in ../../perf_data/perfomance_tables.

--tmp_dir_with_parsed_performance_data directory contains intermediate parsed results. You can use its content for debugging. If you do not provide --tmp_dir_with_parsed_performance_data, then it is created by the script automatically and removed before the script exits.

The script fill_performance_tables.py updates markdown files

  • docs/source/asr/asr-performance-table.md,

  • docs/source/asr/asr-cpu-effect-performance-table.md,

  • docs/source/tts/tts-performance-table.md with new CSV files created by prepare_raw_data.py.

High level explanation of table preprocessing scripts#

prepare_raw_data.py#

prepare_raw_data.py parses CSV files in --input_dir directory tree and then saves ready performance tables in the --output_dir together with a file metadata.yaml. An item in metadata.yaml contains a path to final performance CSV file and a metadata section with several fields describing how the performance data in the table was collected.

Table preparation is performed step by step and intermediate results are saved into --tmp_dir_with_parsed_performance_data. If you provide the option --keep_intermediate_parsed_data, then the intermediate results are not removed after preprocessing finishes.

Table preparation is governed by several constants defined inside config.py:

  • SupportedTasks - an enum which defines which tasks are supported by the system;

  • TASK_NAMES - the names of the tasks in --input_dir directory tree;

  • Level - an enum which defines supported levels in directory tree;

  • LEVELS - a list of levels, which are expected in input directory tree from top to bottom. This level structure is acquired after filler directories are added (see FILLER_DIRECTORIES_SCHEME) and directory tree is split into “all CPU cores” and “varying number of CPU cores” trees;

  • ALL_CPU_CORES_PATTERN - a compiled pattern which must match “all CPU cores” directory names for ASR tasks;

  • VARYING_CORES_PATTERN - a compiled pattern which must match “varying number of CPU cores” directory names for ASR tasks;

  • RESULTS_FILE_PATTERN - a compiled pattern which must match raw results CSV files;

  • FILLER_DIRECTORIES_SCHEME - allows to process an “asymmetric” directory tree --input_dir (if in a branch of the tree 1 or more levels are missing);

  • PREPROCESSING_FUNCS_BY_TASK - how to preprocess a table. Sometimes a raw table needs cleanup or other small fixes.

  • LEVEL_TO_EXPERIMENT_DESCRIPTION - how to preprocess a name of a directory to extract experiment description parameter from it. For example, number of CPU cores level directories match pattern '^[0-9]+_Cores$'. It is more convenient to work with an integer, than with a string containing excess suffix _Cores;

  • Experiment{ASR,TTS,NLP} - enums of parameters of ASR, TTS, NLP experiments which are collected from raw CSV files. There are also parameters which are encoded in directory names (see Level and LEVELS). Fields of these enums are collected after PREPROCESSING_FUNCS_BY_TASK were applied.

  • AdditionalInfo{ASR,TTS,NLP} - enums with additional info about experiments which is generated during final CSV preparation. For example, source_file containing raw CSV table;

  • {ASR,TTS,NLP}ExperimentReplacementParameters, ReplacementParametersType, REPLACEMENT_EXPERIMENT_DESCRIPTION_FIELDS show how to check that results from a new PBR contain same experiments as already present CSVs. {ASR,TTS,NLP}ExperimentReplacementParameters are similar to ExperimentDescription but contain fewer parameters. If an experiment from a new PBR has same “replacement parameters”, then all old CSVs with such “replacement parameters” are removed. It is done to avoid situations like “Steaming” mode is measured on version 2.8.0 and “Streaming-Throughput” measured on version 2.9.0. “Replacement parameters” also ensure that only 1 CSV file is suitable for a table tab.

  • ASR_MODE_VALUES - a list of names of ASR modes;

  • EXPERIMENT_DESCRIPTION_COLS_IN_PREPROCESSED_TABLES - a dictionary with column names from which Experiment{ASR,TTS,NLP} fields are collected. It also verifies types of values taken from the columns;

  • EXPERIMENT_INFO_USED_FOR_FILE_NAME_CREATION - contains names of fields of ExperimentDescription{ASR,TTS,NLP} and names of levels which are used in final CSV file names. There have to be enough parameters for all names to be unique;

  • EXPERIMENT_DESCRIPTION_FIELDS_WHICH_ARE_ALLOWED_TO_VARY_INSIDE_PREPROCESSED_TABLE - contains sets of ExperimentDescription{ASR,TTS,NLP} field names which are allowed to vary inside a preprocessed CSV file. A file, in which an experiment description fields varies, is split into 2 or more separate files;

  • LATENCY_COLUMNS_BY_TASK - compiled regex patterns which matches latency column names;

  • THROUGHPUT_COLUMNS_BY_TASK - names of throughput columns;

  • NUM_PARALLEL_COLUMN_NAMES_BY_TASK - columns which contain names of “# of streams” columns;

  • MAX_NOT_SUSPICIOUS_OUTLIER_QUOTIENT - used for detecting unusually bad measurements which are counted and shown in the warning. The MAX_NOT_SUSPICIOUS_OUTLIER_QUOTIENT shows maximum quotient of a measured value and the best value which was acquired under same conditions (Currently, 3 trials are made for all sets of parameters). For thoughtput the quotient is the largest value divided by checked value, and for latency the quotient is checked value divided by the smallest value.

  • MERGE_STEPS - defines which tables will be merged (the second table is appended the first table);

  • FINAL_CSV_SCHEMES - describes which columns which be added into a final CSV file and defines headers.

fill_performance_pages.py#

The fill_performance_pages.py script arranges CSV files prepared by prepare_raw_data.py into tabbed tables. A tabbed table structure is inferred from tab level list (e.g. ASR_TABBED_TABLE_LEVELS for ASR task) and available CSV files. If in a branch of tabbed table a level is missing (all values of metadata field are None), then the level will be omitted in this branch (e.g. in 23.01 version there is no GPU type level for AWS and GCP). Customization options:

  • TAB_LEVEL_SORT_FUNCTIONS - can be used for sorting of tabs (in docs/source/tabbed_tables_scripts/tabbed_tables.py);

  • METADATA_VALUES_PRETTY - can be used for setting tab names (in docs/source/tabbed_tables_scripts/tabbed_tables.py);

  • METADATA_KEYS_PRETTY - is for customizing metadata fields names inside the innermost tab. Currently, it is used only for ASR;

Adding a new parameter (level) to an input directory structure#

If you need to insert a level to an input directory structure, you will need to make several changes in scripts config.py and fill_performance_pages.py. Let’s consider a case when before the insertion your directory structure is following:

├── A100
│   ├── ASR
│   │   ├── 12_Cores
│   │   ...
│   │   └── All_CPU_Cores
│   ├── NLP
│   └── TTS
├── T4
│   ├── ASR
│   │   ├── 16_Cores
│   │   ...
│   │   └── All_CPU_Cores
│   ├── NLP
│   └── TTS
└── V100
    ├── ASR
    │   ├── 12_Cores
    │   ├── ...
    │   └── All_CPU_Cores
    ├── NLP
    └── TTS

and you need to add results on AWS. This means that there will be a new level cloud with to directories: AWS and on_prem. The directory structure after the insertion is

├── aws
│   ├── A100
│   │   ├── ASR
│   │   │   ├── 12_Cores
│   │   │   ├── ...
│   │   │   └── All_CPU_Cores
│   │   ├── NLP
│   │   └── TTS
│   ├── T4
│   │   ├── ASR
│   │   │   ├── 16_Cores
│   │   │   ├── ...
│   │   │   └── All_CPU_Cores
│   │   ├── NLP
│   │   └── TTS
│   └── V100
│       ├── ASR
│       │   ├── 12_Cores
│       │   ├── ...
│       │   └── All_CPU_Cores
│       ├── NLP
│       └── TTS
└── on_prem
    ├── A100
    │   ├── ASR
    │   │   ├── 12_Cores
    │   │   ├── ...
    │   │   └── All_CPU_Cores
    │   ├── NLP
    │   └── TTS
    ├── T4
    │   ├── ASR
    │   │   ├── 16_Cores
    │   │   ├── ...
    │   │   └── All_CPU_Cores
    │   ├── NLP
    │   └── TTS
    └── V100
        ├── ASR
        │   ├── 12_Cores
        │   ├── ...
        │   └── All_CPU_Cores
        ├── NLP
        └── TTS

For adding the cloud level to your tabbed tables you will need:

Inside config.py#

  1. Add cloud to Level Enum:

class Level(Enum):
    ...
    cloud = auto()
  1. Add the new level to LEVELS.

LEVELS = (Level.cloud, Level.gpu_type, Level.task)
  1. It is possible that one of “branches” of a directory tree lacks some levels (see current required structure of --input_dir). In such case, consult the section “How to deal with missing levels” of this README.

  2. (optional) If you need to preprocess level directory name before using it as an experiment description field, then add a method for preprocessing into LEVEL_TO_EXPERIMENT_DESCRIPTION dictionary constant.

Inside fill_performance_pages.py#

  1. Add Level.cloud.name to ASR_TABBED_TABLE_LEVELS and TTS_TABBED_TABLE_LEVELS. The position of Level.cloud.name in ASR_TABBED_TABLE_LEVELS and TTS_TABBED_TABLE_LEVELS shows the level of cloud tab level in resulting tabbed tables.

ASR_TABBED_TABLE_LEVELS = [
  Level.cloud.name,  # cloud name is the topmost level of the ASR table
  Level.gpu_type.name,
  ExperimentASR.asr_acoustic_model.name,
  ExperimentASR.language.name,
  ExperimentASR.asr_mode.name
]
TTS_TABBED_TABLE_LEVELS = [
  Level.gpu_type.name,
  ExperimentTTS.model.name,
  Level.cloud.name,  # cloud name is the lowest level of the TTS table
]

Inside docs/source/tabbed_tables_scripts/tabbed_tables.py#

  1. (optional) You can sort tabs by specifying a sorting key in a constant TAB_SORT_FUNCTIONS.

  2. (optional) You can prettify names of cloud instances in the table by adding mappings to the PRETTIFYING_TABS:

PRETTIFYING_TABS = {
    ...
    'cloud': {'aws': 'AWS', 'on_prem': 'on Prem'},
    ...
}

How to add a new task#

Raw CSV files for different tasks vary, so quite a few things need to be added to config.py, fill_performance_pages.py and docs/source/tabbed_tables_scripts/tabbed_tables.py scripts for processing of a new task.

Below is a description of adding of a new task to the system. Just for convenience, let’s call this task NMT.

Inside config.py#

  1. Add a new task to the enum SupportedTasks.

class SupportedTasks(Enum):
    ...
    nmt = auto()
  1. Add a name of directory, which stores results for the new task, to TASK_NAMES constant. In directory tree passed to prepare_raw_data.py, all NMT results have to be in NMT directories.

TASK_NAMES = {..., SupportedTasks.nmt: 'NMT'}
  1. Add enums ExperimentNMT and AdditionalInfoNMT. ExperimentNMT contains fields which allow to distinguish between different NMT experiments. Values for those fields will be taken from CSV files after PREPROCESSING_FUNCS_BY_TASK are applied. AdditionalInfoNMT fields are auxiliary info. AdditionalInfoNMT has to contain source_file field (for storing the path to a raw CSV file in which the data was originally before all preprocessing). If you need to add other fields to additional info, then it will take you to write a separate function for it as it was done for max_effective_number_of_streams in AdditionalInfoASR enum.

  2. Add ExperimentNMT and AdditionalInfoNMT correspondingly to ExperimentDescription and AdditionalInfo:

ExperimentDescription = namedtuple(
    "ExperimentDescription",
    unite_enum_members([Level, ExperimentASR, ExperimentTTS, ExperimentNLP, ExperimentNMT]),
    defaults=[None] * len(
        set().union(
            *[enum_.__members__ for enum_ in [Level, ExperimentASR, ExperimentTTS, ExperimentNLP, ExperimentNMT]]
        )
    ),
)
AdditionalInfo = namedtuple(
    "AdditionalInfo",
    unite_enum_members([AdditionalInfoASR, AdditionalInfoTTS, AdditionalInfoNLP, AdditionalInfoNMT]),
    defaults=[None] * len(
        set().union(
            *[
                enum_.__members__
                for enum_ in [AdditionalInfoASR, AdditionalInfoTTS, AdditionalInfoNLP, AdditionalInfoNMT]
            ]
        )
    ),
)
  1. Add NMT preprocessing functions to PREPROCESSING_FUNCS_BY_TASK. It is recommended to add remove_spaces_near_commas function to the list.

  2. Specify types and column names for experiment description fields in EXPERIMENT_DESCRIPTION_COLS_IN_PREPROCESSED_TABLES constant. For every experiment description field, there has to be a dictionary with 2 keys: ColSpec.name and ColSpec.types. ColSpec.name value is a name of a column from which an experiment description field will be taken, and ColSpec.types value is a tuple of types which are allowed to be in this column. For more details please look up a comment above EXPERIMENT_DESCRIPTION_COLS_IN_PREPROCESSED_TABLES declaration.

  3. Specify which ExperimentDescription fields will be used in file names of resulting CSV files in EXPERIMENT_INFO_USED_FOR_FILE_NAME_CREATION. There have to be enough field names to distinguish between any resulting tables. Elements of LEVEL_NAMES are required in EXPERIMENT_INFO_USED_FOR_FILE_NAME_CREATION.

  4. Add SupportedTasks.nmt item to EXPERIMENT_DESCRIPTION_FIELDS_WHICH_ARE_ALLOWED_TO_VARY_INSIDE_PREPROCESSED_TABLE. Sometimes 1 raw CSV file contains results for several experiments. If an experiment description column in a raw CSV can vary, then you need to add a corresponding field name to EXPERIMENT_DESCRIPTION_FIELDS_WHICH_ARE_ALLOWED_TO_VARY_INSIDE_PREPROCESSED_TABLE. Please note, that splitting of CSV tables is performed after preprocessing.

  5. Add SupportedTasks.nmt item into ADDITIONAL_INFO_BY_TASK dictionary.

  6. Add NMTExperimentReplacementParameters namedtuple. Add this namedtuple to ReplacementParametersType and REPLACEMENT_EXPERIMENT_DESCRIPTION_FIELDS.

  7. If the new task performance tables have latency columns, you need to add a regex for latency columns into LATENCY_COLUMNS_BY_TASK dictionary.

  8. If the new task performance tables has a throughput column, then you need to add the name of this column into THROUGHPUT_COLUMNS_BY_TASK dictionary.

  9. Performance result tables are expected to have “# of streams” column. Please, add the name of this column into NUM_PARALLEL_COLUMN_NAMES_BY_TASK dictionary.

  10. Define a format of the output tables for NMT task in FINAL_CSV_SCHEMES dictionary. Please see present descriptions and a comment above FINAL_CSV_SCHEMES for more details.

Inside fill_performance_pages.py#

  1. Declare NMT tabbed table level order. For this you will need to import ExperimentNMT and define a constant which will list levels from top to bottom, e.g.

NMT_TABBED_TABLE_LEVELS = [Level.gpu_type.name, ...]

Any Level and ExperimentNMT element can be in NMT_TABBED_TABLE_LEVELS. 16. (optional) If some metadata fields are be used inside innermost tabs, then you may set pretty names for these metadata fields in METADATA_KEYS_PRETTY constant. 17. Call a function create_tabbed_table() inside main(),

    create_tabbed_table(
        args.metadata,
        Path(__file__).parent / '../../nmt/nmt-performance-table.md',
        NMT_TABBED_TABLE_LEVELS,
        SupportedTasks.nmt,
        nmt_tab_creator,
        {Level.task.name: TASK_NAMES[SupportedTasks.nmt]},
    )
  1. Write a function nmt_tab_creator() which is responsible for rendering of innermost tabs in the NMT table.

def nmt_tab_creator(level_values: Dict[str, Any], metadata_file: Path, metadata: List[MetadataItemType]) -> str:
    """A function for creation of TTS tab. This function should be passed in ``inner_most_tab_creator`` parameter of
    ``create_nested_tabs()` function."""
    relevant_items = extract_relevant_items_from_metadata(metadata, level_values, metadata_file)
    raise_error_if_more_than_1_item(relevant_items, level_values, metadata_file)
    return f'''.. csv-table::
   :header-rows: 2
   :file: {to_unix_path_str(build_path_to_performance_table(metadata_file.parent / relevant_items[0]['path']))}
'''

Inside docs/source/tabbed_tables_scripts/tabbed_tables.py#

  1. (optional) Prettify tab names for NMT tasks in PRETTIFYING_TABS.

  2. (optional) You may sort tabs in a tab level. For this you will need to provide sorting key in TAB_SORT_FUNCTIONS constant.

How to deal with missing levels#

Look at the following directory tree

.
├── AWS
│   ├── g4dn.16xlarge
│   │   ├── ASR
│   │   │   └── All_CPU_Cores
│   │   │       ├── results_citrinet-1024_en-US_flashlight_offline.csv
│   │   │       ├── ...
│   │   │       └── results_conformer_en-US_flashlight_streaming.csv
│   │   ├── NLP
│   │   │   └── results.csv
│   │   └── TTS
│   │       ├── results_file_faspitch_hifigan.csv
│   │       ...
│   ├── ...
│   └── p4d.24xlarge
│       ├── ASR
│       ├── NLP
│       └── TTS
├── GCP
│   ├── a2-highgpu-1g_[a100_12vcpu]
│   ├── ...
│   └── n1-highmem-8_[v100_8vcpu]
└── on_prem
    ├── A10
    │   ├── ASR
    │   │   ├── 16_Cores
    │   │   │   ├── results_conformer_en-US_flashlight_offline.csv
    │   │   │   ├── results_conformer_en-US_flashlight_streaming-throughput.csv
    │   │   │   └── results_conformer_en-US_flashlight_streaming.csv
    │   │   ├── 32_Cores
    │   │   ├── 64_Cores
    │   │   └── All_CPU_Cores
    │   │       ├── results_citrinet-1024_de-DE_flashlight_streaming.csv
    │   │       ├── ...
    │   │       └── results_quartznet_en-US_os2s_streaming.csv
    │   ├── NLP
    │   │   └── results.csv
    │   └── TTS
    │       └── results_file_faspitch_hifigan.csv
    ├── ...
    └── V100

on_prem directory contains GPU type directories, whereas AWS and GCP contain cloud instance directories. If you intend to pass such directory tree into prepare_raw_data.py script, you need to specify which levels are missing in FILLER_DIRECTORIES_SCHEME constant in config.py.

A FILLER_DIRECTORIES_SCHEME for the above case is

FILLER_DIRECTORIES_SCHEME = {
    FillerScheme.all_present_directories: {FillerScheme.all_present_directories: FillerScheme.filler},
    'on_prem': FillerScheme.filler,
}

The FILLER_DIRECTORIES_SCHEME constant shows where to insert “filler” directories so that level structure become identical in all branches of the input tree. A key of a dictionary inside FILLER_DIRECTORIES_SCHEME can be:

  • FillerScheme.filler,

  • FillerScheme.all_present_directories,

  • a name of a directory from the processed directory tree.

If a value in a dictionary from FILLER_DIRECTORIES_SCHEME is not nest nested dictionary, then this value has to be FillerScheme.filler.

If a key in a dictionary is FillerScheme.filler, there can be no other keys in the same dictionary. In the next example to “filler” directories are inserted in 'AWS' branch.

FILLER_DIRECTORIES_SCHEME = {
    FillerScheme.all_present_directories: {FillerScheme.all_present_directories: FillerScheme.filler},
    'AWS': {FillerScheme.filler: {FillerScheme.all_present_directories: FillerScheme.filler}},
    'on_prem': FillerScheme.filler,
}

If a key k of dictionary D is FillerScheme.all_present_directories, then the value corresponding to k is for all directories which are not among D keys.

How to preprocess tables#

You can fix raw CSV files before metadata will be extracted from them. For this you need to create a function which

  • takes a path to a raw CSV file as an input

  • does necessary changes to the CSV file and after that saves it into the original place

Then you need to add the function into PREPROCESSING_FUNCS_BY_TASK constant in config.py script. The functions in PREPROCESSING_FUNCS_BY_TASK are applied to CSV files in the order, in which they are listed.

How to edit headers of final tables#

A constant FINAL_CSV_SCHEMES from config.py contains info which columns are added to the final CSV file.

There are different headers for ASR CSVs depending on the mode (streaming or offline) and number of language models. A header type is defined by HeaderDescriptionKeys.table_type_callback callback. In cases when a callback is used header schemes are HeaderDescriptionKeys.headers item.

If all final CSVs belonging to a task have same headers, then a FINAL_CSV_SCHEMES[<task>] is a header scheme.

Headers can have 1 or 2 levels.

1 level header is a dictionary, which keys are columns added to the final CSV and values are names of these columns in the final CSV. A key of 1 level header can be a compiled regex matching column names. If a key is a regex, then a corresponding value must be a callable, which takes old column as input and returns a column name for final CSV.

2 level header is a dictionary, which keys are column names of top level of final CSV and values are 1 level headers.

You can make a column optional. If a corresponding column in a raw CSV is missing, then an optional column is not added to the final CSV. To make this work, you need to replace a usual key with a tuple of 2 elements: the first element is the key and the second is ColSpec.only_if_present_in_table. For example, see “speaker diarization” columns in FINAL_CSV_SCHEMES. If in a 2 level header the top column is optional, its subcolumns cannot be optional (all subcolumns become optional by default).

How to prettify a tab name in a final CSV header#

By default, metadata values serve as tab names. However, it is often not convenient. For example, values of model metadata field for TTS task can be 'fastpitch-hifigan' and tacotron-waveglow. To improve tab names, you can use PRETTIFYING_TABS in docs/source/tabbed_tables_scripts/tabbed_tables.py script. In PRETTIFYING_TABS, you can provide a dictionary which maps metadata values to tab names.

You can also provide a function taking metadata value as an input and returning a tab name. For num_cpu metadata field, it looks the following way:

def prettify_number_of_cores(n: int) -> str:
    return f'{n} cores'

PRETTIFYING_TABS = {
    SupportedTasks.asr: {
        Level.num_cpu.name: prettify_number_of_cores,
        ...
    },
    ...
}

How to sort tabs in tabbed tables#

For sorting tabs on a tab level, you need to set a sorting key in TAB_SORT_FUNCTIONS in docs/source/tabbed_tables_scripts/tabbed_tables.py script. The sorting is performed by Python built-in function sorted() applied to tuples (<metadata_value>, <list_of_metadata_items_which_belong_to_the_tab>). Function in TAB_SORT_FUNCTIONS is passed into sorted() in key parameter. If a sorting function is missing, then key parameter of sorted() is lambda x: x[0].