# User-Defined Table Functions RBC supports *user-defined table functions*. UDTFs can access multiple rows of a table column concurrently. The signature of a UDTF is different from a UDF. The signature contains the input columns and the output columns specified by their respective types. There can be any number of input and output columns; the only constraint is that the input columns must be declared before (i.e., to the left) the output columns in the function signature. The signature, then, is declared using `UDTF()`; the number of arguments inside represent the total number of input and output columns. ```python 'UDTF(Column[List], ..., OutputColumn, ...)' ``` By default, the output columns are named `out0, out1, ...`. It's possible to use aliases to reference input and output columns in further SQL constructions. For example: ```sql 'UDTF(Column input_alias, OutputColumn output_alias)' ``` The maximum number of rows on table columns that a UDTF can handle corresponds to the maximum value of an `int32 (`2\*\*32/2`).` In the following example, the UDTF `fahrenheit2celcius` is defined on a table with one column as input. The final line `return 5` means that a table with only 5 rows is returned. ```python @heavy('UDTF(Column, OutputColumn)') def fahrenheit2celsius(inp, out): size = len(inp) for i in range(size): out[i] = (inp[i] - 32) * 5 / 9 return 5 ``` Although the function returns 5 rows, it does not mean that only 5 rows are going to be processed by the function. If `size<5`, the output is padded with the value 0. And if `size>5`, the function will still iterate on all the rows while just returning the first 5 elements. If the number of rows in the output table from a UDTF needs to be adapted at runtime, the function `set_output_row_size` from the module `rbc.externals.heavydb` is required. The function must be called before any assignment on output columns. ```python from rbc.externals.heavydb import set_output_row_size @heavy('UDTF(Column, OutputColumn)') def fahrenheit2celsius(inp, output): size = ... set_output_row_size(size) ... return size ``` While the return value from a UDTF controls the number of rows in the output table, there are no restrictions on the assumed number of rows in the corresponding input table. The whole column—again, up to `int32`rows—will be loaded whenever the function executes. As with any SQL function, limits on number of rows in tables associated with a UDTF can be set using SQL keywords like `LIMIT` or `WHERE`. By default, a UDTF that has a variable number of rows in the output table is not thread-safe. To work around this constraint, use a [`TableFunctionManager`](https://app.gitbook.com/o/-M6jfCWx41RhOdl5oOoZ/s/-M6kHkWH17KJKscL0Cc4-887967055/~/changes/tDREZCfSeL3KCVPGY8Zg/udtf/remote-backend-compiler-rbc#undefined). ## Cursors In SQL, *cursors* are used to declare temporary memory for storing database tables. In particular, UDTFs use cursors as inputs. Here is a SQL request using a UDTF: ```sql select * from table(fahrenheit2celsius(cursor(select col from table))) ``` You can also define the signature with the cursor made explicit in the previous UDTF as follows: ```python @heavy('UDTF(Cursor(Column), OutputColumn)') def fahrenheit2celsius(inp, output): ... ``` For convenience, when a single cursor is used, you do not need to specify cursors in the definition of the UDTF. When multiple cursors are needed in a SQL query, including the literal `Cursor` in the UDTF definition as shown above is required. ## Table Function Manager Using the argument `TableFunctionManager` in the signature of a UDTF enables parallel execution of table functions. Without this argument, table functions are executed on a single thread; more importantly, the execution is not thread-safe. To enable threaded execution, the function signature the extra argument for the `TableFunctionManager` and the function `set_output_row_size` must be called on the manager to ensure thread safety. ```python from rbc.externals.heavydb import set_output_row_size @heavy('UDTF(TableFunctionManager, Column, OutputColumn)') def fahrenheit2celsius(mgr, inp, output): ... mgr.set_output_row_size(size) return size ``` ### Column Lists Instead of declaring a parameter per column, it is possible to group columns into a list using `ColumnList`. In the following example, the mean over each column is returned. It's possible to have multiple `ColumnList` parameters. Two helper attributes are available to get the number of rows and column, respectively `ColumnList.nrows` and `ColumnList.ncols`. ```python @heavy('UDTF(ColumnList, OutputColumn)') def fahrenheit2celsius(inp, out): ncols = inp.ncols nrows = inp.nrows set_output_row_size(ncols) for i in range(ncols): col = inp[i] out[i] = 0. for j in range(nrows): out[i] += col[j] out[i] /= nrows out[i] = (out[i] - 32) * 5 / 9 return ncols ``` ## Supported Functions ### Python Grammar The package RBC makes use of the Python [Numba](https://numba.readthedocs.io) compiler internally. As a result, RBC inherits some limitations in syntax and features from Numba. Specifically, the [nopython](https://numba.readthedocs.io/en/stable/user/5minguide.html?highlight=nopython#what-is-nopython-mode) mode of Numba is used which means that certain Python objects or class constructions have no—or, at best have limited—support. This includes—and is not limited to—list comprehensions, slicing or complex indexing (e.g., `[`:`]`, `[-1]`, `[1:6]`, `[::2]`). When using functions, a common pitfall is to have type mismatch errors. Casting rules are less forgiving than in Python and types have to be carefully handled. ### NumPy and Others The list of supported functions is always growing. Most functions are overwritten versions of functions from NumPy or the builtin `math` module. These functions are defined in `rbc.stdlib`, so, to get the full list of supported functions, inspect that module: ```python from rbc import stdlib print(stdlib.__all__) print(stdlib.array_api.__all__) ``` ### Numba Because RBC internally makes use of [Numba](https://numba.readthedocs.io/), RBC also supports the usage of Numba functions within RBC functions. For example, the function `fahrenheit2celsius_numba` embedded within `fahrenheit2celsius` has been decorated with `numba.njit` when `fahrenheit2celsius` is defined. ```python from numba import njit @njit def fahrenheit2celsius_numba(f): return (f - 32) * 5 / 9 @heavy('double(double)') def fahrenheit2celsius(f): return fahrenheit2celsius_numba(f) ``` ### External The module `rbc.external` describes functions known to the server. Those functions on the server can be used when constructing new UDFs or UDTFs by using the function `rbc.external.external`. In the following example, `log2` is a function which is known on by the database server. To use `log2` with a UDF or a UDTF defined using RBC, it needs to be typed using a C-like syntax similar to the one used when decorating functions for RBC. ```c "output_type function(input_types)" ``` ```python from rbc.external import external log2 = external("double log2(double)") @heavy("double(double)") def log2_heavy(x): return log2(x) ```