*** description: >- Explore large datasets in HEAVY.AI with the full power of SQL, with a pandas-like API --------------- # Ibis The Ibis open source [project documentation](http://ibis-project.org/docs/index.html) is the best place to learn more and [stay up to date](http://ibis-project.org/docs/release/index.html). Ibis allows you to explore data in a variety of remote storage systems, with an API is inspired by the [pandas](https://pandas.pydata.org/) toolkit. You can use Ibis to interact directly with HeavyDB and several other supported SQL systems by writing high-level Python code instead of lower-level SQL. Using familiar tools can increase productivity. Following is summary of what Ibis provides, directly from the [Ibis website](http://ibis-project.org): > * ***Full coverage of SQL features**: You can code in Ibis anything you can implement in a SQL SELECT* > * ***Transparent to SQL implementation differences**: Write standard code that translate to any SQL syntax* > * ***High performance execution**: Execute at the speed of your backend, not your local computer* > * ***Integration with community data formats and tools** (e.g. pandas, Parquet, Avro...)* Ibis support for [several execution backends](https://ibis-project.org/docs/3.0.2/) including common SQL databases and processing engines. This allows you to author analytical workflows or operations once and run them consistently against these backends. You can also create complex workflows that can simultaneously materialize data from multiple backends and combine the resulting outputs in various ways. Supported backends include PySpark, Google BigQuery, and PostgreSQL. ## Install the backend ibis-heavyai You can install [ibis-heavyai](https://github.com/heavyai/ibis-heavyai) using `conda` (recommended) or `mamba`, which is a faster implementation of `conda`. For more information, see [the Mamba documentation](https://mamba.readthedocs.io/en/latest/index.html). ```bash conda install -c conda-forge ibis-heavyai # or to install rbc to a new environemnt, run conda create -n ibis -c conda-forge ibis-heavyai conda activate ibis # check if ibis-heavyai installed succesfully python -c 'import ibis; ibis.heavyai.version' ``` You can also use `pip` for package management: ```bash pip install ibis-heavyai # check if ibis-heavyai installed succesfully python -c 'import ibis; ibis.heavyai.version' ``` ## Ibis and HeavyDB This short example demonstrates Ibis. Inside a notebook cell, you can first connect to an HeavyDB database instance: ```python import ibis con = ibis.heavyai.connect( host='localhost', database='heavyai', user='admin', password='HyperInteractive' ) ``` If you are launching JupyterLab from Immerse, this connection is already set up for you based on your Immerse login and credentials. This connection object has [several utility methods](http://ibis-project.org/docs/api.html#omniscidb-client) that can come in handy, for listing database objects, creating tables, loading data etc. Next, let's identify a table and define a simple Ibis expression against it: ```python tablename = 'github' t = con.table(tablename) count_expr = t.count() count_expr.compile() ``` This expression is compiled by Ibis into SQL that runs against HeavyDB: ```python SELECT count(*) AS "count" FROM github ``` When executed, the above SQL, as expected, counts the rows of the table: ```python print(f'Count of records in table {tablename} is {count_expr.execute()}') ``` ``` Count of records in table github is 2033049065 ``` You can also write raw SQL in Ibis. A SQL query can be wrapped as an Ibis expression and composed together with other Ibis expressions. Ibis offers a powerful way to [compose sophisticated analytical expressions](http://ibis-project.org/docs/user_guide/sql.html), and then leverage the processing power and scale of supported storage backends to evaluate them and return results. The following shows a more complex Ibis expression and the SQL it generates. You can leverage a full programming language (Python) to assemble complex expressions that cover SQL features like joins, filters, and aggregates. ```python well_list = ['4201334812', '4201335379', '4201335112', '4201334947', '4201334329', '4201334965', '4201334978', '4201334998', '4201335467', '4201334917', '4201334930', '4212331826', '4212331164', '4212300017', '4212331782', '4212332076', '4212331839', '4228532777', '4228533451', '4212332354', '4212332207', '4212330405', '4212332014'] #Find oil well production for the specific well list above sum_oil=t.oil_monthly.sum().log10().name('oil_monthly') expr = (t.filter([t.api_no_10.isin(well_list), t.oil_monthly.notnull()])\ .group_by(['date_prod', 'api_no_10']).having(sum_oil >= 0.)\ .aggregate(sum_oil))\ .sort_by('date_prod') # find the max production month per well in the wellset max_oil = expr.oil_monthly.max().name('oil_monthly') expr1 = expr.group_by('api_no_10').aggregate(max_oil) expr2 = (expr.join(expr1, [expr.api_no_10 == expr1.api_no_10,\ expr.oil_monthly == expr1.oil_monthly]))\ [expr.api_no_10.name("api"), expr.oil_monthly.name("om2"), expr.date_prod.name("dp2")] print(expr2.compile()) ``` This results in automatically generated SQL: ```python WITH t0 AS ( SELECT "date_prod", "api_no_10", log10(sum("oil_monthly")) AS oil_monthly FROM upstream_reservoir WHERE ("api_no_10" IN ('4201334998', '4212331782', '4212331826', '4201334965', '4212300017', '4201334329', '4228533451', '4212331164', '4201335467', '4201334930', '4212330405', '4201334947', '4212332354', '4201334978', '4201334917', '4201335379', '4201334812', '4212332207', '4228532777', '4201335112', '4212332076', '4212331839', '4212332014')) AND ("oil_monthly" IS NOT NULL) GROUP BY date_prod, api_no_10 HAVING log10(sum("oil_monthly")) >= 0.0 ORDER BY "date_prod" ) SELECT t0."api_no_10" AS api, t0."oil_monthly" AS om2, t0."date_prod" AS dp2 FROM t0 JOIN ( SELECT "api_no_10", max("oil_monthly") AS oil_monthly FROM t0 GROUP BY api_no_10 ) t1 ON (t0."api_no_10" = t1."api_no_10") AND (t0."oil_monthly" = t1."oil_monthly") ``` The result of the evaluation is by default a pandas dataframe, making it convenient to use Ibis inside other Python data analysis workflows. ``` expr2.execute() ``` | | api | om2 | dp2 | | - | ---------- | -------- | ---------- | | 0 | 4201335379 | 3.944927 | 2016-11-01 | | 1 | 4201335467 | 4.125253 | 2017-12-01 | | 2 | 4201334998 | 3.822887 | 2014-09-01 | | 3 | 4201334930 | 3.698709 | 2014-05-01 | | 4 | 4201334917 | 4.353243 | 2014-06-01 | | 5 | 4201334812 | 4.127299 | 2014-03-01 | | 6 | 4201334947 | 3.715502 | 2014-06-01 | | 7 | 4201334978 | 3.822887 | 2014-09-01 | | 8 | 4201335112 | 4.010766 | 2015-01-01 | | 9 | 4201334329 | 3.721068 | 2015-07-01 | Although Ibis can output to pandas, the result is first materialized and could be transferred over the network to the client running Ibis (such as a Jupyter notebook in a browser on a laptop). Avoid using it for expressions that try to move a million rows to a browser. ## Apache Arrow Support Ibis depends on [heavyai](https://heavyai.readthedocs.io/en/latest/) as a low-level connector. It can leverage some of the key features of pyomnisci, including output of query results to the Arrow format via the `select_ipc`and`select_ipc_gpu`calls. Building on this capability, the HeavyDB backend for Ibis can output query results using Apache Arrow to CPU or GPU memory (the latter by using the [cudf](https://github.com/rapidsai/cudf) dataframe library from NVIDIA). This makes it convenient to use with the GPU-based RAPIDS machine learning methods. Output to cudf is as simple as passing an extra couple of parameters to the `execute` method in ibis, for example, as follows ```python gdf = get_ts_by_period_station(zip_cd=94568).execute(gpu_device=3, ipc=True) ``` See the[ RAPIDS example](/python-data-science/introduction-to-ibis) for how you can build a workflow integrating HEAVY.AI and RAPIDs. RAPIDs is currently usable only with supported Nvidia GPUs and requires CUDA ## Geospatial Operations The HeavyDB backend for Ibis supports geospatial functions building directly on HeavyDB [SQL support](/sql/data-manipulation-dml/geospatial-capabilities) for these functions. These operations produce [geopandas](https://geopandas.org) dataframes, allowing for usage in Python-based geospatial analytics workflows. Here is an example of how a table that has geospatial data in HEAVY.AI, can output directly to a geopandas dataframe. ```python heavyai_cli = ibis.heavyai.connect( host='metis.heavyai.com', user='heavyai', password='HyperInteractive', port=443, database='heavyai', protocol= 'https' ) t = heavyai_cli.table('zipcodes_2017') df = t.head().execute() type(df) #geopandas.geodataframe.GeoDataFrame ``` ## User-Defined Functions Using Ibis, you can create and use user-defined functions (UDFs) in Python that execute inside HeavyDB. This UDF framework leverages integration with Numba, a Just-in-Time (JIT) compiler backend for python, and produces lower-level code for a more performant execution path than within Python itself. It also makes it easy to author UDFs in Python, and then make the UDFs usable in a SQL workflow. ## Additional Resources See the following Ibis documentation to get started: * The Ibis User Guide---in particular, the [Ibis for SQL programmers](https://ibis-project.org/docs/3.0.2/ibis-for-sql-programmers/) section, which shows how Ibis simplifies workflows involving analytical SQL. * The extensive [tutorials](https://ibis-project.org/docs/3.0.2/tutorial/01-Introduction-to-Ibis/).