Compute MD5
In this example, we will see how ETL can be used to do something as simple as computing MD5 of the object. We will go over two ways of starting ETL to achieve our goal. Get ready!
Note: ETL is still in development so some steps may not work exactly as written below.
Prerequisites
- AIStore cluster deployed on Kubernetes. We recommend following guide: Deploy AIStore on local Kubernetes cluster
Prepare ETL
To showcase ETL’s capabilities, we will go over a simple ETL container that computes the MD5 checksum of the object. There are three ways of approaching this problem:
-
Simplified flow
In this example, we will be using
python3.11v2runtime. In simplified flow, we are only expected to write a simpletransformfunction, which can look like this (code.py):transformfunction must take bytes as an argument (the object’s content) and return output bytes that will be saved in the transformed object.Once we have the
transformfunction defined, we can use CLI to build and initialize ETL: -
Simplified flow with input/output Similar to the above example, we will be using the
python3.11v2runtime. However, the python code in this case expects data as standard input and writes the output bytes to standard output, as shown in the followingcode.py:We can now use the CLI to build and initialize ETL with
io://communicator type: -
Regular flow
First, we need to write a server. In this case, we will write a Python 3 HTTP server. The code for it can look like this (
server.py):Once we have a server that computes the MD5, we need to create an image out of it. For that, we need to write
Dockerfile, which can look like this:Once we have the docker file, we must build it and publish it to some Docker Registry so that our Kubernetes cluster can pull this image later. In this example, we will use docker.io Docker Registry.
The next step is to create spec of a Pod, that will be run on Kubernetes (
spec.yaml):Important: the server listens on the same port as specified in
ports.containerPort. It is required, as a target needs to know the precise socket address of the ETL container.Once we have our
spec.yaml, we can initialize ETL with CLI:
Just before we started ETL containers, our Pods looked like this:
We can see that the cluster is running with one proxy and two targets.
After we initialized the ETL, we expect two more Pods to be started (#targets == #etl_containers).
As expected, two more Pods are up and running - one for each target.
ETL containers will be run on the same node as the targets that started them. In other words, each ETL container runs close to data and does not generate any extract-transform-load related network traffic. Given that there are as many ETL containers as storage nodes (one container per target) and that all ETL containers run in parallel, the cumulative “transformation” bandwidth scales proportionally to the number of storage nodes and disks.
Finally, we can use newly created Pods to transform the objects on the fly for us:
Voilà! The ETL container successfully computed the md5 on the transform/shard.in object.
Alternatively, one can use the offline ETL feature to transform the whole bucket.
Once ETL isn’t needed anymore, the Pods can be stopped with: