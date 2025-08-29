



XLA is an optimizing graph compiler for TensorFlow. It optimizes parts of the TensorFlow GraphDef in an attempt to improve performance.



Unlike native TensorFlow, which executes GraphDef nodes one at a time, XLA considers many GraphDef nodes at once and generates optimized code for these nodes.



In many cases, XLA improves performance over native TensorFlow. The major difference between these two is the fusion optimizer in XLA. Instead of executing many small kernels back to back, XLA optimizes these into larger kernels. This greatly reduces execution time of bandwidth bound kernels. XLA also offers many algebraic simplifications, far superior to what Tensorflow offers.

XLA can be enabled in a few ways:

Opt in for specific parts of the model, referred to as manual clustering. tf.function is one way for users to control which parts of a model to optimize with XLA. See here for TF1.X, and here for TF2.X how to use a tf.function . Alternatively, a so-called jit_scope can be used to control which parts of a model to optimize with XLA. See here for TF1.X, and here for TF2.X how to use a jit_scope . Be advised that this is merely a hint, as unsupported nodes are not compiled. Make TensorFlow decide which parts of the model to optimize, referred to as auto clustering.

Opt in at the source level. For TF 1.X, users can opt in to auto clustering at the session level by adding the following line of code to change the session configuration argument: config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1 For TF 2.X, users can opt in to auto clustering at the global level by adding the following line of code at the beginning of the script: tf.config.optimizer.set_jit(True)

Opt in through environment variable. Users can opt into auto clustering, where TensorFlow decides which nodes go to XLA, without any source code changes by setting the following environment variable: TF_XLA_FLAGS=--tf_xla_auto_jit=1 This environment variable works for both TF 1.X, and TF 2.X. For models that have XLA enabled through the session config, XLA can be disabled with: TF_XLA_FLAGS=--tf_xla_auto_jit=-1

The environment variable takes precedence over the global_jit_level and the jit_scope , but not over tf.function . This document focuses primarily on auto clustering. Most of it applies to manual clustering as well.



TensorFlow autoclustering clusters operations based on an allow-list.

By default, this list is quite extensive. This can result in large clusters which can result in loss of performance or excessive memory usage (see below). To mitigate such problems, users can make TensorFlow only cluster “small” and fusible operations such as pointwise and reduction operations, by setting:

TF_XLA_FLAGS=--tf_xla_auto_jit=fusible

Note that since models always have matrix multiplications or convolutions, this results in more, but smaller clusters.

Using XLA incurs two different costs when comparing against a native TensorFlow execution:

Compilation time Code must be compiled at runtime. This takes time depending on the size of the generated clusters and the number of times it is compiled (once for every shape instance), this time might not be recoverable during execution. Execution time overhead due to the TF-XLA interface. The way that XLA integrates back into TensorFlow adds a cost to the graph execution.

Both of these are discussed in more depth throughout the document. These caveats are mentioned here to highlight that XLA is not a silver bullet that speeds up all models under all scenarios. Short running scripts, on small batch sizes, are typically not good candidates for XLA. Models that exhibit dynamic shapes are another class of models that often don’t do well when using XLA.

Furthermore, the overhead incurred from using XLA is hard to pin-point. Some overhead occurs at the beginning of the model, others occur throughout the execution with fixed cost, and others yet occur irregularly during execution. It is important to be aware of this when instructing the code to get timing information from a model execution.

Besides these performance caveats, XLA sometimes causes an out-of-memory during execution. This is discussed in more detail later as well.

This section describes how XLA currently changes the executed graph, and how it affects the execution of the graph.

This helps developers understand how TensorFlow interfaces with XLA, and how this affects performance. This section, as well as the remainder of this document, assumes the reader to have a rudimentary knowledge of TensorFlow GraphDef, and how the TensorFlow stream executor executes it.



When auto clustering is enabled, a part of the graph is chosen to be compiled with XLA. This part ( G ), has a set of inputs ( I ) and a set of outputs ( O ).

When XLA is enabled, this graph is transformed into the following graph for execution:

The part ( G ) has been clustered into a cluster ( C ). Be aware that C is merely a cluster of operations and has not yet been compiled. It cannot be compiled yet, as the actual shapes of the input and output tensors are not available until execution time.

When auto clustering, each part of the model that is clustered is replaced with a graph depicted above, referring to a cluster that represents that part. Note that the original GraphDef representation of the graph is still there to serve as a fallback path. The merge nodes (in the actual graph, there is one merge node per output) forward the outputs from either the native TensorFlow execution or the XLA execution.

During the graph execution, when the above graph is reached, the following happens for each unique shape instance of the set of inputs.1

The first two times the _XlaCompile node is executed, it falls back to the original GraphDef execution of G . The third time the _XlaCompile node is executed, cluster C is compiled into an XLA binary. After the compilation has finished, the binary is stored in a cache, accessible through a key that encodes the cluster, and the shapes and types of the inputs. This key is passed to the _XlaRun node, and the binary is executed. If compilation of the cluster took longer than 30 seconds, it will not be compiled for any other shape instance in the future. Any subsequent time the _XlaCompile node is executed, compilation is skipped and the key of the cached binary is passed to _XlaRun node to execute. Note that in order for the _XlaCompile node to execute, all inputs ( I ) must be ready. Similarly, none of the outputs (O) are ready until either G or _XlaRun are done executing. Note that the above steps are for each unique shape instance. If the shape of any input changes throughout the model execution, recompilation of the cluster happens.

This section mentions the most common issues observed when using XLA, and why these can happen. The next section focuses on how to address them.



The most common functional issues fall into two categories:

Different output. XLA is a compiler that performs semantically equivalent transformations only.2 Since these transformations are on floating point tensors, the outcome is most likely (slightly) different compared to the native TensorFlow execution. Developers should be aware of this. Out of memory. XLA, by nature of its design, increases the amount of memory needed to execute. As mentioned earlier, all inputs must be ready before execution, and all outputs are ready at the same time after execution. This means memory for all inputs and outputs must be allocated for the execution. Furthermore, all memory required to hold the intermediate tensors in the XLA binary must be allocated.

The most common performance issues fall into four categories:

TF-XLA integration. The way that TF interfaces with XLA (as described in 9.2.1), comes at a performance cost, partly due to the extra nodes inserted in the graph. Compile time overhead. When a model has input data of varying shapes (for example sentences for an NLP model), compilation overhead can add up. Large clusters contribute to this as well. Compute/communication overlap. XLA requires all inputs to be ready before execution, and the outputs are ready when the execution finishes. This synchronization can contribute to a performance when compared to Tensorflow. TensorFlow has a fine-grained execution model, that allows for overlap of copy and computation, which is not possible when using XLA. XLA optimization and code generation. Sometimes the XLA optimizer or code generator don’t do as well as they could. In our experience this is the least common scenario when dealing with performance regressions.

For developers, XLA is mostly a black box. This section sheds some light on how to identify the symptoms, and take appropriate action in an attempt to solve these issues.



XLA exposes many ways to either retrieve information from it, or control the behavior of it. These options are not (well) documented, and can even change from release to release (behavior, format, and even existence). Even then, they offer lots of insights into XLA’s innards. They come in two classes:

TF_XLA_FLAGS environment variable This variable can be set to options that control the TF-XLA boundary. It can be set to a space separated list of options, which can be found in tensorflow/compiler/jit/flags.cc

XLA_FLAGS environment variable This variable can be set to options that control the XLA optimizer and code generator. It can be set to a space separated list of options, which can be found in tensorflow/compiler/xla/debug_options_flags.cc

Most of the options in these source files are primarily used by XLA developers and go beyond the scope of this document. Some are useful to model developers, and are mentioned below.

This is very easy to identify, as it is reported by the TensorFlow allocator. The reason for out-of-memory has been described in Functional Issues. When this happens, run the model with this extra parameter:

TF_XLA_FLAGS=--tf_xla_always_defer_compilation=true 3

This option instructs TensorFlow to never compile a cluster, but always execute the fallback path. If this still results in an out-of-memory error, the problem is that the inputs ( I ) and outputs ( O ) don’t all fit in memory simultaneously. If the execution succeeds, the problem is that the inputs ( I ), outputs ( O ) and all intermediate tensors in the XLA binary don’t all fit in memory simultaneously.

In both cases, the only way to guranteed address an out-of-memory issue in XLA is by reducing the number of operations in a cluster. The number of operations (in)directly affects the number of inputs, outputs, and intermediate tensors.

One way to reduce the cluster sizes is by enabling XLA-Lite. This pretty much guarantees this issue to disappear, as the cluster sizes are reduced rather drastically.

Another way is to limit the maximum size of each cluster. By default, there is no upper bound to the size of a cluster. The sizes of clusters can by retrieved with:

TF_CPP_VMODULE=mark_for_compilation_pass=2

Look for occurrences of " *** Clustering info for graph" to get the sizes of the generated clusters.

When running with:

TF_XLA_FLAGS=--tf_xla_max_cluster_size=<n>

where <n> is smaller than the largest cluster size, an upper bound can be found that avoids the out-of-memory.

Reducing the mini-batch size is another proactive way an out-of-memory issue can possibly be avoided. This requires no changes to XLA. A smaller mini-batch size, in combination with XLA, can still outperform native TensorFlow.

The TensorFlow stream executor executes many operations in parallel. This increases the memory requirement as more intermediate tensors are alive concurrently. Limiting the amount of parallelism in the stream executor often reduces the amount of memory needed. The amount of parallelism is controlled by specifying the number of threads the executor can use, by setting the following environment variable:

TF_NUM_INTEROP_THREADS=<n>



By default, XLA allocates the required memory for the intermediate tensors in one allocation. Because of memory fragmentation, it can happen that the TensorFlow allocator can not allocate a contiguous memory buffer eventhough ample memory is still available.

By setting

XLA_FLAGS=--xla_multiheap_size_constraint_per_heap=<n>

a user can break up the monolithic memory allocation into multiple ones, each with a maximum size of n bytes.

Alternatively, TensorFlow-XLA Integration Issues describes an option that sometimes successfully circumvents memory fragmentation.

When XLA performs worse than native TensorFlow, one of first things to try is to run it again with:

TF_XLA_FLAGS=--tf_xla_always_defer_compilation=true

If the performance is still worse, the TF-XLA integration is the issue, as this option will run the fallback path without any compilation. In that case, run again with:

TF_XLA_FLAGS=--tf_xla_enable_lazy_compilation=false

This will completely change the way that TensorFlow interfaces with XLA. Unlike the diagram in Clustering, the graph now looks like:

Note that the fallback path is no longer present. This option often delivers good performance results for nodes that are executed many times for a given shape instance. It avoids the merge nodes and the compilation heuristics, as it must compile on first execution. A direct side effect of this option is a very different graphDef node execution (execute a single _XlaRun node instead of many nodes in the fallback path), and therefore a very different usage of the TensorFLow memory allocator. There are scenarios where this option successfully avoids the memory fragmentation issue mentioned before. This is a mere accidental side-effect, and not a fix.

A second manner in which the TX-XLA interface can manifest itself as a performance issue is when lots of smaller clusters occur. When this happens, the overhead of the extra nodes in the graph, the compilation time (little as it is), and the XLA executor is worse than the performance gain achieved by the XLA compiler. By default, the lower bound for a cluster size is 4 operations. When running with:

TF_XLA_FLAGS=--tf_xla_min_cluster_size=<n>

where <n> is larger than 4, any cluster with less than <n> operations will be ignored, and the graph remains unchanged.

Compile time overhead can cause severe performance degradation, especially in models with varying shapes of input data.

To know how much time is spent on compilation, run the model with:

TF_CPP_VMODULE=xla_compilation_cache=1

This dumps information after each compilation happens. It shows how long the last compilation took, and the accumulated time of all compilations up to that moment. When running with:

TF_CPP_VMODULE=nvptx_compiler=2

it also dumps information regarding the size of the .ptx files and .cub files. Compile time overhead can be reduced by either increasing the lower bound for clusters (when there are many small compilations):

TF_XLA_FLAGS=--tf_xla_min_cluster_size=<n>

or by decreasing the upper bound for cluster sizes when some compilations take too long:

TF_XLA_FLAGS=--tf_xla_max_cluster_size=<n>

XLA can also perform cluster compilations in the background, while execution of the fallback path can make progress. These background compilations take CPU resources away from the main execution pipeline, but for many cases, not blocking the execution while waiting for compilation to finish can lower the end to end latency. Asynchronous compilation can be opted into by setting

TF_XLA_FLAGS=--tf_xla_async_compilation=true

Asynchronous compilation is not compatible with disabling of lazy compilation. When lazy compilation is disabled, it takes precedence over asynchronous compilation.

XLA can persistently cache parts of a cluster compilation to disk. This cache can be used to speed up the compilation time of subsequent runs by reusing cached compilation results. Persistent caching is enabled by setting

TF_XLA_FLAGS=--xla_gpu_persistent_cache_dir=<dir_name>

dirname must refer to an existing directory, with read and write permissions. It will be used to store new entries, and retrieve existing entries. The recommended usage is a small cache per model, as all cache entries are read at initialization time and entries not used by the model will therefore be read, but never used. This is an experimental feature, and compatibilty of the cache with respect to the TensorFlow version is not guaranteed.

As mentioned earlier, as a direct side effect of how XLA integrates with TensorFlow, all outputs of an _XlaRun node are ready at the same time.

This means that any consumer of any XLA output must wait until all outputs are produced. This limits the TensorFlow executor to execute computation and communication in parallel. The fact is that inside an XLA cluster it is often the case that outputs have already been computed, and could be passed on to a consumer. XLA has an optimization referred to as AyncIO, that allows consumers of XLA outputs to start execution the moment the output is available.

If XLA on one GPU outperforms native TensorFlow, but performs worse on multi-GPU systems, it could be because of the lack of compute and computation overlap. Run TensorFlow with:

TF_XLA_FLAGS=--tf_xla_async_io_level=1

to enable AsyncIO.

In most cases, the TF-XLA interface issues disappear when:



Running long enough. Compilation overhead decreases over time, as most shape instances of the clusters will hit in the cache.

Running with large enough tensors, i.e., a large enough batch size.

Filtering out tiny clusters.

If XLA compiled clusters run slower than the fallback path, the XLA compiler could be the reason for the performance degradation.



By default, XLA performs autotuning over the GEMM and CONV algorithms in cuBLAS and cuDNN. For each GEMM, all possible alternatives in cuBLAS are tried, and the fastest one is picked; similarly for each CONV operation into cuDNN. Autotuning is controlled with a level [0..4]. Level 4 is the default and it performs some functional tests between the different alternatives to catch cuBLAS and cuDNN problems.



Try running with:

XLA_FLAGS=--xla_gpu_autotune_level=2

This does full autotuning, but skips some initialization and comparisons done at level 4. For short running scripts, even disabling autotuning altogether can improve performance:

XLA_FLAGS=--xla_gpu_autotune_level=0

cuDNN Batch Normalization

By default the XLA breaks down the batch normalization layer into many smaller operations that then get optimized by XLA. For Training only, XLA has an option to preserve batch normalization as a single operation that targets the cuDNN batch normalization API to execute it. Run with:

XLA_FLAGS=--xla_gpu_use_cudnn_batchnorm_level=2

to use cuDNN to execute the batch normalization layer for both forward and backward layers.

tf.enable_resource_variables()

TensorFlow resource variables are improved versions of TensorFlow variables. XLA does not cluster variables, whereas is does cluster resource variables. Clustering resource variables therefore clusters more graphDef nodes, including assign nodes. We’ve seen at least one model in which many assign nodes were all depending on outputs from an XLA cluster. Enabling resource variables made these nodes a part of the XLA cluster, improving the latency, as these assign nodes did not have to wait for the XLA cluster to finish. Enabling resource variables is achieved by adding:

tf.enable_resource_variables()

to the model source before any variables have been created.

--xla_backend_optimization_level

XLA_FLAGS=--xla_backend_optimization_level=0

--xla_gpu_disable_ptxas_optimizations

XLA_FLAGS=--xla_gpu_disable_ptxas_optimizations=true

This section lists the options mentioned throughout this document with a short description.

TF_XLA Options

--tf_xla_auto_jit Controls clustering. 0: off , 1: on, fusible: Xla-Lite

--tf_xla_always_defer_compilation Controls deferred compilation. true: fallback path, false: default strategy

--tf_xla_max_cluster_size Sets upper bound for cluster size

--tf_xla_min_cluster_size Sets lower bound for cluster size

--tf_xla_enable_lazy_compilation Controls lazy compilation. false: always compile, true: default strategy

--tf_xla_async_compilation Controls asynchronous compilation. false: default strategy, true: compile asynchronously Available as of 20.10

--xla_gpu_persistent_cache_dir Controls a persistent compilation cache. When set to an existing directory, that directory is used as the compilation cache. Available as of 21.04

--tf_xla_async_io_level Enable asynchronous IO. 0: off, 1: on

XLA Options

--xla_gpu_autotune_level Controls autotune level, 0: off, 1: w/o initialization, 2: w/ initalizalization, 3: w/ re-initialization, 4: w/ functional check

--xla_multiheap_size_constraint_per_heap Controls whether to multiple heaps. -1: single heap, n : use multiple allocations with a maximum of n bytes each. Available as of 20.11