# tf\_feature\_self\_similarity Given a query input of entity keys/IDs (for example, airplane tail numbers), a set of feature columns (for example, airports visited), and a metric column (for example number of times each airport was visited), scores each pair of entities based on their similarity. The score is computed as the cosine similarity of the feature column(s) between each entity pair, which can optionally be TF/IDF weighted. ``` select * from table( tf_feature_self_similarity( primary_features => cursor( select primary_key, pivot_features, metric from table group by primary_key, pivot_features ), use_tf_idf => )) ``` #### Input Arguments | Parameter | Description | Data Type | | ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------- | | `primary_key` | Column containing keys/entity IDs that can be used to uniquely identify the entities for which the function computes co-similarity. Examples include countries, census block groups, user IDs of website visitors, and aircraft callsigns. | Column\ | | `pivot_features` | One or more columns constituting a compound feature. For example, two columns of visit hour and census block group would compare entities specified by `primary_key` based on whether they visited the same census block group in the same hour. If a single census block group feature column is used, the `primary_key` entities would be compared only by the census block groups visited, regardless of time overlap. | Column\ | | `metric` | Column denoting the values used as input for the cosine similarity metric computation. In many cases, this is `COUNT(*)` such that feature overlaps are weighted by the number of co-occurrences. | Column\ | | `use_tf_idf` | Boolean constant denoting whether [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) weighting should be used in the cosine similarity score computation. | BOOLEAN | #### Output Columns
NameDescriptionData Types
class1ID of the first primary key in the pair-wise comparison.Column (type is the same of primary_key input column)
class2ID of the second primary key in the pair-wise comparison. Because the computed similarity score for a pair of primary keys is order-invariant, results are output only for ordering such that class1 <= class2. For primary keys of type TextEncodingDict, the order is based on the internal integer IDs for each string value and not lexicographic ordering.Column (type is the same of primary_key input column)
similarity_scoreComputed cosine similarity score between each primary_key pair, with values falling between 0 (completely dissimilar) and 1 (completely similar).Column
**Example** ``` /* Compute similarity of airlines by the airports they fly from */ select * from table( tf_feature_self_similarity( primary_features => cursor( select carrier_name, origin, count(*) as num_flights from flights_2008 group by carrier_name, origin ), use_tf_idf => false ) ) where similarity_score <= 0.99 order by similarity_score desc limit 20; class1|class2|similarity_score Expressjet Airlines|Continental Air Lines|0.9564615 Delta Air Lines|Atlantic Southeast Airlines|0.9436753 Delta Air Lines|AirTran Airways Corporation|0.9379856 Atlantic Southeast Airlines|AirTran Airways Corporation|0.9326661 American Eagle Airlines|American Airlines|0.8906327 Northwest Airlines|Pinnacle Airlines|0.8222722 Skywest Airlines|United Air Lines|0.6857293 Mesa Airlines|US Airways|0.6116939 United Air Lines|Frontier Airlines|0.5921053 Mesa Airlines|United Air Lines|0.5686765 United Air Lines|American Eagle Airlines|0.5272493 Skywest Airlines|Frontier Airlines|0.4684323 Southwest Airlines|US Airways|0.4166781 United Air Lines|American Airlines|0.397027 Comair|JetBlue Airways|0.3631534 Mesa Airlines|American Eagle Airlines|0.3379275 Skywest Airlines|American Eagle Airlines|0.3331468 Mesa Airlines|Skywest Airlines|0.3235496 Comair|Delta Air Lines|0.3075919 Southwest Airlines|Mesa Airlines|0.2901711 /* Compute the similarity of US States by the TF-IDF weighted cosine similarity of the words tweeted in each state */ select * from table( tf_feature_self_similarity( primary_features => cursor( select state_abbr, unnest(tweet_tokens), count(*) from tweets_2022_06 where country = 'US' group by state_abbr, unnest(tweet_tokens) ), use_tf_idf => TRUE ) ) where class1 <> class2 order by similarity_score desc; TX|GA|0.9928479 IL|TN|0.9920474 IL|NC|0.9920027 TX|IL|0.9917723 IN|OH|0.9916649 TN|NC|0.9915619 CA|TX|0.9910875 IN|VA|0.9909871 CA|IL|0.9909689 IL|OH|0.9909481 TX|NC|0.9908867 IL|MO|0.9907863 IN|MI|0.990751 TN|OH|0.9907123 IL|MD|0.9907106 OH|NC|0.9905779 VA|OH|0.990536 IN|IL|0.9904549 IN|MO|0.9903805 TX|TN|0.9903381 ``` ![Computed similarity score for US airlines for 2008, where similarity is computed by the cosine similarity of the airports each airline departs from, weighted by the number of flights from that airport (using the first example query above, sans LIMIT). Dataset courtesy of the FAA.](https://files.buildwithfern.com/heavyai.docs.buildwithfern.com/heavyai/331d95b59fb9ec85aebc38f524c05e68d9d1cae9b74808b1785b719571e2c08c/docs/assets/airline_similarity.png)