# tf\_feature\_similarity Given a query input of entity keys, feature columns, and a metric column, and a second query input specifying a search vector of feature columns and metric, computes the similarity of each entity in the first input to the search vector based on their similarity. The score is computed as the cosine similarity of the feature column(s) for each entity with the feature column(s) for the search vector, which can optionally be TF/IDF weighted. ``` SELECT * FROM TABLE( tf_feature_similarity( primary_features => CURSOR( SELECT primary_key, pivot_features, metric from table where ... group by primary_key, pivot_features ), comparison_features => CURSOR( SELECT comparison_metric from table where ... group by ), use_tf_idf => ) ) ``` #### Input Arguments | Parameter | Description | Data Type | | ------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------- | | `primary_key` | Column containing keys/entity IDs that can be used to uniquely identify the entities for which the function will compute the similarity to the search vector specified by the `comparison_features` cursor. Examples include countries, census block groups, user IDs of website visitors, and aircraft call signs. | Column\ | | `pivot_features` | One or more columns constituting a compound feature. For example, two columns of visit hour and census block group would compare entities specified by `primary_key` based on whether they visited the same census block group in the same hour. If a single census block group feature column is used, the `primary_key` entities are compared only by the census block groups visited, regardless of time overlap. | Column\ | | `metric` | Column denoting the values used as input for the cosine similarity metric computation. In many cases, this is simply `COUNT(*)` such that feature overlaps are weighted by the number of co-occurrences. | Column\ | | `comparison_`pivot\_`features` | One or more columns constituting a compound feature for the search vector. This should match in number of sub-features, types, and semantics `pivot features`. | Column\ | | `comparison_metric` | Column denoting the values used as input for the cosine similarity metric computation from the search vector. In many cases, this is simply `COUNT(*)` such that feature overlaps are weighted by the number of co-occurrences. | Column\ | | `use_tf_idf` | Boolean constant denoting whether [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) weighting should be used in the cosine similarity score computation. | BOOLEAN | #### Output Columns
NameDescriptionData Types
classID of the primary key being compared against the search vector.Column (type will be the same of primary_key input column)
similarity_scoreComputed cosine similarity score between each primary_key pair, with values falling between 0 (completely dissimilar) and 1 (completely similar).Column
#### Example ``` /* Compute the similarity of US airline flight nums to a particular Delta flight (DAL795) based on the cosine similarity of the overlap of flight paths binned to a H3 Hex at zoom level 7 (roughly 5 sq km), and return the top 10 most similar flight nums */ SELECT * FROM TABLE( tf_feature_similarity( primary_features => CURSOR( SELECT callsign, geotoh3(st_x(location), st_y(location), 7) as h3, count(*) as n from adsb_2021_03_01 where operator in ( 'Delta Air Lines', 'Alaska Airlines', 'Southwest Airlines', 'American Airlines', 'United Airlines' ) and altitude >= 1000 group by callsign, h3 ), comparison_features => CURSOR( SELECT geotoh3(st_x(location), st_y(location), 7) as h3, COUNT(*) as n from adsb_2021_03_01 where callsign = 'DAL795' and altitude >= 1000 group by h3 ), use_tf_idf => false ) ) ORDER BY similarity_score desc limit 10; class|similarity_score DAL795|1 DAL538|0.610889 DAL1192|0.3419932 DAL1185|0.3391671 SWA4346|0.3206964 DAL365|0.3037131 SWA953|0.2912168 UAL1559|0.2747431 SWA2098|0.2511763 DAL526|0.2473387 ```