#BigQuery #VectorDB #VectorSearch #GCP Conducting Clustering Analysis on users, consumers groups is common techniques for us to uncover hidden patterns, hidden user behaviors etc, from classic K Means to now commonly used DBSCAN, we are able compute Vectors for each customers from the features we selected depends on particular scenarios. e.g.: A customer’s age, income, purchase_amount: [30, 60000, 100], thus a customer is now visualized in a vector.
It is ideal for “semantic” search based o similarity. Imaging we store vectorized survey data in DB, we are now able to compute similar feedbacks from users without initiated a python notebook again and again.
Given now, BigQuery native support VECTOR_SEARCH[1], now we basically can conduct **Approximate Nearest Neighbor** search directly in BigQuery console, without using python, below is a quick demonstration. Create two tables with `Array` data format storing `embedding values`:
  1. Create a base table - imaging this table is called Zoo, each vector are different animals
-- random is the dataset name in my GCP project
create or replace table random.embedding_example_base (
id INT64,
my_embedding array<float64>,
);

insert random.embedding_example_base (id, my_embedding)
VALUES(1, [1.0, 2.0]),
(2, [2.0, 4.0]),
(3, [1.5, 7.0]),
(4, [1.0, 3.2]),
(5, [5.0, 5.4]),
(6, [3.7, 1.8]),
(7, [4.4, 2.9]);
  1. Create a query table - imaging data points in this table is animals you try to find their “similar” species, in this case, you have dog and cat
create or replace table random.embedding_example_query (
query_id STRING,
my_embedding array<float64>,
);

insert random.embedding_example_query (query_id, my_embedding)
VALUES('dog', [1.0, 2.0]),
('cat', [3.0, 5.2]);
Now, let’s do `VECTOR_SEARCH`
select
*
from vector_search(
table random.embedding_example_base -- base table
, 'my_embedding' -- column to search
, (
select query_id
, my_embedding as query_embedding
from random.embedding_example_query
)
, 'query_embedding'
, top_k => 2
, distance_type => 'COSINE' -- EUCLINDEAN, DOT_PRODUCT
)
  • `=>` this is assign symbol
And as for Distance Types: Then it return following values:
  • Top 2 most similar **vectors** based on distance values

Final Thoughts

Besides, use BQ as a vector store Db for applications. Now, if a company has ML pipeline (or just BigQuery’s own ML sql)[2] constantly output vectors represent customers behaviors, then now, Data Scientists could directly query those tables and conduct quick similarity analysis without using Python.
Vector Search is the new select *
BigQuery vector search and embedding generation - YouTube BigQuery as a Vector Database — how cool is that_ _ by Shuvro @ Nimesa _ Medium.pdf