动态标量字段

最近更新时间:2024-12-24 17:53:53

我的收藏

功能介绍

在数据库操作中,创建集合(Collection)的 Schema (结构)时,需要指定各个字段的名称和数据类型。在数据库 Schema 设计之初,由于业务需求的不断演进和数据模型的持续发展,某些字段可能无法完全预见,动态标量字段允许数据库在不修改原有 Schema 的情况下,适应数据结构的变化需求,从而提高数据库的灵活性和适应性。
启用动态标量字段功能后,集合中的所有标量字段将自动创建 Filter 索引以提升查询效率。同时,支持灵活选择对特定字段不建立索引,以优化存储空间或减少索引维护开销,实现更加灵活、精细的数据库管理。

开启方式

在创建数据库集合时,可以通过配置控制参数来启用标量字段的全索引功能,并明确指定哪些字段不需要建立索引。Python、Java、Go SDK的开启方式,如下表所示。
SDK
开启标量字段参数
开启方式
创建集合
Python SDK
filter_index_config
filter_index_config=FilterIndexConfig(
# filter_all 为 True,则开启标量字段全索引
filter_all=True,
# fields_without_index 在开启之后,指定不创建索引的字段
fields_without_index=['author'],
# max_str_len 在开启之后,指定创建索引标量字段的最大字符数,超过限制则截断创建索引
max_str_len=32,
)
Java SDK
FilterIndexConfig
.withFilterIndexConfig(FilterIndexConfig.newBuilder()
// FilterAll 为 true,则开启标量字段全索引
.withFilterAll(true)
// FieldWithoutFilterIndex 在开启之后,指定不创建索引的字段
.withFieldWithoutFilterIndex(Arrays.asList("test1", "test2"))
// MaxStrLen 在开启之后,指定创建索引标量字段的最大字符数,超过限制则截断创建索引
.withMaxStrLen(64)
.build())
Go SDK
FilterIndexConfig
maxStrLen := uint32(32)
param := &tcvectordb.CreateCollectionParams{
FilterIndexConfig: &tcvectordb.FilterIndexConfig{
// FilterAll 为 true,则开启标量字段全索引
FilterAll: true,
// FieldWithoutFilterIndex 在开启之后,指定不创建索引的字段
FieldsWithoutIndex: []string{"author"},
// MaxStrLen 在开启之后,指定创建索引标量字段的最大字符数,超过限制则截断创建索引
MaxStrLen: &maxStrLen,
},
}

使用示例

如下以 Java SDK 为例,给出开启动态标量字段全索引,并应用 FIlter 表达式检索的具体方式。获取 Java SDK,请参见 SDK 准备

步骤1:创建数据库

import com.tencent.tcvectordb.client.RPCVectorDBClient;
import com.tencent.tcvectordb.client.VectorDBClient;
import com.tencent.tcvectordb.model.*;
import com.tencent.tcvectordb.model.Collection;
import com.tencent.tcvectordb.model.param.collection.*;
import com.tencent.tcvectordb.model.param.database.ConnectParam;
import com.tencent.tcvectordb.model.param.dml.*;
import com.tencent.tcvectordb.model.param.entity.AffectRes;
import com.tencent.tcvectordb.model.param.entity.BaseRes;
import com.tencent.tcvectordb.model.param.enums.ReadConsistencyEnum;
import com.tencent.tcvectordb.utils.JsonUtils;
import java.util.*;

public class VectorDBExample {
public static void main(String[] args) {
// 创建VectorDB Client
ConnectParam connectParam = ConnectParam.newBuilder()
.withUrl("http://10.0.X.X:80")
.withUsername("root")
.withKey("eC4bLRy2va******************************")
.withTimeout(30)
.build();
VectorDBClient client = new RPCVectorDBClient(connectParam,ReadConsistencyEnum.EVENTUAL_CONSISTENCY);
}
}
Database db = client.createDatabase("db-test");

步骤2:创建集合,开启动态标量字段

// 初始化Collection参数,通过配置 withFilterIndexConfig 开启动态标量字段功能。
// 在以下示例中,集合开启了动态标量字段,同时指定"test1", "test2"两个字段不创建filter索引,其余字段均默认创建filter索引
CreateCollectionParam collectionParam = CreateCollectionParam.newBuilder()
.withName("book-vector")
.withShardNum(1)
.withReplicaNum(1)
.withDescription("this is a java sdk test")
.addField(new FilterIndex("id", FieldType.String, IndexType.PRIMARY_KEY))
.addField(new VectorIndex("vector", 3, IndexType.HNSW,
MetricType.COSINE, new HNSWParams(16, 200)))
.withFilterIndexConfig(FilterIndexConfig.newBuilder()
.withFilterAll(true)
.withFieldWithoutFilterIndex(Arrays.asList("test1", "test2"))
.withMaxStrLen(64)
.build())
.build();
Collection collection = db.createCollection(collectionParam);

步骤3:插入数据

写入向量数据,指定标量字段,并查询集合索引结构。
List<Document> documentList = new ArrayList<>(Arrays.asList(
Document.newBuilder()
.withId("0001")
.withVector(Arrays.asList(0.2123, 0.21, 0.213))
.addDocField(new DocField("bookName", "西游记"))
.addDocField(new DocField("author", "吴承恩"))
.addDocField(new DocField("array_test", Arrays.asList("1","2","3")))
.addDocField(new DocField("test1", 28))
.build(),
Document.newBuilder()
.withId("0002")
.withVector(Arrays.asList(0.2123, 0.22, 0.213))
.addDocField(new DocField("bookName", "西游记"))
.addDocField(new DocField("author", "吴承恩"))
.addDocField(new DocField("array_test", Arrays.asList("4","5","6")))
.addDocField(new DocField("test2", 25))
.build(),
Document.newBuilder()
.withId("0003")
.withVector(Arrays.asList(0.2123, 0.23, 0.213))
.addDocField(new DocField("bookName", "三国演义"))
.addDocField(new DocField("author", "罗贯中"))
.addDocField(new DocField("array_test", Arrays.asList("7","8","9")))
.build(),
Document.newBuilder()
.withId("0004")
.withVector(Arrays.asList(0.2123, 0.24, 0.213))
.addDocField(new DocField("bookName", "三国演义"))
.addDocField(new DocField("author", "罗贯中"))
.addDocField(new DocField("array_test", Arrays.asList("10","11","12")))
.addDocField(new DocField("test1", 23)
.build(),
Document.newBuilder()
.withId("0005")
.withVector(Arrays.asList(0.2123, 0.25, 0.213))
.addDocField(new DocField("bookName", "三国演义"))
.addDocField(new DocField("author", "罗贯中"))
.build()));
System.out.println("---------------------- upsert ----------------------");
InsertParam insertParam = InsertParam.newBuilder().withDocuments(documentList).build();
AffectRes affectRes = client.upsert("db-test", "book-vector", insertParam);
System.out.println(JsonUtils.toJsonString(affectRes));
// 查询集合的结构
Database database = client.database("db-test");
Collection collection = database.describeCollection("book-vector");
System.out.println("\\tres: " + collection.toString());
使用 describeCollection 查询集合结构,如下所示,标量字段 bookName、author、array_test 均已自动创建 Filter 索引,而特定的字段 test1、test2并没有创建索引。
{
"database": "db-test",
"collection": "book-vector",
"replicaNum": 1,
"shardNum": 1,
"description": "this is a java sdk test",
"indexes": [
{
"fieldName": "array_test",
"fieldType": "array",
"indexType": "filter",
"fieldElementType": "string"
},
{
"fieldName": "author",
"fieldType": "string",
"indexType": "filter"
},
{
"fieldName": "id",
"fieldType": "string",
"indexType": "primaryKey"
},
{
"fieldName": "bookName",
"fieldType": "string",
"indexType": "filter"
},
{
"fieldName": "vector",
"fieldType": "vector",
"indexType": "HNSW",
"metricType": "COSINE",
"params": {
"efConstruction": 200,
"M": 16
},
"dimension": 3
}
],
"createTime": "2024-12-19 17:07:53",
"documentCount": 0,
"indexStatus": {
"status": "ready"
},
"alias": [],
"filterIndexConfig": {
"filterAll": true,
"fieldsWithoutIndex": [
"test1",
"test2"
],
"maxStrLen": 32
}
}

步骤4:应用动态标量字段相似性检索


// 使用标量字段设置 Filter 表达式
Filter filterParam = new Filter("bookName=\\"三国演义\\"")
.and(Filter.exclude("array_test", Arrays.asList("7")));
System.out.println("---------------------- search ----------------------");
// 设置检索参数
SearchByVectorParam searchByVectorParam = SearchByVectorParam.newBuilder()
.addVector(Arrays.asList(0.2123, 0.23, 0.213))
// 若使用 HNSW 索引,则需要指定参数ef,ef越大,召回率越高,但也会影响检索速度
.withParams(new HNSWSearchParams(100))
// 指定 Top K 的 K 值
.withLimit(10)
// 过滤获取到结果
.withFilter(filterParam)
.build();
// 输出相似性检索结果,检索结果为二维数组,每一位为一组返回结果,分别对应 search 时指定的多个向量
List<List<Document>> svDocs = client.search(DBNAME, COLL_NAME, searchByVectorParam);
int i = 0;
for (List<Document> docs : svDocs) {
System.out.println("\\tres: " + i);
i++;
for (Document doc : docs) {
System.out.println("\\tres: " + doc.toString());
}
}
相似性检索结果,如下所示,可以根据动态写入的标量字段进行数据过滤,筛选出满足 filter 条件的检索结果。
res: 0
res: {"id":"0004","score":0.9997869729995728,"bookName":"三国演义","author":"罗贯中","array_test":["10","11","12"]}
res: {"id":"0005","score":0.9991745948791504,"bookName":"三国演义","author":"罗贯中"}hor":"罗贯中"}