# 搜索引擎Typesense的使用

**Published by:** [Yooma](https://paragraph.com/@yooma/)
**Published on:** 2022-09-01
**URL:** https://paragraph.com/@yooma/typesense

## Content

注： 使用语言 python一：Typesense介绍Typesense将数据保存在磁盘当中，建立的索引保存内存中 Typesense是一个开源的、有容错能力的搜索引擎，针对实时（通常低于 50 毫秒）搜索即键入体验和开发人员生产力进行了优化。 Typesense做了一个对于其他搜索引擎的对比。（文档版，表格版） 索引数据速度以及资源占用： 对于220万份食谱（一份食谱相当于下文中提到的一个document）在 Typesense 中进行索引时占用了大约 900MB 的 RAM(内存)花了 3.6 分钟索引所有 220 万条记录在具有 4 个 vCPU 的服务器上，Typesense 每秒能够处理104 个并发搜索查询，平均搜索处理时间为11毫秒。RAM(内存)方面：如果数据量为 X MB大小，则需要占用2X-3XRAM（2-3倍数据量大小的占用） 如需深入了解可以查阅官方文档二：Typesense的用法1：使用typesense有两种方法使用自带的云服务，配置运行简单（收费）在本地安装typesense，自己维护配置（本文使用这种方法）2：安装启动typesense（1）：下载并安装centos:（如果您的系统不是centos可以在官方文档中找到对应的下载方法curl -O https://dl.typesense.org/releases/0.23.1/typesense-server-0.23.1-1.x86_64.rpm sudo yum install ./typesense-server-0.23.1-1.x86_64.rpm （2）：启动服务查看状态如果从rpm包中安装typesense会自动启动服务，使用下面命令查看服务状态sudo systemctl status typesense-server.service Active 是 running 则代表已启动配置文件位于/etc/typesense/typesense-server.ini日志在/var/log/typesense/数据目录在/var/lib/typesense/日志目录和数据目录可以在配置文件中更改（3）：服务启动后检查是否可以接收到请求curl http://localhost:8108/health {"ok":true} （4）：安装client这是python版本的，如果需要其他版本可以在官网上查看3：构建数据集用于检索（1）：下载一份测试数据数据来源typesense官网 用于测试，如果自己有测试数据直接使用即可cd /typesense_test # 任意目录 curl -O https://dl.typesense.org/datasets/books.jsonl.gz gunzip books.jsonl.gz （2）：初始化clientimport typesense client = typesense.Client({ 'nodes': [{ 'host': 'localhost', # For Typesense Cloud use xxx.a1.typesense.net 'port': '8108', # For Typesense Cloud use 443 'protocol': 'http' # For Typesense Cloud use https }], 'api_key': '&#x3C;API_KEY>', # api-key 在 一-->2-->（2）中提到的配置文件中可以查看 'connection_timeout_seconds': 2 }) 这样就可以和typesense交互了 以下代码都是在建立交互的基础上运行（3）：创建存放书籍数据的collection在 typesense 中，一个 collection 相当于关系型数据库中的一张表 而 Documents 相当于表中的每条数据 创建一个 collection 时需要指定一些字段名和类型 （继续初始化代码，在初始化后：） （如果使用自己数据需要根据数据的值类型来设置collection fields中的name type等）books_schema = { 'name': 'books', # collection 的名字，对该集合操作都是根据该名字 'fields': [ {'name': 'title', 'type': 'string' }, {'name': 'authors', 'type': 'string[]', 'facet': True }, ###'facet': True facet字段被逐字索引 {'name': 'publication_year', 'type': 'int32', 'facet': True }, {'name': 'ratings_count', 'type': 'int32' }, {'name': 'average_rating', 'type': 'float' } ], 'default_sorting_field': 'ratings_count' # 在没有sort_by时检索结果默认以ratings_count字段排序 } client.collections.create(books_schema) ######## 字段类型 type 描述 string 字符串值 string[] 字符串数组 int32 整数值高达 2,147,483,647 int32[] 数组int32 int64 大于 2,147,483,647 的整数值 int64[] 数组int64 float 浮点数/十进制数 float[] 浮点数/十进制数数组 bool true或者false bool[] 布尔数组 geopoint 纬度和经度指定为[lat, lng] geopoint[] 纬度和经度数组指定为[[lat1, lng1], [lat2, lng2]] string* string自动将值转换为 a或的特殊类型string[]。 ###如果传入的值为1，那么会自动转化成 ‘1’ auto 自动尝试根据添加到集合中的文档推断数据类型的特殊类型。请参阅自动模式检测。 （4）：添加数据在创建完成后需要向里面添加数据，将刚刚下载好的测试数据添加到collection中with open('/typesense_test/books.jsonl') as jsonl_file: client.collections['books'].documents.import_(jsonl_file.read().encode('utf-8')) # books--collection中定义的name 或者自己有数据需要向里面添加## 导入单条数据 document = { 'id': '124', ### 可以指定，也可以不指定，不指定的情况下就按自增的来 'company_name': 'Stark Industries', ### 以下字段是需要包含在collection中的字段，就相当于往数据表中添加数据，指定每个字段的值，如果collection中的有a字段在document中没有指定a字段的值，那么该条document在collection中没有a字段，在以后更新的时候可以再写入 'num_employees': 5215, 'country': 'USA' } client.collections['collection_name'].documents.create(document) ## 导入多条数据 数据集 在数据量比较大的情况下 这个方法效率很高 documents = [{ 'id': '124', 'company_name': 'Stark Industries', 'num_employees': 5215, 'country': 'USA' }] # IMPORTANT: Be sure to increase connection_timeout_seconds to at least 5 minutes or more for imports, # when instantiating the client client.collections['collection'].documents.import_(documents, {'action': 'create'}) action create 创建一个新的document，如果collection中有了相同的id，则创建失败 update 更新一个document，如果没有指定id，则更新失败 upsert 创建一个新document，或者在有该id的情况下更新该id的数据，需要发送整个document emplace 创建一个新document，或者在有该id的情况下更新该id的数据，可以发送部分document或整个document （5）： 搜索数据单个collection搜索 search_parameters = { 'q' : 'harry potter', # 搜索词 'query_by' : 'title', # 从title字段中检索 'sort_by' : 'ratings_count:desc' # 将搜索结果按ratings_count倒排 } client.collections['books'].documents.search(search_parameters) 多个collection一起搜索search_requests = { 'searches': [ { 'collection': 'books', 'query_by': 'title, authors' # }, { 'collection': 'collection_name1', 'query_by': 'field1, field2, field3' }, { 'collection': 'collection_name2', 'query_by': 'field1' } ] } common_search_params = { 'q': 'xxxx' # 搜索词 相同的内容可以放在这里面，代表从每个collection中检索该词 如果query_by 都相同的情况下也可放在这里，把上面的去掉 } client.multi_search.perform(search_requests, common_search_params) 对于搜索结果的处理以及排序方式还有疑问可以查阅官方文档（6）：查看 collection 和 document查看collection，返回collction结构（例如数据表的结构）# 单个collection client.collections['collection_name'].retrieve() # 所有collection client.collections.retrieve() 查看 document 数据# 查看document client.collections['collection_name'].documents['124'].retrieve() # 查看该collection中id为124的document的数据 （7）：更改 collection 和 document更改collection结构update_schema = { 'fields': [ { 'name' : 'num_employees', 'drop' : True ## 删除该字段 }, { 'name' : 'company_category', 'type' : 'string' } ] } client.collections['collection_name'].update(update_schema) 更新document数据document = { 'company_name': 'Stark Industries', 'num_employees': 5500 } client.collections['collection_name'].documents['124'].update(document) ## 更新documentid为124的数据 （8）：删除 collection 和 document删除 collectionclient.collections['collection_name'].delete() 删除 documentclient.collections['collection_name'].documents['124'].delete() # 删除 id为124的document 注：官方文档api

## Publication Information

- [Yooma](https://paragraph.com/@yooma/): Publication homepage
- [All Posts](https://paragraph.com/@yooma/): More posts from this publication
- [RSS Feed](https://api.paragraph.com/blogs/rss/@yooma): Subscribe to updates
- [Twitter](https://twitter.com/_yoonama): Follow on Twitter