搜索引擎Typesense的使用

注：使用语言 python

一：Typesense介绍

Typesense将数据保存在磁盘当中，建立的索引保存内存中

Typesense是一个开源的、有容错能力的搜索引擎，针对实时（通常低于 50 毫秒）搜索即键入体验和开发人员生产力进行了优化。

Typesense做了一个对于其他搜索引擎的对比。（文档版，表格版）

索引数据速度以及资源占用：

对于220万份食谱（一份食谱相当于下文中提到的一个document）

在 Typesense 中进行索引时占用了大约 900MB 的 RAM(内存)
花了 3.6 分钟索引所有 220 万条记录
在具有 4 个 vCPU 的服务器上，Typesense 每秒能够处理104 个并发搜索查询，平均搜索处理时间为11毫秒。

RAM(内存)方面：如果数据量为 X MB大小，则需要占用2X-3XRAM（2-3倍数据量大小的占用）

如需深入了解可以查阅官方文档

二：Typesense的用法

1：使用typesense有两种方法

使用自带的云服务，配置运行简单（收费）
在本地安装typesense，自己维护配置（本文使用这种方法）

2：安装启动typesense

（1）：下载并安装

centos:（如果您的系统不是centos可以在官方文档中找到对应的下载方法

curl -O https://dl.typesense.org/releases/0.23.1/typesense-server-0.23.1-1.x86_64.rpm
sudo yum install ./typesense-server-0.23.1-1.x86_64.rpm

（2）：启动服务查看状态

如果从rpm包中安装typesense会自动启动服务，使用下面命令查看服务状态

sudo systemctl status typesense-server.service

Active 是 running 则代表已启动

配置文件位于/etc/typesense/typesense-server.ini
日志在/var/log/typesense/
数据目录在/var/lib/typesense/

（3）：服务启动后检查是否可以接收到请求

curl http://localhost:8108/health
{"ok":true}

（4）：安装client

这是python版本的，如果需要其他版本可以在官网上查看

3：构建数据集用于检索

（1）：下载一份测试数据

数据来源typesense官网用于测试，如果自己有测试数据直接使用即可

cd /typesense_test    # 任意目录 
curl -O https://dl.typesense.org/datasets/books.jsonl.gz
gunzip books.jsonl.gz

（2）：初始化client

import typesense

client = typesense.Client({
  'nodes': [{
    'host': 'localhost', # For Typesense Cloud use xxx.a1.typesense.net
    'port': '8108',      # For Typesense Cloud use 443
    'protocol': 'http'   # For Typesense Cloud use https
  }],
  'api_key': '<API_KEY>',    # api-key 在 一-->2-->（2）中提到的配置文件中可以查看
  'connection_timeout_seconds': 2  
})

这样就可以和typesense交互了

以下代码都是在建立交互的基础上运行

（3）：创建存放书籍数据的collection

在 typesense 中，一个 collection 相当于关系型数据库中的一张表

而 Documents 相当于表中的每条数据

创建一个 collection 时需要指定一些字段名和类型

（继续初始化代码，在初始化后：）

（如果使用自己数据需要根据数据的值类型来设置collection fields中的name type等）

books_schema = {
  'name': 'books',    # collection 的名字，对该集合操作都是根据该名字
  'fields': [
    {'name': 'title', 'type': 'string' },
    {'name': 'authors', 'type': 'string[]', 'facet': True },  ###'facet': True facet字段被逐字索引
    {'name': 'publication_year', 'type': 'int32', 'facet': True },
    {'name': 'ratings_count', 'type': 'int32' },
    {'name': 'average_rating', 'type': 'float' }
  ],
  'default_sorting_field': 'ratings_count'  # 在没有sort_by时检索结果默认以ratings_count字段排序
}

client.collections.create(books_schema)

######## 字段类型
type	描述
string	字符串值
string[]	字符串数组
int32	整数值高达 2,147,483,647
int32[]	数组int32
int64	大于 2,147,483,647 的整数值
int64[]	数组int64
float	浮点数/十进制数
float[]	浮点数/十进制数数组
bool	true或者false
bool[]	布尔数组
geopoint	纬度和经度指定为[lat, lng]
geopoint[]	纬度和经度数组指定为[[lat1, lng1], [lat2, lng2]]
string*	string自动将值转换为 a或的特殊类型string[]。   ###如果传入的值为1，那么会自动转化成 ‘1’
auto	自动尝试根据添加到集合中的文档推断数据类型的特殊类型。请参阅自动模式检测。

（4）：添加数据

在创建完成后需要向里面添加数据，将刚刚下载好的测试数据添加到collection中

with open('/typesense_test/books.jsonl') as jsonl_file:
  client.collections['books'].documents.import_(jsonl_file.read().encode('utf-8'))    # books--collection中定义的name

或者自己有数据需要向里面添加

## 导入单条数据
document = {
  'id': '124',    ### 可以指定，也可以不指定，不指定的情况下就按自增的来
  'company_name': 'Stark Industries',    ### 以下字段是需要包含在collection中的字段，就相当于往数据表中添加数据，指定每个字段的值，如果collection中的有a字段在document中没有指定a字段的值，那么该条document在collection中没有a字段，在以后更新的时候可以再写入
  'num_employees': 5215,
  'country': 'USA'
}

client.collections['collection_name'].documents.create(document)

## 导入多条数据 数据集 在数据量比较大的情况下 这个方法效率很高
documents = [{
  'id': '124',
  'company_name': 'Stark Industries',
  'num_employees': 5215,
  'country': 'USA'
}]

# IMPORTANT: Be sure to increase connection_timeout_seconds to at least 5 minutes or more for imports,
#  when instantiating the client

client.collections['collection'].documents.import_(documents, {'action': 'create'})
action
 create  创建一个新的document，如果collection中有了相同的id，则创建失败
 update  更新一个document，如果没有指定id，则更新失败
 upsert  创建一个新document，或者在有该id的情况下更新该id的数据，需要发送整个document
 emplace 创建一个新document，或者在有该id的情况下更新该id的数据，可以发送部分document或整个document

（5）：搜索数据

单个collection搜索

 
search_parameters = {
  'q'         : 'harry potter',  # 搜索词
  'query_by'  : 'title',          # 从title字段中检索
  'sort_by'   : 'ratings_count:desc'  # 将搜索结果按ratings_count倒排
}

client.collections['books'].documents.search(search_parameters)

多个collection一起搜索

search_requests = {
  'searches': [
    {
      'collection': 'books',
      'query_by': 'title, authors'  #
    },
    {
      'collection': 'collection_name1',  
      'query_by': 'field1, field2, field3'
    },
    {
      'collection': 'collection_name2',
      'query_by': 'field1'
    }
  ]
}

common_search_params = {
    'q': 'xxxx'    # 搜索词   相同的内容可以放在这里面，代表从每个collection中检索该词  如果query_by 都相同的情况下也可放在这里，把上面的去掉
}

client.multi_search.perform(search_requests, common_search_params)

对于搜索结果的处理以及排序方式还有疑问可以查阅官方文档

（6）：查看 collection 和 document

查看collection，返回collction结构（例如数据表的结构）

# 单个collection
client.collections['collection_name'].retrieve()

# 所有collection
client.collections.retrieve()

查看 document 数据

# 查看document
client.collections['collection_name'].documents['124'].retrieve() # 查看该collection中id为124的document的数据

（7）：更改 collection 和 document

更改collection结构

update_schema = {
  'fields': [    
    {
      'name'  :  'num_employees',
      'drop'  :  True    ## 删除该字段
    },
    {
      'name'  :  'company_category',
      'type'  :  'string'
    }
  ]
}
client.collections['collection_name'].update(update_schema)

更新document数据

document = {
  'company_name': 'Stark Industries',
  'num_employees': 5500
}

client.collections['collection_name'].documents['124'].update(document)  ## 更新documentid为124的数据

（8）：删除 collection 和 document

删除 collection

client.collections['collection_name'].delete()

删除 document

client.collections['collection_name'].documents['124'].delete()  # 删除 id为124的document

注：官方文档api

Yooma

搜索引擎Typesense的使用

一：Typesense介绍

二：Typesense的用法

1：使用typesense有两种方法

2：安装启动typesense

（1）：下载并安装

（2）：启动服务查看状态

（3）：服务启动后检查是否可以接收到请求

（4）：安装client

3：构建数据集用于检索

（1）：下载一份测试数据

（2）：初始化client

（3）：创建存放书籍数据的collection

（4）：添加数据

（5）：搜索数据

（6）：查看 collection 和 document

（7）：更改 collection 和 document

（8）：删除 collection 和 document

Yooma

一：Typesense介绍

二：Typesense的用法

1：使用typesense有两种方法

2：安装启动typesense

（1）：下载并安装

（2）：启动服务查看状态

（3）：服务启动后检查是否可以接收到请求

（4）：安装client

3：构建数据集用于检索

（1）：下载一份测试数据

（2）：初始化client

（3）：创建存放书籍数据的collection

（4）：添加数据

（5）： 搜索数据

（6）：查看 collection 和 document

（7）：更改 collection 和 document

（8）：删除 collection 和 document

Yooma

（5）：搜索数据