如何在Ubuntu 16.04上使用Cassandra和ElasticSearch设置Titan Graph数据库

原创

水门

发布于 2018-07-27 17:45:46

2.3K0

发布于 2018-07-27 17:45:46

介绍

Titan是一个高度可扩展的开源图形数据库。图形数据库是一种NoSQL数据库，其中所有数据都存储为节点（nodes）和边（edges）。图形数据库适用于高度连接数据的应用程序，其中数据之间的关系是应用程序功能的重要部分，如社交网站。Titan用于存储和查询分布在多台机器上的大量数据。它可以使用各种存储后端，如Apache Cassandra，HBase和BerkeleyDB。在本教程中，您将安装Titan 1.0，然后配置Titan以使用Cassandra和ElasticSearch。Cassandra充当保存底层数据的数据存储区，而ElasticSearch是一个自由文本搜索引擎，可用于在数据库中执行一些复杂的搜索操作。您还将使用Gremlin从数据库创建和查询数据。

准备

要完成本教程，您需要：

一个至少有2 GB的RAM 非root用户的Ubuntu 16.04服务器
安装Oracle JDK 8，可以参考腾讯云相关教程。

没有服务器的用户可以购买和使用腾讯云服务器或者直接在腾讯云实验室Ubuntu服务器上体验。

第1步 - 下载，解包和启动Titan

要下载Titan数据库，请转到下载页面。您将看到两个可供下载的Titan发行版。使用wget下载稳定版本：

$ wget http://s3.thinkaurelius.com/downloads/titan/titan-1.0.0-hadoop1.zip

下载完成后，解压缩zip文件。默认情况下不安装解压缩文件的程序：

$ sudo apt-get install unzip

然后解压Titan：

$ unzip titan-1.0.0-hadoop1.zip

这将创建一个名为titan-1.0.0-hadoop的目录。

切换到titan-1.0.0-hadoop目录并调用shell脚本以启动Titan。

$ cd titan-1.0.0-hadoop1
$ ./bin/titan.sh start

您将看到类似于此的输出：

Forking Cassandra...
Running `nodetool statusthrift`... OK (returned exit status 0 and printed string "running").
Forking Elasticsearch...
Connecting to Elasticsearch (127.0.0.1:9300)...... OK (connected to 127.0.0.1:9300).
Forking Gremlin-Server...
Connecting to Gremlin-Server (127.0.0.1:8182)...... OK (connected to 127.0.0.1:8182).
Run gremlin.sh to connect.

Titan依赖一堆其他工具来工作。因此，每当Titan启动时，Cassandra，ElasticSearch和Gremlin-Server也会随之启动。

您可以通过运行以下命令来检查Titan的状态。

$ ./bin/titan.sh status

你会看到这个输出：

Gremlin-Server (org.apache.tinkerpop.gremlin.server.GremlinServer) is running with pid 7490
Cassandra (org.apache.cassandra.service.CassandraDaemon) is running with pid 7077
Elasticsearch (org.elasticsearch.bootstrap.Elasticsearch) is running with pid 7358

在下一步中，您将看到如何查询图表。

第2步 - 使用Gremlin查询图表

Gremlin是一种图形遍历语言，用于查询，分析和操作Graph数据库。现在Titan已经设置并启动，您将使用Gremlin创建和查询Titan的节点和边缘。

要使用Gremlin，请通过输入以下命令打开Gremlin控制台。

$ ./bin/gremlin.sh

您将看到类似于此的响应：

       \,,,/
         (o o)
-----oOOo-(3)-oOOo-----
plugin activated: tinkerpop.server
plugin activated: tinkerpop.hadoop
plugin activated: tinkerpop.utilities
plugin activated: aurelius.titan
plugin activated: tinkerpop.tinkergraph
gremlin>

Gremlin控制台加载了几个插件以支持Titan和Gremlin特有的功能。

首先，实例化图形对象。此对象表示我们当前正在处理的图表。它有一些方法可以帮助管理图形，如添加顶点，创建标签和处理事务。执行此命令以实例化图形对象：

gremlin> graph = TitanFactory.open('conf/titan-cassandra-es.properties')

你会看到这个输出：

==>standardtitangraph[cassandrathrift:[127.0.0.1]]

输出指定TitanFactory.open()方法返回的对象类型，即standardtitangraph。它还表示图形使用（cassandrathrift）的哪个存储后端，以及它通过localhost（127.0.0.1）的连接。

open()方法使用指定属性文件中的配置选项创建新的Titan图，或打开现有图。配置文件包含高级配置选项，例如要使用的存储后端，缓存后端和一些其他选项。您可以创建自定义配置文件并使用它。

执行命令后，图形对象将被实例化并存储在graph变量中。要查看图形对象的所有可用属性和方法，请键入graph.，然后按下TAB：

gremlin> graph.
addVertex(                    assignID(                     buildTransaction()            close()                       
closeTransaction(             commit(                       compute(                      compute()                     
configuration()               containsEdgeLabel(            containsPropertyKey(          containsRelationType(         
containsVertexLabel(          edgeMultiQuery(               edgeQuery(                    edges(                        
features()                    getEdgeLabel(                 getOrCreateEdgeLabel(         getOrCreatePropertyKey(       
...
...

在图形数据库中，您主要通过遍历它来查询数据，而不是像关系数据库一样检索具有连接和索引的记录。为了遍历图形，我们需要来自graph参考变量的图形遍历源。以下命令可实现此目的。

gremlin> g = graph.traversal()

您使用g变量执行遍历。让我们使用该变量来创建几个顶点。顶点就像SQL中的行。每个顶点都有一个顶点类型或其label关联的属性，类似于SQL中的字段。输入以下命令：

gremlin> sammy = g.addV(label, 'fish', 'name', 'Sammy', 'residence', 'The Deep Blue Sea').next()
gremlin> company = g.addV(label, 'company', 'name', 'DigitalOcean', 'website', 'www.digitalocean.com').next()

在这个例子中，我们已经创建了两个顶点，标签分别为fish和company。我们还定义了两个属性，第一个顶点的name与residence，和第二个定点的name与website。现在让我们使用变量sammy和company来访问这些顶点。

例如，为了列出第一个顶点的所有属性，请执行以下命令：

gremlin> g.V(sammy).properties()

输出如下：

==>vp[name->Sammy]
==>vp[residence->The Deep Blue Sea]

您还可以向顶点添加新属性。例如，我们可以添加一种颜色：

gremlin> g.V(sammy).property('color', 'blue')

现在，让我们定义这两个顶点之间的关系。这是通过在它们之间创建edge来实现的。

gremlin> company.addEdge('hasMascot', sammy, 'status', 'high')

这会在sammy和company之间使用标签hasMascot创建edge，和一个值为high的status属性。

现在，让我们来看看公司的吉祥物（一种属性）：

gremlin> g.V(company).out('hasMascot')

这将返回顶点的传出company顶点，并将它们之间的edge标记为hasMascot。我们也可以反过来让公司与吉祥物sammy进行捆绑：

gremlin>  g.V(sammy).in('hasMascot')

按下CTRL+C退出Gremlin控制台。

现在让我们为Titan添加一些自定义配置选项。

第3步 - 配置Titan

让我们创建一个新配置文件，您可以使用它来定义Titan的所有自定义配置选项。

Titan有一个可插拔的存储层;Titan使用另一个数据库来处理它，而不是处理数据存储本身。Titan目前为存储数据库提供三种选择：Cassandra，HBase和BerkeleyDB。在本教程中，我们将使用Cassandra作为存储引擎，因为它具有高可扩展性和高可用性。

首先，创建配置文件：

$ nano conf/gremlin-server/custom-titan-config.properties

添加这些行以定义存储后端以及它可用的位置。存储后端设置为cassandrathrift表示我们正在使用Cassandra进行存储，并使用Cassandra的thrift接口：

conf/gremlin-server/custom-titan-config.properties

storage.backend=cassandrathrift
storage.hostname=localhost

然后添加这三行以定义要使用的搜索后端。我们将elasticsearch用作搜索后端。

conf/gremlin-server/custom-titan-config.properties

...
index.search.backend=elasticsearch
index.search.hostname=localhost
index.search.elasticsearch.client-only=true

第三行表示ElasticSearch是一个不存储数据的轻客户端。将其设置为false创建可以存储数据的常规ElasticSearch集群节点。

最后，添加此行，告诉Gremlin Server它将要服务的图形类型。

conf/gremlin-server/custom-titan-config.properties

...
gremlin.graph=com.thinkaurelius.titan.core.TitanFactory

conf目录中提供了许多示例配置文件，您可以查看这些文件以供参考。

保存文件并退出编辑器。

我们需要将这个新配置文件添加到Gremlin Server。打开Gremlin Server的配置文件。

$ nano conf/gremlin-server/gremlin-server.yaml

导航到该graphs部分并找到以下行：

conf/gremlin-server/gremlin-server.yaml

..
 graph: conf/gremlin-server/titan-berkeleyje-server.properties}
..

替换为：

conf/gremlin-server/gremlin-server.yaml

..
 graph: conf/gremlin-server/custom-titan-config.properties}
..

保存并退出该文件。

现在停止并重新启动Titan。

$ ./bin/titan.sh stop
$ ./bin/titan.sh start

现在我们已经有了自定义配置，让我们将Titan配置为作为服务运行。

第4步 - 使用Systemd管理Titan

每次我们的服务器启动时，我们都应确保Titan自动启动。

要配置它，我们将为Titan创建一个Systemd单元文件，以便我们进行管理。

首先，我们使用.service扩展名在/etc/systemd/system目录中为我们的应用程序创建一个文件：

$ sudo nano /etc/systemd/system/titan.service

该[Unit]部分指定了我们服务的元数据和依赖关系，包括我们的服务描述以及何时启动我们的服务。

将此配置添加到文件中：

/etc/systemd/system/titan.service

[Unit]
Description=The Titan database
After=network.target

我们指定服务应在达到网络目标后启动。换句话说，我们只在网络服务准备好后才启动此服务。

在该[Unit]部分之后，我们定义了[Service]如何启动服务。将其添加到配置文件中：

/etc/systemd/system/titan.service

[Service]
User=sammy
Group=www-data
Type=forking
Environment="PATH=/home/sammy/titan-1.0.0-hadoop1/bin:/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin"
WorkingDirectory=/home/sammy/titan-1.0.0-hadoop1/
ExecStart=/home/sammy/titan-1.0.0-hadoop1/bin/titan.sh start
ExecStop=/home/sammy/titan-1.0.0-hadoop1/bin/titan.sh stop

我们首先定义服务运行的用户和组。然后我们定义它将要服务的类型。默认情况下，该类型被假定为simple。由于我们用来启动Titan的启动脚本启动了其他子程序，我们将服务类型指定为forking。

然后我们指定PATH环境变量，Titan的工作目录和执行命令。我们将命令分配给ExecStart变量启动Titan。

最后，我们添加了如下所示的[Install]部分：