问ArangoDB:通过示例作为查询函数插入
EN

Stack Overflow用户

提问于 2016-10-18 15:47:58

回答 1查看 332关注 0票数 6

图的一部分是使用两个大型集合之间的一个巨大连接来构造的，每次向两个集合中添加文档时都会运行它。查询基于older post。

FOR fromItem IN fromCollection
    FOR toItem IN toCollection
        FILTER fromItem.fromAttributeValue == toItem.toAttributeValue
        INSERT { _from: fromItem._id, _to: toItem._id, otherAttributes: {}} INTO edgeCollection

这需要大约55,000秒来完成我的数据集。我绝对欢迎大家提出更快的建议。

但我有两个相关的问题：

我需要一个插销。通常情况下，upsert会很好，但在这种情况下，由于我无法知道前面的密钥，这对我没有帮助。要获得前面的键，我需要通过示例查询，以找到其他相同的、现有边缘的键。这似乎是合理的，只要它不损害我的性能，但我不知道如何在AQL中有条件地构造我的查询，以便如果等效边还不存在，它就插入边，但如果存在等效边，则不执行任何操作。我该怎么做？
每次数据被添加到这两个集合中时，我都需要运行它。我需要一种只在最新数据上运行这个程序的方法，这样它就不会试图加入整个集合。如何编写允许我只加入新插入的记录的AQL？它们是用Arangoimp添加的，我无法保证它们的更新顺序，所以我不能在创建节点的同时创建边缘。我如何才能加入新的数据？我不想每次添加记录时花费55k秒。

query-optimization

graph-databases

arangodb

aql

回答 1

Stack Overflow用户

发布于 2016-10-20 00:47:32

如果您在没有任何索引的情况下运行所编写的查询，那么它必须执行两个嵌套的完整集合扫描，通过查看

db._explain(<your query here>);

它显示了如下情况：

  1   SingletonNode                1   * ROOT
  2   EnumerateCollectionNode      3     - FOR fromItem IN fromCollection   /* full collection scan */
  3   EnumerateCollectionNode      9       - FOR toItem IN toCollection   /* full collection scan */
  4   CalculationNode              9         - LET #3 = (fromItem.`fromAttributeValue` == toItem.`toAttributeValue`)   /* simple expression */   /* collections used: fromItem : fromCollection, toItem : toCollection */
  5   FilterNode                   9         - FILTER #3
  ...

如果你这样做了

db.toCollection.ensureIndex({"type":"hash", fields ["toAttributeValue"], unique:false})`

然后，在fromCollection中将有一个完整的表集合扫描，对于找到的每一个项，在toCollection中都有一个哈希查找，这将更快。每件事都会分批发生，所以情况应该已经有所改善了。db._explain()将显示如下：

  1   SingletonNode                1   * ROOT
  2   EnumerateCollectionNode      3     - FOR fromItem IN fromCollection   /* full collection scan */
  8   IndexNode                    3       - FOR toItem IN toCollection   /* hash index scan */

仅处理fromCollection中最近插入的项相对容易:只需将导入时间的时间戳添加到所有顶点，然后使用：

FOR fromItem IN fromCollection
    FILTER fromItem.timeStamp > @lastRun
    FOR toItem IN toCollection
        FILTER fromItem.fromAttributeValue == toItem.toAttributeValue
        INSERT { _from: fromItem._id, _to: toItem._id, otherAttributes: {}} INTO edgeCollection

当然，在timeStamp属性的fromCollection中放置一个跳过的索引。

这应该能很好地发现fromCollection中的新顶点。它将“忽略”toCollection中链接到fromCollection中旧顶点的新顶点。

您可以通过在查询中互换fromCollection和toCollection的角色(不要忘记fromCollection中的fromAttributeValue索引)来发现这些问题，并记住只有在from顶点陈旧时才放边，如in：

FOR toItem IN toCollection
    FILTER toItem.timeStamp > @lastRun
    FOR fromItem IN fromCollection
        FILTER fromItem.fromAttributeValue == toItem.toAttributeValue
        FILTER fromItem.timeStamp <= @lastRun 
        INSERT { _from: fromItem._id, _to: toItem._id, otherAttributes: {}} INTO edgeCollection