C/C++中的字典树数据结构

Trie数据结构作为一个动态数组的容器。在本文章中,我们将看一下如何在C/C++中实现一个Trie。

这是基于树数据结构的,但不一定存储键。在这里,每个节点只有一个值,该值是根据位置定义的。

所以,每个节点的价值表示前缀,因为它是经过一系列匹配后从根节点出发的点。

我们把这样的匹配称为前缀匹配。因此,我们可以使用字典树(Tries)来存储字典中的词语!

例如,在上图中,根节点为空,因此每个字符串都匹配根节点。节点7匹配了一个以to为前缀的字符串,而节点3匹配了一个以tea为前缀的字符串。

在这里,键只存储在叶子节点的位置上,因为前缀匹配会停在任何叶子节点。因此,任何非叶子节点都不存储整个字符串,而只存储前缀匹配的字符。

出于这些原因,尝试被称为前缀树。现在让我们从程序员的角度来理解如何实现这种数据结构。


在C/C++中实现一个Trie数据结构。

让我们首先写下Trie结构。Trie节点主要有两个组成部分。

  • It’s children
  • A marker to indicate a leaf node.

然而,由于我们还将打印Trie,如果我们能够在数据部分中存储一个附加属性,那将更容易。

那么让我们来定义TrieNode的结构。在这里,由于我只会存储小写字母,我将将其实现为一棵26叉树。因此,每个节点将有指向26个孩子的指针。

// The number of children for each node
// We will construct a N-ary tree and make it
// a Trie
// Since we have 26 english letters, we need
// 26 children per node
#define N 26

typedef struct TrieNode TrieNode;

struct TrieNode {
    // The Trie Node Structure
    // Each node has N children, starting from the root
    // and a flag to check if it's a leaf node
    char data; // Storing for printing purposes only
    TrieNode* children[N];
    int is_leaf;
};

既然我们已经定义了结构,让我们编写函数来初始化TrieNode,并且还要释放其指针。

TrieNode* make_trienode(char data) {
    // Allocate memory for a TrieNode
    TrieNode* node = (TrieNode*) calloc (1, sizeof(TrieNode));
    for (int i=0; i<N; i++)
        node->children[i] = NULL;
    node->is_leaf = 0;
    node->data = data;
    return node;
}

void free_trienode(TrieNode* node) {
    // Free the trienode sequence
    for(int i=0; i<N; i++) {
        if (node->children[i] != NULL) {
            free_trienode(node->children[i]);
        }
        else {
            continue;
        }
    }
    free(node);
}

将一个词插入到 Trie 数据结构中。

现在我们将编写insert_trie()函数,该函数接收根节点(最顶层节点)的指针,并将一个单词插入到Trie中。

插入过程很简单。它逐个字符地迭代单词,并评估其相对位置。

例如,字符b的位置是1,所以它将成为第二个子元素。

for (int i=0; word[i] != '\0'; i++) {
    // Get the relative position in the alphabet list
    int idx = (int) word[i] - 'a';
    if (temp->children[idx] == NULL) {
        // If the corresponding child doesn't exist,
        // simply create that child!
        temp->children[idx] = make_trienode(word[i]);
    }
    else {
        // Do nothing. The node already exists
    }

我们将逐个字符匹配前缀,并在必要时简单地初始化一个节点。

否则,我们只需要一直沿着链条向下移动,直到匹配所有字符。

temp = temp->children[idx];

最后,我们将插入所有不匹配的字符,并将更新后的Trie返回。

TrieNode* insert_trie(TrieNode* root, char* word) {
    // Inserts the word onto the Trie
    // ASSUMPTION: The word only has lower case characters
    TrieNode* temp = root;

    for (int i=0; word[i] != '\0'; i++) {
        // Get the relative position in the alphabet list
        int idx = (int) word[i] - 'a';
        if (temp->children[idx] == NULL) {
            // If the corresponding child doesn't exist,
            // simply create that child!
            temp->children[idx] = make_trienode(word[i]);
        }
        else {
            // Do nothing. The node already exists
        }
        // Go down a level, to the child referenced by idx
        // since we have a prefix match
        temp = temp->children[idx];
    }
    // At the end of the word, mark this node as the leaf node
    temp->is_leaf = 1;
    return root;
}

在 Trie 中搜索一个单词。

现在我们已经在Trie上实施了插入操作,让我们来看一下如何搜索给定的模式。

我们将尝试逐个字符匹配字符串,采用与上述相同的前缀匹配策略。

如果我们已经到达链的末尾,但仍然没有找到匹配项,这意味着字符串不存在,因为没有完成完整的前缀匹配。

为了使其正确返回,我们的模式必须完全匹配,并且在匹配完成后,我们必须到达一个叶节点。

int search_trie(TrieNode* root, char* word)
{
    // Searches for word in the Trie
    TrieNode* temp = root;

    for(int i=0; word[i]!='\0'; i++)
    {
        int position = word[i] - 'a';
        if (temp->children[position] == NULL)
            return 0;
        temp = temp->children[position];
    }
    if (temp != NULL && temp->is_leaf == 1)
        return 1;
    return 0;
}

好的,所以我们已经完成了插入和搜索的步骤。

为了测试它,我们首先编写一个print_tree()函数,用于打印每个节点的数据。

void print_trie(TrieNode* root) {
    // Prints the nodes of the trie
    if (!root)
        return;
    TrieNode* temp = root;
    printf("%c -> ", temp->data);
    for (int i=0; i<N; i++) {
        print_trie(temp->children[i]); 
    }
}

既然我们已经做到了这一点,让我们只需运行到目前为止的完整程序,以检查它是否正常工作。

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

// The number of children for each node
// We will construct a N-ary tree and make it
// a Trie
// Since we have 26 english letters, we need
// 26 children per node
#define N 26

typedef struct TrieNode TrieNode;

struct TrieNode {
    // The Trie Node Structure
    // Each node has N children, starting from the root
    // and a flag to check if it's a leaf node
    char data; // Storing for printing purposes only
    TrieNode* children[N];
    int is_leaf;
};

TrieNode* make_trienode(char data) {
    // Allocate memory for a TrieNode
    TrieNode* node = (TrieNode*) calloc (1, sizeof(TrieNode));
    for (int i=0; i<N; i++)
        node->children[i] = NULL;
    node->is_leaf = 0;
    node->data = data;
    return node;
}

void free_trienode(TrieNode* node) {
    // Free the trienode sequence
    for(int i=0; i<N; i++) {
        if (node->children[i] != NULL) {
            free_trienode(node->children[i]);
        }
        else {
            continue;
        }
    }
    free(node);
}

TrieNode* insert_trie(TrieNode* root, char* word) {
    // Inserts the word onto the Trie
    // ASSUMPTION: The word only has lower case characters
    TrieNode* temp = root;

    for (int i=0; word[i] != '\0'; i++) {
        // Get the relative position in the alphabet list
        int idx = (int) word[i] - 'a';
        if (temp->children[idx] == NULL) {
            // If the corresponding child doesn't exist,
            // simply create that child!
            temp->children[idx] = make_trienode(word[i]);
        }
        else {
            // Do nothing. The node already exists
        }
        // Go down a level, to the child referenced by idx
        // since we have a prefix match
        temp = temp->children[idx];
    }
    // At the end of the word, mark this node as the leaf node
    temp->is_leaf = 1;
    return root;
}

int search_trie(TrieNode* root, char* word)
{
    // Searches for word in the Trie
    TrieNode* temp = root;

    for(int i=0; word[i]!='\0'; i++)
    {
        int position = word[i] - 'a';
        if (temp->children[position] == NULL)
            return 0;
        temp = temp->children[position];
    }
    if (temp != NULL && temp->is_leaf == 1)
        return 1;
    return 0;
}

void print_trie(TrieNode* root) {
    // Prints the nodes of the trie
    if (!root)
        return;
    TrieNode* temp = root;
    printf("%c -> ", temp->data);
    for (int i=0; i<N; i++) {
        print_trie(temp->children[i]); 
    }
}

void print_search(TrieNode* root, char* word) {
    printf("Searching for %s: ", word);
    if (search_trie(root, word) == 0)
        printf("Not Found\n");
    else
        printf("Found!\n");
}

int main() {
    // Driver program for the Trie Data Structure Operations
    TrieNode* root = make_trienode('\0');
    root = insert_trie(root, "hello");
    root = insert_trie(root, "hi");
    root = insert_trie(root, "teabag");
    root = insert_trie(root, "teacan");
    print_search(root, "tea");
    print_search(root, "teabag");
    print_search(root, "teacan");
    print_search(root, "hi");
    print_search(root, "hey");
    print_search(root, "hello");
    print_trie(root);
    free_trienode(root);
    return 0;
}

在使用你的gcc编译器编译后,你将得到以下输出。

Searching for tea: Not Found
Searching for teabag: Found!
Searching for teacan: Found!
Searching for hi: Found!
Searching for hey: Not Found
Searching for hello: Found!
 -> h -> e -> l -> l -> o -> i -> t -> e -> a -> b -> a -> g -> c -> a -> n -> 

虽然它的打印方式可能显而易见,但线索就在于查看print_tree()方法。因为该方法首先打印当前节点,然后是其所有子节点,所以顺序确实表明了打印顺序。

所以,现在我们来实现删除方法吧!

从Trie数据结构中删除一个单词。

这个方法比另外两个方法稍微复杂一些。

由于我们没有明确存储键和值的方式,因此不存在任何格式算法。

然而,针对本文的目的,我只会处理那些待删除的词是叶节点的情况。也就是说,它必须完全匹配前缀,直至达到一个叶节点。

那么让我们开始吧。我们删除函数的签名将是:

TrieNode* delete_trie(TrieNode* root, char* word);

在前面提到过,只有当匹配的前缀达到叶节点时,才会执行删除操作。因此,让我们编写另一个函数is_leaf_node()来为我们执行此操作。

int is_leaf_node(TrieNode* root, char* word) {
    // Checks if the prefix match of word and root
    // is a leaf node
    TrieNode* temp = root;
    for (int i=0; word[i]; i++) {
        int position = (int) word[i] - 'a';
        if (temp->children[position]) {
            temp = temp->children[position];
        }
    }
    return temp->is_leaf;
}

好的,完成了之后,让我们来看一下我们的delete_trie()方法的草稿。

TrieNode* delete_trie(TrieNode* root, char* word) {
    // Will try to delete the word sequence from the Trie only it 
    // ends up in a leaf node
    if (!root)
        return NULL;
    if (!word || word[0] == '\0')
        return root;
    // If the node corresponding to the match is not a leaf node,
    // we stop
    if (!is_leaf_node(root, word)) {
        return root;
    }
    // TODO
}

经过这个检查之后,我们现在知道我们的词将以叶节点结束。

但还有另外一种情况需要处理。如果存在另一个单词与当前字符串具有部分前缀匹配的情况怎么办?

例如,在下面的字典树中,单词”tea”和”ted”在”te”之前有部分相同的匹配。

所以, 如果发生这种情况,我们不能简单地从t到a删除整个序列,因为它们是链接在一起的,那么两个链都会被删除!

因此,我们需要找到能够出现这种匹配的最深点。只有在那之后,我们才能删除剩余的链条。

这涉及到查找最长的前缀字符串,所以让我们写另一个函数。

char* longest_prefix(TrieNode* root, char* word);

这将返回Trie中的最长匹配,该匹配并非当前的字词(word)。下面的代码在注释中解释了每个中间步骤。

char* find_longest_prefix(TrieNode* root, char* word) {
    // Finds the longest common prefix substring of word
    // in the Trie
    if (!word || word[0] == '\0')
        return NULL;
    // Length of the longest prefix
    int len = strlen(word);

    // We initially set the longest prefix as the word itself,
    // and try to back-track from the deepest position to
    // a point of divergence, if it exists
    char* longest_prefix = (char*) calloc (len + 1, sizeof(char));
    for (int i=0; word[i] != '\0'; i++)
        longest_prefix[i] = word[i];
    longest_prefix[len] = '\0';

    // If there is no branching from the root, this
    // means that we're matching the original string!
    // This is not what we want!
    int branch_idx  = check_divergence(root, longest_prefix) - 1;
    if (branch_idx >= 0) {
        // There is branching, We must update the position
        // to the longest match and update the longest prefix
        // by the branch index length
        longest_prefix[branch_idx] = '\0';
        longest_prefix = (char*) realloc (longest_prefix, (branch_idx + 1) * sizeof(char));
    }

    return longest_prefix;
}

这里还有一个变化!由于我们试图找到最长的匹配,算法实际上会贪婪地匹配原始字符串本身!这不是我们想要的。

我们希望找到最长的匹配项,但排除原始输入字符串。因此,我们必须使用另一种方法check_divergence() 进行必要的检查。

这将检查从根节点到当前位置是否存在任何分支,并返回最佳匹配的长度,该长度不是输入的长度。

如果我们处于坏链中,对应于原始字符串,那么就不会从该点分叉!所以对于我们来说,使用check_divergence()是避免那个讨厌的点的好方法。

int check_divergence(TrieNode* root, char* word) {
    // Checks if there is branching at the last character of word
 //and returns the largest position in the word where branching occurs
    TrieNode* temp = root;
    int len = strlen(word);
    if (len == 0)
        return 0;
    // We will return the largest index where branching occurs
    int last_index = 0;
    for (int i=0; i < len; i++) {
        int position = word[i] - 'a';
        if (temp->children[position]) {
            // If a child exists at that position
            // we will check if there exists any other child
            // so that branching occurs
            for (int j=0; j<N; j++) {
                if (j != position && temp->children[j]) {
                    // We've found another child! This is a branch.
                    // Update the branch position
                    last_index = i + 1;
                    break;
                }
            }
            // Go to the next child in the sequence
            temp = temp->children[position];
        }
    }
    return last_index;
}

最后,我们确保不会盲目地删除整个链条。所以现在让我们继续使用我们的删除方法。

TrieNode* delete_trie(TrieNode* root, char* word) {
    // Will try to delete the word sequence from the Trie only it 
    // ends up in a leaf node
    if (!root)
        return NULL;
    if (!word || word[0] == '\0')
        return root;
    // If the node corresponding to the match is not a leaf node,
    // we stop
    if (!is_leaf_node(root, word)) {
        return root;
    }
    TrieNode* temp = root;
    // Find the longest prefix string that is not the current word
    char* longest_prefix = find_longest_prefix(root, word);
    //printf("Longest Prefix = %s\n", longest_prefix);
    if (longest_prefix[0] == '\0') {
        free(longest_prefix);
        return root;
    }
    // Keep track of position in the Trie
    int i;
    for (i=0; longest_prefix[i] != '\0'; i++) {
        int position = (int) longest_prefix[i] - 'a';
        if (temp->children[position] != NULL) {
            // Keep moving to the deepest node in the common prefix
            temp = temp->children[position];
        }
        else {
            // There is no such node. Simply return.
            free(longest_prefix);
            return root;
        }
    }
    // Now, we have reached the deepest common node between
    // the two strings. We need to delete the sequence
    // corresponding to word
    int len = strlen(word);
    for (; i < len; i++) {
        int position = (int) word[i] - 'a';
        if (temp->children[position]) {
            // Delete the remaining sequence
            TrieNode* rm_node = temp->children[position];
            temp->children[position] = NULL;
            free_trienode(rm_node);
        }
    }
    free(longest_prefix);
    return root;
}

上述的代码简单地找到了前缀匹配的最深节点,并从Trie树中删除剩余匹配的单词序列,因为这与任何其他的匹配无关。

上述过程的时间复杂度

以下是这些过程的时间复杂度。

Procedure Time Complexity of Implementation
search_trie() O(n) -> n is the length of the input string
insert_trie() O(n) -> n is the length of the input string
delete_trie() O(C*n) -> C is the number of alphabets,
n is the length of the input word

几乎所有情况下,字母的数量是恒定的,因此 delete_trie() 的复杂度也被降低到 O(n)。


字典树数据结构的完整C/C++程序

最终,我们(希望如此)完成了delete_trie()函数。让我们使用我们的驱动程序来检查我们所编写的是否正确。

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

// The number of children for each node
// We will construct a N-ary tree and make it
// a Trie
// Since we have 26 english letters, we need
// 26 children per node
#define N 26

typedef struct TrieNode TrieNode;

struct TrieNode {
    // The Trie Node Structure
    // Each node has N children, starting from the root
    // and a flag to check if it's a leaf node
    char data; // Storing for printing purposes only
    TrieNode* children[N];
    int is_leaf;
};

TrieNode* make_trienode(char data) {
    // Allocate memory for a TrieNode
    TrieNode* node = (TrieNode*) calloc (1, sizeof(TrieNode));
    for (int i=0; i<N; i++)
        node->children[i] = NULL;
    node->is_leaf = 0;
    node->data = data;
    return node;
}

void free_trienode(TrieNode* node) {
    // Free the trienode sequence
    for(int i=0; i<N; i++) {
        if (node->children[i] != NULL) {
            free_trienode(node->children[i]);
        }
        else {
            continue;
        }
    }
    free(node);
}

TrieNode* insert_trie(TrieNode* root, char* word) {
    // Inserts the word onto the Trie
    // ASSUMPTION: The word only has lower case characters
    TrieNode* temp = root;

    for (int i=0; word[i] != '\0'; i++) {
        // Get the relative position in the alphabet list
        int idx = (int) word[i] - 'a';
        if (temp->children[idx] == NULL) {
            // If the corresponding child doesn't exist,
            // simply create that child!
            temp->children[idx] = make_trienode(word[i]);
        }
        else {
            // Do nothing. The node already exists
        }
        // Go down a level, to the child referenced by idx
        // since we have a prefix match
        temp = temp->children[idx];
    }
    // At the end of the word, mark this node as the leaf node
    temp->is_leaf = 1;
    return root;
}

int search_trie(TrieNode* root, char* word)
{
    // Searches for word in the Trie
    TrieNode* temp = root;

    for(int i=0; word[i]!='\0'; i++)
    {
        int position = word[i] - 'a';
        if (temp->children[position] == NULL)
            return 0;
        temp = temp->children[position];
    }
    if (temp != NULL && temp->is_leaf == 1)
        return 1;
    return 0;
}

int check_divergence(TrieNode* root, char* word) {
    // Checks if there is branching at the last character of word
    // and returns the largest position in the word where branching occurs
    TrieNode* temp = root;
    int len = strlen(word);
    if (len == 0)
        return 0;
    // We will return the largest index where branching occurs
    int last_index = 0;
    for (int i=0; i < len; i++) {
        int position = word[i] - 'a';
        if (temp->children[position]) {
            // If a child exists at that position
            // we will check if there exists any other child
            // so that branching occurs
            for (int j=0; j<N; j++) {
                if (j != position && temp->children[j]) {
                    // We've found another child! This is a branch.
                    // Update the branch position
                    last_index = i + 1;
                    break;
                }
            }
            // Go to the next child in the sequence
            temp = temp->children[position];
        }
    }
    return last_index;
}

char* find_longest_prefix(TrieNode* root, char* word) {
    // Finds the longest common prefix substring of word
    // in the Trie
    if (!word || word[0] == '\0')
        return NULL;
    // Length of the longest prefix
    int len = strlen(word);

    // We initially set the longest prefix as the word itself,
    // and try to back-tracking from the deepst position to
    // a point of divergence, if it exists
    char* longest_prefix = (char*) calloc (len + 1, sizeof(char));
    for (int i=0; word[i] != '\0'; i++)
        longest_prefix[i] = word[i];
    longest_prefix[len] = '\0';

    // If there is no branching from the root, this
    // means that we're matching the original string!
    // This is not what we want!
    int branch_idx  = check_divergence(root, longest_prefix) - 1;
    if (branch_idx >= 0) {
        // There is branching, We must update the position
        // to the longest match and update the longest prefix
        // by the branch index length
        longest_prefix[branch_idx] = '\0';
        longest_prefix = (char*) realloc (longest_prefix, (branch_idx + 1) * sizeof(char));
    }

    return longest_prefix;
}

int is_leaf_node(TrieNode* root, char* word) {
    // Checks if the prefix match of word and root
    // is a leaf node
    TrieNode* temp = root;
    for (int i=0; word[i]; i++) {
        int position = (int) word[i] - 'a';
        if (temp->children[position]) {
            temp = temp->children[position];
        }
    }
    return temp->is_leaf;
}

TrieNode* delete_trie(TrieNode* root, char* word) {
    // Will try to delete the word sequence from the Trie only it 
    // ends up in a leaf node
    if (!root)
        return NULL;
    if (!word || word[0] == '\0')
        return root;
    // If the node corresponding to the match is not a leaf node,
    // we stop
    if (!is_leaf_node(root, word)) {
        return root;
    }
    TrieNode* temp = root;
    // Find the longest prefix string that is not the current word
    char* longest_prefix = find_longest_prefix(root, word);
    //printf("Longest Prefix = %s\n", longest_prefix);
    if (longest_prefix[0] == '\0') {
        free(longest_prefix);
        return root;
    }
    // Keep track of position in the Trie
    int i;
    for (i=0; longest_prefix[i] != '\0'; i++) {
        int position = (int) longest_prefix[i] - 'a';
        if (temp->children[position] != NULL) {
            // Keep moving to the deepest node in the common prefix
            temp = temp->children[position];
        }
        else {
            // There is no such node. Simply return.
            free(longest_prefix);
            return root;
        }
    }
    // Now, we have reached the deepest common node between
    // the two strings. We need to delete the sequence
    // corresponding to word
    int len = strlen(word);
    for (; i < len; i++) {
        int position = (int) word[i] - 'a';
        if (temp->children[position]) {
            // Delete the remaining sequence
            TrieNode* rm_node = temp->children[position];
            temp->children[position] = NULL;
            free_trienode(rm_node);
        }
    }
    free(longest_prefix);
    return root;
}

void print_trie(TrieNode* root) {
    // Prints the nodes of the trie
    if (!root)
        return;
    TrieNode* temp = root;
    printf("%c -> ", temp->data);
    for (int i=0; i<N; i++) {
        print_trie(temp->children[i]); 
    }
}

void print_search(TrieNode* root, char* word) {
    printf("Searching for %s: ", word);
    if (search_trie(root, word) == 0)
        printf("Not Found\n");
    else
        printf("Found!\n");
}

int main() {
    // Driver program for the Trie Data Structure Operations
    TrieNode* root = make_trienode('\0');
    root = insert_trie(root, "hello");
    root = insert_trie(root, "hi");
    root = insert_trie(root, "teabag");
    root = insert_trie(root, "teacan");
    print_search(root, "tea");
    print_search(root, "teabag");
    print_search(root, "teacan");
    print_search(root, "hi");
    print_search(root, "hey");
    print_search(root, "hello");
    print_trie(root);
    printf("\n");
    root = delete_trie(root, "hello");
    printf("After deleting 'hello'...\n");
    print_trie(root);
    printf("\n");
    root = delete_trie(root, "teacan");
    printf("After deleting 'teacan'...\n");
    print_trie(root);
    printf("\n");
    free_trienode(root);
    return 0;
}

输出

Searching for tea: Not Found
Searching for teabag: Found!
Searching for teacan: Found!
Searching for hi: Found!
Searching for hey: Not Found
Searching for hello: Found!
 -> h -> e -> l -> l -> o -> i -> t -> e -> a -> b -> a -> g -> c -> a -> n -> 
After deleting 'hello'...
 -> h -> i -> t -> e -> a -> b -> a -> g -> c -> a -> n -> 
After deleting 'teacan'...
 -> h -> i -> t -> e -> a -> b -> a -> g -> 

通过这个结束,我们已经完成了在C/C++中实现Trie数据结构的学习。我知道这是一篇很长的阅读,但希望你能正确理解如何应用这些方法!


下载该代码

你可以在我上传到Github Gist上找到完整的代码。它可能不是最高效的代码,但我已尽力确保它能正常工作。

如果你有任何问题或建议,欢迎在下方的评论区提出!


参考资料

  • Wikipedia Article on Tries

推荐阅读:

如果你对类似的话题感兴趣,可以参考《数据结构和算法》部分,其中包括哈希表和图论等主题。


发表回复 0

Your email address will not be published. Required fields are marked *


广告
将在 10 秒后关闭
bannerAds