C/C++中的字典树数据结构
Trie数据结构作为一个动态数组的容器。在本文章中,我们将看一下如何在C/C++中实现一个Trie。
这是基于树数据结构的,但不一定存储键。在这里,每个节点只有一个值,该值是根据位置定义的。

所以,每个节点的价值表示前缀,因为它是经过一系列匹配后从根节点出发的点。
我们把这样的匹配称为前缀匹配。因此,我们可以使用字典树(Tries)来存储字典中的词语!
例如,在上图中,根节点为空,因此每个字符串都匹配根节点。节点7匹配了一个以to为前缀的字符串,而节点3匹配了一个以tea为前缀的字符串。
在这里,键只存储在叶子节点的位置上,因为前缀匹配会停在任何叶子节点。因此,任何非叶子节点都不存储整个字符串,而只存储前缀匹配的字符。
出于这些原因,尝试被称为前缀树。现在让我们从程序员的角度来理解如何实现这种数据结构。
在C/C++中实现一个Trie数据结构。
让我们首先写下Trie结构。Trie节点主要有两个组成部分。
- It’s children
- A marker to indicate a leaf node.
然而,由于我们还将打印Trie,如果我们能够在数据部分中存储一个附加属性,那将更容易。
那么让我们来定义TrieNode的结构。在这里,由于我只会存储小写字母,我将将其实现为一棵26叉树。因此,每个节点将有指向26个孩子的指针。
// The number of children for each node
// We will construct a N-ary tree and make it
// a Trie
// Since we have 26 english letters, we need
// 26 children per node
#define N 26
typedef struct TrieNode TrieNode;
struct TrieNode {
// The Trie Node Structure
// Each node has N children, starting from the root
// and a flag to check if it's a leaf node
char data; // Storing for printing purposes only
TrieNode* children[N];
int is_leaf;
};
既然我们已经定义了结构,让我们编写函数来初始化TrieNode,并且还要释放其指针。
TrieNode* make_trienode(char data) {
// Allocate memory for a TrieNode
TrieNode* node = (TrieNode*) calloc (1, sizeof(TrieNode));
for (int i=0; i<N; i++)
node->children[i] = NULL;
node->is_leaf = 0;
node->data = data;
return node;
}
void free_trienode(TrieNode* node) {
// Free the trienode sequence
for(int i=0; i<N; i++) {
if (node->children[i] != NULL) {
free_trienode(node->children[i]);
}
else {
continue;
}
}
free(node);
}
将一个词插入到 Trie 数据结构中。
现在我们将编写insert_trie()函数,该函数接收根节点(最顶层节点)的指针,并将一个单词插入到Trie中。
插入过程很简单。它逐个字符地迭代单词,并评估其相对位置。
例如,字符b的位置是1,所以它将成为第二个子元素。
for (int i=0; word[i] != '\0'; i++) {
// Get the relative position in the alphabet list
int idx = (int) word[i] - 'a';
if (temp->children[idx] == NULL) {
// If the corresponding child doesn't exist,
// simply create that child!
temp->children[idx] = make_trienode(word[i]);
}
else {
// Do nothing. The node already exists
}
我们将逐个字符匹配前缀,并在必要时简单地初始化一个节点。
否则,我们只需要一直沿着链条向下移动,直到匹配所有字符。
temp = temp->children[idx];
最后,我们将插入所有不匹配的字符,并将更新后的Trie返回。
TrieNode* insert_trie(TrieNode* root, char* word) {
// Inserts the word onto the Trie
// ASSUMPTION: The word only has lower case characters
TrieNode* temp = root;
for (int i=0; word[i] != '\0'; i++) {
// Get the relative position in the alphabet list
int idx = (int) word[i] - 'a';
if (temp->children[idx] == NULL) {
// If the corresponding child doesn't exist,
// simply create that child!
temp->children[idx] = make_trienode(word[i]);
}
else {
// Do nothing. The node already exists
}
// Go down a level, to the child referenced by idx
// since we have a prefix match
temp = temp->children[idx];
}
// At the end of the word, mark this node as the leaf node
temp->is_leaf = 1;
return root;
}
在 Trie 中搜索一个单词。
现在我们已经在Trie上实施了插入操作,让我们来看一下如何搜索给定的模式。
我们将尝试逐个字符匹配字符串,采用与上述相同的前缀匹配策略。
如果我们已经到达链的末尾,但仍然没有找到匹配项,这意味着字符串不存在,因为没有完成完整的前缀匹配。
为了使其正确返回,我们的模式必须完全匹配,并且在匹配完成后,我们必须到达一个叶节点。
int search_trie(TrieNode* root, char* word)
{
// Searches for word in the Trie
TrieNode* temp = root;
for(int i=0; word[i]!='\0'; i++)
{
int position = word[i] - 'a';
if (temp->children[position] == NULL)
return 0;
temp = temp->children[position];
}
if (temp != NULL && temp->is_leaf == 1)
return 1;
return 0;
}
好的,所以我们已经完成了插入和搜索的步骤。
为了测试它,我们首先编写一个print_tree()函数,用于打印每个节点的数据。
void print_trie(TrieNode* root) {
// Prints the nodes of the trie
if (!root)
return;
TrieNode* temp = root;
printf("%c -> ", temp->data);
for (int i=0; i<N; i++) {
print_trie(temp->children[i]);
}
}
既然我们已经做到了这一点,让我们只需运行到目前为止的完整程序,以检查它是否正常工作。
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
// The number of children for each node
// We will construct a N-ary tree and make it
// a Trie
// Since we have 26 english letters, we need
// 26 children per node
#define N 26
typedef struct TrieNode TrieNode;
struct TrieNode {
// The Trie Node Structure
// Each node has N children, starting from the root
// and a flag to check if it's a leaf node
char data; // Storing for printing purposes only
TrieNode* children[N];
int is_leaf;
};
TrieNode* make_trienode(char data) {
// Allocate memory for a TrieNode
TrieNode* node = (TrieNode*) calloc (1, sizeof(TrieNode));
for (int i=0; i<N; i++)
node->children[i] = NULL;
node->is_leaf = 0;
node->data = data;
return node;
}
void free_trienode(TrieNode* node) {
// Free the trienode sequence
for(int i=0; i<N; i++) {
if (node->children[i] != NULL) {
free_trienode(node->children[i]);
}
else {
continue;
}
}
free(node);
}
TrieNode* insert_trie(TrieNode* root, char* word) {
// Inserts the word onto the Trie
// ASSUMPTION: The word only has lower case characters
TrieNode* temp = root;
for (int i=0; word[i] != '\0'; i++) {
// Get the relative position in the alphabet list
int idx = (int) word[i] - 'a';
if (temp->children[idx] == NULL) {
// If the corresponding child doesn't exist,
// simply create that child!
temp->children[idx] = make_trienode(word[i]);
}
else {
// Do nothing. The node already exists
}
// Go down a level, to the child referenced by idx
// since we have a prefix match
temp = temp->children[idx];
}
// At the end of the word, mark this node as the leaf node
temp->is_leaf = 1;
return root;
}
int search_trie(TrieNode* root, char* word)
{
// Searches for word in the Trie
TrieNode* temp = root;
for(int i=0; word[i]!='\0'; i++)
{
int position = word[i] - 'a';
if (temp->children[position] == NULL)
return 0;
temp = temp->children[position];
}
if (temp != NULL && temp->is_leaf == 1)
return 1;
return 0;
}
void print_trie(TrieNode* root) {
// Prints the nodes of the trie
if (!root)
return;
TrieNode* temp = root;
printf("%c -> ", temp->data);
for (int i=0; i<N; i++) {
print_trie(temp->children[i]);
}
}
void print_search(TrieNode* root, char* word) {
printf("Searching for %s: ", word);
if (search_trie(root, word) == 0)
printf("Not Found\n");
else
printf("Found!\n");
}
int main() {
// Driver program for the Trie Data Structure Operations
TrieNode* root = make_trienode('\0');
root = insert_trie(root, "hello");
root = insert_trie(root, "hi");
root = insert_trie(root, "teabag");
root = insert_trie(root, "teacan");
print_search(root, "tea");
print_search(root, "teabag");
print_search(root, "teacan");
print_search(root, "hi");
print_search(root, "hey");
print_search(root, "hello");
print_trie(root);
free_trienode(root);
return 0;
}
在使用你的gcc编译器编译后,你将得到以下输出。
Searching for tea: Not Found
Searching for teabag: Found!
Searching for teacan: Found!
Searching for hi: Found!
Searching for hey: Not Found
Searching for hello: Found!
-> h -> e -> l -> l -> o -> i -> t -> e -> a -> b -> a -> g -> c -> a -> n ->
虽然它的打印方式可能显而易见,但线索就在于查看print_tree()方法。因为该方法首先打印当前节点,然后是其所有子节点,所以顺序确实表明了打印顺序。
所以,现在我们来实现删除方法吧!
从Trie数据结构中删除一个单词。
这个方法比另外两个方法稍微复杂一些。
由于我们没有明确存储键和值的方式,因此不存在任何格式算法。
然而,针对本文的目的,我只会处理那些待删除的词是叶节点的情况。也就是说,它必须完全匹配前缀,直至达到一个叶节点。
那么让我们开始吧。我们删除函数的签名将是:
TrieNode* delete_trie(TrieNode* root, char* word);
在前面提到过,只有当匹配的前缀达到叶节点时,才会执行删除操作。因此,让我们编写另一个函数is_leaf_node()来为我们执行此操作。
int is_leaf_node(TrieNode* root, char* word) {
// Checks if the prefix match of word and root
// is a leaf node
TrieNode* temp = root;
for (int i=0; word[i]; i++) {
int position = (int) word[i] - 'a';
if (temp->children[position]) {
temp = temp->children[position];
}
}
return temp->is_leaf;
}
好的,完成了之后,让我们来看一下我们的delete_trie()方法的草稿。
TrieNode* delete_trie(TrieNode* root, char* word) {
// Will try to delete the word sequence from the Trie only it
// ends up in a leaf node
if (!root)
return NULL;
if (!word || word[0] == '\0')
return root;
// If the node corresponding to the match is not a leaf node,
// we stop
if (!is_leaf_node(root, word)) {
return root;
}
// TODO
}
经过这个检查之后,我们现在知道我们的词将以叶节点结束。
但还有另外一种情况需要处理。如果存在另一个单词与当前字符串具有部分前缀匹配的情况怎么办?
例如,在下面的字典树中,单词”tea”和”ted”在”te”之前有部分相同的匹配。

所以, 如果发生这种情况,我们不能简单地从t到a删除整个序列,因为它们是链接在一起的,那么两个链都会被删除!
因此,我们需要找到能够出现这种匹配的最深点。只有在那之后,我们才能删除剩余的链条。
这涉及到查找最长的前缀字符串,所以让我们写另一个函数。
char* longest_prefix(TrieNode* root, char* word);
这将返回Trie中的最长匹配,该匹配并非当前的字词(word)。下面的代码在注释中解释了每个中间步骤。
char* find_longest_prefix(TrieNode* root, char* word) {
// Finds the longest common prefix substring of word
// in the Trie
if (!word || word[0] == '\0')
return NULL;
// Length of the longest prefix
int len = strlen(word);
// We initially set the longest prefix as the word itself,
// and try to back-track from the deepest position to
// a point of divergence, if it exists
char* longest_prefix = (char*) calloc (len + 1, sizeof(char));
for (int i=0; word[i] != '\0'; i++)
longest_prefix[i] = word[i];
longest_prefix[len] = '\0';
// If there is no branching from the root, this
// means that we're matching the original string!
// This is not what we want!
int branch_idx = check_divergence(root, longest_prefix) - 1;
if (branch_idx >= 0) {
// There is branching, We must update the position
// to the longest match and update the longest prefix
// by the branch index length
longest_prefix[branch_idx] = '\0';
longest_prefix = (char*) realloc (longest_prefix, (branch_idx + 1) * sizeof(char));
}
return longest_prefix;
}
这里还有一个变化!由于我们试图找到最长的匹配,算法实际上会贪婪地匹配原始字符串本身!这不是我们想要的。
我们希望找到最长的匹配项,但排除原始输入字符串。因此,我们必须使用另一种方法check_divergence() 进行必要的检查。
这将检查从根节点到当前位置是否存在任何分支,并返回最佳匹配的长度,该长度不是输入的长度。
如果我们处于坏链中,对应于原始字符串,那么就不会从该点分叉!所以对于我们来说,使用check_divergence()是避免那个讨厌的点的好方法。
int check_divergence(TrieNode* root, char* word) {
// Checks if there is branching at the last character of word
//and returns the largest position in the word where branching occurs
TrieNode* temp = root;
int len = strlen(word);
if (len == 0)
return 0;
// We will return the largest index where branching occurs
int last_index = 0;
for (int i=0; i < len; i++) {
int position = word[i] - 'a';
if (temp->children[position]) {
// If a child exists at that position
// we will check if there exists any other child
// so that branching occurs
for (int j=0; j<N; j++) {
if (j != position && temp->children[j]) {
// We've found another child! This is a branch.
// Update the branch position
last_index = i + 1;
break;
}
}
// Go to the next child in the sequence
temp = temp->children[position];
}
}
return last_index;
}
最后,我们确保不会盲目地删除整个链条。所以现在让我们继续使用我们的删除方法。
TrieNode* delete_trie(TrieNode* root, char* word) {
// Will try to delete the word sequence from the Trie only it
// ends up in a leaf node
if (!root)
return NULL;
if (!word || word[0] == '\0')
return root;
// If the node corresponding to the match is not a leaf node,
// we stop
if (!is_leaf_node(root, word)) {
return root;
}
TrieNode* temp = root;
// Find the longest prefix string that is not the current word
char* longest_prefix = find_longest_prefix(root, word);
//printf("Longest Prefix = %s\n", longest_prefix);
if (longest_prefix[0] == '\0') {
free(longest_prefix);
return root;
}
// Keep track of position in the Trie
int i;
for (i=0; longest_prefix[i] != '\0'; i++) {
int position = (int) longest_prefix[i] - 'a';
if (temp->children[position] != NULL) {
// Keep moving to the deepest node in the common prefix
temp = temp->children[position];
}
else {
// There is no such node. Simply return.
free(longest_prefix);
return root;
}
}
// Now, we have reached the deepest common node between
// the two strings. We need to delete the sequence
// corresponding to word
int len = strlen(word);
for (; i < len; i++) {
int position = (int) word[i] - 'a';
if (temp->children[position]) {
// Delete the remaining sequence
TrieNode* rm_node = temp->children[position];
temp->children[position] = NULL;
free_trienode(rm_node);
}
}
free(longest_prefix);
return root;
}
上述的代码简单地找到了前缀匹配的最深节点,并从Trie树中删除剩余匹配的单词序列,因为这与任何其他的匹配无关。
上述过程的时间复杂度
以下是这些过程的时间复杂度。
Procedure | Time Complexity of Implementation |
search_trie() |
O(n) -> n is the length of the input string |
insert_trie() |
O(n) -> n is the length of the input string |
delete_trie() |
O(C*n) -> C is the number of alphabets, |
n is the length of the input word |
几乎所有情况下,字母的数量是恒定的,因此 delete_trie() 的复杂度也被降低到 O(n)。
字典树数据结构的完整C/C++程序
最终,我们(希望如此)完成了delete_trie()函数。让我们使用我们的驱动程序来检查我们所编写的是否正确。
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
// The number of children for each node
// We will construct a N-ary tree and make it
// a Trie
// Since we have 26 english letters, we need
// 26 children per node
#define N 26
typedef struct TrieNode TrieNode;
struct TrieNode {
// The Trie Node Structure
// Each node has N children, starting from the root
// and a flag to check if it's a leaf node
char data; // Storing for printing purposes only
TrieNode* children[N];
int is_leaf;
};
TrieNode* make_trienode(char data) {
// Allocate memory for a TrieNode
TrieNode* node = (TrieNode*) calloc (1, sizeof(TrieNode));
for (int i=0; i<N; i++)
node->children[i] = NULL;
node->is_leaf = 0;
node->data = data;
return node;
}
void free_trienode(TrieNode* node) {
// Free the trienode sequence
for(int i=0; i<N; i++) {
if (node->children[i] != NULL) {
free_trienode(node->children[i]);
}
else {
continue;
}
}
free(node);
}
TrieNode* insert_trie(TrieNode* root, char* word) {
// Inserts the word onto the Trie
// ASSUMPTION: The word only has lower case characters
TrieNode* temp = root;
for (int i=0; word[i] != '\0'; i++) {
// Get the relative position in the alphabet list
int idx = (int) word[i] - 'a';
if (temp->children[idx] == NULL) {
// If the corresponding child doesn't exist,
// simply create that child!
temp->children[idx] = make_trienode(word[i]);
}
else {
// Do nothing. The node already exists
}
// Go down a level, to the child referenced by idx
// since we have a prefix match
temp = temp->children[idx];
}
// At the end of the word, mark this node as the leaf node
temp->is_leaf = 1;
return root;
}
int search_trie(TrieNode* root, char* word)
{
// Searches for word in the Trie
TrieNode* temp = root;
for(int i=0; word[i]!='\0'; i++)
{
int position = word[i] - 'a';
if (temp->children[position] == NULL)
return 0;
temp = temp->children[position];
}
if (temp != NULL && temp->is_leaf == 1)
return 1;
return 0;
}
int check_divergence(TrieNode* root, char* word) {
// Checks if there is branching at the last character of word
// and returns the largest position in the word where branching occurs
TrieNode* temp = root;
int len = strlen(word);
if (len == 0)
return 0;
// We will return the largest index where branching occurs
int last_index = 0;
for (int i=0; i < len; i++) {
int position = word[i] - 'a';
if (temp->children[position]) {
// If a child exists at that position
// we will check if there exists any other child
// so that branching occurs
for (int j=0; j<N; j++) {
if (j != position && temp->children[j]) {
// We've found another child! This is a branch.
// Update the branch position
last_index = i + 1;
break;
}
}
// Go to the next child in the sequence
temp = temp->children[position];
}
}
return last_index;
}
char* find_longest_prefix(TrieNode* root, char* word) {
// Finds the longest common prefix substring of word
// in the Trie
if (!word || word[0] == '\0')
return NULL;
// Length of the longest prefix
int len = strlen(word);
// We initially set the longest prefix as the word itself,
// and try to back-tracking from the deepst position to
// a point of divergence, if it exists
char* longest_prefix = (char*) calloc (len + 1, sizeof(char));
for (int i=0; word[i] != '\0'; i++)
longest_prefix[i] = word[i];
longest_prefix[len] = '\0';
// If there is no branching from the root, this
// means that we're matching the original string!
// This is not what we want!
int branch_idx = check_divergence(root, longest_prefix) - 1;
if (branch_idx >= 0) {
// There is branching, We must update the position
// to the longest match and update the longest prefix
// by the branch index length
longest_prefix[branch_idx] = '\0';
longest_prefix = (char*) realloc (longest_prefix, (branch_idx + 1) * sizeof(char));
}
return longest_prefix;
}
int is_leaf_node(TrieNode* root, char* word) {
// Checks if the prefix match of word and root
// is a leaf node
TrieNode* temp = root;
for (int i=0; word[i]; i++) {
int position = (int) word[i] - 'a';
if (temp->children[position]) {
temp = temp->children[position];
}
}
return temp->is_leaf;
}
TrieNode* delete_trie(TrieNode* root, char* word) {
// Will try to delete the word sequence from the Trie only it
// ends up in a leaf node
if (!root)
return NULL;
if (!word || word[0] == '\0')
return root;
// If the node corresponding to the match is not a leaf node,
// we stop
if (!is_leaf_node(root, word)) {
return root;
}
TrieNode* temp = root;
// Find the longest prefix string that is not the current word
char* longest_prefix = find_longest_prefix(root, word);
//printf("Longest Prefix = %s\n", longest_prefix);
if (longest_prefix[0] == '\0') {
free(longest_prefix);
return root;
}
// Keep track of position in the Trie
int i;
for (i=0; longest_prefix[i] != '\0'; i++) {
int position = (int) longest_prefix[i] - 'a';
if (temp->children[position] != NULL) {
// Keep moving to the deepest node in the common prefix
temp = temp->children[position];
}
else {
// There is no such node. Simply return.
free(longest_prefix);
return root;
}
}
// Now, we have reached the deepest common node between
// the two strings. We need to delete the sequence
// corresponding to word
int len = strlen(word);
for (; i < len; i++) {
int position = (int) word[i] - 'a';
if (temp->children[position]) {
// Delete the remaining sequence
TrieNode* rm_node = temp->children[position];
temp->children[position] = NULL;
free_trienode(rm_node);
}
}
free(longest_prefix);
return root;
}
void print_trie(TrieNode* root) {
// Prints the nodes of the trie
if (!root)
return;
TrieNode* temp = root;
printf("%c -> ", temp->data);
for (int i=0; i<N; i++) {
print_trie(temp->children[i]);
}
}
void print_search(TrieNode* root, char* word) {
printf("Searching for %s: ", word);
if (search_trie(root, word) == 0)
printf("Not Found\n");
else
printf("Found!\n");
}
int main() {
// Driver program for the Trie Data Structure Operations
TrieNode* root = make_trienode('\0');
root = insert_trie(root, "hello");
root = insert_trie(root, "hi");
root = insert_trie(root, "teabag");
root = insert_trie(root, "teacan");
print_search(root, "tea");
print_search(root, "teabag");
print_search(root, "teacan");
print_search(root, "hi");
print_search(root, "hey");
print_search(root, "hello");
print_trie(root);
printf("\n");
root = delete_trie(root, "hello");
printf("After deleting 'hello'...\n");
print_trie(root);
printf("\n");
root = delete_trie(root, "teacan");
printf("After deleting 'teacan'...\n");
print_trie(root);
printf("\n");
free_trienode(root);
return 0;
}
输出
Searching for tea: Not Found
Searching for teabag: Found!
Searching for teacan: Found!
Searching for hi: Found!
Searching for hey: Not Found
Searching for hello: Found!
-> h -> e -> l -> l -> o -> i -> t -> e -> a -> b -> a -> g -> c -> a -> n ->
After deleting 'hello'...
-> h -> i -> t -> e -> a -> b -> a -> g -> c -> a -> n ->
After deleting 'teacan'...
-> h -> i -> t -> e -> a -> b -> a -> g ->
通过这个结束,我们已经完成了在C/C++中实现Trie数据结构的学习。我知道这是一篇很长的阅读,但希望你能正确理解如何应用这些方法!
下载该代码
你可以在我上传到Github Gist上找到完整的代码。它可能不是最高效的代码,但我已尽力确保它能正常工作。
如果你有任何问题或建议,欢迎在下方的评论区提出!
参考资料
- Wikipedia Article on Tries
推荐阅读:
如果你对类似的话题感兴趣,可以参考《数据结构和算法》部分,其中包括哈希表和图论等主题。