基于词典的逐词匹配分词算法 Python 实现

算法的整体思想

把词典中的词按照 由长到短递减 的顺序逐字搜索整个待处理的文本，一直到把全部的词切分出来为止。不论分词词典多大，被处理的文本多么小，都得把这个分词词典匹配一遍。

首先将词典中的词按降序排序。
遍历排序后的词典，把每个词与原文本相匹配。
用 python 中的 find () 方法，获取匹配到的字符串在文本中的索引位置。
用 python 定义一个字典 dict，value 存放匹配到的字符串，key 是其索引，这样输出分词结果的时候，将 dict 按 key 升序排序，将 value 输出，分词片段的顺序和原文本中出现的顺序保持一致。
find () 方法返回的是匹配到的字符串在文本中第一次出现的位置，如果一个字符串在文本中出现多次，需要用 while 循环，找到该字符串所有的索引位置。每次找到一个索引位置后，将索引和字符串存入 dict，然后在原文本中用与该字符串相同长度的空格替换原字符串。直到所有的字符串都被替换完毕，循环结束。
遍历结束之后，原文本中的匹配到的字符串都替换成了相同长度的空格，剩下的内容就是未匹配到的字符串，用 split () 方法将其分割，存入列表 list。然后遍历 list，找到每个字符串的索引位置，同样存入 dict。
最后输出 dict 中的 value，就是分词结果。
输出 list 中的内容，就是未匹配结果。

Python 实现

{.line-numbers}

# -*- coding: UTF-8 -*-
 
class WordByWordMatching:
    """ 基于词典的逐词遍历分词法 & quot;""
 
    text_file = open('text.txt', 'r', encoding='utf8')
    text = text_file.read ()
    text_file.close ()
 
    print (" 原文本：")
    print (text)
 
    # 存放字典 
    word_dict = []
    with open('dict.txt', 'r', encoding='utf8') as word_file:
        for line in word_file:
            string = line.split ()[0]
            word_dict.append (string)
 
    # 按照长度倒序排序 
    word_dict.sort (key=lambda i: len(i), reverse=True)
 
    # 用字典记录词出现的位置 
    word_cut = {}
 
    for word in word_dict:
        while word in text:
            index = text.find (word)
            if index not in word_cut:
                word_cut [index] = word
                # 匹配过的词用空格替换，只匹配第一个 
                text = text.replace (word, ' ' * len(word), 1)
 
    # 未匹配的词 
    unmatched = text.split ()
 
    for word in unmatched:
        index = text.find (word)
        if index not in word_cut:
            word_cut [index] = word
            text = text.replace (word, ' ' * len(word), 1)
 
    key_list = sorted(word_cut.keys ())
 
    print ("\n 分词结果：")
 
    for i in key_list:
        print (word_cut [i] + "/", end='')
 
    print ("\n\n 未匹配内容：")
 
    for word in unmatched:
        print (word)

示例

text.txt

1	很高兴见到你，我是 Bob，今年 24 岁，来自天津理工大学，我是一个硕士研究生，今年研二。

dict.txt

 很 adv
 高兴 adj
 见到 v
 你 prep
 我 prep
 来 prep
 来自 v
 天津 n
 理工 n
 大学 n
 天津理工 n
 理工大学 n
 天津理工大学 n
 是 v
 一 n
 一个 n
 硕士 n
 研究 v
 研究生 n
， comma 
。 comma

输出：

 原文本：
 很高兴见到你，我是 Bob，今年 24 岁，来自天津理工大学，我是一个硕士研究生，今年研二。

 分词结果：
 很 / 高兴 / 见到 / 你 /，/ 我 / 是 /Bob/，/ 今年 24 岁 /，/ 来自 / 天津理工大学 /，/ 我 / 是 / 一个 / 硕士 / 研究生 /，/ 今年研二 /。/

 未匹配内容：
Bob
 今年 24 岁 
 今年研二