欢迎来到代码驿站!

Python代码

当前位置:首页 > 软件编程 > Python代码

论文查重python文本相似性计算simhash源码

时间:2023-02-01 10:28:38|栏目:Python代码|点击:

场景:

1.计算SimHash值,及Hamming距离。
2.SimHash适用于较长文本(大于三五百字)的相似性比较,文本越短误判率越高。

Python实现:

代码如下

# -*- encoding:utf-8 -*-
import math
import jieba
import jieba.analyse
class SimHash(object):
    def getBinStr(self, source):
        if source == "":
            return 0
        else:
            x = ord(source[0]) << 7
            m = 1000003
            mask = 2 ** 128 - 1
            for c in source:
                x = ((x * m) ^ ord(c)) & mask
            x ^= len(source)
            if x == -1:
                x = -2
            x = bin(x).replace('0b', '').zfill(64)[-64:]
            return str(x)
    def getWeight(self, source):
        return ord(source)
    def unwrap_weight(self, arr):
        ret = ""
        for item in arr:
            tmp = 0
            if int(item) > 0:
                tmp = 1
            ret += str(tmp)
        return ret
    def sim_hash(self, rawstr):
        seg = jieba.cut(rawstr)
        keywords = jieba.analyse.extract_tags("|".join(seg), topK=100, withWeight=True)
        ret = []
        for keyword, weight in keywords:
            binstr = self.getBinStr(keyword)
            keylist = []
            for c in binstr:
                weight = math.ceil(weight)
                if c == "1":
                    keylist.append(int(weight))
                else:
                    keylist.append(-int(weight))
            ret.append(keylist)
        # 降维
        rows = len(ret)
        cols = len(ret[0])
        result = []
        for i in range(cols):
            tmp = 0
            for j in range(rows):
                tmp += int(ret[j][i])
            if tmp > 0:
                tmp = "1"
            elif tmp <= 0:
                tmp = "0"
            result.append(tmp)
        return "".join(result)
    def distince(self, hashstr1, hashstr2):
        length = 0
        for index, char in enumerate(hashstr1):
            if char == hashstr2[index]:
                continue
            else:
                length += 1
        return length
if __name__ == "__main__":
    simhash = SimHash()
    str1 = '咱哥俩谁跟谁啊'
    str2 = '咱们俩谁跟谁啊'
    hash1 = simhash.sim_hash(str1)
    print(hash1)
    hash2 = simhash.sim_hash(str2)
    distince = simhash.distince(hash1, hash2)
    value = 5
    print("simhash", distince, "距离:", value, "是否相似:", distince<=value)

上一篇:Python3基础教程之递归函数简单示例

栏    目:Python代码

下一篇:为Python程序添加图形化界面的教程

本文标题:论文查重python文本相似性计算simhash源码

本文地址:http://www.codeinn.net/misctech/224867.html

推荐教程

广告投放 | 联系我们 | 版权申明

重要申明:本站所有的文章、图片、评论等,均由网友发表或上传并维护或收集自网络,属个人行为,与本站立场无关。

如果侵犯了您的权利,请与我们联系,我们将在24小时内进行处理、任何非本站因素导致的法律后果,本站均不负任何责任。

联系QQ:914707363 | 邮箱:codeinn#126.com(#换成@)

Copyright © 2020 代码驿站 版权所有