数据相似性检测算法 – 刘爱贵的专栏 – 博客频道 – CSDN.NET

1、引言
“数据同步算法研究”一文研究了在网络上高效同步数据的方法，其中有个前提是文件A和B非常相似，即两者之间存在大量相同的数据。如果两个文件相似性很低，虽然这种方法依然可以正常工作，但数据同步性能却不会得到提高，甚至会有所降低。因为会产生部分元数据和网络通信消耗，这在两个文件完全不相关时尤为明显。因此，同步数据前需要计算种子文件(seed file)与目标文件之间的相似性，如果相似性大于指定阈值(通常应大于50%)则应用该数据同步算法，否则接传输文件即可。如此，可使得数据同步算法则具有较好的自适应性，在数据具有不同相似性的情形下均可进行高性能的数据同步。另外，在数据相似性检测的基础之上，可对于相似性高的数据进行数据编码处理(如Delta编码)，通过一个文件给另一个文件编码的方式进行数据压缩，这是一种基于相似数据检测与编码的重复数据删除技术。

2、相似性计算
　　Unix diff对文档进行逐行对比来检测相似文件，它采用经典的LCS(Longest Common Subsequence，最长公共子串)算法，运用动态规划方法来计算相似性。LCS的含义是同时包含在字符串里的一个最长字符序列，LCS的长度作为这两个字符串相似性的度量。Diff算法以整行作为”字符”来计算最长公共子串，性能上比字符级的LCS算法快很多。这种方法效率很低，而且只适用文本文件的相似比较，不能直接适用于二进制文件。

目前通常的做法是将文件相似性问题转换为集合相似性问题，如基于shingle的计算方法和基于bloom filter的计算方法，这两种方法都可适用于任何格式的数据文件。这种方式的核心思想是为每个文件提取组特征值，以特征值集合来计算相似性，从而降低计算复杂性来提高性能。shingle用特征值交集来计算相似性会导致高计算和空间开销，bloom filter技术在计算开销和匹配精度上更具势。Bloom filter所定义的集合元素是文件按照CDC(content-defined chunking)算法所切分数据块的指纹值，其相似性定义如下：
|fingerprints(f1) ∩ fingerprints(f2)|
Sim(f1, f2) = ——————————————— (公式1)
|fingerprints(f1) ∪ fingerprints(f2)|

另外一种方法，是将二进制文件进行切块，使用数据块指纹来表示数据块，然后将数据块映射为”字符”，再应用LCS算法寻找最大公共子串并计算出相似度。其相似性定义如下：
2 * length(LCS(fingerprints(f1), fingerprints(f2)))
Sim(f1, f2) = ——————————————————————　(公式2)
length(fingerprints(f1)) + length(fingerprints(f2))

上面两种相似性算法中均采用数据切分技术，数据块可以是定长或变长。为了相似性计算的精确性，实现中采用以数据块长度作为权的加权计算方法。

3、Bloom filter算法
　　该文件相似性计算流程如下：
(1) 采用CDC算法将文件切分成数据块集，并为每个数据块计算MD5指纹；
(2) 计算两个指纹集合的交集和并集，通过hashtable来实现；
(3) 按照公式1计算文件相似性，考虑重复数据块和数据块长度来提高计算精确度。
详细参见附录bsim源码中的file_chunk，chunk_file_process和similarity_detect函数实现。

4、LCS算法
　　该文件相似性计算流程如下：
(1) 采用CDC算法将文件切分成数据块集，并为每个数据块计算MD5指纹；
(2) 将MD5指纹串映射为”字符”，则文件转换为”字符串”表示；
(3) 应用LCS算法计算出最长公共子串，并计算其加权长度；
(4) 按照公式2计算文件相似性，考虑重复数据块和数据块长度来提高计算精确度。
详细参见附录bsim源码中的file_chunk，chunk_file_process，LCS和similarity_detect函数实现。

5、算法分析比较
两种算法都对文件进行切分操作，假设文件f1切为m个块，文件f2切分成n个块。Bloom filter算法没有考虑数据块顺序，因此在相似性精确度方面要低于LCS算法，其时间和空间复杂性都是O(m + n)。相反，LCS算法考虑了数据块顺序问题，相似性度量相对精确，然而其时间和空间复杂性是O(mn)，这个大大限制了应用规模。综合来看，Bloom filter算法精确度比LCS算法要低，但计算消耗要小很多，性能和适用性非常好。LCS比较适合精确的文件相似性计算，这些文件往往比较小，50MB以内比较合适。对于重复数据删除和网络数据同步来说，消重效果和性能与数据块顺序性无关，因此Bloom filter算法计算的数据相似性更适用，性能也更高。

附录：bsim.c源码
(完整源码请参见deduputil源码)

[cpp] view plaincopyprint?

*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 3 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License along
* with this program; if not, visit the http://fsf.org website.
*/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include “hashtable.h”
#include “sync.h”
#define NEITHER 0
#define UP 1
#define LEFT 2
#define UP_AND_LEFT 3
#define MAX(x, y) (((x) > (y)) ? (x) : (y))
#define MIN(x, y) (((x) < (y)) ? (x) : (y))
#define MD5_LEN 17
enum {
FILE1 = 0,
FILE2
};
enum {
LCS_NOT = 0,
LCS_YES
};
typedef struct {
uint32_t nr1;
uint32_t nr2;
uint32_t len;
} hash_entry;
typedef struct {
char **str;
uint32_t len;
} lcs_entry;
static uint32_t sim_union = 0;
static uint32_t sim_intersect = 0;
static void usage()
{
fprintf(stderr, “Usage: bsim FILE1 FILE2 CHUNK_ALGO LCS/n/n”);
fprintf(stderr, “Similarity detect between FILE1 and FILE2 based on block level./n”);
fprintf(stderr, “CHUNK_ALGO:/n”);
fprintf(stderr, ” FSP – fixed-size partition/n”);
fprintf(stderr, ” CDC – content-defined chunking/n”);
fprintf(stderr, ” SBC – slide block chunking/n/n”);
fprintf(stderr, “LCS:/n”);
fprintf(stderr, ” LCS_NOT – do not use LCS(longest lommon subsequence) algorithms/n”);
fprintf(stderr, ” LCS_YES – use LCS algorithms/n/n”);
fprintf(stderr, “Report bugs to <Aigui.Liu@gmail.com>./n”);
}
static int parse_arg(char *argname)
{
if (0 == strcmp(argname, “FSP”))
return CHUNK_FSP;
else if (0 == strcmp(argname, “CDC”))
return CHUNK_CDC;
else if (0 == strcmp(argname, “SBC”))
return CHUNK_SBC;
else if (0 == strcmp(argname, “LCS_NOT”))
return LCS_NOT;
else if (0 == strcmp(argname, “LCS_YES”))
return LCS_YES;
else
return -1;
}
static char **alloc_2d_array(int row, int col)
{
int i;
char *p, **pp;
p = (char *)malloc(row * col * sizeof(char));
pp = (char **)malloc(row * sizeof(char *));
if (p == NULL || pp == NULL)
return NULL;
for (i = 0; i < row; i++) {
pp[i] = p + col * i;
}
return pp;
}
static void free_2d_array(char **str)
{
free(str[0]);
free(str);
}
static void show_md5_hex(unsigned char md5_checksum[16])
{
int i;
for (i = 0; i < 16; i++) {
printf(“%02x”, md5_checksum[i]);
}
printf(“/n”);
}
static int chunk_file_process(char *chunk_file, hashtable *htab, int which, int sim_algo, lcs_entry *le)
{
int fd, i, ret = 0;
ssize_t rwsize;
chunk_file_header chunk_file_hdr;
chunk_block_entry chunk_bentry;
hash_entry *he = NULL;
/* parse chunk file */
fd = open(chunk_file, O_RDONLY);
if (-1 == fd) {
return -1;
}
rwsize = read(fd, &chunk_file_hdr, CHUNK_FILE_HEADER_SZ);
if (rwsize != CHUNK_FILE_HEADER_SZ) {
ret = -1;
goto _CHUNK_FILE_PROCESS_EXIT;
}
if (sim_algo == LCS_YES) {
le->str = alloc_2d_array(chunk_file_hdr.block_nr, MD5_LEN);
if (le->str == NULL) {
ret = -1;
goto _CHUNK_FILE_PROCESS_EXIT;
}
le->len = chunk_file_hdr.block_nr;
}
for(i = 0; i < chunk_file_hdr.block_nr; i++) {
rwsize = read(fd, &chunk_bentry, CHUNK_BLOCK_ENTRY_SZ);
if (rwsize != CHUNK_BLOCK_ENTRY_SZ) {
ret = -1;
goto _CHUNK_FILE_PROCESS_EXIT;
}
he = (hash_entry *)hash_value((void *)chunk_bentry.md5, htab);
if (he == NULL) {
he = (hash_entry *)malloc(sizeof(hash_entry));
he->nr1 = he->nr2 = 0;
he->len = chunk_bentry.len;
}
(which == FILE1) ? he->nr1++ : he->nr2++;
/* insert or update hash entry */
hash_insert((void *)strdup(chunk_bentry.md5), (void *)he, htab);
if (sim_algo == LCS_YES) {
memcpy(le->str[i], chunk_bentry.md5, MD5_LEN);
}
}
_CHUNK_FILE_PROCESS_EXIT:
close(fd);
return ret;
}
uint32_t LCS(char** a, int n, char** b, int m, hashtable *htab)
{
int** S;
int** R;
int ii;
int jj;
int pos;
uint32_t len = 0;
hash_entry *he = NULL;
/* Memory allocation */
S = (int **)malloc( (n+1) * sizeof(int *) );
R = (int **)malloc( (n+1) * sizeof(int *) );
if (S == NULL || R == NULL) {
perror(“malloc for S and R in LCS”);
exit(0);
}
for(ii = 0; ii <= n; ++ii) {
S[ii] = (int*) malloc( (m+1) * sizeof(int) );
R[ii] = (int*) malloc( (m+1) * sizeof(int) );
if (S[ii] == NULL || R[ii] == NULL) {
perror(“malloc for S[ii] and R[ii] in LCS”);
exit(0);
}
}
/* It is important to use <=, not <. The next two for-loops are initialization */
for(ii = 0; ii <= n; ++ii) {
S[ii][0] = 0;
R[ii][0] = UP;
}
for(jj = 0; jj <= m; ++jj) {
S[0][jj] = 0;
R[0][jj] = LEFT;
}
/* This is the main dynamic programming loop that computes the score and */
/* backtracking arrays. */
for(ii = 1; ii <= n; ++ii) {
for(jj = 1; jj <= m; ++jj) {
if (strcmp(a[ii-1], b[jj-1]) == 0) {
S[ii][jj] = S[ii-1][jj-1] + 1;
R[ii][jj] = UP_AND_LEFT;
}
else {
S[ii][jj] = S[ii-1][jj-1] + 0;
R[ii][jj] = NEITHER;
}
if( S[ii-1][jj] >= S[ii][jj] ) {
S[ii][jj] = S[ii-1][jj];
R[ii][jj] = UP;
}
if( S[ii][jj-1] >= S[ii][jj] ) {
S[ii][jj] = S[ii][jj-1];
R[ii][jj] = LEFT;
}
}
}
/* The length of the longest substring is S[n][m] */
ii = n;
jj = m;
pos = S[ii][jj];
/* Trace the backtracking matrix. */
while( ii > 0 || jj > 0 ) {
if( R[ii][jj] == UP_AND_LEFT ) {
ii–;
jj–;
//lcs[pos–] = a[ii];
he = (hash_entry *)hash_value((void *)a[ii], htab);
len += ((he == NULL) ? 0: he->len);
}
else if( R[ii][jj] == UP ) {
ii–;
}
else if( R[ii][jj] == LEFT ) {
jj–;
}
}
for(ii = 0; ii <= n; ++ii ) {
free(S[ii]);
free(R[ii]);
}
free(S);
free(R);
return len;
}
int hash_callback(void *key, void *data)
{
hash_entry *he = (hash_entry *)data;
sim_union += (he->len * (he->nr1 + he->nr2));
sim_intersect += (he->len * MIN(he->nr1, he->nr2));
}
static float similarity_detect(hashtable *htab, char **str1, int n, char **str2, int m, int sim_algo)
{
uint32_t lcs_len = 0;
hash_for_each_do(htab, hash_callback);
if (sim_algo == LCS_YES) {
lcs_len = LCS(str1, n, str2, m, htab);
return lcs_len * 2.0 / sim_union;
} else { /* LCS_NOT */
return sim_intersect * 2.0 / sim_union;
}
}
int main(int argc, char *argv[])
{
int chunk_algo = CHUNK_CDC;
int sim_algo = LCS_NOT;
char *file1 = NULL;
char *file2 = NULL;
lcs_entry le1, le2;
char tmpname[NAME_MAX_SZ] = {0};
char template[] = “deduputil_bsim_XXXXXX”;
hashtable *htab = NULL;
int ret = 0;
if (argc < 5) {
usage();
return -1;
}
/* parse chunk algorithms */
file1 = argv[1];
file2 = argv[2];
chunk_algo = parse_arg(argv[3]);
sim_algo = parse_arg(argv[4]);
if (chunk_algo == -1 || sim_algo == -1) {
usage();
return -1;
}
htab = create_hashtable(HASHTABLE_BUCKET_SZ);
if (htab == NULL) {
fprintf(stderr, “create hashtabke failed/n”);
return -1;
}
/* chunk file1 and file2 into blocks */
sprintf(tmpname, “/tmp/%s_%d”, mktemp(template), getpid());
ret = file_chunk(file1, tmpname, chunk_algo);
if (0 != ret) {
fprintf(stderr, “chunk %s failed/n”, file1);
goto _BENCODE_EXIT;
}
le1.str = NULL;
ret = chunk_file_process(tmpname, htab, FILE1, sim_algo, &le1);
if (ret != 0) {
fprintf(stderr, “pasre %s failed/n”, file1);
goto _BENCODE_EXIT;
}
ret = file_chunk(file2, tmpname, chunk_algo);
if (0 != ret){
fprintf(stderr, “chunk %s failed/n”, file2);
goto _BENCODE_EXIT;
}
le2.str = NULL;
ret = chunk_file_process(tmpname, htab, FILE2, sim_algo, &le2);
if (ret != 0) {
fprintf(stderr, “pasre %s failed/n”, file2);
goto _BENCODE_EXIT;
}
fprintf(stderr, “similarity = %.4f/n”, similarity_detect(htab, le1.str, le1.len, le2.str, le2.len, sim_algo));
_BENCODE_EXIT:
unlink(tmpname);
hash_free(htab);
if (le1.str) free_2d_array(le1.str);
if (le2.str) free_2d_array(le2.str);
return ret;
}

/* Copyright (C) 2010 Aigui Liu * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 3 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License along * with this program; if not, visit the http://fsf.org website. */ #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> #include “hashtable.h” #include “sync.h” #define NEITHER 0 #define UP 1 #define LEFT 2 #define UP_AND_LEFT 3 #define MAX(x, y) (((x) > (y)) ? (x) : (y)) #define MIN(x, y) (((x) < (y)) ? (x) : (y)) #define MD5_LEN 17 enum { FILE1 = 0, FILE2 }; enum { LCS_NOT = 0, LCS_YES }; typedef struct { uint32_t nr1; uint32_t nr2; uint32_t len; } hash_entry; typedef struct { char **str; uint32_t len; } lcs_entry; static uint32_t sim_union = 0; static uint32_t sim_intersect = 0; static void usage() { fprintf(stderr, “Usage: bsim FILE1 FILE2 CHUNK_ALGO LCS/n/n”); fprintf(stderr, “Similarity detect between FILE1 and FILE2 based on block level./n”); fprintf(stderr, “CHUNK_ALGO:/n”); fprintf(stderr, ” FSP – fixed-size partition/n”); fprintf(stderr, ” CDC – content-defined chunking/n”); fprintf(stderr, ” SBC – slide block chunking/n/n”); fprintf(stderr, “LCS:/n”); fprintf(stderr, ” LCS_NOT – do not use LCS(longest lommon subsequence) algorithms/n”); fprintf(stderr, ” LCS_YES – use LCS algorithms/n/n”); fprintf(stderr, “Report bugs to <Aigui.Liu@gmail.com>./n”); } static int parse_arg(char *argname) { if (0 == strcmp(argname, “FSP”)) return CHUNK_FSP; else if (0 == strcmp(argname, “CDC”)) return CHUNK_CDC; else if (0 == strcmp(argname, “SBC”)) return CHUNK_SBC; else if (0 == strcmp(argname, “LCS_NOT”)) return LCS_NOT; else if (0 == strcmp(argname, “LCS_YES”)) return LCS_YES; else return -1; } static char **alloc_2d_array(int row, int col) { int i; char *p, **pp; p = (char *)malloc(row * col * sizeof(char)); pp = (char **)malloc(row * sizeof(char *)); if (p == NULL || pp == NULL) return NULL; for (i = 0; i < row; i++) { pp[i] = p + col * i; } return pp; } static void free_2d_array(char **str) { free(str[0]); free(str); } static void show_md5_hex(unsigned char md5_checksum[16]) { int i; for (i = 0; i < 16; i++) { printf(“%02x”, md5_checksum[i]); } printf(“/n”); } static int chunk_file_process(char *chunk_file, hashtable *htab, int which, int sim_algo, lcs_entry *le) { int fd, i, ret = 0; ssize_t rwsize; chunk_file_header chunk_file_hdr; chunk_block_entry chunk_bentry; hash_entry *he = NULL; /* parse chunk file */ fd = open(chunk_file, O_RDONLY); if (-1 == fd) { return -1; } rwsize = read(fd, &chunk_file_hdr, CHUNK_FILE_HEADER_SZ); if (rwsize != CHUNK_FILE_HEADER_SZ) { ret = -1; goto _CHUNK_FILE_PROCESS_EXIT; } if (sim_algo == LCS_YES) { le->str = alloc_2d_array(chunk_file_hdr.block_nr, MD5_LEN); if (le->str == NULL) { ret = -1; goto _CHUNK_FILE_PROCESS_EXIT; } le->len = chunk_file_hdr.block_nr; } for(i = 0; i < chunk_file_hdr.block_nr; i++) { rwsize = read(fd, &chunk_bentry, CHUNK_BLOCK_ENTRY_SZ); if (rwsize != CHUNK_BLOCK_ENTRY_SZ) { ret = -1; goto _CHUNK_FILE_PROCESS_EXIT; } he = (hash_entry *)hash_value((void *)chunk_bentry.md5, htab); if (he == NULL) { he = (hash_entry *)malloc(sizeof(hash_entry)); he->nr1 = he->nr2 = 0; he->len = chunk_bentry.len; } (which == FILE1) ? he->nr1++ : he->nr2++; /* insert or update hash entry */ hash_insert((void *)strdup(chunk_bentry.md5), (void *)he, htab); if (sim_algo == LCS_YES) { memcpy(le->str[i], chunk_bentry.md5, MD5_LEN); } } _CHUNK_FILE_PROCESS_EXIT: close(fd); return ret; } uint32_t LCS(char** a, int n, char** b, int m, hashtable *htab) { int** S; int** R; int ii; int jj; int pos; uint32_t len = 0; hash_entry *he = NULL; /* Memory allocation */ S = (int **)malloc( (n+1) * sizeof(int *) ); R = (int **)malloc( (n+1) * sizeof(int *) ); if (S == NULL || R == NULL) { perror(“malloc for S and R in LCS”); exit(0); } for(ii = 0; ii <= n; ++ii) { S[ii] = (int*) malloc( (m+1) * sizeof(int) ); R[ii] = (int*) malloc( (m+1) * sizeof(int) ); if (S[ii] == NULL || R[ii] == NULL) { perror(“malloc for S[ii] and R[ii] in LCS”); exit(0); } } /* It is important to use <=, not <. The next two for-loops are initialization */ for(ii = 0; ii <= n; ++ii) { S[ii][0] = 0; R[ii][0] = UP; } for(jj = 0; jj <= m; ++jj) { S[0][jj] = 0; R[0][jj] = LEFT; } /* This is the main dynamic programming loop that computes the score and */ /* backtracking arrays. */ for(ii = 1; ii <= n; ++ii) { for(jj = 1; jj <= m; ++jj) { if (strcmp(a[ii-1], b[jj-1]) == 0) { S[ii][jj] = S[ii-1][jj-1] + 1; R[ii][jj] = UP_AND_LEFT; } else { S[ii][jj] = S[ii-1][jj-1] + 0; R[ii][jj] = NEITHER; } if( S[ii-1][jj] >= S[ii][jj] ) { S[ii][jj] = S[ii-1][jj]; R[ii][jj] = UP; } if( S[ii][jj-1] >= S[ii][jj] ) { S[ii][jj] = S[ii][jj-1]; R[ii][jj] = LEFT; } } } /* The length of the longest substring is S[n][m] */ ii = n; jj = m; pos = S[ii][jj]; /* Trace the backtracking matrix. */ while( ii > 0 || jj > 0 ) { if( R[ii][jj] == UP_AND_LEFT ) { ii–; jj–; //lcs[pos–] = a[ii]; he = (hash_entry *)hash_value((void *)a[ii], htab); len += ((he == NULL) ? 0: he->len); } else if( R[ii][jj] == UP ) { ii–; } else if( R[ii][jj] == LEFT ) { jj–; } } for(ii = 0; ii <= n; ++ii ) { free(S[ii]); free(R[ii]); } free(S); free(R); return len; } int hash_callback(void *key, void *data) { hash_entry *he = (hash_entry *)data; sim_union += (he->len * (he->nr1 + he->nr2)); sim_intersect += (he->len * MIN(he->nr1, he->nr2)); } static float similarity_detect(hashtable *htab, char **str1, int n, char **str2, int m, int sim_algo) { uint32_t lcs_len = 0; hash_for_each_do(htab, hash_callback); if (sim_algo == LCS_YES) { lcs_len = LCS(str1, n, str2, m, htab); return lcs_len * 2.0 / sim_union; } else { /* LCS_NOT */ return sim_intersect * 2.0 / sim_union; } } int main(int argc, char *argv[]) { int chunk_algo = CHUNK_CDC; int sim_algo = LCS_NOT; char *file1 = NULL; char *file2 = NULL; lcs_entry le1, le2; char tmpname[NAME_MAX_SZ] = {0}; char template[] = “deduputil_bsim_XXXXXX”; hashtable *htab = NULL; int ret = 0; if (argc < 5) { usage(); return -1; } /* parse chunk algorithms */ file1 = argv[1]; file2 = argv[2]; chunk_algo = parse_arg(argv[3]); sim_algo = parse_arg(argv[4]); if (chunk_algo == -1 || sim_algo == -1) { usage(); return -1; } htab = create_hashtable(HASHTABLE_BUCKET_SZ); if (htab == NULL) { fprintf(stderr, “create hashtabke failed/n”); return -1; } /* chunk file1 and file2 into blocks */ sprintf(tmpname, “/tmp/%s_%d”, mktemp(template), getpid()); ret = file_chunk(file1, tmpname, chunk_algo); if (0 != ret) { fprintf(stderr, “chunk %s failed/n”, file1); goto _BENCODE_EXIT; } le1.str = NULL; ret = chunk_file_process(tmpname, htab, FILE1, sim_algo, &le1); if (ret != 0) { fprintf(stderr, “pasre %s failed/n”, file1); goto _BENCODE_EXIT; } ret = file_chunk(file2, tmpname, chunk_algo); if (0 != ret){ fprintf(stderr, “chunk %s failed/n”, file2); goto _BENCODE_EXIT; } le2.str = NULL; ret = chunk_file_process(tmpname, htab, FILE2, sim_algo, &le2); if (ret != 0) { fprintf(stderr, “pasre %s failed/n”, file2); goto _BENCODE_EXIT; } fprintf(stderr, “similarity = %.4f/n”, similarity_detect(htab, le1.str, le1.len, le2.str, le2.len, sim_algo)); _BENCODE_EXIT: unlink(tmpname); hash_free(htab); if (le1.str) free_2d_array(le1.str); if (le2.str) free_2d_array(le2.str); return ret; }

来源URL：http://blog.csdn.net/liuaigui/article/details/5870297