请教个字符串相似度算法, 300分送上(300)

suger · 2009-06-06

问题: 如今有50000个字符串需要进行相似度的计算,请个速度快的算法.有哪位高手有现成的例子是最好不过的了.经初步计算,最悲观的对比次数是12.5亿次,这样的次数耗时大概想一下都可怕,不知道需要多长时间才能计算完成.思路:1、考虑多线程同时处理2、发现相似度高于指定值（如90%）的字符串不再参与对比。如果A，B，C，D，E 5个字符串，A与B的相似度有90%以上，则B不再与C，D，E对比处理。如果有哪位实现过类似这样大数据量的相似度计算工作，请慷慨解囊帮忙小弟一下！分不够可以继续加！

凤冠坡 · 2009-06-06

兄弟：具体目标是什么，我觉得显示程序真的需要这样处理的情况应该不多哦！！能不能说下你得具体任务是什么，也许解决实际问题有更好的模型！！

suger · 2009-06-07

我所负责的物资管理系统中，如今系统有5万多个物资名称，但如今发现有很大一部分的相同或者是类似相同的，现在需要处理的就是把相同和类似度很高的物资找出来。比如“螺丝刀”和“螺丝批”这样的名称。对客户来说是要处理五万个物资的，对我们开发人员来说，就是处理5万个字符串了。呵呵。

cpj7406 · 2009-06-07

慢就慢，关系不大，反正又不是经常运行。难的其实还不是计算机的比对，是人工的参于，你对比后，列了出来，需要人工决定是否合并或保留。现实的做法一般是，分阶段分步骤慢慢整理，譬如，今天我们先整理标准件中的紧固件、明天我们整理液压件、后天我们整理电动工具.......你的算法再高明也不是就能摆平！数据的不理想是因为企业的技术管理体系不完善、标准化程度低所造成的，不是你的软件系统造成的！你的算法再高明，整理完之后，又能怎样呢？

fanronghua · 2009-06-07

去重复,用哈希表存储

suger · 2009-06-10

cpj7406朋友：现实中的处理的确很麻烦，得对于一个企业来说，造成物资名称的重复问题是无法避免的，只能是尽量的减少和时候维护。就按你说的今天处理紧固件，明天处理液压件，但如果靠人工的方式来检查液压件或紧固件有多少是重复的，保证是看到眼睛花了都没有找到重复的东西。因此才需要计算机辅助进行筛选。fanronghua朋友：能不能用具体的代码或者例子说说明呢。目前进展，我在网上找一个字符串相似度计算的例子strcmp，里面有个单元是CnStrDiff。但速度实在不敢恭候。对比1W个字符串需要3到4秒钟时间，如果按12亿条数据对比计算的话，最少都得100个小时。而且运行时电脑几乎是不会动了（电脑配置是2个4核CPU，16G内存的服务器）。需求更好的解决方法！

hs-kill · 2009-06-10

。。。。。。偷懒的方法，把字符串都扔到数据库里，用数据库进行对比

zkktom · 2009-06-10

偷懒好像不行的培林就是我们说的轴承这个好像没有相似度！

wllxf · 2009-06-10

那就分类吧，把相似的分成一类，5万多怎么也能分出大几千类别来。

逍遥小鱼 · 2009-06-22

晕。肯定要扔给数据库处理了，你自己处理不感觉麻烦吗，用数据库 like 处理吧。。应该差不多的，我经常这样处理，是仓库的。。

guanyue7613 · 2009-06-25

{******************************************************************************}{ CnPack For Delphi/C++Builder }{ 中国人自己的开放源码第三方开发包 }{ (C)Copyright 2001-2009 CnPack 开发组 }{ ------------------------------------ }{ }{ 本开发包是开源的自由软件，您可以遵照 CnPack 的发布协议来修 }{ 改和重新发布这一程序。 }{ }{ 发布这一开发包的目的是希望它有用，但没有任何担保。甚至没有 }{ 适合特定目的而隐含的担保。更详细的情况请参阅 CnPack 发布协议。 }{ }{ 您应该已经和开发包一起收到一份 CnPack 发布协议的副本。如果 }{ 还没有，可访问我们的网站： }{ }{ 网站地址：http://www.cnpack.org }{ 电子邮件：master@cnpack.org }{ }{******************************************************************************}{ 该单元基于 Angus Johnson 的 TDiffUnit.pas改写，以下是原的声明： }(******************************************************************************** Component TDiff ** Version: 1.1 ** Date: 24 February 2002 ** Compilers: Delphi 3 - Delphi 6 ** Author: Angus Johnson - ajohnson@rpi.net.au ** Copyright: ?2001-2002 Angus Johnson * ** Licence to use, terms and conditions: ** The code in the TDiff component is released as freeware ** provided you agree to the following terms & conditions: ** 1. the copyright notice, terms and conditions are ** left unchanged ** 2. modifications to the code by other authors must be ** clearly documented and accompanied by the modifier's name. ** 3. the TDiff component may be freely compiled into binary ** format and no acknowledgement is required. However, a ** discrete acknowledgement would be appreciated (eg. in a ** program's 'About Box'). ** ** Description: Component to list differences between two integer arrays ** using a "longest common sequence" algorithm. ** Typically, this component is used to diff 2 text files ** once their individuals lines have been hashed. ** By uncommenting {$DEFINE DIFF_BYTES} this component ** can also diff char arrays (eg to create file patches) ** ** Acknowledgements: The key algorithm in this component is based on: ** "An O(ND) Difference Algorithm and its Variations" ** By E Myers - Algorithmica Vol. 1 No. 2, 1986, pp. 251-266 ** http://www.cs.arizona.edu/people/gene/ ** http://www.cs.arizona.edu/people/gene/PAPERS/diff.ps ** ********************************************************************************)(******************************************************************************** History: ** 13 December 2001 - Original Release ** 24 February 2002 - OnProgress event added, improvements to code comments ********************************************************************************)unit CnStrDiff;{* |<PRE>================================================================================* 软件名称：开发包基础库* 单元名称：字符串详细比较单元* 单元作者：周劲羽 (zjy@cnpack.org)* 备注：该单元基于 Angus Johnson 的 TDiffUnit.pas改写* 开发平台：PWinXP + Delphi 5.01* 兼容测试：PWin9X/2000/XP + Delphi 5/6* 本地化：该单元中的字符串均符合本地化处理方式* 单元标识：$Id: CnStrDiff.pas,v 1.7 2009/01/02 08:27:38 liuxiao Exp $* 修改记录：2004.11.15 V1.0* 移植单元================================================================================|</PRE>}interface{$I CnPack.inc}uses Windows, SysUtils, Classes;const //Maximum allowed deviation from centre diagonal vector ... MAX_DIAGONAL = $FFFFF;type PDiagVectorArray = ^TDiagVectorArray; TDiagVectorArray = array[-MAX_DIAGONAL.. + MAX_DIAGONAL] of Integer; TScriptKind = (skAddRange, skDelRange, skDelDiagDel, skAddDiagAdd, skAddDel, skAddDiagDel, skDelDiagAdd); TChangeKind = (ckAdd, ckDelete, ckModify); PChangeRec = ^TChangeRec; TChangeRec = record Kind: TChangeKind; //(ckAdd, ckDelete, ckModify) x: Integer; //Array1 offset (where to add, delete, modify) y: Integer; //Array2 offset (what to add, modify) Range: Integer; //range

end; TProgressEvent = procedure(Sender: TObject; ProgressPercent: Integer) of object; TCnStrDiff = class private MaxD: Integer; fChangeList: TList; fLastAdd, fLastDel, fLastMod: PChangeRec; diagVecB: PDiagVectorArray; diagVecF: PDiagVectorArray; //forward and backward arrays Array1: PChar; Array2: PChar; fCancelled: Boolean; function RecursiveDiff(x1, y1, x2, y2: Integer): Boolean; procedure AddToScript(x1, y1, x2, y2: Integer; ScriptKind: TScriptKind); procedure ClearChanges; function GetChangeCount: Integer; function GetChanges(Index: Integer): TChangeRec; procedure PushAdd; procedure PushDel; procedure PushMod; public constructor Create; destructor Destroy; override; function Execute(const S1, S2: PChar; Size1, Size2: Integer): Boolean; property ChangeCount: Integer read GetChangeCount; property Changes[Index: Integer]: TChangeRec read GetChanges; default; end;// 计算两个字符串的相似程度，返回 0..1 之间的小数function SimilarText(S1, S2: string; CaseSensitive: Boolean = False): Double;implementation// Miscellaneous Functions ...function Min(a, b: Integer): Integer;begin if a < b then Result := a else Result := b;end;function Max(a, b: Integer): Integer;begin if a > b then Result := a else Result := b;end;function SimilarText(S1, S2: string; CaseSensitive: Boolean): Double;var Diff: TCnStrDiff; i: Integer; Count, Len: Integer;begin if (S1 = '') or (S2 = '') then begin if S1 = S2 then Result := 1 else Result := 0; Exit; end; if not CaseSensitive then begin S1 := UpperCase(S1); S2 := UpperCase(S2); end; Diff := TCnStrDiff.Create; try if Diff.Execute(PChar(S1), PChar(S2), Length(S1), Length(S2)) then begin Count := 0; for i := 0 to Diff.ChangeCount - 1 do Inc(Count, Diff.Changes.Range); Len := Max(Length(S1), Length(S2)); if Len > 1 then Dec(Len); Result := 1 - Count / Len; if Result < 0 then Result := 0; end else Result := 0; finally; Diff.Free; end;end;// TCnStrDiff Class ...constructor TCnStrDiff.Create;begin inherited; fChangeList := TList.Create;end;destructor TCnStrDiff.Destroy;begin ClearChanges; fChangeList.Free; inherited;end;function TCnStrDiff.Execute(const S1, S2: PChar; Size1, Size2: Integer): Boolean;var IntArr_f, IntArr_b: PChar;begin Result := False; ClearChanges; if not Assigned(S1) or not Assigned(S2) then Exit; Array1 := S1; Array2 := S2; //MaxD == Maximum possible deviation from centre diagonal vector //which can't be more than the largest intArray (with upperlimit = MAX_DIAGONAL) ... MaxD := Min(Max(Size1, Size2), MAX_DIAGONAL); //estimate the no. Changes == 1/8 total size rounded to a 32bit boundary fChangeList.capacity := (Max(MaxD, 1024) div 32) * 4; IntArr_f := nil; IntArr_b := nil; try //allocate the vector memory ... GetMem(IntArr_f, SizeOf(Integer) * (MaxD * 2 + 1)); GetMem(IntArr_b, SizeOf(Integer) * (MaxD * 2 + 1)); //Align the forward and backward diagonal vector arrays //with the memory which has just been allocated ... PChar(diagVecF) := PChar(IntArr_f) - SizeOf(Integer) * (MAX_DIAGONAL - MaxD); PChar(diagVecB) := PChar(IntArr_b) - SizeOf(Integer) * (MAX_DIAGONAL - MaxD); fCancelled := False; //NOW DO IT HERE... Result := RecursiveDiff(0, 0, Size1, Size2); //add remaining range buffers onto ChangeList... PushAdd; PushDel; if not Result then ClearChanges; finally FreeMem(IntArr_f); FreeMem(IntArr_b); end;end;function TCnStrDiff.RecursiveDiff(x1, y1, x2, y2: Integer): Boolean;var //normally, parameters and local vars should be stored on the heap for //recursive functions. However, as the maximum number of possible recursions //here is relatively small (<25) the risk of stack overflow is negligible. x, y, Delta, D, k: Integer;begin Result := True; //skip over initial and trailing matches... D := Min(x2 - x1, y2 - y1); k := 0; while (k < D) and (Array1[x1 + k + 1] = Array2[y1 + k + 1]) do Inc(k); Inc(x1, k); Inc(y1, k); Dec(D, k); k := 0; while (k < D) and (Array1[x2 - k] = Array2[y2 - k]) do Inc(k); Dec(x2, k); Dec(y2, k); //check if just all additions or all deletions... if (x2 = x1) then begin AddToScript(x1, y1, x2, y2, skAddRange); Exit; end else if (y2 = y1) then begin AddToScript(x1, y1, x2, y2, skDelRange); Exit; end; //divide and conquer ... //(recursively) find midpoints of the edit path... Delta := (x2 - x1) - (y2 - y1); //initialize forward and backward diagonal vectors... diagVecF[0] := x1; diagVecB[Delta] := x2; //OUTER LOOP ... //MAKE INCREASING OSCILLATIONS ABOUT CENTRE DIAGONAL UNTIL A FORWARD //DIAGONAL VECTOR IS GREATER THAN OR EQUAL TO A BACKWARD DIAGONAL. //nb: 'D' doesn't needs to start at 0 as there's never an initial match for D := 1 to MaxD do begin //forward loop............................................... //nb: k == index of current diagonal vector and // will oscillate (in increasing swings) between -MaxD and MaxD k := -D; while k <= D do begin //derive x from the larger of the adjacent vectors... if (k = -D) or ((k < D) and (diagVecF[k - 1] < diagVecF[k + 1])) then x := diagVecF[k + 1] else x := diagVecF[k - 1] + 1; y := x - x1 + y1 - k; //while (x+1,y+1) match - increment them... while (x < x2) and (y < y2) and (Array1[x + 1] = Array2[y + 1]) do begin Inc(x); Inc; end; //update current vector ... diagVecF[k] := x; //check if midpoint reached (ie: when diagVecF[k] & diagVecB[k] vectors overlap)... //nb: if midpoint found in forward loop then there must be common sub-sequences ... if odd(Delta) and (k > -D + Delta) and (k < D + Delta) and (diagVecF[k] >= diagVecB[k]) then begin //To avoid declaring 2 extra variables in this recursive function .. //Delta & k are simply reused to store the x & y values ... Delta := x; k := y; //slide up to top (left) of diagonal... while (x > x1) and (y > y1) and (Array1[x] = Array2[y]) do begin Dec(x); Dec; end; //do recursion with the first half... Result := RecursiveDiff(x1, y1, x, y); if not Result then Exit; //and again with the second half (nb: Delta & k are stored x & y)... Result := RecursiveDiff(Delta, k, x2, y2); Exit; //All done!!! end; Inc(k, 2); end; //backward loop.............................................. //nb: k will oscillate (in increasing swings) between -MaxD and MaxD k := -D + Delta; while k <= D + Delta do begin //make sure we remain within the diagVecB[] and diagVecF[] array bounds... if (k < -MaxD) then begin Inc(k, 2); Continue; end else if (k > MaxD) then Break; //derive x from the adjacent vectors... if (k = D + Delta) or ((k > -D + Delta) and (diagVecB[k + 1] > diagVecB[k - 1])) then x := diagVecB[k - 1] else x := diagVecB[k + 1] - 1; y := x - x1 + y1 - k; //while (x,y) match - decrement them... while (x > x1) and (y > y1) and (Array1[x] = Array2[y]) do begin Dec(x); Dec; end; //update current vector ... diagVecB[k] := x; //check if midpoint reached... if not odd(Delta) and (k >= -D) and (k <= D) and (diagVecF[k] >= diagVecB[k]) then begin //if D == 1 then the smallest common subsequence must have been found ... if D = 1 then //nb: if D == 1 then Delta must be in [-2,0,+2] begin if Delta = 2 then AddToScript(x1, y1, x2, y2, skDelDiagDel) else if Delta = -2 then AddToScript(x1, y1, x2, y2, skAddDiagAdd) else if (x1 + 1 = x2) then AddToScript(x1, y1, x2, y2, skAddDel) else if (Array1[x1 + 2] = Array2[y1 + 1]) then AddToScript(x1, y1, x2, y2, skDelDiagAdd) else AddToScript(x1, y1, x2, y2, skAddDiagDel); end else begin // D > 1 then find common sub-sequences... //process the first half... Result := RecursiveDiff(x1, y1, x, y); if not Result then Exit; //now slide down to bottom (right) of diagonal... while (x < x2) and (y < y2) and (Array1[x + 1] = Array2[y + 1]) do begin Inc(x); Inc; end; //and process the second half... Result := RecursiveDiff(x, y, x2, y2); end; Exit; //All done!!! end; Inc(k, 2); end; end; Result := False;end;(*................................. . skAddRange: | . (x1 == x2) | . | . . skDelRange: ---- . (y1 == y2) . .When the midpoint is reached in .the smallest possible editgrid, .D = 1 & Delta must be even and .the snake must appears as one of: . . skAddDiagAdd: | . (Delta == -2) / . | . . skDelDiagDel: _ . (Delta == +2) / . - . . skAddDel: |_ . (Delta == 0 . & Rec size == 1x1) . nb: skAddDel == skDelAdd . . skAddDiagDel | . (Delta == 0) / . - . . skDelDiagAdd _ . (Delta == 0) / . | . ..................................*)procedure TCnStrDiff.PushAdd;begin PushMod; if Assigned(fLastAdd) then fChangeList.Add(fLastAdd); fLastAdd := nil;end;procedure TCnStrDiff.PushDel;begin PushMod; if Assigned(fLastDel) then fChangeList.Add(fLastDel); fLastDel := nil;end;procedure TCnStrDiff.PushMod;begin if Assigned(fLastMod) then fChangeList.Add(fLastMod); fLastMod := nil;end;//This is a bit UGLY but simply reduces many adds & deletes to many fewer//add, delete & modify ranges which are then stored in ChangeList...procedure TCnStrDiff.AddToScript(x1, y1, x2, y2: Integer; ScriptKind: TScriptKind);var i: Integer; procedure TrashAdd; begin Dispose(fLastAdd); fLastAdd := nil; end; procedure TrashDel; begin Dispose(fLastDel); fLastDel := nil; end; procedure NewAdd(x1, y1: Integer); begin New(fLastAdd); fLastAdd.Kind := ckAdd; fLastAdd.x := x1; fLastAdd.y := y1; fLastAdd.Range := 1; end; procedure NewMod(x1, y1: Integer); begin New(fLastMod); fLastMod.Kind := ckModify; fLastMod.x := x1; fLastMod.y := y1; fLastMod.Range := 1; end; procedure NewDel(x1: Integer); begin New(fLastDel); fLastDel.Kind := ckDelete; fLastDel.x := x1; fLastDel.y := 0; fLastDel.Range := 1; end; // 1. there can NEVER be concurrent fLastAdd and fLastDel record ranges. // 2. fLastMod is always pushed onto ChangeList before fLastAdd & fLastDel. procedure Add(x1, y1: Integer); begin if Assigned(fLastAdd) then //OTHER ADDS PENDING begin if (fLastAdd.x = x1) and (fLastAdd.y + fLastAdd.Range = y1) then Inc(fLastAdd.Range) //add in series else begin PushAdd; NewAdd(x1, y1); end; //add NOT in series end else if Assigned(fLastDel) then //NO ADDS BUT DELETES PENDING begin if x1 = fLastDel.x then //add matches pending del so modify ... begin if Assigned(fLastMod) and (fLastMod.x + fLastMod.Range - 1 = x1) and (fLastMod.y + fLastMod.Range - 1 = y1) then Inc(fLastMod.Range) //modify in series else begin PushMod; NewMod(x1, y1); end; //start NEW modify if fLastDel.Range = 1 then TrashDel //decrement or remove existing del else begin Dec(fLastDel.Range); Inc(fLastDel.x); end; end else begin PushDel; NewAdd(x1, y1); end; //add does NOT match pending del's end else NewAdd(x1, y1); //NO ADDS OR DELETES PENDING end; procedure Delete(x1: Integer); begin if Assigned(fLastDel) then //OTHER DELS PENDING begin if (fLastDel.x + fLastDel.Range = x1) then Inc(fLastDel.Range) //del in series else begin PushDel; NewDel(x1); end; //del NOT in series end else if Assigned(fLastAdd) then //NO DELS BUT ADDS PENDING begin if x1 = fLastAdd.x then //del matches pending add so modify ... begin if Assigned(fLastMod) and (fLastMod.x + fLastMod.Range = x1) then Inc(fLastMod.Range) //mod in series else begin PushMod; NewMod(x1, fLastAdd.y); end; //start NEW modify ... if fLastAdd.Range = 1 then TrashAdd //decrement or remove existing add else begin Dec(fLastAdd.Range); Inc(fLastAdd.x); Inc(fLastAdd.y); end; end else begin PushAdd; NewDel(x1); end; //del does NOT match pending add's end else NewDel(x1); //NO ADDS OR DELETES PENDING end;begin case ScriptKind of skAddRange: for i := y1 to y2 - 1 do Add(x1, i); skDelRange: for i := x1 to x2 - 1 do Delete(i); skDelDiagDel: begin Delete(x1); Delete(x2 - 1); end; skAddDiagAdd: begin Add(x1, y1); Add(x2, y2 - 1); end; skAddDel: begin Add(x1, y1); Delete(x2 - 1); end; skDelDiagAdd: begin Delete(x1); Add(x2, y2 - 1); end; skAddDiagDel: begin Add(x1, y1); Delete(x2 - 1); end; end;end;procedure TCnStrDiff.ClearChanges;var i: Integer;begin for i := 0 to fChangeList.Count - 1 do dispose(PChangeRec(fChangeList)); fChangeList.Clear;end;function TCnStrDiff.GetChangeCount: Integer;begin Result := fChangeList.Count;end;function TCnStrDiff.GetChanges(Index: Integer): TChangeRec;begin Result := PChangeRec(fChangeList[Index])^;end;end.

suger · 2009-08-24

接受答案了.

请教个字符串相似度算法, 300分送上(300)

suger

Unregistered / Unconfirmed

凤冠坡

Unregistered / Unconfirmed

suger

Unregistered / Unconfirmed

cpj7406

Unregistered / Unconfirmed

fanronghua

Unregistered / Unconfirmed

suger

Unregistered / Unconfirmed

hs-kill

Unregistered / Unconfirmed

zkktom

Unregistered / Unconfirmed

wllxf

Unregistered / Unconfirmed

逍遥小鱼

Unregistered / Unconfirmed

guanyue7613

Unregistered / Unconfirmed

suger

Unregistered / Unconfirmed

Similar threads