请教字符串算法，很有意思的问题 (300分)

LiChaoHui · 2003-02-13

我再提个测试用例，楼主看一下理想的结果应该是什么
文本：
< tab le border = 1 mm = "abcde mmn" pp= me > f asa 342.dfsa </ pp>
关键字：
<table border=1 mm="abcde mmn" pp=me>fasa342.dfsa</pp>
对于这个例子，如果允许通过的话，那么程序就比较简单了
但我的程序不允许通过这种例子

Adnil · 2003-02-13

"table"不会变成"tab le"的，这种可能性不会有，不需要考虑。

我在哪里能看到您的代码？

copy_paste，不清楚哦，我的应用中html待查找字串就是读取文件后放在Memo中的

ARMADA · 2003-02-13

请问需要一次查找所有吗？还是一次一个结果

参照这个思路试试：

首先要整理子字符串：
第一个字符为非空格，非回车字符；
连续的空格转换为一个空格，回车或连续的回车转换为一个空格或消除（我不知道你需要怎样）
结尾处的空格、回车全部截除；
这样你就有一个标准的子字符串，因为考虑到子字符串也可能不标准

接下来查找
1。查找子字符串中的第一个有效字符{就是非空格，非回车字符}
取得这个字符在母字符串中的位置，记录这个位置，首部
2。然后比较子串的第二个有效字符与母串的下一个字符是否相同，
如果该字符不同则重复1。如果该字符为空格，则跳过接下来的空格
3。参照步骤2。比较其它数据
4。如果完成比较则记录结尾在母串的位置，标记首部与尾部位置之间的所有字符
查找完成。

可以用多线程分块同时查找，容许追踪到下一个块即可

估计要考虑到file stream stringlist string 等结构的转化

最近比较忙，可能没空写代码，思虑可以参考一下，最原始的方法

LiChaoHui · 2003-02-13

to 楼主：
我的测试程序在
http://www.playicq.com/dispdoc.php?id=2981
你先测试一下吧，如果可以，我再发给你源码

Adnil · 2003-02-13

LiChaoHui您好，您的代码我测试过了，基本上没有问题！

Adnil · 2003-02-13

不过你给的可执行程序中关键字串不能输入回车，这点我就没有测试到了

LiChaoHui · 2003-02-13

留下邮箱，我将所有的源代码发给你

源代码的核心是函数LcScan
第一个参数是待查找的文本
第二个参数是要查找的关键字原型
第三个参数是开始查找的位置，变量参数，按地址传递
函数的返回值是查找结果串的长度，如果为0，表示没有找到
函数的代码行数为147，执行效率未经测试

查找一次后，第三个参数会自动后移，所以可以在第一次查找的基础上继续查找
只需要计算下一次查找的起始位置，例子程序中仅仅查找了第一处关键字

LiChaoHui · 2003-02-13

要求：
如果关键字中含有回车，则原文中必须有回车，否则匹配失败

Adnil · 2003-02-13

adnil.zhou@hdtworld.com，谢谢！

LiChaoHui · 2003-02-13

已发送！！

Adnil · 2003-02-13

代码已经收到，基本上没有问题。

shenloqi,LiChaoHui,copy_paste每人两百分
ARMADA,50
zhukewen, 50
不知大家有没有意见？

几天内我会总结出一个效率比较高的最终版本贴上来。

shenloqi · 2003-02-13

Adnil：我的代码也已经修正。
支持Unicode查找（字数还是ansi的，不过你已经说是可以用Length了）
请再测试吧http://loqi.myetang.com/findtext.exe
上面的代码已修改

copy_paste · 2003-02-13

LiChaoHui干嘛不都代码贴出来，是不是太长了？

Adnil · 2003-02-13

我帮LiChaoHui贴出来吧，相信他不会介意

unit Unit1;

interface

uses
Windows, Messages, SysUtils, Variants, Classes, Graphics, Controls, Forms,
Dialogs, StdCtrls;

type
TForm1 = class(TForm)
Button1: TButton;
Memo1: TMemo;
Edit1: TEdit;
Memo2: TMemo;
Label1: TLabel;
Label2: TLabel;
procedure Button1Click(Sender: TObject);
procedure FormCreate(Sender: TObject);
private
{ Private declarations }
public
function LcScan(Text: String
Keyw: String;
var iStart: Integer): Integer;
end;

var
Form1: TForm1;

implementation

{$R *.dfm}

{ TForm1 }

function TForm1.LcScan(Text, Keyw: String
var iStart: Integer): Integer;
var
iState: Integer;
LastState: Integer;
i, m, len, ml: Integer;
NextCh, LastCh: Char;
begin
len := Length(Text);
ml := Length(Keyw);
if (len = 0) or (ml = 0) then
begin
Result := 0;
Exit;
end;
iState := 0;
LastState := 0;
m := 1;
NextCh := Keyw[m];
LastCh := #0;
i := iStart;
while i <= len do
begin
case iState of
0: //表示当然位置在标记之外
begin
case Text of
' ',#13,#10: iState := 4;
'<': iState := 2;
end;
end;
1:
begin
case Text of
' ': iState := 2;
#13: iState := 2;
#10:
begin
if (i > 1) and (Text[i-1] = #13) then
iState := 2
else
iState := 2;
end;
'"': iState := 3;
'>': iState := 4;
end;
end;
2:
begin
case Text of
' ',#13,#10: ;
'"': iState := 3;
'>': iState := 4;
else
iState := 1;
end;
end;
3:
begin
case Text of
'"': iState := 1;
'>': iState := 4;
end;
end;
4:
begin
case Text of
' ',#13,#10: ;
else
iState := 0;
end;
end;
end;
if iState = 2 then
begin
if (NextCh = '=') or (LastCh = '=') or (NextCh = '>') then
begin
if m = 1 then iStart := i + 1;
end
else
if LastState <> 2 then
begin
if Text = Keyw[m] then
begin
m := m + 1;
if m > ml then
begin
//完成匹配
Break;
end
else
begin
LastCh := NextCh;
NextCh := Keyw[m];
end;
end
else
begin
if m > 1 then
begin
i := iStart
//进行回退
m := 1;
end;
iStart := i + 1;
end;
end;
end
else
begin
if Text = Keyw[m] then
begin
m := m + 1;
if m > ml then
begin
Break;
end
else
begin
LastCh := NextCh;
NextCh := Keyw[m];
end;
end
else
if (iState = 4)then
begin
if m = 1 then iStart := i + 1;
end
else
begin
if m > 1 then
begin
i := iStart
//进行回退
m := 1;
end;
iStart := i + 1;
end;
end;
LastState := iState;
i := i + 1;
end;
if i > len then
begin
iStart := len + 1;
Result := 0;
end
else
Result := i - iStart + 1;
end;

procedure TForm1.Button1Click(Sender: TObject);
var
im: Integer;
len: Integer;
s1: String;
s2: String;
begin
im := 1;
//s1 := 'ad<fa fa = "fd me" df >1234 fasdf fdas';
//s1 := ''#13#10'2001 . 4 . 1 ';
//s1 := ' ' + #13#10 +
// '2001 .4. 1';
//s2 := 'd<fa fa="fd me" df >1234';
//s2 := '2001.4.1';
s1 := Memo2.Lines.Text;
s2 := StringReplace(Edit1.Text, '^p', #13#10, [rfReplaceAll]);
len := LcScan(s1, s2, im);
//ShowMessage(Copy(s1, im, len));
Memo1.Lines.Clear;
Memo1.Lines.Add('Search Text:');
Memo1.Lines.Add(s1);
Memo1.Lines.Add('');
Memo1.Lines.Add('KeyWord:');
Memo1.Lines.Add(s2);
Memo1.Lines.Add('');
Memo1.Lines.Add('Search Result:');
Memo1.Lines.Add('StartPos: ' + IntToStr(im));
Memo1.Lines.Add('MatchLen: ' + IntToStr(len));
Memo1.Lines.Add('FoundTxt: ' + Copy(s1, im, len))
end;

procedure TForm1.FormCreate(Sender: TObject);
begin
Memo2.Clear;
Memo1.Clear;
Edit1.Clear;
end;

end.

LiChaoHui · 2003-02-13

请楼主仔细的测试一下，（另：将函数的前两个参数改为const型参数，可以提高多次调用的效率）
要用一些反例，即测试不允许通过的例子，再测试一下，然后把结果告诉我

再：
我很想知道我的算法的效率如何

还请那个告诉我一下，集合运算和常规的比较运算，那个效率更高？
譬如
thechar in ['a', #13, #32, 'm']
和
if (thechar = 'a') or
(thechar = #13) or
(thechar = #32) or
(thechar = 'm') then

哪一个效率更高呢？

shenloqi · 2003-02-13

LiChaoHui的代码使用了部分Token技术。
但是这段代码对Unicode是不支持的，修改成为支持Unicode也不容易。
我的测试：
把Button1Click的代码换成
procedure TForm1.Button1Click(Sender: TObject);
var
im: Integer;
len: Integer;
s1: String;
s2: String;
i:Integer;
begin
im := 1;
s2 := ' 中文测试';
s1 := '';
for i := 1 to Length(s2) do s1 := s1 + s2 + ' ';
len := LcScan(s1, s2, im);
Memo1.Lines.Clear;
Memo1.Lines.Add('Search Text:');
Memo1.Lines.Add(s1);
Memo1.Lines.Add('');
Memo1.Lines.Add('KeyWord:');
Memo1.Lines.Add(s2);
Memo1.Lines.Add('');
Memo1.Lines.Add('Search Result:');
Memo1.Lines.Add('StartPos: ' + IntToStr(im));
Memo1.Lines.Add('MatchLen: ' + IntToStr(len));
Memo1.Lines.Add('FoundTxt: ' + Copy(s1, im, len))
end;
得到的结果是：
Search Text:
????????

KeyWord:
中文测试

Search Result:
StartPos: 1
MatchLen: 17
FoundTxt: ????????
实际应该是找不到的。
我的代码中没有这个问题。

Adnil · 2003-02-13

我觉得
thechar in ['a', #13, #32, 'm']
和
if (thechar = 'a') or
(thechar = #13) or
(thechar = #32) or
(thechar = 'm') 这两者之间的效率应该没有太大的区别，编译器好的话或许都会编译
成同样的机器码，我倾向于第一种，起码可读性比较好，性能还是主要看算法，小地方
就忽略不计了，呵呵

LiChaoHui · 2003-02-13

如果是推测的可不行，有谁能证明一下更好

谢谢shenloqi,给我找出的毛病，我又加了十几行代码，增强了对中文的支持
修改的地方有两处，下面是修改后的源代码，
注意，我已将前两个参数改为 const 型，可以加快调用的速度
（是否真的能加快，有待讨论及测试）
function TForm1.LcScan(const Text, Keyw: String
var iStart: Integer): Integer;
var
iState: Integer;
LastState: Integer;
i, m, len, ml: Integer;
NextCh, LastCh: Char;
HzBt: String;
begin
len := Length(Text);
ml := Length(Keyw);
if (len = 0) or (ml = 0) then
begin
Result := 0;
Exit;
end;
SetLength(HzBt, ml);
Byte(HzBt[1]) := Ord(Keyw[1] > #128);
for i := 2 to ml do
begin
case Byte(HzBt[i-1]) of
0: Byte(HzBt) := Ord(Keyw > #128);
1:
begin
Byte(HzBt) := Ord(Keyw > #64);
if Byte(HzBt) > 0 then Byte(HzBt) := 2;
end;
2: Byte(HzBt) := Ord(Keyw > #128);
end;
end;
iState := 0;
LastState := 0;
m := 1;
NextCh := Keyw[m];
LastCh := #0;
i := iStart;
while i <= len do
begin
case iState of
0: //表示当然位置在标记之外
begin
case Text of
' ',#13,#10: iState := 4;
'<': iState := 2;
end;
end;
1:
begin
case Text of
' ': iState := 2;
#13: iState := 2;
#10:
begin
if (i > 1) and (Text[i-1] = #13) then
iState := 2
else
iState := 2;
end;
'"': iState := 3;
'>': iState := 4;
end;
end;
2:
begin
case Text of
' ',#13,#10: ;
'"': iState := 3;
'>': iState := 4;
else
iState := 1;
end;
end;
3:
begin
case Text of
'"': iState := 1;
'>': iState := 4;
end;
end;
4:
begin
case Text of
' ',#13,#10: ;
else
iState := 0;
end;
end;
end;
if iState = 2 then
begin
if (NextCh = '=') or (LastCh = '=') or (NextCh = '>') then
begin
if m = 1 then iStart := i + 1;
end
else
if LastState <> 2 then
begin
if Text = Keyw[m] then
begin
m := m + 1;
if m > ml then
begin
//完成匹配
Break;
end
else
begin
LastCh := NextCh;
NextCh := Keyw[m];
end;
end
else
begin
if m > 1 then
begin
i := iStart
//进行回退
m := 1;
end;
iStart := i + 1;
end;
end;
end
else
begin
if Text = Keyw[m] then
begin
m := m + 1;
if m > ml then
begin
Break;
end
else
begin
LastCh := NextCh;
NextCh := Keyw[m];
end;
end
else
if (iState = 4) and (Byte(HzBt[m]) < 2) then
begin
if m = 1 then iStart := i + 1;
end
else
begin
if m > 1 then
begin
i := iStart
//进行回退
m := 1;
end;
iStart := i + 1;
end;
end;
LastState := iState;
i := i + 1;
end;
if i > len then
begin
iStart := len + 1;
Result := 0;
end
else
Result := i - iStart + 1;
end;

LiChaoHui · 2003-02-13

至于修改成为支持Unicode也不难，简直太容易，仅修改了几个字母
下面便是Unicode版的查找函数，也不用考虑汉字的问题了

function TForm1.LcScanW(const Text, Keyw: WideString
var iStart: Integer): Integer;
var
iState: Integer;
LastState: Integer;
i, m, len, ml: Integer;
NextCh, LastCh: WideChar;
begin
len := Length(Text);
ml := Length(Keyw);
if (len = 0) or (ml = 0) then
begin
Result := 0;
Exit;
end;
iState := 0;
LastState := 0;
m := 1;
NextCh := Keyw[m];
LastCh := #0;
i := iStart;
while i <= len do
begin
case iState of
0: //表示当然位置在标记之外
begin
case Text of
' ',#13,#10: iState := 4;
'<': iState := 2;
end;
end;
1:
begin
case Text of
' ': iState := 2;
#13: iState := 2;
#10:
begin
if (i > 1) and (Text[i-1] = #13) then
iState := 2
else
iState := 2;
end;
'"': iState := 3;
'>': iState := 4;
end;
end;
2:
begin
case Text of
' ',#13,#10: ;
'"': iState := 3;
'>': iState := 4;
else
iState := 1;
end;
end;
3:
begin
case Text of
'"': iState := 1;
'>': iState := 4;
end;
end;
4:
begin
case Text of
' ',#13,#10: ;
else
iState := 0;
end;
end;
end;
if iState = 2 then
begin
if (NextCh = '=') or (LastCh = '=') or (NextCh = '>') then
begin
if m = 1 then iStart := i + 1;
end
else
if LastState <> 2 then
begin
if Text = Keyw[m] then
begin
m := m + 1;
if m > ml then
begin
//完成匹配
Break;
end
else
begin
LastCh := NextCh;
NextCh := Keyw[m];
end;
end
else
begin
if m > 1 then
begin
i := iStart
//进行回退
m := 1;
end;
iStart := i + 1;
end;
end;
end
else
begin
if Text = Keyw[m] then
begin
m := m + 1;
if m > ml then
begin
Break;
end
else
begin
LastCh := NextCh;
NextCh := Keyw[m];
end;
end
else
if (iState = 4) then
begin
if m = 1 then iStart := i + 1;
end
else
begin
if m > 1 then
begin
i := iStart
//进行回退
m := 1;
end;
iStart := i + 1;
end;
end;
LastState := iState;
i := i + 1;
end;
if i > len then
begin
iStart := len + 1;
Result := 0;
end
else
Result := i - iStart + 1;
end;

copy_paste · 2003-02-13

procedure TForm1.Button1Click(Sender: TObject);
var
A, I: Integer;
dwStart, dwEnd: DWORD;
begin
A := 20;
dwStart := GetTickCount;
for I := 0 to 100000000 do
if A in [9, 10, 13, 32] then
A := 20;
dwEnd := GetTickCount;
Memo1.Lines.Add(Format('Time: %d', [dwStart - dwEnd]));
dwStart := GetTickCount;
for I := 0 to 100000000 do
if (A = 9) or (A = 10) or (A = 13) or (A = 32) then
A := 20;
dwEnd := GetTickCount;
Memo1.Lines.Add(Format('Time: %d', [dwStart - dwEnd]));
end;

IN快点，在集合中如果多的情况下，
集合中如果有两个元素以上的话，就会比常规方法快了，如果是只是一个慢一点。

请教字符串算法，很有意思的问题 (300分)

LiChaoHui

Unregistered / Unconfirmed

Adnil

Unregistered / Unconfirmed

ARMADA

Unregistered / Unconfirmed

LiChaoHui

Unregistered / Unconfirmed

Adnil

Unregistered / Unconfirmed

Adnil

Unregistered / Unconfirmed

LiChaoHui

Unregistered / Unconfirmed

LiChaoHui

Unregistered / Unconfirmed

Adnil

Unregistered / Unconfirmed

LiChaoHui

Unregistered / Unconfirmed

Adnil

Unregistered / Unconfirmed

shenloqi

Unregistered / Unconfirmed

copy_paste

Unregistered / Unconfirmed

Adnil

Unregistered / Unconfirmed

LiChaoHui

Unregistered / Unconfirmed

shenloqi

Unregistered / Unconfirmed

Adnil

Unregistered / Unconfirmed

LiChaoHui

Unregistered / Unconfirmed

LiChaoHui

Unregistered / Unconfirmed

copy_paste

Unregistered / Unconfirmed

Similar threads