====请问如何实现文本文件中特定字符串的快速提取(高分悬赏^O^)====(100分)

  • 主题发起人 主题发起人 Topside
  • 开始时间 开始时间
T

Topside

Unregistered / Unconfirmed
GUEST, unregistred user!
1.文本文件体积比较大或者是基本没有文件大小要求(可以处理大至只受系统本身限制的文本);
2.提取包含特定字符(串)(如“大富翁论坛”)或包含某种结构(如“虽然……但是”)的文本,
要求以‘。’、‘?’、‘!’等为截取界限(即能处理没有到界却又有了回车的情况);
3.能处理中文和中英文混排的文本;
4.提取结果另存为容量不受限制的文本文件;
5.小可已经花两天时间泡本论坛了,没有找到既能解决文件大小限制又
有很高或较高处理速度问题的办法;
6.现在采用的是Richedit,发现文件太大时响应太慢或不响应;
7.已试过论坛里的相当一部分相关解决方案,效果不好;
8.如果不嫌麻烦,敬请您指点比较详尽的解决方案;
9.先谢过了。给高人各位请安。
 
10.我原先是想读入文本后将#13删除,再依“。”等分界符号每句插入回车分行,
之后一行一行地验证,再每行编号,写入文本中。
请问这样行吗?文件大小受限吗?具体怎么实现?
 
You can try DTsearch.
below is this product information
==================================
fuzzy searching. dtSearch's proprietary fuzzy searching uses a unique algorithm to find search terms even if they are misspelled. Search fuzziness adjusts from 0 to 10 to correspond to the level of typographical or OCR errors in files. With a fuzziness level of 1, a search for alphabet would find alphaqet. With a fuzziness level of 3, a search for alphabet would find not only alphaqet but also alpkaqet. Note: fuzziness is not hardwired into the index, so the same index can handle both fuzzy and non-fuzzy searches. (Unindexed searches can also be fuzzy!)

concept/synonym/thesaurus searching. dtSearch can perform automatic query expansion using a comprehensive semantic network of the English language with variable levels of expansion (user-defined synonyms, built-in synonyms, or built-in synonyms + related words).
relevancy-ranked natural language searching. Natural language searches, also known as query-by-example, look for all words in a search request and return results based on automatic term weighting. Using the "Vector Space" method, dtSearch's relevancy ranking takes into account the frequency of hits, relative frequency of the search terms in the index, and hit density in retrieved documents.
variable term weighting. dtSearch provides not only the automatic relevancy ranking in a natural language search request, but also the ability to specify relative weights. These weights can be positive or negative. For example, a user might assign a positive weight of 3 to the word green and a negative weight of five to the word orange.
field searching. dtSearch automatically recognizes and indexes fielded data in such file formats as MS Word, Excel, PowerPoint, HTML, PDF and XML, making these fields separately searchable by field name (as well as accessible for full-text searching). Version 6.0 adds support for searching based on nested field criteria in XML documents.
Unicode support. Version 6.0 adds Unicode support, which expands supported character sets to include Chinese and Japanese, while enhancing support for European language character sets.
 
>>提取包含特定字符(串)(如“大富翁论坛”)
先把文件读到一个TMemoryStream中。然后:
pos('大富翁论坛',pchar(Memorystream.memory))
在100MB的文件中查找只要10多秒。
 
那读100M的文件要费很长时间吧,如果内存不多,速度也许不会快了的。
 
不会的,一瞬间就读到了。:)
 
我用TMemoryStream读36M的文本结果死机多次,这是为何?谢谢。
 
Post 你的語法出來看看!
 
语法大致如下,自己机器上不了网了,在人家机器上上的,应该没记错。
var
TempStream:TmemoryStream;
begin
if OpenDialog1.Execute then
TempStream:=TmemoryStream.Create;
TempStream.LoadFromFile(OpenDialog1.Filename);
Label1.Caption:='File Loaded.';
TempStream.Free;
end;
再请教以下问题:
1.用pos('大富翁论坛',pchar(Memorystream.memory))时出错,这是为何;
2.文本读到memory后怎么去掉#13#10:
3.如果要以‘。’‘?’等为界截取包含‘大富翁论坛’的句子,怎办?
--谢谢!
 
很简单的问题:为何打开36M可以,打开72M就死机?
procedure TForm1.Button2Click(Sender: TObject);
var
MStream:TMemoryStream;
begin
if OpenDialog1.Execute then
MStream:=TMemoryStream.Create;
MStream.LoadFromFile(OpenDialog1.FileName);
MStream.Position:=0;
ShowMessage('File have been loaded to MemoryStream.');
MStream.Free;
end;
谢谢您。
 
或許是 begin-end 的問題吧!

試試我的,成功讀進 184 MB.

procedure TForm1.Button1Click(Sender: TObject);
var
MStream:TMemoryStream;
begin
if OpenDialog1.Execute then
begin
MStream:=TMemoryStream.Create;
try
MStream.LoadFromFile(OpenDialog1.FileName);
MStream.Position:=0;
ShowMessage('File have been loaded to MemoryStream.');
finally
MStream.Free;
end;
end;
end;
 
谢谢jiichen,不过在我用它打开288M时,机器还是毫不犹豫地死了。结合自己的目的
我最终还是决定用古老的Readln()来解决问题,把速度放在第二位来考虑。谢谢各位高人。
 
多人接受答案了。
 
后退
顶部