3
votes

I'm using TClientDataSet in an application for managing a load of data imported from multiple CSV files. This can be a million or more entries in total. I want to be able to delete all the dataset entries associated with a particular CSV file but the time to delete large numbers of items is cripplingly slow.

As a test to try and work out if I was doing something stupid I created a simple console application. All it does is:

  • Create a TClientDataSet instance with 1 field defined (ID):

    CDS := TClientDataSet.Create(nil); CDS.FieldDefs.Add('ID', ftInteger); CDS.CreateDataSet; CDS.LogChanges := False;

  • Append 100,000 items (takes 0.1 seconds):

    for i := 1 to 100000 do begin CDS.AppendRecord([i]); end;

  • Delete 50,000 items (takes ~4 secs, or ~4.4 secs with LogChanges=TRUE):

    CDS.First; while CDS['ID'] <= 50000 do CDS.Delete;

If I had 1.5M items in my dataset and wanted to remove 0.5M entries it would take so long to delete items by this method that I can't even measure it.

As a workaround for now I'm having to create a new dataset, then copy all the items that I want to keep to the new copy and delete the original copy. Unless I am only removing a small % of entries from the original dataset this method is much faster.

Perhaps I am not using the most appropriate method for trying to remove items from the dataset? I am guessing it is triggering a bunch of internal processing with every item I delete. Is there some method to delete a range of items at once that I am missing? Perhaps I can set an index and a range based on that index, and then delete all the items in the current range with one operation?

Maybe the problem is with ClientDataSet and not me? Perhaps I need to use a different component. Any suggestions?

1
Perhaps you forgot to turn off LogChanges. - Ondrej Kelle
Interesting, there should be a difference as maintaining the delta isn't cheap. - Ondrej Kelle
As you mentioned aggregates and indexes, these may be the culprit; try to disable them before the bulk delete operation and re-enable them (if needed) after. (The problem might be that they are updated after every single delete which is completely unnecessary.) - Ondrej Kelle
I think that is expected. Deleting requires moving data in the file. - Sertac Akyuz
Which version of Delphi? Have you tried Andreas Hausladen's MidasSpeedFix? - Ondrej Kelle

1 Answers

1
votes

Thought I might as well provide details of my own (temporary/maybe permanent?) workaround in case anyone with the same problem is interested.

PROBLEM: Delete a large proportion of records from a large TClientDataSet (100k or more records) takes a long time when using the Delete operation in comparison to initial time for adding the items (factor of 40 or more).

SOLUTION: Copy all the records you do not wish to delete to a new dataset, then delete the original. [Downsides: Loss of changelog, extra RAM requirements?]

var
 CDS: TClientDataSet;

// initialize new CDS instance
function CreateNewCDSInstance: TCLientDataSet;
begin
 Result := TClientDataSet.Create(nil);
 Result.FieldDefs.Add('ID', ftInteger);
 Result.CreateDataSet;
 Result.LogChanges := False;
end;

// close + free CDS instance
procedure CloseCDS;
begin
 CDS.EmptyDataSet;
 CDS.Close;
 CDS.Free;
end;

// delete current record?
function CanDeleteCurrentRecord: boolean;
begin
 Result := CDS['ID'] < 50001; //in this simple example
 // in my application it would be more like:
 // "CDS['FILE_ID'] = AFileIDToDelete"
end;

// delete block of records:
procedure DeleteRecords;
var
 aNewCopy: TClientDataSet;
begin
 aNewCopy := CreateNewCDSInstance;
 CDS.First;
 while not CDS.EoF do
 begin
  if not CanDeleteCurrentRecord then
  begin
   // NB: AppendRecord takes array of values corresponding to field defintions
   aNewCopy.AppendRecord([CDS['ID']]);
  end;
  CDS.Next;
 end;
 CloseCDS;
 CDS := aNewCopy;
 //NB: If you have any aggregates/indexes defined, they must be redefined
end;

Using the example given above this method of removing 50k items takes 94 ms instead of ~4 secs.

However, while asking this question and reading through the comments it has become obvious to me that this solution is more of a bandage than a cure. The bigger problem is that the system I'm working on isn't really well designed to handle the amount of data it needs to. Perhaps it is not so much "problem with TClientDataSet" as "problem with how we're using TClientDataSet"! Even with the fix for delete speed there will still be issues with performance for importing increasingly large size and quantity of files and managing that data down the line.

Perhaps (on a rainy day!) I will start looking at an approach akin to SilverWarior's suggestion of using a separate dataset to hold each imported file, instead of splurging all the data out into one huge memory table. Then deleting a file just becomes deleting the dataset - among other potential benefits.