Project

General

Profile

Actions

Test #120

closed

I/O performance with CbmDigi

Added by Volker Friese over 7 years ago. Updated about 7 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Target version:
Start date:
06/01/2015
Due date:
07/01/2015
% Done:

100%

Estimated time:
20.00 h
Spent time:

Description

When using the Sts Digitizer, I noticed that the I/O takes an unusually large amount of time. While the digitzer (the only task in the run) takes about 0.25 s per event, the total event processing time is 2.4 s:

[DEBUG ] StsDigitize: processing event 8 with 8982 StsPoints
[DEBUG ] StsDigitize: 16001 digis sent to DAQ ( from 0.000 ns to 4743.000 ns )
[DEBUG ] StsDigitize: 0 signals in analog buffers
[INFO ] + StsDigitize : step 8, time 0.266125 s, points: 8982, signals: 0 / 8125, digis: 16001
[DEBUG ] Time to execute 1 event: 2.401268 s
[DEBUG ] Used resident memory: 558 MB
[DEBUG ] Used virtual memory: 589 MB

I traced this to being due to the CbmMatch object being member by pointer of CbmDigi. If I tell ROOT not to stream the CbmMatch* member of CbmDigi (by using //!), the behaviour is more like expected:

[DEBUG ] StsDigitize: processing event 9 with 8839 StsPoints
[DEBUG ] StsDigitize: 15625 digis sent to DAQ ( from 0.000 ns to 3732.000 ns )
[DEBUG ] StsDigitize: 0 signals in analog buffers
[INFO ] + StsDigitize : step 9, time 0.241610 s, points: 8839, signals: 0 / 7930, digis: 15625
[DEBUG ] Time to execute 1 event: 0.295737 s
[DEBUG ] Used resident memory: 564 MB
[DEBUG ] Used virtual memory: 596 MB

So, there seems to be an issue with ROOT streaming an object which has a pointer member to another object. Further observations:

1. If I use no streaming instructions for CbmDigi (i.e., nothing after the declaration of CbmMatch* as a member), the I/O is fast if the CbmMatch* member is NULL. It gets slow when the pointer addresses a valid object. However, in this case I see the match object in the TBrowser, but cannot browse it.

2. If I tell ROOT to use the streamer of CbmMatch (by adding //-> after the declaration in CbmDigi), the data in the output are correct, I can look at them in the TBrowser. However, in this case the I/O is slow, independent of whether the pointer is NULL or valid.

3. Suspecting that this behaviour might be connected to CbmMatch using std::vector, I replaced CbmMatch by a version with the same functionality, but using TLink instead of vector. There is no change in the behaviour.

4. The feature is not connected to the CbmMatch in the additional output branch, which in CbmStsDigitize is created for backward compatibility. So, putting a CbmMatch directly in a TClonesArray is unproblematic.

5. If I make CbmMatch a real member of CbmDigi instead of a "member by pointer", the issue disappears: I/O is fast again.

I do think this needs further investigation, maybe an inquiry with ROOT. If the slow I/O when using "member by pointer" is a ROOT feature, we shall have to rethink the CbmDigi design.

Actions #1

Updated by Volker Friese about 7 years ago

  • Tracker changed from Bug to Test
  • Due date changed from 03/26/2015 to 07/01/2015
  • Status changed from New to Scheduled
  • Target version set to NOV15
  • Start date changed from 02/11/2015 to 06/01/2015

The best way to check would be to create a small example with two classes: mother and daughter. Two cases shall be compared w.r.t. I/o speed: when the daughter is member of the mother, and when the mother has a poinnter member to the daughter.

Actions #2

Updated by Volker Friese about 7 years ago

  • Estimated time changed from 10.00 h to 20.00 h
Actions #3

Updated by Volker Friese about 7 years ago

  • Status changed from Scheduled to In Progress
  • % Done changed from 0 to 50
The investigation proposed above was done in the development branch friese. As daughter data class, CbmStsDaughter, containing just three doubles as members was used. For the mother classes, three different versions were investigated:
  • CbmStsMother1: just six Double_t members (as reference);
  • CbmStsMother2: three Double_t members, plus a member of class CbmStsDaughter;
  • CbmStsMother3: same as Mother2, but the member is of type CbmStsDaughter*.

The task to fill these objects with data is CbmStsTestIo. In each event, six Double_t are randomly generated 10,000 times and filled into CbmStsMotherX and CbmStsDaughter, respectively. The measured performances comprise the time of the Exec() of the task class, the event time obtained from FairRunInfo (comprising I/O), the total run time, and the output file size. 100 of such events were generated.

Results:
  • Mother1: Task 0.0036s, Event 0.03s, Run 7.53s, File 31 MB
  • Mother2: Task 0.0039s, Event 0.03s, Run 7,81s, File 31 MB
  • Mother3: Task 0.0077s, Event 0.045s, Run 9,44s, File 33 MB

Judgement: There is little difference whether the variables are stored directly in the mother class or through the daughter member (1/2). When using a pointer to the daughter (3), the task execution time doubles, which is connected to the instantiation of the daughter class within the Exec function. The I/O time also increases by some 50%; obviously the streaming of a member by pointer is somehow more expensive.
There is no measureable difference whether the streamer of the daughter class is used (//-> after member declaration) or not. Of course, since no streamer was defined by the user, in both cases the automatic ROOT streamers are used.

Now, the performance was checked in the case when there is no daughter data. This would correspond to the case of digis from real data (no link to MC). For all of these, the class CbmStsMother3 was used. Three cases were studied:
  • Case 1: Streaming of the daughter is deactivated in the code (//! after member declaration)
  • Case 2: A NULL pointer is given as argument to the mother constructor
  • Case 3: Like case 2, but with native daughter streamer (//->)
The performances for these three cases should be compared with "Mother3" above.
Results:
  • Case1: Task 0.0065s, Event 0.022s, Run 7.11s, File 16 MB
  • Case2: Task 0.0050s, Event 0.025s, Run 7.26s, File 16 MB
  • Case3: Task 0.0072s, Event 0.03s, Run 8.59s, File 16 MB
    From these numbers, no big differences are seen.

There is, however, a difference between case 2 and case 3 when reading back data from the output file. Using the daughter streamer (case 3), valid pointers to daughter objects are obtained with null values in the daughter members. Obviously, daughter objects created by the default constructor are streamed in the case. In case 2, this is different: NULL pointers are read back, which is the behaviour we would like to have.

Summary of this investigation:
With simple data classes, the concept of carrying a daughter object by pointer can be regarded as validated. ROOT I/O works with this concept. No big performance issues between direct membership and membership by pointer are observed. Using the //-> streaming directive for the daughter object must be avoided in order to have correct data in the output file.

The observation with CbmStsDigi above is thus not reproduced with these simple data classes. Possible reasons are:
  • While CbmStsDigi (mother) is a simple class, CbmMatch (daughter) is not, since it has a std::vector as member.
  • Different ROOT versions (for this investigation, fairsoft mar15p2 was used, while the observation with the StsDigitizer was made with jul14p3).
Actions #4

Updated by Volker Friese about 7 years ago

  • % Done changed from 50 to 80

Investigation continued, now with CbmMatch as daughter (instead of the simple class CbmStsDaughter). Mother4: with CbmMatch as member, mother5: with CbmMatch* as member. For each mother object, one CbmMatch is created with two CbmLinks, filled with random values.

Results:

mother4, match object is not filled (empty match)
Task: 0.012s, event: 0.033s, run: 6.91s, file: 16.6 MB

mother4, match object is filled with two links
Task: 0.016s, event: 0.067s, run: 10.92s, file: 43.2 MB

mother 5, NULL pointer to match
Task: 0.015s, event: 0.033s, run: 7.33s, file: 16.4 MB

mother 5, with valid pointer to filled match object
Task: 0.015s, event: 0.23s, run: 26.5s, file: 46.8 MB

Assessment: When there is no match object (empty in mother 4, NULL pointer in mother5), there is no difference between having it as member or by pointer. In case there is a filled match object, there is a strong I/O performance penalty when it is member by pointer (a factor of 3 here). This must be connected to the ROOT streamer of the mother class, which does not handle the std::vector member of CbmMatch very efficiently.

Actions #5

Updated by Volker Friese about 7 years ago

Discussed in the software meeting of 25 June 2015. Proposal: Use an alternative to CbmMatch using ROOT containers instead of std::vector. Provide simple example without FairRoot framework for reporting to ROOT.

Remark: according to the original posting, a CbmMatch with TList instead of vector was already tested, without change in the performance. Will try to reproduce that.

Actions #6

Updated by Volker Friese about 7 years ago

  • % Done changed from 80 to 90

I introduced a alternative version of CbmMatch, using TList instead of vector. The class is called CbmMatch2 (development/friese/cbmdata). I compared: CbmStsMother4 (with member CbmMatch), CbmStsMother5 (with member CbmMatch*), CbmStsMother6 (with member CbmMatch2), CbmStsMother7 (with member CbmMatch2*). Other conditions as in the tests above.

Results:

CbmMatch   - task 0.016s, event 0.050s, run  9.35s
CbmMatch*  - task 0.014s, event 0.214s, run 24.96s
CbmMatch2  - task 0.027s, event 0.095s, run 13.16s
CbmMatch2* - task 0.030s, event 0.104s, run 14.18s

I checked in addition that the difference between task time and event time is due to I/O (no difference between the two when making the array non-persistent), so it is not somewhere else in FairRoot.

Assessment:
Comparing CbmMatch and CbmMatch2, it is clear that a std::vector is more performant than a TList (almost a factor of 2, both in task and event time).
However, the strong performance penalty when using the Match object by pointer instead of as direct member (more than a factor of 4 from CbmMatch to CbmMatch*) almost vanishes for CbmMatch2, i.e. when using TList instead of std::vector. Our suspicion that the automatic ROOT streamer handles ROOT containers much more efficient than STL containers seems justified.

Actions #7

Updated by Volker Friese about 7 years ago

  • Status changed from In Progress to Closed
  • % Done changed from 90 to 100

I checked again with the full STS digitizer (CbmStsDigitize). I now see a I/O performance penalty of about a factor of two between streaming / not streaming the CbmMatch* member of CbmStsDigi:

With match: Task: 0.26s/event; Event: 0.70s;  -> I/O 0.54s/event
W/O match:  Task: 0.26s/event; Event: 0.38s;  -> I/O 0.12s/event

This goes in line with finding of the toy example above of about a penalty factor of four for the I/O.
However, the total performance loss makes only a factor of two, because the task itself now consumes considerable time. This seems acceptable.

The difference to the original observation is unclear. A probable explanation is the different ROOT version; there probably was an improvement in the ROOT automatic streamers.

Conclusion: For the time being, we stick to the current implementation with CbmMatch* being member of CbmDigi. On the long run, I will investigate the possibility to get rid of it at all - maybe a std::pair<CbmStsDigi*, CbmMatch*> can be transported through the DAQ and then be splitted into separate arrays for the final I/O, both in the event-based TClonesArray or in the time-based vector in CbmTimeSlice.

Actions

Also available in: Atom PDF