Click here to monitor SSC

Mysteries of the NET Framework: Question 2

Last post 10-27-2008, 2:41 AM by Tormod. 11 replies.
Sort Posts: Previous Next
  •  09-23-2008, 3:38 AM Post number 69670

    Mysteries of the NET Framework: Question 2

    What is the best approach to doing fast complex string manipulation in .NET?
    (See the article Mysteries of The Net Framework)
  •  09-24-2008, 4:50 AM Post number 69693 in reply to post number 69670

    Re: Mysteries of the NET Framework: Question 2

    How complex string manipluation is being considered?

    How big is the final string? (What if it's > 200Mb) Does this make a difference?

    What is meant by fastest? Is this time and/or resources?

  •  09-30-2008, 10:28 AM Post number 69780 in reply to post number 69693

    Re: Mysteries of the NET Framework: Question 2

    We've been deliberately vague with this question: we'd like to see answers discussing different types of string manipulation. One way to answer this question might be to discuss a specific form of string manipulation you consider 'complex', in whatever context you want to choose.

    Certainly once the size of string passes a certain limit any kind of manipulation can become 'complex': fragmentation in the large object heap can become an issue on 32 bit systems and there is an upper limit on the size of a string that can be realistically stored in memory.

    Size isn't the only thing that can make string manipulation complex, though: another example might be matching braces in a text editing application. An off the shelf parser might help a bit but isn't the whole answer as the text will usually contain invalid syntax while the user is editing it.

    There are many other possible operations on strings that could be discussed, and different approaches may be more appropriate depending on the context.
  •  10-01-2008, 6:55 PM Post number 69814 in reply to post number 69670

    Re: Mysteries of the NET Framework: Question 2

    Primarily do what you can with LINQ for sorting, filtering strings from arrays/collections/lists/etc. You can use it on individual charactes in strings as well to get really wild.

    Use string builders (no brainer)

    Use culture when formatting strings - preferably invariant if an application is directed to one specific language, or if working with a string of data that does not change between different cultures. Use a resource file to store strings that do not change except for when translated to other languages.

    Use string.compare to determine if one string is greater than, equal, or less than another - and specify string comparison with preference to StringComparison.Ordinal. If case can not be enforced, use OrdinalIgnoreCase.

    Use string.IsNullOrEmpty(foo) when looking for either null, empty, or both null/empty. Method is quicker than checking direct against length or null.

    Use regular expressions to find or replace complex patterns

     

  •  10-05-2008, 6:53 PM Post number 69923 in reply to post number 69670

    • mrhassell is not online. Last active: 06-12-2010, 1:37 AM mrhassell
    • Top 500 Contributor
    • Joined on 10-06-2008
    • Melbourne, Australia
    • Level 1: Deep thought

    Re: Mysteries of the NET Framework: Question 2

    String Builder. Very straight forward.
  •  10-05-2008, 11:10 PM Post number 69928 in reply to post number 69923

    • Damon is not online. Last active: 02-28-2014, 9:16 AM Damon
    • Top 10 Contributor
    • Joined on 06-26-2006
    • Dallas, TX
    • Atari ST

    Re: Mysteries of the NET Framework: Question 2

    Whenever I'm doing token replacement in a string (for example, an email that needs to have fields injected) I load the string into a StringBuilder, then use a regular expression to locate all of the tokens in the string -- I only one type of token (e.g. $TokenName$ where TokenName can be anything) because then you only need to run it through the Regex engine once -- the actual token name gets saved in the regex match information so you can run the appropriate replacement.

    Then I iterate backwards over all the matches and remove the matched text using the Remove method on the string builder, and insert whatever I want to replace using the Insert method.  My hope (I've never actually checked) is that the implementation of these methods are memory move operations that should be relatively quick. 

    I go through the matches backwards to ensure the match positions are retained (if you do the first one first, the positions on all the rest of the matches are going to be off unless you manually account for the length of the replacement text).  It works quite well. 

    Shameless plug follows -- Here's the article link with more detailed info --

    http://www.simple-talk.com/dotnet/asp.net/regular-expression-based-token-replacement-in-asp.net/

  •  10-15-2008, 1:45 PM Post number 70026 in reply to post number 69928

    Re: Mysteries of the NET Framework: Question 2

    Damon,

    For token replacements, I tend to lean towards string formatting with resource files.

    string email = string.Format(CultureInfo.InvariantCulture, Resource.WelcomeEmailTemplate, name);

    The problem though is the token "{0}" doesn't really help describe what kind of data will be placed in that location. It is assumed that whomever is working with it already knows what goes where, and in what order when formatting the string. Formatting javascript gets pretty tricky though since you need to escape the "{" and "}" characters.

  •  10-16-2008, 8:27 AM Post number 70039 in reply to post number 69670

    • dmajkic is not online. Last active: 10 Dec 2008, 11:23 AM dmajkic
    • Not Ranked
    • Joined on 10-16-2008
    • Belgrade
    • Level 1: Deep thought

    Re: Mysteries of the NET Framework: Question 2

    Problem with strings speed is that they get copied around on manipulation. 

    So first thing we need to do is to destringify strings. You shoud stop there and think twice as whatever you do next will make your code less readable.

    Second step needs more info on manipulation kind. Are you replacing or building?  If your manipulation is complex search and replace than you should go regex. It is invented for that. On the building side there is StringBuilder class.

    But, let me guess that by "complex" you mean both, and then your clean dotnet answer is to use StringBuilder and Regex.  

    If you still insist on "fast", you can go to the dark side - the unmanaged code.  Isolate complex string manipulation in pure unmanaged natve C/C++ dll  (this sounds realy dark nowdays). Then just call it from dotnet. Those dark compilers can treat your strings as pure bytes and words and can still use regex and other string stuff. Again; note that your assemblies using unmanaged code will be screamed at by dotnet body snatchers.

     

  •  10-16-2008, 2:15 PM Post number 70040 in reply to post number 70026

    • Damon is not online. Last active: 02-28-2014, 9:16 AM Damon
    • Top 10 Contributor
    • Joined on 06-26-2006
    • Dallas, TX
    • Atari ST

    Re: Mysteries of the NET Framework: Question 2

    I probably should have clarified a little more.  For simple string replacement I use string.Replace with the {0} {1} syntax.  But a lot of the stuff I end up building needs to be human readable because Business Analyists are the ones dealing with it, and I find that the tokens you can use in Regex to be more user-friendly.  We also have replacements that are based on the token itself, so the token

    $Object.PropertyName$

    Ends up being replaced by the property on an object that's available at the time when you call the replacement.  This is pretty handy because you can make quick updates to a string in a database or config file and have the appropriate values appear without having to recompile anything, which would be pretty tough to do with the string.Repalce method because you would have to pre-determine everything that could possibly be in there and you'd have to remember which {n} token value that represents it.  I'm currently using an interface off the objects passed into the replacement function to help get property values without reflection, and if the property value cannot be found through that interface we revert to refelection (which is not very speedy in the grand scale of things). 

    But if you know everything that needs to be in your string, don't care about the {n} token syntax, and it's not going to change (or at least not on the fly), then I certainly think the string.Replace method is a great built-in option.

  •  10-16-2008, 8:52 PM Post number 70047 in reply to post number 70040

    Re: Mysteries of the NET Framework: Question 2

    As with my other posts to these questions...."It depends"...

    There have certainly been some excellent suggestions for specific situations.

    Recently I was working on a project that involved very complex manipulation of text (size ranging from 25K to 500K characters). This was in prepration for publication of the raw information.

    After evaluating many of the techniques (StringBuilder, String.Format, String.Replace(), etc) we developed a solution where we used a linked list of "Segment Descriptors" which was a very simple class of the form below:

    class SegmentDescriptor
    {
         int StringID;
         int StartIndex;
         int EndIndex;
    }

    Initially The content was:

    {1, 0, 50000}

    Removing a a portion of the content: (locations 100 thru 125)

    {1, 0, 99}
    {1, 126, 50000}        

    Inserting additional content 30 character string at offset 200

    {1, 0, 99}
    {1, 126, 200}
            
    {2, 0, 29}        
    {1, 201, 50000} 

    etc...

    The advantage of this approach was the the character data itself was rarely if ever "moved" in memory. Nor were additional copies made (potentially inducing significant GC).

    After all of the work was done, the total file size was calculated and a StringBuilder initialized with the appropriate capacity, the individual segments were then appended into the StringBuilder, and the final string extracted [via ToString()]

    For this particular application this provided significant performance improvements (in rare cases over an order of magnitude). The primary downside was that there was a fairly complex development process to create and validate the class library. Because this process was going to be run on thousands of documents on a regular basis, the amortized cost of the development compared to the improved response time of the application was a "win" situation.

     

     

     

     

     

  •  10-25-2008, 4:56 AM Post number 70193 in reply to post number 69670

    Re: Mysteries of the NET Framework: Question 2

    I think push fragments of such string into a collection and use string.Join to connect them later. (May be just as same performance as using StringBuilder?).
  •  10-27-2008, 2:41 AM Post number 70205 in reply to post number 69670

    Re: Mysteries of the NET Framework: Question 2

    I think that a good performance profiler is adament. ANTS feature of performance snapshots will be very useful.

    Strings are, for many reasons (encodings, streams, immutability, memory access&footprint, localization, framework support), very hard to get right on the first try. Not only do you need to know the penalty in the first place, but you also need to know wether or not, from its uses, something will be a scaling issue or not.

    I freely admit that chances are that I would have to read up on whatever was hogging the resources and Reflector-dive into it before optimizing it.

    Chances are you'll be over-engineering something (the wrong "something") if you can't work incrementally.

    Performance tool is your friend.

View as RSS news feed in XML