Performance Tuning – String Sorting
Performance Tuning – “Know What You Are Sorting”
You probably know that our Smalltalk products support CLDR-based internationalization, which is incredibly useful for building applications for use with many languages that are deployed throughout many parts of the world.
Unicode, the basis of internationalization frameworks, makes sorting significantly more work, computation wise.
Cincom engineers are working on techniques to improve the performance of full unicode sorting.
Even if you don’t use the internationalization capabilities, you will probably want to know about this for potential performance tuning. Knowledge is power!
StringCollationPolicy is a class with the following comment:
“There are several collation algorithms available:
Fastest – Strings are sorted based on the values of the characters. No intelligent case folding is done, so that for example, A < B < a < b. This is very fast, but not usually what a user would expect to see.
Fast – The default collation algorithm for previous versions of Cincom® VisualWorks®. Case folding is done, but the algorithm is otherwise relatively primitive.
Unicode, low-priority punctuation – Unicode-compatible collation. White space and punctuation characters are ignored unless the strings cannot be distinguished based on letters, accents and upper-/ lowercase.
Unicode, high-priority punctuation – Unicode-compatible collation. White space and punctuation characters are treated as first-class characters, with more influence over collation than distinctions like accents and upper-/lowercase.”
Let’s run a simple test, sorting strings from a text file:
file := 'FAQ.txt'. stream := file asFilename readStream. [stream atEnd] whileFalse:[lines add: (stream upTo: Character cr) ].
StringCollationPolicy collationAlgorithm: #UnicodeWithPunctuation. Transcript cr; show: [lines asSortedCollection] timeToRun printString.
StringCollationPolicy collationAlgorithm: #UnicodeNormal. Transcript cr; show: [lines asSortedCollection] timeToRun printString.
StringCollationPolicy collationAlgorithm: #Fast. Transcript cr; show: [lines asSortedCollection] timeToRun printString.
StringCollationPolicy collationAlgorithm: #Fastest. Transcript cr; show: [lines asSortedCollection] timeToRun printString.
Results:
- 100.806 milliseconds
- 80.348 milliseconds
- 4.105 milliseconds
- 4.15 milliseconds
Good coding to you! –Arden Thomas