Comparing plain strings vs comparing codeUnits

warcayac · July 21, 2022, 3:29pm

You are using codeUnits method to compare strings in your solution to Challenge 2 of Chapter 4, why not to directly compare strings? performance issues?

suragch · July 22, 2022, 8:38pm

In order to check whether the parentheses are balanced in a string like '(hello world', you need to check each character. The only way to do that in Dart that I know of is to look at the code units since Dart strings are strings of code units. Dart provides ready access to those code units with myString.codeUnits.

It would be perfectly valid to convert each code unit back to a string and then compare it to ( and ). However, if your original string is 1000 characters, then you have to do the character-to-string conversion 1000 times. It’s much simpler to just convert ( and ) to their respective code units once and then compare those to the other code units you already have. You’re unlikely to notice a performance difference between those two algorithms in most situations, but the second algorithm requires less work than the first one.

warcayac · July 23, 2022, 5:49am

I’ve implemented two methods (hasBalancedParentheses1, hasBalancedParentheses3) to compare them with your method (hasBalancedParentheses2)

import 'package:data_structures_algorithms/stack.dart';

const kPars = ['(',')'];

void hasBalancedParentheses1(String value, [bool showResult = true]) {
  var filtered = value.split('').where((e) => kPars.contains(e));
  var stack = Stack<String>();
  
  if (filtered.isEmpty || filtered.length.isOdd) {
    if (showResult) print(false);
    return;
  }
  
  var flag = true;
  for (var e in filtered) {
    if (e == kPars.first) { stack.push(e); } 
    else {
      if (stack.isNotEmpty) { stack.pop(); }
      else { 
        flag = false;
        break; 
      }
    }
  }

  if (showResult) print(stack.isEmpty && flag);
}

void hasBalancedParentheses2(String text, [bool showResult = true]) {
  var stack = Stack<int>();
  
  final open = kPars.first.codeUnitAt(0);
  final close = kPars.last.codeUnitAt(0);
  
  for (int codeUnit in text.codeUnits) {
    if (codeUnit == open) {
      stack.push(codeUnit);
    } else if (codeUnit == close) {
      if (stack.isEmpty) {
        if (showResult) print(false);
        return;
      } else {
        stack.pop();
      }
    }
  }
  if (showResult) print(stack.isEmpty);
}

void hasBalancedParentheses3(String text, [bool showResult = true, int idx = 0, Stack<bool>? stack]) {
  stack ??= Stack<bool>();
  
  if (idx >= text.length) {
    if (showResult) print(stack.isEmpty && idx == text.length);
    return;
  }

  final char = text[idx];
  if (char == kPars.first) stack.push(true);
  if (char == kPars.last) {
    stack.isNotEmpty ? stack.pop() : idx = text.length;
  }
  hasBalancedParentheses3(text, showResult, ++idx, stack);
}

I’ve run this code to compare run times for every method:

int measureTime(void Function() task, {bool simplePrint = false, bool showResult = true}) {
  final start = DateTime.now();
  task();
  final end = DateTime.now();
  final time = end.difference(start).inMicroseconds;
  if (showResult) {
    simplePrint ? print(time) : print('Task done at $time μs');
  }
  return time;
}

void main(List<String> args) {
  const word3 = 'h((e))llo(world)()h((e))llo(world)()h((e))llo(world)()h((e))llo(world)()(())))))))h((e))llo(world)()';
  const n = 10;

  for (var t = 0; t < n; t++) {
    var times = <int>[];
    for (var i = 0; i < n; i++) {
      times.add(measureTime(() => hasBalancedParentheses1(word3*(t+1), false), showResult: false));
    }
    print((times.reduce((a, b) => a + b) / n).toStringAsFixed(2));
    times.clear();
  }
}

Evidently I’ve changed hasBalancedParentheses1 in each run by the next method. I placed the average results in a table:

|      | M1    | M2    | M3    |
| ---- | ----- | ----- | ----- |
| 100  | 855.9 | 319.4 | 173.3 |
| 200  | 227.8 | 51.7  | 47.1  |
| 300  | 161.5 | 36.3  | 31.1  |
| 400  | 196.3 | 37.3  | 35.4  |
| 500  | 163.3 | 35.2  | 42.1  |
| 600  | 108.1 | 36.8  | 32    |
| 700  | 96.9  | 61.2  | 20.6  |
| 800  | 107.4 | 21.4  | 20.8  |
| 900  | 116.7 | 21.9  | 30.3  |
| 1000 | 127   | 31.1  | 20.2  |

It seems that using recursion is slightly more efficient.

suragch · July 25, 2022, 4:47pm

This is great! I love seeing a data driven analysis like this. I’ll need to study this more closely in the future, especially the recursive version. Since recursion itself uses a stack internally, that makes me wonder if we could implement a recursive solution without our Stack.

TODO: Note to self: Consider linking to this in the solutions of the next edition.

warcayac · July 25, 2022, 6:05pm

According your suggestion I’ve created a new recursive method:

void hasBalancedParentheses4(String text, [bool showResult = true, int idx = 0, int stack = 0]) {
  if (idx >= text.length) {
    if (showResult) print(stack == 0 && idx == text.length);
    return;
  }

  final char = text[idx];
  if (char == kPars.first) stack++;
  if (char == kPars.last) {
    stack > 0 ? stack-- : idx = text.length;
  }
  hasBalancedParentheses4(text, showResult, ++idx, stack);
}

After comparing run times, I don’t know what to think, see these two graphics:

runtimes

Where: M? = hasBalancedParentheses?

Why are averages run times different with the same data tested and same CPU?

suragch · August 30, 2022, 12:38am

I also don’t know why the run times are different. Here are a couple guesses:

Other running processes on the system are affecting the outcome.
The Dart VM gets “smarter” over time as it tries to optimize how it runs. Maybe it is choosing a different internal algorithm to run the code.

urbanmarton · January 10, 2023, 11:45pm

You mentioned the only way you know how to check each character is to look at the code units. But did you know you can check them directly using string[index] in a for loop?

bool checkParentheses(String text) {
  var stack = Stack<String>();
  for (var i = 0; i < text.length; i++) {
    final currentChar = text[i];
    if (currentChar == '(') {
      stack.push(currentChar);
    } else if (currentChar == ')') {
      if (stack.isEmpty) {
        return false;
      } else {
        stack.pop();
      }
    }
  }
  return stack.isEmpty;
}

suragch · January 11, 2023, 4:58am

Thanks for pointing that out. I had forgotten about that option when I wrote my earlier comment.

I’m curious, though, whether internally Dart converts the code unit back to a string when using index notation on a String.

Note to self: Look up the source code for this.

suragch · April 19, 2023, 5:00am

I’m updating the second edition of the book to use the more simplified syntax that you suggest here. Thank you for posting!

Side note: I did try looking up the source code here for how String indexing is implemented in the Dart VM: sdk/object.h at master · dart-lang/sdk · GitHub (search for OneByteString or TwoByteString). Unfortunately I didn’t understand the C++ code well enough to answer my question to my satisfaction about whether Dart converts the code unit back to a string. From what I can see, the C++ code is using an unsigned 16-bit integer for the two-byte string and an unsigned 8-bit integer for the one-byte string. Since this comes back to Dart as a String, the conversion is happening somewhere, but I’m assuming this is an efficient process for the short strings we are using in the problem. The left and right parentheses should both be one-byte strings.

urbanmarton · April 24, 2023, 3:40pm

Great research!! Unfortunately my C++ is very rusty :))
Will there be many other changes in the new updated edition? Can’t wait to read it. Thanks for all your work!
Kindest regards,
Márton Urbán

suragch · April 27, 2023, 7:27am

The biggest change in the second edition is a new chapter on recursion. Edition 1.0 jumped into recursive solutions without much explanation of how recursion works. However, if you already made it through the first edition, the new chapter probably won’t help you that much.

Here are some other changes:

Using the generic Comparable<E> rather than Comparable<dynamic> for all the classes and algorithms that require comparisons. Since int doesn’t directly implement comparable, this requires users to mark integer lists with num, but overall the code is more type safe now.
Updated lots of images that were confusing or had problems.
Used an iterative solution for binary search rather than a recursive one.
Used a singly linked list for the queue rather than a doubly linked list (since the book never taught the reader how to make a doubly linked list).