timsort及最初版本的bug

什么是timsort

From Wikipedia, the free encyclopedia
Timsort is a hybrid stable sorting algorithm, derived from merge sort and insertion sort, designed to perform well on many kinds of real-world data. It uses techniques from Peter McIlroy’s “Optimistic Sorting and Information Theoretic Complexity”, in Proceedings of the Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 467–474, January 1993. It was implemented by Tim Peters in 2002 for use in the Python programming language. The algorithm finds subsequences of the data that are already ordered, and uses that knowledge to sort the remainder more efficiently. This is done by merging an identified subsequence, called a run, with existing runs until certain criteria are fulfilled. Timsort has been Python’s standard sorting algorithm since version 2.3. It is also used to sort arrays of non-primitive type in Java SE 7, on the Android platform, and in GNU Octave.
Worst-case performance : nlogn
Best-case performance : n
Average performance : nlogn
Worst-case space complexity O(n)

原理

其实从根本上来说,这个排序是通过对数组的一些信息提前处理,结合归并和插入排序等处理实现的。

步骤

  1. 判断数组的大小,小于32使用二分插入排序
  2. 数组大于32时,将输入按其严格升序和严格降序特点进行了分区(run),其中分区长度(runLen)过短则用插入排序补充到minRun的个数
  3. 将run入栈,若栈中不满足runLen[i - 3] > runLen[i - 2] + runLen[i - 1]runLen[i - 2] > runLen[i - 1]那么就合并相邻较小栈,栈顶三元素满足则break,继续执行入栈操作
  4. 栈只剩一个run时结束
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
static <T> void sort(T[] a, int lo, int hi, Comparator<? super T> c) {
if (c == null) {
Arrays.sort(a, lo, hi);
return;
}

rangeCheck(a.length, lo, hi);
int nRemaining = hi - lo;
if (nRemaining < 2)
return; // Arrays of size 0 and 1 are always sorted

// If array is small, do a "mini-TimSort" with no merges
if (nRemaining < MIN_MERGE) {
int initRunLen = countRunAndMakeAscending(a, lo, hi, c);
binarySort(a, lo, hi, lo + initRunLen, c);
return;
}

/**
* March over the array once, left to right, finding natural runs,
* extending short natural runs to minRun elements, and merging runs
* to maintain stack invariant.
*/
TimSort<T> ts = new TimSort<>(a, c);
int minRun = minRunLength(nRemaining);
do {
// Identify next run
int runLen = countRunAndMakeAscending(a, lo, hi, c);

// If run is short, extend to min(minRun, nRemaining)
if (runLen < minRun) {
int force = nRemaining <= minRun ? nRemaining : minRun;
binarySort(a, lo, lo + force, lo + runLen, c);
runLen = force;
}

// Push run onto pending-run stack, and maybe merge
ts.pushRun(lo, runLen);
ts.mergeCollapse();

// Advance to find next run
lo += runLen;
nRemaining -= runLen;
} while (nRemaining != 0);

// Merge all remaining runs to complete sort
assert lo == hi;
ts.mergeForceCollapse();
assert ts.stackSize == 1;
}
/**
* Examines the stack of runs waiting to be merged and merges adjacent runs
* until the stack invariants are reestablished:
*
* 1. runLen[i - 3] > runLen[i - 2] + runLen[i - 1]
* 2. runLen[i - 2] > runLen[i - 1]
*
* This method is called each time a new run is pushed onto the stack,
* so the invariants are guaranteed to hold for i < stackSize upon
* entry to the method.
*/
private void mergeCollapse() {
while (stackSize > 1) {
int n = stackSize - 2;
if (n > 0 && runLen[n-1] <= runLen[n] + runLen[n+1]) {
if (runLen[n - 1] < runLen[n + 1])
n--;
mergeAt(n);
} else if (runLen[n] <= runLen[n + 1]) {
mergeAt(n);
} else {
break; // Invariant is established
}
}
}

BUG

问题出现在哪里呢?可以看看这份报告:
Proving that Android’s, Java’s and Python’s sorting algorithm is broken (and showing how to fix it)

BUG出现的地点正在这个条件里

1
2
3
4
/**
* 1. runLen[i - 3] > runLen[i - 2] + runLen[i - 1]
* 2. runLen[i - 2] > runLen[i - 1]
*/

这个条件在认为检查栈顶的三个元素时,通过递推每一次都满足(循环不变量),从而直接break出了while循环,视作循环结束也满足这个条件,但其实这个是不够的,举个网上的例子:
120, 80, 25, 20, 30
经过mergeCollapse()函数后变为:
120, 80, 45, 30
由于80 > 45 + 30,45 > 30,满足约束条件,此时归并就终止了。但是注意栈里的其他run,120 < 80 + 45,这是不满足约束条件的,而由于我们只判断了栈顶的run,因此在这里就留下了“隐患”。虽然平常使用不太可能会出现这个问题,但是由于栈的长度是事先定好的:

1
2
3
4
5
int stackLen = (len <    120  ?  5 :
len < 1542 ? 10 :
len < 119151 ? 19 : 40);
runBase = new int[stackLen];
runLen = new int[stackLen];

其中len表示输入的Array的长度,stackLen表示申请的栈的大小。其中的数字是通过约束条件得出的,可以设序列中的元素为F(n),只要满足F(n) = F(n-1) + F(n-2) + 1即可满足约束条件。但是请注意这是在理想条件下,也就是栈内每个run都必须满足这个约束条件。而我们在这个约束条件下能够举出反例,说明反例出现时,栈可能会用的多一些,那么就会出现java.lang.ArrayIndexOutOfBoundsException

当然JDK也修复了这个问题,不过是用粗暴的增加栈的长度来解决的,其实KeY团队有更严谨的一个解决方案,在Proving that Android’s, Java’s and Python’s sorting algorithm is broken (and showing how to fix it)这文章里有讲到,感兴趣可以去看看。

参考文章

  1. 形式化方法的逆袭——如何找出Timsort算法和玉兔月球车中的Bug?
  2. Proving that Android’s, Java’s and Python’s sorting algorithm is broken (and showing how to fix it)