Grokking Algorithms

Grokking Algorithms (TURING)

Author: Aditya Bhargava

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
def binary_search(list, item):
"""
list must be ordered
time: O(logn)
"""
low = 0
high = len(list)-1
while(low <= high):
mid = (low + high)/2
if list[mid] == item:
return mid
if list[mid] < item:
low = mid+1
else:
high = mid-1
return None
if __name__ == '__main__':
print(binary_search([1,3,5,7,9],5))
print(binary_search([1,3,4,7,8],9))

2.Selection_sort

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def findSmallest(arr):
"""
input: arr
output: smallest_index
"""
smallest = arr[0]
smallest_index = 0
for i in range(len(arr)):
if arr[i] < smallest:
smallest = arr[i]
smallest_index = i
return smallest_index
def selectionSort(arr):
newArr = []
for i in range(len(arr)):
smallest_index = findSmallest(arr)
newArr.append(arr.pop(smallest_index))
return newArr
if __name__ == '__main__':
print(selectionSort([5, 3, 6, 2, 10]))

3.Quick_sort

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def quicksort(array):
if(len(array) < 2): # baseline
return array
else:
pivot = array[0]
less = [i for i in array[1:] if i < pivot]
greater = [i for i in array[1:] if i > pivot]
return quicksort(less) + [pivot] + quicksort(greater)
print(quicksort([10, 5, 2, 7]))
# O(nlogn)
# layers: O(logn)
# for every layer: O(n)

用于图的查找算法

查找最短路径:

  • 从A出发,有前往B的路径么?
  • 从A出发,前往B的哪条路径最短?

在广度优先搜索中,搜索范围从起点开始逐渐向外延伸,即先检查一度关系,再检查二度关系

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
from collections import deque
graph = {}
graph["you"] = ["alice", "bob", "claire"]
graph["bob"] = ["anuj", "peggy"]
graph["alice"] = ["peggy"]
graph["claire"] = ["thom", "jonny"]
graph["anuj"] = []
graph["peggy"] = []
graph["thom"] = []
graph["jonny"] = []
def person_is_seller(name):
# sample logic to judge if the person is a seller
return name[-1] == 'm'
def search(name):
search_queue = deque()
search_queue += graph[name]
# This array is how you keep track of which people you've searched before.
searched = []
while search_queue:
person = search_queue.popleft()
# Only search this person if you haven't already searched them.
if not person in searched:
if person_is_seller(person):
print person + " is a mango seller!"
return True
else:
search_queue += graph[person]
# Marks this person as searched
searched.append(person)
return False
search("you")
# Running time: O(V+E)
# O(E) 边数 沿着每条边
# O(V) vertice 顶点数 将每个元素添加到队列的时间为O(1) 所有的元素共需要O(V)

Points:

  • 使用队列

  • 散列表是无序的,添加键值对的顺序无关紧要

  • 对于检查过的元素,务必不要再去检查,否则可能会导致无限循环

    如果任务A依赖于任务B,在列表中任务A必须在B后面,被称为拓扑排序,创建一个有序列表

    这种图被称为数,树是一种特殊的图,其中没有往后指的边

5.Dijkstra’s Algorithm

找出最快路径

Process:

  1. 可在最短时间内前往的节点
  2. 对于该节点的邻居,检查是否有前往他们的更短距离,如果有,就更新其开销
  3. 重复这个过程,直到对图中的每个节点都这样做过
  4. 计算最终路径

Dilkstra’s Algorithm只适用于有向无环图(directed acyclic graph) DAG

如果有负权边,就不能使用Dilkstra’s Algorithm,可以使用Bellman-Ford Algorithm

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
graph = {}
graph["start"] = {}
graph["start"]["a"] = 6
graph["start"]["b"] = 2
graph["a"] = {}
graph["a"]["fin"] = 1
graph["b"] = {}
graph["b"]["a"] = 3
graph["b"]["fin"] = 5
graph["fin"] = {}
infinity = float("inf")
costs = {}
costs["a"] = 6
costs["b"] = 2
costs["fin"] = infinity
parents = {}
parents["a"] = "start"
parents["b"] = "start"
parents["fin"] = None
processed = []
def find_lower_cost_node(costs):
lower_cost = float("inf")
lower_cost_node = None
for node in costs:
cost = costs[node]
if cost < lower_cost and node not in processed:
lower_cost = cost
lower_cost_node = node
return lower_cost_node
node = find_lower_cost_node(costs)
while node is not None:
cost = costs[node]
neighbors = graph[node]
for n in neighbors.keys():
new_cost = cost + neighbors[n]
if costs[n] > new_cost:
costs[n] = new_cost
parents[n] = node
processed.append(node)
node = find_lower_cost_node(costs)
print(costs)
node = "fin"
while node is not "start":
node = parents[node]
print(node)

6.Greedy Algorithm

每步都采取最优的算法

每部都采取局部最优解,最终得到的就是全局最优解

近似算法:使用贪婪算法可以得到非常接近的解

在获得精确解需要的时间太长时,可使用近似算法

(1)集合覆盖问题(广播台)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
states_needed = set(["mt", "wa", "or", "id", "nv", "ut", "ca", "az"])
stations = {}
stations["kone"] = set(["id", "nv", "ut"])
stations["ktwo"] = set(["wa", "id", "mt"])
stations["kthree"] = set(["or", "nv", "ca"])
stations["kfour"] = set(["nv", "ut"])
stations["kfive"] = set(["ca", "az"])
final_stations = set()
while states_needed:
best_station = None
states_covered = set()
for station, states_for_station in stations.items():
covered = states_for_station & states_needed
if len(covered) > len(states_covered):
states_covered = covered
best_station = station
states_needed -= states_covered
final_stations.add(best_station)
print(final_stations)

(2)旅行商问题

​ 旅行商前往5个不同的城市,找出前往这5个城市的最短路径

NP完全问题:

需要计算所有的解,并从中选出最小/短的

NP问题不用去找最完美的解,而是使用近似算法即可

没有办法完全判断是不是NP完全问题,但还是有蛛丝马迹可循

Points:

  • 元素较少时算法的运行速度较快,但随着元素的数量的增加,速度会变得非常慢
  • 涉及所有组合的问题,通常是NP问题
  • 不能讲问题分成小问题,必须考虑各种可能的情况,可能是NP问题
  • 如果问题涉及序列(如旅行商问题中的城市序列问题)且难以解决,可能是NP问题
  • 如果问题涉及集合(如广播台集合)且难以解决,可能是NP问题
  • 如果问题可以转换成旅行商问题或集合覆盖问题,那他就是NP完全问题

对于NP完全问题,没有找到快速解决的办法,最佳的做法是使用近似算法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
states_needed = set(["mt", "wa", "or", "id", "nv", "ut", "ca", "az"])
stations = {}
stations["kone"] = set(["id", "nv", "ut"])
stations["ktwo"] = set(["wa", "id", "mt"])
stations["kthree"] = set(["or", "nv", "ca"])
stations["kfour"] = set(["nv", "ut"])
stations["kfive"] = set(["ca", "az"])
final_stations = set()
while states_needed:
best_station = None
states_covered = set()
for station, states_for_station in stations.items():
covered = states_for_station & states_needed
if len(covered) > len(states_covered):
states_covered = covered
best_station = station
states_needed -= states_covered
final_stations.add(best_station)
print(final_stations)

7.Dynamic Programming

Keys:

  • 最优子结构
  • 边界
  • 状态转移公式
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
// 有一座高度是10级台阶的楼梯,从下往上走,每跨一步只能
// 向上1级或者2级台阶。要求用程序来求出一共有多少种走法。
import java.util.HashMap;
public class Dynamic_programming{
public int getClimbingWays_recursive(int n){
if(n<1)
return 0;
if(n==1)
return 1;
if(n==2)
return 2;
return getClimbingWays_recursive(n-1)+getClimbingWays_recursive(n-2);
}
public int getClimbingWays_memo(int n, HashMap<Integer, Integer> map){
if(n<1)
return 0;
if(n==1)
return 1;
if(n==2)
return 2;
if(map.containsKey(n))
return map.get(n);
else{
int value=getClimbingWays_memo(n-1, map)+getClimbingWays_memo(n-2, map);
map.put(n,value);
return value;
}
}
public int getClimbingWays_DP(int n){
if(n<1)
return 0;
if(n==1)
return 1;
if(n==2)
return 2;
int a=1, b=2, temp=0;
for(int i=3;i<=n;i++){
temp=a+b;
a=b;
b=temp;
}
return temp;
}
public static void main(String args[]){
Dynamic_programming test = new Dynamic_programming();
HashMap<Integer, Integer> map = new HashMap<>();
System.out.println(test.getClimbingWays_DP(3));
System.out.println(test.getClimbingWays_memo(3,map));
System.out.println(test.getClimbingWays_recursive(3));
}
}

3 ways:

  • recursive
  • memo, for record
  • DP
The King and the Golden

10 workers + 5 Golden

Num Golden People
1 400 5
2 500 5
3 200 3
4 300 4
5 350 3

最优子结构:

​ 5个金矿的最优选择:Max(前4座金矿10人的挖金数量,前4座金矿7工人的挖金数量+第5座金矿的挖金数量)

金矿数量设为N,工人数W,金矿的黄金量设为数组G[ ],金矿的用工量设为数组P[ ]

F(5, 10) = Max( F(4, 10), F(4, 10-P[4])+G[4])

边界情况:

​ N=1, W>=P[0], F(N, W) = G[0]

​ N=1, W<P[0], F(N, W) = 0

Summary:

1
2
3
4
F(N, W) = 0 (N<=1, W<=P[0])
F(N, W) = G[0] (N==1, W>=P[0])
F(N, W) = F(N-1, W) (N>1, W<P[N-1])
F(N, W) = max(F(N-1,W), F(N-1, W-P[N-1])+G[N-1]) (N>1, W>=P[N-1])
N\W 1 2 3 4 5 6 7 8 9 10
1 400 400 400 400 400 400
2 500 500 500 500 500 900
3 200 200 500 500 500 700 700 900
4 200 300 500 500 500 700 800 900
5 200 350 500 550 650 850 850 900

除了第一行以外,每个格子都是前一行的一个或者两个格子推导而来的

并不需要存储整个表格,只需要存储前一行的结果,就可以推到出新的一行

8.Next

    • 二叉树

      查找:O(logn),插入:O(logn),删除:O(logn)

    • B树

    • 红黑树

    • 伸展树

  • 反向索引

  • 傅里叶变换

  • 并行算法

    • Map,Reduce
  • 布隆过滤器

    概率型数据结构,提供的答案可能不对,但很可能是正确的

    使用散列表时,答案绝对可靠,而使用布隆过滤器时,答案却很可能是正确的

    可能出现报错的情况,不可能出现漏报的情况

    • HyperLogLog 类似布隆过滤器的算法

    面对海量数据且要求答案八九不离十时,可考虑使用概率性算法

  • SHA算法

    安全散列函数(secure hash algorithm),生成一个散列值–一个较短的字符串

    散列算法是单向的

    局部敏感的散列函数:Simhash,需要检查两项内容相似程度时

  • Diffie-Hellman密钥交换

    私钥、公钥

    RSA