defmethod_0(df): for i in df.index: price_list1 = df['price_1'].loc[i] price_list2 = price_list1.copy() for price in price_list1: if price in [999, 9999, 99999, 999999, 9999999, 99999999]: price_list2.remove(price) if price_list2 == []: price_list2 = [df['price_2'].loc[i]] df['price_1'].loc[i] = price_list2
第一次看到代码时,我则猜测最大的瓶颈是使用了loc进行赋值:Pandas.DataFrame.loc一如既往、众所周知的慢! (Pandas 的底层是通过 Cpython 和 C 实现的。 使用 for 循环时,就会不停在 Python 语言和 C 语言中进行转换。 而且,Pandas 弱在检索,强在计算;Python 强在检索,弱在计算。 使用Pandas.DataFrame.loc实际上就是在使用 Pandas 的弱项。)
先看下原始代码运行速度:
Timer unit: 1e-06 s
Total time: 110.704 s File: <ipython-input-6-da4d503d1068> Function: method_0 at line 1
Line # Hits Time Per Hit % Time Line Contents ============================================================== 1 def method_0(df): 2 100001 182837.0 1.8 0.2 for i in df.index: 3 100000 5567047.0 55.7 5.0 price_list1 = df['price_1'].loc[i] 4 100000 100540.0 1.0 0.1 price_list2 = price_list1.copy() 5 400000 151158.0 0.4 0.1 for price in price_list1: 6 300000 172759.0 0.6 0.2 if price in [999, 9999, 99999, 999999, 9999999, 99999999]: 7 100044 86540.0 0.9 0.1 price_list2.remove(price) 8 100000 50259.0 0.5 0.0 if price_list2 == []: 9 price_list2 = [df['price_2'].loc[i]] 10 100000 104392918.0 1043.9 94.3 df['price_1'].loc[i] = price_list2
果然 loc 就是罪魁祸首!尤其是最后赋值的部分。
基本代码书写习惯
虽然已经找出关键问题,但这一次的代码优化我们先从最容易入手的规范基本的代码书写习惯开始:
copy的步骤整体看下来是不需要的,去掉不影响实现
列表remove这一步也可以简单地用列表生成式替换
空列表的判断也可以更加简洁
新的代码如下:
defmethod_1(df): remove_list = [999, 9999, 99999, 999999, 9999999, 99999999] for i in df.index: price_list = df['price_1'].loc[i] price_list_ = [x for x in price_list if x notin remove_list] # 还有一种写法是 price_list_ = list(set(price_list) - set(remove_list)) # 测试下来速度差不多,这里就不赘述了 ifnot price_list_: price_list_ = [df['price_2'].loc[i]] df['price_1'].loc[i] = price_list_
然后对比运行速度:
Timer unit: 1e-06 s
Total time: 113.922 s File: <ipython-input-6-d31e4af0eef1> Function: method_1 at line 1
Line # Hits Time Per Hit % Time Line Contents ============================================================== 1 def method_1(df): 2 1 5.0 5.0 0.0 remove_list = [999, 9999, 99999, 999999, 9999999, 99999999] 3 100001 194521.0 1.9 0.2 for i in df.index: 4 100000 5914280.0 59.1 5.2 price_list = df['price_1'].loc[i] 5 100000 353655.0 3.5 0.3 price_list_ = [x for x in price_list if x not in remove_list] 6 100000 50260.0 0.5 0.0 if not price_list_: 7 price_list_ = [df['price_2'].loc[i]] 8 100000 107408811.0 1074.1 94.3 df['price_1'].loc[i] = price_list_
defmethod_2(df): remove_list = [999, 9999, 99999, 999999, 9999999, 99999999] for i, row in df.iterrows(): price_list = row['price_1'] price_list_ = [x for x in price_list if x notin remove_list] ifnot price_list_: price_list_ = [row['price_2']] df['price_1'].loc[i] = price_list_
新的运行速度:
Timer unit: 1e-06 s
Total time: 136.877 s File: <ipython-input-6-fe966249eec5> Function: method_2 at line 1
Line # Hits Time Per Hit % Time Line Contents ============================================================== 1 def method_2(df): 2 1 5.0 5.0 0.0 remove_list = [999, 9999, 99999, 999999, 9999999, 99999999] 3 100001 20044080.0 200.4 14.6 for i, row in df.iterrows(): 4 100000 1805185.0 18.1 1.3 price_list = row['price_1'] 5 100000 376541.0 3.8 0.3 price_list_ = [x for x in price_list if x not in remove_list] 6 100000 49942.0 0.5 0.0 if not price_list_: 7 price_list_ = [row['price_2']] 8 100000 114601379.0 1146.0 83.7 df['price_1'].loc[i] = price_list_
没错,运行效率反而下降了!但如果你仔细观察,你会发现:
虽然,for i, row in df.iterrows() 这一步比原本的 for i in df.index 更耗费时间
defmethod_3(df): remove_list = [999, 9999, 99999, 999999, 9999999, 99999999] for row in df.itertuples(): price_list = row[10] price_list_ = [x for x in price_list if x notin remove_list] ifnot price_list_: price_list_ = [row[11]] df['price_1'].loc[row[0]] = price_list_
这一次的运行速度:
Timer unit: 1e-06 s
Total time: 119.508 s File: <ipython-input-7-a4704a52d255> Function: method_3 at line 1
Line # Hits Time Per Hit % Time Line Contents ============================================================== 1 def method_3(df): 2 1 4.0 4.0 0.0 remove_list = [999, 9999, 99999, 999999, 9999999, 99999999] 3 100001 1000295.0 10.0 0.8 for row in df.itertuples(): 4 100000 69624.0 0.7 0.1 price_list = row[10] 5 100000 368825.0 3.7 0.3 price_list_ = [x for x in price_list if x not in remove_list] 6 100000 46745.0 0.5 0.0 if not price_list_: 7 price_list_ = [row[11]] 8 100000 118022133.0 1180.2 98.8 df['price_1'].loc[row[0]] = price_list_
defmethod_4(df): remove_list = [999, 9999, 99999, 999999, 9999999, 99999999] temp = [] for i, row in df.iterrows(): price_list = row['price_1'] price_list_ = [x for x in price_list if x notin remove_list] ifnot price_list_: price_list_ = [row['price_2']] temp.append(price_list_) df['price_1'] = temp
然后对比运行速度:
Timer unit: 1e-06 s
Total time: 19.9491 s File: <ipython-input-6-43926ad4dcc0> Function: method_4 at line 1
Line # Hits Time Per Hit % Time Line Contents ============================================================== 1 def method_4(df): 2 1 4.0 4.0 0.0 remove_list = [999, 9999, 99999, 999999, 9999999, 99999999] 3 1 2.0 2.0 0.0 temp = [] 4 100001 17929375.0 179.3 89.9 for i, row in df.iterrows(): 5 100000 1576072.0 15.8 7.9 price_list = row['price_1'] 6 100000 293081.0 2.9 1.5 price_list_ = [x for x in price_list if x not in remove_list] 7 100000 48867.0 0.5 0.2 if not price_list_: 8 price_list_ = [row['price_2']] 9 100000 70884.0 0.7 0.4 temp.append(price_list_) 10 1 30843.0 30843.0 0.2 df['price_1'] = temp
很显然,解决了loc的问题,整体代码速度有了质的提升!整体处理时间已经变成了最初代码的六分之一。
apply: 也许是在这个场景下最适合的方法
直接上代码:
defmethod_5(df): remove_list = [999, 9999, 99999, 999999, 9999999, 99999999] defdeal_with_it(price_1, price_2): price_list_ = [x for x in price_1 if x notin remove_list] ifnot price_list_: return [price_2] else: return price_list_