【发布时间】:2019-02-11 01:00:51
【问题描述】:
我有 100 个文件,每个文件 10 个 GB。我需要重新格式化文件并组合成更可用的表格格式,以便对数据进行分组、求和、平均等。使用 Python 重新格式化数据需要一个多星期的时间。即使我将其重新格式化为表格后,我也不知道它是否对于数据框来说太大了,但一次一个问题。
谁能建议一种更快的方法来重新格式化文本文件?我会考虑任何 C++、perl 等。
样本数据:
Scenario: Modeling_5305 (0.0001)
Position: NORTHERN UTILITIES SR NT,
" ","THEO/Effective Duration","THEO/Yield","THEO/Implied Spread","THEO/Value","THEO/Price","THEO/Outstanding Balance","THEO/Effective Convexity","ID","WAL","Type","Maturity Date","Coupon Rate","POS/Position Units","POS/Portfolio","POS/User Defined 1","POS/SE Cash 1","User Defined 2","CMO WAL","Spread Over Yield",
"2017/12/31",16.0137 T,4.4194 % SEMI 30/360,0.4980 % SEMI 30/360,"6,934,452.0000 USD","6,884,052.0000 USD","7,000,000.0000 USD",371.6160 T,CachedFilterPartitions-PL_SPLITTER.2:665876C#3,29.8548 T,Fixed Rate Bond,2047/11/01,4.3200 % SEMI 30/360,"70,000.0000",All Portfolios,030421000,0.0000 USD,FRB,N/A,0.4980 % SEMI 30/360,
"2018/01/12",15.5666 T,4.8499 % SEMI 30/360,0.4980 % SEMI 30/360,"6,477,803.7492 USD","6,418,163.7492 USD","7,000,000.0000 USD",356.9428 T,CachedFilterPartitions-PL_SPLITTER.2:665876C#3,29.8219 T,Fixed Rate Bond,2047/11/01,4.3200 % SEMI 30/360,"70,000.0000",All Portfolios,030421000,0.0000 USD,FRB,N/A,0.4980 % SEMI 30/360,
Scenario: Modeling_5305 (0.0001)
Position: OLIVIA ISSUER TR SER A (A,
" ","THEO/Effective Duration","THEO/Yield","THEO/Implied Spread","THEO/Value","THEO/Price","THEO/Outstanding Balance","THEO/Effective Convexity","ID","WAL","Type","Maturity Date","Coupon Rate","POS/Position Units","POS/Portfolio","POS/User Defined 1","POS/SE Cash 1","User Defined 2","CMO WAL","Spread Over Yield",
"2017/12/31",1.3160 T,19.0762 % SEMI 30/360,0.2990 % SEMI 30/360,"3,862,500.0000 USD","3,862,500.0000 USD","5,000,000.0000 USD",2.3811 T,CachedFilterPartitions-PL_SPLITTER.2:681071AA4,1.3288 T,Interest Rate Index Linked Note,2019/05/30,0.0000 % MON 30/360,"50,000.0000",All Portfolios,010421002,0.0000 USD,IRLIN,N/A,0.2990 % SEMI 30/360,
"2018/01/12",1.2766 T,21.9196 % SEMI 30/360,0.2990 % SEMI 30/360,"3,815,391.3467 USD","3,815,391.3467 USD","5,000,000.0000 USD",2.2565 T,CachedFilterPartitions-PL_SPLITTER.2:681071AA4,1.2959 T,Interest Rate Index Linked Note,2019/05/30,0.0000 % MON 30/360,"50,000.0000",All Portfolios,010421002,0.0000 USD,IRLIN,N/A,0.2990 % SEMI 30/360,
我想重新格式化这个 csv 表,以便导入数据框:
Position, Scenario, TimeSteps, THEO/Value
NORTHERN UTILITIES SR NT, Modeling_5305, 2018/01/12, 6477803.7492
OLIVIA ISSUER TR SER A (A, Modeling_5305, 2018/01/12, 3815391.3467
【问题讨论】:
-
根据我的经验,最大的瓶颈是您用来访问/操作 csv 或其他文件格式的库。你用什么图书馆?您是否尝试过其他库?为什么一个简单的 bash 脚本不能解决问题?