【发布时间】:2018-09-22 18:13:34
【问题描述】:
我有几个包含 URL 的文本文件。我正在尝试创建一个 SQLite 数据库来将这些 URL 存储在一个表中。 URL 表有两列,即主键(INTEGER) 和 URL(TEXT)。
我尝试在一个插入命令中插入 100,000 个条目并循环,直到完成 URL 列表。基本上,读取所有文本文件内容并保存在列表中,然后我使用创建包含 100,000 个条目的较小列表并插入表中。
文本文件中的 URL 总数为 4,591,415,文本文件总大小约为 97.5 MB。
问题:
当我选择文件数据库时,插入大约需要 7-7.5 分钟。我觉得这不是一个非常快的插入,因为我有固态硬盘,它的读/写速度更快。除此之外,我还有大约 10GB RAM 可用,如任务管理器中所示。处理器是 i5-6300U 2.4Ghz。
文本文件总数约为97.5 MB。但在我将 URL 插入 SQLite 后,SQLite 数据库大约为 350MB,即几乎是原始数据大小的 3.5 倍。由于数据库不包含任何其他表、索引等,因此该数据库大小看起来有点奇怪。
对于问题 1,我尝试使用参数,并根据使用不同参数的测试运行得出最佳参数。
table, th, td {
border: 1px solid black;
border-collapse: collapse;
}
th, td {
padding: 15px;
text-align: left;
}
<table style="width:100%">
<tr>
<th>Configuration</th>
<th>Time</th>
</tr>
<tr><th>50,000 - with journal = delete and no transaction </th><th>0:12:09.888404</th></tr>
<tr><th>50,000 - with journal = delete and with transaction </th><th>0:22:43.613580</th></tr>
<tr><th>50,000 - with journal = memory and transaction </th><th>0:09:01.140017</th></tr>
<tr><th>50,000 - with journal = memory </th><th>0:07:38.820148</th></tr>
<tr><th>50,000 - with journal = memory and synchronous=0 </th><th>0:07:43.587135</th></tr>
<tr><th>50,000 - with journal = memory and synchronous=1 and page_size=65535 </th><th>0:07:19.778217</th></tr>
<tr><th>50,000 - with journal = memory and synchronous=0 and page_size=65535 </th><th>0:07:28.186541</th></tr>
<tr><th>50,000 - with journal = delete and synchronous=1 and page_size=65535 </th><th>0:07:06.539198</th></tr>
<tr><th>50,000 - with journal = delete and synchronous=0 and page_size=65535 </th><th>0:07:19.810333</th></tr>
<tr><th>50,000 - with journal = wal and synchronous=0 and page_size=65535 </th><th>0:08:22.856690</th></tr>
<tr><th>50,000 - with journal = wal and synchronous=1 and page_size=65535 </th><th>0:08:22.326936</th></tr>
<tr><th>50,000 - with journal = delete and synchronous=1 and page_size=4096 </th><th>0:07:35.365883</th></tr>
<tr><th>50,000 - with journal = memory and synchronous=1 and page_size=4096 </th><th>0:07:15.183948</th></tr>
<tr><th>1,00,000 - with journal = delete and synchronous=1 and page_size=65535 </th><th>0:07:13.402985</th></tr>
</table>
我在网上查了一下,看到这个链接https://adamyork.com/2017/07/02/fast-database-inserts-with-python-3-6-and-sqlite/,系统比我慢很多,但性能仍然很好。 从这个链接中脱颖而出的两件事是:
- 链接中的表格的列比我的多。
- 数据库文件没有增长 3.5 倍。
我在这里分享了python代码和文件:https://github.com/ksinghgithub/python_sqlite
谁能指导我优化这段代码。谢谢。
环境:
- i5-6300U 和 20GB RAM 和 512 SSD 上的 Windows 10 专业版。
- Python 3.7.0
编辑 1:: 基于收到的关于 UNIQUE 约束的反馈和我玩缓存大小值的新性能图表。
self.db.execute('CREATE TABLE blacklist (id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT, url TEXT NOT NULL UNIQUE)')
table, th, td {
border: 1px solid black;
border-collapse: collapse;
}
th, td {
padding: 15px;
text-align: left;
}
<table>
<tr>
<th>Configuration</th>
<th>Action</th>
<th>Time</th>
<th>Notes</th>
</tr>
<tr><th>50,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = 8192</th><th>REMOVE UNIQUE FROM URL</th><th>0:00:18.011823</th><th>Size reduced to 196MB from 350MB</th><th></th></tr>
<tr><th>50,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = default</th><th>REMOVE UNIQUE FROM URL</th><th>0:00:25.692283</th><th>Size reduced to 196MB from 350MB</th><th></th></tr>
<tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 </th><th></th><th>0:07:13.402985</th><th></th></tr>
<tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = 4096</th><th></th><th>0:04:47.624909</th><th></th></tr>
<tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = 8192</th><th></th><<th>0:03:32.473927</th><th></th></tr>
<tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = 8192</th><th>REMOVE UNIQUE FROM URL</th><th>0:00:17.927050</th><th>Size reduced to 196MB from 350MB</th><th></th></tr>
<tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = default </th><th>REMOVE UNIQUE FROM URL</th><th>0:00:21.804679</th><th>Size reduced to 196MB from 350MB</th><th></th></tr>
<tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = default </th><th>REMOVE UNIQUE FROM URL & ID</th><th>0:00:14.062386</th><th>Size reduced to 134MB from 350MB</th><th></th></tr>
<tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = default </th><th>REMOVE UNIQUE FROM URL & DELETE ID</th><th>0:00:11.961004</th><th>Size reduced to 134MB from 350MB</th><th></th></tr>
</table>
【问题讨论】: