如何有效地打开一个巨大的excel文件答案

【问题标题】：How to open a huge excel file efficiently如何有效地打开一个巨大的excel文件
【发布时间】：2018-03-04 13:47:52
【问题描述】：

我有一个 150MB 的单页 Excel 文件，使用以下方法在一台功能非常强大的机器上打开大约需要 7 分钟：

# using python
import xlrd
wb = xlrd.open_workbook(file)
sh = wb.sheet_by_index(0)

有什么方法可以更快地打开excel文件？我愿意接受甚至非常古怪的建议（例如 hadoop、spark、c、java 等）。理想情况下，如果这不是白日梦，我正在寻找一种在 30 秒内打开文件的方法。另外，上面的例子使用的是python，但不一定是python。

注意：这是来自客户端的 Excel 文件。在我们收到之前，它不能转换成任何其他格式。这不是我们的文件

更新：回答一个可在 30 秒内打开以下 200MB excel 文件的工作示例将获得赏金奖励：https://drive.google.com/file/d/0B_CXvCTOo7_2VW9id2VXRWZrbzQ/view?usp=sharing。此文件应包含字符串 (col 1)、日期 (col 9) 和数字 (col 11)。

【问题讨论】：

有什么理由将这么多数据保存在 excel 中而不是数据库中？
150 Mb XL 工作簿这些天并没有那么大 - 要在 XL 中有效打开，首先将 Calculation 设置为 Manual 并将 ForceFullCalculation 设置为 True - 然后如果您想使用 Python，请遵循将其保存为 CSV 的建议。在强大的多核机器上，整个过程应该比 7 分钟快很多。
归根结底，Excel 文件是一个包含大量 xml 文档的文件夹。您可以从文件中提取 xml 并对其进行解析吗？ link
Excel打开需要多长时间？仅供参考。
我认为知道你想对加载的数据做什么也很重要。用于加载文件的方法在原始磁盘性能方面可能不是最好的，但在加载后如何尝试访问该数据方面更好。

标签： java c# python c++ excel

【解决方案1】：

您是否尝试加载自 xlrd 0.7.1 版以来可用的worksheet on demand？

为此，您需要将on_demand=True 传递给open_workbook().

xlrd.open_workbook(文件名=无，日志文件=<_io.textiowrapper name="" mode="w" encoding="UTF-8">, verbosity=0, use_mmap=1, file_contents=None，encoding_override=None，formatting_info=False， on_demand=False, ragged_rows=False)

我发现用于读取 xlsx 文件的其他潜在 python 解决方案：

阅读“xl/sharedStrings.xml”和“xl/worksheets/sheet1.xml”中的raw xml

试试openpyxl library's Read Only mode，它声称对大文件的内存使用也进行了优化。

from openpyxl import load_workbook wb = load_workbook(filename='large_file.xlsx', read_only=True) ws = wb['big_data']

for row in ws.rows:
    for cell in row:
        print(cell.value)

如果您在 Windows 上运行，您可以使用 PyWin32 and 'Excel.Application'

import time
import win32com.client as win32
def excel():
   xl = win32.gencache.EnsureDispatch('Excel.Application')
   ss = xl.Workbooks.Add()
...

【讨论】：

这似乎只加载必要的工作表，但在上述情况下只有一张工作表要加载，所以on_demand 没有节省时间（我测试过）。

【解决方案2】：

大多数与 Office 产品一起使用的编程语言都有一些中间层，这通常是瓶颈所在，一个很好的例子是使用 PIA 的/Interop 或 Open XML SDK。

在较低级别（绕过中间层）获取数据的一种方法是使用驱动程序。

150MB 单页 Excel 文件，大约需要 7 分钟。

我能做的最好的事情是在 135 秒内创建一个 130MB 的文件，大约快 3 倍：

Stopwatch sw = new Stopwatch();
sw.Start();

DataSet excelDataSet = new DataSet();

string filePath = @"c:\temp\BigBook.xlsx";

// For .XLSXs we use =Microsoft.ACE.OLEDB.12.0;, for .XLS we'd use Microsoft.Jet.OLEDB.4.0; with  "';Extended Properties=\"Excel 8.0;HDR=YES;\"";
string connectionString = "Provider=Microsoft.ACE.OLEDB.12.0;Data Source='" + filePath + "';Extended Properties=\"Excel 12.0;HDR=YES;\"";

using (OleDbConnection conn = new OleDbConnection(connectionString))
{
    conn.Open();
    OleDbDataAdapter objDA = new System.Data.OleDb.OleDbDataAdapter
    ("select * from [Sheet1$]", conn);
    objDA.Fill(excelDataSet);
    //dataGridView1.DataSource = excelDataSet.Tables[0];
}
sw.Stop();
Debug.Print("Load XLSX tool: " + sw.ElapsedMilliseconds + " millisecs. Records = "  + excelDataSet.Tables[0].Rows.Count);

Win 7x64，Intel i5，2.3ghz，8GB 内存，SSD250GB。

如果我也可以推荐一个硬件解决方案，如果您使用的是标准 HDD，请尝试使用 SSD 解决它。

_{注意：我无法下载您的 Excel 电子表格示例，因为我位于公司防火墙后面。}

附言。请参阅MSDN - Fastest Way to import xlsx files with 200 MB of Data，共识是 OleDB 是最快的。

PS 2. 以下是使用 python 执行此操作的方法： http://code.activestate.com/recipes/440661-read-tabular-data-from-excel-spreadsheets-the-fast/

【讨论】：

【解决方案3】：

我使用 .NET core 和 Open XML SDK 在大约 30 秒内读取了文件。

以下示例返回一个对象列表，其中包含具有匹配类型的所有行和单元格，它支持日期、数字和文本单元格。该项目可在此处获得：https://github.com/xferaa/BigSpreadSheetExample/（应适用于 Windows、Linux 和 Mac OS，并且不需要安装 Excel 或任何 Excel 组件）。

public List<List<object>> ParseSpreadSheet()
{
    List<List<object>> rows = new List<List<object>>();

    using (SpreadsheetDocument spreadsheetDocument = SpreadsheetDocument.Open(filePath, false))
    {
        WorkbookPart workbookPart = spreadsheetDocument.WorkbookPart;
        WorksheetPart worksheetPart = workbookPart.WorksheetParts.First();

        OpenXmlReader reader = OpenXmlReader.Create(worksheetPart);

        Dictionary<int, string> sharedStringCache = new Dictionary<int, string>();

        int i = 0;
        foreach (var el in workbookPart.SharedStringTablePart.SharedStringTable.ChildElements)
        {
            sharedStringCache.Add(i++, el.InnerText);
        }

        while (reader.Read())
        {
            if(reader.ElementType == typeof(Row))
            {
                reader.ReadFirstChild();

                List<object> cells = new List<object>();

                do
                {
                    if (reader.ElementType == typeof(Cell))
                    {
                        Cell c = (Cell)reader.LoadCurrentElement();

                        if (c == null || c.DataType == null || !c.DataType.HasValue)
                            continue;

                        object value;

                        switch(c.DataType.Value)
                        {
                            case CellValues.Boolean:
                                value = bool.Parse(c.CellValue.InnerText);
                                break;
                            case CellValues.Date:
                                value = DateTime.Parse(c.CellValue.InnerText);
                                break;
                            case CellValues.Number:
                                value = double.Parse(c.CellValue.InnerText);
                                break;
                            case CellValues.InlineString:
                            case CellValues.String:
                                value = c.CellValue.InnerText;
                                break;
                            case CellValues.SharedString:
                                value = sharedStringCache[int.Parse(c.CellValue.InnerText)];
                                break;
                            default:
                                continue;
                        }

                        if (value != null)
                            cells.Add(value);
                    }

                } while (reader.ReadNextSibling());

                if (cells.Any())
                    rows.Add(cells);
            }
        }
    }

    return rows;
}

我在一台使用了三年的笔记本电脑上运行该程序，该笔记本电脑配备 SSD 驱动器、8GB RAM 和 Intel Core i7-4710 CPU @ 2.50GHz（两个内核），运行 Windows 10 64 位。

请注意，虽然将整个文件作为字符串打开和解析只需不到 30 秒，但在我上次编辑的示例中使用对象时，使用我的蹩脚笔记本电脑的时间会上升到近 50 秒。您可能会在使用 Linux 的服务器上接近 30 秒。

诀窍是使用 SAX 方法，如下所述：

https://msdn.microsoft.com/en-us/library/office/gg575571.aspx

【讨论】：

非常简洁的方法，谢谢。使用 objects 是什么意思？这是一个非字符串，还是什么？
没问题，这很有趣；) Excel 文件中每个单元格的值（存储为 XML 文本）会根据单元格的数据类型转换为对象，因此您最终会得到有一个对象列表的列表。第一个列表包含每一行，每行包含一个包含每个单元格的第二个列表。这有点慢，但是当您想对数据做一些事情并将其保存到数据库时，它会得到回报。例如，您可以使用实体框架批量导入来一次导入所有对象。
这可能是最快的，无需手动解压缩和解析 XML，这将是很多代码，并且难以维护。
我也这么认为，但人们似乎倾向于 OleDb 解决方案，它至少慢 50%，恕我直言，不太干净。
@Isma case CellValues.Date: 是 Office 2010 唯一的数据类型。临时日期存储为具有相应双精度值的共享字符串，以转换具有 1900 或 1904 纪元的 OADate，单元格不存在数据类型。唷。您需要检查单元格的 StyleIndex 以格式化日期或数字（或百分比等）。所以实际上有很多工作。

【解决方案4】：

Python 的 Pandas 库可用于保存和处理您的数据，但使用它直接加载 .xlsx 文件会很慢，例如使用read_excel()。

一种方法是使用 Python 自动使用 Excel 将文件转换为 CSV，然后使用 Pandas 通过 read_csv() 加载生成的 CSV 文件。这会给你一个很好的加速，但不会低于 30 秒：

import win32com.client as win32        
import pandas as pd    
from datetime import datetime    

print ("Starting")
start = datetime.now()

# Use Excel to load the xlsx file and save it in csv format
excel = win32.gencache.EnsureDispatch('Excel.Application')
wb = excel.Workbooks.Open(r'c:\full path\BigSpreadsheet.xlsx')
excel.DisplayAlerts = False
wb.DoNotPromptForConvert = True
wb.CheckCompatibility = False

print('Saving')
wb.SaveAs(r'c:\full path\temp.csv', FileFormat=6, ConflictResolution=2) 
excel.Application.Quit()

# Use Pandas to load the resulting CSV file
print('Loading CSV')
df = pd.read_csv(r'c:\full path\temp.csv', dtype=str)

print(df.shape)
print("Done", datetime.now() - start)

列类型
可以通过传递 dtype 和 converters 和 parse_dates 来指定列的类型：

df = pd.read_csv(r'c:\full path\temp.csv', dtype=str, converters={10:int}, parse_dates=[8], infer_datetime_format=True)

您还应该指定infer_datetime_format=True，因为这将大大加快日期转换。

nfer_datetime_format：布尔值，默认为 False

如果启用了 True 和 parse_dates，pandas 将尝试推断列中日期时间字符串的格式，如果可以推断，切换到解析它们的更快方法。在某些情况下这可以将解析速度提高 5-10 倍。

如果日期格式为DD/MM/YYYY，则还要添加dayfirst=True。

选择性列
如果您实际上只需要处理列1 9 11，那么您可以通过指定usecols=[0, 8, 10] 来进一步减少资源，如下所示：

df = pd.read_csv(r'c:\full path\temp.csv', dtype=str, converters={10:int}, parse_dates=[1], dayfirst=True, infer_datetime_format=True, usecols=[0, 8, 10])

生成的数据框将只包含这 3 列数据。

RAM 驱动器
使用 RAM 驱动器存储临时 CSV 文件将进一步加快加载时间。

注意：这假设您使用的是带有 Excel 的 Windows PC。

【讨论】：

感谢您的回答——您能用示例 excel 文件计时，看看它是如何执行的吗？
在我的机器上运行它大约需要 60 秒。相比之下，我只使用 xlrd 得到 7 分 30 秒。
使用 Python Pandas 是将 Excel 转换为 CSV 的绝佳方式。但是，在 Python 中运行 VBScript 时，我能够使用慢速计算机（从 148 秒降至 113 秒，i.stack.imgur.com/EzbvR.jpg）节省 30 多秒。但是 OP 可能不喜欢它，所以请在评论中留下它。

【解决方案5】：

好吧，如果您的 excel 将像您的示例 (https://drive.google.com/file/d/0B_CXvCTOo7_2UVZxbnpRaEVnaFk/view?usp=sharing) 一样简单，您可以尝试将文件作为 zip 文件打开并直接读取每个 xml：

英特尔 i5 4460、12 GB RAM、SSD 三星 EVO PRO。

如果你有很多内存： 这段代码需要很多内存，但需要 20~25 秒。（需要参数-Xmx7g）

package com.devsaki.opensimpleexcel;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.PrintWriter;
import java.nio.charset.Charset;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.zip.ZipFile;

public class Multithread {

    public static final char CHAR_END = (char) -1;

    public static void main(String[] args) throws IOException, ExecutionException, InterruptedException {
        String excelFile = "C:/Downloads/BigSpreadsheetAllTypes.xlsx";
        ZipFile zipFile = new ZipFile(excelFile);
        long init = System.currentTimeMillis();
        ExecutorService executor = Executors.newFixedThreadPool(4);
        char[] sheet1 = readEntry(zipFile, "xl/worksheets/sheet1.xml").toCharArray();
        Future<Object[][]> futureSheet1 = executor.submit(() -> processSheet1(new CharReader(sheet1), executor));
        char[] sharedString = readEntry(zipFile, "xl/sharedStrings.xml").toCharArray();
        Future<String[]> futureWords = executor.submit(() -> processSharedStrings(new CharReader(sharedString)));

        Object[][] sheet = futureSheet1.get();
        String[] words = futureWords.get();

        executor.shutdown();

        long end = System.currentTimeMillis();
        System.out.println("only read: " + (end - init) / 1000);

        ///Doing somethin with the file::Saving as csv
        init = System.currentTimeMillis();
        try (PrintWriter writer = new PrintWriter(excelFile + ".csv", "UTF-8");) {
            for (Object[] rows : sheet) {
                for (Object cell : rows) {
                    if (cell != null) {
                        if (cell instanceof Integer) {
                            writer.append(words[(Integer) cell]);
                        } else if (cell instanceof String) {
                            writer.append(toDate(Double.parseDouble(cell.toString())));
                        } else {
                            writer.append(cell.toString()); //Probably a number
                        }
                    }
                    writer.append(";");
                }
                writer.append("\n");
            }
        }
        end = System.currentTimeMillis();
        System.out.println("Main saving to csv: " + (end - init) / 1000);
    }

    private static final DateTimeFormatter formatter = DateTimeFormatter.ISO_DATE_TIME;
    private static final LocalDateTime INIT_DATE = LocalDateTime.parse("1900-01-01T00:00:00+00:00", formatter).plusDays(-2);

    //The number in excel is from 1900-jan-1, so every number time that you get, you have to sum to that date
    public static String toDate(double s) {
        return formatter.format(INIT_DATE.plusSeconds((long) ((s*24*3600))));
    }

    public static String readEntry(ZipFile zipFile, String entry) throws IOException {
        System.out.println("Initialing readEntry " + entry);
        long init = System.currentTimeMillis();
        String result = null;

        try (BufferedReader br = new BufferedReader(new InputStreamReader(zipFile.getInputStream(zipFile.getEntry(entry)), Charset.forName("UTF-8")))) {
            br.readLine();
            result = br.readLine();
        }

        long end = System.currentTimeMillis();
        System.out.println("readEntry '" + entry + "': " + (end - init) / 1000);
        return result;
    }


    public static String[] processSharedStrings(CharReader br) throws IOException {
        System.out.println("Initialing processSharedStrings");
        long init = System.currentTimeMillis();
        String[] words = null;
        char[] wordCount = "Count=\"".toCharArray();
        char[] token = "<t>".toCharArray();
        String uniqueCount = extractNextValue(br, wordCount, '"');
        words = new String[Integer.parseInt(uniqueCount)];
        String nextWord;
        int currentIndex = 0;
        while ((nextWord = extractNextValue(br, token, '<')) != null) {
            words[currentIndex++] = nextWord;
            br.skip(11); //you can skip at least 11 chars "/t></si><si>"
        }
        long end = System.currentTimeMillis();
        System.out.println("SharedStrings: " + (end - init) / 1000);
        return words;
    }


    public static Object[][] processSheet1(CharReader br, ExecutorService executorService) throws IOException, ExecutionException, InterruptedException {
        System.out.println("Initialing processSheet1");
        long init = System.currentTimeMillis();
        char[] dimensionToken = "dimension ref=\"".toCharArray();
        String dimension = extractNextValue(br, dimensionToken, '"');
        int[] sizes = extractSizeFromDimention(dimension.split(":")[1]);
        br.skip(30); //Between dimension and next tag c exists more or less 30 chars
        Object[][] result = new Object[sizes[0]][sizes[1]];

        int parallelProcess = 8;
        int currentIndex = br.currentIndex;
        CharReader[] charReaders = new CharReader[parallelProcess];
        int totalChars = Math.round(br.chars.length / parallelProcess);
        for (int i = 0; i < parallelProcess; i++) {
            int endIndex = currentIndex + totalChars;
            charReaders[i] = new CharReader(br.chars, currentIndex, endIndex, i);
            currentIndex = endIndex;
        }
        Future[] futures = new Future[parallelProcess];
        for (int i = charReaders.length - 1; i >= 0; i--) {
            final int j = i;
            futures[i] = executorService.submit(() -> inParallelProcess(charReaders[j], j == 0 ? null : charReaders[j - 1], result));
        }
        for (Future future : futures) {
            future.get();
        }

        long end = System.currentTimeMillis();
        System.out.println("Sheet1: " + (end - init) / 1000);
        return result;
    }

    public static void inParallelProcess(CharReader br, CharReader back, Object[][] result) {
        System.out.println("Initialing inParallelProcess : " + br.identifier);

        char[] tokenOpenC = "<c r=\"".toCharArray();
        char[] tokenOpenV = "<v>".toCharArray();

        char[] tokenAttributS = " s=\"".toCharArray();
        char[] tokenAttributT = " t=\"".toCharArray();

        String v;
        int firstCurrentIndex = br.currentIndex;
        boolean first = true;

        while ((v = extractNextValue(br, tokenOpenC, '"')) != null) {
            if (first && back != null) {
                int sum = br.currentIndex - firstCurrentIndex - tokenOpenC.length - v.length() - 1;
                first = false;
                System.out.println("Adding to : " + back.identifier + " From : " + br.identifier);
                back.plusLength(sum);
            }
            int[] indexes = extractSizeFromDimention(v);

            int s = foundNextTokens(br, '>', tokenAttributS, tokenAttributT);
            char type = 's'; //3 types: number (n), string (s) and date (d)
            if (s == 0) { // Token S = number or date
                char read = br.read();
                if (read == '1') {
                    type = 'n';
                } else {
                    type = 'd';
                }
            } else if (s == -1) {
                type = 'n';
            }
            String c = extractNextValue(br, tokenOpenV, '<');
            Object value = null;
            switch (type) {
                case 'n':
                    value = Double.parseDouble(c);
                    break;
                case 's':
                    try {
                        value = Integer.parseInt(c);
                    } catch (Exception ex) {
                        System.out.println("Identifier Error : " + br.identifier);
                    }
                    break;
                case 'd':
                    value = c.toString();
                    break;
            }
            result[indexes[0] - 1][indexes[1] - 1] = value;
            br.skip(7); ///v></c>
        }
    }

    static class CharReader {
        char[] chars;
        int currentIndex;
        int length;

        int identifier;

        public CharReader(char[] chars) {
            this.chars = chars;
            this.length = chars.length;
        }

        public CharReader(char[] chars, int currentIndex, int length, int identifier) {
            this.chars = chars;
            this.currentIndex = currentIndex;
            if (length > chars.length) {
                this.length = chars.length;
            } else {
                this.length = length;
            }
            this.identifier = identifier;
        }

        public void plusLength(int n) {
            if (this.length + n <= chars.length) {
                this.length += n;
            }
        }

        public char read() {
            if (currentIndex >= length) {
                return CHAR_END;
            }
            return chars[currentIndex++];
        }

        public void skip(int n) {
            currentIndex += n;
        }
    }


    public static int[] extractSizeFromDimention(String dimention) {
        StringBuilder sb = new StringBuilder();
        int columns = 0;
        int rows = 0;
        for (char c : dimention.toCharArray()) {
            if (columns == 0) {
                if (Character.isDigit(c)) {
                    columns = convertExcelIndex(sb.toString());
                    sb = new StringBuilder();
                }
            }
            sb.append(c);
        }
        rows = Integer.parseInt(sb.toString());
        return new int[]{rows, columns};
    }

    public static int foundNextTokens(CharReader br, char until, char[]... tokens) {
        char character;
        int[] indexes = new int[tokens.length];
        while ((character = br.read()) != CHAR_END) {
            if (character == until) {
                break;
            }
            for (int i = 0; i < indexes.length; i++) {
                if (tokens[i][indexes[i]] == character) {
                    indexes[i]++;
                    if (indexes[i] == tokens[i].length) {
                        return i;
                    }
                } else {
                    indexes[i] = 0;
                }
            }
        }

        return -1;
    }

    public static String extractNextValue(CharReader br, char[] token, char until) {
        char character;
        StringBuilder sb = new StringBuilder();
        int index = 0;

        while ((character = br.read()) != CHAR_END) {
            if (index == token.length) {
                if (character == until) {
                    return sb.toString();
                } else {
                    sb.append(character);
                }
            } else {
                if (token[index] == character) {
                    index++;
                } else {
                    index = 0;
                }
            }
        }
        return null;
    }

    public static int convertExcelIndex(String index) {
        int result = 0;
        for (char c : index.toCharArray()) {
            result = result * 26 + ((int) c - (int) 'A' + 1);
        }
        return result;
    }
}

老答案（不需要Xms7g参数，所以占用内存少）： 使用 HDD 打开和读取示例文件大约需要 35 秒（200MB），而使用 SDD 则需要更少（30 秒）。

代码如下： https://github.com/csaki/OpenSimpleExcelFast.git

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.PrintWriter;
import java.nio.charset.Charset;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.zip.ZipFile;

public class Launcher {

    public static final char CHAR_END = (char) -1;

    public static void main(String[] args) throws IOException, ExecutionException, InterruptedException {
        long init = System.currentTimeMillis();
        String excelFile = "D:/Downloads/BigSpreadsheet.xlsx";
        ZipFile zipFile = new ZipFile(excelFile);

        ExecutorService executor = Executors.newFixedThreadPool(4);
        Future<String[]> futureWords = executor.submit(() -> processSharedStrings(zipFile));
        Future<Object[][]> futureSheet1 = executor.submit(() -> processSheet1(zipFile));
        String[] words = futureWords.get();
        Object[][] sheet1 = futureSheet1.get();
        executor.shutdown();

        long end = System.currentTimeMillis();
        System.out.println("Main only open and read: " + (end - init) / 1000);


        ///Doing somethin with the file::Saving as csv
        init = System.currentTimeMillis();
        try (PrintWriter writer = new PrintWriter(excelFile + ".csv", "UTF-8");) {
            for (Object[] rows : sheet1) {
                for (Object cell : rows) {
                    if (cell != null) {
                        if (cell instanceof Integer) {
                            writer.append(words[(Integer) cell]);
                        } else if (cell instanceof String) {
                            writer.append(toDate(Double.parseDouble(cell.toString())));
                        } else {
                            writer.append(cell.toString()); //Probably a number
                        }
                    }
                    writer.append(";");
                }
                writer.append("\n");
            }
        }
        end = System.currentTimeMillis();
        System.out.println("Main saving to csv: " + (end - init) / 1000);
    }

    private static final DateTimeFormatter formatter = DateTimeFormatter.ISO_DATE_TIME;
    private static final LocalDateTime INIT_DATE = LocalDateTime.parse("1900-01-01T00:00:00+00:00", formatter).plusDays(-2);

    //The number in excel is from 1900-jan-1, so every number time that you get, you have to sum to that date
    public static String toDate(double s) {
        return formatter.format(INIT_DATE.plusSeconds((long) ((s*24*3600))));
    }

    public static Object[][] processSheet1(ZipFile zipFile) throws IOException {
        String entry = "xl/worksheets/sheet1.xml";
        Object[][] result = null;
        char[] dimensionToken = "dimension ref=\"".toCharArray();
        char[] tokenOpenC = "<c r=\"".toCharArray();
        char[] tokenOpenV = "<v>".toCharArray();

        char[] tokenAttributS = " s=\"".toCharArray();
        char[] tokenAttributT = " t=\"".toCharArray();
        try (BufferedReader br = new BufferedReader(new InputStreamReader(zipFile.getInputStream(zipFile.getEntry(entry)), Charset.forName("UTF-8")))) {
            String dimension = extractNextValue(br, dimensionToken, '"');
            int[] sizes = extractSizeFromDimention(dimension.split(":")[1]);
            br.skip(30); //Between dimension and next tag c exists more or less 30 chars
            result = new Object[sizes[0]][sizes[1]];
            String v;
            while ((v = extractNextValue(br, tokenOpenC, '"')) != null) {
                int[] indexes = extractSizeFromDimention(v);

                int s = foundNextTokens(br, '>', tokenAttributS, tokenAttributT);
                char type = 's'; //3 types: number (n), string (s) and date (d)
                if (s == 0) { // Token S = number or date
                    char read = (char) br.read();
                    if (read == '1') {
                        type = 'n';
                    } else {
                        type = 'd';
                    }
                } else if (s == -1) {
                    type = 'n';
                }
                String c = extractNextValue(br, tokenOpenV, '<');
                Object value = null;
                switch (type) {
                    case 'n':
                        value = Double.parseDouble(c);
                        break;
                    case 's':
                        value = Integer.parseInt(c);
                        break;
                    case 'd':
                        value = c.toString();
                        break;
                }
                result[indexes[0] - 1][indexes[1] - 1] = value;
                br.skip(7); ///v></c>
            }
        }
        return result;
    }

    public static int[] extractSizeFromDimention(String dimention) {
        StringBuilder sb = new StringBuilder();
        int columns = 0;
        int rows = 0;
        for (char c : dimention.toCharArray()) {
            if (columns == 0) {
                if (Character.isDigit(c)) {
                    columns = convertExcelIndex(sb.toString());
                    sb = new StringBuilder();
                }
            }
            sb.append(c);
        }
        rows = Integer.parseInt(sb.toString());
        return new int[]{rows, columns};
    }

    public static String[] processSharedStrings(ZipFile zipFile) throws IOException {
        String entry = "xl/sharedStrings.xml";
        String[] words = null;
        char[] wordCount = "Count=\"".toCharArray();
        char[] token = "<t>".toCharArray();
        try (BufferedReader br = new BufferedReader(new InputStreamReader(zipFile.getInputStream(zipFile.getEntry(entry)), Charset.forName("UTF-8")))) {
            String uniqueCount = extractNextValue(br, wordCount, '"');
            words = new String[Integer.parseInt(uniqueCount)];
            String nextWord;
            int currentIndex = 0;
            while ((nextWord = extractNextValue(br, token, '<')) != null) {
                words[currentIndex++] = nextWord;
                br.skip(11); //you can skip at least 11 chars "/t></si><si>"
            }
        }
        return words;
    }

    public static int foundNextTokens(BufferedReader br, char until, char[]... tokens) throws IOException {
        char character;
        int[] indexes = new int[tokens.length];
        while ((character = (char) br.read()) != CHAR_END) {
            if (character == until) {
                break;
            }
            for (int i = 0; i < indexes.length; i++) {
                if (tokens[i][indexes[i]] == character) {
                    indexes[i]++;
                    if (indexes[i] == tokens[i].length) {
                        return i;
                    }
                } else {
                    indexes[i] = 0;
                }
            }
        }

        return -1;
    }

    public static String extractNextValue(BufferedReader br, char[] token, char until) throws IOException {
        char character;
        StringBuilder sb = new StringBuilder();
        int index = 0;

        while ((character = (char) br.read()) != CHAR_END) {
            if (index == token.length) {
                if (character == until) {
                    return sb.toString();
                } else {
                    sb.append(character);
                }
            } else {
                if (token[index] == character) {
                    index++;
                } else {
                    index = 0;
                }
            }
        }
        return null;
    }

    public static int convertExcelIndex(String index) {
        int result = 0;
        for (char c : index.toCharArray()) {
            result = result * 26 + ((int) c - (int) 'A' + 1);
        }
        return result;
    }

}

【讨论】：

谢谢，我会测试一下——想把主文件 Launcher.java 的 GitHub 代码放在答案本身中吗？我认为这可能对未来可能想要查看它的人更有帮助。
好吧，我检查一下当单元格不是字符串并且代码不起作用时会发生什么，但是实现这些功能并不难（只检查 c 标签中的属性 t 是什么。” s" => 字符串）。单元格样式不应破坏代码。
知道了——我认为我们需要的三种类型是（1）字符串；（二）号码； (3) 日期。我知道 Excel 将日期存储为数字，但我们需要能够以某种方式知道单元格是日期。允许（2）和（3）的时间安排是多少？
好的，我添加了对日期和数字的支持
太好了，我还将更新问题/文件，使其包含数字和日期字段，以便轻松测试。

【解决方案6】：

看起来这在 Python 中根本无法实现。如果我们解压一个工作表数据文件，那么仅仅通过基于 C 的迭代 SAX 解析器就需要 30 秒（使用 lxml，一个非常快的对 libxml2 的包装器）：

from __future__ import print_function

from lxml import etree
import time


start_ts = time.time()

for data in etree.iterparse(open('xl/worksheets/sheet1.xml'), events=('start',), 
                            collect_ids=False, resolve_entities=False,
                            huge_tree=True):
    pass

print(time.time() - start_ts)

样本输出：27.2134890556

顺便说一下，Excel 本身需要大约 40 秒来加载工作簿。

【讨论】：

【解决方案7】：

我使用的是戴尔 Precision T1700 工作站并使用 c#，只需使用标准代码使用互操作服务打开工作簿，我就能够在大约 24 秒内打开文件并读取其内容。在此处使用对 Microsoft Excel 15.0 对象库的引用是我的代码。

我的使用语句：

using System.Runtime.InteropServices;
using Excel = Microsoft.Office.Interop.Excel;

打开和阅读工作簿的代码：

public partial class MainWindow : Window {
    public MainWindow() {
        InitializeComponent();

        Excel.Application xlApp;
        Excel.Workbook wb;
        Excel.Worksheet ws;

        xlApp = new Excel.Application();
        xlApp.Visible = false;
        xlApp.ScreenUpdating = false;

        wb = xlApp.Workbooks.Open(@"Desired Path of workbook\Copy of BigSpreadsheet.xlsx");

        ws = wb.Sheets["Sheet1"];

        //string rng = ws.get_Range("A1").Value;
        MessageBox.Show(ws.get_Range("A1").Value);

        Marshal.FinalReleaseComObject(ws);

        wb.Close();
        Marshal.FinalReleaseComObject(wb);

        xlApp.Quit();
        Marshal.FinalReleaseComObject(xlApp);

        GC.Collect();
        GC.WaitForPendingFinalizers();
    }
}

【讨论】：

【解决方案8】：

c#和ole的方案还是有瓶颈的，所以我用c++和ado来测试一下。

_bstr_t connStr(makeConnStr(excelFile, header).c_str());

TESTHR(pRec.CreateInstance(__uuidof(Recordset)));       
TESTHR(pRec->Open(sqlSelectSheet(connStr, sheetIndex).c_str(), connStr, adOpenStatic, adLockOptimistic, adCmdText));

while(!pRec->adoEOF)
{
    for(long i = 0; i < pRec->Fields->GetCount(); ++i)
    {   
        _variant_t v = pRec->Fields->GetItem(i)->Value;
        if(v.vt == VT_R8)
            num[i] = v.dblVal;
        if(v.vt == VT_BSTR)
            str[i] = v.bstrVal;          
        ++cellCount;
    }                                    
    pRec->MoveNext();
}

在 i5-4460 和 HDD 机器中，我发现 xls 中的 50 万个单元需要 1.5s。但 xlsx 中的相同数据需要 2.829s。因此可以在 30s 内处理您的数据。

如果你真的需要30s以下，使用RAM Drive来减少文件IO。它将显着改善你的进程。我无法下载您的数据进行测试，所以请告诉我结果。

【讨论】：

感谢您。那么从上面的 xls 文件中“读取”xls(x) 文件并在内存中获取它的整个代码是什么？
@David542 here is the entire project
这仍然在内部使用 ADO，因此结果不会比 ADO.NET 版本快多少，顺便说一句，它获得了大部分选票，但它不是最快/最干净的解决方案。
@Isma 有两个瓶颈，中间层和文件 IO。最快的版本需要假设。我认为 OP 最终将需要 RAM Drive 来减少文件 IO 的时间。

【解决方案9】：

我已经创建了一个示例 Java 程序，它能够在我的笔记本电脑（Intel i7 4 核，16 GB RAM）大约 40 秒内加载文件。

https://github.com/skadyan/largefile

此程序使用Apache POI library 加载使用XSSF SAX API 的.xlsx 文件。

回调接口com.stackoverlfow.largefile.RecordHandler实现可用于处理从excel加载的数据。这个接口只定义了一个方法，它接受三个参数

sheetname : 字符串，Excel 工作表名称
行号：int，数据的行号
和data map：地图：excel单元格引用和excel格式化单元格值

com.stackoverlfow.largefile.Main 类演示了这个接口的一个基本实现，它只是在控制台上打印行号。

更新

woodstox 解析器似乎比标准 SAXReader 具有更好的性能。（代码在 repo 中更新）。

另外为了满足预期的性能要求，您可以考虑重新实现org.apache.poi...XSSFSheetXMLHandler。在实现中，可以实现更优化的字符串/文本值处理，并且可以跳过不必要的文本格式化操作。

【讨论】：

实际上这是我们目前用来解析文件的确切实现。所以寻找比这更快的解决方案。
为了提高性能，我建议您尝试使用不同的 Xml 解析器（例如 github.com/FasterXML/woodstox）。我观察到这个解析器的结果稍微好一些。查看更新的答案。

【解决方案10】：

我想了解有关您所在系统的更多信息正在打开文件...无论如何：

在您的系统中查找名为
的 Windows 更新 “Office 文件验证加载项 ...”

如果你有它...卸载它...
该文件应该加载得更快
特别是如果从共享加载

【讨论】：

【解决方案11】：

另一种可以大大改善加载/操作时间的方法是 RAMDrive

为您的文件创建一个具有足够空间和 10%..20% 额外空间的 RAMDrive...
复制 RAMDrive 的文件...
从那里加载文件...取决于您的驱动器和文件系统速度提升应该是巨大的......

我最喜欢的是 IMDisk 工具包
(https://sourceforge.net/projects/imdisk-toolkit/) 在这里你有一个强大的命令行来编写所有的脚本......

我也推荐 SoftPerfect ramdisk
(http://www.majorgeeks.com/files/details/softperfect_ram_disk.html)

但这也取决于您的操作系统...

【讨论】：