【问题标题】:In python3 Tabula.read_pdf returns TypeError: expected str, bytes or os.PathLike object, not builtin_function_or_method. How do I make it work?在 python3 Tabula.read_pdf 返回 TypeError: expected str, bytes or os.PathLike object, not builtin_function_or_method。我如何使它工作?
【发布时间】:2020-03-08 07:59:38
【问题描述】:

我正在使用 python3 在我的服务器上的 Jupyter Notebooks 中运行我的抓取项目。由于某种原因,运行 Tabula.read_pdf 时出现 Tabula-py / Tabula 错误并返回 TypeError: expected str, bytes or os.PathLike object, not builtin_function_or_method。我如何使它工作?我正在传递实际的 PDF 文件。

我的错误代码

import tabula
df = tabula.read_pdf("20200125-sitrep-5-2019-ncov.pdf", pages=all)

我的错误

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-20-4f86b7402956> in <module>
----> 1 df = tabula.read_pdf("20200125-sitrep-5-2019-ncov.pdf", pages=all)

/usr/local/lib/python3.7/dist-packages/tabula/io.py in read_pdf(input_path, output_format,       encoding, java_options, pandas_options, multiple_tables, user_agent, **kwargs)
320 
321     try:
--> 322         output = _run(java_options, kwargs, path, encoding)
323     finally:
324         if temporary:

/usr/local/lib/python3.7/dist-packages/tabula/io.py in _run(java_options, options, path, encoding)
 83             stderr=subprocess.PIPE,
 84             stdin=subprocess.DEVNULL,
---> 85             check=True,
 86         )
 87         if result.stderr:

/usr/lib/python3.7/subprocess.py in run(input, capture_output, timeout, check, *popenargs, **kwargs)
470         kwargs['stderr'] = PIPE
471 
--> 472     with Popen(*popenargs, **kwargs) as process:
473         try:
474             stdout, stderr = process.communicate(input, timeout=timeout)

/usr/lib/python3.7/subprocess.py in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, encoding, errors, text)
773                                 c2pread, c2pwrite,
774                                 errread, errwrite,
--> 775                                 restore_signals, start_new_session)
776         except:
777             # Cleanup if the child failed starting.

/usr/lib/python3.7/subprocess.py in _execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, restore_signals, start_new_session)
1451                             errread, errwrite,
1452                             errpipe_read, errpipe_write,
-> 1453                             restore_signals, start_new_session, preexec_fn)
1454                     self._child_created = True
1455                 finally:

TypeError: expected str, bytes or os.PathLike object, not builtin_function_or_method

我的 PDF 文件名为 20200125-sitrep-5-2019-ncov.pdf。这是我抓取的 pdf - https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200125-sitrep-5-2019-ncov.pdf?sfvrsn=429b143d_8

【问题讨论】:

    标签: python-3.x pdf web-scraping jupyter-notebook


    【解决方案1】:

    Tabula 无法在服务器或虚拟环境中工作,因此我决定使用另一个名为 Camelot 的库。

    安装 Camelot

    pip install camelot-py
    

    导入 Camelot

    import camelot
    

    我的新代码

    tables = camelot.read_pdf('20200125-sitrep-5-2019-ncov.pdf', pages='3', process_background=True)
    tables.export('20200125-sitrep-5-2019-ncov.csv', f='csv', compress=True) 
    tables[0]
    tables[0].parsing_report
    {
        'accuracy': 99.02,
        'whitespace': 12.24,
        'order': 1,
        'page': 1
    }
    tables[0].to_csv('foo.csv') # to_json, to_excel, to_html
    df_1 = tables[0].df # get a pandas DataFrame!
    

    文档可以在这里找到 - https://camelot-py.readthedocs.io/en/master/user/quickstart.html 进一步阅读https://camelot-py.readthedocs.io/en/master/user/advanced.html#advanced

    【讨论】:

      【解决方案2】:

      你的

      pages=all
      

      应该是

      pages = "all" 
      

      tabula.read_pdf 期望它的参数是字符串。 这就是你看到的原因

      expected str, bytes or os.PathLike object, not builtin_function_or_method
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2018-12-08
        • 2021-09-12
        • 2020-11-07
        • 1970-01-01
        • 2021-11-07
        • 1970-01-01
        • 1970-01-01
        • 2019-06-05
        相关资源
        最近更新 更多