从缓冲区 C 读取答案

【问题标题】：Read from buffer C从缓冲区 C 读取
【发布时间】：2012-03-15 15:29:36
【问题描述】：

我正在尝试创建一个简单的 c 程序，从网页中删除 HTML 并保留文本。到目前为止，我已经想出了下面的代码。它使用 cURL 获取网页的内容并将其写入文件。如何通过内存缓冲区并删除所有 HTML 标记并将文本输出到终端或文件？

#include <curl/curl.h>
#include <stdio.h>
#include <stdlib.h>
#define WEBPAGE_URL "http://homepages.paradise.net.nz/adrianfu/index.html"
#define DESTINATION_FILE "/home/acwest/data.txt"

size_t write_data( void *ptr, size_t size, size_t nmeb, void *stream)
{
 return fwrite(ptr,size,nmeb,stream);
}

int main()
{
 int in_tag = 0;
 char * buffer;
 char c;
 long lSize;
 size_t result;

 FILE * file = fopen(DESTINATION_FILE,"w+");
 if (file==NULL) {
fputs ("File error",stderr); 
exit (1);
}

 CURL *handle = curl_easy_init();
 curl_easy_setopt(handle,CURLOPT_URL,WEBPAGE_URL); /*Using the http protocol*/
 curl_easy_setopt(handle,CURLOPT_WRITEFUNCTION, write_data);
 curl_easy_setopt(handle,CURLOPT_WRITEDATA, file);
 curl_easy_perform(handle);
 curl_easy_cleanup(handle);

 // obtain file size:
 fseek (file, 0, SEEK_END);
 lSize = ftell (file);
 rewind (file);

 // allocate memory to contain the whole file:
 buffer = (char*) malloc (sizeof(char)*lSize);
 if (buffer == NULL) {
fputs ("Memory error",stderr); 
exit (2);
}

 // copy the file into the buffer:
 result = fread (buffer,1,lSize,file);
 if (result != lSize) {
fputs ("Reading error",stderr); 
exit (3);
}
}

【问题讨论】：

您可以使用现有的解析库，例如 expat.sourceforge.net
请注意：您尝试实现的目标将接近于 bash 脚本中的单行，使用 curl 和 sed 的组合。
@user667430：您的代码甚至无法编译……
甚至不要考虑用那种愚蠢的语法来解析 html。例如：你会得到所有的 javascript 和 css 代码。这就是为什么 sed 也是一个坏主意的原因，尽管我同意 curl+其他一些实用程序（html2txt 或类似的东西）这是一个单行。

标签： c linux curl libcurl

【解决方案1】：

Curl 不会帮助您解析 HTML，而且这是一项复杂的任务。您可以阅读语言规范并编写解析器。 http://www.mbayer.de/html2text/ 有一个开源 C++ 项目，https://github.com/aaronsw/html2text 有一个 python 脚本。您还可以从命令行安装和使用 html2text 或从您的 c 代码执行它。

【讨论】：