【发布时间】:2024-05-04 15:10:04
【问题描述】:
我制作了一个简单的网络抓取工具,它为我抓取歌词,然后将其写入数据库。一切正常,但由于某种原因,它用问号替换了一些字符,当我在一个简单的 php 网页上查看此信息时,我发现歌词中有很多错误。
I?m = I'm
Let?s = Let's
haven?t = haven't
stuff like that.
我知道错误出在 c# 和我的代码中,因为我在它写入数据库之前放置了一个断点,并将它显示在富文本框中。我怎样才能让它正确显示这些字符?
public static string getSourceCode(string url)
{
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
HttpWebResponse resp = (HttpWebResponse)req.GetResponse();
StreamReader sr = new StreamReader(resp.GetResponseStream());
string sourceCode = sr.ReadToEnd();
sr.Close();
resp.Close();
return sourceCode;
}
........
string url = txbURL2.Text;
string sourceCode = sourceCode = WorkerClass.getSourceCode(url);
int startIndex = sourceCode.IndexOf("<td valign=\"top\" width=\"100%\">");
sourceCode = sourceCode.Substring(startIndex, sourceCode.Length - startIndex);
........
//Gets Lyric
startIndex = sourceCode.IndexOf("<br><b>Lyrics:</b><br><br>") + 30;
endIndex = sourceCode.IndexOf(" <br><br>", startIndex);
string lyric = sourceCode.Substring(startIndex, endIndex - startIndex) + "";
rtbLyric.Text = lyric;
//End Lyric
【问题讨论】:
-
请不要在标题前加上“C#”之类的前缀。这就是标签的用途。
标签: c# character-encoding screen-scraping web-scraping