这是一个设计过度但非常有趣的通用方法,您可以在其中为您的类提供属性以将它们与 CSV 列标题匹配:
第一步是解析您的 CSV。有多种方法,但我最喜欢的是TextFieldParser that can be found in the Microsoft.VisualBasic.FileIO namespace。使用它的好处是它是 100% 原生的;您需要做的就是将Microsoft.VisualBasic 添加到引用中。
完成后,您的数据为List<String[]>。现在,事情变得有趣了。看,现在我们可以创建一个自定义属性并将其添加到我们的类属性中:
属性类:
[AttributeUsage(AttributeTargets.Property)]
public sealed class CsvColumnAttribute : System.Attribute
{
public String Name { get; private set; }
public Regex ValidationRegex { get; private set; }
public CsvColumnAttribute(String name) : this(name, null) { }
public CsvColumnAttribute(String name, String validationRegex)
{
this.Name = name;
this.ValidationRegex = new Regex(validationRegex ?? "^.*$");
}
}
数据类:
public class AddressInfo
{
[CsvColumnAttribute("SITE_ID", "^\\d+$")]
public Int32 SiteId { get; set; }
[CsvColumnAttribute("HOUSE", "^\\d+$")]
public Int32 House { get; set; }
[CsvColumnAttribute("STREET", "^[a-zA-Z0-9- ]+$")]
public String Street { get; set; }
[CsvColumnAttribute("CITY", "^[a-zA-Z0-9- ]+$")]
public String City { get; set; }
[CsvColumnAttribute("STATE", "^[a-zA-Z]{2}$")]
public String State { get; set; }
[CsvColumnAttribute("ZIP", "^\\d{1,5}$")]
public Int32 Zip { get; set; }
[CsvColumnAttribute("APARTMENT", "^\\d*$")]
public Int32? Apartment { get; set; }
}
如您所见,我在这里所做的是将每个属性链接到一个 CSV 列名,并给它一个正则表达式来验证内容。在不需要的东西上,您仍然可以使用正则表达式,但允许使用空值,如公寓一所示。
现在,要真正将列与 CSV 标题匹配,我们需要获取 AddressInfo 类的属性,检查每个属性是否有 CsvColumnAttribute,如果有,将其名称与CSV 文件数据的列标题。一旦我们有了它,我们就会得到一个PropertyInfo 对象列表,它可以用来动态填充为所有行创建的新对象的属性。
此方法是完全通用的,允许在 CSV 文件中以任何顺序给出列,一旦将 CsvColumnAttribute 分配给要填写的属性,解析将适用于任何类。它将自动验证数据,您可以随心所欲地处理故障。不过,在这段代码中,我所做的只是跳过无效行。
public static List<T> ParseCsvInfo<T>(List<String[]> split) where T : new()
{
// No template row, or only a template row but no data. Abort.
if (split.Count < 2)
return new List<T>();
String[] templateRow = split[0];
// Create a dictionary of rows and their index in the file data.
Dictionary<String, Int32> columnIndexing = new Dictionary<String, Int32>();
for (Int32 i = 0; i < templateRow.Length; i++)
{
// ToUpperInvariant is optional, of course. You could have case sensitive headers.
String colHeader = templateRow[i].Trim().ToUpperInvariant();
if (!columnIndexing.ContainsKey(colHeader))
columnIndexing.Add(colHeader, i);
}
// Prepare the arrays of property parse info. We set the length
// so the highest found column index exists in it.
Int32 numCols = columnIndexing.Values.Max() + 1;
// Actual property to fill in
PropertyInfo[] properties = new PropertyInfo[numCols];
// Regex to validate the string before parsing
Regex[] propValidators = new Regex[numCols];
// Type converters for automatic parsing
TypeConverter[] propconverters = new TypeConverter[numCols];
// go over the properties of the given type, see which ones have a
// CsvColumnAttribute, and put these in the list at their CSV index.
foreach (PropertyInfo p in typeof(T).GetProperties())
{
object[] attrs = p.GetCustomAttributes(true);
foreach (Object attr in attrs)
{
CsvColumnAttribute csvAttr = attr as CsvColumnAttribute;
if (csvAttr == null)
continue;
Int32 index;
if (!columnIndexing.TryGetValue(csvAttr.Name.ToUpperInvariant(), out index))
{
// If no valid column is found, and the regex for this property
// does not allow an empty value, then all lines are invalid.
if (!csvAttr.ValidationRegex.IsMatch(String.Empty))
return new List<T>();
// No valid column found: ignore this property.
break;
}
properties[index] = p;
propValidators[index] = csvAttr.ValidationRegex;
// Automatic type converter. This function could be enhanced by giving a
// list of custom converters as extra argument and checking those first.
propconverters[index] = TypeDescriptor.GetConverter(p.PropertyType);
break; // Only handle one CsvColumnAttribute per property.
}
}
List<T> objList = new List<T>();
// start from 1 since the first line is the template with the column names
for (Int32 i = 1; i < split.Count; i++)
{
Boolean abortLine = false;
String[] line = split[i];
// make new object of the given type
T obj = new T();
for (Int32 col = 0; col < properties.Length; col++)
{
// It is possible a line is not long enough to contain all columns.
String curVal = col < line.Length ? line[col] : String.Empty;
PropertyInfo prop = properties[col];
// this can be null if the column was not found but wasn't required.
if (prop == null)
continue;
// check validity. Abort buildup of this object if not valid.
Boolean valid = propValidators[col].IsMatch(curVal);
if (!valid)
{
// Add logging here? We have the line and column index.
abortLine = true;
break;
}
// Automated parsing. Always use nullable types for nullable properties.
Object value = propconverters[col].ConvertFromString(curVal);
prop.SetValue(obj, value, null);
}
if (!abortLine)
objList.Add(obj);
}
return objList;
}
要在您的 CSV 文件上使用,只需这样做
// the function using VB's TextFieldParser
List<String[]> splitData = SplitFile(datafile, new UTF8Encoding(false), ',');
// The above function, applied to the AddressInfo class
List<AddressInfo> addresses = ParseCsvInfo<AddressInfo>(splitData);
就是这样。自动解析和验证,全部通过类属性上的一些添加属性。
注意,如果提前拆分数据会对大数据造成太大的性能影响,那并不是真正的问题; TextFieldParser 从 Stream 包裹在 TextReader 中工作,因此您可以提供一个流并在 ParseCsvInfo 函数内动态执行 csv 解析,而不是提供 List<String[]>,只需读取每个 CSV直接来自TextFieldParser。
我在这里没有这样做,因为我将阅读器写给 List<String[]> 的原始 csv 读取用例包括自动编码检测,这无论如何都需要读取整个文件。