【问题标题】:Write all data from once CSV file to another -- but include new parsed geocoding data as additional fields将所有数据从一个 CSV 文件写入另一个文件 - 但将新解析的地理编码数据作为附加字段包含在内
【发布时间】:2018-05-24 14:25:53
【问题描述】:

我正在尝试编写一个 Python 脚本,该脚本将获取任何 CSV 文件,通过地理编码器运行它,然后将生成的地理编码属性(+原始文件中的所有数据)写入新的 csv 文件。

到目前为止,我的代码如下,我应该注意到除了将地理编码属性与原始 csv 文件中的数据相结合之外,一切都按预期工作。目前发生的情况是特定行的所有原始 csv 文件的字段值在 csv 文件中仅显示为一个值(尽管地理编码属性显示正确)。脚本的问题位于最后。为简洁起见,我省略了不同类的代码。

我还应该注意我正在使用 hasattr*,因为虽然我不知道原始 in_file 中的所有字段是什么,但我知道在输入 csv 中的某个地方会出现这些字段,这些字段是所需的地理编码。

最初我尝试将“new_file.writerow([])”更改为“new_file.writerow()”,此时行输入-r-确实正确写入了csv文件,但无法再写入地理编码属性到 csv,因为它们被视为附加参数。

def locate(file=None):
""" locate by geocoding func"""
start_time = time.time()
count = 0

if file != None:

    with open (file) as in_file:
        f_csv = csv.reader(in_file)

        # regex headers and lowercase to standarize for hasattr func.
        headers = [ re.sub('["\s+]', '_', h).lower() for h in next(f_csv)]

        # Used namedtuple for headers
        Row = namedtuple('Row', headers)

        # for row in file
        for r in f_csv:
            count += 1
            # set row values to named tuple values
            row = Row(*r)

            # Try hasattr to find fields names address, city, state, zipcode
            if hasattr(row, 'address'):
                address = row.address
            elif hasattr(row, 'address1'):
                address = row.address1
            if hasattr(row, 'city'):
                city = row.city
            if hasattr(row, 'state'):
                state = row.state
            elif hasattr(row, 'st'):
                state = row.st
            if hasattr(row, 'zipcode'):
                zipCode = row.zipcode
            elif hasattr(row, 'zip'):
                zipCode = row.zipcode

            # Create new address object
            addressObject = Address(address, city, state, zipCode)

            # Get response from api
            data = requests.get(addressObject.__str__()).json()

            try:
                data['geocodeStatusCode'] = int(data['geocodeStatusCode'])
            except:
                data['geocodeStatusCode'] =  None

            if data['geocodeStatusCode'] == 'SomeNumber':

                # geocoded address ideally uses parent class attributes
                geocodedAddressObject =  GeocodedAddress(addressObject.address, addressObject.city, addressObject.state, addressObject.zipCode, data['addressGeo']['latitude'], data['addressGeo']['longitude'], data['addressGeo']['score'])              


            else:

                geocodedAddressObject =  GeocodedAddress(addressObject.address, addressObject.city, addressObject.state, addressObject.zipCode)

            # Problem Area
            geocoded_file = file.replace('.csv', '_geocoded2') + '.csv'
            with open(geocoded_file, 'a', newline='') as geocoded:

                # Problem area -- the r -row- attribute writes all within the same cell even though they are comma separated. The geocoding attributes do write correctly to the csv file 
                new_file = csv.writer(geocoded)
                new_file.writerow([r, geocodedAddressObject.latitude, geocodedAddressObject.longitude, geocodedAddressObject.geocodeScore])

print('The time to geocode {} records: {}'.format(count, (time.time() - start_time)))

CSV 输入数据示例:

"UID", "Occupant", "Address", "City", "State", "ZipCode"
"100001", "Playstation Theater", "New York", "NY", "10036"
"100002", "Ed Sullivan Theater", "New York, "NY", "10019"

CSV 输出示例(在地理编码期间解析附加字段)

"UID", "Occupant", "Address", "City", "State", "ZipCode", "GeoCodingLatitude", "GeoCodingLongitude", "GeoCodingScore"
"100001", "Playstation Theater", "New York", "NY", "10036", "45.1234", "-110.4567", "100"
"100002", "Ed Sullivan Theater", "New York, "NY", "10019", "44.1234", "-111.4567", "100"

【问题讨论】:

  • 听起来你应该使用DictReader。显示预期与实际输出和示例输入会有所帮助。
  • @MarkTolonen 如果您认为我应该在问题、风格或内容中添加任何其他内容,以便更容易回答 - 请告诉我。谢谢!

标签: csv python-3.6 namedtuple


【解决方案1】:

我想出了一个解决方案,尽管它可能不是最优雅的。我使用 namedtuple._asdict() 将 namedtuple 转换为字典,然后循环遍历行的值,将它们添加到新列表中。此时我添加了地理编码变量,然后将整个列表写入行。这是我更改的代码示例!如果您能想到更好的解决方案,请告诉我。

                    if data['geocodeStatusCode'] == 'SomeNumber':

                        # geocoded address ideally should use parent class address values and not have to be restated 
                        geocodedAddressObject =  GeocodedAddress(addressObject.address, addressObject.city, addressObject.state, addressObject.zipCode,
                                                                data['addressGeo']['latitude'], data['addressGeo']['longitude'], data['addressGeo']['score'])              


                    else:

                        geocodedAddressObject =  GeocodedAddress(addressObject.address, addressObject.city, addressObject.state, addressObject.zipCode)              


                    # This is where I made the change - set new list
                    list_values = []    

                    # Use _asdict for the named tuple
                    row_content = row._asdict()

                    # Loop through and strip white space
                    for key, value in row_content.items():
                        # print(key, value.strip())
                        list_values.append(value.strip())

                    # Extend list rather then append due to multiple values
                    list_values.extend((geocodedAddressObject.latitude, geocodedAddressObject.longitude, geocodedAddressObject.geocodeScore))

                    # Finally write the new list to the csv file - which includes both the row and the geocoded objects 
                    #- and is agnostic as to what data it's passed as long as its utf-8 complaint
                    new_file.writerow(list_values)

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2021-07-18
    • 2015-10-17
    • 1970-01-01
    • 1970-01-01
    • 2014-02-26
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多