Python 3 Fundamentals Week 23 - CSV Module 課程筆記

Introduction

The Python csv module provides a simple way to handle CSV (Comma-Separated Values) files.

CSV Dialects

CSV formats vary and these different formats are often called dialects.

  • Delimiter: tab, pipe(|)
  • Line separator: \n in Unix, \r\n in Windows

Reading CSV Files

What is CSV Data

The Comma-Separated Values (CSV) is a data format suitable for storing tabular data.

  • Each row in a file is a row of data
  • Rows in a file are seperated by a newline (OS specific)
  • Each field in the row is separated by a separator aka delimiter

If field contains the field delimiter character?

"John","We hope to "win" the game"

Double up the quotes

"John","We hope to ""win"" the game"

Use escape character (\)

"John","We hope to \"win\" the game"

CSV is not a standard format

The most common dialect is used by Excel, but others exist as well.

Common Excel CSV Format

  • delimiter (field separator): ,
  • quotechar (field delimiter): "

When handling special characters inside fields:

  • Quotes inside a field: If a field contains a quote character, it is represented by doubling the quote. For example, We hope to "win" becomes "We hope to ""win"".
  • Delimiter inside a field: If a field contains the delimiter, it is enclosed in quotes. For example, Smith, John becomes "Smith, John".

Other valid CSV formats:

  • Pipe delimited, e.g., field1|field2|field3
  • Tab delimited, e.g., field1 field2 field3

Parsing CSV Data

The default parser dialect of CSV module is excel but the module allows custom settings for delimiter, quotechar, etc.

with open('some_file') as f:
	reader = csv.reader(f, delimiter=',', quotechar='"')
	for row in reader:
		# row is a list containing parsed fields
		print(row)

f: The file opened for reading.
delimiter : Optional. Default value is ‘,’.
quotechar: Optional, default value is ‘"’.

csv.reader() returns an iterator of parsed rows from the file.

Coding

File content in ‘actors.csv’

First Name,Last Name,DOB,Sketches
John,Cleese,10/27/39,"The Cheese Shop, Ministry of Silly Walks, It's the Arts"
Eric,Idle,3/29/43,"The Cheese Shop, Nudge Nudge, ""Spam"""
Peter,O'Toole,8/2/32,Lawrence of Arabia

Splitting data with ,

with open('actors.csv') as f:
    for row in f:
        row = row.strip()
        field = row.split(",")
        print(row)

Output:

['First Name', 'Last Name', 'DOB', 'Sketches']
['John', 'Cleese', '10/27/39', '"The Cheese Shop', ' Ministry of Silly Walks', ' It\'s the Arts"']
['Eric', 'Idle', '3/29/43', '"The Cheese Shop', ' Nudge Nudge', ' ""Spam"""']
['Peter', "O'Toole", '8/2/32', 'Lawrence of Arabia']

Using csv.reader()

import csv

with open('actors.csv') as f:
    csv_file = csv.reader(f)  # default uses , and "
    for row in csv_file:
        print(row)

Output:

['First Name', 'Last Name', 'DOB', 'Sketches']
['John', 'Cleese', '10/27/39', "The Cheese Shop, Ministry of Silly Walks, It's the Arts"]
['Eric', 'Idle', '3/29/43', 'The Cheese Shop, Nudge Nudge, "Spam"']
['Peter', "O'Toole", '8/2/32', 'Lawrence of Arabia']

Dialects

Custom dialects can be defined once in the program and reused multiple times.

Listing Available Dialects

The csv module comes with some pre-defined dialects.

csv.list_dialects()

# Output
# ['excel', 'excel-tab', 'unix']

Create a custom dialect

  • Name for the dialect
  • Values for delimiter, quotechar, etc
csv.register_dialect("<name>, delimiter=..., quotechar=..., ...)

Dialect Parameter

Parameter Description Default
delimiter Character used to separate fields ,
quotechar Character used to quote fields "
escapechar Character used to prefix escape sequences. If None, escaping is disabled None
skipinitialspace If True, ignore spaces after the field delimiter False
doublequote If True, two consecutive quote characters are used to escape a quote character inside a field True

Using a Dinfined Dialect

csv.reader(f, dialect='excel')

delimiter : Optional. Default value is excel.

Coding

File content in ‘actors.pdv’

FIRST_NAME| LAST_NAME| DOB| SKETCHES
John|Cleese| 10/27/39| 'The Cheese Shop, Ministry of Silly Walks, It\'s the Arts'
Eric| Idle| 3/29/43| 'The Cheese Shop, Nudge Nudge, "Spam"'
Peter| 'O\'Toole'| 8/2/32| Lawrence of Arabia
with open('actors.pdv') as f:
    reader = csv.reader(f,
                        delimiter='|',
                        quotechar="'",
                        escapechar='\\',
                        skipinitialspace=True
                       )
    for row in reader:
        print(row)

Output:

['FIRST_NAME', 'LAST_NAME', 'DOB', 'SKETCHES']
['John', 'Cleese', '10/27/39', "The Cheese Shop, Ministry of Silly Walks, It's the Arts"]
['Eric', 'Idle', '3/29/43', 'The Cheese Shop, Nudge Nudge, "Spam"']
['Peter', "O'Toole", '8/2/32', 'Lawrence of Arabia']

Registering and reusing custom dialects

csv.register_dialect(
	'pdv',
	delimiter='|',
    quotechar="'",
    escapechar='\\',
    skipinitialspace=True
    )

with open('actors.pdv') as f:
    reader = csv.reader(f, dialect='pdv')
    for row in reader:
        print(row)

Output:

['FIRST_NAME', 'LAST_NAME', 'DOB', 'SKETCHES']
['John', 'Cleese', '10/27/39', "The Cheese Shop, Ministry of Silly Walks, It's the Arts"]
['Eric', 'Idle', '3/29/43', 'The Cheese Shop, Nudge Nudge, "Spam"']
['Peter', "O'Toole", '8/2/32', 'Lawrence of Arabia']

Printing \

print("\")
# SyntaxError: EOL while scanning string literal

print("\\")
# \

More Examples Reading CSV Files

Two additional CSV files:

  • NASDAQ data
  • Old data from the U.S. Census Bureau

Coding

NASDAQ Data

Symbol,Company Name,Security Name,Market Category,Test Issue,Financial Status,Round Lot Size
AAIT,iShares MSCI All Country Asia Information Technology Index Fund,iShares MSCI All Country Asia Information Technology Index Fund,G,N,N,100.0
AAL,"American Airlines Group, Inc.","American Airlines Group, Inc. - Common Stock",Q,N,N,100.0
...

Reading csv by using default dialect (excel).

with open(nasdaq) as f:
    reader = csv.reader(f)
    for row in reader:
        print(row)

Output

['Symbol', 'Company Name', 'Security Name', 'Market Category', 'Test Issue', 'Financial Status', 'Round Lot Size']
['AAIT', 'iShares MSCI All Country Asia Information Technology Index Fund', 'iShares MSCI All Country Asia Information Technology Index Fund', 'G', 'N', 'N', '100.0']
['AAL', 'American Airlines Group, Inc.', 'American Airlines Group, Inc. - Common Stock', 'Q', 'N', 'N', '100.0']
...

Handling numeric fields - converting strings to floats:

def parse_nasdaq(f_name):
    result = []

    with open(f_name) as f:
        reader = csv.reader(f)
        header = next(reader)
        result.append(header)

        for row in reader:
            row[-1] = float(row[-1])
            result.append(row)

    return result

parse_nasdaq('nasdaq.csv')

Output

[['Symbol',
  'Company Name',
  'Security Name',
  'Market Category',
  'Test Issue',
  'Financial Status',
  'Round Lot Size'],
 ['AAIT',
  'iShares MSCI All Country Asia Information Technology Index Fund',
  'iShares MSCI All Country Asia Information Technology Index Fund',
  'G',
  'N',
  'N',
  100.0],
 ['AAL',
  'American Airlines Group, Inc.',
  'American Airlines Group, Inc. - Common Stock',
  'Q',
  'N',
  'N',
  100.0],
  ...]

Data from the U.S. Census Bureau

Geographic Area,"July 1, 2001 Estimate","July 1, 2000 Estimate","April 1, 2000 Population Estimates Base"
United States," 284,796,887 "," 282,124,631 "," 281,421,906 "
Alabama," 4,464,356 "," 4,451,493 "," 4,447,100 "
Alaska," 634,892 "," 627,601 "," 626,932 "
Arizona," 5,307,331 "," 5,165,274 "," 5,130,632 "
...

Handling numeric fields - removing commas and converting strings to integers:

def parse_census_data(file_name):
    results = []
    with open(file_name) as f:
        reader = csv.reader(f)
        # first row contains headers
        headers = next(reader)
        results.append(headers)

        # process rest of data
        for row in reader:
            # columns 1, 2, 3 are numerical
            area = row[0]
            data = row[1:]
            data = [area] + [int(field.replace(',', '')) for field in data]
            results.append(data)
            
    return results

parse_census_data('st-2001est-01.csv')

Output

[['Geographic Area', 'July 1, 2001 Estimate', 'July 1, 2000 Estimate', 'April 1, 2000 Population Estimates Base'], 
['United States', 284796887, 282124631, 281421906], 
['Alabama', 4464356, 4451493, 4447100], 
['Alaska', 634892, 627601, 626932], 
['Arizona', 5307331, 5165274, 5130632],
...]

Writing CSV Files

  • To write data to a CSV file, the data should be an iterable of rows, where each row is itself an iterable of fields.
  • Better to use the standard excel dialect to write a CSV file.

Writing a CSV File

Use csv.writer to create a CSV writer object and the writerow() to write out each row in your data.

import csv 

# open the file in the write mode 
with open('<file_name>', 'w') as f: 
	# create the csv writer 
	writer = csv.writer(f, dialect='...') 
	for low in data:
		# write a row to the csv file 
		writer.writerow(row)

The data is an iterable containing iterables of field.

data = [
	[row1_col1, row1_col2, row1_col3, ...],
	[row2_col1, row2_col2, row2_col3, ...],
	...
]

Coding

import csv

data = [
    ['First Name', 'Last Name', 'DOB', 'Sketches'],
    ['John', 'Cleese', '10/27/39', "The Cheese Shop, Ministry of Silly Walks, It's the Arts"],
    ['Eric', 'Idle', '3/29/43', 'The Cheese Shop, Nudge Nudge, "Spam"'],
    ['Peter', "O'Toole", '8/2/32', 'Lawrence of Arabia']
]

with open('test.csv', 'w') as f:
    writer = csv.writer(f)
    for row in data:
        writer.writerow(row)

Result

First Name,Last Name,DOB,Sketches
John,Cleese,10/27/39,"The Cheese Shop, Ministry of Silly Walks, It's the Arts"
Eric,Idle,3/29/43,"The Cheese Shop, Nudge Nudge, ""Spam"""
Peter,O'Toole,8/2/32,Lawrence of Arabia

Custom Dialect

# If doublequote is True, Python ignores escapechar 
# and just double up the quotechar while writing the CSV file.
csv.register_dialect(
	'pdv',
	delimiter='|',
    quotechar="'",
    escapechar='\\'
)

with open('test.csv', 'w') as f:
    writer = csv.writer(f, dialect='pdv')
    for row in data:
        writer.writerow(row)

# Result
# First Name|Last Name|DOB|Sketches
# John|Cleese|10/27/39|'The Cheese Shop, Ministry of Silly Walks, It''s the Arts'
# Eric|Idle|3/29/43|The Cheese Shop, Nudge Nudge, "Spam"
# Peter|'O''Toole'|8/2/32|Lawrence of Arabia


# doublequote = False 
csv.register_dialect(
	'pdv',
	delimiter='|',
    quotechar="'",
    escapechar='\\',
    doublequote=False
)

with open('test.csv', 'w') as f:
    writer = csv.writer(f, dialect='pdv')
    for row in data:
        writer.writerow(row)

# Result
First Name|Last Name|DOB|Sketches
John|Cleese|10/27/39|The Cheese Shop, Ministry of Silly Walks, It\'s the Arts
Eric|Idle|3/29/43|The Cheese Shop, Nudge Nudge, "Spam"
Peter|O\'Toole|8/2/32|Lawrence of Arabia

References

Common Format and MIME Type for Comma-Separated Values (CSV) Files
Dialect 與格式參數