Introduction
The Python csv
module provides a simple way to handle CSV (Comma-Separated Values) files.
CSV Dialects
CSV formats vary and these different formats are often called dialects.
- Delimiter: tab, pipe(
|
) - Line separator:
\n
in Unix,\r\n
in Windows
Reading CSV Files
What is CSV Data
The Comma-Separated Values (CSV) is a data format suitable for storing tabular data.
- Each row in a file is a row of data
- Rows in a file are seperated by a newline (OS specific)
- Each field in the row is separated by a separator aka delimiter
If field contains the field delimiter character?
"John","We hope to "win" the game"
Double up the quotes
"John","We hope to ""win"" the game"
Use escape character (\
)
"John","We hope to \"win\" the game"
CSV is not a standard format
The most common dialect is used by Excel, but others exist as well.
Common Excel CSV Format
delimiter
(field separator):,
quotechar
(field delimiter):"
When handling special characters inside fields:
- Quotes inside a field: If a field contains a quote character, it is represented by doubling the quote. For example,
We hope to "win"
becomes"We hope to ""win""
. - Delimiter inside a field: If a field contains the delimiter, it is enclosed in quotes. For example,
Smith, John
becomes"Smith, John"
.
Other valid CSV formats:
- Pipe delimited, e.g.,
field1|field2|field3
- Tab delimited, e.g.,
field1 field2 field3
Parsing CSV Data
The default parser dialect of CSV module is excel but the module allows custom settings for delimiter, quotechar, etc.
with open('some_file') as f:
reader = csv.reader(f, delimiter=',', quotechar='"')
for row in reader:
# row is a list containing parsed fields
print(row)
f
: The file opened for reading.
delimiter
: Optional. Default value is ‘,
’.
quotechar
: Optional, default value is ‘"
’.
csv.reader() returns an iterator of parsed rows from the file.
Coding
File content in ‘actors.csv’
First Name,Last Name,DOB,Sketches
John,Cleese,10/27/39,"The Cheese Shop, Ministry of Silly Walks, It's the Arts"
Eric,Idle,3/29/43,"The Cheese Shop, Nudge Nudge, ""Spam"""
Peter,O'Toole,8/2/32,Lawrence of Arabia
Splitting data with ,
with open('actors.csv') as f:
for row in f:
row = row.strip()
field = row.split(",")
print(row)
Output:
['First Name', 'Last Name', 'DOB', 'Sketches']
['John', 'Cleese', '10/27/39', '"The Cheese Shop', ' Ministry of Silly Walks', ' It\'s the Arts"']
['Eric', 'Idle', '3/29/43', '"The Cheese Shop', ' Nudge Nudge', ' ""Spam"""']
['Peter', "O'Toole", '8/2/32', 'Lawrence of Arabia']
Using csv.reader()
import csv
with open('actors.csv') as f:
csv_file = csv.reader(f) # default uses , and "
for row in csv_file:
print(row)
Output:
['First Name', 'Last Name', 'DOB', 'Sketches']
['John', 'Cleese', '10/27/39', "The Cheese Shop, Ministry of Silly Walks, It's the Arts"]
['Eric', 'Idle', '3/29/43', 'The Cheese Shop, Nudge Nudge, "Spam"']
['Peter', "O'Toole", '8/2/32', 'Lawrence of Arabia']
Dialects
Custom dialects can be defined once in the program and reused multiple times.
Listing Available Dialects
The csv module comes with some pre-defined dialects.
csv.list_dialects()
# Output
# ['excel', 'excel-tab', 'unix']
Create a custom dialect
- Name for the dialect
- Values for delimiter, quotechar, etc
csv.register_dialect("<name>, delimiter=..., quotechar=..., ...)
Dialect Parameter
Parameter | Description | Default |
---|---|---|
delimiter | Character used to separate fields | , |
quotechar | Character used to quote fields | " |
escapechar | Character used to prefix escape sequences. If None , escaping is disabled |
None |
skipinitialspace | If True , ignore spaces after the field delimiter |
False |
doublequote | If True , two consecutive quote characters are used to escape a quote character inside a field |
True |
Using a Dinfined Dialect
csv.reader(f, dialect='excel')
delimiter
: Optional. Default value is excel
.
Coding
File content in ‘actors.pdv’
FIRST_NAME| LAST_NAME| DOB| SKETCHES
John|Cleese| 10/27/39| 'The Cheese Shop, Ministry of Silly Walks, It\'s the Arts'
Eric| Idle| 3/29/43| 'The Cheese Shop, Nudge Nudge, "Spam"'
Peter| 'O\'Toole'| 8/2/32| Lawrence of Arabia
with open('actors.pdv') as f:
reader = csv.reader(f,
delimiter='|',
quotechar="'",
escapechar='\\',
skipinitialspace=True
)
for row in reader:
print(row)
Output:
['FIRST_NAME', 'LAST_NAME', 'DOB', 'SKETCHES']
['John', 'Cleese', '10/27/39', "The Cheese Shop, Ministry of Silly Walks, It's the Arts"]
['Eric', 'Idle', '3/29/43', 'The Cheese Shop, Nudge Nudge, "Spam"']
['Peter', "O'Toole", '8/2/32', 'Lawrence of Arabia']
Registering and reusing custom dialects
csv.register_dialect(
'pdv',
delimiter='|',
quotechar="'",
escapechar='\\',
skipinitialspace=True
)
with open('actors.pdv') as f:
reader = csv.reader(f, dialect='pdv')
for row in reader:
print(row)
Output:
['FIRST_NAME', 'LAST_NAME', 'DOB', 'SKETCHES']
['John', 'Cleese', '10/27/39', "The Cheese Shop, Ministry of Silly Walks, It's the Arts"]
['Eric', 'Idle', '3/29/43', 'The Cheese Shop, Nudge Nudge, "Spam"']
['Peter', "O'Toole", '8/2/32', 'Lawrence of Arabia']
Printing \
print("\")
# SyntaxError: EOL while scanning string literal
print("\\")
# \
More Examples Reading CSV Files
Two additional CSV files:
- NASDAQ data
- Old data from the U.S. Census Bureau
Coding
NASDAQ Data
Symbol,Company Name,Security Name,Market Category,Test Issue,Financial Status,Round Lot Size
AAIT,iShares MSCI All Country Asia Information Technology Index Fund,iShares MSCI All Country Asia Information Technology Index Fund,G,N,N,100.0
AAL,"American Airlines Group, Inc.","American Airlines Group, Inc. - Common Stock",Q,N,N,100.0
...
Reading csv by using default dialect (excel).
with open(nasdaq) as f:
reader = csv.reader(f)
for row in reader:
print(row)
Output
['Symbol', 'Company Name', 'Security Name', 'Market Category', 'Test Issue', 'Financial Status', 'Round Lot Size']
['AAIT', 'iShares MSCI All Country Asia Information Technology Index Fund', 'iShares MSCI All Country Asia Information Technology Index Fund', 'G', 'N', 'N', '100.0']
['AAL', 'American Airlines Group, Inc.', 'American Airlines Group, Inc. - Common Stock', 'Q', 'N', 'N', '100.0']
...
Handling numeric fields - converting strings to floats:
def parse_nasdaq(f_name):
result = []
with open(f_name) as f:
reader = csv.reader(f)
header = next(reader)
result.append(header)
for row in reader:
row[-1] = float(row[-1])
result.append(row)
return result
parse_nasdaq('nasdaq.csv')
Output
[['Symbol',
'Company Name',
'Security Name',
'Market Category',
'Test Issue',
'Financial Status',
'Round Lot Size'],
['AAIT',
'iShares MSCI All Country Asia Information Technology Index Fund',
'iShares MSCI All Country Asia Information Technology Index Fund',
'G',
'N',
'N',
100.0],
['AAL',
'American Airlines Group, Inc.',
'American Airlines Group, Inc. - Common Stock',
'Q',
'N',
'N',
100.0],
...]
Data from the U.S. Census Bureau
Geographic Area,"July 1, 2001 Estimate","July 1, 2000 Estimate","April 1, 2000 Population Estimates Base"
United States," 284,796,887 "," 282,124,631 "," 281,421,906 "
Alabama," 4,464,356 "," 4,451,493 "," 4,447,100 "
Alaska," 634,892 "," 627,601 "," 626,932 "
Arizona," 5,307,331 "," 5,165,274 "," 5,130,632 "
...
Handling numeric fields - removing commas and converting strings to integers:
def parse_census_data(file_name):
results = []
with open(file_name) as f:
reader = csv.reader(f)
# first row contains headers
headers = next(reader)
results.append(headers)
# process rest of data
for row in reader:
# columns 1, 2, 3 are numerical
area = row[0]
data = row[1:]
data = [area] + [int(field.replace(',', '')) for field in data]
results.append(data)
return results
parse_census_data('st-2001est-01.csv')
Output
[['Geographic Area', 'July 1, 2001 Estimate', 'July 1, 2000 Estimate', 'April 1, 2000 Population Estimates Base'],
['United States', 284796887, 282124631, 281421906],
['Alabama', 4464356, 4451493, 4447100],
['Alaska', 634892, 627601, 626932],
['Arizona', 5307331, 5165274, 5130632],
...]
Writing CSV Files
- To write data to a CSV file, the data should be an iterable of rows, where each row is itself an iterable of fields.
- Better to use the standard excel dialect to write a CSV file.
Writing a CSV File
Use csv.writer
to create a CSV writer object and the writerow()
to write out each row in your data.
import csv
# open the file in the write mode
with open('<file_name>', 'w') as f:
# create the csv writer
writer = csv.writer(f, dialect='...')
for low in data:
# write a row to the csv file
writer.writerow(row)
The data is an iterable containing iterables of field.
data = [
[row1_col1, row1_col2, row1_col3, ...],
[row2_col1, row2_col2, row2_col3, ...],
...
]
Coding
import csv
data = [
['First Name', 'Last Name', 'DOB', 'Sketches'],
['John', 'Cleese', '10/27/39', "The Cheese Shop, Ministry of Silly Walks, It's the Arts"],
['Eric', 'Idle', '3/29/43', 'The Cheese Shop, Nudge Nudge, "Spam"'],
['Peter', "O'Toole", '8/2/32', 'Lawrence of Arabia']
]
with open('test.csv', 'w') as f:
writer = csv.writer(f)
for row in data:
writer.writerow(row)
Result
First Name,Last Name,DOB,Sketches
John,Cleese,10/27/39,"The Cheese Shop, Ministry of Silly Walks, It's the Arts"
Eric,Idle,3/29/43,"The Cheese Shop, Nudge Nudge, ""Spam"""
Peter,O'Toole,8/2/32,Lawrence of Arabia
Custom Dialect
# If doublequote is True, Python ignores escapechar
# and just double up the quotechar while writing the CSV file.
csv.register_dialect(
'pdv',
delimiter='|',
quotechar="'",
escapechar='\\'
)
with open('test.csv', 'w') as f:
writer = csv.writer(f, dialect='pdv')
for row in data:
writer.writerow(row)
# Result
# First Name|Last Name|DOB|Sketches
# John|Cleese|10/27/39|'The Cheese Shop, Ministry of Silly Walks, It''s the Arts'
# Eric|Idle|3/29/43|The Cheese Shop, Nudge Nudge, "Spam"
# Peter|'O''Toole'|8/2/32|Lawrence of Arabia
# doublequote = False
csv.register_dialect(
'pdv',
delimiter='|',
quotechar="'",
escapechar='\\',
doublequote=False
)
with open('test.csv', 'w') as f:
writer = csv.writer(f, dialect='pdv')
for row in data:
writer.writerow(row)
# Result
First Name|Last Name|DOB|Sketches
John|Cleese|10/27/39|The Cheese Shop, Ministry of Silly Walks, It\'s the Arts
Eric|Idle|3/29/43|The Cheese Shop, Nudge Nudge, "Spam"
Peter|O\'Toole|8/2/32|Lawrence of Arabia
References
Common Format and MIME Type for Comma-Separated Values (CSV) Files
Dialect 與格式參數