CSV (Comma Separated Values) Files: CSV Files stores a tabular data that has been saved as a plaintext, where data is separated by commas. It follows two rules:
1) Number of rows in CSV File = Number of rows in table
2) Comma , appears after every field (except the last field)
RollNo Name Marks 1 Sheetal 500 2 Amit 400 3 Nidhi 300 4 Rajesh 450 5 Ansh 350 Tabular Data
RollNo,Name,Marks 1,Sheetal,500 2,Amit,400 3,Nidhi,300 4,Rajesh,450 5,Ansh,350 CSV File
Advantages of CSV Files:
1) It is simple and compact format for data storage.
2) It is a common format for data interchange.
3) It can be opened in any spreadsheet application software package like MS-Excel, calc etc.
4) All spreadsheets and databases support import and export to csv files.
Library Needed: We need pandas library to read and write data to/from csv files.
Import the required library: We can import any library in Python as import libraryname
Examples: import pandas
Before importing, Ensure that the library is installed in your system, otherwise you will get an error as
>>> import pandas Traceback (most recent call last): File "<pyshell#2>", line 1, in <module> import pandas ModuleNotFoundError: No module named 'pandas'
You can install the library as
1) Open command prompt by clicking on Run (Search on the windows) and then type cmd and click on OK
2) Type pip install pandas to install pandas
C:\Users\DELL>pip install pandas Collecting pandas Downloading pandas-1.5.3-cp39-cp39-win_amd64.whl (10.9 MB) |████████████████████████████████| 10.9 MB 3.3 MB/s Requirement already satisfied: python-dateutil>=2.8.1 in c:\users\dell\appdata\local\programs\python\python39\lib\site-packages (from pandas) (2.8.2) Requirement already satisfied: numpy>=1.20.3; python_version < "3.10" in c:\users\dell\appdata\local\programs\python\python39\lib\site-packages (from pandas) (1.23.1) Requirement already satisfied: pytz>=2020.1 in c:\users\dell\appdata\local\programs\python\python39\lib\site-packages (from pandas) (2022.1) Requirement already satisfied: six>=1.5 in c:\users\dell\appdata\local\programs\python\python39\lib\site-packages (from python-dateutil>=2.8.1->pandas) (1.16.0) Installing collected packages: pandas Successfully installed pandas-1.5.3 WARNING: You are using pip version 20.2.3; however, version 23.0 is available. You should consider upgrading via the 'c:\users\dell\appdata\local\programs\python\python39\python.exe -m pip install --upgrade pip' command. C:\Users\DELL>
Now, you can import library as
>>> import pandas >>>
You can also give any other name (generally short name) ie alias name for this library as
>>> import pandas as p >>>
After giving the alias name for a library, you can refer to that library with the alias name (p here) . But you can’t refer to the library with its original name (pandas here) .
CSV to Data Frame: To read data from csv file into pandas dataframe, read_csv() function of pandas library is used. Syntax is
dataframe_name = pandas.read_csv(“name of csv file with complete path”)
If a file named “myfile.csv” is available in computer in a folder demo in F drive then, We will write as
import pandas
pandas.read_csv(“f:\\demo\\myfile.csv”)
read_csv() is a pandas library function and can be accessed only after importing the pandas library.
RollNo,Name,Marks 1,Sheetal,500 2,Amit,400 3,Nidhi,300 4,Rajesh,450 5,Ansh,350 myfile.csv
>>> import pandas as p >>> df = p.read_csv("f:\\demo\\myfile.csv") >>> df RollNo Name Marks 0 1 Sheetal 500 1 2 Amit 400 2 3 Nidhi 300 3 4 Rajesh 450 4 5 Ansh 350
Note: 1) row numbers are taken as default indexes ie 0,1,2….
2) column names are taken from the first row (RollNo, Name, Marks) of csv file.
3) If csv file don’t have column headers, then first row ie actual data will be treated as column names.
Example: If file is having contents as :
1,Sheetal,500 2,Amit,400 3,Nidhi,300 4,Rajesh,450 5,Ansh,350 myfile.csv
>>> import pandas as p >>> df1=p.read_csv("f:\\demo\\myfile.csv") >>> df1 1 Sheetal 500 0 2 Amit 400 1 3 Nidhi 300 2 4 Rajesh 450 3 5 Ansh 350
To avoid this, we can give column names using the names argument as
1,Sheetal,500 2,Amit,400 3,Nidhi,300 4,Rajesh,450 5,Ansh,350 myfile.csv
>>> import pandas as p >>> df1=p.read_csv("f:\\demo\\myfile.csv",names=["RN","Name","Marks"]) >>> df1 RN Name Marks 0 1 Sheetal 500 1 2 Amit 400 2 3 Nidhi 300 3 4 Rajesh 450 4 5 Ansh 350
We can also give default indexes as
1,Sheetal,500 2,Amit,400 3,Nidhi,300 4,Rajesh,450 5,Ansh,350 myfile.csv
>>> import pandas as p >>> df=p.read_csv("f:\\demo\\myfile.csv",header=None) >>> df 0 1 2 0 1 Sheetal 500 1 2 Amit 400 2 3 Nidhi 300 3 4 Rajesh 450 4 5 Ansh 350
But, if our file contains headers and we give column names then, our column headers (in CSV file) are treated as actual data
RollNo,Name,Marks 1,Sheetal,500 2,Amit,400 3,Nidhi,300 4,Rajesh,450 5,Ansh,350 myfile.csv
>>> import pandas as p >>> df1=p.read_csv("f:\\demo\\myfile.csv",names=["RN","Name","Marks"]) >>> df1 RN Name Marks 0 RollNo Name Marks 1 1 Sheetal 500 2 2 Amit 400 3 3 Nidhi 300 4 4 Rajesh 450 5 5 Ansh 350
and, if our file contains headers and we give header=None then also our column headers (in CSV file) are treated as actual data
RollNo,Name,Marks 1,Sheetal,500 2,Amit,400 3,Nidhi,300 4,Rajesh,450 5,Ansh,350 myfile.csv
>>> import pandas as p >>> df1=p.read_csv("f:\\demo\\myfile.csv",header=None) >>> df1 0 1 2 0 RollNo Name Marks 1 1 Sheetal 500 2 2 Amit 400 3 3 Nidhi 300 4 4 Rajesh 450 5 5 Ansh 350
To give our own column names, even if column headers appears in csv file, we can use skiprows arguement as
RollNo,Name,Marks 1,Sheetal,500 2,Amit,400 3,Nidhi,300 4,Rajesh,450 5,Ansh,350 myfile.csv
>>> import pandas as p >>> df1=p.read_csv("f:\\demo\\myfile.csv",names=["RN","Name","Marks"],skiprows=1) >>> df1 RN Name Marks 0 1 Sheetal 500 1 2 Amit 400 2 3 Nidhi 300 3 4 Rajesh 450 4 5 Ansh 350
To give default headers, even if column headers appears in csv file, we can write as
RollNo,Name,Marks 1,Sheetal,500 2,Amit,400 3,Nidhi,300 4,Rajesh,450 5,Ansh,350 myfile.csv
>>> import pandas as p >>> df1=p.read_csv("f:\\demo\\myfile.csv",header=None,skiprows=1) >>> df1 0 1 2 0 1 Sheetal 500 1 2 Amit 400 2 3 Nidhi 300 3 4 Rajesh 450 4 5 Ansh 350
Basically using skiprows, we can any number of rows to skip from reading. If we write skiprows=3, then first 3 rows are skipped.
RollNo,Name,Marks 1,Sheetal,500 2,Amit,400 3,Nidhi,300 4,Rajesh,450 5,Ansh,350 myfile.csv
>>> import pandas as p >>> df1=p.read_csv("f:\\demo\\myfile.csv",header=None,skiprows=3) >>> df1 0 1 2 0 3 Nidhi 300 1 4 Rajesh 450 2 5 Ansh 350 >>> df2=p.read_csv("f:\\demo\\myfile.csv",skiprows=3) >>> df2 3 Nidhi 300 0 4 Rajesh 450 1 5 Ansh 350 >>>
We can also skip multiple rows using the skiprows by writing all the row numbers in a list as:
RollNo,Name,Marks 1,Sheetal,500 2,Amit,400 3,Nidhi,300 4,Rajesh,450 5,Ansh,350 myfile.csv
>>> import pandas as p >>> df1=p.read_csv("f:\\demo\\myfile.csv",header=None,skiprows=[1,3,5]) >>> df1 0 1 2 0 RollNO Name Marks 1 2 Amit 400 2 4 Rajesh 450 >>> df2=p.read_csv("f:\\demo\\myfile.csv",skiprows=[1,3,5]) >>> df2 RollNO Name Marks 0 2 Amit 400 1 4 Rajesh 450
As line numbers starts from 0 in the csv file, notice that it has skipped lines accordingly. If we want to read only specific number of rows from the csv file, we can use nrows argument as
RollNo,Name,Marks 1,Sheetal,500 2,Amit,400 3,Nidhi,300 4,Rajesh,450 5,Ansh,350 myfile.csv
>>> import pandas as p >>> df1=p.read_csv("f:\\demo\\myfile.csv",nrows=3) >>> df1 RollNo Name Marks 0 1 Sheetal 500 1 2 Amit 400 2 3 Nidhi 300 >>> df1=p.read_csv("f:\\demo\\myfile.csv",nrows=3,header=None) >>> df1 0 1 2 0 RollNo Name Marks 1 1 Sheetal 500 2 2 Amit 400
If our file contains data separated by a character other than Comma , then we can write that character as the value of the sep argument as
RollNo#Name#Marks 1#Sheetal#500 2#Amit#400 3#Nidhi#300 4#Rajesh#450 5#Ansh#350 myfile.csv
>>> import pandas as p >>> df1=p.read_csv("f:\\demo\\myfile.csv",sep='#') >>> df1 RollNo Name Marks 0 1 Sheetal 500 1 2 Amit 400 2 3 Nidhi 300 3 4 Rajesh 450 4 5 Ansh 350
If we want anyone of our column headers to be used as row labels, we can give that column name as the value of the index_col argument as:
RollNo,Name,Marks 1,Sheetal,500 2,Amit,400 3,Nidhi,300 4,Rajesh,450 5,Ansh,350 myfile.csv
>>> import pandas as p >>> df1=p.read_csv("f:\\demo\\myfile.csv",index_col='RollNo') >>> df1 Name Marks RollNo 1 Sheetal 500 2 Amit 400 3 Nidhi 300 4 Rajesh 450 5 Ansh 350 >>> df2=p.read_csv("f:\\demo\\myfile.csv",index_col='Name') >>> df2 RollNo Marks Name Sheetal 1 500 Amit 2 400 Nidhi 3 300 Rajesh 4 450 Ansh 5 350
The structure of read_csv function (with most commonly used arguments, but not all arguments) is
pandas.read_csv (filepath , sep=’character’ , names=[column_names_list] , index_col=’name_of_column_header_in_csv_file’ , header=None, skiprows=n1, nrows=n2)
where
1) filepath is the file name with complete path
2) sep is the separator character
3) names is a list of column names
4) index_col is the name of column header in csv file
5) header = None will specify that csv contains no column headers, it contains only data.
6) skiprows is the number of rows to be skipped while reading csv file
7) nrows is the number of rows to be read from the csv file.
Note:
n2 must be an integer greater than or equal to zero otherwise python will give error.
n1 must be an integer less than n, where n is the number of rows in csv file, otherwise python will give error as
>>> df1=p.read_csv("f:\\demo\\myfile.csv",sep='#',skiprows=10) Traceback (most recent call last): File "<pyshell#72>", line 1, in <module> df1=p.read_csv("f:\\demo\\myfile.csv",sep='#',skiprows=10) File "C:\Users\DELL\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\util\_decorators.py", line 211, in wrapper return func(*args, **kwargs) File "C:\Users\DELL\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\util\_decorators.py", line 331, in wrapper return func(*args, **kwargs) File "C:\Users\DELL\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\parsers\readers.py", line 950, in read_csv return _read(filepath_or_buffer, kwds) File "C:\Users\DELL\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\parsers\readers.py", line 605, in _read parser = TextFileReader(filepath_or_buffer, **kwds) File "C:\Users\DELL\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\parsers\readers.py", line 1442, in __init__ self._engine = self._make_engine(f, self.engine) File "C:\Users\DELL\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\parsers\readers.py", line 1753, in _make_engine return mapping[engine](f, **self.options) File "C:\Users\DELL\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py", line 79, in __init__ self._reader = parsers.TextReader(src, **kwds) File "pandas\_libs\parsers.pyx", line 554, in pandas._libs.parsers.TextReader.__cinit__ pandas.errors.EmptyDataError: No columns to parse from file
Data Frame to csv: To write data from pandas dataframe to csv file, to_csv() function of dataframe structure is used. Syntax is:
dataframe_name.to_csv(“filepath”)
If a dataframe named df is created in python and we want to create a csv file named “mycsvfile.csv” in a folder demo in F drive in the computer, then, We will write as
import pandas
df.to_csv(“f:\\demo\\mycsvfile.csv”)
to_csv() is a dataframe structure’s function and can be accessed only after importing the pandas library and creating a dataframe.
>>> import pandas as p >>> df=p.DataFrame([[1,"Amit",38],[2,"Ansh",14],[3,"Vansh",12]]) >>> df 0 1 2 0 1 Amit 38 1 2 Ansh 14 2 3 Vansh 12 >>> df.to_csv("f:\\demo\\mycsvfile.csv")
,0,1,2 0,1,Amit,38 1,2,Ansh,14 2,3,Vansh,12 mycsvfile.csv
Note: 1) Notice the comma at the very first location in the first row.
2) Python will not give any warning, if the file with the same name already exists. It will just overwrite the contents of the file.
3) By default, comma is used to separate different fields in a row of the csv file. If we want any other character (must be only 1 character) to be used as the separator character, we can write the character as the value of the sep argument as
>>> import pandas as p >>> df=p.DataFrame([[1,"Amit",38],[2,"Ansh",14],[3,"Vansh",12]]) >>> df 0 1 2 0 1 Amit 38 1 2 Ansh 14 2 3 Vansh 12 >>> df.to_csv("f:\\demo\\mycsvfile.csv", sep="#")
#0#1#2 0#1#Amit#38 1#2#Ansh#14 2#3#Vansh#12 mycsvfile.csv
If the values of some fields are missing or None or NaN, then empty strings are stored in csv file.
Note: NaN is defined in numpy module and therefore, we must install and then import numpy module in python to use this. Otherwise we can simply write None (as we have done in 3rd row 2nd column of dataframe) or leave a column empty (as we have done in 2nd row 3rd column) to indicate a missing value.
If we define even a single value as numpy.NaN then all the values of that column are taken as float. So values 2 and 3 in first column are stored as 2.0 and 3.0 respectively in the dataframe as well as in the csv file.
>>> import pandas as p >>> import numpy as n >>> df=p.DataFrame([[n.NaN,"Amit",38],[2,"Ansh",],[3,None ,12]]) >>> df 0 1 2 0 NaN Amit 38.0 1 2.0 Ansh NaN 2 3.0 None 12.0 >>> df.to_csv("f:\\demo\\mycsvfile.csv")
,0,1,2 0,,Amit,38.0 1,2.0,Ansh, 2,3.0,,12.0 mycsvfile.csv
If we want, we can fill these missing values with any other string as
>>> import pandas as p >>> import numpy as n >>> df=p.DataFrame([[n.NaN,"Amit",38],[2,"Ansh",],[3,None ,12]]) >>> df 0 1 2 0 NaN Amit 38.0 1 2.0 Ansh NaN 2 3.0 None 12.0 >>> df.to_csv("f:\\demo\\mycsvfile1.csv", na_rep="NULL") >>> df.to_csv("f:\\demo\\mycsvfile2.csv", na_rep="GARG")
,0,1,2 0,NULL,Amit,38.0 1,2.0,Ansh,NULL 2,3.0,NULL,12.0 mycsvfile1.csv ,0,1,2 0,GARG,Amit,38.0 1,2.0,Ansh,GARG 2,3.0,GARG,12.0 mycsvfile2.csv