Import Export Data between CSV files and Data Frame in Python / Garg's Academy

CSV (Comma Separated Values) Files: CSV Files stores a tabular data that has been saved as a plaintext, where data is separated by commas. It follows two rules:
1) Number of rows in CSV File = Number of rows in table
2) Comma , appears after every field (except the last field)

RollNo         Name           Marks
1              Sheetal        500
2              Amit           400
3              Nidhi          300
4              Rajesh         450
5              Ansh           350
            Tabular Data

RollNo,Name,Marks
1,Sheetal,500
2,Amit,400
3,Nidhi,300
4,Rajesh,450
5,Ansh,350
                   CSV File

Advantages of CSV Files:
1) It is simple and compact format for data storage.
2) It is a common format for data interchange.
3) It can be opened in any spreadsheet application software package like MS-Excel, calc etc.
4) All spreadsheets and databases support import and export to csv files.

Library Needed: We need pandas library to read and write data to/from csv files.

Import the required library: We can import any library in Python as import libraryname
Examples: import pandas
Before importing, Ensure that the library is installed in your system, otherwise you will get an error as

>>> import pandas
Traceback (most recent call last):
  File "<pyshell#2>", line 1, in <module>
    import pandas
ModuleNotFoundError: No module named 'pandas'

You can install the library as
1) Open command prompt by clicking on Run (Search on the windows) and then type cmd and click on OK
2) Type pip install pandas to install pandas

C:\Users\DELL>pip install pandas
Collecting pandas
  Downloading pandas-1.5.3-cp39-cp39-win_amd64.whl (10.9 MB)
     |████████████████████████████████| 10.9 MB 3.3 MB/s
Requirement already satisfied: python-dateutil>=2.8.1 in c:\users\dell\appdata\local\programs\python\python39\lib\site-packages (from pandas) (2.8.2)
Requirement already satisfied: numpy>=1.20.3; python_version < "3.10" in c:\users\dell\appdata\local\programs\python\python39\lib\site-packages (from pandas) (1.23.1)
Requirement already satisfied: pytz>=2020.1 in c:\users\dell\appdata\local\programs\python\python39\lib\site-packages (from pandas) (2022.1)
Requirement already satisfied: six>=1.5 in c:\users\dell\appdata\local\programs\python\python39\lib\site-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)
Installing collected packages: pandas
Successfully installed pandas-1.5.3
WARNING: You are using pip version 20.2.3; however, version 23.0 is available.
You should consider upgrading via the 'c:\users\dell\appdata\local\programs\python\python39\python.exe -m pip install --upgrade pip' command.

C:\Users\DELL>

Now, you can import library as

>>> import pandas
>>>

You can also give any other name (generally short name) ie alias name for this library as

>>> import pandas as p
>>>

After giving the alias name for a library, you can refer to that library with the alias name (p here) . But you can’t refer to the library with its original name (pandas here) .

CSV to Data Frame: To read data from csv file into pandas dataframe, read_csv() function of pandas library is used. Syntax is
dataframe_name = pandas.read_csv(“name of csv file with complete path”)
If a file named “myfile.csv” is available in computer in a folder demo in F drive then, We will write as
import pandas
pandas.read_csv(“f:\\demo\\myfile.csv”)
read_csv() is a pandas library function and can be accessed only after importing the pandas library.

RollNo,Name,Marks
1,Sheetal,500
2,Amit,400
3,Nidhi,300
4,Rajesh,450
5,Ansh,350


                myfile.csv

>>> import pandas as p
>>> df = p.read_csv("f:\\demo\\myfile.csv")
>>> df
   RollNo     Name  Marks
0       1  Sheetal    500
1       2     Amit    400
2       3    Nidhi    300
3       4   Rajesh    450
4       5     Ansh    350

Note: 1) row numbers are taken as default indexes ie 0,1,2….
2) column names are taken from the first row (RollNo, Name, Marks) of csv file.
3) If csv file don’t have column headers, then first row ie actual data will be treated as column names.
Example: If file is having contents as :

1,Sheetal,500
2,Amit,400
3,Nidhi,300
4,Rajesh,450
5,Ansh,350


               myfile.csv

>>> import pandas as p
>>> df1=p.read_csv("f:\\demo\\myfile.csv")
>>> df1
   1 Sheetal  500
0  2    Amit  400
1  3   Nidhi  300
2  4  Rajesh  450
3  5    Ansh  350

To avoid this, we can give column names using the names argument as

1,Sheetal,500
2,Amit,400
3,Nidhi,300
4,Rajesh,450
5,Ansh,350


               myfile.csv

>>> import pandas as p
>>> df1=p.read_csv("f:\\demo\\myfile.csv",names=["RN","Name","Marks"])
>>> df1
   RN     Name  Marks
0   1  Sheetal    500
1   2     Amit    400
2   3    Nidhi    300
3   4   Rajesh    450
4   5     Ansh    350

We can also give default indexes as

1,Sheetal,500
2,Amit,400
3,Nidhi,300
4,Rajesh,450
5,Ansh,350


               myfile.csv

>>> import pandas as p
>>> df=p.read_csv("f:\\demo\\myfile.csv",header=None)
>>> df
   0        1    2
0  1  Sheetal  500
1  2     Amit  400
2  3    Nidhi  300
3  4   Rajesh  450
4  5     Ansh  350

But, if our file contains headers and we give column names then, our column headers (in CSV file) are treated as actual data

RollNo,Name,Marks
1,Sheetal,500
2,Amit,400
3,Nidhi,300
4,Rajesh,450
5,Ansh,350


               myfile.csv

>>> import pandas as p
>>> df1=p.read_csv("f:\\demo\\myfile.csv",names=["RN","Name","Marks"])
>>> df1
       RN     Name  Marks
0  RollNo     Name  Marks
1       1  Sheetal    500
2       2     Amit    400
3       3    Nidhi    300
4       4   Rajesh    450
5       5     Ansh    350

and, if our file contains headers and we give header=None then also our column headers (in CSV file) are treated as actual data

RollNo,Name,Marks
1,Sheetal,500
2,Amit,400
3,Nidhi,300
4,Rajesh,450
5,Ansh,350


               myfile.csv

>>> import pandas as p
>>> df1=p.read_csv("f:\\demo\\myfile.csv",header=None)
>>> df1
        0        1      2
0  RollNo     Name  Marks
1       1  Sheetal    500
2       2     Amit    400
3       3    Nidhi    300
4       4   Rajesh    450
5       5     Ansh    350

To give our own column names, even if column headers appears in csv file, we can use skiprows arguement as

RollNo,Name,Marks
1,Sheetal,500
2,Amit,400
3,Nidhi,300
4,Rajesh,450
5,Ansh,350
         

      myfile.csv

>>> import pandas as p
>>> df1=p.read_csv("f:\\demo\\myfile.csv",names=["RN","Name","Marks"],skiprows=1)
>>> df1
   RN     Name  Marks
0   1  Sheetal    500
1   2     Amit    400
2   3    Nidhi    300
3   4   Rajesh    450
4   5     Ansh    350

To give default headers, even if column headers appears in csv file, we can write as

RollNo,Name,Marks
1,Sheetal,500
2,Amit,400
3,Nidhi,300
4,Rajesh,450
5,Ansh,350

               myfile.csv

>>> import pandas as p
>>> df1=p.read_csv("f:\\demo\\myfile.csv",header=None,skiprows=1)
>>> df1
   0        1    2
0  1  Sheetal  500
1  2     Amit  400
2  3    Nidhi  300
3  4   Rajesh  450
4  5     Ansh  350

Basically using skiprows, we can any number of rows to skip from reading. If we write skiprows=3, then first 3 rows are skipped.

RollNo,Name,Marks
1,Sheetal,500
2,Amit,400
3,Nidhi,300
4,Rajesh,450
5,Ansh,350





               myfile.csv

>>> import pandas as p
>>> df1=p.read_csv("f:\\demo\\myfile.csv",header=None,skiprows=3)
>>> df1
   0       1    2
0  3   Nidhi  300
1  4  Rajesh  450
2  5    Ansh  350
>>> df2=p.read_csv("f:\\demo\\myfile.csv",skiprows=3)
>>> df2
   3   Nidhi  300
0  4  Rajesh  450
1  5    Ansh  350
>>>

We can also skip multiple rows using the skiprows by writing all the row numbers in a list as:

RollNo,Name,Marks
1,Sheetal,500
2,Amit,400
3,Nidhi,300
4,Rajesh,450
5,Ansh,350




               myfile.csv

>>> import pandas as p
>>> df1=p.read_csv("f:\\demo\\myfile.csv",header=None,skiprows=[1,3,5])
>>> df1
        0       1      2
0  RollNO    Name  Marks
1       2    Amit    400
2       4  Rajesh    450
>>> df2=p.read_csv("f:\\demo\\myfile.csv",skiprows=[1,3,5])
>>> df2
   RollNO    Name  Marks
0       2    Amit    400
1       4  Rajesh    450

As line numbers starts from 0 in the csv file, notice that it has skipped lines accordingly. If we want to read only specific number of rows from the csv file, we can use nrows argument as

RollNo,Name,Marks
1,Sheetal,500
2,Amit,400
3,Nidhi,300
4,Rajesh,450
5,Ansh,350





               myfile.csv

>>> import pandas as p
>>> df1=p.read_csv("f:\\demo\\myfile.csv",nrows=3)
>>> df1
   RollNo     Name  Marks
0       1  Sheetal    500
1       2     Amit    400
2       3    Nidhi    300
>>> df1=p.read_csv("f:\\demo\\myfile.csv",nrows=3,header=None)
>>> df1
        0        1      2
0  RollNo     Name  Marks
1       1  Sheetal    500
2       2     Amit    400

If our file contains data separated by a character other than Comma , then we can write that character as the value of the sep argument as

RollNo#Name#Marks
1#Sheetal#500
2#Amit#400
3#Nidhi#300
4#Rajesh#450
5#Ansh#350

               myfile.csv

>>> import pandas as p
>>> df1=p.read_csv("f:\\demo\\myfile.csv",sep='#')
>>> df1
   RollNo     Name  Marks
0       1  Sheetal    500
1       2     Amit    400
2       3    Nidhi    300
3       4   Rajesh    450
4       5     Ansh    350

If we want anyone of our column headers to be used as row labels, we can give that column name as the value of the index_col argument as:

RollNo,Name,Marks
1,Sheetal,500
2,Amit,400
3,Nidhi,300
4,Rajesh,450
5,Ansh,350

  myfile.csv

>>> import pandas as p
>>> df1=p.read_csv("f:\\demo\\myfile.csv",index_col='RollNo')
>>> df1
           Name  Marks
RollNo                
1       Sheetal    500
2          Amit    400
3         Nidhi    300
4        Rajesh    450
5          Ansh    350
>>> df2=p.read_csv("f:\\demo\\myfile.csv",index_col='Name')
>>> df2
         RollNo  Marks
Name                  
Sheetal       1    500
Amit          2    400
Nidhi         3    300
Rajesh        4    450
Ansh          5    350

The structure of read_csv function (with most commonly used arguments, but not all arguments) is
pandas.read_csv (filepath , sep=’character’ , names=[column_names_list] , index_col=’name_of_column_header_in_csv_file’ , header=None, skiprows=n1, nrows=n2)
where
1) filepath is the file name with complete path
2) sep is the separator character
3) names is a list of column names
4) index_col is the name of column header in csv file
5) header = None will specify that csv contains no column headers, it contains only data.
6) skiprows is the number of rows to be skipped while reading csv file
7) nrows is the number of rows to be read from the csv file.
Note:
n2 must be an integer greater than or equal to zero otherwise python will give error.
n1 must be an integer less than n, where n is the number of rows in csv file, otherwise python will give error as

>>> df1=p.read_csv("f:\\demo\\myfile.csv",sep='#',skiprows=10)
Traceback (most recent call last):
  File "<pyshell#72>", line 1, in <module>
    df1=p.read_csv("f:\\demo\\myfile.csv",sep='#',skiprows=10)
  File "C:\Users\DELL\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\util\_decorators.py", line 211, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\DELL\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\util\_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\DELL\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\parsers\readers.py", line 950, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "C:\Users\DELL\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\parsers\readers.py", line 605, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "C:\Users\DELL\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\parsers\readers.py", line 1442, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "C:\Users\DELL\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\parsers\readers.py", line 1753, in _make_engine
    return mapping[engine](f, **self.options)
  File "C:\Users\DELL\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py", line 79, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas\_libs\parsers.pyx", line 554, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file

Data Frame to csv: To write data from pandas dataframe to csv file, to_csv() function of dataframe structure is used. Syntax is:
dataframe_name.to_csv(“filepath”)
If a dataframe named df is created in python and we want to create a csv file named “mycsvfile.csv” in a folder demo in F drive in the computer, then, We will write as
import pandas
df.to_csv(“f:\\demo\\mycsvfile.csv”)
to_csv() is a dataframe structure’s function and can be accessed only after importing the pandas library and creating a dataframe.

>>> import pandas as p
>>> df=p.DataFrame([[1,"Amit",38],[2,"Ansh",14],[3,"Vansh",12]])
>>> df
   0      1   2
0  1   Amit  38
1  2   Ansh  14
2  3  Vansh  12
>>> df.to_csv("f:\\demo\\mycsvfile.csv")

,0,1,2
0,1,Amit,38
1,2,Ansh,14
2,3,Vansh,12



      mycsvfile.csv

Note: 1) Notice the comma at the very first location in the first row.
2) Python will not give any warning, if the file with the same name already exists. It will just overwrite the contents of the file.
3) By default, comma is used to separate different fields in a row of the csv file. If we want any other character (must be only 1 character) to be used as the separator character, we can write the character as the value of the sep argument as

>>> import pandas as p
>>> df=p.DataFrame([[1,"Amit",38],[2,"Ansh",14],[3,"Vansh",12]])
>>> df
   0      1   2
0  1   Amit  38
1  2   Ansh  14
2  3  Vansh  12
>>> df.to_csv("f:\\demo\\mycsvfile.csv", sep="#")

#0#1#2
0#1#Amit#38
1#2#Ansh#14
2#3#Vansh#12



      mycsvfile.csv

If the values of some fields are missing or None or NaN, then empty strings are stored in csv file.
Note: NaN is defined in numpy module and therefore, we must install and then import numpy module in python to use this. Otherwise we can simply write None (as we have done in 3rd row 2nd column of dataframe) or leave a column empty (as we have done in 2nd row 3rd column) to indicate a missing value.
If we define even a single value as numpy.NaN then all the values of that column are taken as float. So values 2 and 3 in first column are stored as 2.0 and 3.0 respectively in the dataframe as well as in the csv file.

>>> import pandas as p
>>> import numpy as n
>>> df=p.DataFrame([[n.NaN,"Amit",38],[2,"Ansh",],[3,None ,12]])
>>> df
     0     1     2
0  NaN  Amit  38.0
1  2.0  Ansh   NaN
2  3.0  None  12.0
>>> df.to_csv("f:\\demo\\mycsvfile.csv")

,0,1,2
0,,Amit,38.0
1,2.0,Ansh,
2,3.0,,12.0





      mycsvfile.csv

If we want, we can fill these missing values with any other string as

>>> import pandas as p
>>> import numpy as n
>>> df=p.DataFrame([[n.NaN,"Amit",38],[2,"Ansh",],[3,None ,12]])
>>> df
     0     1     2
0  NaN  Amit  38.0
1  2.0  Ansh   NaN
2  3.0  None  12.0

>>> df.to_csv("f:\\demo\\mycsvfile1.csv", na_rep="NULL")

>>> df.to_csv("f:\\demo\\mycsvfile2.csv", na_rep="GARG")

,0,1,2
0,NULL,Amit,38.0
1,2.0,Ansh,NULL
2,3.0,NULL,12.0
      mycsvfile1.csv


,0,1,2
0,GARG,Amit,38.0
1,2.0,Ansh,GARG
2,3.0,GARG,12.0
      mycsvfile2.csv

Related Posts