CBSE CS and IP

CBSE Class 11 & 12 Computer Science and Informatics Practices Python Materials, Video Lecture

Showing posts with label Python Pandas. Show all posts
Showing posts with label Python Pandas. Show all posts

What is Pandas DataFrame ? How to Create it ?

Python Pandas DataFrame



What is DataFrame?

It is a 2 dimensional data structure with columns of different types. It is just similar to a spreadsheet or SQL table, or a dict of Series objects.

Python Pandas DatFrame

Characteristics of DataFrame Object:
  • It has two indexes or axis row index (axis = 0) and column index (axis = 1)
  • Row index is known as index and column index is known as column name
  • Index(Row-Index) or Column (Column-Index) can be numbers or letters or stings
  • A column can have values of different types.
  • DataFrame is Value Mutable and Size Mutable

Creation of DataFrame

Now we will discuss How to create Pandas DataFrame. Before creating DataFrame Object we have to import pandas library.

1
import pandas

Syntax for DataFrame Creation:

1
2
<df_object> = pandas.DataFrame(data = None, index = None, 
                            columns = None, dtype = None, copy = False)

1. Dictionary of List / Series

Dictionary of List

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
import pandas as pd

d = {'first_name': ['Sheldon', 'Raj', 'Leonard', 'Howard', 'Amy'],
        'last_name': ['Copper', 'Koothrappali', 'Hofstadter', 'Wolowitz', 'Fowler'],
        'age': [42, 38, 36, 41, 35],
        'Comedy_Score': [9, 7, 8, 8, 5],
        'Rating_Score': [25, 25, 49, 62, 70]}

df = pd.DataFrame(d)
print(df)


Output
------
  first_name     last_name  age  Comedy_Score  Rating_Score
0    Sheldon        Copper   42             9            25
1        Raj  Koothrappali   38             7            25
2    Leonard    Hofstadter   36             8            49
3     Howard      Wolowitz   41             8            62
4        Amy        Fowler   35             5            70

Dictionary of Series

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import pandas as pd
d = {'first_name': pd.Series(['Sheldon', 'Raj', 'Leonard', 'Howard', 'Amy']),
        'last_name': pd.Series(['Copper', 'Koothrappali', 'Hofstadter', 'Wolowitz', 'Fowler']),
        'age': pd.Series([42, 38, 36, 41, 35]),
        'Comedy_Score': pd.Series([9, 7, 8, 8, 5]),
        'Rating_Score': pd.Series([25, 25, 49, 62, 70])}

df = pd.DataFrame(d)
print(df)

Output
------
first_name     last_name  age  Comedy_Score  Rating_Score
0    Sheldon        Copper   42             9            25
1        Raj  Koothrappali   38             7            25
2    Leonard    Hofstadter   36             8            49
3     Howard      Wolowitz   41             8            62
4        Amy        Fowler   35             5            70


2. From List of List / Dictionaries

List of List

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import pandas as pd
l= [ ['Sheldon', 'Copper', 42, 9, 25],
     ['Raj', 'Koothrappali', 38, 7, 25],
     ['Leonard', 'Hofstadter', 36, 8, 49],
     ['Howard', 'Wolowitz', 41, 8, 62],
     ['Amy', 'Fowler', 35, 5, 70] ]

df = pd.DataFrame(l)
print(df)


         0             1   2  3   4
0  Sheldon        Copper  42  9  25
1      Raj  Koothrappali  38  7  25
2  Leonard    Hofstadter  36  8  49
3   Howard      Wolowitz  41  8  62
4      Amy        Fowler  35  5  70


List of Dictionary

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
import pandas as pd

l = [ {'first_name': 'Sheldon', 'last_name': 'Copper', 'age': 42, 'Comedy_Score': 9, 'Rating_Score': 25},
{'first_name': 'Raj', 'last_name': 'Koothrappali', 'age': 38, 'Comedy_Score': 7, 'Rating_Score': 25},
{'first_name': 'Leonard', 'last_name': 'Hofstadter', 'age': 36, 'Comedy_Score': 8, 'Rating_Score': 49},
{'first_name': 'Howard', 'last_name': 'Wolowitz', 'age': 41, 'Comedy_Score': 8, 'Rating_Score': 62},
{'first_name': 'Amy', 'last_name': 'Fowler', 'age': 35, 'Comedy_Score': 5, 'Rating_Score': 70} ]

df = pd.DataFrame(l)
print(df)

Output
------
  first_name     last_name  age  Comedy_Score  Rating_Score
0    Sheldon        Copper   42             9            25
1        Raj  Koothrappali   38             7            25
2    Leonard    Hofstadter   36             8            49
3     Howard      Wolowitz   41             8            62
4        Amy        Fowler   35             5            70



3. Text / CSV Files

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
file: data.csv
--------------
first_name,last_name,age,Comedy_Score,Rating_Score
Sheldon, Copper, 42, 9, 25
Raj, Koothrappali, 38, 7, 25
Leonard, Hofstadter, 36, 8, 49
Howard, Wolowitz, 41, 8, 62
Amy, Fowler, 35, 5, 70

import pandas as pd
df = pd.read_csv("data.csv")
# you can give a .txt file also, but the data should be comma separated
print(df)

Output
------
  first_name      last_name  age  Comedy_Score  Rating_Score
0    Sheldon         Copper   42             9            25
1        Raj   Koothrappali   38             7            25
2    Leonard     Hofstadter   36             8            49
3     Howard       Wolowitz   41             8            62
4        Amy         Fowler   35             5            70


Modifying Pandas Series Elements

pandas series modifying the elements

If you know how to extract the series single element and Series slice, it is very simple for you to change the series elements. You can change a single element or a full slice of the series object.

Whatever you want to change in a series you have to access that element and assign it with the new value. 

Consider the following Series Object:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    import pandas as pd
    student = pd.Series(
    data = ["BOB", "JHON", "RAM", "MOHAN"],
    index = ['S1','S2','S3','S4'])
    print(student)
    
    '''
    S1      BOB
    S2     JHON
    S3      RAM
    S4    MOHAN
    dtype: object
    '''
    


    Following are the different types, using which you can modify the elements of a series object.

  1. <series object> [<index>] = <new data value>
    To assign a new value you have to simply access the value using the Series label or index position as described below. Then you have to just provide the new value by giving the assignment operator.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    student['S1'] = "JACK"
    student[2] = "PETER"
    print(student)
    
    '''
    S1     JACK
    S2     JHON
    S3    PETER
    S4    MOHAN
    dtype: object
    '''
    

  2. <series object> [start : stop] = <new data value>
    If you want to change a slice of values, you can provide the values using the colon (:), after that you can provide a scalar value or the value in the form of a list.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    student['S3':'S4'] = "JACK"
    student[0:1] = "PETER"
    print(student)
    
    '''
    S1    PETER
    S2     JHON
    S3     JACK
    S4     JACK
    dtype: object
    '''
    

  3. loc and iloc attribute
    You can use loc or iloc to modify the existing values in the series. In both the cases you have to provide the new value using assignment operator.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    student.loc['S3'] = "JACK"
    student.iloc[3] = "PETER"
    print(student)
    '''
    S1      BOB
    S2     JHON
    S3     JACK
    S4    PETER
    
    dtype: object
    '''
    


For more clarification you can watch the follwing video lecture on this topic:



Accessing Pandas Series Slices

Slicing means extracting the part of the Series. Slicing can be done in the following ways:

  1. Using indexing operator( [ start : stop : step ] )
    1. Position wise (slicing includes stop - 1 data)
    2. Data label wise (slicing includes both ends )
      1. With unique data labels
      2. With duplicate data labels
  2. Using .loc attribute
  3. Using .iloc attribute
Accessing Pandas Series Slicing

Let us now discuss each type one by one:

1. Using indexing operator( [ start : stop : step ] )

Indexing operator is used for slicing, it is very similar to list and string slicing. There are three things start, stop and step. The Start is the starting point of the slice and it will go up to Stop - 1 with taking the mentioned Step

Start, Stop can be Series Data Labels/Index or Index Position. The Step can be a positive or negative number. The default value of Step is 1.

Let us now discuss what is Series Data Labels / Index and Series Index Position. To know the difference between these two terms, check the below given Series student:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import pandas as pd
student = pd.Series(
data = ["BOB", "JHON", "RAM", "MOHAN"],
index = ['S1','S2','S3','S4'])
print(student)

S1      BOB
S2     JHON
S3      RAM
S4    MOHAN
dtype: object

We have created a Series student with data elements as "BOB", "JHON", "RAM" and "MOHAN" and its data labels/index as 'S1', 'S2', 'S3' and 'S4'. Here 'S1', 'S2', 'S3' and 'S4' are called data labels/index of given Series student. Pandas internally maintain a Position for these data labels starting from 0 up to (length - 1) from top and -1 to length from the bottom. You can understand both the terms as below:
 
1
2
3
4
5
Position   Index   Data_Values
  0/-4        S1       BOB
  1/-3        S2       JHON
  2/-2        S3       RAM
  3/-1        S4       MOHAN

Since our Series student has 4 elements, we have positions starting from 0 up to 3. I hope you have now understood the difference between the Series index and index positions.

a) Position wise (slicing includes stop - 1 data)

As we have discussed the position is a number, which pandas assigns to series internally, so we will use that position to find the series slice.

In this type of slicing the data will come up to Stop - 1.

syntax:

<Series Object> [start : stop : step]

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
>>> print(student)
S1      BOB
S2     JHON
S3      RAM
S4    MOHAN
dtype: object

>>> student[0:3:1]
S1     BOB
S2    JHON
S3     RAM
dtype: object

>>> student[-3:-1:1]
S2    JHON
S3     RAM
dtype: object

>>> student[-1:-4:-2]
S4    MOHAN
S2     JHON
dtype: object


b) Data label wise (slicing includes both ends )

We can use Series Data Labels for slicing, in this case, the Start and Stop will be a data label and the Step will be a number.

syntax:

<Series Object> [start : stop : step]

Since the Data Labels of any series can be duplicate, hence we will see the slicing for unique and duplicate data labels separately.

Note: In this type of slicing both the start and stop end will be included in the result.

i) With unique data labels

Check the following example, in this example all the data labels of student Series are unique.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
>>> print(student)
S1      BOB
S2     JHON
S3      RAM
S4    MOHAN
dtype: object

>>> student['S1':'S4':2]
S1    BOB
S3    RAM
dtype: object

>>> student['S4':'S1':1]
Series([], dtype: object)

>>> student['S4':'S1':-1]
S4    MOHAN
S3      RAM
S2     JHON
S1      BOB
dtype: object


ii) With duplicate data labels 

Check the following example, student Series is having two similar Data Labels S1. If we are doing the slicing on a non-unique data Label, we will face the error as we are facing in the below given example.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
>>> print(student)
S1      BOB
S2     JHON
S3      RAM
S1    MOHAN
dtype: object

>>> student['S1':'S2']
KeyError: "Cannot get left slice bound for non-unique label: 'S1'"

>>> student['S2':'S3']
S2    JHON
S3     RAM
dtype: object


2. Using ".loc" attribute

Access a group of rows and columns by label(s) or a Boolean array.
  1. Series.loc[ start : stop : step ]
  2. Series.loc[[<list of labels>]]
Consider the following Series Object:
1
2
3
4
5
import pandas as pd
student = pd.Series(
data = ["BOB", "JHON", "RAM", "MOHAN"],
index = ['S1','S2','S3','S4'])
print(student)

  1. Series.loc[ start : stop : step ] : Using this you can extract series slices using series index names with providing the range. Here start is the start index, stop is till where you want to extract the slice and step is the step size when you read the data. Data will be printed up to stop.
    Example:
    1
    2
    3
    4
    5
    6
    7
    8
    student.loc['S1':'S4':2]
    
    
    '''
    S1    BOB
    S3    RAM
    dtype: object
    '''
    

  2. Series.loc[[<list of labels>]] : If you want to access particular elements of a Series object you can use this type of loc attribute. Here you have to provide the index in the form of a list.
    Example:

    1
    2
    3
    4
    5
    6
    7
    student.loc[['S1','S4']]
    
    '''
    S1      BOB
    S4    MOHAN
    dtype: object
    '''
    

3. Using ".iloc" attribute

Using iloc attribute : Purely integer-location-based indexing for selection by position.
  1. Series.iloc[ start : stop : step ]   
  2. Series.iloc[[<list of positions>]]
Consider the following Series Object:
1
2
3
4
5
import pandas as pd
student = pd.Series(
data = ["BOB", "JHON", "RAM", "MOHAN"],
index = ['S1','S2','S3','S4'])
print(student)

    1. Series.iloc[ start : stop : step ] :  Using this you can extract series slices using series index positions with providing the range. Here start is the start index, stop is till where you want to extract the slice and step is the step size when you read the data. Data will be printed up to stop-1.
      Example:
      1
      2
      3
      4
      5
      6
      7
      8
      student.iloc[0:3:1]
      
      '''
      S1     BOB
      S2    JHON
      S3     RAM
      dtype: object
      '''
      

    2. Series.iloc[[<list of positions>]] :  If you want to access particular elements of a Series object you can use this type of iloc attribute. Here you have to provide the index positions in the form of a list.
      Example:
      1
      2
      3
      4
      5
      6
      7
      8
      student.iloc[[1,2,0]]
      
      '''
      S2    JHON
      S3     RAM
      S1     BOB
      dtype: object
      '''
      

    Watch the following video lecture to know more:



    Accessing Pandas Series Elements

    Pandas Series is a 1-D (One Dimensional) Pandas Data Structure. In the previous post, we have seen how to create a Series Object. Here we will discuss how to access elements of Series in Pandas. There are two ways using which you can access the Individual Series Elements:

    1. By using Data Labels / Index
    2. By using Index Position
    3. By using "at" and "iat" attributes
    Syntax:
    <Series Object> [ <Valid Index> ]

    <Series Object> . at [ <Valid Index> ]

    <Series Object> .iat [ <Valid Index position> ]


    accessing elements of series pandas



    Let us now discuss what is Series Data Labels / Index and Series Index Position. To know the difference between these two terms, check the below given Series student:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    import pandas as pd
    student = pd.Series(
    data = ["BOB", "JHON", "RAM", "MOHAN"],
    index = ['S1','S2','S3','S4'])
    print(student)
    
    S1      BOB
    S2     JHON
    S3      RAM
    S4    MOHAN
    dtype: object
    

    We have created a Series student with data elements as "BOB", "JHON", "RAM" and "MOHAN" and its data labels/index as 'S1', 'S2', 'S3' and 'S4'. Here 'S1', 'S2', 'S3' and 'S4' are called data labels/index of given Series student. Pandas internally maintain a Position for these data labels starting from 0 up to (length - 1) from top and -1 to length from the bottom. You can understand both the terms as below:
     
    1
    2
    3
    4
    5
    Position   Index   Data_Values
      0/-4        S1       BOB
      1/-3        S2       JHON
      2/-2        S3       RAM
      3/-1        S4       MOHAN
    

    Since our Series student has 4 elements, we have positions starting from 0 up to 3. I hope you have now understood the difference between the Series index and index positions.

    It is time to discuss the two main types using which we can find the Series elements:

    1. By using Data Labels / Index

    We will take our previous Series student and syntax mentioned above to find the elements by using Data Labels / Index i.e. 'S1', 'S2', 'S3' and 'S4'. Check the below-given examples:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    >>> print(student)
    S1      BOB
    S2     JHON
    S3      RAM
    S4    MOHAN
    dtype: object
    
    >>> student["S1"]
    'BOB'
    
    >>> student["S3"]
    'RAM'
    
    >>> student["S5"]
    ## Error 
    

    2. By using Index Positions

    Here again, we will use the Series student to find the elements by using Index Positions. The syntax will remain the same as we have used in our previous example.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    >>> print(student)
    S1      BOB
    S2     JHON
    S3      RAM
    S4    MOHAN
    dtype: object
    >>> student[0]
    'BOB'
    
    >>> student[-4]
    'BOB'
    
    >>> student[2]
    'RAM'
    
    >>> student[-2]
    'RAM'
    
    >>> student[5]
    ## Error
    

    3. By using "at" and "iat" attributes

    we will use the same series student. "at" and "iat" both are Series attributes, we can use these attributes to find the elements of series.
    "at": It takes Data Labels or Index to find the elements
    "iat": It takes Index positions to extract the elements from Series

    Let check the example of both:
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    >>> print(student)
    S1      BOB
    S2     JHON
    S3      RAM
    S4    MOHAN
    dtype: object
     
    >>> student.at['S1']
    'BOB'
    >>> student.iat[0]
    'BOB'
    
    >>> student.at['S4']
    'MOHAN'
    >>> student.iat[-1]
    'MOHAN'
    

    I hope, till now you have learnt how to get / access Series element by index. Now read the below-given questions and try to answer by yourself:

    Questions:
    1. How do you access the elements of a Pandas series?
    2. To display the third element of a series object what you will write?
    3. How do you get the first element of the pandas series?
    4. How to get the last element of Series Object?
    5. How to get the second last element of Series Object? 
    Answers:
    1. You can access the series elements either using index or index positions.
    2. student[2]
    3. student[0] 
    4. student[-1]
    5. student[-2]



    Python Pandas - Series Attribute

    Python Pandas - Series Attribute


    Attributes are the properties of any object. Here we will discuss all the Series attributes with programming examples. All the important Series attributes according to the CBSE Class 12 Informatics practices syllabus are given below in the table:-


    Attributes Description
    Series.index Range of the index (axis labels) of the Series.
    Series.values Return Series as ndarray or ndarray like depending upon dtype
    Series.dtype Return the dtype object of the underlying data.
    Series.shape Return a tuple of the shape of the underlying data.
    Series.nbytes Return the number of bytes in the underlying data.
    Series.ndim The number of dimensions of the underlying data, by definition 1.
    Series.size Return the number of elements in the underlying data.
    Series.hasnans Return if I have any nans; enables various perf speedups.
    Series.empty Return true if Series is empty
    at, iat To access a single value from Series
    loc, iloc To access slices from Series


    Let us now check all the attribute with programming example. We will consider the following Series Student and check all the attributes on this Series Student.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    import pandas as pd
    student = pd.Series(["Sonal", "Rahul", "Mohan", "Siya",])
    print(student)
    
    '''
    Output:
    0    Sonal
    1    Rahul
    2    Mohan
    3     Siya
    dtype: object
    '''
    

    1. Series.index
    This attribute is used to get the range of the index (axis labels) of the Series. Let us try this function on the student Series.
    1
    2
    >>> student.index
    RangeIndex(start=0, stop=4, step=1)
    

    2. Series.values
    values attribute returns Series as ndarray or ndarray like depending upon dtype.
    1
    2
    >>> student.values
    array(['Sonal', 'Rahul', 'Mohan', 'Siya'], dtype=object)
    

    3. Series.dtype
    dtype attribute is used to check the data type of the Series Object. Since the student series is of object type, below output is showing 'o'.
    1
    2
    >>> student.dtype
    dtype('O')
    

    4. Series.shape
    shape attribute gives the shape of the underlying data structure in the form of a tuple. Since the student Series is having 4 elements the output is (4,).
    1
    2
    >>> student.shape
    (4,)
    
    5. Series.nbytes
    nbyte attribute gives the total number of bytes taken by the Series object to store the data. The below-given output tells that the student object takes 32 bytes of memory.
    1
    2
    >>> student.nbytes
    32
    

    6. Series.ndim
    ndim gives the dimension of the underlying data structure. Since series is a 1-D data structure, for all series object it gives 1.
    1
    2
    >>> student.ndim
    1
    

    7. Series.size
    size gives the total number of elements in the series. Since the student series has 4 elements size will give 4.
    1
    2
    >>> student.size
    4
    

    8. Series.hasnans
    hasnans returns Boolean value. If any of the series elements is NaN it will return True. Otherwise false.
    1
    2
    >>> student.hasnans
    False
    

    9. Series.empty
    empty attribute returns Boolean True if Series is empty, otherwise the output will be False.
    1
    2
    >>> student.empty
    False
    

    10. at, iat
    We will discuss at and iat in detail in our upcoming Post. You can click here to go to the post.

    11. loc, iloc
    We will discuss loc and iloc in detail in our upcoming Post. You can click here to go to the post.